# Lecture 10: Artificial neural networks (ANNs)

![](https://www.tensorflow.org/images/colab_logo_32px.png)
[Run in colab](https://colab.research.google.com/drive/1PKCyhLqpAeBA3u8yurcgCeaFYcKZNnQE)

In [1]:
import datetime
now = datetime.datetime.now()
print("Last executed: " + now.strftime("%Y-%m-%d %H:%M:%S"))

Last executed: 2024-01-10 00:20:22


## Biological inspiration

Architecture of neural networks originally inspired by the brain.


### Rewriting the brain


<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/brain_rewriting_auditory.png" width="750px" style="display:block; margin:auto"/>

Study performed with ferrets by [Roe et al. (1992)](https://www.ncbi.nlm.nih.gov/pubmed/1527604).

[[Image credit](https://www.coursera.org/learn/machine-learning)]

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/brain_rewriting_somatosensory.png" width="750px" style="display:block; margin:auto"/>

Study performed with hamsters by [Metin & Frost (1989)](https://www.ncbi.nlm.nih.gov/pubmed/2911580).

[[Image credit](https://www.coursera.org/learn/machine-learning)]

Led to "*one learning algorithm*" hypothesis.

### Biological neurons

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/Blausen_0657_MultipolarNeuron.png" width="600px" style="display:block; margin:auto"/>

[Image credit: [Bruce Blaus, Wikipedia](https://en.wikipedia.org/wiki/Neuron)]

Biological neurons consist of cell body containing nucleus, dentrite branches (inputs) and axon (output). 

Axon connects neurons and the length of the axon can be a few to 10,000 times the size of the cell body.

Axon splits into telodendria branch, with synaptic terminals at ends, which are connected to dendrites of other neurons.


Although biological neurons rather simple, complexity comes from networks of billions of neurons, each connected to thousands of other neurons.

## Artificial neurons (units)

### Perceptron

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/perceptron.jpg" width="600px" style="display:block; margin:auto"/>

[[Image credit](https://blog.dbrgn.ch/2013/3/26/perceptrons-in-python/)]

### General logistic unit

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/general_logistic_unit.png" width="500px" style="display:block; margin:auto"/>


*Weighted sum*:

$$z = \sum_{j=1}^n \theta_j x_j =\theta^{\rm T} x.$$

*Activations*:
$a = h(z),$
for non-linear activation function $h$.

Generally we refer to as a logistic unit (rather than an artificial neuron) since additional generalities than concepts motivated by biology will be considered.

### Examples of activation functions

Step

$$ a(z) = \biggl \{
\begin{split}
0,\ \text{if}\ z < 0\\
1,\ \text{if}\ z \geq 0 \\
\end{split}
$$

Sigmoid

$$
a(z) = \frac{1}{1+\exp{(-z)}}
$$

Hyperboic tangent

$$
a(z) = \tanh(z)
$$

Rectified linear unit (ReLU)

$$
a(z) = \max(0, z)
$$


<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/activation_func.png" width="900px" style="display:block; margin:auto"/>

#### Gradients of activation functions

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/grad_activation_func.png" width="900px" style="display:block; margin:auto"/>

Notice the step function has zero gradient.

**Exercises:** *You can now complete Exercise 1-2 in the exercises associated with this lecture.*

## Neural network

Construct artifical neural network by combining layers of multiple logistic units.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/ann.png" width="500px"  style="display:block; margin:auto"/>

*Weighted sums*:
$
z_j = \sum_{i=1}^n \theta_{ij} x_i
$

*Activations*:
$
a_j = h(z_j)
$



### Architectures and terminology

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/ann_layers.jpg" width="500px" style="display:block; margin:auto"/>

[[Image credit](https://medium.com/@xenonstack/overview-of-artificial-neural-networks-and-its-applications-2525c1addff7)]

Networks can be wide/narrow and deep/shallow.

Here we consider feedforward network only.  Other configurations can also be considered, as we will see later in the course.

### Universal approximation theorem

The *universal approximation theorem* states that a feedforward network *can* accurately approximate any continuous function from one finite dimensional space to another, given enough hidden units (Hornik et al. 1989, Cybenko 1989).

(Some technical caveats that are beyond the scope of this course regarding properties of the mapping and activation functions.)

ANNs thus have the *potential* to be universal approximators.  

Universal approximation theorem does *not* provide any guarantee that training finds this representation.

## Multi-class classification

Multi-class classification can be easily performed with an ANN, where each output node corresponds to a certain class.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/ann_layers_multiclass_classification.jpg" width="500px" style="display:block; margin:auto"/>

[[Image credit (adapted)](https://medium.com/@xenonstack/overview-of-artificial-neural-networks-and-its-applications-2525c1addff7)]

Set up training data as unit vectors, with 1 for the target class and 0 for all other classes.

### Softmax

Map predictions to *"probabilities"* using the softmax function for all output nodes with activiations $a_j$:

$$ 
\hat{p}_j = \frac{\exp(a_j)}{\sum_{j^\prime} \exp(a_{j^\prime})}
.
$$

Normalised such that
- $\sum_j \hat{p}_j=1$
- $0 \leq \hat{p}_j \leq 1$

## Cost functions

Appropriate cost functions depend on whether performing regression or classification.  Consider targets $y_j^{(i)}$ and outputs (predictions) of ANN $\hat{p}_j^{(i)}$, for training instance $i$ and output node $j$.

Typical cost function for regression is the mean square error: 

$$\text{MSE}(\Theta) = \frac{1}{m} \sum_{i} \sum_{j} \left(\hat{p}_j^{(i)} - y_j^{(i)}\right)^2 .$$

Typical cost function for classification is cross-entropy:

$$
C(\Theta) = -\frac{1}{m} \sum_{i} \sum_{j} y_j^{(i)} \log \left(\hat{p}_j^{(i)}\right)
.
$$

(Although other cost functions are used widely.)

Various forms of regularisation often considered, e.g. $\ell_2$ regularisation.

Error surface non-convex, potentially with many local optima.  Historically training ANNs has been difficult.

## Backpropagation

### Problem

To train ANN's want to exploit gradient of error surface (e.g. for gradient descent algorithms).  Therefore need an efficient method to compute gradients.

Backpropagation algorithm developed by [Rumelhart, Hinton & Williams (1986)](https://www.nature.com/articles/323533a0) to efficiently compute the gradient of the error surface (i.e. cost function) with respect to each weight of the network.

Gradients then accessible for training.

### Overview of backpropagation

Backpropagation consists of forward and reverse (backwards) passes (hence name).

Consider each training instance.  A forward run of the network is applied to compute the output error $\epsilon$.  Then errors are backpropagated through the network to compute the rate of change of the error with respect to the weights of the network.

In practice, error gradients $\frac{\partial \epsilon}{\partial z_j}$ are computed and backpropagated, from which error gradients with respect to the weights can be computed $\frac{\partial \epsilon}{\partial \theta_{ij}}$.

Backpropagation algorithm follows from a straightforward application of the chain rule.

### Define network architecture and notation

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/backpropagation_architecture2.png" width="400px" style="display:block; margin:auto"/>

Now make network layer explicit in notation.

Weighted sum:
$
z_j^l = \sum_i \theta_{ij}^l a_i^{l-1} ,
$
where $\theta_{ij}^l$ is the weight between node $i$ at layer $l-1$ and node $j$ at layer $l$ (note that difference conventions are often used, e.g. $\theta_{ji}^{l-1}$ for the same connection). Consider $L$ layers.

Activations:
$
a_i^l = h(z_i^l) .
$



### Backpropagation calculations

Want to compute

$$\Delta \theta_{ij}^l = -\eta \frac{\partial \epsilon}{\partial \theta_{ij}^l}.$$

By chain rule:

$$\frac{\partial \epsilon}{\partial \theta_{ij}^l}=\frac{\partial \epsilon}{\partial z_{j}^l}\frac{\partial z_{j}^l}{\partial \theta_{ij}^l}=\frac{\partial \epsilon}{\partial z_{j}^l}a_{i}^{l-1}=\delta_{i}^l a_{i}^{l-1}, $$

where $\delta_i^l = \frac{\partial \epsilon}{\partial z_{j}^l}$

$\left(
\text{recall}\ 
z_j^l = \sum_i \theta_{ij}^l a_i^{l-1}
\right)$.

Now need to compute

$$\delta_i^l = \frac{\partial \epsilon}{\partial z_{j}^l} .$$

#### Functional dependence

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/backpropagation_functional_dependence2.png" width="500px" style="display:block; margin:auto"/>


By chain rule again:

$$\delta_i^l = \frac{\partial \epsilon}{\partial z_{j}^l} = \sum_i \frac{\partial \epsilon}{\partial z_{i}^{l+1}} \frac{\partial z_{i}^{l+1}}{\partial a_{j}^l} \frac{\partial a_{j}^l}{\partial z_{j}^l} = \sum_i \delta_i^{l+1} \theta_{ji}^{l+1} h^\prime(z_j^l).$$

Note the term $h^\prime(z_j^l)$ is independent of $i$ and so can be moved outside the summation.

Boundary condition:

$$\delta_i^L = \frac{\partial \epsilon}{\partial z_{j}^L} = \frac{\partial \epsilon}{\partial a_{j}^L} \frac{\partial a_{j}^L}{\partial z_{j}^L} = \frac{\partial \epsilon}{\partial a_{j}^L} h^\prime(z_j^L).$$

### Summary of backpropagation

For current set of weights $\theta_{ij}^l$, compute forward pass through network:

$$z_j^l = \sum_i \theta_{ij}^l a_i^{l-1} ,$$

$$a_i^l = h(z_i^l) .$$

Propagate errors backwards through network:

$$\delta_i^l = \frac{\partial \epsilon}{\partial z_{j}^l}=\sum_i \delta_i^{l+1} \theta_{ji}^{l+1} h^\prime(z_j^l) .$$

Compute derivatives of error with respect to weights:

$$\frac{\partial \epsilon}{\partial \theta_{ij}^l} = \delta_i^l a_i^{l-1}.$$

### Training with backpropagation

Backpropagation simply computes derivatives of error with respect to weights.

Still need training algorithm to update weights given derivatives, e.g. $\Delta \theta_{ij}^l = -\eta \frac{\partial \epsilon}{\partial \theta_{ij}^l}.$  


Various approaches can be considered:
- Online: update weights after each training instance.
- Full-batch: update weights after full sweep through training data.
- Mini-batch: update weights after a small sample of training cases.

#### Example

Scikit-learn now supports ANNs but not intended for large scale problems. 

Trains using some form of gradient descent, with gradients computed by backpropagation.

In [2]:
# disable convergence warning from early stopping
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)

In [3]:
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.neural_network import MLPClassifier

mnist = fetch_openml('mnist_784', version=1, cache=True, parser='auto')
# rescale the data, use the traditional train/test split
X, y = mnist.data / 255., mnist.target
X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]

mlp = MLPClassifier(hidden_layer_sizes=(50,), max_iter=10, alpha=1e-4,
                    solver='sgd', verbose=10, tol=1e-4, random_state=1,
                    learning_rate_init=.1)

mlp.fit(X_train, y_train)
print("Training set score: %f" % mlp.score(X_train, y_train))
print("Test set score: %f" % mlp.score(X_test, y_test))

Iteration 1, loss = 0.32009978


Iteration 2, loss = 0.15347534


Iteration 3, loss = 0.11544755


Iteration 4, loss = 0.09279764


Iteration 5, loss = 0.07889367


Iteration 6, loss = 0.07170497


Iteration 7, loss = 0.06282111


Iteration 8, loss = 0.05530788


Iteration 9, loss = 0.04960484


Iteration 10, loss = 0.04645355


Training set score: 0.986800
Test set score: 0.970000
