1048 lines
25 KiB
Plaintext
1048 lines
25 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"# Lecture 10: Artificial neural networks (ANNs)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "skip"
|
|
}
|
|
},
|
|
"source": [
|
|
"\n",
|
|
"[Run in colab](https://colab.research.google.com/drive/1PKCyhLqpAeBA3u8yurcgCeaFYcKZNnQE)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:20:22.727146Z",
|
|
"iopub.status.busy": "2024-01-10T00:20:22.726716Z",
|
|
"iopub.status.idle": "2024-01-10T00:20:22.735610Z",
|
|
"shell.execute_reply": "2024-01-10T00:20:22.735071Z"
|
|
},
|
|
"slideshow": {
|
|
"slide_type": "skip"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Last executed: 2024-01-10 00:20:22\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import datetime\n",
|
|
"now = datetime.datetime.now()\n",
|
|
"print(\"Last executed: \" + now.strftime(\"%Y-%m-%d %H:%M:%S\"))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Biological inspiration"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Architecture of neural networks originally inspired by the brain.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Rewriting the brain\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/brain_rewriting_auditory.png\" width=\"750px\" style=\"display:block; margin:auto\"/>\n",
|
|
"\n",
|
|
"Study performed with ferrets by [Roe et al. (1992)](https://www.ncbi.nlm.nih.gov/pubmed/1527604).\n",
|
|
"\n",
|
|
"[[Image credit](https://www.coursera.org/learn/machine-learning)]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/brain_rewriting_somatosensory.png\" width=\"750px\" style=\"display:block; margin:auto\"/>\n",
|
|
"\n",
|
|
"Study performed with hamsters by [Metin & Frost (1989)](https://www.ncbi.nlm.nih.gov/pubmed/2911580).\n",
|
|
"\n",
|
|
"[[Image credit](https://www.coursera.org/learn/machine-learning)]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Led to \"*one learning algorithm*\" hypothesis."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Biological neurons"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "-"
|
|
}
|
|
},
|
|
"source": [
|
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/Blausen_0657_MultipolarNeuron.png\" width=\"600px\" style=\"display:block; margin:auto\"/>\n",
|
|
"\n",
|
|
"[Image credit: [Bruce Blaus, Wikipedia](https://en.wikipedia.org/wiki/Neuron)]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"Biological neurons consist of cell body containing nucleus, dentrite branches (inputs) and axon (output). \n",
|
|
"\n",
|
|
"Axon connects neurons and the length of the axon can be a few to 10,000 times the size of the cell body.\n",
|
|
"\n",
|
|
"Axon splits into telodendria branch, with synaptic terminals at ends, which are connected to dendrites of other neurons.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Although biological neurons rather simple, complexity comes from networks of billions of neurons, each connected to thousands of other neurons."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Artificial neurons (units)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Perceptron"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "-"
|
|
}
|
|
},
|
|
"source": [
|
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/perceptron.jpg\" width=\"600px\" style=\"display:block; margin:auto\"/>\n",
|
|
"\n",
|
|
"[[Image credit](https://blog.dbrgn.ch/2013/3/26/perceptrons-in-python/)]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### General logistic unit"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/general_logistic_unit.png\" width=\"500px\" style=\"display:block; margin:auto\"/>\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"*Weighted sum*:\n",
|
|
"\n",
|
|
"$$z = \\sum_{j=1}^n \\theta_j x_j =\\theta^{\\rm T} x.$$\n",
|
|
"\n",
|
|
"*Activations*:\n",
|
|
"$a = h(z),$\n",
|
|
"for non-linear activation function $h$."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Generally we refer to as a logistic unit (rather than an artificial neuron) since additional generalities than concepts motivated by biology will be considered."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Examples of activation functions"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Step\n",
|
|
"\n",
|
|
"$$ a(z) = \\biggl \\{\n",
|
|
"\\begin{split}\n",
|
|
"0,\\ \\text{if}\\ z < 0\\\\\n",
|
|
"1,\\ \\text{if}\\ z \\geq 0 \\\\\n",
|
|
"\\end{split}\n",
|
|
"$$\n",
|
|
"\n",
|
|
"Sigmoid\n",
|
|
"\n",
|
|
"$$\n",
|
|
"a(z) = \\frac{1}{1+\\exp{(-z)}}\n",
|
|
"$$\n",
|
|
"\n",
|
|
"Hyperboic tangent\n",
|
|
"\n",
|
|
"$$\n",
|
|
"a(z) = \\tanh(z)\n",
|
|
"$$\n",
|
|
"\n",
|
|
"Rectified linear unit (ReLU)\n",
|
|
"\n",
|
|
"$$\n",
|
|
"a(z) = \\max(0, z)\n",
|
|
"$$\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/activation_func.png\" width=\"900px\" style=\"display:block; margin:auto\"/>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"#### Gradients of activation functions"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/grad_activation_func.png\" width=\"900px\" style=\"display:block; margin:auto\"/>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Notice the step function has zero gradient."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
},
|
|
"tags": [
|
|
"exercise_pointer"
|
|
]
|
|
},
|
|
"source": [
|
|
"**Exercises:** *You can now complete Exercise 1-2 in the exercises associated with this lecture.*"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Neural network"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Construct artifical neural network by combining layers of multiple logistic units."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/ann.png\" width=\"500px\" style=\"display:block; margin:auto\"/>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"*Weighted sums*:\n",
|
|
"$\n",
|
|
"z_j = \\sum_{i=1}^n \\theta_{ij} x_i\n",
|
|
"$\n",
|
|
"\n",
|
|
"*Activations*:\n",
|
|
"$\n",
|
|
"a_j = h(z_j)\n",
|
|
"$\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Architectures and terminology"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/ann_layers.jpg\" width=\"500px\" style=\"display:block; margin:auto\"/>\n",
|
|
"\n",
|
|
"[[Image credit](https://medium.com/@xenonstack/overview-of-artificial-neural-networks-and-its-applications-2525c1addff7)]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Networks can be wide/narrow and deep/shallow.\n",
|
|
"\n",
|
|
"Here we consider feedforward network only. Other configurations can also be considered, as we will see later in the course."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Universal approximation theorem"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The *universal approximation theorem* states that a feedforward network *can* accurately approximate any continuous function from one finite dimensional space to another, given enough hidden units (Hornik et al. 1989, Cybenko 1989).\n",
|
|
"\n",
|
|
"(Some technical caveats that are beyond the scope of this course regarding properties of the mapping and activation functions.)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"ANNs thus have the *potential* to be universal approximators. \n",
|
|
"\n",
|
|
"Universal approximation theorem does *not* provide any guarantee that training finds this representation."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Multi-class classification"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Multi-class classification can be easily performed with an ANN, where each output node corresponds to a certain class."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/ann_layers_multiclass_classification.jpg\" width=\"500px\" style=\"display:block; margin:auto\"/>\n",
|
|
"\n",
|
|
"[[Image credit (adapted)](https://medium.com/@xenonstack/overview-of-artificial-neural-networks-and-its-applications-2525c1addff7)]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"Set up training data as unit vectors, with 1 for the target class and 0 for all other classes."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Softmax"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "-"
|
|
}
|
|
},
|
|
"source": [
|
|
"Map predictions to *\"probabilities\"* using the softmax function for all output nodes with activiations $a_j$:\n",
|
|
"\n",
|
|
"$$ \n",
|
|
"\\hat{p}_j = \\frac{\\exp(a_j)}{\\sum_{j^\\prime} \\exp(a_{j^\\prime})}\n",
|
|
".\n",
|
|
"$$\n",
|
|
"\n",
|
|
"Normalised such that\n",
|
|
"- $\\sum_j \\hat{p}_j=1$\n",
|
|
"- $0 \\leq \\hat{p}_j \\leq 1$"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Cost functions"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Appropriate cost functions depend on whether performing regression or classification. Consider targets $y_j^{(i)}$ and outputs (predictions) of ANN $\\hat{p}_j^{(i)}$, for training instance $i$ and output node $j$."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"Typical cost function for regression is the mean square error: \n",
|
|
"\n",
|
|
"$$\\text{MSE}(\\Theta) = \\frac{1}{m} \\sum_{i} \\sum_{j} \\left(\\hat{p}_j^{(i)} - y_j^{(i)}\\right)^2 .$$\n",
|
|
"\n",
|
|
"Typical cost function for classification is cross-entropy:\n",
|
|
"\n",
|
|
"$$\n",
|
|
"C(\\Theta) = -\\frac{1}{m} \\sum_{i} \\sum_{j} y_j^{(i)} \\log \\left(\\hat{p}_j^{(i)}\\right)\n",
|
|
".\n",
|
|
"$$\n",
|
|
"\n",
|
|
"(Although other cost functions are used widely.)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Various forms of regularisation often considered, e.g. $\\ell_2$ regularisation."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"Error surface non-convex, potentially with many local optima. Historically training ANNs has been difficult."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Backpropagation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Problem\n",
|
|
"\n",
|
|
"To train ANN's want to exploit gradient of error surface (e.g. for gradient descent algorithms). Therefore need an efficient method to compute gradients."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"Backpropagation algorithm developed by [Rumelhart, Hinton & Williams (1986)](https://www.nature.com/articles/323533a0) to efficiently compute the gradient of the error surface (i.e. cost function) with respect to each weight of the network.\n",
|
|
"\n",
|
|
"Gradients then accessible for training."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Overview of backpropagation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Backpropagation consists of forward and reverse (backwards) passes (hence name).\n",
|
|
"\n",
|
|
"Consider each training instance. A forward run of the network is applied to compute the output error $\\epsilon$. Then errors are backpropagated through the network to compute the rate of change of the error with respect to the weights of the network."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"In practice, error gradients $\\frac{\\partial \\epsilon}{\\partial z_j}$ are computed and backpropagated, from which error gradients with respect to the weights can be computed $\\frac{\\partial \\epsilon}{\\partial \\theta_{ij}}$."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"Backpropagation algorithm follows from a straightforward application of the chain rule."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Define network architecture and notation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/backpropagation_architecture2.png\" width=\"400px\" style=\"display:block; margin:auto\"/>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"Now make network layer explicit in notation.\n",
|
|
"\n",
|
|
"Weighted sum:\n",
|
|
"$\n",
|
|
"z_j^l = \\sum_i \\theta_{ij}^l a_i^{l-1} ,\n",
|
|
"$\n",
|
|
"where $\\theta_{ij}^l$ is the weight between node $i$ at layer $l-1$ and node $j$ at layer $l$ (note that difference conventions are often used, e.g. $\\theta_{ji}^{l-1}$ for the same connection). Consider $L$ layers.\n",
|
|
"\n",
|
|
"Activations:\n",
|
|
"$\n",
|
|
"a_i^l = h(z_i^l) .\n",
|
|
"$\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Backpropagation calculations\n",
|
|
"\n",
|
|
"Want to compute\n",
|
|
"\n",
|
|
"$$\\Delta \\theta_{ij}^l = -\\eta \\frac{\\partial \\epsilon}{\\partial \\theta_{ij}^l}.$$"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"By chain rule:\n",
|
|
"\n",
|
|
"$$\\frac{\\partial \\epsilon}{\\partial \\theta_{ij}^l}=\\frac{\\partial \\epsilon}{\\partial z_{j}^l}\\frac{\\partial z_{j}^l}{\\partial \\theta_{ij}^l}=\\frac{\\partial \\epsilon}{\\partial z_{j}^l}a_{i}^{l-1}=\\delta_{i}^l a_{i}^{l-1}, $$\n",
|
|
"\n",
|
|
"where $\\delta_i^l = \\frac{\\partial \\epsilon}{\\partial z_{j}^l}$\n",
|
|
"\n",
|
|
"$\\left(\n",
|
|
"\\text{recall}\\ \n",
|
|
"z_j^l = \\sum_i \\theta_{ij}^l a_i^{l-1}\n",
|
|
"\\right)$."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"Now need to compute\n",
|
|
"\n",
|
|
"$$\\delta_i^l = \\frac{\\partial \\epsilon}{\\partial z_{j}^l} .$$"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"#### Functional dependence\n",
|
|
"\n",
|
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture10_Images/backpropagation_functional_dependence2.png\" width=\"500px\" style=\"display:block; margin:auto\"/>\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"By chain rule again:\n",
|
|
"\n",
|
|
"$$\\delta_i^l = \\frac{\\partial \\epsilon}{\\partial z_{j}^l} = \\sum_i \\frac{\\partial \\epsilon}{\\partial z_{i}^{l+1}} \\frac{\\partial z_{i}^{l+1}}{\\partial a_{j}^l} \\frac{\\partial a_{j}^l}{\\partial z_{j}^l} = \\sum_i \\delta_i^{l+1} \\theta_{ji}^{l+1} h^\\prime(z_j^l).$$"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Note the term $h^\\prime(z_j^l)$ is independent of $i$ and so can be moved outside the summation."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Boundary condition:\n",
|
|
"\n",
|
|
"$$\\delta_i^L = \\frac{\\partial \\epsilon}{\\partial z_{j}^L} = \\frac{\\partial \\epsilon}{\\partial a_{j}^L} \\frac{\\partial a_{j}^L}{\\partial z_{j}^L} = \\frac{\\partial \\epsilon}{\\partial a_{j}^L} h^\\prime(z_j^L).$$"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Summary of backpropagation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"For current set of weights $\\theta_{ij}^l$, compute forward pass through network:\n",
|
|
"\n",
|
|
"$$z_j^l = \\sum_i \\theta_{ij}^l a_i^{l-1} ,$$\n",
|
|
"\n",
|
|
"$$a_i^l = h(z_i^l) .$$"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"Propagate errors backwards through network:\n",
|
|
"\n",
|
|
"$$\\delta_i^l = \\frac{\\partial \\epsilon}{\\partial z_{j}^l}=\\sum_i \\delta_i^{l+1} \\theta_{ji}^{l+1} h^\\prime(z_j^l) .$$"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"Compute derivatives of error with respect to weights:\n",
|
|
"\n",
|
|
"$$\\frac{\\partial \\epsilon}{\\partial \\theta_{ij}^l} = \\delta_i^l a_i^{l-1}.$$"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Training with backpropagation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Backpropagation simply computes derivatives of error with respect to weights.\n",
|
|
"\n",
|
|
"Still need training algorithm to update weights given derivatives, e.g. $\\Delta \\theta_{ij}^l = -\\eta \\frac{\\partial \\epsilon}{\\partial \\theta_{ij}^l}.$ \n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"Various approaches can be considered:\n",
|
|
"- Online: update weights after each training instance.\n",
|
|
"- Full-batch: update weights after full sweep through training data.\n",
|
|
"- Mini-batch: update weights after a small sample of training cases."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"#### Example\n",
|
|
"\n",
|
|
"Scikit-learn now supports ANNs but not intended for large scale problems. \n",
|
|
"\n",
|
|
"Trains using some form of gradient descent, with gradients computed by backpropagation."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:20:22.781603Z",
|
|
"iopub.status.busy": "2024-01-10T00:20:22.781145Z",
|
|
"iopub.status.idle": "2024-01-10T00:20:23.161352Z",
|
|
"shell.execute_reply": "2024-01-10T00:20:23.160637Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# disable convergence warning from early stopping\n",
|
|
"from warnings import simplefilter\n",
|
|
"from sklearn.exceptions import ConvergenceWarning\n",
|
|
"simplefilter(\"ignore\", category=ConvergenceWarning)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:20:23.165008Z",
|
|
"iopub.status.busy": "2024-01-10T00:20:23.164451Z",
|
|
"iopub.status.idle": "2024-01-10T00:20:38.093599Z",
|
|
"shell.execute_reply": "2024-01-10T00:20:38.092993Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Iteration 1, loss = 0.32009978\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Iteration 2, loss = 0.15347534\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Iteration 3, loss = 0.11544755\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Iteration 4, loss = 0.09279764\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Iteration 5, loss = 0.07889367\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Iteration 6, loss = 0.07170497\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Iteration 7, loss = 0.06282111\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Iteration 8, loss = 0.05530788\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Iteration 9, loss = 0.04960484\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Iteration 10, loss = 0.04645355\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Training set score: 0.986800\n",
|
|
"Test set score: 0.970000\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import matplotlib.pyplot as plt\n",
|
|
"from sklearn.datasets import fetch_openml\n",
|
|
"from sklearn.neural_network import MLPClassifier\n",
|
|
"\n",
|
|
"mnist = fetch_openml('mnist_784', version=1, cache=True, parser='auto')\n",
|
|
"# rescale the data, use the traditional train/test split\n",
|
|
"X, y = mnist.data / 255., mnist.target\n",
|
|
"X_train, X_test = X[:60000], X[60000:]\n",
|
|
"y_train, y_test = y[:60000], y[60000:]\n",
|
|
"\n",
|
|
"mlp = MLPClassifier(hidden_layer_sizes=(50,), max_iter=10, alpha=1e-4,\n",
|
|
" solver='sgd', verbose=10, tol=1e-4, random_state=1,\n",
|
|
" learning_rate_init=.1)\n",
|
|
"\n",
|
|
"mlp.fit(X_train, y_train)\n",
|
|
"print(\"Training set score: %f\" % mlp.score(X_train, y_train))\n",
|
|
"print(\"Test set score: %f\" % mlp.score(X_test, y_test))"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"celltoolbar": "Slideshow",
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.8.18"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|