spce0038-machine-learning-w.../week2/slides/Lecture05_TrainingI.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Lecture 5: Training I"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](https://www.tensorflow.org/images/colab_logo_32px.png)\n",
    "[Run in colab](https://colab.research.google.com/drive/10z5cZZcHnp1cfCRiD4ESsdwLR7NFb5p0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:19:25.718881Z",
     "iopub.status.busy": "2024-01-10T00:19:25.718648Z",
     "iopub.status.idle": "2024-01-10T00:19:25.726287Z",
     "shell.execute_reply": "2024-01-10T00:19:25.725751Z"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Last executed: 2024-01-10 00:19:25\n"
     ]
    }
   ],
   "source": [
    "import datetime\n",
    "now = datetime.datetime.now()\n",
    "print(\"Last executed: \" + now.strftime(\"%Y-%m-%d %H:%M:%S\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Linear regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Linear regression mode (scalar form)\n",
    "\n",
    "$$\\hat{y} = \\theta_0 + \\theta_1 x_1 + ... + \\theta_n x_n$$\n",
    "\n",
    "- $\\hat{y}$ is the predicted value.\n",
    "- $n$ is the number of features.\n",
    "- $x_j$ is the $j$th feature, for $j=0,1,...,n$.\n",
    "- $\\theta_j$ is the $j$th model parameter\n",
    "<br>(with bias $\\theta_0$ and feature weights $\\theta_1, ..., \\theta_n)$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Linear regression mode (vector form)\n",
    "\n",
    "$$\\hat{y} =\\theta^{\\rm T} x = h_\\theta(x)$$\n",
    "\n",
    "- $x$ is the instance feature vector $x=(x_0, x_1, ... x_n)^{\\rm T}$, where $x_0 = 1$.\n",
    "- $\\theta$ is the vector of model parameters $\\theta=(\\theta_0, \\theta_1, ... \\theta_n)^{\\rm T}$.\n",
    "- $\\cdot^{\\rm T}$ denotes transpose.\n",
    "- $h_\\theta(x)$ is the hypothesis function, with parameters $\\theta$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Train linear regression\n",
    "\n",
    "Give training data $\\{ x^{(i)}, y^{(i)} \\}$, for $m$ instances $i=1,2,...,m$.\n",
    "\n",
    "Minimise mean square error (MSE):\n",
    "\n",
    "$$\\text{MSE} = \\frac{1}{m} \\sum_{i=1}^{m} \\left(\\theta^{\\rm T} x^{(i)} - y^{(i)}\\right)^2 .$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Concise matrix-vector notation\n",
    "\n",
    "Recall:\n",
    "- $x_j^{(i)}$ is (scalar) value of feature $j$ in $i$th training example.\n",
    "- $x^{(i)}$ is the (column) vector of features of the $i$th training example.\n",
    "- $y^{(i)}$ is the (scalar) value of target of the $i$th training example.\n",
    "- $n$ features and $m$ training instances\n",
    "\n",
    "Define:\n",
    "- Feature matrix: $X_{m \\times n} = [ x^{(1)},\\ x^{(2)},\\ ...,\\ x^{(m)}]^{\\rm T}$.\n",
    "- Target vector: $y_{m \\times 1} = [ y^{(1)},\\ y^{(2)},\\ ...,\\ y^{(m)}]^{\\rm T}$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Recall features matrix and target vector\n",
    "\n",
    "<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture05_Images/data-layout.png\" width=\"500\"  style=\"display:block; margin:auto\"/>\n",
    "\n",
    "[Image source](https://github.com/jakevdp/sklearn_tutorial)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Minimising the MSE is equivalent to minimising the cost function\n",
    "\n",
    "$$C(\\theta) = \\frac{1}{m} (X \\theta - y)^{\\rm T}(X \\theta - y),$$\n",
    "\n",
    "where for notational convenience we denote the dependence on $\\theta$ only and consider $X$ and $y$ fixed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Normal equations\n",
    "\n",
    "Minimise the cost function analytically \n",
    "\n",
    "$$\\min_\\theta\\ C(\\theta) = \\min_\\theta \\ (X \\theta - y)^{\\rm T}(X \\theta - y).$$\n",
    "\n",
    "Solution given by \n",
    "\n",
    "$$ \\hat{\\theta} = \\left( X^{\\rm T} X \\right)^{-1} X^{\\rm T} y. $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:19:25.767150Z",
     "iopub.status.busy": "2024-01-10T00:19:25.766614Z",
     "iopub.status.idle": "2024-01-10T00:19:26.182564Z",
     "shell.execute_reply": "2024-01-10T00:19:26.181848Z"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "# Common imports\n",
    "import os\n",
    "import numpy as np\n",
    "np.random.seed(42) # To make this notebook's output stable across runs\n",
    "\n",
    "# To plot pretty figures\n",
    "%matplotlib inline\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "plt.rcParams['axes.labelsize'] = 14\n",
    "plt.rcParams['xtick.labelsize'] = 12\n",
    "plt.rcParams['ytick.labelsize'] = 12"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:19:26.186110Z",
     "iopub.status.busy": "2024-01-10T00:19:26.185473Z",
     "iopub.status.idle": "2024-01-10T00:19:26.426019Z",
     "shell.execute_reply": "2024-01-10T00:19:26.425332Z"
    }
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAw0AAAIbCAYAAACpGXLSAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8WgzjOAAAACXBIWXMAAA9hAAAPYQGoP6dpAABCYUlEQVR4nO3deXxU5b3H8e8kSFgTiqAsCYsQtbKjgoDFaKvBIrgCVaxQYoMWwRUpFS8KAqVaV1TE5iVWpepFvbXiragXN4oogtZesVBtMENAZMvIkgDJc/+Ym8BkOZlMZuZsn/frlRevnJzJPHPmZHi+5/k9zwkYY4wAAAAAoA4pdjcAAAAAgLMRGgAAAABYIjQAAAAAsERoAAAAAGCJ0AAAAADAEqEBAAAAgCVCAwAAAABLhAYAAAAAlprY3YBEqaioUHFxsVq3bq1AIGB3cwAAAIC4MMbo+++/V6dOnZSSkpwxAM+GhuLiYmVlZdndDAAAACAhioqKlJmZmZTn8mxoaN26taTwwUxPT7e5NQAAAEB8hEIhZWVlVfV3k8GzoaGyJCk9PZ3QAAAAAM9JZgk+E6EBAAAAWCI0AAAAALBEaAAAAABgidAAAAAAwBKhAQAAAIAlQgMAAAAAS4QGAAAAAJYIDQAAAAAsERoAAAAAWCI0AAAAALBEaAAAAABgidAAAAAAwBKhAQAAAIAlQgMAAAAAS4QGAAAAAJYIDQAAAAAsERoAAAAAWCI0AAAAALBEaAAAAABgidAAAAAAwBKhAQAAAIAlQgMAAAAAS0kLDfv27dPs2bM1YsQItW3bVoFAQEuXLrV8zOHDh3XaaacpEAjovvvuS05DAQAAAERIWmjYuXOn5syZo40bN6pfv35RPeaRRx7RN998k+CWAQAAALCStNDQsWNHbdu2TVu2bNG9995b7/47duzQnDlzNGPGjCS0DgAAAEBdkhYa0tLS1KFDh6j3//Wvf61TTjlFV199dQJbBQAAAKA+TexuQG0++ugjPf300/rggw8UCATsbg4AAADga44LDcYYTZ06VePGjdOQIUNUWFgY1ePKyspUVlZW9X0oFEpQCwEAAAB/cdySq0uXLtXnn3+uhQsXNuhxCxYsUEZGRtVXVlZWgloIAAAA+IujQkMoFNLMmTM1ffr0Bnf6Z86cqZKSkqqvoqKiBLUSAAAA8BdHlSfdd999OnTokMaNG1dVlhQMBiVJe/bsUWFhoTp16qSmTZvWeGxaWprS0tKS2VwAAADAFxw10vDNN99oz5496tWrl7p3767u3bvrRz/6kSRp/vz56t69u7744gubWwkAAAD4i6NGGqZNm6ZLLrkkYtuOHTs0efJkTZw4URdffLG6d+9uT+MAAAAAn0pqaFi0aJH27t2r4uJiSdJf/vKXqvKjqVOnauDAgRo4cGDEYyrLlHr16lUjUAAAAABIvKSGhvvuu09btmyp+v7ll1/Wyy+/LEm6+uqrlZGRkczmAAAAAIhCUkNDtPdcOFa3bt1kjIl/YwAAAABExVEToQEAAAA4D6EBAAAAgCVCAwAAAABLhAYAAAAAlggNAAAAACwRGgAAAABYIjQAAAAAsERoAAAAAGCJ0AAAAADAEqEBAAAAgCVCAwAAAABLhAYAAAAAlggNAAAAACwRGgAAAABYIjQAAAAAsERoAAAAAGCJ0AAAAADAEqEBAAAAgCVCAwAAAABLhAYAAAAAlggNAAAAACwRGgAAAABYIjQAAAAAsERoAAAAAGCJ0AAAAADAEqEBAAAAgCVCAwAAAABLhAYAAAAAlggNAAAAACwRGgAAAABYIjQAAAAAsERoAAAAAGCJ0AAAAADAEqEBAAAAgCVCAwAAAABLhAYAAAAAlggNAAAAACwRGgAAAABYIjQAAAAAsERoAAAAAGCJ0AAAAADAEqEBAAAAgCVCAwAAAABLhAYAAAAAlggNAAAAACwRGgAAAABYIjQAAAAAsJS00LBv3z7Nnj1bI0aMUNu2bRUIBLR06dKIfSoqKrR06VKNHj1aWVlZatmypXr37q177rlHpaWlyWoqAAAAgGMkLTTs3LlTc+bM0caNG9WvX79a9zlw4IB+8Ytf6LvvvtN1112nBx98UIMGDdLs2bN14YUXyhiTrOYCAAAA+H9NkvVEHTt21LZt29ShQwetW7dOZ555Zo19mjZtqtWrV2vo0KFV2375y1+qW7dumj17tt5++2395Cc/SVaTAQAAACiJIw1paWnq0KGD5T5NmzaNCAyVLr30UknSxo0bE9I2AAAAAHVzxUTo7du3S5LatWtnc0sAAAAA/0laeVJj/O53v1N6erouvPDCOvcpKytTWVlZ1fehUCgZTQMAAAA8z/EjDfPnz9dbb72l3/72t2rTpk2d+y1YsEAZGRlVX1lZWclrJAAAAOBhjg4NL7zwgmbNmqW8vDxdf/31lvvOnDlTJSUlVV9FRUVJaiUAAADgbY4tT3rzzTd1zTXXaOTIkVq8eHG9+6elpSktLS0JLQMAAAD8xZEjDWvXrtWll16qM844Qy+++KKaNHFstgEAAAA8z3GhYePGjRo5cqS6deum1157Tc2bN7e7SQAAAICvJfUS/qJFi7R3714VFxdLkv7yl78oGAxKkqZOnaqUlBTl5uZqz549mj59ulasWBHx+B49emjIkCHJbDIAAADgewFjjEnWk3Xr1k1btmyp9Wf//ve/JUndu3ev8/ETJkzQ0qVLo3quUCikjIwMlZSUKD09vcFtBQAAAJzIjn5uUkcaCgsL690niRkGAAAAQBQcN6cBAAAAgLMQGgAAAABYIjQAAAAAsERoAAAAAGCJ0AAAAADAEqEBAAAAgCVCAwAAAABLhAYAAAAAlggNAAAAACwRGgAAAABYIjQAAAAAsERoAAAAAGCJ0AAAAADAEqEBAAAAgCVCAwAAAABLhAYAAAAAlggNAAAAACwRGgAAAABYIjQAAAAAsERoAAAAQIMEg9KqVeF/4Q+EBgAAAEStoEDq2lU677zwvwUFdrcIyUBoAAAAQFSCQSk/X6qoCH9fUSFNnsyIgx8QGgAAABCVzZuPBoZK5eXSv/5lT3uQPIQGAAAARCU7W0qp1ntMTZV69rSnPUgeQgMAAACikpkpLVkSDgpS+N8nnghvh7c1sbsBAAAAcI+8PCk3N1yS1LMngcEvCA0AAABokMxMwoLfUJ4EAAAAwBKhAQAAAIAlQgMAAADgEE692zahAQAAAHAAJ99tm9AAAAAA2Mzpd9smNAAAAAA2c/rdtgkNAAAAgM2cfrdtQgMAAABgM6ffbZubuwEAAAAO4OS7bRMaAAAAAIdw6t22KU8CAAAAYInQAAAAAMASoQEAAACAJUIDAAAAAEuEBgAAAACWCA0AAABIumBQWrUq/C+cj9AAAACApCookLp2lc47L/xvQYHdLUJ9CA0AAABImmBQys+XKirC31dUSJMnN3zEgZGK5CI0AAAAx6FD6F2bNx8NDJXKy8N3QY4WIxXJR2gAAACOQofQ27KzpZRqPdDUVKlnz+geH6+RimTxSgAmNAAAAMdwW4cQDZeZKS1ZEg4KUvjfJ54Ib49GPEYqksVLAThpoWHfvn2aPXu2RowYobZt2yoQCGjp0qW17rtx40aNGDFCrVq1Utu2bfXzn/9c3333XbKaCgAAbOKmDiFil5cnFRaGr8AXFoa/j1ZjRyqSxWsBOGmhYefOnZozZ442btyofv361blfMBjU8OHD9a9//Uvz58/XbbfdphUrVuj888/XoUOHktVcAABgA7d0CNF4mZlSTk70IwzHPq4xIxXJ4rUA3CRZT9SxY0dt27ZNHTp00Lp163TmmWfWut/8+fO1f/9+ffLJJ+rSpYskadCgQTr//PO1dOlS5efnJ6vJAAAgySo7hJMnhztYTu0Qwl55eVJubrgD3rOnM8+PygB8bHBwcwBO2khDWlqaOnToUO9+L730ki666KKqwCBJP/nJT3TyySfrxRdfTGQTAQCAAzSmdAX+EetIRbK4ZUQkWkkbaYjG1q1btWPHDp1xxhk1fjZo0CC9/vrrNrQKAAAkW2ameztXQCU3jIhEy1GhYdu
      "text/plain": [
       "<Figure size 900x600 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import numpy as np\n",
    "m = 100\n",
    "X = 2 * np.random.rand(m, 1)\n",
    "y = 4 + 3 * X + np.random.randn(m, 1)\n",
    "plt.figure(figsize=(9,6))\n",
    "plt.plot(X, y, \"b.\")\n",
    "plt.xlabel(\"$x_1$\", fontsize=18)\n",
    "plt.ylabel(\"$y$\", rotation=0, fontsize=18)\n",
    "plt.axis([0, 2, 0, 15]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": [
     "exercise_pointer"
    ]
   },
   "source": [
    "**Exercises:** *You can now complete Exercise 1 in the exercises associated with this lecture.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Solve using Scikit-Learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:19:26.430231Z",
     "iopub.status.busy": "2024-01-10T00:19:26.428868Z",
     "iopub.status.idle": "2024-01-10T00:19:26.811959Z",
     "shell.execute_reply": "2024-01-10T00:19:26.811220Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([4.21509616]), array([[2.77011339]]))"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "lin_reg = LinearRegression()\n",
    "lin_reg.fit(X, y)\n",
    "lin_reg.intercept_, lin_reg.coef_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:19:26.815257Z",
     "iopub.status.busy": "2024-01-10T00:19:26.814784Z",
     "iopub.status.idle": "2024-01-10T00:19:26.820509Z",
     "shell.execute_reply": "2024-01-10T00:19:26.819869Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[4.21509616],\n",
       "       [9.75532293]])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_new = np.array([[0], [2]])\n",
    "lin_reg.predict(X_new)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Computational complexity\n",
    "\n",
    "Solving the normal equation requires inverting $X^{\\rm T} X$, which is an $n \\times n$ matrix (where $n$ is the number of features).\n",
    "\n",
    "Inverting an $n \\times n$ matrix is naively of complexity $\\mathcal{O}(n^3)$ ($\\mathcal{O}(n^{2.4})$ if certain fast algorithms are used).\n",
    "\n",
    "Also, typically requires all training instances to be held in memory at once.\n",
    "\n",
    "Other methods are required to handle large data-sets, i.e. when considering large numbers of features and large numbers of training instances."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Batch gradient descent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture05_Images/optimization_gd.png\" width=\"400px\" style=\"display:block; margin:auto\"/>\n",
    "\n",
    "[[Image source](http://www.holehouse.org/mlclass)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Gradient of the cost function is given by\n",
    "\n",
    "$$ \\frac{\\partial}{\\partial \\theta} C(\\theta)\n",
    "=\\frac{2}{m} X^{\\rm T}  \\left( X \\theta - y \\right) \n",
    "= \\left[\\frac{\\partial C}{\\partial \\theta_0},\\ \\frac{\\partial C}{\\partial \\theta_1},\\ ...,\\ \\frac{\\partial C}{\\partial \\theta_n} \\right]^{\\rm T} \n",
    ".\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Batch gradient descent algorithm defined by taking a step $\\alpha$ in the direction of the gradient:\n",
    "\n",
    "$$\\theta^{(t)} = \\theta^{(t-1)} - \\alpha  \\frac{\\partial C}{\\partial \\theta},$$\n",
    "\n",
    "where $t$ denotes the iteration number."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "This algorithm is called *batch* gradient descent since it uses the full *batch* of training data ($X$) at each iteration.  We will see other forms of gradient descent soon..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": [
     "exercise_pointer"
    ]
   },
   "source": [
    "**Exercises:** *You can now complete Exercise 2 in the exercises associated with this lecture.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Learning rate\n",
    "\n",
    "The step size $\\alpha$ is also called the *learning rate*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Good learning rate\n",
    "\n",
    "<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture05_Images/bgd_alpha_good.png\" width=\"600px\" style=\"display:block; margin:auto\"/>\n",
    "\n",
    "[Source: Geron]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Learning rate too small \n",
    "\n",
    "Convergence will be slow.\n",
    "\n",
    "<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture05_Images/bgd_alpha_small.png\" width=\"600px\" style=\"display:block; margin:auto\"/>\n",
    "\n",
    "[Source: Geron]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Learning rate too large \n",
    "\n",
    "May not converge.\n",
    "\n",
    "<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture05_Images/bgd_alpha_large.png\" alt=\"data-layout\" width=\"600px\" style=\"display:block; margin:auto\"/>\n",
    "\n",
    "[Source: Geron]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Evolution of fitted curve \n",
    "\n",
    "Consider the evolution of the curve corresponding to the best fit parameters for the first 10 iterations for different learning rates $\\alpha$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:19:26.824434Z",
     "iopub.status.busy": "2024-01-10T00:19:26.823834Z",
     "iopub.status.idle": "2024-01-10T00:19:26.831253Z",
     "shell.execute_reply": "2024-01-10T00:19:26.830646Z"
    }
   },
   "outputs": [],
   "source": [
    "theta_path_bgd = []\n",
    "\n",
    "X_b = np.c_[np.ones((m, 1)), X]  # add x0 = 1 to each instance\n",
    "X_new_b = np.c_[np.ones((2, 1)), X_new] \n",
    "\n",
    "def plot_gradient_descent(theta, alpha, theta_path=None):\n",
    "    m = len(X_b)\n",
    "    plt.plot(X, y, \"b.\")\n",
    "    n_iterations = 1000\n",
    "    for iteration in range(n_iterations):\n",
    "        if iteration < 10:\n",
    "            y_predict = X_new_b.dot(theta)\n",
    "            style = \"b-\" if iteration > 0 else \"r--\"\n",
    "            plt.plot(X_new, y_predict, style)\n",
    "        gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)\n",
    "        theta = theta - alpha * gradients\n",
    "        if theta_path is not None:\n",
    "            theta_path.append(theta)\n",
    "    plt.xlabel(\"$x_1$\", fontsize=18)\n",
    "    plt.axis([0, 2, 0, 15])\n",
    "    plt.title(r\"$\\alpha = {}$\".format(alpha), fontsize=16)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:19:26.834198Z",
     "iopub.status.busy": "2024-01-10T00:19:26.833760Z",
     "iopub.status.idle": "2024-01-10T00:19:27.369568Z",
     "shell.execute_reply": "2024-01-10T00:19:27.368907Z"
    }
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAA1UAAAGZCAYAAABhShsgAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8WgzjOAAAACXBIWXMAAA9hAAAPYQGoP6dpAADNIElEQVR4nOydd3gU9fPH35eQhFASQi+h996rVEUQC6iAKCKgKGChKCCgKE1FRbGhFEHATvsi2KmiKEpHkCpSEnoJCS315vfH/Jbd63v99m5ez3NPLrfts5vbyWd2Zt5jIiKCIAiCIAiCIAiC4BFRwR6AIAiCIAiCIAiCkRGnShAEQRAEQRAEwQvEqRIEQRAEQRAEQfACcaoEQRAEQRAEQRC8QJwqQRAEQRAEQRAELxCnShAEQRAEQRAEwQvEqRIEQRAEQRAEQfACcaoEQRAEQRAEQRC8QJwqQRAEQRAEQRAELxCnSjAES5cuRceOHZGUlISCBQuiYcOGePPNN5GTkxOw/ebk5GDdunUYM2YMmjdvjiJFiiAmJgalS5dG9+7d8f3333s1FkEQgoOv7cvBgwfxwQcfYODAgahfvz7y5csHk8mEV155xccjFwQhVPC1HRk4cCBMJpPTV2Zmpo/PQvAGExFRsAchCM4YOXIk3nvvPeTLlw+33norChUqhPXr1+Py5cto27YtVq9ejfj4eL/vd+3atbj99tsBAKVLl0bTpk1RsGBB7Nu3D3v37gUADB48GLNnz4bJZPLNyQuC4Ff8YV+UfVozdepUTJgwwVdDFwQhRPCHHRk4cCAWLVqEW265BdWqVbO7zscff4yYmBhfnILgC0gQQpgVK1YQACpUqBBt37795ufnz5+n+vXrEwAaNWpUQPa7bt066tmzJ/366682+/v6668pOjqaANCiRYvcHo8gCIHHX/bl448/ptGjR9MXX3xB+/fvp0ceeYQA0NSpU305fEEQQgB/2ZEBAwYQAFqwYIEPRyv4E3GqhJCmefPmBIBeeeUVm2W//fYbAaC4uDi6fPly0Pc7aNAgAkC33XabW2MRBCE4+Mu+WKNMjsSpEoTww192RJwq4yE1VQIOHz6MwYMHo2LFisifPz+qVauGyZMn38wDvv/++5E/f34cP348oOM6efIktm7dCgDo27evzfK2bduifPnyyMrKwg8//BD0/TZu3BgAkJKSonsbQQh3Is2+CILge8SOCEZAnKoIZ/78+WjQoAEWLFiAmjVr4o477sCpU6cwadIkvPfee9ixYwdWrFiBoUOHomLFigEd286dOwEARYsWReXKle2u06xZM4t1g7nfw4cPAwDKlCmjextBCGci0b4IguBbIt2ObNiwAaNGjcLgwYMxfvx4rFixAllZWZ4NWPAr+YI9ACF4LF++HE888QQSExOxdu1aNG3aFACwevVqdO3aFStXrsQvv/yCwoUL48UXX3S6L6Wg0l02bNiAjh072l129OhRAECFChUcbl++fHmLdfXgj/2eOXMGCxcuBAD07NlT91gEIVyJVPsiCILvEDsCfPrppzaflSlTBp988gnuuOMOj/Yp+AdxqiKUrKwsDBs2DESEGTNm3DRUANClSxcUKFAA27ZtQ2ZmJiZOnIgSJUo43V/btm09Gkfp0qUdLrty5QoAoGDBgg7XKVSoEAAgIyND9zF9vd/c3Fz069cP6enpqF+/PoYMGaJ7LIIQjkSyfREEwTdEuh1p2LAh3nvvPdx2222oUKECbty4gd27d2PSpEn4448/0L17d6xevdqhwycEHnGqIpQVK1bg9OnTqFOnDh599FGb5UlJSTh58iRKlCiBUaNGudzf448/jscff9wfQw15hg4dinXr1qFYsWJYtmwZYmNjgz0kQQgqYl8EQfCWSLcjzz77rMXvhQsXxu23347OnTvjvvvuw8qVKzFy5Ejs2rUrOAMUbJCaqghFKZjs1auX0/VeeOEFFC5cOBBDskE57rVr1xyuc/XqVQBAQkJCUPY7YsQIzJ8/H0lJSVizZg1q1KihexyCEK5Esn0RBME3iB2xj8lkwuTJkwEAu3fvFnGsEEIiVRGKolbToUMHm2U5OTm4fv06SpQogSeffFLX/ubNm4dNmza5PY5x48ahVq1adpdVqlQJgHM1PWWZsq4efLXfUaNG4f3330eRIkWwevXqm+p/ghDpRLJ9EQTBN4gdcUzt2rVvvk9NTb1ZtyUEF3GqIhRFdrRcuXI2y9555x2kpaWhbt26iIuL07W/TZs2eVQAOnDgQIfGSnFSLl68iKNHj9pV1tm2bRsAoEmTJrqP6Yv9Pv/885gxYwYSExOxevXqm+o+giBEtn0RBME3iB1xzMWLF2++D1aUTrBF0v8ilKgo/tNfvnzZ4vOjR49i6tSpAIDo6Gjd+1u4cCGIm0m79XJWYJmcnIzmzZsDAL788kub5Zs2bUJKSgri4uJw55136h6rt/sdN24cpk+fjsTERKxZs+bmvgRBYCLZvgiC4BvEjjjm66+/BsAphTVr1vTZfgUv8VdXYSG0adWqFQGgRx55hMxmMxERXbhwgRo3bkwmk4liYmKoaNGidP369aCOc8WKFQSAChUqRNu3b7/5+YULF6h+/foEgEaNGmV323HjxlHNmjVp3LhxPtvviy++SACoSJEitGXLFh+coSCEH5FuX6wZMGAAAaCpU6f6bOyCEO5Esh3ZuXMnrVy5knJyciw+z8vLo3nz5lH+/PkJAE2YMMH3JyR4jDhVEcrSpUsJAAGgRo0aUe/evalYsWIEgKZPn06tW7cmANSiRQt67733gjrW4cOHEwCKiYmhO+64g3r27ElFihQhAHTLLbc4NKjKRGbAgAE+2e/KlStvXrNmzZrRgAED7L4cGU9BiBQi3b5s376dWrZsefNVvHhxAkDJyckWn586dcrPZycIxiWS7YjiqCUlJdFtt91Gffv2pTvvvJMqVKhw85o89NBDNk6XEFzEqYpgvvrqK2revDkVKFCA8ufPT7Vr16avvvqKiIh27dpFDRs2JJPJRGPHjg3ySIkWL15M7du3p4SEBIqPj6d69erR66+/TllZWQ63ceVUubvfBQsW3DRmzl4VK1b0wRkLgrGJZPuyYcMGXbbi6NGj/jspQQgDItWO/PfffzRy5Ehq27YtlStXjvLnz09xcXFUoUIF6tWrF33//fd+PhvBE0xERN4nEQqCIAiCIAiCIEQmIlQhCIIgCIIgCILgBeJUCYIgCIIgCIIgeIE4VYIgCIIgCIIgCF4gTpUgCIIgCIIgCIIXiFMlCIIgCIIgCILgBeJUCYIgCIIgCIIgeEG+YA/AX5jNZpw6dQqFCxeGyWQK9nAEIaIhIly5cgVly5ZFVJRxnuWIHRGE0MGIdkRsiCCEDv62IWHrVJ06dQrly5cP9jAEQdCQkpKC5OTkYA9DN2JHBCH0MJIdERsiCKGHv2xI2DpVhQsXBsAXLiEhIcijEYTIJiMjA+XLl795XxoFsSOCEDoY0Y6IDbHko4+A8eOBIkWA7duB4sX1b7twITBiBBAXB/zxB1Ctmv5ts7KA224D9uwBbr0VWL4ccCdQceYM0LYtcP488PDDfB7u8MMPwEMP8fuPPuJ9CIHH3zYkbJ0qJcyekJAghkwQQgSjpb+IHRGE0MNIdkRsiEpqKvDqq/x++nSgShX92544AUyYwO9few1o0sS9Y48ezQ5V8eLA55+zU6eX3Fxg8GB2qOrXB+bOBQoU0L/9gQO8PQA88wzw5JNuDV3wA/6yIcZIShYEQRAEQRAMy8iRwNWrQOvWwGOP6d+OCHj8ceDKFaBNG45WucOaNcDbb/P7+fOBMmXc2/7ll4GNG4FChYClS91zqDIygHvv5bG3awfMmOHesQVjIU6VIAiCIAiC4Dd++IFT7qKjgdmz3Uu9mz+fHaP8+YFPPuF96OXCBWDAAH7/5JNA9+7ujfv774Fp09Rx1Kypf1uzGXjkEeDgQSA5mR2ymBj3ji8YC3GqBEEQBEEQBL9w/TqnvQHAs88CDRro3/bECeC55/j9K6+459QQAYMGAadPA7VrA2+9pX9bADh+nJ0igMf/wAPubf/
      "text/plain": [
       "<Figure size 1000x400 with 3 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "np.random.seed(42)\n",
    "theta = np.random.randn(2,1)  # random initialization\n",
    "\n",
    "plt.figure(figsize=(10,4))\n",
    "plt.subplot(131); plot_gradient_descent(theta, alpha=0.02)\n",
    "plt.ylabel(\"$y$\", rotation=0, fontsize=18)\n",
    "plt.subplot(132); plot_gradient_descent(theta, alpha=0.1)\n",
    "plt.subplot(133); plot_gradient_descent(theta, alpha=0.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- $\\alpha=0.02$: learning rate is too small and will take a long time to reach the global optimum.\n",
    "- $\\alpha=0.1$: learning rate is good and global optimum will be found quickly.\n",
    "- $\\alpha=0.5$: learning rate is too large and parameters jump about (may not converge)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Convergence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Cost function may not be nice with a single local minimum that coincides with the global minimum.\n",
    "\n",
    "<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture05_Images/bgd_pitfalls.png\" alt=\"data-layout\" width=\"600px\" style=\"display:block; margin:auto\"/>\n",
    "\n",
    "[Source: Geron]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Convex objective function\n",
    "\n",
    "MSE cost function for linear regression is convex (for any two points on the curve, the line segment connecting those points lies above the curve).\n",
    "\n",
    "Consequently, the cost function has a single local minimum (that therefore coincides with the global minimum).\n",
    "\n",
    "Furthermore, the cost function is smooth with a slope that never changes abruptly (i.e. it is Lipschitz continuous).  \n",
    "\n",
    "Gradient descent is therefore guaranteed to converge to the global minimum."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Feature scaling\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For gradient descent to work well it is critical for all features to have a similar scale."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture05_Images/bgd_feature_scaling.png\" alt=\"data-layout\" width=\"800px\" style=\"display:block; margin:auto\"/>\n",
    "\n",
    "[Source: Geron]\n",
    "\n",
    "In the case on the left, the features have the same scale and even intially the solution moves towards the minimum, thus reaching it quickly.\n",
    "\n",
    "In the case on the right, the features have different scales and initially the solution does not move directly towards the minimum.  Convergence is therefore slow."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Feature scaling methods\n",
    "\n",
    "In general should ensure features have a similar scale since many machine learning methods do not work well when numerical attributes have very different scales.\n",
    "\n",
    "#### Min-max scaling\n",
    "\n",
    "Min-max scaling given by\n",
    "\\begin{align*}\n",
    "x_j \\rightarrow \\frac{x_j - \\min_i x_i}{\\max_i x_i - \\min_i x_i} .\n",
    "\\end{align*}\n",
    "See Scikit-Learn [`MinMaxScalar`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).\n",
    "\n",
    "\n",
    "#### Standardization\n",
    "\n",
    "Standardization scaling given by\n",
    "\\begin{align*}\n",
    "x_j \\rightarrow \\frac{x_j - \\mu_j}{\\sigma_j},\n",
    "\\end{align*}\n",
    "where $\\mu_j$ and $\\sigma_j$ are the mean and standard deviation of feature $j$ computed from the training instances.\n",
    "See Scikit-Learn [`StandardScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}