spce0038-machine-learning-w.../week6/slides/Lecture18_DimensionalityReduction.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "# Lecture 18: Dimensionality reduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "skip"
    },
    "tags": []
   },
   "source": [
    "![](https://www.tensorflow.org/images/colab_logo_32px.png)\n",
    "[Run in colab](https://colab.research.google.com/drive/19J80Hg1ZLxLrpHWbgxFZoyPfql_AdeGN)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:21:48.075087Z",
     "iopub.status.busy": "2025-02-27T23:21:48.074841Z",
     "iopub.status.idle": "2025-02-27T23:21:48.081076Z",
     "shell.execute_reply": "2025-02-27T23:21:48.080546Z"
    },
    "slideshow": {
     "slide_type": "skip"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Last executed: 2025-02-27 23:21:48\n"
     ]
    }
   ],
   "source": [
    "import datetime\n",
    "now = datetime.datetime.now()\n",
    "print(\"Last executed: \" + now.strftime(\"%Y-%m-%d %H:%M:%S\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Overview of dimensionality reduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "* Many problems have thousands or millions of features.\n",
    "* Want to reduce the number of dimensions, i.e. features.\n",
    "* Dimensionality reduction is (typically) lossy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "### Disadvantages\n",
    "\n",
    "* It may (or may not) speed up training but (almost certainly) can degrade resulting performance.\n",
    "* Also makes pipelines more complex.\n",
    "\n",
    "$\\Rightarrow$ Should always try using original data before considering dimensionality reduction."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "### Advantages\n",
    "\n",
    "* Can make problems possible, which otherwise would have been intractible.\n",
    "* Very useful for visualization (2D, 3D representations are more intuitive)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "### Curse of Dimensionality\n",
    "\n",
    "Many things behave differently in high dimensional space.\n",
    "\n",
    "Our intuition is built from our experience of the 3D world and often fails in higher dimensions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "For example:\n",
    "- Pick a random point in a unit square (a 1 × 1 square), it will have only about a 0.4% chance of being located less than 0.001 from a border (in other words, it is very unlikely that a random point will be “extreme” along any dimension).\n",
    "- But in a 10,000-dimensional unit hypercube (a 1 × 1 × ⋯ × 1 cube, with ten thousand 1s), this probability is greater than 99.999999%. Most points in a high-dimensional hypercube are very close to the border."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "### Two main approaches for dimensionality reduction\n",
    "\n",
    "1. Projection\n",
    "2. Manifold learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "#### Projection \n",
    "\n",
    "- In most real-world problems, training instances are not spread out uniformly across all dimensions. \n",
    "- Many features are almost constant, while others are highly correlated. \n",
    "- As a result, all training instances actually lie within (or close to) a much lower-dimensional subspace"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "For example, consider the following 3D data-set lying close to a 2D subspace.\n",
    "\n",
    "<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture18_Images/Projection.png\" alt=\"data-layout\" width=\"700\" style=\"display:block; margin:auto\"/>\n",
    "\n",
    "[Source: Geron]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "Data can be projected to a 2D plane without losing too much information.  \n",
    "\n",
    "We've reduced the dimension of the data from 3 to 2.\n",
    "\n",
    "<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture18_Images/projection_2d.png\" alt=\"data-layout\" width=\"700\" style=\"display:block; margin:auto\"/>\n",
    "\n",
    "[Source: Geron]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "#### Manifold learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "However, projection is not always the best approach for dimensionality reduction.  \n",
    "\n",
    "Some data-sets may live of spaces (manifolds) that twist and turn.\n",
    "\n",
    "<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture18_Images/swiss_roll.png\" alt=\"data-layout\" width=\"500\" style=\"display:block; margin:auto\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "##### Projection versus unrolling\n",
    "\n",
    "Projecting would squash layers of the swiss roll (left plot), whereas better to \"unroll\" the swiss roll (right plot).\n",
    "\n",
    "<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture18_Images/swiss_roll_unrolled.png\" alt=\"data-layout\" width=\"800\" style=\"display:block; margin:auto\"/>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "##### Manifold hypothesis\n",
    "\n",
    "Many dimensionality reduction algorithms work by modeling the manifold on which the training instances lie; this is called *Manifold Learning*. \n",
    "\n",
    "It relies on the manifold hypothesis.\n",
    "\n",
    "> Manifold hypothesis: most real-world high-dimensional data-sets lie close to a much lower dimensional manifold. This assumption is very often empirically observed.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Principal component analysis (PCA) \n",
    "\n",
    "Most popular dimensionality reduction algorithm.\n",
    "\n",
    "1. Find hyperplane that lies closest to the data.\n",
    "2. Projects data onto it.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "### Principal components"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "Choose directions in the data that preserve the maximum variability.\n",
    "\n",
    "<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture18_Images/pca_variance.png\" alt=\"data-layout\" width=\"800\" style=\"display:block; margin:auto\"/>\n",
    "\n",
    "[Source: Geron]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "PCA finds the direction that accounts for the largest amount of *variance* in the data-set.\n",
    "\n",
    "It then finds the next direction, orthogonal to the first, that accounts for largest amount of remaining variance.\n",
    "\n",
    "Process repeats, up until the full dimension of the data.  But typically the majority of variance in the data is captured in many fewer dimensions than the maximum number.\n",
    "\n",
    "Each axis vector is called a principal component."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "Principal components found by Singular Value Decomposition (SVD), a matrix factorization technique, of the feature matrix $X$.\n",
    "\n",
    "The principal components are the eigenvectors of the covariance matrix of the data $X^\\text{T} X$\n",
    "\n",
    "SVD of $X$ is given by:\n",
    "\n",
    "$$X=U \\Sigma V^T,$$\n",
    " \n",
    "where the columns of $V$ give the principal components (which are normalised and orthogonal).\n",
    "\n",
    "See the [tutorial here](https://arxiv.org/abs/1404.1100) for further details about PCA."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "**Note: PCA assumes data is centered around origin. Scikit-Learn PCA will adjust data for you if needed.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "#### Example\n",
    "\n",
    "Generate example data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:21:48.115817Z",
     "iopub.status.busy": "2025-02-27T23:21:48.115633Z",
     "iopub.status.idle": "2025-02-27T23:21:48.339410Z",
     "shell.execute_reply": "2025-02-27T23:21:48.338849Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from scipy.spatial.transform import Rotation\n",
    "\n",
    "m = 60\n",
    "X = np.zeros((m, 3))  # initialize 3D dataset\n",
    "np.random.seed(42)\n",
    "angles = (np.random.rand(m) ** 3 + 0.5) * 2 * np.pi  # uneven distribution\n",
    "X[:, 0], X[:, 1] = np.cos(angles), np.sin(angles) * 0.5  # oval\n",
    "X += 0.28 * np.random.randn(m, 3)  # add more noise\n",
    "X = Rotation.from_rotvec([np.pi / 29, -np.pi / 20, np.pi / 4]).apply(X)\n",
    "X += [0.2, 0, 0.2]  # shift a bit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:21:48.342580Z",
     "iopub.status.busy": "2025-02-27T23:21:48.341611Z",
     "iopub.status.idle": "2025-02-27T23:21:49.227869Z",
     "shell.execute_reply": "2025-02-27T23:21:49.227246Z"
    },
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "from sklearn.decomposition import PCA\n",
    "\n",
    "pca = PCA(n_components=2)\n",
    "X2D = pca.fit_transform(X)  # dataset reduced to 2D\n",
    "X3D_inv = pca.inverse_transform(X2D)  # 3D position of the projected samples\n",
    "X_centered = X - X.mean(axis=0)\n",
    "U, s, Vt = np.linalg.svd(X_centered)\n",
    "\n",
    "axes = [-1.4, 1.4, -1.4, 1.4, -1.1, 1.1]\n",
    "x1, x2 = np.meshgrid(np.linspace(axes[0], axes[1], 10),\n",
    "                     np.linspace(axes[2], axes[3], 10))\n",
    "w1, w2 = np.linalg.solve(Vt[:2, :2], Vt[:2, 2])  # projection plane coefs\n",
    "z = w1 * (x1 - pca.mean_[0]) + w2 * (x2 - pca.mean_[1]) - pca.mean_[2]  # plane\n",
    "X3D_above = X[X[:, 2] >= X3D_inv[:, 2]]  # samples above plane\n",
    "X3D_below = X[X[:, 2] < X3D_inv[:, 2]]  # samples below plane"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:21:49.230095Z",
     "iopub.status.busy": "2025-02-27T23:21:49.229790Z",
     "iopub.status.idle": "2025-02-27T23:21:49.520446Z",
     "shell.execute_reply": "2025-02-27T23:21:49.519862Z"
    },
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAs4AAALJCAYAAACk6aWUAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjAsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvlHJYcgAAAAlwSFlzAAAPYQAAD2EBqD+naQABAABJREFUeJzs/Xl4I/l534t+q7ARO0A2d7KbbPbGnpnep7vJ9mjzSCNFiTL2OToeO0dWrFga27J0HflajiON5Xis5MaKLW9yNE6urhwd58hWYsV6jmx5kbXPjGZr7vvSJLu5NgFwwV5Vv/sHWEXsKAAFoEC+n+eZp3uaIPBDoVD1rbe+7/flGGMMBEEQBEEQBEHkha/1AgiCIAiCIAiiHiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBBEzWCM1XoJBEEQBKEaY60XQBDE8UOSJMTjcUQiEZhMJhiNRhgMBvA8D47jar08giAIgsgKx6jkQxBElWCMKaJZFEVEIhHwPA/GGHieB8/zMBqNJKQJgiAIXULCmSCIqsAYgyAIEARB+f9YLAaDwQDGmCKqZTiOA8/zMJlMMBgMMBqN4DiOhDRBEARRM8iqQRBExUmuMsuCWBRF5eeyIOb5RNuFLKRFUYQgCMrPZQEtV6RJSBMEQRDVhIQzQRAVI1n8SpKUYr3IJ3hzCWlBEBCPx1OEtFyRlq0dBEEQBFEpSDgTBFERGGNKlRlAWX5lEtIEQRCEHiDhTBCE5kiShFgsllFlTqdSQhpARqMhCWmCIAiiXKg5kCAIzZCtGfF4XEnKyCWOZXFdCTGb3GzIGFOENglpgiAIohxIOBMEoQnp1oxCjXuVFM7Z1iaLaJl0IS2ndhAEQRBELkg4EwRRNnKVuZA1I5lqCud0koV0top0cmoHQRAEQciQcCYIomTSs5mLiYerpXBOJ5uQ5nk+o9mQhDRBEMTxhoQzQRAlIWczy0NLihXAehLO6eQS0ukeaRLSBEEQxwsSzgRBFEXy2OxirBnp6Fk4JyMfIklIEwRBECScCYJQTbENgPmeZ35+Htvb2/B4PPB6vXC5XLoX0UBuIS3/Z7FYyCNNEARxRCHhTBCEKpLHZpczzCQSiWBkZASRSAQdHR3Y3d2F3++HKIpwu93wer3wer1wOp11I6QZY4hEInj55ZfxxBNPUEWaIAjiiEIDUAiCyEu+sdnFsrW1hdHRUZw4cQJXrlxRno8xhlAoBL/fD7/fj+XlZTDGlGq01+uFw+HQpfBMnloIHE5IZIwhGo0iGo0qQlpuNDQajWVtR4IgCKI2kHAmCCInWo3NliQJMzMzWFlZwcWLF9HZ2al4nIGE+LTb7bDb7ejq6gJjDPv7+/D7/QgEAlhcXATHcSlC2m6361J4Jk80NBgMSkVarkonP8ZkMikVaRLSBEEQ+oeEM0EQWSklmzkboVAIw8PDkCQJAwMDcDgcBX+H4zg4nU44nU6cPHkSkiQpQnp7exsLCwvgeR5er1cR0zabrabCM99Ycfln+YS0XIkmIU0QBKFfSDgTBJFCejZzOQJubW0N4+Pj6OjowPnz5xU7Q7HwPA+XywWXy4VTp05BkiTs7e3B5/Nha2sLc3NzMBqNSjXa6/WioaFBl8Izl5CWJEkR0jzPZ3ikSUgTBEHUHhLOBEEopGczl5qaIYoiJicnsbGxgcceewytra2arpPnebjdbrjdbuX15CbDtbU1TE9Pw2w2ZwjpalBsv3U+IR2NRhGJREhIEwRB6AQSzgRBpGQzJ8erlcLe3h6Gh4dhNBoxODgIq9Wq8WozMRgMikAGEkJ6Z2cHfr8fDx48wNTUFBoaGlKsHRaLpeLrKoX0bS8LaVEUIYpizmbDcj4zgiAIQh0knAnimKNlNvP9+/cxNTWFnp4e9PX11SxOzmAwoLGxEY2NjQAAQRAQCATg9/uxsrKCiYkJ2Gw2RWx7PB6YzeayXrNSolX+PORtmSykBUFISfVIrkiTkCYIgtAeEs4EcYzRKps5Ho9jfHwcfr8f165dQ1NTk8YrLQ+j0YgTJ07gxIkTABLrlYX04uIigsEgHA6HUo32eDwwmUw1XnV2cglpQRAQj8dzCul6yMQmCILQOyScCeIYIic6BINB2Gy2skRzIBDA8PAw7HY7BgcHdWuBSMZkMqG5uRnNzc0AgFgspgjp+fl5hEIhOJ1OpSLtdrthNKo7XFZ7plQxQlq2dpCQJgiCKA0SzgRxzJCtGQ8fPsT09DR+5Ed+pGRrxuLiIubm5nD27Fn09PTUrTXAbDajpaUFLS0tAIBoNKoMY5menkY0Gs0Q0qUmhFSaQkIaQNaphiSkCYIgCkPCmSCOEenZzHIjYLFEo1GMjo4iGAzi5s2b8Hg82i+2hlgsFrS1taGtrQ0AEA6HlYr05OQkYrGYMh7c4/HA7Xbr9qIhl5COx+OIxWLKz0lIEwRBFIaEM0EcA5KzmRljSrxZKbaChw8fYnR0FF6vF4ODg7r1AmuJ1WqF1WpFe3s7GGMIh8PKVMMHDx5AEAS4XC4AwO7uLjwej26FZzYhLXvd5Yp0upCWUzsIgiCOOyScCeKII0kSBEHIGJvNcVxRwlmSJMzNzWFpaQkXLlxAV1dX2WJKXkM9iTKO42Cz2WCz2dDZ2QnGGEKhELa3txEIBDA2NgbGGDwej9Js6HQ6dfseZf+zTLKQzlWRJiFNEMRxhYQzQRxRCmUzFyOcw+EwhoeHIQgCbt++DafTqdka612AcRwHu90Oi8WCubk53L59G9FoVLF2LC0tAYAior1eL+x2u27ftxohzfN8RrOhXt8PQRCElpBwJogjSPrY7GyZvhzHKRMC87GxsYGxsTG0traiv79ft01xtUbevhzHwel0wul0oru7G4wx7O3twe/3Y3t7GwsLC+B5PkVI22w23QpPtUI63SOt1/dDEARRDiScCeKIkZzNnOxlTaeQx1kURUxPT2N1dRWPPPII2tvbK7XkIw3HcXC5XHC5XDh16hQkSVKE9NbWFubm5mA0GlOmGlqtVt0Kz2QhLe8/kiQhFospUw0FQQDP83A4HCSkCYI4UpBwJogjQvI0OTk1I59YyWfV2N/fx/DwMHiex+DgIGw2W6WWfeQoZH/heR5utxtutxs9PT2QJEkZD76xsYGZmRmYzWalGu31etHQ0FCl1ReHvH+lC+nl5WWEQiGcP3+eKtIEQRwpSDgTxBEgfWy2moEm2YQzYwwPHjzA5OQ
      "text/plain": [
       "<Figure size 900x900 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "fig = plt.figure(figsize=(9, 9))\n",
    "ax = fig.add_subplot(111, projection=\"3d\")\n",
    "\n",
    "# plot samples and projection lines below plane first\n",
    "ax.plot(X3D_below[:, 0], X3D_below[:, 1], X3D_below[:, 2], \"ro\", alpha=0.3)\n",
    "for i in range(m):\n",
    "    if X[i, 2] < X3D_inv[i, 2]:\n",
    "        ax.plot([X[i][0], X3D_inv[i][0]],\n",
    "                [X[i][1], X3D_inv[i][1]],\n",
    "                [X[i][2], X3D_inv[i][2]], \":\", color=\"#F88\")\n",
    "\n",
    "ax.plot_surface(x1, x2, z, alpha=0.1, color=\"b\")  # projection plane\n",
    "ax.plot(X3D_inv[:, 0], X3D_inv[:, 1], X3D_inv[:, 2], \"b+\")  # projected samples\n",
    "ax.plot(X3D_inv[:, 0], X3D_inv[:, 1], X3D_inv[:, 2], \"b.\")\n",
    "\n",
    "# now plot projection lines and samples above plane\n",
    "for i in range(m):\n",
    "    if X[i, 2] >= X3D_inv[i, 2]:\n",
    "        ax.plot([X[i][0], X3D_inv[i][0]],\n",
    "                [X[i][1], X3D_inv[i][1]],\n",
    "                [X[i][2], X3D_inv[i][2]], \"r--\")\n",
    "\n",
    "ax.plot(X3D_above[:, 0], X3D_above[:, 1], X3D_above[:, 2], \"ro\")\n",
    "\n",
    "def set_xyz_axes(ax, axes):\n",
    "    ax.xaxis.set_rotate_label(False)\n",
    "    ax.yaxis.set_rotate_label(False)\n",
    "    ax.zaxis.set_rotate_label(False)\n",
    "    ax.set_xlabel(\"$x_1$\", labelpad=8, rotation=0)\n",
    "    ax.set_ylabel(\"$x_2$\", labelpad=8, rotation=0)\n",
    "    ax.set_zlabel(\"$x_3$\", labelpad=8, rotation=0)\n",
    "    ax.set_xlim(axes[0:2])\n",
    "    ax.set_ylim(axes[2:4])\n",
    "    ax.set_zlim(axes[4:6])\n",
    "\n",
    "set_xyz_axes(ax, axes)\n",
    "ax.set_zticks([-1, -0.5, 0, 0.5, 1]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "Compute SVD"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:21:49.523152Z",
     "iopub.status.busy": "2025-02-27T23:21:49.522594Z",
     "iopub.status.idle": "2025-02-27T23:21:49.526229Z",
     "shell.execute_reply": "2025-02-27T23:21:49.525813Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "X_centered = X - X.mean(axis=0)\n",
    "U, s, Vt = np.linalg.svd(X_centered)\n",
    "V = Vt.T\n",
    "c1 = V[:,0]\n",
    "c2 = V[:,1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:21:49.527883Z",
     "iopub.status.busy": "2025-02-27T23:21:49.527709Z",
     "iopub.status.idle": "2025-02-27T23:21:49.531736Z",
     "shell.execute_reply": "2025-02-27T23:21:49.531386Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([0.67857588, 0.70073508, 0.22023881]),\n",
       " array([-0.72817329,  0.6811147 ,  0.07646185]))"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c1, c2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "Reconstruct $\\Sigma$ from `s`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:21:49.533494Z",
     "iopub.status.busy": "2025-02-27T23:21:49.533326Z",
     "iopub.status.idle": "2025-02-27T23:21:49.536249Z",
     "shell.execute_reply": "2025-02-27T23:21:49.535878Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "m, n = X.shape\n",
    "Σ = np.zeros_like(X_centered)\n",
    "Σ[:n, :n] = np.diag(s)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "Check recover `X_centered`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:21:49.537981Z",
     "iopub.status.busy": "2025-02-27T23:21:49.537785Z",
     "iopub.status.idle": "2025-02-27T23:21:49.540650Z",
     "shell.execute_reply": "2025-02-27T23:21:49.540108Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "assert np.allclose(X_centered, U @ Σ @ Vt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "### Projecting down to d dimensions\n",
    "\n",
    "Once principal components identified, can project data onto the hyperplane defined by first $d$ principal components.\n",
    "\n",
    "Can be performed by the dot product between the data principal components.\n",
    "\n",
    "1. Construct project matrix $P$ by taking first $d$ columns of $V$.\n",
    "2. Multiple feature matrix by projection matrix to recover lower dimensional representation of the data: $X_d = X P$.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "#### Check sizes\n",
    "\n",
    "For $X$ of size $m \\times n$, i.e. $n$ features and $m$ data instances, the matrices have the following sizes:\n",
    "\n",
    "- $V$ is of size $n \\times n$.\n",
    "- $P$ (first $d$ columns of $V$) is of size $n \\times d$.\n",
    "- $X_d = X P$ is of size $m \\times d$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:21:49.542460Z",
     "iopub.status.busy": "2025-02-27T23:21:49.542294Z",
     "iopub.status.idle": "2025-02-27T23:21:49.545152Z",
     "shell.execute_reply": "2025-02-27T23:21:49.544566Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "W2 = Vt[:2].T\n",
    "X2D = X_centered @ W2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:21:49.546946Z",
     "iopub.status.busy": "2025-02-27T23:21:49.546752Z",
     "iopub.status.idle": "2025-02-27T23:21:49.550054Z",
     "shell.execute_reply": "2025-02-27T23:21:49.549458Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original dimension is 3\n",
      "Reduced dimension is 2\n"
     ]
    }
   ],
   "source": [
    "print(f\"Original dimension is {X_centered.shape[-1]}\")\n",
    "print(f\"Reduced dimension is {X2D.shape[-1]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "### Scikit-Learn PCA\n",
    "\n",
    "Scikit-Learn's PCA support also uses SVD but makes things even easier.\n",
    "\n",
    "You can access each principal component using ```components_``` variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:21:49.551841Z",
     "iopub.status.busy": "2025-02-27T23:21:49.551672Z",
     "iopub.status.idle": "2025-02-27T23:21:49.555266Z",
     "shell.execute_reply": "2025-02-27T23:21:49.554662Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "\n",
    "pca = PCA(n_components=2)\n",
    "X2D = pca.fit_transform(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:21:49.557016Z",
     "iopub.status.busy": "2025-02-27T23:21:49.556827Z",
     "iopub.status.idle": "2025-02-27T23:21:49.560687Z",
     "shell.execute_reply": "2025-02-27T23:21:49.560112Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.67857588,  0.70073508,  0.22023881],\n",
       "       [ 0.72817329, -0.6811147 , -0.07646185]])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pca.components_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "### Explained variance ratio\n",
    "\n",
    "*Explained variance ratio* is a very useful metric.\n",
    "\n",
    "Gives the proportion of dataset variance along the axis of each principal component."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:21:49.562418Z",
     "iopub.status.busy": "2025-02-27T23:21:49.562252Z",
     "iopub.status.idle": "2025-02-27T23:21:49.565839Z",
     "shell.execute_reply": "2025-02-27T23:21:49.565454Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.7578477 , 0.15186921])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pca.explained_variance_ratio_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "The first dimension explains about 76% of the variance, while the second explains about 15%."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "By projecting down to 2D, we lost about 9% of the variance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:21:49.567654Z",
     "iopub.status.busy": "2025-02-27T23:21:49.567484Z",
     "iopub.status.idle": "2025-02-27T23:21:49.571274Z",
     "shell.execute_reply": "2025-02-27T23:21:49.570904Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "np.float64(0.09028309326742034)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "1 - pca.explained_variance_ratio_.sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "### Choosing the right number of dimensions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "Compute the explained variance cumulatively to see how much captured by a given number of principal components."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:21:49.573126Z",
     "iopub.status.busy": "2025-02-27T23:21:49.572952Z",
     "iopub.status.idle": "2025-02-27T23:21:53.659798Z",
     "shell.execute_reply": "2025-02-27T23:21:53.659297Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "np.int64(154)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.datasets import fetch_openml\n",
    "\n",
    "mnist = fetch_openml('mnist_784', as_frame=False, parser=\"auto\")\n",
    "X_train, y_train = mnist.data[:60_000], mnist.target[:60_000]\n",
    "X_test, y_test = mnist.data[60_000:], mnist.target[60_000:]\n",
    "\n",
    "pca = PCA()\n",
    "pca.fit(X_train)\n",
    "cumsum = np.cumsum(pca.explained_variance_ratio_)\n",
    "d = np.argmax(cumsum >= 0.95) + 1\n",
    "d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "So we capture $\\geq95$% of the variance in $154$ dimensions.\n",
    "\n",
    "Compared to original dimensionality of $28 \\times 28 =784$), we compress to $\\sim 20$% of the original size."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "Scikit-Learn also has built in functionality to automatically select number of principal components that capture a given level of variance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:21:53.662280Z",
     "iopub.status.busy": "2025-02-27T23:21:53.662094Z",
     "iopub.status.idle": "2025-02-27T23:21:55.179585Z",
     "shell.execute_reply": "2025-02-27T23:21:55.179013Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "pca = PCA(n_components=0.95)\n",
    "X_reduced = pca.fit_transform(X_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:21:55.183080Z",
     "iopub.status.busy": "2025-02-27T23:21:55.182254Z",
     "iopub.status.idle": "2025-02-27T23:21:55.187758Z",
     "shell.execute_reply": "2025-02-27T23:21:55.187347Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "np.int64(154)"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pca.n_components_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "### PCA for compression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "We can also reconstruct high dimensional representation from compressed representation.\n",
    "\n",
    "Compression is lossy, so we won't recover the original high dimensional representation perfectly.\n",
    "\n",
    "Example applying PCA to MNIST dataset with 95% preservation = results in ~150 features (original = 28x28 = 784)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "Since principal components are orthogonal, we can estimate a recovered high dimensional representation by\n",
    "\n",
    "$$X_\\text{recovered} = X_{d} P^\\text{T}.$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:21:55.190225Z",
     "iopub.status.busy": "2025-02-27T23:21:55.190043Z",
     "iopub.status.idle": "2025-02-27T23:21:55.643751Z",
     "shell.execute_reply": "2025-02-27T23:21:55.643184Z"
    },
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "X_recovered = pca.inverse_transform(X_reduced)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:21:55.647099Z",
     "iopub.status.busy": "2025-02-27T23:21:55.646185Z",
     "iopub.status.idle": "2025-02-27T23:21:55.864486Z",
     "shell.execute_reply": "2025-02-27T23:21:55.863997Z"
    },
    "scrolled": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAjIAAAEgCAYAAACuOplUAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjAsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvlHJYcgAAAAlwSFlzAAAPYQAAD2EBqD+naQABAABJREFUeJzsvXeUpFd5Jv5Uzrk6VHVX59zTEzRRGmmUGQlJCFaATbIkE89iFq9l9uzyMwcMNtjYa2PwYsOaxWDQEgQiCQkJpJHQKIxmRtMz0z3TOXdXd1dVV85V3++P2ffOra+rOqeRvuecPt1d9YV7v3Dve5/3ed9XJgiCAAkSJEiQIEGChGsQ8u1ugAQJEiRIkCBBwlohGTISJEiQIEGChGsWkiEjQYIECRIkSLhmIRkyEiRIkCBBgoRrFpIhI0GCBAkSJEi4ZiEZMhIkSJAgQYKEaxaSISNBggQJEiRIuGYhGTISJEiQIEGChGsWkiEjQYIECRIkSLhmIRkybxJ87nOfg0wmW9O+//7v/w6ZTIbR0dGNbRSH0dFRyGQy/Pu///umnUOCBAkSthsPPfQQ6urqtrsZbyhIhsw1gJ6eHrz//e9HVVUVNBoN3G433ve+96Gnp2e7myZBgoRrDENDQ/joRz+KhoYGaLVamM1mHD16FP/0T/+ERCKx3c2TIGHVkAyZHY6f/vSnuO666/C73/0ODz/8ML7+9a/jgx/8IJ577jlcd911ePzxx1d0nL/4i79Y8yD1gQ98AIlEArW1tWvaX4IECTsDTzzxBLq6uvCjH/0I9913H772ta/hS1/6EmpqavCpT30Kn/zkJ7e7iRIkrBrK7W6AhNIYGhrCBz7wATQ0NOCFF15AWVkZ++6Tn/wkbrrpJnzgAx/A+fPn0dDQUPQYsVgMBoMBSqUSSuXabrdCoYBCoVjTvhIkSNgZGBkZwR/+4R+itrYWzz77LFwuF/vu4x//OAYHB/HEE09sYwtXh2w2i3w+D7Vavd1NkbDNkBiZHYy/+7u/Qzwexze/+c0CIwYAnE4nvvGNbyAWi+HLX/4ygKs6mN7eXrz3ve+FzWbDjTfeWPAdj0Qigf/yX/4LnE4nTCYT3va2t2FqagoymQyf+9zn2HbFNDJ1dXW499578eKLL+LQoUPQarVoaGjAd7/73YJzBAIB/Pmf/zm6urpgNBphNptx9913o7u7ewOvlAQJEpbDl7/8ZUSjUXzrW98qMGIITU1NjJHJZrP4whe+gMbGRmg0GtTV1eHTn/40UqlUwT40Dpw4cQIHDhyATqdDV1cXTpw4AeAKo9zV1QWtVov9+/fj9ddfL9j/oYcegtFoxPDwMI4fPw6DwQC3243Pf/7zEASBbUcaur//+7/HV77yFdau3t5eAMDly5fxzne+E3a7HVqtFgcOHMAvfvGLgnNlMhn85V/+JZqbm6HVauFwOHDjjTfimWeeYdt4vV48/PDDqK6uhkajgcvlwv33379IH/jkk0/ipptugsFggMlkwj333FPU1f+zn/0Mu3btglarxa5du1bMoEtYHSRGZgfjl7/8Jerq6nDTTTcV/f7YsWOoq6tbtIp617vehebmZnzxi18sGAzEeOihh/CjH/0IH/jAB3DkyBE8//zzuOeee1bcvsHBQbzzne/EBz/4QTz44IP4P//n/+Chhx7C/v370dnZCQAYHh7Gz372M7zrXe9CfX09Zmdn8Y1vfAM333wzent74Xa7V3w+CRIkrB2//OUv0dDQgBtuuGHZbT/0oQ/hO9/5Dt75znfikUcewauvvoovfelLuHTp0qLJeHBwEO9973vx0Y9+FO9///vx93//97jvvvvwr//6r/j0pz+N//yf/zMA4Etf+hLe/e53o6+vD3L51TV0LpfDXXfdhSNHjuDLX/4ynnrqKXz2s59FNpvF5z//+YJzffvb30YymcRHPvIRaDQa2O129PT04OjRo6iqqsJ//+//HQaDAT/60Y/w9re/HT/5yU/wjne8A8CVxdyXvvQlfOhDH8KhQ4cQDodx+vRpnD17FnfeeScA4IEHHkBPTw8+8YlPoK6uDnNzc3jmmWcwPj7OBLr/8R//gQcffBDHjx/H3/7t3yIej+Nf/uVfcOONN+L1119n2z399NN44IEH0NHRgS996Uvw+/3MSJKwwRAk7EgEg0EBgHD//fcvud3b3vY2AYAQDoeFz372swIA4T3vec+i7eg7wpkzZwQAwp/+6Z8WbPfQQw8JAITPfvaz7LNvf/vbAgBhZGSEfVZbWysAEF544QX22dzcnKDRaIRHHnmEfZZMJoVcLldwjpGREUGj0Qif//znCz4DIHz7299esr8SJEhYPUKh0IrGE0EQhHPnzgkAhA996EMFn//5n/+5AEB49tln2Wc0Drz00kvss9/85jcCAEGn0wljY2Ps82984xsCAOG5555jnz344IMCAOETn/gE+yyfzwv33HOPoFarhfn5eUEQro4PZrNZmJubK2jX7bffLnR1dQnJZLLgGDfccIPQ3NzMPtuzZ49wzz33lOz3wsKCAED4u7/7u5LbRCIRwWq1Ch/+8IcLPvd6vYLFYin4fO/evYLL5RKCwSD77OmnnxYACLW1tSXPIWH1kFxLOxSRSAQAYDKZltyOvg+Hw+yzj33sY8se/6mnngIAtloifOITn1hxGzs6OgrYorKyMrS2tmJ4eJh9ptFo2Oorl8vB7/fDaDSitbUVZ8+eXfG5JEiQsHbQ+LDceAIAv/71rwEAf/Znf1bw+SOPPAIAixjgjo4OXH/99ez/w4cPAwBuu+021NTULPqcHx8If/Inf8L+lslk+JM/+ROk02n89re/LdjugQceKHCzBwIBPPvss3j3u9+NSCQCn88Hn88Hv9+P48ePY2BgAFNTUwAAq9WKnp4eDAwMFO23TqeDWq3GiRMnsLCwUHSbZ555BsFgEO95z3vYuXw+HxQKBQ4fPoznnnsOADAzM4Nz587hwQcfhMViYfvfeeed6OjoKHpsCWuHZMjsUNCAQwZNKRQzeOrr65c9/tjYGORy+aJtm5qaVtxGfpAi2Gy2gkEgn8/jH//xH9Hc3AyNRgOn04mysjKcP38eoVBoxeeSIEHC2mE2mwEsP54AV8cG8VhQWVkJq9WKsbGxgs/F4wBN3B6Pp+jnYiNBLpcvClZoaWkBgEXaFPF4NTg4CEEQ8JnPfAZlZWUFP5/97GcBAHNzcwCAz3/+8wgGg2hpaUFXVxc+9alP4fz58+xYGo0Gf/u3f4snn3wSFRUVOHbsGL785S/D6/WybcgIuu222xad7+mnn2bnomvU3NwMMVpbWxd9JmF9kDQyOxQWiwUul6vgRSuG8+fPo6qqig1UwJWVxVagVCSTwOlyvvjFL+Izn/kM/viP/xhf+MIXYLfbIZfL8ad/+qfI5/Nb0k4JEt7sMJvNcLvduHjx4or3WWkCzVLjwErGh9VCPLbRGPLnf/7nOH78eNF9yCA7duwYhoaG8POf/xxPP/00/u3f/g3/+I//iH/913/Fhz70IQDAn/7pn+K+++7Dz372M/zmN7/BZz7zGXzpS1/Cs88+i3379rHz/cd//AcqKysXnWutkaES1gfpqu9g3Hvvvfjf//t/48UXX2TRRzx+//vfY3R0FB/96EdXfeza2lrk83mMjIwUrBoGBwfX1WYxHnvsMdx666341re+VfB5MBiE0+nc0HNJkCChNO69915885vfxMsvv1zgChKDxoaBgQG0t7ezz2dnZxEMBjc8n1Q+n8fw8DBjYQCgv78fAJbNgEtMjkqlwh133LHsuex2Ox5++GE8/PDDiEajOHbsGD73uc8xQwYAGhsb8cgjj+CRRx7BwMAA9u7di//5P/8nvve976GxsREAUF5evuT56BoVc2P19fUt204Jq4PkWtrB+NSnPgWdToePfvSj8Pv9Bd8FAgF87GMfg16vx6c+9alVH5tWL1//+tcLPv/a17629gYXgUKhWLQC+/GPf8z81hIkSNga/Lf/9t9gMBjwoQ99CLOzs4u+Hxoawj/90z/hrW99KwDgK1/5SsH3//AP/wAAq4psXCn++Z//mf0
      "text/plain": [
       "<Figure size 700x400 with 2 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "plt.figure(figsize=(7, 4))\n",
    "for idx, X in enumerate((X_train[::2100], X_recovered[::2100])):\n",
    "    plt.subplot(1, 2, idx + 1)\n",
    "    plt.title([\"Original\", \"Compressed\"][idx])\n",
    "    for row in range(5):\n",
    "        for col in range(5):\n",
    "            plt.imshow(X[row * 5 + col].reshape(28, 28), cmap=\"binary\",\n",
    "                       vmin=0, vmax=255, extent=(row, row + 1, col, col + 1))\n",
    "            plt.axis([0, 5, 0, 5])\n",
    "            plt.axis(\"off\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "### Randomized PCA \n",
    "\n",
    "*Randomized PCA* quickly find an approximation of the first first $d$ principal components.\n",
    "\n",
    "It is much quicker than the full SVD approach.\n",
    "\n",
    "Its computational complexity is ${\\mathcal O}( m\\times d^2) + {\\mathcal O}( d^3)$, instead of ${\\mathcal O}( m \\times n^2) + {\\mathcal O}( n^3)$, so it is dramatically faster than the previous algorithms when $d$ is much smaller than $n$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:21:55.866516Z",
     "iopub.status.busy": "2025-02-27T23:21:55.866345Z",
     "iopub.status.idle": "2025-02-27T23:22:02.100911Z",
     "shell.execute_reply": "2025-02-27T23:22:02.100391Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "rnd_pca = PCA(n_components=154, svd_solver=\"randomized\", random_state=42)\n",
    "X_reduced = rnd_pca.fit_transform(X_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:22:02.103482Z",
     "iopub.status.busy": "2025-02-27T23:22:02.103301Z",
     "iopub.status.idle": "2025-02-27T23:22:02.108558Z",
     "shell.execute_reply": "2025-02-27T23:22:02.108081Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "154"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rnd_pca.n_components_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "### Incremental PCA \n",
    "\n",
    "One problem with PCA is that it requires the entire training set to be read into memory at once.  Thus, cannot apply to very large training data-sets. \n",
    "\n",
    "Instead *Incremental PCA (IPCA)* splits the data into batches, and allows for combination. \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:22:02.110486Z",
     "iopub.status.busy": "2025-02-27T23:22:02.110313Z",
     "iopub.status.idle": "2025-02-27T23:22:20.395636Z",
     "shell.execute_reply": "2025-02-27T23:22:20.395095Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from sklearn.decomposition import IncrementalPCA\n",
    "\n",
    "n_batches = 100\n",
    "inc_pca = IncrementalPCA(n_components=154)\n",
    "for X_batch in np.array_split(X_train, n_batches):\n",
    "    inc_pca.partial_fit(X_batch)\n",
    "\n",
    "X_reduced = inc_pca.transform(X_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:22:20.398654Z",
     "iopub.status.busy": "2025-02-27T23:22:20.397666Z",
     "iopub.status.idle": "2025-02-27T23:22:20.403297Z",
     "shell.execute_reply": "2025-02-27T23:22:20.402792Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "154"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inc_pca.n_components_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "**Exercises:** *You can now complete Exercise 1 in the exercises associated with this lecture.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Random projections"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "Even with randomized or incremental PCA, PCA can be too computationally expensive for very large data-sets.\n",
    "\n",
    "An alternative it to consider *random* projections to lower dimensional spaces through linear projections."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "Random projections can actually preserve distances reasonably well.  Hence, two similar data instances remain similar after projection.  Similarly, two very different data instances remain very different after projection.\n",
    "\n",
    "The more dimensions that are dropped, the more information is lost, and hence the more distances get distorted."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "### Selecting the reduced dimension size\n",
    "\n",
    "There is some deep mathematical theory that specifies how much dimensions can be compressed in order to ensure distances are distorted by less than some tolerance $\\epsilon$.  \n",
    "\n",
    "This is called the *Johnson-Lindenstrauss* minimum dimension:\n",
    "\n",
    "$$d \\geq \\frac{4 \\log(m)}{\\epsilon^2/2 - \\epsilon^3/3}.$$\n",
    "\n",
    "(Doesn't depend on number of features $n$.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "### Example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:22:20.405853Z",
     "iopub.status.busy": "2025-02-27T23:22:20.405675Z",
     "iopub.status.idle": "2025-02-27T23:22:20.409905Z",
     "shell.execute_reply": "2025-02-27T23:22:20.409455Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5920"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m, ε = 1_000, 0.1\n",
    "\n",
    "d = int(4 * np.log(m) / (ε ** 2 / 2 - ε ** 3 / 3))\n",
    "d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "Or using Scikit-Learn:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:22:20.413424Z",
     "iopub.status.busy": "2025-02-27T23:22:20.412448Z",
     "iopub.status.idle": "2025-02-27T23:22:20.419642Z",
     "shell.execute_reply": "2025-02-27T23:22:20.419208Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "np.int64(5920)"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.random_projection import johnson_lindenstrauss_min_dim\n",
    "\n",
    "\n",
    "d = johnson_lindenstrauss_min_dim(m, eps=ε)\n",
    "d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "Then can compute a random projection down to dimension $d$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:22:20.421721Z",
     "iopub.status.busy": "2025-02-27T23:22:20.421549Z",
     "iopub.status.idle": "2025-02-27T23:22:23.802948Z",
     "shell.execute_reply": "2025-02-27T23:22:23.802408Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1000, 5920)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "n = 8_000\n",
    "np.random.seed(42)\n",
    "\n",
    "# Radom projection matrix\n",
    "P = np.random.randn(d, n) / np.sqrt(d)  # std dev = square root of variance\n",
    "\n",
    "X = np.random.randn(m, n)  # generate a fake dataset\n",
    "X_reduced = X @ P.T\n",
    "X_reduced.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "Or use Scikit-Learn built in functionality:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:22:23.805852Z",
     "iopub.status.busy": "2025-02-27T23:22:23.804842Z",
     "iopub.status.idle": "2025-02-27T23:22:27.060605Z",
     "shell.execute_reply": "2025-02-27T23:22:27.060048Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1000, 5920)"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.random_projection import GaussianRandomProjection\n",
    "\n",
    "gaussian_rnd_proj = GaussianRandomProjection(eps=ε, random_state=42)\n",
    "X_reduced = gaussian_rnd_proj.fit_transform(X)  \n",
    "X_reduced.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Locally linear embedding (LLE)\n",
    "\n",
    "Locally linear embedding (LLE) is a powerful non-linear dimensionality reduction tool.\n",
    "\n",
    "It is a form of *manifold learning* and doesn't rely on projections.\n",
    "\n",
    "LLE measures how each instance relates to closest neighbors, then looks for low-D representation where local relations are best preserved."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "### First Step: linearly modelling local relationships\n",
    "\n",
    "- First, for each training instance $x^{(i)}$, the algorithm identifies its $k$ closest neighbours.\n",
    "- Then tries to reconstruct $x^{(i)}$ as a linear weighted sum of these neighbors.\n",
    "\n",
    "More specifically, it finds the weights $w_{ij}$ such that the squared distance between $x^{(i)}$ and $\\sum_{j=1}^m w_{ij}x^{(j)}$ is as small as possible, assuming $w_{ij}=0$ if $x^{(j)}$ is not one of the $k$-nearest neighbours of $x^{(i)}$.\n",
    "\n",
    "That is, solve the minimisation problem\n",
    "\n",
    "$$\n",
    "\\text{argmin}_{W} \\sum_{i=1}^m \\left( x^{(i)} - \\sum_{j=1}^m w_{ij}x^{(j)} \\right)^2 ,\n",
    "$$\n",
    "\n",
    "where $W$ is the matrix of weights.  Note that weights are constrained to be normalised such that $\\sum_j w_{ij}=1$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "### Second step: reducing dimensionality while preserving relationships\n",
    "\n",
    "The second step is to map the training instances into a $d$-dimensional space (where $d < n$) while preserving these local relationships as much as possible.\n",
    "\n",
    "If $z^{(i)}$ is the d-space equivalent of $x^{(i)}$ then we want the squared distance between $z^{(i)}$ and $\\sum_{i=1}^m w_{ij}z^{(j)}$ to be as small as possible, given **the fixed set of weights** from step 1. \n",
    "\n",
    "That is, solve the minimisation problem\n",
    "\n",
    "$$\n",
    "\\text{argmin}_{Z} \\sum_{i=1}^m \\left( z^{(i)} - \\sum_{j=1}^m \\hat{w}_{ij}z^{(j)} \\right)^2 ,\n",
    "$$\n",
    "\n",
    "where $Z$ is the matrix of lower dimensional positions and $\\hat{w}_{ij}$ are the weights determined in step 1.\n",
    "\n",
    "Note that step 2 optimises positions of $z$ for fixed $w$, while step 1 optimises the $w$ for fixed positions $x$\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "### Example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "#### Plot swiss roll"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:22:27.063998Z",
     "iopub.status.busy": "2025-02-27T23:22:27.062959Z",
     "iopub.status.idle": "2025-02-27T23:22:27.067705Z",
     "shell.execute_reply": "2025-02-27T23:22:27.067256Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import make_swiss_roll\n",
    "\n",
    "X_swiss, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:22:27.070034Z",
     "iopub.status.busy": "2025-02-27T23:22:27.069836Z",
     "iopub.status.idle": "2025-02-27T23:22:27.289425Z",
     "shell.execute_reply": "2025-02-27T23:22:27.288912Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAi8AAAIvCAYAAAC81DtEAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjAsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvlHJYcgAAAAlwSFlzAAAPYQAAD2EBqD+naQABAABJREFUeJzs/Xd4Hcd974+/Znb3FOCgV/ZOkZQokupWl2xJlpvkXhKXOIm/N91x7nVuHCdxEif3Jk5xkpufnThxIluJHfeq4shWF9UpkRLF3kH0jtN2d2Z+f+weEABBEiBBEGVez8OHwMHZ3dk9uzPv86nCGGOwWCwWi8VimSXICz0Ai8VisVgslslgxYvFYrFYLJZZhRUvFovFYrFYZhVWvFgsFovFYplVWPFisVgsFotlVmHFi8VisVgsllmFFS8Wi8VisVhmFVa8WCwWi8VimVVY8WKxWCwWi2VWYcWLxWKxWCyWWYUVLxaLxWKxWGYVVrxYLBaLxWKZVVjxYrFYLBaLZVZhxYvFYrFYLJZZhRUvFovFYrFYZhVWvFgsFovFYplVWPFisVgsFotlVmHFi8VisVgsllmFFS8Wi8VisVhmFVa8WCwWi8VimVVY8WKxWCwWi2VWYcWLxWKxWCyWWYUVLxaLxWKxWGYVVrxYLBaLxWKZVVjxYrFYLBaLZVZhxYvFYrFYLJZZhRUvFovFYrFYZhVWvFgsFovFYplVWPFisVgsFotlVmHFi8VisVgsllmFFS8Wi8VisVhmFVa8WCwWi8VimVVY8WKxWCwWi2VWYcWLxWKxWCyWWYUVLxaLxWKxWGYVVrxYLBaLxWKZVVjxYrFYLBaLZVZhxYvFYrFYLJZZhRUvFovFYrFYZhVWvFgsFovFYplVWPFisVgsFotlVmHFi8VisVgsllmFFS8Wi8VisVhmFVa8WCwWi8VimVVY8WKxWCwWi2VWYcWLxWKxWCyWWYUVLxaLxWKxWGYVVrxYLBaLxWKZVVjxYrFYLBaLZVZhxYvFYrFYLJZZhRUvFovFYrFYZhVWvFgsFovFYplVWPFisVgsFotlVmHFi8VisVgsllmFFS8Wi8VisVhmFVa8WCwWi8VimVVY8WKxWCwWi2VWYcWLxWKxWCyWWYUVLxaLxWKxWGYVVrxYLBaLxWKZVVjxYrFYLBaLZVZhxYvFYrFYLJZZhRUvFovFYrFYZhVWvFgsFovFYplVWPFisVgsFotlVmHFi8VisVgsllmFFS8Wi8VisVhmFVa8WCwWi8VimVVY8WKxWCwWi2VWYcWLxWKxWCyWWYUVLxaLxWKxWGYVVrxYLBaLxWKZVVjxYrFYLBaLZVZhxYvFYrFYLJZZhRUvFovFYrFYZhVWvFgsFovFYplVWPFisVgsFotlVmHFi8VisVgsllmFFS8Wi8VisVhmFVa8WCwWi8VimVVY8WKxWCwWi2VWYcWLxWKxWCyWWYUVLxaLxWKxWGYVVrxYLBaLxWKZVVjxYrFYLBaLZVZhxYvFYrFYLJZZhRUvFovFYrFYZhVWvFgsFovFYplVWPFisVgsFotlVmHFi8VisVgsllmFFS8Wi8VisVhmFVa8WCwWi8VimVVY8WKxWCwWi2VWYcWLxWKxWCyWWYUVLxaLxWKxWGYVVrxYLBaLxWKZVVjxYrFYLBaLZVZhxYvFYrFYLJZZhRUvFovFYrFYZhVWvFgsFovFYplVWPFisVgsFotlVmHFi8VisVgsllmFFS8Wi8VisVhmFVa8WCwWi8VimVVY8WKxWCwWi2VWYcWLxWKxWCyWWYUVLxaLxWKxWGYVVrxYLBaLxWKZVVjxYrFYLBaLZVZhxYvFYrFYLJZZhRUvFovFYrFYZhVWvFgsFovFYplVWPFisVgsFotlVmHFi8VisVgsllmFe6EHYLFMF1pr+vv7aW9vRymF67q4rovjOCf97DgOnufhOA6O4yDlCZ0vhLiAZ2GxWCwWK14scxqlFF1dXbS2ttLR0UGxWKSmpgYpJVprlFLD/0b+rrUetZ+SiBn5DyCRSJwkekb+PFYcjf0nhBgWQ1YUWSwWy8QQxhhzoQdhsUwVxhh836etrY329nY6OztxHIf6+noaGxupq6tDa43WepQ1pbTtyJ9HChulFGEYDgucYrHIvn37WLFixUmi53SCSCk16jhSylHWnVOJoPGE0KnEkbUWWSyWuY61vFhmPcYYstksra2ttLW10dvbSyaToaGhgcsvv5yqqqrhhdwYc5JVpcTIBV4IgZQSz/PGfa/v++zbt49ly5adJILGG99ItNaEYXiSOBrvX+m9vu9PSByNHf9Ya5HrukgpxxU/pxNGJReatRZZLJaZgBUvllmJ1pqenp5hC0s2m6WmpobGxkYuueQS0un0uAvrVC+2EzFcjj3mSLfTVB7fGDPKOlSyFo0UQWMFT7FYPEkQjSeSRjKeKBorkMazGp0qpmjk38YKSIvFYhkPK14ss4YgCOjs7By2sBhjaGhoYOXKldTX1+N53rxb8MYu9olEYkr2O1YU9fb28tprr3HZZZdNyGI00lp0OpfaWGvRSPfZmVxnE4ktstYii2VuYsWLZcZijKFQKNDa2kp7eztdXV0kk0kaGhqorq6moqKCiy66aNL7nYrFa64vgGNFUckqUlZWds77Hs9aNNZNdiZrUS6XO2VM0dlai0aKpNMJoZLVqOSCs6LIYpl+rHixzCi01gwMDAwLloGBASorK2lsbGTNmjWUl5cjpeTVV1+dEYuFjXefPNNlLRoraMYKobEutiAIKBaLw7/n8/nhlPozWYsm40YbL77oVC40K4wslvGx4sVywSmlM5fiV4rFInV1dSxevJiGhgaSyeS4k7cVDtPHbFg8xwu4PhcOHTpENpvl4osvPq21aCL/giA4pZVo5M8jGc+FNhk32qlcadZaZJkLWPFimXZK6czt7e20tbXR2dmJlJKGhgbWrVtHXV0drnv6W9NOupbpYLxF/nxbi8Zah8a61MZaiwqFwqhtxnvfyOOcyVpUykYbWaTxTDFF1lpkmW6seLFMCyPTmdvb2+np6aG8vHzcdObJ7PNCYSdmy1QwmfT8yTLy+Rgvo+xU/44dO0ZFRQWJRGJCNYxGMl7NovFS80+Xlj+eG81aiyxjseLFct4opTOXLCyldOaGhgY2bNhAWVnZWU9EM2UCs66ructs/2xHPiOTSc/v7Oxk0aJFNDQ0jPv3MxVzHK9O0UixM561aDyBNBFr0XjiaDy32kgxNPJ3ay2avVjxYplSwjCko6ODtrY2Ojo6UEpRX1/PihUraGhomNJ05tm+uFgsM5XTPaPTaS0a6f46kwttZJD1SBfaeIJqJCOtRSWX4OlijCaSnl+a56woOn9Y8WI5J0rpzG1tbbS1tdHV1QVAc3Mzl1xyCbW1tVNWkG0kQgjrNrJY5hhnay06ExOxFg0ODnLgwAGWLVt2ktjxff+0MUWnav1xNgHXtvXHxLDixTJpSunMpeyg/v5+KisraWhoYPXq1WzdupVVq1ZRXl5+XscxEywvM2EM08V8Otf5zFz8nCdiLSoJg4ULF05q3+NZi8a6zE5Xt2iyrT9GjnVsoPVnPvMZ0uk0+XweKSU33HADmUyG8vJyMpkMVVVVvPWtbz2LKzjzsOLFMiFOlc68cOFCNm/ePCqdeTqsIufybWO+flOxWCbKfH1Gzua8L5S1aLzMs8svv5xsNsu2bdtoaWlhyZIlZLNZhoaGyGazeJ43KfHy2GOP8bnPfY4XXniB1tZWvvvd73L33XcP//0jH/kI99xzz6ht7rjjDh544IFzPv8zYcWLZVxGpjO3t7fT0dExnM580UUXUV9ff8p05vkw8c2Hc7TMz895LlpeJsJMO++ziS36wz/8QwC+8IUv8Mgjj/Bv//Zv5zSGbDbLpk2b+OhHP8o73vGOcd/zxje+cdRxksnkOR1zoljxYhmmlM5cil8Zmc582WWXUV1dPeF05umwvJztMaZybDNtwrNYpoL5KtrmynkrpabEAnTnnXdy5513nvY9yWSS5ubmcz7WZLHiZZ6jtaa3t3fYHTQ0NHTO6czTNQFY4WA5n9j7a/4
      "text/plain": [
       "<Figure size 800x700 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from matplotlib.colors import ListedColormap\n",
    "\n",
    "darker_hot = ListedColormap(plt.cm.hot(np.linspace(0, 0.8, 256)))\n",
    "\n",
    "axes = [-11.5, 14, -2, 23, -12, 15]\n",
    "\n",
    "fig = plt.figure(figsize=(8, 7))\n",
    "ax = fig.add_subplot(111, projection='3d')\n",
    "\n",
    "ax.scatter(X_swiss[:, 0], X_swiss[:, 1], X_swiss[:, 2], c=t, cmap=darker_hot)\n",
    "ax.view_init(10, -70)\n",
    "set_xyz_axes(ax, axes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "#### LLE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:22:27.291420Z",
     "iopub.status.busy": "2025-02-27T23:22:27.291245Z",
     "iopub.status.idle": "2025-02-27T23:22:27.383133Z",
     "shell.execute_reply": "2025-02-27T23:22:27.382651Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from sklearn.manifold import LocallyLinearEmbedding\n",
    "\n",
    "lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10, random_state=42)\n",
    "X_unrolled = lle.fit_transform(X_swiss)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "editable": true,
    "execution": {
     "iopub.execute_input": "2025-02-27T23:22:27.385572Z",
     "iopub.status.busy": "2025-02-27T23:22:27.385146Z",
     "iopub.status.idle": "2025-02-27T23:22:27.570413Z",
     "shell.execute_reply": "2025-02-27T23:22:27.569904Z"
    },
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Text(0.5, 1.0, 'Unrolled swiss roll using LLE')"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAlQAAAHHCAYAAAB5gsZZAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjAsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvlHJYcgAAAAlwSFlzAAAPYQAAD2EBqD+naQABAABJREFUeJzsnXeY1FQXh99kZrayhd679N57ESmCDQVFLFQ7oIiiYgFsgIgIUkTgQ0FAEBFEpS1NOlKl916WBZYtbJ2Z5Psjk9npM8uygHLf5xmW3Nzk3mQmyS/nnnuOpKqqikAgEAgEAoHgppHvdAcEAoFAIBAI/u0IQSUQCAQCgUCQQ4SgEggEAoFAIMghQlAJBAKBQCAQ5BAhqAQCgUAgEAhyiBBUAoFAIBAIBDlECCqBQCAQCASCHCIElUAgEAgEAkEOEYJKIBAIBAKBIIcIQSUQ/EeRJInhw4fbl3/44QckSeL06dO3rI1evXpRpkyZW7a/7JIbx5SbeOpv69atad269W3tx53+3gSC/yJCUAkEN8nw4cORJImrV696XF+9evXb/qAUCO52WrduTfXq1X3W8XdtAaxbtw5Jkrx+5s2bd6u7LhD4xHinOyAQCAQ3y/PPP8/TTz9NcHDwne7Kv4pp06ahKMqd7sYt4fXXX6dBgwZu5U2aNLkDvRHcywhBJRD8S0hNTSUsLOxOd+OuwmAwYDAY7lj7FosFRVEICgq6Y324GUwm053uwi2jRYsWdO3a9U53QyAQQ34Cwe1CH6L4+eef+fzzzylRogQhISE88MADHD9+3KmuPiyyc+dOWrZsSVhYGO+//z4AcXFx9O3bl8KFCxMSEkKtWrWYOXPmTfdr2bJltGjRgvDwcCIiInjooYc4cOCAW73FixdTvXp1QkJCqF69OosWLQq4jR07dtChQwcKFChAaGgoZcuWpU+fPvb1devW5YknnnDapkaNGkiSxN69e+1l8+fPR5IkDh06BHj2SfLXFsC8efOoV68eERERREZGUqNGDcaPH+/zGE6fPo0kSYwZM4Zx48ZRvnx5goODOXjwIABr1qyxn8fo6Ggee+wxez9zit72Dz/84LbO1VcuOTmZgQMHUqZMGYKDgylUqBDt2rVj165d9jquPlSOxzZ16lT7sTVo0IDt27e7tblgwQKqVq3q9FsQflmCex1hoRIIbjOjRo1ClmXefvttEhMTGT16NM8++yzbtm1zqnft2jU6duzI008/zXPPPUfhwoVJS0ujdevWHD9+nP79+1O2bFkWLFhAr169SEhI4I033shWX3788Ud69uxJhw4d+OKLL0hNTeXbb7+lefPm7N692/6AXLlyJV26dKFq1aqMHDmSa9eu0bt3b0qUKOG3jbi4ONq3b0/BggV57733iI6O5vTp0/z666/2Oi1atOCnn36yL8fHx3PgwAFkWWbDhg3UrFkTgA0bNlCwYEGqVKly023FxMTQvXt3HnjgAb744gsADh06xKZNmwI6f99//z3p6em89NJLBAcHky9fPlatWkXHjh0pV64cw4cPJy0tjQkTJtCsWTN27dp1W4XGK6+8wi+//EL//v2pWrUq165dY+PGjRw6dIi6dev63Hbu3LkkJyfz8ssvI0kSo0eP5oknnuDkyZN2q9aff/5Jt27dqFGjBiNHjuT69ev07duX4sWL347DcyM5Odmjr1X+/PmRJOkO9Ehwz6IKBIKbYtiwYSqgXrlyxeP6atWqqa1atbIvr127VgXUKlWqqBkZGfby8ePHq4C6b98+e1mrVq1UQJ0yZYrTPseNG6cC6uzZs+1lmZmZapMmTdQ8efKoSUlJ9nJAHTZsmH35+++/VwH11KlTqqqqanJyshodHa2++OKLTm3ExsaqUVFRTuW1a9dWixYtqiYkJNjLVq5cqQJq6dKlvZ8kVVUXLVqkAur27du91lmwYIEKqAcPHlRVVVWXLFmiBgcHq48++qjarVs3e72aNWuqjz/+uNdjCqStN954Q42MjFQtFovPfrty6tQpFVAjIyPVuLg4p3W1a9dWCxUqpF67ds1e9s8//6iyLKs9evTw2l9V1b5rx9+Jr7a///57t3Wu33NUVJTar18/n/vr2bOn0/em7z9//vxqfHy8vfy3335TAfX333+3l9WoUUMtUaKEmpycbC9bt25dQL8FVdWOt1q1aj7r+Lu2VDXrevL2uXTpkt++CAS3EjHkJxDcZnr37u3kc9OiRQsATp486VQvODiY3r17O5UtXbqUIkWK0L17d3uZyWTi9ddf58aNG/z1118B9yMmJoaEhAS6d+/O1atX7R+DwUCjRo1Yu3YtAJcuXWLPnj307NmTqKgo+/bt2rWjatWqftuJjo4G4I8//sBsNnuso5+D9evXA5olqkGDBrRr144NGzYAkJCQwP79++11b7at6OhoUlJSiImJ8dt3T3Tp0oWCBQval/Xz06tXL/Lly2cvr1mzJu3atWPp0qU31c7NEh0dzbZt27h48WK2t+3WrRt58+a1L7v+Ni9evMi+ffvo0aMHefLksddr1aoVNWrUyGHPb46hQ4cSExPj9nH8LgSC24EQVAJBLuJpyKFUqVJOy/oD7Pr1607lxYsXd3N2PnPmDBUqVECWnS9dfQjszJkzAfft2LFjALRp04aCBQs6fVauXElcXJzTPitUqOC2j0qVKvltp1WrVnTp0oWPP/6YAgUK8Nhjj/H999+TkZFhr1O4cGEqVKhgF08bNmygRYsWtGzZkosXL3Ly5Ek2bdqEoig+BVUgbb322mtUrFiRjh07UqJECfr06cPy5csDOGMaZcuWdVrWz4+nc1GlShWuXr1KSkpKwPvPKaNHj2b//v2ULFmShg0bMnz4cDex7g1/v039WO+77z63bT2V3Q5q1KhB27Zt3T7/tokCgn8/QlAJBDdJSEgIAGlpaR7Xp6am2us44m1WmqqqTsuhoaE57KFv9GnzP/74o8c3/N9+++2WtCNJEr/88gtbtmyhf//+XLhwgT59+lCvXj1u3Lhhr9e8eXM2bNhAWloaO3fupEWLFlSvXp3o6Gg2bNjAhg0byJMnD3Xq1MlRW4UKFWLPnj0sWbKERx99lLVr19KxY0d69uwZ0PHk9vfiCW++QFar1a3sqaee4uTJk0yYMIFixYrx5ZdfUq1aNZYtW+a3nUB/mwKBwB0hqASCm6R06dIAHDlyxG1damoq586ds9e5lW0eO3bMLYbQ4cOHnfoUCOXLlwc0geHpDV8PSqrvU7doOeLp2L3RuHFjPv/8c3bs2MGcOXM4cOCAU/DFFi1acPbsWebNm4fVaqVp06bIsmwXWhs2bKBp06YBhUnw11ZQUBCPPPIIkydP5sSJE7z88svMmjXLbbZlIPj6HRw+fJgCBQoQHh6e7f06oluKEhISnMq9WSSLFi3Ka6+9xuLFizl16hT58+fn888/z1EfIOtYPZ2nmzl3AsF/CSGoBIKb5IEHHiAoKIhvv/3WTeBMnToVi8VCx44db2mbnTp1IjY2lvnz59vLLBYLEyZMIE+ePLRq1SrgfXXo0IHIyEhGjBjh0d/oypUrgPZwrl27NjNnziQxMdG+PiYmxh4ywBfXr193s3DUrl0bwGkoTh/K++KLL6hZs6bdX6tFixasXr2aHTt2+BzuC7Sta9euOa2XZdk+i9CxP4HieH4cBc/+/ftZuXIlnTp1yvY+XYmMjKRAgQJ2HzOdyZMnOy1brVan7wg0wVysWLGbOjZXihUrRvXq1Zk1a5aTdfGvv/5i3759Od6/QPBvRoRNEAhukkKFCjF06FA+/PBDWrZsyaOPPkpYWBibN2/mp59+on379jzyyCO3tM2XXnqJ7777jl69erFz507KlCnDL7/8wqZNmxg3bhwREREB7ysyMpJvv/2W559/nrp16/L0009TsGBBzp49y59//kmzZs2YOHEiACNHjuShhx6iefPm9OnTh/j4eCZMmEC1atWcHqyemDlzJpMnT+bxxx+nfPnyJCcnM23aNCIjI53Exn333UeRIkU4cuQIAwYMsJe3bNmSd999F8CvoAqkrRdeeIH4+HjatGlDiRIlOHP
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "plt.scatter(X_unrolled[:, 0], X_unrolled[:, 1],\n",
    "            c=t, cmap=darker_hot)\n",
    "plt.xlabel(\"$z_1$\")\n",
    "plt.ylabel(\"$z_2$\", rotation=0)\n",
    "plt.axis([-0.055, 0.060, -0.070, 0.090])\n",
    "plt.grid(True)\n",
    "plt.title(\"Unrolled swiss roll using LLE\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "Swiss roll is unrolled and distances between instances are well preserved locally.\n",
    "\n",
    "However, distances not preserved on a larger scale."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}