1944 lines
640 KiB
Plaintext
1944 lines
640 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "slide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"# Lecture 18: Dimensionality reduction"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "skip"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"\n",
|
|||
|
"[Run in colab](https://colab.research.google.com/drive/19J80Hg1ZLxLrpHWbgxFZoyPfql_AdeGN)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 1,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:21:48.075087Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:21:48.074841Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:21:48.081076Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:21:48.080546Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "skip"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Last executed: 2025-02-27 23:21:48\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import datetime\n",
|
|||
|
"now = datetime.datetime.now()\n",
|
|||
|
"print(\"Last executed: \" + now.strftime(\"%Y-%m-%d %H:%M:%S\"))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "slide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"## Overview of dimensionality reduction"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"* Many problems have thousands or millions of features.\n",
|
|||
|
"* Want to reduce the number of dimensions, i.e. features.\n",
|
|||
|
"* Dimensionality reduction is (typically) lossy."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Disadvantages\n",
|
|||
|
"\n",
|
|||
|
"* It may (or may not) speed up training but (almost certainly) can degrade resulting performance.\n",
|
|||
|
"* Also makes pipelines more complex.\n",
|
|||
|
"\n",
|
|||
|
"$\\Rightarrow$ Should always try using original data before considering dimensionality reduction."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Advantages\n",
|
|||
|
"\n",
|
|||
|
"* Can make problems possible, which otherwise would have been intractible.\n",
|
|||
|
"* Very useful for visualization (2D, 3D representations are more intuitive)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Curse of Dimensionality\n",
|
|||
|
"\n",
|
|||
|
"Many things behave differently in high dimensional space.\n",
|
|||
|
"\n",
|
|||
|
"Our intuition is built from our experience of the 3D world and often fails in higher dimensions."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"For example:\n",
|
|||
|
"- Pick a random point in a unit square (a 1 × 1 square), it will have only about a 0.4% chance of being located less than 0.001 from a border (in other words, it is very unlikely that a random point will be “extreme” along any dimension).\n",
|
|||
|
"- But in a 10,000-dimensional unit hypercube (a 1 × 1 × ⋯ × 1 cube, with ten thousand 1s), this probability is greater than 99.999999%. Most points in a high-dimensional hypercube are very close to the border."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Two main approaches for dimensionality reduction\n",
|
|||
|
"\n",
|
|||
|
"1. Projection\n",
|
|||
|
"2. Manifold learning"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"#### Projection \n",
|
|||
|
"\n",
|
|||
|
"- In most real-world problems, training instances are not spread out uniformly across all dimensions. \n",
|
|||
|
"- Many features are almost constant, while others are highly correlated. \n",
|
|||
|
"- As a result, all training instances actually lie within (or close to) a much lower-dimensional subspace"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"For example, consider the following 3D data-set lying close to a 2D subspace.\n",
|
|||
|
"\n",
|
|||
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture18_Images/Projection.png\" alt=\"data-layout\" width=\"700\" style=\"display:block; margin:auto\"/>\n",
|
|||
|
"\n",
|
|||
|
"[Source: Geron]"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Data can be projected to a 2D plane without losing too much information. \n",
|
|||
|
"\n",
|
|||
|
"We've reduced the dimension of the data from 3 to 2.\n",
|
|||
|
"\n",
|
|||
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture18_Images/projection_2d.png\" alt=\"data-layout\" width=\"700\" style=\"display:block; margin:auto\"/>\n",
|
|||
|
"\n",
|
|||
|
"[Source: Geron]"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"#### Manifold learning"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"However, projection is not always the best approach for dimensionality reduction. \n",
|
|||
|
"\n",
|
|||
|
"Some data-sets may live of spaces (manifolds) that twist and turn.\n",
|
|||
|
"\n",
|
|||
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture18_Images/swiss_roll.png\" alt=\"data-layout\" width=\"500\" style=\"display:block; margin:auto\"/>"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"##### Projection versus unrolling\n",
|
|||
|
"\n",
|
|||
|
"Projecting would squash layers of the swiss roll (left plot), whereas better to \"unroll\" the swiss roll (right plot).\n",
|
|||
|
"\n",
|
|||
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture18_Images/swiss_roll_unrolled.png\" alt=\"data-layout\" width=\"800\" style=\"display:block; margin:auto\"/>\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"##### Manifold hypothesis\n",
|
|||
|
"\n",
|
|||
|
"Many dimensionality reduction algorithms work by modeling the manifold on which the training instances lie; this is called *Manifold Learning*. \n",
|
|||
|
"\n",
|
|||
|
"It relies on the manifold hypothesis.\n",
|
|||
|
"\n",
|
|||
|
"> Manifold hypothesis: most real-world high-dimensional data-sets lie close to a much lower dimensional manifold. This assumption is very often empirically observed.\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "slide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"## Principal component analysis (PCA) \n",
|
|||
|
"\n",
|
|||
|
"Most popular dimensionality reduction algorithm.\n",
|
|||
|
"\n",
|
|||
|
"1. Find hyperplane that lies closest to the data.\n",
|
|||
|
"2. Projects data onto it.\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Principal components"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Choose directions in the data that preserve the maximum variability.\n",
|
|||
|
"\n",
|
|||
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture18_Images/pca_variance.png\" alt=\"data-layout\" width=\"800\" style=\"display:block; margin:auto\"/>\n",
|
|||
|
"\n",
|
|||
|
"[Source: Geron]"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"PCA finds the direction that accounts for the largest amount of *variance* in the data-set.\n",
|
|||
|
"\n",
|
|||
|
"It then finds the next direction, orthogonal to the first, that accounts for largest amount of remaining variance.\n",
|
|||
|
"\n",
|
|||
|
"Process repeats, up until the full dimension of the data. But typically the majority of variance in the data is captured in many fewer dimensions than the maximum number.\n",
|
|||
|
"\n",
|
|||
|
"Each axis vector is called a principal component."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Principal components found by Singular Value Decomposition (SVD), a matrix factorization technique, of the feature matrix $X$.\n",
|
|||
|
"\n",
|
|||
|
"The principal components are the eigenvectors of the covariance matrix of the data $X^\\text{T} X$\n",
|
|||
|
"\n",
|
|||
|
"SVD of $X$ is given by:\n",
|
|||
|
"\n",
|
|||
|
"$$X=U \\Sigma V^T,$$\n",
|
|||
|
" \n",
|
|||
|
"where the columns of $V$ give the principal components (which are normalised and orthogonal).\n",
|
|||
|
"\n",
|
|||
|
"See the [tutorial here](https://arxiv.org/abs/1404.1100) for further details about PCA."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"**Note: PCA assumes data is centered around origin. Scikit-Learn PCA will adjust data for you if needed.**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"#### Example\n",
|
|||
|
"\n",
|
|||
|
"Generate example data"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 2,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:21:48.115817Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:21:48.115633Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:21:48.339410Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:21:48.338849Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"import numpy as np\n",
|
|||
|
"from scipy.spatial.transform import Rotation\n",
|
|||
|
"\n",
|
|||
|
"m = 60\n",
|
|||
|
"X = np.zeros((m, 3)) # initialize 3D dataset\n",
|
|||
|
"np.random.seed(42)\n",
|
|||
|
"angles = (np.random.rand(m) ** 3 + 0.5) * 2 * np.pi # uneven distribution\n",
|
|||
|
"X[:, 0], X[:, 1] = np.cos(angles), np.sin(angles) * 0.5 # oval\n",
|
|||
|
"X += 0.28 * np.random.randn(m, 3) # add more noise\n",
|
|||
|
"X = Rotation.from_rotvec([np.pi / 29, -np.pi / 20, np.pi / 4]).apply(X)\n",
|
|||
|
"X += [0.2, 0, 0.2] # shift a bit"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 3,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:21:48.342580Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:21:48.341611Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:21:49.227869Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:21:49.227246Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"from sklearn.decomposition import PCA\n",
|
|||
|
"\n",
|
|||
|
"pca = PCA(n_components=2)\n",
|
|||
|
"X2D = pca.fit_transform(X) # dataset reduced to 2D\n",
|
|||
|
"X3D_inv = pca.inverse_transform(X2D) # 3D position of the projected samples\n",
|
|||
|
"X_centered = X - X.mean(axis=0)\n",
|
|||
|
"U, s, Vt = np.linalg.svd(X_centered)\n",
|
|||
|
"\n",
|
|||
|
"axes = [-1.4, 1.4, -1.4, 1.4, -1.1, 1.1]\n",
|
|||
|
"x1, x2 = np.meshgrid(np.linspace(axes[0], axes[1], 10),\n",
|
|||
|
" np.linspace(axes[2], axes[3], 10))\n",
|
|||
|
"w1, w2 = np.linalg.solve(Vt[:2, :2], Vt[:2, 2]) # projection plane coefs\n",
|
|||
|
"z = w1 * (x1 - pca.mean_[0]) + w2 * (x2 - pca.mean_[1]) - pca.mean_[2] # plane\n",
|
|||
|
"X3D_above = X[X[:, 2] >= X3D_inv[:, 2]] # samples above plane\n",
|
|||
|
"X3D_below = X[X[:, 2] < X3D_inv[:, 2]] # samples below plane"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 4,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:21:49.230095Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:21:49.229790Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:21:49.520446Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:21:49.519862Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAs4AAALJCAYAAACk6aWUAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjAsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvlHJYcgAAAAlwSFlzAAAPYQAAD2EBqD+naQABAABJREFUeJzs/Xl4I/l534t+q7ARO0A2d7KbbPbGnpnep7vJ9mjzSCNFiTL2OToeO0dWrFga27J0HflajiON5Xis5MaKLW9yNE6urhwd58hWYsV6jmx5kbXPjGZr7vvSJLu5NgFwwV5Vv/sHWEXsKAAFoEC+n+eZp3uaIPBDoVD1rbe+7/flGGMMBEEQBEEQBEHkha/1AgiCIAiCIAiiHiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBAEQRAEQRAqIOFMEARBEARBECog4UwQBEEQBEEQKiDhTBBEzWCM1XoJBEEQBKEaY60XQBDE8UOSJMTjcUQiEZhMJhiNRhgMBvA8D47jar08giAIgsgKx6jkQxBElWCMKaJZFEVEIhHwPA/GGHieB8/zMBqNJKQJgiAIXULCmSCIqsAYgyAIEARB+f9YLAaDwQDGmCKqZTiOA8/zMJlMMBgMMBqN4DiOhDRBEARRM8iqQRBExUmuMsuCWBRF5eeyIOb5RNuFLKRFUYQgCMrPZQEtV6RJSBMEQRDVhIQzQRAVI1n8SpKUYr3IJ3hzCWlBEBCPx1OEtFyRlq0dBEEQBFEpSDgTBFERGGNKlRlAWX5lEtIEQRCEHiDhTBCE5kiShFgsllFlTqdSQhpARqMhCWmCIAiiXKg5kCAIzZCtGfF4XEnKyCWOZXFdCTGb3GzIGFOENglpgiAIohxIOBMEoQnp1oxCjXuVFM7Z1iaLaJl0IS2ndhAEQRBELkg4EwRRNnKVuZA1I5lqCud0koV0top0cmoHQRAEQciQcCYIomTSs5mLiYerpXBOJ5uQ5nk+o9mQhDRBEMTxhoQzQRAlIWczy0NLihXAehLO6eQS0ukeaRLSBEEQxwsSzgRBFEXy2OxirBnp6Fk4JyMfIklIEwRBECScCYJQTbENgPmeZ35+Htvb2/B4PPB6vXC5XLoX0UBuIS3/Z7FYyCNNEARxRCHhTBCEKpLHZpczzCQSiWBkZASRSAQdHR3Y3d2F3++HKIpwu93wer3wer1wOp11I6QZY4hEInj55ZfxxBNPUEWaIAjiiEIDUAiCyEu+sdnFsrW1hdHRUZw4cQJXrlxRno8xhlAoBL/fD7/fj+XlZTDGlGq01+uFw+HQpfBMnloIHE5IZIwhGo0iGo0qQlpuNDQajWVtR4IgCKI2kHAmCCInWo3NliQJMzMzWFlZwcWLF9HZ2al4nIGE+LTb7bDb7ejq6gJjDPv7+/D7/QgEAlhcXATHcSlC2m6361J4Jk80NBgMSkVarkonP8ZkMikVaRLSBEEQ+oeEM0EQWSklmzkboVAIw8PDkCQJAwMDcDgcBX+H4zg4nU44nU6cPHkSkiQpQnp7exsLCwvgeR5er1cR0zabrabCM99Ycfln+YS0XIkmIU0QBKFfSDgTBJFCejZzOQJubW0N4+Pj6OjowPnz5xU7Q7HwPA+XywWXy4VTp05BkiTs7e3B5/Nha2sLc3NzMBqNSjXa6/WioaFBl8Izl5CWJEkR0jzPZ3ikSUgTBEHUHhLOBEEopGczl5qaIYoiJicnsbGxgcceewytra2arpPnebjdbrjdbuX15CbDtbU1TE9Pw2w2ZwjpalBsv3U+IR2NRhGJREhIEwRB6AQSzgRBpGQzJ8erlcLe3h6Gh4dhNBoxODgIq9Wq8WozMRgMikAGEkJ6Z2cHfr8fDx48wNTUFBoaGlKsHRaLpeLrKoX0bS8LaVEUIYpizmbDcj4zgiAIQh0knAnimKNlNvP9+/cxNTWFnp4e9PX11SxOzmAwoLGxEY2NjQAAQRAQCATg9/uxsrKCiYkJ2Gw2RWx7PB6YzeayXrNSolX+PORtmSykBUFISfVIrkiTkCYIgtAeEs4EcYzRKps5Ho9jfHwcfr8f165dQ1NTk8YrLQ+j0YgTJ07gxIkTABLrlYX04uIigsEgHA6HUo32eDwwmUw1XnV2cglpQRAQj8dzCul6yMQmCILQOyScCeIYIic6BINB2Gy2skRzIBDA8PAw7HY7BgcHdWuBSMZkMqG5uRnNzc0AgFgspgjp+fl5hEIhOJ1OpSLtdrthNKo7XFZ7plQxQlq2dpCQJgiCKA0SzgRxzJCtGQ8fPsT09DR+5Ed+pGRrxuLiIubm5nD27Fn09PTUrTXAbDajpaUFLS0tAIBoNKoMY5menkY0Gs0Q0qUmhFSaQkIaQNaphiSkCYIgCkPCmSCOEenZzHIjYLFEo1GMjo4iGAzi5s2b8Hg82i+2hlgsFrS1taGtrQ0AEA6HlYr05OQkYrGYMh7c4/HA7Xbr9qIhl5COx+OIxWLKz0lIEwRBFIaEM0EcA5KzmRljSrxZKbaChw8fYnR0FF6vF4ODg7r1AmuJ1WqF1WpFe3s7GGMIh8PKVMMHDx5AEAS4XC4AwO7uLjwej26FZzYhLXvd5Yp0upCWUzsIgiCOOyScCeKII0kSBEHIGJvNcVxRwlmSJMzNzWFpaQkXLlxAV1dX2WJKXkM9iTKO42Cz2WCz2dDZ2QnGGEKhELa3txEIBDA2NgbGGDwej9Js6HQ6dfseZf+zTLKQzlWRJiFNEMRxhYQzQRxRCmUzFyOcw+EwhoeHIQgCbt++DafTqdka612AcRwHu90Oi8WCubk53L59G9FoVLF2LC0tAYAior1eL+x2u27ftxohzfN8RrOhXt8PQRCElpBwJogjSPrY7GyZvhzHKRMC87GxsYGxsTG0traiv79ft01xtUbevhzHwel0wul0oru7G4wx7O3twe/3Y3t7GwsLC+B5PkVI22w23QpPtUI63SOt1/dDEARRDiScCeKIkZzNnOxlTaeQx1kURUxPT2N1dRWPPPII2tvbK7XkIw3HcXC5XHC5XDh16hQkSVKE9NbWFubm5mA0GlOmGlqtVt0Kz2QhLe8/kiQhFospUw0FQQDP83A4HCSkCYI4UpBwJogjQvI0OTk1I59YyWfV2N/fx/DwMHiex+DgIGw2W6WWfeQoZH/heR5utxtutxs9PT2QJEkZD76xsYGZmRmYzWalGu31etHQ0FCl1ReHvH+lC+nl5WWEQiGcP3+eKtIEQRwpSDgTxBEgfWy2moEm2YQzYwwPHjzA5OQ
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 900x900 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"fig = plt.figure(figsize=(9, 9))\n",
|
|||
|
"ax = fig.add_subplot(111, projection=\"3d\")\n",
|
|||
|
"\n",
|
|||
|
"# plot samples and projection lines below plane first\n",
|
|||
|
"ax.plot(X3D_below[:, 0], X3D_below[:, 1], X3D_below[:, 2], \"ro\", alpha=0.3)\n",
|
|||
|
"for i in range(m):\n",
|
|||
|
" if X[i, 2] < X3D_inv[i, 2]:\n",
|
|||
|
" ax.plot([X[i][0], X3D_inv[i][0]],\n",
|
|||
|
" [X[i][1], X3D_inv[i][1]],\n",
|
|||
|
" [X[i][2], X3D_inv[i][2]], \":\", color=\"#F88\")\n",
|
|||
|
"\n",
|
|||
|
"ax.plot_surface(x1, x2, z, alpha=0.1, color=\"b\") # projection plane\n",
|
|||
|
"ax.plot(X3D_inv[:, 0], X3D_inv[:, 1], X3D_inv[:, 2], \"b+\") # projected samples\n",
|
|||
|
"ax.plot(X3D_inv[:, 0], X3D_inv[:, 1], X3D_inv[:, 2], \"b.\")\n",
|
|||
|
"\n",
|
|||
|
"# now plot projection lines and samples above plane\n",
|
|||
|
"for i in range(m):\n",
|
|||
|
" if X[i, 2] >= X3D_inv[i, 2]:\n",
|
|||
|
" ax.plot([X[i][0], X3D_inv[i][0]],\n",
|
|||
|
" [X[i][1], X3D_inv[i][1]],\n",
|
|||
|
" [X[i][2], X3D_inv[i][2]], \"r--\")\n",
|
|||
|
"\n",
|
|||
|
"ax.plot(X3D_above[:, 0], X3D_above[:, 1], X3D_above[:, 2], \"ro\")\n",
|
|||
|
"\n",
|
|||
|
"def set_xyz_axes(ax, axes):\n",
|
|||
|
" ax.xaxis.set_rotate_label(False)\n",
|
|||
|
" ax.yaxis.set_rotate_label(False)\n",
|
|||
|
" ax.zaxis.set_rotate_label(False)\n",
|
|||
|
" ax.set_xlabel(\"$x_1$\", labelpad=8, rotation=0)\n",
|
|||
|
" ax.set_ylabel(\"$x_2$\", labelpad=8, rotation=0)\n",
|
|||
|
" ax.set_zlabel(\"$x_3$\", labelpad=8, rotation=0)\n",
|
|||
|
" ax.set_xlim(axes[0:2])\n",
|
|||
|
" ax.set_ylim(axes[2:4])\n",
|
|||
|
" ax.set_zlim(axes[4:6])\n",
|
|||
|
"\n",
|
|||
|
"set_xyz_axes(ax, axes)\n",
|
|||
|
"ax.set_zticks([-1, -0.5, 0, 0.5, 1]);"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Compute SVD"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 5,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:21:49.523152Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:21:49.522594Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:21:49.526229Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:21:49.525813Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"X_centered = X - X.mean(axis=0)\n",
|
|||
|
"U, s, Vt = np.linalg.svd(X_centered)\n",
|
|||
|
"V = Vt.T\n",
|
|||
|
"c1 = V[:,0]\n",
|
|||
|
"c2 = V[:,1]"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 6,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:21:49.527883Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:21:49.527709Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:21:49.531736Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:21:49.531386Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"(array([0.67857588, 0.70073508, 0.22023881]),\n",
|
|||
|
" array([-0.72817329, 0.6811147 , 0.07646185]))"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 6,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"c1, c2"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Reconstruct $\\Sigma$ from `s`:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 7,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:21:49.533494Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:21:49.533326Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:21:49.536249Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:21:49.535878Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"m, n = X.shape\n",
|
|||
|
"Σ = np.zeros_like(X_centered)\n",
|
|||
|
"Σ[:n, :n] = np.diag(s)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Check recover `X_centered`:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 8,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:21:49.537981Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:21:49.537785Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:21:49.540650Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:21:49.540108Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"assert np.allclose(X_centered, U @ Σ @ Vt)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Projecting down to d dimensions\n",
|
|||
|
"\n",
|
|||
|
"Once principal components identified, can project data onto the hyperplane defined by first $d$ principal components.\n",
|
|||
|
"\n",
|
|||
|
"Can be performed by the dot product between the data principal components.\n",
|
|||
|
"\n",
|
|||
|
"1. Construct project matrix $P$ by taking first $d$ columns of $V$.\n",
|
|||
|
"2. Multiple feature matrix by projection matrix to recover lower dimensional representation of the data: $X_d = X P$.\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"#### Check sizes\n",
|
|||
|
"\n",
|
|||
|
"For $X$ of size $m \\times n$, i.e. $n$ features and $m$ data instances, the matrices have the following sizes:\n",
|
|||
|
"\n",
|
|||
|
"- $V$ is of size $n \\times n$.\n",
|
|||
|
"- $P$ (first $d$ columns of $V$) is of size $n \\times d$.\n",
|
|||
|
"- $X_d = X P$ is of size $m \\times d$."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 9,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:21:49.542460Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:21:49.542294Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:21:49.545152Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:21:49.544566Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"W2 = Vt[:2].T\n",
|
|||
|
"X2D = X_centered @ W2"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 10,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:21:49.546946Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:21:49.546752Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:21:49.550054Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:21:49.549458Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Original dimension is 3\n",
|
|||
|
"Reduced dimension is 2\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"print(f\"Original dimension is {X_centered.shape[-1]}\")\n",
|
|||
|
"print(f\"Reduced dimension is {X2D.shape[-1]}\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Scikit-Learn PCA\n",
|
|||
|
"\n",
|
|||
|
"Scikit-Learn's PCA support also uses SVD but makes things even easier.\n",
|
|||
|
"\n",
|
|||
|
"You can access each principal component using ```components_``` variable"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 11,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:21:49.551841Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:21:49.551672Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:21:49.555266Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:21:49.554662Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from sklearn.decomposition import PCA\n",
|
|||
|
"\n",
|
|||
|
"pca = PCA(n_components=2)\n",
|
|||
|
"X2D = pca.fit_transform(X)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 12,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:21:49.557016Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:21:49.556827Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:21:49.560687Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:21:49.560112Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"array([[ 0.67857588, 0.70073508, 0.22023881],\n",
|
|||
|
" [ 0.72817329, -0.6811147 , -0.07646185]])"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 12,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"pca.components_"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Explained variance ratio\n",
|
|||
|
"\n",
|
|||
|
"*Explained variance ratio* is a very useful metric.\n",
|
|||
|
"\n",
|
|||
|
"Gives the proportion of dataset variance along the axis of each principal component."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 13,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:21:49.562418Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:21:49.562252Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:21:49.565839Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:21:49.565454Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"array([0.7578477 , 0.15186921])"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 13,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"pca.explained_variance_ratio_"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"The first dimension explains about 76% of the variance, while the second explains about 15%."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"By projecting down to 2D, we lost about 9% of the variance:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 14,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:21:49.567654Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:21:49.567484Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:21:49.571274Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:21:49.570904Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"np.float64(0.09028309326742034)"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 14,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"1 - pca.explained_variance_ratio_.sum()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Choosing the right number of dimensions"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Compute the explained variance cumulatively to see how much captured by a given number of principal components."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 15,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:21:49.573126Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:21:49.572952Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:21:53.659798Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:21:53.659297Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"np.int64(154)"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 15,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.datasets import fetch_openml\n",
|
|||
|
"\n",
|
|||
|
"mnist = fetch_openml('mnist_784', as_frame=False, parser=\"auto\")\n",
|
|||
|
"X_train, y_train = mnist.data[:60_000], mnist.target[:60_000]\n",
|
|||
|
"X_test, y_test = mnist.data[60_000:], mnist.target[60_000:]\n",
|
|||
|
"\n",
|
|||
|
"pca = PCA()\n",
|
|||
|
"pca.fit(X_train)\n",
|
|||
|
"cumsum = np.cumsum(pca.explained_variance_ratio_)\n",
|
|||
|
"d = np.argmax(cumsum >= 0.95) + 1\n",
|
|||
|
"d"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"So we capture $\\geq95$% of the variance in $154$ dimensions.\n",
|
|||
|
"\n",
|
|||
|
"Compared to original dimensionality of $28 \\times 28 =784$), we compress to $\\sim 20$% of the original size."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Scikit-Learn also has built in functionality to automatically select number of principal components that capture a given level of variance."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 16,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:21:53.662280Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:21:53.662094Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:21:55.179585Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:21:55.179013Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"pca = PCA(n_components=0.95)\n",
|
|||
|
"X_reduced = pca.fit_transform(X_train)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 17,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:21:55.183080Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:21:55.182254Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:21:55.187758Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:21:55.187347Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"np.int64(154)"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 17,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"pca.n_components_"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### PCA for compression"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"We can also reconstruct high dimensional representation from compressed representation.\n",
|
|||
|
"\n",
|
|||
|
"Compression is lossy, so we won't recover the original high dimensional representation perfectly.\n",
|
|||
|
"\n",
|
|||
|
"Example applying PCA to MNIST dataset with 95% preservation = results in ~150 features (original = 28x28 = 784)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Since principal components are orthogonal, we can estimate a recovered high dimensional representation by\n",
|
|||
|
"\n",
|
|||
|
"$$X_\\text{recovered} = X_{d} P^\\text{T}.$$"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 18,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:21:55.190225Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:21:55.190043Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:21:55.643751Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:21:55.643184Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"X_recovered = pca.inverse_transform(X_reduced)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 19,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:21:55.647099Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:21:55.646185Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:21:55.864486Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:21:55.863997Z"
|
|||
|
},
|
|||
|
"scrolled": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjIAAAEgCAYAAACuOplUAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjAsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvlHJYcgAAAAlwSFlzAAAPYQAAD2EBqD+naQABAABJREFUeJzsvXeUpFd5Jv5Uzrk6VHVX59zTEzRRGmmUGQlJCFaATbIkE89iFq9l9uzyMwcMNtjYa2PwYsOaxWDQEgQiCQkJpJHQKIxmRtMz0z3TOXdXd1dVV85V3++P2ffOra+rOqeRvuecPt1d9YV7v3Dve5/3ed9XJgiCAAkSJEiQIEGChGsQ8u1ugAQJEiRIkCBBwlohGTISJEiQIEGChGsWkiEjQYIECRIkSLhmIRkyEiRIkCBBgoRrFpIhI0GCBAkSJEi4ZiEZMhIkSJAgQYKEaxaSISNBggQJEiRIuGYhGTISJEiQIEGChGsWkiEjQYIECRIkSLhmIRkybxJ87nOfg0wmW9O+//7v/w6ZTIbR0dGNbRSH0dFRyGQy/Pu///umnUOCBAkSthsPPfQQ6urqtrsZbyhIhsw1gJ6eHrz//e9HVVUVNBoN3G433ve+96Gnp2e7myZBgoRrDENDQ/joRz+KhoYGaLVamM1mHD16FP/0T/+ERCKx3c2TIGHVkAyZHY6f/vSnuO666/C73/0ODz/8ML7+9a/jgx/8IJ577jlcd911ePzxx1d0nL/4i79Y8yD1gQ98AIlEArW1tWvaX4IECTsDTzzxBLq6uvCjH/0I9913H772ta/hS1/6EmpqavCpT30Kn/zkJ7e7iRIkrBrK7W6AhNIYGhrCBz7wATQ0NOCFF15AWVkZ++6Tn/wkbrrpJnzgAx/A+fPn0dDQUPQYsVgMBoMBSqUSSuXabrdCoYBCoVjTvhIkSNgZGBkZwR/+4R+itrYWzz77LFwuF/vu4x//OAYHB/HEE09sYwtXh2w2i3w+D7Vavd1NkbDNkBiZHYy/+7u/Qzwexze/+c0CIwYAnE4nvvGNbyAWi+HLX/4ygKs6mN7eXrz3ve+FzWbDjTfeWPAdj0Qigf/yX/4LnE4nTCYT3va2t2FqagoymQyf+9zn2HbFNDJ1dXW499578eKLL+LQoUPQarVoaGjAd7/73YJzBAIB/Pmf/zm6urpgNBphNptx9913o7u7ewOvlAQJEpbDl7/8ZUSjUXzrW98qMGIITU1NjJHJZrP4whe+gMbGRmg0GtTV1eHTn/40UqlUwT40Dpw4cQIHDhyATqdDV1cXTpw4AeAKo9zV1QWtVov9+/fj9ddfL9j/oYcegtFoxPDwMI4fPw6DwQC3243Pf/7zEASBbUcaur//+7/HV77yFdau3t5eAMDly5fxzne+E3a7HVqtFgcOHMAvfvGLgnNlMhn85V/+JZqbm6HVauFwOHDjjTfimWeeYdt4vV48/PDDqK6uhkajgcvlwv33379IH/jkk0/ipptugsFggMlkwj333FPU1f+zn/0Mu3btglarxa5du1bMoEtYHSRGZgfjl7/8Jerq6nDTTTcV/f7YsWOoq6tbtIp617vehebmZnzxi18sGAzEeOihh/CjH/0IH/jAB3DkyBE8//zzuOeee1bcvsHBQbzzne/EBz/4QTz44IP4P//n/+Chhx7C/v370dnZCQAYHh7Gz372M7zrXe9CfX09Zmdn8Y1vfAM333wzent74Xa7V3w+CRIkrB2//OUv0dDQgBtuuGHZbT/0oQ/hO9/5Dt75znfikUcewauvvoovfelLuHTp0qLJeHBwEO9973vx0Y9+FO9///vx93//97jvvvvwr//6r/j0pz+N//yf/zMA4Etf+hLe/e53o6+vD3L51TV0LpfDXXfdhSNHjuDLX/4ynnrqKXz2s59FNpvF5z//+YJzffvb30YymcRHPvIRaDQa2O129PT04OjRo6iqqsJ//+//HQaDAT/60Y/w9re/HT/5yU/wjne8A8CVxdyXvvQlfOhDH8KhQ4cQDodx+vRpnD17FnfeeScA4IEHHkBPTw8+8YlPoK6uDnNzc3jmmWcwPj7OBLr/8R//gQcffBDHjx/H3/7t3yIej+Nf/uVfcOONN+L1119n2z399NN44IEH0NHRgS996Uvw+/3MSJKwwRAk7EgEg0EBgHD//fcvud3b3vY2AYAQDoeFz372swIA4T3vec+i7eg7wpkzZwQAwp/+6Z8WbPfQQw8JAITPfvaz7LNvf/vbAgBhZGSEfVZbWysAEF544QX22dzcnKDRaIRHHnmEfZZMJoVcLldwjpGREUGj0Qif//znCz4DIHz7299esr8SJEhYPUKh0IrGE0EQhHPnzgkAhA996EMFn//5n/+5AEB49tln2Wc0Drz00kvss9/85jcCAEGn0wljY2Ps82984xsCAOG5555jnz344IMCAOETn/gE+yyfzwv33HOPoFarhfn5eUEQro4PZrNZmJubK2jX7bffLnR1dQnJZLLgGDfccIPQ3NzMPtuzZ49wzz33lOz3wsKCAED4u7/7u5LbRCIRwWq1Ch/+8IcLPvd6vYLFYin4fO/evYLL5RKCwSD77OmnnxYACLW1tSXPIWH1kFxLOxSRSAQAYDKZltyOvg+Hw+yzj33sY8se/6mnngIAtloifOITn1hxGzs6OgrYorKyMrS2tmJ4eJh9ptFo2Oorl8vB7/fDaDSitbUVZ8+eXfG5JEiQsHbQ+LDceAIAv/71rwEAf/Znf1bw+SOPPAIAixjgjo4OXH/99ez/w4cPAwBuu+021NTULPqcHx8If/Inf8L+lslk+JM/+ROk02n89re/LdjugQceKHCzBwIBPPvss3j3u9+NSCQCn88Hn88Hv9+P48ePY2BgAFNTUwAAq9WKnp4eDAwMFO23TqeDWq3GiRMnsLCwUHSbZ555BsFgEO95z3vYuXw+HxQKBQ4fPoznnnsOADAzM4Nz587hwQcfhMViYfvfeeed6OjoKHpsCWuHZMjsUNCAQwZNKRQzeOrr65c9/tjYGORy+aJtm5qaVtxGfpAi2Gy2gkEgn8/jH//xH9Hc3AyNRgOn04mysjKcP38eoVBoxeeSIEHC2mE2mwEsP54AV8cG8VhQWVkJq9WKsbGxgs/F4wBN3B6Pp+jnYiNBLpcvClZoaWkBgEXaFPF4NTg4CEEQ8JnPfAZlZWUFP5/97GcBAHNzcwCAz3/+8wgGg2hpaUFXVxc+9alP4fz58+xYGo0Gf/u3f4snn3wSFRUVOHbsGL785S/D6/WybcgIuu222xad7+mnn2bnomvU3NwMMVpbWxd9JmF9kDQyOxQWiwUul6vgRSuG8+fPo6qqig1UwJWVxVagVCSTwOlyvvjFL+Izn/kM/viP/xhf+MIXYLfbIZfL8ad/+qfI5/Nb0k4JEt7sMJvNcLvduHjx4or3WWkCzVLjwErGh9VCPLbRGPLnf/7nOH78eNF9yCA7duwYhoaG8POf/xxPP/00/u3f/g3/+I//iH/913/Fhz70IQDAn/7pn+K+++7Dz372M/zmN7/BZz7zGXzpS1/Cs88+i3379rHz/cd//AcqKysXnWutkaES1gfpqu9g3Hvvvfjf//t/48UXX2TRRzx+//vfY3R0FB/96EdXfeza2lrk83mMjIwUrBoGBwfX1WYxHnvsMdx666341re+VfB5MBiE0+nc0HNJkCChNO69915885vfxMsvv1zgChKDxoaBgQG0t7ezz2dnZxEMBjc8n1Q+n8fw8DBjYQCgv78fAJbNgEtMjkqlwh133LHsuex2Ox5++GE8/PDDiEajOHbsGD73uc8xQwYAGhsb8cgjj+CRRx7BwMAA9u7di//5P/8nvve976GxsREAUF5evuT56BoVc2P19fUt204Jq4PkWtrB+NSnPgWdToePfvSj8Pv9Bd8FAgF87GMfg16vx6c+9alVH5tWL1//+tcLPv/a17629gYXgUKhWLQC+/GPf8z81hIkSNga/Lf/9t9gMBjwoQ99CLOzs4u+Hxoawj/90z/hrW99KwDgK1/5SsH3//AP/wAAq4psXCn++Z//mf0
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 700x400 with 2 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"plt.figure(figsize=(7, 4))\n",
|
|||
|
"for idx, X in enumerate((X_train[::2100], X_recovered[::2100])):\n",
|
|||
|
" plt.subplot(1, 2, idx + 1)\n",
|
|||
|
" plt.title([\"Original\", \"Compressed\"][idx])\n",
|
|||
|
" for row in range(5):\n",
|
|||
|
" for col in range(5):\n",
|
|||
|
" plt.imshow(X[row * 5 + col].reshape(28, 28), cmap=\"binary\",\n",
|
|||
|
" vmin=0, vmax=255, extent=(row, row + 1, col, col + 1))\n",
|
|||
|
" plt.axis([0, 5, 0, 5])\n",
|
|||
|
" plt.axis(\"off\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Randomized PCA \n",
|
|||
|
"\n",
|
|||
|
"*Randomized PCA* quickly find an approximation of the first first $d$ principal components.\n",
|
|||
|
"\n",
|
|||
|
"It is much quicker than the full SVD approach.\n",
|
|||
|
"\n",
|
|||
|
"Its computational complexity is ${\\mathcal O}( m\\times d^2) + {\\mathcal O}( d^3)$, instead of ${\\mathcal O}( m \\times n^2) + {\\mathcal O}( n^3)$, so it is dramatically faster than the previous algorithms when $d$ is much smaller than $n$."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 20,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:21:55.866516Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:21:55.866345Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:22:02.100911Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:22:02.100391Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"rnd_pca = PCA(n_components=154, svd_solver=\"randomized\", random_state=42)\n",
|
|||
|
"X_reduced = rnd_pca.fit_transform(X_train)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 21,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:22:02.103482Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:22:02.103301Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:22:02.108558Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:22:02.108081Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"154"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 21,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"rnd_pca.n_components_"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Incremental PCA \n",
|
|||
|
"\n",
|
|||
|
"One problem with PCA is that it requires the entire training set to be read into memory at once. Thus, cannot apply to very large training data-sets. \n",
|
|||
|
"\n",
|
|||
|
"Instead *Incremental PCA (IPCA)* splits the data into batches, and allows for combination. \n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 22,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:22:02.110486Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:22:02.110313Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:22:20.395636Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:22:20.395095Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from sklearn.decomposition import IncrementalPCA\n",
|
|||
|
"\n",
|
|||
|
"n_batches = 100\n",
|
|||
|
"inc_pca = IncrementalPCA(n_components=154)\n",
|
|||
|
"for X_batch in np.array_split(X_train, n_batches):\n",
|
|||
|
" inc_pca.partial_fit(X_batch)\n",
|
|||
|
"\n",
|
|||
|
"X_reduced = inc_pca.transform(X_train)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 23,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:22:20.398654Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:22:20.397666Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:22:20.403297Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:22:20.402792Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"154"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 23,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"inc_pca.n_components_"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"**Exercises:** *You can now complete Exercise 1 in the exercises associated with this lecture.*"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "slide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"## Random projections"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Even with randomized or incremental PCA, PCA can be too computationally expensive for very large data-sets.\n",
|
|||
|
"\n",
|
|||
|
"An alternative it to consider *random* projections to lower dimensional spaces through linear projections."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Random projections can actually preserve distances reasonably well. Hence, two similar data instances remain similar after projection. Similarly, two very different data instances remain very different after projection.\n",
|
|||
|
"\n",
|
|||
|
"The more dimensions that are dropped, the more information is lost, and hence the more distances get distorted."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Selecting the reduced dimension size\n",
|
|||
|
"\n",
|
|||
|
"There is some deep mathematical theory that specifies how much dimensions can be compressed in order to ensure distances are distorted by less than some tolerance $\\epsilon$. \n",
|
|||
|
"\n",
|
|||
|
"This is called the *Johnson-Lindenstrauss* minimum dimension:\n",
|
|||
|
"\n",
|
|||
|
"$$d \\geq \\frac{4 \\log(m)}{\\epsilon^2/2 - \\epsilon^3/3}.$$\n",
|
|||
|
"\n",
|
|||
|
"(Doesn't depend on number of features $n$.)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Example"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"For example:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 24,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:22:20.405853Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:22:20.405675Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:22:20.409905Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:22:20.409455Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"5920"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 24,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"m, ε = 1_000, 0.1\n",
|
|||
|
"\n",
|
|||
|
"d = int(4 * np.log(m) / (ε ** 2 / 2 - ε ** 3 / 3))\n",
|
|||
|
"d"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Or using Scikit-Learn:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 25,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:22:20.413424Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:22:20.412448Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:22:20.419642Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:22:20.419208Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"np.int64(5920)"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 25,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.random_projection import johnson_lindenstrauss_min_dim\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"d = johnson_lindenstrauss_min_dim(m, eps=ε)\n",
|
|||
|
"d"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Then can compute a random projection down to dimension $d$:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 26,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:22:20.421721Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:22:20.421549Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:22:23.802948Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:22:23.802408Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"(1000, 5920)"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 26,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"n = 8_000\n",
|
|||
|
"np.random.seed(42)\n",
|
|||
|
"\n",
|
|||
|
"# Radom projection matrix\n",
|
|||
|
"P = np.random.randn(d, n) / np.sqrt(d) # std dev = square root of variance\n",
|
|||
|
"\n",
|
|||
|
"X = np.random.randn(m, n) # generate a fake dataset\n",
|
|||
|
"X_reduced = X @ P.T\n",
|
|||
|
"X_reduced.shape"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Or use Scikit-Learn built in functionality:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 27,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:22:23.805852Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:22:23.804842Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:22:27.060605Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:22:27.060048Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"(1000, 5920)"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 27,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.random_projection import GaussianRandomProjection\n",
|
|||
|
"\n",
|
|||
|
"gaussian_rnd_proj = GaussianRandomProjection(eps=ε, random_state=42)\n",
|
|||
|
"X_reduced = gaussian_rnd_proj.fit_transform(X) \n",
|
|||
|
"X_reduced.shape"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "slide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"## Locally linear embedding (LLE)\n",
|
|||
|
"\n",
|
|||
|
"Locally linear embedding (LLE) is a powerful non-linear dimensionality reduction tool.\n",
|
|||
|
"\n",
|
|||
|
"It is a form of *manifold learning* and doesn't rely on projections.\n",
|
|||
|
"\n",
|
|||
|
"LLE measures how each instance relates to closest neighbors, then looks for low-D representation where local relations are best preserved."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### First Step: linearly modelling local relationships\n",
|
|||
|
"\n",
|
|||
|
"- First, for each training instance $x^{(i)}$, the algorithm identifies its $k$ closest neighbours.\n",
|
|||
|
"- Then tries to reconstruct $x^{(i)}$ as a linear weighted sum of these neighbors.\n",
|
|||
|
"\n",
|
|||
|
"More specifically, it finds the weights $w_{ij}$ such that the squared distance between $x^{(i)}$ and $\\sum_{j=1}^m w_{ij}x^{(j)}$ is as small as possible, assuming $w_{ij}=0$ if $x^{(j)}$ is not one of the $k$-nearest neighbours of $x^{(i)}$.\n",
|
|||
|
"\n",
|
|||
|
"That is, solve the minimisation problem\n",
|
|||
|
"\n",
|
|||
|
"$$\n",
|
|||
|
"\\text{argmin}_{W} \\sum_{i=1}^m \\left( x^{(i)} - \\sum_{j=1}^m w_{ij}x^{(j)} \\right)^2 ,\n",
|
|||
|
"$$\n",
|
|||
|
"\n",
|
|||
|
"where $W$ is the matrix of weights. Note that weights are constrained to be normalised such that $\\sum_j w_{ij}=1$."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Second step: reducing dimensionality while preserving relationships\n",
|
|||
|
"\n",
|
|||
|
"The second step is to map the training instances into a $d$-dimensional space (where $d < n$) while preserving these local relationships as much as possible.\n",
|
|||
|
"\n",
|
|||
|
"If $z^{(i)}$ is the d-space equivalent of $x^{(i)}$ then we want the squared distance between $z^{(i)}$ and $\\sum_{i=1}^m w_{ij}z^{(j)}$ to be as small as possible, given **the fixed set of weights** from step 1. \n",
|
|||
|
"\n",
|
|||
|
"That is, solve the minimisation problem\n",
|
|||
|
"\n",
|
|||
|
"$$\n",
|
|||
|
"\\text{argmin}_{Z} \\sum_{i=1}^m \\left( z^{(i)} - \\sum_{j=1}^m \\hat{w}_{ij}z^{(j)} \\right)^2 ,\n",
|
|||
|
"$$\n",
|
|||
|
"\n",
|
|||
|
"where $Z$ is the matrix of lower dimensional positions and $\\hat{w}_{ij}$ are the weights determined in step 1.\n",
|
|||
|
"\n",
|
|||
|
"Note that step 2 optimises positions of $z$ for fixed $w$, while step 1 optimises the $w$ for fixed positions $x$\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Example"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"#### Plot swiss roll"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 28,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:22:27.063998Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:22:27.062959Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:22:27.067705Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:22:27.067256Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from sklearn.datasets import make_swiss_roll\n",
|
|||
|
"\n",
|
|||
|
"X_swiss, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 29,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:22:27.070034Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:22:27.069836Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:22:27.289425Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:22:27.288912Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAi8AAAIvCAYAAAC81DtEAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjAsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvlHJYcgAAAAlwSFlzAAAPYQAAD2EBqD+naQABAABJREFUeJzs/Xd4Hcd974+/Znb3FOCgV/ZOkZQokupWl2xJlpvkXhKXOIm/N91x7nVuHCdxEif3Jk5xkpufnThxIluJHfeq4shWF9UpkRLF3kH0jtN2d2Z+f+weEABBEiBBEGVez8OHwMHZ3dk9uzPv86nCGGOwWCwWi8VimSXICz0Ai8VisVgslslgxYvFYrFYLJZZhRUvFovFYrFYZhVWvFgsFovFYplVWPFisVgsFotlVmHFi8VisVgsllmFFS8Wi8VisVhmFVa8WCwWi8VimVVY8WKxWCwWi2VWYcWLxWKxWCyWWYUVLxaLxWKxWGYVVrxYLBaLxWKZVVjxYrFYLBaLZVZhxYvFYrFYLJZZhRUvFovFYrFYZhVWvFgsFovFYplVWPFisVgsFotlVmHFi8VisVgsllmFFS8Wi8VisVhmFVa8WCwWi8VimVVY8WKxWCwWi2VWYcWLxWKxWCyWWYUVLxaLxWKxWGYVVrxYLBaLxWKZVVjxYrFYLBaLZVZhxYvFYrFYLJZZhRUvFovFYrFYZhVWvFgsFovFYplVWPFisVgsFotlVmHFi8VisVgsllmFFS8Wi8VisVhmFVa8WCwWi8VimVVY8WKxWCwWi2VWYcWLxWKxWCyWWYUVLxaLxWKxWGYVVrxYLBaLxWKZVVjxYrFYLBaLZVZhxYvFYrFYLJZZhRUvFovFYrFYZhVWvFgsFovFYplVWPFisVgsFotlVmHFi8VisVgsllmFFS8Wi8VisVhmFVa8WCwWi8VimVVY8WKxWCwWi2VWYcWLxWKxWCyWWYUVLxaLxWKxWGYVVrxYLBaLxWKZVVjxYrFYLBaLZVZhxYvFYrFYLJZZhRUvFovFYrFYZhVWvFgsFovFYplVWPFisVgsFotlVmHFi8VisVgsllmFFS8Wi8VisVhmFVa8WCwWi8VimVVY8WKxWCwWi2VWYcWLxWKxWCyWWYUVLxaLxWKxWGYVVrxYLBaLxWKZVVjxYrFYLBaLZVZhxYvFYrFYLJZZhRUvFovFYrFYZhVWvFgsFovFYplVWPFisVgsFotlVmHFi8VisVgsllmFFS8Wi8VisVhmFVa8WCwWi8VimVVY8WKxWCwWi2VWYcWLxWKxWCyWWYUVLxaLxWKxWGYVVrxYLBaLxWKZVVjxYrFYLBaLZVZhxYvFYrFYLJZZhRUvFovFYrFYZhVWvFgsFovFYplVWPFisVgsFotlVmHFi8VisVgsllmFFS8Wi8VisVhmFVa8WCwWi8VimVVY8WKxWCwWi2VWYcWLxWKxWCyWWYUVLxaLxWKxWGYVVrxYLBaLxWKZVVjxYrFYLBaLZVZhxYvFYrFYLJZZhRUvFovFYrFYZhVWvFgsFovFYplVWPFisVgsFotlVmHFi8VisVgsllmFFS8Wi8VisVhmFVa8WCwWi8VimVVY8WKxWCwWi2VWYcWLxWKxWCyWWYUVLxaLxWKxWGYVVrxYLBaLxWKZVVjxYrFYLBaLZVZhxYvFYrFYLJZZhRUvFovFYrFYZhVWvFgsFovFYplVWPFisVgsFotlVmHFi8VisVgsllmFFS8Wi8VisVhmFVa8WCwWi8VimVVY8WKxWCwWi2VWYcWLxWKxWCyWWYUVLxaLxWKxWGYVVrxYLBaLxWKZVVjxYrFYLBaLZVZhxYvFYrFYLJZZhRUvFovFYrFYZhVWvFgsFovFYplVWPFisVgsFotlVmHFi8VisVgsllmFe6EHYLFMF1pr+vv7aW9vRymF67q4rovjOCf97DgOnufhOA6O4yDlCZ0vhLiAZ2GxWCwWK14scxqlFF1dXbS2ttLR0UGxWKSmpgYpJVprlFLD/0b+rrUetZ+SiBn5DyCRSJwkekb+PFYcjf0nhBgWQ1YUWSwWy8QQxhhzoQdhsUwVxhh836etrY329nY6OztxHIf6+noaGxupq6tDa43WepQ1pbTtyJ9HChulFGEYDgucYrHIvn37WLFixUmi53SCSCk16jhSylHWnVOJoPGE0KnEkbUWWSyWuY61vFhmPcYYstksra2ttLW10dvbSyaToaGhgcsvv5yqqqrhhdwYc5JVpcTIBV4IgZQSz/PGfa/v++zbt49ly5adJILGG99ItNaEYXiSOBrvX+m9vu9PSByNHf9Ya5HrukgpxxU/pxNGJReatRZZLJaZgBUvllmJ1pqenp5hC0s2m6WmpobGxkYuueQS0un0uAvrVC+2EzFcjj3mSLfTVB7fGDPKOlSyFo0UQWMFT7FYPEkQjSeSRjKeKBorkMazGp0qpmjk38YKSIvFYhkPK14ss4YgCOjs7By2sBhjaGhoYOXKldTX1+N53rxb8MYu9olEYkr2O1YU9fb28tprr3HZZZdNyGI00lp0OpfaWGvRSPfZmVxnE4ktstYii2VuYsWLZcZijKFQKNDa2kp7eztdXV0kk0kaGhqorq6moqKCiy66aNL7nYrFa64vgGNFUckqUlZWds77Hs9aNNZNdiZrUS6XO2VM0dlai0aKpNMJoZLVqOSCs6LIYpl+rHixzCi01gwMDAwLloGBASorK2lsbGTNmjWUl5cjpeTVV1+dEYuFjXefPNNlLRoraMYKobEutiAIKBaLw7/n8/nhlPozWYsm40YbL77oVC40K4wslvGx4sVywSmlM5fiV4rFInV1dSxevJiGhgaSyeS4k7cVDtPHbFg8xwu4PhcOHTpENpvl4osvPq21aCL/giA4pZVo5M8jGc+FNhk32qlcadZaZJkLWPFimXZK6czt7e20tbXR2dmJlJKGhgbWrVtHXV0drnv6W9NOupbpYLxF/nxbi8Zah8a61MZaiwqFwqhtxnvfyOOcyVpUykYbWaTxTDFF1lpkmW6seLFMCyPTmdvb2+np6aG8vHzcdObJ7PNCYSdmy1QwmfT8yTLy+Rgvo+xU/44dO0ZFRQWJRGJCNYxGMl7NovFS80+Xlj+eG81aiyxjseLFct4opTOXLCyldOaGhgY2bNhAWVnZWU9EM2UCs66ructs/2xHPiOTSc/v7Oxk0aJFNDQ0jPv3MxVzHK9O0UixM561aDyBNBFr0XjiaDy32kgxNPJ3ay2avVjxYplSwjCko6ODtrY2Ojo6UEpRX1/PihUraGhomNJ05tm+uFgsM5XTPaPTaS0a6f46kwttZJD1SBfaeIJqJCOtRSWX4OlijCaSnl+a56woOn9Y8WI5J0rpzG1tbbS1tdHV1QVAc3Mzl1xyCbW1tVNWkG0kQgjrNrJY5hhnay06ExOxFg0ODnLgwAGWLVt2ktjxff+0MUWnav1xNgHXtvXHxLDixTJpSunMpeyg/v5+KisraWhoYPXq1WzdupVVq1ZRXl5+XscxEywvM2EM08V8Otf5zFz8nCdiLSoJg4ULF05q3+NZi8a6zE5Xt2iyrT9GjnVsoPVnPvMZ0uk0+XweKSU33HADmUyG8vJyMpkMVVVVvPWtbz2LKzjzsOLFMiFOlc68cOFCNm/ePCqdeTqsIufybWO+flOxWCbKfH1Gzua8L5S1aLzMs8svv5xsNsu2bdtoaWlhyZIlZLNZhoaGyGazeJ43KfHy2GOP8bnPfY4XXniB1tZWvvvd73L33XcP//0jH/kI99xzz6ht7rjjDh544IFzPv8zYcWLZVxGpjO3t7fT0dExnM580UUXUV9ff8p05vkw8c2Hc7TMz895LlpeJsJMO++ziS36wz/8QwC+8IUv8Mgjj/Bv//Zv5zSGbDbLpk2b+OhHP8o73vGOcd/zxje+cdRxksnkOR1zoljxYhmmlM5cil8Zmc582WWXUV1dPeF05umwvJztMaZybDNtwrNYpoL5KtrmynkrpabEAnTnnXdy5513nvY9yWSS5ubmcz7WZLHiZZ6jtaa3t3fYHTQ0NHTO6czTNQFY4WA5n9j7a/4
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 800x700 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from matplotlib.colors import ListedColormap\n",
|
|||
|
"\n",
|
|||
|
"darker_hot = ListedColormap(plt.cm.hot(np.linspace(0, 0.8, 256)))\n",
|
|||
|
"\n",
|
|||
|
"axes = [-11.5, 14, -2, 23, -12, 15]\n",
|
|||
|
"\n",
|
|||
|
"fig = plt.figure(figsize=(8, 7))\n",
|
|||
|
"ax = fig.add_subplot(111, projection='3d')\n",
|
|||
|
"\n",
|
|||
|
"ax.scatter(X_swiss[:, 0], X_swiss[:, 1], X_swiss[:, 2], c=t, cmap=darker_hot)\n",
|
|||
|
"ax.view_init(10, -70)\n",
|
|||
|
"set_xyz_axes(ax, axes)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"#### LLE"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 30,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:22:27.291420Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:22:27.291245Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:22:27.383133Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:22:27.382651Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from sklearn.manifold import LocallyLinearEmbedding\n",
|
|||
|
"\n",
|
|||
|
"lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10, random_state=42)\n",
|
|||
|
"X_unrolled = lle.fit_transform(X_swiss)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 31,
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"execution": {
|
|||
|
"iopub.execute_input": "2025-02-27T23:22:27.385572Z",
|
|||
|
"iopub.status.busy": "2025-02-27T23:22:27.385146Z",
|
|||
|
"iopub.status.idle": "2025-02-27T23:22:27.570413Z",
|
|||
|
"shell.execute_reply": "2025-02-27T23:22:27.569904Z"
|
|||
|
},
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"Text(0.5, 1.0, 'Unrolled swiss roll using LLE')"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 31,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAlQAAAHHCAYAAAB5gsZZAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjAsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvlHJYcgAAAAlwSFlzAAAPYQAAD2EBqD+naQABAABJREFUeJzsnXeY1FQXh99kZrayhd679N57ESmCDQVFLFQ7oIiiYgFsgIgIUkTgQ0FAEBFEpS1NOlKl916WBZYtbJ2Z5Psjk9npM8uygHLf5xmW3Nzk3mQmyS/nnnuOpKqqikAgEAgEAoHgppHvdAcEAoFAIBAI/u0IQSUQCAQCgUCQQ4SgEggEAoFAIMghQlAJBAKBQCAQ5BAhqAQCgUAgEAhyiBBUAoFAIBAIBDlECCqBQCAQCASCHCIElUAgEAgEAkEOEYJKIBAIBAKBIIcIQSUQ/EeRJInhw4fbl3/44QckSeL06dO3rI1evXpRpkyZW7a/7JIbx5SbeOpv69atad269W3tx53+3gSC/yJCUAkEN8nw4cORJImrV696XF+9evXb/qAUCO52WrduTfXq1X3W8XdtAaxbtw5Jkrx+5s2bd6u7LhD4xHinOyAQCAQ3y/PPP8/TTz9NcHDwne7Kv4pp06ahKMqd7sYt4fXXX6dBgwZu5U2aNLkDvRHcywhBJRD8S0hNTSUsLOxOd+OuwmAwYDAY7lj7FosFRVEICgq6Y324GUwm053uwi2jRYsWdO3a9U53QyAQQ34Cwe1CH6L4+eef+fzzzylRogQhISE88MADHD9+3KmuPiyyc+dOWrZsSVhYGO+//z4AcXFx9O3bl8KFCxMSEkKtWrWYOXPmTfdr2bJltGjRgvDwcCIiInjooYc4cOCAW73FixdTvXp1QkJCqF69OosWLQq4jR07dtChQwcKFChAaGgoZcuWpU+fPvb1devW5YknnnDapkaNGkiSxN69e+1l8+fPR5IkDh06BHj2SfLXFsC8efOoV68eERERREZGUqNGDcaPH+/zGE6fPo0kSYwZM4Zx48ZRvnx5goODOXjwIABr1qyxn8fo6Ggee+wxez9zit72Dz/84LbO1VcuOTmZgQMHUqZMGYKDgylUqBDt2rVj165d9jquPlSOxzZ16lT7sTVo0IDt27e7tblgwQKqVq3q9FsQflmCex1hoRIIbjOjRo1ClmXefvttEhMTGT16NM8++yzbtm1zqnft2jU6duzI008/zXPPPUfhwoVJS0ujdevWHD9+nP79+1O2bFkWLFhAr169SEhI4I033shWX3788Ud69uxJhw4d+OKLL0hNTeXbb7+lefPm7N692/6AXLlyJV26dKFq1aqMHDmSa9eu0bt3b0qUKOG3jbi4ONq3b0/BggV57733iI6O5vTp0/z666/2Oi1atOCnn36yL8fHx3PgwAFkWWbDhg3UrFkTgA0bNlCwYEGqVKly023FxMTQvXt3HnjgAb744gsADh06xKZNmwI6f99//z3p6em89NJLBAcHky9fPlatWkXHjh0pV64cw4cPJy0tjQkTJtCsWTN27dp1W4XGK6+8wi+//EL//v2pWrUq165dY+PGjRw6dIi6dev63Hbu3LkkJyfz8ssvI0kSo0eP5oknnuDkyZN2q9aff/5Jt27dqFGjBiNHjuT69ev07duX4sWL347DcyM5Odmjr1X+/PmRJOkO9Ehwz6IKBIKbYtiwYSqgXrlyxeP6atWqqa1atbIvr127VgXUKlWqqBkZGfby8ePHq4C6b98+e1mrVq1UQJ0yZYrTPseNG6cC6uzZs+1lmZmZapMmTdQ8efKoSUlJ9nJAHTZsmH35+++/VwH11KlTqqqqanJyshodHa2++OKLTm3ExsaqUVFRTuW1a9dWixYtqiYkJNjLVq5cqQJq6dKlvZ8kVVUXLVqkAur27du91lmwYIEKqAcPHlRVVVWXLFmiBgcHq48++qjarVs3e72aNWuqjz/+uNdjCqStN954Q42MjFQtFovPfrty6tQpFVAjIyPVuLg4p3W1a9dWCxUqpF67ds1e9s8//6iyLKs9evTw2l9V1b5rx9+Jr7a///57t3Wu33NUVJTar18/n/vr2bOn0/em7z9//vxqfHy8vfy3335TAfX333+3l9WoUUMtUaKEmpycbC9bt25dQL8FVdWOt1q1aj7r+Lu2VDXrevL2uXTpkt++CAS3EjHkJxDcZnr37u3kc9OiRQsATp486VQvODiY3r17O5UtXbqUIkWK0L17d3uZyWTi9ddf58aNG/z1118B9yMmJoaEhAS6d+/O1atX7R+DwUCjRo1Yu3YtAJcuXWLPnj307NmTqKgo+/bt2rWjatWqftuJjo4G4I8//sBsNnuso5+D9evXA5olqkGDBrRr144NGzYAkJCQwP79++11b7at6OhoUlJSiImJ8dt3T3Tp0oWCBQval/Xz06tXL/Lly2cvr1mzJu3atWPp0qU31c7NEh0dzbZt27h48WK2t+3WrRt58+a1L7v+Ni9evMi+ffvo0aMHefLksddr1aoVNWrUyGHPb46hQ4cSExPj9nH8LgSC24EQVAJBLuJpyKFUqVJOy/oD7Pr1607lxYsXd3N2PnPmDBUqVECWnS9dfQjszJkzAfft2LFjALRp04aCBQs6fVauXElcXJzTPitUqOC2j0qVKvltp1WrVnTp0oWPP/6YAgUK8Nhjj/H999+TkZFhr1O4cGEqVKhgF08bNmygRYsWtGzZkosXL3Ly5Ek2bdqEoig+BVUgbb322mtUrFiRjh07UqJECfr06cPy5csDOGMaZcuWdVrWz4+nc1GlShWuXr1KSkpKwPvPKaNHj2b//v2ULFmShg0bMnz4cDex7g1/v039WO+77z63bT2V3Q5q1KhB27Zt3T7/tokCgn8/QlAJBDdJSEgIAGlpaR7Xp6am2us44m1WmqqqTsuhoaE57KFv9GnzP/74o8c3/N9+++2WtCNJEr/88gtbtmyhf//+XLhwgT59+lCvXj1u3Lhhr9e8eXM2bNhAWloaO3fupEWLFlSvXp3o6Gg2bNjAhg0byJMnD3Xq1MlRW4UKFWLPnj0sWbKERx99lLVr19KxY0d69uwZ0PHk9vfiCW++QFar1a3sqaee4uTJk0yYMIFixYrx5ZdfUq1aNZYtW+a3nUB/mwKBwB0hqASCm6R06dIAHDlyxG1damoq586ds9e5lW0eO3bMLYbQ4cOHnfoUCOXLlwc0geHpDV8PSqrvU7doOeLp2L3RuHFjPv/8c3bs2MGcOXM4cOCAU/DFFi1acPbsWebNm4fVaqVp06bIsmwXWhs2bKBp06YBhUnw11ZQUBCPPPIIkydP5sSJE7z88svMmjXLbbZlIPj6HRw+fJgCBQoQHh6e7f06oluKEhISnMq9WSSLFi3Ka6+9xuLFizl16hT58+fn888/z1EfIOtYPZ2nmzl3AsF/CSGoBIKb5IEHHiAoKIhvv/3WTeBMnToVi8VCx44db2mbnTp1IjY2lvnz59vLLBYLEyZMIE+ePLRq1SrgfXXo0IHIyEhGjBjh0d/oypUrgPZwrl27NjNnziQxMdG+PiYmxh4ywBfXr193s3DUrl0bwGkoTh/K++KLL6hZs6bdX6tFixasXr2aHTt2+BzuC7Sta9euOa2XZdk+i9CxP4HieH4cBc/+/ftZuXIlnTp1yvY+XYmMjKRAgQJ2HzOdyZMnOy1brVan7wg0wVysWLGbOjZXihUrRvXq1Zk1a5aTdfGvv/5i3759Od6/QPBvRoRNEAhukkKFCjF06FA+/PBDWrZsyaOPPkpYWBibN2/mp59+on379jzyyCO3tM2XXnqJ7777jl69erFz507KlCnDL7/8wqZNmxg3bhwREREB7ysyMpJvv/2W559/nrp16/L0009TsGBBzp49y59//kmzZs2YOHEiACNHjuShhx6iefPm9OnTh/j4eCZMmEC1atWcHqyemDlzJpMnT+bxxx+nfPnyJCcnM23aNCIjI53Exn333UeRIkU4cuQIAwYMsJe3bNmSd999F8CvoAqkrRdeeIH4+HjatGlDiRIlOHP
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"plt.scatter(X_unrolled[:, 0], X_unrolled[:, 1],\n",
|
|||
|
" c=t, cmap=darker_hot)\n",
|
|||
|
"plt.xlabel(\"$z_1$\")\n",
|
|||
|
"plt.ylabel(\"$z_2$\", rotation=0)\n",
|
|||
|
"plt.axis([-0.055, 0.060, -0.070, 0.090])\n",
|
|||
|
"plt.grid(True)\n",
|
|||
|
"plt.title(\"Unrolled swiss roll using LLE\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"editable": true,
|
|||
|
"slideshow": {
|
|||
|
"slide_type": ""
|
|||
|
},
|
|||
|
"tags": []
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Swiss roll is unrolled and distances between instances are well preserved locally.\n",
|
|||
|
"\n",
|
|||
|
"However, distances not preserved on a larger scale."
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"celltoolbar": "Slideshow",
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "Python 3 (ipykernel)",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.11.11"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 4
|
|||
|
}
|