2088 lines
455 KiB
Plaintext
2088 lines
455 KiB
Plaintext
|
{
|
||
|
"cells": [
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "slide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"# Lecture 3: Introduction to Scikit-Learn"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "skip"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"\n",
|
||
|
"[Run in colab](https://colab.research.google.com/drive/1TZW7xcheEHt7DdDraOZUiSG92rqF3TGF)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 1,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:23.213712Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:23.213476Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:23.223868Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:23.223286Z"
|
||
|
},
|
||
|
"slideshow": {
|
||
|
"slide_type": "skip"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"Last executed: 2024-01-10 00:13:23\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"import datetime\n",
|
||
|
"now = datetime.datetime.now()\n",
|
||
|
"print(\"Last executed: \" + now.strftime(\"%Y-%m-%d %H:%M:%S\"))"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "slide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"## Scikit-Learn overview\n",
|
||
|
"\n",
|
||
|
"[Scikit-Learn](http://scikit-learn.org/stable/) is an extremely popular python machine learning package.\n",
|
||
|
"\n",
|
||
|
"Provides implementations of a number of different machine learning algorithms."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "fragment"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"- Clean, uniform and streamlined API.\n",
|
||
|
"- Useful and complete online documentation.\n",
|
||
|
"- Straightforward to switch models or algorithms."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"Two main general concepts:\n",
|
||
|
"- Data representation\n",
|
||
|
"- Estimator API"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "slide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"## Data representations"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"### Scikit-Learn includes a number of example data-sets"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 2,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:23.262339Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:23.261807Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:23.802820Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:23.802100Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from sklearn import datasets"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 3,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:23.806483Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:23.805791Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:23.810230Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:23.809598Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"# Type datasets.<TAB> to see more\n",
|
||
|
"#datasets."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"### Data as a table\n",
|
||
|
"\n",
|
||
|
"Best way to think about data in Scikit-Learn is in terms of tables of data.\n",
|
||
|
"\n",
|
||
|
"Using the [`seaborn`](http://seaborn.pydata.org/) library we can read example data-sets as a Pandas `DataFrame`."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 4,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:23.813758Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:23.813178Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:25.297828Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:25.297118Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"pandas.core.frame.DataFrame"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 4,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"import seaborn as sns\n",
|
||
|
"iris = sns.load_dataset('iris')\n",
|
||
|
"type(iris)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 5,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:25.301227Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:25.300607Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:25.313145Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:25.312527Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>sepal_length</th>\n",
|
||
|
" <th>sepal_width</th>\n",
|
||
|
" <th>petal_length</th>\n",
|
||
|
" <th>petal_width</th>\n",
|
||
|
" <th>species</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>5.1</td>\n",
|
||
|
" <td>3.5</td>\n",
|
||
|
" <td>1.4</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" <td>setosa</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>4.9</td>\n",
|
||
|
" <td>3.0</td>\n",
|
||
|
" <td>1.4</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" <td>setosa</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>4.7</td>\n",
|
||
|
" <td>3.2</td>\n",
|
||
|
" <td>1.3</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" <td>setosa</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>4.6</td>\n",
|
||
|
" <td>3.1</td>\n",
|
||
|
" <td>1.5</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" <td>setosa</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>5.0</td>\n",
|
||
|
" <td>3.6</td>\n",
|
||
|
" <td>1.4</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" <td>setosa</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" sepal_length sepal_width petal_length petal_width species\n",
|
||
|
"0 5.1 3.5 1.4 0.2 setosa\n",
|
||
|
"1 4.9 3.0 1.4 0.2 setosa\n",
|
||
|
"2 4.7 3.2 1.3 0.2 setosa\n",
|
||
|
"3 4.6 3.1 1.5 0.2 setosa\n",
|
||
|
"4 5.0 3.6 1.4 0.2 setosa"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 5,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"iris.head()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"### Iris data\n",
|
||
|
"\n",
|
||
|
"Here we consider the [Iris flower data](https://en.wikipedia.org/wiki/Iris_flower_data_set).\n",
|
||
|
"\n",
|
||
|
"- Introduced by statistician and biologist Ronald Fisher in 1936 paper.\n",
|
||
|
"\n",
|
||
|
"- Consists of 50 samples of three different species of Iris (Iris Setosa, Iris Virginica and Iris Versicolor).\n",
|
||
|
"\n",
|
||
|
"- Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. \n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 6,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:25.316231Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:25.315790Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:25.327069Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:25.326456Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>sepal_length</th>\n",
|
||
|
" <th>sepal_width</th>\n",
|
||
|
" <th>petal_length</th>\n",
|
||
|
" <th>petal_width</th>\n",
|
||
|
" <th>species</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>145</th>\n",
|
||
|
" <td>6.7</td>\n",
|
||
|
" <td>3.0</td>\n",
|
||
|
" <td>5.2</td>\n",
|
||
|
" <td>2.3</td>\n",
|
||
|
" <td>virginica</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>146</th>\n",
|
||
|
" <td>6.3</td>\n",
|
||
|
" <td>2.5</td>\n",
|
||
|
" <td>5.0</td>\n",
|
||
|
" <td>1.9</td>\n",
|
||
|
" <td>virginica</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>147</th>\n",
|
||
|
" <td>6.5</td>\n",
|
||
|
" <td>3.0</td>\n",
|
||
|
" <td>5.2</td>\n",
|
||
|
" <td>2.0</td>\n",
|
||
|
" <td>virginica</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>148</th>\n",
|
||
|
" <td>6.2</td>\n",
|
||
|
" <td>3.4</td>\n",
|
||
|
" <td>5.4</td>\n",
|
||
|
" <td>2.3</td>\n",
|
||
|
" <td>virginica</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>149</th>\n",
|
||
|
" <td>5.9</td>\n",
|
||
|
" <td>3.0</td>\n",
|
||
|
" <td>5.1</td>\n",
|
||
|
" <td>1.8</td>\n",
|
||
|
" <td>virginica</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" sepal_length sepal_width petal_length petal_width species\n",
|
||
|
"145 6.7 3.0 5.2 2.3 virginica\n",
|
||
|
"146 6.3 2.5 5.0 1.9 virginica\n",
|
||
|
"147 6.5 3.0 5.2 2.0 virginica\n",
|
||
|
"148 6.2 3.4 5.4 2.3 virginica\n",
|
||
|
"149 5.9 3.0 5.1 1.8 virginica"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 6,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"iris.tail()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"#### Parts of a flower\n",
|
||
|
"\n",
|
||
|
"Measured flower [petals](https://en.wikipedia.org/wiki/Petal) and [sepals](https://en.wikipedia.org/wiki/Sepal).\n",
|
||
|
"\n",
|
||
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture03_Images/Mature_flower_diagram.png\" width=\"1000px\" style=\"display:block; margin:auto\"/>\n",
|
||
|
"\n",
|
||
|
"[Image credit: [Mariana Ruiz](https://en.wikipedia.org/wiki/Sepal#/media/File:Mature_flower_diagram.svg)]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"#### Images of different species\n",
|
||
|
"\n",
|
||
|
"<!--\n",
|
||
|
"<table border=\"0\" cellpadding=\"0\">\n",
|
||
|
" <tr>\n",
|
||
|
" <td><center><img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture03_Images/iris_setosa.jpg\" width=\"60%\"/></center></td>\n",
|
||
|
" <td><center><img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture03_Images/iris_versicolor.jpg\" width=\"70%\"/></center></td>\n",
|
||
|
" <td><center><img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture03_Images/iris_virginica.jpg\" width=\"50%\"/></center></td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <td><center>Iris Setosa</center></td>\n",
|
||
|
" <td><center>Iris Versicolor</center></td>\n",
|
||
|
" <td><center>Iris Virginica</center></td> \n",
|
||
|
" </tr>\n",
|
||
|
"</table>\n",
|
||
|
"-->\n",
|
||
|
"\n",
|
||
|
"##### Iris Setosa\n",
|
||
|
"\n",
|
||
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture03_Images/iris_setosa.jpg\" width=\"300\" style=\"display:block; margin:auto\"/>\n",
|
||
|
"\n",
|
||
|
"##### Iris Versicolor\n",
|
||
|
"\n",
|
||
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture03_Images/iris_versicolor.jpg\" width=\"300\" style=\"display:block; margin:auto\"/>\n",
|
||
|
"\n",
|
||
|
"##### Iris Virginica\n",
|
||
|
"\n",
|
||
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture03_Images/iris_virginica.jpg\" width=\"300\" style=\"display:block; margin:auto\"/>\n",
|
||
|
"\n",
|
||
|
"[[Image source](https://github.com/jakevdp/sklearn_tutorial)]\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"### Features matrix\n",
|
||
|
"\n",
|
||
|
"Recall data represented to learning algorithm as \"*features*\".\n",
|
||
|
"\n",
|
||
|
"Each row corresponds to an observed (*sampled*) flower, with a number of *features*."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 7,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:25.330360Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:25.329898Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:25.341334Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:25.340725Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>sepal_length</th>\n",
|
||
|
" <th>sepal_width</th>\n",
|
||
|
" <th>petal_length</th>\n",
|
||
|
" <th>petal_width</th>\n",
|
||
|
" <th>species</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>5.1</td>\n",
|
||
|
" <td>3.5</td>\n",
|
||
|
" <td>1.4</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" <td>setosa</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>4.9</td>\n",
|
||
|
" <td>3.0</td>\n",
|
||
|
" <td>1.4</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" <td>setosa</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>4.7</td>\n",
|
||
|
" <td>3.2</td>\n",
|
||
|
" <td>1.3</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" <td>setosa</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>4.6</td>\n",
|
||
|
" <td>3.1</td>\n",
|
||
|
" <td>1.5</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" <td>setosa</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>5.0</td>\n",
|
||
|
" <td>3.6</td>\n",
|
||
|
" <td>1.4</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" <td>setosa</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" sepal_length sepal_width petal_length petal_width species\n",
|
||
|
"0 5.1 3.5 1.4 0.2 setosa\n",
|
||
|
"1 4.9 3.0 1.4 0.2 setosa\n",
|
||
|
"2 4.7 3.2 1.3 0.2 setosa\n",
|
||
|
"3 4.6 3.1 1.5 0.2 setosa\n",
|
||
|
"4 5.0 3.6 1.4 0.2 setosa"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 7,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"iris.head()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"In this example we extract a feature matrix, removing species (which we want to predict)."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 8,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:25.344521Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:25.343963Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:25.356078Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:25.355443Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>sepal_length</th>\n",
|
||
|
" <th>sepal_width</th>\n",
|
||
|
" <th>petal_length</th>\n",
|
||
|
" <th>petal_width</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>5.1</td>\n",
|
||
|
" <td>3.5</td>\n",
|
||
|
" <td>1.4</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>4.9</td>\n",
|
||
|
" <td>3.0</td>\n",
|
||
|
" <td>1.4</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>4.7</td>\n",
|
||
|
" <td>3.2</td>\n",
|
||
|
" <td>1.3</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>4.6</td>\n",
|
||
|
" <td>3.1</td>\n",
|
||
|
" <td>1.5</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>5.0</td>\n",
|
||
|
" <td>3.6</td>\n",
|
||
|
" <td>1.4</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" sepal_length sepal_width petal_length petal_width\n",
|
||
|
"0 5.1 3.5 1.4 0.2\n",
|
||
|
"1 4.9 3.0 1.4 0.2\n",
|
||
|
"2 4.7 3.2 1.3 0.2\n",
|
||
|
"3 4.6 3.1 1.5 0.2\n",
|
||
|
"4 5.0 3.6 1.4 0.2"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 8,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"X_iris = iris.drop('species', axis='columns')\n",
|
||
|
"X_iris.head()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 9,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:25.358964Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:25.358716Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:25.365488Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:25.364851Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"pandas.core.frame.DataFrame"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 9,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"type(X_iris)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"### Target array\n",
|
||
|
"\n",
|
||
|
"Consider 1D *target array* containing labels or targets that we want to predict.\n",
|
||
|
"\n",
|
||
|
"May be numerical values or discrete classes/labels."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"In this example we want to predict the flower species from other measurements."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 10,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:25.368660Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:25.368188Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:25.374927Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:25.374241Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0 setosa\n",
|
||
|
"1 setosa\n",
|
||
|
"2 setosa\n",
|
||
|
"3 setosa\n",
|
||
|
"4 setosa\n",
|
||
|
"Name: species, dtype: object"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 10,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"y_iris = iris['species']\n",
|
||
|
"y_iris.head()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 11,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:25.377983Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:25.377595Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:25.381761Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:25.381237Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"pandas.core.series.Series"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 11,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"type(y_iris)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"### Features matrix and target vector\n",
|
||
|
"\n",
|
||
|
"<img src=\"https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture03_Images/data-layout.png\" alt=\"data-layout\" width=\"500\" style=\"display:block; margin:auto\"/>\n",
|
||
|
"\n",
|
||
|
"[[Image source](https://github.com/jakevdp/sklearn_tutorial)]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 12,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:25.384689Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:25.384254Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:25.390554Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:25.390008Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"(150, 4)"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 12,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"X_iris.shape"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 13,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:25.393600Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:25.393051Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:25.399664Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:25.399032Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"(150,)"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 13,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"y_iris.shape"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"### Visualizing the data"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 14,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:25.402904Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:25.402329Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:29.947973Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:29.947247Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAtYAAAJPCAYAAABYeZNNAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8WgzjOAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOydd3gc1dm37ynbi3q3LMm2XOTeezc2vRNqaCGBhBISkhdI8gWSN42QhISXQEIIvVdjMAbjBu69d0uybMnqZXe1fcr3x9qyhVbGNrIlm7lz5cI6Z+bMmZmzM7855ymCrus6BgYGBgYGBgYGBgbfCLGzO2BgYGBgYGBgYGBwLmAIawMDAwMDAwMDA4MOwBDWBgYGBgYGBgYGBh2AIawNDAwMDAwMDAwMOgBDWBsYGBgYGBgYGBh0AIawNjAwMDAwMDAwMOgADGFtYGBgYGBgYGBg0AEYwtrAwMDAwMDAwMCgAzCEtYGBgYGBgYGBgUEHcM4I64ULF3LNNdcwdOhQJkyYwI9//GMOHjzY2d0yMDAwMDAwMDD4liCcCynNV69eza233srll1/OJZdcQlNTE//4xz/QNI2PPvoIq9Xa2V00MDAwMDAwMDA4x5E7uwMdwdy5c8nOzuYPf/gDgiAAkJyczC233MK2bdsYMWJEJ/fQwMDAwMDAwMDgXOecENaKouBwOFpENYDL5QLgm0zIq6pGQ4P/G/fvZBFFgeRkBw0NfjTtrF9QALrOOaWluU56n44aB13lGnQWXen8T2UcQOc9E6BrXb9j6ar9gq/vW0ePg658LboKXfUanepYMDD4KueEjfWVV15JcXExr732Gj6fj4MHD/K3v/2NoqIihg0b1tndO2lEUUAQBERR+PqNzxLOxXM6Wc71ayBJIqopgmIKg0lt9aEL5/75n2666vXr6H4JAggmDcUURjVFkORTb/dMX7Oueo+6AoIgoJtUonIIb9iHLJ8T8sPAoA3nxIz1iBEjeOqpp3jggQf47W9/C0C/fv147rnnkCTpG7XdGT9+SRJb/fdc4Gw/p44YB2f7NTgeihxhc+0uZu/6jKagh14p+Vw34FJSzCkIWuw3eK6cf2cJgq56/Tq0X6KOV2vinW1z2VqzC6fJzvmF0xiTMxRZsXRu375CvHHQVe9RZ6OJKrXhGt7c8CGljQdJtidyZb8L6JtciKyaO7t7BgYdyjnhvLhhwwbuvPNOrrrqKqZMmUJTUxNPP/00sizz+uuvn7Lzoq7rbWbdDL59GOPg+DRH/Ly66QMWlS5vVS4g8KvJ9zEws28n9azjMcbC6WV/40F+seDPKJrSqnxAel/uG3sbiVZ3J/WsNcY4OHF0XWf9oa38edkzbeou6j2Na/pfjN1s64SeGRicHs6JGevf/e53jBkzhoceeqilbMiQIUyZMoUPP/yQa6+99pTa1TQdrzfQUd08YSRJxO224fUGUVXtjB//dNBVzikpyXHS+3TUOOgq16Cj8dLURlQD6Oj8Z/3r/GrijzGp1i51/qcyDqDzngnQdcdPR/VLk6K8tPHdNqIaYFvNLiq9NRCSOJmpoK/rW0ePg656jzqTiBTiufVvxK37ZM9iphdMJOzv/Gt1qmPBwOCrnBPCuri4mOnTp7cqy8zMJCkpiQMHDnyjthWlc37wUUVFVbVOO/7p4mw9p47s89l6DeIhigLFTWXt1lc11xJUwgjHLPee7eff2X3vqtfvm/ZLEaJsr93Tbv36Q1u5OD8bRVHPeN/icbz2uuo96gwCWpCGYFPcOh2dg55DFDp6f6NAAwYGXYlzwhAsOzubHTt2tCqrqKigsbGRnJycTurVqfPFxgqufngun6xsX7AYGHQVrPLxbV8l4Zx4zBicbgQBk9j+XI9Nthri6yxEEo//+7fIZsC4rwbnDufEG++6665jwYIF/O53v2PFihV88skn3HXXXaSkpHDBBRd0dvdOCk3TefeLYjRN58NlJURPYXbG4NxGFAV0UxTVFEYwaRwx9YxFU9BRTGF0U7SVA5UkC6imMKopjPwNoix8FU3TyU/ohiTGdxLun9Ybs2AkaDJonyPRZATgliFXI7cjrkfkDCQo+hGtKoJZQ20n+oxB18IqWOmZnBe3ziyZyHKmo+sgmY48oyKYzRJYoiiWIFiihjOowVnFOWEKcvPNN2M2m3njjTd47733cDgcDBkyhL///e8kJSV1dvdOitJKL57mCJdN6sGHX5awp9xD//zkzu6WQRdBNyns85Uxe9en1Aca6ZmUx1VFF5FoTqRZ9fHB9nnsrNtHgtXNpb3Po29KIaqusuTgapbsX4mu64zvPpLpBeMxRW0dEkfWpNn40cibeWr1i+jHzDy5LS5uG3YtoiK3KjcwOIJmibC+aivz9i0hpIQYnjWQX0y6hxc3vsMBT0XLdt8ZcDHrDm3hs31f0CMpl/MLp7Lq4AZq/PVc0/9iUk2poHyzCFAGpwdJNXPn8Jv4zRdP4I8ctUsXBIF7R9+GWbMRNgf4aN8S1lRs4tYh12AzW3l/x6cc8lWT7UrnyqILyLJlIkRMnXgmBgYnxjkRFeR00RnJID5ZVcac5aX8v9tG88eX1zJjeDcun9jjjPbhdCDLIklJDhob/Z1qe9iZCWK+8TWQVeaXLebD3fNbFac7UvjRqFv43y/+gaq1XuH4zdSf8szaV6hqrm1VnmJP4pFJP0GKfPPZZF1W2NqwnSRbAhsrt9MY9NAzOY/chGwSLC5SxDRUVe8yYwDOzgQxXen6Hcup9kszR3hm/UvsqN3bqtxmsvLo1J8yZ9fnOEw2xuQOY23FZubtXdyyjYDAj0bfzMe7F1LWVM69o2+jv7sIVW39Ovu6vnX0OOiq96gzkSSRqughgmqI/Y0HKW08SIo9icGZRXhCXvqm9uKXC/9Mc8TPZX1nkmpP5r8b3mzTzu3DrmVU+nDUyOnpp5EgxqCj6PQZa7/fj9frjWs7l52d3Qk96lxKDnnISnEgSSJZyXZKK72d3SWDLkJYCDFn9+dtymf0nMhLm95pI6p7JHVnZ92+NqIaoD7QyIry9UzJnoCqfLNv6xBB/r3uNSRBZEBGX5xmBysPrueNrR+S487kobH3IBqxag2OQRAEqgI1bUQ1QDAaYs6uz7llwDUoYpT7P/0NEaW1mtLReWvrR1xVdAH/Xvcaz298iz9MewhZNcyOuhoRMcg/17xIbaCBHkndyXJlUO6t5LN9X6Cj89tpD9AciX2kjOs+gkcW/TVuO69vmc2g8/phwn4mu29gcNJ0irAOh8M89dRTvPvuuzQ1NbW73c6dO89cp7oIZVXN9MqJxWpNT7Kxs6yxk3tk0BUQRYEDnoq4JhWZzjRKGw+2KS9KL2Rj5bZ221x+cC3jc0YhcurLq6IoUNywHwBV19hc9RUnYm8VIS2MHUNYGxzFZJJYUbKu3fq1FZu4rv+lFNfvbyOqj1AXaMBhjoksfyRAc9RPIoaw7moE1RC1gQYAShoPUNLYOlJXWVM5CVY3npCXYDRESAnHbSekhPGGfaQYwtqgi9MpwvrRRx9l9uzZzJgxg+HDh5OQkNAZ3ehyBMMK9d4QYwdkAJCaaKNpezXBsILN0umLCwadiK7TbsQEQRAQENqIblXTkMX2RXOsvW/u+GWSji/MRSMqiEEbdMzHGTcm0YSG1q5T7BGOHVuSIBrBJbogX/f7N4mmltU28WscUUXjHhucBXSKWvv888+55pprWtKPG8Q4VB9bDktLiGWhSnHHwphVNQQoyOoaGccMOgdd18lxZyGLcpsEGnvqShmQ0Yet1btalW84tJWL+85ge83uuG1OLRiLVbQQ4dQjz2iaTkFid0RBRNPb2pT2Te2JRTj5VNQG5zbRqMrE7qP5bN8XceunFIxhY+U2+mf0xiSZiKrRNtvkJea0mDml2pOxy3Y4Tfa3BqeOVbSSl9iNsqbyNnUmUSbbndFiCmKSTC2z118lweLCZXZC/AltA4MuQ6dMJQmCQFFRUWccuktTVR/zmE5OiAmRJFfsv9WNnZPpzaBrYVIt3DXiJoSvzDKvPLiem4dcjcvibFVe7a+juzuHgeltU4oXphSQbEs
|
||
|
"text/plain": [
|
||
|
"<Figure size 730x600 with 20 Axes>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"%matplotlib inline\n",
|
||
|
"import seaborn as sns; sns.set()\n",
|
||
|
"sns.pairplot(iris, hue='species', height=1.5);"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"tags": [
|
||
|
"inclass_exercise"
|
||
|
]
|
||
|
},
|
||
|
"source": [
|
||
|
"\n",
|
||
|
"How well do you expect classification to perform with these features and why?"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "fragment"
|
||
|
},
|
||
|
"tags": [
|
||
|
"solution",
|
||
|
"inclass_exercise"
|
||
|
]
|
||
|
},
|
||
|
"source": [
|
||
|
"Fairly well since the different classes are reasonably well separated in feature space."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "slide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"## Scikit-Learn's Estimator API"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"### Scikit-Learn API design principles\n",
|
||
|
"\n",
|
||
|
"- Consistency: All objects share a common interface.\n",
|
||
|
"- Inspection: All specified parameter values exposed as public attributes.\n",
|
||
|
"- Limited object hierarchy: Only algorithms are represented by Python classes; data-sets/parameters represented in standard formats.\n",
|
||
|
"- Composition: Many machine learning tasks can be expressed as sequences of more fundamental algorithms.\n",
|
||
|
"- Sensible defaults: Library defines appropriate default value."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"### Impact of design principles\n",
|
||
|
"\n",
|
||
|
"- Makes Scikit-Learn easy to use, once the basic principles are understood. \n",
|
||
|
"- Every machine learning algorithm in Scikit-Learn implemented via the Estimator API.\n",
|
||
|
"- Provides a consistent interface for a wide range of machine learning applications."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"### Typical Scikit-Learn Estimator API steps\n",
|
||
|
"\n",
|
||
|
"1. Choose a class of model (import appropriate estimator class).\n",
|
||
|
"2. Choose model hyperparameters (instantiate class with desired values).\n",
|
||
|
"3. Arrange data into a features matrix and target vector.\n",
|
||
|
"4. Fit the model to data (calling `fit` method of model instance).\n",
|
||
|
"5. Apply model to new data:\n",
|
||
|
" - Supervised learning: often predict targets for unknown data using the `predict` method.\n",
|
||
|
" - For unsupervised learning: often transform or infer properties of the data using the `transform` or `predict` method."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "slide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"## Linear regression as machine learning"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 15,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:29.951745Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:29.951355Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:30.270629Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:30.269923Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjQAAAGhCAYAAAB2yC5uAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8WgzjOAAAACXBIWXMAAA9hAAAPYQGoP6dpAABAy0lEQVR4nO3de3iU9Z3//1cmISGGGTKjCYpEIBgCkUMQ5SAtWAxFPFRZ2wr0pygiWAER9btSu1i1XNW1a7sVqQpCPVSpp2ILRlQsDaviKRYpKpAaoREW4pLDJIHEZOb+/UETnRwmc09m5p7D83Fde3nlnvu+5+O7WXnxOSYZhmEIAAAghtmsbgAAAEBPEWgAAEDMI9AAAICYR6ABAAAxj0ADAABiHoEGAADEPAINAACIeQQaAAAQ81KsbkAkGIYhr9fc/oE2W5LpZxAa1N461N461N461N46/mpvsyUpKSkp4HclRKDxeg1VVTUEfH9Kik1OZ4bc7mNqafGGsWVoj9pbh9pbh9pbh9pbp7vau1wZSk4OPNAw5AQAAGIegQYAAMQ8U4HmlVde0Y9//GNNnjxZhYWFuuyyy/TCCy+o/fmWzz//vKZPn66RI0fqe9/7nrZt2xbQ+48cOaIlS5ZozJgxGjdunH7605+qvr7eTBMBAEACMhVoHn/8caWnp2v58uV6+OGHNXnyZK1YsUKrV69uu+fll1/WihUrNGPGDK1du1aFhYVavHixdu7c6ffdzc3Nmj9/vvbv368HHnhAd911l958803deuutQf2LAQCAxGFqUvDDDz8sl8vV9vPEiRNVU1Oj3/3ud7rxxhtls9n04IMP6uKLL9bNN98sSZowYYL27dun1atXa+3atV2++9VXX1VZWZmKi4uVm5srSXI4HLruuuu0a9cujRo1Koh/PQAAkAhM9dB8M8y0Gj58uOrr63Xs2DFVVFRo//79mjFjhs89F110kXbs2KGvvvqqy3dv375d+fn5bWFGkiZNmqTMzEyVlJSYaSYAAEgwPV62XVpaqn79+qlPnz4qLS2VJA0ePNjnniFDhqi5uVkVFRUaMmRIp+8pLy/3CTOSlJSUpMGDB6u8vLynzVRKSuDZLTnZ5vNPRA61tw61tw61tw61t06oa9+jQPPBBx+ouLhYt99+uySptrZW0omhom9q/bn188643W7Z7fYO1/v27ev3uUDYbElyOjNMP+dwpPfoexE8am8dam8dam8dam+dUNU+6EBz+PBhLVu2TOPHj9fVV18dksaEi9dryO0+FvD9yck2ORzpcruPy+Nho6VIovbWofbWofbWofbW6a72Dke6qd6boAKN2+3W9ddfr8zMTK1atUo224kv7Nu3rySprq5OWVlZPvd/8/POOByOTpdo19bW6rTTTgummT6C2QHS4/Gyc6RFqL11qL11qL11qL11QlV70wNXjY2NWrhwoerq6vTYY4/5DBO1zoFpP+elvLxcvXr1Uk5OTpfvzc3N7fCcYRj6/PPPO8ytAQAA4eH1GtpzoFrvfHJYew5Ux8w5V6Z6aFpaWnTzzTervLxcTz/9tPr16+fzeU5OjgYNGqQtW7aoqKio7XpxcbEmTpyo1NTULt89efJk/fnPf9b+/fs1aNAgSdKOHTtUU1OjKVOmmGkmAAAIQuneSj2ztUzVdU1t15z2NM0pytPY/GwLW9Y9Uz00d999t7Zt26YbbrhB9fX12rlzZ9v/tS7JXrJkiTZv3qwHH3xQ7777rn72s59p165duvHGG9vec/DgQRUUFOihhx5quzZ9+nTl5eVpyZIl2rZtm4qLi3XHHXfo/PPPZw8aAADCrHRvpVZv3O0TZiSpuq5JqzfuVuneSotaFhhTPTRvvfWWJOm+++7r8Nkbb7yhAQMG6JJLLtHx48e1du1arVmzRoMHD9ZDDz2kMWPGtN1rGIY8Ho/PkQm9evXSY489ppUrV+qWW25RSkqKpk2bpjvuuCPYfzcAABAAr9fQM1vL/N6zYWuZxuRlyWYL/ATsSEoy2h/EFIc8Hq+qqhoCvr/1SPPq6gYmiUUYtbcOtbcOtbcOtT9hz4Fq3b/hb93e9++zx2jYQGdIvrO72rtcGaZWObGTEAAACa6moan7m0zcZwUCDQAACS4zIy2k91mBQAMAQIIbmpMpp91/WHHZ0zQ0JzMyDQoCgQYAgARnsyVpTlGe33tmF+VF7YRgiUADAAAkjc3P1qKZIzr01LjsaVo0c0TU70PT49O2AQBAfBibn60xeVnaV1GjmoYmZWacGGaK5p6ZVgQaAADQxmZLCtnS7EhiyAkAAMQ8Ag0AAIh5BBoAABDzCDQAACDmEWgAAEDMI9AAAICYR6ABAAAxj0ADAABiHoEGAADEPAINAACIeQQaAAAQ8zjLCQAA+PB6jZg7oJJAAwAA2pTurdQzW8tUXdfUds1pT9OcojyNzc+2sGX+MeQEAAAknQgzqzfu9gkzklRd16TVG3erdG9l2zWv19CeA9V655PD2nOgWl6vEenm+qCHBgAAyOs19MzWMr/3bNhapjF5Wfpb2ZdR14tDDw0AANC+ipoOPTPtVdU1afPb+wPuxYkkAg0AAFBNg/8w0+r1Dyr8fr5ha5klw08EGgAAoMyMtIDua2hs8ft5VV2T9lXUhKBF5hBoAACAhuZkymn3H2oyegc29TbQ3p5QItAAAADZbEmaU5Tn955p5wwI6F2B9vaEEoEGAABIksbmZ2vRzBEdempc9jQtmjlCl5w3uNteHJf9xEZ8kcaybQAA0GZsfrbG5GV1uVPwnKI8rd64u8vnZxflWbKrsOlAc+DAAa1bt04fffSRysrKlJubq82bN7d9/sUXX+iCCy7o9NnU1FT9/e9/7/Ld7777rq6++uoO1y+66CL9+te/NttUAAAQBJstScMGOjv9rLUXp/0+NC57mmZbuA+N6UBTVlamkpISjR49Wl6vV4bhuzQrOztbzz77rM81wzA0f/58TZgwIaDvuPfee5Wbm9v2s9PZeVEBAEDkddeLYwXTgWbq1KkqKiqSJC1fvly7d/t2O6WmpqqwsNDn2rvvvqv6+npdcsklAX1HXl6eRo4cabZpAADEhFg8/LE9f704VjAdaGw28/OIN2/erD59+mjq1KmmnwUAIJ7E6uGP0S7sk4Kbm5v12muvadq0aUpLC2wZ14IFC1RTU6OsrCxdfPHFWrp0qXr37t2jdqSkBB7EkpNtPv9E5FB761B761B760S69u/vqex0Qm3rsQFLvj9K5w5LjFAT6tqHPdBs375dNTU1AQ032e12zZ8/X+eee67S0tL0zjvvaP369SovL9ejjz4adBtstiQ5nRmmn3M40oP+TvQMtbcOtbcOtbdOJGrv8Rp65vV9fu/ZsLVMF4wfpOQYG37qiVDVPuyBZtOmTTrllFM0ceLEbu8tKChQQUFB288TJ05Udna27rnnHu3atUujRo0Kqg1eryG3+1jA9ycn2+RwpMvtPi6PxxvUdyI41N461N461N46kaz9p/urdLS20e89/1dzXO9+9IWGD3KFtS3RoLvaOxzppnpvwhpoGhoatG3bNv3gBz9QcnJyUO+YMWOG7rnnHu3evTvoQCNJLS3mf1E9Hm9Qz6HnqL11qL11qL11IlH7o27/Yeab9yXS70Goah/WQcPXX39djY2NuvTSS8P5NQAARL1AjwOw4tiAeBDWQLN582adccYZGj16dNDvePnllyWJZdwAgJgWyOGPVh0bEA9MDzkdP35cJSUlkqSDBw+qvr5eW7ZskSSNGzdOLteJcb+qqirt2LFD119/fafvOXjwoKZNm6Ybb7xRixcvliTddtttGjhwoAoKCtomBT/++OMqKioi0AAAYlrr4Y/ReGxAPDAdaI4ePaqlS5f6XGv9+cknn9T48eMlSa+88opaWlq6HG4yDEMej8dnp+G8vDxt2rRJ69evV3Nzs04//XTdcMMNWrBggdlmAgAQdaL12IB4kGS0P7sgDnk8XlVVNQR8f0qKTU5nhqqrGxJqYlY0oPbWofbWofbWsar28bBTcE91V3uXKyN6VjkBAICOou3YgHhAoAEAIA4keq8PgQY
|
||
|
"text/plain": [
|
||
|
"<Figure size 640x480 with 1 Axes>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"import matplotlib.pyplot as plt\n",
|
||
|
"import numpy as np\n",
|
||
|
"\n",
|
||
|
"n_samples = 50\n",
|
||
|
"rng = np.random.RandomState(42)\n",
|
||
|
"x = 10 * rng.rand(n_samples)\n",
|
||
|
"y = 2 * x - 1 + rng.randn(n_samples)\n",
|
||
|
"plt.scatter(x, y);"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"### 1. Choose a class of model\n",
|
||
|
"\n",
|
||
|
"Every class of model is represented by a Python class."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 16,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:30.274187Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:30.273534Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:30.331758Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:30.331059Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from sklearn.linear_model import LinearRegression"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"### 2. Choose model hyperparameters\n",
|
||
|
"\n",
|
||
|
"Make instance of model with defined hyperparameters (e.g. y-intersect, regularization)."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 17,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:30.335535Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:30.335036Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:30.343050Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:30.342450Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<style>#sk-container-id-1 {color: black;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-r
|
||
|
],
|
||
|
"text/plain": [
|
||
|
"LinearRegression()"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 17,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"model = LinearRegression(fit_intercept=True)\n",
|
||
|
"model"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"### 3. Arrange data into a features matrix and target vector"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 18,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:30.347095Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:30.345723Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:30.352782Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:30.352163Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"(50, 1)"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 18,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"X = x.reshape(n_samples,1)\n",
|
||
|
"X.shape"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 19,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:30.355875Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:30.355272Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:30.361812Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:30.361206Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"(50,)"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 19,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"y.shape "
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"### 4. Fit the model to data\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 20,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:30.364747Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:30.364379Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:30.371856Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:30.371202Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<style>#sk-container-id-2 {color: black;}#sk-container-id-2 pre{padding: 0;}#sk-container-id-2 div.sk-toggleable {background-color: white;}#sk-container-id-2 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-2 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-2 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-2 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-2 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-2 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-2 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-2 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-2 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-2 div.sk-item {position: relative;z-index: 1;}#sk-container-id-2 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-2 div.sk-item::before, #sk-container-id-2 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-2 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-2 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-2 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-2 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-2 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-2 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-2 div.sk-label-container {text-align: center;}#sk-container-id-2 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-2 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-2\" class=\"sk-top-container\"><div class=\"sk-text-r
|
||
|
],
|
||
|
"text/plain": [
|
||
|
"LinearRegression()"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 20,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"model.fit(X, y)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "fragment"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"All model parameters that were learned during the `fit()` process have *trailing underscores*."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 21,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:30.374971Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:30.374528Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:30.378642Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:30.378102Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"-0.9033107255311146"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 21,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"model.intercept_"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 22,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:30.381546Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:30.380989Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:30.387913Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:30.387091Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"array([1.9776566])"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 22,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"model.coef_"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Intercept and slope are close to the model used to generate the data (-1 and 2 respectively)."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"### 5. Predict targets for unknown data"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 23,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:30.391306Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:30.390667Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:30.396006Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:30.395397Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"n_fit = 50\n",
|
||
|
"xfit = np.linspace(-1, 11, n_fit)\n",
|
||
|
"Xfit = xfit.reshape(n_fit,1)\n",
|
||
|
"yfit = model.predict(Xfit)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 24,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:30.398979Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:30.398541Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:30.618415Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:30.617751Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAiQAAAGhCAYAAABRZq+GAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8WgzjOAAAACXBIWXMAAA9hAAAPYQGoP6dpAABNLUlEQVR4nO3deWBUVZ73/3dVZV8qC2Rj38ImsogKiKIi7iigooAtirIp2t30+Ou25xmdx+l+pv35TM+MY7MGUBDEHXGJ2OICLrihgohCBMEA2ci+p6ruff5IJ5K9Eiq5leTz+kdTdavqmyMmH+4553tspmmaiIiIiFjIbnUBIiIiIgokIiIiYjkFEhEREbGcAomIiIhYToFERERELKdAIiIiIpZTIBERERHLKZCIiIiI5QKsLsBbpmliGN2rh5vdbut233Nbaay8p7HynsbKexor73WnsbLbbdhsNq+u7TSBxDBM8vJKrS6jwwQE2ImJCaeoqAy327C6HL+msfKexsp7Givvaay8193GKjY2HIfDu0CiKRsRERGxnAKJiIiIWE6BRERERCynQCIiIiKWUyARERERyymQiIiIiOUUSERERMRyCiQiIiJiOQUSERERsZwCiYiIiFhOgUREREQsp0AiIiIillMgEREREcspkIiIiHRzpmlimqalNSiQiIiIdGNl3x/k2B9/T8aaVZbWEWDpp4uIiIglTI+H3Ne3k/fm62CaBCUlWVqPAomIiEg34y7IJ2PtasoPHwIgasplxM2ZZ2lNCiQiIiLdSOmB/WSuS8FTUowtOISEO+/CeeFEq8tSIBEREekOTLeb09u3kf/WmwAE9+1H0tL7CEpItLiyagokIiIiXZwrL5eMNauoOPIjAFGXTyXu1jnYA4MsruwXCiQiIiJdWMm+b8jckIJRWoo9NJSEO+8m8vwLrC6rAQUSERGRLsh0uzn9ykvk/30HAMH9B5C05D6C4uMtrqxxrQokb731Fq+99hrfffcdRUVF9O/fnzvuuIObb74Zm81We92LL77IunXrOHXqFAMHDmT58uVcfvnlPi9eREREGnKdzqmeovnpKADRV1xJz1tuxR4YaHFlTWtVIHn66afp3bs3Dz30EDExMXzyySc8/PDDZGZmcv/99wPw5ptv8vDDD7N06VImTpxIamoq999/P1u2bGHs2LHt8T2IiIjIPxR/tZesp9djlJVhDwsjccE9RIwbb3VZLbKZregVm5eXR2xsbJ3HHn74YVJTU/niiy+w2+1cffXVjBo1ir/+9a+118yZM4fIyEhSUlLaXKjHY5CXV9rm13c2AQF2YmLCyc8vxe02rC7Hr2msvKex8p7GynsaK++151gZLhenX3yegvd2AhAyaBBJS+4jsEdPn35Oa8TGhuNweNcUvlWt4+uHEYARI0ZQUlJCWVkZ6enpHDt2jGuvvbbONddddx179uyhqqqqNR8nIiIiXqjKzib9sf9TG0Zirr6Gvr//Z0vDSGud9aLWvXv3kpCQQEREBHv37gVg4MCBda4ZPHgwLpeL9PR0Bg8e3ObPCgjoPkfv1CRKb5Nld6ax8p7GynsaK+9prLzXHmNV+PlnZGxYj1FRgSMigl4LFxPZCZdInFUg+fLLL0lNTeUPf/gDAIWFhQA4nc4619V8XfN8W9jtNmJiwtv8+s7K6Qy1uoROQ2PlPY2V9zRW3tNYec8XY2VUVfHT+qfI3PF3ACJHDGfYg78juGePs35vK7Q5kGRmZrJ8+XImTJjA/PnzfVlTowzDpKiorN0/x184HHaczlCKisrxeDQn2xyNlfc0Vt7TWHlPY+U9X41VZUYGJ1b+jcr0dAB6TL+B+Fk3UeZwUJbvP+stnc5Qr+8GtSmQFBUVsWjRIqKjo3nyySex26s/LCoqCoDi4mLi4uLqXH/m823VHRdLeTxGt/y+20Jj5T2Nlfc0Vt7rimNlGCaH0wsoKK0kOjyYoX2jsdttLb+wBWczVkWffkLWMxsxKytxREaSeM9iwkedi8cEOvH4tzqQVFRUsGTJEoqLi3n++eeJjIysfW7QoEEAHD16tPbfa74ODAykb9++PihZRESk/e09lM2zO9PIL66sfSwmMph505IZP6zjm4sZlZVkb91C0Ue7AQgdNpykRUsIiI7p8FraQ6tW1bjdbn77299y9OhR1q1bR0JCQp3n+/bty4ABA9ixY0edx1NTU5k0aRJBQf7TM19ERKQpew9ls2LbgTphBCC/uJIV2w6w91B2h9ZTeeokP/+ff6sOIzYbsTfMoM8//b7LhBFo5R2SRx99lPfff5+HHnqIkpISvvnmm9rnRo4cSVBQEA888AAPPvgg/fr1Y8KECaSmprJ//342b97s69pFRER8zjBMnt2Z1uw1W3emMS45zifTNy0p/PhDsrc8g1lVhcPpJGnRUsJGjGz3z+1orQokH3/8MQCPPfZYg+feffdd+vTpw/Tp0ykvLyclJYW1a9cycOBA/va3vzFu3DjfVCwiItKODqcXNLgzUl9ecSWH0wsY3r/97lAYFRVkb3mGoj3Vv3vDRowkceFiAqKi2+0zrdSqQPLee+95dd3s2bOZPXt2mwoSERGxUkFp82Gktde1ReWJdDJWr6QqMwNsNnrMmEXsddOx2bturxed9isiInKG6PBgn17XGqZpUvThbrK3bsZ0uXBER1dP0Qwb7vPP8jcKJCIiImcY2jeamMjgZqdtYiOrtwD7klFRTtamjRR//ikAYeeMqp6iiXS28MquQYFERETkDHa7jXnTklmx7UCT18ydluzTBa0VPx8nY81KXFlZYLfTc9bNxFx9bZeeoqlPgURERKSe8cPiWTZrVIM+JLGRwcz1YR8S0zQp/OB9cp5/FtPtJiAmlqTF9xKanOyT9+9MFEhEREQaMX5YPOOS49qlUyuAp6yMrE1PUfLlFwCEjx5D4t2LcERE+OT9OxsFEhERkSbY7bZ22dpb/tNRTqxcgSsnBxwO4m6eTfSVV2OztX9fE3+lQCIiItJBTNPk1Otv8tNTG8HjIaBHD5KW3EfooMFWl2Y5BRIREZEO4CktJWPjBoq/2gtAxLjxJNx1N47wcIsr8w8KJCIiIu2s/OgRMtasxJ2biy0ggITb5hB52RXdeoqmPgUSERGRdmKaJvl/38HpV14Cj4fAuHhGPvQgrh6JuN2G1eX5FQUSERGRduApKSFzQwql+/cBEHH+hfS++24ieseRn19qcXX+R4FERETEx8rT0shYuwp3fh62gADi5swj6tLLcQQ6rC7NbymQiIiI+IhpGOTvSOX0q6+AYRCYkEivpfcR3Lef1aX5PQUSERGRZhiG6VVzNHdxEZnrUyg78C0AkRMmkXDHfOwhoR1dcqekQCIiItKEvYeyG7SPj4kMZl699vFlh34gI2U1noICbEFBxM+9HefFU7SLphUUSERERBqx91B2owfs5RdXsmLbAZbNGsXYwT1I2/oStt07sJkmgUm9qqdoevexoOLOTYFERESkHsMweXZnWrPXvPjaVxSe3E2fklMA7I8czBdxl3BrSRDjO6LILqb7nGssIiLipcPpBXWmaerrX5bBrT++Sp+SU1TZAngjfjKpCZPJKTNYse0Aew9ld2C1XYPukIiIiNRTUNp4GLGZBhfn7eei/P3YgOygaLYnTiE3KLrOdVt3pjEuOc5nJwN3BwokIiIi9USHBzd4LMJdxo2ZH9KvIguAb5xD2NnzQtz2hr9K84orOZxe0C4nBXdVCiQiIiL1DO0bTUxkcO20zcDSk0zP/phwTwWVtgDejp/IwchBzb5HU3dZpHFaQyIiIlKP3W5j3rRkbKbBpae/4raMdwn3VJAVFMPTfae3GEag8bss0jTdIREREWnE6DgH/1TxEQEFxwDYGzWM93qcj9MZSrjboLTC3eRrYyOrG6iJ9xRIRERE6inZ9w2ZG1IIKC3FHhKK+/pbGdh/BP/0j06tX6flNNqjpMbcacla0NpKCiQiIuLXvG3d7gum283pV14i/+87AAjuP4CkJfcRFB9f57rxw+JZNmtUgy6usZHBzK3XxVW8o0AiIiJ+y9vW7b7gOp1DxtpVVBw
|
||
|
"text/plain": [
|
||
|
"<Figure size 640x480 with 1 Axes>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"plt.scatter(x, y)\n",
|
||
|
"plt.plot(xfit, yfit, 'r');"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "slide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"## Supervised learning: classification\n",
|
||
|
"\n",
|
||
|
"Consider Iris data-set and predict species."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"Split data into training and test sets (hint: [`train_test_split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) is a convenient scikit-learn function for this task)."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 25,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:30.621937Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:30.621464Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:30.627694Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:30.627089Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from sklearn.model_selection import train_test_split\n",
|
||
|
"\n",
|
||
|
"X_train, X_test, y_train, y_test = train_test_split(X_iris, y_iris, test_size=0.5, random_state=1)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 26,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:30.630352Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:30.630134Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:30.641703Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:30.641083Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>sepal_length</th>\n",
|
||
|
" <th>sepal_width</th>\n",
|
||
|
" <th>petal_length</th>\n",
|
||
|
" <th>petal_width</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>74</th>\n",
|
||
|
" <td>6.4</td>\n",
|
||
|
" <td>2.9</td>\n",
|
||
|
" <td>4.3</td>\n",
|
||
|
" <td>1.3</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>116</th>\n",
|
||
|
" <td>6.5</td>\n",
|
||
|
" <td>3.0</td>\n",
|
||
|
" <td>5.5</td>\n",
|
||
|
" <td>1.8</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>93</th>\n",
|
||
|
" <td>5.0</td>\n",
|
||
|
" <td>2.3</td>\n",
|
||
|
" <td>3.3</td>\n",
|
||
|
" <td>1.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>100</th>\n",
|
||
|
" <td>6.3</td>\n",
|
||
|
" <td>3.3</td>\n",
|
||
|
" <td>6.0</td>\n",
|
||
|
" <td>2.5</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>89</th>\n",
|
||
|
" <td>5.5</td>\n",
|
||
|
" <td>2.5</td>\n",
|
||
|
" <td>4.0</td>\n",
|
||
|
" <td>1.3</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" sepal_length sepal_width petal_length petal_width\n",
|
||
|
"74 6.4 2.9 4.3 1.3\n",
|
||
|
"116 6.5 3.0 5.5 1.8\n",
|
||
|
"93 5.0 2.3 3.3 1.0\n",
|
||
|
"100 6.3 3.3 6.0 2.5\n",
|
||
|
"89 5.5 2.5 4.0 1.3"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 26,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"X_train.head() "
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"\n",
|
||
|
"Use a Gaussian Naive Bayes (`GaussianNB`) model to predict Iris species. Then evaluate performance on test data.\n",
|
||
|
"\n",
|
||
|
"(Hint: choose, instantiate, fit and predict.) \n",
|
||
|
"\n",
|
||
|
"See Scikit-Learn documentation on [`GaussianNB`](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html).\n",
|
||
|
"\n",
|
||
|
"Evaluate performance using simple [`accuracy_score`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score).\n",
|
||
|
"\n",
|
||
|
"(Do not set any priors.)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 27,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:30.644869Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:30.644398Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:30.654902Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:30.654270Z"
|
||
|
},
|
||
|
"tags": [
|
||
|
"solution"
|
||
|
]
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from sklearn.naive_bayes import GaussianNB # 1. choose model class\n",
|
||
|
"model = GaussianNB() # 2. instantiate model\n",
|
||
|
"model.fit(X_train, y_train) # 3. fit model to data\n",
|
||
|
"y_model = model.predict(X_test) # 4. predict on new data"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "fragment"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"Evaluate performance on test data."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 28,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:30.658096Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:30.657663Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:30.665168Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:30.664579Z"
|
||
|
},
|
||
|
"tags": [
|
||
|
"solution"
|
||
|
]
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"0.96"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 28,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"from sklearn.metrics import accuracy_score\n",
|
||
|
"accuracy_score(y_test, y_model)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "slide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"## Unsupervised learning: dimensionality reduction\n",
|
||
|
"\n",
|
||
|
"Reduce dimensionality of Iris data for visualisation or to discover structure.\n",
|
||
|
"\n",
|
||
|
"Recall the original Iris data has four features."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 29,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:30.667996Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:30.667639Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:30.676274Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:30.675749Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>sepal_length</th>\n",
|
||
|
" <th>sepal_width</th>\n",
|
||
|
" <th>petal_length</th>\n",
|
||
|
" <th>petal_width</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>5.1</td>\n",
|
||
|
" <td>3.5</td>\n",
|
||
|
" <td>1.4</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>4.9</td>\n",
|
||
|
" <td>3.0</td>\n",
|
||
|
" <td>1.4</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>4.7</td>\n",
|
||
|
" <td>3.2</td>\n",
|
||
|
" <td>1.3</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>4.6</td>\n",
|
||
|
" <td>3.1</td>\n",
|
||
|
" <td>1.5</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>5.0</td>\n",
|
||
|
" <td>3.6</td>\n",
|
||
|
" <td>1.4</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" sepal_length sepal_width petal_length petal_width\n",
|
||
|
"0 5.1 3.5 1.4 0.2\n",
|
||
|
"1 4.9 3.0 1.4 0.2\n",
|
||
|
"2 4.7 3.2 1.3 0.2\n",
|
||
|
"3 4.6 3.1 1.5 0.2\n",
|
||
|
"4 5.0 3.6 1.4 0.2"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 29,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"X_iris.head()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 30,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:30.679207Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:30.678646Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:30.685276Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:30.684661Z"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"(150, 4)"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 30,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"X_iris.shape"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"Compute principle component analysis (`PCA`), with 2 components, and apply transform. Plot data in PCA space. \n",
|
||
|
"\n",
|
||
|
"(Hint: choose, instantiate, fit and transform.)\n",
|
||
|
"\n",
|
||
|
"See Scikit-Learn documentation on [`PCA`](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).\n",
|
||
|
"\n",
|
||
|
"See Seaborn documentation on [`lmplot`](https://seaborn.pydata.org/generated/seaborn.lmplot.html)."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 31,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:30.688494Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:30.687909Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:30.703068Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:30.702344Z"
|
||
|
},
|
||
|
"tags": [
|
||
|
"solution"
|
||
|
]
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from sklearn.decomposition import PCA # 1. Choose the model class\n",
|
||
|
"model = PCA(n_components=2) # 2. Instantiate the model with hyperparameters\n",
|
||
|
"model.fit(X_iris) # 3. Fit to data. Notice y is not specified!\n",
|
||
|
"X_2D = model.transform(X_iris) # 4. Transform the data to two dimensions "
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 32,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:30.706164Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:30.705697Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:30.720075Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:30.719309Z"
|
||
|
},
|
||
|
"tags": [
|
||
|
"solution"
|
||
|
]
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>sepal_length</th>\n",
|
||
|
" <th>sepal_width</th>\n",
|
||
|
" <th>petal_length</th>\n",
|
||
|
" <th>petal_width</th>\n",
|
||
|
" <th>species</th>\n",
|
||
|
" <th>PCA1</th>\n",
|
||
|
" <th>PCA2</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>5.1</td>\n",
|
||
|
" <td>3.5</td>\n",
|
||
|
" <td>1.4</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" <td>setosa</td>\n",
|
||
|
" <td>-2.684126</td>\n",
|
||
|
" <td>0.319397</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>4.9</td>\n",
|
||
|
" <td>3.0</td>\n",
|
||
|
" <td>1.4</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" <td>setosa</td>\n",
|
||
|
" <td>-2.714142</td>\n",
|
||
|
" <td>-0.177001</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>4.7</td>\n",
|
||
|
" <td>3.2</td>\n",
|
||
|
" <td>1.3</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" <td>setosa</td>\n",
|
||
|
" <td>-2.888991</td>\n",
|
||
|
" <td>-0.144949</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>4.6</td>\n",
|
||
|
" <td>3.1</td>\n",
|
||
|
" <td>1.5</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" <td>setosa</td>\n",
|
||
|
" <td>-2.745343</td>\n",
|
||
|
" <td>-0.318299</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>5.0</td>\n",
|
||
|
" <td>3.6</td>\n",
|
||
|
" <td>1.4</td>\n",
|
||
|
" <td>0.2</td>\n",
|
||
|
" <td>setosa</td>\n",
|
||
|
" <td>-2.728717</td>\n",
|
||
|
" <td>0.326755</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" sepal_length sepal_width petal_length petal_width species PCA1 \\\n",
|
||
|
"0 5.1 3.5 1.4 0.2 setosa -2.684126 \n",
|
||
|
"1 4.9 3.0 1.4 0.2 setosa -2.714142 \n",
|
||
|
"2 4.7 3.2 1.3 0.2 setosa -2.888991 \n",
|
||
|
"3 4.6 3.1 1.5 0.2 setosa -2.745343 \n",
|
||
|
"4 5.0 3.6 1.4 0.2 setosa -2.728717 \n",
|
||
|
"\n",
|
||
|
" PCA2 \n",
|
||
|
"0 0.319397 \n",
|
||
|
"1 -0.177001 \n",
|
||
|
"2 -0.144949 \n",
|
||
|
"3 -0.318299 \n",
|
||
|
"4 0.326755 "
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 32,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"iris['PCA1'] = X_2D[:, 0]\n",
|
||
|
"iris['PCA2'] = X_2D[:, 1]\n",
|
||
|
"iris.head() "
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 33,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:30.723401Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:30.723006Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:31.391220Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:31.390526Z"
|
||
|
},
|
||
|
"tags": [
|
||
|
"solution"
|
||
|
]
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAlgAAAHjCAYAAAD/g2H3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8WgzjOAAAACXBIWXMAAA9hAAAPYQGoP6dpAACfsUlEQVR4nOzdeXzcVb34/9dnm5nMJJOlSZs2aZO2QNqyWAGhBaS0QguU+0UEBRRuWYSKKCrw86rXK3i9XrcrKIuyCFpRWQQXaKEtaKkIF7yyY0sEStMmbdM0aTKZmczy+XzO749Jplkm+2R/Px8PHiWTmc+ck2XynnPe5/3WlFIKIYQQQgiRNfpYD0AIIYQQYrKRAEsIIYQQIsskwBJCCCGEyDIJsIQQQgghskwCLCGEEEKILJMASwghhBAiyyTAEkIIIYTIMgmwhBBCCCGyTAIsIYQQQogsM8d6AMNVU1PDfffdx+uvv84777zDvHnzWL9+fb+PW7FiBXV1dT1uf+ONN/B6vSMxVCGEEEJMERM+wHrnnXfYunUrH/jAB3Bdl8F0/lm1ahVXXHFFl9s8Hk+2hyiEEEKIKWbCB1grVqzg9NNPB+ArX/kKb7311oAfW1xczOLFi0doZEIIIYSYqiZ8DpauT/gpCCGEEGKSmfArWMPxxBNP8Mgjj2BZFscffzw33ngjVVVVw7qm47iEQm1ZGmHfNE0jPz+Hlpa2QW2Njicyh/FB5jA+yBzGh/7mUFgYGINRiYlmygZYK1as4JhjjmHWrFns3r2bu+66i09+8pP84Q9/YPbs2UO+rq5ro/7LV1DgH9XnGwkyh/FB5jA+yBzGh8kwBzF2pmyA9fWvfz39/8cffzwnn3wyZ511Fvfddx8333zzkK/ruopQKJqFEfbPMHSCwRxCoTYcxx2V58w2mcP4IHMYH2QO40N/c5AVLDEQUzbA6m769Okcd9xx/OMf/xj2tWx7dF9UHMcd9efMNpnD+CBzGB9kDuPDZJiDGDuSIS6EEEIIkWUSYLWrr6/n5Zdf5uijjx7roQghhBBigpvwW4RtbW1s3boVgLq6OsLhMBs3bgTghBNOoKioiDVr1rBnzx6efvppANavX8+WLVtYtmwZ06dPZ/fu3dxzzz0YhsHll18+ZnMRQgghxOQw4QOsxsZGvvCFL3S5rePjX/7yl5x44om4rovjOOnPl5eXs3//fv77v/+b1tZW8vLyWLJkCdddd92wThAKIYQQQsAkCLDKy8uprq7u8z4PPPBAl48XL17c4zYhhBBCiGyRHCwhhBBCiCyTAEsIIYQQIsskwBJCCCGEyDIJsIQQQgghskwCLCGEEEKILJvwpwiFEEJMPK5yqW3dQzgZIdcKUJ43C12T9/xi8pAASwghxKiqbnqXzTVbqI824CgHQzOY4S9hZcVyqooOG+vhCZEV8nZBCCHEqKluepcHqx+jLrwXr+Eh6MnFa3ioi+zlwerHqG56d6yHKERWSIAlhBBiVLjKZXPNFmJ2nAJvEI9hoWs6HsOiwBMk5sTZXLMFV7ljPVQhhk0CLCGEEKOitnUP9dEGApYfTdO6fE7TNAKmn/poA7Wte8ZohEJkj+RgiUFxlWJXfSvhaJJcv8WcGXno3V4ohRAik3AygqMcTN3I+HlTN4jaDuFkZJRHJkT2SYAlBmz7ziY2vFjDvqYojqMwDI3SIj+rl1SwsLJorIcnhBjncq0AhmZguw4eo+cGiu2mEt5zrcAYjE6I7JItQjEg23c2sW5TNbUNYbyWQTDXg9cyqG2IsG5TNdt3No31EIUQ41x53ixm+EuI2FGUUl0+p5QiYkeZ4S+hPG/WGI1QiOyRAEv0y1WKDS/WEEvYFOR68VgGuqbhsQwKcj3EEg4bXqzB7faCKYQQnemazsqK5fgML82JEAkniatcEk6S5kQIn+FjZcVyqYclJgX5KRb92lXfyr6mKAGflTkx1WeyrynKrvrWMRqhEGKiqCo6jIurzqcsMJO4kyCUCBN3EpQFZnJx1cekDpaYNCQHS/QrHE3iOAozJ3M8bpo60ZhNOJoc5ZEJISaiqqLDOLxwnlRyF5OaBFiiX7l+C8PQsG0Xj9Xz9I9tuxiGRq7fGoPRCSEmIl3TmRMsH+thCDFi5O2C6NecGXmUFvmJxOzMiakxm9IiP3Nm5I3RCIUQQojxRQIs0S9d01i9pAKfx6A5nCCRdHCVIpF0aA4n8HkMVi+pkHpYQgghRDsJsMSALKwsYs2qKspLAsSTDqFwgnjSobwkwJpVVVIHSwghhOhEcrDEgC2sLKKqolAquQshhBD9kABLDIquaVSWBsd6GEIIIcS4JluEQgghhBBZJgGWEEIIIUSWSYAlhBBCCJFlEmAJIYQQQmSZBFhCCCGEEFkmAZYQQgghRJZJgCWEEEIIkWUSYAkhhBBCZJkEWEIIIYQQWSYBlhBCCCFElkmrnCnMVUr6CgohJgxXudS27iGcjJBrBSjPm4WuyTqBGJ8kwJqitu9sYsOLNexriuI4CsPQKC3ys3pJBQsri8Z6eEII0UV107tsrtlCfbQBRzkYmsEMfwkrK5ZTVXTYWA9PiB4k9J+Ctu9sYt2mamobwngtg2CuB69lUNsQYd2marbvbBrrIQohRFp107s8WP0YdeG9eA0PQU8uXsNDXWQvD1Y/RnXTu2M9RCF6kABrinGVYsOLNcQSNgW5XjyWga5peCyDglwPsYTDhhdrcJUa66EKIQSuctlcs4WYHafAG8RjWOiajsewKPAEiTlxNtdswVXuWA9ViC4kwJpidtW3sq8pSsBnoXXLt9I0jYDPZF9TlF31rWM0QiGEOKS2dQ/10QYClj/za5bppz7aQG3rnjEaoRCZSYA1xYSjSRxHYZqZv/WmqeM4inA0OcojE0KInsLJCI5yMHUj4+dN3cBRDuFkZJRHJkTfJMCaYnL9FoahYduZl9Nt28UwNHL91iiPTAghesq1Ahiage06GT9vu6mE91wrMMojE6JvEmBNMXNm5FFa5CcSs1Hd8qyUUkRiNqVFfubMyBujEQohxCHlebOY4S8hYkczv2bZUWb4SyjPmzVGIxQiMwmwphhd01i9pAKfx6A5nCCRdHCVIpF0aA4n8HkMVi+pkHpYQohxQdd0VlYsx2d4aU6ESDhJXOWScJI0J0L4DB8rK5ZLPSwx7shP5BS0sLKINauqKC8JEE86hMIJ4kmH8pIAa1ZVSR0sIcS4UlV0GBdXnU9ZYCZxJ0EoESbuJCgLzOTiqo9JHSwxLkmh0SlqYWURVRWFUsldCDEhVBUdxuGF86SSu5gwJMCawnRNo7I0ONbDEEKIAdE1nTnB8rEehhADIgGWkJ6EQgghRJZJgDXFSU9CIYQQIvtk83oKk56EQgghxMiQAGuKkp6EQgghxMiRAGuKkp6EQgghxMiRAGuKkp6EQgghxMiRAGuKkp6EQgghxMiRAGuKkp6EQgghxMiRAGuKkp6EQgghxMiRAGsKk56EQgghxMiQQqNTnPQkFEIIIbJPAiwhPQmFEEKILJMtQiGEEEKILJvwAVZNTQ3f+MY3OPfcc1m0aBHnnHPOgB6nlOKee+7htNNO45hjjuHCCy/ktddeG9nBCiGEEGJKmPAB1jvvvMPWrVupqKhg/vz5A37cvffey2233cZll13G3XffTUlJCVdccQW7d+8ewdEKIYQQYiqY8AHWihUr2Lp1K7fddhtHHnnkgB4Tj8e5++67ueKKK7jssstYunQpt9xyCwUFBdx3330jPGIhhBBCTHYTPsDS9cFP4ZVXXiEcDnPWWWelb/N4PJxxxhn85S9/yebwhBBCCDEFTclThDt27ABg3rx5XW6fP38+69atIxaL4fP5hnz93vr7ZZth6F3+nYhkDuODzGF8kDmMD5NhDmLsTckAKxQK4fF48Hq9XW4PBoMopWhpaRlygKXrGoWFgWwMc8CCwZxRfb6RIHMYH2QO44PMYXyYDHMQY2dKBlgjyXUVoVB0VJ7LMHSCwRxCoTYcJ3PT5vFO5jA+yBzGB5nD+NDfHEb7TbSYmKZkgBUMBkkkEsTj8S6rWKFQCE3TyM/PH9b1bXt
|
||
|
"text/plain": [
|
||
|
"<Figure size 630x500 with 1 Axes>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"sns.lmplot(data=iris, x=\"PCA1\", y=\"PCA2\", hue='species', fit_reg=False);"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"How well do you expect classification to perform using PCA components as features and why?"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "fragment"
|
||
|
},
|
||
|
"tags": [
|
||
|
"solution"
|
||
|
]
|
||
|
},
|
||
|
"source": [
|
||
|
"Very well since the different classes are well separated in PCA feature space."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "slide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"## Unsupervised learning: clustering\n",
|
||
|
"\n",
|
||
|
"Attempt to find \"groups\" in Iris data without given labels or training data.\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
" \n",
|
||
|
"Cluster Iris data into 3 components using Gaussian Mixture Model (GMM). Plot the 3 components separately in PCA space.\n",
|
||
|
"\n",
|
||
|
"(Hint: choose, instantiate, fit and predict.)\n",
|
||
|
"\n",
|
||
|
"See Scikit-Learn documentation on [`GaussianMixture`](http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html)."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 34,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:31.394705Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:31.394206Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:31.493375Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:31.492329Z"
|
||
|
},
|
||
|
"tags": [
|
||
|
"solution"
|
||
|
]
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from sklearn.mixture import GaussianMixture # 1. Choose the model class\n",
|
||
|
"model = GaussianMixture(n_components=3) # 2. Instantiate the model with hyperparameters\n",
|
||
|
"model.fit(X_iris) # 3. Fit to data. Notice y is not specified!\n",
|
||
|
"y_gmm = model.predict(X_iris) # 4. Determine cluster labels"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 35,
|
||
|
"metadata": {
|
||
|
"execution": {
|
||
|
"iopub.execute_input": "2024-01-10T00:13:31.497402Z",
|
||
|
"iopub.status.busy": "2024-01-10T00:13:31.496762Z",
|
||
|
"iopub.status.idle": "2024-01-10T00:13:33.256040Z",
|
||
|
"shell.execute_reply": "2024-01-10T00:13:33.255265Z"
|
||
|
},
|
||
|
"tags": [
|
||
|
"solution"
|
||
|
]
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABlAAAAHkCAYAAABBiGI5AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8WgzjOAAAACXBIWXMAAA9hAAAPYQGoP6dpAADVGElEQVR4nOzdfXxcdZn//9c5Z2YymZlMk7RpmiZNQm9MCwUKuNCKUFqhBYKL3MiNylZBrC674OquC+t+V91dV1fXqty4CKJWUNAFEaFIC1oB4Vd25R4pkVKSNGkb0g5pMjOZzJyb3x9DQ9OkbdJOMknm/Xw8eJScc+ac65ppc5K55nNdhud5HiIiIiIiIiIiIiIiItLPzHcAIiIiIiIiIiIiIiIi440KKCIiIiIiIiIiIiIiIvtRAUVERERERERERERERGQ/KqCIiIiIiIiIiIiIiIjsRwUUERERERERERERERGR/aiAIiIiIiIiIiIiIiIish8VUERERERERERERERERPajAoqIiIiIiIiIiIiIiMh+VEARERERERERERERERHZjwooInJEnnnmGRoaGnjmmWfyHYqIiMiEpHupiIjIkdG9VERERosKKCIyrj333HPcdNNNdHd35zuUw5ZOp/nmN7/J+9//fo477jg+/OEP89RTT+U7LBERKRAT/V6aSCS48cYbueqqqzj55JNpaGjgl7/8Zb7DEhGRAjLR76UvvfQS//qv/0pjYyOLFi3ijDPO4LrrruPNN9/Md2giIuOeCigiMq49//zz3HzzzRP2B1WA66+/nh//+Md88IMf5Itf/CKWZfGpT32KP/7xj/kOTURECsBEv5e+/fbb3HLLLWzdupWGhoZ8hyMiIgVoot9Lf/CDH7BhwwaWLFnCF7/4RS655BL++Mc/cuGFF/LnP/853+GJiIxrvnwHICKSD729vRQXF4/6dV566SXWrVvHF77wBa666ioAPvShD3HeeefxX//1X9xzzz2jHoOIiMhoGKt76fTp0/nDH/5ARUUFL7/8MhdffPGoX1NERGQsjNW99OMf/zj/9V//RSAQ6N927rnn8sEPfpDbbruN//qv/xr1GEREJiqtQBGRg+ro6OCf/umfeP/738/ChQtZvnw5X/rSl0in0wd8zPLly7n++usHbb/iiiu44oorBmy78847aWxs5Pjjj+cv/uIvuPDCC3nwwQcBuOmmm/jGN74BwAc+8AEaGhpoaGigra2t//EPPPAAF154Iccddxwnn3wyf/d3f8eOHTsGXfe8887jlVde4aMf/SjHH388a9asOeznZCQeeeQRLMvi0ksv7d9WVFTExRdfzPPPPz8oVhERmXx0Lz0ygUCAioqKMbmWiIiMT7qXHpkTTzxxQPEEoL6+nnnz5rF169YxiUFEZKLSChQROaCOjg4uvvhienp6uOSSS5g9ezYdHR2sX7+eVCo16AewkfrFL37Bv//7v7Ny5Ur+6q/+ir6+PpqamnjxxRf54Ac/yFlnnUVzczMPPfQQN9xwA2VlZQCUl5cD8N///d9897vf5ZxzzuHiiy8mFotx11138dGPfpRf/epXRKPR/mt1dXVx9dVX09jYyF/+5V8yderUA8aVTqeJx+PDymFvLAeyefNm6uvriUQiA7Yfd9xx/furqqqGdS0REZl4dC89tEPdS0VEpLDpXnpoh3Mv9TyPXbt2MW/evBE/VkSkkKiAIiIHtGbNGnbt2sUvfvELjj322P7t1113HZ7nHfH5f//73zNv3jxuvPHGIffPnz+fo48+moceeogzzzyTmpqa/n3t7e3cdNNNfPazn+XTn/50//YVK1ZwwQUX8LOf/WzA9s7OTr7yla9w2WWXHTKuvT8YD0dTU9NB93d2dg75qdm92956661hXUdERCYm3UsP7VD3UhERKWy6lx7a4dxLf/3rX9PR0cG111474seKiBQSFVBEZEiu6/LYY4+xbNmyAT+k7mUYxhFfIxqNsnPnTl566aX+FRnD9eijj+K6Lueccw6xWKx/+7Rp06irq+OZZ54Z8INqIBDgwgsvHNa53//+9/OjH/1oRPEcyIE+EVVUVNS/X0REJifdS3NzLxURkcKle+no3EvfeOMN/vVf/5UTTjiBCy64YFSuISIyWaiAIiJDisVixOPxUV3Oe/XVV/P000/z4Q9/mLq6Ok499VTOO+88TjrppEM+trm5Gc/zWLFixZD7fb6B394qKyuHvbR7+vTpTJ8+fVjHHkowGByyL29fX1//fhERmZx0L83NvVRERAqX7qW5v5d2dnayevVqSkpK+O53v4tlWTm/hojIZKICioiMGcdxBvxwNmfOHB555BF+//vf8+STT7JhwwZ+9rOfcc011xxyGbHruhiGwe233z7kD3yhUGjA1yMpVKRSKXp6eoZ17KGG2lZUVNDR0TFoe2dnJ4DeXBIRkREpxHupiIhILhXyvbSnp4err76anp4efvrTn1JZWTnseERECpUKKCIypPLyciKRCK+//vqIHztlyhS6u7sHbd++fTuzZs0asC0UCnHuuedy7rnnkk6n+du//VtuvfVWVq9eTVFR0QGXZNfW1uJ5HjU1NRx11FEjjvFgHn744Zz1mp0/fz7PPPMM8Xh8wCD5F198EYAFCxYcfqAiIjKu6V6qGSgiInJkdC/N3b20r6+PT3/60zQ3N/OjH/2IuXPnHmmIIiIFQQUUERmSaZqceeaZ/PrXv+bll18e1G/W87wD/hA5a9Ysnn32WdLpdP/y5I0bN7Jjx44BP6i+/fbblJWV9X8dCASYM2cOTzzxBJlMhqKiIoqLiwEGffJmxYoVrFmzhptvvpn/+q//GhCL53l0dXUNOPdI5LLX7Nlnn80Pf/hDfv7zn3PVVVcBkE6n+eUvf8nxxx9PVVVVTq4jIiLjj+6lmoEiIiJHRvfS3NxLHcfhs5/9LC+88ALf+973OOGEE3JyXhGRQqACiogc0Oc+9zmeeuoprrjiCi655BLmzJlDZ2cnjzzyCD/72c+IRqNDPu7DH/4w69ev55Of/CTnnHMOra2tPPjgg9TW1g447qqrrmLatGmceOKJTJ06la1bt3LXXXexdOnS/tUaxxxzDADf/va3Offcc/H7/Sxbtoza2lo++9nP8q1vfYv29nbOPPNMwuEwbW1tPPbYY1xyySX9BYuRymWv2eOPP56zzz6bNWvWsHv3burq6rj//vtpb2/nq1/9ak6uISIi45fupblx11130d3dzVtvvQVk3wDbuXMnAFdccQUlJSU5u5aIiIwvupceua9//ev87ne/Y9myZXR1dfHAAw8M2H/++efn5DoiIpORCigickCVlZX84he/4Lvf/S4PPvgg8XicyspKTj/99IP2bj3ttNO4/vrr+dGPfsR//Md/sHDhQm699Vb+8z//c8Bxl156KQ8++CA/+tGPSCaTzJgxgyuuuIK//uu/7j/muOOO47rrruOee+7hySefxHVdfvvb3xIKhfjUpz5FfX09P/7xj7nlllsAmDFjBqeeeirLly8fnSflMHzjG9/gO9/5Dr/+9a/Zs2cPDQ0N3HrrrfzFX/xFvkMTEZFRpntpbvzwhz+kvb29/+sNGzawYcMGAP7yL/9SBRQRkUlM99Ij99prrwHZDyBs3Lhx0H4VUEREDszwPM/LdxAiIiIiIiIiIiIiIiLjiZnvAERERERERERERERERMYbFVBERERERERERERERET2M+FnoLS0tHDHHXfw4osv8vrrrzN79mweeuihQz5u+fLlA/oo7/XSSy9RVFQ0GqGKiIiIiIiIiIiIiMgEMeELKK+//jqPP/44xx9/PK7rMpKRLitXruTKK68csC0QCOQ6RBERERERERERERERmWAmfAFl+fLlnHnmmQBcf/31vPLKK8N+7LRp01i0aNEoRSYiIiIiIiIiIiIiIhPVhJ+BYpoTPgURERERERERERERERlnCrr68OCDD7Jw4UJOOOEErr76apqamvIdkoiIiIiIiIiIiIiIjAMTvoXX4Vq+fDnHHXccM2fOZNu2bdx666185CMf4Ve/+hWzZs067PM6jkt3d28OIx0ZwzC
|
||
|
"text/plain": [
|
||
|
"<Figure size 1630x500 with 3 Axes>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"iris['cluster'] = y_gmm\n",
|
||
|
"sns.lmplot(data=iris, x=\"PCA1\", y=\"PCA2\", hue='species',\n",
|
||
|
" col='cluster', fit_reg=False);"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "fragment"
|
||
|
},
|
||
|
"tags": [
|
||
|
"solution"
|
||
|
]
|
||
|
},
|
||
|
"source": [
|
||
|
"The GMM has done a reasonably good job of separating the different classes. Setosa is perfectly separated in one cluster, while there remains some mixing between versicolor and viginica."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
},
|
||
|
"tags": [
|
||
|
"exercise_pointer"
|
||
|
]
|
||
|
},
|
||
|
"source": [
|
||
|
"**Exercises:** *You can now complete Exercise 1 in the exercises associated with this lecture.*"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"metadata": {
|
||
|
"celltoolbar": "Slideshow",
|
||
|
"kernelspec": {
|
||
|
"display_name": "Python 3 (ipykernel)",
|
||
|
"language": "python",
|
||
|
"name": "python3"
|
||
|
},
|
||
|
"language_info": {
|
||
|
"codemirror_mode": {
|
||
|
"name": "ipython",
|
||
|
"version": 3
|
||
|
},
|
||
|
"file_extension": ".py",
|
||
|
"mimetype": "text/x-python",
|
||
|
"name": "python",
|
||
|
"nbconvert_exporter": "python",
|
||
|
"pygments_lexer": "ipython3",
|
||
|
"version": "3.13.1"
|
||
|
}
|
||
|
},
|
||
|
"nbformat": 4,
|
||
|
"nbformat_minor": 4
|
||
|
}
|