7205 lines
175 KiB
Plaintext
7205 lines
175 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"# Lecture 2: Data wrangling with Pandas "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "skip"
|
|
}
|
|
},
|
|
"source": [
|
|
"data:image/s3,"s3://crabby-images/8691a/8691ad7e08fdc24de3bea5cea2b2da59630fa836" alt=""\n",
|
|
"[Run in colab](https://colab.research.google.com/drive/1L7sAw22PfopC1z8ANRdgFCnJBXiZIdtI)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:14.768141Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:14.767528Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:14.775602Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:14.775063Z"
|
|
},
|
|
"slideshow": {
|
|
"slide_type": "skip"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Last executed: 2024-01-10 00:13:14\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import datetime\n",
|
|
"now = datetime.datetime.now()\n",
|
|
"print(\"Last executed: \" + now.strftime(\"%Y-%m-%d %H:%M:%S\"))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Why Pandas?\n",
|
|
"\n",
|
|
"[Pandas](https://pandas.pydata.org/) is a very useful package for data wrangling.\n",
|
|
"\n",
|
|
"Particularly useful when working with real data, which can be messy."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"Combines advantages of a number of different data structures (NumPy arrays, dictionaries, relational databases).\n",
|
|
"\n",
|
|
"Can also be more efficient than native Python data structures for certain operators (as we will see)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Particularly useful for dealing with:\n",
|
|
"- Labelled data\n",
|
|
"- Missing data\n",
|
|
"- Heteterogenous types\n",
|
|
"- Groupings"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"We will focus mostly on Pandas `Series` and `DataFrame` objects."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Import Pandas"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:14.816280Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:14.815818Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.132633Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.131910Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pandas as pd\n",
|
|
"import numpy as np"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Documentation\n",
|
|
"\n",
|
|
"Recall can check documentation with `pd?`, `pd.<TAB>`, and/or print documentation for specific function with `print(pd.<function_name>.__doc__)`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.136072Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.135754Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.139939Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.139326Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"#pd?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.142929Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.142560Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.145430Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.144796Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"#pd."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.148415Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.147981Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.152029Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.151417Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"#pd.concat?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.154836Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.154383Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.158458Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.157831Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"#print(pd.concat.__doc__)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Pandas `Series`\n",
|
|
"\n",
|
|
"A Pandas `Series` is a *1D* array of *indexed* data. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"Can be created from a list or array:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.161285Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.161073Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.171895Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.171212Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"0 0.25\n",
|
|
"1 0.50\n",
|
|
"2 0.75\n",
|
|
"3 1.00\n",
|
|
"dtype: float64"
|
|
]
|
|
},
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data = pd.Series([0.25, 0.5, 0.75, 1.0])\n",
|
|
"data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"The `Series` wraps both a sequence of *values* and a sequence of *indices*, which we can access with the `values` and `index` attributes."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.174915Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.174556Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.181701Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.181103Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"0 0.25\n",
|
|
"1 0.50\n",
|
|
"2 0.75\n",
|
|
"3 1.00\n",
|
|
"dtype: float64"
|
|
]
|
|
},
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.184468Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.184105Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.188580Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.187941Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"array([0.25, 0.5 , 0.75, 1. ])"
|
|
]
|
|
},
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data.values"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.192321Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.191893Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.196459Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.195837Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"RangeIndex(start=0, stop=4, step=1)"
|
|
]
|
|
},
|
|
"execution_count": 10,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data.index"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### `Series` as generalized NumPy array \n",
|
|
"\n",
|
|
"Values are simply NumPy array.\n",
|
|
"\n",
|
|
"Index need not be an integer, but can consist of values of any desired type. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.199711Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.199076Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.204940Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.204319Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"a 0.25\n",
|
|
"b 0.50\n",
|
|
"c 0.75\n",
|
|
"d 1.00\n",
|
|
"dtype: float64"
|
|
]
|
|
},
|
|
"execution_count": 11,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data = pd.Series([0.25, 0.5, 0.75, 1.0],\n",
|
|
" index=['a', 'b', 'c', 'd'])\n",
|
|
"data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.207914Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.207336Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.211637Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.210997Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"0.5"
|
|
]
|
|
},
|
|
"execution_count": 12,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data['b']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### `Series` as specialized dictionary\n",
|
|
"\n",
|
|
"Can also think of a Pandas `Series` like a specialization of a Python dictionary."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.214449Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.214085Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.217795Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.217150Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"population_dict = {'California': 38332521,\n",
|
|
" 'Texas': 26448193,\n",
|
|
" 'New York': 19651127,\n",
|
|
" 'Florida': 19552860,\n",
|
|
" 'Illinois': 12882135}\n",
|
|
"population = pd.Series(population_dict) # Instantiate from dictionary"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.220887Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.220288Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.224757Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.224118Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"(dict, pandas.core.series.Series)"
|
|
]
|
|
},
|
|
"execution_count": 14,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"type(population_dict), type(population)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.228122Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.227555Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.231762Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.231241Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"38332521"
|
|
]
|
|
},
|
|
"execution_count": 15,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"population['California']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"- Python dictionary: maps *arbitrary* keys to *arbitrary* values.\n",
|
|
"- Pandas `Series`: maps *typed* indices to *typed* values."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"Type information of Pandas `Series` makes it much more efficient than Python dictionaries for certain operations."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Pandas `DataFrame`\n",
|
|
"\n",
|
|
"`DataFrame` can be thought of as a sequence of aligned `Series` objects, with *indices* and *columns*."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.234800Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.234218Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.242493Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.241954Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>foo</th>\n",
|
|
" <th>bar</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>a</th>\n",
|
|
" <td>0.150862</td>\n",
|
|
" <td>0.234217</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>b</th>\n",
|
|
" <td>0.667193</td>\n",
|
|
" <td>0.961023</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>c</th>\n",
|
|
" <td>0.030670</td>\n",
|
|
" <td>0.910813</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" foo bar\n",
|
|
"a 0.150862 0.234217\n",
|
|
"b 0.667193 0.961023\n",
|
|
"c 0.030670 0.910813"
|
|
]
|
|
},
|
|
"execution_count": 16,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"pd.DataFrame(np.random.rand(3, 2),\n",
|
|
" columns=['foo', 'bar'],\n",
|
|
" index=['a', 'b', 'c'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### `DataFrame` as generalized NumPy array\n",
|
|
"\n",
|
|
"`DataFrame` is an analog of a two-dimensional array with both flexible row indices and flexible column names."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"Contstruct another `Series` with same indices."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 17,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.245372Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.245019Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.248490Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.247931Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,\n",
|
|
" 'Florida': 170312, 'Illinois': 149995}\n",
|
|
"area = pd.Series(area_dict)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Combine two `Series` into a `DataFrame`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 18,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.251243Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.250893Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.257216Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.256698Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>population</th>\n",
|
|
" <th>area</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>California</th>\n",
|
|
" <td>38332521</td>\n",
|
|
" <td>423967</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Texas</th>\n",
|
|
" <td>26448193</td>\n",
|
|
" <td>695662</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>New York</th>\n",
|
|
" <td>19651127</td>\n",
|
|
" <td>141297</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Florida</th>\n",
|
|
" <td>19552860</td>\n",
|
|
" <td>170312</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Illinois</th>\n",
|
|
" <td>12882135</td>\n",
|
|
" <td>149995</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" population area\n",
|
|
"California 38332521 423967\n",
|
|
"Texas 26448193 695662\n",
|
|
"New York 19651127 141297\n",
|
|
"Florida 19552860 170312\n",
|
|
"Illinois 12882135 149995"
|
|
]
|
|
},
|
|
"execution_count": 18,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"states = pd.DataFrame({'population': population,\n",
|
|
" 'area': area})\n",
|
|
"states"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"`DataFrame` has both `index` and `column` attributes."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 19,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.260053Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.259693Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.263625Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.263113Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')"
|
|
]
|
|
},
|
|
"execution_count": 19,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"states.index"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 20,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.266338Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.265998Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.269838Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.269330Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"Index(['population', 'area'], dtype='object')"
|
|
]
|
|
},
|
|
"execution_count": 20,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"states.columns"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### `DataFrame` as specialized dictionary\n",
|
|
"\n",
|
|
"Can also think of a Pandas `DataFrame` like a specialization of a Python dictionary."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"`DataFrame` maps a column name to a `Series`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 21,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.272676Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.272320Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.276748Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.276239Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"California 423967\n",
|
|
"Texas 695662\n",
|
|
"New York 141297\n",
|
|
"Florida 170312\n",
|
|
"Illinois 149995\n",
|
|
"Name: area, dtype: int64"
|
|
]
|
|
},
|
|
"execution_count": 21,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"states['area']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 22,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.279494Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.279148Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.282910Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.282377Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"pandas.core.series.Series"
|
|
]
|
|
},
|
|
"execution_count": 22,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"type(states['area'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Pandas `Index`\n",
|
|
"\n",
|
|
"Both Pandas `Series` and `DataFrame` contain `Index` object(s).\n",
|
|
"\n",
|
|
"Can be thought of as *immutable array* (i.e. cannot be changed) or *ordered multi-set* (may contain repeated values)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 23,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.285632Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.285291Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.289529Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.289021Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"Index([2, 3, 5, 7, 11], dtype='int64')"
|
|
]
|
|
},
|
|
"execution_count": 23,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"ind = pd.Index([2, 3, 5, 7, 11])\n",
|
|
"ind "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### `Index` as immutable array\n",
|
|
"\n",
|
|
"Immutability makes it safer to share indices between multiple `DataFrames`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 24,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.292351Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.291994Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.295778Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.295270Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"3"
|
|
]
|
|
},
|
|
"execution_count": 24,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"ind[1]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 25,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.298395Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.298061Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.300547Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.300030Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"#ind[1] = 0"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### `Index` as ordered multi-set\n",
|
|
"\n",
|
|
"`Index` objects support many set operations, e.g. joins, unions, intersections, differences."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Example: Compute the intersection and union of the following two `Index` objects."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 26,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.303350Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.303007Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.306205Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.305664Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"indA = pd.Index([1, 3, 5, 7, 9])\n",
|
|
"indB = pd.Index([2, 3, 5, 7, 11]) "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 27,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.308868Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.308513Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.312952Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.312424Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"Index([3, 5, 7], dtype='int64')"
|
|
]
|
|
},
|
|
"execution_count": 27,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"indA.intersection(indB) # intersection"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 28,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.315736Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.315378Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.319554Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.319039Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')"
|
|
]
|
|
},
|
|
"execution_count": 28,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"indA.union(indB) # union"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Data indexing and selection"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Data selection in a `Series`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In additional to acting like a dictionary, a `Series` also provies array-style selection like NumPy arrays."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 29,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.322458Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.322095Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.327409Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.326886Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"a 0.25\n",
|
|
"b 0.50\n",
|
|
"c 0.75\n",
|
|
"d 1.00\n",
|
|
"dtype: float64"
|
|
]
|
|
},
|
|
"execution_count": 29,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data = pd.Series([0.25, 0.5, 0.75, 1.0],\n",
|
|
" index=['a', 'b', 'c', 'd'])\n",
|
|
"data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 30,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.330109Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.329744Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.334772Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.334235Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"a 0.25\n",
|
|
"b 0.50\n",
|
|
"c 0.75\n",
|
|
"dtype: float64"
|
|
]
|
|
},
|
|
"execution_count": 30,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# slicing by explicit index\n",
|
|
"data['a':'c']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 31,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.337529Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.337173Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.341776Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.341267Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"a 0.25\n",
|
|
"b 0.50\n",
|
|
"dtype: float64"
|
|
]
|
|
},
|
|
"execution_count": 31,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# slicing by implicit integer index\n",
|
|
"data[0:2]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"When slicing by an explicit index (e.g. `data['a':'c']`), the final index *is* included.\n",
|
|
"\n",
|
|
"When slicing by an implicit index (e.g. `data[0:2]`), the final index *is not* included.\n",
|
|
"\n",
|
|
"This can be a source of much confusion."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Consider a Series with integer indices."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 32,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.344695Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.344351Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.349397Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.348877Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"1 a\n",
|
|
"3 b\n",
|
|
"5 c\n",
|
|
"dtype: object"
|
|
]
|
|
},
|
|
"execution_count": 32,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])\n",
|
|
"data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 33,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.352122Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.351783Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.355483Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.354953Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'a'"
|
|
]
|
|
},
|
|
"execution_count": 33,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# explicit index when indexing\n",
|
|
"data[1]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 34,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.358468Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.358085Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.362748Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.362199Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"3 b\n",
|
|
"5 c\n",
|
|
"dtype: object"
|
|
]
|
|
},
|
|
"execution_count": 34,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# implicit index when slicing\n",
|
|
"data[1:3]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"#### Indexers\n",
|
|
"\n",
|
|
"Indexers `loc` (explicit) and `iloc` (implicit) are introduced to avoid confusion."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 35,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.365530Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.365179Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.369614Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.369097Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"1 a\n",
|
|
"3 b\n",
|
|
"5 c\n",
|
|
"dtype: object"
|
|
]
|
|
},
|
|
"execution_count": 35,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 36,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.372452Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.372086Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.375935Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.375418Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'a'"
|
|
]
|
|
},
|
|
"execution_count": 36,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data.loc[1]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 37,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.378492Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.378208Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.382914Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.382378Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"1 a\n",
|
|
"3 b\n",
|
|
"dtype: object"
|
|
]
|
|
},
|
|
"execution_count": 37,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data.loc[1:3]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 38,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.385667Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.385325Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.389111Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.388596Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'b'"
|
|
]
|
|
},
|
|
"execution_count": 38,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data.iloc[1]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 39,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.391891Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.391527Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.396244Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.395728Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"3 b\n",
|
|
"5 c\n",
|
|
"dtype: object"
|
|
]
|
|
},
|
|
"execution_count": 39,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data.iloc[1:3]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Data selection in a `DataFrame`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In additional to acting like a dictionary of `Series` objects with the same index, a `DataFrame` also provies array-style selection like NumPy arrays."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 40,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.399127Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.398770Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.406757Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.406203Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>area</th>\n",
|
|
" <th>population</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>California</th>\n",
|
|
" <td>423967</td>\n",
|
|
" <td>38332521</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Texas</th>\n",
|
|
" <td>695662</td>\n",
|
|
" <td>26448193</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>New York</th>\n",
|
|
" <td>141297</td>\n",
|
|
" <td>19651127</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Florida</th>\n",
|
|
" <td>170312</td>\n",
|
|
" <td>19552860</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Illinois</th>\n",
|
|
" <td>149995</td>\n",
|
|
" <td>12882135</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" area population\n",
|
|
"California 423967 38332521\n",
|
|
"Texas 695662 26448193\n",
|
|
"New York 141297 19651127\n",
|
|
"Florida 170312 19552860\n",
|
|
"Illinois 149995 12882135"
|
|
]
|
|
},
|
|
"execution_count": 40,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"area = pd.Series({'California': 423967, 'Texas': 695662,\n",
|
|
" 'New York': 141297, 'Florida': 170312,\n",
|
|
" 'Illinois': 149995})\n",
|
|
"pop = pd.Series({'California': 38332521, 'Texas': 26448193,\n",
|
|
" 'New York': 19651127, 'Florida': 19552860,\n",
|
|
" 'Illinois': 12882135})\n",
|
|
"data = pd.DataFrame({'area':area, 'population':pop})\n",
|
|
"data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"#### Indexers\n",
|
|
"\n",
|
|
"Indexers `loc` (explicit) and `iloc` (implicit) are also available to avoid confusion when selecting data.\n",
|
|
"\n",
|
|
"Note that index and column labels are preserved in the result."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 41,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.409634Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.409286Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.414919Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.414384Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>area</th>\n",
|
|
" <th>population</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>California</th>\n",
|
|
" <td>423967</td>\n",
|
|
" <td>38332521</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Texas</th>\n",
|
|
" <td>695662</td>\n",
|
|
" <td>26448193</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>New York</th>\n",
|
|
" <td>141297</td>\n",
|
|
" <td>19651127</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Florida</th>\n",
|
|
" <td>170312</td>\n",
|
|
" <td>19552860</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Illinois</th>\n",
|
|
" <td>149995</td>\n",
|
|
" <td>12882135</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" area population\n",
|
|
"California 423967 38332521\n",
|
|
"Texas 695662 26448193\n",
|
|
"New York 141297 19651127\n",
|
|
"Florida 170312 19552860\n",
|
|
"Illinois 149995 12882135"
|
|
]
|
|
},
|
|
"execution_count": 41,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 42,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.417710Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.417363Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.423392Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.422870Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>area</th>\n",
|
|
" <th>population</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>California</th>\n",
|
|
" <td>423967</td>\n",
|
|
" <td>38332521</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Texas</th>\n",
|
|
" <td>695662</td>\n",
|
|
" <td>26448193</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>New York</th>\n",
|
|
" <td>141297</td>\n",
|
|
" <td>19651127</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" area population\n",
|
|
"California 423967 38332521\n",
|
|
"Texas 695662 26448193\n",
|
|
"New York 141297 19651127"
|
|
]
|
|
},
|
|
"execution_count": 42,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data.iloc[:3, :2]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 43,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.426169Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.425803Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.431850Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.431323Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>area</th>\n",
|
|
" <th>population</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>California</th>\n",
|
|
" <td>423967</td>\n",
|
|
" <td>38332521</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Texas</th>\n",
|
|
" <td>695662</td>\n",
|
|
" <td>26448193</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>New York</th>\n",
|
|
" <td>141297</td>\n",
|
|
" <td>19651127</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Florida</th>\n",
|
|
" <td>170312</td>\n",
|
|
" <td>19552860</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Illinois</th>\n",
|
|
" <td>149995</td>\n",
|
|
" <td>12882135</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" area population\n",
|
|
"California 423967 38332521\n",
|
|
"Texas 695662 26448193\n",
|
|
"New York 141297 19651127\n",
|
|
"Florida 170312 19552860\n",
|
|
"Illinois 149995 12882135"
|
|
]
|
|
},
|
|
"execution_count": 43,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data.loc[:'Illinois', :'population']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Additional array-style selection \n",
|
|
"\n",
|
|
"Other NumPy selection approaches can also be applied (e.g. masking)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
},
|
|
"tags": [
|
|
"exercise_pointer"
|
|
]
|
|
},
|
|
"source": [
|
|
"**Exercises:** *You can now complete Exercise 1 in the exercises associated with this lecture.*"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Operating on data in Pandas"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Elementwise operations in Pandas automatically aligns indices and preserves index/column labels.\n",
|
|
"\n",
|
|
"Can avoid many errors and bugs in data wrangling."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Index preservation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 44,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.435183Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.434837Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.440045Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.439526Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"0 6\n",
|
|
"1 3\n",
|
|
"2 7\n",
|
|
"3 4\n",
|
|
"dtype: int64"
|
|
]
|
|
},
|
|
"execution_count": 44,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"rng = np.random.RandomState(42)\n",
|
|
"ser = pd.Series(rng.randint(0, 10, 4))\n",
|
|
"ser"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 45,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.442924Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.442580Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.447315Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.446795Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"0 403.428793\n",
|
|
"1 20.085537\n",
|
|
"2 1096.633158\n",
|
|
"3 54.598150\n",
|
|
"dtype: float64"
|
|
]
|
|
},
|
|
"execution_count": 45,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"np.exp(ser)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 46,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.450058Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.449718Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.456731Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.456205Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>A</th>\n",
|
|
" <th>B</th>\n",
|
|
" <th>C</th>\n",
|
|
" <th>D</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>6</td>\n",
|
|
" <td>9</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>6</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>7</td>\n",
|
|
" <td>4</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>7</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>7</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>4</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" A B C D\n",
|
|
"0 6 9 2 6\n",
|
|
"1 7 4 3 7\n",
|
|
"2 7 2 5 4"
|
|
]
|
|
},
|
|
"execution_count": 46,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df = pd.DataFrame(rng.randint(0, 10, (3, 4)),\n",
|
|
" columns=['A', 'B', 'C', 'D'])\n",
|
|
"df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 47,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.459559Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.459213Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.466329Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.465808Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>A</th>\n",
|
|
" <th>B</th>\n",
|
|
" <th>C</th>\n",
|
|
" <th>D</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>403.428793</td>\n",
|
|
" <td>8103.083928</td>\n",
|
|
" <td>7.389056</td>\n",
|
|
" <td>403.428793</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>1096.633158</td>\n",
|
|
" <td>54.598150</td>\n",
|
|
" <td>20.085537</td>\n",
|
|
" <td>1096.633158</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>1096.633158</td>\n",
|
|
" <td>7.389056</td>\n",
|
|
" <td>148.413159</td>\n",
|
|
" <td>54.598150</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" A B C D\n",
|
|
"0 403.428793 8103.083928 7.389056 403.428793\n",
|
|
"1 1096.633158 54.598150 20.085537 1096.633158\n",
|
|
"2 1096.633158 7.389056 148.413159 54.598150"
|
|
]
|
|
},
|
|
"execution_count": 47,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"np.exp(df)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
},
|
|
"tags": [
|
|
"exercise_pointer"
|
|
]
|
|
},
|
|
"source": [
|
|
"**Exercises:** *You can now complete Exercise 2 in the exercises associated with this lecture.*"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Index alignment"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Index alignment works similarly for `DataFrames`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 48,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.469283Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.468937Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.475078Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.474536Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>A</th>\n",
|
|
" <th>B</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>11</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>5</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" A B\n",
|
|
"0 1 11\n",
|
|
"1 5 1"
|
|
]
|
|
},
|
|
"execution_count": 48,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"A = pd.DataFrame(rng.randint(0, 20, (2, 2)),\n",
|
|
" columns=list('AB'))\n",
|
|
"A"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 49,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.478954Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.478608Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.484998Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.484464Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>B</th>\n",
|
|
" <th>A</th>\n",
|
|
" <th>C</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>4</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>9</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>5</td>\n",
|
|
" <td>8</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>9</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>6</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" B A C\n",
|
|
"0 4 0 9\n",
|
|
"1 5 8 0\n",
|
|
"2 9 2 6"
|
|
]
|
|
},
|
|
"execution_count": 49,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"B = pd.DataFrame(rng.randint(0, 10, (3, 3)),\n",
|
|
" columns=list('BAC'))\n",
|
|
"B"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 50,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.488869Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.488521Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.497638Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.497097Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>A</th>\n",
|
|
" <th>B</th>\n",
|
|
" <th>C</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>15.0</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>13.0</td>\n",
|
|
" <td>6.0</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" A B C\n",
|
|
"0 1.0 15.0 NaN\n",
|
|
"1 13.0 6.0 NaN\n",
|
|
"2 NaN NaN NaN"
|
|
]
|
|
},
|
|
"execution_count": 50,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"A + B"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Operations between `DataFrame` and `Series` objects"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 51,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.505677Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.505241Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.512507Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.511979Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>Q</th>\n",
|
|
" <th>R</th>\n",
|
|
" <th>S</th>\n",
|
|
" <th>T</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>3</td>\n",
|
|
" <td>8</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>4</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>2</td>\n",
|
|
" <td>6</td>\n",
|
|
" <td>4</td>\n",
|
|
" <td>8</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>6</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>8</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" Q R S T\n",
|
|
"0 3 8 2 4\n",
|
|
"1 2 6 4 8\n",
|
|
"2 6 1 3 8"
|
|
]
|
|
},
|
|
"execution_count": 51,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"A = rng.randint(10, size=(3, 4))\n",
|
|
"df = pd.DataFrame(A, columns=list('QRST'))\n",
|
|
"df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 52,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.515360Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.515014Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.519595Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.519059Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"Q 3\n",
|
|
"R 8\n",
|
|
"S 2\n",
|
|
"T 4\n",
|
|
"Name: 0, dtype: int64"
|
|
]
|
|
},
|
|
"execution_count": 52,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"s = df.iloc[0]\n",
|
|
"s"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Difference between the `DataFrame` and `Series`:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 53,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.522471Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.522110Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.528586Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.528055Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>Q</th>\n",
|
|
" <th>R</th>\n",
|
|
" <th>S</th>\n",
|
|
" <th>T</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>-1</td>\n",
|
|
" <td>-2</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>4</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>3</td>\n",
|
|
" <td>-7</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>4</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" Q R S T\n",
|
|
"0 0 0 0 0\n",
|
|
"1 -1 -2 2 4\n",
|
|
"2 3 -7 1 4"
|
|
]
|
|
},
|
|
"execution_count": 53,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df - s"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Convention is to operate row-wise."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Can also operate column-wise using object methods."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 54,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.532195Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.531844Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.538629Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.538075Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>Q</th>\n",
|
|
" <th>R</th>\n",
|
|
" <th>S</th>\n",
|
|
" <th>T</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>-5</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>-6</td>\n",
|
|
" <td>-4</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>-4</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>-2</td>\n",
|
|
" <td>2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>5</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>7</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" Q R S T\n",
|
|
"0 -5 0 -6 -4\n",
|
|
"1 -4 0 -2 2\n",
|
|
"2 5 0 2 7"
|
|
]
|
|
},
|
|
"execution_count": 54,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df.subtract(df['R'], axis=0)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Handling missing data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Real data is messy. Often some data are missing."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Various conventions can be considered to handle missing data.\n",
|
|
"\n",
|
|
"We will focus on the use of the floating point IEEE value NaN (not a number) to represent missing data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"Pandas interprets NaN as Null values.\n",
|
|
"\n",
|
|
"(Pandas also supports `None` but we will focus on NaN here.)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Arithematic operations with NaN values result in NaN."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 55,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.542302Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.541956Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.545706Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.545196Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"nan"
|
|
]
|
|
},
|
|
"execution_count": 55,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"1 + np.nan"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Operating on Null values"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Several useful methods exist to work with NaNs, for example to detect, drop or replace:\n",
|
|
"- `isnull()`: Generate a boolean mask indicating missing values.\n",
|
|
"- `notnull()`: Opposite of `isnull()`.\n",
|
|
"- `dropna()`: Return a filtered version of the data.\n",
|
|
"- `fillna()`: Return a copy of the data with missing values filled."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Detecting null values\n",
|
|
"\n",
|
|
"Pandas `isnull` and `notnull` are useful for detecting null values."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
},
|
|
"tags": [
|
|
"exercise_pointer"
|
|
]
|
|
},
|
|
"source": [
|
|
"**Exercises:** *You can now complete Exercise 3 in the exercises associated with this lecture.*"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Dropping null values\n",
|
|
"\n",
|
|
"Direct routines may be used to drop null values (i.e. `dropna`), rather than constructing masks as performed above."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
},
|
|
"tags": [
|
|
"exercise_pointer"
|
|
]
|
|
},
|
|
"source": [
|
|
"**Exercises:** *You can now complete Exercise 4 in the exercises associated with this lecture.*"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Dropping null values from `DataFrames`\n",
|
|
"\n",
|
|
"For `DataFrames`, there are multiple ways null values can be dropped."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 56,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.549490Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.549132Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.556771Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.556194Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>3.0</td>\n",
|
|
" <td>5</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>4.0</td>\n",
|
|
" <td>6</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2\n",
|
|
"0 1.0 NaN 2\n",
|
|
"1 2.0 3.0 5\n",
|
|
"2 NaN 4.0 6"
|
|
]
|
|
},
|
|
"execution_count": 56,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df = pd.DataFrame([[1, np.nan, 2],\n",
|
|
" [2, 3, 5],\n",
|
|
" [np.nan, 4, 6]])\n",
|
|
"df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"By default `dropna` operates row-wise and drops all rows that contain any NaNs."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 57,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.559929Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.559542Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.566641Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.566094Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>3.0</td>\n",
|
|
" <td>5</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2\n",
|
|
"1 2.0 3.0 5"
|
|
]
|
|
},
|
|
"execution_count": 57,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df.dropna()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Can also operate column-wise."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 58,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.569601Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.569257Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.575786Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.575259Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>3.0</td>\n",
|
|
" <td>5</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>4.0</td>\n",
|
|
" <td>6</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2\n",
|
|
"0 1.0 NaN 2\n",
|
|
"1 2.0 3.0 5\n",
|
|
"2 NaN 4.0 6"
|
|
]
|
|
},
|
|
"execution_count": 58,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 59,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.578790Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.578414Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.584880Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.584357Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>2</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>5</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>6</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 2\n",
|
|
"0 2\n",
|
|
"1 5\n",
|
|
"2 6"
|
|
]
|
|
},
|
|
"execution_count": 59,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df.dropna(axis='columns')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"More sophisticated approaches can also be considered (e.g. only dropping rows/columns if all entries or a certain number of NaNs appear)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Replacing null values\n",
|
|
"\n",
|
|
"Null values can be easily replaced using `fillna`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 60,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.588227Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.587864Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.594332Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.593819Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>3.0</td>\n",
|
|
" <td>5</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>4.0</td>\n",
|
|
" <td>6</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2\n",
|
|
"0 1.0 NaN 2\n",
|
|
"1 2.0 3.0 5\n",
|
|
"2 NaN 4.0 6"
|
|
]
|
|
},
|
|
"execution_count": 60,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 61,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.597648Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.597298Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.604154Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.603634Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>0.0</td>\n",
|
|
" <td>2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>3.0</td>\n",
|
|
" <td>5</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>0.0</td>\n",
|
|
" <td>4.0</td>\n",
|
|
" <td>6</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2\n",
|
|
"0 1.0 0.0 2\n",
|
|
"1 2.0 3.0 5\n",
|
|
"2 0.0 4.0 6"
|
|
]
|
|
},
|
|
"execution_count": 61,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df.fillna(0.0)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Can also fill using adjacent values."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 62,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.607243Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.606901Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.613298Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.612778Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>3.0</td>\n",
|
|
" <td>5</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>4.0</td>\n",
|
|
" <td>6</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2\n",
|
|
"0 1.0 NaN 2\n",
|
|
"1 2.0 3.0 5\n",
|
|
"2 NaN 4.0 6"
|
|
]
|
|
},
|
|
"execution_count": 62,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 63,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.615998Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.615699Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.623380Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.622863Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>2.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>3.0</td>\n",
|
|
" <td>5.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>4.0</td>\n",
|
|
" <td>6.0</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2\n",
|
|
"0 1.0 1.0 2.0\n",
|
|
"1 2.0 3.0 5.0\n",
|
|
"2 NaN 4.0 6.0"
|
|
]
|
|
},
|
|
"execution_count": 63,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df.fillna(method='ffill', axis='columns')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 64,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.626132Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.625790Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.632720Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.632070Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>3.0</td>\n",
|
|
" <td>5</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>4.0</td>\n",
|
|
" <td>6</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2\n",
|
|
"0 1.0 NaN 2\n",
|
|
"1 2.0 3.0 5\n",
|
|
"2 2.0 4.0 6"
|
|
]
|
|
},
|
|
"execution_count": 64,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df.fillna(method='ffill', axis='rows')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 65,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.635515Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.635170Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.642851Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.642316Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>2</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>2.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>3.0</td>\n",
|
|
" <td>5.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>4.0</td>\n",
|
|
" <td>4.0</td>\n",
|
|
" <td>6.0</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1 2\n",
|
|
"0 1.0 2.0 2.0\n",
|
|
"1 2.0 3.0 5.0\n",
|
|
"2 4.0 4.0 6.0"
|
|
]
|
|
},
|
|
"execution_count": 65,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df.fillna(method='bfill', axis='columns')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Combining data-sets"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Define helper functions"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 66,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.645704Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.645362Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.653232Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.652717Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>A</th>\n",
|
|
" <th>B</th>\n",
|
|
" <th>C</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>A0</td>\n",
|
|
" <td>B0</td>\n",
|
|
" <td>C0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>A1</td>\n",
|
|
" <td>B1</td>\n",
|
|
" <td>C1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>A2</td>\n",
|
|
" <td>B2</td>\n",
|
|
" <td>C2</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" A B C\n",
|
|
"0 A0 B0 C0\n",
|
|
"1 A1 B1 C1\n",
|
|
"2 A2 B2 C2"
|
|
]
|
|
},
|
|
"execution_count": 66,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"def make_df(cols, ind):\n",
|
|
" \"\"\"Quickly make a DataFrame\"\"\"\n",
|
|
" data = {c: [str(c) + str(i) for i in ind]\n",
|
|
" for c in cols}\n",
|
|
" return pd.DataFrame(data, ind)\n",
|
|
"\n",
|
|
"# example DataFrame\n",
|
|
"make_df('ABC', range(3))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 67,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.655999Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.655658Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.658376Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.657849Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from IPython.display import display"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Concatenation\n",
|
|
"\n",
|
|
"Can concatenate `Series` and `DataFrame` objects with `pd.concat()`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"Default is to concatenate over rows."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 68,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.661244Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.660906Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.674198Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.673678Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>A</th>\n",
|
|
" <th>B</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>A1</td>\n",
|
|
" <td>B1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>A2</td>\n",
|
|
" <td>B2</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" A B\n",
|
|
"1 A1 B1\n",
|
|
"2 A2 B2"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>A</th>\n",
|
|
" <th>B</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>A3</td>\n",
|
|
" <td>B3</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>A4</td>\n",
|
|
" <td>B4</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" A B\n",
|
|
"3 A3 B3\n",
|
|
"4 A4 B4"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>A</th>\n",
|
|
" <th>B</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>A1</td>\n",
|
|
" <td>B1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>A2</td>\n",
|
|
" <td>B2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>A3</td>\n",
|
|
" <td>B3</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>A4</td>\n",
|
|
" <td>B4</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" A B\n",
|
|
"1 A1 B1\n",
|
|
"2 A2 B2\n",
|
|
"3 A3 B3\n",
|
|
"4 A4 B4"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"df1 = make_df('AB', [1, 2])\n",
|
|
"df2 = make_df('AB', [3, 4])\n",
|
|
"display(df1, df2, pd.concat([df1, df2]))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Can also concatenate over columns."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 69,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.677093Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.676742Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.690361Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.689828Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>A</th>\n",
|
|
" <th>B</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>A0</td>\n",
|
|
" <td>B0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>A1</td>\n",
|
|
" <td>B1</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" A B\n",
|
|
"0 A0 B0\n",
|
|
"1 A1 B1"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>C</th>\n",
|
|
" <th>D</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>C0</td>\n",
|
|
" <td>D0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>C1</td>\n",
|
|
" <td>D1</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" C D\n",
|
|
"0 C0 D0\n",
|
|
"1 C1 D1"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>A</th>\n",
|
|
" <th>B</th>\n",
|
|
" <th>C</th>\n",
|
|
" <th>D</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>A0</td>\n",
|
|
" <td>B0</td>\n",
|
|
" <td>C0</td>\n",
|
|
" <td>D0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>A1</td>\n",
|
|
" <td>B1</td>\n",
|
|
" <td>C1</td>\n",
|
|
" <td>D1</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" A B C D\n",
|
|
"0 A0 B0 C0 D0\n",
|
|
"1 A1 B1 C1 D1"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"df3 = make_df('AB', [0, 1])\n",
|
|
"df4 = make_df('CD', [0, 1])\n",
|
|
"display(df3, df4, pd.concat([df3, df4], axis='columns'))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"#### Duplicated indices\n",
|
|
"\n",
|
|
"Can have duplicated indices."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 70,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.693461Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.693106Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.706128Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.705603Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>A</th>\n",
|
|
" <th>B</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>A0</td>\n",
|
|
" <td>B0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>A1</td>\n",
|
|
" <td>B1</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" A B\n",
|
|
"0 A0 B0\n",
|
|
"1 A1 B1"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>A</th>\n",
|
|
" <th>B</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>A2</td>\n",
|
|
" <td>B2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>A3</td>\n",
|
|
" <td>B3</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" A B\n",
|
|
"0 A2 B2\n",
|
|
"1 A3 B3"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>A</th>\n",
|
|
" <th>B</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>A0</td>\n",
|
|
" <td>B0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>A1</td>\n",
|
|
" <td>B1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>A2</td>\n",
|
|
" <td>B2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>A3</td>\n",
|
|
" <td>B3</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" A B\n",
|
|
"0 A0 B0\n",
|
|
"1 A1 B1\n",
|
|
"0 A2 B2\n",
|
|
"1 A3 B3"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"x = make_df('AB', [0, 1])\n",
|
|
"y = make_df('AB', [2, 3])\n",
|
|
"y.index = x.index # make duplicate indices!\n",
|
|
"display(x, y, pd.concat([x, y]))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"#### Ignoring index\n",
|
|
"\n",
|
|
"Can ignore index."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 71,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.708970Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.708624Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.720590Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.720060Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>A</th>\n",
|
|
" <th>B</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>A0</td>\n",
|
|
" <td>B0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>A1</td>\n",
|
|
" <td>B1</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" A B\n",
|
|
"0 A0 B0\n",
|
|
"1 A1 B1"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>A</th>\n",
|
|
" <th>B</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>A2</td>\n",
|
|
" <td>B2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>A3</td>\n",
|
|
" <td>B3</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" A B\n",
|
|
"0 A2 B2\n",
|
|
"1 A3 B3"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>A</th>\n",
|
|
" <th>B</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>A0</td>\n",
|
|
" <td>B0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>A1</td>\n",
|
|
" <td>B1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>A2</td>\n",
|
|
" <td>B2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>A3</td>\n",
|
|
" <td>B3</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" A B\n",
|
|
"0 A0 B0\n",
|
|
"1 A1 B1\n",
|
|
"2 A2 B2\n",
|
|
"3 A3 B3"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"display(x, y, pd.concat([x, y], ignore_index=True))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"#### Concantenation with joins\n",
|
|
"\n",
|
|
"Can join `DataFrames` with different column names."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 72,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.723441Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.723097Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.738457Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.737891Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>A</th>\n",
|
|
" <th>B</th>\n",
|
|
" <th>C</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>A1</td>\n",
|
|
" <td>B1</td>\n",
|
|
" <td>C1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>A2</td>\n",
|
|
" <td>B2</td>\n",
|
|
" <td>C2</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" A B C\n",
|
|
"1 A1 B1 C1\n",
|
|
"2 A2 B2 C2"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>B</th>\n",
|
|
" <th>C</th>\n",
|
|
" <th>D</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>B3</td>\n",
|
|
" <td>C3</td>\n",
|
|
" <td>D3</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>B4</td>\n",
|
|
" <td>C4</td>\n",
|
|
" <td>D4</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" B C D\n",
|
|
"3 B3 C3 D3\n",
|
|
"4 B4 C4 D4"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>A</th>\n",
|
|
" <th>B</th>\n",
|
|
" <th>C</th>\n",
|
|
" <th>D</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>A1</td>\n",
|
|
" <td>B1</td>\n",
|
|
" <td>C1</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>A2</td>\n",
|
|
" <td>B2</td>\n",
|
|
" <td>C2</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>B3</td>\n",
|
|
" <td>C3</td>\n",
|
|
" <td>D3</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>B4</td>\n",
|
|
" <td>C4</td>\n",
|
|
" <td>D4</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" A B C D\n",
|
|
"1 A1 B1 C1 NaN\n",
|
|
"2 A2 B2 C2 NaN\n",
|
|
"3 NaN B3 C3 D3\n",
|
|
"4 NaN B4 C4 D4"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"df5 = make_df('ABC', [1, 2])\n",
|
|
"df6 = make_df('BCD', [3, 4])\n",
|
|
"display(df5, df6, pd.concat([df5, df6]))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Entries with no data are filled with NaN.\n",
|
|
"\n",
|
|
"Default join is the *union* of the columns of the two `DataFrames`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Can also perform different types of joins.\n",
|
|
"\n",
|
|
"For example, the *intersection* of the columns of the two DataFrames."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 73,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.741470Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.741118Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.754499Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.753943Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>A</th>\n",
|
|
" <th>B</th>\n",
|
|
" <th>C</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>A1</td>\n",
|
|
" <td>B1</td>\n",
|
|
" <td>C1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>A2</td>\n",
|
|
" <td>B2</td>\n",
|
|
" <td>C2</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" A B C\n",
|
|
"1 A1 B1 C1\n",
|
|
"2 A2 B2 C2"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>B</th>\n",
|
|
" <th>C</th>\n",
|
|
" <th>D</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>B3</td>\n",
|
|
" <td>C3</td>\n",
|
|
" <td>D3</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>B4</td>\n",
|
|
" <td>C4</td>\n",
|
|
" <td>D4</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" B C D\n",
|
|
"3 B3 C3 D3\n",
|
|
"4 B4 C4 D4"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>B</th>\n",
|
|
" <th>C</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>B1</td>\n",
|
|
" <td>C1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>B2</td>\n",
|
|
" <td>C2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>B3</td>\n",
|
|
" <td>C3</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>B4</td>\n",
|
|
" <td>C4</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" B C\n",
|
|
"1 B1 C1\n",
|
|
"2 B2 C2\n",
|
|
"3 B3 C3\n",
|
|
"4 B4 C4"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"display(df5, df6, pd.concat([df5, df6], join='inner'))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Relational combinations\n",
|
|
"\n",
|
|
"Pandas also provides functionality to perform relational algebra (cf. relational databases).\n",
|
|
"\n",
|
|
"Hence, Pandas data structures provide analog not only of NumPy array and dictionary, but also relational database."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"Functionality provied by `pd.merge()` function."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"#### One-to-one join"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 74,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.757544Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.757195Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.772846Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.772302Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>employee</th>\n",
|
|
" <th>group</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Bob</td>\n",
|
|
" <td>Accounting</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Jake</td>\n",
|
|
" <td>Engineering</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Lisa</td>\n",
|
|
" <td>Engineering</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>Sue</td>\n",
|
|
" <td>HR</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" employee group\n",
|
|
"0 Bob Accounting\n",
|
|
"1 Jake Engineering\n",
|
|
"2 Lisa Engineering\n",
|
|
"3 Sue HR"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>employee</th>\n",
|
|
" <th>hire_date</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Lisa</td>\n",
|
|
" <td>2004</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Bob</td>\n",
|
|
" <td>2008</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Jake</td>\n",
|
|
" <td>2012</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>Sue</td>\n",
|
|
" <td>2014</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" employee hire_date\n",
|
|
"0 Lisa 2004\n",
|
|
"1 Bob 2008\n",
|
|
"2 Jake 2012\n",
|
|
"3 Sue 2014"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>employee</th>\n",
|
|
" <th>group</th>\n",
|
|
" <th>hire_date</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Bob</td>\n",
|
|
" <td>Accounting</td>\n",
|
|
" <td>2008</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Jake</td>\n",
|
|
" <td>Engineering</td>\n",
|
|
" <td>2012</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Lisa</td>\n",
|
|
" <td>Engineering</td>\n",
|
|
" <td>2004</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>Sue</td>\n",
|
|
" <td>HR</td>\n",
|
|
" <td>2014</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" employee group hire_date\n",
|
|
"0 Bob Accounting 2008\n",
|
|
"1 Jake Engineering 2012\n",
|
|
"2 Lisa Engineering 2004\n",
|
|
"3 Sue HR 2014"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],\n",
|
|
" 'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})\n",
|
|
"df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],\n",
|
|
" 'hire_date': [2004, 2008, 2012, 2014]})\n",
|
|
"df3 = pd.merge(df1, df2)\n",
|
|
"display(df1,df2,df3)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Recognises that \"employee\" column common and automatically selects as key for the relational join."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"#### Many-to-one joins"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 75,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.775906Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.775550Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.790962Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.790405Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>employee</th>\n",
|
|
" <th>group</th>\n",
|
|
" <th>hire_date</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Bob</td>\n",
|
|
" <td>Accounting</td>\n",
|
|
" <td>2008</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Jake</td>\n",
|
|
" <td>Engineering</td>\n",
|
|
" <td>2012</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Lisa</td>\n",
|
|
" <td>Engineering</td>\n",
|
|
" <td>2004</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>Sue</td>\n",
|
|
" <td>HR</td>\n",
|
|
" <td>2014</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" employee group hire_date\n",
|
|
"0 Bob Accounting 2008\n",
|
|
"1 Jake Engineering 2012\n",
|
|
"2 Lisa Engineering 2004\n",
|
|
"3 Sue HR 2014"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>group</th>\n",
|
|
" <th>supervisor</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Accounting</td>\n",
|
|
" <td>Carly</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Engineering</td>\n",
|
|
" <td>Guido</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>HR</td>\n",
|
|
" <td>Steve</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" group supervisor\n",
|
|
"0 Accounting Carly\n",
|
|
"1 Engineering Guido\n",
|
|
"2 HR Steve"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>employee</th>\n",
|
|
" <th>group</th>\n",
|
|
" <th>hire_date</th>\n",
|
|
" <th>supervisor</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Bob</td>\n",
|
|
" <td>Accounting</td>\n",
|
|
" <td>2008</td>\n",
|
|
" <td>Carly</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Jake</td>\n",
|
|
" <td>Engineering</td>\n",
|
|
" <td>2012</td>\n",
|
|
" <td>Guido</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Lisa</td>\n",
|
|
" <td>Engineering</td>\n",
|
|
" <td>2004</td>\n",
|
|
" <td>Guido</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>Sue</td>\n",
|
|
" <td>HR</td>\n",
|
|
" <td>2014</td>\n",
|
|
" <td>Steve</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" employee group hire_date supervisor\n",
|
|
"0 Bob Accounting 2008 Carly\n",
|
|
"1 Jake Engineering 2012 Guido\n",
|
|
"2 Lisa Engineering 2004 Guido\n",
|
|
"3 Sue HR 2014 Steve"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],\n",
|
|
" 'supervisor': ['Carly', 'Guido', 'Steve']})\n",
|
|
"display(df3, df4, pd.merge(df3, df4))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"#### The `on` keyword"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 76,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.793866Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.793517Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.807076Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.806535Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>employee</th>\n",
|
|
" <th>group</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Bob</td>\n",
|
|
" <td>Accounting</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Jake</td>\n",
|
|
" <td>Engineering</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Lisa</td>\n",
|
|
" <td>Engineering</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>Sue</td>\n",
|
|
" <td>HR</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" employee group\n",
|
|
"0 Bob Accounting\n",
|
|
"1 Jake Engineering\n",
|
|
"2 Lisa Engineering\n",
|
|
"3 Sue HR"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>employee</th>\n",
|
|
" <th>hire_date</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Lisa</td>\n",
|
|
" <td>2004</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Bob</td>\n",
|
|
" <td>2008</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Jake</td>\n",
|
|
" <td>2012</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>Sue</td>\n",
|
|
" <td>2014</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" employee hire_date\n",
|
|
"0 Lisa 2004\n",
|
|
"1 Bob 2008\n",
|
|
"2 Jake 2012\n",
|
|
"3 Sue 2014"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>employee</th>\n",
|
|
" <th>group</th>\n",
|
|
" <th>hire_date</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Bob</td>\n",
|
|
" <td>Accounting</td>\n",
|
|
" <td>2008</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Jake</td>\n",
|
|
" <td>Engineering</td>\n",
|
|
" <td>2012</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Lisa</td>\n",
|
|
" <td>Engineering</td>\n",
|
|
" <td>2004</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>Sue</td>\n",
|
|
" <td>HR</td>\n",
|
|
" <td>2014</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" employee group hire_date\n",
|
|
"0 Bob Accounting 2008\n",
|
|
"1 Jake Engineering 2012\n",
|
|
"2 Lisa Engineering 2004\n",
|
|
"3 Sue HR 2014"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"display(df1, df2, pd.merge(df1, df2, on='employee'))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"#### The `left_on` and `right_on` keywords"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 77,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.809978Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.809622Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.823891Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.823352Z"
|
|
},
|
|
"scrolled": true
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>employee</th>\n",
|
|
" <th>group</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Bob</td>\n",
|
|
" <td>Accounting</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Jake</td>\n",
|
|
" <td>Engineering</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Lisa</td>\n",
|
|
" <td>Engineering</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>Sue</td>\n",
|
|
" <td>HR</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" employee group\n",
|
|
"0 Bob Accounting\n",
|
|
"1 Jake Engineering\n",
|
|
"2 Lisa Engineering\n",
|
|
"3 Sue HR"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>name</th>\n",
|
|
" <th>salary</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Bob</td>\n",
|
|
" <td>70000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Jake</td>\n",
|
|
" <td>80000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Lisa</td>\n",
|
|
" <td>120000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>Sue</td>\n",
|
|
" <td>90000</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" name salary\n",
|
|
"0 Bob 70000\n",
|
|
"1 Jake 80000\n",
|
|
"2 Lisa 120000\n",
|
|
"3 Sue 90000"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>employee</th>\n",
|
|
" <th>group</th>\n",
|
|
" <th>name</th>\n",
|
|
" <th>salary</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Bob</td>\n",
|
|
" <td>Accounting</td>\n",
|
|
" <td>Bob</td>\n",
|
|
" <td>70000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Jake</td>\n",
|
|
" <td>Engineering</td>\n",
|
|
" <td>Jake</td>\n",
|
|
" <td>80000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Lisa</td>\n",
|
|
" <td>Engineering</td>\n",
|
|
" <td>Lisa</td>\n",
|
|
" <td>120000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>Sue</td>\n",
|
|
" <td>HR</td>\n",
|
|
" <td>Sue</td>\n",
|
|
" <td>90000</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" employee group name salary\n",
|
|
"0 Bob Accounting Bob 70000\n",
|
|
"1 Jake Engineering Jake 80000\n",
|
|
"2 Lisa Engineering Lisa 120000\n",
|
|
"3 Sue HR Sue 90000"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],\n",
|
|
" 'salary': [70000, 80000, 120000, 90000]})\n",
|
|
"display(df1, df3, pd.merge(df1, df3, left_on=\"employee\", right_on=\"name\"))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Employee and name both included now, so may want to drop one."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 78,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.826815Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.826465Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.834368Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.833835Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>employee</th>\n",
|
|
" <th>group</th>\n",
|
|
" <th>name</th>\n",
|
|
" <th>salary</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Bob</td>\n",
|
|
" <td>Accounting</td>\n",
|
|
" <td>Bob</td>\n",
|
|
" <td>70000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Jake</td>\n",
|
|
" <td>Engineering</td>\n",
|
|
" <td>Jake</td>\n",
|
|
" <td>80000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Lisa</td>\n",
|
|
" <td>Engineering</td>\n",
|
|
" <td>Lisa</td>\n",
|
|
" <td>120000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>Sue</td>\n",
|
|
" <td>HR</td>\n",
|
|
" <td>Sue</td>\n",
|
|
" <td>90000</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" employee group name salary\n",
|
|
"0 Bob Accounting Bob 70000\n",
|
|
"1 Jake Engineering Jake 80000\n",
|
|
"2 Lisa Engineering Lisa 120000\n",
|
|
"3 Sue HR Sue 90000"
|
|
]
|
|
},
|
|
"execution_count": 78,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"pd.merge(df1, df3, left_on=\"employee\", right_on=\"name\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 79,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.837233Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.836883Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.844804Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.844272Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>employee</th>\n",
|
|
" <th>group</th>\n",
|
|
" <th>salary</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Bob</td>\n",
|
|
" <td>Accounting</td>\n",
|
|
" <td>70000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Jake</td>\n",
|
|
" <td>Engineering</td>\n",
|
|
" <td>80000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Lisa</td>\n",
|
|
" <td>Engineering</td>\n",
|
|
" <td>120000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>Sue</td>\n",
|
|
" <td>HR</td>\n",
|
|
" <td>90000</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" employee group salary\n",
|
|
"0 Bob Accounting 70000\n",
|
|
"1 Jake Engineering 80000\n",
|
|
"2 Lisa Engineering 120000\n",
|
|
"3 Sue HR 90000"
|
|
]
|
|
},
|
|
"execution_count": 79,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"pd.merge(df1, df3, left_on=\"employee\", right_on=\"name\").drop('name', axis='columns')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"#### The `left_index` and `right_index` keywords\n",
|
|
"\n",
|
|
"Often one wants to join on index."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 80,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.847768Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.847396Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.856391Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.855859Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>group</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>employee</th>\n",
|
|
" <th></th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>Bob</th>\n",
|
|
" <td>Accounting</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Jake</th>\n",
|
|
" <td>Engineering</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Lisa</th>\n",
|
|
" <td>Engineering</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Sue</th>\n",
|
|
" <td>HR</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" group\n",
|
|
"employee \n",
|
|
"Bob Accounting\n",
|
|
"Jake Engineering\n",
|
|
"Lisa Engineering\n",
|
|
"Sue HR"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>hire_date</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>employee</th>\n",
|
|
" <th></th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>Lisa</th>\n",
|
|
" <td>2004</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Bob</th>\n",
|
|
" <td>2008</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Jake</th>\n",
|
|
" <td>2012</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Sue</th>\n",
|
|
" <td>2014</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" hire_date\n",
|
|
"employee \n",
|
|
"Lisa 2004\n",
|
|
"Bob 2008\n",
|
|
"Jake 2012\n",
|
|
"Sue 2014"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"df1a = df1.set_index('employee')\n",
|
|
"df2a = df2.set_index('employee')\n",
|
|
"display(df1a, df2a)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 81,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.859251Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.858901Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.866112Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.865585Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>group</th>\n",
|
|
" <th>hire_date</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>employee</th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>Bob</th>\n",
|
|
" <td>Accounting</td>\n",
|
|
" <td>2008</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Jake</th>\n",
|
|
" <td>Engineering</td>\n",
|
|
" <td>2012</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Lisa</th>\n",
|
|
" <td>Engineering</td>\n",
|
|
" <td>2004</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Sue</th>\n",
|
|
" <td>HR</td>\n",
|
|
" <td>2014</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" group hire_date\n",
|
|
"employee \n",
|
|
"Bob Accounting 2008\n",
|
|
"Jake Engineering 2012\n",
|
|
"Lisa Engineering 2004\n",
|
|
"Sue HR 2014"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"display(pd.merge(df1a, df2a, left_index=True, right_index=True))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"#### Set arithmetic for joins \n",
|
|
"\n",
|
|
"Have so far been considering relational joins based on *intersection* \n",
|
|
"(also called *inner* join)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 82,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.869060Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.868709Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.885530Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.884965Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>name</th>\n",
|
|
" <th>food</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Peter</td>\n",
|
|
" <td>fish</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Paul</td>\n",
|
|
" <td>beans</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Mary</td>\n",
|
|
" <td>bread</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" name food\n",
|
|
"0 Peter fish\n",
|
|
"1 Paul beans\n",
|
|
"2 Mary bread"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>name</th>\n",
|
|
" <th>drink</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Mary</td>\n",
|
|
" <td>wine</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Joseph</td>\n",
|
|
" <td>beer</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" name drink\n",
|
|
"0 Mary wine\n",
|
|
"1 Joseph beer"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>name</th>\n",
|
|
" <th>food</th>\n",
|
|
" <th>drink</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Mary</td>\n",
|
|
" <td>bread</td>\n",
|
|
" <td>wine</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" name food drink\n",
|
|
"0 Mary bread wine"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],\n",
|
|
" 'food': ['fish', 'beans', 'bread']},\n",
|
|
" columns=['name', 'food'])\n",
|
|
"df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],\n",
|
|
" 'drink': ['wine', 'beer']},\n",
|
|
" columns=['name', 'drink'])\n",
|
|
"display(df6, df7, pd.merge(df6, df7, how='inner'))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"##### Outer join\n",
|
|
"\n",
|
|
"Can also join based on *union* (missing entries filled with NaNs)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 83,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.888739Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.888382Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.903223Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.902673Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>name</th>\n",
|
|
" <th>food</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Peter</td>\n",
|
|
" <td>fish</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Paul</td>\n",
|
|
" <td>beans</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Mary</td>\n",
|
|
" <td>bread</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" name food\n",
|
|
"0 Peter fish\n",
|
|
"1 Paul beans\n",
|
|
"2 Mary bread"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>name</th>\n",
|
|
" <th>drink</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Mary</td>\n",
|
|
" <td>wine</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Joseph</td>\n",
|
|
" <td>beer</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" name drink\n",
|
|
"0 Mary wine\n",
|
|
"1 Joseph beer"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>name</th>\n",
|
|
" <th>food</th>\n",
|
|
" <th>drink</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Peter</td>\n",
|
|
" <td>fish</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Paul</td>\n",
|
|
" <td>beans</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Mary</td>\n",
|
|
" <td>bread</td>\n",
|
|
" <td>wine</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>Joseph</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>beer</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" name food drink\n",
|
|
"0 Peter fish NaN\n",
|
|
"1 Paul beans NaN\n",
|
|
"2 Mary bread wine\n",
|
|
"3 Joseph NaN beer"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"display(df6, df7, pd.merge(df6, df7, how='outer'))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"##### Left and right join\n",
|
|
"\n",
|
|
"Can also join based on *left* or *right* entries."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 84,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.906173Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.905822Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.920315Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.919750Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>name</th>\n",
|
|
" <th>food</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Peter</td>\n",
|
|
" <td>fish</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Paul</td>\n",
|
|
" <td>beans</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Mary</td>\n",
|
|
" <td>bread</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" name food\n",
|
|
"0 Peter fish\n",
|
|
"1 Paul beans\n",
|
|
"2 Mary bread"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>name</th>\n",
|
|
" <th>drink</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Mary</td>\n",
|
|
" <td>wine</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Joseph</td>\n",
|
|
" <td>beer</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" name drink\n",
|
|
"0 Mary wine\n",
|
|
"1 Joseph beer"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>name</th>\n",
|
|
" <th>food</th>\n",
|
|
" <th>drink</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Peter</td>\n",
|
|
" <td>fish</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Paul</td>\n",
|
|
" <td>beans</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Mary</td>\n",
|
|
" <td>bread</td>\n",
|
|
" <td>wine</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" name food drink\n",
|
|
"0 Peter fish NaN\n",
|
|
"1 Paul beans NaN\n",
|
|
"2 Mary bread wine"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"display(df6, df7, pd.merge(df6, df7, how='left'))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"#### Overlapping column names\n",
|
|
"\n",
|
|
"Possible for DataFrames to have conflicting columns."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 85,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.923323Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.922959Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.933270Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.932631Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>name</th>\n",
|
|
" <th>rank</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Bob</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Jake</td>\n",
|
|
" <td>2</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Lisa</td>\n",
|
|
" <td>3</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>Sue</td>\n",
|
|
" <td>4</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" name rank\n",
|
|
"0 Bob 1\n",
|
|
"1 Jake 2\n",
|
|
"2 Lisa 3\n",
|
|
"3 Sue 4"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>name</th>\n",
|
|
" <th>rank</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Bob</td>\n",
|
|
" <td>3</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Jake</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Lisa</td>\n",
|
|
" <td>4</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>Sue</td>\n",
|
|
" <td>2</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" name rank\n",
|
|
"0 Bob 3\n",
|
|
"1 Jake 1\n",
|
|
"2 Lisa 4\n",
|
|
"3 Sue 2"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],\n",
|
|
" 'rank': [1, 2, 3, 4]})\n",
|
|
"df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],\n",
|
|
" 'rank': [3, 1, 4, 2]})\n",
|
|
"display(df8, df9)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 86,
|
|
"metadata": {
|
|
"execution": {
|
|
"iopub.execute_input": "2024-01-10T00:13:15.937472Z",
|
|
"iopub.status.busy": "2024-01-10T00:13:15.936984Z",
|
|
"iopub.status.idle": "2024-01-10T00:13:15.946330Z",
|
|
"shell.execute_reply": "2024-01-10T00:13:15.945561Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>name</th>\n",
|
|
" <th>rank_L</th>\n",
|
|
" <th>rank_R</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Bob</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>3</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Jake</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Lisa</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>4</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>Sue</td>\n",
|
|
" <td>4</td>\n",
|
|
" <td>2</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" name rank_L rank_R\n",
|
|
"0 Bob 1 3\n",
|
|
"1 Jake 2 1\n",
|
|
"2 Lisa 3 4\n",
|
|
"3 Sue 4 2"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"display(pd.merge(df8, df9, on=\"name\", suffixes=[\"_L\", \"_R\"]))"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"celltoolbar": "Slideshow",
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.8.18"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|