2025-01-24 13:21:11 +00:00

7205 lines
175 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Lecture 2: Data wrangling with Pandas "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](https://www.tensorflow.org/images/colab_logo_32px.png)\n",
"[Run in colab](https://colab.research.google.com/drive/1L7sAw22PfopC1z8ANRdgFCnJBXiZIdtI)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:14.768141Z",
"iopub.status.busy": "2024-01-10T00:13:14.767528Z",
"iopub.status.idle": "2024-01-10T00:13:14.775602Z",
"shell.execute_reply": "2024-01-10T00:13:14.775063Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Last executed: 2024-01-10 00:13:14\n"
]
}
],
"source": [
"import datetime\n",
"now = datetime.datetime.now()\n",
"print(\"Last executed: \" + now.strftime(\"%Y-%m-%d %H:%M:%S\"))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Why Pandas?\n",
"\n",
"[Pandas](https://pandas.pydata.org/) is a very useful package for data wrangling.\n",
"\n",
"Particularly useful when working with real data, which can be messy."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Combines advantages of a number of different data structures (NumPy arrays, dictionaries, relational databases).\n",
"\n",
"Can also be more efficient than native Python data structures for certain operators (as we will see)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Particularly useful for dealing with:\n",
"- Labelled data\n",
"- Missing data\n",
"- Heteterogenous types\n",
"- Groupings"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"We will focus mostly on Pandas `Series` and `DataFrame` objects."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Import Pandas"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:14.816280Z",
"iopub.status.busy": "2024-01-10T00:13:14.815818Z",
"iopub.status.idle": "2024-01-10T00:13:15.132633Z",
"shell.execute_reply": "2024-01-10T00:13:15.131910Z"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Documentation\n",
"\n",
"Recall can check documentation with `pd?`, `pd.<TAB>`, and/or print documentation for specific function with `print(pd.<function_name>.__doc__)`."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.136072Z",
"iopub.status.busy": "2024-01-10T00:13:15.135754Z",
"iopub.status.idle": "2024-01-10T00:13:15.139939Z",
"shell.execute_reply": "2024-01-10T00:13:15.139326Z"
}
},
"outputs": [],
"source": [
"#pd?"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.142929Z",
"iopub.status.busy": "2024-01-10T00:13:15.142560Z",
"iopub.status.idle": "2024-01-10T00:13:15.145430Z",
"shell.execute_reply": "2024-01-10T00:13:15.144796Z"
}
},
"outputs": [],
"source": [
"#pd."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.148415Z",
"iopub.status.busy": "2024-01-10T00:13:15.147981Z",
"iopub.status.idle": "2024-01-10T00:13:15.152029Z",
"shell.execute_reply": "2024-01-10T00:13:15.151417Z"
}
},
"outputs": [],
"source": [
"#pd.concat?"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.154836Z",
"iopub.status.busy": "2024-01-10T00:13:15.154383Z",
"iopub.status.idle": "2024-01-10T00:13:15.158458Z",
"shell.execute_reply": "2024-01-10T00:13:15.157831Z"
}
},
"outputs": [],
"source": [
"#print(pd.concat.__doc__)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Pandas `Series`\n",
"\n",
"A Pandas `Series` is a *1D* array of *indexed* data. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Can be created from a list or array:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.161285Z",
"iopub.status.busy": "2024-01-10T00:13:15.161073Z",
"iopub.status.idle": "2024-01-10T00:13:15.171895Z",
"shell.execute_reply": "2024-01-10T00:13:15.171212Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 0.25\n",
"1 0.50\n",
"2 0.75\n",
"3 1.00\n",
"dtype: float64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.Series([0.25, 0.5, 0.75, 1.0])\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"The `Series` wraps both a sequence of *values* and a sequence of *indices*, which we can access with the `values` and `index` attributes."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.174915Z",
"iopub.status.busy": "2024-01-10T00:13:15.174556Z",
"iopub.status.idle": "2024-01-10T00:13:15.181701Z",
"shell.execute_reply": "2024-01-10T00:13:15.181103Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 0.25\n",
"1 0.50\n",
"2 0.75\n",
"3 1.00\n",
"dtype: float64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.184468Z",
"iopub.status.busy": "2024-01-10T00:13:15.184105Z",
"iopub.status.idle": "2024-01-10T00:13:15.188580Z",
"shell.execute_reply": "2024-01-10T00:13:15.187941Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([0.25, 0.5 , 0.75, 1. ])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.values"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.192321Z",
"iopub.status.busy": "2024-01-10T00:13:15.191893Z",
"iopub.status.idle": "2024-01-10T00:13:15.196459Z",
"shell.execute_reply": "2024-01-10T00:13:15.195837Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"RangeIndex(start=0, stop=4, step=1)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.index"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### `Series` as generalized NumPy array \n",
"\n",
"Values are simply NumPy array.\n",
"\n",
"Index need not be an integer, but can consist of values of any desired type. "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.199711Z",
"iopub.status.busy": "2024-01-10T00:13:15.199076Z",
"iopub.status.idle": "2024-01-10T00:13:15.204940Z",
"shell.execute_reply": "2024-01-10T00:13:15.204319Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"a 0.25\n",
"b 0.50\n",
"c 0.75\n",
"d 1.00\n",
"dtype: float64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.Series([0.25, 0.5, 0.75, 1.0],\n",
" index=['a', 'b', 'c', 'd'])\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.207914Z",
"iopub.status.busy": "2024-01-10T00:13:15.207336Z",
"iopub.status.idle": "2024-01-10T00:13:15.211637Z",
"shell.execute_reply": "2024-01-10T00:13:15.210997Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0.5"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['b']"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### `Series` as specialized dictionary\n",
"\n",
"Can also think of a Pandas `Series` like a specialization of a Python dictionary."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.214449Z",
"iopub.status.busy": "2024-01-10T00:13:15.214085Z",
"iopub.status.idle": "2024-01-10T00:13:15.217795Z",
"shell.execute_reply": "2024-01-10T00:13:15.217150Z"
}
},
"outputs": [],
"source": [
"population_dict = {'California': 38332521,\n",
" 'Texas': 26448193,\n",
" 'New York': 19651127,\n",
" 'Florida': 19552860,\n",
" 'Illinois': 12882135}\n",
"population = pd.Series(population_dict) # Instantiate from dictionary"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.220887Z",
"iopub.status.busy": "2024-01-10T00:13:15.220288Z",
"iopub.status.idle": "2024-01-10T00:13:15.224757Z",
"shell.execute_reply": "2024-01-10T00:13:15.224118Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(dict, pandas.core.series.Series)"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(population_dict), type(population)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.228122Z",
"iopub.status.busy": "2024-01-10T00:13:15.227555Z",
"iopub.status.idle": "2024-01-10T00:13:15.231762Z",
"shell.execute_reply": "2024-01-10T00:13:15.231241Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"38332521"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"population['California']"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"- Python dictionary: maps *arbitrary* keys to *arbitrary* values.\n",
"- Pandas `Series`: maps *typed* indices to *typed* values."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Type information of Pandas `Series` makes it much more efficient than Python dictionaries for certain operations."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Pandas `DataFrame`\n",
"\n",
"`DataFrame` can be thought of as a sequence of aligned `Series` objects, with *indices* and *columns*."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.234800Z",
"iopub.status.busy": "2024-01-10T00:13:15.234218Z",
"iopub.status.idle": "2024-01-10T00:13:15.242493Z",
"shell.execute_reply": "2024-01-10T00:13:15.241954Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>foo</th>\n",
" <th>bar</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>a</th>\n",
" <td>0.150862</td>\n",
" <td>0.234217</td>\n",
" </tr>\n",
" <tr>\n",
" <th>b</th>\n",
" <td>0.667193</td>\n",
" <td>0.961023</td>\n",
" </tr>\n",
" <tr>\n",
" <th>c</th>\n",
" <td>0.030670</td>\n",
" <td>0.910813</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" foo bar\n",
"a 0.150862 0.234217\n",
"b 0.667193 0.961023\n",
"c 0.030670 0.910813"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(np.random.rand(3, 2),\n",
" columns=['foo', 'bar'],\n",
" index=['a', 'b', 'c'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### `DataFrame` as generalized NumPy array\n",
"\n",
"`DataFrame` is an analog of a two-dimensional array with both flexible row indices and flexible column names."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Contstruct another `Series` with same indices."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.245372Z",
"iopub.status.busy": "2024-01-10T00:13:15.245019Z",
"iopub.status.idle": "2024-01-10T00:13:15.248490Z",
"shell.execute_reply": "2024-01-10T00:13:15.247931Z"
}
},
"outputs": [],
"source": [
"area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,\n",
" 'Florida': 170312, 'Illinois': 149995}\n",
"area = pd.Series(area_dict)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Combine two `Series` into a `DataFrame`."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.251243Z",
"iopub.status.busy": "2024-01-10T00:13:15.250893Z",
"iopub.status.idle": "2024-01-10T00:13:15.257216Z",
"shell.execute_reply": "2024-01-10T00:13:15.256698Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>population</th>\n",
" <th>area</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>California</th>\n",
" <td>38332521</td>\n",
" <td>423967</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Texas</th>\n",
" <td>26448193</td>\n",
" <td>695662</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New York</th>\n",
" <td>19651127</td>\n",
" <td>141297</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Florida</th>\n",
" <td>19552860</td>\n",
" <td>170312</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Illinois</th>\n",
" <td>12882135</td>\n",
" <td>149995</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" population area\n",
"California 38332521 423967\n",
"Texas 26448193 695662\n",
"New York 19651127 141297\n",
"Florida 19552860 170312\n",
"Illinois 12882135 149995"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"states = pd.DataFrame({'population': population,\n",
" 'area': area})\n",
"states"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"`DataFrame` has both `index` and `column` attributes."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.260053Z",
"iopub.status.busy": "2024-01-10T00:13:15.259693Z",
"iopub.status.idle": "2024-01-10T00:13:15.263625Z",
"shell.execute_reply": "2024-01-10T00:13:15.263113Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"states.index"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.266338Z",
"iopub.status.busy": "2024-01-10T00:13:15.265998Z",
"iopub.status.idle": "2024-01-10T00:13:15.269838Z",
"shell.execute_reply": "2024-01-10T00:13:15.269330Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['population', 'area'], dtype='object')"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"states.columns"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### `DataFrame` as specialized dictionary\n",
"\n",
"Can also think of a Pandas `DataFrame` like a specialization of a Python dictionary."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"`DataFrame` maps a column name to a `Series`."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.272676Z",
"iopub.status.busy": "2024-01-10T00:13:15.272320Z",
"iopub.status.idle": "2024-01-10T00:13:15.276748Z",
"shell.execute_reply": "2024-01-10T00:13:15.276239Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"California 423967\n",
"Texas 695662\n",
"New York 141297\n",
"Florida 170312\n",
"Illinois 149995\n",
"Name: area, dtype: int64"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"states['area']"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.279494Z",
"iopub.status.busy": "2024-01-10T00:13:15.279148Z",
"iopub.status.idle": "2024-01-10T00:13:15.282910Z",
"shell.execute_reply": "2024-01-10T00:13:15.282377Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(states['area'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Pandas `Index`\n",
"\n",
"Both Pandas `Series` and `DataFrame` contain `Index` object(s).\n",
"\n",
"Can be thought of as *immutable array* (i.e. cannot be changed) or *ordered multi-set* (may contain repeated values)."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.285632Z",
"iopub.status.busy": "2024-01-10T00:13:15.285291Z",
"iopub.status.idle": "2024-01-10T00:13:15.289529Z",
"shell.execute_reply": "2024-01-10T00:13:15.289021Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Index([2, 3, 5, 7, 11], dtype='int64')"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ind = pd.Index([2, 3, 5, 7, 11])\n",
"ind "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### `Index` as immutable array\n",
"\n",
"Immutability makes it safer to share indices between multiple `DataFrames`."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.292351Z",
"iopub.status.busy": "2024-01-10T00:13:15.291994Z",
"iopub.status.idle": "2024-01-10T00:13:15.295778Z",
"shell.execute_reply": "2024-01-10T00:13:15.295270Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ind[1]"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.298395Z",
"iopub.status.busy": "2024-01-10T00:13:15.298061Z",
"iopub.status.idle": "2024-01-10T00:13:15.300547Z",
"shell.execute_reply": "2024-01-10T00:13:15.300030Z"
}
},
"outputs": [],
"source": [
"#ind[1] = 0"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### `Index` as ordered multi-set\n",
"\n",
"`Index` objects support many set operations, e.g. joins, unions, intersections, differences."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Example: Compute the intersection and union of the following two `Index` objects."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.303350Z",
"iopub.status.busy": "2024-01-10T00:13:15.303007Z",
"iopub.status.idle": "2024-01-10T00:13:15.306205Z",
"shell.execute_reply": "2024-01-10T00:13:15.305664Z"
}
},
"outputs": [],
"source": [
"indA = pd.Index([1, 3, 5, 7, 9])\n",
"indB = pd.Index([2, 3, 5, 7, 11]) "
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.308868Z",
"iopub.status.busy": "2024-01-10T00:13:15.308513Z",
"iopub.status.idle": "2024-01-10T00:13:15.312952Z",
"shell.execute_reply": "2024-01-10T00:13:15.312424Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Index([3, 5, 7], dtype='int64')"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"indA.intersection(indB) # intersection"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.315736Z",
"iopub.status.busy": "2024-01-10T00:13:15.315378Z",
"iopub.status.idle": "2024-01-10T00:13:15.319554Z",
"shell.execute_reply": "2024-01-10T00:13:15.319039Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"indA.union(indB) # union"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Data indexing and selection"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Data selection in a `Series`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In additional to acting like a dictionary, a `Series` also provies array-style selection like NumPy arrays."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.322458Z",
"iopub.status.busy": "2024-01-10T00:13:15.322095Z",
"iopub.status.idle": "2024-01-10T00:13:15.327409Z",
"shell.execute_reply": "2024-01-10T00:13:15.326886Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"a 0.25\n",
"b 0.50\n",
"c 0.75\n",
"d 1.00\n",
"dtype: float64"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.Series([0.25, 0.5, 0.75, 1.0],\n",
" index=['a', 'b', 'c', 'd'])\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.330109Z",
"iopub.status.busy": "2024-01-10T00:13:15.329744Z",
"iopub.status.idle": "2024-01-10T00:13:15.334772Z",
"shell.execute_reply": "2024-01-10T00:13:15.334235Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"a 0.25\n",
"b 0.50\n",
"c 0.75\n",
"dtype: float64"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# slicing by explicit index\n",
"data['a':'c']"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.337529Z",
"iopub.status.busy": "2024-01-10T00:13:15.337173Z",
"iopub.status.idle": "2024-01-10T00:13:15.341776Z",
"shell.execute_reply": "2024-01-10T00:13:15.341267Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"a 0.25\n",
"b 0.50\n",
"dtype: float64"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# slicing by implicit integer index\n",
"data[0:2]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"When slicing by an explicit index (e.g. `data['a':'c']`), the final index *is* included.\n",
"\n",
"When slicing by an implicit index (e.g. `data[0:2]`), the final index *is not* included.\n",
"\n",
"This can be a source of much confusion."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Consider a Series with integer indices."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.344695Z",
"iopub.status.busy": "2024-01-10T00:13:15.344351Z",
"iopub.status.idle": "2024-01-10T00:13:15.349397Z",
"shell.execute_reply": "2024-01-10T00:13:15.348877Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"1 a\n",
"3 b\n",
"5 c\n",
"dtype: object"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.352122Z",
"iopub.status.busy": "2024-01-10T00:13:15.351783Z",
"iopub.status.idle": "2024-01-10T00:13:15.355483Z",
"shell.execute_reply": "2024-01-10T00:13:15.354953Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'a'"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# explicit index when indexing\n",
"data[1]"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.358468Z",
"iopub.status.busy": "2024-01-10T00:13:15.358085Z",
"iopub.status.idle": "2024-01-10T00:13:15.362748Z",
"shell.execute_reply": "2024-01-10T00:13:15.362199Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"3 b\n",
"5 c\n",
"dtype: object"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# implicit index when slicing\n",
"data[1:3]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Indexers\n",
"\n",
"Indexers `loc` (explicit) and `iloc` (implicit) are introduced to avoid confusion."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.365530Z",
"iopub.status.busy": "2024-01-10T00:13:15.365179Z",
"iopub.status.idle": "2024-01-10T00:13:15.369614Z",
"shell.execute_reply": "2024-01-10T00:13:15.369097Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"1 a\n",
"3 b\n",
"5 c\n",
"dtype: object"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.372452Z",
"iopub.status.busy": "2024-01-10T00:13:15.372086Z",
"iopub.status.idle": "2024-01-10T00:13:15.375935Z",
"shell.execute_reply": "2024-01-10T00:13:15.375418Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'a'"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.loc[1]"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.378492Z",
"iopub.status.busy": "2024-01-10T00:13:15.378208Z",
"iopub.status.idle": "2024-01-10T00:13:15.382914Z",
"shell.execute_reply": "2024-01-10T00:13:15.382378Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"1 a\n",
"3 b\n",
"dtype: object"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.loc[1:3]"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.385667Z",
"iopub.status.busy": "2024-01-10T00:13:15.385325Z",
"iopub.status.idle": "2024-01-10T00:13:15.389111Z",
"shell.execute_reply": "2024-01-10T00:13:15.388596Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'b'"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[1]"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.391891Z",
"iopub.status.busy": "2024-01-10T00:13:15.391527Z",
"iopub.status.idle": "2024-01-10T00:13:15.396244Z",
"shell.execute_reply": "2024-01-10T00:13:15.395728Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"3 b\n",
"5 c\n",
"dtype: object"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[1:3]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Data selection in a `DataFrame`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In additional to acting like a dictionary of `Series` objects with the same index, a `DataFrame` also provies array-style selection like NumPy arrays."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.399127Z",
"iopub.status.busy": "2024-01-10T00:13:15.398770Z",
"iopub.status.idle": "2024-01-10T00:13:15.406757Z",
"shell.execute_reply": "2024-01-10T00:13:15.406203Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>area</th>\n",
" <th>population</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>California</th>\n",
" <td>423967</td>\n",
" <td>38332521</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Texas</th>\n",
" <td>695662</td>\n",
" <td>26448193</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New York</th>\n",
" <td>141297</td>\n",
" <td>19651127</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Florida</th>\n",
" <td>170312</td>\n",
" <td>19552860</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Illinois</th>\n",
" <td>149995</td>\n",
" <td>12882135</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" area population\n",
"California 423967 38332521\n",
"Texas 695662 26448193\n",
"New York 141297 19651127\n",
"Florida 170312 19552860\n",
"Illinois 149995 12882135"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"area = pd.Series({'California': 423967, 'Texas': 695662,\n",
" 'New York': 141297, 'Florida': 170312,\n",
" 'Illinois': 149995})\n",
"pop = pd.Series({'California': 38332521, 'Texas': 26448193,\n",
" 'New York': 19651127, 'Florida': 19552860,\n",
" 'Illinois': 12882135})\n",
"data = pd.DataFrame({'area':area, 'population':pop})\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Indexers\n",
"\n",
"Indexers `loc` (explicit) and `iloc` (implicit) are also available to avoid confusion when selecting data.\n",
"\n",
"Note that index and column labels are preserved in the result."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.409634Z",
"iopub.status.busy": "2024-01-10T00:13:15.409286Z",
"iopub.status.idle": "2024-01-10T00:13:15.414919Z",
"shell.execute_reply": "2024-01-10T00:13:15.414384Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>area</th>\n",
" <th>population</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>California</th>\n",
" <td>423967</td>\n",
" <td>38332521</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Texas</th>\n",
" <td>695662</td>\n",
" <td>26448193</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New York</th>\n",
" <td>141297</td>\n",
" <td>19651127</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Florida</th>\n",
" <td>170312</td>\n",
" <td>19552860</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Illinois</th>\n",
" <td>149995</td>\n",
" <td>12882135</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" area population\n",
"California 423967 38332521\n",
"Texas 695662 26448193\n",
"New York 141297 19651127\n",
"Florida 170312 19552860\n",
"Illinois 149995 12882135"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.417710Z",
"iopub.status.busy": "2024-01-10T00:13:15.417363Z",
"iopub.status.idle": "2024-01-10T00:13:15.423392Z",
"shell.execute_reply": "2024-01-10T00:13:15.422870Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>area</th>\n",
" <th>population</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>California</th>\n",
" <td>423967</td>\n",
" <td>38332521</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Texas</th>\n",
" <td>695662</td>\n",
" <td>26448193</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New York</th>\n",
" <td>141297</td>\n",
" <td>19651127</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" area population\n",
"California 423967 38332521\n",
"Texas 695662 26448193\n",
"New York 141297 19651127"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[:3, :2]"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.426169Z",
"iopub.status.busy": "2024-01-10T00:13:15.425803Z",
"iopub.status.idle": "2024-01-10T00:13:15.431850Z",
"shell.execute_reply": "2024-01-10T00:13:15.431323Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>area</th>\n",
" <th>population</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>California</th>\n",
" <td>423967</td>\n",
" <td>38332521</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Texas</th>\n",
" <td>695662</td>\n",
" <td>26448193</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New York</th>\n",
" <td>141297</td>\n",
" <td>19651127</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Florida</th>\n",
" <td>170312</td>\n",
" <td>19552860</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Illinois</th>\n",
" <td>149995</td>\n",
" <td>12882135</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" area population\n",
"California 423967 38332521\n",
"Texas 695662 26448193\n",
"New York 141297 19651127\n",
"Florida 170312 19552860\n",
"Illinois 149995 12882135"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.loc[:'Illinois', :'population']"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Additional array-style selection \n",
"\n",
"Other NumPy selection approaches can also be applied (e.g. masking)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": [
"exercise_pointer"
]
},
"source": [
"**Exercises:** *You can now complete Exercise 1 in the exercises associated with this lecture.*"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Operating on data in Pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Elementwise operations in Pandas automatically aligns indices and preserves index/column labels.\n",
"\n",
"Can avoid many errors and bugs in data wrangling."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Index preservation"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.435183Z",
"iopub.status.busy": "2024-01-10T00:13:15.434837Z",
"iopub.status.idle": "2024-01-10T00:13:15.440045Z",
"shell.execute_reply": "2024-01-10T00:13:15.439526Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 6\n",
"1 3\n",
"2 7\n",
"3 4\n",
"dtype: int64"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rng = np.random.RandomState(42)\n",
"ser = pd.Series(rng.randint(0, 10, 4))\n",
"ser"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.442924Z",
"iopub.status.busy": "2024-01-10T00:13:15.442580Z",
"iopub.status.idle": "2024-01-10T00:13:15.447315Z",
"shell.execute_reply": "2024-01-10T00:13:15.446795Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 403.428793\n",
"1 20.085537\n",
"2 1096.633158\n",
"3 54.598150\n",
"dtype: float64"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.exp(ser)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.450058Z",
"iopub.status.busy": "2024-01-10T00:13:15.449718Z",
"iopub.status.idle": "2024-01-10T00:13:15.456731Z",
"shell.execute_reply": "2024-01-10T00:13:15.456205Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" <th>D</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>6</td>\n",
" <td>9</td>\n",
" <td>2</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>7</td>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>7</td>\n",
" <td>2</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B C D\n",
"0 6 9 2 6\n",
"1 7 4 3 7\n",
"2 7 2 5 4"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(rng.randint(0, 10, (3, 4)),\n",
" columns=['A', 'B', 'C', 'D'])\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.459559Z",
"iopub.status.busy": "2024-01-10T00:13:15.459213Z",
"iopub.status.idle": "2024-01-10T00:13:15.466329Z",
"shell.execute_reply": "2024-01-10T00:13:15.465808Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" <th>D</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>403.428793</td>\n",
" <td>8103.083928</td>\n",
" <td>7.389056</td>\n",
" <td>403.428793</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1096.633158</td>\n",
" <td>54.598150</td>\n",
" <td>20.085537</td>\n",
" <td>1096.633158</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1096.633158</td>\n",
" <td>7.389056</td>\n",
" <td>148.413159</td>\n",
" <td>54.598150</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B C D\n",
"0 403.428793 8103.083928 7.389056 403.428793\n",
"1 1096.633158 54.598150 20.085537 1096.633158\n",
"2 1096.633158 7.389056 148.413159 54.598150"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.exp(df)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": [
"exercise_pointer"
]
},
"source": [
"**Exercises:** *You can now complete Exercise 2 in the exercises associated with this lecture.*"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Index alignment"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Index alignment works similarly for `DataFrames`."
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.469283Z",
"iopub.status.busy": "2024-01-10T00:13:15.468937Z",
"iopub.status.idle": "2024-01-10T00:13:15.475078Z",
"shell.execute_reply": "2024-01-10T00:13:15.474536Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B\n",
"0 1 11\n",
"1 5 1"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"A = pd.DataFrame(rng.randint(0, 20, (2, 2)),\n",
" columns=list('AB'))\n",
"A"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.478954Z",
"iopub.status.busy": "2024-01-10T00:13:15.478608Z",
"iopub.status.idle": "2024-01-10T00:13:15.484998Z",
"shell.execute_reply": "2024-01-10T00:13:15.484464Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>B</th>\n",
" <th>A</th>\n",
" <th>C</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>5</td>\n",
" <td>8</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>9</td>\n",
" <td>2</td>\n",
" <td>6</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" B A C\n",
"0 4 0 9\n",
"1 5 8 0\n",
"2 9 2 6"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"B = pd.DataFrame(rng.randint(0, 10, (3, 3)),\n",
" columns=list('BAC'))\n",
"B"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.488869Z",
"iopub.status.busy": "2024-01-10T00:13:15.488521Z",
"iopub.status.idle": "2024-01-10T00:13:15.497638Z",
"shell.execute_reply": "2024-01-10T00:13:15.497097Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>15.0</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>13.0</td>\n",
" <td>6.0</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B C\n",
"0 1.0 15.0 NaN\n",
"1 13.0 6.0 NaN\n",
"2 NaN NaN NaN"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"A + B"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Operations between `DataFrame` and `Series` objects"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.505677Z",
"iopub.status.busy": "2024-01-10T00:13:15.505241Z",
"iopub.status.idle": "2024-01-10T00:13:15.512507Z",
"shell.execute_reply": "2024-01-10T00:13:15.511979Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Q</th>\n",
" <th>R</th>\n",
" <th>S</th>\n",
" <th>T</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3</td>\n",
" <td>8</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>6</td>\n",
" <td>4</td>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>6</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>8</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Q R S T\n",
"0 3 8 2 4\n",
"1 2 6 4 8\n",
"2 6 1 3 8"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"A = rng.randint(10, size=(3, 4))\n",
"df = pd.DataFrame(A, columns=list('QRST'))\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.515360Z",
"iopub.status.busy": "2024-01-10T00:13:15.515014Z",
"iopub.status.idle": "2024-01-10T00:13:15.519595Z",
"shell.execute_reply": "2024-01-10T00:13:15.519059Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Q 3\n",
"R 8\n",
"S 2\n",
"T 4\n",
"Name: 0, dtype: int64"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s = df.iloc[0]\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Difference between the `DataFrame` and `Series`:"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.522471Z",
"iopub.status.busy": "2024-01-10T00:13:15.522110Z",
"iopub.status.idle": "2024-01-10T00:13:15.528586Z",
"shell.execute_reply": "2024-01-10T00:13:15.528055Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Q</th>\n",
" <th>R</th>\n",
" <th>S</th>\n",
" <th>T</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-1</td>\n",
" <td>-2</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>-7</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Q R S T\n",
"0 0 0 0 0\n",
"1 -1 -2 2 4\n",
"2 3 -7 1 4"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df - s"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Convention is to operate row-wise."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Can also operate column-wise using object methods."
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.532195Z",
"iopub.status.busy": "2024-01-10T00:13:15.531844Z",
"iopub.status.idle": "2024-01-10T00:13:15.538629Z",
"shell.execute_reply": "2024-01-10T00:13:15.538075Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Q</th>\n",
" <th>R</th>\n",
" <th>S</th>\n",
" <th>T</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-5</td>\n",
" <td>0</td>\n",
" <td>-6</td>\n",
" <td>-4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-4</td>\n",
" <td>0</td>\n",
" <td>-2</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>7</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Q R S T\n",
"0 -5 0 -6 -4\n",
"1 -4 0 -2 2\n",
"2 5 0 2 7"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.subtract(df['R'], axis=0)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Handling missing data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Real data is messy. Often some data are missing."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Various conventions can be considered to handle missing data.\n",
"\n",
"We will focus on the use of the floating point IEEE value NaN (not a number) to represent missing data."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Pandas interprets NaN as Null values.\n",
"\n",
"(Pandas also supports `None` but we will focus on NaN here.)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Arithematic operations with NaN values result in NaN."
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.542302Z",
"iopub.status.busy": "2024-01-10T00:13:15.541956Z",
"iopub.status.idle": "2024-01-10T00:13:15.545706Z",
"shell.execute_reply": "2024-01-10T00:13:15.545196Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"nan"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"1 + np.nan"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Operating on Null values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Several useful methods exist to work with NaNs, for example to detect, drop or replace:\n",
"- `isnull()`: Generate a boolean mask indicating missing values.\n",
"- `notnull()`: Opposite of `isnull()`.\n",
"- `dropna()`: Return a filtered version of the data.\n",
"- `fillna()`: Return a copy of the data with missing values filled."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Detecting null values\n",
"\n",
"Pandas `isnull` and `notnull` are useful for detecting null values."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": [
"exercise_pointer"
]
},
"source": [
"**Exercises:** *You can now complete Exercise 3 in the exercises associated with this lecture.*"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Dropping null values\n",
"\n",
"Direct routines may be used to drop null values (i.e. `dropna`), rather than constructing masks as performed above."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": [
"exercise_pointer"
]
},
"source": [
"**Exercises:** *You can now complete Exercise 4 in the exercises associated with this lecture.*"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Dropping null values from `DataFrames`\n",
"\n",
"For `DataFrames`, there are multiple ways null values can be dropped."
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.549490Z",
"iopub.status.busy": "2024-01-10T00:13:15.549132Z",
"iopub.status.idle": "2024-01-10T00:13:15.556771Z",
"shell.execute_reply": "2024-01-10T00:13:15.556194Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>4.0</td>\n",
" <td>6</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1.0 NaN 2\n",
"1 2.0 3.0 5\n",
"2 NaN 4.0 6"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame([[1, np.nan, 2],\n",
" [2, 3, 5],\n",
" [np.nan, 4, 6]])\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"By default `dropna` operates row-wise and drops all rows that contain any NaNs."
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.559929Z",
"iopub.status.busy": "2024-01-10T00:13:15.559542Z",
"iopub.status.idle": "2024-01-10T00:13:15.566641Z",
"shell.execute_reply": "2024-01-10T00:13:15.566094Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"1 2.0 3.0 5"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dropna()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Can also operate column-wise."
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.569601Z",
"iopub.status.busy": "2024-01-10T00:13:15.569257Z",
"iopub.status.idle": "2024-01-10T00:13:15.575786Z",
"shell.execute_reply": "2024-01-10T00:13:15.575259Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>4.0</td>\n",
" <td>6</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1.0 NaN 2\n",
"1 2.0 3.0 5\n",
"2 NaN 4.0 6"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.578790Z",
"iopub.status.busy": "2024-01-10T00:13:15.578414Z",
"iopub.status.idle": "2024-01-10T00:13:15.584880Z",
"shell.execute_reply": "2024-01-10T00:13:15.584357Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>6</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 2\n",
"0 2\n",
"1 5\n",
"2 6"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dropna(axis='columns')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"More sophisticated approaches can also be considered (e.g. only dropping rows/columns if all entries or a certain number of NaNs appear)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Replacing null values\n",
"\n",
"Null values can be easily replaced using `fillna`."
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.588227Z",
"iopub.status.busy": "2024-01-10T00:13:15.587864Z",
"iopub.status.idle": "2024-01-10T00:13:15.594332Z",
"shell.execute_reply": "2024-01-10T00:13:15.593819Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>4.0</td>\n",
" <td>6</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1.0 NaN 2\n",
"1 2.0 3.0 5\n",
"2 NaN 4.0 6"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.597648Z",
"iopub.status.busy": "2024-01-10T00:13:15.597298Z",
"iopub.status.idle": "2024-01-10T00:13:15.604154Z",
"shell.execute_reply": "2024-01-10T00:13:15.603634Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.0</td>\n",
" <td>4.0</td>\n",
" <td>6</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1.0 0.0 2\n",
"1 2.0 3.0 5\n",
"2 0.0 4.0 6"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.fillna(0.0)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Can also fill using adjacent values."
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.607243Z",
"iopub.status.busy": "2024-01-10T00:13:15.606901Z",
"iopub.status.idle": "2024-01-10T00:13:15.613298Z",
"shell.execute_reply": "2024-01-10T00:13:15.612778Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>4.0</td>\n",
" <td>6</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1.0 NaN 2\n",
"1 2.0 3.0 5\n",
"2 NaN 4.0 6"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.615998Z",
"iopub.status.busy": "2024-01-10T00:13:15.615699Z",
"iopub.status.idle": "2024-01-10T00:13:15.623380Z",
"shell.execute_reply": "2024-01-10T00:13:15.622863Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>4.0</td>\n",
" <td>6.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1.0 1.0 2.0\n",
"1 2.0 3.0 5.0\n",
"2 NaN 4.0 6.0"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.fillna(method='ffill', axis='columns')"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.626132Z",
"iopub.status.busy": "2024-01-10T00:13:15.625790Z",
"iopub.status.idle": "2024-01-10T00:13:15.632720Z",
"shell.execute_reply": "2024-01-10T00:13:15.632070Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2.0</td>\n",
" <td>4.0</td>\n",
" <td>6</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1.0 NaN 2\n",
"1 2.0 3.0 5\n",
"2 2.0 4.0 6"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.fillna(method='ffill', axis='rows')"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.635515Z",
"iopub.status.busy": "2024-01-10T00:13:15.635170Z",
"iopub.status.idle": "2024-01-10T00:13:15.642851Z",
"shell.execute_reply": "2024-01-10T00:13:15.642316Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4.0</td>\n",
" <td>4.0</td>\n",
" <td>6.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1.0 2.0 2.0\n",
"1 2.0 3.0 5.0\n",
"2 4.0 4.0 6.0"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.fillna(method='bfill', axis='columns')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Combining data-sets"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Define helper functions"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.645704Z",
"iopub.status.busy": "2024-01-10T00:13:15.645362Z",
"iopub.status.idle": "2024-01-10T00:13:15.653232Z",
"shell.execute_reply": "2024-01-10T00:13:15.652717Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A0</td>\n",
" <td>B0</td>\n",
" <td>C0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" <td>C1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A2</td>\n",
" <td>B2</td>\n",
" <td>C2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B C\n",
"0 A0 B0 C0\n",
"1 A1 B1 C1\n",
"2 A2 B2 C2"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def make_df(cols, ind):\n",
" \"\"\"Quickly make a DataFrame\"\"\"\n",
" data = {c: [str(c) + str(i) for i in ind]\n",
" for c in cols}\n",
" return pd.DataFrame(data, ind)\n",
"\n",
"# example DataFrame\n",
"make_df('ABC', range(3))"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.655999Z",
"iopub.status.busy": "2024-01-10T00:13:15.655658Z",
"iopub.status.idle": "2024-01-10T00:13:15.658376Z",
"shell.execute_reply": "2024-01-10T00:13:15.657849Z"
}
},
"outputs": [],
"source": [
"from IPython.display import display"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Concatenation\n",
"\n",
"Can concatenate `Series` and `DataFrame` objects with `pd.concat()`."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Default is to concatenate over rows."
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.661244Z",
"iopub.status.busy": "2024-01-10T00:13:15.660906Z",
"iopub.status.idle": "2024-01-10T00:13:15.674198Z",
"shell.execute_reply": "2024-01-10T00:13:15.673678Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A2</td>\n",
" <td>B2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B\n",
"1 A1 B1\n",
"2 A2 B2"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>A3</td>\n",
" <td>B3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>A4</td>\n",
" <td>B4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B\n",
"3 A3 B3\n",
"4 A4 B4"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A2</td>\n",
" <td>B2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>A3</td>\n",
" <td>B3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>A4</td>\n",
" <td>B4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B\n",
"1 A1 B1\n",
"2 A2 B2\n",
"3 A3 B3\n",
"4 A4 B4"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df1 = make_df('AB', [1, 2])\n",
"df2 = make_df('AB', [3, 4])\n",
"display(df1, df2, pd.concat([df1, df2]))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Can also concatenate over columns."
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.677093Z",
"iopub.status.busy": "2024-01-10T00:13:15.676742Z",
"iopub.status.idle": "2024-01-10T00:13:15.690361Z",
"shell.execute_reply": "2024-01-10T00:13:15.689828Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A0</td>\n",
" <td>B0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B\n",
"0 A0 B0\n",
"1 A1 B1"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>C</th>\n",
" <th>D</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>C0</td>\n",
" <td>D0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>C1</td>\n",
" <td>D1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" C D\n",
"0 C0 D0\n",
"1 C1 D1"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" <th>D</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A0</td>\n",
" <td>B0</td>\n",
" <td>C0</td>\n",
" <td>D0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" <td>C1</td>\n",
" <td>D1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B C D\n",
"0 A0 B0 C0 D0\n",
"1 A1 B1 C1 D1"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df3 = make_df('AB', [0, 1])\n",
"df4 = make_df('CD', [0, 1])\n",
"display(df3, df4, pd.concat([df3, df4], axis='columns'))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Duplicated indices\n",
"\n",
"Can have duplicated indices."
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.693461Z",
"iopub.status.busy": "2024-01-10T00:13:15.693106Z",
"iopub.status.idle": "2024-01-10T00:13:15.706128Z",
"shell.execute_reply": "2024-01-10T00:13:15.705603Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A0</td>\n",
" <td>B0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B\n",
"0 A0 B0\n",
"1 A1 B1"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A2</td>\n",
" <td>B2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A3</td>\n",
" <td>B3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B\n",
"0 A2 B2\n",
"1 A3 B3"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A0</td>\n",
" <td>B0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A2</td>\n",
" <td>B2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A3</td>\n",
" <td>B3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B\n",
"0 A0 B0\n",
"1 A1 B1\n",
"0 A2 B2\n",
"1 A3 B3"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"x = make_df('AB', [0, 1])\n",
"y = make_df('AB', [2, 3])\n",
"y.index = x.index # make duplicate indices!\n",
"display(x, y, pd.concat([x, y]))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Ignoring index\n",
"\n",
"Can ignore index."
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.708970Z",
"iopub.status.busy": "2024-01-10T00:13:15.708624Z",
"iopub.status.idle": "2024-01-10T00:13:15.720590Z",
"shell.execute_reply": "2024-01-10T00:13:15.720060Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A0</td>\n",
" <td>B0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B\n",
"0 A0 B0\n",
"1 A1 B1"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A2</td>\n",
" <td>B2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A3</td>\n",
" <td>B3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B\n",
"0 A2 B2\n",
"1 A3 B3"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A0</td>\n",
" <td>B0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A2</td>\n",
" <td>B2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>A3</td>\n",
" <td>B3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B\n",
"0 A0 B0\n",
"1 A1 B1\n",
"2 A2 B2\n",
"3 A3 B3"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(x, y, pd.concat([x, y], ignore_index=True))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Concantenation with joins\n",
"\n",
"Can join `DataFrames` with different column names."
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.723441Z",
"iopub.status.busy": "2024-01-10T00:13:15.723097Z",
"iopub.status.idle": "2024-01-10T00:13:15.738457Z",
"shell.execute_reply": "2024-01-10T00:13:15.737891Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" <td>C1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A2</td>\n",
" <td>B2</td>\n",
" <td>C2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B C\n",
"1 A1 B1 C1\n",
"2 A2 B2 C2"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" <th>D</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>B3</td>\n",
" <td>C3</td>\n",
" <td>D3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>B4</td>\n",
" <td>C4</td>\n",
" <td>D4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" B C D\n",
"3 B3 C3 D3\n",
"4 B4 C4 D4"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" <th>D</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" <td>C1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A2</td>\n",
" <td>B2</td>\n",
" <td>C2</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NaN</td>\n",
" <td>B3</td>\n",
" <td>C3</td>\n",
" <td>D3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>NaN</td>\n",
" <td>B4</td>\n",
" <td>C4</td>\n",
" <td>D4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B C D\n",
"1 A1 B1 C1 NaN\n",
"2 A2 B2 C2 NaN\n",
"3 NaN B3 C3 D3\n",
"4 NaN B4 C4 D4"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df5 = make_df('ABC', [1, 2])\n",
"df6 = make_df('BCD', [3, 4])\n",
"display(df5, df6, pd.concat([df5, df6]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Entries with no data are filled with NaN.\n",
"\n",
"Default join is the *union* of the columns of the two `DataFrames`."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Can also perform different types of joins.\n",
"\n",
"For example, the *intersection* of the columns of the two DataFrames."
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.741470Z",
"iopub.status.busy": "2024-01-10T00:13:15.741118Z",
"iopub.status.idle": "2024-01-10T00:13:15.754499Z",
"shell.execute_reply": "2024-01-10T00:13:15.753943Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" <td>C1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A2</td>\n",
" <td>B2</td>\n",
" <td>C2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B C\n",
"1 A1 B1 C1\n",
"2 A2 B2 C2"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" <th>D</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>B3</td>\n",
" <td>C3</td>\n",
" <td>D3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>B4</td>\n",
" <td>C4</td>\n",
" <td>D4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" B C D\n",
"3 B3 C3 D3\n",
"4 B4 C4 D4"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>B1</td>\n",
" <td>C1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>B2</td>\n",
" <td>C2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>B3</td>\n",
" <td>C3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>B4</td>\n",
" <td>C4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" B C\n",
"1 B1 C1\n",
"2 B2 C2\n",
"3 B3 C3\n",
"4 B4 C4"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(df5, df6, pd.concat([df5, df6], join='inner'))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Relational combinations\n",
"\n",
"Pandas also provides functionality to perform relational algebra (cf. relational databases).\n",
"\n",
"Hence, Pandas data structures provide analog not only of NumPy array and dictionary, but also relational database."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Functionality provied by `pd.merge()` function."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### One-to-one join"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.757544Z",
"iopub.status.busy": "2024-01-10T00:13:15.757195Z",
"iopub.status.idle": "2024-01-10T00:13:15.772846Z",
"shell.execute_reply": "2024-01-10T00:13:15.772302Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>employee</th>\n",
" <th>group</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Bob</td>\n",
" <td>Accounting</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Jake</td>\n",
" <td>Engineering</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Lisa</td>\n",
" <td>Engineering</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sue</td>\n",
" <td>HR</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" employee group\n",
"0 Bob Accounting\n",
"1 Jake Engineering\n",
"2 Lisa Engineering\n",
"3 Sue HR"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>employee</th>\n",
" <th>hire_date</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Lisa</td>\n",
" <td>2004</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Bob</td>\n",
" <td>2008</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Jake</td>\n",
" <td>2012</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sue</td>\n",
" <td>2014</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" employee hire_date\n",
"0 Lisa 2004\n",
"1 Bob 2008\n",
"2 Jake 2012\n",
"3 Sue 2014"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>employee</th>\n",
" <th>group</th>\n",
" <th>hire_date</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Bob</td>\n",
" <td>Accounting</td>\n",
" <td>2008</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Jake</td>\n",
" <td>Engineering</td>\n",
" <td>2012</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Lisa</td>\n",
" <td>Engineering</td>\n",
" <td>2004</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sue</td>\n",
" <td>HR</td>\n",
" <td>2014</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" employee group hire_date\n",
"0 Bob Accounting 2008\n",
"1 Jake Engineering 2012\n",
"2 Lisa Engineering 2004\n",
"3 Sue HR 2014"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],\n",
" 'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})\n",
"df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],\n",
" 'hire_date': [2004, 2008, 2012, 2014]})\n",
"df3 = pd.merge(df1, df2)\n",
"display(df1,df2,df3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recognises that \"employee\" column common and automatically selects as key for the relational join."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Many-to-one joins"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.775906Z",
"iopub.status.busy": "2024-01-10T00:13:15.775550Z",
"iopub.status.idle": "2024-01-10T00:13:15.790962Z",
"shell.execute_reply": "2024-01-10T00:13:15.790405Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>employee</th>\n",
" <th>group</th>\n",
" <th>hire_date</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Bob</td>\n",
" <td>Accounting</td>\n",
" <td>2008</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Jake</td>\n",
" <td>Engineering</td>\n",
" <td>2012</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Lisa</td>\n",
" <td>Engineering</td>\n",
" <td>2004</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sue</td>\n",
" <td>HR</td>\n",
" <td>2014</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" employee group hire_date\n",
"0 Bob Accounting 2008\n",
"1 Jake Engineering 2012\n",
"2 Lisa Engineering 2004\n",
"3 Sue HR 2014"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>group</th>\n",
" <th>supervisor</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Accounting</td>\n",
" <td>Carly</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Engineering</td>\n",
" <td>Guido</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>HR</td>\n",
" <td>Steve</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" group supervisor\n",
"0 Accounting Carly\n",
"1 Engineering Guido\n",
"2 HR Steve"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>employee</th>\n",
" <th>group</th>\n",
" <th>hire_date</th>\n",
" <th>supervisor</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Bob</td>\n",
" <td>Accounting</td>\n",
" <td>2008</td>\n",
" <td>Carly</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Jake</td>\n",
" <td>Engineering</td>\n",
" <td>2012</td>\n",
" <td>Guido</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Lisa</td>\n",
" <td>Engineering</td>\n",
" <td>2004</td>\n",
" <td>Guido</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sue</td>\n",
" <td>HR</td>\n",
" <td>2014</td>\n",
" <td>Steve</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" employee group hire_date supervisor\n",
"0 Bob Accounting 2008 Carly\n",
"1 Jake Engineering 2012 Guido\n",
"2 Lisa Engineering 2004 Guido\n",
"3 Sue HR 2014 Steve"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],\n",
" 'supervisor': ['Carly', 'Guido', 'Steve']})\n",
"display(df3, df4, pd.merge(df3, df4))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### The `on` keyword"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.793866Z",
"iopub.status.busy": "2024-01-10T00:13:15.793517Z",
"iopub.status.idle": "2024-01-10T00:13:15.807076Z",
"shell.execute_reply": "2024-01-10T00:13:15.806535Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>employee</th>\n",
" <th>group</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Bob</td>\n",
" <td>Accounting</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Jake</td>\n",
" <td>Engineering</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Lisa</td>\n",
" <td>Engineering</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sue</td>\n",
" <td>HR</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" employee group\n",
"0 Bob Accounting\n",
"1 Jake Engineering\n",
"2 Lisa Engineering\n",
"3 Sue HR"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>employee</th>\n",
" <th>hire_date</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Lisa</td>\n",
" <td>2004</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Bob</td>\n",
" <td>2008</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Jake</td>\n",
" <td>2012</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sue</td>\n",
" <td>2014</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" employee hire_date\n",
"0 Lisa 2004\n",
"1 Bob 2008\n",
"2 Jake 2012\n",
"3 Sue 2014"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>employee</th>\n",
" <th>group</th>\n",
" <th>hire_date</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Bob</td>\n",
" <td>Accounting</td>\n",
" <td>2008</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Jake</td>\n",
" <td>Engineering</td>\n",
" <td>2012</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Lisa</td>\n",
" <td>Engineering</td>\n",
" <td>2004</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sue</td>\n",
" <td>HR</td>\n",
" <td>2014</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" employee group hire_date\n",
"0 Bob Accounting 2008\n",
"1 Jake Engineering 2012\n",
"2 Lisa Engineering 2004\n",
"3 Sue HR 2014"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(df1, df2, pd.merge(df1, df2, on='employee'))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### The `left_on` and `right_on` keywords"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.809978Z",
"iopub.status.busy": "2024-01-10T00:13:15.809622Z",
"iopub.status.idle": "2024-01-10T00:13:15.823891Z",
"shell.execute_reply": "2024-01-10T00:13:15.823352Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>employee</th>\n",
" <th>group</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Bob</td>\n",
" <td>Accounting</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Jake</td>\n",
" <td>Engineering</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Lisa</td>\n",
" <td>Engineering</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sue</td>\n",
" <td>HR</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" employee group\n",
"0 Bob Accounting\n",
"1 Jake Engineering\n",
"2 Lisa Engineering\n",
"3 Sue HR"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>salary</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Bob</td>\n",
" <td>70000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Jake</td>\n",
" <td>80000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Lisa</td>\n",
" <td>120000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sue</td>\n",
" <td>90000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name salary\n",
"0 Bob 70000\n",
"1 Jake 80000\n",
"2 Lisa 120000\n",
"3 Sue 90000"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>employee</th>\n",
" <th>group</th>\n",
" <th>name</th>\n",
" <th>salary</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Bob</td>\n",
" <td>Accounting</td>\n",
" <td>Bob</td>\n",
" <td>70000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Jake</td>\n",
" <td>Engineering</td>\n",
" <td>Jake</td>\n",
" <td>80000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Lisa</td>\n",
" <td>Engineering</td>\n",
" <td>Lisa</td>\n",
" <td>120000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sue</td>\n",
" <td>HR</td>\n",
" <td>Sue</td>\n",
" <td>90000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" employee group name salary\n",
"0 Bob Accounting Bob 70000\n",
"1 Jake Engineering Jake 80000\n",
"2 Lisa Engineering Lisa 120000\n",
"3 Sue HR Sue 90000"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],\n",
" 'salary': [70000, 80000, 120000, 90000]})\n",
"display(df1, df3, pd.merge(df1, df3, left_on=\"employee\", right_on=\"name\"))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Employee and name both included now, so may want to drop one."
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.826815Z",
"iopub.status.busy": "2024-01-10T00:13:15.826465Z",
"iopub.status.idle": "2024-01-10T00:13:15.834368Z",
"shell.execute_reply": "2024-01-10T00:13:15.833835Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>employee</th>\n",
" <th>group</th>\n",
" <th>name</th>\n",
" <th>salary</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Bob</td>\n",
" <td>Accounting</td>\n",
" <td>Bob</td>\n",
" <td>70000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Jake</td>\n",
" <td>Engineering</td>\n",
" <td>Jake</td>\n",
" <td>80000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Lisa</td>\n",
" <td>Engineering</td>\n",
" <td>Lisa</td>\n",
" <td>120000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sue</td>\n",
" <td>HR</td>\n",
" <td>Sue</td>\n",
" <td>90000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" employee group name salary\n",
"0 Bob Accounting Bob 70000\n",
"1 Jake Engineering Jake 80000\n",
"2 Lisa Engineering Lisa 120000\n",
"3 Sue HR Sue 90000"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(df1, df3, left_on=\"employee\", right_on=\"name\")"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.837233Z",
"iopub.status.busy": "2024-01-10T00:13:15.836883Z",
"iopub.status.idle": "2024-01-10T00:13:15.844804Z",
"shell.execute_reply": "2024-01-10T00:13:15.844272Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>employee</th>\n",
" <th>group</th>\n",
" <th>salary</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Bob</td>\n",
" <td>Accounting</td>\n",
" <td>70000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Jake</td>\n",
" <td>Engineering</td>\n",
" <td>80000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Lisa</td>\n",
" <td>Engineering</td>\n",
" <td>120000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sue</td>\n",
" <td>HR</td>\n",
" <td>90000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" employee group salary\n",
"0 Bob Accounting 70000\n",
"1 Jake Engineering 80000\n",
"2 Lisa Engineering 120000\n",
"3 Sue HR 90000"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(df1, df3, left_on=\"employee\", right_on=\"name\").drop('name', axis='columns')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### The `left_index` and `right_index` keywords\n",
"\n",
"Often one wants to join on index."
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.847768Z",
"iopub.status.busy": "2024-01-10T00:13:15.847396Z",
"iopub.status.idle": "2024-01-10T00:13:15.856391Z",
"shell.execute_reply": "2024-01-10T00:13:15.855859Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>group</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Bob</th>\n",
" <td>Accounting</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Jake</th>\n",
" <td>Engineering</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Lisa</th>\n",
" <td>Engineering</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Sue</th>\n",
" <td>HR</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" group\n",
"employee \n",
"Bob Accounting\n",
"Jake Engineering\n",
"Lisa Engineering\n",
"Sue HR"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>hire_date</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Lisa</th>\n",
" <td>2004</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Bob</th>\n",
" <td>2008</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Jake</th>\n",
" <td>2012</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Sue</th>\n",
" <td>2014</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" hire_date\n",
"employee \n",
"Lisa 2004\n",
"Bob 2008\n",
"Jake 2012\n",
"Sue 2014"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df1a = df1.set_index('employee')\n",
"df2a = df2.set_index('employee')\n",
"display(df1a, df2a)"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.859251Z",
"iopub.status.busy": "2024-01-10T00:13:15.858901Z",
"iopub.status.idle": "2024-01-10T00:13:15.866112Z",
"shell.execute_reply": "2024-01-10T00:13:15.865585Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>group</th>\n",
" <th>hire_date</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Bob</th>\n",
" <td>Accounting</td>\n",
" <td>2008</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Jake</th>\n",
" <td>Engineering</td>\n",
" <td>2012</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Lisa</th>\n",
" <td>Engineering</td>\n",
" <td>2004</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Sue</th>\n",
" <td>HR</td>\n",
" <td>2014</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" group hire_date\n",
"employee \n",
"Bob Accounting 2008\n",
"Jake Engineering 2012\n",
"Lisa Engineering 2004\n",
"Sue HR 2014"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(pd.merge(df1a, df2a, left_index=True, right_index=True))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Set arithmetic for joins \n",
"\n",
"Have so far been considering relational joins based on *intersection* \n",
"(also called *inner* join)."
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.869060Z",
"iopub.status.busy": "2024-01-10T00:13:15.868709Z",
"iopub.status.idle": "2024-01-10T00:13:15.885530Z",
"shell.execute_reply": "2024-01-10T00:13:15.884965Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>food</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Peter</td>\n",
" <td>fish</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Paul</td>\n",
" <td>beans</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Mary</td>\n",
" <td>bread</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name food\n",
"0 Peter fish\n",
"1 Paul beans\n",
"2 Mary bread"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>drink</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Mary</td>\n",
" <td>wine</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Joseph</td>\n",
" <td>beer</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name drink\n",
"0 Mary wine\n",
"1 Joseph beer"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>food</th>\n",
" <th>drink</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Mary</td>\n",
" <td>bread</td>\n",
" <td>wine</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name food drink\n",
"0 Mary bread wine"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],\n",
" 'food': ['fish', 'beans', 'bread']},\n",
" columns=['name', 'food'])\n",
"df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],\n",
" 'drink': ['wine', 'beer']},\n",
" columns=['name', 'drink'])\n",
"display(df6, df7, pd.merge(df6, df7, how='inner'))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"##### Outer join\n",
"\n",
"Can also join based on *union* (missing entries filled with NaNs)."
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.888739Z",
"iopub.status.busy": "2024-01-10T00:13:15.888382Z",
"iopub.status.idle": "2024-01-10T00:13:15.903223Z",
"shell.execute_reply": "2024-01-10T00:13:15.902673Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>food</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Peter</td>\n",
" <td>fish</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Paul</td>\n",
" <td>beans</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Mary</td>\n",
" <td>bread</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name food\n",
"0 Peter fish\n",
"1 Paul beans\n",
"2 Mary bread"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>drink</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Mary</td>\n",
" <td>wine</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Joseph</td>\n",
" <td>beer</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name drink\n",
"0 Mary wine\n",
"1 Joseph beer"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>food</th>\n",
" <th>drink</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Peter</td>\n",
" <td>fish</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Paul</td>\n",
" <td>beans</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Mary</td>\n",
" <td>bread</td>\n",
" <td>wine</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Joseph</td>\n",
" <td>NaN</td>\n",
" <td>beer</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name food drink\n",
"0 Peter fish NaN\n",
"1 Paul beans NaN\n",
"2 Mary bread wine\n",
"3 Joseph NaN beer"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(df6, df7, pd.merge(df6, df7, how='outer'))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"##### Left and right join\n",
"\n",
"Can also join based on *left* or *right* entries."
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.906173Z",
"iopub.status.busy": "2024-01-10T00:13:15.905822Z",
"iopub.status.idle": "2024-01-10T00:13:15.920315Z",
"shell.execute_reply": "2024-01-10T00:13:15.919750Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>food</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Peter</td>\n",
" <td>fish</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Paul</td>\n",
" <td>beans</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Mary</td>\n",
" <td>bread</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name food\n",
"0 Peter fish\n",
"1 Paul beans\n",
"2 Mary bread"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>drink</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Mary</td>\n",
" <td>wine</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Joseph</td>\n",
" <td>beer</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name drink\n",
"0 Mary wine\n",
"1 Joseph beer"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>food</th>\n",
" <th>drink</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Peter</td>\n",
" <td>fish</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Paul</td>\n",
" <td>beans</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Mary</td>\n",
" <td>bread</td>\n",
" <td>wine</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name food drink\n",
"0 Peter fish NaN\n",
"1 Paul beans NaN\n",
"2 Mary bread wine"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(df6, df7, pd.merge(df6, df7, how='left'))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Overlapping column names\n",
"\n",
"Possible for DataFrames to have conflicting columns."
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.923323Z",
"iopub.status.busy": "2024-01-10T00:13:15.922959Z",
"iopub.status.idle": "2024-01-10T00:13:15.933270Z",
"shell.execute_reply": "2024-01-10T00:13:15.932631Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>rank</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Bob</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Jake</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Lisa</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sue</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name rank\n",
"0 Bob 1\n",
"1 Jake 2\n",
"2 Lisa 3\n",
"3 Sue 4"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>rank</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Bob</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Jake</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Lisa</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sue</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name rank\n",
"0 Bob 3\n",
"1 Jake 1\n",
"2 Lisa 4\n",
"3 Sue 2"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],\n",
" 'rank': [1, 2, 3, 4]})\n",
"df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],\n",
" 'rank': [3, 1, 4, 2]})\n",
"display(df8, df9)"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-10T00:13:15.937472Z",
"iopub.status.busy": "2024-01-10T00:13:15.936984Z",
"iopub.status.idle": "2024-01-10T00:13:15.946330Z",
"shell.execute_reply": "2024-01-10T00:13:15.945561Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>rank_L</th>\n",
" <th>rank_R</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Bob</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Jake</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Lisa</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sue</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name rank_L rank_R\n",
"0 Bob 1 3\n",
"1 Jake 2 1\n",
"2 Lisa 3 4\n",
"3 Sue 4 2"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(pd.merge(df8, df9, on=\"name\", suffixes=[\"_L\", \"_R\"]))"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.18"
}
},
"nbformat": 4,
"nbformat_minor": 4
}