spce0038-machine-learning-w.../week1/slides/Lecture02_Pandas.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Lecture 2: Data wrangling with Pandas  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](https://www.tensorflow.org/images/colab_logo_32px.png)\n",
    "[Run in colab](https://colab.research.google.com/drive/1L7sAw22PfopC1z8ANRdgFCnJBXiZIdtI)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:14.768141Z",
     "iopub.status.busy": "2024-01-10T00:13:14.767528Z",
     "iopub.status.idle": "2024-01-10T00:13:14.775602Z",
     "shell.execute_reply": "2024-01-10T00:13:14.775063Z"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Last executed: 2024-01-10 00:13:14\n"
     ]
    }
   ],
   "source": [
    "import datetime\n",
    "now = datetime.datetime.now()\n",
    "print(\"Last executed: \" + now.strftime(\"%Y-%m-%d %H:%M:%S\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Why Pandas?\n",
    "\n",
    "[Pandas](https://pandas.pydata.org/) is a very useful package for data wrangling.\n",
    "\n",
    "Particularly useful when working with real data, which can be messy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Combines advantages of a number of different data structures (NumPy arrays, dictionaries, relational databases).\n",
    "\n",
    "Can also be more efficient than native Python data structures for certain operators (as we will see)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Particularly useful for dealing with:\n",
    "- Labelled data\n",
    "- Missing data\n",
    "- Heteterogenous types\n",
    "- Groupings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "We will focus mostly on Pandas `Series` and `DataFrame` objects."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Import Pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:14.816280Z",
     "iopub.status.busy": "2024-01-10T00:13:14.815818Z",
     "iopub.status.idle": "2024-01-10T00:13:15.132633Z",
     "shell.execute_reply": "2024-01-10T00:13:15.131910Z"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Documentation\n",
    "\n",
    "Recall can check documentation with `pd?`, `pd.<TAB>`, and/or print documentation for specific function with `print(pd.<function_name>.__doc__)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.136072Z",
     "iopub.status.busy": "2024-01-10T00:13:15.135754Z",
     "iopub.status.idle": "2024-01-10T00:13:15.139939Z",
     "shell.execute_reply": "2024-01-10T00:13:15.139326Z"
    }
   },
   "outputs": [],
   "source": [
    "#pd?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.142929Z",
     "iopub.status.busy": "2024-01-10T00:13:15.142560Z",
     "iopub.status.idle": "2024-01-10T00:13:15.145430Z",
     "shell.execute_reply": "2024-01-10T00:13:15.144796Z"
    }
   },
   "outputs": [],
   "source": [
    "#pd."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.148415Z",
     "iopub.status.busy": "2024-01-10T00:13:15.147981Z",
     "iopub.status.idle": "2024-01-10T00:13:15.152029Z",
     "shell.execute_reply": "2024-01-10T00:13:15.151417Z"
    }
   },
   "outputs": [],
   "source": [
    "#pd.concat?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.154836Z",
     "iopub.status.busy": "2024-01-10T00:13:15.154383Z",
     "iopub.status.idle": "2024-01-10T00:13:15.158458Z",
     "shell.execute_reply": "2024-01-10T00:13:15.157831Z"
    }
   },
   "outputs": [],
   "source": [
    "#print(pd.concat.__doc__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Pandas `Series`\n",
    "\n",
    "A Pandas `Series` is a *1D* array of *indexed* data. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Can be created from a list or array:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.161285Z",
     "iopub.status.busy": "2024-01-10T00:13:15.161073Z",
     "iopub.status.idle": "2024-01-10T00:13:15.171895Z",
     "shell.execute_reply": "2024-01-10T00:13:15.171212Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0.25\n",
       "1    0.50\n",
       "2    0.75\n",
       "3    1.00\n",
       "dtype: float64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.Series([0.25, 0.5, 0.75, 1.0])\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "The `Series` wraps both a sequence of *values* and a sequence of *indices*, which we can access with the `values` and `index` attributes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.174915Z",
     "iopub.status.busy": "2024-01-10T00:13:15.174556Z",
     "iopub.status.idle": "2024-01-10T00:13:15.181701Z",
     "shell.execute_reply": "2024-01-10T00:13:15.181103Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0.25\n",
       "1    0.50\n",
       "2    0.75\n",
       "3    1.00\n",
       "dtype: float64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.184468Z",
     "iopub.status.busy": "2024-01-10T00:13:15.184105Z",
     "iopub.status.idle": "2024-01-10T00:13:15.188580Z",
     "shell.execute_reply": "2024-01-10T00:13:15.187941Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.25, 0.5 , 0.75, 1.  ])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.192321Z",
     "iopub.status.busy": "2024-01-10T00:13:15.191893Z",
     "iopub.status.idle": "2024-01-10T00:13:15.196459Z",
     "shell.execute_reply": "2024-01-10T00:13:15.195837Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RangeIndex(start=0, stop=4, step=1)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### `Series` as generalized NumPy array  \n",
    "\n",
    "Values are simply NumPy array.\n",
    "\n",
    "Index need not be an integer, but can consist of values of any desired type. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.199711Z",
     "iopub.status.busy": "2024-01-10T00:13:15.199076Z",
     "iopub.status.idle": "2024-01-10T00:13:15.204940Z",
     "shell.execute_reply": "2024-01-10T00:13:15.204319Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    0.25\n",
       "b    0.50\n",
       "c    0.75\n",
       "d    1.00\n",
       "dtype: float64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.Series([0.25, 0.5, 0.75, 1.0],\n",
    "                 index=['a', 'b', 'c', 'd'])\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.207914Z",
     "iopub.status.busy": "2024-01-10T00:13:15.207336Z",
     "iopub.status.idle": "2024-01-10T00:13:15.211637Z",
     "shell.execute_reply": "2024-01-10T00:13:15.210997Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['b']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### `Series` as specialized dictionary\n",
    "\n",
    "Can also think of a Pandas `Series` like a specialization of a Python dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.214449Z",
     "iopub.status.busy": "2024-01-10T00:13:15.214085Z",
     "iopub.status.idle": "2024-01-10T00:13:15.217795Z",
     "shell.execute_reply": "2024-01-10T00:13:15.217150Z"
    }
   },
   "outputs": [],
   "source": [
    "population_dict = {'California': 38332521,\n",
    "                   'Texas': 26448193,\n",
    "                   'New York': 19651127,\n",
    "                   'Florida': 19552860,\n",
    "                   'Illinois': 12882135}\n",
    "population = pd.Series(population_dict) # Instantiate from dictionary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.220887Z",
     "iopub.status.busy": "2024-01-10T00:13:15.220288Z",
     "iopub.status.idle": "2024-01-10T00:13:15.224757Z",
     "shell.execute_reply": "2024-01-10T00:13:15.224118Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(dict, pandas.core.series.Series)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(population_dict), type(population)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.228122Z",
     "iopub.status.busy": "2024-01-10T00:13:15.227555Z",
     "iopub.status.idle": "2024-01-10T00:13:15.231762Z",
     "shell.execute_reply": "2024-01-10T00:13:15.231241Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "38332521"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "population['California']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "- Python dictionary: maps *arbitrary* keys to *arbitrary* values.\n",
    "- Pandas `Series`: maps *typed* indices to *typed* values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Type information of Pandas `Series` makes it much more efficient than Python dictionaries for certain operations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Pandas `DataFrame`\n",
    "\n",
    "`DataFrame` can be thought of as a sequence of aligned `Series` objects, with *indices* and *columns*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.234800Z",
     "iopub.status.busy": "2024-01-10T00:13:15.234218Z",
     "iopub.status.idle": "2024-01-10T00:13:15.242493Z",
     "shell.execute_reply": "2024-01-10T00:13:15.241954Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>foo</th>\n",
       "      <th>bar</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>0.150862</td>\n",
       "      <td>0.234217</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>0.667193</td>\n",
       "      <td>0.961023</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c</th>\n",
       "      <td>0.030670</td>\n",
       "      <td>0.910813</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        foo       bar\n",
       "a  0.150862  0.234217\n",
       "b  0.667193  0.961023\n",
       "c  0.030670  0.910813"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(np.random.rand(3, 2),\n",
    "             columns=['foo', 'bar'],\n",
    "             index=['a', 'b', 'c'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### `DataFrame` as generalized NumPy array\n",
    "\n",
    "`DataFrame` is an analog of a two-dimensional array with both flexible row indices and flexible column names."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Contstruct another `Series` with same indices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.245372Z",
     "iopub.status.busy": "2024-01-10T00:13:15.245019Z",
     "iopub.status.idle": "2024-01-10T00:13:15.248490Z",
     "shell.execute_reply": "2024-01-10T00:13:15.247931Z"
    }
   },
   "outputs": [],
   "source": [
    "area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,\n",
    "             'Florida': 170312, 'Illinois': 149995}\n",
    "area = pd.Series(area_dict)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Combine two `Series` into a `DataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.251243Z",
     "iopub.status.busy": "2024-01-10T00:13:15.250893Z",
     "iopub.status.idle": "2024-01-10T00:13:15.257216Z",
     "shell.execute_reply": "2024-01-10T00:13:15.256698Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>population</th>\n",
       "      <th>area</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>California</th>\n",
       "      <td>38332521</td>\n",
       "      <td>423967</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Texas</th>\n",
       "      <td>26448193</td>\n",
       "      <td>695662</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>New York</th>\n",
       "      <td>19651127</td>\n",
       "      <td>141297</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Florida</th>\n",
       "      <td>19552860</td>\n",
       "      <td>170312</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Illinois</th>\n",
       "      <td>12882135</td>\n",
       "      <td>149995</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            population    area\n",
       "California    38332521  423967\n",
       "Texas         26448193  695662\n",
       "New York      19651127  141297\n",
       "Florida       19552860  170312\n",
       "Illinois      12882135  149995"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "states = pd.DataFrame({'population': population,\n",
    "                       'area': area})\n",
    "states"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "`DataFrame` has both `index` and `column` attributes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.260053Z",
     "iopub.status.busy": "2024-01-10T00:13:15.259693Z",
     "iopub.status.idle": "2024-01-10T00:13:15.263625Z",
     "shell.execute_reply": "2024-01-10T00:13:15.263113Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "states.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.266338Z",
     "iopub.status.busy": "2024-01-10T00:13:15.265998Z",
     "iopub.status.idle": "2024-01-10T00:13:15.269838Z",
     "shell.execute_reply": "2024-01-10T00:13:15.269330Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['population', 'area'], dtype='object')"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "states.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### `DataFrame` as specialized dictionary\n",
    "\n",
    "Can also think of a Pandas `DataFrame` like a specialization of a Python dictionary."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "`DataFrame` maps a column name to a `Series`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.272676Z",
     "iopub.status.busy": "2024-01-10T00:13:15.272320Z",
     "iopub.status.idle": "2024-01-10T00:13:15.276748Z",
     "shell.execute_reply": "2024-01-10T00:13:15.276239Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "California    423967\n",
       "Texas         695662\n",
       "New York      141297\n",
       "Florida       170312\n",
       "Illinois      149995\n",
       "Name: area, dtype: int64"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "states['area']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.279494Z",
     "iopub.status.busy": "2024-01-10T00:13:15.279148Z",
     "iopub.status.idle": "2024-01-10T00:13:15.282910Z",
     "shell.execute_reply": "2024-01-10T00:13:15.282377Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.series.Series"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(states['area'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Pandas `Index`\n",
    "\n",
    "Both Pandas `Series` and `DataFrame` contain `Index` object(s).\n",
    "\n",
    "Can be thought of as *immutable array* (i.e. cannot be changed) or *ordered multi-set* (may contain repeated values)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.285632Z",
     "iopub.status.busy": "2024-01-10T00:13:15.285291Z",
     "iopub.status.idle": "2024-01-10T00:13:15.289529Z",
     "shell.execute_reply": "2024-01-10T00:13:15.289021Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index([2, 3, 5, 7, 11], dtype='int64')"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ind = pd.Index([2, 3, 5, 7, 11])\n",
    "ind  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### `Index` as immutable array\n",
    "\n",
    "Immutability makes it safer to share indices between multiple `DataFrames`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.292351Z",
     "iopub.status.busy": "2024-01-10T00:13:15.291994Z",
     "iopub.status.idle": "2024-01-10T00:13:15.295778Z",
     "shell.execute_reply": "2024-01-10T00:13:15.295270Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ind[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.298395Z",
     "iopub.status.busy": "2024-01-10T00:13:15.298061Z",
     "iopub.status.idle": "2024-01-10T00:13:15.300547Z",
     "shell.execute_reply": "2024-01-10T00:13:15.300030Z"
    }
   },
   "outputs": [],
   "source": [
    "#ind[1] = 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### `Index` as ordered multi-set\n",
    "\n",
    "`Index` objects support many set operations, e.g. joins, unions, intersections, differences."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Example: Compute the intersection and union of the following two `Index` objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.303350Z",
     "iopub.status.busy": "2024-01-10T00:13:15.303007Z",
     "iopub.status.idle": "2024-01-10T00:13:15.306205Z",
     "shell.execute_reply": "2024-01-10T00:13:15.305664Z"
    }
   },
   "outputs": [],
   "source": [
    "indA = pd.Index([1, 3, 5, 7, 9])\n",
    "indB = pd.Index([2, 3, 5, 7, 11]) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.308868Z",
     "iopub.status.busy": "2024-01-10T00:13:15.308513Z",
     "iopub.status.idle": "2024-01-10T00:13:15.312952Z",
     "shell.execute_reply": "2024-01-10T00:13:15.312424Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index([3, 5, 7], dtype='int64')"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "indA.intersection(indB)  # intersection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.315736Z",
     "iopub.status.busy": "2024-01-10T00:13:15.315378Z",
     "iopub.status.idle": "2024-01-10T00:13:15.319554Z",
     "shell.execute_reply": "2024-01-10T00:13:15.319039Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "indA.union(indB)  # union"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Data indexing and selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Data selection in a `Series`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In additional to acting like a dictionary, a `Series` also provies array-style selection like NumPy arrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.322458Z",
     "iopub.status.busy": "2024-01-10T00:13:15.322095Z",
     "iopub.status.idle": "2024-01-10T00:13:15.327409Z",
     "shell.execute_reply": "2024-01-10T00:13:15.326886Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    0.25\n",
       "b    0.50\n",
       "c    0.75\n",
       "d    1.00\n",
       "dtype: float64"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.Series([0.25, 0.5, 0.75, 1.0],\n",
    "                 index=['a', 'b', 'c', 'd'])\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.330109Z",
     "iopub.status.busy": "2024-01-10T00:13:15.329744Z",
     "iopub.status.idle": "2024-01-10T00:13:15.334772Z",
     "shell.execute_reply": "2024-01-10T00:13:15.334235Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    0.25\n",
       "b    0.50\n",
       "c    0.75\n",
       "dtype: float64"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# slicing by explicit index\n",
    "data['a':'c']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.337529Z",
     "iopub.status.busy": "2024-01-10T00:13:15.337173Z",
     "iopub.status.idle": "2024-01-10T00:13:15.341776Z",
     "shell.execute_reply": "2024-01-10T00:13:15.341267Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    0.25\n",
       "b    0.50\n",
       "dtype: float64"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# slicing by implicit integer index\n",
    "data[0:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "When slicing by an explicit index (e.g. `data['a':'c']`), the final index *is* included.\n",
    "\n",
    "When slicing by an implicit index (e.g. `data[0:2]`), the final index *is not* included.\n",
    "\n",
    "This can be a source of much confusion."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Consider a Series with integer indices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.344695Z",
     "iopub.status.busy": "2024-01-10T00:13:15.344351Z",
     "iopub.status.idle": "2024-01-10T00:13:15.349397Z",
     "shell.execute_reply": "2024-01-10T00:13:15.348877Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1    a\n",
       "3    b\n",
       "5    c\n",
       "dtype: object"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.352122Z",
     "iopub.status.busy": "2024-01-10T00:13:15.351783Z",
     "iopub.status.idle": "2024-01-10T00:13:15.355483Z",
     "shell.execute_reply": "2024-01-10T00:13:15.354953Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'a'"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# explicit index when indexing\n",
    "data[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.358468Z",
     "iopub.status.busy": "2024-01-10T00:13:15.358085Z",
     "iopub.status.idle": "2024-01-10T00:13:15.362748Z",
     "shell.execute_reply": "2024-01-10T00:13:15.362199Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3    b\n",
       "5    c\n",
       "dtype: object"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# implicit index when slicing\n",
    "data[1:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### Indexers\n",
    "\n",
    "Indexers `loc` (explicit) and `iloc` (implicit) are introduced to avoid confusion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.365530Z",
     "iopub.status.busy": "2024-01-10T00:13:15.365179Z",
     "iopub.status.idle": "2024-01-10T00:13:15.369614Z",
     "shell.execute_reply": "2024-01-10T00:13:15.369097Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1    a\n",
       "3    b\n",
       "5    c\n",
       "dtype: object"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.372452Z",
     "iopub.status.busy": "2024-01-10T00:13:15.372086Z",
     "iopub.status.idle": "2024-01-10T00:13:15.375935Z",
     "shell.execute_reply": "2024-01-10T00:13:15.375418Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'a'"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.loc[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.378492Z",
     "iopub.status.busy": "2024-01-10T00:13:15.378208Z",
     "iopub.status.idle": "2024-01-10T00:13:15.382914Z",
     "shell.execute_reply": "2024-01-10T00:13:15.382378Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1    a\n",
       "3    b\n",
       "dtype: object"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.loc[1:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.385667Z",
     "iopub.status.busy": "2024-01-10T00:13:15.385325Z",
     "iopub.status.idle": "2024-01-10T00:13:15.389111Z",
     "shell.execute_reply": "2024-01-10T00:13:15.388596Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'b'"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.iloc[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.391891Z",
     "iopub.status.busy": "2024-01-10T00:13:15.391527Z",
     "iopub.status.idle": "2024-01-10T00:13:15.396244Z",
     "shell.execute_reply": "2024-01-10T00:13:15.395728Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3    b\n",
       "5    c\n",
       "dtype: object"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.iloc[1:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Data selection in a `DataFrame`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In additional to acting like a dictionary of `Series` objects with the same index, a `DataFrame` also provies array-style selection like NumPy arrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.399127Z",
     "iopub.status.busy": "2024-01-10T00:13:15.398770Z",
     "iopub.status.idle": "2024-01-10T00:13:15.406757Z",
     "shell.execute_reply": "2024-01-10T00:13:15.406203Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>area</th>\n",
       "      <th>population</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>California</th>\n",
       "      <td>423967</td>\n",
       "      <td>38332521</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Texas</th>\n",
       "      <td>695662</td>\n",
       "      <td>26448193</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>New York</th>\n",
       "      <td>141297</td>\n",
       "      <td>19651127</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Florida</th>\n",
       "      <td>170312</td>\n",
       "      <td>19552860</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Illinois</th>\n",
       "      <td>149995</td>\n",
       "      <td>12882135</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              area  population\n",
       "California  423967    38332521\n",
       "Texas       695662    26448193\n",
       "New York    141297    19651127\n",
       "Florida     170312    19552860\n",
       "Illinois    149995    12882135"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "area = pd.Series({'California': 423967, 'Texas': 695662,\n",
    "                  'New York': 141297, 'Florida': 170312,\n",
    "                  'Illinois': 149995})\n",
    "pop = pd.Series({'California': 38332521, 'Texas': 26448193,\n",
    "                 'New York': 19651127, 'Florida': 19552860,\n",
    "                 'Illinois': 12882135})\n",
    "data = pd.DataFrame({'area':area, 'population':pop})\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### Indexers\n",
    "\n",
    "Indexers `loc` (explicit) and `iloc` (implicit) are also available to avoid confusion when selecting data.\n",
    "\n",
    "Note that index and column labels are preserved in the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.409634Z",
     "iopub.status.busy": "2024-01-10T00:13:15.409286Z",
     "iopub.status.idle": "2024-01-10T00:13:15.414919Z",
     "shell.execute_reply": "2024-01-10T00:13:15.414384Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>area</th>\n",
       "      <th>population</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>California</th>\n",
       "      <td>423967</td>\n",
       "      <td>38332521</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Texas</th>\n",
       "      <td>695662</td>\n",
       "      <td>26448193</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>New York</th>\n",
       "      <td>141297</td>\n",
       "      <td>19651127</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Florida</th>\n",
       "      <td>170312</td>\n",
       "      <td>19552860</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Illinois</th>\n",
       "      <td>149995</td>\n",
       "      <td>12882135</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              area  population\n",
       "California  423967    38332521\n",
       "Texas       695662    26448193\n",
       "New York    141297    19651127\n",
       "Florida     170312    19552860\n",
       "Illinois    149995    12882135"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.417710Z",
     "iopub.status.busy": "2024-01-10T00:13:15.417363Z",
     "iopub.status.idle": "2024-01-10T00:13:15.423392Z",
     "shell.execute_reply": "2024-01-10T00:13:15.422870Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>area</th>\n",
       "      <th>population</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>California</th>\n",
       "      <td>423967</td>\n",
       "      <td>38332521</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Texas</th>\n",
       "      <td>695662</td>\n",
       "      <td>26448193</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>New York</th>\n",
       "      <td>141297</td>\n",
       "      <td>19651127</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              area  population\n",
       "California  423967    38332521\n",
       "Texas       695662    26448193\n",
       "New York    141297    19651127"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.iloc[:3, :2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.426169Z",
     "iopub.status.busy": "2024-01-10T00:13:15.425803Z",
     "iopub.status.idle": "2024-01-10T00:13:15.431850Z",
     "shell.execute_reply": "2024-01-10T00:13:15.431323Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>area</th>\n",
       "      <th>population</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>California</th>\n",
       "      <td>423967</td>\n",
       "      <td>38332521</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Texas</th>\n",
       "      <td>695662</td>\n",
       "      <td>26448193</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>New York</th>\n",
       "      <td>141297</td>\n",
       "      <td>19651127</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Florida</th>\n",
       "      <td>170312</td>\n",
       "      <td>19552860</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Illinois</th>\n",
       "      <td>149995</td>\n",
       "      <td>12882135</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              area  population\n",
       "California  423967    38332521\n",
       "Texas       695662    26448193\n",
       "New York    141297    19651127\n",
       "Florida     170312    19552860\n",
       "Illinois    149995    12882135"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.loc[:'Illinois', :'population']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Additional array-style selection \n",
    "\n",
    "Other NumPy selection approaches can also be applied (e.g. masking)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": [
     "exercise_pointer"
    ]
   },
   "source": [
    "**Exercises:** *You can now complete Exercise 1 in the exercises associated with this lecture.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Operating on data in Pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Elementwise operations in Pandas automatically aligns indices and preserves index/column labels.\n",
    "\n",
    "Can avoid many errors and bugs in data wrangling."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Index preservation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.435183Z",
     "iopub.status.busy": "2024-01-10T00:13:15.434837Z",
     "iopub.status.idle": "2024-01-10T00:13:15.440045Z",
     "shell.execute_reply": "2024-01-10T00:13:15.439526Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    6\n",
       "1    3\n",
       "2    7\n",
       "3    4\n",
       "dtype: int64"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rng = np.random.RandomState(42)\n",
    "ser = pd.Series(rng.randint(0, 10, 4))\n",
    "ser"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.442924Z",
     "iopub.status.busy": "2024-01-10T00:13:15.442580Z",
     "iopub.status.idle": "2024-01-10T00:13:15.447315Z",
     "shell.execute_reply": "2024-01-10T00:13:15.446795Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     403.428793\n",
       "1      20.085537\n",
       "2    1096.633158\n",
       "3      54.598150\n",
       "dtype: float64"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.exp(ser)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.450058Z",
     "iopub.status.busy": "2024-01-10T00:13:15.449718Z",
     "iopub.status.idle": "2024-01-10T00:13:15.456731Z",
     "shell.execute_reply": "2024-01-10T00:13:15.456205Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>6</td>\n",
       "      <td>9</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>7</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>7</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A  B  C  D\n",
       "0  6  9  2  6\n",
       "1  7  4  3  7\n",
       "2  7  2  5  4"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(rng.randint(0, 10, (3, 4)),\n",
    "                  columns=['A', 'B', 'C', 'D'])\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.459559Z",
     "iopub.status.busy": "2024-01-10T00:13:15.459213Z",
     "iopub.status.idle": "2024-01-10T00:13:15.466329Z",
     "shell.execute_reply": "2024-01-10T00:13:15.465808Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>403.428793</td>\n",
       "      <td>8103.083928</td>\n",
       "      <td>7.389056</td>\n",
       "      <td>403.428793</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1096.633158</td>\n",
       "      <td>54.598150</td>\n",
       "      <td>20.085537</td>\n",
       "      <td>1096.633158</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1096.633158</td>\n",
       "      <td>7.389056</td>\n",
       "      <td>148.413159</td>\n",
       "      <td>54.598150</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             A            B           C            D\n",
       "0   403.428793  8103.083928    7.389056   403.428793\n",
       "1  1096.633158    54.598150   20.085537  1096.633158\n",
       "2  1096.633158     7.389056  148.413159    54.598150"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.exp(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": [
     "exercise_pointer"
    ]
   },
   "source": [
    "**Exercises:** *You can now complete Exercise 2 in the exercises associated with this lecture.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Index alignment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Index alignment works similarly for `DataFrames`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.469283Z",
     "iopub.status.busy": "2024-01-10T00:13:15.468937Z",
     "iopub.status.idle": "2024-01-10T00:13:15.475078Z",
     "shell.execute_reply": "2024-01-10T00:13:15.474536Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A   B\n",
       "0  1  11\n",
       "1  5   1"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A = pd.DataFrame(rng.randint(0, 20, (2, 2)),\n",
    "                 columns=list('AB'))\n",
    "A"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.478954Z",
     "iopub.status.busy": "2024-01-10T00:13:15.478608Z",
     "iopub.status.idle": "2024-01-10T00:13:15.484998Z",
     "shell.execute_reply": "2024-01-10T00:13:15.484464Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>B</th>\n",
       "      <th>A</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5</td>\n",
       "      <td>8</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>9</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   B  A  C\n",
       "0  4  0  9\n",
       "1  5  8  0\n",
       "2  9  2  6"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "B = pd.DataFrame(rng.randint(0, 10, (3, 3)),\n",
    "                 columns=list('BAC'))\n",
    "B"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.488869Z",
     "iopub.status.busy": "2024-01-10T00:13:15.488521Z",
     "iopub.status.idle": "2024-01-10T00:13:15.497638Z",
     "shell.execute_reply": "2024-01-10T00:13:15.497097Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>15.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>13.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      A     B   C\n",
       "0   1.0  15.0 NaN\n",
       "1  13.0   6.0 NaN\n",
       "2   NaN   NaN NaN"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A + B"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Operations between `DataFrame` and `Series` objects"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.505677Z",
     "iopub.status.busy": "2024-01-10T00:13:15.505241Z",
     "iopub.status.idle": "2024-01-10T00:13:15.512507Z",
     "shell.execute_reply": "2024-01-10T00:13:15.511979Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Q</th>\n",
       "      <th>R</th>\n",
       "      <th>S</th>\n",
       "      <th>T</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3</td>\n",
       "      <td>8</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "      <td>4</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Q  R  S  T\n",
       "0  3  8  2  4\n",
       "1  2  6  4  8\n",
       "2  6  1  3  8"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A = rng.randint(10, size=(3, 4))\n",
    "df = pd.DataFrame(A, columns=list('QRST'))\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.515360Z",
     "iopub.status.busy": "2024-01-10T00:13:15.515014Z",
     "iopub.status.idle": "2024-01-10T00:13:15.519595Z",
     "shell.execute_reply": "2024-01-10T00:13:15.519059Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Q    3\n",
       "R    8\n",
       "S    2\n",
       "T    4\n",
       "Name: 0, dtype: int64"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = df.iloc[0]\n",
    "s"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Difference between the `DataFrame` and `Series`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.522471Z",
     "iopub.status.busy": "2024-01-10T00:13:15.522110Z",
     "iopub.status.idle": "2024-01-10T00:13:15.528586Z",
     "shell.execute_reply": "2024-01-10T00:13:15.528055Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Q</th>\n",
       "      <th>R</th>\n",
       "      <th>S</th>\n",
       "      <th>T</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-1</td>\n",
       "      <td>-2</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>-7</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Q  R  S  T\n",
       "0  0  0  0  0\n",
       "1 -1 -2  2  4\n",
       "2  3 -7  1  4"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df - s"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Convention is to operate row-wise."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Can also operate column-wise using object methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.532195Z",
     "iopub.status.busy": "2024-01-10T00:13:15.531844Z",
     "iopub.status.idle": "2024-01-10T00:13:15.538629Z",
     "shell.execute_reply": "2024-01-10T00:13:15.538075Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Q</th>\n",
       "      <th>R</th>\n",
       "      <th>S</th>\n",
       "      <th>T</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-5</td>\n",
       "      <td>0</td>\n",
       "      <td>-6</td>\n",
       "      <td>-4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-4</td>\n",
       "      <td>0</td>\n",
       "      <td>-2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Q  R  S  T\n",
       "0 -5  0 -6 -4\n",
       "1 -4  0 -2  2\n",
       "2  5  0  2  7"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.subtract(df['R'], axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Handling missing data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Real data is messy.  Often some data are missing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Various conventions can be considered to handle missing data.\n",
    "\n",
    "We will focus on the use of the floating point IEEE value NaN (not a number) to represent missing data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Pandas interprets NaN as Null values.\n",
    "\n",
    "(Pandas also supports `None` but we will focus on NaN here.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Arithematic operations with NaN values result in NaN."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.542302Z",
     "iopub.status.busy": "2024-01-10T00:13:15.541956Z",
     "iopub.status.idle": "2024-01-10T00:13:15.545706Z",
     "shell.execute_reply": "2024-01-10T00:13:15.545196Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "nan"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "1 + np.nan"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Operating on Null values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Several useful methods exist to work with NaNs, for example to detect, drop or replace:\n",
    "- `isnull()`: Generate a boolean mask indicating missing values.\n",
    "- `notnull()`: Opposite of `isnull()`.\n",
    "- `dropna()`: Return a filtered version of the data.\n",
    "- `fillna()`: Return a copy of the data with missing values filled."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Detecting null values\n",
    "\n",
    "Pandas `isnull` and `notnull` are useful for detecting null values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": [
     "exercise_pointer"
    ]
   },
   "source": [
    "**Exercises:** *You can now complete Exercise 3 in the exercises associated with this lecture.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Dropping null values\n",
    "\n",
    "Direct routines may be used to drop null values (i.e. `dropna`), rather than constructing masks as performed above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": [
     "exercise_pointer"
    ]
   },
   "source": [
    "**Exercises:** *You can now complete Exercise 4 in the exercises associated with this lecture.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Dropping null values from `DataFrames`\n",
    "\n",
    "For `DataFrames`, there are multiple ways null values can be dropped."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.549490Z",
     "iopub.status.busy": "2024-01-10T00:13:15.549132Z",
     "iopub.status.idle": "2024-01-10T00:13:15.556771Z",
     "shell.execute_reply": "2024-01-10T00:13:15.556194Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>4.0</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1  2\n",
       "0  1.0  NaN  2\n",
       "1  2.0  3.0  5\n",
       "2  NaN  4.0  6"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame([[1,      np.nan, 2],\n",
    "                   [2,      3,      5],\n",
    "                   [np.nan, 4,      6]])\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "By default `dropna` operates row-wise and drops all rows that contain any NaNs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.559929Z",
     "iopub.status.busy": "2024-01-10T00:13:15.559542Z",
     "iopub.status.idle": "2024-01-10T00:13:15.566641Z",
     "shell.execute_reply": "2024-01-10T00:13:15.566094Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1  2\n",
       "1  2.0  3.0  5"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dropna()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Can also operate column-wise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.569601Z",
     "iopub.status.busy": "2024-01-10T00:13:15.569257Z",
     "iopub.status.idle": "2024-01-10T00:13:15.575786Z",
     "shell.execute_reply": "2024-01-10T00:13:15.575259Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>4.0</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1  2\n",
       "0  1.0  NaN  2\n",
       "1  2.0  3.0  5\n",
       "2  NaN  4.0  6"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.578790Z",
     "iopub.status.busy": "2024-01-10T00:13:15.578414Z",
     "iopub.status.idle": "2024-01-10T00:13:15.584880Z",
     "shell.execute_reply": "2024-01-10T00:13:15.584357Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   2\n",
       "0  2\n",
       "1  5\n",
       "2  6"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dropna(axis='columns')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "More sophisticated approaches can also be considered (e.g. only dropping rows/columns if all entries or a certain number of NaNs appear)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Replacing null values\n",
    "\n",
    "Null values can be easily replaced using `fillna`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.588227Z",
     "iopub.status.busy": "2024-01-10T00:13:15.587864Z",
     "iopub.status.idle": "2024-01-10T00:13:15.594332Z",
     "shell.execute_reply": "2024-01-10T00:13:15.593819Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>4.0</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1  2\n",
       "0  1.0  NaN  2\n",
       "1  2.0  3.0  5\n",
       "2  NaN  4.0  6"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.597648Z",
     "iopub.status.busy": "2024-01-10T00:13:15.597298Z",
     "iopub.status.idle": "2024-01-10T00:13:15.604154Z",
     "shell.execute_reply": "2024-01-10T00:13:15.603634Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1  2\n",
       "0  1.0  0.0  2\n",
       "1  2.0  3.0  5\n",
       "2  0.0  4.0  6"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.fillna(0.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Can also fill using adjacent values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.607243Z",
     "iopub.status.busy": "2024-01-10T00:13:15.606901Z",
     "iopub.status.idle": "2024-01-10T00:13:15.613298Z",
     "shell.execute_reply": "2024-01-10T00:13:15.612778Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>4.0</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1  2\n",
       "0  1.0  NaN  2\n",
       "1  2.0  3.0  5\n",
       "2  NaN  4.0  6"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.615998Z",
     "iopub.status.busy": "2024-01-10T00:13:15.615699Z",
     "iopub.status.idle": "2024-01-10T00:13:15.623380Z",
     "shell.execute_reply": "2024-01-10T00:13:15.622863Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>4.0</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2\n",
       "0  1.0  1.0  2.0\n",
       "1  2.0  3.0  5.0\n",
       "2  NaN  4.0  6.0"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.fillna(method='ffill', axis='columns')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.626132Z",
     "iopub.status.busy": "2024-01-10T00:13:15.625790Z",
     "iopub.status.idle": "2024-01-10T00:13:15.632720Z",
     "shell.execute_reply": "2024-01-10T00:13:15.632070Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1  2\n",
       "0  1.0  NaN  2\n",
       "1  2.0  3.0  5\n",
       "2  2.0  4.0  6"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.fillna(method='ffill', axis='rows')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.635515Z",
     "iopub.status.busy": "2024-01-10T00:13:15.635170Z",
     "iopub.status.idle": "2024-01-10T00:13:15.642851Z",
     "shell.execute_reply": "2024-01-10T00:13:15.642316Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2\n",
       "0  1.0  2.0  2.0\n",
       "1  2.0  3.0  5.0\n",
       "2  4.0  4.0  6.0"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.fillna(method='bfill', axis='columns')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Combining data-sets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Define helper functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.645704Z",
     "iopub.status.busy": "2024-01-10T00:13:15.645362Z",
     "iopub.status.idle": "2024-01-10T00:13:15.653232Z",
     "shell.execute_reply": "2024-01-10T00:13:15.652717Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "      <td>C0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "      <td>C1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "      <td>C2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    A   B   C\n",
       "0  A0  B0  C0\n",
       "1  A1  B1  C1\n",
       "2  A2  B2  C2"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def make_df(cols, ind):\n",
    "    \"\"\"Quickly make a DataFrame\"\"\"\n",
    "    data = {c: [str(c) + str(i) for i in ind]\n",
    "            for c in cols}\n",
    "    return pd.DataFrame(data, ind)\n",
    "\n",
    "# example DataFrame\n",
    "make_df('ABC', range(3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.655999Z",
     "iopub.status.busy": "2024-01-10T00:13:15.655658Z",
     "iopub.status.idle": "2024-01-10T00:13:15.658376Z",
     "shell.execute_reply": "2024-01-10T00:13:15.657849Z"
    }
   },
   "outputs": [],
   "source": [
    "from IPython.display import display"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Concatenation\n",
    "\n",
    "Can concatenate `Series` and `DataFrame` objects with `pd.concat()`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Default is to concatenate over rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.661244Z",
     "iopub.status.busy": "2024-01-10T00:13:15.660906Z",
     "iopub.status.idle": "2024-01-10T00:13:15.674198Z",
     "shell.execute_reply": "2024-01-10T00:13:15.673678Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    A   B\n",
       "1  A1  B1\n",
       "2  A2  B2"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A3</td>\n",
       "      <td>B3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>A4</td>\n",
       "      <td>B4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    A   B\n",
       "3  A3  B3\n",
       "4  A4  B4"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A3</td>\n",
       "      <td>B3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>A4</td>\n",
       "      <td>B4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    A   B\n",
       "1  A1  B1\n",
       "2  A2  B2\n",
       "3  A3  B3\n",
       "4  A4  B4"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df1 = make_df('AB', [1, 2])\n",
    "df2 = make_df('AB', [3, 4])\n",
    "display(df1, df2, pd.concat([df1, df2]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Can also concatenate over columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.677093Z",
     "iopub.status.busy": "2024-01-10T00:13:15.676742Z",
     "iopub.status.idle": "2024-01-10T00:13:15.690361Z",
     "shell.execute_reply": "2024-01-10T00:13:15.689828Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    A   B\n",
       "0  A0  B0\n",
       "1  A1  B1"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>C0</td>\n",
       "      <td>D0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>C1</td>\n",
       "      <td>D1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    C   D\n",
       "0  C0  D0\n",
       "1  C1  D1"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "      <td>C0</td>\n",
       "      <td>D0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "      <td>C1</td>\n",
       "      <td>D1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    A   B   C   D\n",
       "0  A0  B0  C0  D0\n",
       "1  A1  B1  C1  D1"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df3 = make_df('AB', [0, 1])\n",
    "df4 = make_df('CD', [0, 1])\n",
    "display(df3, df4, pd.concat([df3, df4], axis='columns'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### Duplicated indices\n",
    "\n",
    "Can have duplicated indices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.693461Z",
     "iopub.status.busy": "2024-01-10T00:13:15.693106Z",
     "iopub.status.idle": "2024-01-10T00:13:15.706128Z",
     "shell.execute_reply": "2024-01-10T00:13:15.705603Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    A   B\n",
       "0  A0  B0\n",
       "1  A1  B1"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A3</td>\n",
       "      <td>B3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    A   B\n",
       "0  A2  B2\n",
       "1  A3  B3"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A3</td>\n",
       "      <td>B3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    A   B\n",
       "0  A0  B0\n",
       "1  A1  B1\n",
       "0  A2  B2\n",
       "1  A3  B3"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "x = make_df('AB', [0, 1])\n",
    "y = make_df('AB', [2, 3])\n",
    "y.index = x.index  # make duplicate indices!\n",
    "display(x, y, pd.concat([x, y]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### Ignoring index\n",
    "\n",
    "Can ignore index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.708970Z",
     "iopub.status.busy": "2024-01-10T00:13:15.708624Z",
     "iopub.status.idle": "2024-01-10T00:13:15.720590Z",
     "shell.execute_reply": "2024-01-10T00:13:15.720060Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    A   B\n",
       "0  A0  B0\n",
       "1  A1  B1"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A3</td>\n",
       "      <td>B3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    A   B\n",
       "0  A2  B2\n",
       "1  A3  B3"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A3</td>\n",
       "      <td>B3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    A   B\n",
       "0  A0  B0\n",
       "1  A1  B1\n",
       "2  A2  B2\n",
       "3  A3  B3"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(x, y, pd.concat([x, y], ignore_index=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### Concantenation with joins\n",
    "\n",
    "Can join `DataFrames` with different column names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.723441Z",
     "iopub.status.busy": "2024-01-10T00:13:15.723097Z",
     "iopub.status.idle": "2024-01-10T00:13:15.738457Z",
     "shell.execute_reply": "2024-01-10T00:13:15.737891Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "      <td>C1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "      <td>C2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    A   B   C\n",
       "1  A1  B1  C1\n",
       "2  A2  B2  C2"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>B3</td>\n",
       "      <td>C3</td>\n",
       "      <td>D3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B4</td>\n",
       "      <td>C4</td>\n",
       "      <td>D4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    B   C   D\n",
       "3  B3  C3  D3\n",
       "4  B4  C4  D4"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "      <td>C1</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "      <td>C2</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>B3</td>\n",
       "      <td>C3</td>\n",
       "      <td>D3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NaN</td>\n",
       "      <td>B4</td>\n",
       "      <td>C4</td>\n",
       "      <td>D4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     A   B   C    D\n",
       "1   A1  B1  C1  NaN\n",
       "2   A2  B2  C2  NaN\n",
       "3  NaN  B3  C3   D3\n",
       "4  NaN  B4  C4   D4"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df5 = make_df('ABC', [1, 2])\n",
    "df6 = make_df('BCD', [3, 4])\n",
    "display(df5, df6, pd.concat([df5, df6]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Entries with no data are filled with NaN.\n",
    "\n",
    "Default join is the *union* of the columns of the two `DataFrames`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Can also perform different types of joins.\n",
    "\n",
    "For example, the *intersection* of the columns of the two DataFrames."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.741470Z",
     "iopub.status.busy": "2024-01-10T00:13:15.741118Z",
     "iopub.status.idle": "2024-01-10T00:13:15.754499Z",
     "shell.execute_reply": "2024-01-10T00:13:15.753943Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "      <td>C1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "      <td>C2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    A   B   C\n",
       "1  A1  B1  C1\n",
       "2  A2  B2  C2"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>B3</td>\n",
       "      <td>C3</td>\n",
       "      <td>D3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B4</td>\n",
       "      <td>C4</td>\n",
       "      <td>D4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    B   C   D\n",
       "3  B3  C3  D3\n",
       "4  B4  C4  D4"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>B1</td>\n",
       "      <td>C1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>B2</td>\n",
       "      <td>C2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>B3</td>\n",
       "      <td>C3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B4</td>\n",
       "      <td>C4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    B   C\n",
       "1  B1  C1\n",
       "2  B2  C2\n",
       "3  B3  C3\n",
       "4  B4  C4"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(df5, df6, pd.concat([df5, df6], join='inner'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Relational combinations\n",
    "\n",
    "Pandas also provides functionality to perform relational algebra (cf. relational databases).\n",
    "\n",
    "Hence, Pandas data structures provide analog not only of NumPy array and dictionary, but also relational database."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Functionality provied by `pd.merge()` function."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### One-to-one join"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.757544Z",
     "iopub.status.busy": "2024-01-10T00:13:15.757195Z",
     "iopub.status.idle": "2024-01-10T00:13:15.772846Z",
     "shell.execute_reply": "2024-01-10T00:13:15.772302Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>employee</th>\n",
       "      <th>group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Bob</td>\n",
       "      <td>Accounting</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jake</td>\n",
       "      <td>Engineering</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Lisa</td>\n",
       "      <td>Engineering</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Sue</td>\n",
       "      <td>HR</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  employee        group\n",
       "0      Bob   Accounting\n",
       "1     Jake  Engineering\n",
       "2     Lisa  Engineering\n",
       "3      Sue           HR"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>employee</th>\n",
       "      <th>hire_date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Lisa</td>\n",
       "      <td>2004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Bob</td>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Jake</td>\n",
       "      <td>2012</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Sue</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  employee  hire_date\n",
       "0     Lisa       2004\n",
       "1      Bob       2008\n",
       "2     Jake       2012\n",
       "3      Sue       2014"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>employee</th>\n",
       "      <th>group</th>\n",
       "      <th>hire_date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Bob</td>\n",
       "      <td>Accounting</td>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jake</td>\n",
       "      <td>Engineering</td>\n",
       "      <td>2012</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Lisa</td>\n",
       "      <td>Engineering</td>\n",
       "      <td>2004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Sue</td>\n",
       "      <td>HR</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  employee        group  hire_date\n",
       "0      Bob   Accounting       2008\n",
       "1     Jake  Engineering       2012\n",
       "2     Lisa  Engineering       2004\n",
       "3      Sue           HR       2014"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],\n",
    "                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})\n",
    "df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],\n",
    "                    'hire_date': [2004, 2008, 2012, 2014]})\n",
    "df3 = pd.merge(df1, df2)\n",
    "display(df1,df2,df3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recognises that \"employee\" column common and automatically selects as key for the relational join."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### Many-to-one joins"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.775906Z",
     "iopub.status.busy": "2024-01-10T00:13:15.775550Z",
     "iopub.status.idle": "2024-01-10T00:13:15.790962Z",
     "shell.execute_reply": "2024-01-10T00:13:15.790405Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>employee</th>\n",
       "      <th>group</th>\n",
       "      <th>hire_date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Bob</td>\n",
       "      <td>Accounting</td>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jake</td>\n",
       "      <td>Engineering</td>\n",
       "      <td>2012</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Lisa</td>\n",
       "      <td>Engineering</td>\n",
       "      <td>2004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Sue</td>\n",
       "      <td>HR</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  employee        group  hire_date\n",
       "0      Bob   Accounting       2008\n",
       "1     Jake  Engineering       2012\n",
       "2     Lisa  Engineering       2004\n",
       "3      Sue           HR       2014"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>group</th>\n",
       "      <th>supervisor</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Accounting</td>\n",
       "      <td>Carly</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Engineering</td>\n",
       "      <td>Guido</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>HR</td>\n",
       "      <td>Steve</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         group supervisor\n",
       "0   Accounting      Carly\n",
       "1  Engineering      Guido\n",
       "2           HR      Steve"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>employee</th>\n",
       "      <th>group</th>\n",
       "      <th>hire_date</th>\n",
       "      <th>supervisor</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Bob</td>\n",
       "      <td>Accounting</td>\n",
       "      <td>2008</td>\n",
       "      <td>Carly</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jake</td>\n",
       "      <td>Engineering</td>\n",
       "      <td>2012</td>\n",
       "      <td>Guido</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Lisa</td>\n",
       "      <td>Engineering</td>\n",
       "      <td>2004</td>\n",
       "      <td>Guido</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Sue</td>\n",
       "      <td>HR</td>\n",
       "      <td>2014</td>\n",
       "      <td>Steve</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  employee        group  hire_date supervisor\n",
       "0      Bob   Accounting       2008      Carly\n",
       "1     Jake  Engineering       2012      Guido\n",
       "2     Lisa  Engineering       2004      Guido\n",
       "3      Sue           HR       2014      Steve"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],\n",
    "                    'supervisor': ['Carly', 'Guido', 'Steve']})\n",
    "display(df3, df4, pd.merge(df3, df4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### The `on` keyword"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.793866Z",
     "iopub.status.busy": "2024-01-10T00:13:15.793517Z",
     "iopub.status.idle": "2024-01-10T00:13:15.807076Z",
     "shell.execute_reply": "2024-01-10T00:13:15.806535Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>employee</th>\n",
       "      <th>group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Bob</td>\n",
       "      <td>Accounting</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jake</td>\n",
       "      <td>Engineering</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Lisa</td>\n",
       "      <td>Engineering</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Sue</td>\n",
       "      <td>HR</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  employee        group\n",
       "0      Bob   Accounting\n",
       "1     Jake  Engineering\n",
       "2     Lisa  Engineering\n",
       "3      Sue           HR"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>employee</th>\n",
       "      <th>hire_date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Lisa</td>\n",
       "      <td>2004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Bob</td>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Jake</td>\n",
       "      <td>2012</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Sue</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  employee  hire_date\n",
       "0     Lisa       2004\n",
       "1      Bob       2008\n",
       "2     Jake       2012\n",
       "3      Sue       2014"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>employee</th>\n",
       "      <th>group</th>\n",
       "      <th>hire_date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Bob</td>\n",
       "      <td>Accounting</td>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jake</td>\n",
       "      <td>Engineering</td>\n",
       "      <td>2012</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Lisa</td>\n",
       "      <td>Engineering</td>\n",
       "      <td>2004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Sue</td>\n",
       "      <td>HR</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  employee        group  hire_date\n",
       "0      Bob   Accounting       2008\n",
       "1     Jake  Engineering       2012\n",
       "2     Lisa  Engineering       2004\n",
       "3      Sue           HR       2014"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(df1, df2, pd.merge(df1, df2, on='employee'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### The `left_on` and `right_on` keywords"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.809978Z",
     "iopub.status.busy": "2024-01-10T00:13:15.809622Z",
     "iopub.status.idle": "2024-01-10T00:13:15.823891Z",
     "shell.execute_reply": "2024-01-10T00:13:15.823352Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>employee</th>\n",
       "      <th>group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Bob</td>\n",
       "      <td>Accounting</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jake</td>\n",
       "      <td>Engineering</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Lisa</td>\n",
       "      <td>Engineering</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Sue</td>\n",
       "      <td>HR</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  employee        group\n",
       "0      Bob   Accounting\n",
       "1     Jake  Engineering\n",
       "2     Lisa  Engineering\n",
       "3      Sue           HR"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>salary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Bob</td>\n",
       "      <td>70000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jake</td>\n",
       "      <td>80000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Lisa</td>\n",
       "      <td>120000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Sue</td>\n",
       "      <td>90000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   name  salary\n",
       "0   Bob   70000\n",
       "1  Jake   80000\n",
       "2  Lisa  120000\n",
       "3   Sue   90000"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>employee</th>\n",
       "      <th>group</th>\n",
       "      <th>name</th>\n",
       "      <th>salary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Bob</td>\n",
       "      <td>Accounting</td>\n",
       "      <td>Bob</td>\n",
       "      <td>70000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jake</td>\n",
       "      <td>Engineering</td>\n",
       "      <td>Jake</td>\n",
       "      <td>80000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Lisa</td>\n",
       "      <td>Engineering</td>\n",
       "      <td>Lisa</td>\n",
       "      <td>120000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Sue</td>\n",
       "      <td>HR</td>\n",
       "      <td>Sue</td>\n",
       "      <td>90000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  employee        group  name  salary\n",
       "0      Bob   Accounting   Bob   70000\n",
       "1     Jake  Engineering  Jake   80000\n",
       "2     Lisa  Engineering  Lisa  120000\n",
       "3      Sue           HR   Sue   90000"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],\n",
    "                    'salary': [70000, 80000, 120000, 90000]})\n",
    "display(df1, df3, pd.merge(df1, df3, left_on=\"employee\", right_on=\"name\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Employee and name both included now, so may want to drop one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.826815Z",
     "iopub.status.busy": "2024-01-10T00:13:15.826465Z",
     "iopub.status.idle": "2024-01-10T00:13:15.834368Z",
     "shell.execute_reply": "2024-01-10T00:13:15.833835Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>employee</th>\n",
       "      <th>group</th>\n",
       "      <th>name</th>\n",
       "      <th>salary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Bob</td>\n",
       "      <td>Accounting</td>\n",
       "      <td>Bob</td>\n",
       "      <td>70000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jake</td>\n",
       "      <td>Engineering</td>\n",
       "      <td>Jake</td>\n",
       "      <td>80000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Lisa</td>\n",
       "      <td>Engineering</td>\n",
       "      <td>Lisa</td>\n",
       "      <td>120000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Sue</td>\n",
       "      <td>HR</td>\n",
       "      <td>Sue</td>\n",
       "      <td>90000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  employee        group  name  salary\n",
       "0      Bob   Accounting   Bob   70000\n",
       "1     Jake  Engineering  Jake   80000\n",
       "2     Lisa  Engineering  Lisa  120000\n",
       "3      Sue           HR   Sue   90000"
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(df1, df3, left_on=\"employee\", right_on=\"name\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.837233Z",
     "iopub.status.busy": "2024-01-10T00:13:15.836883Z",
     "iopub.status.idle": "2024-01-10T00:13:15.844804Z",
     "shell.execute_reply": "2024-01-10T00:13:15.844272Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>employee</th>\n",
       "      <th>group</th>\n",
       "      <th>salary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Bob</td>\n",
       "      <td>Accounting</td>\n",
       "      <td>70000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jake</td>\n",
       "      <td>Engineering</td>\n",
       "      <td>80000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Lisa</td>\n",
       "      <td>Engineering</td>\n",
       "      <td>120000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Sue</td>\n",
       "      <td>HR</td>\n",
       "      <td>90000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  employee        group  salary\n",
       "0      Bob   Accounting   70000\n",
       "1     Jake  Engineering   80000\n",
       "2     Lisa  Engineering  120000\n",
       "3      Sue           HR   90000"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(df1, df3, left_on=\"employee\", right_on=\"name\").drop('name', axis='columns')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### The `left_index` and `right_index` keywords\n",
    "\n",
    "Often one wants to join on index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.847768Z",
     "iopub.status.busy": "2024-01-10T00:13:15.847396Z",
     "iopub.status.idle": "2024-01-10T00:13:15.856391Z",
     "shell.execute_reply": "2024-01-10T00:13:15.855859Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>group</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>employee</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Bob</th>\n",
       "      <td>Accounting</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Jake</th>\n",
       "      <td>Engineering</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Lisa</th>\n",
       "      <td>Engineering</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sue</th>\n",
       "      <td>HR</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                group\n",
       "employee             \n",
       "Bob        Accounting\n",
       "Jake      Engineering\n",
       "Lisa      Engineering\n",
       "Sue                HR"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>hire_date</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>employee</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Lisa</th>\n",
       "      <td>2004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Bob</th>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Jake</th>\n",
       "      <td>2012</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sue</th>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          hire_date\n",
       "employee           \n",
       "Lisa           2004\n",
       "Bob            2008\n",
       "Jake           2012\n",
       "Sue            2014"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df1a = df1.set_index('employee')\n",
    "df2a = df2.set_index('employee')\n",
    "display(df1a, df2a)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.859251Z",
     "iopub.status.busy": "2024-01-10T00:13:15.858901Z",
     "iopub.status.idle": "2024-01-10T00:13:15.866112Z",
     "shell.execute_reply": "2024-01-10T00:13:15.865585Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>group</th>\n",
       "      <th>hire_date</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>employee</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Bob</th>\n",
       "      <td>Accounting</td>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Jake</th>\n",
       "      <td>Engineering</td>\n",
       "      <td>2012</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Lisa</th>\n",
       "      <td>Engineering</td>\n",
       "      <td>2004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sue</th>\n",
       "      <td>HR</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                group  hire_date\n",
       "employee                        \n",
       "Bob        Accounting       2008\n",
       "Jake      Engineering       2012\n",
       "Lisa      Engineering       2004\n",
       "Sue                HR       2014"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(pd.merge(df1a, df2a, left_index=True, right_index=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### Set arithmetic for joins \n",
    "\n",
    "Have so far been considering relational joins based on *intersection* \n",
    "(also called *inner* join)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.869060Z",
     "iopub.status.busy": "2024-01-10T00:13:15.868709Z",
     "iopub.status.idle": "2024-01-10T00:13:15.885530Z",
     "shell.execute_reply": "2024-01-10T00:13:15.884965Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>food</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Peter</td>\n",
       "      <td>fish</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Paul</td>\n",
       "      <td>beans</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Mary</td>\n",
       "      <td>bread</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    name   food\n",
       "0  Peter   fish\n",
       "1   Paul  beans\n",
       "2   Mary  bread"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>drink</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Mary</td>\n",
       "      <td>wine</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Joseph</td>\n",
       "      <td>beer</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     name drink\n",
       "0    Mary  wine\n",
       "1  Joseph  beer"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>food</th>\n",
       "      <th>drink</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Mary</td>\n",
       "      <td>bread</td>\n",
       "      <td>wine</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   name   food drink\n",
       "0  Mary  bread  wine"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],\n",
    "                    'food': ['fish', 'beans', 'bread']},\n",
    "                   columns=['name', 'food'])\n",
    "df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],\n",
    "                    'drink': ['wine', 'beer']},\n",
    "                   columns=['name', 'drink'])\n",
    "display(df6, df7, pd.merge(df6, df7, how='inner'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "##### Outer join\n",
    "\n",
    "Can also join based on *union* (missing entries filled with NaNs)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.888739Z",
     "iopub.status.busy": "2024-01-10T00:13:15.888382Z",
     "iopub.status.idle": "2024-01-10T00:13:15.903223Z",
     "shell.execute_reply": "2024-01-10T00:13:15.902673Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>food</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Peter</td>\n",
       "      <td>fish</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Paul</td>\n",
       "      <td>beans</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Mary</td>\n",
       "      <td>bread</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    name   food\n",
       "0  Peter   fish\n",
       "1   Paul  beans\n",
       "2   Mary  bread"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>drink</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Mary</td>\n",
       "      <td>wine</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Joseph</td>\n",
       "      <td>beer</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     name drink\n",
       "0    Mary  wine\n",
       "1  Joseph  beer"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>food</th>\n",
       "      <th>drink</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Peter</td>\n",
       "      <td>fish</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Paul</td>\n",
       "      <td>beans</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Mary</td>\n",
       "      <td>bread</td>\n",
       "      <td>wine</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Joseph</td>\n",
       "      <td>NaN</td>\n",
       "      <td>beer</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     name   food drink\n",
       "0   Peter   fish   NaN\n",
       "1    Paul  beans   NaN\n",
       "2    Mary  bread  wine\n",
       "3  Joseph    NaN  beer"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(df6, df7, pd.merge(df6, df7, how='outer'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "##### Left and right join\n",
    "\n",
    "Can also join based on *left* or *right* entries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.906173Z",
     "iopub.status.busy": "2024-01-10T00:13:15.905822Z",
     "iopub.status.idle": "2024-01-10T00:13:15.920315Z",
     "shell.execute_reply": "2024-01-10T00:13:15.919750Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>food</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Peter</td>\n",
       "      <td>fish</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Paul</td>\n",
       "      <td>beans</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Mary</td>\n",
       "      <td>bread</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    name   food\n",
       "0  Peter   fish\n",
       "1   Paul  beans\n",
       "2   Mary  bread"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>drink</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Mary</td>\n",
       "      <td>wine</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Joseph</td>\n",
       "      <td>beer</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     name drink\n",
       "0    Mary  wine\n",
       "1  Joseph  beer"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>food</th>\n",
       "      <th>drink</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Peter</td>\n",
       "      <td>fish</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Paul</td>\n",
       "      <td>beans</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Mary</td>\n",
       "      <td>bread</td>\n",
       "      <td>wine</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    name   food drink\n",
       "0  Peter   fish   NaN\n",
       "1   Paul  beans   NaN\n",
       "2   Mary  bread  wine"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(df6, df7, pd.merge(df6, df7, how='left'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### Overlapping column names\n",
    "\n",
    "Possible for DataFrames to have conflicting columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.923323Z",
     "iopub.status.busy": "2024-01-10T00:13:15.922959Z",
     "iopub.status.idle": "2024-01-10T00:13:15.933270Z",
     "shell.execute_reply": "2024-01-10T00:13:15.932631Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>rank</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Bob</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jake</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Lisa</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Sue</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   name  rank\n",
       "0   Bob     1\n",
       "1  Jake     2\n",
       "2  Lisa     3\n",
       "3   Sue     4"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>rank</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Bob</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jake</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Lisa</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Sue</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   name  rank\n",
       "0   Bob     3\n",
       "1  Jake     1\n",
       "2  Lisa     4\n",
       "3   Sue     2"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],\n",
    "                    'rank': [1, 2, 3, 4]})\n",
    "df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],\n",
    "                    'rank': [3, 1, 4, 2]})\n",
    "display(df8, df9)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-10T00:13:15.937472Z",
     "iopub.status.busy": "2024-01-10T00:13:15.936984Z",
     "iopub.status.idle": "2024-01-10T00:13:15.946330Z",
     "shell.execute_reply": "2024-01-10T00:13:15.945561Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>rank_L</th>\n",
       "      <th>rank_R</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Bob</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jake</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Lisa</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Sue</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   name  rank_L  rank_R\n",
       "0   Bob       1       3\n",
       "1  Jake       2       1\n",
       "2  Lisa       3       4\n",
       "3   Sue       4       2"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(pd.merge(df8, df9, on=\"name\", suffixes=[\"_L\", \"_R\"]))"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}