{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Lecture 2: Data wrangling with Pandas " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "![](https://www.tensorflow.org/images/colab_logo_32px.png)\n", "[Run in colab](https://colab.research.google.com/drive/1L7sAw22PfopC1z8ANRdgFCnJBXiZIdtI)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:14.768141Z", "iopub.status.busy": "2024-01-10T00:13:14.767528Z", "iopub.status.idle": "2024-01-10T00:13:14.775602Z", "shell.execute_reply": "2024-01-10T00:13:14.775063Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Last executed: 2024-01-10 00:13:14\n" ] } ], "source": [ "import datetime\n", "now = datetime.datetime.now()\n", "print(\"Last executed: \" + now.strftime(\"%Y-%m-%d %H:%M:%S\"))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Why Pandas?\n", "\n", "[Pandas](https://pandas.pydata.org/) is a very useful package for data wrangling.\n", "\n", "Particularly useful when working with real data, which can be messy." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Combines advantages of a number of different data structures (NumPy arrays, dictionaries, relational databases).\n", "\n", "Can also be more efficient than native Python data structures for certain operators (as we will see)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Particularly useful for dealing with:\n", "- Labelled data\n", "- Missing data\n", "- Heteterogenous types\n", "- Groupings" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We will focus mostly on Pandas `Series` and `DataFrame` objects." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Import Pandas" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:14.816280Z", "iopub.status.busy": "2024-01-10T00:13:14.815818Z", "iopub.status.idle": "2024-01-10T00:13:15.132633Z", "shell.execute_reply": "2024-01-10T00:13:15.131910Z" } }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Documentation\n", "\n", "Recall can check documentation with `pd?`, `pd.`, and/or print documentation for specific function with `print(pd..__doc__)`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.136072Z", "iopub.status.busy": "2024-01-10T00:13:15.135754Z", "iopub.status.idle": "2024-01-10T00:13:15.139939Z", "shell.execute_reply": "2024-01-10T00:13:15.139326Z" } }, "outputs": [], "source": [ "#pd?" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.142929Z", "iopub.status.busy": "2024-01-10T00:13:15.142560Z", "iopub.status.idle": "2024-01-10T00:13:15.145430Z", "shell.execute_reply": "2024-01-10T00:13:15.144796Z" } }, "outputs": [], "source": [ "#pd." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.148415Z", "iopub.status.busy": "2024-01-10T00:13:15.147981Z", "iopub.status.idle": "2024-01-10T00:13:15.152029Z", "shell.execute_reply": "2024-01-10T00:13:15.151417Z" } }, "outputs": [], "source": [ "#pd.concat?" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.154836Z", "iopub.status.busy": "2024-01-10T00:13:15.154383Z", "iopub.status.idle": "2024-01-10T00:13:15.158458Z", "shell.execute_reply": "2024-01-10T00:13:15.157831Z" } }, "outputs": [], "source": [ "#print(pd.concat.__doc__)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Pandas `Series`\n", "\n", "A Pandas `Series` is a *1D* array of *indexed* data. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Can be created from a list or array:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.161285Z", "iopub.status.busy": "2024-01-10T00:13:15.161073Z", "iopub.status.idle": "2024-01-10T00:13:15.171895Z", "shell.execute_reply": "2024-01-10T00:13:15.171212Z" } }, "outputs": [ { "data": { "text/plain": [ "0 0.25\n", "1 0.50\n", "2 0.75\n", "3 1.00\n", "dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.Series([0.25, 0.5, 0.75, 1.0])\n", "data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The `Series` wraps both a sequence of *values* and a sequence of *indices*, which we can access with the `values` and `index` attributes." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.174915Z", "iopub.status.busy": "2024-01-10T00:13:15.174556Z", "iopub.status.idle": "2024-01-10T00:13:15.181701Z", "shell.execute_reply": "2024-01-10T00:13:15.181103Z" } }, "outputs": [ { "data": { "text/plain": [ "0 0.25\n", "1 0.50\n", "2 0.75\n", "3 1.00\n", "dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.184468Z", "iopub.status.busy": "2024-01-10T00:13:15.184105Z", "iopub.status.idle": "2024-01-10T00:13:15.188580Z", "shell.execute_reply": "2024-01-10T00:13:15.187941Z" } }, "outputs": [ { "data": { "text/plain": [ "array([0.25, 0.5 , 0.75, 1. ])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.values" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.192321Z", "iopub.status.busy": "2024-01-10T00:13:15.191893Z", "iopub.status.idle": "2024-01-10T00:13:15.196459Z", "shell.execute_reply": "2024-01-10T00:13:15.195837Z" } }, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=4, step=1)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.index" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### `Series` as generalized NumPy array \n", "\n", "Values are simply NumPy array.\n", "\n", "Index need not be an integer, but can consist of values of any desired type. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.199711Z", "iopub.status.busy": "2024-01-10T00:13:15.199076Z", "iopub.status.idle": "2024-01-10T00:13:15.204940Z", "shell.execute_reply": "2024-01-10T00:13:15.204319Z" } }, "outputs": [ { "data": { "text/plain": [ "a 0.25\n", "b 0.50\n", "c 0.75\n", "d 1.00\n", "dtype: float64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.Series([0.25, 0.5, 0.75, 1.0],\n", " index=['a', 'b', 'c', 'd'])\n", "data" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.207914Z", "iopub.status.busy": "2024-01-10T00:13:15.207336Z", "iopub.status.idle": "2024-01-10T00:13:15.211637Z", "shell.execute_reply": "2024-01-10T00:13:15.210997Z" } }, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['b']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### `Series` as specialized dictionary\n", "\n", "Can also think of a Pandas `Series` like a specialization of a Python dictionary." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.214449Z", "iopub.status.busy": "2024-01-10T00:13:15.214085Z", "iopub.status.idle": "2024-01-10T00:13:15.217795Z", "shell.execute_reply": "2024-01-10T00:13:15.217150Z" } }, "outputs": [], "source": [ "population_dict = {'California': 38332521,\n", " 'Texas': 26448193,\n", " 'New York': 19651127,\n", " 'Florida': 19552860,\n", " 'Illinois': 12882135}\n", "population = pd.Series(population_dict) # Instantiate from dictionary" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.220887Z", "iopub.status.busy": "2024-01-10T00:13:15.220288Z", "iopub.status.idle": "2024-01-10T00:13:15.224757Z", "shell.execute_reply": "2024-01-10T00:13:15.224118Z" } }, "outputs": [ { "data": { "text/plain": [ "(dict, pandas.core.series.Series)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(population_dict), type(population)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.228122Z", "iopub.status.busy": "2024-01-10T00:13:15.227555Z", "iopub.status.idle": "2024-01-10T00:13:15.231762Z", "shell.execute_reply": "2024-01-10T00:13:15.231241Z" } }, "outputs": [ { "data": { "text/plain": [ "38332521" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "population['California']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- Python dictionary: maps *arbitrary* keys to *arbitrary* values.\n", "- Pandas `Series`: maps *typed* indices to *typed* values." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Type information of Pandas `Series` makes it much more efficient than Python dictionaries for certain operations." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Pandas `DataFrame`\n", "\n", "`DataFrame` can be thought of as a sequence of aligned `Series` objects, with *indices* and *columns*." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.234800Z", "iopub.status.busy": "2024-01-10T00:13:15.234218Z", "iopub.status.idle": "2024-01-10T00:13:15.242493Z", "shell.execute_reply": "2024-01-10T00:13:15.241954Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
foobar
a0.1508620.234217
b0.6671930.961023
c0.0306700.910813
\n", "
" ], "text/plain": [ " foo bar\n", "a 0.150862 0.234217\n", "b 0.667193 0.961023\n", "c 0.030670 0.910813" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(np.random.rand(3, 2),\n", " columns=['foo', 'bar'],\n", " index=['a', 'b', 'c'])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### `DataFrame` as generalized NumPy array\n", "\n", "`DataFrame` is an analog of a two-dimensional array with both flexible row indices and flexible column names." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Contstruct another `Series` with same indices." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.245372Z", "iopub.status.busy": "2024-01-10T00:13:15.245019Z", "iopub.status.idle": "2024-01-10T00:13:15.248490Z", "shell.execute_reply": "2024-01-10T00:13:15.247931Z" } }, "outputs": [], "source": [ "area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,\n", " 'Florida': 170312, 'Illinois': 149995}\n", "area = pd.Series(area_dict)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Combine two `Series` into a `DataFrame`." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.251243Z", "iopub.status.busy": "2024-01-10T00:13:15.250893Z", "iopub.status.idle": "2024-01-10T00:13:15.257216Z", "shell.execute_reply": "2024-01-10T00:13:15.256698Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
populationarea
California38332521423967
Texas26448193695662
New York19651127141297
Florida19552860170312
Illinois12882135149995
\n", "
" ], "text/plain": [ " population area\n", "California 38332521 423967\n", "Texas 26448193 695662\n", "New York 19651127 141297\n", "Florida 19552860 170312\n", "Illinois 12882135 149995" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "states = pd.DataFrame({'population': population,\n", " 'area': area})\n", "states" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "`DataFrame` has both `index` and `column` attributes." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.260053Z", "iopub.status.busy": "2024-01-10T00:13:15.259693Z", "iopub.status.idle": "2024-01-10T00:13:15.263625Z", "shell.execute_reply": "2024-01-10T00:13:15.263113Z" } }, "outputs": [ { "data": { "text/plain": [ "Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "states.index" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.266338Z", "iopub.status.busy": "2024-01-10T00:13:15.265998Z", "iopub.status.idle": "2024-01-10T00:13:15.269838Z", "shell.execute_reply": "2024-01-10T00:13:15.269330Z" } }, "outputs": [ { "data": { "text/plain": [ "Index(['population', 'area'], dtype='object')" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "states.columns" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### `DataFrame` as specialized dictionary\n", "\n", "Can also think of a Pandas `DataFrame` like a specialization of a Python dictionary." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "`DataFrame` maps a column name to a `Series`." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.272676Z", "iopub.status.busy": "2024-01-10T00:13:15.272320Z", "iopub.status.idle": "2024-01-10T00:13:15.276748Z", "shell.execute_reply": "2024-01-10T00:13:15.276239Z" } }, "outputs": [ { "data": { "text/plain": [ "California 423967\n", "Texas 695662\n", "New York 141297\n", "Florida 170312\n", "Illinois 149995\n", "Name: area, dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "states['area']" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.279494Z", "iopub.status.busy": "2024-01-10T00:13:15.279148Z", "iopub.status.idle": "2024-01-10T00:13:15.282910Z", "shell.execute_reply": "2024-01-10T00:13:15.282377Z" } }, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(states['area'])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Pandas `Index`\n", "\n", "Both Pandas `Series` and `DataFrame` contain `Index` object(s).\n", "\n", "Can be thought of as *immutable array* (i.e. cannot be changed) or *ordered multi-set* (may contain repeated values)." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.285632Z", "iopub.status.busy": "2024-01-10T00:13:15.285291Z", "iopub.status.idle": "2024-01-10T00:13:15.289529Z", "shell.execute_reply": "2024-01-10T00:13:15.289021Z" } }, "outputs": [ { "data": { "text/plain": [ "Index([2, 3, 5, 7, 11], dtype='int64')" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ind = pd.Index([2, 3, 5, 7, 11])\n", "ind " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### `Index` as immutable array\n", "\n", "Immutability makes it safer to share indices between multiple `DataFrames`." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.292351Z", "iopub.status.busy": "2024-01-10T00:13:15.291994Z", "iopub.status.idle": "2024-01-10T00:13:15.295778Z", "shell.execute_reply": "2024-01-10T00:13:15.295270Z" } }, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ind[1]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.298395Z", "iopub.status.busy": "2024-01-10T00:13:15.298061Z", "iopub.status.idle": "2024-01-10T00:13:15.300547Z", "shell.execute_reply": "2024-01-10T00:13:15.300030Z" } }, "outputs": [], "source": [ "#ind[1] = 0" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### `Index` as ordered multi-set\n", "\n", "`Index` objects support many set operations, e.g. joins, unions, intersections, differences." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Example: Compute the intersection and union of the following two `Index` objects." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.303350Z", "iopub.status.busy": "2024-01-10T00:13:15.303007Z", "iopub.status.idle": "2024-01-10T00:13:15.306205Z", "shell.execute_reply": "2024-01-10T00:13:15.305664Z" } }, "outputs": [], "source": [ "indA = pd.Index([1, 3, 5, 7, 9])\n", "indB = pd.Index([2, 3, 5, 7, 11]) " ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.308868Z", "iopub.status.busy": "2024-01-10T00:13:15.308513Z", "iopub.status.idle": "2024-01-10T00:13:15.312952Z", "shell.execute_reply": "2024-01-10T00:13:15.312424Z" } }, "outputs": [ { "data": { "text/plain": [ "Index([3, 5, 7], dtype='int64')" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "indA.intersection(indB) # intersection" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.315736Z", "iopub.status.busy": "2024-01-10T00:13:15.315378Z", "iopub.status.idle": "2024-01-10T00:13:15.319554Z", "shell.execute_reply": "2024-01-10T00:13:15.319039Z" } }, "outputs": [ { "data": { "text/plain": [ "Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "indA.union(indB) # union" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Data indexing and selection" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Data selection in a `Series`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In additional to acting like a dictionary, a `Series` also provies array-style selection like NumPy arrays." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.322458Z", "iopub.status.busy": "2024-01-10T00:13:15.322095Z", "iopub.status.idle": "2024-01-10T00:13:15.327409Z", "shell.execute_reply": "2024-01-10T00:13:15.326886Z" } }, "outputs": [ { "data": { "text/plain": [ "a 0.25\n", "b 0.50\n", "c 0.75\n", "d 1.00\n", "dtype: float64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.Series([0.25, 0.5, 0.75, 1.0],\n", " index=['a', 'b', 'c', 'd'])\n", "data" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.330109Z", "iopub.status.busy": "2024-01-10T00:13:15.329744Z", "iopub.status.idle": "2024-01-10T00:13:15.334772Z", "shell.execute_reply": "2024-01-10T00:13:15.334235Z" } }, "outputs": [ { "data": { "text/plain": [ "a 0.25\n", "b 0.50\n", "c 0.75\n", "dtype: float64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# slicing by explicit index\n", "data['a':'c']" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.337529Z", "iopub.status.busy": "2024-01-10T00:13:15.337173Z", "iopub.status.idle": "2024-01-10T00:13:15.341776Z", "shell.execute_reply": "2024-01-10T00:13:15.341267Z" } }, "outputs": [ { "data": { "text/plain": [ "a 0.25\n", "b 0.50\n", "dtype: float64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# slicing by implicit integer index\n", "data[0:2]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "When slicing by an explicit index (e.g. `data['a':'c']`), the final index *is* included.\n", "\n", "When slicing by an implicit index (e.g. `data[0:2]`), the final index *is not* included.\n", "\n", "This can be a source of much confusion." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Consider a Series with integer indices." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.344695Z", "iopub.status.busy": "2024-01-10T00:13:15.344351Z", "iopub.status.idle": "2024-01-10T00:13:15.349397Z", "shell.execute_reply": "2024-01-10T00:13:15.348877Z" } }, "outputs": [ { "data": { "text/plain": [ "1 a\n", "3 b\n", "5 c\n", "dtype: object" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])\n", "data" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.352122Z", "iopub.status.busy": "2024-01-10T00:13:15.351783Z", "iopub.status.idle": "2024-01-10T00:13:15.355483Z", "shell.execute_reply": "2024-01-10T00:13:15.354953Z" } }, "outputs": [ { "data": { "text/plain": [ "'a'" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# explicit index when indexing\n", "data[1]" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.358468Z", "iopub.status.busy": "2024-01-10T00:13:15.358085Z", "iopub.status.idle": "2024-01-10T00:13:15.362748Z", "shell.execute_reply": "2024-01-10T00:13:15.362199Z" } }, "outputs": [ { "data": { "text/plain": [ "3 b\n", "5 c\n", "dtype: object" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# implicit index when slicing\n", "data[1:3]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Indexers\n", "\n", "Indexers `loc` (explicit) and `iloc` (implicit) are introduced to avoid confusion." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.365530Z", "iopub.status.busy": "2024-01-10T00:13:15.365179Z", "iopub.status.idle": "2024-01-10T00:13:15.369614Z", "shell.execute_reply": "2024-01-10T00:13:15.369097Z" } }, "outputs": [ { "data": { "text/plain": [ "1 a\n", "3 b\n", "5 c\n", "dtype: object" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.372452Z", "iopub.status.busy": "2024-01-10T00:13:15.372086Z", "iopub.status.idle": "2024-01-10T00:13:15.375935Z", "shell.execute_reply": "2024-01-10T00:13:15.375418Z" } }, "outputs": [ { "data": { "text/plain": [ "'a'" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.loc[1]" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.378492Z", "iopub.status.busy": "2024-01-10T00:13:15.378208Z", "iopub.status.idle": "2024-01-10T00:13:15.382914Z", "shell.execute_reply": "2024-01-10T00:13:15.382378Z" } }, "outputs": [ { "data": { "text/plain": [ "1 a\n", "3 b\n", "dtype: object" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.loc[1:3]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.385667Z", "iopub.status.busy": "2024-01-10T00:13:15.385325Z", "iopub.status.idle": "2024-01-10T00:13:15.389111Z", "shell.execute_reply": "2024-01-10T00:13:15.388596Z" } }, "outputs": [ { "data": { "text/plain": [ "'b'" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.iloc[1]" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.391891Z", "iopub.status.busy": "2024-01-10T00:13:15.391527Z", "iopub.status.idle": "2024-01-10T00:13:15.396244Z", "shell.execute_reply": "2024-01-10T00:13:15.395728Z" } }, "outputs": [ { "data": { "text/plain": [ "3 b\n", "5 c\n", "dtype: object" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.iloc[1:3]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Data selection in a `DataFrame`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In additional to acting like a dictionary of `Series` objects with the same index, a `DataFrame` also provies array-style selection like NumPy arrays." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.399127Z", "iopub.status.busy": "2024-01-10T00:13:15.398770Z", "iopub.status.idle": "2024-01-10T00:13:15.406757Z", "shell.execute_reply": "2024-01-10T00:13:15.406203Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areapopulation
California42396738332521
Texas69566226448193
New York14129719651127
Florida17031219552860
Illinois14999512882135
\n", "
" ], "text/plain": [ " area population\n", "California 423967 38332521\n", "Texas 695662 26448193\n", "New York 141297 19651127\n", "Florida 170312 19552860\n", "Illinois 149995 12882135" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "area = pd.Series({'California': 423967, 'Texas': 695662,\n", " 'New York': 141297, 'Florida': 170312,\n", " 'Illinois': 149995})\n", "pop = pd.Series({'California': 38332521, 'Texas': 26448193,\n", " 'New York': 19651127, 'Florida': 19552860,\n", " 'Illinois': 12882135})\n", "data = pd.DataFrame({'area':area, 'population':pop})\n", "data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Indexers\n", "\n", "Indexers `loc` (explicit) and `iloc` (implicit) are also available to avoid confusion when selecting data.\n", "\n", "Note that index and column labels are preserved in the result." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.409634Z", "iopub.status.busy": "2024-01-10T00:13:15.409286Z", "iopub.status.idle": "2024-01-10T00:13:15.414919Z", "shell.execute_reply": "2024-01-10T00:13:15.414384Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areapopulation
California42396738332521
Texas69566226448193
New York14129719651127
Florida17031219552860
Illinois14999512882135
\n", "
" ], "text/plain": [ " area population\n", "California 423967 38332521\n", "Texas 695662 26448193\n", "New York 141297 19651127\n", "Florida 170312 19552860\n", "Illinois 149995 12882135" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.417710Z", "iopub.status.busy": "2024-01-10T00:13:15.417363Z", "iopub.status.idle": "2024-01-10T00:13:15.423392Z", "shell.execute_reply": "2024-01-10T00:13:15.422870Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areapopulation
California42396738332521
Texas69566226448193
New York14129719651127
\n", "
" ], "text/plain": [ " area population\n", "California 423967 38332521\n", "Texas 695662 26448193\n", "New York 141297 19651127" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.iloc[:3, :2]" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.426169Z", "iopub.status.busy": "2024-01-10T00:13:15.425803Z", "iopub.status.idle": "2024-01-10T00:13:15.431850Z", "shell.execute_reply": "2024-01-10T00:13:15.431323Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areapopulation
California42396738332521
Texas69566226448193
New York14129719651127
Florida17031219552860
Illinois14999512882135
\n", "
" ], "text/plain": [ " area population\n", "California 423967 38332521\n", "Texas 695662 26448193\n", "New York 141297 19651127\n", "Florida 170312 19552860\n", "Illinois 149995 12882135" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.loc[:'Illinois', :'population']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Additional array-style selection \n", "\n", "Other NumPy selection approaches can also be applied (e.g. masking)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [ "exercise_pointer" ] }, "source": [ "**Exercises:** *You can now complete Exercise 1 in the exercises associated with this lecture.*" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Operating on data in Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Elementwise operations in Pandas automatically aligns indices and preserves index/column labels.\n", "\n", "Can avoid many errors and bugs in data wrangling." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Index preservation" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.435183Z", "iopub.status.busy": "2024-01-10T00:13:15.434837Z", "iopub.status.idle": "2024-01-10T00:13:15.440045Z", "shell.execute_reply": "2024-01-10T00:13:15.439526Z" } }, "outputs": [ { "data": { "text/plain": [ "0 6\n", "1 3\n", "2 7\n", "3 4\n", "dtype: int64" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rng = np.random.RandomState(42)\n", "ser = pd.Series(rng.randint(0, 10, 4))\n", "ser" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.442924Z", "iopub.status.busy": "2024-01-10T00:13:15.442580Z", "iopub.status.idle": "2024-01-10T00:13:15.447315Z", "shell.execute_reply": "2024-01-10T00:13:15.446795Z" } }, "outputs": [ { "data": { "text/plain": [ "0 403.428793\n", "1 20.085537\n", "2 1096.633158\n", "3 54.598150\n", "dtype: float64" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.exp(ser)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.450058Z", "iopub.status.busy": "2024-01-10T00:13:15.449718Z", "iopub.status.idle": "2024-01-10T00:13:15.456731Z", "shell.execute_reply": "2024-01-10T00:13:15.456205Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
06926
17437
27254
\n", "
" ], "text/plain": [ " A B C D\n", "0 6 9 2 6\n", "1 7 4 3 7\n", "2 7 2 5 4" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(rng.randint(0, 10, (3, 4)),\n", " columns=['A', 'B', 'C', 'D'])\n", "df" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.459559Z", "iopub.status.busy": "2024-01-10T00:13:15.459213Z", "iopub.status.idle": "2024-01-10T00:13:15.466329Z", "shell.execute_reply": "2024-01-10T00:13:15.465808Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
0403.4287938103.0839287.389056403.428793
11096.63315854.59815020.0855371096.633158
21096.6331587.389056148.41315954.598150
\n", "
" ], "text/plain": [ " A B C D\n", "0 403.428793 8103.083928 7.389056 403.428793\n", "1 1096.633158 54.598150 20.085537 1096.633158\n", "2 1096.633158 7.389056 148.413159 54.598150" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.exp(df)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [ "exercise_pointer" ] }, "source": [ "**Exercises:** *You can now complete Exercise 2 in the exercises associated with this lecture.*" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Index alignment" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Index alignment works similarly for `DataFrames`." ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.469283Z", "iopub.status.busy": "2024-01-10T00:13:15.468937Z", "iopub.status.idle": "2024-01-10T00:13:15.475078Z", "shell.execute_reply": "2024-01-10T00:13:15.474536Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
0111
151
\n", "
" ], "text/plain": [ " A B\n", "0 1 11\n", "1 5 1" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A = pd.DataFrame(rng.randint(0, 20, (2, 2)),\n", " columns=list('AB'))\n", "A" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.478954Z", "iopub.status.busy": "2024-01-10T00:13:15.478608Z", "iopub.status.idle": "2024-01-10T00:13:15.484998Z", "shell.execute_reply": "2024-01-10T00:13:15.484464Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BAC
0409
1580
2926
\n", "
" ], "text/plain": [ " B A C\n", "0 4 0 9\n", "1 5 8 0\n", "2 9 2 6" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "B = pd.DataFrame(rng.randint(0, 10, (3, 3)),\n", " columns=list('BAC'))\n", "B" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.488869Z", "iopub.status.busy": "2024-01-10T00:13:15.488521Z", "iopub.status.idle": "2024-01-10T00:13:15.497638Z", "shell.execute_reply": "2024-01-10T00:13:15.497097Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
01.015.0NaN
113.06.0NaN
2NaNNaNNaN
\n", "
" ], "text/plain": [ " A B C\n", "0 1.0 15.0 NaN\n", "1 13.0 6.0 NaN\n", "2 NaN NaN NaN" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A + B" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Operations between `DataFrame` and `Series` objects" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.505677Z", "iopub.status.busy": "2024-01-10T00:13:15.505241Z", "iopub.status.idle": "2024-01-10T00:13:15.512507Z", "shell.execute_reply": "2024-01-10T00:13:15.511979Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
QRST
03824
12648
26138
\n", "
" ], "text/plain": [ " Q R S T\n", "0 3 8 2 4\n", "1 2 6 4 8\n", "2 6 1 3 8" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A = rng.randint(10, size=(3, 4))\n", "df = pd.DataFrame(A, columns=list('QRST'))\n", "df" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.515360Z", "iopub.status.busy": "2024-01-10T00:13:15.515014Z", "iopub.status.idle": "2024-01-10T00:13:15.519595Z", "shell.execute_reply": "2024-01-10T00:13:15.519059Z" } }, "outputs": [ { "data": { "text/plain": [ "Q 3\n", "R 8\n", "S 2\n", "T 4\n", "Name: 0, dtype: int64" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = df.iloc[0]\n", "s" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Difference between the `DataFrame` and `Series`:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.522471Z", "iopub.status.busy": "2024-01-10T00:13:15.522110Z", "iopub.status.idle": "2024-01-10T00:13:15.528586Z", "shell.execute_reply": "2024-01-10T00:13:15.528055Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
QRST
00000
1-1-224
23-714
\n", "
" ], "text/plain": [ " Q R S T\n", "0 0 0 0 0\n", "1 -1 -2 2 4\n", "2 3 -7 1 4" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df - s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Convention is to operate row-wise." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Can also operate column-wise using object methods." ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.532195Z", "iopub.status.busy": "2024-01-10T00:13:15.531844Z", "iopub.status.idle": "2024-01-10T00:13:15.538629Z", "shell.execute_reply": "2024-01-10T00:13:15.538075Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
QRST
0-50-6-4
1-40-22
25027
\n", "
" ], "text/plain": [ " Q R S T\n", "0 -5 0 -6 -4\n", "1 -4 0 -2 2\n", "2 5 0 2 7" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.subtract(df['R'], axis=0)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Handling missing data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Real data is messy. Often some data are missing." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Various conventions can be considered to handle missing data.\n", "\n", "We will focus on the use of the floating point IEEE value NaN (not a number) to represent missing data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Pandas interprets NaN as Null values.\n", "\n", "(Pandas also supports `None` but we will focus on NaN here.)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Arithematic operations with NaN values result in NaN." ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.542302Z", "iopub.status.busy": "2024-01-10T00:13:15.541956Z", "iopub.status.idle": "2024-01-10T00:13:15.545706Z", "shell.execute_reply": "2024-01-10T00:13:15.545196Z" } }, "outputs": [ { "data": { "text/plain": [ "nan" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1 + np.nan" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Operating on Null values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Several useful methods exist to work with NaNs, for example to detect, drop or replace:\n", "- `isnull()`: Generate a boolean mask indicating missing values.\n", "- `notnull()`: Opposite of `isnull()`.\n", "- `dropna()`: Return a filtered version of the data.\n", "- `fillna()`: Return a copy of the data with missing values filled." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Detecting null values\n", "\n", "Pandas `isnull` and `notnull` are useful for detecting null values." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [ "exercise_pointer" ] }, "source": [ "**Exercises:** *You can now complete Exercise 3 in the exercises associated with this lecture.*" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Dropping null values\n", "\n", "Direct routines may be used to drop null values (i.e. `dropna`), rather than constructing masks as performed above." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [ "exercise_pointer" ] }, "source": [ "**Exercises:** *You can now complete Exercise 4 in the exercises associated with this lecture.*" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Dropping null values from `DataFrames`\n", "\n", "For `DataFrames`, there are multiple ways null values can be dropped." ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.549490Z", "iopub.status.busy": "2024-01-10T00:13:15.549132Z", "iopub.status.idle": "2024-01-10T00:13:15.556771Z", "shell.execute_reply": "2024-01-10T00:13:15.556194Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.0NaN2
12.03.05
2NaN4.06
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 NaN 2\n", "1 2.0 3.0 5\n", "2 NaN 4.0 6" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame([[1, np.nan, 2],\n", " [2, 3, 5],\n", " [np.nan, 4, 6]])\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "By default `dropna` operates row-wise and drops all rows that contain any NaNs." ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.559929Z", "iopub.status.busy": "2024-01-10T00:13:15.559542Z", "iopub.status.idle": "2024-01-10T00:13:15.566641Z", "shell.execute_reply": "2024-01-10T00:13:15.566094Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
12.03.05
\n", "
" ], "text/plain": [ " 0 1 2\n", "1 2.0 3.0 5" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Can also operate column-wise." ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.569601Z", "iopub.status.busy": "2024-01-10T00:13:15.569257Z", "iopub.status.idle": "2024-01-10T00:13:15.575786Z", "shell.execute_reply": "2024-01-10T00:13:15.575259Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.0NaN2
12.03.05
2NaN4.06
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 NaN 2\n", "1 2.0 3.0 5\n", "2 NaN 4.0 6" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.578790Z", "iopub.status.busy": "2024-01-10T00:13:15.578414Z", "iopub.status.idle": "2024-01-10T00:13:15.584880Z", "shell.execute_reply": "2024-01-10T00:13:15.584357Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2
02
15
26
\n", "
" ], "text/plain": [ " 2\n", "0 2\n", "1 5\n", "2 6" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna(axis='columns')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "More sophisticated approaches can also be considered (e.g. only dropping rows/columns if all entries or a certain number of NaNs appear)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Replacing null values\n", "\n", "Null values can be easily replaced using `fillna`." ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.588227Z", "iopub.status.busy": "2024-01-10T00:13:15.587864Z", "iopub.status.idle": "2024-01-10T00:13:15.594332Z", "shell.execute_reply": "2024-01-10T00:13:15.593819Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.0NaN2
12.03.05
2NaN4.06
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 NaN 2\n", "1 2.0 3.0 5\n", "2 NaN 4.0 6" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.597648Z", "iopub.status.busy": "2024-01-10T00:13:15.597298Z", "iopub.status.idle": "2024-01-10T00:13:15.604154Z", "shell.execute_reply": "2024-01-10T00:13:15.603634Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.00.02
12.03.05
20.04.06
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 0.0 2\n", "1 2.0 3.0 5\n", "2 0.0 4.0 6" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(0.0)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Can also fill using adjacent values." ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.607243Z", "iopub.status.busy": "2024-01-10T00:13:15.606901Z", "iopub.status.idle": "2024-01-10T00:13:15.613298Z", "shell.execute_reply": "2024-01-10T00:13:15.612778Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.0NaN2
12.03.05
2NaN4.06
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 NaN 2\n", "1 2.0 3.0 5\n", "2 NaN 4.0 6" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.615998Z", "iopub.status.busy": "2024-01-10T00:13:15.615699Z", "iopub.status.idle": "2024-01-10T00:13:15.623380Z", "shell.execute_reply": "2024-01-10T00:13:15.622863Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.01.02.0
12.03.05.0
2NaN4.06.0
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 1.0 2.0\n", "1 2.0 3.0 5.0\n", "2 NaN 4.0 6.0" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(method='ffill', axis='columns')" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.626132Z", "iopub.status.busy": "2024-01-10T00:13:15.625790Z", "iopub.status.idle": "2024-01-10T00:13:15.632720Z", "shell.execute_reply": "2024-01-10T00:13:15.632070Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.0NaN2
12.03.05
22.04.06
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 NaN 2\n", "1 2.0 3.0 5\n", "2 2.0 4.0 6" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(method='ffill', axis='rows')" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.635515Z", "iopub.status.busy": "2024-01-10T00:13:15.635170Z", "iopub.status.idle": "2024-01-10T00:13:15.642851Z", "shell.execute_reply": "2024-01-10T00:13:15.642316Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.02.02.0
12.03.05.0
24.04.06.0
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 2.0 2.0\n", "1 2.0 3.0 5.0\n", "2 4.0 4.0 6.0" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(method='bfill', axis='columns')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Combining data-sets" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Define helper functions" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.645704Z", "iopub.status.busy": "2024-01-10T00:13:15.645362Z", "iopub.status.idle": "2024-01-10T00:13:15.653232Z", "shell.execute_reply": "2024-01-10T00:13:15.652717Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
0A0B0C0
1A1B1C1
2A2B2C2
\n", "
" ], "text/plain": [ " A B C\n", "0 A0 B0 C0\n", "1 A1 B1 C1\n", "2 A2 B2 C2" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def make_df(cols, ind):\n", " \"\"\"Quickly make a DataFrame\"\"\"\n", " data = {c: [str(c) + str(i) for i in ind]\n", " for c in cols}\n", " return pd.DataFrame(data, ind)\n", "\n", "# example DataFrame\n", "make_df('ABC', range(3))" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.655999Z", "iopub.status.busy": "2024-01-10T00:13:15.655658Z", "iopub.status.idle": "2024-01-10T00:13:15.658376Z", "shell.execute_reply": "2024-01-10T00:13:15.657849Z" } }, "outputs": [], "source": [ "from IPython.display import display" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Concatenation\n", "\n", "Can concatenate `Series` and `DataFrame` objects with `pd.concat()`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Default is to concatenate over rows." ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.661244Z", "iopub.status.busy": "2024-01-10T00:13:15.660906Z", "iopub.status.idle": "2024-01-10T00:13:15.674198Z", "shell.execute_reply": "2024-01-10T00:13:15.673678Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
1A1B1
2A2B2
\n", "
" ], "text/plain": [ " A B\n", "1 A1 B1\n", "2 A2 B2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
3A3B3
4A4B4
\n", "
" ], "text/plain": [ " A B\n", "3 A3 B3\n", "4 A4 B4" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
1A1B1
2A2B2
3A3B3
4A4B4
\n", "
" ], "text/plain": [ " A B\n", "1 A1 B1\n", "2 A2 B2\n", "3 A3 B3\n", "4 A4 B4" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df1 = make_df('AB', [1, 2])\n", "df2 = make_df('AB', [3, 4])\n", "display(df1, df2, pd.concat([df1, df2]))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Can also concatenate over columns." ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.677093Z", "iopub.status.busy": "2024-01-10T00:13:15.676742Z", "iopub.status.idle": "2024-01-10T00:13:15.690361Z", "shell.execute_reply": "2024-01-10T00:13:15.689828Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
0A0B0
1A1B1
\n", "
" ], "text/plain": [ " A B\n", "0 A0 B0\n", "1 A1 B1" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CD
0C0D0
1C1D1
\n", "
" ], "text/plain": [ " C D\n", "0 C0 D0\n", "1 C1 D1" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
0A0B0C0D0
1A1B1C1D1
\n", "
" ], "text/plain": [ " A B C D\n", "0 A0 B0 C0 D0\n", "1 A1 B1 C1 D1" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df3 = make_df('AB', [0, 1])\n", "df4 = make_df('CD', [0, 1])\n", "display(df3, df4, pd.concat([df3, df4], axis='columns'))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Duplicated indices\n", "\n", "Can have duplicated indices." ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.693461Z", "iopub.status.busy": "2024-01-10T00:13:15.693106Z", "iopub.status.idle": "2024-01-10T00:13:15.706128Z", "shell.execute_reply": "2024-01-10T00:13:15.705603Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
0A0B0
1A1B1
\n", "
" ], "text/plain": [ " A B\n", "0 A0 B0\n", "1 A1 B1" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
0A2B2
1A3B3
\n", "
" ], "text/plain": [ " A B\n", "0 A2 B2\n", "1 A3 B3" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
0A0B0
1A1B1
0A2B2
1A3B3
\n", "
" ], "text/plain": [ " A B\n", "0 A0 B0\n", "1 A1 B1\n", "0 A2 B2\n", "1 A3 B3" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "x = make_df('AB', [0, 1])\n", "y = make_df('AB', [2, 3])\n", "y.index = x.index # make duplicate indices!\n", "display(x, y, pd.concat([x, y]))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Ignoring index\n", "\n", "Can ignore index." ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.708970Z", "iopub.status.busy": "2024-01-10T00:13:15.708624Z", "iopub.status.idle": "2024-01-10T00:13:15.720590Z", "shell.execute_reply": "2024-01-10T00:13:15.720060Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
0A0B0
1A1B1
\n", "
" ], "text/plain": [ " A B\n", "0 A0 B0\n", "1 A1 B1" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
0A2B2
1A3B3
\n", "
" ], "text/plain": [ " A B\n", "0 A2 B2\n", "1 A3 B3" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
0A0B0
1A1B1
2A2B2
3A3B3
\n", "
" ], "text/plain": [ " A B\n", "0 A0 B0\n", "1 A1 B1\n", "2 A2 B2\n", "3 A3 B3" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(x, y, pd.concat([x, y], ignore_index=True))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Concantenation with joins\n", "\n", "Can join `DataFrames` with different column names." ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.723441Z", "iopub.status.busy": "2024-01-10T00:13:15.723097Z", "iopub.status.idle": "2024-01-10T00:13:15.738457Z", "shell.execute_reply": "2024-01-10T00:13:15.737891Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
1A1B1C1
2A2B2C2
\n", "
" ], "text/plain": [ " A B C\n", "1 A1 B1 C1\n", "2 A2 B2 C2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BCD
3B3C3D3
4B4C4D4
\n", "
" ], "text/plain": [ " B C D\n", "3 B3 C3 D3\n", "4 B4 C4 D4" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
1A1B1C1NaN
2A2B2C2NaN
3NaNB3C3D3
4NaNB4C4D4
\n", "
" ], "text/plain": [ " A B C D\n", "1 A1 B1 C1 NaN\n", "2 A2 B2 C2 NaN\n", "3 NaN B3 C3 D3\n", "4 NaN B4 C4 D4" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df5 = make_df('ABC', [1, 2])\n", "df6 = make_df('BCD', [3, 4])\n", "display(df5, df6, pd.concat([df5, df6]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Entries with no data are filled with NaN.\n", "\n", "Default join is the *union* of the columns of the two `DataFrames`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Can also perform different types of joins.\n", "\n", "For example, the *intersection* of the columns of the two DataFrames." ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.741470Z", "iopub.status.busy": "2024-01-10T00:13:15.741118Z", "iopub.status.idle": "2024-01-10T00:13:15.754499Z", "shell.execute_reply": "2024-01-10T00:13:15.753943Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
1A1B1C1
2A2B2C2
\n", "
" ], "text/plain": [ " A B C\n", "1 A1 B1 C1\n", "2 A2 B2 C2" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BCD
3B3C3D3
4B4C4D4
\n", "
" ], "text/plain": [ " B C D\n", "3 B3 C3 D3\n", "4 B4 C4 D4" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BC
1B1C1
2B2C2
3B3C3
4B4C4
\n", "
" ], "text/plain": [ " B C\n", "1 B1 C1\n", "2 B2 C2\n", "3 B3 C3\n", "4 B4 C4" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(df5, df6, pd.concat([df5, df6], join='inner'))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Relational combinations\n", "\n", "Pandas also provides functionality to perform relational algebra (cf. relational databases).\n", "\n", "Hence, Pandas data structures provide analog not only of NumPy array and dictionary, but also relational database." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Functionality provied by `pd.merge()` function." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### One-to-one join" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.757544Z", "iopub.status.busy": "2024-01-10T00:13:15.757195Z", "iopub.status.idle": "2024-01-10T00:13:15.772846Z", "shell.execute_reply": "2024-01-10T00:13:15.772302Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeegroup
0BobAccounting
1JakeEngineering
2LisaEngineering
3SueHR
\n", "
" ], "text/plain": [ " employee group\n", "0 Bob Accounting\n", "1 Jake Engineering\n", "2 Lisa Engineering\n", "3 Sue HR" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeehire_date
0Lisa2004
1Bob2008
2Jake2012
3Sue2014
\n", "
" ], "text/plain": [ " employee hire_date\n", "0 Lisa 2004\n", "1 Bob 2008\n", "2 Jake 2012\n", "3 Sue 2014" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeegrouphire_date
0BobAccounting2008
1JakeEngineering2012
2LisaEngineering2004
3SueHR2014
\n", "
" ], "text/plain": [ " employee group hire_date\n", "0 Bob Accounting 2008\n", "1 Jake Engineering 2012\n", "2 Lisa Engineering 2004\n", "3 Sue HR 2014" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],\n", " 'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})\n", "df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],\n", " 'hire_date': [2004, 2008, 2012, 2014]})\n", "df3 = pd.merge(df1, df2)\n", "display(df1,df2,df3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recognises that \"employee\" column common and automatically selects as key for the relational join." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Many-to-one joins" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.775906Z", "iopub.status.busy": "2024-01-10T00:13:15.775550Z", "iopub.status.idle": "2024-01-10T00:13:15.790962Z", "shell.execute_reply": "2024-01-10T00:13:15.790405Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeegrouphire_date
0BobAccounting2008
1JakeEngineering2012
2LisaEngineering2004
3SueHR2014
\n", "
" ], "text/plain": [ " employee group hire_date\n", "0 Bob Accounting 2008\n", "1 Jake Engineering 2012\n", "2 Lisa Engineering 2004\n", "3 Sue HR 2014" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupsupervisor
0AccountingCarly
1EngineeringGuido
2HRSteve
\n", "
" ], "text/plain": [ " group supervisor\n", "0 Accounting Carly\n", "1 Engineering Guido\n", "2 HR Steve" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeegrouphire_datesupervisor
0BobAccounting2008Carly
1JakeEngineering2012Guido
2LisaEngineering2004Guido
3SueHR2014Steve
\n", "
" ], "text/plain": [ " employee group hire_date supervisor\n", "0 Bob Accounting 2008 Carly\n", "1 Jake Engineering 2012 Guido\n", "2 Lisa Engineering 2004 Guido\n", "3 Sue HR 2014 Steve" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],\n", " 'supervisor': ['Carly', 'Guido', 'Steve']})\n", "display(df3, df4, pd.merge(df3, df4))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### The `on` keyword" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.793866Z", "iopub.status.busy": "2024-01-10T00:13:15.793517Z", "iopub.status.idle": "2024-01-10T00:13:15.807076Z", "shell.execute_reply": "2024-01-10T00:13:15.806535Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeegroup
0BobAccounting
1JakeEngineering
2LisaEngineering
3SueHR
\n", "
" ], "text/plain": [ " employee group\n", "0 Bob Accounting\n", "1 Jake Engineering\n", "2 Lisa Engineering\n", "3 Sue HR" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeehire_date
0Lisa2004
1Bob2008
2Jake2012
3Sue2014
\n", "
" ], "text/plain": [ " employee hire_date\n", "0 Lisa 2004\n", "1 Bob 2008\n", "2 Jake 2012\n", "3 Sue 2014" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeegrouphire_date
0BobAccounting2008
1JakeEngineering2012
2LisaEngineering2004
3SueHR2014
\n", "
" ], "text/plain": [ " employee group hire_date\n", "0 Bob Accounting 2008\n", "1 Jake Engineering 2012\n", "2 Lisa Engineering 2004\n", "3 Sue HR 2014" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(df1, df2, pd.merge(df1, df2, on='employee'))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### The `left_on` and `right_on` keywords" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.809978Z", "iopub.status.busy": "2024-01-10T00:13:15.809622Z", "iopub.status.idle": "2024-01-10T00:13:15.823891Z", "shell.execute_reply": "2024-01-10T00:13:15.823352Z" }, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeegroup
0BobAccounting
1JakeEngineering
2LisaEngineering
3SueHR
\n", "
" ], "text/plain": [ " employee group\n", "0 Bob Accounting\n", "1 Jake Engineering\n", "2 Lisa Engineering\n", "3 Sue HR" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namesalary
0Bob70000
1Jake80000
2Lisa120000
3Sue90000
\n", "
" ], "text/plain": [ " name salary\n", "0 Bob 70000\n", "1 Jake 80000\n", "2 Lisa 120000\n", "3 Sue 90000" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeegroupnamesalary
0BobAccountingBob70000
1JakeEngineeringJake80000
2LisaEngineeringLisa120000
3SueHRSue90000
\n", "
" ], "text/plain": [ " employee group name salary\n", "0 Bob Accounting Bob 70000\n", "1 Jake Engineering Jake 80000\n", "2 Lisa Engineering Lisa 120000\n", "3 Sue HR Sue 90000" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],\n", " 'salary': [70000, 80000, 120000, 90000]})\n", "display(df1, df3, pd.merge(df1, df3, left_on=\"employee\", right_on=\"name\"))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Employee and name both included now, so may want to drop one." ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.826815Z", "iopub.status.busy": "2024-01-10T00:13:15.826465Z", "iopub.status.idle": "2024-01-10T00:13:15.834368Z", "shell.execute_reply": "2024-01-10T00:13:15.833835Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeegroupnamesalary
0BobAccountingBob70000
1JakeEngineeringJake80000
2LisaEngineeringLisa120000
3SueHRSue90000
\n", "
" ], "text/plain": [ " employee group name salary\n", "0 Bob Accounting Bob 70000\n", "1 Jake Engineering Jake 80000\n", "2 Lisa Engineering Lisa 120000\n", "3 Sue HR Sue 90000" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df3, left_on=\"employee\", right_on=\"name\")" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.837233Z", "iopub.status.busy": "2024-01-10T00:13:15.836883Z", "iopub.status.idle": "2024-01-10T00:13:15.844804Z", "shell.execute_reply": "2024-01-10T00:13:15.844272Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeegroupsalary
0BobAccounting70000
1JakeEngineering80000
2LisaEngineering120000
3SueHR90000
\n", "
" ], "text/plain": [ " employee group salary\n", "0 Bob Accounting 70000\n", "1 Jake Engineering 80000\n", "2 Lisa Engineering 120000\n", "3 Sue HR 90000" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df3, left_on=\"employee\", right_on=\"name\").drop('name', axis='columns')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### The `left_index` and `right_index` keywords\n", "\n", "Often one wants to join on index." ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.847768Z", "iopub.status.busy": "2024-01-10T00:13:15.847396Z", "iopub.status.idle": "2024-01-10T00:13:15.856391Z", "shell.execute_reply": "2024-01-10T00:13:15.855859Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
group
employee
BobAccounting
JakeEngineering
LisaEngineering
SueHR
\n", "
" ], "text/plain": [ " group\n", "employee \n", "Bob Accounting\n", "Jake Engineering\n", "Lisa Engineering\n", "Sue HR" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hire_date
employee
Lisa2004
Bob2008
Jake2012
Sue2014
\n", "
" ], "text/plain": [ " hire_date\n", "employee \n", "Lisa 2004\n", "Bob 2008\n", "Jake 2012\n", "Sue 2014" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df1a = df1.set_index('employee')\n", "df2a = df2.set_index('employee')\n", "display(df1a, df2a)" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.859251Z", "iopub.status.busy": "2024-01-10T00:13:15.858901Z", "iopub.status.idle": "2024-01-10T00:13:15.866112Z", "shell.execute_reply": "2024-01-10T00:13:15.865585Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
grouphire_date
employee
BobAccounting2008
JakeEngineering2012
LisaEngineering2004
SueHR2014
\n", "
" ], "text/plain": [ " group hire_date\n", "employee \n", "Bob Accounting 2008\n", "Jake Engineering 2012\n", "Lisa Engineering 2004\n", "Sue HR 2014" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(pd.merge(df1a, df2a, left_index=True, right_index=True))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Set arithmetic for joins \n", "\n", "Have so far been considering relational joins based on *intersection* \n", "(also called *inner* join)." ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.869060Z", "iopub.status.busy": "2024-01-10T00:13:15.868709Z", "iopub.status.idle": "2024-01-10T00:13:15.885530Z", "shell.execute_reply": "2024-01-10T00:13:15.884965Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namefood
0Peterfish
1Paulbeans
2Marybread
\n", "
" ], "text/plain": [ " name food\n", "0 Peter fish\n", "1 Paul beans\n", "2 Mary bread" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namedrink
0Marywine
1Josephbeer
\n", "
" ], "text/plain": [ " name drink\n", "0 Mary wine\n", "1 Joseph beer" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namefooddrink
0Marybreadwine
\n", "
" ], "text/plain": [ " name food drink\n", "0 Mary bread wine" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],\n", " 'food': ['fish', 'beans', 'bread']},\n", " columns=['name', 'food'])\n", "df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],\n", " 'drink': ['wine', 'beer']},\n", " columns=['name', 'drink'])\n", "display(df6, df7, pd.merge(df6, df7, how='inner'))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### Outer join\n", "\n", "Can also join based on *union* (missing entries filled with NaNs)." ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.888739Z", "iopub.status.busy": "2024-01-10T00:13:15.888382Z", "iopub.status.idle": "2024-01-10T00:13:15.903223Z", "shell.execute_reply": "2024-01-10T00:13:15.902673Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namefood
0Peterfish
1Paulbeans
2Marybread
\n", "
" ], "text/plain": [ " name food\n", "0 Peter fish\n", "1 Paul beans\n", "2 Mary bread" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namedrink
0Marywine
1Josephbeer
\n", "
" ], "text/plain": [ " name drink\n", "0 Mary wine\n", "1 Joseph beer" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namefooddrink
0PeterfishNaN
1PaulbeansNaN
2Marybreadwine
3JosephNaNbeer
\n", "
" ], "text/plain": [ " name food drink\n", "0 Peter fish NaN\n", "1 Paul beans NaN\n", "2 Mary bread wine\n", "3 Joseph NaN beer" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(df6, df7, pd.merge(df6, df7, how='outer'))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### Left and right join\n", "\n", "Can also join based on *left* or *right* entries." ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.906173Z", "iopub.status.busy": "2024-01-10T00:13:15.905822Z", "iopub.status.idle": "2024-01-10T00:13:15.920315Z", "shell.execute_reply": "2024-01-10T00:13:15.919750Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namefood
0Peterfish
1Paulbeans
2Marybread
\n", "
" ], "text/plain": [ " name food\n", "0 Peter fish\n", "1 Paul beans\n", "2 Mary bread" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namedrink
0Marywine
1Josephbeer
\n", "
" ], "text/plain": [ " name drink\n", "0 Mary wine\n", "1 Joseph beer" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namefooddrink
0PeterfishNaN
1PaulbeansNaN
2Marybreadwine
\n", "
" ], "text/plain": [ " name food drink\n", "0 Peter fish NaN\n", "1 Paul beans NaN\n", "2 Mary bread wine" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(df6, df7, pd.merge(df6, df7, how='left'))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Overlapping column names\n", "\n", "Possible for DataFrames to have conflicting columns." ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.923323Z", "iopub.status.busy": "2024-01-10T00:13:15.922959Z", "iopub.status.idle": "2024-01-10T00:13:15.933270Z", "shell.execute_reply": "2024-01-10T00:13:15.932631Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namerank
0Bob1
1Jake2
2Lisa3
3Sue4
\n", "
" ], "text/plain": [ " name rank\n", "0 Bob 1\n", "1 Jake 2\n", "2 Lisa 3\n", "3 Sue 4" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namerank
0Bob3
1Jake1
2Lisa4
3Sue2
\n", "
" ], "text/plain": [ " name rank\n", "0 Bob 3\n", "1 Jake 1\n", "2 Lisa 4\n", "3 Sue 2" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],\n", " 'rank': [1, 2, 3, 4]})\n", "df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],\n", " 'rank': [3, 1, 4, 2]})\n", "display(df8, df9)" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "execution": { "iopub.execute_input": "2024-01-10T00:13:15.937472Z", "iopub.status.busy": "2024-01-10T00:13:15.936984Z", "iopub.status.idle": "2024-01-10T00:13:15.946330Z", "shell.execute_reply": "2024-01-10T00:13:15.945561Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namerank_Lrank_R
0Bob13
1Jake21
2Lisa34
3Sue42
\n", "
" ], "text/plain": [ " name rank_L rank_R\n", "0 Bob 1 3\n", "1 Jake 2 1\n", "2 Lisa 3 4\n", "3 Sue 4 2" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(pd.merge(df8, df9, on=\"name\", suffixes=[\"_L\", \"_R\"]))" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.18" } }, "nbformat": 4, "nbformat_minor": 4 }