# Lecture 1: Introduction to machine learning

![](https://www.tensorflow.org/images/colab_logo_32px.png)
[Run in colab](https://colab.research.google.com/drive/1zNonj4k0gGhz8Q9kg-5kMk2y9Rq-yjJQ)

In [1]:
import datetime
now = datetime.datetime.now()
print("Last executed: " + now.strftime("%Y-%m-%d %H:%M:%S"))

Last executed: 2024-01-10 00:13:11


## Course overview

### Description and objectives

This module covers how to apply machine learning techniques to large data-sets, so-called *big-data*. 

An introduction to machine learning (ML) is presented to provide a general understanding of the concepts of machine learning, common machine learning techniques, and how to apply these methods to data-sets of moderate sizes.  

Deep learning and computing frameworks to scale machine learning techniques to big-data are then presented. 

Scientific data formats and data curation methods are also discussed.

### Syllabus

Foundations of ML (e.g. overview of ML, training, data wrangling, scikit-learn, performance analysis, gradient descent), data formats and curation (e.g. data pipelines, data version control, databases, big-data), ML methods (e.g. logistic regression, SVMs, ANNs, decision trees, ensemble learning and random forests, dimensionality reduction), deep learning and scaling to big-data (e.g. TensorFlow, 
Deep ANNs, CNNs, RNNs, Autoencoders) and applications of ML in astrophysics, high-energy physics and industry.

### Prerequisites

Students should have a reasonable working knowledge of Python, some familiarity with working in the command line environment in Linux/Unix based operating systems, and a general understanding of elementary mathematics, including linear algebra and calculus. 

No previous familiarity with machine learning is required.

### Resources

#### Textbooks 

- VanderPlas, ["*Python data science handbook*"](https://jakevdp.github.io/PythonDataScienceHandbook/), O'Reilly, 2017, ISBN 9781491912058
  ([Example code](https://github.com/jakevdp/PythonDataScienceHandbook))

- Geron (1st Edition), ["*Hands-on machine learning with Scikit-Learn and TensorFlow*"](https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/), O'Reilly, 2017, ISBN 9781491962299
  ([Example code](https://github.com/ageron/handson-ml))

- Geron (2nd Edition), ["*Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow*"](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/), O'Reilly, 2019, ISBN 9781492032649 ([Example code](https://github.com/ageron/handson-ml2))

- Goodfellow, Bengio, Courville (GBC), ["*Deep learning*"](http://www.deeplearningbook.org), MIT Press, 2016, ISBN 9780262035613

#### Tutorials 
 
- [Scikit-Learn tutorial](https://github.com/jakevdp/sklearn_tutorial), VanderPlas

#### Main code frameworks and libraries

- [Scikit-Learn](http://scikit-learn.org/stable/)
 
- [TensorFlow](https://www.tensorflow.org/)

### Schedule

Lectures will run on Friday's from 10am-1pm.  


### Jupyter notebooks

Each lecture has an accompaning Jupyter notebook, with executable code.


These slides are a Jupyter notebook.

Notebooks can be viewed in slide mode using [RISE](https://rise.readthedocs.io/en/stable/).

The supporting Jupyter notebooks thus serve as the course *slides*, *lecture notes*, and *examples*.

A book version is also made available.

### Course philosophy

This is a practical, hands-on course.  While we will cover basic concepts and background theory (but not in great mathematical depth or rigor), a large component of the course will focus on implementing and running machine learning algorithms.  Many code examples and exercises will be considered.

The course Jupyter notebooks will be made available weekly, in advance of lectures.  Students can then follow examples in the lectures by running code live (and inspecting variables and making modifications).  

### Exercises

A number of lectures are accompanies by an additional Jupyter notebook with related examples for you to complete.  The solutions to these exercises will be made available as the module progresses.  These exercises will not be graded but are intended to help improve your understanding of  the lecture material.

### Assessment


- Courseworks: 2 x 20% = 40%
- Exam: 60%


#### Coursework

Courseworks will involve downloading a Jupyter notebook, which you will need to complete.  

Throughout the notebook you will need to complete code, analytic exercises and descriptive answers. Much of the grading of the coursework will be performed automatically.

There will be two courseworks.  The first coursework will be issued after the first 9 lectures, when all the material required to complete the first coursework will be covered.  The second coursework will be issued after the first 15 lectures, when all the material required to complete the second coursework will be covered.  

#### Exam

*Answer THREE questions* of the FOUR questions provided.

Each question has equal mark (15 marks per question).

Markers place importance on clarity and a portion of the marks are awarded for clear descriptions, answers, drawings, and diagrams, and attention to precision in quantitative answers.

### Computing setup

Students can bring their own laptops to class in order to run notebooks and complete examples.
 
All examples are implemented in Python 3. 

The main Python libraries that are required include the following:
```
- numpy 
- scipy
- matplotlib
- scikit-learn
- ipython/jupyter
- seaborn
- tensorflow
- astroML
```

An environment to run the notebooks can be set up with the versions of the libraries in `requirements.txt` (details below), following the steps below in terminal (MacOS, Linux) or anaconda prompt (Windows): 

1. Create an environment named mlbd with Python 3.11.

   ```
   conda create --name mlbd python=3.11
   ```

2. Activate the `mlbd` environment and then install the libraries in the requirements.txt file. 

   ```
   conda activate mlbd 
   pip install -r requirements.txt  
   ```
3. Finally, start Jupyter, which will open the explorer and let you run the notebooks. 

    ```
    jupyter lab
    ```

Content of `requirements.txt`:

```
numpy==1.24.3
matplotlib==3.7.4
pandas==2.0.3
scikit-learn==1.3.2
seaborn==0.13.1
tensorflow==2.13.1
tensorflow_datasets==4.9.2
jupyterlab==4.0.10
jupyter-book==0.15.1
jupyterlab_rise== 0.42.0
astroML==1.0.2.post1
nbdime==4.0.1
boto3==1.34.15
pyarrow==14.0.2
pyspark==3.5.0
pyppeteer==1.0.2
dvc==3.38.1
```

Content of `requirements_macosx.txt` for Mac:

```
numpy==1.24.3
matplotlib==3.7.4
pandas==2.0.3
scikit-learn==1.3.2
seaborn==0.13.1
tensorflow==2.13.1
tensorflow-metal==1.1.0
tensorflow_datasets==4.9.2
jupyterlab==4.0.10
jupyter-book==0.15.1
jupyterlab_rise== 0.42.0
astroML==1.0.2.post1
nbdime==4.0.1
boto3==1.34.15
pyarrow==14.0.2
pyspark==3.5.0
pyppeteer==1.0.2
dvc==3.38.1
```

## What is machine learning?

### Artifical intelligence (AI)

Ironically...

- Solving "computational problems" that are difficult for humans is straightforward for machines (i.e. problems described by list of formal mathematical rules).

- Solving "intuitive problems" that are easy for humans is difficult for machines (i.e. problems difficult to describe formally).

This is often known as [Moravec's paradox](https://en.wikipedia.org/wiki/Moravec%27s_paradox) (although formal definition is a little more specific).

Solution is to allow computers to learn from experience and to build an understanding of the world through a hierarchy of concepts.

### Knowledge base approach

Hard-code knowledge about world in formal set of rules and use logical inference.

Very difficult to capture complexity of intuitive problems in this manner.


### Machine learning (ML)

Arthur Samuel (1959):
> "[Machine learning is the] field of study that gives computers the ability to learn without being explicitly programmed."

<br>

Tom Mitchell (1997):
> "A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E."



### Uses of machine learning

1. **Prediction:** Predict outcome given data.
2. **Inference:** Better understand data (and their distribution).

### Data representations


Performance of machine learning depends on representation of data given.

Data presented to learning algorithm as *features*.

Traditional approach to machine learning involved *"feature engineering"*, where a practitioner with domain expertise would develop techniques to extract informative features from raw data. 


#### Examples of features

- Computer visions: edges and corners
- Spam: frequency of words
- Character recognition: histograms of black pixels along rows/columns, number of holes, number of strokes

### Learning representations

Alternative is to learn features.

- Can discover informative features from data.
- Minimal human intervention.


### Approaches to representation learning



- Dedicated feature learning, e.g. autoencoder combining encoder and decoder.

- Representation learning integral to overall machine learning technique, e.g. deep learning.

### Approaches to artifical intelligence

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture01_Images/ai_venn_diagram.png" width="500" style="display:block; margin:auto"/>

[Image credit: [GBC](http://www.deeplearningbook.org/)]

### AI pipelines

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture01_Images/ai_approaches.png" width="400" style="display:block; margin:auto"/>

[Image credit: [GBC](http://www.deeplearningbook.org/)]

## The unreasonable effectiveness of data

As society becomes increasing digitised, the volume of available data is exploding. 

A significant increase in the volume of data can lead to dramatic increases in the performance of machine learning techniques.


(Term coined in Halevy, Norbig & Pereira, 2009, [*The unreasonable effectiveness of data*](http://goo.gl/q6LaZ8).)

### Size of benchmark data-sets

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture01_Images/data_sizes.png" width="800" style="display:block; margin:auto"/>
    
[Image credit: [GBC](http://www.deeplearningbook.org/)]

### Size of data can have a larger impact than algorihm

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture01_Images/importance_of_data.png" width="500" style="display:block; margin:auto"/>

Source: Banko & Brill, 2001, [*Scaling to very very large corpora for natural language disambiguation*](http://goo.gl/R5enIE)



> As a rule of thumb, a supervised deep learning algorithm will perform reasonably well with around 5,000 labelled samples.  

> With 10 million samples, it will match or exceed human performance. 

[Source: [GBC](http://www.deeplearningbook.org/)]


However, in many cases very large datasets are not available and in some cases not possible.  

Hence, developing effective algorithms remains critical.

## A brief history of deep learning

### AlexNet: an inflection point in machine learning

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture01_Images/alexnet_performance.png" width="800" style="display:block; margin:auto"/>

Source: [*Ten Years of AI in Review*](https://towardsdatascience.com/ten-years-of-ai-in-review-85decdb2a540), Towards Data Science

### Deep learning timeline

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture01_Images/deeplearning_timeline.png" width="800" style="display:block; margin:auto"/>

Source: [*Ten Years of AI in Review*](https://towardsdatascience.com/ten-years-of-ai-in-review-85decdb2a540), Towards Data Science

### A fourth industrial revolution?

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture01_Images/industrial_revolution_4.jpg" width="600" style="display:block; margin:auto"/>

[[Image Source](https://rw-rw.facebook.com/195228108045971/photos/a.195229821379133/195229781379137/)]

- First industrial revolution (1760-1840): mechanisation through steam and water power.
- Second industrial revolution (1871-1914): electrification, railroad and telegraph networks.
- Third industrial revolution (late 20th century): digital revolution.
- Fourth industrial revolution (21st century): AI revolution.

## Classes of machine learning

1. **Supervised:** Learn to predict output given input (given labelled training data).
2. **Unsupervised:** Discover internal representation of input.
3. **Reinforcement:** Learn action to maximise payoff.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture01_Images/supervised_unsupervised_learning.png" width="800" style="display:block; margin:auto"/>
    
[[Image source](http://beta.cambridgespark.com/courses/jpm/01-module.html)]

### Supervised learning

Learn to predict output given input (given labelled training data).

1. **Regression:** Target output is a (real) number, <br>
    e.g. estimate flux intensity.

2. **Classification:** Target output is a class label,<br>
    e.g. classify galaxy morphology.

#### How supervised learning works

- Select model defined by function $f$, and model target $y$ from inputs $x$ by
$y = f(x, \theta),$
where $\theta$ are the parameters of the model that are learnt during training.



- Learning typically involves minimising the difference between the inputs and outputs for the model, given a training data-set (more on training, validation and test data-sets later).

### Unsupervised learning

Discover internal representation of input.

1. **Cluster finding:** Learn cluster of similar structure in data.
2. **Density estimation:** Learn representations of data (probability distributions).
3. **Dimensionality reduction:** Provides compact, low-dimensional representation of data.

#### Unsupervised learning examples


Anomaly detection, clustering groups of similar objects, visualising high-dimensional data in 2D or 3D plots are  examples of unsupervised learning.

### Reinforcement learning

Learn action to maximise payoff.

- Output is an action or sequence of actions and the only supervisory signal is an occasional numerical (scalar) reward.
- Difficult since rewards are delayed.
- Not covered in this course.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture01_Images/rl_interaction.png" width="500" style="display:block; margin:auto"/>
    
[[Image credit](https://www.analyticsvidhya.com/blog/2016/12/getting-ready-for-ai-based-gaming-agents-overview-of-open-source-reinforcement-learning-platforms/)]

### Reinforcement learning examples

Go, playing computer games, driverless cars, self navigating vaccum cleaners, scheduling of elevators are all applications of reinforcement learning.

E.g. [Google [DeepMind] machine learns to master video games](http://www.bbc.co.uk/news/science-environment-31623427)

## Training

Machine *learning* often involves solving an *optimization* problem, i.e. finding the parameters $\theta$ of the model $f$ to best represent the training data (for supervised learning).


### Objective function

Typically maximise/minimise some goodness-of-fit/cost function.

#### Example of convex objective function

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture01_Images/optimization_convex.png" width="500" style="display:block; margin:auto"/>
    
[Image credit: Kirkby, UC Irvine, LSST Dark Energy Summer School 2017]

#### Example of non-convex objective function

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture01_Images/optimization_nonconvex.jpg" width="500" style="display:block; margin:auto"/>

[[Image source](https://cs.hse.ru/data/2016/08/26/1121363361/moml.jpg)]

### Using gradients to optimize objective function (i.e. perform training)

- **(Batch) Gradient descent:** Use all data at each iteration (full dimension).
- **Stochastic gradient descent:** Use a random data-point at each iteration (1 dimension).
- **Backpropagation:** propagate errors backwards through networks.

<!--
<table>
  <tr>
    <td><img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture01_Images/optimization_gd.png" width="80%"/></td>
    <td><img  src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture01_Images/optimization_sgd.png" width="80%"/></td>
  </tr>
  <tr>
    <td><center>Batch gradient descent</center></td>
    <td><center>Stochastic gradient descent</center></td>
  </tr>
</table>
-->

#### Batch gradient descent 
<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture01_Images/optimization_gd.png" width="400" style="display:block; margin:auto"/>

#### Stochastic gradient descent 
<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture01_Images/optimization_sgd.png" width="400" style="display:block; margin:auto"/>

[[Image source](http://www.holehouse.org/mlclass)]

### Batch and online learning


#### Batch learning

Algorithm is trained using all available training data at once.

Also called *offline learning*.

- Requires substantial resources (CPU, memory space, disk space).
- If want to add new training data, must re-train from scratch on new full set of data (i.e. not just the new data but also the old data).

#### Online learning

Algorithm is trained using a sub-set of the training data.


- Each learning step does *not* require substantial resources. 
- Can integate new training data on the fly.
- May be able to throw away data once used it (although might not want to).
- If fed bad data, performance will decline.
- Noisy training.

## Overfitting and underfitting

- **Problem:** The learned model may fit the training set extremely well but fail to generalise to new examples.

### 1D example

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture01_Images/overfitting_1d.png" width="900" style="display:block; margin:auto"/>

[[Image source](http://scikit-learn.org/stable/_images/sphx_glr_plot_underfitting_overfitting_001.png)]

### 2D example

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture01_Images/overfitting_2d.png" width="900" style="display:block; margin:auto"/>

[[Image source](https://www.safaribooksonline.com/library/view/deep-learning/9781491924570/assets/dpln_0107.png)]

### Techniques to avoid overfitting

- Reduce complexity of model.
- Regularization:
    - Place additional constraints (priors) on features/parameters.
    - E.g. smoothness of parameters, sparsity of model (i.e. limit complexity).
- Split data into training, validation and test sets (e.g. cross-validation). 


## Testing and validation

### No free lunch theorem

Essentially, all algorithms are equivalent when performance is averaged over all possible problems.

Consequently, there is no a priori model that is guaranteed to work best on all problems.

(Wolpert, 1996, [*The lack of a priori distinctions between learning algorithms*](http://goo.gl/q6LaZ8))

It is therefore a matter of validating models empirically.

### Training and test datasets

Split data into training and test sets (e.g. 80% for training and 20% for testing).

The model is trained on the *training set* and then tested on the *test set*.  

**No data used in training the method is then used to evaluate it.**

Error rate on the test set is called the *generalization error* or *out of sample error*.

If the training error is low but the generalization error is high, it suggests the model is overfitted.

### Hyperparameters

Many machine learning algorithms contain hyperparameters to control the model.  

One (**bad**) approach is to evaluate alternative models defined by different hyperparameters on test set and select the model that performs best.

However, this optimizes the model for the test set and may not generalise to other data well.

### Validation


A better approach is to split the data into three sets: 
1. Training set
2. Validation set
3. Test set

Train models on the training set and evaluate different models (with different hyperparameters) on the validation set.

Only once the final model to be used is fully specified should it be applied to the test set to estimate its generalization performance.

### Cross-validation

A disadvantage of the previous approach is that less data are available for training.


*Cross-validation* addresses this issue by performing a sequence of fits where each subset of the data is used both as a training set and a validation set.


<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture01_Images/2-fold-CV.png" width="600" style="display:block; margin:auto"/>


[Image credit: [VanderPlas](https://github.com/jakevdp/PythonDataScienceHandbook)]

Get validation accuracy scores for each trial, which could be combined.

#### Extension to n-fold cross-validation

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture01_Images/5-fold-CV.png" width="700" style="display:block; margin:auto"/>

[Image credit: [VanderPlas](https://github.com/jakevdp/PythonDataScienceHandbook)]