# Coursework
# SPCE0038: Machine Learning with Big-Data

---

## Overview

This coursework is provided as a Jupyter notebook, which you will need to complete.  

Throughout the notebook you will need to complete code, analytic exercises (if equations are required please typeset your solutions using latex in the markdown cell provided) and descriptive answers. Much of the grading of the coursework will be performed automatically, so it is critical you name your variables as requested.

Before you turn this coursework in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says "YOUR ANSWER HERE" or `YOUR CODE HERE` and remove remove the `raise NotImplementedError()` exceptions that are thrown before you have added your answers.

Please also:
- Make sure you use a python environment using the `requirements.txt` files provided by the course.
- Make sure your notebook executes without errors.
- Do not add and remove cells but only provide your answers in the spaces given.
- Do not add or change code in the cells other than the ones marked with `# YOUR CODE HERE`.
- Do not overwrite or rename any existing variables.
- Do not install code or packages in the notebooks.
- Do not import any libraries other than modules from `sklearn`.
- Always label your plots.
- Answer the questions concisely and show your work/derivations/reasoning.

**Please rename the notebook filename to include your candidate number in the filename.  And please also add your candidate number below:**

In [None]:
CANDIDATE_NUMBER = ""

You will be able to run some basic tests in the notebook to check the basic operation of your code is as expected.  Although do not assume your responses are complete or fully correct just because the basic tests pass.

Once you have renamed the notebook file and completed the exercises, please upload the notebook to Moodle.


---

## AstroML

The data used is this coursework is obtained using [AstroML](http://www.astroml.org), a python package for machine learning for astronomy.  Although we take data from AstroML, this coursework is not based on standard AstroML examples.  So you will *not* find the solutions in AstroML examples!

## SDSS

The data obtained through AstroML was observed by the [Sloan Digital Sky Survey](https://www.sdss.org/) (SDSS), which began observations in 2000.  SDSS data have lead to many scientific advances and the experiment is widely seen as one of the most successful surveys in the history of astronomy.

---

## Dependencies

- Standard course dependencies (e.g. numpy, scikit-learn, etc.)
- [AstoML](http://www.astroml.org)
- [AstroPy](http://www.astropy.org/)

---

In [None]:
import numpy as np
from matplotlib import pyplot as plt

In [None]:
def check_var_defined(var):
    try:
        exec(var)
    except NameError:
        raise NameError(var + " not defined.")
    else:
        print(var + " defined.")

## Part 1: Regression

In these exercises we will consider the regression problem of the astonomical distance modulus vs redshift relationship.

In astronomy, the [distance modulus](https://en.wikipedia.org/wiki/Distance_modulus) specifies the difference between the apparent and absolute magnitudes of an astronomnical object.  It provides a way of expressing astrophysical distances. 

Astronomical [redshift](https://en.wikipedia.org/wiki/Redshift) specifies the shift in wavelength that astronomical objects undergo due to the expansion of the Universe.  Due to Hubble's Law, more distance objects experience a greater redshift.


In [None]:
from astroML.datasets import generate_mu_z

In [None]:
# Load data
m = 150
z_sample, mu_sample, dmu = generate_mu_z(m, random_state=3)

*Plot the distance modulus ($\mu$) vs redhift ($z$), including error bars.*

In [None]:
# Plot data
def plot_dist_mod():
    # YOUR CODE HERE
    raise NotImplementedError()
    plt.xlabel('$z$')
    plt.ylabel('$\mu$')
    plt.title('Distance modulus vs redshift')
    plt.ylim(36, 50)
    plt.xlim(0, 1.5)
plot_dist_mod()

Recall the normal equations for linear regression follow by analytically minimising the cost function: 

$$\min_\theta\ C(\theta) = \min_\theta \ (X \theta - y)^{\rm T}(X \theta - y).$$

Show analytically that the solution is given by 

$$ \hat{\theta} = \left( X^{\rm T} X \right)^{-1} X^{\rm T} y. $$

[Matrix calculus identities](https://en.wikipedia.org/wiki/Matrix_calculus) may be useful (note that we use the denominator layout convention).

*Expand the cost function and drop terms that do not depend on $\theta$ (use latex mathematics expressions):*

YOUR ANSWER HERE

*Calculate the derivative, set it to zero, and solve for $\theta$ (use latex mathematics expressions):*

YOUR ANSWER HERE

*Solve for $\theta$ by numerically implementing the analytic solution given above.*

In [None]:
def compute_theta_lin_reg(X, y):
    # YOUR CODE HERE
    raise NotImplementedError()
    return theta

In [None]:
assert compute_theta_lin_reg(z_sample, mu_sample).shape == (2,)
theta = compute_theta_lin_reg(z_sample, mu_sample)
(theta_c, theta_m) = theta
print("Linear regression parameters recovered analytically: intercept={0:.4f}, slope={1:.4f}".format(theta_c, theta_m))

In [None]:
check_var_defined('theta_c')
check_var_defined('theta_m')

*Write a method to make a prediction for a given redshift.*

In [None]:
def predict_lin_reg(theta, x):
    # YOUR CODE HERE
    raise NotImplementedError()
    return y

*Predict the distance modulus for a range of redshift values between 0.01 and 1.5 and plot the predicted curve overlayed on data (make a new plot; do not revise the plot above).  Call the variable used to store the predictions for your polynomial model `mu_pred_lin`.*

In [None]:
z = np.linspace(0.01, 1.5, 1000)
plot_dist_mod()
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('mu_pred_lin')
assert mu_pred_lin.shape == (len(z),), "Make sure the shape of your predictions is correct"

*Solve for the parameters $\theta$ using Scikit-Learn.*

In [None]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert lin_reg.coef_.shape == (1,), "Make sure your features have the right shape, such that we have 1 fitted coefficient"
print("Linear regression parameters recovered by scikit-learn: intercept={0:.4f}, slope={1:.4f}"
      .format(lin_reg.intercept_, lin_reg.coef_[0]))

*Extend your model to include polynomial features up to degree 15 (using Scikit-Learn).  Use variable `lin_reg_poly` for your revised model.*

In [None]:
degree = 15
bias = False
from sklearn.preprocessing import PolynomialFeatures
def compute_poly_features(degree, bias):
    # Return polynomial features of samples and class
    # YOUR CODE HERE
    raise NotImplementedError()
    return z_sample_poly, poly_features
z_sample_poly, poly_features = compute_poly_features(degree, bias)
# Train model
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('lin_reg_poly')

*Plot the data and the predictions of your models considered so far (linear and polynomial regression).  Call the variable used to store the predictions for your polynomial model `mu_pred_poly`.*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('mu_pred_poly')
assert mu_pred_poly.shape == (len(z),)

*Comment on the accuracy of your models.*

YOUR ANSWER HERE

*Think about methods that could be used to improve the performance of your models. Improve your polynomial model and use the improved model to make predictions. Call the variable used to store the polynomial model `ridge_reg_poly`. Call the variable used to store the predictions for your polynomial model `mu_pred_poly_improved`.*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('ridge_reg_poly')
check_var_defined('mu_pred_poly_improved')
assert mu_pred_poly_improved.shape == (len(z),), "Make sure the shape of your predictions is correct"

*Plot the predictions made with new model and all previous models considered.*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

*Compute the RMS error between your predictions and the data samples.*

In [None]:
# Define a general function to compute the RMS error
def compute_rms(mu_1, mu_2):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert np.isclose(compute_rms(mu_pred_lin, mu_pred_lin), 0.0)

In [None]:
# Compute the RMS error between the data and the predictions for each model.
# Use variables rms_sample_lin, rms_sample_poly and rms_sample_poly_improved.
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Print RMS values computed.
print("rms_sample_lin = {0:.4f}".format(rms_sample_lin))
print("rms_sample_poly = {0:.4f}".format(rms_sample_poly))
print("rms_sample_poly_improved = {0:.4f}".format(rms_sample_poly_improved))

In [None]:
check_var_defined('rms_sample_lin')

In [None]:
check_var_defined('rms_sample_poly')

In [None]:
check_var_defined('rms_sample_poly_improved')

*Comment on what models you believe are best.*

YOUR ANSWER HERE

Using our cosmological concordance model we can predict the theoretical distance modulus vs redshift relationship using our understanding of the physics.

In [None]:
from astroML.cosmology import Cosmology
cosmo = Cosmology()
mu_cosmo = np.array(list(map(cosmo.mu, z)))

*Plot the data, predictions made with all regression models, and the values predicted by the cosmological model.*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

*Compute the RMS error between the predictions made by the cosmological model and each of the regression models, over the sample array `z`.*

In [None]:
# Compute the RMS error between the data and the predictions for each model.
# Use variables rms_cosmo_lin, rms_cosmo_poly and rms_cosmo_poly_improved.
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Print RMS values computed.
print("rms_cosmo_lin = {0:.4f}".format(rms_cosmo_lin))
print("rms_cosmo_poly = {0:.4f}".format(rms_cosmo_poly))
print("rms_cosmo_poly_improved = {0:.4f}".format(rms_cosmo_poly_improved))

In [None]:
check_var_defined('rms_cosmo_lin')

In [None]:
check_var_defined('rms_cosmo_poly')

In [None]:
check_var_defined('rms_cosmo_poly_improved')

*Comment on the RMS values computed and the implications for the accuracy of the different regression models considered.*

YOUR ANSWER HERE

---

## Part 2: Classification

In these exercises we will consider classification of [RR Lyrae](https://en.wikipedia.org/wiki/RR_Lyrae_variable) variable stars.  RR Lyrae variables are often used as standard candles to measure astronomical distances since their period of pulsation can be related to their absolute magnitude.

Observations of star magnitudes are made in each [SDSS filter band](http://skyserver.sdss.org/dr2/en/proj/advanced/color/sdssfilters.asp): u, g, r, i, z.

We will consider the space of astronomical "colours" to distinguish RR Lyraes from background stars.  Astronomical colours are simply differences in magnitudes between bands, e.g. u-g, g-r, r-i, i-z.  You can find further background [here](https://en.wikipedia.org/wiki/Color%E2%80%93color_diagram).

First, download the data.  (This may take some time on first execution.  Subsequently executions will read from cached data on your system.)

In [None]:
# Load data
from astroML.datasets import fetch_rrlyrae_combined
X, y = fetch_rrlyrae_combined()

You can learn more about the format of the returned data [here](http://www.astroml.org/modules/generated/astroML.datasets.fetch_rrlyrae_combined.html).  In particular, note that the columns of `X` are u-g, g-r, r-i, i-z.

*Construct a Pandas DataFrame for the `X` data and a Series for the `y` data.  Call your Pandas objects `X_pd` and `y_pd` respectively.*

Be sure to give your colums the correct colour name, e.g. 'u-g'.

In [None]:
import pandas as pd
cols=['u-g', 'g-r', 'r-i', 'i-z']
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('X_pd')
print(X_pd)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('y_pd')
print(y_pd)

*Combine your data and targets into a single Pandas DataFrame, labelling the target column 'target'.  Call the resulting Pandas DataFrame `X_pd_all`.*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('X_pd_all')
print(X_pd_all)

*Add a 'target description' column to your existing `X_pd_all` DataFrame, with fields 'Background' and 'RR Lyrae' to specify the target type.*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
print(X_pd_all)

*How many RR Lyrae variable stars are there in the dataset (i.e compute `n_rrlyrae`)?*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('n_rrlyrae')
print("n_rrlyrae = {0}".format(n_rrlyrae))

*How many background stars are there in the dataset (i.e. compute `n_background`)?*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('n_background')
print("n_background = {0}".format(n_background))

*Plot scatter plot pairs for all colour combinations using `seaborn`.  Colour the points by target type. Make sure the distribution plots are normalised to have an area of 1 under the curve for each of the classes.*

In [None]:
%matplotlib inline
import seaborn as sns; sns.set()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Let's separate the data into training and test sets, keeping 25% of the data for testing.  

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

First let's consider 1D classification for the zeroth colour, i.e. $u-g$. 

In [None]:
ind = 0
col=cols[ind]
col

In [None]:
X_train_1d = X_train[:, ind]
X_train_1d = X_train_1d.reshape(-1,1)
X_test_1d = X_test[:, ind]
X_test_1d = X_test_1d.reshape(-1,1)

To get some further intuition about the 1D classiciation problem consider a 1D plot of
class against colour.

In [None]:
def plot_scatter():
    plt.figure(figsize=(10,5))
    plt.scatter(X_train_1d[y_train==1], y_train[y_train==1], c='m', marker='^', label='RR Lyrae')
    plt.scatter(X_train_1d[y_train==0], y_train[y_train==0], c='c', marker='v', label='Background')
    plt.xlabel('$' + col + '$')
    plt.ylabel('Probability of type RR Lyrae')
plot_scatter()    
plt.legend()

*Given the plot shown above, comment on how well you expect logistic regression to perform.*

YOUR ANSWER HERE

*Where would you guess the decision bounary should lie?  Set the variable `decision_boundary_guess` to your guess.*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('decision_boundary_guess')
print("decision_boundary_guess = {0:.4f}".format(decision_boundary_guess))

Use Scikit-Learn to perform logistic regression to classify the two classes for this 1D problem.

First, set the inverse regularation strength `C` such that regularisation is effecitvely not performed.

In [None]:
C = 1e10

*Second, fit the model using Scikit-Learn. Use the variable `clf` for your classification model.*

In [None]:
from sklearn.linear_model import LogisticRegression
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('clf')

*Compute the decision boundary of the logistic regression model fitted by Scikit-Learn.  User variable `decision_boundary_sklearn` for your result.*

(Ensure your result is a scalar and not an array of length 1.)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert not hasattr(decision_boundary_sklearn, "__len__")
print("decision_boundary_sklearn = {0:.4f}".format(decision_boundary_sklearn))

*Evaluate the probabilities predicted by your logistic regression model over the domain specified by the variable `X_1d_new`. Use variable `y_1d_proba` for your computed probabilities.*

In [None]:
X_1d_new = np.linspace(0.3, 2.0, 1000).reshape(-1, 1)
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('y_1d_proba')

*Plot the probability of a star being of type RR Lyrae against the colour variable considered.  Also plot the probability of being a Background star.  Overlay these plots on the scatter plot of class types.  Also plot the decision boundary that you guessed previously and the one computed by Scikit-Learn.*

In [None]:
plot_scatter()
# YOUR CODE HERE
raise NotImplementedError()

*From inspection of your plot, how would all objects in the training set be classified?*

YOUR ANSWER HERE

*Use your logistic regression model fitted by Scikit-Learn to predict the class of all objects in the test set. Use variable `y_test_1d_pred` to specify your answer.*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('y_test_1d_pred')

*How many objects are classified as of type RR Lyrae?  Use variable `n_rrlyrae_pred` to specify your answer.*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('n_rrlyrae_pred')
assert n_rrlyrae_pred % 1 == 0 # check integer
print("n_rrlyrae_pred = {0}".format(n_rrlyrae_pred))

*How many objects are classified as of type Background?  Use variable `n_background_pred` to specify your answer.*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('n_background_pred')
assert n_background_pred % 1 == 0 # check integer
print("n_background_pred = {0}".format(n_background_pred))

Let's check the Scikit-Learn result by solving the logistic regression problem (without regularisation) manually.

Recall that the cost function for logistic regression is given by
$$
C(\theta) = -\frac{1}{m} \sum_{i=1}^m 
\left [ 
y^{(i)} \log(\hat{p}^{(i)})
+
(1 - y^{(i)}) \log(1 - \hat{p}^{(i)})
\right],
$$


where

$$\hat{p} = \sigma(\theta^\text{T} x) = \frac{1}{1+\exp{(-\theta^\text{T} x)}}. $$

Show analytically that the derivative of the cost function is given by
$$\begin{eqnarray}
\frac{\partial C}{\partial \theta} 
&=& 
\frac{1}{m} \sum_{i=1}^m 
\left[ \sigma\left(\theta^{\rm T} x^{(i)} \right) - y^{(i)} \right]
x^{(i)}\\
&=&
\frac{1}{m} 
X^{\rm T}
\left[ \sigma\left(X \theta \right) - y \right]
\end{eqnarray}$$

(use latex mathematics expressions).

*First, simplify the cost function terms $\log(\hat{p})$ and $\log(1-\hat{p})$ to express in terms linear in $\log\left({1+{\rm e}^{-\theta^{\rm T}x}}\right)$.*

(You may drop $i$ superscripts for notational brevity.)

YOUR ANSWER HERE

*Next, substitute these terms into the cost function and simplify to also express the cost function in terms linear in $\log\left({1+{\rm e}^{-\theta^{\rm T}x}}\right)$.*

YOUR ANSWER HERE

*Now compute the derivative of the cost function with respect to variable $\theta_j$, i.e. compute $\partial C / \partial \theta_j$.*

YOUR ANSWER HERE

*Combine terms for all $\theta_j$ to give the overall derivative with respect to $\theta$, i.e. $\partial C / \partial \theta$.*

YOUR ANSWER HERE

Using the analytically expression for the derivative of the cost function, we will solve the logistic regression problem by implementing a gradient descent algorithm.

*First, define the sigmoid function.*

In [None]:
def sigmoid(x):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert np.isclose(sigmoid(0), 0.5)

*Next, extend the training data to account for a bias term in your model. Use variable `X_train_1d_b` to specify your result.*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('X_train_1d_b')

*Implement batch gradient descent to fit the parameters of your logistic regression model.  Consider `n_iterations = 4000` iterations and a learning rate of `alpha = 100.0`. Consider a starting point of $\theta_0 = (1, 1)$, i.e. `theta = np.array([[1], [1]])`. Use variable `theta` to specify your estimated parameters.*

*(Make sure your implementation is reasonably efficient. If it takes longer than 2 minutes to execute when running on our server it may not complete and you will not be awarded grades. The solution answer runs in under 10 seconds.)*

In [None]:
n_iterations = 4000
alpha = 100.0
theta = np.array([[1], [1]])

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('theta')
print("theta[0] = {0:.4f}".format(theta[0][0]))
print("theta[1] = {0:.4f}".format(theta[1][0]))

*Compute the difference between the logistic regression model intercept computed by Scikit-Learn and manually.  Use variable `intercept_diff` for your result.*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('intercept_diff')
print("intercept_diff = {0:.4E}".format(intercept_diff))

*Compute the difference between the logistic regression model* slope *(i.e. coefficient) computed by Scikit-Learn and manually.  Use variable `coeff_diff` for your result.*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('coeff_diff')
print("coeff_diff = {0:.4E}".format(coeff_diff))

You should find that the solution from your gradient descent algorithm is close (although not identical) to that recovered by Scikit-Learn. 

Both fitted logistic regression models, however, are not effective. The reason for this is because of class imbalance.  *Describe the class imbalance problem in your own words and how it manifests itself in the classification problem at hand.*

YOUR ANSWER HERE

The class imbalance problem can be addressed by weighting the training data in a manner that is inversely proportional to their frequency.

*Repeat the fitting of your linear regression model but this time perform class weighting.  Use variable `clf_balanced` for your new model.*

See the `class_weight` argument of the Scikit-Learn [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) classifier for further details on how to perform class weighting.

In [None]:
from sklearn.linear_model import LogisticRegression
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('clf_balanced')

*Compute the decision boundary of the logistic regression model fitted by Scikit-Learn when weighting classes.* 

(Ensure your result is a scalar and not an array of length 1.)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('decision_boundary_sklearn_balanced')
assert not hasattr(decision_boundary_sklearn_balanced, "__len__")
print("decision_boundary_sklearn_balanced = {0:.4f}".format(decision_boundary_sklearn_balanced))

*Evaluate the probabilities prediced by your new logistic regression model over the domain specified by the variable `X_1d_new`. Use variable `y_1d_proba_balanced` for your computed probabilities.*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('y_1d_proba_balanced')

*For your new balanced model, plot the probability of a star being of type RR Lyrae against the colour variable considered.  Also plot the probability of being a Background star.  Overlay these plots on the scatter plot of class types.  Also plot the decision boundary that you guessed previously, the one computed by Scikit-Learn initially, and the one computed by Scikit-Learn for your new balanced model.*

In [None]:
plot_scatter()
# YOUR CODE HERE
raise NotImplementedError()

*Comment on the decision boundary of the balanced model compared to the unbalanced models fitted previously.*

YOUR ANSWER HERE

Now that we've built up good intuition surrounding the subtleties of the classification problem at hand in 1D, let's consider the 2D problem (we will keep to 2D for plotting convenience).

For the 2D case we consider the following colours.

In [None]:
ind = 1
cols[:ind+1]

Consider the following training and test data for the 2D problem.

In [None]:
X_train_2d = X_train[:, :ind+1]
X_train_2d = X_train_2d.reshape(-1,ind+1)
X_test_2d = X_test[:, :ind+1]
X_test_2d = X_test_2d.reshape(-1,ind+1)

*Train a logistic regression model for this 2D problem.  Use variable `clf_2d_logistic` for your classifier.*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('clf_2d_logistic')

*Compute the precision and recall of your 2D logistic regression model. Use variables `precision_logistic` and `recall_logistic` for your results.*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('precision_logistic')
print("precision_logistic = {0:.6f}".format(precision_logistic))

In [None]:
check_var_defined('recall_logistic')
print("recall_logistic = {0:.6f}".format(recall_logistic))

Consider the following meshgrid defining the u-g and g-r colour domain of interest.

In [None]:
xlim = (0.7, 1.45)  # u-g
ylim = (-0.15, 0.4) # g-r
xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 100),
                     np.linspace(ylim[0], ylim[1], 100))

*Over the domain specified above plot the predicted classification probability.  Overlay on your plot the data instances, highlighting whether a RR Lyrae or background star, and the decision boundary.*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

*Now train an SVM classifier that can support a non-linear decision boundary on the same problem. Use the variable `clf_2d_svm` for your model.*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('clf_2d_svm')

*Replicate for the SVM your plot above for the 2D logistic regression model.  Over the domain specified above plot the decision function score.  Overlay on your plot the data instances, highlighting whether a RR Lyrae or background star, and the decision boundary.*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

*Compute the precision and recall of your 2D SVM model. Use variables `precision_svm` and `recall_svm` for your results.*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
check_var_defined('precision_svm')
print("precision_svm = {0:.6f}".format(precision_svm))

In [None]:
check_var_defined('recall_svm')
print("recall_svm = {0:.6f}".format(recall_svm))

*Comment on the difference in decision boundary between your logistic regression and SVM models and how this impacts the effectiveness of the models.*

YOUR ANSWER HERE