Lecture 2: Regression

# Regression

### I 2020

---

# Schedule Changes

---

# Schedule Changes

- We did not have anything planned for Semana U, so we only "lost" one class

- We merged two classes into one

- We cut one lab (New grading: Lab 1: 15%, Labs 2-5: 12.5% each)

- We also moved the deadlines for the project

---

# Labs

* Lab 1: Python/Stats/PyTorch intro, 10/3 - 21/4

* Lab 2: Regression, 28/4 - 5/5

* Lab 3: Classification, 12/5 - 19/5

* Lab 4: GANs, 26/5 - 9/6

* Lab 5: Ethics, 16/6 - 23/6

---

# Project

* 14/4: Proposal: 10 mins presentation and a document.

* 5/5: Update 1: 7 mins presentation and a document.

* 2/6: Update 2: 3 mins presentation and a document.

* 23/6: Q&A

* 30/6: Presentations: 15 mins presentation and a document.

---

# Machine Learning

---

# Introduction

In any machine learning problem, one of the most important things is to know if the presented task is:

1. Classification
2. Regression

Knowing this will allow you to pick the algorithm to address the task at hand

We will present the concepts that will allow the differentiation between the two as follows

---

# Examples of Classification

* Given a set of input features predict whether a Breast Cancer is benign or malignant.

* Given an image correctly classify as containing cats or dogs.

* From a given email predict whether it's spam email or not.

---

# Regression Problems

* In regression problems we try to predict continuously valued outputs
 
  - Given a size of the house predict the price (real value).

Regression models emulate generative processes that derive one or more values from a set of variables.

These input variables are capable of explaining the the output, by correlation or causality.

---

# Regression

---

# Regression

Given: Some x and corresponding observed values y (real numbers)

Wanted: A function f that, given a (potentiall new!) x produces a prediction y

What property do we want this function to have?

---

# A good fit

- Our goal is to find a function that fits our data "well"

- Note that there is an infinite number of possible functions

- If we only have one y value for each x (not necessarily true!), there is an infinite number of functions that fit "perfectly"

- What we want to do with our function, though, is **predict** values for new x

---

# Occam's Razor

"Numquam ponenda est pluralitas sine necessitate" ("Plurality must never be posited without necessity")

Or, in a more popular formulation:

"Simpler hypotheses are generally better than the complex ones."

or just:

**The simplest solution is the best.**

---

# Simpler Hypothesis

---

# Beware! We can go too simple ...

---

# Regression Metrics

* How can we describe a "good fit" mathematically?

* MSE or mean square error is one of the most used metrics in regression

* So for a regression model that predicted house prices in Cartago, shown with ŷᵢ, and for each house we have the real price (yᵢ). The MSE can be calculate as follows:

$$
\textit{MSE}(\hat{y}) = \frac{1}{n} \sum_i (y_i - \hat{y}_i)^2
$$

---

# Linear Regression

* Last time, some of you already fit a linear model using Ordinary Least Squares

* How does it work?

$$
\hat{y} = \vec{w}'\cdot \vec{x}\\\\
\textit{MSE}(\hat{y}) = (y - \hat{y})^2 = (y - \vec{w}'\cdot \vec{x})^2\\\\
\textit{MSE}(\hat{\mathbf{y}}) = \frac{1}{n} \sum_i (y - \hat{y})^2 = \frac{1}{n} (y - w' \cdot X)^2 \\\\
\ldots
w = (X^T\cdot X)^{-1}X^T y
$$

---

# Linear Regression

WHY?

* What we want to do is to find a value for w that minimizes the MSE

* How can we find the minimum of a function? We calculate the derivative and set it 0

* And then we solve for w

---

# A different view

* We are looking for a line:

$$
y = w x + b
$$

* We have two parameters, the slope w and the y-intercept, b. These are knobs that we can adjust or change to find the "best" linear equation

* What is "best"? The values for w and b that minimize the MSE!

---

# Turning the knobs

* We could start by "guessing" some values for w and b, and see how good the line we get is

* Then we search for better values from there?

* How? Let's figure out in which "direction" the error becomes lower

---

# An example

* Let's say we have 16 samples of (different) children's ages and their weights

* Each sample was taken on the children's birthdays

* We would like a prediction/estimate of children's weights for any age (e.g. 12.15 years)

* A statistician said something about using a linear model, and that the line would pass through the mean x and y value, but we forgot the rest :(

* Idea: Start with a line with w=0 and "turn" it in the direction where the MSE decreases

---

# Finding a Line: Example

$$
y = \beta x + \varepsilon = w\cdot x + b
$$

w: <input id="kvalue" size="15" value="0"/>  <button type="button" onclick="updatek()">Submit</button> <br/>
MSE: <input id="msevalue" size="15" value="0" type="text"/>

---

# Gradient

* We were looking in which direction the error (or **loss**) decreases

* The gradient of a function tells us its slope, i.e. which way it decreases!

* General algorithm, starting with some initial estimate/guess for the parameters:
  
  - Calculate the loss, and its gradient with respect to the parameters

- Move the parameters "a little bit" into the direction of the decrease
  
  - Repeat

---

# Gradient Descent

* Note: This procedure makes no assumptions about what you are looking for. It could be a line, polynomial, or anything else

* There are some requirements, though:

- You want to minimize a loss (or maximize some quantity, e.g. a likelihood)
   
   - Your loss function has a derivative wrt the parameters
   
   - And that also requires that your model has a derivative wrt the parameters
   
---

# Gradient of the Loss

Our loss function (MSE) looks like this (for one sample):
$$
L(\hat{y}) = (y - \hat{y})^2
$$

Or, using the model (for some fixed x and y):

$$
L(M_w) = (y - M_w(x))^2
$$

We want the gradient:

$$
\frac{\partial }{\partial w} L(M_w)
$$

---

# The Chain Rule

Remember the Chain Rule?

$$
\frac{\partial }{\partial w} f(g(x)) = \frac{\partial }{\partial g(x)} f(g(x)) \cdot \frac{\partial }{\partial w} g(x)
$$

In our case, f is the loss, and g is the model:

$$
 \frac{\partial }{\partial w} L(M_w) = \frac{\partial }{\partial M_w} L(M_w) \cdot \frac{\partial }{\partial w} M_w
$$

---

# Calculating the Gradient

The chain rule is nice, because we can calculate things in parts. We need:

$$
 \frac{\partial }{\partial w} L(M_w) = \frac{\partial }{\partial M_w} L(M_w) \cdot \frac{\partial }{\partial w} M_w
$$

First, the derivative of the Loss wrt the model:

$$
\frac{\partial }{\partial M_w} L(M_w) = \frac{\partial }{\partial M_w} (y - M_w(x))^2 = -2 \cdot (y - M_w(x))
$$

---

# Calculating the Gradient

Reminder: We need:

$$
 \frac{\partial }{\partial w} L(M_w) = \frac{\partial }{\partial M_w} L(M_w) \cdot \frac{\partial }{\partial w} M_w
$$

We're still missing the gradient of the model wrt the parameters

$$
\frac{\partial }{\partial w} M_w = \frac{\partial }{\partial M_w} w\cdot x + b = x
$$

---

# Putting it all together

$$
 \frac{\partial }{\partial w} L(M_w) = -2 \cdot (y - M_w(x)) \cdot x
$$

Now we can do our update:

$$
w' = w - \alpha \cdot \frac{\partial }{\partial w} L(M_w) = w + \alpha \cdot (y - M_w(x)) \cdot x
$$

Where does this `$\alpha$` come from?

We said we move "a little bit" in the downhill direction. The `$\alpha$` defines how fast we are going.

---

# Why not just OLS?

* Of course, we have not really invented anything new

* In fact, we could just calculate the best line in one step using a formula ...

* But note: Now we can try more complex models!

---

# Fitting a polynomial

$$
y = a x^b
$$

a: <input id="avalue" size="15" value="0"/> b: <input id="bvalue" size="15" value="0"/> <button type="button" onclick="updateab()">Submit</button> <br/>
MSE: <input id="mse1value" size="15" value="0" type="text"/> <button type="button" onclick="showcontour()">Show loss contours</button> <br/>

---

# For the Lab

* In the lab, you will implement exactly the parts we need for gradient descent: `dmodel_w`, `dmodel_b`, and `dloss_m`

* Then you run a loop (how long?) and update w and b in the downhill direction of the error function

* **Then** you will see that PyTorch can calculate the gradients automatically for you, and also has classes that perform the update step

* With that, you can start playing with different kinds of models, like a polynomial or exponential one

* The main purpose of this lab is to learn how gradient descent works, and how to make PyTorch do it for you automatically

* **New** deadline: 28/4, before class (= 3 more class sessions + 1 extra week)

---
  
# References

* Ian Goodfellow, Yoshua Bengio, and Aaron Courville. “*Deep learning*”, MIT press, 2016
  
  * Christopher M. Bishop, *“Pattern recognition and machine learning”*, springer, 2006
  
  * Jerome Friedman, Trevor Hastie, and Robert Tibshirani. “*The elements of statistical learning*”, Springer series in statistics, 2001
  
  * [Overfitting and Underfitting in Machine Learning](https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6803a989c76)