Lecture 4: Neural Networks for Regression

# Neural Networks for Regression

### I 2020

---

# Reminder: Regression Problems

* In regression problems we try to predict continuously valued outputs, e.g. "Given a size of the house predict the price (real value)."

* Regression models emulate generative processes that derive one or more values from a set of variables.

* These input variables are capable of explaining the the output, by correlation or causality.

---

# Reminder: Regression

* Given: Some x and corresponding observed values y (real numbers)

* Wanted: A function f that, given a (potentiall new!) x produces a prediction y

* Our goal is to find a function that fits our data "well"

---

# Reminder: Gradient Descent

* We are looking for a line:

$$
y = w x + b
$$

* We have two parameters, the slope `w` and the bias `b`. These are knobs that we can adjust or change to find the "best" linear equation

* What is "best"? The values for w and b that minimize the MSE!

---

# Reminder: Dot Product

Remember the dot product:

$$
\vec{w} \cdot \vec{x}' = \begin{pmatrix}w\\\\b\end{pmatrix} \cdot \begin{pmatrix}x \\\\ 1\end{pmatrix} = wx + b
$$

This lets us write the linear model more concisely:

$$
y = w x + b = \vec{w} \cdot \vec{x}'
$$

---

# Reminder: Gradient Descent

* We were looking in which direction the error (or **loss**) decreases

* The gradient of a function tells us its slope, i.e. which way it decreases!

* General algorithm, starting with some initial estimate/guess for the parameters:
  
  - Calculate the loss, and its gradient with respect to the parameters

- Move the parameters "a little bit" into the direction of the decrease
  
  - Repeat
  
---

# Reminder: The Chain Rule

Remember the Chain Rule? We want to calculate the gradient wrt the parameters of the loss function. This results in the product of the gradient of the loss function wrt the model, and the gradient of the model 
wrt the parameters

$$
 \frac{\partial }{\partial w} L(M_w) = \frac{\partial }{\partial M_w} L(M_w) \cdot \frac{\partial }{\partial w} M_w
$$

---

# Non-linear models

* In Lab 1 you tried some other functions

* We "only" had to make sure that we could calculate the gradient of the model wrt the parameters

* This means we can just use "any" differentiable function as our model!

* Ideally we want something that can represent many different function

---

# Something non-linear

Linear model:

$$
M_w = y = \vec{w} \cdot \vec{x}'
$$

Non-linear model:

$$
M_w = y = h(\vec{w} \cdot \vec{x}')
$$

with "some" non-linear function `h`.

---

# Differentiability

Calculate the gradient:

$$
M_w = y = h(\vec{w} \cdot \vec{x}')\\\\
\frac{\partial }{\partial w} M_w = \frac{\partial }{\partial \vec{w} \cdot \vec{x}'} h(\vec{w} \cdot \vec{x}') \cdot \frac{\partial}{\partial w} \vec{w} \cdot \vec{x}'
$$

Or, in words: We will need to calculate the derivative of `h` wrt its input.

Another take: We **only** need to be able to calculate the derivative of `h` wrt its input.

---

# Functions

Summary: What do we want from h?

- Non-linear

- Differentiable

- "Interesting"

For example:

$$
h(z) = \frac{1}{1 + e^{-z}}
$$

---

# Sigmoid Function

Non-linear, differentiable and "interesting"!

Any problems?

---

# Output values

* Our sigmoid function only produces values between 0 and 1

* Often/usually we have a different range we want to predict something for?

* We could "scale" (and shift) the output!

$$
y = w_1 \cdot h(\vec{w}\cdot \vec{x}') + b_1
$$

This looks like a linear model that uses the result of h as its input!

---

# Recursion

* Let us call the operation `$h(\vec{w}\cdot \vec{x}')$` a "Neuron"

* This "neuron" takes some inputs `x'`, performs a linear transformation, and applies a function `h` to produce a result

* We can then pass this result to another neuron, which may use the same or a different `h`

* Implementation detail: Remember that we added a 1 to `x` to produce `x'`, so we can write the linear model as a simple dot product. We will use the `'` to denote when this operation is performed.

---

# Two Neurons

Now we have a compact representation:

$$
y = h_1(\vec{w}_1 \cdot h(\vec{w} \cdot \vec{x}')')
$$

`$h_1$` can just be the identity function, then we have exactly the same thing we had earlier:

$$
y = w_1 \cdot h(\vec{w}\cdot \vec{x}') + b_1
$$

Fun fact: As long as the additional layers are also differentiable, the entire function will be differentiable (you can try this at home).

---

# Why?

---

# More Nonlinearity!

Recall:

$$
\vec{w} \cdot \vec{x}' = \begin{pmatrix}w\\\\b\end{pmatrix} \cdot \begin{pmatrix}x \\\\ 1\end{pmatrix} = wx + b
$$

The same thing works with more values in the vectors!

$$
\vec{w} \cdot \vec{x}' = \begin{pmatrix}w_1 \\\\ w_2 \\\\b\end{pmatrix} \cdot \begin{pmatrix}x_1 \\\\ x_2 \\\\ 1\end{pmatrix} = w_1 x_1 + w_2 x_2 + b
$$

This means each of our neurons could take more than just one input!

---

# Artificial Neural Networks

$$
\vec{a} = h_1(W_1 \cdot \vec{x}')\\\\
y = h_2(\vec{w_2} \cdot \vec{a}')\\\\
y = h_2(\vec{w_2} \cdot h_1(W_1 \cdot \vec{x}'))
$$
]

---

# Terminology

* Input Layer: Our data

* Hidden Layers: The intermediate neurons

* Output Layer: The neurons that actually produce the outputs

* The Hidden Layer and the Output Layer Neurons have activation functions. While not theoretically necessary, we normally use the same activation function for all neurons in a layer.

---

# Some Activation Functions

---

# Gradient Calculation

To get the gradient for the weights of the second layer:

$$
\frac{\partial}{\partial w_2} (h_2(\vec{w_2} \cdot h_1(W_1 \cdot \vec{x}')) - \hat{y})^2 = \\\\
2\cdot(h_2(\vec{w_2} \cdot h_1(W_1 \cdot \vec{x})) - \hat{y})\cdot \frac{\partial}{\partial w_2} h_2(\vec{w_2} \cdot h_1(W_1 \cdot \vec{x})) \cdot h_1(W_1 \cdot \vec{x}) =\\\\
2\cdot(y - \hat{y})\cdot \frac{\partial}{\partial w_2} h_2(\vec{w_2} \cdot \vec{a}) \cdot \vec{a}
$$

As mentioned before, we can use the chain rule a couple times more to also get the gradient wrt the weights of the first layer.

---

# What does this do?

* We can take "any" number of inputs, send them through a couple of layers of transformations, and obtain a result

* We call this structure **Feed-Forward** Neural Networks

* Gradient descent will change the weights to minimize the distance of the output from our training data

* Does this work?

---

# Universal Approximation Theorem

A Feed-Forward Neural Network with a single hidden layer, and a linear output layer can approximate continuous functions on compact subsets of `$\mathrm{R}^n$` to arbitrary precision given enough neurons.

* Which activation function? Sigmoid works (Cybenko, 1989), but anything that's not a simple polynomial will do (Leshno et al. 1993)

* How many neurons do we need? Potentially exponentially many (in the dimensionality of the input) :(

* Can we learn the weights? Who knows ...

---

# In Practice

* For many applications neural networks produce useful approximations to functions

* The number of neurons is usually determined by educated guesses and tweaking

* Adding more layers helps with some tasks

* Rule of thumb: You don't want to use too many neurons (overfitting!)

---

# Myths

* Myth: "Neural Networks are how the brain works"

* Truth: **At most** the original development drew some inspiration from our understanding of the brain

* Myth: "Neural Networks are a black box that no one understands"

* Truth: Neural Networks are nothing magical, they're "just" giant non-linear functions. We have a very good understanding of how they work. Interpreting their operation can be challenging, though.

* Myth: Neural Networks are "human-like intelligence"

---

# Braaaaaains

<blockquote class="twitter-tweet" data-conversation="none" data-lang="es"><p lang="en" dir="ltr">4yo child is one thing but we need to explain how Megaphragma mymaripenne can fly and navigate with a brain of only 7400 neurons. Each neuron must be doing much more (1000x) than our Perceptron model explains. <a href="https://t.co/xNPq0RAgPj">pic.twitter.com/xNPq0RAgPj</a></p>— Mark Sugrue (@marksugruek) <a href="https://twitter.com/marksugruek/status/1206130412260646912?ref_src=twsrc%5Etfw">15 de diciembre de 2019</a></blockquote>

---

# Using Neural Networks

---

# An AI Koan

In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.

"What are you doing?", asked Minsky.
"I am training a randomly wired neural net to play Tic-Tac-Toe" Sussman replied.
"Why is the net wired randomly?", asked Minsky.
"I do not want it to have any preconceptions of how to play", Sussman said.

Minsky then shut his eyes.
"Why do you close your eyes?", Sussman asked his teacher.
"So that the room will be empty."

At that moment, Sussman was enlightened.

---

# Initialization

* What happens if we initialize the weights with all 0s?

* What happens if we initialize the weights with very large values?

* Zero initialization: All gradients of a layer are **the same**, updating the weights in the same way!

* Large initial values: All gradients of a layer are **zero** (look at the sigmoid curve, for example), and our weights don't change!

* Therefore, we will typically we initialize our weights randomly, **depending on the square root of the size of the previous layer** (e.g. He, or Xavier-initialization)

---

# Training

Typically our training set contains many different samples. How do we update the weights?

* Idea 1: Calculate the error on the entire training set use that for the gradient ("classical" gradient descent)

* Idea 2: Shuffle the training set, and calculate the error/gradient for each item one at a time ("stochastic"/"on-line" gradient descent)

---

# Mini-Batch training

* Classical gradient descent: Imagine you have two (or more) training samples that are very similar. In classical gradient descent, you calculate the error and gradient using both/all of them before you do any update.

* Stochastic gradient descent: Updating after every sample prevents redundant computations, and is often faster to converge. However, the parameters may change erratically, especially if you have outliers or a wide variety of 
different data. Also not parallelizable.

* In practice: Mini-Batch gradient descent. Instead of using one or all of your data, split it into smaller batches ("minibatches"), commonly between 50 and 256 samples, and perform the calculations and update for each 
minibatch in random order.

---

# Other improvements

* Momentum: When updating the weights, also add a fraction of the previous update again. If the gradient points in the same direction, this will accelerate it (like a ball rolling down a hill)

* Nesterov accelerated gradient

$$
\Delta w \leftarrow \gamma \cdot \Delta w + \nabla L(M(w - \alpha \Delta w, x))\\\\
w \leftarrow w - \alpha \Delta w
$$

Instead of using the current values for the weights, we "imagine" what would happen if we performed the same update again. This allows us to "look into the future" and slow down when we get near the optimum.

---

# Optimizers

* SGD: Perform the basic (Stochastic) Gradient Descent update every epoch

* Adagrad: Accumulates the sum of the squares of all gradients **for each parameter/weight** of all previous iterations and divides the learning rate by that value. Problem: Learning rate can only get smaller.

* Adadelta: Instead of saving **all** previous gradients, only save the last n

* Adam: Stores estimates for the mean and variance of previous gradients and uses them to scale the learning rate (this is similar to a momentum term). **When in doubt, use this one.**

* Nadam: Instead of vanilla momentum use the Nesterov accelerated gradient idea to "predict the future"

---

# Animation

---

# Neural Networks in PyTorch

---

# Neural Networks in PyTorch

Construct the neural network:

```Python
model = torch.nn.Sequential(

# Hidden Layer 
torch.nn.Linear(INPUTS, HIDDEN_NEURONS),
torch.nn.Sigmoid(),

# Output Layer
torch.nn.Linear(HIDDEN_NEURONS,1)
)
```

Apply it to data:

```Python
y = model(x)
```

---

# Training

```Python
for t in range(n):
  # Input ALWAYS has to be a matrix!
  # Rows: samples, columns: features
  y_pred = model(x.view(-1,INPUTS))
  loss = loss_fn(y_pred, y)
  
  optimizer.zero_grad()
  loss.backward()
  
  optimizer.step()
```

We still need a `loss_fn`, and an `optimizer`!

---

# Loss Functions and Optimizers

* `torch.nn.MSELoss`: Mean Squared Error

* `torch.optim.SGD`: Stochastic Gradient Descent

* `torch.optim.Adam`: Adam Optimizer

* etc.

---

# Full code

```Python
model = torch.nn.Sequential( torch.nn.Linear(6, 3), 
                             torch.nn.Sigmoid(), 
                             torch.nn.Linear(3,1))

loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)

for t in range(n):
  y_pred = model(x.view(-1,6))
  loss = loss_fn(y_pred, y)
  
  optimizer.zero_grad()
  loss.backward()
  
  optimizer.step()
```

---

# Modules

It is also possible to define a new class for your neural network, which gives you more control:

```Python
class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.sig1 = torch.nn.Sigmoid()
        self.linear2 = torch.nn.Linear(H, D_out)

def forward(self, x):
        h = self.linear1(x)
        h = self.sig1(h)
        y_pred = self.linear2(h)
        return y_pred
        
model = TwoLayerNet(6,5,4)
y_pred = model(x.view(-1,6))
```

---

# Lab 2

---

# Lab 2

* In lab 2 you will replace your non-linear model from lab 1 with a neural network

* First, construct a neural network with one hidden layer, one output layer that takes the `ActionLatency` as input, and predicts the `APM`

* Try some different numbers of neurons

* Then add more inputs. The lab description has a list of 6 you should use, if you want, you can play with adding more

* Deadline: May 12, before class

---

class: small
  
# References
  
  * [Machine Learning 4 All: Guides](https://ml4a.github.io/guides/)
  
  * [PyTorch](https://pytorch.org/)
  
  * [DeepLearning with PyTorch](https://pytorch.org/deep-learning-with-pytorch)
  
  * [Introduction to Neural Networks](http://mt-class.org/jhu/slides/lecture-nn-intro.pdf)
  
  * Cybenko, G. (1989) ["Approximations by superpositions of sigmoidal functions"](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.441.7873&rep=rep1&type=pdf), Mathematics of Control, Signals, and Systems, 2(4), 303–314. doi:10.1007/BF02551274
  
  * Leshno, Moshe; Lin, Vladimir Ya.; Pinkus, Allan; Schocken, Shimon (January 1993). "Multilayer feedforward networks with a nonpolynomial activation function can approximate any function". Neural Networks. 6 (6): 861–867. doi:10.1016/S0893-6080(05)80131-5.
  
  * [3Blue1Brown: Neural Networks](https://www.youtube.com/watch?v=aircAruvnKk)
  
  * [Comparison of different optimizers](https://ruder.io/optimizing-gradient-descent/)