Lecture 5: Neural Networks 2

# Machine Learning

## Neural Networks 2

### III-Verano 2019

---

# Artificial Neural Networks

<p style="margin-top: 5cm; text-align: center; font-size: 2em">
What is an (Artificial) Neural Network?
</p>

---

# Artificial Neural Networks

$$
\vec{h} = f_1(W_1 \cdot \vec{x})\\\\
y = f_2(\vec{w_2} \cdot \vec{h})\\\\
y = f_2(\vec{w_2} \cdot f_1(W_1 \cdot \vec{x}))
$$
]

---

# Training

Remember:

$$
y = f_2(\vec{w_2} \cdot f_1(W_1 \cdot \vec{x}))\\\\
\Delta W_1 = 2\cdot(y - \hat{y})\cdot \frac{\text{d}}{\text{d} w_2} f_2(\vec{w_2} \cdot \vec{h}) \cdot \vec{h} \\\\
W_1 = W_1 - \alpha\cdot \Delta W_1
$$

We kinda glossed over some details last week: Initialization and the exact learning procedure.

Remember the problem with local minima? How do we choose the initial values?

---

# An AI Koan

In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.

"What are you doing?", asked Minsky.
"I am training a randomly wired neural net to play Tic-Tac-Toe" Sussman replied.
"Why is the net wired randomly?", asked Minsky.
"I do not want it to have any preconceptions of how to play", Sussman said.

Minsky then shut his eyes.
"Why do you close your eyes?", Sussman asked his teacher.
"So that the room will be empty."

At that moment, Sussman was enlightened.

---

# Initialization

* What happens if we initialize the weights with all 0s?

* What happens if we initialize the weights with very high values?

* Zero initialization: All gradients of a layer are **the same**, updating the weights in the same way!

* Large initial values:

* Therefore, we will typically we initialize our weights randomly, **depending on the square root of the size of the previous layer** (e.g. He, or Xavier-initialization)

---

# Training

Remember:

Typically our training set contains many different samples. How do we do this update?

Idea 1: Calculate the error on the entire training set use that for the gradient ("classical" gradient descent)

Idea 2: Shuffle the training set, and calculate the error/gradient for each item one at a time ("stochastic"/"on-line" gradient descent)

---

# Mini-Batch training

* Classical gradient descent: Imagine you have two (or more) training samples that are very similar. In classical gradient descent, you calculate the error and gradient using both/all of them before you do any update.

* Stochastic gradient descent: Updating after every sample prevents redundant computations, and is often faster to converge. However, the parameters may change erratically, especially if you have outliers or a wide variety of 
different data. Also not parallelizable.

* In practice: Mini-Batch gradient descent. Instead of using one or all of your data, split it into smaller batches ("minibatches"), commonly between 50 and 256 samples, and perform the calculations and update for each 
minibatch in random order.

---

# Other improvements

* Momentum: When updating the weights, also add a fraction of the previous update again. If the gradient points in the same direction, this will accelerate it (like a ball rolling down a hill)

* Nesterov accelerated gradient

$$
\Delta w \leftarrow \gamma \cdot \Delta w + \nabla L(M(w - \alpha \Delta w, x))\\\\
w \leftarrow w - \alpha \Delta w
$$

Instead of using the current values for the weights, we "imagine" what would happen if we performed the same update again. This allows us to "look into the future" and slow down when we get near the optimum.

---

# Optimizers

* SGD: Perform the basic (Stochastic) Gradient Descent update every epoch

* Adagrad: Accumulates the sum of the squares of all gradients **for each parameter/weight** of all previous iterations and divides the learning rate by that value. Problem: Learning rate can only get smaller.

* Adadelta: Instead of saving **all** previous gradients, only save the last n

* Adam: Stores estimates for the mean and variance of previous gradients and uses them to scale the learning rate (this is similar to a momentum term). When in doubt, use this one.

* Nadam: Instead of vanilla momentum use the Nesterov accelerated gradient idea to "predict the future"

---

# Animation

---

# Classification

---

# Artificial Neural Networks

.right-column[
Last week we talked about Neural Networks for regression, to estimate a continuous function.

What about classification?
]

---

# Classification

* Say we want to distinguish between cats and dogs in pictures

* We have images, with 128x128 pixels, 3 color channels

* This means, we have 128x128x3 = 49 152‬ inputs ("features")

* We use a neural network with 2 layers, some hidden neurons, and **one** output neuron

* If the output is greater than some threshold x we say the input is a cat, otherwise it is a dog

---

# More classes

* What if we also have pictures of fish and microwaves?

* We could just say: less than 0.5 is a dog, 0.5-1.5 is a cat, 1.5-2.5 is a fish, more than 2.5 is a microwave

* Problem: When we do gradient descent we may increase/decrease numbers for *all* samples with some property

* Generally, our classes do not have a numeric relationship: a cat is better than a dog, but does the same numeric difference hold between fish and microwaves?

* Better: Look at each class "independently"

---

# One-Hot-Encoding

* Instead of one output, we have **one output per class**

* We translate our training labels ("dog" = 0, "cat" = 1, "fish" = 2, "microwave" = 3) to vectors: "dog" = (1,0,0,0), "cat" = (0,1,0,0), "fish" = (0,0,1,0), "microwave" = (0,0,0,1)

* The idea is that the output for **each** entry can be interpreted as "this is how likely this picture is of this class" (a probability)

* The classification produced by our network is then whichever class has the **maximum** value

---

# Probabilities?

* Say we put a sigmoid function as the activation function

* All values will be between 0 and 1!

* However: Cat may have a "probability" of 0.7, dog a "probability" of 0.2, fish a "probability" of 0.5, and microwave a "probability" of 0.45 ... The sum is greater than 1

* Instead: **Softmax**, a "generalization" of a sigmoid to multiple classes. This will make sure that the probabilities sum to 1

---

# Classification Error

* Because our output are probabilities there are better ways to measure the error than the (squared) distance

* In particular, in combination with the softmax activation function (which is similar to the sigmoid function and uses exp), a loss function that uses logarithms is advantageous

* Observe: The logarithm of 1 is 0, the logarithm of values between 0 and 1 is negative

* Loss: The negation of the logarithm of the probability of the desired class

---

# Cross-Entropy Loss

Output probability for each class:

$$
p_c = \frac{exp(w_c \cdot x)}{\sum_j exp(w_j \cdot x)}
$$

Loss:

$$
L(y) = - \log(p_y)
$$

When we take the derivative, the log and the exp cancel out nicely, making the update very efficient.

---

# Tensorflow Playground

[A Neural Network Playground](https://playground.tensorflow.org)

---

# Neural Networks in PyTorch

---

# Neural Networks in PyTorch

Construct the neural network:

```Python
model = torch.nn.Sequential(

torch.nn.Linear(128*128*3, 100),
torch.nn.Sigmoid(),

torch.nn.Linear(100,4),
torch.nn.Softmax(1)
)
```

Apply it to data:

```Python
y = model(x)
```

---

# Training

```Python
for t in range(n):
  y_pred = model(x.view(-1,128*128*3))
  loss = loss_fn(y_pred, y)
  
  optimizer.zero_grad()
  loss.backward()
  
  optimizer.step()
```

We still need a `loss_fn`, and an `optimizer`!

---

# Loss Functions and Optimizers

* `torch.nn.MSELoss`: Mean Squared Error

* `torch.nn.CrossEntropyLoss`: Cross Entropy Loss

* `torch.optim.SGD`: Stochastic Gradient Descent

* `torch.optim.Adam`: Adam Optimizer

* etc.

---

# Full code

```Python
model = torch.nn.Sequential( torch.nn.Linear(128*128*3, 100), 
                             torch.nn.Sigmoid(), 
                             torch.nn.Linear(100,4), 
                             torch.nn.Softmax(1))

loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for t in range(n):
  y_pred = model(x.view(-1,128*128*3))
  loss = loss_fn(y_pred, y)
  
  optimizer.zero_grad()
  loss.backward()
  
  optimizer.step()
```

---

# Modules

It is also possible to define a new class for your neural network, which gives you more control:

```Python
class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.sig1 = torch.nn.Sigmoid()
        self.linear2 = torch.nn.Linear(H, D_out)
        self.softmax = torch.nn.Softmax(1)

def forward(self, x):
        h = self.linear1(x)
        h = self.sig1(h)
        y_pred = self.linear2(h)
        return self.softmax(y_pred)
        
model = TwoLayerNet(128*128*3,100,4)
y_pred = model(x.view(-1,128*128*3))
```

---
  
# References
  
  * [Machine Learning 4 All: Guides](https://ml4a.github.io/guides/)
  
  * [PyTorch](https://pytorch.org/)
  
  * [DeepLearning with PyTorch](https://pytorch.org/deep-learning-with-pytorch)
  
  * [Introduction to Neural Networks](http://mt-class.org/jhu/slides/lecture-nn-intro.pdf)
  
  * [3Blue1Brown: Neural Networks](https://www.youtube.com/watch?v=aircAruvnKk)
  
  * [Comparison of different optimizers](https://ruder.io/optimizing-gradient-descent/)