Lecture 13: Neural Networks

# Artificial Intelligence

### Neural Networks

---

# Supervised Learning

---

# Supervised Learning

* In Supervised Learning, we are given some x and y, and want to learn a function `f(x) = y`

* The goal is to be able to apply this function to unknown values of x

* For example: We are given the stats of several houses (x) in Cartago and their prices (y), and want to be able to predict the prices of other houses

* Another example: We are given pictures of animals (x), and labels of cat or dog (y), and want to recognize cats and dogs in other pictures

---

# Supervised Learning

There are two main tasks:

* Regression: We predict a (continuous) number (price, duration, score) as our output

* Classification: We predict one of several limited (discrete) categories/"classes" (cat vs. dog, spam vs. not spam, movie genre), as our output

There are many methods for each of these two tasks. We will mostly focus on Neural Networks in this class, which can be applied to both problems

---

# Linear Regression

* We have some points `$(x_i, y_i)$` in the plane

* We want to find a line that "best" represents these points

* What's a line? `$\hat{y}_i = w \cdot x_i + b$`

* What is the "best" representation? We want `$\hat{y}_i$` and `$y_i$` to be "the same"

* We minimize the mean squared error (MSE) `$\frac{1}{n} \sum (\hat{y}_i - y_i)^2$` by changing `w` and `b`

---

# Finding a Line: Example

$$
y_i = w\cdot x_i + b
$$

w: <input id="kvalue" size="15" value="0"/> b: <input id="bvalue" size="15" value="0" disabled="disabled"/> <button type="button" onclick="updatek()">Submit</button> <br/>
MSE: <input id="msevalue" size="15" value="0" type="text" disabled="disabled"/>

---

# Gradient Descent

* The process we just did is called "gradient descent"

* We took our error or **loss function** `$\frac{1}{n} \sum (\hat{y}_i - y_i)^2$`

* Then we calculated in which direction this loss decreases when we change `w`

* Then we changed `w` "a little bit" and continued

* We could do the same for `b`!

---

# Nonlinear models

* 2D Lines only get us so far

* We can extend this process to more dimensions:

$$
\hat{y} = w_1\cdot x_1 + w_2 \cdot x_2+ w_3 \cdot x_3 + b 
$$

* Gradient descent will also work for this case, but our prediction model is still linear

* What if our data is nonlinear?

---

# Nonlinear regression

---

# Nonlinear functions

* Now we have `$\hat{y}_i = f(x_i)$`, where `f` can be "anything"

* We can still use the same idea as before, though!

* Minimize the mean squared error (MSE) `$\frac{1}{n} \sum (\hat{y}_i - y_i)^2$` by changing `f`

* For that, we need to be able to change `f` in some way: It should have **parameters** that we can change

* Earlier, we looked at the gradient to see in which direction the loss decreases. For that, `f` needs to be **differentiable** with respect to these parameters

---

# Some Notation

* Vectors are neat because we have mathematical operations defined on them: Addition, substraction, multiplication with a scalar, etc.

* One particularly important operation is the dot product:

$$
\vec{v} \cdot \vec{w} = \begin{pmatrix}v_1\\\\v_2\\\\\vdots\\\\v_n\end{pmatrix}\cdot\begin{pmatrix}w_1\\\\w_2\\\\\vdots\\\\w_n\end{pmatrix} =  v_1 \cdot w_1 + v_2 \cdot w_2 + \ldots + v_n \cdot w_n
$$

---

# Dot Product

$$
\vec{w} \cdot \vec{x}' = \begin{pmatrix}w\\\\b\end{pmatrix} \cdot \begin{pmatrix}x \\\\ 1\end{pmatrix} = wx + b
$$

This lets us write the linear model more concisely:

$$
y = w x + b = \vec{w} \cdot \vec{x}'
$$

We will use the notation `$\vec{x}'$` to mean "adding a 1 to the end of the vector x"

Now let us build a non-linear function!

---

# Something non-linear

Linear model:

$$
M_w = y = \vec{w} \cdot \vec{x}'
$$

Non-linear model:

$$
M_w = y = h(\vec{w} \cdot \vec{x}')
$$

with "some" non-linear function `h`.

---

# Differentiability

To calculate the gradient wrt to the parameters:

$$
M_w = y = h(\vec{w} \cdot \vec{x}')\\\\
\frac{\partial }{\partial w} M_w = \frac{\partial }{\partial \vec{w} \cdot \vec{x}'} h(\vec{w} \cdot \vec{x}') \cdot \frac{\partial}{\partial w} \vec{w} \cdot \vec{x}'
$$

Or, in words: We will need to calculate the derivative of `h` wrt its input.

Another take: We **only** need to be able to calculate the derivative of `h` wrt its input.

---

# Functions

Summary: What do we want from h?

- Non-linear

- Differentiable

- "Interesting"

For example:

$$
h(z) = \frac{1}{1 + e^{-z}}
$$

---

# Sigmoid Function

Non-linear, differentiable and "interesting"!

Any problems?

---

# Output values

* Our sigmoid function only produces values between 0 and 1

* Often/usually we have a different range we want to predict something for?

* We could "scale" (and shift) the output!

$$
y = w_1 \cdot h(\vec{w}\cdot \vec{x}') + b_1
$$

This looks like a linear model that uses the result of h as its input!

---

# Recursion

* Let us call the operation `$h(\vec{w}\cdot \vec{x}')$` a "Neuron"

* This "neuron" takes some inputs `x'`, performs a linear transformation, and applies a function `h` to produce a result

* We can then pass this result to another neuron, which may use the same or a different `h`

* Implementation detail: Remember that we added a 1 to `x` to produce `x'`, so we can write the linear model as a simple dot product. We do the same with the result of h!

---

# Two Neurons

Now we have a compact representation:

$$
y = h_1(\vec{w}_1 \cdot h(\vec{w} \cdot \vec{x}')')
$$

`$h_1$` can just be the identity function, then we have exactly the same thing we had earlier:

$$
y = w_1 \cdot h(\vec{w}\cdot \vec{x}') + b_1
$$

Fun fact: As long as the additional layers are also differentiable, the entire function will be differentiable (you can try this at home).

---

# Why?

---

# More Nonlinearity!

Recall:

$$
\vec{w} \cdot \vec{x}' = \begin{pmatrix}w\\\\b\end{pmatrix} \cdot \begin{pmatrix}x \\\\ 1\end{pmatrix} = wx + b
$$

The same thing works with more values in the vectors!

$$
\vec{w} \cdot \vec{x}' = \begin{pmatrix}w_1 \\\\ w_2 \\\\b\end{pmatrix} \cdot \begin{pmatrix}x_1 \\\\ x_2 \\\\ 1\end{pmatrix} = w_1 x_1 + w_2 x_2 + b
$$

This means each of our neurons could take more than just one input!

---

# Artificial Neural Networks

---

# Artificial Neural Networks

$$
\vec{a} = h_1(W_1 \cdot \vec{x}')\\\\
y = h_2(\vec{w_2} \cdot \vec{a}')\\\\
y = h_2(\vec{w_2} \cdot h_1(W_1 \cdot \vec{x}'))
$$
]

---

# Terminology

* Input Layer: Our data

* Hidden Layers: The intermediate neurons

* Output Layer: The neurons that actually produce the outputs

* The Hidden Layer and the Output Layer Neurons have activation functions. While not theoretically necessary, we normally use the same activation function for all neurons in a layer.

---

# Some Activation Functions

---

# Gradient Calculation

To get the gradient for the weights of the second layer:

$$
\frac{\partial}{\partial w_2} (h_2(\vec{w_2} \cdot h_1(W_1 \cdot \vec{x}')) - \hat{y})^2 = \\\\
2\cdot(h_2(\vec{w_2} \cdot h_1(W_1 \cdot \vec{x})) - \hat{y})\cdot \frac{\partial}{\partial w_2} h_2(\vec{w_2} \cdot h_1(W_1 \cdot \vec{x})) \cdot h_1(W_1 \cdot \vec{x}) =\\\\
2\cdot(y - \hat{y})\cdot \frac{\partial}{\partial w_2} h_2(\vec{w_2} \cdot \vec{a}) \cdot \vec{a}
$$

As mentioned before, we can use the chain rule a couple times more to also get the gradient wrt the weights of the first layer.

---

# What does this do?

* We can take "any" number of inputs, send them through a couple of layers of transformations, and obtain a result

* We call this structure **Feed-Forward** Neural Networks

* Gradient descent will change the weights to minimize the distance of the output from our training data

* Does this work?

---

# Universal Approximation Theorem

A Feed-Forward Neural Network with a single hidden layer, and a linear output layer can approximate continuous functions on compact subsets of `$\mathrm{R}^n$` to arbitrary precision given enough neurons.

* Which activation function? Sigmoid works (Cybenko, 1989), but anything that's not a simple polynomial will do (Leshno et al. 1993)

* How many neurons do we need? Potentially exponentially many (in the dimensionality of the input) :(

* Can we learn the weights? Who knows ...

---

# In Practice

* For many applications neural networks produce useful approximations to functions

* The number of neurons is usually determined by educated guesses and tweaking

* Adding more layers helps with some tasks

* Rule of thumb: You don't want to use too many neurons (overfitting!)

---

# Myths

* Myth: "Neural Networks are how the brain works"

* Truth: **At most** the original development drew some inspiration from our understanding of the brain

* Myth: "Neural Networks are a black box that no one understands"

* Truth: Neural Networks are nothing magical, they're "just" giant non-linear functions. We have a very good understanding of how they work. Interpreting their operation can be challenging, though.

* Myth: Neural Networks are "human-like intelligence"

---

# Braaaaaains

<blockquote class="twitter-tweet" data-conversation="none" data-lang="es"><p lang="en" dir="ltr">4yo child is one thing but we need to explain how Megaphragma mymaripenne can fly and navigate with a brain of only 7400 neurons. Each neuron must be doing much more (1000x) than our Perceptron model explains. <a href="https://t.co/xNPq0RAgPj">pic.twitter.com/xNPq0RAgPj</a></p>— Mark Sugrue (@marksugruek) <a href="https://twitter.com/marksugruek/status/1206130412260646912?ref_src=twsrc%5Etfw">15 de diciembre de 2019</a></blockquote>

---

# Learning!

* Give the network an input sample

* Record the output

* Calculate the error **and the gradient**

* Change the weights "a little" in the opposite direction of the gradient

---

# Learning

$$
\hat{y} = h_2(\vec{w_2} \cdot h_1(W_1 \cdot \vec{x}'))\\\\
\Delta w_2 = 2\cdot(y - \hat{y})\cdot \frac{\text{d}}{\text{d} w_2} h_2(\vec{w_2} \cdot h_1(W_1 \cdot \vec{x}')) \\\\
w_2 = w_2 - \alpha\cdot \Delta w_2
$$

* How do we start?

* How often do we do update?

* What is `$\alpha$`?

* Any problems with this approach?

---

# Gradient Descent: Problems

---

# Gradient Descent: More Problems

---

# Generalizability

* Another problem with Neural Network is overfitting

* If you train your Neural Network on data, it might just start to "memorize" that data

* The actual application, however, is to predict values for **other** data

* We would therefore like to **test** our network

---

# Training and Test Set

* Neural Networks are typically trained on only parts (typically around 80%) of the available data

* We can then use the rest of the data to test how well the network generalizes

* These sets are called the training set and the test set

* The (relevant) performance of the network is its error on the test set

---
  
# References
  
  * [Machine Learning 4 All: Guides](https://ml4a.github.io/guides/)
  
  * [PyTorch](https://pytorch.org/)
  
  * [DeepLearning with PyTorch](https://pytorch.org/deep-learning-with-pytorch)
  
  * [Introduction to Neural Networks](http://mt-class.org/jhu/slides/lecture-nn-intro.pdf)