In Supervised Learning, we are given some x and y, and want to learn a function f(x) = y
The goal is to be able to apply this function to unknown values of x
For example: We are given the stats of several houses (x) in Cartago and their prices (y), and want to be able to predict the prices of other houses
Another example: We are given pictures of animals (x), and labels of cat or dog (y), and want to recognize cats and dogs in other pictures
There are two main tasks:
Regression: We predict a (continuous) number (price, duration, score) as our output
Classification: We predict one of several limited (discrete) categories/"classes" (cat vs. dog, spam vs. not spam, movie genre), as our output
There are many methods for each of these two tasks. We will mostly focus on Neural Networks in this class, which can be applied to both problems
We have some points (xi,yi)
in the plane
We want to find a line that "best" represents these points
What's a line? ˆyi=w⋅xi+b
What is the "best" representation? We want ˆyi
and yi
to be "the same"
We minimize the mean squared error (MSE) 1n∑(ˆyi−yi)2
by changing w
and b
yi=w⋅xi+b
Data | MSE |
---|---|
|
|
w: b:
MSE:
The process we just did is called "gradient descent"
We took our error or loss function 1n∑(ˆyi−yi)2
Then we calculated in which direction this loss decreases when we change w
Then we changed w
"a little bit" and continued
We could do the same for b
!
2D Lines only get us so far
We can extend this process to more dimensions:
ˆy=w1⋅x1+w2⋅x2+w3⋅x3+b
Gradient descent will also work for this case, but our prediction model is still linear
What if our data is nonlinear?
Now we have ˆyi=f(xi)
, where f
can be "anything"
We can still use the same idea as before, though!
Minimize the mean squared error (MSE) 1n∑(ˆyi−yi)2
by changing f
For that, we need to be able to change f
in some way: It should have parameters that we can change
Earlier, we looked at the gradient to see in which direction the loss decreases. For that, f
needs to be differentiable with respect to these parameters
Vectors are neat because we have mathematical operations defined on them: Addition, substraction, multiplication with a scalar, etc.
One particularly important operation is the dot product:
→v⋅→w=(v1v2⋮vn)⋅(w1w2⋮wn)=v1⋅w1+v2⋅w2+…+vn⋅wn
→w⋅→x′=(wb)⋅(x1)=wx+b
This lets us write the linear model more concisely:
y=wx+b=→w⋅→x′
We will use the notation →x′
to mean "adding a 1 to the end of the vector x"
Now let us build a non-linear function!
Linear model:
Mw=y=→w⋅→x′
Non-linear model:
Mw=y=h(→w⋅→x′)
with "some" non-linear function h
.
To calculate the gradient wrt to the parameters:
Mw=y=h(→w⋅→x′)∂∂wMw=∂∂→w⋅→x′h(→w⋅→x′)⋅∂∂w→w⋅→x′
Or, in words: We will need to calculate the derivative of h
wrt its input.
To calculate the gradient wrt to the parameters:
Mw=y=h(→w⋅→x′)∂∂wMw=∂∂→w⋅→x′h(→w⋅→x′)⋅∂∂w→w⋅→x′
Or, in words: We will need to calculate the derivative of h
wrt its input.
Another take: We only need to be able to calculate the derivative of h
wrt its input.
Summary: What do we want from h?
Non-linear
Differentiable
"Interesting"
For example:
h(z)=11+e−z
Non-linear, differentiable and "interesting"!
Non-linear, differentiable and "interesting"!
Any problems?
Our sigmoid function only produces values between 0 and 1
Often/usually we have a different range we want to predict something for?
We could "scale" (and shift) the output!
y=w1⋅h(→w⋅→x′)+b1
This looks like a linear model that uses the result of h as its input!
Let us call the operation h(→w⋅→x′)
a "Neuron"
This "neuron" takes some inputs x'
, performs a linear transformation, and applies a function h
to produce a result
We can then pass this result to another neuron, which may use the same or a different h
Implementation detail: Remember that we added a 1 to x
to produce x'
, so we can write the linear model as a simple dot product. We do the same with the result of h!
Now we have a compact representation:
y=h1(→w1⋅h(→w⋅→x′)′)
h1
can just be the identity function, then we have exactly the same thing we had earlier:
y=w1⋅h(→w⋅→x′)+b1
Fun fact: As long as the additional layers are also differentiable, the entire function will be differentiable (you can try this at home).
Recall:
→w⋅→x′=(wb)⋅(x1)=wx+b
The same thing works with more values in the vectors!
→w⋅→x′=(w1w2b)⋅(x1x21)=w1x1+w2x2+b
This means each of our neurons could take more than just one input!
With two layers:
→a=h1(W1⋅→x′)y=h2(→w2⋅→a′)y=h2(→w2⋅h1(W1⋅→x′))
Input Layer: Our data
Hidden Layers: The intermediate neurons
Output Layer: The neurons that actually produce the outputs
The Hidden Layer and the Output Layer Neurons have activation functions. While not theoretically necessary, we normally use the same activation function for all neurons in a layer.
To get the gradient for the weights of the second layer:
∂∂w2(h2(→w2⋅h1(W1⋅→x′))−ˆy)2=2⋅(h2(→w2⋅h1(W1⋅→x))−ˆy)⋅∂∂w2h2(→w2⋅h1(W1⋅→x))⋅h1(W1⋅→x)=2⋅(y−ˆy)⋅∂∂w2h2(→w2⋅→a)⋅→a
As mentioned before, we can use the chain rule a couple times more to also get the gradient wrt the weights of the first layer.
We can take "any" number of inputs, send them through a couple of layers of transformations, and obtain a result
We call this structure Feed-Forward Neural Networks
Gradient descent will change the weights to minimize the distance of the output from our training data
Does this work?
A Feed-Forward Neural Network with a single hidden layer, and a linear output layer can approximate continuous functions on compact subsets of Rn
to arbitrary precision given enough neurons.
Which activation function? Sigmoid works (Cybenko, 1989), but anything that's not a simple polynomial will do (Leshno et al. 1993)
How many neurons do we need? Potentially exponentially many (in the dimensionality of the input) :(
Can we learn the weights? Who knows ...
For many applications neural networks produce useful approximations to functions
The number of neurons is usually determined by educated guesses and tweaking
Adding more layers helps with some tasks
Rule of thumb: You don't want to use too many neurons (overfitting!)
Myth: "Neural Networks are how the brain works"
Truth: At most the original development drew some inspiration from our understanding of the brain
Myth: "Neural Networks are a black box that no one understands"
Truth: Neural Networks are nothing magical, they're "just" giant non-linear functions. We have a very good understanding of how they work. Interpreting their operation can be challenging, though.
Myth: Neural Networks are "human-like intelligence"
4yo child is one thing but we need to explain how Megaphragma mymaripenne can fly and navigate with a brain of only 7400 neurons. Each neuron must be doing much more (1000x) than our Perceptron model explains. pic.twitter.com/xNPq0RAgPj
— Mark Sugrue (@marksugruek) 15 de diciembre de 2019
Give the network an input sample
Record the output
Calculate the error and the gradient
Change the weights "a little" in the opposite direction of the gradient
ˆy=h2(→w2⋅h1(W1⋅→x′))Δw2=2⋅(y−ˆy)⋅ddw2h2(→w2⋅h1(W1⋅→x′))w2=w2−α⋅Δw2
How do we start?
How often do we do update?
What is α
?
Any problems with this approach?
Another problem with Neural Network is overfitting
If you train your Neural Network on data, it might just start to "memorize" that data
The actual application, however, is to predict values for other data
We would therefore like to test our network
Neural Networks are typically trained on only parts (typically around 80%) of the available data
We can then use the rest of the data to test how well the network generalizes
These sets are called the training set and the test set
The (relevant) performance of the network is its error on the test set
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |