In regression problems we try to predict continuously valued outputs, e.g. "Given a size of the house predict the price (real value)."
Regression models emulate generative processes that derive one or more values from a set of variables.
These input variables are capable of explaining the the output, by correlation or causality.
Given: Some x and corresponding observed values y (real numbers)
Wanted: A function f that, given a (potentiall new!) x produces a prediction y
Our goal is to find a function that fits our data "well"
y=wx+b
We have two parameters, the slope w
and the bias b
. These are knobs that we can adjust or change to find the "best" linear equation
What is "best"? The values for w and b that minimize the MSE!
Remember the dot product:
→w⋅→x′=(wb)⋅(x1)=wx+b
This lets us write the linear model more concisely:
y=wx+b=→w⋅→x′
We were looking in which direction the error (or loss) decreases
The gradient of a function tells us its slope, i.e. which way it decreases!
General algorithm, starting with some initial estimate/guess for the parameters:
Calculate the loss, and its gradient with respect to the parameters
Move the parameters "a little bit" into the direction of the decrease
Repeat
Remember the Chain Rule? We want to calculate the gradient wrt the parameters of the loss function. This results in the product of the gradient of the loss function wrt the model, and the gradient of the model wrt the parameters
∂∂wL(Mw)=∂∂MwL(Mw)⋅∂∂wMw
In Lab 1 you tried some other functions
We "only" had to make sure that we could calculate the gradient of the model wrt the parameters
This means we can just use "any" differentiable function as our model!
Ideally we want something that can represent many different function
Linear model:
Mw=y=→w⋅→x′
Non-linear model:
Mw=y=h(→w⋅→x′)
with "some" non-linear function h
.
Calculate the gradient:
Mw=y=h(→w⋅→x′)∂∂wMw=∂∂→w⋅→x′h(→w⋅→x′)⋅∂∂w→w⋅→x′
Or, in words: We will need to calculate the derivative of h
wrt its input.
Calculate the gradient:
Mw=y=h(→w⋅→x′)∂∂wMw=∂∂→w⋅→x′h(→w⋅→x′)⋅∂∂w→w⋅→x′
Or, in words: We will need to calculate the derivative of h
wrt its input.
Another take: We only need to be able to calculate the derivative of h
wrt its input.
Summary: What do we want from h?
Non-linear
Differentiable
"Interesting"
For example:
h(z)=11+e−z
Non-linear, differentiable and "interesting"!
Non-linear, differentiable and "interesting"!
Any problems?
Our sigmoid function only produces values between 0 and 1
Often/usually we have a different range we want to predict something for?
We could "scale" (and shift) the output!
y=w1⋅h(→w⋅→x′)+b1
This looks like a linear model that uses the result of h as its input!
Let us call the operation h(→w⋅→x′)
a "Neuron"
This "neuron" takes some inputs x'
, performs a linear transformation, and applies a function h
to produce a result
We can then pass this result to another neuron, which may use the same or a different h
Implementation detail: Remember that we added a 1 to x
to produce x'
, so we can write the linear model as a simple dot product. We will use the '
to denote when this operation is performed.
Now we have a compact representation:
y=h1(→w1⋅h(→w⋅→x′)′)
h1
can just be the identity function, then we have exactly the same thing we had earlier:
y=w1⋅h(→w⋅→x′)+b1
Fun fact: As long as the additional layers are also differentiable, the entire function will be differentiable (you can try this at home).
Recall:
→w⋅→x′=(wb)⋅(x1)=wx+b
The same thing works with more values in the vectors!
→w⋅→x′=(w1w2b)⋅(x1x21)=w1x1+w2x2+b
This means each of our neurons could take more than just one input!
With two layers:
→a=h1(W1⋅→x′)y=h2(→w2⋅→a′)y=h2(→w2⋅h1(W1⋅→x′))
Input Layer: Our data
Hidden Layers: The intermediate neurons
Output Layer: The neurons that actually produce the outputs
The Hidden Layer and the Output Layer Neurons have activation functions. While not theoretically necessary, we normally use the same activation function for all neurons in a layer.
To get the gradient for the weights of the second layer:
∂∂w2(h2(→w2⋅h1(W1⋅→x′))−ˆy)2=2⋅(h2(→w2⋅h1(W1⋅→x))−ˆy)⋅∂∂w2h2(→w2⋅h1(W1⋅→x))⋅h1(W1⋅→x)=2⋅(y−ˆy)⋅∂∂w2h2(→w2⋅→a)⋅→a
As mentioned before, we can use the chain rule a couple times more to also get the gradient wrt the weights of the first layer.
We can take "any" number of inputs, send them through a couple of layers of transformations, and obtain a result
We call this structure Feed-Forward Neural Networks
Gradient descent will change the weights to minimize the distance of the output from our training data
Does this work?
A Feed-Forward Neural Network with a single hidden layer, and a linear output layer can approximate continuous functions on compact subsets of Rn
to arbitrary precision given enough neurons.
Which activation function? Sigmoid works (Cybenko, 1989), but anything that's not a simple polynomial will do (Leshno et al. 1993)
How many neurons do we need? Potentially exponentially many (in the dimensionality of the input) :(
Can we learn the weights? Who knows ...
For many applications neural networks produce useful approximations to functions
The number of neurons is usually determined by educated guesses and tweaking
Adding more layers helps with some tasks
Rule of thumb: You don't want to use too many neurons (overfitting!)
Myth: "Neural Networks are how the brain works"
Truth: At most the original development drew some inspiration from our understanding of the brain
Myth: "Neural Networks are a black box that no one understands"
Truth: Neural Networks are nothing magical, they're "just" giant non-linear functions. We have a very good understanding of how they work. Interpreting their operation can be challenging, though.
Myth: Neural Networks are "human-like intelligence"
4yo child is one thing but we need to explain how Megaphragma mymaripenne can fly and navigate with a brain of only 7400 neurons. Each neuron must be doing much more (1000x) than our Perceptron model explains. pic.twitter.com/xNPq0RAgPj
— Mark Sugrue (@marksugruek) 15 de diciembre de 2019
In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.
"What are you doing?", asked Minsky. "I am training a randomly wired neural net to play Tic-Tac-Toe" Sussman replied. "Why is the net wired randomly?", asked Minsky. "I do not want it to have any preconceptions of how to play", Sussman said.
Minsky then shut his eyes. "Why do you close your eyes?", Sussman asked his teacher. "So that the room will be empty."
At that moment, Sussman was enlightened.
What happens if we initialize the weights with all 0s?
What happens if we initialize the weights with very large values?
What happens if we initialize the weights with all 0s?
What happens if we initialize the weights with very large values?
Zero initialization: All gradients of a layer are the same, updating the weights in the same way!
Large initial values: All gradients of a layer are zero (look at the sigmoid curve, for example), and our weights don't change!
Therefore, we will typically we initialize our weights randomly, depending on the square root of the size of the previous layer (e.g. He, or Xavier-initialization)
Typically our training set contains many different samples. How do we update the weights?
Idea 1: Calculate the error on the entire training set use that for the gradient ("classical" gradient descent)
Idea 2: Shuffle the training set, and calculate the error/gradient for each item one at a time ("stochastic"/"on-line" gradient descent)
Classical gradient descent: Imagine you have two (or more) training samples that are very similar. In classical gradient descent, you calculate the error and gradient using both/all of them before you do any update.
Stochastic gradient descent: Updating after every sample prevents redundant computations, and is often faster to converge. However, the parameters may change erratically, especially if you have outliers or a wide variety of different data. Also not parallelizable.
In practice: Mini-Batch gradient descent. Instead of using one or all of your data, split it into smaller batches ("minibatches"), commonly between 50 and 256 samples, and perform the calculations and update for each minibatch in random order.
Momentum: When updating the weights, also add a fraction of the previous update again. If the gradient points in the same direction, this will accelerate it (like a ball rolling down a hill)
Nesterov accelerated gradient
Δw←γ⋅Δw+∇L(M(w−αΔw,x))w←w−αΔw
Instead of using the current values for the weights, we "imagine" what would happen if we performed the same update again. This allows us to "look into the future" and slow down when we get near the optimum.
SGD: Perform the basic (Stochastic) Gradient Descent update every epoch
Adagrad: Accumulates the sum of the squares of all gradients for each parameter/weight of all previous iterations and divides the learning rate by that value. Problem: Learning rate can only get smaller.
Adadelta: Instead of saving all previous gradients, only save the last n
Adam: Stores estimates for the mean and variance of previous gradients and uses them to scale the learning rate (this is similar to a momentum term). When in doubt, use this one.
Nadam: Instead of vanilla momentum use the Nesterov accelerated gradient idea to "predict the future"
Construct the neural network:
model = torch.nn.Sequential(# Hidden Layer torch.nn.Linear(INPUTS, HIDDEN_NEURONS),torch.nn.Sigmoid(),# Output Layertorch.nn.Linear(HIDDEN_NEURONS,1))
Apply it to data:
y = model(x)
for t in range(n): # Input ALWAYS has to be a matrix! # Rows: samples, columns: features y_pred = model(x.view(-1,INPUTS)) loss = loss_fn(y_pred, y) optimizer.zero_grad() loss.backward() optimizer.step()
We still need a loss_fn
, and an optimizer
!
torch.nn.MSELoss
: Mean Squared Error
torch.optim.SGD
: Stochastic Gradient Descent
torch.optim.Adam
: Adam Optimizer
etc.
model = torch.nn.Sequential( torch.nn.Linear(6, 3), torch.nn.Sigmoid(), torch.nn.Linear(3,1))loss_fn = torch.nn.MSELoss()optimizer = torch.optim.Adam(model.parameters(), lr=0.1)for t in range(n): y_pred = model(x.view(-1,6)) loss = loss_fn(y_pred, y) optimizer.zero_grad() loss.backward() optimizer.step()
It is also possible to define a new class for your neural network, which gives you more control:
class TwoLayerNet(torch.nn.Module): def __init__(self, D_in, H, D_out): super(TwoLayerNet, self).__init__() self.linear1 = torch.nn.Linear(D_in, H) self.sig1 = torch.nn.Sigmoid() self.linear2 = torch.nn.Linear(H, D_out) def forward(self, x): h = self.linear1(x) h = self.sig1(h) y_pred = self.linear2(h) return y_predmodel = TwoLayerNet(6,5,4)y_pred = model(x.view(-1,6))
In lab 2 you will replace your non-linear model from lab 1 with a neural network
First, construct a neural network with one hidden layer, one output layer that takes the ActionLatency
as input, and predicts the APM
Try some different numbers of neurons
Then add more inputs. The lab description has a list of 6 you should use, if you want, you can play with adding more
Deadline: May 12, before class
Cybenko, G. (1989) "Approximations by superpositions of sigmoidal functions", Mathematics of Control, Signals, and Systems, 2(4), 303–314. doi:10.1007/BF02551274
Leshno, Moshe; Lin, Vladimir Ya.; Pinkus, Allan; Schocken, Shimon (January 1993). "Multilayer feedforward networks with a nonpolynomial activation function can approximate any function". Neural Networks. 6 (6): 861–867. doi:10.1016/S0893-6080(05)80131-5.
In regression problems we try to predict continuously valued outputs, e.g. "Given a size of the house predict the price (real value)."
Regression models emulate generative processes that derive one or more values from a set of variables.
These input variables are capable of explaining the the output, by correlation or causality.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |