Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Artificial Intelligence

Neural Networks

1 / 40

Supervised Learning

2 / 40

Supervised Learning

  • In Supervised Learning, we are given some x and y, and want to learn a function f(x) = y

  • The goal is to be able to apply this function to unknown values of x

  • For example: We are given the stats of several houses (x) in Cartago and their prices (y), and want to be able to predict the prices of other houses

  • Another example: We are given pictures of animals (x), and labels of cat or dog (y), and want to recognize cats and dogs in other pictures

3 / 40

Supervised Learning

There are two main tasks:

  • Regression: We predict a (continuous) number (price, duration, score) as our output

  • Classification: We predict one of several limited (discrete) categories/"classes" (cat vs. dog, spam vs. not spam, movie genre), as our output

There are many methods for each of these two tasks. We will mostly focus on Neural Networks in this class, which can be applied to both problems

4 / 40

Linear Regression

  • We have some points (xi,yi) in the plane

  • We want to find a line that "best" represents these points

  • What's a line? ˆyi=wxi+b

  • What is the "best" representation? We want ˆyi and yi to be "the same"

  • We minimize the mean squared error (MSE) 1n(ˆyiyi)2 by changing w and b

5 / 40

Finding a Line: Example

yi=wxi+b

DataMSE
−8−6−4−202468102030
x
−1012345050100150200
w

w: b:
MSE:

6 / 40

Gradient Descent

  • The process we just did is called "gradient descent"

  • We took our error or loss function 1n(ˆyiyi)2

  • Then we calculated in which direction this loss decreases when we change w

  • Then we changed w "a little bit" and continued

  • We could do the same for b!

7 / 40

Nonlinear models

  • 2D Lines only get us so far

  • We can extend this process to more dimensions:

ˆy=w1x1+w2x2+w3x3+b

  • Gradient descent will also work for this case, but our prediction model is still linear

  • What if our data is nonlinear?

8 / 40

Nonlinear regression

9 / 40

Nonlinear functions

  • Now we have ˆyi=f(xi), where f can be "anything"

  • We can still use the same idea as before, though!

  • Minimize the mean squared error (MSE) 1n(ˆyiyi)2 by changing f

  • For that, we need to be able to change f in some way: It should have parameters that we can change

  • Earlier, we looked at the gradient to see in which direction the loss decreases. For that, f needs to be differentiable with respect to these parameters

10 / 40

Some Notation

  • Vectors are neat because we have mathematical operations defined on them: Addition, substraction, multiplication with a scalar, etc.

  • One particularly important operation is the dot product:

vw=(v1v2vn)(w1w2wn)=v1w1+v2w2++vnwn

11 / 40

Dot Product

wx=(wb)(x1)=wx+b

This lets us write the linear model more concisely:

y=wx+b=wx

We will use the notation x to mean "adding a 1 to the end of the vector x"

Now let us build a non-linear function!

12 / 40

Something non-linear

Linear model:

Mw=y=wx

Non-linear model:

Mw=y=h(wx)

with "some" non-linear function h.

13 / 40

Differentiability

To calculate the gradient wrt to the parameters:

Mw=y=h(wx)wMw=wxh(wx)wwx

Or, in words: We will need to calculate the derivative of h wrt its input.

14 / 40

Differentiability

To calculate the gradient wrt to the parameters:

Mw=y=h(wx)wMw=wxh(wx)wwx

Or, in words: We will need to calculate the derivative of h wrt its input.

Another take: We only need to be able to calculate the derivative of h wrt its input.

15 / 40

Functions

Summary: What do we want from h?

  • Non-linear

  • Differentiable

  • "Interesting"

For example:

h(z)=11+ez

16 / 40

Sigmoid Function

Non-linear, differentiable and "interesting"!

17 / 40

Sigmoid Function

Non-linear, differentiable and "interesting"!

Any problems?

18 / 40

Output values

  • Our sigmoid function only produces values between 0 and 1

  • Often/usually we have a different range we want to predict something for?

  • We could "scale" (and shift) the output!

y=w1h(wx)+b1

This looks like a linear model that uses the result of h as its input!

19 / 40

Recursion

  • Let us call the operation h(wx) a "Neuron"

  • This "neuron" takes some inputs x', performs a linear transformation, and applies a function h to produce a result

  • We can then pass this result to another neuron, which may use the same or a different h

  • Implementation detail: Remember that we added a 1 to x to produce x', so we can write the linear model as a simple dot product. We do the same with the result of h!

20 / 40

Two Neurons

Now we have a compact representation:

y=h1(w1h(wx))

h1 can just be the identity function, then we have exactly the same thing we had earlier:

y=w1h(wx)+b1

Fun fact: As long as the additional layers are also differentiable, the entire function will be differentiable (you can try this at home).

21 / 40

Why?

22 / 40

More Nonlinearity!

Recall:

wx=(wb)(x1)=wx+b

The same thing works with more values in the vectors!

wx=(w1w2b)(x1x21)=w1x1+w2x2+b

This means each of our neurons could take more than just one input!

23 / 40

Artificial Neural Networks

24 / 40

Artificial Neural Networks

With two layers:

a=h1(W1x)y=h2(w2a)y=h2(w2h1(W1x))

25 / 40

Terminology

  • Input Layer: Our data

  • Hidden Layers: The intermediate neurons

  • Output Layer: The neurons that actually produce the outputs

  • The Hidden Layer and the Output Layer Neurons have activation functions. While not theoretically necessary, we normally use the same activation function for all neurons in a layer.

26 / 40

Some Activation Functions

27 / 40

Gradient Calculation

To get the gradient for the weights of the second layer:

w2(h2(w2h1(W1x))ˆy)2=2(h2(w2h1(W1x))ˆy)w2h2(w2h1(W1x))h1(W1x)=2(yˆy)w2h2(w2a)a

As mentioned before, we can use the chain rule a couple times more to also get the gradient wrt the weights of the first layer.

28 / 40

What does this do?

  • We can take "any" number of inputs, send them through a couple of layers of transformations, and obtain a result

  • We call this structure Feed-Forward Neural Networks

  • Gradient descent will change the weights to minimize the distance of the output from our training data

  • Does this work?

29 / 40

Universal Approximation Theorem

A Feed-Forward Neural Network with a single hidden layer, and a linear output layer can approximate continuous functions on compact subsets of Rn to arbitrary precision given enough neurons.

  • Which activation function? Sigmoid works (Cybenko, 1989), but anything that's not a simple polynomial will do (Leshno et al. 1993)

  • How many neurons do we need? Potentially exponentially many (in the dimensionality of the input) :(

  • Can we learn the weights? Who knows ...

30 / 40

In Practice

  • For many applications neural networks produce useful approximations to functions

  • The number of neurons is usually determined by educated guesses and tweaking

  • Adding more layers helps with some tasks

  • Rule of thumb: You don't want to use too many neurons (overfitting!)

31 / 40

Myths

  • Myth: "Neural Networks are how the brain works"

  • Truth: At most the original development drew some inspiration from our understanding of the brain

  • Myth: "Neural Networks are a black box that no one understands"

  • Truth: Neural Networks are nothing magical, they're "just" giant non-linear functions. We have a very good understanding of how they work. Interpreting their operation can be challenging, though.

  • Myth: Neural Networks are "human-like intelligence"

32 / 40

Braaaaaains

33 / 40

Learning!

  • Give the network an input sample

  • Record the output

  • Calculate the error and the gradient

  • Change the weights "a little" in the opposite direction of the gradient

34 / 40

Learning

ˆy=h2(w2h1(W1x))Δw2=2(yˆy)ddw2h2(w2h1(W1x))w2=w2αΔw2

  • How do we start?

  • How often do we do update?

  • What is α?

  • Any problems with this approach?

35 / 40

Gradient Descent: Problems

36 / 40

Gradient Descent: More Problems

37 / 40

Generalizability

  • Another problem with Neural Network is overfitting

  • If you train your Neural Network on data, it might just start to "memorize" that data

  • The actual application, however, is to predict values for other data

  • We would therefore like to test our network

38 / 40

Training and Test Set

  • Neural Networks are typically trained on only parts (typically around 80%) of the available data

  • We can then use the rest of the data to test how well the network generalizes

  • These sets are called the training set and the test set

  • The (relevant) performance of the network is its error on the test set

39 / 40

Supervised Learning

2 / 40
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow