Lecture 7: Advanced Neural Networks

# Computational and statistical techniques of Machine Learning

## Advanced Neural Networks

---

# Artificial Neural Networks

$$
\vec{h} = f_1(W_1 \cdot \vec{x})\\\\
y = f_2(\vec{w_2} \cdot \vec{h})\\\\
y = f_2(\vec{w_2} \cdot f_1(W_1 \cdot \vec{x}))
$$
]

---

# Deep Networks

* The Universal Approximation Theorem states that an ANN with *one* layer can approximate any function 
  
  * But we could also add more layers!
  
  * Why? Learning features!
  
  * Viewed another way: If the ANN has to produce output for an input of dimensionality `n`, but the data has to pass through a layer with `m` neurons, where `m` is much smaller than `n`, the ANN has to encode enough information about the input into these `m` numbers/features to produce the desired output 
  
---

# Auto-Encoders

* One application of this approach are Auto-encoders 
  
  * They are neural networks with many layers, that become narrower and narrower, before they widen again 
  
  * The number of inputs is the same as the number of outputs, and the training examples use the *same* values for input and output 
  
  * The goal is to learn a smaller *representation* for the input data
  
  * In essence, the ANN has to reconstruct the input from fewer values
  
---

# Auto-Encoders

---

# Vector Models

---

# I Dreamed a Dream

* Let's say we want to work with words 
  
  * Imagine we could turn a word into a vector 
  
  * And then we can take vector differences to get the "relation" between words 
  
  * For example: (Vienna - Austria) + Costa Rica = San Jose
  
--

* Turns out, we can actually do that!
  
---

# Vector Space Models

* As before, we will have a smaller representation of our input
  
  * Our vector elements don't have any clear interpretation, they are just numbers representing a word
  
  * The vectors are *learned* from a corpus
  
  * We can then compare vectors using algebra, for example calculate the cosine (dot product!) between two vectors to get their similiarity
  
---

# The Distributional Hypothesis

* The reason this vector trick works is the Distributional Hypothesis 
  
  * It says that words that are used in the same **context** are semantically similar, too 
  
  * However, the actual relation between this hypothesis and the vectors has been called "very hand-wavy" by some authors
  
---

# Context

* For each word, we use `k` words before and after as the "context"
  
  * For example: "Machine Learning is a fascinating subject" with context 1 becomes:
  
      - [(machine, is), learning]
      - [(learning, a), is]
      - [(is, fascinating), a]
      - [(a, subject), fascinating]
      
  * What do we do with these? Learn the word from the context!

---

# Word2Vec

* We use a neural network to learn the relationship between words and their context 
  
  * Because this relationship has to be calculated by the neural network, the values of one of the hidden layers have to encode it 
  
  * We can use these values as a vector representation for a word!
  
  * Two approaches:
      - Context is the input, word is the output (Continuous Bag-of-Words)
      - Word is the input, context is the output (skip-gram)

---

# Word2Vec

---

# Word2Vec Results

---

# Doc2Vec

* It may seem tricky to go from learning a representation for single words, to a representation for entire documents 
  
  * But it is actually not!
  
  * Here is the trick: For each word that we learn, we provide the document ID *in addition* to the context 
  
  * This means that the network will have to learn some vector representation for the document while training as well 
  
  * Note: The "documents" are sometimes also called "paragraphs" in the literature
  
---

# Distributed Memory version of Paragraph Vector

---

# Distributed Bag of Words version of Paragraph Vector

---

# More Neural Network Architectures

---

# Convolutional Neural Networks

* In traditional ANNs, all neurons in a layer are connected to all neurons in the previous layer 
  
  * For very high dimensional data (e.g. images) this results in a very larger number of weights that need to be learned
  
  * In many practical applications, there is some sort of locality present in the data 
  
  * For example, in image recognition: Pixels that are close together are more likely to be correlated than pixels from opposite corners of the image 
  
  * Idea: Instead of connecting the network fully, define a "receptive field" for each neuron that defines how much of the previous layer it "perceives"
  
---

# Convolutional Neural Networks

---

# Recurrent Neural Networks

* So far we have looked at Neural Networks with a static number of inputs

* However, often we have variable length input, for example if we collect time series data (like cards played in Hearthstone)

* One approach to this is to feed the network one time step at the time and give it "memory"

* We can conceptualize this memory as a "hidden variable", or **hidden state**

---

# Recurrent Neural Networks

* The hidden state is initialized to some values (zeros)

* Then the first input element/step is passed to the network, and it produces output **and** a new hidden state

* This new hidden state is passed to the network with the next input element/step
]

---

# Recurrent Neural Networks: Unfolding

---

# Recurrent Neural Networks: Modes

---

# Generative, Adversarial Networks

---

# Generative, Adversarial Networks

* So far we have used Neural Networks to classify images, or predict some value

* Could we **generate** things with a Neural Network?

* Crazy idea: We pass the Neural Network some random numbers and it produces a new Picasso-like painting

* That's exactly what we'll do!

---

# First: Classification

* To produce a Picasso-like painting, we first need to know which paintings *are* Picasso-like

* We could train a Neural Network that detects "real" Picassos (the "Discriminator")

* Input: An image

* Output: "True" Picasso, or "fake"

* So we'll need some real and fake Picassos to start with ...

---

# Art Connoisseur Network

* After some training, our network will be able to distinguish real and fake picassos

* This means we can give this network a new painting, and it will tell us if it is real or not

* Now we can define the task for our generator more clearly: Fool the discriminator network, i.e. generate paintings that the discriminator recognizes as "real" Picassos

---

# The Generator Network

* The Generator Network takes, as we wanted, a vector of random numbers as input, and produces a picture as output

* The **loss function** for this network then consists of passing the produced image through the discriminator and determining if it believes the painting to be real or not

* We can then use backpropagation and gradient descent, as usual, to update the weights in our generator

* Over time, our generator will learn to fool the discriminator!

---

# Not quite enough ...

* If our discriminator was "perfect", this would already be enough

* However, to start, we needed some "fake" Picassos, which we just generated randomly

* Once the Generator produces some images, we actually have "better fakes"!

* So we can improve the Detector with that

* And then we need to improve the Generator again, etc.

---

# Generative, Adversarial Networks

* Generative: We **generate** images 
* Adversarial: The Generator and the Discriminator play a "game" against each other

---

# Lab 4

---

# Lab 4

* Let's build a GAN to generate more MNIST-like images

* For now, focus on only one digit (pick one: your birthday, last digit of the carné, your favorite number, ...)

* As mentioned above, we need two neural networks: A discriminator and a generator

---

# Lab 4: Discriminator

* Start with your network from lab 3, maybe add more layers and neurons

* We only need one output to distinguish between fake (0) or real (1)

* Use `torch.nn.BCELoss` (Binary CrossEntropy Loss), which expects the output from the network and the expected output (0 or 1) **in the same shape**

* Take the images from the MNIST data set for one digit as real data

* Generate completely random images as initial fake data

* Train the network to distinguish the two

---

# Lab 4: Generator

* For the Generator, we want a very different network

* It takes 100 inputs and produces 784 outputs (one for each pixel)

* Sample these 100 inputs randomly from a normal distribution (`torch.randn`)

* The loss function is a bit more complicated: First pass the output through the (trained) discriminator network, then pass the output from that to `BCELoss`, where we **want our outputs to be 1** (i.e. the generator 
wants the discriminator to say that the images are real), then do a `backward` call, and optimizer step as normal

* Your generator optimizer will only have the generator parameters as its target and therefore will not change the discriminator!

---

# Lab 4: Training

* In a loop, alternate between training the generator and the discriminator

* Keep an ``episode memory'' of fake images, to which you add newly generated fakes in every iteration

* Train the generator and discriminator each for a couple of iterations

* Put both into a larger loop

* The lab description has a skeleton code for this training loop setup

---

# Lab 4: Summary

* You need two Neural Networks, two optimizers, two training functions

* There is one overall loop that contains two sub-parts:
    
    - Train the discriminator
    
    - Train the generator 
    
* Store images produced by the generator as fake images to train the discriminator

* Make sure to preserve some old images!

---

# References

* [Torch Tensor Operations Overview](https://jhui.github.io/2018/02/09/PyTorch-Basic-operations/)
  
  * [RNN Introduction](https://medium.com/explore-artificial-intelligence/an-introduction-to-recurrent-neural-networks-72c97bf0912)
  
  * [GAN Introduction](https://machinelearningmastery.com/how-to-develop-a-generative-adversarial-network-for-an-mnist-handwritten-digits-from-scratch-in-keras/)
  
  * [GAN hacks](https://github.com/soumith/ganhacks)