Lecture 8: Neural Networks

# AI in Digital Entertainment

### Neural Networks

---

# Machine Learning

---

# Machine Learning

* Supervised Learning: ML algorithm gets inputs and outputs and has to learn the mapping (in some sense)
  
  * Unsupervised Learning: ML algorithm gets inputs, and has to find out something "interesting" about it
  
  * Reinforcement Learning: ML algorithm can perform actions that give it a reward, and has to select actions to maximize its reward
  
---

# Supervised Learning

* Classification: The outputs are labels/classes of objects 
  
  * Regression: The outputs are numeric values (function approximation)
  
  * Preference learning: The outputs are ranks
  
---

# Neural Networks

---

# Neural Networks

* Neural Networks are **not** "how the brain works"
  
  * Their structure was "inspired" by the basic buildings blocks of the brain: Neurons 
  
  * Artificial Neural Networks (ANNs) also consist of units, called neurons, that take numeric inputs, and produce a numeric output
  
  * The true power of ANNs is that they define non-linear functions that can approximate a wide variety of different functions and are differentiable
  
---

# ANN Structure

* An ANN has n inputs and m outputs 
  
  * Each input and output is represented by a floating point number 
  
  * For regression tasks the output represents the predicted number(s)
  
  * For classification tasks, there is typically one output per class, and the classification is the class with highest of the output values 
  
  * For preference learning tasks, the output defines the predicted ranking

---

# Neurons

* Neurons are the building blocks that ANNs are composed of 
  
  * Each Neuron has k inputs, a bias and an output 
  
  * Each input is associated with a weight (the bias can be viewed as an extra input that is always 1, and also has a weight associated with it)
  
  * A Neuron performs the operation
  
$$
h \left( \sum_i w_i x_i \right)
$$
  
---

# Neurons

---

# Activation Functions

* What is h? The *activation function* of the neuron 
  
  * Typical choices are linear functions or sigmoids
  
<svg width="300" height="200" viewBox="0 0 600 400"
    xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"  >
  <title>Graph of Logistics Curve</title>
  <desc>Originally Produced by GNUPLOT 4.2 patchlevel 2, hand compressed  </desc>

---

# Feed-Forward Neural Networks

* We can now assemble multiple neurons into a network 
  
  * The inputs to the network are fed to one or more neurons
  
  * The output from one or more neurons is fed as input to another, until we use a neuron's output as the network input
  
  * For now we restrict ourselves to networks that:
  
     - Are sequentially organized in layers
     - On each layer, each neuron gets the output from *all* neurons on the previous layer as inputs 
     - The first layer (*input layer*) takes the network input as input 
     - The last layer (*output layer*) defines the outputs of the network
     - The other layers are called "hidden layers"

---

# Feed-Forward Neural Network

---

# Neural Network Function

Now we can define what the Neural Network actually computes:

$$
s_j = h \left( \sum_i w^j_i x_i \right) = h(\vec{w}^j \cdot \vec{x})\\\\
y = h \left( \sum_i v_i s_i \right) = h(\vec{v} \cdot h(\vec{w} \cdot \vec{x}))
$$

---

# Training a Neural Network

* To train a neural network we need training data (examples to learn from)
  
  * We also need to define what we actually want from our neural network
  
  * A "low error" would be nice 
  
  * So we need an error function 
  
$$
E(v,w) = \frac{1}{2n} \sum_i \left(y(x_i) - \hat{y}_i\right)^2
$$

---

# The Training Process

* We take one of our training examples `$(x_i,\hat{y}_i)$` and feed x as the input to the network 
  
  * We calculate the difference of the output to `$\hat{y}_i$`
  
  * Then we change v and w depending on the error 
  
  * How? By calculating the gradient
  
---

# The Gradient of the Error Function

We can get the gradient of the error function through judicious application of the chain rule

<span style="font-size: 0.7em">
$$
E(v,w) = \frac{1}{2n} \sum_i \left(h(\vec{v} \cdot h(\vec{w} \cdot \vec{x})) - \hat{y}_i\right)^2\\\\
\frac{\partial}{\partial v} E(v,w) = \frac{1}{n} \sum_i \left(h(\vec{v} \cdot h(\vec{w} \cdot \vec{x})) - \hat{y}_i\right) \cdot (\frac{\partial}{\partial v} h(\vec{v} \cdot h(\vec{w} \cdot \vec{x}))\\\\
$$
$$
\frac{\partial}{\partial v} h(\vec{v} \cdot h(\vec{w} \cdot \vec{x})) = h'(\vec{v} \cdot h(\vec{w} \cdot \vec{x})) \cdot h(\vec{w} \cdot \vec{x})
$$
</span>

(And analogously for w)

---

# Backpropagation

* To train, we iterate over our training examples
  
  * For each example, we calculate the output (forward propagation)
  
  * Then we use the gradient of the error function to change the weights (back propagation)
<span style="font-size: 0.7em">  
$$
v_n = v + \alpha \cdot \frac{\partial}{\partial v} E(v,w)
$$
</span>
  * We can also accumulate multiple training examples at a time (a "batch")

"Backprop is the cockroach of machine learning. It's ugly, and annoying, but you just can't get rid of it." - Geoffrey Hinton

---

# Overfitting

---

# Fighting Overfitting

* The error rate on the training data is *not* a good measure for the quality of the learned function 
  
  * Instead, a separate *test set* is used to measure how well the ANN generalizes 
  
  * One indicator of overfitting is when the weights have very high values (indicating that certain parts of the network are highly specialized to certain cases)
  
  * Adding the values of the weights to the error can help mitigating overfitting (regularization)

---

# Universal Approximation Theorem

Given:

* A feed-forward Neural Network, 
  
  * with **one** hidden layer 
  
  * an "arbitrary" function f 
  
It has been shown that:

* With enough neurons in the hidden layer it is possible to choose weights such that the neural network approximates f with an arbitrarily small error
  
### Neural Networks are universal function approximators!

---

# Universal Approximation Theorem

Not addressed:

* How can you actually learn the weights (backpropagation may find local minima)
  
  * How many neurons yield "good" results 
  
  * How do you achieve generalizeability?

---

# Tensorflow Playground

[A Neural Network Playground](https://playground.tensorflow.org)

---

# An AI Koan

In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.

"What are you doing?", asked Minsky.
"I am training a randomly wired neural net to play Tic-Tac-Toe" Sussman replied.
"Why is the net wired randomly?", asked Minsky.
"I do not want it to have any preconceptions of how to play", Sussman said.

Minsky then shut his eyes.
"Why do you close your eyes?", Sussman asked his teacher.
"So that the room will be empty."

At that moment, Sussman was enlightened.

---

# Neural Network Architectures

---

# Deep Networks

* The Universal Approximation Theorem states that an ANN with *one* layer can approximate any function 
  
  * But we could also add more layers!
  
  * Why? Learning features!
  
  * Viewed another way: If the ANN has to produce output for an input of dimensionality `n`, but the data has to pass through a layer with `m` neurons, where `m` is much smaller than `n`, the ANN has to encode enough information about the input into these `m` numbers/features to produce the desired output 
  
---

# Auto-Encoders

* One application of these approach are Auto-encoders 
  
  * They are neural networks with many layers, that become more and more narrow, before they widen again 
  
  * The number of inputs is the same as the number of outputs, and the training examples use the *same* values for input and output 
  
  * The goal is to learn a smaller *representation* for the input data
  
  * In essence, the ANN has to reconstruct the input from fewer values
  
---

# Auto-Encoders

---

# Convolutional Neural Networks

* In traditional ANNs, all neurons in a layer are connected to all neurons in the previous layer 
  
  * For very high dimensional data (e.g. images) this results in a very larger number of weights that need to be learned
  
  * In many practical applications, there is some sort of locality present in the data 
  
  * For example, in image recognition: Pixels that are close together are more likely to be correlated than pixels from opposite corners of the image 
  
  * Idea: Instead of connecting the network fully, define a "receptive field" for each neuron that defines how much of the previous layer it "perceives"

---

# Generative Adversarial Networks (GANs)

* Challenge: How can a computer generate art? 
  
  * Idea: Let's use two Neural Networks
  
      - One generator network that learns to generate replicas of art 
      - One discriminator network that learns to *detect* replicas 
      
  * The generator and the discriminator are playing a zero-sum game, where each is trying to become better than the other 
  
  * If successful, the generator will become very good at faking data 
  
---

# GAN training

* Train discriminator on real data for a while 
  
  * Generate fake images with the generator and train discriminator on them (to learn fakes)
  
  * Train generator by using the results from the discriminator as the error function (try minimizing detected fakes)
  
  * Alternate between training discriminator and generator until the images have the desired quality

---

# GAN results

---

# GAN problems

---

# Long Short-Term Memory Networks

* So far we have discussed networks that only connected in one direction 
  
  * What if there are loops?
  
  * Loops are basically "memory": If a neuron sends something to a previous layer, that information can then be recalled in the next iteration
  
  * Memory is useful for many kinds of data, especially sequential data
  
  * The new memory can depend on the "old" memory, allowing storing information across multiple iterations
  
  * Backpropagation has to take into account this memory, using an approach called "Backpropagation through time"
  
---

# Exploding and Vanishing Gradients

* If you backpropagate through many time steps you essentially multiple many numbers together 
  
  * If the numbers (gradients) are small, the product becomes basically 0
  
  * If the numbers are large, the product becomes basically infinite
  
  * We need some mechanism to control timing, instead of carrying memory around one time step at a time 
  
---

# LSTMs

* Let's use cells with built-in memory instead of simple neurons 
  
  * The memory can be caused to "forget" information after some time 
  
  * It can also store information for longer times 
  
  * Mathematically, the gradient may stay unchanged (not multiplied!) when backpropagating through time
  
---

# LSTMs

---

# Neural Networks in Games

---

# Neural Networks in Games

* We have already heard of some applications of Neural Networks for games:
  
    - Alpha Go used Neural Networks to evaluate states and moves 
    - The MCTS Hanabi agent used Neural Networks to actually play the game during the competition
 
  * As computational resources improve, more and more games make use of Neural Networks 
  
---

# Academic Uses

* AlphaStar used a deep LSTM to play StarCraft
  
  * OpenAI's DotA 2 AI agents use five separate, but coordinated Neural Networks to play DotA 2
  
  * GANs can be used to generate content ([DOOM levels](https://arxiv.org/pdf/1804.09154.pdf), [Super Mario levels](https://arxiv.org/abs/1805.00728))
  
  * Next week's paper!
  
---

# Forza Motorsport's Drivatars

* Forza is a long-running series of racing games, that has always used some sort of Neural Network for its AI players
  
  * Forza 5 introduced "Drivatars" to allow players to play against AI-"imitations" of their friends
  
  * Basically, the game records the actions you perform in a race, and trains a neural network with them 
  
  * When someone races against your Drivatar, the Neural Network is used to predict what actions you would have taken
  
---

# Race for the Galaxy

* Race for the Galaxy is a digital adaptation of a card game 
  
  * The AI agents in the game use a neural network to decide on their actions 
  
  * The initial "training data" was obtained by having the agent play against itself thousands of times, and assigning "blame" to each turn/action for the final score
  
  * The AI agents are *really* good ... To make them more fun to play with for newer players, random noise is added to their inputs
  
  * The different Neural Networks (for different player counts) make up a quarter of the download size of the game

---

# Next Week

* [Mystical Tutor: A Magic: The Gathering Design Assistant via Denoising Sequence-to-Sequence Learning](https://www.aaai.org/ocs/index.php/AIIDE/AIIDE16/paper/viewFile/13980/13599)

---

# References

* [Generative Adversarial Nets - Fresh Machine Learning #2](https://www.youtube.com/watch?v=deyOX6Mt_As)
  
  * [5 New Generative Adversarial Network (GAN) Architectures For Image Synthesis](https://www.topbots.com/ai-research-generative-adversarial-network-images/)
  
  * [Introduction to LSTM networks](https://skymind.ai/wiki/lstm)
  
  * ['Race for the Galaxy': A Neural Network in Production](https://www.gdcvault.com/play/1025504/-Race-for-the-Galaxy)