Lecture 16: Advanced Neural Networks

# Artificial Intelligence

## Advanced Neural Networks

---

# Artificial Neural Networks

$$
\vec{h} = f_1(W_1 \cdot \vec{x})\\\\
y = f_2(\vec{w_2} \cdot \vec{h})\\\\
y = f_2(\vec{w_2} \cdot f_1(W_1 \cdot \vec{x}))
$$
]

---

# The Curse of Dimensionality

* Imagine we have 20 data points with a number of features

* If we have only one feature, all our data points are on a single line

* If we have two features (= dimensions), our data points are spread out on a grid

* If we have three features, they are spread out in a cube

* Etc.

---

# The Curse of Dimensionality

---

# The Curse of Dimensionality

* On one hand, this is "good", because our data may be easier to distinguish with more dimensions

* But on the other hand, if we have "too many" dimensions, we may not be able to learn what each one "means"

* The more dimensions we have, the more data we need to learn something meaningful!

---

# Deep Networks

* The Universal Approximation Theorem states that an ANN with *one* layer can approximate any function 
  
  * But we could also add more layers!
  
  * Why? Learning features!
  
  * Viewed another way: If the ANN has to produce output for an input of dimensionality `n`, but the data has to pass through a layer with `m` neurons, where `m` is much smaller than `n`, the ANN has to encode enough information about the input into these `m` numbers/features to produce the desired output 
  
---

# Deep Networks

---

# Deep Networks

* When we train a network we pass the input through the network, calculate the loss, and the gradient and change the weights

* The more neurons we have, the more weights we have

* Training only really works well, if we have "enough" data to train *all* weights

* In other words: More weights = we need more data

---

# Convolutional Neural Networks

* In traditional ANNs, all neurons in a layer are connected to all neurons in the previous layer 
  
  * For very high dimensional data (e.g. images) this results in a very larger number of weights that need to be learned
  
  * In many practical applications, there is some sort of locality present in the data 
  
  * For example, in image recognition: Pixels that are close together are more likely to be correlated than pixels from opposite corners of the image 
  
  * Idea: Instead of connecting the network fully, define a "receptive field" for each neuron that defines how much of the previous layer it "perceives"
  
---

# Convolutional Neural Networks

---

# Encoding Data

---

# Encodings

* Neural Networks are nonlinear functions

* "Function" means that the output only depends on the input

* Each layer is also a function!

* This means, if we take the input of a particular layer, we know the output of the network!

* The input of a layer is the output of the previous layer

---

# Encodings

---

# Auto-Encoders

* One application of this idea are Auto-encoders 
  
  * They are neural networks with several layers, that become narrower and narrower (fewer neurons), before they widen again 
  
  * The number of inputs is the same as the number of outputs, and the training examples use the *same* values for input and output 
  
  * The goal is to learn a smaller *representation* for the input data
  
  * In essence, the ANN has to reconstruct the input from fewer values
  
---

# Auto-Encoders

---

# Auto-Encoders

---

# Vector Models

---

# I Dreamed a Dream

* Let's say we want to work with words 
  
  * Imagine we could turn a word into a vector 
  
  * And then we can take vector differences to get the "relation" between words 
  
  * For example: (Vienna - Austria) + Costa Rica = San Jose
  
--

* Turns out, we can actually do that!
  
---

# Vector Space Models

* As before, we will have a smaller representation of our input
  
  * Our vector elements don't have any clear interpretation, they are just numbers representing a word
  
  * The vectors are *learned* from a corpus
  
  * We can then compare vectors using algebra, for example calculate the cosine (dot product!) between two vectors to get their similiarity
  
---

# The Distributional Hypothesis

* The reason this vector trick works is the Distributional Hypothesis 
  
  * It says that words that are used in the same **context** are semantically similar, too 
  
  * However, the actual relation between this hypothesis and the vectors has been called "very hand-wavy" by some authors
  
---

# Context

* For each word, we use `k` words before and after as the "context"
  
  * For example: "Machine Learning is a fascinating subject" with context 1 becomes:
  
      - [(machine, is), learning]
      - [(learning, a), is]
      - [(is, fascinating), a]
      - [(a, subject), fascinating]
      
  * What do we do with these? Learn the word from the context!

---

# Word2Vec

* We use a neural network to learn the relationship between words and their context 
  
  * Because this relationship has to be calculated by the neural network, the values of one of the hidden layers have to encode it 
  
  * We can use these values as a vector representation for a word!
  
  * Two approaches:
      - Context is the input, word is the output (Continuous Bag-of-Words)
      - Word is the input, context is the output (skip-gram)

---

# Word2Vec

---

# Word2Vec Results

---

# Doc2Vec

* It may seem tricky to go from learning a representation for single words, to a representation for entire documents 
  
  * But it is actually not!
  
  * Here is the trick: For each word that we learn, we provide the document ID *in addition* to the context 
  
  * This means that the network will have to learn some vector representation for the document while training as well 
  
  * Note: The "documents" are sometimes also called "paragraphs" in the literature
  
---

# Distributed Memory version of Paragraph Vector

---

# Distributed Bag of Words version of Paragraph Vector

---

# Sequential Data

---

# Recurrent Neural Networks

* So far we have looked at Neural Networks with a static number of inputs

* However, often we have variable length input, for example if we collect time series data (like cards played in Hearthstone)

* One approach to this is to feed the network one time step at the time and give it "memory"

* We can conceptualize this memory as a "hidden variable", or **hidden state**

---

# Recurrent Neural Networks

* The hidden state is initialized to some values (zeros)

* Then the first input element/step is passed to the network, and it produces output **and** a new hidden state

* This new hidden state is passed to the network with the next input element/step
]

---

# Recurrent Neural Networks: Unfolding

---

# Recurrent Neural Networks: Modes

---

# References 
  
  * [Convolutional Neural Networks](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)
  
  * [Auto-Encoder: What Is It? And What Is It Used For?](https://towardsdatascience.com/auto-encoder-what-is-it-and-what-is-it-used-for-part-1-3e5c6f017726)
  
  * [Word Vector Models](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa?gi=5fc95c310fbe)
  
  * [Document Vector Models](https://medium.com/wisio/a-gentle-introduction-to-doc2vec-db3e8c0cce5e)
  
  * [RNN Introduction](https://medium.com/explore-artificial-intelligence/an-introduction-to-recurrent-neural-networks-72c97bf0912)