Lecture 6: Neural Networks 3

# Machine Learning

## Neural Networks 3

### III-Verano 2019

---

# Artificial Neural Networks

<p style="margin-top: 5cm; text-align: center; font-size: 2em">
What is an (Artificial) Neural Network?
</p>

---

# Artificial Neural Networks

$$
\vec{h} = f_1(W_1 \cdot \vec{x})\\\\
y = f_2(\vec{w_2} \cdot \vec{h})\\\\
y = f_2(\vec{w_2} \cdot f_1(W_1 \cdot \vec{x}))
$$
]

---

# Deep Networks

* The Universal Approximation Theorem states that an ANN with *one* layer can approximate any function 
  
  * But we could also add more layers!
  
  * Why? Learning features!
  
  * Viewed another way: If the ANN has to produce output for an input of dimensionality `n`, but the data has to pass through a layer with `m` neurons, where `m` is much smaller than `n`, the ANN has to encode enough information about the input into these `m` numbers/features to produce the desired output 
  
---

# Auto-Encoders

* One application of this approach are Auto-encoders 
  
  * They are neural networks with many layers, that become more and more narrow, before they widen again 
  
  * The number of inputs is the same as the number of outputs, and the training examples use the *same* values for input and output 
  
  * The goal is to learn a smaller *representation* for the input data
  
  * In essence, the ANN has to reconstruct the input from fewer values
  
---

# Auto-Encoders

---

# Convolutional Neural Networks

* In traditional ANNs, all neurons in a layer are connected to all neurons in the previous layer 
  
  * For very high dimensional data (e.g. images) this results in a very larger number of weights that need to be learned
  
  * In many practical applications, there is some sort of locality present in the data 
  
  * For example, in image recognition: Pixels that are close together are more likely to be correlated than pixels from opposite corners of the image 
  
  * Idea: Instead of connecting the network fully, define a "receptive field" for each neuron that defines how much of the previous layer it "perceives"
  
---

# Document Processing

---

# Documents

* For our purposes, a *document* is a text of arbitrary length
  
  * Not all of our documents need to have the same length!
  
  * For example: A data set of emails, or stories
  
  * We want to determine "something interesting" from the data 
  
---

# Examples

* Emails: Find spam emails, group emails by topic
  
  * Tweets: Group tweets by mood/opinion
  
  * Stories: Find reoccuring themes, group similiar stories
  
  * Movie dialog: Group by character affiliation
  
---

# Challenges

* Text can have a lot of nuance
  
  * While you may have "a lot" of data, any individual pattern may only show up once or twice 
  
  * And if you actually have enough data to cover all cases, you may not be able to process it in reasonable time 
  
---

# Approaches

* Text is messy, and of variable length

* Numbers are easier and faster to reason about
  
  * So we want to encode our documents in some numerical way
  
  * Basically, we will get a vector (of floating point numbers) that represents our documents 
  
  * It would be nice if numerical operations on these vectors would do "something useful"
  
  * For example, the distance between vectors should represent the similarity of the corresponding documents

---

# Topic Models

---

# Documents and Topics

* A document consists of sentences, which consist of words   
  
  * We assume all documents are in the same language
  
  * We don't know how long each document is
  
  * We also don't know the actual topics a priori
  
  * We may have to estimate/guess *how many* different topics there could be
  
---

# The Bag-of-Words (BoW) Model

* We can look at our document as a "bag of words"
  
  * This "bag" does not have grammar
  
  * Basically, for each word, we count how often it shows up in the document 
  
  * We may want to only count the "important" words, to get a fixed-length vector

---

# Preprocessing

* In this simple model, "annoy", "annoyed", and "annoying" are all different words 
  
  * But since we ignore grammar, we might as well just count them together
  
  * The process to get the "root" (or stem) of a word is called stemming
  
  * While we are at it, we might also want to run a spellchecker and fix misspellings
  
  * Other preprocesing steps: remove punctuation, convert to lowercase, ignore common words that don't carry much information like "a", "to", "the"
  
---

# Bag-of-Words Vector

<img src="/PF-3341/assets/img/bow.png" width="50%"/>
  
---

# Word Importance

* So far, our vectors basically just consisted of word counts
  
  * However, not all words are equally important to a document 
  
  * For example, if only a 5 of 1000 emails mention "russia", this may be a more important term than "hack", which is mentioned in 700 of the emails 
  
  * Idea: Instead of just counting, put it in relation with how often it shows up overall 
  
---

# Term-Frequency Inverse-Document-Frequency
  
  * Term-Frequency (TF): How often does the word show up in the current document 
  
  * Document-Frequency (DF): How often does the word show up over *all* documents 
  
  * Term-Frequency, **Inverse**-Document-Frequency (TF-IDF): Divide the number of occurrences in the current document by the number of occurrences in *all* documents
  
  * Gives a measure of "how unique" a word is to a document
  
---

# Topic-Analysis

* Documents with similar BoW-vectors are likely about similar topics 
  
  * We may also want to put a higher weight on lower-frequency words (higher TF-IDF)
  
  * A common approach is Latent Dirichlet Allocation (LDA), where each document is described as a mixture of `k` topics
  
  * We will talk about some clustering algorithms later in the semester
  
---

# Limitations of BoW

* Each document is just a count/frequency of words 
  
  * Order/grammar is lost 
  
  * But context often changes meanings (compare "(not eating the) cake" and "cake (eating contest)")
  
  * While the frequency vectors of similar documents have some overlap, vector operations are somewhat non-sensical

---

# Vector Models

---

# I Dreamed a Dream

* Let's start with words 
  
  * Imagine we could turn a word into a vector 
  
  * And then we can take vector differences to get the "relation" between words 
  
  * For example: (Vienna - Austria) + Costa Rica = San Jose
  
--

* Turns out, we can actually do that!
  
---

# Vector Space Models

* As before, we will have a representation in the form of vectors 
  
  * But now our vector elements don't have any clear interpretation anymore, they are just numbers 
  
  * The vectors are *learned* from a corpus
  
  * We can then compare vectors using algebra, for example calculate the cosine (dot product!) between two vectors to get their similiarity
  
---

# The Distributional Hypothesis

* The reason this vector trick works is the Distributional Hypothesis 
  
  * It says that words that are used in the same **context** are semantically similar, too 
  
  * However, the actual relation between this hypothesis and the vectors has been called "very hand-wavy" by some authors
  
---

# Context

* For each word, we use `k` words before and after as the "context"
  
  * For example: "Artificial Intelligence is a fascinating subject" with context 1 becomes:
  
      - [(artificial, is), intelligence]
      - [(intelligence, a), is]
      - [(is, fascinating), a]
      - [(a, subject), fascinating]
      
  * What do we do with these? Learn the word from the context!

---

# Word2Vec

* We use a neural network to learn the relationship between words and their context 
  
  * Because this relationship has to be calculated by the neural network, the values of one of the hidden layers have to encode it 
  
  * We can use these values as a vector representation for a word!
  
  * Two approaches:
      - Context is the input, word is the output (Continuous Bag-of-Words)
      - Word is the input, context is the output (skip-gram)
      
---

# Word2Vec

---

# Word2Vec Results

---

# Doc2Vec

* It may seem tricky to go from learning a representation for single words, to a representation for entire documents 
  
  * But it is actually not!
  
  * Here is the trick: For each word that we learn, we provide the document ID *in addition* to the context 
  
  * This means that the network will have to learn some vector representation for the document while training as well 
  
  * Note: The "documents" are sometimes also called "paragraphs" in the literature
  
---

# Distributed Memory version of Paragraph Vector

---

# Distributed Bag of Words version of Paragraph Vector

---

# Applications

---

# Useful Properties

* Vector operations can be used to find analogies
  
     - Answer the query: Corn is to Tortilla like Flour is to ?
     - Calculate the vector from the embedding of `corn` to the embedding of `tortilla` and add it to the vector embedding of `flour`
     - Convert the result back to a word by using it as input to the corresponding hidden layer of the ANN 
     
  * Vector distance can be used to find similarities:
  
     - `corn`, `wheat`, `barley` all have less distance from each other than from `house`, or `airport`
     - Distance measure: Usually cosine distance (angle between the vectors)
     - For document vectors similar documents have a lower distance (e.g. all emails from foreign royalty offering lucrative opportunities have a low distance from each other)
     
  * Word model can predict words from context

---

# What can this be used for?

* Narrative Generation: Find analogues
  
  * Machine Translation: Train on known translations, *and* other text in both languages, can yield *unknown* translations
  
  * Speech Recognition: Predict words that were hard to hear/understand
  
  * Movie/book/product recommendations based on similarity of descriptions 
  
  * Spotify uses songs in a playlist as "words", and uses the vectors to find new songs the user might like

---

# Circular Mayan Narrative Art Pieces

[*The Shape of Story: A Semiotic Artistic Visualization of a Communal Storytelling Experience*](https://www.aaai.org/ocs/index.php/AIIDE/AIIDE17/paper/viewPDFInterstitial/15876/15243) by Long et al.

---

# Other Data

* What else could we convert to vectors?
  
  * The sky is the limit!
  
  * Sounds, to find [similar sounds for poems](https://aaai.org/ocs/index.php/AIIDE/AIIDE17/paper/view/15879/15227)
  
  * Maybe story or game characters?
  
  * Levels?
  
---

# GPT-2
 
  * OpenAI trained a language model on 40GB of internet text 
  
  * It is **very** good at "predicting" a word from context
  
  * What do we mean by "good"? It will predict a word that is close to what a human might have written
  
  * But note: These words are not actually part of the training set, the "prediction" is actually a generation of text
  
  * This can be used to generate text: Write the start of a sentence, and ask it to predict the next word, then the next, etc.
   
---

# GPT-2 example

*In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.*

The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.

Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.

Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.

---

# GPT-2: Weaknesses

* The text is mostly topic-consistent, but not always sensical
  
  * For example: "four-horned unicorns", or "fire under water"
  
  * It can also be repetitive
  
  * Note how it says that the phenomenon is "finally solved", but doesn't actually explain how
  
  * The good texts are hand-picked (the authors report that it produces reasonable text in 50% of the cases for popular topics)
  
---

# GPT-2: Risks

* Automated text generation can also be used for bad purposes: Generating fake news articles, facebook posts, scam emails, or even impersonation of other's writing style 
  
  * Therefore, OpenAI had initially decided **not** to release the model publicly
  
  * [Better Language Models and Their Implications](https://openai.com/blog/better-language-models/)
  
  * By now, they have released the full model, though
  
---

# References

* [A Gentle Introduction to the Bag-of-Words Model](https://machinelearningmastery.com/gentle-introduction-bag-words-model/)
  
  * [How does LDA work](https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d)
  
  * [Word2Vec](https://skymind.ai/wiki/word2vec)
  
  * [A Gentle Introduction to Doc2Vec](https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e)
  
  * [GenSim (Python Library for Topic Modeling)](https://radimrehurek.com/gensim/)