AI in Digital EntertainmentVector Models1 / 42

Document Processing2 / 42

Documents

For our purposes, a document is a text of arbitrary length
Not all of our documents need to have the same length!
For example: A data set of emails, or stories
We want to determine "something interesting" from the data

3 / 42

Examples

Emails: Find spam emails, group emails by topic
Tweets: Group tweets by mood/opinion
Stories: Find reoccuring themes, group similiar stories
Movie dialog: Group by character affiliation

4 / 42

Challenges

Text can have a lot of nuance
While you may have "a lot" of data, any individual pattern may only show up once or twice
And if you actually have enough data to cover all cases, you may not be able to process it in reasonable time

5 / 42

Approaches

Text is messy, and of variable length
Numbers are easier and faster to reason about
So we want to encode our documents in some numerical way
Basically, we will get a vector (of floating point numbers) that represents our documents
It would be nice if numerical operations on these vectors would do "something useful"
For example, the distance between vectors should represent the similarity of the corresponding documents

6 / 42

Topic Models7 / 42

Documents and Topics

A document consists of sentences, which consist of words
We assume all documents are in the same language
We don't know how long each document is
We also don't know the actual topics a priori
We may have to estimate/guess how many different topics there could be

8 / 42

The Bag-of-Words (BoW) Model

We can look at our document as a "bag of words"
This "bag" does not have grammar
Basically, for each word, we count how often it shows up in the document
We may want to only count the "important" words, to get a fixed-length vector

9 / 42

Preprocessing

In this simple model, "annoy", "annoyed", and "annoying" are all different words
But since we ignore grammar, we might as well just count them together
The process to get the "root" (or stem) of a word is called stemming
While we are at it, we might also want to run a spellchecker and fix misspellings
Other preprocesing steps: remove punctuation, convert to lowercase, ignore common words that don't carry much information like "a", "to", "the"

10 / 42

Bag-of-Words Vector

11 / 42

Word Importance

So far, our vectors basically just consisted of word counts
However, not all words are equally important to a document
For example, if only a 5 of 1000 emails mention "russia", this may be a more important term than "hack", which is mentioned in 700 of the emails
Idea: Instead of just counting, put it in relation with how often it shows up overall

12 / 42

Term-Frequency Inverse-Document-Frequency

Term-Frequency (TF): How often does the word show up in the current document
Document-Frequency (DF): How often does the word show up over all documents
Term-Frequency, Inverse-Document-Frequency (TF-IDF): Divide the number of occurrences in the current document by the number of occurrences in all documents
Gives a measure of "how unique" a word is to a document

13 / 42

Topic-Analysis

Documents with similar BoW-vectors are likely about similar topics
We may also want to put a higher weight on lower-frequency words (higher TF-IDF)
A common approach is Latent Dirichlet Allocation (LDA), where each document is described as a mixture of k topics
We will talk about more concrete clustering algorithms in 2 weeks

14 / 42

Limitations of BoW

Each document is just a count/frequency of words
Order/grammar is lost
But context often changes meanings (compare "(not eating the) cake" and "cake (eating contest)")
While the frequency vectors of similar documents have some overlap, vector operations are somewhat non-sensical

15 / 42

Vector Models16 / 42

I Dreamed a Dream

Let's start with words
Imagine we could turn a word into a vector
And then we can take vector differences to get the "relation" between words
For example: Vienna - Austria + Costa Rica = San Jose

17 / 42

I Dreamed a Dream

Let's start with words
Imagine we could turn a word into a vector
And then we can take vector differences to get the "relation" between words
For example: Vienna - Austria + Costa Rica = San Jose
Turns out, we can actually do that!

18 / 42

Vector Space Models

As before, we will have a representation in the form of vectors
But now our vector elements don't have any clear interpretation anymore, they are just numbers
The vectors are learned from a corpus
We can then compare vectors using algebra, for example calculate the cosine between to vectors to get their similiarity

19 / 42

The Distributional Hypothesis

The reason this vector trick works is the Distributional Hypothesis
It says that words that are used in the same context are semantically similar, too
However, the actual relation between this hypothesis and the vectors has been called "very hand-wavy" by some authors

20 / 42

Context

For each word, we use k words before and after as the "context"
For example: "Artificial Intelligence is a fascinating subject" with context 1 becomes:
- [(artificial, is), intelligence]
- [(intelligence, a), is]
- [(is, fascinating), a]
- [(a, subject), fascinating]
What do we do with these? Learn the word from the context!

21 / 42

Word2Vec

We use a neural network to learn the relationship between words and their context
Because this relationship has to be calculated by the neural network, the values of one of the hidden layers have to encode it
We can use these values as a vector representation for a word!
Two approaches:
- Context is the input, word is the output (Continuous Bag-of-Words)
- Word is the input, context is the output (skip-gram)

22 / 42

Word2Vec

23 / 42

Word2Vec Results

24 / 42

Doc2Vec

It may seem tricky to go from learning a representation for single words, to a representation for entire documents
But it is actually not!
Here is the trick: For each word that we learn, we provide the document ID in addition to the context
This means that the network will have to learn some vector representation for the document while training as well
Note: The "documents" are sometimes also called "paragraphs" in the literature

25 / 42

Distributed Memory version of Paragraph Vector

26 / 42

Distributed Bag of Words version of Paragraph Vector

27 / 42

Applications28 / 42

Useful Properties

Vector operations can be used to find analogies
- Answer the query: Corn is to Tortilla like Flour is to ?
- Calculate the vector from the embedding of corn to the embedding of tortilla and add it to the vector embedding of flour
- Convert the result back to a word by using it as input to the corresponding hidden layer of the ANN
Vector distance can be used to find similarities:
- corn, wheat, barley all have less distance from each other than from house, or airport
- Distance measure: Usually cosine distance (angle between the vectors)
- For document vectors similar documents have a lower distance (e.g. all emails from foreign royalty offering lucrative opportunities have a low distance from each other)
Word model can predict words from context

29 / 42

What can this be used for?

Narrative Generation: Find analogues
Machine Translation: Train on known translations, and other text in both languages, can yield unknown translations
Speech Recognition: Predict words that were hard to hear/understand
Movie/book/product recommendations based on similarity of descriptions
Spotify uses songs in a playlist as "words", and uses the vectors to find new songs the user might like

30 / 42

Circular Mayan Narrative Art Pieces

The Shape of Story: A Semiotic Artistic Visualization of a Communal Storytelling Experience by Long et al.

31 / 42

Other Data

What else could we convert to vectors?
The sky is the limit!
Sounds, to find similar sounds for poems
Maybe story or game characters?
Levels?

32 / 42

GPT-2

OpenAI trained a language model on 40GB of internet text
It is very good at "predicting" a word from context
What do we mean by "good"? It will predict a word that is close to what a human might have written
But note: These words are not actually part of the training set, the "prediction" is actually a generation of text
This can be used to generate text: Write the start of a sentence, and ask it to predict the next word, then the next, etc.

33 / 42

GPT-2 example

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.

Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.

Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.

34 / 42

GPT-2: Weaknesses

The text is mostly topic-consistent, but not always sensical
For example: "four-horned unicorns", or "fire under water"
It can also be repetitive
Note how it says that the phenomenon is "finally solved", but doesn't actually explain how
The good texts are hand-picked (the authors report that it produces reasonable text in 50% of the cases for popular topics)

35 / 42

GPT-2: Risks

Automated text generation can also be used for bad purposes: Generating fake news articles, facebook posts, scam emails, or even impersonation of other's writing style
Therefore, OpenAI has decided not to release the model publicly
You can download a smaller, worse version, though
Better Language Models and Their Implications

36 / 42

My OpinionPersonally, I believe the downsides of not releasing the model outweigh the advantages 
The advantage is that not every random mae can generate fake text  
One downside is that researchers can not easily work on e.g. classifiers to determine fake text from legitimate one (OpenAI now has a "partners program" where they give a medium-sized model to partner researchers; and they released a bunch of sample output)
Another downside is that the public can't really learn what the model is capable of, only the reported, best-case scenarios 
And the actors I am actually concerned about have enough resources to just reproduce the model themselves, leaving the general public at a disadvantage
37 / 42

My Opinion

I also believe that headlines like this are not helpful for anyone:

38 / 42

Discussion

Do you think GPT-2 should have been released?
What are your concerns?
How should AI research handle potentially dangerous technologies?

39 / 42

Next Week

dAIrector: Automatic Story Beat Generation through Knowledge Synthesis
How to use vector models in improv theater!

40 / 42

Prototype Submission

Due 3/6, AoE
Send me code via email or add me (yawgmoth) to a github repository
Write short readme explaining what already works and what is missing
Goal: Convince me that you'll be able to finish by the end of the semester
This would also be a good time to tell me if you have to cut/change something

41 / 42

References

42 / 42

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help