For our purposes, a document is a text of arbitrary length
Not all of our documents need to have the same length!
For example: A data set of emails, or stories
We want to determine "something interesting" from the data
Emails: Find spam emails, group emails by topic
Tweets: Group tweets by mood/opinion
Stories: Find reoccuring themes, group similiar stories
Movie dialog: Group by character affiliation
Text can have a lot of nuance
While you may have "a lot" of data, any individual pattern may only show up once or twice
And if you actually have enough data to cover all cases, you may not be able to process it in reasonable time
Text is messy, and of variable length
Numbers are easier and faster to reason about
So we want to encode our documents in some numerical way
Basically, we will get a vector (of floating point numbers) that represents our documents
It would be nice if numerical operations on these vectors would do "something useful"
For example, the distance between vectors should represent the similarity of the corresponding documents
A document consists of sentences, which consist of words
We assume all documents are in the same language
We don't know how long each document is
We also don't know the actual topics a priori
We may have to estimate/guess how many different topics there could be
We can look at our document as a "bag of words"
This "bag" does not have grammar
Basically, for each word, we count how often it shows up in the document
We may want to only count the "important" words, to get a fixed-length vector
In this simple model, "annoy", "annoyed", and "annoying" are all different words
But since we ignore grammar, we might as well just count them together
The process to get the "root" (or stem) of a word is called stemming
While we are at it, we might also want to run a spellchecker and fix misspellings
Other preprocesing steps: remove punctuation, convert to lowercase, ignore common words that don't carry much information like "a", "to", "the"
So far, our vectors basically just consisted of word counts
However, not all words are equally important to a document
For example, if only a 5 of 1000 emails mention "russia", this may be a more important term than "hack", which is mentioned in 700 of the emails
Idea: Instead of just counting, put it in relation with how often it shows up overall
Term-Frequency (TF): How often does the word show up in the current document
Document-Frequency (DF): How often does the word show up over all documents
Term-Frequency, Inverse-Document-Frequency (TF-IDF): Divide the number of occurrences in the current document by the number of occurrences in all documents
Gives a measure of "how unique" a word is to a document
Documents with similar BoW-vectors are likely about similar topics
We may also want to put a higher weight on lower-frequency words (higher TF-IDF)
A common approach is Latent Dirichlet Allocation (LDA), where each document is described as a mixture of k
topics
We will talk about more concrete clustering algorithms in 2 weeks
Each document is just a count/frequency of words
Order/grammar is lost
But context often changes meanings (compare "(not eating the) cake" and "cake (eating contest)")
While the frequency vectors of similar documents have some overlap, vector operations are somewhat non-sensical
Let's start with words
Imagine we could turn a word into a vector
And then we can take vector differences to get the "relation" between words
For example: Vienna - Austria + Costa Rica = San Jose
Let's start with words
Imagine we could turn a word into a vector
And then we can take vector differences to get the "relation" between words
For example: Vienna - Austria + Costa Rica = San Jose
Turns out, we can actually do that!
As before, we will have a representation in the form of vectors
But now our vector elements don't have any clear interpretation anymore, they are just numbers
The vectors are learned from a corpus
We can then compare vectors using algebra, for example calculate the cosine between to vectors to get their similiarity
The reason this vector trick works is the Distributional Hypothesis
It says that words that are used in the same context are semantically similar, too
However, the actual relation between this hypothesis and the vectors has been called "very hand-wavy" by some authors
For each word, we use k
words before and after as the "context"
For example: "Artificial Intelligence is a fascinating subject" with context 1 becomes:
What do we do with these? Learn the word from the context!
We use a neural network to learn the relationship between words and their context
Because this relationship has to be calculated by the neural network, the values of one of the hidden layers have to encode it
We can use these values as a vector representation for a word!
Two approaches:
It may seem tricky to go from learning a representation for single words, to a representation for entire documents
But it is actually not!
Here is the trick: For each word that we learn, we provide the document ID in addition to the context
This means that the network will have to learn some vector representation for the document while training as well
Note: The "documents" are sometimes also called "paragraphs" in the literature
Vector operations can be used to find analogies
corn
to the embedding of tortilla
and add it to the vector embedding of flour
Vector distance can be used to find similarities:
corn
, wheat
, barley
all have less distance from each other than from house
, or airport
Word model can predict words from context
Narrative Generation: Find analogues
Machine Translation: Train on known translations, and other text in both languages, can yield unknown translations
Speech Recognition: Predict words that were hard to hear/understand
Movie/book/product recommendations based on similarity of descriptions
Spotify uses songs in a playlist as "words", and uses the vectors to find new songs the user might like
The Shape of Story: A Semiotic Artistic Visualization of a Communal Storytelling Experience by Long et al.
What else could we convert to vectors?
The sky is the limit!
Sounds, to find similar sounds for poems
Maybe story or game characters?
Levels?
OpenAI trained a language model on 40GB of internet text
It is very good at "predicting" a word from context
What do we mean by "good"? It will predict a word that is close to what a human might have written
But note: These words are not actually part of the training set, the "prediction" is actually a generation of text
This can be used to generate text: Write the start of a sentence, and ask it to predict the next word, then the next, etc.
In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.
Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.
Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.
The text is mostly topic-consistent, but not always sensical
For example: "four-horned unicorns", or "fire under water"
It can also be repetitive
Note how it says that the phenomenon is "finally solved", but doesn't actually explain how
The good texts are hand-picked (the authors report that it produces reasonable text in 50% of the cases for popular topics)
Automated text generation can also be used for bad purposes: Generating fake news articles, facebook posts, scam emails, or even impersonation of other's writing style
Therefore, OpenAI has decided not to release the model publicly
You can download a smaller, worse version, though
I also believe that headlines like this are not helpful for anyone:
Do you think GPT-2 should have been released?
What are your concerns?
How should AI research handle potentially dangerous technologies?
dAIrector: Automatic Story Beat Generation through Knowledge Synthesis
How to use vector models in improv theater!
Due 3/6, AoE
Send me code via email or add me (yawgmoth) to a github repository
Write short readme explaining what already works and what is missing
Goal: Convince me that you'll be able to finish by the end of the semester
This would also be a good time to tell me if you have to cut/change something
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |