class: center, middle # Artificial Intelligence ## Advanced Neural Networks --- # Artificial Neural Networks .left-column[
] .right-column[ We introduced one or more "hidden layers", which will hold intermediate values h $$ \vec{h} = f_1(W_1 \cdot \vec{x})\\\\ y = f_2(\vec{w_2} \cdot \vec{h})\\\\ y = f_2(\vec{w_2} \cdot f_1(W_1 \cdot \vec{x})) $$ ] --- # The Curse of Dimensionality * Imagine we have 20 data points with a number of features * If we have only one feature, all our data points are on a single line * If we have two features (= dimensions), our data points are spread out on a grid * If we have three features, they are spread out in a cube * Etc. --- # The Curse of Dimensionality
--- # The Curse of Dimensionality * On one hand, this is "good", because our data may be easier to distinguish with more dimensions * But on the other hand, if we have "too many" dimensions, we may not be able to learn what each one "means" * The more dimensions we have, the more data we need to learn something meaningful! --- class: medium # Deep Networks * The Universal Approximation Theorem states that an ANN with *one* layer can approximate any function * But we could also add more layers! * Why? Learning features! * Viewed another way: If the ANN has to produce output for an input of dimensionality `n`, but the data has to pass through a layer with `m` neurons, where `m` is much smaller than `n`, the ANN has to encode enough information about the input into these `m` numbers/features to produce the desired output --- # Deep Networks
--- # Deep Networks * When we train a network we pass the input through the network, calculate the loss, and the gradient and change the weights * The more neurons we have, the more weights we have * Training only really works well, if we have "enough" data to train *all* weights * In other words: More weights = we need more data --- class: small # Convolutional Neural Networks * In traditional ANNs, all neurons in a layer are connected to all neurons in the previous layer * For very high dimensional data (e.g. images) this results in a very larger number of weights that need to be learned * In many practical applications, there is some sort of locality present in the data * For example, in image recognition: Pixels that are close together are more likely to be correlated than pixels from opposite corners of the image * Idea: Instead of connecting the network fully, define a "receptive field" for each neuron that defines how much of the previous layer it "perceives" --- # Convolutional Neural Networks
--- class: center, middle # Encoding Data --- class: medium # Encodings * Neural Networks are nonlinear functions * "Function" means that the output only depends on the input * Each layer is also a function! * This means, if we take the input of a particular layer, we know the output of the network! * The input of a layer is the output of the previous layer --- # Encodings
--- class: medium # Auto-Encoders
--- class: center, middle # Vector Models --- # I Dreamed a Dream * Let's say we want to work with words * Imagine we could turn a word into a vector * And then we can take vector differences to get the "relation" between words * For example: (Vienna - Austria) + Costa Rica = San Jose -- * Turns out, we can actually do that! --- # Vector Space Models * As before, we will have a smaller representation of our input * Our vector elements don't have any clear interpretation, they are just numbers representing a word * The vectors are *learned* from a corpus * We can then compare vectors using algebra, for example calculate the cosine (dot product!) between two vectors to get their similiarity --- # The Distributional Hypothesis * The reason this vector trick works is the Distributional Hypothesis * It says that words that are used in the same **context** are semantically similar, too * However, the actual relation between this hypothesis and the vectors has been called "very hand-wavy" by some authors --- class: medium # Context * For each word, we use `k` words before and after as the "context" * For example: "Machine Learning is a fascinating subject" with context 1 becomes: - [(machine, is), learning] - [(learning, a), is] - [(is, fascinating), a] - [(a, subject), fascinating] * What do we do with these? Learn the word from the context! --- class: medium # Word2Vec * We use a neural network to learn the relationship between words and their context * Because this relationship has to be calculated by the neural network, the values of one of the hidden layers have to encode it * We can use these values as a vector representation for a word! * Two approaches: - Context is the input, word is the output (Continuous Bag-of-Words) - Word is the input, context is the output (skip-gram) --- class: medium # Word2Vec
--- # Word2Vec Results
--- class: medium # Doc2Vec * It may seem tricky to go from learning a representation for single words, to a representation for entire documents * But it is actually not! * Here is the trick: For each word that we learn, we provide the document ID *in addition* to the context * This means that the network will have to learn some vector representation for the document while training as well * Note: The "documents" are sometimes also called "paragraphs" in the literature --- # Distributed Memory version of Paragraph Vector
--- # Distributed Bag of Words version of Paragraph Vector
--- class: center, middle # Sequential Data --- # Recurrent Neural Networks * So far we have looked at Neural Networks with a static number of inputs * However, often we have variable length input, for example if we collect time series data (like cards played in Hearthstone) * One approach to this is to feed the network one time step at the time and give it "memory" * We can conceptualize this memory as a "hidden variable", or **hidden state** --- class: medium # Recurrent Neural Networks .left-column[
] .right-column[ * The hidden state is initialized to some values (zeros) * Then the first input element/step is passed to the network, and it produces output **and** a new hidden state * This new hidden state is passed to the network with the next input element/step ] --- # Recurrent Neural Networks: Unfolding
--- # Recurrent Neural Networks: Modes
--- class: center, middle # Transformers --- # Text Processing * One might think that using an RNN with word vector inputs could allow us to process text * And that works ... to a point * The problem is: Our hidden vector has to store *everything* and we need to pass gradients backwards * Maybe we can use some more "direct" way? --- # A Task: Translation
[Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/) --- # Getting rid of the bottleneck * Instead of just using the final output of processing the input sequence, maybe we can use **all** hidden states? * When the decoder produces a result (word), *some* input parts may be more important than others * So let's see if we can pay more attention to those parts? --- # Attention
[Visualizing A Neural Machine Translation Model](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/) --- # Encoder
--- # Transformer
--- # Beyond Words
--- class: medium # References * [Convolutional Neural Networks](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53) * [Auto-Encoder: What Is It? And What Is It Used For?](https://towardsdatascience.com/auto-encoder-what-is-it-and-what-is-it-used-for-part-1-3e5c6f017726) * [Word Vector Models](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa?gi=5fc95c310fbe) * [Document Vector Models](https://medium.com/wisio/a-gentle-introduction-to-doc2vec-db3e8c0cce5e) * [RNN Introduction](https://medium.com/explore-artificial-intelligence/an-introduction-to-recurrent-neural-networks-72c97bf0912) * [Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/) * [Visualizing A Neural Machine Translation Model](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/)