class: center, middle # Computational and statistical techniques of Machine Learning ## Advanced Neural Networks --- # Artificial Neural Networks .left-column[
] .right-column[ We introduced one or more "hidden layers", which will hold intermediate values h $$ \vec{h} = f_1(W_1 \cdot \vec{x})\\\\ y = f_2(\vec{w_2} \cdot \vec{h})\\\\ y = f_2(\vec{w_2} \cdot f_1(W_1 \cdot \vec{x})) $$ ] --- class: medium # Deep Networks * The Universal Approximation Theorem states that an ANN with *one* layer can approximate any function * But we could also add more layers! * Why? Learning features! * Viewed another way: If the ANN has to produce output for an input of dimensionality `n`, but the data has to pass through a layer with `m` neurons, where `m` is much smaller than `n`, the ANN has to encode enough information about the input into these `m` numbers/features to produce the desired output --- class: medium # Auto-Encoders * One application of this approach are Auto-encoders * They are neural networks with many layers, that become narrower and narrower, before they widen again * The number of inputs is the same as the number of outputs, and the training examples use the *same* values for input and output * The goal is to learn a smaller *representation* for the input data * In essence, the ANN has to reconstruct the input from fewer values --- class: medium # Auto-Encoders
--- class: center, middle # Vector Models --- # I Dreamed a Dream * Let's say we want to work with words * Imagine we could turn a word into a vector * And then we can take vector differences to get the "relation" between words * For example: (Vienna - Austria) + Costa Rica = San Jose -- * Turns out, we can actually do that! --- # Vector Space Models * As before, we will have a smaller representation of our input * Our vector elements don't have any clear interpretation, they are just numbers representing a word * The vectors are *learned* from a corpus * We can then compare vectors using algebra, for example calculate the cosine (dot product!) between two vectors to get their similiarity --- # The Distributional Hypothesis * The reason this vector trick works is the Distributional Hypothesis * It says that words that are used in the same **context** are semantically similar, too * However, the actual relation between this hypothesis and the vectors has been called "very hand-wavy" by some authors --- class: medium # Context * For each word, we use `k` words before and after as the "context" * For example: "Machine Learning is a fascinating subject" with context 1 becomes: - [(machine, is), learning] - [(learning, a), is] - [(is, fascinating), a] - [(a, subject), fascinating] * What do we do with these? Learn the word from the context! --- class: medium # Word2Vec * We use a neural network to learn the relationship between words and their context * Because this relationship has to be calculated by the neural network, the values of one of the hidden layers have to encode it * We can use these values as a vector representation for a word! * Two approaches: - Context is the input, word is the output (Continuous Bag-of-Words) - Word is the input, context is the output (skip-gram) --- class: medium # Word2Vec
--- # Word2Vec Results
--- class: medium # Doc2Vec * It may seem tricky to go from learning a representation for single words, to a representation for entire documents * But it is actually not! * Here is the trick: For each word that we learn, we provide the document ID *in addition* to the context * This means that the network will have to learn some vector representation for the document while training as well * Note: The "documents" are sometimes also called "paragraphs" in the literature --- # Distributed Memory version of Paragraph Vector
--- # Distributed Bag of Words version of Paragraph Vector
--- class: center, middle # More Neural Network Architectures --- class: small # Convolutional Neural Networks * In traditional ANNs, all neurons in a layer are connected to all neurons in the previous layer * For very high dimensional data (e.g. images) this results in a very larger number of weights that need to be learned * In many practical applications, there is some sort of locality present in the data * For example, in image recognition: Pixels that are close together are more likely to be correlated than pixels from opposite corners of the image * Idea: Instead of connecting the network fully, define a "receptive field" for each neuron that defines how much of the previous layer it "perceives" --- # Convolutional Neural Networks
--- # Recurrent Neural Networks * So far we have looked at Neural Networks with a static number of inputs * However, often we have variable length input, for example if we collect time series data (like cards played in Hearthstone) * One approach to this is to feed the network one time step at the time and give it "memory" * We can conceptualize this memory as a "hidden variable", or **hidden state** --- class: medium # Recurrent Neural Networks .left-column[
] .right-column[ * The hidden state is initialized to some values (zeros) * Then the first input element/step is passed to the network, and it produces output **and** a new hidden state * This new hidden state is passed to the network with the next input element/step ] --- # Recurrent Neural Networks: Unfolding
--- # Recurrent Neural Networks: Modes
--- class: center, middle # Generative, Adversarial Networks --- # Generative, Adversarial Networks * So far we have used Neural Networks to classify images, or predict some value * Could we **generate** things with a Neural Network? * Crazy idea: We pass the Neural Network some random numbers and it produces a new Picasso-like painting -- * That's exactly what we'll do! --- # First: Classification * To produce a Picasso-like painting, we first need to know which paintings *are* Picasso-like * We could train a Neural Network that detects "real" Picassos (the "Discriminator") * Input: An image * Output: "True" Picasso, or "fake" * So we'll need some real and fake Picassos to start with ... --- # Art Connoisseur Network * After some training, our network will be able to distinguish real and fake picassos * This means we can give this network a new painting, and it will tell us if it is real or not * Now we can define the task for our generator more clearly: Fool the discriminator network, i.e. generate paintings that the discriminator recognizes as "real" Picassos --- class: medium # The Generator Network * The Generator Network takes, as we wanted, a vector of random numbers as input, and produces a picture as output * The **loss function** for this network then consists of passing the produced image through the discriminator and determining if it believes the painting to be real or not * We can then use backpropagation and gradient descent, as usual, to update the weights in our generator * Over time, our generator will learn to fool the discriminator! --- # Not quite enough ... * If our discriminator was "perfect", this would already be enough * However, to start, we needed some "fake" Picassos, which we just generated randomly * Once the Generator produces some images, we actually have "better fakes"! * So we can improve the Detector with that * And then we need to improve the Generator again, etc. --- # Generative, Adversarial Networks * Generative: We **generate** images * Adversarial: The Generator and the Discriminator play a "game" against each other
--- class: center, middle # Lab 4 --- # Lab 4 * Let's build a GAN to generate more MNIST-like images * For now, focus on only one digit (pick one: your birthday, last digit of the carné, your favorite number, ...) * As mentioned above, we need two neural networks: A discriminator and a generator --- class: mmedium # Lab 4: Discriminator * Start with your network from lab 3, maybe add more layers and neurons * We only need one output to distinguish between fake (0) or real (1) * Use `torch.nn.BCELoss` (Binary CrossEntropy Loss), which expects the output from the network and the expected output (0 or 1) **in the same shape** * Take the images from the MNIST data set for one digit as real data * Generate completely random images as initial fake data * Train the network to distinguish the two --- class: mmedium # Lab 4: Generator * For the Generator, we want a very different network * It takes 100 inputs and produces 784 outputs (one for each pixel) * Sample these 100 inputs randomly from a normal distribution (`torch.randn`) * The loss function is a bit more complicated: First pass the output through the (trained) discriminator network, then pass the output from that to `BCELoss`, where we **want our outputs to be 1** (i.e. the generator wants the discriminator to say that the images are real), then do a `backward` call, and optimizer step as normal * Your generator optimizer will only have the generator parameters as its target and therefore will not change the discriminator! --- # Lab 4: Training * In a loop, alternate between training the generator and the discriminator * Keep an ``episode memory'' of fake images, to which you add newly generated fakes in every iteration * Train the generator and discriminator each for a couple of iterations * Put both into a larger loop * The lab description has a skeleton code for this training loop setup --- class: medium # Lab 4: Summary * You need two Neural Networks, two optimizers, two training functions * There is one overall loop that contains two sub-parts: - Train the discriminator - Train the generator * Store images produced by the generator as fake images to train the discriminator * Make sure to preserve some old images! --- # References * [Torch Tensor Operations Overview](https://jhui.github.io/2018/02/09/PyTorch-Basic-operations/) * [RNN Introduction](https://medium.com/explore-artificial-intelligence/an-introduction-to-recurrent-neural-networks-72c97bf0912) * [GAN Introduction](https://machinelearningmastery.com/how-to-develop-a-generative-adversarial-network-for-an-mnist-handwritten-digits-from-scratch-in-keras/) * [GAN hacks](https://github.com/soumith/ganhacks)