Lab 1

Introduction
Report
The SkillCraft1 Master Table Dataset
Exploratory Data Analysis
Ordinary Least Squares
Fitting a Linear Model with Gradient Descent
Automated Gradients and Optimization
A Better Fit
Useful Resources

Introduction

In this lab, we will use data collected from players in StarCraft II to predict the number of actions per minute of a new player. Note: This lab is essentially chapter 4.1 of “Deep Learning with PyTorch: Essential Excerpts”, and you may want to refer to it for more details if anything is unclear. The lab is structured into two parts: Initial data analysis, and model fitting. We suggest that you work on the data analysis during the first week of class, and perform the model fitting and tweaking during the following two, following the contents of the lectures:

First week of class: Introduction to python and statistics, apply knowledge in lab to read and interpret data
Second week of class: Introduction to learning as model fitting, apply knowledge in lab to fit linear (and non-linear!) models
Third week of class: Vocabularity, best practices, apply knowledge in lab to prepare the lab report

Report

You are required to document your work in a report, that you should write while you work on the lab. Include all requested images, and any other graphs you deem interesting, and describe what you observe. The lab text will prompt you for specific information at times, but you are expected to fill in other text to produce a coherent document. At the end of the lab, send an email with the names and carnés of the students in the group as well as the zip file containing the lab report as a pdf, and all code you wrote to the two professors (markus.eger.ucr@gmail.com, marcela.alfarocordoba@ucr.ac.cr) with the subject “[PF-3115]Lab 1, carné 1, carné 2” before the start of class on 28/4. Do not include the data sets in this zip file or email.

The SkillCraft1 Master Table Dataset

The SkillCraft1 Master Table Dataset contains information collected from StarCraft 2 replay files and includes a measure of player activity. Basically, the researchers noted when the screen was moving, and when it was at rest and used that to determine when the players were focusing on something. The delay between when the screen came to a rest and when the player actually performed an action is called the Action Latency. An important measure for player skill in StarCraft 2 are a player’s actions per minute (APM). All other things being equal, a player that can perform more actions per minute has an advantage. It would be reasonable to assume that a lower latency also means a higher number of actions per minute, which we will investigate in this lab.

Your first task is to read the data set from the provided CSV file and put the data into tensors. We will need two tensors: One for the input (the column ActionLatency) and one for the variable we want to predict (the column APM). Implement a function read_csv that takes a file name, and two column names and returns two tensors, one for each of the columns. Split the data into three sets: Use the first 30 entries as test set 1, then split the rest randomly into 20% test set 2 and 80% training set.

Note: The way this assignment is written you will return two separate tensors and have to select the same random indices from both of them to get test set 2 and the training set. One way to do this is to generate a random permutation of all possible indices (using torch.randperm) and use the first 20% and the last 80% of of this permutation as the indices for the ActionLatency as well as the APM tensors. As a smaller example, if you have a tensor [12, 23, 38, 42] as ActionLatency, and a tensor [123, 215, 333, 228] as APM, you can use torch.randperm(4) to generate a random permuation p of the indices from 0 to 3, for example [2,1,0,3]. Then you can use p[1:] to get [1,0,3], and you can actually use this directly as an index, ActionLatency[p[1:]] will return [23,12,42], and APM[p[1:]] will return [215, 123, 228].

Exploratory Data Analysis

Take the test set 1 (with thirty entries), and plot it as a scatterplot with Matplotlib. Does the data look linear? Try the same with the training set. What are the maximum, minimum and mean values for APM and ActionLatency? What is the standard deviation of the two variables? What is the correlation between the two variables?

Ordinary Least Squares

The first thing we will do with the data is to fit a line using Ordinary Least Squares. The idea is to calculate a line that minimizes the sum of squared distances from the predictions (the line) to the data we have for each x value. Recall that a linear model has the general form

$M(\vec{x}) = y = \vec{w}\cdot \vec{x} + b$

In our case, x is the action latency, and y are the actions per minute. w are coefficients for the line, and b is the bias. However, in the general formulation x can be a vector of arbitrary length. Likewise, we can take our value for x and turn it into a vector x' by adding an extra element, such as [x,1]. This allows us to “merge” the bias into the coefficients, leading to a simpler equation:

$M(\vec{x}) = y = \vec{w}\cdot \vec{x}'$

The Ordinary Least Squares method can be used to calculate the “best” coefficients (wrt to our error metric) directly. For this, we collect all our data into a matrix X (where the second column is filled with 1s, as described above), and a vector y.

$\vec{w} = (X^T \cdot X)^{-1} \cdot X^T \cdot \vec{y}$

Implement a function ols that takes as parameters vectors x and y and returns the vector for the coefficients. Use this function to calculate the best coefficients from the training set and plot the resulting line together with the training set, and each test set (3 plots in total).

Fitting a Linear Model with Gradient Descent

Note: The material for this and the following sections will be covered in weeks 2 and 3 of the class.

While having a closed form solution is a big advantage, we want to be able to fit more complex models in the future. We will therefore reformulate the problem of finding a good line, and perform a search through the space of available lines “manually”. Note: The instructions will require you to write a series of functions. Most of these function will only consist of a single line of code. The reason we keep these functions separate is that we will replace them one by one with operations provided by PyTorch.

First, write a function model that accepts three tensors, w, x and b and returns the value for y. Then, write a function loss_fn that takes a tensor containing output values y from our model, and another tensor containing observed output values, and computes the mean squared distance between the two (i.e. for each pair, calculate the difference and square it, then calculate the mean value over the entire tensor).

Mathematically, our loss function has the form:

$L(\hat{\vec{y}}) = \text{mean} (\vec{y} - \hat{\vec{y}})^2$

In order to actually learn good values for the model parameters w and b, we will use an optimization procedure, namely gradient descent. In order to minimize the loss, we calculate it for some parameter values, and then “move” in the direction in which the loss decreases. This “movement” is done by changing the parameters w and b appropriately, and the direction in which the loss decreases is calculated using the gradient. What we want to calculate are the partial derivatives:

$\frac{\partial }{\partial w} L(M(x)) = \frac{1}{n} \cdot \frac{\partial }{\partial w} (y - M(x))^2 = \frac{1}{n} \cdot \frac{\partial }{\partial w} (y - (w\cdot x + b))^2\\\\ \frac{\partial }{\partial b} L(M(x)) = \frac{1}{n} \cdot \frac{\partial }{\partial b} (y - M(x))^2 = \frac{1}{n} \cdot \frac{\partial }{\partial b} (y - (w\cdot x + b))^2$

For this, we use the chain rule

$\frac{\partial }{\partial w} f(g(x)) = \frac{\partial }{\partial g(x)} f(g(x)) \cdot \frac{\partial }{\partial w} g(x)$

In our case, we have:

$\frac{\partial }{\partial w} L(M(x)) = \frac{1}{n} \cdot \frac{\partial }{\partial M(x)} L(M(x)) \cdot \frac{\partial }{\partial w} M(x) = \frac{1}{n} \cdot -2\cdot(y - M(x)) \cdot x\\\\ \frac{\partial }{\partial b} L(M(x)) = \frac{1}{n} \cdot \frac{\partial }{\partial M(x)} L(M(x)) \cdot \frac{\partial }{\partial b} M(x) = \frac{1}{n} \cdot -2\cdot(y - M(x)) \cdot 1$

What does this tell us? First, we “randomly” choose starting values for w and b (you can just try 1 and 0, for example), then we calculate the gradient (which tells us the direction of ascent), and then we move in the other direction (to descend/minimize). However, we need to be careful how far we move, and for that we introduce a constant called the “learning rate” alpha.

$w' = w - \alpha \cdot \frac{\partial }{\partial w} L(M(x))\\\\ b' = b - \alpha \cdot \frac{\partial }{\partial b} L(M(x))\\\\$

Implement functions dmodel_w, dmodel_b, dloss_m, corresponding to the partial derivatives of the model and the loss function, and then implement a function training(n, w, b, alpha, x, y):

Do n times:

Calculate the current estimate for y using the current values for w and b, as well as the loss
Calculate the gradient
Print the current loss, gradient and iteration count
Update w and b using the gradient

After the loop, return the values of w and b. For now, use a constant number for n (e.g. 1000 iterations), and guess values for w, b, and alpha (Important: Even though w and b are just single numbers, make sure that they are tensors with one entry!). When you call this function using the training set as x and y, and run this loop, you should see the loss gradually decrease, until a time when it does not change much anymore. If not, especially if the loss is increasing over time: Try changing the learning rate. Note this in your report, and how you determined a good learning rate. Another issue that may come up: w and b use very different scales (w should be between -5 and 5, b will grow to over 200), so if you set your learning rate to a low value, b will take forever to be updated properly, but if you set it to a high value, w will “blow up”. In practice, you will try to normalize your inputs, so that they are (roughly) between -1 and 1. Create a new variable xn that is a copy of x, where you subtract the mean and divide by half of the range of values. Try using this new xn for as a parameter to training and note any differences you notice.

Create another plot that contains the training data as a scatterplot like before, and add the learned model (line) to it. What is the loss? Is it a good fit? Also plot the model compared to the two test sets, and calculate their corresponding loss values. Does the model generalize well to the test set?

Automated Gradients and Optimization

In practice, our models will be a lot more complex than simple lines, and calculating gradients by hand, while possible, is a bit annoying. Fortunately, PyTorch can automatically calculate gradients for us! All you have to do is add an additional parameter requires_grad=True when you create the tensors w and b (that’s why they have to be tensors!). Create a new function training_auto, which is a copy of the original training function, but instead of calculating the gradient manually, you can have PyTorch calculate it automatically as an attribute of w and b:

Recall that every tensor “remembers” where it comes from. When you calculate the loss the resulting tensor knows that it is the result of w and b, which require a gradient. You can then call backward() on the loss tensor, which will cause PyTorch to backpropagate the gradient and accumulate it in the leaves (in our case w and b). Then, you can access it as the .grad attribute of w and b. There are two pitfalls:

The gradient will accumulate, which means you have to manually set it back to zero on every iteration (call w.grad.zero_(), if w.grad is not None!)
When you update w and b, they will be the results of tensor operations, so they will also track where they are coming from (namely the old w and b), which would mean that you will keep the entire history of computation in memory (and calculate the gradients for everything). In order to “break” this connection, call the “detach” method on w and b after the update, and also call requires_grad_() in order to reenable the gradient calculation.

This new version of the training function should behave exactly like the old one, with one big advantage: We can now use any (differentiable) function as our model without having to calculate the gradient manually!

The combination of calculating the gradient and updating some model parameters to minimize a loss function is so common that PyTorch also provides support for it: Optimizers. In fact, one challenge with optimization is that you may run into a local minimum (in our toy example the loss function does not have any local minima, so we avoided that problem), and there are also different ways of using the learning rate, adding additional parameters, etc. PyTorch therefore provides not just the simple gradient descent that we used, but several different kinds of optimizers with different advantages and trade-offs. The common API for them consists of the constructor, which takes a list of parameters to optimize (in our case w and b) and a learning rate lr, a method called step, which updates the parameters using the calculated gradients, and a method called zero_grad, which you have to call to reset the gradients. Implement a new function training_opt, which uses the optimizer SGD (“Stochastic Gradient Descent”) in the following loop:

Do n times:

Call zero_grad on the optimizer
Calculate the current estimate for y using the current values for w and b, as well as the loss
Print the current loss, gradient and iteration count
Call the step method of the optimizer

As before, plot the learned line with this method. Is it the same as before? Better, worse? Why? How does it behave on the test sets?

Also try a different optimizer, such as ADAM, using the scaled as well as the unscaled data.

A Better Fit

As you may have noticed, the data itself is not exactly suited for a linear model. As a last step, come up with a better family of functions to model the behavior of the data. Write a new function training_nonlin, which uses a new model model_nonlin, which should be a parameterized function, and performs the optimization procedure using that function’s parameters. You could use something like:

$y = a\cdot x^b + c$

$y = a\cdot e^{b\cdot x} + c$

Where a, b, and c are the model parameters. Try several different functions, and note which produced the best fit in your report. Always perform your training on the training data, and calculate the performance on the test sets!