Instructors: Dra. Marcela Alfaro Córdoba, Dr. Markus Eger
Email: marcela.alfarocordoba@ucr.ac.cr, markus.eger.ucr@gmail.com
Office hours:
Class: Tuesday, 5.30pm-9pm, Lab 102, Edificio ECCI
Originally from Austria
BSc and MSc in Computer Science from University of Technology Graz, Austria
PhD in Computer Science from NC State University, USA, working on game AI for games involving communication
Games: Smite, Guild Wars 2, Incremental Games
I also like board games (Ricochet Robots, Dominion, Brewcrafters, ...)
Originally from San Ramón, Alajuela
BSc in Statistics from UCR
MSc from Iowa State, USA
PhD in Statistics from NC State University, USA, working on spatial-temporal models for atmospheric data
Games: 2048 obsessively
I like communities: R-ladies, RDA, CODATA, ASA, Datos abiertos, etc.
Name, Program?
Games, Communities, Interests?
Fun facts?
Python and Stats Introduction/Revision (10/3-17/3)
Two worlds, two vocabularies: Statistics and Computer Science (24/3)
Supervised Learning: Neural Networks, SVMs, etc. (31/3-5/5)
Dimensionality Reduction (12/5)
Bayesian Statistics (19/5)
Advanced Topics (26/5-2/6)
Data Collection, Storage, Ethical Considerations (9/6-23/6)
Each class will be a combination of lecture and lab work
There will be 6 different lab exercises
You should work on the lab during the class time, and finish them at home, if necessary
Deadline for submission is before the start of the next lab session (Tuesday before class)
Submit labs by email to the two professors
For each lab you have to work in groups of two: One statistician and one computer scientist
Lab 1: Python/Stats/PyTorch intro, 10/3 - 24/3
Lab 2: Regression, 31/3 - 21/4
Lab 3: Classification, 28/4 - 5/5
Lab 4: Dimensionality Reduction, 12/5 - 19/5
Lab 5: GANs, 26/5 - 9/6
Lab 6: Ethics, 16/6 - 23/6
In addition to the labs, you will work on a semester-long project of your choice
Work in groups of two: One statistician and one computer scientist
We will provide several ideas for project, but feel free to propose your own
Important: We encourage creative ideas, but the more "interesting" your proposal, the more important it is to coordinate it with us beforehand
31/3: Proposal: 10 mins presentation and a document.
28/4: Update 1: 7 mins presentation and a document.
26/5: Update 2: 3 mins presentation and a document.
23/6: Q&A
30/6: Presentations: 15 mins presentation and a document.
We have game logs from an experiment with the cooperative game Hanabi
Over 200 players played about 2000 games
Ideas: Group players according to play style, analyze differences between different experience levels, predict player actions depending on the game state, etc.
I (Markus) have about 16000 game logs from Hearthstone games (and a python script to download more)
Games come tagged with deck type, player name, winner, and some other information
Ideas: Analyze deck type development over time, predict winner after n
turns, etc.
Collect data from a game (like PUBG, Fortnite): Game logs, network traffic, maybe CPU/GPU usage
There are people that say that the game starts to lag when someone is coming from behind them
Investigate if there is a correlation
The Smithsonian recently made 3 Million pieces of art digitally available
For example, there is a collection of stamps
We could classify stamps by type, year, price, etc.
Or we could use a GAN to generate more stamps of a particular type
Lab 1: 15% - document with code
Lab 2-6: 10% each - document with code
Project Proposal: 5% - presentation and document
Project Updates: 5% each - presentation and document
Final Project: 20% - presentation and document
Specific formats will be clarified in class.
The 100-page ML Book, Burkov
Deep Learning with PyTorch, Stevens et al
Elements of Statistical Learning, Hastie et al
If you are completely lost with Python, you might want to start here
Artificial Intelligence | Machine Learning | Statistics | Data Science | Computer Science
Source: https://jamesmccaffrey.wordpress.com/2016/09/29/machine-learning-data-science-and-statistics/machinelearningdatasciencestatistics/
Supervised Learning: {(x1,y1),...,(xn,yn)}
Learn a mapping from examples.
Unsupervised Learning: {x1,...,xm}
Learn an interesting thing about data.
Semi-supervised Learning: {(x1,y1),...,(xn,yn)}∪{x1,...,xm}
Reinforcement Learning: Learn what to do in an environment, given feedback information.
We will cover some algorithms from each of these areas in this course!
Say there is a function f(→x)=y
Supervised Learning: We know x and y, and are trying to find f
Unsupervised Learning: We know x and are trying to find "interesting" f and y
Reinforcement Learning: We know f*, and are trying to get "the best y" by choosing x
*: Terms and Conditions may apply
What we want is to give the computer "some" x and y, and it finds the "general connection" between them
The function that we learn is not just a memorization of the values, but it also gives us a "good" value for y for x that we haven't seen before
Inter- and Extrapolation: From the known data, we can "predict" values for new inputs
But who are x and y?
Descriptive vs Inference
Describing vs Predicting - why is this useful?
Correlation vs Causation: https://xkcd.com/925/
Remember our function f(→x)=y
This function takes a vector of real numbers and produces one real number
Unlike vectors you may have seen before, this vector is really just "an ordered collection of numbers"
For example: We want to predict the price of google stock given the day of the year, the temperature, the position of Mars in its orbit, and the number of Marvel movies released so far
We construct a four-dimensional vector with one entry for each of these numbers
Our (supervised) learning algorithm then has to figure out how to turn these four values into a stock price (not all values may be relevant)
Vectors are neat because we have mathematical operations defined on them: Addition, substraction, multiplication with a scalar, etc.
One particularly important operation is the dot product:
→v⋅→w=(v1v2⋮vn)⋅(w1w2⋮wn)=v1⋅w1+v2⋅w2+…+vn⋅wn
Vector: An ordered list of numbers
Matrix: A grid of numbers
We could store a matrix in a vector, if we remember the dimensions!
Tensor: What if we "need" more dimension?
For example: We have 10000 images of 28x28 pixels. We store them in a 10000x28x28 tensor
In memory the pixels will still be stored sequentially, the tensor is really just a different "view" on the data
To develop in Python, use your favorite text editor
Save the file with the extension ".py"
Then you can run the file from the command line with python <filename>.py
There are some ways to make this more "comfortable"
PyCharm is an IDE for Python
You can write your code, as with any other editor
To run it, you can use a button in the top right corner that will open a console on the bottom and run your code
You will need to tell PyCharm which file to run, though
import sysdef main(who): print("Hello", who) for i in range(100): if (i+1)%10 == 0: print("Iteration %d"%(i+1))if __name__ == "__main__": if len(sys.argv) > 1: main(sys.argv[1]) else: main("World")
Save as helloworld.py
, run with:
python helloworld.py Universe
One of the main differences you will run into: Python arrays (and matrices, vectors, tensors, etc.) start with index 0
a[0]
is the first element
a[-1]
is the last element
a[1:]
is everything except the first element
a[:-1]
is everything except the last element
a[2:5]
is the 3rd to the 5th element
Install Python 3.7 or 3.8, 64 bit from https://www.python.org/
You can install additional packages using the command line with pip install <packagename>
For example, numpy can be installed with pip install numpy
If you don't have administrator rights, use pip install --user numpy
pip will automatically install all dependencies
PyTorch is a python library for Machine Learning
At every major Machine Learning conference last year, the majority of papers used PyTorch
The other "big one" is Tensorflow, which is a bit older and has therefore more adoption in industry
PyTorch is easier to get started with, and once you know the core concepts you can easily pick up Tensorflow, too
(Almost) everything is a pytorch.Tensor
!
These tensors really are just views into a sequential array of numbers
Each tensor can also "remember" where it came from (e.g. if it is the result of an addition)
This means, you can also view a tensor as a "tree" of computation, with the result at the top, and the inputs as the leaves
You can also tell your tensors that the operations should be performed on the GPU
The dimensionality of a tensor defines how many "sizes" it has
The shape of a tensor tells us how many elements exist in each dimension
A tensor with shape [24,1]
has a different shape than a tensor using the same data with shape [12,2]
, but also from a tensor with shape [24]
, or [24,1,1]
You can use torch.matmul(x,y)
for a multiplication that automatically "fixes" dimensions, in most cases
Let x
be a tensor with shape [12,2]
x[0,0]
is the first element
x[0]
(or x[0,:]
) is the first row (a tensor of shape [2]
)
x[:,0]
is the first column (a tensor of shape [12]
)
x.T
is a tensor with shape [2,12]
(the transpose)
Let x
be a tensor with shape [24]
x > 0
produces a tensor that tells you for each element if it is greater than 0 or not
x[x > 0]
produces a tensor with all values that are greater than 0
x[labels == "C"]
produces a tensor with all x
values which have label "C" (labels
has to be a tensor containing the labels for each entry in x
)
Note: In many cases our x
will have more than one dimension. x[labels == "C",:]
does the same for tensors with dimensionality 2, etc.
A very popular library for linear algebra in Python is numpy
PyTorch interacts very nicely with numpy
You can convert a tensor x
to a numpy array with x.numpy()
You can convert a numpy array y
to a torch tensor with torch.from_numpy(y)
Pandas is a library for data manipulation and analysis
It is particularly nice for reading data from a variety of formats
R programmers will also like the dataframes it provides (which are similar to the ones in R)
Is built on top of numpy, and therefore plays nicely with PyTorch
import pandas as pdimport numpy as npimport torchimport sysdef read_csv(fname, colx, coly): data = pd.read_csv(fname) x = torch.tensor(data[colx]) y = torch.tensor(data[coly]) return x,ydef main(datafile): x,y = read_csv(datafile, "ActionLatency", "APM") print("mean x: %.2f, max y: %.2f"%(x.mean(), y.max())) ymax,ymaxat = y.max(0) print("max y: %.2f at index %d"%(ymax, ymaxat)) print(x.shape, x[x > 125]) import pdb; pdb.set_trace()if __name__ == "__main__": main(sys.argv[1])
Squeeze is used to remove one/all dimension(s) of size 1:
If x.shape
is [12,2]
, x.squeeze()
does nothing
If x.shape
is [24,1]
, x.squeeze()
produces a tensor of shape [24]
If x.shape
is [24,1,1]
, x.squeeze()
produces a tensor of shape [24]
If x.shape
is [24,1,1]
, x.squeeze(1)
produces a tensor of shape [24,1]
Unsqueeze is used to insert a dimension of size 1:
If x.shape
is [12,2]
, x.unsqueeze(0)
produces a tensor of shape [1,12,2]
If x.shape
is [12,2]
, x.unsqueeze(1)
produces a tensor of shape [12,1,2]
If x.shape
is [12,2]
, x.unsqueeze(2)
produces a tensor of shape [12,2,1]
View is used to convert the shape of a tensor to something "arbitrary" (with the same total number)
If x.shape
is [12,2]
, x.view(24)
produces a tensor of shape [24]
If x.shape
is [24]
, x.view((24,1))
produces a tensor of shape [24,1]
(exactly like x.unsqueeze(1)
)
If x.shape
is [24]
, x.view((2,3,4))
produces a tensor of shape [2,3,4]
If x.shape
is [24,1]
, x.view(24)
produces a tensor of shape [24]
(exactly like x.squeeze(1)
)
If x.shape
is [12,2]
, x.view((8,3))
produces a tensor of shape [8,3]
If x.shape
is [12,2]
, x.view((8,6))
produces an error
One dimension passed to view
can be -1
. Because view
knows how many elements there are in total, it will just put "the rest"
If x.shape
is [12,2]
, x.view(-1)
produces a tensor of shape [24]
If x.shape
is [n]
, x.view((n,-1))
produces a tensor of shape [n,1]
(exactly like x.unsqueeze(1)
)
If x.shape
is [24]
, x.view((2,-1,4))
produces a tensor of shape [2,3,4]
If x.shape
is [24,1]
, x.view(-1)
produces a tensor of shape [24]
(exactly like x.squeeze(1)
)
permute
allows you to reorder dimensions (useful when you have image data with color channels, for example)
If x.shape
is [3,2,4]
, x.permute((1,0,2))
produces a tensor of shape [2,3,4]
If x.shape
is [3,2,4]
, x.permute((2,1,0))
produces a tensor of shape [4,2,3]
If x.shape
is [3,2,4]
, x.permute((2,1,0))
produces a tensor of shape [4,2,3]
If x.shape
is [3,2,4]
, x.permute((1,0))
produces an error
If x.shape
is [3,2,4]
, x.permute((1,0,2,1))
produces an error
Torch is a bit picky about data types (for optimization reasons)
Each tensor has a data type associated with it
If you have an integer tensor x
, you can get a floating point tensor with x.float()
You can also specify an extra parameter dtype=torch.float64
for torch.tensor(...)
CUDA ("Compute Unified Device Architecture") allows you to program your (Nvidia) graphics card
Computer graphics needs many vector operations in parallel, which the GPU can perform
Someone noticed that performing many vector operations in parallel is useful in other contexts as well
You can move any tensor x
to the graphics card by calling x.cuda()
(if you have an Nvidia card and CUDA installed)
This will return a new tensor (don't mistakenly use the old one!)
Any operation on that tensor will then run on the graphics cards
Important: You can not mix tensors that live on the graphics cards with ones that live in RAM/the CPU
RuntimeError: expected device cuda:0 but got device cpu
To make plots with python, we use matplotlib (and seaborn for prettier graphs)
The usual convention is to import matplotlib.pyplot as plt
and import seaborn as sns
plt
has many different plot types available, and you can manipulate axis, labels, etc.
sns
integrates into that by adding more plot types (and nicer colors, styles, etc.)
x = [1,2,3,4,5]y = torch.randn(5)# line plotplt.figure()plt.plot(x,y) plt.show()# scatter plot and line in the same figureplt.figure()plt.scatter(x,y)plt.plot(x,y)plt.savefig("scatterplot.png")# Hexbin plot with marginal distributionsplt.figure()seaborn.jointplot(x,y,kind="hex")plt.show()
Instructors: Dra. Marcela Alfaro Córdoba, Dr. Markus Eger
Email: marcela.alfarocordoba@ucr.ac.cr, markus.eger.ucr@gmail.com
Office hours:
Class: Tuesday, 5.30pm-9pm, Lab 102, Edificio ECCI
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |