Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Computational and statistical techniques of Machine Learning

I 2020

1 / 62

Instructor and Schedule

  • Instructors: Dra. Marcela Alfaro Córdoba, Dr. Markus Eger

  • Email: marcela.alfarocordoba@ucr.ac.cr, markus.eger.ucr@gmail.com

  • Office hours:

    • Markus: Monday, 4pm-5.30pm, Tuesday 4.30pm-5.25pm, Thursday 3pm-5pm, office 3-23, ECCI Anexo
    • Marcela: Tuesday, 4.30pm-5.25pm, office 3-23, ECCI Anexo
  • Class: Tuesday, 5.30pm-9pm, Lab 102, Edificio ECCI

2 / 62

About Markus

  • Originally from Austria
3 / 62

About Markus

4 / 62

About Markus

  • Originally from Austria

  • BSc and MSc in Computer Science from University of Technology Graz, Austria

  • PhD in Computer Science from NC State University, USA, working on game AI for games involving communication

  • Games: Smite, Guild Wars 2, Incremental Games

  • I also like board games (Ricochet Robots, Dominion, Brewcrafters, ...)

5 / 62

About Marcela

  • Originally from San Ramón, Alajuela

  • BSc in Statistics from UCR

  • MSc from Iowa State, USA

  • PhD in Statistics from NC State University, USA, working on spatial-temporal models for atmospheric data

  • Games: 2048 obsessively

  • I like communities: R-ladies, RDA, CODATA, ASA, Datos abiertos, etc.

6 / 62

About Markus and Marcela

7 / 62

About You

  • Name, Program?

  • Games, Communities, Interests?

  • Fun facts?

8 / 62

Class Resources

9 / 62

Class contents

  • Python and Stats Introduction/Revision (10/3-17/3)

  • Two worlds, two vocabularies: Statistics and Computer Science (24/3)

  • Supervised Learning: Neural Networks, SVMs, etc. (31/3-5/5)

  • Dimensionality Reduction (12/5)

  • Bayesian Statistics (19/5)

  • Advanced Topics (26/5-2/6)

  • Data Collection, Storage, Ethical Considerations (9/6-23/6)

10 / 62

Labs

  • Each class will be a combination of lecture and lab work

  • There will be 6 different lab exercises

  • You should work on the lab during the class time, and finish them at home, if necessary

  • Deadline for submission is before the start of the next lab session (Tuesday before class)

  • Submit labs by email to the two professors

  • For each lab you have to work in groups of two: One statistician and one computer scientist

11 / 62

Labs

  • Lab 1: Python/Stats/PyTorch intro, 10/3 - 24/3

  • Lab 2: Regression, 31/3 - 21/4

  • Lab 3: Classification, 28/4 - 5/5

  • Lab 4: Dimensionality Reduction, 12/5 - 19/5

  • Lab 5: GANs, 26/5 - 9/6

  • Lab 6: Ethics, 16/6 - 23/6

12 / 62

Project

  • In addition to the labs, you will work on a semester-long project of your choice

  • Work in groups of two: One statistician and one computer scientist

  • We will provide several ideas for project, but feel free to propose your own

  • Important: We encourage creative ideas, but the more "interesting" your proposal, the more important it is to coordinate it with us beforehand

13 / 62

Project

  • 31/3: Proposal: 10 mins presentation and a document.

  • 28/4: Update 1: 7 mins presentation and a document.

  • 26/5: Update 2: 3 mins presentation and a document.

  • 23/6: Q&A

  • 30/6: Presentations: 15 mins presentation and a document.

14 / 62

Project Ideas

  • We have game logs from an experiment with the cooperative game Hanabi

  • Over 200 players played about 2000 games

  • Ideas: Group players according to play style, analyze differences between different experience levels, predict player actions depending on the game state, etc.

15 / 62

Project Ideas

  • I (Markus) have about 16000 game logs from Hearthstone games (and a python script to download more)

  • Games come tagged with deck type, player name, winner, and some other information

  • Ideas: Analyze deck type development over time, predict winner after n turns, etc.

16 / 62

Project Ideas

  • Collect data from a game (like PUBG, Fortnite): Game logs, network traffic, maybe CPU/GPU usage

  • There are people that say that the game starts to lag when someone is coming from behind them

  • Investigate if there is a correlation

17 / 62

Project Ideas

18 / 62

Project Ideas

  • Public data you can use

  • Bring your own data: let us know ASAP :)

19 / 62

Grading

  • Lab 1: 15% - document with code

  • Lab 2-6: 10% each - document with code

  • Project Proposal: 5% - presentation and document

  • Project Updates: 5% each - presentation and document

  • Final Project: 20% - presentation and document

Specific formats will be clarified in class.

20 / 62

Textbook

  • The 100-page ML Book, Burkov

  • Deep Learning with PyTorch, Stevens et al

  • Elements of Statistical Learning, Hastie et al

  • If you are completely lost with Python, you might want to start here

21 / 62

Introduction

22 / 62

Definitions - Brainstorming

Artificial Intelligence | Machine Learning | Statistics | Data Science | Computer Science

Source: https://jamesmccaffrey.wordpress.com/2016/09/29/machine-learning-data-science-and-statistics/machinelearningdatasciencestatistics/

23 / 62

Why This Course?

24 / 62

Observe

25 / 62

Machine Learning

  • Supervised Learning: {(x1,y1),...,(xn,yn)} Learn a mapping from examples.

  • Unsupervised Learning: {x1,...,xm} Learn an interesting thing about data.

  • Semi-supervised Learning: {(x1,y1),...,(xn,yn)}{x1,...,xm}

  • Reinforcement Learning: Learn what to do in an environment, given feedback information.

We will cover some algorithms from each of these areas in this course!

26 / 62

Machine Learning

Say there is a function f(x)=y

  • Supervised Learning: We know x and y, and are trying to find f

  • Unsupervised Learning: We know x and are trying to find "interesting" f and y

  • Reinforcement Learning: We know f*, and are trying to get "the best y" by choosing x

*: Terms and Conditions may apply

27 / 62

Supervised Learning?

  • What we want is to give the computer "some" x and y, and it finds the "general connection" between them

  • The function that we learn is not just a memorization of the values, but it also gives us a "good" value for y for x that we haven't seen before

  • Inter- and Extrapolation: From the known data, we can "predict" values for new inputs

  • But who are x and y?

28 / 62

But first - Stats 101

  • Descriptive vs Inference

  • Describing vs Predicting - why is this useful?

  • Correlation vs Causation: https://xkcd.com/925/

29 / 62

Some Linear Algebra

30 / 62

Functions

Remember our function f(x)=y

  • This function takes a vector of real numbers and produces one real number

  • Unlike vectors you may have seen before, this vector is really just "an ordered collection of numbers"

  • For example: We want to predict the price of google stock given the day of the year, the temperature, the position of Mars in its orbit, and the number of Marvel movies released so far

  • We construct a four-dimensional vector with one entry for each of these numbers

  • Our (supervised) learning algorithm then has to figure out how to turn these four values into a stock price (not all values may be relevant)

31 / 62

Vectors

  • Vectors are neat because we have mathematical operations defined on them: Addition, substraction, multiplication with a scalar, etc.

  • One particularly important operation is the dot product:

vw=(v1v2vn)(w1w2wn)=v1w1+v2w2++vnwn

  • We will use this to concisely define learning systems and algorithms!
32 / 62

Vector and Matrices and Tensors, oh my!

  • Vector: An ordered list of numbers

  • Matrix: A grid of numbers

  • We could store a matrix in a vector, if we remember the dimensions!

  • Tensor: What if we "need" more dimension?

  • For example: We have 10000 images of 28x28 pixels. We store them in a 10000x28x28 tensor

  • In memory the pixels will still be stored sequentially, the tensor is really just a different "view" on the data

33 / 62

Review of Statistical Learning

http://www.mit.edu/~9.520/fall18/slides/class02_SLT.pdf

34 / 62

Python

35 / 62

Python: How to start

  • To develop in Python, use your favorite text editor

  • Save the file with the extension ".py"

  • Then you can run the file from the command line with python <filename>.py

  • There are some ways to make this more "comfortable"

36 / 62

PyCharm

  • PyCharm is an IDE for Python

  • You can write your code, as with any other editor

  • To run it, you can use a button in the top right corner that will open a console on the bottom and run your code

  • You will need to tell PyCharm which file to run, though

37 / 62

Python: Hello World

import sys
def main(who):
print("Hello", who)
for i in range(100):
if (i+1)%10 == 0:
print("Iteration %d"%(i+1))
if __name__ == "__main__":
if len(sys.argv) > 1:
main(sys.argv[1])
else:
main("World")

Save as helloworld.py, run with:

python helloworld.py Universe
38 / 62

Python vs. R

  • One of the main differences you will run into: Python arrays (and matrices, vectors, tensors, etc.) start with index 0

  • a[0] is the first element

  • a[-1] is the last element

  • a[1:] is everything except the first element

  • a[:-1] is everything except the last element

  • a[2:5] is the 3rd to the 5th element

39 / 62

Python at Home

  • Install Python 3.7 or 3.8, 64 bit from https://www.python.org/

  • You can install additional packages using the command line with pip install <packagename>

  • For example, numpy can be installed with pip install numpy

  • If you don't have administrator rights, use pip install --user numpy

  • pip will automatically install all dependencies

40 / 62

PyTorch

41 / 62

PyTorch

  • PyTorch is a python library for Machine Learning

  • At every major Machine Learning conference last year, the majority of papers used PyTorch

  • The other "big one" is Tensorflow, which is a bit older and has therefore more adoption in industry

  • PyTorch is easier to get started with, and once you know the core concepts you can easily pick up Tensorflow, too

42 / 62

PyTorch vs. Tensorflow

43 / 62

PyTorch

  • (Almost) everything is a pytorch.Tensor!

  • These tensors really are just views into a sequential array of numbers

  • Each tensor can also "remember" where it came from (e.g. if it is the result of an addition)

  • This means, you can also view a tensor as a "tree" of computation, with the result at the top, and the inputs as the leaves

  • You can also tell your tensors that the operations should be performed on the GPU

44 / 62

Tensors in PyTorch

  • The dimensionality of a tensor defines how many "sizes" it has

  • The shape of a tensor tells us how many elements exist in each dimension

  • A tensor with shape [24,1] has a different shape than a tensor using the same data with shape [12,2], but also from a tensor with shape [24], or [24,1,1]

  • You can use torch.matmul(x,y) for a multiplication that automatically "fixes" dimensions, in most cases

45 / 62

Tensors in PyTorch

Let x be a tensor with shape [12,2]

  • x[0,0] is the first element

  • x[0] (or x[0,:]) is the first row (a tensor of shape [2])

  • x[:,0] is the first column (a tensor of shape [12])

  • x.T is a tensor with shape [2,12] (the transpose)

46 / 62

Tensors in PyTorch

Let x be a tensor with shape [24]

  • x > 0 produces a tensor that tells you for each element if it is greater than 0 or not

  • x[x > 0] produces a tensor with all values that are greater than 0

  • x[labels == "C"] produces a tensor with all x values which have label "C" (labels has to be a tensor containing the labels for each entry in x)

Note: In many cases our x will have more than one dimension. x[labels == "C",:] does the same for tensors with dimensionality 2, etc.

47 / 62

Numpy and PyTorch

  • A very popular library for linear algebra in Python is numpy

  • PyTorch interacts very nicely with numpy

  • You can convert a tensor x to a numpy array with x.numpy()

  • You can convert a numpy array y to a torch tensor with torch.from_numpy(y)

48 / 62

Pandas

  • Pandas is a library for data manipulation and analysis

  • It is particularly nice for reading data from a variety of formats

  • R programmers will also like the dataframes it provides (which are similar to the ones in R)

  • Is built on top of numpy, and therefore plays nicely with PyTorch

49 / 62

An Example

import pandas as pd
import numpy as np
import torch
import sys
def read_csv(fname, colx, coly):
data = pd.read_csv(fname)
x = torch.tensor(data[colx])
y = torch.tensor(data[coly])
return x,y
def main(datafile):
x,y = read_csv(datafile, "ActionLatency", "APM")
print("mean x: %.2f, max y: %.2f"%(x.mean(), y.max()))
ymax,ymaxat = y.max(0)
print("max y: %.2f at index %d"%(ymax, ymaxat))
print(x.shape, x[x > 125])
import pdb; pdb.set_trace()
if __name__ == "__main__":
main(sys.argv[1])
50 / 62

Squeeze

Squeeze is used to remove one/all dimension(s) of size 1:

  • If x.shape is [12,2], x.squeeze() does nothing

  • If x.shape is [24,1], x.squeeze() produces a tensor of shape [24]

  • If x.shape is [24,1,1], x.squeeze() produces a tensor of shape [24]

  • If x.shape is [24,1,1], x.squeeze(1) produces a tensor of shape [24,1]

51 / 62

Unsqueeze

Unsqueeze is used to insert a dimension of size 1:

  • If x.shape is [12,2], x.unsqueeze(0) produces a tensor of shape [1,12,2]

  • If x.shape is [12,2], x.unsqueeze(1) produces a tensor of shape [12,1,2]

  • If x.shape is [12,2], x.unsqueeze(2) produces a tensor of shape [12,2,1]

52 / 62

View

View is used to convert the shape of a tensor to something "arbitrary" (with the same total number)

  • If x.shape is [12,2], x.view(24) produces a tensor of shape [24]

  • If x.shape is [24], x.view((24,1)) produces a tensor of shape [24,1] (exactly like x.unsqueeze(1))

  • If x.shape is [24], x.view((2,3,4)) produces a tensor of shape [2,3,4]

  • If x.shape is [24,1], x.view(24) produces a tensor of shape [24] (exactly like x.squeeze(1))

  • If x.shape is [12,2], x.view((8,3)) produces a tensor of shape [8,3]

  • If x.shape is [12,2], x.view((8,6)) produces an error

53 / 62

View

One dimension passed to view can be -1. Because view knows how many elements there are in total, it will just put "the rest"

  • If x.shape is [12,2], x.view(-1) produces a tensor of shape [24]

  • If x.shape is [n], x.view((n,-1)) produces a tensor of shape [n,1] (exactly like x.unsqueeze(1))

  • If x.shape is [24], x.view((2,-1,4)) produces a tensor of shape [2,3,4]

  • If x.shape is [24,1], x.view(-1) produces a tensor of shape [24] (exactly like x.squeeze(1))

54 / 62

Permute

  • permute allows you to reorder dimensions (useful when you have image data with color channels, for example)

  • If x.shape is [3,2,4], x.permute((1,0,2)) produces a tensor of shape [2,3,4]

  • If x.shape is [3,2,4], x.permute((2,1,0)) produces a tensor of shape [4,2,3]

  • If x.shape is [3,2,4], x.permute((2,1,0)) produces a tensor of shape [4,2,3]

  • If x.shape is [3,2,4], x.permute((1,0)) produces an error

  • If x.shape is [3,2,4], x.permute((1,0,2,1)) produces an error

55 / 62

Data Types

  • Torch is a bit picky about data types (for optimization reasons)

  • Each tensor has a data type associated with it

  • If you have an integer tensor x, you can get a floating point tensor with x.float()

  • You can also specify an extra parameter dtype=torch.float64 for torch.tensor(...)

56 / 62

CUDA

  • CUDA ("Compute Unified Device Architecture") allows you to program your (Nvidia) graphics card

  • Computer graphics needs many vector operations in parallel, which the GPU can perform

  • Someone noticed that performing many vector operations in parallel is useful in other contexts as well

57 / 62

PyTorch and CUDA

  • You can move any tensor x to the graphics card by calling x.cuda() (if you have an Nvidia card and CUDA installed)

  • This will return a new tensor (don't mistakenly use the old one!)

  • Any operation on that tensor will then run on the graphics cards

  • Important: You can not mix tensors that live on the graphics cards with ones that live in RAM/the CPU

RuntimeError: expected device cuda:0 but got device cpu
58 / 62

Matplotlib and Seaborn

  • To make plots with python, we use matplotlib (and seaborn for prettier graphs)

  • The usual convention is to import matplotlib.pyplot as plt and import seaborn as sns

  • plt has many different plot types available, and you can manipulate axis, labels, etc.

  • sns integrates into that by adding more plot types (and nicer colors, styles, etc.)

59 / 62

Matplotlib Scatterplot

x = [1,2,3,4,5]
y = torch.randn(5)
# line plot
plt.figure()
plt.plot(x,y)
plt.show()
# scatter plot and line in the same figure
plt.figure()
plt.scatter(x,y)
plt.plot(x,y)
plt.savefig("scatterplot.png")
# Hexbin plot with marginal distributions
plt.figure()
seaborn.jointplot(x,y,kind="hex")
plt.show()
60 / 62

Hexbin Plots

61 / 62

Instructor and Schedule

  • Instructors: Dra. Marcela Alfaro Córdoba, Dr. Markus Eger

  • Email: marcela.alfarocordoba@ucr.ac.cr, markus.eger.ucr@gmail.com

  • Office hours:

    • Markus: Monday, 4pm-5.30pm, Tuesday 4.30pm-5.25pm, Thursday 3pm-5pm, office 3-23, ECCI Anexo
    • Marcela: Tuesday, 4.30pm-5.25pm, office 3-23, ECCI Anexo
  • Class: Tuesday, 5.30pm-9pm, Lab 102, Edificio ECCI

2 / 62
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow