Lecture 1: Introduction

# Computational and statistical techniques of Machine Learning

### I 2020

---

# Instructor and Schedule

* Instructors: Dra. Marcela Alfaro Córdoba, Dr. *Markus* Eger
 
 * Email: <a href="mailto:marcela.alfarocordoba@ucr.ac.cr">marcela.alfarocordoba@ucr.ac.cr</a>, <a href="mailto:markus.eger.ucr@gmail.com">markus.eger.ucr@gmail.com</a>
 
 * Office hours: 
     - Markus: Monday, 4pm-5.30pm, Tuesday 4.30pm-5.25pm, Thursday 3pm-5pm, office 3-23, ECCI Anexo
     - Marcela: Tuesday, 4.30pm-5.25pm, office 3-23, ECCI Anexo
    
 * Class: Tuesday, 5.30pm-9pm, Lab 102, Edificio ECCI

---
 
# About Markus

* Originally from Austria
 
---

# About Markus

<img src="/CI-2700/assets/img/austria.png" width="70%"/>
 
---

* Originally from Austria
 
 * BSc and MSc in Computer Science from University of Technology Graz, Austria 
 
 * PhD in Computer Science from NC State University, USA, working on game AI for games involving communication
 
 * Games: Smite, Guild Wars 2, Incremental Games 
 
 * I also like board games (Ricochet Robots, Dominion, Brewcrafters, ...)
 
---

* Originally from San Ramón, Alajuela
 
 * BSc in Statistics from UCR
 
 * MSc from Iowa State, USA 
 
 * PhD in Statistics from NC State University, USA, working on spatial-temporal models for atmospheric data
 
 * Games: 2048 obsessively
 
 * I like communities: R-ladies, RDA, CODATA, ASA, Datos abiertos, etc.
 
---
 
# About Markus and Marcela

<img src="/CI-0129/assets/img/cats.jpg" width="80%"/>
 
---
 
# About You

* Name, Program?
 
 * Games, Communities, Interests?
 
 * Fun facts?
 
---

# Class Resources

* Website: <a href="http://bit.ly/PF-3115">http://bit.ly/PF-3115</a>
  
---

class: small
  
# Class contents
  
  * Python and Stats Introduction/Revision (10/3-17/3)
  
  * Two worlds, two vocabularies: Statistics and Computer Science (24/3)
  
  * Supervised Learning: Neural Networks, SVMs, etc. (31/3-5/5)
  
  * Dimensionality Reduction (12/5)
  
  * Bayesian Statistics (19/5)
  
  * Advanced Topics (26/5-2/6)
  
  * Data Collection, Storage, Ethical Considerations (9/6-23/6)

---

# Labs

* Each class will be a combination of lecture and lab work

* There will be 6 different lab exercises

* You should work on the lab during the class time, and finish them at home, if necessary

* Deadline for submission is before the start of the next lab session (Tuesday before class)

* Submit labs by email to the two professors

* For each lab you have to work in groups of two: One statistician and one computer scientist

---

# Labs

* Lab 1: Python/Stats/PyTorch intro, 10/3 - 24/3

* Lab 2: Regression, 31/3 - 21/4

* Lab 3: Classification, 28/4 - 5/5

* Lab 4: Dimensionality Reduction, 12/5 - 19/5

* Lab 5: GANs, 26/5 - 9/6

* Lab 6: Ethics, 16/6 - 23/6

---

# Project

* In addition to the labs, you will work on a semester-long project of your choice

* Work in groups of two: One statistician and one computer scientist

* We will provide several ideas for project, but feel free to propose your own

* **Important:** We encourage creative ideas, but the more "interesting" your proposal, the more important it is to coordinate it with us **beforehand**

---

# Project

* 31/3: Proposal: 10 mins presentation and a document.

* 28/4: Update 1: 7 mins presentation and a document.

* 26/5: Update 2: 3 mins presentation and a document.

* 23/6: Q&A

* 30/6: Presentations: 15 mins presentation and a document.

---

# Project Ideas

* We have game logs from an experiment with the cooperative game Hanabi

* Over 200 players played about 2000 games

* Ideas: Group players according to play style, analyze differences between different experience levels, predict player actions depending on the game state, etc.

---

# Project Ideas

* I (Markus) have about 16000 game logs from Hearthstone games (and a python script to download more)

* Games come tagged with deck type, player name, winner, and some other information

* Ideas: Analyze deck type development over time, predict winner after `n` turns, etc.

---

# Project Ideas

* Collect data from a game (like PUBG, Fortnite): Game logs, network traffic, maybe CPU/GPU usage

* There are people that say that the game starts to lag when someone is coming from behind them

* Investigate if there is a correlation

---

# Project Ideas

* The Smithsonian recently made [3 Million pieces of art digitally available](https://www.si.edu/openaccess)

* For example, there is a [collection of stamps](http://collections.si.edu/search/gallery.htm?og=postage-stamps)

* We could classify stamps by type, year, price, etc.

* Or we could use a GAN to generate more stamps of a particular type

---
class: medium

# Project Ideas

* Public data you can use

- Data from [538](https://data.fivethirtyeight.com/)

- Public data from [Medium](https://medium.com/towards-artificial-intelligence/the-50-best-public-datasets-for-machine-learning-d80e9f030279)
  
  - San Francisco [Open Data](https://datasf.org/opendata/)
  
  - Bioinformatics [data sets](https://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-020-4922-8)
  
* Bring your own data: let us know ASAP :)

---

# Grading

* Lab 1: 15% - document with code

* Lab 2-6: 10% each - document with code

* Project Proposal: 5% - presentation and document

* Project Updates: 5% each - presentation and document

* Final Project: 20% - presentation and document

Specific formats will be clarified in class.

---

# Textbook

* The 100-page ML Book, [Burkov](http://themlbook.com/)

* Deep Learning with PyTorch, [Stevens et al](https://pytorch.org/deep-learning-with-pytorch)

* Elements of Statistical Learning, [Hastie et al](https://web.stanford.edu/~hastie/ElemStatLearn/)

* If you are completely lost with Python, you might want to start [here](https://swcarpentry.github.io/python-novice-inflammation/)

---

# Introduction

---

# Definitions - Brainstorming

Artificial Intelligence | Machine Learning | Statistics | Data Science | Computer Science
 
<img src="https://jamesmccaffrey.files.wordpress.com/2016/09/machinelearningdatasciencestatistics.jpg" width="80%"/>

.tiny[Source: https://jamesmccaffrey.wordpress.com/2016/09/29/machine-learning-data-science-and-statistics/machinelearningdatasciencestatistics/]

---

# Why This Course?

---

# Observe

---

# Machine Learning

- Supervised Learning: `$ \{(x_1, y_1), . . . ,(x_n, y_n)\} $` 
Learn a mapping from examples.

- Unsupervised Learning: `$ \{x_1, . . . , x_m\} $` 
Learn an interesting thing about data.

- Semi-supervised Learning: `$ \{(x_1, y_1), . . . ,(x_n, y_n)\} \cup \{x_1, . . . , x_m\} $`

- Reinforcement Learning: Learn what to do in an environment, given feedback information.

We will cover some algorithms from each of these areas in this course!

---

# Machine Learning

Say there is a function `$f(\vec{x}) = y$`

- Supervised Learning: We know x and y, and are trying to find f

- Unsupervised Learning: We know x and are trying to find "interesting" f and y

- Reinforcement Learning: We know f*, and are trying to get "the best y" by choosing x

---

# Supervised Learning?

- What we want is to give the computer "some" x and y, and it finds the "general connection" between them

- The function that we learn is not just a memorization of the values, but it also gives us a "good" value for y for x that we haven't seen before

- Inter- and Extrapolation: From the known data, we can "predict" values for new inputs

- But who are x and y?

---

# But first - Stats 101

- Descriptive vs Inference

- Describing vs Predicting - why is this useful?

- Correlation vs Causation: https://xkcd.com/925/

---

# Some Linear Algebra

---

# Functions

Remember our function `$f(\vec{x}) = y$`

- This function takes a vector of real numbers and produces one real number

- Unlike vectors you may have seen before, this vector is really just "an ordered collection of numbers"

- For example: We want to predict the price of google stock given the day of the year, the temperature, the position of Mars in its orbit, and the number of Marvel movies released so far

- We construct a four-dimensional vector with one entry for each of these numbers

- Our (supervised) learning algorithm then has to figure out how to turn these four values into a stock price (not all values may be relevant)

---

# Vectors

* Vectors are neat because we have mathematical operations defined on them: Addition, substraction, multiplication with a scalar, etc.

* One particularly important operation is the dot product:

$$
\vec{v} \cdot \vec{w} = \begin{pmatrix}v_1\\\\v_2\\\\\vdots\\\\v_n\end{pmatrix}\cdot\begin{pmatrix}w_1\\\\w_2\\\\\vdots\\\\w_n\end{pmatrix} =  v_1 \cdot w_1 + v_2 \cdot w_2 + \ldots + v_n \cdot w_n
$$

* We will use this to concisely define learning systems and algorithms!

---

# Vector and Matrices and Tensors, oh my!

* Vector: An ordered list of numbers

* Matrix: A grid of numbers

* We could store a matrix in a vector, if we remember the dimensions!

* **Tensor**: What if we "need" more dimension?

* For example: We have 10000 images of 28x28 pixels. We store them in a 10000x28x28 tensor

* In memory the pixels will still be stored sequentially, the tensor is really just a different "view" on the data

---

# Review of Statistical Learning

http://www.mit.edu/~9.520/fall18/slides/class02_SLT.pdf

---

# Python

---

# Python: How to start

* To develop in Python, use your favorite text editor

* Save the file with the extension ".py"

* Then you can run the file from the command line with `python <filename>.py`

* There are some ways to make this more "comfortable"

---

# PyCharm

* PyCharm is an IDE for Python

* You can write your code, as with any other editor

* To run it, you can use a button in the top right corner that will open a console on the bottom and run your code

* You will need to tell PyCharm which file to run, though

---

# Python: Hello World

```Python
import sys

def main(who):
    print("Hello", who)
    for i in range(100):
        if (i+1)%10 == 0:
            print("Iteration %d"%(i+1))

if __name__ == "__main__":
    if len(sys.argv) > 1:
        main(sys.argv[1])
    else:
        main("World")
```

Save as [`helloworld.py`](/PF-3115/assets/helloworld.py), run with:

```
python helloworld.py Universe
```

---

# Python vs. R

* One of the main differences you will run into: Python arrays (and matrices, vectors, tensors, etc.) start with index 0

* `a[0]` is the first element

* `a[-1]` is the last element

* `a[1:]` is everything except the first element

* `a[:-1]` is everything except the last element

* `a[2:5]` is the 3rd to the 5th element

---

# Python at Home

* Install Python 3.7 or 3.8, 64 bit from https://www.python.org/

* You can install additional packages using the command line with `pip install <packagename>`

* For example, numpy can be installed with `pip install numpy`

* If you don't have administrator rights, use `pip install --user numpy`

* pip will automatically install all dependencies

---

# PyTorch

---

# PyTorch

* PyTorch is a python library for Machine Learning

* At every major Machine Learning conference last year, the majority of papers used PyTorch

* The other "big one" is Tensorflow, which is a bit older and has therefore more adoption in industry

* PyTorch is easier to get started with, and once you know the core concepts you can easily pick up Tensorflow, too

---

# PyTorch vs. Tensorflow

---

# PyTorch

* (Almost) everything is a `pytorch.Tensor`!

* These tensors really are just views into a sequential array of numbers

* Each tensor can also "remember" where it came from (e.g. if it is the result of an addition)

* This means, you can also view a tensor as a "tree" of computation, with the result at the top, and the inputs as the leaves

* You can also tell your tensors that the operations should be performed on the GPU

---

# Tensors in PyTorch

* The **dimensionality** of a tensor defines how many "sizes" it has

* The **shape** of a tensor tells us how many elements exist in each dimension

* A tensor with shape `[24,1]` has a different shape than a tensor using the same data with shape `[12,2]`, but also from a tensor with shape `[24]`, or `[24,1,1]`

* You can use `torch.matmul(x,y)` for a multiplication that automatically "fixes" dimensions, in most cases

---

# Tensors in PyTorch

Let `x` be a tensor with shape `[12,2]`

* `x[0,0]` is the first element

* `x[0]` (or `x[0,:]`) is the first **row** (a tensor of shape `[2]`)

* `x[:,0]` is the first **column** (a tensor of shape `[12]`)

* `x.T` is a tensor with shape `[2,12]` (the transpose)

---

# Tensors in PyTorch

Let `x` be a tensor with shape `[24]`

* `x > 0` produces a tensor that tells you for each element if it is greater than 0 or not

* `x[x > 0]` produces a tensor with all values that are greater than 0

* `x[labels == "C"]` produces a tensor with all `x` values which have label "C" (`labels` has to be a tensor containing the labels for each entry in `x`)

Note: In many cases our `x` will have more than one dimension. `x[labels == "C",:]` does the same for tensors with dimensionality 2, etc.

---

# Numpy and PyTorch

* A very popular library for linear algebra in Python is numpy

* PyTorch interacts very nicely with numpy

* You can convert a tensor `x` to a numpy array with `x.numpy()`

* You can convert a numpy array `y` to a torch tensor with `torch.from_numpy(y)`

---

# Pandas

* Pandas is a library for data manipulation and analysis

* It is particularly nice for reading data from a variety of formats

* R programmers will also like the dataframes it provides (which are similar to the ones in R)

* Is built on top of numpy, and therefore plays nicely with PyTorch

---

# An Example

```Python
import pandas as pd
import numpy as np
import torch
import sys

def read_csv(fname, colx, coly):
    data = pd.read_csv(fname)
    x = torch.tensor(data[colx])
    y = torch.tensor(data[coly])
    return x,y

def main(datafile):
    x,y = read_csv(datafile, "ActionLatency", "APM")
    print("mean x: %.2f, max y: %.2f"%(x.mean(), y.max()))
    ymax,ymaxat = y.max(0)
    print("max y: %.2f at index %d"%(ymax, ymaxat))
    print(x.shape, x[x > 125]) 
    import pdb; pdb.set_trace()

if __name__ == "__main__":
    main(sys.argv[1])
```

---

# Squeeze

Squeeze is used to *remove* one/all dimension(s) of size 1:

* If `x.shape` is `[12,2]`, `x.squeeze()` does *nothing*

* If `x.shape` is `[24,1]`, `x.squeeze()` produces a tensor of shape `[24]`

* If `x.shape` is `[24,1,1]`, `x.squeeze()` produces a tensor of shape `[24]`

* If `x.shape` is `[24,1,1]`, `x.squeeze(1)` produces a tensor of shape `[24,1]`

---

# Unsqueeze

Unsqueeze is used to *insert* a dimension of size 1:

* If `x.shape` is `[12,2]`, `x.unsqueeze(0)` produces a tensor of shape `[1,12,2]`

* If `x.shape` is `[12,2]`, `x.unsqueeze(1)` produces a tensor of shape `[12,1,2]`

* If `x.shape` is `[12,2]`, `x.unsqueeze(2)` produces a tensor of shape `[12,2,1]`

---

# View

View is used to convert the shape of a tensor to something "arbitrary" (with the same total number)

* If `x.shape` is `[12,2]`, `x.view(24)` produces a tensor of shape `[24]`

* If `x.shape` is `[24]`, `x.view((24,1))` produces a tensor of shape `[24,1]` (exactly like `x.unsqueeze(1)`)

* If `x.shape` is `[24]`, `x.view((2,3,4))` produces a tensor of shape `[2,3,4]`

* If `x.shape` is `[24,1]`, `x.view(24)` produces a tensor of shape `[24]` (exactly like `x.squeeze(1)`)

* If `x.shape` is `[12,2]`, `x.view((8,3))` produces a tensor of shape `[8,3]`

* If `x.shape` is `[12,2]`, `x.view((8,6))` produces an error

---

# View

**One** dimension passed to `view` can be `-1`. Because `view` knows how many elements there are in total, it will just put "the rest"

* If `x.shape` is `[12,2]`, `x.view(-1)` produces a tensor of shape `[24]`

* If `x.shape` is `[n]`, `x.view((n,-1))` produces a tensor of shape `[n,1]` (exactly like `x.unsqueeze(1)`)

* If `x.shape` is `[24]`, `x.view((2,-1,4))` produces a tensor of shape `[2,3,4]`

* If `x.shape` is `[24,1]`, `x.view(-1)` produces a tensor of shape `[24]` (exactly like `x.squeeze(1)`)

---

# Permute

* `permute` allows you to reorder dimensions (useful when you have image data with color channels, for example)

* If `x.shape` is `[3,2,4]`, `x.permute((1,0,2))` produces a tensor of shape `[2,3,4]`

* If `x.shape` is `[3,2,4]`, `x.permute((2,1,0))` produces a tensor of shape `[4,2,3]`

* If `x.shape` is `[3,2,4]`, `x.permute((1,0))` produces an error

* If `x.shape` is `[3,2,4]`, `x.permute((1,0,2,1))` produces an error

---

# Data Types

* Torch is a bit picky about data types (for optimization reasons)

* Each tensor has a data type associated with it

* If you have an integer tensor `x`, you can get a floating point tensor with `x.float()`

* You can also specify an extra parameter `dtype=torch.float64` for `torch.tensor(...)`

---

# CUDA

* CUDA ("Compute Unified Device Architecture") allows you to program your (Nvidia) graphics card

* Computer graphics needs many vector operations in parallel, which the GPU can perform

* Someone noticed that performing many vector operations in parallel is useful in other contexts as well

---

# PyTorch and CUDA

* You can move any tensor `x` to the graphics card by calling `x.cuda()` (if you have an Nvidia card and CUDA installed)

* This will return a **new** tensor (don't mistakenly use the old one!)

* Any operation on that tensor will then run on the graphics cards

* **Important**: You can not mix tensors that live on the graphics cards with ones that live in RAM/the CPU

```
RuntimeError: expected device cuda:0 but got device cpu
```

---

# Matplotlib and Seaborn

* To make plots with python, we use matplotlib (and seaborn for prettier graphs)

* The usual convention is to `import matplotlib.pyplot as plt` and `import seaborn as sns`

* `plt` has many different plot types available, and you can manipulate axis, labels, etc.

* `sns` integrates into that by adding more plot types (and nicer colors, styles, etc.)

---

# Matplotlib Scatterplot

```Python
x = [1,2,3,4,5]
y = torch.randn(5)

# line plot
plt.figure()
plt.plot(x,y) 
plt.show()

# scatter plot and line in the same figure
plt.figure()
plt.scatter(x,y)
plt.plot(x,y)
plt.savefig("scatterplot.png")

# Hexbin plot with marginal distributions
plt.figure()
seaborn.jointplot(x,y,kind="hex")
plt.show()
```

---

# Hexbin Plots

---

# References

* [Starting with Pandas and Numpy](https://www.hackerearth.com/practice/machine-learning/data-manipulation-visualisation-r-python/tutorial-data-manipulation-numpy-pandas-python/tutorial/)

* [Numpy, Pandas, and Matplotlib](https://cloudxlab.com/blog/numpy-pandas-introduction/)
 
 * [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) (Requires an Nvidia card)
 
 * [Matplotlib Sample Gallery](https://matplotlib.org/3.1.3/tutorials/introductory/sample_plots.html)
 
 * [Seaborn Sample Gallery](https://seaborn.pydata.org/examples/index.html)