Lecture 19: Supervised Learning

# Artificial Intelligence

### Supervised Learning

---

# Machine Learning

- Supervised Learning: `$ \{(x_1, y_1), . . . ,(x_n, y_n)\} $` 
Learn a mapping from examples.

- Unsupervised Learning: `$ \{x_1, . . . , x_m\} $` 
Learn an interesting thing about data.

- Semi-supervised Learning: `$ \{(x_1, y_1), . . . ,(x_n, y_n)\} \cup \{x_1, . . . , x_m\} $`

- Reinforcement Learning: Learn what to do in an environment, given feedback information.

---

# Machine Learning

Say there is a function `$f(\vec{x}) = y$`

- Supervised Learning: We know x and y, and are trying to find f

- Unsupervised Learning: We know x and are trying to find "interesting" f and y

- Reinforcement Learning: We know f*, and are trying to get "the best y" by choosing x

---

# Supervised Learning

* Today we will discuss supervised learning: We have x and y, and try to find f

* What are x and y?

* Depends on the problem type:

- Classification: Determine a **discrete** class for the input 
   
   - Regression: Predict/infer a **continuous** number as a function of the input

* Some approaches work for both tasks, in a way
   
---

# First: Inputs and Outputs

Remember our function `$f(\vec{x}) = y$`

- This function takes a vector of real numbers and produces one real number

- Unlike vectors you may have seen before, this vector is really just "an ordered collection of numbers"

- For example: We want to predict the price of Google/Alphabet stock given the day of the year, the temperature in Mountain View, the position of Mars in its orbit, and the number of Marvel movies released so far

- We construct a four-dimensional vector with one entry for each of these numbers

- Our (supervised) learning algorithm then has to figure out how to turn these four values into a stock price (not all values may be relevant)

---

# Vectors

* Vectors are neat because we have mathematical operations defined on them: Addition, substraction, multiplication with a scalar, etc.

* One particularly important operation is the dot product:

$$
\vec{v} \cdot \vec{w} = \begin{pmatrix}v_1\\\\v_2\\\\\vdots\\\\v_n\end{pmatrix}\cdot\begin{pmatrix}w_1\\\\w_2\\\\\vdots\\\\w_n\end{pmatrix} =  v_1 \cdot w_1 + v_2 \cdot w_2 + \ldots + v_n \cdot w_n
$$

* We can use this to concisely define learning systems and algorithms!

---

# Learning

* We are given "some" example pairs of `$\vec{x}$` and y

* Our machine learning "algorithm" consists of two parts:

- A representation of a predicted function `$\hat{f}$`
   
   - A search (or "optimization") procedure
   
* We split our examples into three sets:

- Training Set 
   
   - Validation Set 
   
   - Test Set 
   
---

# Learning

* To learn a (hopefully) "good" `$\hat{f}$` we pass our training set to our search procedure

* This will produce "some" `$\hat{f}$`

* We then **measure** how well this works on the **validation set**

* If it works well, great, if not, maybe we can tweak some parameters of the search procedure or representation and try again

* Only at the **very end** do we try the algorithm on the test set to report generalization performance!

---

# Learning

* Different learning algorithms differ in how they produce this `$\hat{f}$`

* Depending on the approach there may be many parameters to decide

* We also need to decide **how** to measure how well our approach does

* Let's start with classification

---

# Classification

---

# Classification

A classification problem is defined as one were you are trying to predict a discrete number of values, for example:

* Given a set of input features predict whether a breast cancer is benign or malignant

* Given an image correctly classify it as containing cats or dogs (or birds, or ...)

* From a given email determine whether it's spam or not

* Determine if a person's fingerprint is valid or not

---

# Types of classification

* *Binary classification*: when there are only two classes to predict, usually labeled as 0 and 1 (e.g. valid = 0, invalid = 1; or not spam = 0, spam = 1)

* *Multi-Class Classification*: When there are more than two class labels to predict, e.g. three types of flower species, image classification problems where there are more than a thousand classes (facial recognition), etc.

---

# Binary Classification

* If we have two classes, we usually like to maintain a continuous output in the interval [0,1]

* This can be interpreted as a "probability" to a certain degree

* However, the ultimate output for a given task is likely a binary choice

* We will have to decide how "certain" we want to be of something (e.g. we classify as spam what our algorithm says is spam with more than 95% probability)

---

# Multi-Class Classification

* We can also have more than two classes

* In this case, we would assign each class a label that we want our classifier to produce

* Since we're working with computers, these labels will usually be numbers again

* We could say a value between 0 and 1 is a cat, between 1 and 2 is a microwave, between 2 and 3 is a dog, etc.

* We will see that that is not necessarily a good idea, and there are often better encodings

---

# Classification Metrics

---

# Classification Metrics

* In order to be able to tell if our classifier is "working", we need some metrics to determine its quality

* We will have a training set, used to train our classifier, and then we will measure its "error" on the validation set

* Imagine the problem of classifying images as cats or non-cats: What errors could we make?

---

# Error Types

<img src="/PF-3115/assets/img/error_types.png" width="40%"/><br/>
Source: https://xkcd.com/2303/

---

# Confusion Matrix/Contingency Table

* The confusion matrix or error matrix is a tabular representation of the model predictions vs ground truth labels

* Each row represents the instances in a predicted class and each column represents the instances in an actual class (some people flip the two, always use labels!)

* For example, say we have a binary classifier that classifies cat- and non-cat images. We also have a validation set with 1100 images (1000 non-cat images and 100 cat images)

---

# Overview of the Matrix:

The diagonal elements are the correct prediction for different classes (true positives/negatives), the off-diagonal elements are samples which are mis-classified, or false positives/negatives.

---

# Metrics

* **Accuracy**: The number of correct predictions divided by the total number of predictions (percentage of samples we got right)

* **Precision**: Percentage of true positives among all samples identified as positive (i.e. what percentage of images we labeled as "cats" were actually cats)

* **Recall**: Percentage of true positives among all samples from a class (i.e. what percentage of cat pictures did we identify as "cats")

* **F1-Score**: Harmonic mean between precision and recall: `$2\cdot \text{Precision} \cdot \text{Recall}/(\text{Precision} + \text{Recall})$`

---

# Metrics

$$ \text{accuracy} = \frac{\mathit{TP} + \mathit{TN}}{\mathit{TP} + \mathit{FP} + \mathit{TN} + \mathit{FN}} = \frac{90 + 940}{90 + 60 + 940  + 10} = 0.94$$

$$ \text{precision} = \frac{\mathit{TP}}{\mathit{TP} + \mathit{FP}} = \frac{90}{90+ 60} = 0.6 $$

$$ \text{recall} = \frac{\mathit{TP}}{\mathit{TP} + \mathit{FN}} = \frac{90}{90 + 10} = 0.9$$

$$ \text{F1} = \frac{2\cdot\text{precision}\cdot\text{recall}}{\text{precision} + \text{recall}} = \frac{2\cdot 0.6\cdot 0.9}{0.6+0.9} = 0.72$$

---

# Cutoff thresholds

* Recall that our model usually produces a continuous output in the interval [0,1], which we interpreted as "probabilities"

* For example, for three images a model might predict 0.45, 0.6, and 0.7. Depending on the threshold values, the labels will change:

- cut-off = 0.5: predicted labels = 0, 1, 1 (default threshold)

- cut-off = 0.2: predicted labels = 1, 1, 1

- cut-off = 0.8: predicted labels = 0, 0, 0

* Changing the threshold changes the labels, which affects precision and recall

---

# Multi-class Metrics

* What if we have more than two classes?

* We can calculate each metric (precision, recall, F1) **per class**, e.g. accuracy of "cat", "dog", "fish", "microwave"

* We can then calculate the (weighted or unweighted) "average" of these metrics to gain one "overall" metric

* Usually it's best to report **all** values

---

# Classification Algorithms

---

# Decision Trees

* When the input variables are categorical/discrete it is possible to apply logical rules to arrive to an answer

* This can be done without the need to go through numerical computations. In these cases tree-based models are very useful

* Continuous input variables can be discretized

---

# Decision Tree Classifier

---

# Decision Tree Learning

* There are several algorithms to actually **learn** the tree from data (given features and labels)

* One set of algorithms (ID3, C4.5, and C5) uses the concept of entropy and information gain

$$
H(S) = - \sum_x p(x) \log p(x)
$$

* If all items are from one class, this entropy is 0, if exactly half are in one class, it is 1 ("number of bits needed to represent the information")

* ID3 and C4.5 choose the feature that decreases the entropy the most (e.g. if **all** email that contain "viagra" are spam, the resulting entropy for that part of the data will be 0), and then recursively applies the same algorithm for the branches

---

# Limitations

* Practical algorithms use a (greedy) heuristic, which can lead to trees that generalize poorly
  
  * Trimming may be needed

* Trees are not very good for non-structured data, such as images

* Trees are unable to capture causality nor ordering in sequences

---

# Support Vector Classifiers

---

# Support Vector Machines

* Once again we are given training data consisting of features and labels

* We interpret our features as vectors in n-dimensional space, each with its label

* Our goal is to find a (hyper-)plane, such that all data points from one class are on one side, and the ones from the other class on the other side of this plane

* We also want this plane to have the maximum possible distance ("margin") from the data points it is closest to (the "support vectors")

---

# Maximum Margin Hyperplane

---

# Support Vector Classifier

* Data can usually not be quite as cleanly separated

* We specify a parameter "C" that says how costly "errors" are

* We will end up with an optimization problem to maximize the margin while minimizing the cost of errors

---

# Projection

* What if our data is not linearly separable at all?

* We can "project" it into more dimensions

* For example

$$
(x_1, x_2) \mapsto (x_1^2, x_2^2, \sqrt{2} x_1 \cdot x_2)
$$

And then we "hope" that our data can be linearly separated in these 3 dimensions!

Or we use more dimensions, maybe even infinitely many ...

---

# The Kernel Trick

How do you calculate the distance of a point from a (hyper-plane)?

The dot product!

Here is a neat trick: We don't need to actually "project" our data, we just need to calculate the dot product "as if" we had projected it.

$$
x = (x_1,x_2) \mapsto X = (x_1^2, x_2^2, \sqrt{2}x_1\cdot x_2)\\\\
(x_i \cdot x_j)^2 = X_i \cdot X_j
$$

---

# SVM with Kernel

---

# Limitations

* We have to tweak C and the kernel function, but often we don't know if/how the data can be linearly separated even in higher spaces

* There is no "interpretation" of the classification in terms of probabilities

* Not well suited for categorical data, because we want to interpret everything as a vector

* Slow for large data sets

* If we have more than two classes, the/one "standard" way is to train multiple SVMs

---

# Other Approaches

* Logistic Regression

* Naive Bayes

* k-Nearest Neighbors

* Neural Networks (next time!)

---

# Regression

---

# Regression

* The output of our classification problems were classes

* The output of a regression problem is a number

* In a way, this is "simpler", because we can just math our way through

* Note: Even for classification we mathed it out, and said "the output will be a probability (i.e. a number)"

---

# Learning?

* Now we have a prediction for y (which we call `$\hat{y}$`)

* What is our goal? To get a good prediction

* How would we measure that?

* Just as for classification, we need some metric again!

---

# Mean Absolute Error

* The MAE is the average magnitude of error in a set of predictions, i.e. `$\frac{1}{n}\sum_i | y_i - \hat{y}_i|$`

* This has one big advantage: It uses the same scale as the response variable (output)

* For example, if your y measures google stock price in USD, a MAE of 2 amounts to an average error of $2

* It weighs all errors equally, i.e. does not consider outliers "worse" than many small errors

---

# Mean Squared Error

* MSE or mean square error is one of the most used metrics in regression

* It is average (over all examples) squared error between the predicted values `$\hat{y}$` and the actual values `$y$`: `$\frac{1}{n}\sum_i (y_i - \hat{y}_i)^2$`

* Why squared? Non-negative and punishes outliers more

* It's also differentiable, if you care about that sort of thing (spoilers: we do!)

* A numeric interpretation may be tricky, since it does not use the same scale as y, but we can take the square root to get the Root Mean Squared Error (RMSE)

---

# Linear Regression

* Say we have a vector with four values, and want to predict google stock

* We also add an extra "1" at the end, because math

$$
\begin{pmatrix}w_1\\\\w_2\\\\w_3\\\\w_4\\\\b\end{pmatrix}\cdot \begin{pmatrix}d\\\\t\\\\o\\\\m\\\\1\end{pmatrix} =  w_1\cdot d + w_2 \cdot t + w_3\cdot m + w_4\cdot b + b = \hat{y}
$$

---

# MSE

$$
\begin{aligned}
\mathit{MSE} =& \frac{1}{n}\sum_i (y_i - \hat{y}_i)^2 \\\\
             =& \frac{1}{n}\sum_i\\left(y_i - \begin{pmatrix}w_1\\\\w_2\\\\w_3\\\\w_4\\\\b\end{pmatrix}\cdot \begin{pmatrix}d_i\\\\t_i\\\\o_i\\\\m_i\\\\1\end{pmatrix}\\right)^2
\end{aligned}
$$

And now? How do we learn a good "function" with this?

Well, we need to choose the **weights** `$w_1, w_2, w_3, w_4$` and the **bias** `$b$`

---

# A Small Matter of Calculus

* What we want is the **smallest** MSE

* How do we minimize a function? We differentiate it and set the derivative to 0

* Then we "solve" for `$w_1, w_2, w_3, w_4, b$`

* I hope you didn't sleep through your calculus and linear algebra classes ...

Just kidding. We're not going to do that.

---

# Ordinary Least Squares

* In your statistics class you may actually have seen this exact process

* Some statistics classes teach it for one variable only, and call it something like `$\hat{y} = \alpha + \beta x_i + \varepsilon_i$`

* Or you may have seen it for any number of variables, with a solution of `$\beta = (X^T X)^{-1} X^T y$`

* In this case, the matrix X contains **all** training examples, y is a vector of **all** training outputs, and `$\beta$` will be the weights + bias as a vector

* The problem is: This only works because our model is linear!

* Since our model is linear, and our error function quadratic, there is only one minimum, that we can get a closed form solution for

---

# Beyond Linearity

Next time we will solve several issues:

* We need something more than just a linear model (preferably arbitrarily complex)

* On the other hand, we want to be able to calculate "good" values for our "weights" and "bias"

* It would also be nice if we can handle "arbitrary" input (numbers, pixels, etc.)

* And maybe there can be multiple options for output as well (numbers, probability, one of multiple classes, etc.)

That's where Neural Networks come in!

---
  
# References

* [Multi-class Metrics Made Simple](https://towardsdatascience.com/multi-class-metrics-made-simple-part-i-precision-and-recall-9250280bddc2?gi=d80b77850252)

* [SKLearn Classification Report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

* [Deep Learning With PyTorch](https://pytorch.org/assets/deep-learning/Deep-Learning-with-PyTorch.pdf)