Lecture 5: Classification

# Computational and statistical techniques of Machine Learning

## Classification

---

# Classification

---

# Classification

A classification problem is defined as one were you are trying to predict a discrete number of values, for example:

* Given a set of input features predict whether a breast cancer is benign or malignant.

* Given an image correctly classify it as containing cats or dogs.

* From a given email determine whether it's spam or not.

---

# Types of classification

* Binary classification — when there is only two classes to predict, usually labeled as 0 and 1

* Multi-Class Classification — When there are more than two class labels to predict, e.g. 3 types of flower species, image classification problems where there are more than thousands classes (cat, dog, fish, car, ...), etc.

---

# Binary Classification

* If we have two classes, we usually like to maintain a continuous output in the interval [0,1]

* This allows a probabilistic degree of certainty for the model's prediction

* This can be interpreted, of course under some assumptions (based on the model and the task)

---

# Multi-Class Classification

* We can also have more than two classes

* In this case, we would assign each class a label that we want our classifier to produce

* Since we're working with computers, these labels will usually be numbers

* We could say a value between 0 and 1 is a cat, between 1 and 2 is a dog, between 2 and 3 is a microwave, etc.

---

# Classification Metrics

---

# Classification Metrics

* In order to be able to tell if our classifier is "working", we need some metrics to determine its quality

* As before, we will have a training set, used to train our classifier, and then we will measure its "error" on the test set

* Imagine the problem of classifying images as cats or non-cats: What errors could we make?

---

# Error Types

<img src="/PF-3115/assets/img/error_types.png" width="40%"/><br/>
Source: https://xkcd.com/2303/

---

# Confusion Matrix/Contingency Table

* An important concept in Classification is the confusion matrix or error matrix – a tabular representation of the model predictions vs ground truth labels

* Each row represents the instances in a predicted class and each column represents the instances in an actual class.

* An easy way is to use an example. A binary classifier that classifies cats and non-cat images. We also have a test set possessing 1100 images (1000 non-cat images and 100 cat images)

---

# Overview of the Matrix:

The diagonal elements are the correct prediction for different classes or true positives/negatives, the off-diagonal elements are samples which are mis-classified, or false positives/negatives.

---

# Metrics

* **Accuracy**: The number of correct predictions divided by the total number of predictions (percentage of samples we got right)

* **Precision**: Percentage of true positives among all samples identified as positive (i.e. what percentage of images we labeled as "cats" were actually cats)

* **Recall**: Percentage of true positives among all samples from a class (i.e. what percentage of cat pictures did we identify as "cats")

* **F1-Score**: Harmonic mean between precision and recall: 2\*Precision\*Recall/(Precision + Recall)

---

# Metrics

<img src="/CI-2600/assets/img/confusion_matrix.png" width="70%"/>

$$ \text{accuracy} = \frac{\mathit{TP} + \mathit{TN}}{\mathit{TP} + \mathit{FP} + \mathit{TN} + \mathit{FN}} = \frac{90 + 940}{90 + 940 + 60 + 10} = 0.94$$

$$ \text{precision} = \frac{\mathit{TP}}{\mathit{TP} + \mathit{FP}} = \frac{90}{90+ 60} = 0.6 $$

$$ \text{recall} = \frac{\mathit{TP}}{\mathit{TP} + \mathit{FN}} = \frac{90}{90 + 10} = 0.9$$

$$ \text{F1} = \frac{2\cdot\text{precision}\cdot\text{recall}}{\text{precision} + \text{recall}} = \frac{2\cdot 0.6\cdot 0.9}{0.6+0.9} = 0.72$$

---

# Cutoff thresholds

* Recall that our model usually produces a continuous output in the interval [0,1], which we interpreted as "probabilities"

* For example, a model might might predict the following for 3 images [0.45, 0.6, 0.7]. Depending on the threshold values, the labels will change:

- cut-off = 0.5: predicted-labels= [0,1,1] (default threshold)

- cut-off = 0.2: predicted-labels= [1,1,1]

- cut-off = 0.8: predicted-labels= [0,0,0]

* Changing the threshold changes the labels, which affects precision and recall

---

# ROC Curve and AUC-ROC

* The ROC curve (receiver operating characteristic curve) shows the performance of a binary classifier as a function of its cut-off threshold (or other parameters)

* In essence what it shows is the true positive rate against the false positive rate for various threshold values

* Classification models are probabilistic, this means they predict the probability an instance belonging to a given class.

* The predicted output probability is then compared to a threshold. If it is larger, then the model predicts a label (cat), otherwise it assigns another label (non-cat).

---

# AUC

* It is the area under the ROC curve, and thus it is between 0 and 1.

* It can be interpreted as the probability that the model ranks a random positive instance more highly than a random negative one.

* Basically, the higher the AUC, the better it is. But in some cases you may focus more about recall (maintaining reasonable precision)

* In such cases you tune the model to meet you requirements and your AUC might not be too high

---

# AUC-ROC

---

# Multi-class Metrics

* What if we have more than two classes?

* We can calculate each metric (precision, recall, F1) **per class**, e.g. accuracy of "cat", "dog", "fish", "microwave"

* We can then calculate the (weighted or unweighted) "average" of these metrics to gain one "overall" metric

---
class: center, middle

# Classification Algorithms

---

# Logistic Regression

.center[
<iframe width="560" height="315" src="https://www.youtube.com/embed/yIYKR4sgzI8" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
]

---

# Decision Trees

---

# A little more classification

* When the input variables are categorical it is possible to apply logical rules to arrive to an answer

* This can be done without the need to go through numerical computations. In these cases tree-based models are very useful

* Trees can also handle numerical variables by discretizing them

---

# Decision Tree Classifier

---

# Decision Tree Learning

* There are several algorithms to actually **learn** the tree from data (given features and labels)

* One set of algorithms (ID3, C4.5, and C5) uses the concept of entropy and information gain

$$
H(S) = - \sum_x p(x) \log p(x)
$$

* If all items are from one class, this entropy is 0, if exactly half are in one class, it is 1 ("number of bits needed to represent the information")

* ID3 and C4.5 choose the feature that decreases the entropy the most (e.g. if **all** email that contain "viagra" are spam, the resulting entropy for that part of the data will be 0), and then recursively applies the same algorithm for the branches

---

# Limitations

* Practical algorithms use a (greedy) heuristic, which can lead to trees that generalize poorly
  
  * Trimming may be needed

* Trees are not very good for non-structured data, such as images

* Trees are unable to capture causality nor ordering in sequences

---

# Support Vector Classifiers

---

# Support Vector Machines

* Once again we are given training data consisting of features and labels

* We interpret our features as vectors in n-dimensional space, each with its label

* Our goal is to find a (hyper-)plane, such that all data points from one class are on one side, and the ones from the other class on the other side of this plane

* We also want this plane to have the maximum possible distance ("margin") from the data points it is closest to (the "support vectors")

---

# Maximum Margin Hyperplane

---

# Support Vector Classifier

* Data can usually not be quite as cleanly separated

* We specify a parameter "C" that says how costly "errors" are

* We will end up with an optimization problem to maximize the margin while minimizing the cost of errors

---

# Projection

* What if our data is not linearly separable at all?

* We can "project" it into more dimensions

* For example

$$
(x_1, x_2) \mapsto (x_1^2, x_2^2, \sqrt{2} x_1 \cdot x_2)
$$

And then we "hope" that our data can be linearly separated in these 3 dimensions!

Or we use more dimensions, maybe even infinitely many ...

---

# The Kernel Trick

How do you calculate the distance of a point from a (hyper-plane)?

The dot product!

Here is a neat trick: We don't need to actually "project" our data, we just need to calculate the dot product "as if" we had projected it.

$$
x = (x_1,x_2) \mapsto X = (x_1^2, x_2^2, \sqrt{2}x_1\cdot x_2)\\\\
(x_i \cdot x_j)^2 = X_i \cdot X_j
$$

---

# SVM with Kernel

---

# Limitations

* We have to tweak C and the kernel function, but often we don't know if/how the data can be linearly separated even in higher spaces

* There is no "interpretation" of the classification in terms of probabilities

* Not well suited for categorical data, because we want to interpret everything as a vector

* Slow for large data sets

* If we have more than two classes, the/one "standard" way is to train multiple SVMs

---

# Neural Networks for Classification

---

# Artificial Neural Networks

.right-column[
Two weeks ago we talked about Neural Networks for regression, to estimate a continuous function.

What about classification?
]

---

# Classification

* Say we want to distinguish between cats and dogs in pictures

* We have images, with 128x128 pixels, 3 color channels

* This means, we have 128x128x3 = 49 152‬ inputs ("features")

* We use a neural network with 2 layers, some hidden neurons, and **one** output neuron

* If the output is greater than some threshold x we say the input is a cat, otherwise it is a dog

---

# More classes

* What if we also have pictures of fish and microwaves?

* We could just say: less than 0.5 is a dog, 0.5 - 1.5 is a cat, 1.5 - 2.5 is a fish, more than 2.5 is a microwave

* Problem: When we do gradient descent we may increase/decrease numbers for *all* samples with some property

* Generally, our classes do not have a numeric relationship: a cat is better than a dog, but does the same numeric difference hold between fish and microwaves?

* Better: Look at each class "independently"

---

# One-Hot-Encoding

* Instead of one output, we have **one output per class**

* We translate our training labels ("dog" = 0, "cat" = 1, "fish" = 2, "microwave" = 3) to vectors: "dog" = (1,0,0,0), "cat" = (0,1,0,0), "fish" = (0,0,1,0), "microwave" = (0,0,0,1)

* The idea is that the output for **each** entry can be interpreted as "this is how likely this picture is of this class" (a probability)

* The classification produced by our network is then whichever class has the **maximum** value

---

# Probabilities?

* Say we put a sigmoid function as the activation function **of the output layer**

* All values will be between 0 and 1, which we could interpret as a "probability"!

* However: Cat may have a "probability" of 0.7, dog a "probability" of 0.2, fish a "probability" of 0.5, and microwave a "probability" of 0.45 ... The sum is greater than 1

* Instead: **Softmax**, a "generalization" of a sigmoid to multiple classes that makes sure that the probabilities sum to 1

---

# Classification Error

* Because our output are probabilities there are better ways to measure the error than the (squared) distance

* In particular, in combination with the softmax activation function (which is similar to the sigmoid function and uses exp), a loss function that uses logarithms is advantageous

* Observe: The logarithm of 1 is 0, the logarithm of values between 0 and 1 is negative

* Loss: The negation of the logarithm of the probability of the desired class

---

# Cross-Entropy Loss

Output probability for each class:

$$
p_c = \frac{exp(w_c \cdot x)}{\sum_j exp(w_j \cdot x)}
$$

Loss:

$$
L(y) = - \log(p_y)
$$

When we take the derivative, the log and the exp cancel out nicely, making the update very efficient.

---

# Lab 3

---

# Lab 3

* We are given pictures of handwritten digits, with labels telling us which digit is shown in each picture

* Our goal is to train a neural network to determine the digit shown in each picture

* This is a **classification** problem: For each image, determine one of 10 classes

---

# Lab 3

---

# Neural Networks for Classification in PyTorch

* Start with the code from the last lab

* Each image is encoded as 28x28 grayscale (0 to 255) pixels

* **Normalize** the pixels (divide by 255)

* Change your neural network to accept 28x28 = 784 inputs

* You will also need 10 outputs, using the Softmax activation function on the output layer (one-hot encoding)

---

# Reading MNIST

* You will have to download the MNIST data

* The images come in a (somewhat weird) binary format

* There is code to load the images into tensors (and save tensors to image files) available on the lab website

---

# Reading MNIST

```Python
# Load the first 500 images (faster for debugging!)
data,dims = read_idx.read(trainingdataf, 500)

td = torch.tensor(data, dtype=torch.float)
training_data = td.view((-1,28*28))

# NORMALIZE!
training_data /= 255

show_image(training_data[0], scale=SCALE_01)
```

You can store images to the disk, or display them on the screen, which may be useful for debugging!

---

# Classification Network in Python

```Python
class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.sig1 = torch.nn.Sigmoid()
        self.linear2 = torch.nn.Linear(H, D_out)
        # Which *dimension* to calculate the softmax over
        self.softmax = torch.nn.Softmax(1)

def forward(self, x):
        h = self.linear1(x)
        h = self.sig1(h)
        y_pred = self.linear2(h)
        return self.softmax(y_pred)
        
model = TwoLayerNet(28*28,100,10)
y_pred = model(x.view(-1,28*28))

# Which digit do we predict for each image
predictions = y_pred.max(1).indices
```

---

# Cross-Entropy Loss

* PyTorch comes with a built-in `CrossEntropyLoss` function

* **Note:** This loss function accepts a two-dimensional output tensor from your neural network (number of rows = number of samples, number of columns = number of classes), and a one-dimensional tensor with the **indices** of the desired class

* For our purpose, we can use the label as the index (e.g. the digit "0" uses index 0)

* `training_data` and `labels` already have the correct shapes!

---

# Metrics

* You will need to calculate per-class metrics

* The easiest way is to construct the confusion matrix as a 10x10 tensor, and fill its values using the predicted and actual labels

```Python
for pred,lab in zip(predictions,labels):
    matrix[pred,lab] += 1
```

* You can then calculate the precision as the diagonal element divided by the sum of the row, and the recall as the diagonal element divided by the sum of the column

* If you want, you can also use `sklearn.metrics.classification_report`

---

# Weight Interpretation

* One way to investigate what a Neural Network might be doing is by looking at the weights

* On our input layer, each neuron has 785 weights (one for each input + bias)

* We can "draw" these weights as images (e.g. the weight used for pixel (0,0), is drawn as pixel (0,0) of a new image)

* Many of these images may just show shades of gray, but some (may) have identifiable responsibilities!

---

# References

* [Understanding AUC - ROC Curve](https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5)

* [SVM Tutorial](https://blog.statsbot.co/support-vector-machines-tutorial-c1618e635e93?gi=991cc54573a8)

* [Multi-class Metrics Made Simple](https://towardsdatascience.com/multi-class-metrics-made-simple-part-i-precision-and-recall-9250280bddc2?gi=d80b77850252)

* [SKLearn Classification Report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)