Lecture 3: Classification and Regression

# Classification and Regression

### III-Verano 2019

Classification and Regression
---

# Introduction

In any machine learning problem, one of the most important things is to know of the presented task is:

1. Classification
2. Regression

Knowing this will allow you to pick the algorithm to address the task at hand

We will present the concepts that will allow the differentiation between the two as follows

---

# Classification

A classification problem is defined as one were you are trying to predict a discrete number of values.

* Give me an example?

- By convention, the labels are provided in a categorical form and represent a finite number of classes.

- Recap from stats – Why are they discrete?

---

# Examples of Classification

The following are examples of classification

* Given set of input features predict whether a Breast Cancer is Benign or Malignant.

* Given an image correctly classify as containing Cats or Dogs.

* From a given email predict whether it’s spam email or not.

---

# Types of classification

* Binary classification — when there is only two classes to predict

- The label (y) is usually assigned the values 1 or 0

* Multi-Class Classification — When there are more than two class labels to predict. E.g. predicting 3 types of flower species, image classification problems where there are more than thousands classes (cat, dog, fish, car,…).

---

# Algorithms for classification

* Decision Trees

* Logistic Regression

* Naive Bayes

* K Nearest Neighbors

* SVC (Support vector Classifier)

---

# Regression Problems

* In regression problems we try to predict continuously valued outputs
 
  - Given a size of the house predict the price (real value).

Regression models emulate generative processes that derive one or more values from a set of variables.

These input variables are capable of explaining the the output, by correlation or causality.

---

# Regression Algorithms

* Linear Regression

* Regression Trees(e.g. Random Forest)

* Support Vector Regression (SVR)

---

# Classification VS Regression

* Classification: Discrete valued Y (e.g. 1,2,3 and 4)

* Regression: Continues Values Y (e.g. 222.6, 300, 568,…)

* Whenever you find machine learning problem first define whether you are dealing with a classification or regression problem.

* This can be done by analyzing the target variable (Y). The input X can be of any kind (continues or discrete) and that doesn’t count to define the problem.

* After defining the problem and getting to know the data it’s much easier to choose or try out some algorithms.
---

# Performance Metrics

* One of the most significant skills when working in machine learning is to understanding how to evaluate a model

* Multiple metrics are at ypur disposal, several for each class of model (classification or regression)

* However, depending on the problem at hand, or what the problem is solving, will determine which evaluation metric will be best suited
---

# Further Concepts in  Classification

* Classification consists of assigning labels to data samples. The simplest is the binary assignment (One or Zero, True or False). As before it might say if it is a cat or a dog, or a car, or a cactus.

* But it can also tell us the breed of the cat (multiclass)

* Classification models arrive at discrete, categorical answers

* Most classification models can be seen as regression models with a decision boundary, where if the predicted value is less than the threshold it is interpeted as Zero

* Otherwise it is One

---

# A little More Classification

* When the input variables are categorical it is possible to apply logical rules to arrive to an answer

* This can be done without the need to go through numerical computations. In these cases Tree-based models are very useful.

* Trees can also handle numerical variables by discretizing them. This is one reason why trees are very popular.
---

# Limitations

However, trees have limitations. Including:

* They are not very good for non-structured data, such as images

* They are unable to capture causality nor ordering in sequences
---

# Other Important Considerations in Clasification

* The most common problem we mentioned is binary classification. In this case we usually like to maintain the continuous output [0,1

* This allows a probabilistic degree of certainty for the model’s prediction

* This can be interpreted, of course under some assumptions (based on the model and the task)
---

# Classification Metrics

Confusion Matrix

* An important concept in Classification is  the confusion matrix or error matrix – a tabular representation of the model predicti0ons vs ground truths labels

* Each row represents the instances in a predicted class and each column represents the instances in an actual class.

* An easy way is to use an example. A binary classifier that classifies cats and non-cat images. We also have a test set possessing 1100 images (1000 bob-cat images and 100 cat images)

---
# Overview of the Matrix:

---

# Assessment of Confusion Matrix

Simply, the diagonal elements are the correct prediction for different classes or true positives, the off-diagonal elements are samples which are mis-classified, or False Positives.

---

# Classification Accuracy

Classification accuracy or accuracy is defined as the number of correct predictions divided by the total number of predictions

---

# Precision

* Classification accuracy is not always a good metric (for example detection of credit card fraud), or cases where you class distribution is imbalanced

* For such cases even if the model predicts all instances as the most frequent class, you get a high accuracy, but it is not good because it is not learning. It is just predicting everything as the majority class

---
class: small

# Precision

In such cases, like the cat and non-cat precitor, a class-specific metric is important, like precision

$$ Precision = \frac{TP}{TP+ FP} $$

* Thus we calculate the precision of cat and non-cat as # of samples correctly predicted as cat/#samples predicted as

$$ cat = \frac{90}{90+ 60} = 0.60 $$

$$ Non-cat = \frac{940}{950} = 0.989 $$

* In this case, this classifier has higher precision predicting non-cat samples versus cat samples.

---

# Recall

* Recall is the fraction of samples from a class that are correctly predicted

* Recall= TP/ (TP + FN)

* For our cat and non-cat example:

* Recall_cat = 90/100= 90%
* Recall_NonCat = 940/1000= 94%

---

# F1 Score

* Depending on your task, you might want to focus on precision or recall. However, in some cases both are important. In such scenarios, a metric that takes them into account is relevant

* A metric that does this (combine or takes into consideration precision and recall) is F1. It is the harmonic mean of precision and recall

* F1-score= 2\*Precision\*Recall/(Precision+Recall)

* So for the confusion matrix shown before the F1 is:

$$ F1-cat = 2\cdot\frac{0.6\cdot0.9}{0.6+0.9} = 0.72 $$

* There is always a trade-off between precision and recall

* If you want to make the precision too high, you would end up seeing a drop in the recall rate, and vice versa.

---

#Sensitivity and Specificity

* Sensitivity and specificity also impo:Lrtant metrics, specially in the medical and biological fields (bioinformatics). They are defined as:

- $$ Sensitivity= Recall= \frac{TP}{TP+FN} $$

- $$ Specificity = \frac{TN}{TN+FP}$$

---

# ROC Curve and AUC-ROC

* The ROC curve (receiver operating characteristic curve) shows the performance of a binary classifier as a function of its cut-off thresholds.

* In essence what it shows is the true positive rate against the false positive rate for various threshold values

* Classification models are probabilistic, this means they predict the probability an instance belonging to a given class.

* The predicted output probability is then compared to a threshold. If it is larger, then the model predicts a label (cat), otherwise it assigns another label (non-cat).

---

# Example

* For example, a model might might predict the following for 3 images [0.45, 0.6, 0.7]. Depending on the threshold values, the labels will change:

* cut-off= 0.5: predicted-labels= [0,1,1] (default threshold)

* cut-off= 0.2: predicted-labels= [1,1,1]

* cut-off= 0.8: predicted-labels= [0,0,0]

* Changing the threshold changes the labels, which affects precision and recall.

---

# AUC

* It is the area under the ROC curve, and thus it is between 0 and 1.

* It can be interpreted as the probability that the model ranks a random positive instance more highly than a random negative one.

* Basically, the higher the AUC, the better it is. But in some cases you may focus more about recall (maintaining reasonable precision)

* In such cases you tune the model tto meet you requirements and your AUC might not be too high

---

# Regression Metrics

* MSE or  mean square error is oe of the ,ost used metrics in regression

* In essence it basically finds the average squared error between the predicted and the actual values

* So for a regression model that predicted house prices in Cartago, shown with ŷᵢ, and for each house we have the real price (yᵢ). The MSE can be calculate as follows…

---

# MSE

---

# RMSE

* RMSE is the squared root of the MSE.

* It is useful because it is scaled on the same base unit as the variables. For example, if your y measures house prices in colones, an RMSE of 75 000amounts to an average error of 75 000 colones.

---

# MAE

* It is a measure of the average magnitude of error in a set of predictions, but it does not consier its direction.

* It is the absolute difference between predictions and observations, where all individual differecnes have equal weights

* It is more robust to outliers than MSE. Squared errors can magnify the error of the model.

---
#MAE

---

# Gradient

* A simple way is to compare it to ascending or climbing a mountain. We want to reach the top as fast as we can

* Of course, there is more than way one way to reach.  But you want to use the fastest way possible.

* Gradient fits in here because, the derivative is the rate of change or the slope of a function a a point

* So, for a `$f(x) = x^{2}$` the derivative will be? For a Point x = 2, the slope will be 4.

---

# Gradient Ascent

* The derivative indicates the direction of steepest ascent, and this is what the gradient is. However, the gradient is a vector-valued function.

* Again with our mountain climber, the gradient tells you the direction you get to the top of the mauntain the fastest. So if you have a function with 2, 3 or 4 variables, we get a gradient with 2,3 or 4 gradient vector with n partial derivatives.

* Generally, an n-variable function results in an n-dimensional gradient vector.

* Gradient descent represents the opposite of the gadient, we do NOT want to maximize f, but minimize it.

---

# Linear Regression and Gradient Descent

* A linear regression task is given by:

$$
y = mx + b
$$

* For a Linear Model we have two parameters, the slope and the y-intercept, b. These are knobs that we can adjust or change to find the best linear equation

* This is done by minimizing the difference between the predictions and observations.

* Iteratively, changes in these parameters are going to be done in order to find the steepest descent on the error. After each iteration, the weight changes will be used to refine the model

---
  
# References

* Ian Goodfellow, Yoshua Bengio, and Aaron Courville. “*Deep learning*”, MIT press, 2016
  
  * Christopher M. Bishop, *“Pattern recognition and machine learning”*, springer, 2006
  
  * Jerome Friedman, Trevor Hastie, and Robert Tibshirani. “*The elements of statistical learning*”, Springer series in statistics, 2001