class: center, middle # Artificial Intelligence ### Supervised Learning --- # Machine Learning - Supervised Learning: `\( \{(x_1, y_1), . . . ,(x_n, y_n)\} \)` Learn a mapping from examples. - Unsupervised Learning: `\( \{x_1, . . . , x_m\} \)` Learn an interesting thing about data. - Semi-supervised Learning: `\( \{(x_1, y_1), . . . ,(x_n, y_n)\} \cup \{x_1, . . . , x_m\} \)` - Reinforcement Learning: Learn what to do in an environment, given feedback information. --- # Machine Learning Say there is a function `\(f(\vec{x}) = y\)` - Supervised Learning: We know x and y, and are trying to find f - Unsupervised Learning: We know x and are trying to find "interesting" f and y - Reinforcement Learning: We know f*, and are trying to get "the best y" by choosing x .tiny[*: Terms and Conditions may apply] --- class: medium # Supervised Learning * Today we will discuss supervised learning: We have x and y, and try to find f * What are x and y? * Depends on the problem type: - Classification: Determine a **discrete** class for the input - Regression: Predict/infer a **continuous** number as a function of the input * Some approaches work for both tasks, in a way --- class: small # First: Inputs and Outputs Remember our function `\(f(\vec{x}) = y\)` - This function takes a vector of real numbers and produces one real number - Unlike vectors you may have seen before, this vector is really just "an ordered collection of numbers" - For example: We want to predict the price of Google/Alphabet stock given the day of the year, the temperature in Mountain View, the position of Mars in its orbit, and the number of Marvel movies released so far - We construct a four-dimensional vector with one entry for each of these numbers - Our (supervised) learning algorithm then has to figure out how to turn these four values into a stock price (not all values may be relevant) --- class: medium # Vectors * Vectors are neat because we have mathematical operations defined on them: Addition, substraction, multiplication with a scalar, etc. * One particularly important operation is the dot product: $$ \vec{v} \cdot \vec{w} = \begin{pmatrix}v_1\\\\v_2\\\\\vdots\\\\v_n\end{pmatrix}\cdot\begin{pmatrix}w_1\\\\w_2\\\\\vdots\\\\w_n\end{pmatrix} = v_1 \cdot w_1 + v_2 \cdot w_2 + \ldots + v_n \cdot w_n $$ * We can use this to concisely define learning systems and algorithms! --- class: mmedium # Learning * We are given "some" example pairs of `\(\vec{x}\)` and y * Our machine learning "algorithm" consists of two parts: - A representation of a predicted function `\(\hat{f}\)` - A search (or "optimization") procedure * We split our examples into three sets: - Training Set - Validation Set - Test Set --- class: medium # Learning * To learn a (hopefully) "good" `\(\hat{f}\)` we pass our training set to our search procedure * This will produce "some" `\(\hat{f}\)` * We then **measure** how well this works on the **validation set** * If it works well, great, if not, maybe we can tweak some parameters of the search procedure or representation and try again * Only at the **very end** do we try the algorithm on the test set to report generalization performance! --- # Learning * Different learning algorithms differ in how they produce this `\(\hat{f}\)` * Depending on the approach there may be many parameters to decide * We also need to decide **how** to measure how well our approach does * Let's start with classification --- class: center, middle # Classification --- class: medium # Classification A classification problem is defined as one were you are trying to predict a discrete number of values, for example: * Given a set of input features predict whether a breast cancer is benign or malignant * Given an image correctly classify it as containing cats or dogs (or birds, or ...) * From a given email determine whether it's spam or not * Determine if a person's fingerprint is valid or not --- # Types of classification * *Binary classification*: when there are only two classes to predict, usually labeled as 0 and 1 (e.g. valid = 0, invalid = 1; or not spam = 0, spam = 1) * *Multi-Class Classification*: When there are more than two class labels to predict, e.g. three types of flower species, image classification problems where there are more than a thousand classes (facial recognition), etc. --- class: medium # Binary Classification * If we have two classes, we usually like to maintain a continuous output in the interval [0,1] * This can be interpreted as a "probability" to a certain degree * However, the ultimate output for a given task is likely a binary choice * We will have to decide how "certain" we want to be of something (e.g. we classify as spam what our algorithm says is spam with more than 95% probability) --- class: medium # Multi-Class Classification * We can also have more than two classes * In this case, we would assign each class a label that we want our classifier to produce * Since we're working with computers, these labels will usually be numbers again * We could say a value between 0 and 1 is a cat, between 1 and 2 is a microwave, between 2 and 3 is a dog, etc. * We will see that that is not necessarily a good idea, and there are often better encodings --- class: center, middle # Classification Metrics --- # Classification Metrics * In order to be able to tell if our classifier is "working", we need some metrics to determine its quality * We will have a training set, used to train our classifier, and then we will measure its "error" on the validation set * Imagine the problem of classifying images as cats or non-cats: What errors could we make? --- # Error Types
Source: https://xkcd.com/2303/ --- class: medium # Confusion Matrix/Contingency Table * The confusion matrix or error matrix is a tabular representation of the model predictions vs ground truth labels * Each row represents the instances in a predicted class and each column represents the instances in an actual class (some people flip the two, always use labels!) * For example, say we have a binary classifier that classifies cat- and non-cat images. We also have a validation set with 1100 images (1000 non-cat images and 100 cat images) --- # Overview of the Matrix:
The diagonal elements are the correct prediction for different classes (true positives/negatives), the off-diagonal elements are samples which are mis-classified, or false positives/negatives. --- class: medium # Metrics * **Accuracy**: The number of correct predictions divided by the total number of predictions (percentage of samples we got right) * **Precision**: Percentage of true positives among all samples identified as positive (i.e. what percentage of images we labeled as "cats" were actually cats) * **Recall**: Percentage of true positives among all samples from a class (i.e. what percentage of cat pictures did we identify as "cats") * **F1-Score**: Harmonic mean between precision and recall: `\(2\cdot \text{Precision} \cdot \text{Recall}/(\text{Precision} + \text{Recall})\)` --- class: small, smallermath # Metrics
$$ \text{accuracy} = \frac{\mathit{TP} + \mathit{TN}}{\mathit{TP} + \mathit{FP} + \mathit{TN} + \mathit{FN}} = \frac{90 + 940}{90 + 60 + 940 + 10} = 0.94$$ $$ \text{precision} = \frac{\mathit{TP}}{\mathit{TP} + \mathit{FP}} = \frac{90}{90+ 60} = 0.6 $$ $$ \text{recall} = \frac{\mathit{TP}}{\mathit{TP} + \mathit{FN}} = \frac{90}{90 + 10} = 0.9$$ $$ \text{F1} = \frac{2\cdot\text{precision}\cdot\text{recall}}{\text{precision} + \text{recall}} = \frac{2\cdot 0.6\cdot 0.9}{0.6+0.9} = 0.72$$ --- class: mmedium # Cutoff thresholds * Recall that our model usually produces a continuous output in the interval [0,1], which we interpreted as "probabilities" * For example, for three images a model might predict 0.45, 0.6, and 0.7. Depending on the threshold values, the labels will change: - cut-off = 0.5: predicted labels = 0, 1, 1 (default threshold) - cut-off = 0.2: predicted labels = 1, 1, 1 - cut-off = 0.8: predicted labels = 0, 0, 0 * Changing the threshold changes the labels, which affects precision and recall --- # Multi-class Metrics * What if we have more than two classes? * We can calculate each metric (precision, recall, F1) **per class**, e.g. accuracy of "cat", "dog", "fish", "microwave" * We can then calculate the (weighted or unweighted) "average" of these metrics to gain one "overall" metric * Usually it's best to report **all** values --- class: center, middle # Classification Algorithms --- # Decision Trees * When the input variables are categorical/discrete it is possible to apply logical rules to arrive to an answer * This can be done without the need to go through numerical computations. In these cases tree-based models are very useful * Continuous input variables can be discretized --- # Decision Tree Classifier
--- class: mmedium # Decision Tree Learning * There are several algorithms to actually **learn** the tree from data (given features and labels) * One set of algorithms (ID3, C4.5, and C5) uses the concept of entropy and information gain $$ H(S) = - \sum_x p(x) \log p(x) $$ * If all items are from one class, this entropy is 0, if exactly half are in one class, it is 1 ("number of bits needed to represent the information") * ID3 and C4.5 choose the feature that decreases the entropy the most (e.g. if **all** email that contain "viagra" are spam, the resulting entropy for that part of the data will be 0), and then recursively applies the same algorithm for the branches --- # Limitations * Practical algorithms use a (greedy) heuristic, which can lead to trees that generalize poorly * Trimming may be needed * Trees are not very good for non-structured data, such as images * Trees are unable to capture causality nor ordering in sequences --- class: center, middle # Support Vector Classifiers --- class: medium # Support Vector Machines * Once again we are given training data consisting of features and labels * We interpret our features as vectors in n-dimensional space, each with its label * Our goal is to find a (hyper-)plane, such that all data points from one class are on one side, and the ones from the other class on the other side of this plane * We also want this plane to have the maximum possible distance ("margin") from the data points it is closest to (the "support vectors") --- # Maximum Margin Hyperplane
SVM Separating Hyperplanes
image/svg+xml
SVM Separating Hyperplanes
2012-11-26
Zachary Weinberg
H
1
H
2
H
3
X
1
X
2
--- # Support Vector Classifier * Data can usually not be quite as cleanly separated * We specify a parameter "C" that says how costly "errors" are * We will end up with an optimization problem to maximize the margin while minimizing the cost of errors --- # Projection * What if our data is not linearly separable at all? * We can "project" it into more dimensions * For example $$ (x_1, x_2) \mapsto (x_1^2, x_2^2, \sqrt{2} x_1 \cdot x_2) $$ And then we "hope" that our data can be linearly separated in these 3 dimensions! -- Or we use more dimensions, maybe even infinitely many ... --- # The Kernel Trick How do you calculate the distance of a point from a (hyper-plane)? -- The dot product! Here is a neat trick: We don't need to actually "project" our data, we just need to calculate the dot product "as if" we had projected it. $$ x = (x_1,x_2) \mapsto X = (x_1^2, x_2^2, \sqrt{2}x_1\cdot x_2)\\\\ (x_i \cdot x_j)^2 = X_i \cdot X_j $$ --- # SVM with Kernel
--- class: medium # Limitations * We have to tweak C and the kernel function, but often we don't know if/how the data can be linearly separated even in higher spaces * There is no "interpretation" of the classification in terms of probabilities * Not well suited for categorical data, because we want to interpret everything as a vector * Slow for large data sets * If we have more than two classes, the/one "standard" way is to train multiple SVMs --- # Other Approaches * Logistic Regression * Naive Bayes * k-Nearest Neighbors * Neural Networks (next time!) --- class: center, middle # Regression --- # Regression * The output of our classification problems were classes * The output of a regression problem is a number * In a way, this is "simpler", because we can just math our way through * Note: Even for classification we mathed it out, and said "the output will be a probability (i.e. a number)" --- # Learning? * Now we have a prediction for y (which we call `\(\hat{y}\)`) * What is our goal? To get a good prediction * How would we measure that? * Just as for classification, we need some metric again! --- # Mean Absolute Error * The MAE is the average magnitude of error in a set of predictions, i.e. `\(\frac{1}{n}\sum_i | y_i - \hat{y}_i|\)` * This has one big advantage: It uses the same scale as the response variable (output) * For example, if your y measures google stock price in USD, a MAE of 2 amounts to an average error of $2 * It weighs all errors equally, i.e. does not consider outliers "worse" than many small errors --- class: mmedium # Mean Squared Error * MSE or mean square error is one of the most used metrics in regression * It is average (over all examples) squared error between the predicted values `\(\hat{y}\)` and the actual values `\(y\)`: `\(\frac{1}{n}\sum_i (y_i - \hat{y}_i)^2\)` * Why squared? Non-negative and punishes outliers more * It's also differentiable, if you care about that sort of thing (spoilers: we do!) * A numeric interpretation may be tricky, since it does not use the same scale as y, but we can take the square root to get the Root Mean Squared Error (RMSE) --- # Linear Regression * Say we have a vector with four values, and want to predict google stock * We also add an extra "1" at the end, because math $$ \begin{pmatrix}w_1\\\\w_2\\\\w_3\\\\w_4\\\\b\end{pmatrix}\cdot \begin{pmatrix}d\\\\t\\\\o\\\\m\\\\1\end{pmatrix} = w_1\cdot d + w_2 \cdot t + w_3\cdot m + w_4\cdot b + b = \hat{y} $$ --- # MSE $$ \begin{aligned} \mathit{MSE} =& \frac{1}{n}\sum_i (y_i - \hat{y}_i)^2 \\\\ =& \frac{1}{n}\sum_i\\left(y_i - \begin{pmatrix}w_1\\\\w_2\\\\w_3\\\\w_4\\\\b\end{pmatrix}\cdot \begin{pmatrix}d_i\\\\t_i\\\\o_i\\\\m_i\\\\1\end{pmatrix}\\right)^2 \end{aligned} $$ And now? How do we learn a good "function" with this? Well, we need to choose the **weights** `\(w_1, w_2, w_3, w_4\)` and the **bias** `\(b\)` --- # A Small Matter of Calculus * What we want is the **smallest** MSE * How do we minimize a function? We differentiate it and set the derivative to 0 * Then we "solve" for `\(w_1, w_2, w_3, w_4, b\)` * I hope you didn't sleep through your calculus and linear algebra classes ... -- Just kidding. We're not going to do that. --- class: mmedium # Ordinary Least Squares * In your statistics class you may actually have seen this exact process * Some statistics classes teach it for one variable only, and call it something like `\(\hat{y} = \alpha + \beta x_i + \varepsilon_i\)` * Or you may have seen it for any number of variables, with a solution of `\(\beta = (X^T X)^{-1} X^T y\)` * In this case, the matrix X contains **all** training examples, y is a vector of **all** training outputs, and `\(\beta\)` will be the weights + bias as a vector * The problem is: This only works because our model is linear! * Since our model is linear, and our error function quadratic, there is only one minimum, that we can get a closed form solution for --- class: medium # Beyond Linearity Next time we will solve several issues: * We need something more than just a linear model (preferably arbitrarily complex) * On the other hand, we want to be able to calculate "good" values for our "weights" and "bias" * It would also be nice if we can handle "arbitrary" input (numbers, pixels, etc.) * And maybe there can be multiple options for output as well (numbers, probability, one of multiple classes, etc.) That's where Neural Networks come in! --- # References * [Multi-class Metrics Made Simple](https://towardsdatascience.com/multi-class-metrics-made-simple-part-i-precision-and-recall-9250280bddc2?gi=d80b77850252) * [SKLearn Classification Report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) * [Deep Learning With PyTorch](https://pytorch.org/assets/deep-learning/Deep-Learning-with-PyTorch.pdf)