Lecture 8: SVMs

# Machine Learning

## SVM for Classification

### III-Verano 2019

---

# Support Vector Machines

<p style="margin-top: 5cm; text-align: center; font-size: 2em">
Fundamentals of a SVM
</p>

---

# Simple SVM

---

# Support Vector Machines

* SVM finds the decision boundary that separates different classes and maximizes the margin

* But, what is the margin?
  
  * Margin abre perpendicular distance between the line and the instances closest to it
  
---

# What is the Goal of an SVM

* The objective of the SVM is to find a hyperplane that distinctly classifies the data points 
 
  * The hyperplane is N-dimensional, defend by the number of features

* Hyperplanes ARE decision boundaries that allow classification of instances

* Data points located in each side of the hyperplane can be attributed to different classes

* The dimension of the hyperplane depends on the number of input features

* 2 features, and the hyperplane is a line, 3 it is a 2D plane,

---

# Support Vectors

* Learning algorithms try to learn a common characteristic or signal, something that makes one class different from another

* Classification is thus based on such characteristics, or the differences between the classes

* Support Vector Machines actually have an interesting characteristic, they find the points that are the most similar or closer between the classes

* These are the support vectors, and are more formally defined as data points that are closer to the hyperplane

* Support vectors influence the position and orientation of the hyperplane

* Support vectors are used to maximize the margin of the classifier, and they are deleted or removed, the position of the hyperplane changes

---

# SVM

---

# Getting the Best Possible Hyperplane

* In our apples and lemons example the best line is the one that has the largest margin

* This means we maximize the margin between both classes

---

# Sub-optimal Hyperplane

---

# The Optimal Margin

* For multiple classes, then we have to maximize the margin considering each of the classes

* The idea was to maximize the margins based of all the classes, and not just one margin for each class

*This results in a global margin that takes into account all the classes

---

# Optimal Margin

This margin is perpendicular to the boundary and equidistant to the support vectors.

---

# Margins and Support Vectors

* So a support vector are data points which define: position and margin of the hyperplane

* They are given the name support vectors because they are the representative data points of the classes.

* If a given support vector is moved, either the position and or the margin will change.

* What happens to the margin if you move a point other than a support vector?

---

# Margins and Support Vectors

* So a support vector are data points which define: position and margin of the hyperplane

* They are given the name support vectors because they are the representative data points of the classes.

* If a given support vector is moved, either the position and or the margin will change.

* What happens to the margin if you move a point other than a support vector?

* It won't have an effect! Only moving support vectors has an effect!

---

# Classification and Support Vectors

* It is not entirely necessary to use ALL the training data points; it is actually possible to save the support vectors.

* In some scenarios, all the points will be the support vectors, but it is not common. Of this happens, review your model.

* The learning is equivalent with finding the hyperplane with the best margin, being a simple optimization problem.

---

# SVM in a Nutshell

## Basic Steps

* Select 2 hyperplanes, in 2D, which can separate the data, with no points in between them.

* Maximize their distance (margin)

* Define the decision boundary, which can be the average line between the two hyperplanes.

---

# SVM in non Linear Space

* Finding the best margin is not a trivial task. However, it is easier to achieve in 2D, with 2 attributes

* The problem becomes more complex when we have N attributes

* This sort of problems are solved with the Lagrange Multipliers

---

# SVM in non Linear Space

* A straight line cannot separate the data points (apples and lemons)

* So, to solve those we use the kernel trick

---

# Kernel Tricks, Apples and Lemons

* The idea is that when data is not separable in the current dimension, it is possible to add another dimension.

* By adding this dimension, the data will be separable.

* In this example, perhaps by adding a third dimension can separate the two classes because there is a gap between levels one and two of apples and lemons.

* By adding a new dimension it is possible to draw a separating hyperplane between both apples and lemons.

---

# Mapping 3D from 2D

* When mapping the classes (data points) to a new space, this was done using a transformation where levels were added based on distance

* The distance was based on the origin, where points on the lowest level are on the origin, and the further you move from the origin, one moves from the center to the margin

* If the origin is the lemon in the center then we get the following

---

# Kernel Trick and Mapping to 3D from 2D

* Now it is easy to separate the two classes. These transformations are called kernels. 
 * Popular kernels are: Polynomial Kernel, Gaussian Kernel, Radial Basis Function (RBF), Laplace RBF Kernel, Sigmoid Kernel

---

# Overview of Parameter Tuning

* The right kernel is always important, and as a rule of thumb always check first for a linear data and check for a linear SVM

* Linear SVM are parametric, whereas RBF SVM are not, this makes the complexity of the latter growing with the dataset.

* Because of the complexity you have more hyperparameters to tune, and so model selection is more expensive.

* Also, complex models are potentially easier to overfit

---
# Regularization

* The Regularization Parameter (in python it’s called C) tells the SVM optimization how much you want to avoid miss classifying each training example.

* If the C is higher, the optimization will choose smaller margin hyperplane, so training data miss classification rate will be lower.

* For lower values of C the margin will be big, there will even  be miss classified training data examples.

---

# Regularization

---
# Gamma

* The gamma parameter defines the reach if the influence of a single training instance or example reaches

* A high Gamma will consider only those points close to the plausible hyperplane and low Gamma will consider points at greater distance

* Decreasing Gamma results in that finding the correct hyperplane will consider points at greater distances, so more and more points are used when finding the optimum hyperplane

---

# Gamma

---

# Pros

* SVM can be very efficient, by using the support vectors the problem of finding the maximum margin hyperplane can be formulated as an optimization problem that only considers the support vectors

* Works well on small data, as well as non-linear and high dimensional data, and is less prone to overfitting than other algorithms.

* Effective when the dimensionality is superior than the number of instances

---

# Cons

* High training time with large datasets

* Noisy datasets, where the target clases are overlapping are not good fits for SV

---

# Popular Use Cases

* Text Classification

* Detecting spam

* Sentiment analysis

* Aspect-based recognition

* Handwritten digit recognition

* Candidate gene identification

* Sign-language recognition

—

# References

* [Kernel Functions-Introduction to SVM Kernel & Examples](https://data-flair.training/blogs/svm-kernel-functions/)
  
  * [A Simple Explanation of Why Lagrange Multipliers Works](https://medium.com/@andrew.chamberlain/a-simple-explanation-of-why-lagrange-multipliers-works-253e2cdcbf74)