Lecture 12: Hyperparameter Tuning and Featiure Selection, Engoneering and Extraction

# Machine Learning

## Hyperparameter Tuning and Fetures Selction, Extraction and Engineering
### III-Verano 2019

---

<p style="margin-top: 5cm; text-align: center; font-size: 2em">
Defining Hyperparameters and Parameters for Tuning
</p>

---

# Model Parameters

* Machine learning models are mathematical functions that represent the relationship between different aspects of data.

* Looking at the linear regression model, it uses a line to represent a relationship between features and target using the following formula:

$$ y = w^T*x $$

* x is a vector of features from the data and y is a scalar variable, which represents the response variable

* A linear regression assumes that the relationship between x and y is linear.

* w is a weight vector (the **parameters** of the model) that specifies the slope of the line/plane. It is learned during the training phase.

* When we say "training a model" we talk about optimization processes to find the best model parameters that fit the data

---

# So What is a Hyperparameter?

* Hyperparameters, or nuisance parameters are values that MUST be specifiedoutside the training procedure

* Hyperparameters for linear regression?
 
--

* None!

* Variations do have them, for example? LASSO has L1 regularization, Ridge Regression does L2 regularization, which are parameters.

* Decision trees have hyperparameters such as depth and number of leaves

* SVC? and Kernelized SVM?

---

# Working with Hyperprameters

* Regularization hyperparameters have interesting functions such as controling model capacity (how flexible the model is, how many degrees of freedom it has in fitting the data)

* Proper control of the model prevents overfitting.

* Hyperparameters can also come from the training process.

* Training a machine learning model often involves optimizing a loss function (the training metric).

* Mathematical optimization techniques may be employed, which may have parameters of their own.

* Stochastic gradient descent optimization requires a learning rate

* Random forests and boosted decision trees require knowing the number of total trees

---

# Mechanisms to Tune Hyperparameters

* Hyperparameter settings could have a big impact on the prediction accuracy of the trained model.

* Optimal hyperparameter settings often differ for different datasets.

* Therefore they should be tuned for each dataset.

* Since the training process doesn't set the hyperparameters, there needs to be a meta process that tunes the hyperparameters.

* This is what we mean by hyperparameter tuning.

* Hyperparameter tuning is a meta-optimization task.

* The outcome of hyperparameter tuning is the best hyperparameter setting, and the outcome of model training is the best model parameter setting.

---

# Hyperparameter Tuning Algorithms

* Hyperparameter tuning is an optimization task.

* This is similar to model training.

* In model training, the quality of the model is presented in terms of a mathematical represetnation, the loss function.

* In hyperparamter tuning, however, the quality of the hyperparameters cannot be formally presented as it dependes on the resuilts of the model.

* This makes hyperparameter tuning a difficult task.

* Up until a few years ago, the only available methods were grid search and random search.

---

# Grid Search

* Grid search as its name indicates packs a grid of hyperparameter values, and evaluates EVERY ONE of them

* It returns the highest performing parameter or combination of parameters based on performance metrics.

* For example, if the hyperparameter is the number of leaves in a decision tree, then the grid could be 10, 20, 30, ..., 100.

* For regularization parameters, it is common to use exponential scale: 1e-5, 1e-4, 1e-3, ..., 1

* It is necessary to use guesswork to specify the minial and maximum values

* Solution?

---

#Grid Search

* Solution?

* Run a smaller grid search, see if the optimum lies at either endpoint, and then expand the grid in that direction.

* This is the manual grid search

* Grid search is dead simple to set up and trivial to parallelize.

* It is the most expensive method in terms of total computation time.

* If parallelized it can be fast. **If parallelized!**

---

# Random Search

* Instead of searching across the full grid space, random search evalautes only a random sample of the grid

* This in turn makes random search a cheaper alternative than grid search.

* Because it does not search the full grid it was not considered a serious strategy.

* Why?
 
--

* Because it was considered that it could not beat the optimum found by grid search

* However Bergstra and Bengio showed that, in surprisingly many instances, random search performs about as well as grid search

---

# Fundamentals of Random Search

* For any distribution over a sample space with a finite maximum, the maximum of 60 random observations lies within the top 5% of the true maximum, with 95% probability.

* How?

* Imagine the 5% interval around the true maximum.

* We sample points from this space and see if any land within that maximum.

* Each random sampling has a 5% landing in that interval

* If points are drawn independently the probability of all of them missing is the interval is `$ (1 – 0.05)^n $`

* What does this mean?

---

# Random Sampling

* If points are drawn independently the probability of all of them missing is the interval is `$ (1 – 0.05)^n $`

* What does this mean?

* So the probability that at least one of them succeeds in hitting the interval is 1 minus that quantity.

* We want at least a 0.95 probability of success.

* To determine the number of draws needed we need to solve this equation:

$$ 1 – (1 – 0.05)^n > 0.95 $$

* Resulting in: $$ n >= 60 $$

---

# So, what does this all mean?

* So this all means that if at least 5% result in close to optimal solutions, then random search with 60 trial will result in identifying that region with a high probabilty (95%).

* The condition of the if-statement is very important.

* It can be satisfied if either the close-to-optimal region is large, or if somehow there is a high concentration of grid points in that region.

* The former is more likely, because a good machine learning model should not be overly sensitive to the hyperparameters, i.e., the close-to-optimal region is large.

---

# Smart Hyperparameter Tuning

* Smart hyperparameter tuning is much less parallelizable than grid search or random search.

* In smart tuning one does not generate and evaluate all points. Instead a few are selected and their quality are evaluated, to decide where to sample next.

* This is iterative and sequential, but not very parallelizable.

* The goal is to make fewer evaluations and reduce the overall computation time.

---

# Caveats of Smart Hyperparameter Tuning

* Smart search algorithms require computation time to figure out where to place the next set of samples.

* Some algorithms require much more time than others.

* Thus, it only makes sense if the inner optimization takes much longer than the process of evaluating where to sample next.

* Smart search algorithms also contain parameters or hyperparameters of their own that need to be tuned.

* Sometimes tuning the hyper-hyperparameters is crucial to make the smart search algorithm faster than random search.

* Hyperparameter tuning is difficult because it is not possible to write down the mathematical formula (response surface) for the function being optimizing.

---

# Examples of Smart Hyperparameter Tuning

* In recent years there have been three approaches for smart hyperparameter tuning:

1- Derivative-free optimization
		 2- Bayesian optimization
		 3- Random forest smart tuning

* Derivative-free methods employ heuristics to determine where to sample next.

* Bayesian optimization and random forest smart tuning both model the response surface with another function, then sample more points based on what the model says.

---

# Examples of Smart Hyperparameter Tuning

* RF smart tuning consisted of training a random forest of regression trees to approximate the response surface.

New points are sampled based on where the random forest considers to be the optimal regions.

* Derivative-free optimization,is a branch of mathematical optimization for situations where there is no derivative information

* Notable example methods include genetic algorithms

* It can be summarized in the follwong way: 
		1- Try n-number of random points 
		2- Approximate the gradient
		3- Find the most likely search direction
		4- Go there.

---

<p style="margin-top: 5cm; text-align: center; font-size: 2em">
 Feature Selection, Engineering and Extraction
</p>

---

# Defining Feature or Variable Selection

* The task of selecting a subset of variables from the set of input variables used by a learning algorithm

* These input variables will be used for training whereas the reminder will be ignored (considered noise)

* This is also known as reducing the dimension of the set (dimensionality reduction)

---

# In more specific Terms

* Mathematically:

* Given a set of features $$ F = (f_1, ... , f_i, ... , f_n) $$ the Feature Selection problem is to find a subset that maximizes the learner's ability to classify or identify a pattern

---

# Dichotomy of Feature Selection

---

# Importance of Feature Selection

* It can help with large or high dimensional datasets where there are noisy features

* It can help with the performance of the model. Beware that improving the performance is risky!
	
 * It means we can counter overfitting

* It can also mean to reduce computational time

* Finally it can actually be represented in increases in performance metrics

---

# What is Feature Extraction?

* "Feature Extraction aims to reduce the number of features in a dataset by creating new features from the existing ones"

* The original features are then discarded for you will be using the newly created ones!

* In other words:

* From a given set of features in your input F set you create a new set of features that summarizes the information of F.

* This is different from feature selection because this new summarized version of F is a new version of F created from a combination of the original features.

---

# And Feature Engineering?

* This one is a gray area!

* Feature engineering is usually done in data processing and involves:
	- combining features into new ones
	- creating new features from data (word counting, vectorization, kmers, etc.)
	- creating a new feature using existing data (gross income using income data, duplication effects on gene expression)

* Feature engineering usually requires 2 things:
		1- EDA
		2- domain knowledge
	
---

# The Overall Purpose of FS, FE and FEng

* Reduce Dimensionality and to Obtain Informative Features

---

# The Curse of Dimensionality

* The number of samples needed to achieved the same accuracy will grow exponentially with number of features

* However, the number of training instances is fixed, meaning the performance of the classifier will degrade for a large number of features

* Also, the information lost by discarding features can be compensated by more accurate mapping in a lower dimension space.

---

# The Optimal Feature Set

* Is usually not feasible to find the optimal sub-set of features (maximize the scoring function)

* For most problems it is computationally intractable to reach the optimal result or optimal subset.

* This leads to settling for a sub-optimal subset

* By Optimal Feature Subset usually is referred to performance metrics and not the subset per se

---

# Types of Feature Selection Methods

1- Wrapper

2- Filter

3- Embedded

---

# Wrapper Methods

* Wrapper methods train a new model for each subset and use the error rate of the model on a hold-out set to score feature subsets.

* Wrapper methods are subdivided into exhaustive search, heuristic search, and random search.

* Exhaustive search enumerates all possible feature combinations. These methods are rarely used since the time complexity would be `$ O(2^n) $`

* Non-exhaustive search methods are optimizations based on exhaustive search.

* Branch and Bound Search saves time by cutting off branches that are unlikely to search for a solution better than the currently found optimal solution.

---

# Heuristic Wrapper Methods

* Heuristic search has SFS (Sequential Forward Selection) and SBS (Sequential Backward Selection). SFS starts from an empty set.

* Each time a feature x is added to the feature subset X, in order to optimize the evaluation metric.

* SBS, starts from the universal set and deletes a feature x each time, and evaluates the metrics each step.

* Both SFS and SBS are greedy and will likely fall into local optima.

---

# Filter Methods

* Filter methods use evaluation criteria from the intrinsic connections between features to score a feature subset.

* Filter methods are independent to the type of predictive model.

* The result of a filter would be more general than a wrapper.

* They are usually less computationally intensive than wrapper methods.

* Filter methods have also been used as a preprocessing step for wrapper methods, allowing using a wrapper on more massive problems.

* Common measures have four types: distance metrics, correlation, mutual information, and consistency metrics.

---

# Embedded Methods

The feature selection algorithm is integrated as part of the learning algorithm.

Decision tree algorithm is the most typical and representative.

Decision tree algorithms select a feature in each recursive step of the tree growth process and divide the sample set into smaller subsets.

The more child nodes in a subset are in the same class, the more informative the features are.

The process of decision tree generation is also the process of feature selection. ID3, C4.5, and CART are all common decision tree algorithms.

---

# References

* [Chapter 4: Hyperparameter Tuning](https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html)
  
  * [Hyperparameter Tuning](https://towardsdatascience.com/hyperparameter-tuning-c5619e7e6624)
  
  * [Hyperparameter Optimization](https://towardsdatascience.com/hyperparameters-optimization-526348bb8e2d)
  
  * [A review of feature selection techniques in bioinformatics](https://academic.oup.com/bioinformatics/article/23/19/2507/185254)
  
  * [Common Methods for Feature Selection You should Know](https://medium.com/@cxu24/common-methods-for-feature-selection-you-should-know-2346847fdf31)
  
  * [Feature Extraction Technoques](https://towardsdatascience.com/feature-extraction-techniques-d619b56e31be)
  
  * [Getting Data ready for modelling: Feature engineering, Feature Selection, Dimension Reduction (Part 1)](https://towardsdatascience.com/getting-data-ready-for-modelling-feature-engineering-feature-selection-dimension-reduction-77f2b9fadc0b)