class: center, middle # Computational and statistical techniques of Machine Learning ## ML Pipeline --- # ML Pipeline
--- class: center, middle # Data Collection, Storage and Loading --- # Data Collection * For your projects, y'all needed data * For our purposes, we used data sets that someone else had already collected and prepared * In other cases, we will have to collect data (experiments, surveys, web crawlers/scrapers, etc.) --- # Data Collection 1. API use 2. Data wrangling, find the appropriate format 3. EDA as a quality control process 4. Data input decisions --- # Data Storage * Where? public repos, local servers, cloud * How? metadata, format, ontologies * What? data as numbers, images, sound, documents, etc. In some cases we can only store the code to generate the data, or the protocol to retrieve it. * How long? financial and ethical issues --- # Data Loading Questions we need to answer: * How are we going to split the data? * Do we need to normalize it? * Do we need feature selection? This is not an exhaustive list, but it can be a good start for a data project --- # Data Loading * So far we have used custom code to load data into tensors * PyTorch actually has a class `Dataset` which you can subclass to represent a data set * There are various utilities to process these datasets in parallel * Torchvision also provides subclasses for several popular datasets, including MNIST --- # Image Folders * For several common tasks, torchvision also provides specialized classes * For example, if you have a classification task over images, and you have one folder per class, that contains all images of that class, there is `torchvision.datasets.ImageFolder` * Similarly, if your data is not in an image-format, there is an extensible `torchvision.datasets.DatasetFolder` --- # Streaming Data * Remember mini-batch learning? Instead of using **all** data, we divided it into smaller batches * Well, if we don't use it all at once, we don't have to load it all at once * PyTorch gives you a `torch.utils.data.IterableDataset`, which you can subclass to load data piece by piece --- class: center, middle # Feature Selection, Engineering and Extraction --- class: mmedium # Defining Feature or Variable Selection * The task of selecting a subset of variables from the set of input variables used by a learning algorithm * These input variables will be used for training whereas the reminder will be ignored (considered noise) * This is also known as reducing the dimension of the set (dimensionality reduction) --- # In more specific Terms * Mathematically: * Given a set of features $$ F = (f_1, ... , f_i, ... , f_n) $$ the Feature Selection problem is to find a subset that maximizes the learner's ability to classify or identify a pattern --- class: mmedium # Dichotomy of Feature Selection
--- class: medium # Importance of Feature Selection * It can help with large or high dimensional datasets where there are noisy features * It can help with the performance of the model. Beware that improving the performance is risky! * It means we can counter overfitting * It can also mean to reduce computational time * Finally it can actually be represented in increases in performance metrics --- class: mmedium # What is Feature Extraction? * "Feature Extraction aims to reduce the number of features in a dataset by creating new features from the existing ones" * The original features are then discarded for you will be using the newly created ones! * In other words: From a given set of features in your input F set you create a new set of features that summarizes the information of F. * This is different from feature selection because this new summarized version of F is a new version of F created from a combination of the original features. --- class: medium # And Feature Engineering? * This one is a gray area! * Feature engineering is usually done in data processing and involves: - combining features into new ones - creating new features from data (word counting, vectorization, kmers, etc.) - creating a new feature using existing data (gross income using income data, duplication effects on gene expression) * Feature engineering usually requires 2 things: 1- exploratory data analysis 2- domain knowledge --- # The Overall Purpose of FS, FE and FEng * Reduce Dimensionality and to Obtain Informative Features
--- # The Curse of Dimensionality * The number of samples needed to achieved the same accuracy will grow exponentially with number of features * However, the number of training instances is fixed, meaning the performance of the classifier will degrade for a large number of features * Also, the information lost by discarding features can be compensated by more accurate mapping in a lower dimension space --- class: medium # The Optimal Feature Set * Is usually not feasible to find the optimal sub-set of features (maximize the scoring function) * For most problems it is computationally intractable to reach the optimal result or optimal subset * This leads to settling for a sub-optimal subset * By Optimal Feature Subset usually is referred to performance metrics and not the subset per se --- # Types of Feature Selection Methods 1- Wrapper 2- Filter 3- Embedded --- class: small # Wrapper Methods * Wrapper methods train a new model for each subset and use the error rate of the model on a hold-out set to score feature subsets * Wrapper methods are subdivided into exhaustive search, heuristic search, and random search * Exhaustive search enumerates all possible feature combinations. These methods are rarely used since the time complexity would be `\( O(2^n) \)` * Non-exhaustive search methods are optimizations based on exhaustive search * Branch and Bound Search saves time by cutting off branches that are unlikely to search for a solution better than the currently found optimal solution --- class: medium # Heuristic Wrapper Methods * Heuristic search has SFS (Sequential Forward Selection) and SBS (Sequential Backward Selection). SFS starts from an empty set * Each time a feature x is added to the feature subset X, in order to optimize the evaluation metric * SBS, starts from the universal set and deletes a feature x each time, and evaluates the metrics each step * Both SFS and SBS are greedy and will likely fall into local optima --- class: mmedium # Filter Methods * Filter methods use evaluation criteria from the intrinsic connections between features to score a feature subset * Filter methods are independent to the type of predictive model * The result of a filter would be more general than a wrapper * They are usually less computationally intensive than wrapper methods * Filter methods have also been used as a preprocessing step for wrapper methods, allowing using a wrapper on more massive problems * Common measures have four types: distance metrics, correlation, mutual information, and consistency metrics --- class: mmedium # Embedded Methods * The feature selection algorithm is integrated as part of the learning algorithm * Example: Decision trees * A decision tree algorithms selects a feature in each recursive step of the tree growth process and divide the sample set into smaller subsets * We can end the splitting at any point (using an entropy threshold, for example), and this naturally limits which features are used --- class: center, middle # Hyperparameter Tuning --- class: small # Model Parameters * Machine learning models are mathematical functions that represent the relationship between different aspects of data * Looking at the linear regression model, it uses a line to represent a relationship between features and target using the following formula: $$ y = \vec{w}\cdot \vec{x}' $$ -- * x is a vector of features from the data and y is a scalar variable, which represents the response variable * A linear regression assumes that the relationship between x and y is linear. * w is a weight vector (the **parameters** of the model) that specifies the slope of the line/plane. It is learned during the training phase. * When we say "training a model" we talk about optimization processes to find the best model parameters that fit the data --- class: medium # So What is a Hyperparameter? * Hyperparameters are values that MUST be specified outside the training procedure * Hyperparameters for linear regression? -- * None! (But there are variations that have them) * Neural Networks: Number of layers, neurons, type of activation function, ... * Decision trees have hyperparameters such as depth cutoff --- class: mmedium # Working with Hyperprameters * Hyperparameters have interesting functions such as controling model capacity (how flexible the model is, how many degrees of freedom it has in fitting the data) * Proper control of the model prevents overfitting * Hyperparameters can also come from the training process * Training a machine learning model often involves optimizing a loss function (the training metric) * Mathematical optimization techniques may be employed, which may have parameters of their own * Gradient descent optimization requires a learning rate * Random forests and boosted decision trees require knowing the number of total trees --- class: small # Mechanisms to Tune Hyperparameters * Hyperparameter settings could have a big impact on the prediction accuracy of the trained model * Optimal hyperparameter settings often differ for different datasets * Therefore they should be tuned for each dataset * Since the training process doesn't set the hyperparameters, there needs to be a meta process that tunes the hyperparameters * This is what we mean by hyperparameter tuning * Hyperparameter tuning is a meta-optimization task * The outcome of hyperparameter tuning is the best hyperparameter setting, and the outcome of model training is the best model parameter setting --- class: mmedium # Hyperparameter Tuning Algorithms * Hyperparameter tuning is an optimization task * This is similar to model training * In model training, the quality of the model is presented in terms of a mathematical represetnation, the loss function * In hyperparamter tuning, however, the quality of the hyperparameters cannot be formally presented as it dependes on the resuilts of the model * This makes hyperparameter tuning a difficult task * Up until a few years ago, the only available methods were grid search and random search --- class: mmedium # Grid Search * Grid search packs a grid of hyperparameter values, and evaluates EVERY ONE of them * It returns the highest performing parameter or combination of parameters based on performance metrics * For example, if the hyperparameter is the number of neurons in the hidden layer, then the grid could be 10, 20, 30, ..., 100 * For other parameters, it is common to use exponential scale: 1e-5, 1e-4, 1e-3, ..., 1 * It is necessary to use guesswork to specify the minimal and maximum values * Solution? --- class: mmedium # Grid Search * Solution? * Run a smaller grid search, see if the optimum lies at either endpoint, and then expand the grid in that direction * This is the manual grid search * Grid search is dead simple to set up and trivial to parallelize * It is the most expensive method in terms of total computation time --- class: mmedium # Random Search * Instead of searching across the full grid space, random search evalautes only a random sample of the grid * This in turn makes random search a cheaper alternative than grid search * Because it does not search the full grid it was not considered a serious strategy * Why? -- * Because it was considered that it could not beat the optimum found by grid search * However Bergstra and Bengio showed that, in surprisingly many instances, random search performs about as well as grid search --- class: mmedium # Fundamentals of Random Search * For any distribution over a sample space with a finite maximum, the maximum of 60 random observations lies within the top 5% of the true maximum, with 95% probability * How? * Imagine the 5% interval around the true maximum * We sample points from the space and see if any land within that interval * Each random sampling has a 5% landing in that interval * If points are drawn independently the probability of all of them missing is the interval is `\( (1 – 0.05)^n \)` * What does this mean? --- class: small # Random Sampling * If points are drawn independently the probability of all of them missing is the interval is `\( (1 – 0.05)^n \)` * What does this mean? * The probability that at least one of them succeeds in hitting the interval is 1 minus that quantity * We want at least a 0.95 probability of success * To determine the number of draws needed we need to solve this equation: $$ 1 – (1 – 0.05)^n > 0.95 $$ * Resulting in: $$ n >= 60 $$ --- class: mmedium # So, what does this all mean? * So this all means that if at least 5% result in close to optimal solutions, then random search with 60 trial will result in identifying that region with a high probabilty (95%) * The condition of the if-statement is very important * It can be satisfied if either the close-to-optimal region is large, or if somehow there is a high concentration of grid points in that region * The former is more likely, because a good machine learning model should not be overly sensitive to the hyperparameters, i.e., the close-to-optimal region is large --- class: medium # Smart Hyperparameter Tuning * Smart hyperparameter tuning is much less parallelizable than grid search or random search * In smart tuning one does not generate and evaluate all points, instead a few are selected and their quality are evaluated, to decide where to sample next * This is iterative and sequential, but not very parallelizable * The goal is to make fewer evaluations and reduce the overall computation time --- class: mmedium # Caveats of Smart Hyperparameter Tuning * Smart search algorithms require computation time to figure out where to place the next set of samples * Some algorithms require much more time than others * Thus, it only makes sense if the inner optimization takes much longer than the process of evaluating where to sample next * Smart search algorithms also contain parameters or hyperparameters of their own that need to be tuned * Sometimes tuning the hyper-hyperparameters is crucial to make the smart search algorithm faster than random search * Hyperparameter tuning is difficult because it is not possible to write down the mathematical formula (response surface) for the function being optimizing --- class: mmedium # References * [Chapter 4: Hyperparameter Tuning](https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html) * [Hyperparameter Tuning](https://towardsdatascience.com/hyperparameter-tuning-c5619e7e6624) * [Hyperparameter Optimization](https://towardsdatascience.com/hyperparameters-optimization-526348bb8e2d) * [A review of feature selection techniques in bioinformatics](https://academic.oup.com/bioinformatics/article/23/19/2507/185254) * [Common Methods for Feature Selection You should Know](https://medium.com/@cxu24/common-methods-for-feature-selection-you-should-know-2346847fdf31) * [Feature Extraction Technoques](https://towardsdatascience.com/feature-extraction-techniques-d619b56e31be) * [Getting Data ready for modelling: Feature engineering, Feature Selection, Dimension Reduction (Part 1)](https://towardsdatascience.com/getting-data-ready-for-modelling-feature-engineering-feature-selection-dimension-reduction-77f2b9fadc0b)