Lecture 13: Unsupervised Learning

# Machine Learning

## Unsupervised Learning
### III-Verano 2019

---

# Unsupervised Learning

---

# Unsupervised Learning

* In Unsupervised Learning we give our algorithm data, and let it find "something interesting"
  
  * Examples:
  
     - Clusters of similar data: Similar players, cliques in social networks, similar sounding words
     - Common patterns of data: Attack or other action sequences used by many players
     - Related data items: Purchases often made together (with real or in-game currency), quests often chosen together, games played by the same people
     
---

# Why Unsupervised Learning?

* If we can find similar players, we can make them play together 
  
  * Friends can be recommended based on player type/preference 
  
  * Help new players by suggesting common actions/purchases made by similar players 
  
  * Recommend new games to players
  
  * If genes "behave" similarly, they must be related or associated in a biological process

---

# Clustering

---

# Clustering

* We are given `n` vectors, representing our players/games/words/...
  
  * How can we determine which vectors belong to the same "class"/"type"?
  
  * How many classes are there?
  
  * We call the classes *clusters*

---

# What is a Cluster?

* For now, we assume that each of our clusters is defined by a single point, the *center* of the cluster
  
  * Each data point is then assigned to a cluster based on which cluster center it is closest to
  
<img src="/PF-3341/assets/img/voronoigrowth.gif" width="40%"/>

---

# What is a good Clustering?

* Say we are told that we should create `k` "good" clusters 
  
    * k-center clustering: Minimize the maximum distance of any data point from its cluster center (firehouse placement)
  
    * k-median clustering: Minimize the sum of the distances of data points to their cluster center 
  
    * k-means clustering: Minimize the variance of distances of data points within a cluster (which is the average squared distance from the mean)
  
  * Each of these is a measure for how "compact" a cluster is, but that does not necessarily tell us anything about cluster "usefulness", which is application-dependent

---

# k-Means Clustering

* k-means clustering puts more weight on outliers than k-median, but is not dominated by them like k-center
  
  * Especially for d-dimensional vectors, k-means is usually the first choice

* How do we find a k-means clustering? Try all possible assignments of data points to clusters
  
  * Finding an optimal clustering is NP-hard :(
  
  * Lloyd's algorithm! (Often also just called "k-means algorithm")
  
---

# Lloyd's algorithm

* Determine `k` initial cluster centers, then iterate these two steps:
  
     - Assign each data point to its cluster based on the current centers 
     
     - Compute new centers as the mean of each cluster 
     
  * After "some" iterations we will have a clustering of the data 
  
  * This may be a local minimum when compared to the k-means criterion, but is often "good enough"
  
---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Generalized Distance Functions

* What if our data is not d-dimensional vectors, but e.g. all the data we have about each player 
  
  * For any two players, we can calculate a distance, but we can't make up an "average value"
  
  * In other words, all we have are our data points and pairwise distances, but no vector embedding
  
  * We can still cluster, we just have the restriction that each cluster center must be exactly on a data point 
  
---

# k-Medoids Clustering

* The cluster centers are called "medoids"
  
  * We use a variation of Lloyd's algorithm 
  
  * The only difference is how we assign new cluster centers 
  
  * One option: Use the data point that has the lowest sum of distance to the other data points in the cluster 
  
  * Other option: Choose a new data point as the new cluster center, and check if that new cluster center would result in a better clustering (slower, but more stable)
  
---

# Lloyd's algorithm vs. k-Medoids Clustering

<img src="/PF-3341/assets/img/kmedoidsdetail.png" width="100%"/>
  
---

# Lloyd's algorithm vs. k-Medoids Clustering

---

# Lloyd's algorithm vs. k-Medoids Clustering

<img src="/PF-3341/assets/img/kmedoidsresult.png" width="40%"/>
  
---

# We forgot something ...

* We need initial cluster centers from somewhere?
  
  * Simplest approach: Just pick data points at random; Problem: Results may be poor/unpredictable
  
  * Another idea: Pick a data point at random, then pick the data point that is furthest away from it, then pick the data point furthest away from both, etc.; Problem: Outliers affect the initialization
  
  * Another idea: Pick a data point at random, and assign weights to each other data point based on the distance. Then pick the next center using these weights as probabilities, etc.
  
  * You can also use the result from any other algorithm/guess/heuristic as an initialization, Lloyd's algorithm will never make the solution worse (as measured by the k-means clustering goal)!
  
---

# Ward's Algorithm

* Start with each data point in its own cluster 
  
  * Merge two clusters until there are only `k` clusters left 
  
  * Which two clusters do you merge? The two such that the average distance from the cluster centers increases the least 
  
  * This is basically "greedy" k-means
  
---

# Heirarchical Clustering

* A potential issue /disdavantage of k-means clustering is that it requires human input in defining the number of clsuters.

* Hierarchical clustering however does not require an initial number of clusters.
 
  * The most common type of hierarchical clustering: bottom-up or agglomerative clustering.

* It generates a dendrogram starting from the leaves and combining clusters up to the trunk.

---

#  Bottom-up Heirarchical Clustering

* Clustering by constructing a dendogram

---

# HClust Construction

* The algiorithm follows this simple process:

1- A dissimilarity measure is defined between each pair of observations, i.e Eclidean distance
		
	2- It then assumes that eac h observation belongs to an individual cluster
		
	3- The 2 most similar clusters are fused/merged so that there are n-1 clusters.
		
	4- The other 2 similar clusters are fused resulting in n-2 clusters.

* The process is repeated iteratively until all observations are part of a single cluster.

---

# Dissimilarity

* Although a simple algorithm, something was not defined, the dissimilarity measure between clusters

* This is achieved with the concept of linkage.

* There are 4 main types of linkage: complete, average, simple and centroids

---

# Linkage

---

# Whcih one do I use?

* Complete, average and centroid are the most commonly used

* Single linkage tends to yield unbalanced dendrograms.

---

# Other Measures

* Selecting the appropiate dissimiliraty measure is ipmportant.

* Besides Euclidean distance there is also correlation-based distance.

* This considers two features to be similar if they are highly correlated, meaning that they have similar profiles.

---

# Correlation in Practice

* In an example, an online retailer is interested in clustering shoppers based on their previous shoppoing history.

* The goal is to identify subgroups of similar shoppers and show them ads to entice them to buy stuff.

* Using Euclidean distance, low-buying shoppers will be gropued together. This is not ideal.

* Using correlation-based distance methods shoppers with similar preferences (they bought items A and B, but not C and D) will be clustered together, even if they have bought of different volume of items.

---

# Distribution-Based Clustering

* Our representation of clusters as single vectors had the advantage of being simple
  
  * However, clusters sometimes have different sizes/distributions 
  
  * So let's assume our clusters are probability distributions 
  
  * Let's start with Gaussians
  
---

# Why Gaussians?

* Many datasets can be modeled by Gaussian Distribution

* It is potentially intuitive to think that the clusters come from different Gaussian Distributions

* With this notion it is possible to model dataset as a mixture of several Gaussian Distributions.

* This is the core idea of GMM.

---

# GMM

* A GMM has a series of parameter that can be ilustrated here:
 
 1- A mean μ that defines its centre.
 
 2- A covariance Σ that defines its width.
 
 3- A mixing probability π that defines how big or small the Gaussian function will be.

---
# Gaussian Clusters

---

# Gaussian Clusters

* Each cluster has a mean (point) and covariance (matrix)
  
  * The mean defines where the center of the cluster is 
  
  * The covariance matrix defines the size and extent 
  
  * The mean and covariance are the *parameters* of the distribution
  
  * Technically, all Gauss distributions extend infinitely; we assign each data point to the cluster for which it has the highest probability (but we could allow membership in multiple clusters!); in other words, each Gaussian *contributes* to each data point with some (non-zero) probability

---

# Expectation Maximization (EM)

* Similar to k-means, we can determine parameter values for k Gaussians iteratively
   
* Initialize k means and covariance matrices, then iterate:
   
      - (**E**xpectation Step) Calculate the current responsibilities/contributions for each data point from each Gaussian 
      
      - (**M**aximization Step) Use these responsibilities to calculate new means (weighted average of all data points), and covariance matrix
  
* Repeat until the clusters don't change anymore
  
---

# Expectation Maximization

---

# Expectation Maximization

* The general mathematical formulation of EM is actually more powerful
  
  * It works for general, parameterized models with latent (inferred) variables
  
  * The Expectation step computes the probabilities for these latent variables (which we called the "contribution" of a Gaussian to a data point)
  
  * The Maximization step finds new parameters using these probabilities (our parameters were the mean and covariance) that maximizes the likelihood of the data points

---

# Density-Based Clustering

How do we cluster this?

No matter where we put our cluster centers, we can't cluster it into the inner and outer ring.

---

# Density-Based Clustering

* We can observe that clusters generally are "more dense" than the regions in between 
  
  * Let's start with each data point in its own cluster 
  
  * Single Linkage: We connect two clusters if the *distance* between any two points in them is minimal between all cluster pairs 
  
  * Repeat until we have `k` clusters
  
  * Sometimes there are a few single points that would link two clusters, resulting in undesirable connections 
  
  * Robust Linkage: Connect two clusters only if there are `t` points in each close to the other cluster

---

# How many clusters?

* So far we have kind of ignored how many clusters there are, but how do we get k?

* Define "some measure" of cluster quality, and then try `\(k=1,2,3,4,\ldots\)`
   - Statistical: Variance explained by clusters
   - Measurements of cluster density, span, etc. 
   - Usefulness in application (!)
   - etc.

* There are also some more advanced algorithms that don't need to be told k explicitly (e.g. DBSCAN)

---

# References
  
  * [Foundations of Data Science](https://www.cs.cornell.edu/jeh/book.pdf)
  
  * *Pattern Recognition and Machine Learning* (Chapter 9), by Christopher Bishop