Supervised Learning: Learn a mapping from examples
Unsupervised Learning: Learn an interesting thing about data
Reinforcement Learning: Learn what to do in an environment, given feedback information
Say there is a function f(→x)=y
Supervised Learning: We know x and y, and are trying to find f (more accurately: a probability distribution P(y|x)
)
Unsupervised Learning: We know x and are trying to find "interesting" f and y
Reinforcement Learning: We know f*, and are trying to get "the best y" by choosing x
*: Terms and Conditions may apply
In Unsupervised Learning we give our algorithm data, and let it find "something interesting"
Examples:
If we can find similar players, we can make them play together
Friends can be recommended based on player type/preference
Help new players by suggesting common actions/purchases made by similar players
Recommend new games to players
If genes "behave" similarly, they must be related or associated in a biological process
We are given n
vectors, representing our players/games/words/...
How can we determine which vectors belong to the same "class"/"type"?
How many classes are there?
We call the classes clusters
For now, we assume that each of our clusters is defined by a single point, the center of the cluster
Each data point is then assigned to a cluster based on which cluster center it is closest to
Say we are told that we should create k
"good" clusters
k-center clustering: Minimize the maximum distance of any data point from its cluster center (firehouse placement)
k-median clustering: Minimize the sum of the distances of data points to their cluster center
k-means clustering: Minimize the variance of distances of data points within a cluster (which is the average squared distance from the mean)
Each of these is a measure for how "compact" a cluster is, but that does not necessarily tell us anything about cluster "usefulness", which is application-dependent
k-means clustering puts more weight on outliers than k-median, but is not dominated by them like k-center
Especially for d-dimensional vectors, k-means is usually the first choice
How do we find a k-means clustering? Try all possible assignments of data points to clusters
Finding an optimal clustering is NP-hard :(
Lloyd's algorithm! (Often also just called "k-means algorithm")
Determine k
initial cluster centers, then iterate these two steps:
Assign each data point to its cluster based on the current centers
Compute new centers as the mean of each cluster
After "some" iterations we will have a clustering of the data
This may be a local minimum when compared to the k-means criterion, but is often "good enough"
What if our data is not d-dimensional vectors, but e.g. all the data we have about each player
For any two players, we can calculate a distance, but we can't make up an "average value"
In other words, all we have are our data points and pairwise distances, but no vector embedding
We can still cluster, we just have the restriction that each cluster center must be exactly on a data point
The cluster centers are called "medoids"
We use a variation of Lloyd's algorithm
The only difference is how we assign new cluster centers
One option: Use the data point that has the lowest sum of distance to the other data points in the cluster
Other option: Choose a new data point as the new cluster center, and check if that new cluster center would result in a better clustering (slower, but more stable)
We need initial cluster centers from somewhere?
Simplest approach: Just pick data points at random; Problem: Results may be poor/unpredictable
Another idea: Pick a data point at random, then pick the data point that is furthest away from it, then pick the data point furthest away from both, etc.; Problem: Outliers affect the initialization
Another idea: Pick a data point at random, and assign weights to each other data point based on the distance. Then pick the next center using these weights as probabilities, etc.
You can also use the result from any other algorithm/guess/heuristic as an initialization, Lloyd's algorithm will never make the solution worse (as measured by the k-means clustering goal)!
Start with each data point in its own cluster
Merge two clusters until there are only k
clusters left
Which two clusters do you merge? The two such that the average distance from the cluster centers increases the least
This is basically "greedy" k-means
Our representation of clusters as single vectors had the advantage of being simple
However, clusters sometimes have different sizes/distributions
So let's assume our clusters are probability distributions
Let's start with Gaussians
Many datasets can be modeled by Gaussian Distribution
It is potentially intuitive to think that the clusters come from different Gaussian Distributions
With this notion it is possible to model dataset as a mixture of several Gaussian Distributions.
This is the core idea of Gaussian Mixture Models
A Gaussian (normal) distribution in our model has several parameters:
1- A mean μ that defines its centre
2- A covariance Σ that defines its width (in each dimension!)
Each cluster has a mean (point) and covariance (matrix)
The mean defines where the center of the cluster is
The covariance matrix defines the size and extent
The mean and covariance are the parameters of the distribution
Technically, all Gauss distributions extend infinitely; we assign each data point to the cluster for which it has the highest probability (but we could allow membership in multiple clusters!); in other words, each Gaussian contributes to each data point with some (non-zero) probability
Similar to k-means, we can determine parameter values for k Gaussians iteratively
Initialize k means and covariance matrices, then iterate:
(Expectation Step) Calculate the current responsibilities/contributions for each data point from each Gaussian
(Maximization Step) Use these responsibilities to calculate new means (weighted average of all data points), and covariance matrix
Repeat until the clusters don't change anymore
The general mathematical formulation of EM is actually more powerful
It works for general, parameterized models with latent (inferred) variables
The Expectation step computes the probabilities for these latent variables (which we called the "contribution" of a Gaussian to a data point)
The Maximization step finds new parameters using these probabilities (our parameters were the mean and covariance) that maximizes the likelihood of the data points
How do we cluster this?
No matter where we put our cluster centers, we can't cluster it into the inner and outer ring.
We can observe that clusters generally are "more dense" than the regions in between
Let's start with each data point in its own cluster
Single Linkage: We connect two clusters if the distance between any two points in them is minimal between all cluster pairs
Repeat until we have k
clusters
Sometimes there are a few single points that would link two clusters, resulting in undesirable connections
Robust Linkage: Connect two clusters only if there are t
points in each close to the other cluster
So far we have kind of ignored how many clusters there are, but how do we get k?
Define "some measure" of cluster quality, and then try k=1,2,3,4,…
There are also some more advanced algorithms that don't need to be told k explicitly (e.g. DBSCAN)
Pattern Recognition and Machine Learning (Chapter 9), by Christopher Bishop
Supervised Learning: Learn a mapping from examples
Unsupervised Learning: Learn an interesting thing about data
Reinforcement Learning: Learn what to do in an environment, given feedback information
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |