Loading [MathJax]/jax/output/HTML-CSS/jax.js
+ - 0:00:00
Notes for current slide
Notes for next slide

Computational and statistical techniques of Machine Learning

Clustering

1 / 47

Reminder: Machine Learning

  • Supervised Learning: Learn a mapping from examples

  • Unsupervised Learning: Learn an interesting thing about data

  • Reinforcement Learning: Learn what to do in an environment, given feedback information

2 / 47

Machine Learning

Say there is a function f(x)=y

  • Supervised Learning: We know x and y, and are trying to find f (more accurately: a probability distribution P(y|x))

  • Unsupervised Learning: We know x and are trying to find "interesting" f and y

  • Reinforcement Learning: We know f*, and are trying to get "the best y" by choosing x

*: Terms and Conditions may apply

3 / 47

Unsupervised Learning

  • In Unsupervised Learning we give our algorithm data, and let it find "something interesting"

  • Examples:

    • Clusters of similar data: Similar players, cliques in social networks, similar sounding words
    • Common patterns of data: Attack or other action sequences used by many players
    • Related data items: Purchases often made together (with real or in-game currency), quests often chosen together, games played by the same people
4 / 47

Why Unsupervised Learning?

  • If we can find similar players, we can make them play together

  • Friends can be recommended based on player type/preference

  • Help new players by suggesting common actions/purchases made by similar players

  • Recommend new games to players

  • If genes "behave" similarly, they must be related or associated in a biological process

5 / 47

Clustering

6 / 47

Clustering

  • We are given n vectors, representing our players/games/words/...

  • How can we determine which vectors belong to the same "class"/"type"?

  • How many classes are there?

  • We call the classes clusters

7 / 47

Clustering

8 / 47

Clustering

9 / 47

What is a Cluster?

  • For now, we assume that each of our clusters is defined by a single point, the center of the cluster

  • Each data point is then assigned to a cluster based on which cluster center it is closest to

10 / 47

What is a good Clustering?

  • Say we are told that we should create k "good" clusters

    • k-center clustering: Minimize the maximum distance of any data point from its cluster center (firehouse placement)

    • k-median clustering: Minimize the sum of the distances of data points to their cluster center

    • k-means clustering: Minimize the variance of distances of data points within a cluster (which is the average squared distance from the mean)

  • Each of these is a measure for how "compact" a cluster is, but that does not necessarily tell us anything about cluster "usefulness", which is application-dependent

11 / 47

k-Means Clustering

  • k-means clustering puts more weight on outliers than k-median, but is not dominated by them like k-center

  • Especially for d-dimensional vectors, k-means is usually the first choice

  • How do we find a k-means clustering? Try all possible assignments of data points to clusters

  • Finding an optimal clustering is NP-hard :(

  • Lloyd's algorithm! (Often also just called "k-means algorithm")

12 / 47

Lloyd's algorithm

  • Determine k initial cluster centers, then iterate these two steps:

    • Assign each data point to its cluster based on the current centers

    • Compute new centers as the mean of each cluster

  • After "some" iterations we will have a clustering of the data

  • This may be a local minimum when compared to the k-means criterion, but is often "good enough"

13 / 47

Lloyd's algorithm

14 / 47

Lloyd's algorithm

15 / 47

Lloyd's algorithm

16 / 47

Lloyd's algorithm

17 / 47

Lloyd's algorithm

18 / 47

Lloyd's algorithm

19 / 47

Lloyd's algorithm

20 / 47

Lloyd's algorithm

21 / 47

Lloyd's algorithm

22 / 47

Lloyd's algorithm

23 / 47

Lloyd's algorithm

24 / 47

Lloyd's algorithm

25 / 47

Lloyd's algorithm

26 / 47

Lloyd's algorithm

27 / 47

Lloyd's algorithm

28 / 47

Generalized Distance Functions

  • What if our data is not d-dimensional vectors, but e.g. all the data we have about each player

  • For any two players, we can calculate a distance, but we can't make up an "average value"

  • In other words, all we have are our data points and pairwise distances, but no vector embedding

  • We can still cluster, we just have the restriction that each cluster center must be exactly on a data point

29 / 47

k-Medoids Clustering

  • The cluster centers are called "medoids"

  • We use a variation of Lloyd's algorithm

  • The only difference is how we assign new cluster centers

  • One option: Use the data point that has the lowest sum of distance to the other data points in the cluster

  • Other option: Choose a new data point as the new cluster center, and check if that new cluster center would result in a better clustering (slower, but more stable)

30 / 47

Lloyd's algorithm vs. k-Medoids Clustering

31 / 47

Lloyd's algorithm vs. k-Medoids Clustering

32 / 47

Lloyd's algorithm vs. k-Medoids Clustering

33 / 47

We forgot something ...

  • We need initial cluster centers from somewhere?

  • Simplest approach: Just pick data points at random; Problem: Results may be poor/unpredictable

  • Another idea: Pick a data point at random, then pick the data point that is furthest away from it, then pick the data point furthest away from both, etc.; Problem: Outliers affect the initialization

  • Another idea: Pick a data point at random, and assign weights to each other data point based on the distance. Then pick the next center using these weights as probabilities, etc.

  • You can also use the result from any other algorithm/guess/heuristic as an initialization, Lloyd's algorithm will never make the solution worse (as measured by the k-means clustering goal)!

34 / 47

Ward's Algorithm

  • Start with each data point in its own cluster

  • Merge two clusters until there are only k clusters left

  • Which two clusters do you merge? The two such that the average distance from the cluster centers increases the least

  • This is basically "greedy" k-means

35 / 47

Distribution-Based Clustering

  • Our representation of clusters as single vectors had the advantage of being simple

  • However, clusters sometimes have different sizes/distributions

  • So let's assume our clusters are probability distributions

  • Let's start with Gaussians

36 / 47

Why Gaussians?

  • Many datasets can be modeled by Gaussian Distribution

  • It is potentially intuitive to think that the clusters come from different Gaussian Distributions

  • With this notion it is possible to model dataset as a mixture of several Gaussian Distributions.

  • This is the core idea of Gaussian Mixture Models

37 / 47

Gaussian Mixture Models

  • A Gaussian (normal) distribution in our model has several parameters:

    1- A mean μ that defines its centre

    2- A covariance Σ that defines its width (in each dimension!)

38 / 47

Gaussian Clusters

image/svg+xml 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
39 / 47

Gaussian Clusters

  • Each cluster has a mean (point) and covariance (matrix)

  • The mean defines where the center of the cluster is

  • The covariance matrix defines the size and extent

  • The mean and covariance are the parameters of the distribution

  • Technically, all Gauss distributions extend infinitely; we assign each data point to the cluster for which it has the highest probability (but we could allow membership in multiple clusters!); in other words, each Gaussian contributes to each data point with some (non-zero) probability

40 / 47

Expectation Maximization (EM)

  • Similar to k-means, we can determine parameter values for k Gaussians iteratively

  • Initialize k means and covariance matrices, then iterate:

    • (Expectation Step) Calculate the current responsibilities/contributions for each data point from each Gaussian

    • (Maximization Step) Use these responsibilities to calculate new means (weighted average of all data points), and covariance matrix

  • Repeat until the clusters don't change anymore

41 / 47

Expectation Maximization

42 / 47

Expectation Maximization

  • The general mathematical formulation of EM is actually more powerful

  • It works for general, parameterized models with latent (inferred) variables

  • The Expectation step computes the probabilities for these latent variables (which we called the "contribution" of a Gaussian to a data point)

  • The Maximization step finds new parameters using these probabilities (our parameters were the mean and covariance) that maximizes the likelihood of the data points

43 / 47

Density-Based Clustering

How do we cluster this?

No matter where we put our cluster centers, we can't cluster it into the inner and outer ring.

44 / 47

Density-Based Clustering

  • We can observe that clusters generally are "more dense" than the regions in between

  • Let's start with each data point in its own cluster

  • Single Linkage: We connect two clusters if the distance between any two points in them is minimal between all cluster pairs

  • Repeat until we have k clusters

  • Sometimes there are a few single points that would link two clusters, resulting in undesirable connections

  • Robust Linkage: Connect two clusters only if there are t points in each close to the other cluster

45 / 47

How many clusters?

  • So far we have kind of ignored how many clusters there are, but how do we get k?

  • Define "some measure" of cluster quality, and then try k=1,2,3,4,

    • Statistical: Variance explained by clusters
    • Measurements of cluster density, span, etc.
    • Usefulness in application (!)
    • etc.
  • There are also some more advanced algorithms that don't need to be told k explicitly (e.g. DBSCAN)

46 / 47

References

47 / 47

Reminder: Machine Learning

  • Supervised Learning: Learn a mapping from examples

  • Unsupervised Learning: Learn an interesting thing about data

  • Reinforcement Learning: Learn what to do in an environment, given feedback information

2 / 47
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow