Lecture 17: Machine Learning

# Artificial Intelligence

### Machine Learning

---

# Artificial Intelligence and Machine Learning

---

# Machine Learning

What is Machine Learning?

* In classical AI we write algorithms that solve a problem "intelligently"

* In Machine Learning we use **data** to understand relationships

* For example, given 10000 pictures of cats and dogs, we learn how to distinguish by learning the relationship between the pixels and the label (animal)

---

# Machine Learning

[Source](https://xkcd.com/1838/)

---

# Machine Learning

- Supervised Learning: `\( \{(x_1, y_1), . . . ,(x_n, y_n)\} \)` 
Learn a mapping from examples.

- Unsupervised Learning: `\( \{x_1, . . . , x_m\} \)` 
Learn an interesting thing about data.

- Semi-supervised Learning: `\( \{(x_1, y_1), . . . ,(x_n, y_n)\} \cup \{x_1, . . . , x_m\} \)`

- Reinforcement Learning: Learn what to do in an environment, given feedback information.

---

# Machine Learning

Say there is a function `\(f(\vec{x}) = y\)`

- Supervised Learning: We know x and y, and are trying to find f (more accurately: a probability distribution `\(P(y|x)\)`)

- Unsupervised Learning: We know x and are trying to find "interesting" f and y

- Reinforcement Learning: We know f*, and are trying to get "the best y" by choosing x

---

# Supervised Learning: Rote Learning

- Say we want to "learn" a function given some x and y (supervised learning)

- The simplest thing to do is: Just memorize the values

- Computers are good at remembering things!

- However, in most interesting applications, we would need to store a lot of data

- It also does not generalize: We only know the values we have seen

---

# Supervised Learning?

- What we want is to give the computer "some" x and y, and it finds the "general connection" between them

- The function that we learn is not just a memorization of the values, but it also gives us a "good" value for y for x that we haven't seen before

- Inter- and Extrapolation: From the known data, we can "predict" values for new inputs

- But who are x and y?

---

# Unsupervised Learning

---

# Unsupervised Learning

* In Unsupervised Learning we give our algorithm data, and let it find "something interesting"
  
  * Examples:
  
     - Clusters of similar data: Similar players, cliques in social networks, similar sounding words
     - Common patterns of data: Action sequences used by many people, common network traffic
     - Related data items: Purchases often made together, music listened to by the same people
     
---

# Why Unsupervised Learning?

* If we can find similar players, we can make them play together 
  
  * Music can be recommended based on what you already listen to
  
  * Help new users by suggesting common actions/purchases made by similar users 
  
  * If genes "behave" similarly, they must be related or associated in a biological process

---

# Clustering

---

# Clustering

* We are given `n` vectors, representing our players/games/words/music/...
  
  * How can we determine which vectors belong to the same "class"/"type"?
  
  * How many classes are there?
  
  * We call the classes *clusters*
  
---

# Clustering

---

# Clustering

---

# What is a Cluster?

* For now, we assume that each of our clusters is defined by a single point, the *center* of the cluster
  
  * Each data point is then assigned to a cluster based on which cluster center it is closest to
  
<img src="/PF-3341/assets/img/voronoigrowth.gif" width="40%"/>

---

# What is a good Clustering?

* Say we are told that we should create `k` "good" clusters 
  
    * k-center clustering: Minimize the maximum distance of any data point from its cluster center (firehouse placement)
  
    * k-median clustering: Minimize the sum of the distances of data points to their cluster center 
  
    * k-means clustering: Minimize the variance of distances of data points within a cluster (which is the average squared distance from the mean)
  
  * Each of these is a measure for how "compact" a cluster is, but that does not necessarily tell us anything about cluster "usefulness", which is application-dependent

---

# k-Means Clustering

* k-means clustering puts more weight on outliers than k-median, but is not dominated by them like k-center
  
  * Especially for d-dimensional vectors, k-means is usually the first choice

* How do we find a k-means clustering? Try all possible assignments of data points to clusters
  
  * Finding an optimal clustering is NP-hard :(
  
  * Lloyd's algorithm! (Often also just called "k-means algorithm")
  
---

# Lloyd's algorithm

* Determine `k` initial cluster centers, then iterate these two steps:
  
     - Assign each data point to its cluster based on the current centers 
     
     - Compute new centers as the mean of each cluster 
     
  * After "some" iterations we will have a clustering of the data 
  
  * This may be a local minimum when compared to the k-means criterion, but is often "good enough"
  
---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Generalized Distance Functions

* What if our data is not d-dimensional vectors, but e.g. all the data we have about a player 
  
  * For any two players, we can calculate a distance, but we can't make up an "average value"
  
  * In other words, all we have are our data points and pairwise distances, but no vector embedding
  
  * We can still cluster, we just have the restriction that each cluster center must be exactly on a data point 
  
---

# k-Medoids Clustering

* The cluster centers are called "medoids"
  
  * We use a variation of Lloyd's algorithm 
  
  * The only difference is how we assign new cluster centers 
  
  * One option: Use the data point that has the lowest sum of distance to the other data points in the cluster 
  
  * Other option: Choose a new data point as the new cluster center, and check if that new cluster center would result in a better clustering (slower, but more stable)
  
---

# Lloyd's algorithm vs. k-Medoids Clustering

<img src="/PF-3341/assets/img/kmedoidsdetail.png" width="100%"/>
  
---

# Lloyd's algorithm vs. k-Medoids Clustering

---

# Lloyd's algorithm vs. k-Medoids Clustering

<img src="/PF-3341/assets/img/kmedoidsresult.png" width="40%"/>
  
---

# We forgot something ...

* We need initial cluster centers from somewhere?
  
  * Simplest approach: Just pick data points at random; Problem: Results may be poor/unpredictable
  
  * Another idea: Pick a data point at random, then pick the data point that is furthest away from it, then pick the data point furthest away from both, etc.; Problem: Outliers affect the initialization
  
  * Another idea: Pick a data point at random, and assign weights to each other data point based on the distance. Then pick the next center using these weights as probabilities, etc.
  
  * You can also use the result from any other algorithm/guess/heuristic as an initialization, Lloyd's algorithm will never make the solution worse (as measured by the k-means clustering goal)!
  
---

# Ward's Algorithm

* Start with each data point in its own cluster 
  
  * Merge two clusters until there are only `k` clusters left 
  
  * Which two clusters do you merge? The two such that the average distance from the cluster centers increases the least 
  
  * This is basically "greedy" k-means
  
---

# Distribution-Based Clustering

* Our representation of clusters as single vectors had the advantage of being simple
  
  * However, clusters sometimes have different sizes/distributions 
  
  * So let's assume our clusters are probability distributions 
  
  * Let's start with Gaussians
  
---

# Why Gaussians?

* Many datasets can be modeled by Gaussian Distribution

* It is potentially intuitive to think that the clusters come from different Gaussian Distributions

* With this notion it is possible to model a dataset as a mixture of several Gaussian Distributions.

* This is the core idea of Gaussian Mixture Models

---

# Gaussian Mixture Models

* A Gaussian (normal) distribution in our model has several parameters:
 
 1- A mean μ that defines its centre
 
 2- A covariance Σ that defines its width (in each dimension!)

---
# Gaussian Clusters

---

# Gaussian Clusters

* Each cluster has a mean (point) and covariance (matrix)
  
  * The mean defines where the center of the cluster is 
  
  * The covariance matrix defines the size and extent 
  
  * The mean and covariance are the *parameters* of the distribution
  
  * Technically, all Gauss distributions extend infinitely; we assign each data point to the cluster for which it has the highest probability (but we could allow membership in multiple clusters!); in other words, each Gaussian *contributes* to each data point with some (non-zero) probability

---

# Expectation Maximization (EM)

* Similar to k-means, we can determine parameter values for k Gaussians iteratively
   
* Initialize k means and covariance matrices, then iterate:
   
      - (**E**xpectation Step) Calculate the current responsibilities/contributions for each data point from each Gaussian 
      
      - (**M**aximization Step) Use these responsibilities to calculate new means (weighted average of all data points), and covariance matrix
  
* Repeat until the clusters don't change anymore
  
---

# Expectation Maximization

---

# Expectation Maximization

* The general mathematical formulation of EM is actually more powerful
  
  * It works for general, parameterized models with latent (inferred) variables
  
  * The Expectation step computes the probabilities for these latent variables (which we called the "contribution" of a Gaussian to a data point)
  
  * The Maximization step finds new parameters using these probabilities (our parameters were the mean and covariance) that maximizes the likelihood of the data points

---

# Density-Based Clustering

How do we cluster this?

No matter where we put our cluster centers, we can't cluster it into the inner and outer ring.

---

# Density-Based Clustering

* We can observe that clusters generally are "more dense" than the regions in between 
  
  * Let's start with each data point in its own cluster 
  
  * Single Linkage: We connect two clusters if the *distance* between any two points in them is minimal between all cluster pairs 
  
  * Repeat until we have `k` clusters
  
  * Sometimes there are a few single points that would link two clusters, resulting in undesirable connections 
  
  * Robust Linkage: Connect two clusters only if there are `t` points in each close to the other cluster

---

# How many clusters?

* So far we have kind of ignored how many clusters there are, but how do we get k?

* Define "some measure" of cluster quality, and then try `\(k=1,2,3,4,\ldots\)`
   - Statistical: Variance explained by clusters
   - Measurements of cluster density, span, etc. 
   - Usefulness in application (!)
   - etc.

* There are also some more advanced algorithms that don't need to be told k explicitly (e.g. DBSCAN)

---
  
# References

* [Foundations of Data Science](https://www.cs.cornell.edu/jeh/book.pdf)

* [Rote Learning](http://users.cs.cf.ac.uk/Dave.Marshall/AI2/node133.html)
  
  * [Machine Learning 4 All: Guides](https://ml4a.github.io/guides/)
  
  * *Pattern Recognition and Machine Learning* (Chapter 9), by Christopher Bishop
  
  * [So You Have Some Clusters, Now What?](https://medium.com/@fan_zhang/so-you-have-some-clusters-now-what-4cc39a531e9b)