Lecture 11: Unsupervised Learning

# AI in Digital Entertainment

### Unsupervised Learning

---

# Unsupervised Learning

---

# Unsupervised Learning

* In Unsupervised Learning we give our algorithm data, and let it find "something interesting"
  
  * Examples:
  
     - Clusters of similar data: Similar players, cliques in social networks, similar sounding words
     - Common patterns of data: Attack or other action sequences used by many players
     - Related data items: Purchases often made together (with real or in-game currency), quests often chosen together, games played by the same people
     
---

# Why Unsupervised Learning for Games?

* If we can find similar players, we can make them play together 
  
  * Friends can be recommended based on player type/preference 
  
  * Help new players by suggesting common actions/purchases made by similar players 
  
  * Recommend new games to players

---

# Clustering

---

# Clustering

* We are given `n` vectors, representing our players/games/words/...
  
  * How can we determine which vectors belong to the same "class"/"type"?
  
  * How many classes are there?
  
  * We call the classes *clusters*

---

# What is a Cluster?

* For now, we assume that each of our clusters is defined by a single point, the *center* of the cluster
  
  * Each data point is then assigned to a cluster based on which cluster center it is closest to
  
<img src="/PF-3341/assets/img/voronoigrowth.gif" width="40%"/>

---

# What is a good Clustering?

* Say we are told that we should create `k` "good" clusters 
  
    * k-center clustering: Minimize the maximum distance of any data point from its cluster center (firehouse placement)
  
    * k-median clustering: Minimize the sum of the distances of data points to their cluster center 
  
    * k-means clustering: Minimize the variance of distances of data points within a cluster (which is the average squared distance from the mean)
  
  * Each of these is a measure for how "compact" a cluster is, but that does not necessarily tell us anything about cluster "usefulness", which is application-dependent

---

# k-Means Clustering

* k-means clustering puts more weight on outliers than k-median, but is not dominated by them like k-center
  
  * Especially for d-dimensional vectors, k-means is usually the first choice

* How do we find a k-means clustering? Try all possible assignments of data points to clusters
  
  * Finding an optimal clustering is NP-hard :(
  
  * Lloyd's algorithm! (Often also just called "k-means algorithm")
  
---

# Lloyd's algorithm

* Determine `k` initial cluster centers, then iterate these two steps:
  
     - Assign each data point to its cluster based on the current centers 
     
     - Compute new centers as the mean of each cluster 
     
  * After "some" iterations we will have a clustering of the data 
  
  * This may be a local minimum when compared to the k-means criterion, but is often "good enough"
  
---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Lloyd's algorithm

---

# Generalized Distance Functions

* What if our data is not d-dimensional vectors, but e.g. all the data we have about each player 
  
  * For any two players, we can calculate a distance, but we can't make up an "average value"
  
  * In other words, all we have are our data points and pairwise distances, but no vector embedding
  
  * We can still cluster, we just have the restriction that each cluster center must be exactly on a data point 
  
---

# k-Medoids Clustering

* The cluster centers are called "medoids"
  
  * We use a variation of Lloyd's algorithm 
  
  * The only difference is how we assign new cluster centers 
  
  * One option: Use the data point that has the lowest sum of distance to the other data points in the cluster 
  
  * Other option: Choose a new data point as the new cluster center, and check if that new cluster center would result in a better clustering (slower, but more stable)
  
---

# Lloyd's algorithm vs. k-Medoids Clustering

<img src="/PF-3341/assets/img/kmedoidsdetail.png" width="100%"/>
  
---

# Lloyd's algorithm vs. k-Medoids Clustering

---

# Lloyd's algorithm vs. k-Medoids Clustering

<img src="/PF-3341/assets/img/kmedoidsresult.png" width="40%"/>
  
---

# We forgot something ...

* We need initial cluster centers from somewhere?
  
  * Simplest approach: Just pick data points at random; Problem: Results may be poor/unpredictable
  
  * Another idea: Pick a data point at random, then pick the data point that is furthest away from it, then pick the data point furthest away from both, etc.; Problem: Outliers affect the initialization
  
  * Another idea: Pick a data point at random, and assign weights to each other data point based on the distance. Then pick the next center using these weights as probabilities, etc.
  
  * You can also use the result from any other algorithm/guess/heuristic as an initialization, Lloyd's algorithm will never make the solution worse (as measured by the k-means clustering goal)!
  
---

# Ward's Algorithm

* Start with each data point in its own cluster 
  
  * Merge two clusters until there are only `k` clusters left 
  
  * Which two clusters do you merge? The two such that the average distance from the cluster centers increases the least 
  
  * This is basically "greedy" k-means
  
---

# Distribution-Based Clustering

* Our representation of clusters as single vectors had the advantage of being simple
  
  * However, clusters sometimes have different sizes/distributions 
  
  * So let's assume our clusters are probability distributions 
  
  * Let's start with Gaussians
  
---

# Gaussian Clusters

---

# Gaussian Clusters

* Each cluster has a mean (point) and covariance (matrix)
  
  * The mean defines where the center of the cluster is 
  
  * The covariance matrix defines the size and extent 
  
  * The mean and covariance are the *parameters* of the distribution
  
  * Technically, all Gauss distributions extend infinitely; we assign each data point to the cluster for which it has the highest probability (but we could allow membership in multiple clusters!); in other words, each Gaussian *contributes* to each data point with some (non-zero) probability

---

# Expectation Maximization (EM)

* Similar to k-means, we can determine parameter values for k Gaussians iteratively
   
* Initialize k means and covariance matrices, then iterate:
   
      - (**E**xpectation Step) Calculate the current responsibilities/contributions for each data point from each Gaussian 
      
      - (**M**aximization Step) Use these responsibilities to calculate new means (weighted average of all data points), and covariance matrix
  
* Repeat until the clusters don't change anymore
  
---

# Expectation Maximization

---

# Expectation Maximization

* The general mathematical formulation of EM is actually more powerful
  
  * It works for general, parameterized models with latent (inferred) variables
  
  * The Expectation step computes the probabilities for these latent variables (which we called the "contribution" of a Gaussian to a data point)
  
  * The Maximization step finds new parameters using these probabilities (our parameters were the mean and covariance) that maximizes the likelihood of the data points

---

# Density-Based Clustering

How do we cluster this?

No matter where we put our cluster centers, we can't cluster it into the inner and outer ring.

---

# Density-Based Clustering

* We can observe that clusters generally are "more dense" than the regions in between 
  
  * Let's start with each data point in its own cluster 
  
  * Single Linkage: We connect two clusters if the *distance* between any two points in them is minimal between all cluster pairs 
  
  * Repeat until we have `k` clusters
  
  * Sometimes there are a few single points that would link two clusters, resulting in undesirable connections 
  
  * Robust Linkage: Connect two clusters only if there are `t` points in each close to the other cluster

---

# How many clusters?

* So far we have kind of ignored how many clusters there are, but how do we get k?

* Define "some measure" of cluster quality, and then try `\(k=1,2,3,4,\ldots\)`
   - Statistical: Variance explained by clusters
   - Measurements of cluster density, span, etc. 
   - Usefulness in application (!)
   - etc.

* There are also some more advanced algorithms that don't need to be told k explicitly (e.g. DBSCAN)

---

# Frequent Pattern Mining

---

# Frequent Pattern Mining

* Let's say we collect information from many sources (e.g. people)
  
  * Now we want to see what is "common"
  
  * Which topics do many people like on social media, which actions are often performed in sequence, which cards are often played together in Hearthstone, games played together etc.
  
  * Applications: Group similar people/decks, find imbalance in game play, give recommendations, ...

---

# Apriori Algorithm

* Let's say we are given a set of sets, like a set of people, and each person has a set of games they play

* We want to know which games are commonly played together

* Define the *support* for a game as the number of people that play that game

* Define a *support threshold* for frequent games

---

# Apriori Algorithm

* Identify all games for which the support is above the support threshold

* Merge all such games into pairs

* Discard all pairs for which the shared support is below the threshold

* Continue merging until all item sets are below the threshold

* Return subsets before last merge

---

# Apriori For Example

* A plays Call of Duty, Quake, Overwatch, Super Mario
* B plays Call of Duty, Quake, Super Mario, Tetris
* C plays Tetris, Quake, Super Mario 
* D plays Quake, Overwatch, Tetris

Support threshold: 3

* Quake (4), Tetris(3), Super Mario (3) are all kept, Call of Duty (2), Overwatch(2) are discarded
* {Quake,Tetris} (3), {Quake, Super Mario}(3) are kept, {Tetris, Super Mario}(2) is discarded 
* {Quake,Tetris,Super Mario}(2) is discarded 
* Return {Quake,Tetris} and {Quake, Super Mario}

---

# What can we use that for?

* Analytics: Look at why people prefer Quake over Call of Duty

* Game Balancing: If everyone uses the same 3 spells or buys the same 3 items, determine if they are too strong or the others too weak

* Recommendations: Say someone plays Tetris and Super Mario. Both show up in both of our frequent game subsets, so we should recommend Quake to them

---

# Sequence Mining

* For some data, such as actions, just the presence may not be as important, as the actual ordering/sequencing

* We can modify the Apriori algorithm into the Generalized Sequential Patterns algorithm by considering sequences instead of sets

- Start with all common sequences of length 1
   - Merge sequences by concatenating them, and count occurrences in the data 
   - Continue adding 1-sequences until all sequences are below the threshold

---

# Next Week

* [A Recommender System for Hero Line-Ups in MOBA Games](https://www.aaai.org/ocs/index.php/AIIDE/AIIDE17/paper/viewFile/15902/15164)
  
  
---

# References
  
  * [Foundations of Data Science](https://www.cs.cornell.edu/jeh/book.pdf)
  
  * *Pattern Recognition and Machine Learning* (Chapter 9), by Christopher Bishop