Lecture 6: Monte Carlo Tree Search

# AI in Digital Entertainment

### Monte Carlo Tree Search

---

# Game Trees

* We want to build an agent for a turn-based game
  
  * On its turn, the agent needs to figure out which action to perform
  
  * Ideally, all the agent knows are the rules of the game (including scoring), and then is able to figure out the best move automatically
  
  * We've talked about this before, in the form for game trees
  
---

# Minimax

* Let's say we want to get the highest possible score 
  
  * Then our opponent wants us to get the lowest possible score 
  
  * For each of our potential actions, we look at each of the opponents possible actions 
  
  * The opponent will pick the action that gives us the lowest score, and we will pick from our actions the one where the opponent's choice gives us the highest score 
  
  * How does the opponent decide what to pick? The same way!
  
---

# Minimax

# Minimax

---

# Minimax

* Remember what we did next?
  
  * I played a game against a student, and then the student was upset because he lost.
  
  * So let's make a fairer game: Instead of me, a die will play
  
---

# A fairer game

You choose a number between 1 and 6 as the first digit, then I roll a d6 (six-sided die) to get a second digit. If the resulting number is prime *or* divisible by 3 you get that many points, otherwise you lose that many points. For example: You pick 3, the die roll shows a 1, so the resulting number is 31, which is prime, and you get 31 points.

What's your move?

---

# How would we play this game?

* Winning numbers: 11, 12, 13, 15, 21, 23, 24, 31, 33, 36, 41, 42, 43, 46, 51, 53, 54, 61, 63, 66
  
  * If we pick 2, 3, 5 or 6 there there are three possible die rolls that give us points, for 1 or 4 there are four.
  
  * We can calculate the **expected value** for each choice:
  
<span style="font-size: 0.5em">
$$
E(c=1) = \frac{1}{6}\cdot 11 + \frac{1}{6} \cdot 12 + \frac{1}{6} \cdot 13 + \frac{1}{6} \cdot 15 - (\frac{1}{6}\cdot 14 + \frac{1}{6}\cdot 16) = 3.5\\\\
E(c=2) = \frac{1}{6}\cdot 21 + \frac{1}{6}\cdot 23 + \frac{1}{6}\cdot 24 - (\frac{1}{6} \cdot 22 + \frac{1}{6}\cdot 25 + \frac{1}{6}\cdot 26) = -0.8\\\\
E(c=3) = \frac{1}{6}\cdot 31 + \frac{1}{6}\cdot 33 + \frac{1}{6}\cdot 36 - (\frac{1}{6} \cdot 32  + \frac{1}{6}\cdot 34 + \frac{1}{6}\cdot 35) = -0.2\\\\
E(c=4) = \frac{1}{6}\cdot 41 + \frac{1}{6} \cdot 42 + \frac{1}{6} \cdot 43 + \frac{1}{6} \cdot 45 - (\frac{1}{6}\cdot 44 + \frac{1}{6}\cdot 46) = 13.5\\\\
E(c=5) = \frac{1}{6} \cdot 51 + \frac{1}{6}\cdot 53 + \frac{1}{6}\cdot 54 - (\frac{1}{6}\cdot 52 + \frac{1}{6}\cdot 55 + \frac{1}{6}\cdot 56) = -0.8\\\\
E(c=6) = \frac{1}{6}\cdot 61 + \frac{1}{6}\cdot 63 + \frac{1}{6}\cdot 66 - (\frac{1}{6} \cdot 62 + \frac{1}{6}\cdot 64 + \frac{1}{6}\cdot 65) = -0.2
$$
</span>

---

# How about this game?

Let's make the game more interesting: You choose a number from 1 to 6, I roll a die, then you choose another number, then I roll a die, etc., until we have 10 digits (5 each). Same scoring as before: If the number is prime or divisible by 3 you get that many points, otherwise you lose them. For example, you choose 3, I roll 2, you choose 4, I roll 1, you choose 5, I roll 1, you choose 6, I roll 1, you choose 4, I roll 3. Result: 3241516143 = 3*1080505381, so you get 3241516143 points.

What's your move?

---

# How would we play this game?

* There are 6^10 = 60446176 possible outcomes 
  
  * We could compute expected values again, it might just take a bit
  
  * How about we just try a few random combinations for each of our options to get a feeling for what might be a good move?
 
  * But let's do this in a systematic way!
  
---

# Sampling!

* Say we try 10 random numbers for each of our 6 options, and get averages of: 
  
     * 1: -241591424.3
     * 2: -1049205520.1
     * 3: -711555192.8
     * 4: 838179893.4
     * 5: -2208620186.1
     * 6: 1289273320.5
     
  * Instead of trying another 10 numbers for each option, we may want to focus more on the more promising ones?
  
  * But 10 tries are very few. Maybe the value for e.g. 2 is completely wrong?
  
---

# A more structured approach

* How do we randomly try options?
  
  * First we need to pick the first number 
  
  * Then we randomly generate the second one 
  
  * Then we pick the third number, etc. 
  
  * We can memorize these picks!
  
  * It will basically be a game tree, but incomplete
  
---

# Monte Carlo Tree Search 
  
---

# Initialization and Data Structure

* We always have a partial game tree
  
  * For each node remember the expected score/win rate 
  
  * We start with a single node: The current state 
  
  * A node can be expanded by performing a possible action in that node, resulting in a child node
  
---

# Monte Carlo Tree Search

---

# Selection

* Start at the root of the partial tree and select a child node 
  
  * Continue selecting child nodes until you reach a node that is not fully expanded
  
  * The selection is done using a *tree policy*
  
  * For example: With probability `$1-\varepsilon$` we pick the child with the highest expected value, otherwise we pick a child uniformly at random (epsilon greedy policy)
  
---

# Expansion

* When we find a node with unexpanded children, we expand it 
  
  * Expansion also depends on the tree policy (we may want to revisit a child we've already expanded)
  
---

# Simulation

* Using a *default policy*, the game is played until the end 
  
  * This default policy can be a simple AI agent, or just pure random moves 
  
  * The simulation is sometimes also called a "rollout"
  
  * At the end of the game, we obtain the score

---

# Backpropagation

* Record the score obtained from the simulation in the tree
  
  * For each node that was on the path to the end, update the expected value with the new score 
  
---

# Example

---

# Features of Monte Carlo Tree Search

* No heuristic needed, only the game rules
  
  * Opponents can be modeled by policy or random moves
  
  * Hidden information (such as shuffled decks, die rolls, etc.) can be modeled as random moves
  
  * When you run the algorithm, you can stop it at *any* time and just pick the best move so far, giving you a trade-off between time needed and quality of the estimate

---

# How does this algorithm do?

* The good news: MCTS converges to minimax!
  
  * The bad news: It converges very slowly
  
  * Imagine a move like sacrificing a queen in chess: It may have a high reward in *some* scenarios, but most of the future results will be bad 
  
  * To actually find the "good" version, many different simulations have to be performed

---

# Exploration vs. Exploitation

* The epsilon greedy approach strongly prefers nodes with high scores
  
  * But what if we "missed" a better option?
  
  * Should we try to improve the most promising option (exploitation) or try to find a different, but potentially better move (exploration)?
  
  * The number of times a node is visited may give us an indication of how good the score estimate is
  
---

# Upper Confidence Bound 1 applied to trees

Use UCT1 value to determine which child to visit:

$$
X_j + c \sqrt{\frac{\ln N}{n_j}}
$$

Where `$X_j$` is the current estimate for the score of the jth child, `$N$` is the number of times the current node was visited and `$n_j$` is the number of times the jth child was visited. c determines the trade-off between exploitation and exploration.

A high value for `$X_j$` makes nodes more likely to be visited (exploitation), as does a low value for `$n_j$` compared to the total N (exploration). If `$n_j$` is 0, the value is infinite, guaranteeing that unvisited children are visited.

---

# What is MCTS good for?

* MCTS does extremly well on a wide variety of games, even those with a large branching factor (many possible actions), hidden information, and/or non-determinism
  
  * Has been successfully used for Go, Scrabble, Mancala, constructing crossword puzzles, PacMan, Poker, Magic: The Gathering, etc.
  
  * Also useful for non-games, such as the traveling salesman problem, scheduling, and vulnerability testing 
  
  * Challenges: 
  
      - Random rollouts may take a long time in many domains
      - With a large branching factor a good tree policy can be helpful

---

# AlphaGo

---

# Go

* 40 million active players

* Simple rules, but a lot of strategic depth

* At least 10e170 different game states

* Games can last hundreds of turns

]

---

# Go

* Before MCTS, AI agents struggled with Go

* When MCTS was first developed (2006/2007), it was successfully applied to Go on a smaller (9x9) board

* By 2012, Go programs could beat an international champion with a small handicap

* In 2016, AlphaGo by DeepMind beat the world champion

* How?

---

# Go with MCTS

* Because of the large graph, standard MCTS runs into limitations

* AlphaGo uses two key techniques to overcome these:
   - Better action selection in the tree policy 
   - Limited rollouts
   
* Both of these were implemented using Neural Networks

---

# Better Action Selection

* UCT1 uses the results obtained during sampling to determine which action to try next

* It actually guarantees that unvisited actions are visited first

* But what if we "knew" that some moves are definitely bad? We wouldn't need to visit them at all

* AlphaGo uses a Neural Network to estimate how likely a move is the best move

* Does not need to be perfect, a reasonable guess is enough

---

# Limited Rollouts

* Instead of playing games to the end with random moves, we can use a "guess" for how the game will end at some point

* AlphaGo trained a Neural Network to provide this "guess"

* Both of these Neural Networks were trained with data from human expert players and then refined with data from AlphaGo playing against itself

* AlphaGo Zero (2017) then *only* used self-play data
  
---

# MCTS Variants

---

# Single-Player MCTS

* In Single Player games, there is no opponent that tries to minimize our score

* In addition to determining the "expected" reward, it may also be useful to store the maximum reward for each branch

* The "opponent" is usually some form of randomness, so additionally the variance of the results may be of interest

* MCTS can then use these values in the tree policy (e.g. as an additional term in the UCT1 formula)

---

# Multi-Player MCTS

* If there are more than two players (or the game isn't zero-sum), minimizing an opponent's score may not be the same as maximizing your own score anymore

* Instead, store a vector of rewards, one for each player, and maximize the appropriate one for each action

* Alternatively: pretend all other players are playing as a team against you ("Paranoid UCT")

---

# Multi-Agent MCTS

* During the rollouts, the default policy determines how the simulation is done

* Instead of a pure random selection, this may use some simple heuristic

* Multi-Agent MCTS uses multiple different policies for the rollouts, to better predict interactions between agent types

* However, defining these different agents requires significant engineering effort for each game

---

# Information Set MCTS

* Hidden information, such as a shuffled deck, can be accounted for by MCTS by sampling and pretending the game state is fully visible

* Store a tree of information sets, pick one possible state and only follow actions compatible with this state (e.g. if the state picked has me holding the ace of spades, *any* state in which I am holding the ace of spades is compatible with that fact)

* This avoids fully determinizing the state with information that is not relevant to the rollout, like cards at the bottom of the deck

---

# Forward-Search Sparse Sampling

* MCTS can also be used for planning

* Instead of finding a winning move set, we are looking for a plan

* Challenge: How do you define the "score"? There are very few "wins" in planning

---

# Next Week

* [Re-determinizing Information Set Monte Carlo Tree Search in Hanabi](https://arxiv.org/pdf/1902.06075.pdf)
  
---

# Resources

* [Chapter 2.3.4 in the Game AI Book](http://gameaibook.org)

* [A Survey of Monte Carlo Tree Search Methods](http://repository.essex.ac.uk/4117/1/MCTS-Survey.pdf)