Lecture 8: Monte Carlo Tree Search

# Artificial Intelligence

### Monte Carlo Tree Search

---

# Adversarial Search: Recap

- We have a turn-based game
 
 - Our goal is to get the highest possible score 
 
 - Our opponent wants us to get the lowest possible score 
 
 - It is our turn
 
---

# Adversarial Search: Recap

- For every possible move we could make, we consider every possible move our opponent could make, etc.
 
 - For each possible sequence of moves we calculate our score 
 
 - We then assume that our opponent will choose the action that results in the lowest score for us 
 
---

# Adversarial Search: Limitations

- This *game tree* will be huge 
 
 - We heard about alpha-beta pruning to reduce the tree size, but even with that the tree is too large to calculate for many games
 
 - Many games also have random components
 
 - What now? 
  
---

# Monte Carlo Tree Search

* Idea: Don't calculate the entire tree, but instead *sample* random(-ish) sequences ("*rollouts*")
  
  * Record the outcomes for these playouts
  
  * Repeat a large number of times 
  
  * At the end, we will have an *estimate* for the amount of points we will get for each action
  
---

# Monte Carlo Tree Search

- If we pick actions completely at random for our rollouts we will need too many repetitions to get a good estimate 
  
  - But we can use the information we learn during the rollouts to "guide" future iterations 
  
  - For example: Say we have already performed 100 rollouts, which gives us a (probably bad) estimate for the expected value of each action 
  
  - For the next rollouts, we will choose the action with the highest expected value with the highest probability
  
  - Over time, our sampling process will collect more samples for more promising actions
  
---

# Monte Carlo Tree Search

Our algorithm will construct a game tree piece by piece. Each iteration it expands the partial tree in four steps:

- *Select* actions from the tree until we reach a node we haven't fully expanded yet 
 
 - *Expand* a new action from that node 
 
 - *Simulate* the game until the end, and note the result 
 
 - *Backpropagate* this result back up the tree 
  
---

# MCTS

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# Monte Carlo Tree Search

Algorithm:

---

# Selection

* We use the scores we have obtained so far to choose which action to select until we reach a leaf 
 
 * One approach would be to always pick the action with the (currently) highest expected value 
 
 * However, this would ignore actions that got bad results due to "bad luck" in the rollout 
 
 * There are several different *selection strategies* we can use to overcome this problem 
 
---

# Epsilon-Greedy Selection

* One of the simplest selection strategy uses a single parameter: `$\varepsilon$`
 
 * When we have to select an action, we choose a number between 0 and 1 uniformly at random
 
 * If that number is less than `$\varepsilon$`, we choose an action uniformly at random 
 
 * Otherwise we choose the action with the highest expected value

---

# Roulette-Wheel Selection

* Epsilon-Greedy may be problematic if two actions have almost the same expected value 
 
 * Ideally, we would choose each of these two with (almost) the same probability 
 
 * Roulette-Wheel selection selects an action at random with weights determined by the expected value of each action 
 
 * For example, if the expected value for four actions are 1, 4, 8, and 7, we choose the actions with probability 1/20, 4/20, 8/20 and 7/20, respectively
 
 * This is also called "fitness proportionate selection"
 
---

# Roulette-Wheel Selection

<img src="/CI-0129/assets/img/roulette.png" width="50%"/>
</center>
 
---

# UCT

* We can also use a more sophisticated selection strategy

* UCT = "Upper Confidence Bound 1 applied to trees", based on the UCB-1 formula:

$$
E + c \sqrt{\frac{\ln N}{n}}
$$

* Where E is the expected value of an action, N is the number of times we have chosen any action in the current state, and n is the number of times we have chosen this particular action

* We can use c to "tune" the behavior, to prefer choosing the best action (lower c), or trying each action equally often (higher c)

---

# Simulation

* We said once we reach a node we haven't fully expanded yet, we "simulate" the game until the end to get a result

* How can we simulate a game?

* Simplest variant: Each player performs completely random moves

* We will build our tree piece by piece, but we will still need "many" repetitions to get good simulation results

---

# Simulation

* Instead of moving randomly, we can use any other strategy we might know

* For example, if we have a (bad) agent for the game, it could play the game for our simulation

* As we build our tree, we will use more actions selected by our selection strategy and fewer by our "bad" agent

* By not playing completely randomly, we may need fewer repetitions

---

# Simulation

* What if we don't actually simulate?

* Ideally, we would have the exact result of the game for a new action that we're exploring

* For many games we can instead come up with a game state evaluation

* In Chess, for example, we can say whoever has more valuable pieces on the board will likely win

* Advanced ideas: Play a few random turns and then evaluate the state, use a Neural Network to evaluate the board state, etc.

---

# Randomness

---

# Randomness

* What if we have a (shuffled) deck?

* We already sample our actions, we can also sample from the deck!

* For every iteration, we shuffle the deck, too (using known information)

---

# Example: Card Game

* Imagine a card game like Hearthstone, Magic: The Gathering, etc.

* On your turn, you can perform actions with the cards in your hand

* What the "best" action is depends on cards you will draw in the future

* But you don't know that ...

---

# Example: Card Game

* Some future cards may have a similar effect on your strategy

* In general, you want to play in a way that **maximizes** your change of winning

* You can use MCTS to help you!

* Instead of just sampling the actions during roll-outs, you also sample shuffles

---

# Example: Card Game

* Before **each** roll-out, you take the knowledge you have of the deck (i.e. which cards are remaining)

* Then you shuffle these remaining cards

* You can combine this with partial knowledge, for example if you know the third card from the top from some effect

* Then you perform the rollout

---

# Example: Card Game

* Previously, our rollouts just sampled the expected value over random action sequences

* Now we also sample over possible deck orders

* With enough roll-outs that will tell us for each action what the expected value is for that action over all possible future cards we could draw

* Next time we will look at a simpler game to apply this idea: Blackjack (Lab 2)

---

# References
  
  * [MCTS Tutorial](https://www.cs.swarthmore.edu/~bryce/cs63/s16/reading/mcts.html)
  
  * [MCTS Slides](https://www.lri.fr/~sebag/Slides/InvitedTutorial_CP12.pdf)
  
  * [MCTS for a Card Game](http://teaching.csse.uwa.edu.au/units/CITS3001/project/2017/paper1.pdf)