Lecture 6: Monte Carlo Tree Search

# Artificial Intelligence

### Monte Carlo Tree Search

---

# Adversarial Search: Recap

- We have a turn-based game
 
 - Our goal is to get the highest possible score 
 
 - Our opponent wants us to get the lowest possible score 
 
 - It is our turn
 
---

# Adversarial Search: Recap

- For every possible move we could make, we consider every possible move our opponent could make, etc.
 
 - For each possible sequence of moves we calculate our score 
 
 - We then assume that our opponent will choose the action that results in the lowest score for us 
 
---

# Adversarial Search: Limitations

- This *game tree* will be huge 
 
 - We heard about alpha-beta pruning to reduce the tree size, but even with that the tree is too large to calculate for many games
 
 - Many games also have random components
 
 - What now? 
  
---

# Monte Carlo Tree Search

* Idea: Don't calculate the entire tree, but instead *sample* random(-ish) sequences ("*rollouts*")
  
  * Record the outcomes for these playouts
  
  * Repeat a large number of times 
  
  * At the end, we will have an *estimate* for the amount of points we will get for each action
  
---

# Monte Carlo Tree Search

- If we pick actions completely at random for our rollouts we will need too many repetitions to get a good estimate 
  
  - But we can use the information we learn during the rollouts to "guide" future iterations 
  
  - For example: Say we have already performed 100 rollouts, which gives us a (probably bad) estimate for the expected value of each action 
  
  - For the next rollouts, we will choose the action with the highest expected value with the highest probability
  
  - Over time, our sampling process will collect more samples for more promising actions
  
---

# Monte Carlo Tree Search

Our algorithm will construct a game tree piece by piece. Each iteration it expands the partial tree in four steps:

- *Select* actions from the tree until we reach a node we haven't fully expanded yet 
 
 - *Expand* a new action from that node 
 
 - *Simulate* the game until the end, and note the result 
 
 - *Backpropagate* this result back up the tree 
  
---

# MCTS

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# MCTS for Tic-Tac-Toe

---

# Monte Carlo Tree Search

Algorithm:

---

# Selection

* We use the scores we have obtained so far to choose which action to select until we reach a leaf 
 
 * One approach would be to always pick the action with the (currently) highest expected value 
 
 * However, this would ignore actions that got bad results due to "bad luck" in the rollout 
 
 * There are several different *selection strategies* we can use to overcome this problem 
 
---

# Epsilon-Greedy Selection

* One of the simplest selection strategy uses a single parameter: `$\varepsilon$`
 
 * When we have to select an action, we choose a number between 0 and 1 uniformly at random
 
 * If that number is less than `$\varepsilon$`, we choose an action uniformly at random 
 
 * Otherwise we choose the action with the highest expected value

---

# Roulette-Wheel Selection

* Epsilon-Greedy may be problematic if two actions have almost the same expected value 
 
 * Ideally, we would choose each of these two with (almost) the same probability 
 
 * Roulette-Wheel selection selects an action at random with weights determined by the expected value of each action 
 
 * For example, if the expected value for four actions are 1, 4, 8, and 7, we choose the actions with probability 1/20, 4/20, 8/20 and 7/20, respectively
 
 * This is also called "fitness proportionate selection"
 
---

# Roulette-Wheel Selection

<img src="/CI-0129/assets/img/roulette.png" width="50%"/>
</center>
 
---

# UCT

* We can also use a more sophisticated selection strategy

* UCT = "Upper Confidence Bound 1 applied to trees", based on the UCB-1 formula:

$$
E + c \sqrt{\frac{\ln N}{n}}
$$

* Where E is the expected value of an action, N is the number of times we have chosen any action in the current state, and n is the number of times we have chosen this particular action

* We can use c to "tune" the behavior, to prefer choosing the best action (lower c), or trying each action equally often (higher c)

---

# Simulation

* We said once we reach a node we haven't fully expanded yet, we "simulate" the game until the end to get a result

* How can we simulate a game?

* Simplest variant: Each player performs completely random moves

* We will build our tree piece by piece, but we will still need "many" repetitions to get good simulation results

---

# Simulation

* Instead of moving randomly, we can use any other strategy we might know

* For example, if we have a (bad) agent for the game, it could play the game for our simulation

* As we build our tree, we will use more actions selected by our selection strategy and fewer by our "bad" agent

* By not playing completely randomly, we may need fewer repetitions

---

# Simulation

* What if we don't actually simulate?

* Ideally, we would have the exact result of the game for a new action that we're exploring

* For many games we can instead come up with a game state evaluation

* In Chess, for example, we can say whoever has more valuable pieces on the board will likely win

* Advanced ideas: Play a few random turns and then evaluate the state, use a Neural Network to evaluate the board state, etc.

---

# Randomness

---

# Randomness

* What if we have a (shuffled) deck?

* We already sample our actions, we can also sample from the deck!

* For every iteration, we shuffle the deck, too (using known information)

---

# Example: Blackjack

* In Blackjack the player can request to draw cards from the deck

* The goal is to get a sum of card values close to 21, but not over 21

* For example: Ten of Spades, Three of Hearts, Seven of Clubs are 10+3+7=20 points

* Jack, Queen and King are 10 points each, an Ace can count as 1 **or** 11 points (player's choice)

---

# Blackjack

* After the player has performed their actions, the dealer draws cards until they have more than 16 points

* If the player has more than 21 points, they lose

* If the dealer has more than 21 points, the player wins

* If the player then has more points than the dealer, the player wins

* If there is a tie, no one wins

* The winner gets an amount of money (like $1)

---

# Blackjack: Player Actions

A player can do one of four things:

* **Hit**: Request one more card 
 
 * **Stand**: Stop taking cards, passing the turn to the dealer 
 
 * **Double Down**: Draw exactly one more card and then stand, and double the bet (win or lose $2)
 
 * **Split**: If the first two cards have the same value, the player can split them into two hands, and continue playing with these two independently (each hand wins/loses $1)

---

# MCTS for Blackjack

---

# MCTS for Blackjack

---

# MCTS for Blackjack

---

# MCTS for Blackjack

---

# MCTS for Blackjack

---

# MCTS for Blackjack

---

# MCTS for Blackjack

---

# Why Blackjack

* Implementing MCTS for Blackjack is Lab 2 (4/5-22/5)

* You will get a framework with a Blackjack implementation (and some AI agents)

* Your task is to expand the existing sampling player with a proper action selection and tree construction

* Weird feature: You can play with non-standard decks (e.g. only even cards), and your agent will figure it out!

---

# References
  
  * [MCTS Tutorial](https://www.cs.swarthmore.edu/~bryce/cs63/s16/reading/mcts.html)
  
  * [MCTS Slides](https://www.lri.fr/~sebag/Slides/InvitedTutorial_CP12.pdf)
  
  * [MCTS for a Card Game](http://teaching.csse.uwa.edu.au/units/CITS3001/project/2017/paper1.pdf)