class: center, middle # Artificial Intelligence ### Monte Carlo Tree Search --- # Adversarial Search: Recap - We have a turn-based game - Our goal is to get the highest possible score - Our opponent wants us to get the lowest possible score - It is our turn --- # Adversarial Search: Recap - For every possible move we could make, we consider every possible move our opponent could make, etc. - For each possible sequence of moves we calculate our score - We then assume that our opponent will choose the action that results in the lowest score for us --- # Adversarial Search: Limitations - This *game tree* will be huge - We heard about alpha-beta pruning to reduce the tree size, but even with that the tree is too large to calculate for many games - Many games also have random components - What now? --- # Monte Carlo Tree Search * Idea: Don't calculate the entire tree, but instead *sample* random(-ish) sequences ("*rollouts*") * Record the outcomes for these playouts * Repeat a large number of times * At the end, we will have an *estimate* for the amount of points we will get for each action --- class: mmedium # Monte Carlo Tree Search - If we pick actions completely at random for our rollouts we will need too many repetitions to get a good estimate - But we can use the information we learn during the rollouts to "guide" future iterations - For example: Say we have already performed 100 rollouts, which gives us a (probably bad) estimate for the expected value of each action - For the next rollouts, we will choose the action with the highest expected value with the highest probability - Over time, our sampling process will collect more samples for more promising actions --- class: medium # Monte Carlo Tree Search Our algorithm will construct a game tree piece by piece. Each iteration it expands the partial tree in four steps: - *Select* actions from the tree until we reach a node we haven't fully expanded yet - *Expand* a new action from that node - *Simulate* the game until the end, and note the result - *Backpropagate* this result back up the tree --- # MCTS
Selection
11/21
7/10
3/8
0/3
2/4
1/6
1/2
2/3
2/3
2/3
3/3
Expansion
11/21
7/10
3/8
0/3
2/4
1/6
1/2
2/3
2/3
2/3
3/3
0/0
Simulation
11/21
7/10
3/8
0/3
2/4
1/6
1/2
2/3
2/3
2/3
3/3
0/0
0/1
Backpropagation
11/22
8/11
3/8
0/3
2/4
1/7
1/2
2/3
2/3
2/3
4/4
0/1
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- class: medium # Monte Carlo Tree Search Algorithm: - *Select* actions from the tree until we reach a node we haven't fully expanded yet - *Expand* a new action from that node - *Simulate* the game until the end, and note the result - *Backpropagate* this result back up the tree --- # Selection * We use the scores we have obtained so far to choose which action to select until we reach a leaf * One approach would be to always pick the action with the (currently) highest expected value * However, this would ignore actions that got bad results due to "bad luck" in the rollout * There are several different *selection strategies* we can use to overcome this problem --- # Epsilon-Greedy Selection * One of the simplest selection strategy uses a single parameter: `\(\varepsilon\)` * When we have to select an action, we choose a number between 0 and 1 uniformly at random * If that number is less than `\(\varepsilon\)`, we choose an action uniformly at random * Otherwise we choose the action with the highest expected value --- class: medium # Roulette-Wheel Selection * Epsilon-Greedy may be problematic if two actions have almost the same expected value * Ideally, we would choose each of these two with (almost) the same probability * Roulette-Wheel selection selects an action at random with weights determined by the expected value of each action * For example, if the expected value for four actions are 1, 4, 8, and 7, we choose the actions with probability 1/20, 4/20, 8/20 and 7/20, respectively * This is also called "fitness proportionate selection" --- # Roulette-Wheel Selection
--- class: mmedium # UCT * We can also use a more sophisticated selection strategy * UCT = "Upper Confidence Bound 1 applied to trees", based on the UCB-1 formula: $$ E + c \sqrt{\frac{\ln N}{n}} $$ * Where E is the expected value of an action, N is the number of times we have chosen any action in the current state, and n is the number of times we have chosen this particular action * We can use c to "tune" the behavior, to prefer choosing the best action (lower c), or trying each action equally often (higher c) --- # Simulation * We said once we reach a node we haven't fully expanded yet, we "simulate" the game until the end to get a result * How can we simulate a game? * Simplest variant: Each player performs completely random moves * We will build our tree piece by piece, but we will still need "many" repetitions to get good simulation results --- # Simulation * Instead of moving randomly, we can use any other strategy we might know * For example, if we have a (bad) agent for the game, it could play the game for our simulation * As we build our tree, we will use more actions selected by our selection strategy and fewer by our "bad" agent * By not playing completely randomly, we may need fewer repetitions --- class: medium # Simulation * What if we don't actually simulate? * Ideally, we would have the exact result of the game for a new action that we're exploring * For many games we can instead come up with a game state evaluation * In Chess, for example, we can say whoever has more valuable pieces on the board will likely win * Advanced ideas: Play a few random turns and then evaluate the state, use a Neural Network to evaluate the board state, etc. --- class: center, middle # Randomness --- # Randomness * What if we have a (shuffled) deck? * We already sample our actions, we can also sample from the deck! * For every iteration, we shuffle the deck, too (using known information) --- # Example: Card Game * Imagine a card game like Hearthstone, Magic: The Gathering, etc. * On your turn, you can perform actions with the cards in your hand * What the "best" action is depends on cards you will draw in the future * But you don't know that ... --- # Example: Card Game * Some future cards may have a similar effect on your strategy * In general, you want to play in a way that **maximizes** your change of winning * You can use MCTS to help you! * Instead of just sampling the actions during roll-outs, you also sample shuffles --- # Example: Card Game * Before **each** roll-out, you take the knowledge you have of the deck (i.e. which cards are remaining) * Then you shuffle these remaining cards * You can combine this with partial knowledge, for example if you know the third card from the top from some effect * Then you perform the rollout --- # Example: Card Game * Previously, our rollouts just sampled the expected value over random action sequences * Now we also sample over possible deck orders * With enough roll-outs that will tell us for each action what the expected value is for that action over all possible future cards we could draw * Next time we will look at a simpler game to apply this idea: Blackjack (Lab 2) --- # References * [MCTS Tutorial](https://www.cs.swarthmore.edu/~bryce/cs63/s16/reading/mcts.html) * [MCTS Slides](https://www.lri.fr/~sebag/Slides/InvitedTutorial_CP12.pdf) * [MCTS for a Card Game](http://teaching.csse.uwa.edu.au/units/CITS3001/project/2017/paper1.pdf)