class: center, middle # Artificial Intelligence ### Monte Carlo Tree Search --- # Adversarial Search: Recap - We have a turn-based game - Our goal is to get the highest possible score - Our opponent wants us to get the lowest possible score - It is our turn --- # Adversarial Search: Recap - For every possible move we could make, we consider every possible move our opponent could make, etc. - For each possible sequence of moves we calculate our score - We then assume that our opponent will choose the action that results in the lowest score for us --- # Adversarial Search: Limitations - This *game tree* will be huge - We heard about alpha-beta pruning to reduce the tree size, but even with that the tree is too large to calculate for many games - Many games also have random components - What now? --- # Monte Carlo Tree Search * Idea: Don't calculate the entire tree, but instead *sample* random(-ish) sequences ("*rollouts*") * Record the outcomes for these playouts * Repeat a large number of times * At the end, we will have an *estimate* for the amount of points we will get for each action --- class: mmedium # Monte Carlo Tree Search - If we pick actions completely at random for our rollouts we will need too many repetitions to get a good estimate - But we can use the information we learn during the rollouts to "guide" future iterations - For example: Say we have already performed 100 rollouts, which gives us a (probably bad) estimate for the expected value of each action - For the next rollouts, we will choose the action with the highest expected value with the highest probability - Over time, our sampling process will collect more samples for more promising actions --- class: medium # Monte Carlo Tree Search Our algorithm will construct a game tree piece by piece. Each iteration it expands the partial tree in four steps: - *Select* actions from the tree until we reach a node we haven't fully expanded yet - *Expand* a new action from that node - *Simulate* the game until the end, and note the result - *Backpropagate* this result back up the tree --- # MCTS
Selection
11/21
7/10
3/8
0/3
2/4
1/6
1/2
2/3
2/3
2/3
3/3
Expansion
11/21
7/10
3/8
0/3
2/4
1/6
1/2
2/3
2/3
2/3
3/3
0/0
Simulation
11/21
7/10
3/8
0/3
2/4
1/6
1/2
2/3
2/3
2/3
3/3
0/0
0/1
Backpropagation
11/22
8/11
3/8
0/3
2/4
1/7
1/2
2/3
2/3
2/3
4/4
0/1
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- # MCTS for Tic-Tac-Toe
--- class: medium # Monte Carlo Tree Search Algorithm: - *Select* actions from the tree until we reach a node we haven't fully expanded yet - *Expand* a new action from that node - *Simulate* the game until the end, and note the result - *Backpropagate* this result back up the tree --- # Selection * We use the scores we have obtained so far to choose which action to select until we reach a leaf * One approach would be to always pick the action with the (currently) highest expected value * However, this would ignore actions that got bad results due to "bad luck" in the rollout * There are several different *selection strategies* we can use to overcome this problem --- # Epsilon-Greedy Selection * One of the simplest selection strategy uses a single parameter: `\(\varepsilon\)` * When we have to select an action, we choose a number between 0 and 1 uniformly at random * If that number is less than `\(\varepsilon\)`, we choose an action uniformly at random * Otherwise we choose the action with the highest expected value --- class: medium # Roulette-Wheel Selection * Epsilon-Greedy may be problematic if two actions have almost the same expected value * Ideally, we would choose each of these two with (almost) the same probability * Roulette-Wheel selection selects an action at random with weights determined by the expected value of each action * For example, if the expected value for four actions are 1, 4, 8, and 7, we choose the actions with probability 1/20, 4/20, 8/20 and 7/20, respectively * This is also called "fitness proportionate selection" --- # Roulette-Wheel Selection
--- class: mmedium # UCT * We can also use a more sophisticated selection strategy * UCT = "Upper Confidence Bound 1 applied to trees", based on the UCB-1 formula: $$ E + c \sqrt{\frac{\ln N}{n}} $$ * Where E is the expected value of an action, N is the number of times we have chosen any action in the current state, and n is the number of times we have chosen this particular action * We can use c to "tune" the behavior, to prefer choosing the best action (lower c), or trying each action equally often (higher c) --- # Simulation * We said once we reach a node we haven't fully expanded yet, we "simulate" the game until the end to get a result * How can we simulate a game? * Simplest variant: Each player performs completely random moves * We will build our tree piece by piece, but we will still need "many" repetitions to get good simulation results --- # Simulation * Instead of moving randomly, we can use any other strategy we might know * For example, if we have a (bad) agent for the game, it could play the game for our simulation * As we build our tree, we will use more actions selected by our selection strategy and fewer by our "bad" agent * By not playing completely randomly, we may need fewer repetitions --- class: medium # Simulation * What if we don't actually simulate? * Ideally, we would have the exact result of the game for a new action that we're exploring * For many games we can instead come up with a game state evaluation * In Chess, for example, we can say whoever has more valuable pieces on the board will likely win * Advanced ideas: Play a few random turns and then evaluate the state, use a Neural Network to evaluate the board state, etc. --- class: center, middle # Randomness --- # Randomness * What if we have a (shuffled) deck? * We already sample our actions, we can also sample from the deck! * For every iteration, we shuffle the deck, too (using known information) --- # Example: Blackjack * In Blackjack the player can request to draw cards from the deck * The goal is to get a sum of card values close to 21, but not over 21 * For example: Ten of Spades, Three of Hearts, Seven of Clubs are 10+3+7=20 points * Jack, Queen and King are 10 points each, an Ace can count as 1 **or** 11 points (player's choice) --- class: medium # Blackjack * After the player has performed their actions, the dealer draws cards until they have more than 16 points * If the player has more than 21 points, they lose * If the dealer has more than 21 points, the player wins * If the player then has more points than the dealer, the player wins * If there is a tie, no one wins * The winner gets an amount of money (like $1) --- class: medium # Blackjack: Player Actions A player can do one of four things: * **Hit**: Request one more card * **Stand**: Stop taking cards, passing the turn to the dealer * **Double Down**: Draw exactly one more card and then stand, and double the bet (win or lose $2) * **Split**: If the first two cards have the same value, the player can split them into two hands, and continue playing with these two independently (each hand wins/loses $1) --- # MCTS for Blackjack
--- # MCTS for Blackjack
--- # MCTS for Blackjack
--- # MCTS for Blackjack
--- # MCTS for Blackjack
--- # MCTS for Blackjack
--- # MCTS for Blackjack
--- # Why Blackjack * Implementing MCTS for Blackjack is Lab 2 (4/5-22/5) * You will get a framework with a Blackjack implementation (and some AI agents) * Your task is to expand the existing sampling player with a proper action selection and tree construction * Weird feature: You can play with non-standard decks (e.g. only even cards), and your agent will figure it out! --- class: small # References * [MCTS Tutorial](https://www.cs.swarthmore.edu/~bryce/cs63/s16/reading/mcts.html) * [MCTS Slides](https://www.lri.fr/~sebag/Slides/InvitedTutorial_CP12.pdf) * [MCTS for a Card Game](http://teaching.csse.uwa.edu.au/units/CITS3001/project/2017/paper1.pdf)