class: center, middle # AI in Digital Entertainment ### Monte Carlo Tree Search --- # Game Trees * We want to build an agent for a turn-based game * On its turn, the agent needs to figure out which action to perform * Ideally, all the agent knows are the rules of the game (including scoring), and then is able to figure out the best move automatically * We've talked about this before, in the form for game trees --- class: small # Minimax * Let's say we want to get the highest possible score * Then our opponent wants us to get the lowest possible score * For each of our potential actions, we look at each of the opponents possible actions * The opponent will pick the action that gives us the lowest score, and we will pick from our actions the one where the opponent's choice gives us the highest score * How does the opponent decide what to pick? The same way! --- # Minimax
image/svg+xml
0
1
2
3
4
+∞
10
5
-10
7
5
-5
-7
-∞
10
5
-10
-7
10
-10
-∞
-7
-7
-10
5
5
-7
--- # Minimax
image/svg+xml
0
1
2
3
4
+∞
10
5
-10
7
5
-5
-7
-∞
10
5
-10
-7
10
-10
-∞
-7
-7
-10
5
5
-7
--- # Minimax * Remember what we did next? * I played a game against a student, and then the student was upset because he lost. * So let's make a fairer game: Instead of me, a die will play --- # A fairer game You choose a number between 1 and 6 as the first digit, then I roll a d6 (six-sided die) to get a second digit. If the resulting number is prime *or* divisible by 3 you get that many points, otherwise you lose that many points. For example: You pick 3, the die roll shows a 1, so the resulting number is 31, which is prime, and you get 31 points. What's your move? --- class: small # How would we play this game? * Winning numbers: 11, 12, 13, 15, 21, 23, 24, 31, 33, 36, 41, 42, 43, 46, 51, 53, 54, 61, 63, 66 * If we pick 2, 3, 5 or 6 there there are three possible die rolls that give us points, for 1 or 4 there are four. * We can calculate the **expected value** for each choice:
$$ E(c=1) = \frac{1}{6}\cdot 11 + \frac{1}{6} \cdot 12 + \frac{1}{6} \cdot 13 + \frac{1}{6} \cdot 15 - (\frac{1}{6}\cdot 14 + \frac{1}{6}\cdot 16) = 3.5\\\\ E(c=2) = \frac{1}{6}\cdot 21 + \frac{1}{6}\cdot 23 + \frac{1}{6}\cdot 24 - (\frac{1}{6} \cdot 22 + \frac{1}{6}\cdot 25 + \frac{1}{6}\cdot 26) = -0.8\\\\ E(c=3) = \frac{1}{6}\cdot 31 + \frac{1}{6}\cdot 33 + \frac{1}{6}\cdot 36 - (\frac{1}{6} \cdot 32 + \frac{1}{6}\cdot 34 + \frac{1}{6}\cdot 35) = -0.2\\\\ E(c=4) = \frac{1}{6}\cdot 41 + \frac{1}{6} \cdot 42 + \frac{1}{6} \cdot 43 + \frac{1}{6} \cdot 45 - (\frac{1}{6}\cdot 44 + \frac{1}{6}\cdot 46) = 13.5\\\\ E(c=5) = \frac{1}{6} \cdot 51 + \frac{1}{6}\cdot 53 + \frac{1}{6}\cdot 54 - (\frac{1}{6}\cdot 52 + \frac{1}{6}\cdot 55 + \frac{1}{6}\cdot 56) = -0.8\\\\ E(c=6) = \frac{1}{6}\cdot 61 + \frac{1}{6}\cdot 63 + \frac{1}{6}\cdot 66 - (\frac{1}{6} \cdot 62 + \frac{1}{6}\cdot 64 + \frac{1}{6}\cdot 65) = -0.2 $$
--- class: small # How about this game? Let's make the game more interesting: You choose a number from 1 to 6, I roll a die, then you choose another number, then I roll a die, etc., until we have 10 digits (5 each). Same scoring as before: If the number is prime or divisible by 3 you get that many points, otherwise you lose them. For example, you choose 3, I roll 2, you choose 4, I roll 1, you choose 5, I roll 1, you choose 6, I roll 1, you choose 4, I roll 3. Result: 3241516143 = 3*1080505381, so you get 3241516143 points. What's your move? --- # How would we play this game? * There are 6^10 = 60446176 possible outcomes * We could compute expected values again, it might just take a bit * How about we just try a few random combinations for each of our options to get a feeling for what might be a good move? * But let's do this in a systematic way! --- class: small # Sampling! * Say we try 10 random numbers for each of our 6 options, and get averages of: * 1: -241591424.3 * 2: -1049205520.1 * 3: -711555192.8 * 4: 838179893.4 * 5: -2208620186.1 * 6: 1289273320.5 * Instead of trying another 10 numbers for each option, we may want to focus more on the more promising ones? * But 10 tries are very few. Maybe the value for e.g. 2 is completely wrong? --- class: small # A more structured approach * How do we randomly try options? * First we need to pick the first number * Then we randomly generate the second one * Then we pick the third number, etc. * We can memorize these picks! * It will basically be a game tree, but incomplete --- class: center, middle # Monte Carlo Tree Search --- # Initialization and Data Structure * We always have a partial game tree * For each node remember the expected score/win rate * We start with a single node: The current state * A node can be expanded by performing a possible action in that node, resulting in a child node --- # Monte Carlo Tree Search
Selection
11/21
7/10
3/8
0/3
2/4
1/6
1/2
2/3
2/3
2/3
3/3
Expansion
11/21
7/10
3/8
0/3
2/4
1/6
1/2
2/3
2/3
2/3
3/3
0/0
Simulation
11/21
7/10
3/8
0/3
2/4
1/6
1/2
2/3
2/3
2/3
3/3
0/0
0/1
Backpropagation
11/22
8/11
3/8
0/3
2/4
1/7
1/2
2/3
2/3
2/3
4/4
0/1
--- # Selection * Start at the root of the partial tree and select a child node * Continue selecting child nodes until you reach a node that is not fully expanded * The selection is done using a *tree policy* * For example: With probability `\(1-\varepsilon\)` we pick the child with the highest expected value, otherwise we pick a child uniformly at random (epsilon greedy policy) --- # Expansion * When we find a node with unexpanded children, we expand it * Expansion also depends on the tree policy (we may want to revisit a child we've already expanded) --- # Simulation * Using a *default policy*, the game is played until the end * This default policy can be a simple AI agent, or just pure random moves * The simulation is sometimes also called a "rollout" * At the end of the game, we obtain the score --- # Backpropagation * Record the score obtained from the simulation in the tree * For each node that was on the path to the end, update the expected value with the new score --- class: middle, center # Example --- class: medium # Features of Monte Carlo Tree Search * No heuristic needed, only the game rules * Opponents can be modeled by policy or random moves * Hidden information (such as shuffled decks, die rolls, etc.) can be modeled as random moves * When you run the algorithm, you can stop it at *any* time and just pick the best move so far, giving you a trade-off between time needed and quality of the estimate --- class: medium # How does this algorithm do? * The good news: MCTS converges to minimax! * The bad news: It converges very slowly * Imagine a move like sacrificing a queen in chess: It may have a high reward in *some* scenarios, but most of the future results will be bad * To actually find the "good" version, many different simulations have to be performed --- # Exploration vs. Exploitation * The epsilon greedy approach strongly prefers nodes with high scores * But what if we "missed" a better option? * Should we try to improve the most promising option (exploitation) or try to find a different, but potentially better move (exploration)? * The number of times a node is visited may give us an indication of how good the score estimate is --- # Upper Confidence Bound 1 applied to trees Use UCT1 value to determine which child to visit: $$ X_j + c \sqrt{\frac{\ln N}{n_j}} $$ Where `\(X_j\)` is the current estimate for the score of the jth child, `\(N\)` is the number of times the current node was visited and `\(n_j\)` is the number of times the jth child was visited. c determines the trade-off between exploitation and exploration. A high value for `\(X_j\)` makes nodes more likely to be visited (exploitation), as does a low value for `\(n_j\)` compared to the total N (exploration). If `\(n_j\)` is 0, the value is infinite, guaranteeing that unvisited children are visited. --- class: small # What is MCTS good for? * MCTS does extremly well on a wide variety of games, even those with a large branching factor (many possible actions), hidden information, and/or non-determinism * Has been successfully used for Go, Scrabble, Mancala, constructing crossword puzzles, PacMan, Poker, Magic: The Gathering, etc. * Also useful for non-games, such as the traveling salesman problem, scheduling, and vulnerability testing * Challenges: - Random rollouts may take a long time in many domains - With a large branching factor a good tree policy can be helpful --- class: center, middle # AlphaGo --- class: medium # Go .left-column[
] .right-column[ * Over 2500 years old * 40 million active players * Simple rules, but a lot of strategic depth * At least 10e170 different game states * Games can last hundreds of turns ] --- # Go * Before MCTS, AI agents struggled with Go * When MCTS was first developed (2006/2007), it was successfully applied to Go on a smaller (9x9) board * By 2012, Go programs could beat an international champion with a small handicap * In 2016, AlphaGo by DeepMind beat the world champion * How? --- # Go with MCTS * Because of the large graph, standard MCTS runs into limitations * AlphaGo uses two key techniques to overcome these: - Better action selection in the tree policy - Limited rollouts * Both of these were implemented using Neural Networks --- class: medium # Better Action Selection * UCT1 uses the results obtained during sampling to determine which action to try next * It actually guarantees that unvisited actions are visited first * But what if we "knew" that some moves are definitely bad? We wouldn't need to visit them at all * AlphaGo uses a Neural Network to estimate how likely a move is the best move * Does not need to be perfect, a reasonable guess is enough --- class: medium # Limited Rollouts * Instead of playing games to the end with random moves, we can use a "guess" for how the game will end at some point * AlphaGo trained a Neural Network to provide this "guess" * Both of these Neural Networks were trained with data from human expert players and then refined with data from AlphaGo playing against itself * AlphaGo Zero (2017) then *only* used self-play data --- class: center, middle # MCTS Variants --- # Single-Player MCTS * In Single Player games, there is no opponent that tries to minimize our score * In addition to determining the "expected" reward, it may also be useful to store the maximum reward for each branch * The "opponent" is usually some form of randomness, so additionally the variance of the results may be of interest * MCTS can then use these values in the tree policy (e.g. as an additional term in the UCT1 formula) --- # Multi-Player MCTS * If there are more than two players (or the game isn't zero-sum), minimizing an opponent's score may not be the same as maximizing your own score anymore * Instead, store a vector of rewards, one for each player, and maximize the appropriate one for each action * Alternatively: pretend all other players are playing as a team against you ("Paranoid UCT") --- # Multi-Agent MCTS * During the rollouts, the default policy determines how the simulation is done * Instead of a pure random selection, this may use some simple heuristic * Multi-Agent MCTS uses multiple different policies for the rollouts, to better predict interactions between agent types * However, defining these different agents requires significant engineering effort for each game --- class: medium # Information Set MCTS * Hidden information, such as a shuffled deck, can be accounted for by MCTS by sampling and pretending the game state is fully visible * Store a tree of information sets, pick one possible state and only follow actions compatible with this state (e.g. if the state picked has me holding the ace of spades, *any* state in which I am holding the ace of spades is compatible with that fact) * This avoids fully determinizing the state with information that is not relevant to the rollout, like cards at the bottom of the deck --- # Forward-Search Sparse Sampling * MCTS can also be used for planning * Instead of finding a winning move set, we are looking for a plan * Challenge: How do you define the "score"? There are very few "wins" in planning --- # Next Week * [Re-determinizing Information Set Monte Carlo Tree Search in Hanabi](https://arxiv.org/pdf/1902.06075.pdf) --- # Resources * [Chapter 2.3.4 in the Game AI Book](http://gameaibook.org) * [A Survey of Monte Carlo Tree Search Methods](http://repository.essex.ac.uk/4117/1/MCTS-Survey.pdf)