class: center, middle # Machine Learning ## Reinforcement Learning ### III-Verano 2019 --- class: center, middle # Reinforcement Learning --- class: medium # The Problem * We want an agent that "figures out" how to "perform well" * Given an environment in which the agent perform actions, we tell the agent which reward they get * We generally assume that our states are discrete, and actions transition between them * The goal of the agent is to learn which actions are "good" to perform in which state * Rewards may be negative, representing "punishment" --- # The Problem
--- # Reinforcement Learning * The idea is that the agent performs several trial runs in the environment to determine a good policy * Compare to how a human player might learn how to play a game: Try some actions, observe the result * Games are a natural fit for this kind of learning, but there are other applications as well! --- # An Example
--- # An Example
--- class: medium # Reinforcement Learning: Notation * `s` is a state * `a` is an action * `T(s,a)` is the *state transition function*, it tells us which state we reach when we use action `a` in state `s` * `R(s)` is the (immediate) *reward* for a particular state * `V(s)` is the (utility) value of a state --- # Policies * `\(\pi(s)\)` is a *policy* * A policy tells the agent which action it should take in each state * The goal of learning is to learn a "good" policy * The *optimal policy* is usually written as `\(\pi^*(s)\)` --- # A Policy
--- # The Bellman Equation * `V(s)` is the (utility) value of a state * This utility value depends on the reward of the state, as well as all future rewards the agent will receive * However, the future rewards depend on what the agent will do, i.e. what the policy says $$ V^\pi(s) = R(s) + \gamma V(T(s, \pi(s))) $$ For the optimal policy: $$ V(s) = R(s) + \max_a \gamma V(T(s, a)) $$ --- # (Partial) Value Function
--- # Learning * The problem is: We don't generally know `V`, because we don't know `T(s,a)` (or `R(s)`) a priori * What can we do? * In our learning runs (episodes), we record the rewards we get, and use them to find an estimate of `V` * But we would also need to learn which state each action takes us to, in order to determine which action we should take --- # The Q function * Instead of learning `V` directly, we define a new function `Q(s,a)`, that satisfies `\(V(s) = \max_a Q(s,a)\)` $$ Q(s,a) = R(s) + \max_{a'} \gamma Q(T(s,a),a') $$ * Now we learn the value of `Q` for "each" pair of state and action * The agent's policy is then `\(\pi(s) = \text{argmax}_a Q(s,a)\)` * How do we learn `Q`? --- # Q-Learning * We store Q as a table, with one row per state and one column per action * We then initialize the table "somehow" * During each training episode, we update the table when we are in a state s and perform the action a:
$$ Q(s,a) \leftarrow Q(s,a) + \alpha (R(s) + \gamma \max_{a'} Q(T(s,a), a') - Q(s,a)) $$
--- # Q-Learning: Training * How do we train this agent? * We could just pick the action with the highest Q-value in our table * But then the initial values of the table would guide the exploration * Instead, we use an exploration policy * This could be random, or `\(\varepsilon\)`-greedy --- # Q-Table
--- class: small # SARSA * Q-Learning is very flexible, because it can use any policy to explore and construct the Q-table (off-policy learning) * However, when we already have a somewhat reasonable policy, it may be faster to use the *actual actions* the agent takes to update the Q-values (on-policy learning) * This approach is called SARSA: State-action-reward-state-action SARSA:
$$ Q(s,a) \leftarrow Q(s,a) + \alpha (R(s) + \gamma Q(T(s,a), a') - Q(s,a)) $$
Compare with Q-Learning:
$$ Q(s,a) \leftarrow Q(s,a) + \alpha (R(s) + \gamma \max_{a'} Q(T(s,a), a') - Q(s,a)) $$
--- class: small # Policy Search * Finally, instead of learning the values of `Q`, we could just directly learn a good policy * For example, start with an initial policy and tweak it until a good result is obtained * The idea is to use a parameterized representation of `\(\pi\)` that has fewer parameters than there are states * The problem is that a policy is usually a discontinuous function (if we change one parameter a little bit, we get a completely different action), so we can't use the gradient to find optima * Solution: Use a stochastic policy, which has probabilities for each action to be selected, and change these probabilities continuously --- class: mmedium # Markov Decision Processes * So far, we have assumed actions will take us from one state to another deterministically * However, in many environments, transitions are non-deterministic * Fortunately, we don't have to change much: Instead of a transition function `T(s,a)` we have transition probabilities `\(P(s' | s, a)\)` * Wherever we used `T(s,a)` we now use the expected value over all possible successor states * Note that with Q-Learning we did not have to learn `T(s,a)`, and we also do **not** have to learn `\(P(s' | s, a)\)` (model-free algorithm) --- class: center, middle # Some Applications --- class: medium # Applications * Games are a natural fit for Reinforcement Learning * Arguably the first ever AI agent was a "Reinforcement Learning" agent for checkers in 1959 * The Backgammon agent TD-Gammon was another early success able to beat top players in 1992 * Another popular application is robot control, such as for an inverted pendulum, autonomous cars, helicopters, etc. --- # Games * Games, particularly classic arcade/Atari/NES games, have a become a standard testing-environment for Reinforcement Learning * Often, the state is given to the agent in terms of the raw pixels of the screen * Reinforcement Learning agents have become *very* good at playing these games * Sometimes they find exploits! --- # Exploiting Coast Runners
--- # Inverted Pendulum
Cart pendulum
y
x
M
F
➝
θ
l
m
--- # Stanford Autonomous Helicopter
--- # Circuit Design
--- # Lab 6 ```Python env = gym.make("CartPole-v1") observation = env.reset() action = env.action_space.sample() i = 0 for _ in range(100000): env.render() observation, reward, done, info = env.step(action) action = env.action_space.sample() i += 1 if done: print(i) i = 0 observation = env.reset() env.close() ``` --- # Lab 6
--- # Step ```Python observation, reward, done, info = env.step(action) ``` Observations: | Index | Name | Min | Max | |-------|----------------------|---------|--------| | 0 | Cart Position | -4.8 | 4.8 | | 1 | Cart Velocity | -Inf | Inf | | 2 | Pole Angle | -24 deg | 24 deg | | 3 | Pole Velocity At Tip | -Inf | Inf | Reward: 1 if the pole does not "fall" Actions: * 0: push left * 1: push right --- # Objective "Considered solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials." For the lab, if you get average scores/times of over 40-50 that is already good (after several hundred training episodes). You will get a skeleton python code that outputs the episode length, and averages for every 100 trials. --- class: medium # Lab Implementation * Set up a Q table * Discretize the states (you don't have to use all 4 elements) * Select an action using the current Q table and random choice (epsilon greedy) * Perform the action, obtain the reward * Update the Q table --- class: medium # Some Pointers to start * If you use a low learning rate (0.1 or less), you will need more iterations, but your Q table should be more accurate * You should start with a high value for `\(\varepsilon\)` (close to 1), and decrease it over time * `\(\gamma\)` can just be 1 * The angle of the pole is probably important, the other variables may or may not be * Inspect your Q table manually from time to time --- # References * Chapter 21 of *AI: A Modern Approach*, by Russel and Norvig * [A Painless Q-Learning Tutorial](http://mnemstudio.org/path-finding-q-learning-tutorial.htm) * [Faulty Reward Functions in the Wild](https://openai.com/blog/faulty-reward-functions/) * [Exploiting Q*Bert](https://www.theverge.com/tldr/2018/2/28/17062338/ai-agent-atari-q-bert-cracked-bug-cheat)