Lecture 10: Reinforcement Learning

# Machine Learning

## Reinforcement Learning

### III-Verano 2019

---

# Reinforcement Learning

---

# The Problem

* We want an agent that "figures out" how to "perform well"
  
  * Given an environment in which the agent perform actions, we tell the agent which reward they get 
  
  * We generally assume that our states are discrete, and actions transition between them
  
  * The goal of the agent is to learn which actions are "good" to perform in which state 
  
  * Rewards may be negative, representing "punishment"

---

# The Problem

---

# Reinforcement Learning

* The idea is that the agent performs several trial runs in the environment to determine a good policy
  
  * Compare to how a human player might learn how to play a game: Try some actions, observe the result 
  
  * Games are a natural fit for this kind of learning, but there are other applications as well!
  
---

# An Example

---

# An Example

<center>
<img src="/PF-3341/assets/img/qreward.png" width="80%"/>
</center>
  
---

# Reinforcement Learning: Notation

* `s` is a state 
  
  * `a` is an action
  
  * `T(s,a)` is the *state transition function*, it tells us which state we reach when we use action `a` in state `s`
  
  * `R(s)` is the (immediate) *reward* for a particular state 
  
  * `V(s)` is the (utility) value of a state

---

# Policies

* `$\pi(s)$` is a *policy*
  
  * A policy tells the agent which action it should take in each state 
  
  * The goal of learning is to learn a "good" policy 
  
  * The *optimal policy* is usually written as `$\pi^*(s)$`
  
---

# A Policy

---

# The Bellman Equation

* `V(s)` is the (utility) value of a state
  
  * This utility value depends on the reward of the state, as well as all future rewards the agent will receive 
  
  * However, the future rewards depend on what the agent will do, i.e. what the policy says
  
$$
V^\pi(s) = R(s) + \gamma V(T(s, \pi(s)))
$$

For the optimal policy:

$$
V(s) = R(s) + \max_a \gamma V(T(s, a))
$$

---

# (Partial) Value Function

---

# Learning

* The problem is: We don't generally know `V`, because we don't know `T(s,a)` (or `R(s)`) a priori
  
  * What can we do?
  
  * In our learning runs (episodes), we record the rewards we get, and use them to find an estimate of `V`
  
  * But we would also need to learn which state each action takes us to, in order to determine which action we should take

---

# The Q function

* Instead of learning `V` directly, we define a new function `Q(s,a)`, that satisfies `$V(s) = \max_a Q(s,a)$`

$$
Q(s,a) = R(s) + \max_{a'} \gamma Q(T(s,a),a')
$$

* Now we learn the value of `Q` for "each" pair of state and action
  
  * The agent's policy is then `$\pi(s) = \text{argmax}_a Q(s,a)$`
  
  * How do we learn `Q`?
  
---

# Q-Learning

* We store Q as a table, with one row per state and one column per action
  
  * We then initialize the table "somehow"
  
  * During each training episode, we update the table when we are in a state s and perform the action a:
  
<span style="font-size: 0.7em;">
$$
Q(s,a) \leftarrow Q(s,a) + \alpha (R(s) + \gamma \max_{a'} Q(T(s,a), a') - Q(s,a))
$$
</span>

---

# Q-Learning: Training

* How do we train this agent?
  
  * We could just pick the action with the highest Q-value in our table 
  
  * But then the initial values of the table would guide the exploration
  
  * Instead, we use an exploration policy
  
  * This could be random, or `$\varepsilon$`-greedy
  
---

# Q-Table

<img src="/PF-3341/assets/img/qtable.png" width="45%"/>
  
---

# SARSA

* Q-Learning is very flexible, because it can use any policy to explore and construct the Q-table (off-policy learning)
  
  * However, when we already have a somewhat reasonable policy, it may be faster to use the *actual actions* the agent takes to update the Q-values (on-policy learning)
  
  * This approach is called SARSA: State-action-reward-state-action
  
SARSA:
<span style="font-size: 0.7em;">
$$
Q(s,a) \leftarrow Q(s,a) + \alpha (R(s) + \gamma Q(T(s,a), a') - Q(s,a))
$$
</span>

Compare with Q-Learning:
<span style="font-size: 0.7em;">
$$
Q(s,a) \leftarrow Q(s,a) + \alpha (R(s) + \gamma \max_{a'} Q(T(s,a), a') - Q(s,a))
$$
</span>

---

# Policy Search

* Finally, instead of learning the values of `Q`, we could just directly learn a good policy 
  
  * For example, start with an initial policy and tweak it until a good result is obtained
  
  * The idea is to use a parameterized representation of `$\pi$` that has fewer parameters than there are states
  
  * The problem is that a policy is usually a discontinuous function (if we change one parameter a little bit, we get a completely different action), so we can't use the gradient to find optima 
  
  * Solution: Use a stochastic policy, which has probabilities for each action to be selected, and change these probabilities continuously

---

# Markov Decision Processes

* So far, we have assumed actions will take us from one state to another deterministically
  
  * However, in many environments, transitions are non-deterministic 
  
  * Fortunately, we don't have to change much: Instead of a transition function `T(s,a)` we have transition probabilities `$P(s' | s, a)$`
  
  * Wherever we used `T(s,a)` we now use the expected value over all possible successor states
  
  * Note that with Q-Learning we did not have to learn `T(s,a)`, and we also do **not** have to learn `$P(s' | s, a)$` (model-free algorithm)
  
---

# Some Applications

---

# Applications

* Games are a natural fit for Reinforcement Learning 
  
  * Arguably the first ever AI agent was a "Reinforcement Learning" agent for checkers in 1959
  
  * The Backgammon agent TD-Gammon was another early success able to beat top players in 1992
  
  * Another popular application is robot control, such as for an inverted pendulum, autonomous cars, helicopters, etc.

---

# Games

* Games, particularly classic arcade/Atari/NES games, have a become a standard testing-environment for Reinforcement Learning
  
  * Often, the state is given to the agent in terms of the raw pixels of the screen
  
  * Reinforcement Learning agents have become *very* good at playing these games 
  
  * Sometimes they find exploits!
  
---

# Exploiting Coast Runners

---

# Inverted Pendulum

---

# Stanford Autonomous Helicopter

---

# Circuit Design

---

# Lab 6

```Python
env = gym.make("CartPole-v1")
observation = env.reset()
action = env.action_space.sample()
i = 0
for _ in range(100000):
    env.render()
    observation, reward, done, info = env.step(action)
    action = env.action_space.sample()
    i += 1
    
    if done:
        print(i)
        i = 0
        observation = env.reset()
env.close()
```

---

# Lab 6

---

# Step

```Python
observation, reward, done, info = env.step(action)
```

Observations:

| Index | Name                 | Min     | Max    |
|-------|----------------------|---------|--------|
| 0     | Cart Position        | -4.8    | 4.8    |
| 1     | Cart Velocity        | -Inf    | Inf    |
| 2     | Pole Angle           | -24 deg | 24 deg |
| 3     | Pole Velocity At Tip | -Inf    | Inf    |

Reward: 1 if the pole does not "fall"

Actions: 
  * 0: push left
  * 1: push right

---

# Objective

"Considered solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials."

For the lab, if you get average scores/times of over 40-50 that is already good (after several hundred training episodes).

You will get a skeleton python code that outputs the episode length, and averages for every 100 trials.

---

# Lab Implementation

* Set up a Q table
  
  * Discretize the states (you don't have to use all 4 elements)
  
  * Select an action using the current Q table and random choice (epsilon greedy)
  
  * Perform the action, obtain the reward 
  
  * Update the Q table
  
---

# Some Pointers to start

* If you use a low learning rate (0.1 or less), you will need more iterations, but your Q table should be more accurate
  
  * You should start with a high value for `$\varepsilon$` (close to 1), and decrease it over time 
  
  * `$\gamma$` can just be 1
  
  * The angle of the pole is probably important, the other variables may or may not be
  
  * Inspect your Q table manually from time to time

---

# References

* Chapter 21 of *AI: A Modern Approach*, by Russel and Norvig

* [A Painless Q-Learning Tutorial](http://mnemstudio.org/path-finding-q-learning-tutorial.htm)

* [Faulty Reward Functions in the Wild](https://openai.com/blog/faulty-reward-functions/)
  
  * [Exploiting Q*Bert](https://www.theverge.com/tldr/2018/2/28/17062338/ai-agent-atari-q-bert-cracked-bug-cheat)