Lecture 17: Deep Q Learning

# Artificial Intelligence

## Deep Q Learning

---

# Reinforcement Learning

---

# Reinforcement Learning

---

# The Q function

Utility of a state:
  
$$
V^\pi(s) = R(s) + \gamma V(T(s, \pi(s)))
$$

Q function:

$$
Q(s,a) = R(s) + \max_{a'} \gamma Q(T(s,a),a')
$$

---

# Q Learning

* Q Learning is based on trials 
 
 * We store a Q table and continuously update it 
 
 * Rewards our agent receives tell us what the values of the Q table "should be"
 
 * Drawback: The Q table is huge
 
---

# Deep Q Learning

* `Q(s,a)` is a function 
   
   * We have heard of a way to approximate functions: Neural Networks!
   
   * Instead of storing all possible Q-values in a table we train a neural network
   
   * How?
   
---

# Deep Q Learning

---

# Q Learning

Repeat:

* Observe state 
  
  * Choose action according to Q table 
  
  * Perform action, get reward and new state 
  
  * Update Q table
  
---

# Deep Q Learning

Repeat:

* Observe state 
  
  * Choose action according to Q .red[network]
  
  * Perform action, get reward and new state 
  
  * Update Q .red[network]
  
--

How do we update? Gradient descent!

---

# Deep Q Learning

---

# Deep Q Learning: Challenge

---

# Deep Q Learning: Two Networks

---

# Deep Q Learning: Process

Use two networks: Target and prediction

* Run agent for `n` iterations using the prediction network and an epsilon-greedy strategy to determine the used actions, collect all states, actions, next states and rewards `(s,a,s',r)`
 
 * After collecting "some" data (like a "minibatch"), calculate the gradient with the difference between the **target** and the **prediction** (networks) as the loss to **update the prediction network**
 
 * After C iterations, copy the prediction network and use it as the new target network
 
---

# Applications

---

# Applications

* Reinforcement Learning/Q Learning often performs very well on problems where we can define a reward
  
  * Classical Q Learning has the problem of having to store the entire state space 
  
  * Deep Q Learning can be applied to larger problems, such as self-driving cars, video games, etc.
  
---

# Applications

---

# OpenAI Gym

---

# OpenAI Gym

* OpenAI is a research center with a focus on AI and Machine Learning

* Their stated goal is to develop AI for "the benefit of all humans"

* They provide many resources for free

* One of their offerings is the "Gym"

---

# OpenAI Gym

- The Gym is an API and implementation for many different Reinforcement Learning problems in Python

- Each task, or *environment*, provides observations, rewards, and actions

- The environments typically also have a visualization so that researchers can see what is happening

- There are also many third-party environments that use the same API

---

# API

```Python
import gym
env = gym.make('CartPole-v0')
observation = env.reset()

for _ in range(1000):
    env.render()
    # Select an action
    action = choose(observation, env.action_space)
    
    # Perform action and get reward 
    observation, reward, done, info = env.step(action)
    
    # Use reward to update policy here
    
    # If we reach the goal: reset
    if done:
        observation = env.reset()
    
env.close()
```

---

# Environment: Inverted Pendulum

---

# Environment: Ant

---

# Environment: Breakout

---

# Application: Minecraft

---

# Malmö

"Project Malmö, named after a town between Cambridge, UK where it was developed and Stockholm, Sweden where Minecraft was created, is an Artificial Intelligence (AI) platform that allows researchers to 
create challenging and interesting tasks for evaluating agents, and Minecraft enthusiasts to engage in the Modding community and help advance AI."

"MalmoEnv is an OpenAI "gym" Python Environment for Malmo/Minecraft"

---

# Malmö Task: Get Diamond

A potential task is to actually play Minecraft (without enemies), and get diamond

* First, gather wood, to create a wood pickaxe
  
  * Then use the wood pickaxe to get stone and a stone pickaxe
  
  * Find and mine iron
  
  * Build a smelter to smelt iron and create an iron pickaxe
  
  * Find and mine diamond
  
---

# Malmö Task: Navigation

A simpler task: Get to the gold.

---

# Minecraft Environment

* If you want to play around with Minecraft, we made a bundle for CI-2600 during the III ciclo 2019

* There is a zip file with Minecraft, a batch file to run everything, and a python file with the skeleton code

* This is completely optional!

---

# References

* [A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python](https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/)
  
  * [A Comprehensive Guide to Convolutional Neural Networks](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)
  
  * [Deep Reinforcement Learning for General Video Game AI](https://arxiv.org/pdf/1806.02448.pdf)
  
  * [OpenAI Gym](https://gym.openai.com/)
  
  * [Project Malmo](https://www.microsoft.com/en-us/research/project/project-malmo/)
  
  * [Lab 8 from the Machine Learning Course](https://yawgmoth.github.io/CI-2600/labs/lab8.html)