Notes:Playing Atari with Deep Reinforcement Learning

Paper notes and implementation details.

Table of Contents

Code available here

Introduction

Challenge: Learning to control agents directly from high-dimensional sensory inputs like vision and speech.

Recent advance in deep learning: Making it possible to extract high-level features from raw sensory data, learning to breakthrough in computer vision and speech recognition.

Special RL challenges from a deep learning perspective:

  1. Most successful DL applications required large amounts of hand-labelled training data. But RL algorithms must learn from scalar reward signal that is frequently sparse, noisy and delayed; the delay can be thousands of timesteps long;
  2. Most DL algorithms assume i.i.d. data, while in RL one typically encounters sequences of highly correlated states;
  3. In RL the data distribution changes as the algorithm learns new behaviours, while DL methods assume a fixed underlying distribution → non-stationary distributions

Contributions:

  1. Use a variant of the Q-learning algorithm to solve challenge 1.
  2. Use Experience Replay mechanism to solve challenge 2 and 3.

Deep Reinforcement Learning

Experience replay: store the agent's experiences at each time-step, $e_{t}=\left(s_{t}, a_{t}, r_{t}, s_{t+1}\right)$

$$\begin{array}{l} \hline \textbf{Algorithm 1} \text{ Deep Q-learning with Experience Replay} \\\hline \quad \text{ Initialize replay memory } \mathcal{D} \text { to capacity } N \\ \quad \text { Initialize action-value function } Q \text { with random weights } \\ \quad \textbf{ for} \text{ episode} =1, M \textbf{do} \\\quad \qquad \text {Initialise sequence } s_{1}=\left\{x_{1}\right\} \text { and preprocessed sequenced } \phi_{1}=\phi\left(s_{1}\right) \\\quad \quad \quad \textbf{for } t=1, T \textbf{do} \\ \quad\qquad \qquad \text{With probability } \epsilon \text { select a random action } a_{t} \\\quad\qquad \qquad\text { otherwise select } a_{t}=\max _{a} Q^{*}\left(\phi\left(s_{t}\right), a ; \theta\right) \\\quad\qquad \qquad\text { Execute action } a_{t} \text { in emulator and observe reward } r_{t} \text { and image } x_{t+1} \\\quad\qquad \qquad\text { Set } s_{t+1}=s_{t}, a_{t}, x_{t+1} \text { and preprocess } \phi_{t+1}=\phi\left(s_{t+1}\right) \\\quad\qquad \qquad\text { Store transition }\left(\phi_{t}, a_{t}, r_{t}, \phi_{t+1}\right) \text { in } \mathcal{D} \\\quad\qquad \qquad\text { Sample random minibatch of transitions }\left(\phi_{j}, a_{j}, r_{j}, \phi_{j+1}\right) \text { from } \mathcal{D} \\\quad\qquad \qquad\text { Set } y_{j}=\left\{\begin{array}{l}r_{j} \\r_{j}+\gamma \max _{a^{\prime}} Q\left(\phi_{j+1}, a^{\prime} ; \theta\right) \quad \text { for terminal } \phi_{j+1} \\\text { for non-terminal } \phi_{j+1}\end{array}\right. \\\quad\qquad \qquad\text { Perform a gradient descent step on }\left(y_{j}-Q\left(\phi_{j}, a_{j} ; \theta\right)\right)^{2} \text { according to equation } 3 \\\quad \quad \quad\textbf{end for} \\\quad \textbf{end for} \end{array}$$

Advantages over standard online Q-learning:

  1. Greater data efficiency: each step of experience is potentially used in many weight updates;
  2. Reduce the variance of gradient updates: randomising the samples breaks the correlations;
  3. Avoiding oscillations or divergence in the parameters: by using experience replay the behaviour distribution is averaged over many of its previous states.

Deficiencies:

  1. Memory buffer does not differentiate important transitions, only stores the last $N$ experience tuples and always overwrites with recent transitions due to finite memory size;
  2. Uniform sampling gives equal importance to all transitions in the replay memory (but actually they are not equally important).

Possible improvment:

  1. Prioritised experience replay.

Model Architecture

  • Use a separate output unit for each possible action, and only the state representation is an input to the neural network.
    • This design could compute Q-values for all possible actions in a given state with only a single forward pass through the network.

Answer the following questions

  1. What did authors try to accomplish?
    • Presented the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning.
    • Outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
  2. What were the key elements of the approach?
    • Q-learning with neural networks
    • Experience Replay
  3. What other references do you want to follow?
    • Vanilla Q-learning.
    • Prioritised Experience Replay