Table of Contents
Code available here
Introduction
Challenge: Learning to control agents directly from high-dimensional sensory inputs like vision and speech.
Recent advance in deep learning: Making it possible to extract high-level features from raw sensory data, learning to breakthrough in computer vision and speech recognition.
Special RL challenges from a deep learning perspective:
- Most successful DL applications required large amounts of hand-labelled training data. But RL algorithms must learn from scalar reward signal that is frequently sparse, noisy and delayed; the delay can be thousands of timesteps long;
- Most DL algorithms assume i.i.d. data, while in RL one typically encounters sequences of highly correlated states;
- In RL the data distribution changes as the algorithm learns new behaviours, while DL methods assume a fixed underlying distribution → non-stationary distributions
Contributions:
- Use a variant of the Q-learning algorithm to solve challenge 1.
- Use Experience Replay mechanism to solve challenge 2 and 3.
Deep Reinforcement Learning
Experience replay: store the agent's experiences at each time-step, $e_{t}=\left(s_{t}, a_{t}, r_{t}, s_{t+1}\right)$
$$\begin{array}{l} \hline \textbf{Algorithm 1} \text{ Deep Q-learning with Experience Replay} \\\hline \quad \text{ Initialize replay memory } \mathcal{D} \text { to capacity } N \\ \quad \text { Initialize action-value function } Q \text { with random weights } \\ \quad \textbf{ for} \text{ episode} =1, M \textbf{do} \\\quad \qquad \text {Initialise sequence } s_{1}=\left\{x_{1}\right\} \text { and preprocessed sequenced } \phi_{1}=\phi\left(s_{1}\right) \\\quad \quad \quad \textbf{for } t=1, T \textbf{do} \\ \quad\qquad \qquad \text{With probability } \epsilon \text { select a random action } a_{t} \\\quad\qquad \qquad\text { otherwise select } a_{t}=\max _{a} Q^{*}\left(\phi\left(s_{t}\right), a ; \theta\right) \\\quad\qquad \qquad\text { Execute action } a_{t} \text { in emulator and observe reward } r_{t} \text { and image } x_{t+1} \\\quad\qquad \qquad\text { Set } s_{t+1}=s_{t}, a_{t}, x_{t+1} \text { and preprocess } \phi_{t+1}=\phi\left(s_{t+1}\right) \\\quad\qquad \qquad\text { Store transition }\left(\phi_{t}, a_{t}, r_{t}, \phi_{t+1}\right) \text { in } \mathcal{D} \\\quad\qquad \qquad\text { Sample random minibatch of transitions }\left(\phi_{j}, a_{j}, r_{j}, \phi_{j+1}\right) \text { from } \mathcal{D} \\\quad\qquad \qquad\text { Set } y_{j}=\left\{\begin{array}{l}r_{j} \\r_{j}+\gamma \max _{a^{\prime}} Q\left(\phi_{j+1}, a^{\prime} ; \theta\right) \quad \text { for terminal } \phi_{j+1} \\\text { for non-terminal } \phi_{j+1}\end{array}\right. \\\quad\qquad \qquad\text { Perform a gradient descent step on }\left(y_{j}-Q\left(\phi_{j}, a_{j} ; \theta\right)\right)^{2} \text { according to equation } 3 \\\quad \quad \quad\textbf{end for} \\\quad \textbf{end for} \end{array}$$
Advantages over standard online Q-learning:
- Greater data efficiency: each step of experience is potentially used in many weight updates;
- Reduce the variance of gradient updates: randomising the samples breaks the correlations;
- Avoiding oscillations or divergence in the parameters: by using experience replay the behaviour distribution is averaged over many of its previous states.
Deficiencies:
- Memory buffer does not differentiate important transitions, only stores the last $N$ experience tuples and always overwrites with recent transitions due to finite memory size;
- Uniform sampling gives equal importance to all transitions in the replay memory (but actually they are not equally important).
Possible improvment:
- Prioritised experience replay.
Model Architecture
- Use a separate output unit for each possible action, and only the state representation is an input to the neural network.
- This design could compute Q-values for all possible actions in a given state with only a single forward pass through the network.
Answer the following questions
- What did authors try to accomplish?
- Presented the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning.
- Outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
- What were the key elements of the approach?
- Q-learning with neural networks
- Experience Replay
- What other references do you want to follow?
- Vanilla Q-learning.
- Prioritised Experience Replay