Notes: Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning (REINFORCE)

Table of Contents

Before we start: this paper is a little bit dated (published in 1992), and some terms could be quite different from those used by contemporary deep reinforcement learning communities. Luckily, the author was very patient in explaining them in detail.

Code available here.

Introduction

This paper presents analytical results concerning certain algorithms for tasks that are associative, meaning that the learner is required to perform an input-output mapping, and, with one limited exception, that involve immediate reinforcement, meaning that the reinforcement (i.e., payoff) provided to the learner is determined by the most recent input-output pair only.

A distinguishing property of the algorithms presented here is that while they can be described roughly as statistically climbing an appropriate gradient, they manage to do this without explicitly computing an estimate of this gradient or even storing information from which such an estimate could be directly computed. A more informative adjective would be non-model-based.

The expected reinforcement performance criterion

In order to consider gradient learning algorithms, it is necessary to have a performance measure to optimise. A very natural one for any immediate-reinforcement learning problem, associative or not, is the expected value of the reinforcement signal, conditioned on a particular choice of parameters of the learning system. Thus, for a reinforcement learning network, our performance measure is $E\{r \mid \mathbf{W}\}$, where $E$ denotes the expectation operator, $r$ the reinforcement signal, and $\mathbf{W}$ the network's weight matrix.

Why use expected values $E$ here?

Because of potential randomness in any of the following:

  1. the environment's choice of input to the network;
  2. the network's choice of output corresponding to any particular input;
  3. the environment's choice of reinforcement value for any particular input/output pair.

A important assumption: the randomness of the environment is independent of time.

$E\{r \mid \mathbf{W}\}$ is a well-defined, deterministic function of $\mathbf{W}$ (but one which is unknown to the learning system). Thus, in this formulation, the objective of the reinforcement learning system is to search the space of all possible weight matrices $\mathbf{W}$ for a point where $E\{r \mid \mathbf{W}\}$ is maximum.

REINFORCE algorithms

Notations first:

$y_i$: the output of the $i$th unit in the network;

$\mathbf{x}_i$: the pattern of input to the $i$th unit in the network;

$\mathbf{w}_i$: the weight vector consisting of all the weights $w_{i j}$, the collection of all parameters on which the behaviour of the $i$th unit depends

$\mathbf{W}$: the weight matrix consisting of all wights $w_{i j}$ in the network, the collection of all parameters on which the behaviour of the entire network depends

Definition of REINFORCE algorithms

Consider a network facing an associative immediate-reinforcement learning task. Recall that weights are adjusted in this network following receipt of the reinforcement value $r$ at each trial. Suppose that the learning algorithm for this network is such that at the end of each trial each parameter $w_{ij}$ in the network is incremented by an amount:

$$\Delta w_{i j}=\alpha_{i j}\left(r-b_{i j}\right) e_{i j},$$

where $\alpha_{i j}$ is a learning rate factor, $b_{i j}$ is a reinforcement baseline, and $e_{i j}=\partial \ln g_{i} / \partial w_{i j}$ is called the characteristic eligibility of $w_{i j}$.

Suppose further that the reinforcement baseline $b_{i j}$ is conditionally independent of $y_{i}$, given $\mathbf{W}$ and $\mathbf{x}^{i}$, and the rate factor $\alpha_{i j}$ is nonnegative and depends at most on $\mathbf{w}^i$ and $t$.

Any learning algorithm having this particular form will be called a REINFORCE algorithm. The name is the acronym of "REward Increment = Nonnegative Factor x Offset Reinforcement x Characteristic Eligibility"

But why would this form of algorithms work? Following is the MATH. $\textbf{Theorem 1.}\text{ For any REINFORCE algorithm, the inner product of }E\{\Delta \mathbf{W} \mid \mathbf{W}\} \text{ and } \nabla_{\mathbf{W}} E\{r \mid \mathbf{W}\} \text{ is nonnegative.}\\ \text{Furthermore, if } \alpha_{i j}>0 \text{ for all } i \text{ and } j\text{, then this inner product is zero only when }\nabla_{\mathbf{W}} E\{r \mid \mathbf{W}\}=0 \text{.}\\ \text{Also, if }\alpha_{i j}=\alpha\text{ is independent of } i \text{ and } j \text{, then }E\{\Delta \mathbf{W} \mid \mathbf{W}\}=\alpha \nabla_{\mathbf{W}} E\{r \mid \mathbf{W}\}$

What does it mean? It says that for any such algorithm the average update vector in weight space lies in a direction for which this performance measure is increasing. The last sentence of the theorem is equivalent to the claim that for each weight $w_{i j}$ the quantity $\left(r-b_{i j}\right) \partial \ln g_{i} / \partial w_{i j}$ represents an unbiased estimate of $\partial E\{r \mid \mathbf{W}\} / \partial w_{i j}$.

Proof of Theorem 1

(Feel free to skip this part.)

Notations first:

$Y_{i}$: the set of possible output values $y_{i}$ of the $i$th unit

$X_{i}$: the set of possible values of the input vector $\mathbf{x}^i$ to the $i$th unit

$I$: the index set for elements of $\mathbf{W}$, so that $(i,j)\in I$ if and only if $w_{i j}$ is a parameter in the system.

$\textbf { Fact } 1 .$

$$\begin{aligned}\frac{\partial E\left\{r \mid \mathbf{W}, \mathbf{x}^{i}\right\}}{\partial w_{i j}}=\sum_{\xi \in Y_{i}} E\left\{r \mid \mathbf{W}, \mathbf{x}^{i}, y_{i}=\xi\right\} \frac{\partial g_{i}}{\partial w_{i j}}\left(\xi, \mathbf{w}^{i}, \mathbf{x}^{i}\right)\end{aligned}$$

$Proof.$ Conditioning on the possible values of the output $y_{i}$, we may write:

$$\begin{aligned}E\left\{r \mid \mathbf{W}, \mathbf{x}^{i}\right\} &=\sum_{\xi \in Y_{i}} E\left\{r \mid \mathbf{W}, \mathbf{x}^{i}, y_{i}=\xi\right\} \operatorname{Pr}\left\{y_{i}=\xi \mid \mathbf{W}, \mathbf{x}^{i}\right\} \\&=\sum_{\xi \in Y_{i}} E\left\{r \mid \mathbf{W}, \mathbf{x}^{i}, y_{i}=\xi\right\} g_{i}\left(\xi, \mathbf{w}^{i}, \mathbf{x}^{i}\right)\end{aligned}$$

Note that specification of the value of $y_{i}$ causes $w_{i j}$ to have no influence on the ultimate value of $r$, which means that $E\left\{r \mid \mathbf{W}, \mathbf{x}^{i}, y_{i}=\xi\right\}$ does not depend on $w_{i j}$. The result then follows by differentiating both sides of this last equation with respect to $w_{i j}$.

$\textbf { Fact } 2 .$

We define $g(\xi, \mathbf{W}, \mathbf{x})=\operatorname{Pr}\{\mathbf{y}=\xi \mid \mathbf{W}, \mathbf{x}\}$ to be the overall probability mass function describing the input-output behaviour of the entire network.

$$\sum_{\xi \in Y_{i}} \frac{\partial g_{i}}{\partial w_{i j}}\left(\xi, \mathbf{w}^{i}, \mathbf{x}^{i}\right)=0$$

$Proof.$

$$\sum_{\xi \in Y_{i}} g_{i}\left(\xi, \mathbf{w}^{i}, \mathbf{x}^{i}\right)=\sum_{\xi \in Y_{i}} \operatorname{Pr}\left\{x=\xi \mid \mathbf{w}^{i}, \mathbf{x}^{i}\right\}=1$$

and the result follows by differentiating with respect to $w_{i j}$.

$\textbf{Lemma 1.}\text{ For any REINFORCE algorithm,}$

$$E\left\{\Delta w_{i j} \mid \mathbf{W}, \mathbf{x}^{i}\right\}=\alpha_{i j} \frac{\partial E\left\{r \mid \mathbf{W}, \mathbf{x}^{i}\right\}}{\partial w_{i j}}$$

$Proof.$ First note that the characteristic eligibility can be written:

$$e_{i j}=\frac{\partial \ln g_{i}}{\partial w_{i j}}=\frac{1}{g_{i}} \frac{\partial g_{i}}{\partial w_{i j}}$$

Although this fails to be defined when $g_{i} = 0$, it will still be the case that $\Delta w_{i j}$ is well-defined for any REINFORCE algorithm as long as $Y_i$ is discrete. This is because $g_{i}\left(\xi, \mathbf{w}^{i}, \mathbf{x}^{i}\right)=0$ means that the value $\xi$ has zero probability of occurrence as a value of the output $y_{i}$.

Then

$$\begin{aligned}E\left\{\Delta w_{i j} \mid \mathbf{W}, \mathbf{x}^{i}\right\} &=\sum_{\xi \in Y_{i}} E\left\{\Delta w_{i j} \mid \mathbf{W}, \mathbf{x}^{i}, y_{i}=\xi\right\} \operatorname{Pr}\left\{y_{i}=\xi \mid \mathbf{W}, \mathbf{x}^{i}\right\} \\&=\sum_{\xi \in Y_{i}} E\left\{\frac{\alpha_{i j}\left(r-b_{i j}\right)}{g_{i}\left(\xi, \mathbf{w}^{i}, \mathbf{x}^{i}\right)} \frac{\partial g_{i}}{\partial w_{i j}}\left(\xi, \mathbf{w}^{i}, \mathbf{x}^{i}\right) \mid \mathbf{W}, \mathbf{x}^{i}, y_{i}=\xi\right\} g_{i}\left(\xi, \mathbf{w}^{i}, \mathbf{x}^{i}\right) \\&= \alpha_{i j} \sum_{\xi \in Y_{i}} E\left\{r \mid \mathbf{W}, \mathbf{x}^{i}, y_{i}=\xi\right\} \frac{\partial g_{i}}{\partial w_{i j}}\left(\xi, \mathbf{w}^{i}, \mathbf{x}^{i}\right) \\&\quad-\alpha_{i j} \sum_{\xi \in Y_{i}} E\left\{b_{i j} \mid \mathbf{W}, \mathbf{x}^{i}, y_{i}=\xi\right\} \frac{\partial g_{i}}{\partial w_{i j}}\left(\xi, \mathbf{w}^{i}, \mathbf{x}^{i}\right)\end{aligned}$$

making use of the fact that $\alpha_{i j}$ does not depend on the particular value of the output $y_{i}$. By Fact 1, the first term of this last expression is $\alpha_{i j}\left(\partial E\left\{r \mid \mathbf{W}, \mathbf{x}^{i}\right\} / \partial w_{i j}\right)$. Consider the remaining term. Since $E\left\{b_{i j} \mid \mathbf{W}, \mathbf{x}^{i}, y_{i}=\xi\right\}=E\left\{b_{i j} \mid \mathbf{W}, \mathbf{x}^{i}\right\}$, by assumption, we have

$$\begin{aligned}\sum_{\xi \in Y_{i}} E\left\{b_{i j} \mid \mathbf{W}, \mathbf{x}^{i}, y_{i}=\xi\right\} \frac{\partial g_{i}}{\partial w_{i j}}\left(\xi, \mathbf{w}^{i}, \mathbf{x}^{i}\right) &=E\left\{b_{i j} \mid \mathbf{W}, \mathbf{x}^{i}\right\} \sum_{\xi \in Y_{i}} \frac{\partial g_{i}}{\partial w_{i j}}\left(\xi, \mathbf{w}^{i}, \mathbf{x}^{i}\right) \\&=0\end{aligned}$$

by Fact 2, and the Lemma is proved.

$\textbf { Fact } 3 .$

$$\frac{\partial E\{r \mid \mathbf{W}\}}{\partial w_{i j}}=\sum_{x \in X_{i}} \frac{\partial E\left\{r \mid \mathbf{W}, \mathbf{x}^{i}=\mathbf{x}\right\}}{\partial w_{i j}} \operatorname{Pr}\left\{\mathbf{x}^{i}=\mathbf{x} \mid \mathbf{W}\right\}$$

$Proof.$ Conditioning on the possible input patterns $\mathbf{x}^i$, we may write

$$E\{r \mid \mathbf{W}\}=\sum_{x \in X_{i}}\left\{r \mid \mathbf{W}, \mathbf{x}^{i}=\mathbf{x}\right\} \operatorname{Pr}\left\{\mathbf{x}^{i}=\mathbf{x} \mid \mathbf{W}\right\}$$

Note that the weight $w_{i j}$ lies downstream of all computation performed to determine $\mathbf{x}^i$. This means that $\operatorname{Pr}\left\{\mathbf{x}^{i}=\mathbf{x} \mid \mathbf{W}\right\}$ does not depend on $w_{i j}$, so the result follows by differentiating both sides of this last equation by $w_{i j}$.

$\textbf{Lemma 2.}\text{ For any REINFORCE algorithm,}$

$$E\left\{\Delta w_{i j} \mid \mathbf{W}\right\}=\alpha_{i j} \frac{\partial E\{r \mid \mathbf{W}\}}{\partial w_{i j}}$$

$Proof.$

$$\begin{aligned}E\left\{\Delta w_{i j} \mid \mathbf{W}\right\} &=\sum_{x \in X_{i}} E\left\{\Delta w_{i j} \mid \mathbf{W}, \mathbf{x}^{i}=\mathbf{x}\right\} \operatorname{Pr}\left\{\mathbf{x}^{i}=\mathbf{x} \mid \mathbf{W}\right\} \\&=\sum_{x \in X_{i}} \alpha_{i j} \frac{\partial E\left\{r \mid \mathbf{W}, \mathbf{x}^{i}=\mathbf{x}\right\}}{\partial w_{i j}} \operatorname{Pr}\left\{\mathbf{x}^{i}=\mathbf{x} \mid \mathbf{W}\right\} \\&=\alpha_{i j} \sum_{x \in X_{i}} \frac{\partial E\left\{r \mid \mathbf{W}, \mathbf{x}^{i}=\mathbf{x}\right\}}{\partial w_{i j}} \operatorname{Pr}\left\{\mathbf{x}^{i}=\mathbf{x} \mid \mathbf{W}\right\} \\&=\alpha_{i j} \frac{\partial E\{r \mid \mathbf{W}\}}{\partial w_{i j}}\end{aligned}$$

where the first equality is obtained by conditioning on the possible input patterns to the unit, the second equation follows from Lemma 1, the third equation follows from the assumption that $\alpha_{i j}$ does not depend on the input to the unit, and the last equality follows from Fact 3.

Establishing this last result, which is just like Lemma 1 except that the conditioning on input to unit $i$ has been removed from both sides of the equation, is a key step. It relates two quantities that, unlike those of Lemma 1, would be quite messy to compute explicitly in general because $\operatorname{Pr}\left\{\mathbf{x}^{i}=\mathbf{x} \mid \mathbf{W}\right\}$ can be quite complicated. From this lemma our main result follows easily.

Now, again:

$\begin{aligned}&\textbf { Theorem 1.}\text{ For any REINFORCE algorithm, } E\{\Delta \mathbf{W} \mid \mathbf{W}\}^{\mathrm{T}} \nabla_{\mathbf{W}} E\{r \mid \mathbf{W}\} \geq 0 \text { . Fur }\\&\text { thermore, if } \alpha_{i j}>0 \text { for all } i \text { and } j, \text { then equality holds if and only if } \nabla_{\mathbf{w}} E\{r \mid \mathbf{W}\}=0\end{aligned}$

$Proof.$

$$\begin{array}{l}E\{\Delta \mathbf{W} \mid \mathbf{W}\}^{\mathrm{T}} \nabla_{\mathbf{W}} E\{r \mid \mathbf{W}\}=\sum_{(i, j) \in I} E\left\{\Delta w_{i j} \mid \mathbf{W}\right\} \frac{\partial E\{r \mid \mathbf{W}\}}{\partial w_{i j}} \\\qquad=\sum_{(i, j) \in I} \alpha_{i j}\left(\frac{\partial E\{r \mid \mathbf{W}\}}{\partial w_{i j}}\right)^{2}\end{array}$$

by Lemma 2, and the result is immediate.

Episodic REINFORCE algorithms

In this part, the author discussed how the REINFORCE class of algorithms can be extend to certain learning problems on an episode-by-episode basis (which is more common in recent RL research).

Definition of episodic REINFORCE algorithms

Assume a net $N$ is trained on an episode-by-episode basis, where each episode consists of $k$ time steps, during which the units may recompute their outputs and the environment may alter its non-reinforcement input to the system at each time step. A single reinforcement value $r$ is delivered to the net at the end of each episode.

Here, the paper introduced the use of "unfolding-in-time" mapping, which is obtained by duplicating $N$ once for each time step. The new network is marked as $N^{*}$. It has no cycles but exhibits corresponding behaviour to $N$.

Formally, this amounts to associating with each time-dependent variable $v$ in $N$ a corresponding time-indexed set of variables $\{v^t\}$ in $N^*$ whose values do not depend on time, and which have the property that $v(t) = v^t$ for all appropriate $t$.

In particular, each weight $w_{ij}$ in $N$ gives rise to several weights $w^t_{i j }$ in $N^*$, all of whose values happen to be equal to each other and to the value of $w_{i j}$ in $N$ since it is assumed that $w_{i j}$ is constant over the episode.

At the conclusion of each episode, each parameter $w_{i j}$ is incremented by

$$\Delta w_{i j}=\alpha_{i j}\left(r-b_{i j}\right) \sum_{t=1}^{k} e_{i j}(t)$$

where all notation is the same as that defined earlier, with $e_{i j}(t)$ representing the characteristic eligibility for $w_{i j}$ evaluated at the particular time $t$.

All quantities are assumed to satisfy the same conditions required for the REINFORCE algorithm, where, in particular, for each $i$ and $j$, the reinforcement baseline $b_{i j}$ is independent of any of the output values $y_i(t)$ and the rate factor $\alpha_{i j}$ depends at most on $\mathbf{w}_i$ and episode number. Call any algorithm of this form (and intended for such a learning problem) an episodic REINFORCE algorithm.

Math part:

$\textbf{Theorem 2.} \text{ For any episodic REINFORCE algorithm, the inner product of }E\{\Delta \mathbf{W} \mid \mathbf{W}\}\text{ and }\nabla_{\mathbf{W}} E\{r \mid \mathbf{W}\}\text{ is nonnegative.}\\ \text{Furthermore, if }\alpha_{i j}>0\text{ for all }i\text{ and }j\text{, then this inner product is zero only when }\nabla_{\mathbf{W}} E\{r \mid \mathbf{W}\}=0 \text{.}\\ \text{Also, if }\alpha_{i j}=\alpha\text{ is independent of }i\text{ and }j\text{, then }E\{\Delta \mathbf{W} \mid \mathbf{W}\}=\alpha \nabla E\{r \mid \mathbf{W}\}$

Proof of Theorem 2

Notions:

$\mathbf{W}^{*}$: the weight matrix for $N^*$, with its individual components being denoted $w^{t}_{i j}$.

The weight $w^{t}_{i j}$ in $N^*$ corresponds to the weight $w_{i j}$ in $N$ at the $t$th time step, so that $w*{t}_{i j}=w_{i j}$ for all $i$, $j$, and $t$. Specifying $\mathbf{W}$ is equivalent to specifying $\mathbf{W}^*$.

$\textbf { Fact } 4 .$

$$\frac{\partial E\{r \mid \mathbf{W}\}}{\partial w_{i j}}=\sum_{t=1}^{k} \frac{\partial E\{r \mid \mathbf{W}^{*}\}}{\partial w_{i j}^{t}}$$

$Proof.$ Using the chain rule, we have

$$\frac{\partial E\{r \mid \mathbf{W}\}}{\partial w_{i j}}=\sum_{t=1}^{k} \frac{\partial E\{r \mid \mathbf{W}\}}{\partial w_{i j}^{t}} \frac{\partial w_{i j}^{t}}{\partial w_{i j}}=\sum_{t=1}^{k} \frac{\partial E\{r \mid \mathbf{W}\}}{\partial w_{i j}^{t}}=\sum_{t=1}^{k} \frac{\partial E\left\{r \mid \mathbf{W}^{*}\right\}}{\partial w_{i j}^{t}}$$

since $w^{t}_{i j}=w_{i j}$ for all $t$.

$\textbf{Lemma 3.}\text{ For any episodic REINFORCE algorithm,}$

$$E\left\{\Delta w_{i j} \mid \mathbf{W}\right\}=\alpha_{i j} \frac{\partial E\{r \mid \mathbf{W}\}}{\partial w_{i j}}$$

$Proof.$ Let $\Delta w_{i j}^{t}=\alpha_{i j}\left(r-b_{i j}\right) e_{i j}^{t}$, so that $\Delta w_{i j}=\Sigma_{t=1}^{k} \Delta w_{i j}^{t}$. Note that this represents a REINFORCE algorithm in $N^*$, so it follows from Lemma 2 that

$$E\left\{\Delta w_{i j}^{j} \mid \mathbf{W} *\right\}=\alpha_{i j} \frac{\partial E\{r \mid \mathbf{W} *\}}{\partial w_{i j}^{t}}$$

But then

$$\begin{aligned}E\left\{\Delta w_{i j} \mid \mathbf{W}\right\} &=E\left\{\sum_{t=1}^{k} \Delta w_{i j}^{t} \mid \mathbf{W}^{*}\right\} \\&=\sum_{t=1}^{k} E\left\{\Delta w_{i j}^{t} \mid \mathbf{W}^ {*}\right\} \\&=\sum_{t=1}^{k} \alpha_{i j} \frac{\partial E\{r \mid \mathbf{W}^{*}\}}{\partial w_{i j}^{t}} \\&=\alpha_{i j} \frac{\partial E\{r \mid \mathbf{W}\}}{\partial w_{i j}}\end{aligned}$$

where the last equality follows from Fact 4.

$\textbf {Theorem 2.}\text{ For any episodic REINFORCE algorithm, } E\{\Delta \mathbf{W} \mid \mathbf{W}\}^{\mathrm{T}} \nabla_{\mathbf{w}} E\{r \mid \mathbf{W}\} \geq 0\text{.} \\ \text {Furthermore, if } \alpha_{i j}>0 \text { for all } i \text { and } j\text {, then equality holds if and only if } \nabla_{\mathbf{W}} E\{r \mid \mathbf{W}\}=0$

$Proof.$

$$\begin{aligned}E\{\Delta \mathbf{W} \mid \mathbf{W}\}^{\mathrm{T}} \nabla_{\mathbf{W}} E\{r \mid \mathbf{W}\} &=\sum_{(i, j) \in I} E\left\{\Delta w_{i j} \mid \mathbf{W}\right\} \frac{\partial E\{r \mid \mathbf{W}\}}{\partial w_{i j}} \\&=\sum_{(i, j) \in I} \alpha_{i j}\left(\frac{\partial E\{r \mid \mathbf{W}\}}{\partial w_{i j}}\right)^{2}\end{aligned}$$

by Lemma 3, and the result is immediate.

Note that the proof of Theorem 2 is identical to that for Theorem 1. This is because Theorem 1 uses Lemma 2 and Theorem 2 uses Lemma 3, and both lemmas have the same conclusion.

Code Details

After reading this paper I still felt quite confused about implementing the whole algorithm. Thanks to this code by Denny Britz, I was finally able to adapt his early-version TensorFlow code into my PyTorch version. Following are some important code excerpts:

# Training Loop. Pretty simple.
for i_episode in range(NUM_EPISODES):
state = env.reset()
for t in count():
state = torch.from_numpy(state).unsqueeze_(0).to(device=device, dtype=torch.float)
action, action_prob = agent.act(state)
next_state, reward, done, _ = env.step(action.item())
agent.memorize(reward, action_prob)
state = next_state
if done:
writer.add_scalar("Train durations/episode", t, global_step=i_episode)
break
episode_loss = agent.learn()
# Policy Net. Feel free to play with the network structure.
class MLPPolicy(nn.Module):
def __init__(self, d_state, d_hidden, d_action):
super(MLPPolicy, self).__init__()
self.linear1 = nn.Linear(d_state, d_hidden)
self.linear2 = nn.Linear(d_hidden, d_action)
# This is important. Softmax outputs a normalized (sum=1) probability distribution for all actions.
self.head = nn.Softmax(dim=1)
def forward(self, x):
x = F.relu(self.linear1(x))
x = self.linear2(x)
return self.head(x)
# A sketch of the agent class. Refer to the github code for full details.
class VPGAgent:
def __init__(self, device, lr, gamma=0.99):
...
def act(self, state):
self.time_step += 1
action_probs = self.policy_net(state)
# Sample from the distribution of actions.
action = torch.multinomial(action_probs, 1)
action_prob = action_probs.gather(1, action)
return action, action_prob
def memorize(self, reward, action_prob):
reward = torch.tensor([reward]).to(self.device)
self.replay_buffer.append(Transition(reward, action_prob))
return
def learn(self):
...
self.optimizer.zero_grad()
accumulative_loss = 0
total_return = 0
for t, transition in enumerate(reversed(self.replay_buffer)):
total_return *= self.gamma
total_return += transition.reward
loss = - torch.log(transition.action_prob) * total_return
accumulative_loss += loss
# Calculate the loss but don't do optimizer step until the end of episode.
# The gradients will be automatically accumulated.
loss.backward()
self.optimizer.step()
self.replay_buffer.clear()
return accumulative_loss

Remember the definition of REINFORCE? $\Delta w_{i j}=\alpha_{i j}\left(r-b_{i j}\right) \sum_{t=1}^{k} e_{i j}(t)$

This is done by:

loss = - torch.log(transition.action_prob) * total_return

$e_{i j}(t)=\partial \ln g_{i}(t) / \partial w_{i j}^*$, $\alpha_{i j}$ is the learning rate applied by the optimiser.

Notice that here $b_{i j}=0$, meaning that no baseline value is used to compute the advantage. (We will delve into this in later posts)

Another trick: When computing the total_return, what Denny Britz wrote was:

# The episode is a list working as replay buffer.
for t, transition in enumerate(episode):
total_return = sum(discount_factor**i * t.reward for i, t in enumerate(episode[t:]))

It turns out there would be a lot of redundant computation when you calculate the total_return from the front to the end, so I reversed it:

total_return = 0
for t, transition in enumerate(reversed(self.replay_buffer)):
total_return *= self.gamma
total_return += transition.reward

Works well and much faster. :)

Answer the following questions

  1. What did authors try to accomplish?
    • To introduce a series of statistical gradient-following algorithms for reinforcement-learning network.
  2. What were the key elements of the approach?
    • $\Delta w_{i j}=\alpha_{i j}\left(r-b_{i j}\right) e_{i j}$
    • REward Increment = Nonnegative Factor x Offset Reinforcement x Characteristic Eligibility
  3. What other references do you want to follow?
    • This paper proposed two main disadvantages in Conclusion:
      1. Lack of a general convergence theory applicable to this class of algorithms;
      2. As with all gradient algorithms, an apparent susceptibility to convergence to false optima.