**Introduction**

This post is aimed at introducing the concept of reinforcement learning for sequence tasks. This particular method has been used in many tasks involving recurrent neural networks such as LSTM’s where we would like to optimize the LSTM on an external reward, rather than providing per time-step supervision labels.

**Reinforcement Learning**

Let’s imagine that you want to train an LSTM to maximize an external reward such as the ROUGE metric. There is no way to do this using standard backprop. But if we frame this as a reinforcement learning problem, we can view the *sequences of actions we take until the LSTM produces an EOS token* as an action, and the ROUGE score of the produced sentence as its corresponding reward. Reinforcement learning allows us to bridge the gap between an external metric such as ROUGE and a sequential model such as LSTM.

This adaptation of reinforcement learning to sequence-based tasks can be seen as a special case of the general policy gradient method, which optimizes . In our case , we are specifically interested in the case where the state is an internalization of the actions taken so far, , a.k.a the hidden state of the LSTM or GRU, which gives us

**Policy Gradient**

Let’s define the components of the reinforcement learning model.

- First, we have the policy function . The policy function is the behavior rule that our reinforcement learning agent will follow. The only requirement is that it defines a probability distribution over the space of possible actions. In most cases this is simply the familiar softmax function.
- Secondly, we will define the reward function as the reward we expect to get by behaving according to .

Since the policy is probabilistic, our reward should be an expectation over the probability distribution defined by :

(2)

Where is a sequence of actions that constitute an episode that ends at time .

**Policy Gradient Derivation**

We’ve now defined our reward function. In order to do learning on this reward, we will need to compute the gradient of this expression, . Unfortunately, itself is an impossible sum to calculate in most cases, since the possible combination of action sequences grows exponentially with the sequence length T. The gradient is likewise impossible to evaluate. However, since the reward is a sum over a probability distribution and therefore an expectation, we have the option to approximate it via an unbiased estimator (random sampling). How does this help us? As we will show below, a mathematical trick allows us to calculate also as an expectation, allowing it to be estimated via random sampling as well.

By the chain rule, we have:

If we multiply both sides by , we get

(3)

This is often referred to as the *log trick*. Using it, we can write:

(4)

Which is an expectation:

(5)

We can use (5) to approximate the gradient we should follow, by randomly sampling the action sequence until the end of an episode, observing the reward we got, and plugging it into the expression. The learning process now boils down to executing many trials of randomly sampled action sequences (this is called Monte-Carlo sampling), observing their rewards, and updating the policy weights accordingly until we reach convergence.

**Note on the Policy Gradient with an LSTM**

We have now derived the policy gradient equation. However, the implementation might not be clear in your mind. Implementing the sequential version of the policy gradient using LSTM is extremely simple because we do not manually loop over the time dimension to calculate any gradients. This is already baked into the backpropagation-through-time (BPTT) calculation for the LSTM function , as we will see below.

Recall that the LSTM outputs at each timestep are defined as . This is the same as because the hidden state encodes the actions taken so far, .

Over multiple timesteps, the LSTM models the join probability distribution that is simply the multiplication of the conditional probabilities at each timestep:

Since we will take the entire action-sequence chosen by the LSTM during an episode as an ‘action’, and therefore have a single corresponding reward for the action-sequence, the gradients at each time-step are simply the standard BPTT gradients. All modern deep learning graph libraries support the BPTT feature which automatically unrolls the graph in the time dimension and uses the chain-rule to calculate each for . By default, these BPTT gradients are accumulated as a sum over time, which essentially replaces the sum over time that appears in the general policy gradient algorithm. Therefore, the only thing we need to do to calculate the correct gradients for each time-step is to weight the standard BPTT gradients with the final episode reward.

**Policy Gradient with a Baseline**

Although the estimate of the gradient is unbiased there is high variance. A popular method to reduce the variance is to subtract a baseline from the reward, which reduces the variance but does not effect the expected value of the gradient.

We subtract from in (4):

Using (3) to rewrite the second term on the RHS,

Since is independent of ,

And finally the sum rule gives us,

The base line trick can be interpreted as measuring a relative reward (w.r.t the baseline), rather than the raw reward, to reduce the variance of the gradient estimate.

**Conclusion**

We’ve reviewed the derivation of policy gradients in the sequence setting. As we can see, the method rather straightforward, but can be useful in scenarios where standard supervised learning can’t be employed. It should be noted that other variations and extensions have improved this method in useful and interesting ways, and further research into the recent literature is strongly recommended.