I have been working with policy gradient (PG) methods for quite some time now and I thought I should continue sharing our findings here. In June, we put out a paper on how we could see PG methods from an operators perspective. I even wrote a blog post on this. Today I’m going to talk about a paper we put out last month about the role of baselines in PG methods.

**The standard view of policy gradient methods**

As in my previous post, let me first talk about the standard view of PG methods. I’ll skip basic definitions and use a somewhat standard notation ($s$, $a$, $p$, $r$, …). There are great tutorials about this out there (e.g., this and this one) but, in short, reinforcement learning algorithms aim to find a policy maximizing the total reward an agent collects through repeated interactions with the environment.

The simplest PG method is REINFORCE. The goal is to maximize the expected discounted sum of rewards, $J(\theta)$. PG methods do that by directly learning the parameters of a policy with gradient ascent:

$$\nabla_\theta J(\theta) = \int_\tau \pi_\theta(\tau) R(\tau) \nabla_\theta \log \pi_\theta(\tau) d\tau,$$

where $\pi_\theta$ is the parameterized policy and $R(\tau)$ is the return of a trajectory $\tau$. The problem is that this update can have a very high-variance. In this context, what people often do is to introduce a function $b$ that doesn’t change the expected value of the update but that has a big impact in the variance of this update. Instead of using $R(\tau)$, we use $R(\tau) – b$ in the update. We call this function $b$ the **baseline**.

**Is it all about variance?**

Baselines are often introduced as I did above. They are often seen as a trick to reduce variance because, well, we always want lower variance, right? Not really. What I’m going to show in this post is that baselines impact not only the variance of the updates but the learning dynamics itself. Moreover, sometimes the lowest possible variance is not what you are looking for.

So, one question we asked in our paper was whether low variance is all we really care about (and whether this is everything baselines affect). We did so in a bandits setting to make the explanations simple and to be able to derive more theoretical results (please check the paper if you are into that). Because it is a bandits problem, we can actually compute the minimum variance baseline for the problem. Since the variance is a squared term, if we add or subtract $\epsilon$ from the minimum variance baseline $b^*$ we will end up with two different updates with exactly the same variance everywhere. We did that and, lo and behold, the learning dynamics is quite different! Take a look below.

What you are seeing here are 15 different trajectories of vanilla policy gradient on a 3-arm bandit problem with rewards (1, 0.7, 0). The black dot is the initial policy and colors represent time, from purple to yellow. Again, the variance of both updates is exactly the same at any point of the simplex but you see here that the path these agents take is quite different. Say we have $b^* + \epsilon$. For $\epsilon = -1$, we can see the agent gets much closer to deterministic policies, oftentimes a suboptimal one. Some of these trajectories actually do get stuck and they take a long time to recover (yellow lines in the left plot). But wait, isn’t it weird? Shouldn’t baselines only reduce variance and low variance is necessarily good? Should baselines be impacting the convergence of REINFORCE? Well, they do.

In hindsight, the reason is kind of obvious. When using a baseline $b$, the update above becomes

$$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_i \big[R(\tau_i) – b \big] \nabla_\theta \log \pi_\theta (\tau_i), \ \tau_i \sim \pi_\theta.$$

A sample of the gradient of the expected return is $\big[R(\tau_i) – b \big] \nabla_\theta \log\pi_\theta(\tau_i)$. Thus, if $R(\tau_i) − b$ is positive, which happens more often when the baseline $b$ is small, the update rule will increase the probability $\pi_\theta(\tau_i)$. This leads to an increase in the probability of taking the actions the agent took before, regardless of their quality. This is reflected in the results for the $\epsilon = −1$ case. Because the agent is likely to choose the same actions again, we call this *committal behaviour*.

While a smaller baseline leads to committal behaviour, a larger baseline makes the agent second-guess itself. If $R(\tau_i) − b$ is negative, which happens more often when $b$ is large, the parameter update decreases the probability $\pi_\theta(\tau_i)$ of the sampled trajectory $\tau_i$, reducing the probability the agent will retake the actions it just took. While this might slow down convergence, this also makes it harder for the agent to get stuck. This is reflected in the $\epsilon = 1$ case. We call this *non-committal behaviour*.

More than that, we can also compare the results above with the minimum variance baseline. You can see a comparison below (I replicated a plot to make your life easier).

What we see here is that $b^* + 1$ can converge faster than $b^*$. However, **recall that $b^* + 1$ has higher variance!** Yes, sometimes a higher-variance update induced by a baseline is better than using the minimum variance update. At least to me, this was extremely surprising. This is different from what I heard over and over about baselines and variance. But again, it makes sense, it is about the behaviour the baseline induces.

**The baseline can impact the convergence point of PG methods**

If you keep digging, as we did, you learn more. One thing we actually learned is that **baselines can impact the policy the agent converges to**. When looking at natural policy gradient updates, for example, we even derived theoretical results about this. To stick with empirical observations, take a look at the plot below.

This is a two-armed bandit problem and we are using a baseline $b = -1$. Shortly, the red curves are the runs that converge to the suboptimal arm. Again, this surprised me since baselines don’t change the updates on expectation. The problem though, is that for natural PG, we can have unbounded variance and things go bad. There are several ways to address this. We are not trying to point out a fundamental flaw in PG methods or the theory. We just want to highlight that the baseline can impact the learning dynamics in ways that are often not discussed by practitioners. The problem here is that we can have policies becoming deterministic too fast, and baselines induce a committal behaviour that provokes that. In our paper we even have theoretical results showing that reducing the variance of updates with baselines can be detrimental. Think about that.

Notice that in our paper we do not only present several theoretical results on all the things I discussed here but we also have experiments in bigger MDPs (a.k.a., not bandit problems) to show that what we are talking about is quite pervasive in PG methods.

**A potential solution**

All our results are about on-policy methods. This is actually a problem because this means the updates we make directly impact the samples we observe. A natural thing to consider here is to break this dependence. One way we can do this is with importance sampling, in which the policy we sample actions from is not the same as the policy we are optimizing for. Notice that here we are talking about *designing* these sampling policies, instead of just correcting for different distributions, which is what people often use importance sampling for in reinforcement learning. I’m not going to delve too much into this to not make this post even longer, but long story short, breaking this dependence works, even when we increase the variance at some points. Just to not miss the opportunity to put a pretty plot here, see teaser below. The colors are highlighting when the variance of a given method is higher than the other. See the paper for a more thorough discussion and theoretical results.

**Wrapping up**

If we put on an optimization hat to analyze PG methods, optimization theory predicts that their convergence speed will be affected by the variance of the estimates and by the geometry of the function, represented by its curvature. Roughly speaking, the geometry dictates how effective true gradient ascent is at optimizing $J(\theta)$, while the variance can be viewed as a penalty, capturing how much slower the optimization process is by using noisy versions of this true gradient. In a sense, in our paper we showed that this is actually not enough to entirely explain the behaviour of PG methods. As I mentioned above, it is because the updates in RL methods directly impact the samples the algorithm observes.

I believe our paper opens up interesting research directions. One topic I’m excited about is exploration. How do we properly characterize exploration here? Let me conclude this post the same way we concluded our paper: We believe our observations should make us think of exploration as a property of the learning dynamics. For instance, one could view it as a mechanism to ensure iterates will converge to similar solutions regardless of the trajectories sampled, thus countering the committal effect of negative baselines where trajectories are unlikely to intersect. While encouraging the policies to have high entropy, as done by entropy regularization, might be a way, we believe blindly enforced stochasticity is not the only, nor the best, way to yield efficient policy gradient methods.

**Acknowledgements**

Wesley Chung, Valentin Thomas, and Courtney Paquette gave me valuable feedback on this post. Obviously, any mistakes or imprecisions are my fault.