Showing 3 Result(s)
Different behaviours for baselines that induce the same variance

The True Impact of Baselines in Policy Gradient Methods

I have been working with policy gradient (PG) methods for quite some time now and I thought I should continue sharing our findings here. In June, we put out a paper on how we could see PG methods from an operators perspective. I even wrote a blog post on this. Today I’m going to talk about a paper we put out last month about the role of baselines in PG methods.

The standard view of policy gradient methods

As in my previous post, let me first talk about the standard view of PG methods. I’ll skip basic definitions and use a somewhat standard notation ($s$, $a$, $p$, $r$, …). There are great tutorials about this out there (e.g., this and this one) but, in short, reinforcement learning algorithms aim to find a policy maximizing the total reward an agent collects through repeated interactions with the environment.

The simplest PG method is REINFORCE. The goal is to maximize the expected discounted sum of rewards, $J(\theta)$. PG methods do that by directly learning the parameters of a  policy with gradient ascent:

$$\nabla_\theta J(\theta) = \int_\tau \pi_\theta(\tau) R(\tau) \nabla_\theta \log \pi_\theta(\tau) d\tau,$$

where $\pi_\theta$ is the parameterized policy and $R(\tau)$ is the return of a trajectory $\tau$. The problem is that this update can have a very high-variance. In this context, what people often do is to introduce a function $b$ that doesn’t change the expected value of the update but that has a big impact in the variance of this update. Instead of using $R(\tau)$, we use $R(\tau) – b$ in the update. We call this function $b$ the baseline.

Is it all about variance?

Baselines are often introduced as I did above. They are often seen as a trick to reduce variance because, well, we always want lower variance, right? Not really. What I’m going to show in this post is that baselines impact not only the variance of the updates but the learning dynamics itself. Moreover, sometimes the lowest possible variance is not what you are looking for.

So, one question we asked in our paper was whether low variance is all we really care about (and whether this is everything baselines affect). We did so in a bandits setting to make the explanations simple and to be able to derive more theoretical results (please check the paper if you are into that). Because it is a bandits problem, we can actually compute the minimum variance baseline for the problem. Since the variance is a squared term, if we add or subtract $\epsilon$ from the minimum variance baseline $b^*$ we will end up with two different updates with exactly the same variance everywhere. We did that and, lo and behold, the  learning dynamics is quite different! Take a look below.

What you  are seeing here are 15 different trajectories of vanilla policy gradient on a 3-arm bandit problem with rewards (1, 0.7, 0). The black dot is the initial policy and colors represent time, from purple to yellow. Again, the variance of both updates is exactly the same at any point of the simplex but you see here that the path these agents take is quite different. Say we have $b^* + \epsilon$. For $\epsilon = -1$, we can see the agent gets much closer to deterministic policies, oftentimes a suboptimal one. Some of these trajectories actually do get stuck and they take a long time to recover (yellow lines in  the  left plot). But wait, isn’t it weird? Shouldn’t baselines only reduce variance and low variance is necessarily good? Should baselines be impacting the convergence of REINFORCE? Well, they do.

In hindsight, the reason is kind of obvious. When using a baseline $b$, the update above becomes

$$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_i \big[R(\tau_i) – b \big] \nabla_\theta \log \pi_\theta (\tau_i), \ \tau_i \sim \pi_\theta.$$

A sample of the gradient of the expected return is $\big[R(\tau_i) – b \big] \nabla_\theta \log\pi_\theta(\tau_i)$. Thus, if $R(\tau_i) − b$ is positive, which happens more often when the baseline $b$ is small, the update rule will increase the probability $\pi_\theta(\tau_i)$. This leads to an increase in the probability of taking the actions the agent took before, regardless of their quality. This is reflected in the results for the $\epsilon = −1$ case. Because the agent is likely to choose the same actions again, we call this committal behaviour.

While a smaller baseline leads to committal behaviour, a larger baseline makes the agent second-guess itself. If $R(\tau_i) − b$ is negative, which happens more often when $b$ is large, the parameter update decreases the probability $\pi_\theta(\tau_i)$ of the sampled trajectory $\tau_i$, reducing the probability the agent will retake the actions it just took. While this might slow down convergence, this also makes it harder for the agent to get stuck. This is reflected in the $\epsilon = 1$ case. We call this non-committal behaviour.

More than that, we can also compare the results above with the minimum variance baseline. You can see a comparison below (I replicated a plot to make your life easier). 

What we see here is that  $b^* + 1$ can converge faster than $b^*$. However, recall that $b^* + 1$ has higher variance! Yes, sometimes a higher-variance update induced by a baseline is better than using the minimum variance update. At least to me, this was extremely surprising. This is different from what I heard over and over about baselines and variance. But again, it makes sense, it is about the behaviour the baseline induces.

The baseline can impact the convergence point of PG methods

If you keep digging, as we did, you learn more. One thing we actually learned is that baselines can impact the policy the agent converges to. When looking at natural policy gradient updates, for example, we even derived theoretical results about this. To stick with empirical observations, take a look at the plot below.

This is a two-armed bandit problem and we are using a baseline $b = -1$. Shortly, the red curves are the runs that converge to the suboptimal arm. Again, this surprised me since baselines don’t change the updates on expectation. The problem though, is that for natural PG, we can have unbounded variance and things go bad. There are several ways to address this. We are not trying to point out a fundamental flaw in PG methods or the theory. We just want to highlight that the baseline can impact the learning dynamics in ways that are often not discussed by practitioners. The problem here is that we can have policies becoming deterministic too fast, and baselines induce a committal behaviour that provokes that. In our paper we even have theoretical results showing that reducing the variance of updates with baselines can be detrimental. Think about that.

Notice that in our paper we do not only present several theoretical results on all the things I discussed here but we also have experiments in bigger MDPs (a.k.a., not bandit problems) to show that what we are talking about is quite pervasive in PG methods.

A potential solution

All our results are about on-policy methods. This is actually a problem because this means the updates we make directly impact the samples we observe. A natural thing to consider here is to break this dependence. One way we can do this is with importance sampling, in which the policy we sample actions from is not the same as the policy we are optimizing for. Notice that here we are talking about designing these sampling policies, instead of just correcting for different distributions, which is what people often use importance sampling for in reinforcement learning. I’m not going to delve too much into this to not make this post even longer, but long story short, breaking this dependence works, even when we increase the variance at some points. Just to not miss the opportunity to put a pretty plot here, see teaser below. The colors are highlighting when the variance of a given method is higher than the other. See the paper for a more thorough discussion and theoretical results.

Wrapping up

If we put on an optimization hat to analyze PG methods, optimization theory predicts that their convergence speed will be affected by the variance of the estimates and by the geometry of the function, represented by its curvature. Roughly speaking, the geometry dictates how effective true gradient ascent is at optimizing $J(\theta)$, while the variance can be viewed as a penalty, capturing how much slower the optimization process is by using noisy versions of this true gradient. In a sense, in our paper we showed that this is actually not enough to entirely explain the behaviour of PG methods. As I mentioned above, it is because the updates in RL methods directly impact the samples the algorithm observes.

I believe our paper opens up interesting research directions. One topic I’m excited about is exploration. How do we properly characterize exploration here? Let me conclude this post the same way we concluded our paper: We believe our observations should make us think of exploration as a property of the learning dynamics. For instance, one could view it as a mechanism to ensure iterates will converge to similar solutions regardless of the trajectories sampled, thus countering the committal effect of negative baselines where trajectories are unlikely to intersect. While encouraging the policies to have high entropy, as done by entropy regularization, might be a way, we believe blindly enforced stochasticity is not the only, nor the best, way to yield efficient policy gradient methods.


Wesley Chung, Valentin Thomas, and Courtney Paquette gave me valuable feedback on this post. Obviously, any mistakes or imprecisions are my fault.

Looking at Policy Gradient Methods from an Operators Perspective

Policy gradient (PG) methods are one of the cornerstones of the field of reinforcement learning (RL). They are quite popular and have been behind high-profile success stories in the field (e.g., this and this one). They are often presented as an alternative to value-based methods where, instead of learning value functions and using them to take actions, they directly learn a parameterized policy by doing gradient ascent on the objective function. This view suggests that value-based methods and PG methods are fundamentally different. In this post I’ll share a different perspective over PG methods. I’ll present them from an operators perspective and I’ll show that PG methods and value-based methods are directly related, lying on opposite sides of the same spectrum. This post is a subset of the results presented in a paper Dibya Ghosh, Nicolas Le Roux, and I recently wrote. This has been a very collaborative effort and the only reason I ended up writing this post with “I” instead of “we” was to ensure I’m responsible for any mistakes 😅 .

The standard view of policy gradient methods

Let me first talk about the standard view of policy gradient methods. I’ll skip basic definitions and use a somewhat standard notation ($s$, $a$, $p$, $r$, …). There are great tutorials about this out there (e.g., this and this one) but, in short, reinforcement learning algorithms aim to find a policy maximizing the total reward an agent collects through repeated interactions with the environment.

The simplest PG method is REINFORCE, and its update in the state-action formulation is

\theta_{t+1} &= \theta_t + \epsilon \sum_s d^{\pi_t}(s) \sum_ a \pi_t(a | s) Q^{\pi_t}(s, a) \frac{\partial \log \pi_{\theta}(a | s)}{\partial \theta}\bigg|_{\theta = \theta_t} \; ,

where $d^{\pi}$ is the discounted stationary distribution induced by the policy $\pi$ and $\pi_t$ is the policy parameterized by $\theta_t$.

Policy gradient methods from an operators perspective

Our realization was that we can also look at PG methods from an operators perspective. That is, instead of seeing them as doing gradient ascent on some parameters, we can see them as successively applying a policy improvement step followed by a projection step, which ensures we can still represent the new policy.

To give an intuition on this, let’s say we design an algorithm that starts with an initial policy $\pi_{\theta_t}$, at time step $t$, and we then estimate the Q-values with respect to that policy, $Q^{\pi_{\theta_t}}$. We then decide to improve this policy by reweighting the probability of each state-action pair to be proportional to its estimated return. For a given state $s$,

\begin{align} \pi’_{\theta_t}(a|s) \ \ \alpha \ \ \pi_{\theta_t}(a|s)Q^{\pi_{\theta_t}}(s,a).

The policy we just obtained is not realizable. Thus, we need to find its closest policy that is realizable, according to some distance. We do that with the KL divergence, projecting each state independently,

\begin{align} \pi_{\theta_{t+1}} \ \ = \ \ \arg\min_{\pi \in \Pi} KL(\pi’_{\theta_t} || \pi).

This seems quite reasonable, right? The cool part is that, when we put everything together, if we normalize the distributions and we weight everything by the discounted stationary distribution, we see that the REINFORCE update is nothing more than the improvement step followed by one gradient step of the projection!

\theta_{t+1} &= \theta_t + \epsilon \sum_s d^{\pi_t}(s) \sum_ a \color{red}{ \pi_t(a | s) Q^{\pi_t}(s, a)} \frac{\partial \log \pi_{\theta}(a | s)}{\partial \theta}\bigg|_{\theta = \theta_t} \; ,

The term in red is the policy improvement step and the rest is the KL divergence. Formally, if all $Q^\pi(s,a)$ are positive, the equation above can be seen as doing a gradient step, with respect to $\pi$, starting at $\pi = \pi_t$, to minimize

D_{V^\pi_t\pi_t}(Q^{\pi_t}\pi_t || \pi) &= \sum_s d^{\pi_t}(s) V^{\pi_t}(s) KL(Q^{\pi_t}\pi_t || \pi) \; ,

where $D_{V^\pi_t\pi_t}$ and the distribution $Q^\pi\pi$ over actions are defined as

D_{z}(\mu || \pi) = \sum_s z(s) KL(\mu(\cdot | s) || \pi(\cdot | s)) \; ,\ && Q^\pi\pi(a | s) = \frac{1}{V^\pi(s)}Q(s, a)\pi(a | s) \; .

Hence, the two operators associated with the state-action formulation are:

\mathcal{I}_V(\pi)(s,a) &= \left(\frac{1}{\mathbb{E}\pi[V^\pi]}d^\pi(s) V^\pi(s)\right) Q^\pi\pi(a | s) \\
\mathcal{P}_{V}(\mu) &= \arg\min_{z \in \Pi} \sum_s \mu(s) KL(\mu(\cdot | s) || z(\cdot | s)) \; .

You might be wondering about the assumption up there that all returns $Q^\pi(s,a)$ are positive. Well, as far as we know, we currently need that. However, we are only making explicit what is actually a common assumption. Several algorithms need it, such as those that see RL as inference. One way to deal with that, which is what a lot of the existing approaches do, is to introduce reward transformations to ensure positivity. The most common transformation is the exponential function, used by PPO and MPO. We are still thinking about these things and we feel there’s more to be said/done with respect to that, but I’ll stop here for now.

Coming up with new operators and the relationship to value-based methods

Once we start to see PG methods from this operators perspective, it is natural to wonder if we can come up with new operators. An obvious thing to try is to have more aggressive improvement steps. The simplest way to do so is to replace $\mathcal{I}_V$ by $\mathcal{I}_V^\alpha: \; \pi \longrightarrow (Q^\pi)^\frac{1}{\alpha}\pi$, with $\alpha \leq 1$. In words, to use the state-action values raised to the $\alpha^{-1}$-th power. Smaller values of $\alpha$ place higher emphasis on state-action values with high expected return and, as $\alpha$ goes to zero, $(Q^{\pi})^\frac{1}{\alpha}\pi$ becomes the deterministic distribution that assigns probability 1 to the action $a^\ast(s) = \arg\max_a Q^\pi(s, a)$.

It is natural to wonder whether different improvement steps, for $\alpha \neq 1$, still lead to the same fixed point. The answer is no. Maybe surprisingly, projecting a policy that achieves a higher expected return can lead to a worse policy. Nevertheless, in our paper we show that projecting the improved policy back with the $\alpha$-divergence guarantees a fixed point for these bigger improvement steps. For more details about the $\alpha$-divergence, check out this paper.

We can also empirically check if this mismatch really matters, and it does! The figure below depicts some of these analyses in the four-room domain. The leftmost panel shows, as theory predicts, that using more aggressive steps accelerates learning at first, but leads to a suboptimal solution. The center panel shows that this doesn’t happen when we project the improved policies back using the $\alpha$-divergence. The rightmost panel is maybe just a teaser, but it starts to show some other things we can do empirically, such as annealing the $\alpha$ term with time. So yes, the right projection makes all the difference!

But, personally, the reason I’m most excited about this result is that it gives us the result I mentioned at the beginning. The operator view can be seen as offering us an interpolation between REINFORCE and value-based methods: we recover REINFORCE when $\alpha = 1$ and, at the limit $\alpha = 0$, we have the policy which assigns probability 1 to the action $a^\ast(s) = \arg\max_a Q^\pi(s, a)$, and $\mathcal{I}_V^\alpha$ becomes the Bellman optimality operator. From this perspective, one may say that the main difference between REINFORCE and value-based methods is how aggressively they use value estimates to define their policy. REINFORCE generates smooth policies that choose actions proportionally to their estimated value, value-based methods choose the action with higher value. That’s cool, isn’t it?

Wrapping Up

I wanted to write this blog post to present this new perspective in an easier and more succinct way. I also wanted to present this notion that PG methods and value-based methods are directly related, with each one lying on opposite sides of the same spectrum. Obviously, our paper has a much more formal presentation of these ideas. We discuss related work, we have results about the fixed point of these operations, about the size of the improvements in each step, we discuss the trajectory and state-action formulations, we have other operators (e.g., for PPO and MPO), and more. At the same time, this post (and even our paper) just highlights some of the things this view can give us. There’s still a lot to do and to leverage here!


Nicolas Le Roux, as always, was very helpful with his thorough feedback. Obviously, any mistakes or imprecisions are my fault.


Dibya Ghosh, Marlos C. Machado, Nicolas Le Roux. An Operator View of Policy Gradient Methods. CoRR abs/2006.11266, 2020.

Hello, world!

I recently received an email saying that my account at University of Alberta expired. I lost my old webpage (I mean, the hosting service) and my university email account. I decided to use that as an opportunity to make a new website, which should be easier to keep up to date. With this new website I also started a blog.

I won’t actually say anything in this post, it is just a start. I’m planning on using this blog as an excuse to post about some of my papers and, sporadically, some reflections about life as a researcher. 

Realistically speaking, I don’t plan to write that often here. I’m frequently feeling overwhelmed by what I want to do in terms of research, collaborations, and my personal life. This past year has been crazy. I defended my Ph.D. in February, I moved to Montreal and started a new job at Google Brain in March, and my daughter was born in April. I think I’m still catching up my breath. Anyways, I digress. Maybe I’ll talk about all this in a future post.

If you are interested in this type of thing, take a look at it every now and then. I’m much more present on Twitter (@MarlosCMachado) and I’m sure I’ll announce my posts there. That’s it for now 🙂