Policy gradient (PG) methods are one of the cornerstones of the field of reinforcement learning (RL). They are quite popular and have been behind high-profile success stories in the field (e.g., this and this one). They are often presented as an alternative to value-based methods where, instead of learning value functions and using them to take actions, they *directly* learn a parameterized policy by doing gradient ascent on the objective function. This view suggests that value-based methods and PG methods are fundamentally different. In this post I’ll share a different perspective over PG methods. I’ll present them from an operators perspective and I’ll show that PG methods and value-based methods are directly related, lying on opposite sides of the same spectrum. This post is a subset of the results presented in a paper Dibya Ghosh, Nicolas Le Roux, and I recently wrote. This has been a very collaborative effort and the only reason I ended up writing this post with “I” instead of “we” was to ensure I’m responsible for any mistakes ðŸ˜… .

#### The standard view of policy gradient methods

Let me first talk about the standard view of policy gradient methods. I’ll skip basic definitions and use a somewhat standard notation ($s$, $a$, $p$, $r$, â€¦). There are great tutorials about this out there (e.g., this and this one) but, in short, reinforcement learning algorithms aim to find a policy maximizing the total reward an agent collects through repeated interactions with the environment.

The simplest PG method is REINFORCE, and its update in the state-action formulation is

$$

\begin{align}

\label{eq:update_state-action}

\theta_{t+1} &= \theta_t + \epsilon \sum_s d^{\pi_t}(s) \sum_ a \pi_t(a | s) Q^{\pi_t}(s, a) \frac{\partial \log \pi_{\theta}(a | s)}{\partial \theta}\bigg|_{\theta = \theta_t} \; ,

\end{align}

$$

where $d^{\pi}$ is the discounted stationary distribution induced by the policy $\pi$ and $\pi_t$ is the policy parameterized by $\theta_t$.

#### Policy gradient methods from an operators perspective

Our realization was that we can also look at PG methods from an operators perspective. That is, instead of seeing them as doing gradient ascent on some parameters, we can see them as successively applying a policy improvement step followed by a projection step, which ensures we can still represent the new policy.

To give an intuition on this, let’s say we design an algorithm that starts with an initial policy $\pi_{\theta_t}$, at time step $t$, and we then estimate the Q-values with respect to that policy, $Q^{\pi_{\theta_t}}$. We then decide to *improve* this policy by reweighting the probability of each state-action pair to be proportional to its estimated return. For a given state $s$,

$$

\begin{align} \pi’_{\theta_t}(a|s) \ \ \alpha \ \ \pi_{\theta_t}(a|s)Q^{\pi_{\theta_t}}(s,a).

\end{align}

$$

The policy we just obtained is not realizable. Thus, we need to find its closest policy that is realizable, according to some distance. We do that with the KL divergence, *projecting* each state independently,

$$

\begin{align} \pi_{\theta_{t+1}} \ \ = \ \ \arg\min_{\pi \in \Pi} KL(\pi’_{\theta_t} || \pi).

\end{align}

$$

This seems quite reasonable, right? The cool part is that, when we put everything together, if we normalize the distributions and we weight everything by the discounted stationary distribution, we see that the REINFORCE update is nothing more than the improvement step followed by one gradient step of the projection!

$$

\begin{align}

\theta_{t+1} &= \theta_t + \epsilon \sum_s d^{\pi_t}(s) \sum_ a \color{red}{ \pi_t(a | s) Q^{\pi_t}(s, a)} \frac{\partial \log \pi_{\theta}(a | s)}{\partial \theta}\bigg|_{\theta = \theta_t} \; ,

\end{align}

$$

The term in red is the policy improvement step and the rest is the KL divergence. Formally, if all $Q^\pi(s,a)$ are positive, the equation above can be seen as doing a gradient step, with respect to $\pi$, starting at $\pi = \pi_t$, to minimize

$$

\begin{align}

D_{V^\pi_t\pi_t}(Q^{\pi_t}\pi_t || \pi) &= \sum_s d^{\pi_t}(s) V^{\pi_t}(s) KL(Q^{\pi_t}\pi_t || \pi) \; ,

\end{align}

$$

where $D_{V^\pi_t\pi_t}$ and the distribution $Q^\pi\pi$ over actions are defined as

$$

\begin{align}

D_{z}(\mu || \pi) = \sum_s z(s) KL(\mu(\cdot | s) || \pi(\cdot | s)) \; ,\ && Q^\pi\pi(a | s) = \frac{1}{V^\pi(s)}Q(s, a)\pi(a | s) \; .

\end{align}

$$

Hence, the two operators associated with the state-action formulation are:

$$\begin{align}

\mathcal{I}_V(\pi)(s,a) &= \left(\frac{1}{\mathbb{E}\pi[V^\pi]}d^\pi(s) V^\pi(s)\right) Q^\pi\pi(a | s) \\

\mathcal{P}_{V}(\mu) &= \arg\min_{z \in \Pi} \sum_s \mu(s) KL(\mu(\cdot | s) || z(\cdot | s)) \; .

\end{align}

$$

You might be wondering about the assumption up there that all returns $Q^\pi(s,a)$ are positive. Well, as far as we know, we currently need that. However, we are only making explicit what is actually a common assumption. Several algorithms need it, such as those that see RL as inference. One way to deal with that, which is what a lot of the existing approaches do, is to introduce reward transformations to ensure positivity. The most common transformation is the exponential function, used by PPO and MPO. We are still thinking about these things and we feel there’s more to be said/done with respect to that, but I’ll stop here for now.

#### Coming up with new operators and the relationship to value-based methods

Once we start to see PG methods from this operators perspective, it is natural to wonder if we can come up with new operators. An obvious thing to try is to have more aggressive improvement steps. The simplest way to do so is to replace $\mathcal{I}_V$ by $\mathcal{I}_V^\alpha: \; \pi \longrightarrow (Q^\pi)^\frac{1}{\alpha}\pi$, with $\alpha \leq 1$. In words, to use the state-action values raised to the $\alpha^{-1}$-th power. Smaller values of $\alpha$ place higher emphasis on state-action values with high expected return and, as $\alpha$ goes to zero, $(Q^{\pi})^\frac{1}{\alpha}\pi$ becomes the deterministic distribution that assigns probability 1 to the action $a^\ast(s) = \arg\max_a Q^\pi(s, a)$.

It is natural to wonder whether different improvement steps, for $\alpha \neq 1$, still lead to the same fixed point. The answer is no. Maybe surprisingly, projecting a policy that achieves a higher expected return can lead to a worse policy. Nevertheless, in our paper we show that projecting the improved policy back with the $\alpha$-divergence guarantees a fixed point for these bigger improvement steps. For more details about the $\alpha$-divergence, check out this paper.

We can also empirically check if this mismatch really matters, and it does! The figure below depicts some of these analyses in the four-room domain. The leftmost panel shows, as theory predicts, that using more aggressive steps accelerates learning at first, but leads to a suboptimal solution. The center panel shows that this doesn’t happen when we project the improved policies back using the $\alpha$-divergence. The rightmost panel is maybe just a teaser, but it starts to show some other things we can do empirically, such as annealing the $\alpha$ term with time. So yes, the right projection makes all the difference!

But, personally, the reason I’m most excited about this result is that it gives us the result I mentioned at the beginning. **The operator view can be seen as offering us an interpolation between REINFORCE and value-based methods**: we recover REINFORCE when $\alpha = 1$ and, at the limit $\alpha = 0$, we have the policy which assigns probability 1 to the action $a^\ast(s) = \arg\max_a Q^\pi(s, a)$, and $\mathcal{I}_V^\alpha$ becomes the Bellman optimality operator. From this perspective, one may say that the main difference between REINFORCE and value-based methods is how aggressively they use value estimates to define their policy. REINFORCE generates smooth policies that choose actions proportionally to their estimated value, value-based methods choose the action with higher value. That’s cool, isn’t it?

#### Wrapping Up

I wanted to write this blog post to present this new perspective in an easier and more succinct way. I also wanted to present this notion that PG methods and value-based methods are directly related, with each one lying on opposite sides of the same spectrum. Obviously, our paper has a much more formal presentation of these ideas. We discuss related work, we have results about the fixed point of these operations, about the size of the improvements in each step, we discuss the trajectory and state-action formulations, we have other operators (e.g., for PPO and MPO), and more. At the same time, this post (and even our paper) just highlights some of the things this view can give us. There’s still a lot to do and to leverage here!

#### Acknowledgements

Nicolas Le Roux, as always, was very helpful with his thorough feedback. Obviously, any mistakes or imprecisions are my fault.