Value, advantage, baselines
Critics, GAE, and variance reduction
REINFORCE works, but it shudders. The previous chapter ended on its fatal flaw: the gradient is multiplied by the raw reward, so when every response scores, say, , the algorithm enthusiastically pushes up the probability of everything it just did, learning almost nothing about which responses were actually better. This chapter is about taming that variance — first with a free trick called the baseline, then with the value functions and advantage estimates that power every modern RLHF system.
A free lunch: subtracting a baseline
Here is the key observation. We can subtract any constant from the reward inside the policy gradient without changing what it estimates:
Why is this allowed? Because the extra term we introduced, , is exactly zero. The proof is two lines and worth seeing.
So the baseline is a genuine free lunch: it cannot bias the estimator, but a well-chosen slashes its variance. Intuitively, instead of asking “was this response good?” we ask “was this response better than my typical response?” If is the average reward, a in a sea of s contributes nothing — no spurious push — while a among s gets a strong upward push and a gets pushed down. The signal becomes relative, which is exactly what we want.
The simplest useful baseline is just the mean reward of the current batch of rollouts. That alone helps enormously — and, as a preview, it is essentially what GRPO GRPO Group Relative Policy Optimization (Shao, 2024) — drop PPO’s critic; sample a group of responses per prompt and use their mean reward as the baseline, giving a group-relative advantage. Memory-cheap RL that powered DeepSeek-R1. See in glossary → (chapter 23) does to avoid training a separate network at all.
The value function and the critic
We can do better than a single batch-wide constant. The ideal baseline is state-dependent: the reward we expect from a given starting point. That is the value function value function The expected return from a given state under the current policy. A learned value function (the critic) provides a baseline that reduces the variance of policy-gradient updates. See in glossary → :
answers: “Starting from state — this prompt, these tokens so far — how much reward do I expect this policy to earn from here on out?” A strong prompt where the model usually does well has a high ; a hard one has a low . Subtracting as the baseline asks the sharpest possible question: did this rollout beat what was expected from this particular state?
We don’t know , so we learn it with a second network — the critic critic A model trained to predict the value function. PPO uses an actor (the policy) and a critic; GRPO drops the critic and uses a group average instead. See in glossary → . The critic is typically a copy of the model with a scalar output head, trained by regression to predict the returns the policy actually receives. The policy (the “actor”) proposes; the critic evaluates. This actor–critic split is the structural backbone of PPO PPO Proximal Policy Optimization (Schulman, 2017) — approximates TRPO’s trust region with a simple clipped surrogate objective. The RLHF workhorse: stable, simple, widely used. See in glossary → .
Advantage: better or worse than expected
Put the pieces together and you get the central quantity of policy optimization, the advantage advantage How much better an action was than the baseline expectation: A = reward − value. Positive advantage pushes an action’s probability up, negative pushes it down. See in glossary → :
where is the expected reward of taking action in state and then continuing. In words: how much better is this specific action than the policy’s average behavior in this state? A positive advantage means “this was a pleasant surprise — do more of it”; a negative advantage means “worse than usual — do less.” Replacing the raw reward with the advantage gives the modern policy gradient:
For a terminal reward with a learned baseline, this is just — reward minus the critic’s prediction. The advantage is the cleaned-up, centered learning signal: the variance-inflating level of the reward has been subtracted away, leaving only the informative part.
GAE: trading bias against variance
There’s one more wrinkle, and it’s where the real engineering lives. To compute the advantage we need to estimate returns, and we have a spectrum of ways to do it.
At one extreme, use the actual reward earned over the whole rollout. This is unbiased — it’s what really happened — but high-variance, because a single noisy outcome stands in for the expectation. At the other extreme, lean entirely on the critic’s one-step prediction. This is low-variance (the critic averages over many episodes) but biased (the critic is imperfect). Neither extreme is ideal; we want a dial between them.
That dial is Generalized Advantage Estimation ( GAE GAE Generalized Advantage Estimation — a way to trade bias against variance in advantage estimates using a decay parameter λ. The standard advantage signal inside PPO. See in glossary → , Schulman 2016). It is built from the per-step temporal-difference error:
Each is a small one-step “surprise”: the reward you just got, plus the discounted value of where you landed, minus the value of where you were. GAE then sums these surprises down the trajectory with an exponentially decaying weight:
Two knobs. The discount sets how much future reward counts now. The GAE parameter is the bias/variance dial:
- collapses the sum to a single term, — pure reliance on the critic. Low variance, higher bias.
- recovers the full Monte-Carlo return minus the baseline — unbiased, but high variance.
- Intermediate (a value near is typical for RLHF) blends them, keeping most of the variance reduction while paying only a little bias.
Try it
Below, a short rollout with per-step rewards and a learned value estimate. Turn the and dials and watch the advantage estimates along the trajectory respond: crank toward 1 and the estimates get spikier (high variance); pull it toward 0 and they smooth out toward the critic’s one-step view (low variance, more bias). This single picture is the heart of why PPO is stable.
Where this is heading
Baselines, a critic, and GAE give us a low-variance advantage signal . But we still haven’t fixed the other failure mode of naive policy gradients — taking a step so large that it wrecks the policy in a single update. That is the problem trust regions and clipping solve, and it brings us to the algorithm at the center of RLHF: PPO PPO Proximal Policy Optimization (Schulman, 2017) — approximates TRPO’s trust region with a simple clipped surrogate objective. The RLHF workhorse: stable, simple, widely used. See in glossary → , in the next chapter.