Section 24

GRPO refinements

DAPO, Dr.GRPO, VAPO, RLOO, REINFORCE++

Papers: DAPO: An Open-Source LLM Reinforcement Learning System at Scale — Yu et al., 2025 · Understanding R1-Zero-Like Training (Dr. GRPO) — Liu et al., 2025 · VAPO — Yue et al., 2025 · RLOO — Ahmadian et al., 2024 · REINFORCE++ — Hu et al., 2025

GRPO GRPO Group Relative Policy Optimization (Shao, 2024) — drop PPO’s critic; sample a group of responses per prompt and use their mean reward as the baseline, giving a group-relative advantage. Memory-cheap RL that powered DeepSeek-R1. See in glossary → lit the fuse for DeepSeek-R1, but the moment thousands of people tried to reproduce it, the cracks showed: runs that collapsed into repetition, entropy entropy A measure of how spread-out (uncertain) a probability distribution is. In RL post-training, keeping entropy up preserves exploration and prevents premature collapse onto one answer. See in glossary → that flatlined after a few hundred steps, and rewards that crept up while the actual reasoning stopped improving. The first half of 2025 was a flurry of papers fixing GRPO — and the striking thing is that almost every fix is a rediscovery of something the reinforcement-learning literature already knew about variance reduction and exploration. This chapter walks the wave.

DAPO: GRPO with four sharp edges filed off

The most complete public fix is DAPO DAPO A fully open GRPO refinement (ByteDance/Tsinghua, 2025) combining Clip-Higher, dynamic sampling, token-level loss, and overlong-reward shaping to stabilize large-scale reasoning RL. See in glossary → (Yu et al., 2025), from ByteDance and Tsinghua — and crucially, it shipped as a fully open system: code, data, and the exact recipe, where DeepSeek had described R1 but not handed over the training stack. DAPO is GRPO plus four targeted tricks, each aimed at a specific failure mode the authors hit when scaling long chain-of-thought long chain-of-thought Extended internal reasoning — thousands of tokens of self-correction, backtracking, and exploration — that reasoning-RL elicits and that test-time scaling rewards. See in glossary → RL.

  • Clip-Higher. Standard PPO PPO Proximal Policy Optimization (Schulman, 2017) — approximates TRPO’s trust region with a simple clipped surrogate objective. The RLHF workhorse: stable, simple, widely used. See in glossary → /GRPO uses a symmetric clipped surrogate clipped surrogate objective PPO’s loss: maximize the probability-ratio-weighted advantage, but clip the ratio to [1−ε, 1+ε] so a single update can’t move the policy too far. See in glossary → with the same ϵ\epsilon on both sides. The problem: the lower clip silently caps how much probability mass a rare but good token can gain, so exploration dies and entropy collapses. DAPO decouples the bounds — a larger upper clip ϵhigh\epsilon_{\text{high}} — letting promising low-probability tokens grow:
min ⁣(rtA^t, clip(rt,1ϵlow,1+ϵhigh)A^t)\min\!\left(r_t\,\hat{A}_t,\ \operatorname{clip}(r_t,\,1-\epsilon_{\text{low}},\,1+\epsilon_{\text{high}})\,\hat{A}_t\right)
  • Dynamic Sampling. When every completion in a group is right (or every one is wrong), the group’s reward variance is zero, so the group-relative advantage group-relative advantage GRPO’s advantage estimate: a response’s reward minus the mean reward of its group of siblings (often divided by their standard deviation), replacing a learned value function. See in glossary → is zero for all of them — that prompt contributes no gradient. DAPO filters these out and keeps sampling until the batch is full of prompts that actually carry signal, which both speeds convergence and stops the effective batch from quietly shrinking.
  • Token-Level loss. GRPO normalizes the loss per sample, which means a 4,000-token answer and a 40-token answer get the same total weight — so individual tokens in long answers are under-counted, and long-form reasoning is poorly shaped. DAPO averages over tokens across the whole batch, giving every token equal pull regardless of which response it lived in.
  • Overlong Reward Shaping. Truncated answers (cut off at the length limit) would otherwise be scored as wrong, teaching the model that being long is bad rather than that rambling is bad. DAPO adds a soft, length-aware penalty near the limit instead of a hard truncation cliff — a form of length-reward length / format reward Auxiliary reward terms that shape output length or enforce a required format (e.g. putting reasoning in tags, the answer in a box) — used to keep reasoning-RL outputs usable. See in glossary → shaping that controls verbosity without punishing legitimately long reasoning.

With these, DAPO reached 50 points on AIME 2024 with Qwen2.5-32B in roughly half the training steps of a naive GRPO baseline — and, because it’s open, it became the reference implementation everyone forks.

Dr. GRPO: the bias hiding in the denominator

Dr. GRPO Dr.GRPO A corrected GRPO that removes length and standard-deviation normalization biases, so the gradient is unbiased and long wrong answers aren’t implicitly favored. See in glossary → (Liu et al., 2025) takes the opposite tack — not adding machinery but removing it. The authors show GRPO’s advantage normalization contains two statistical biases. First, dividing each advantage by the response length makes the per-token gradient depend on how long the answer is, systematically favoring longer wrong answers and inflating response length over training (the dreaded “length hacking”). Second, dividing by the group’s reward standard deviation re-weights prompts by how hard they happen to be in that batch, distorting the objective. Dr. GRPO drops both normalizers, recovering an unbiased estimator that keeps responses short and the optimization honest. It’s a clean reminder that a “harmless-looking” normalization can quietly encode a preference you never intended.

VAPO: bringing the critic back

GRPO’s headline selling point was killing the critic critic A model trained to predict the value function. PPO uses an actor (the policy) and a critic; GRPO drops the critic and uses a group average instead. See in glossary → — no value function value function The expected return from a given state under the current policy. A learned value function (the critic) provides a baseline that reduces the variance of policy-gradient updates. See in glossary → to train, half the memory. VAPO VAPO Value-Augmented PPO (2025) — brings a well-trained critic back for long chain-of-thought RL, building on DAPO’s tricks to beat critic-free methods on reasoning. See in glossary → (Yue et al., 2025) argues that for long chains of thought, that was a false economy. With rewards arriving only at the end of a 10,000-token answer, the group-mean baseline baseline A reference value subtracted from the reward to reduce gradient variance without adding bias. Can be a learned critic, a group mean (GRPO), or a leave-one-out average (RLOO). See in glossary → is a blunt instrument; a trained value function gives a far lower-variance, per-token advantage advantage How much better an action was than the baseline expectation: A = reward − value. Positive advantage pushes an action’s probability up, negative pushes it down. See in glossary → via GAE GAE Generalized Advantage Estimation — a way to trade bias against variance in advantage estimates using a decay parameter λ. The standard advantage signal inside PPO. See in glossary → , which matters most exactly when the horizon is long. VAPO is value-augmented PPO built on DAPO’s tricks, and it edged out DAPO on the same AIME benchmark — evidence that the critic-free era is a pendulum, not a one-way door.

RLOO and REINFORCE++: maybe PPO was overkill

A parallel thread asks whether we needed PPO’s clipping and critics at all. RLOO RLOO REINFORCE Leave-One-Out — use the average reward of the other samples in a group as each sample’s baseline. A simple, critic-free policy-gradient method for LLMs. See in glossary → (REINFORCE Leave-One-Out; Ahmadian et al., 2024) points out that for LLMs — where a full response is one action and rewards are deterministic given the prompt — plain REINFORCE REINFORCE The basic Monte-Carlo policy-gradient estimator (Williams, 1992): scale the gradient of each action’s log-probability by the reward (or advantage) it earned. Everything else builds on it. See in glossary → with a good baseline is enough. Its baseline is elegantly simple: for each of the kk sampled completions, use the mean reward of the other k1k-1 as that sample’s baseline. That leave-one-out estimate is unbiased and needs no critic — essentially GRPO’s group baseline, derived independently from first principles.

REINFORCE++ REINFORCE++ A critic-free baseline that adds PPO-style stabilizers (token-level KL, clipping) to plain REINFORCE, aiming for robustness without a value network. See in glossary → (Hu et al., 2025) splits the difference: keep critic-free REINFORCE, but bolt on PPO’s stabilizers — token-level KL penalty KL penalty A term added to the RLHF reward that subtracts β times the KL divergence from the reference policy, keeping the optimized model from drifting too far while chasing reward. See in glossary → , reward normalization, and clipping — to get robustness without a value network. It’s positioned as a strong, simple baseline that’s harder to make explode than vanilla GRPO.

Try it

The knobs below are DAPO’s four tricks on a toy long-CoT training run. Toggle Clip-Higher, Dynamic Sampling, Token-Level loss, and Overlong Shaping on and off, and watch the effect on the reward curve, the entropy, and the average response length. The default GRPO baseline (everything off) shows the classic failure: reward stalls while entropy collapses and length balloons. Turn the tricks on one at a time to see which symptom each one treats.

DAPO's four knobs over GRPO
DAPO is GRPO plus four practical fixes. Toggle each one and watch the toy training curve and entropy respond. Curves are illustrative, not real runs.
Reward / accuracy vs. training steps
qualitative
0.00.51.0steps →
plain GRPO current (4/4 on)
Policy entropy (exploration)
healthy · 93%
Without Clip-Higher, the policy keeps narrowing and entropy collapses — exploration dies and the reward curve sags late in training.
All four on: this is full DAPO — the curve climbs highest and smoothest while entropy stays healthy. DAPO is GRPO plus four practical fixes; toggle them to see what each buys. Curves are illustrative and qualitative.

A few things to notice as you experiment:

  • Clip-Higher is the entropy medicine — turn it off and watch the entropy line nose-dive as exploration dies.
  • Overlong Shaping and Token-Level loss are the length medicine — without them, average response length drifts up without a matching reward gain, the visual signature of length hacking.
  • Dynamic Sampling mostly buys speed: the reward curve climbs faster because no step is wasted on zero-variance groups.

No single trick is magic; the point of the widget is that they’re complementary, each closing one specific leak.

Where this leaves us

The algorithm zoo has gotten crowded — DAPO, Dr. GRPO, VAPO, RLOO, REINFORCE++ — but they’re variations on a now-stable theme: a group baseline (or a revived critic) for the advantage, asymmetric clipping and KL for stability, and length-aware shaping to keep answers honest. With the optimizer mostly settled, the action moves to assembling whole pipelines at scale — how the open ecosystem stitches SFT, preference optimization, and verifiable-reward RL into a single reproducible recipe. That’s the next chapter.