Section 24

GRPO refinements

DAPO, Dr.GRPO, VAPO, RLOO, REINFORCE++

Papers: DAPO: An Open-Source LLM Reinforcement Learning System at Scale — Yu et al., 2025 · Understanding R1-Zero-Like Training (Dr. GRPO) — Liu et al., 2025 · VAPO — Yue et al., 2025 · RLOO — Ahmadian et al., 2024 · REINFORCE++ — Hu et al., 2025

GRPO lit the fuse for DeepSeek-R1, but the moment thousands of people tried to reproduce it, the cracks showed: runs that collapsed into repetition, entropy that flatlined after a few hundred steps, and rewards that crept up while the actual reasoning stopped improving. The first half of 2025 was a flurry of papers fixing GRPO — and the striking thing is that almost every fix is a rediscovery of something the reinforcement-learning literature already knew about variance reduction and exploration. This chapter walks the wave.

DAPO: GRPO with four sharp edges filed off

The most complete public fix is DAPO (Yu et al., 2025), from ByteDance and Tsinghua — and crucially, it shipped as a fully open system: code, data, and the exact recipe, where DeepSeek had described R1 but not handed over the training stack. DAPO is GRPO plus four targeted tricks, each aimed at a specific failure mode the authors hit when scaling long chain-of-thought RL.

Clip-Higher. Standard PPO /GRPO uses a symmetric clipped surrogate with the same $\epsilon$ on both sides. The problem: the lower clip silently caps how much probability mass a rare but good token can gain, so exploration dies and entropy collapses. DAPO decouples the bounds — a larger upper clip $\epsilon_{\text{high}}$ — letting promising low-probability tokens grow:

\min\!\left(r_t\,\hat{A}_t,\ \operatorname{clip}(r_t,\,1-\epsilon_{\text{low}},\,1+\epsilon_{\text{high}})\,\hat{A}_t\right)

Dynamic Sampling. When every completion in a group is right (or every one is wrong), the group’s reward variance is zero, so the group-relative advantage is zero for all of them — that prompt contributes no gradient. DAPO filters these out and keeps sampling until the batch is full of prompts that actually carry signal, which both speeds convergence and stops the effective batch from quietly shrinking.
Token-Level loss. GRPO normalizes the loss per sample, which means a 4,000-token answer and a 40-token answer get the same total weight — so individual tokens in long answers are under-counted, and long-form reasoning is poorly shaped. DAPO averages over tokens across the whole batch, giving every token equal pull regardless of which response it lived in.
Overlong Reward Shaping. Truncated answers (cut off at the length limit) would otherwise be scored as wrong, teaching the model that being long is bad rather than that rambling is bad. DAPO adds a soft, length-aware penalty near the limit instead of a hard truncation cliff — a form of length-reward shaping that controls verbosity without punishing legitimately long reasoning.

With these, DAPO reached 50 points on AIME 2024 with Qwen2.5-32B in roughly half the training steps of a naive GRPO baseline — and, because it’s open, it became the reference implementation everyone forks.

Dr. GRPO: the bias hiding in the denominator

Dr. GRPO (Liu et al., 2025) takes the opposite tack — not adding machinery but removing it. The authors show GRPO’s advantage normalization contains two statistical biases. First, dividing each advantage by the response length makes the per-token gradient depend on how long the answer is, systematically favoring longer wrong answers and inflating response length over training (the dreaded “length hacking”). Second, dividing by the group’s reward standard deviation re-weights prompts by how hard they happen to be in that batch, distorting the objective. Dr. GRPO drops both normalizers, recovering an unbiased estimator that keeps responses short and the optimization honest. It’s a clean reminder that a “harmless-looking” normalization can quietly encode a preference you never intended.

VAPO: bringing the critic back

GRPO’s headline selling point was killing the critic — no value function to train, half the memory. VAPO (Yue et al., 2025) argues that for long chains of thought, that was a false economy. With rewards arriving only at the end of a 10,000-token answer, the group-mean baseline is a blunt instrument; a trained value function gives a far lower-variance, per-token advantage via GAE , which matters most exactly when the horizon is long. VAPO is value-augmented PPO built on DAPO’s tricks, and it edged out DAPO on the same AIME benchmark — evidence that the critic-free era is a pendulum, not a one-way door.

RLOO and REINFORCE++: maybe PPO was overkill

A parallel thread asks whether we needed PPO’s clipping and critics at all. RLOO (REINFORCE Leave-One-Out; Ahmadian et al., 2024) points out that for LLMs — where a full response is one action and rewards are deterministic given the prompt — plain REINFORCE with a good baseline is enough. Its baseline is elegantly simple: for each of the $k$ sampled completions, use the mean reward of the other $k-1$ as that sample’s baseline. That leave-one-out estimate is unbiased and needs no critic — essentially GRPO’s group baseline, derived independently from first principles.

REINFORCE++ (Hu et al., 2025) splits the difference: keep critic-free REINFORCE, but bolt on PPO’s stabilizers — token-level KL penalty , reward normalization, and clipping — to get robustness without a value network. It’s positioned as a strong, simple baseline that’s harder to make explode than vanilla GRPO.

Try it

The knobs below are DAPO’s four tricks on a toy long-CoT training run. Toggle Clip-Higher, Dynamic Sampling, Token-Level loss, and Overlong Shaping on and off, and watch the effect on the reward curve, the entropy, and the average response length. The default GRPO baseline (everything off) shows the classic failure: reward stalls while entropy collapses and length balloons. Turn the tricks on one at a time to see which symptom each one treats.

DAPO's four knobs over GRPO

DAPO is GRPO plus four practical fixes. Toggle each one and watch the toy training curve and entropy respond. Curves are illustrative, not real runs.

Reward / accuracy vs. training steps

qualitative

plain GRPO current (4/4 on)

Policy entropy (exploration)

healthy · 93%

Without Clip-Higher, the policy keeps narrowing and entropy collapses — exploration dies and the reward curve sags late in training.

All four on: this is full DAPO — the curve climbs highest and smoothest while entropy stays healthy. DAPO is GRPO plus four practical fixes; toggle them to see what each buys. Curves are illustrative and qualitative.

A few things to notice as you experiment:

Clip-Higher is the entropy medicine — turn it off and watch the entropy line nose-dive as exploration dies.
Overlong Shaping and Token-Level loss are the length medicine — without them, average response length drifts up without a matching reward gain, the visual signature of length hacking.
Dynamic Sampling mostly buys speed: the reward curve climbs faster because no step is wasted on zero-variance groups.

No single trick is magic; the point of the widget is that they’re complementary, each closing one specific leak.

Where this leaves us

The algorithm zoo has gotten crowded — DAPO, Dr. GRPO, VAPO, RLOO, REINFORCE++ — but they’re variations on a now-stable theme: a group baseline (or a revived critic) for the advantage, asymmetric clipping and KL for stability, and length-aware shaping to keep answers honest. With the optimizer mostly settled, the action moves to assembling whole pipelines at scale — how the open ecosystem stitches SFT, preference optimization, and verifiable-reward RL into a single reproducible recipe. That’s the next chapter.