GRPO & DeepSeek-R1
Group-relative advantage and critic-free RL
Papers: DeepSeekMath (GRPO) — Shao et al., 2024 · DeepSeek-R1 — DeepSeek-AI, 2025
This is the payoff. Everything in this section has been converging here: chain-of-thought made reasoning visible, self-consistency and o1 made it scale with compute, process-vs-outcome rewards taught us about credit assignment, and RLVR gave us a reward we can trust. The last two pieces are an algorithm lean enough to run RL at scale — GRPO — and the model that combined it with verifiable rewards to reproduce o1 in the open and, astonishingly, watched reasoning emerge from a base model with no supervised examples at all — DeepSeek-R1. Let’s finish the arc.
GRPO: throw away the critic
Recall PPO from chapter 13. To compute the advantage advantage How much better an action was than the baseline expectation: A = reward − value. Positive advantage pushes an action’s probability up, negative pushes it down. See in glossary → — “was this action better or worse than expected?” — PPO trains a second network alongside the policy: a value function critic A model trained to predict the value function. PPO uses an actor (the policy) and a critic; GRPO drops the critic and uses a group average instead. See in glossary → that estimates expected return at each token. That critic is roughly the same size as the policy, must be trained in lockstep, and doubles the memory and a big chunk of the compute of the whole RL run. For a giant LLM, the critic is a brutal tax.
GRPO (Group Relative Policy Optimization), introduced by Shao et al. (2024) in DeepSeekMath, asks: what if we just delete the critic? The critic only ever existed to provide a baseline baseline A reference value subtracted from the reward to reduce gradient variance without adding bias. Can be a learned critic, a group mean (GRPO), or a leave-one-out average (RLOO). See in glossary → — a reference point to subtract from each reward so we know whether a sample was above or below average. GRPO gets that baseline a completely different way: sampling.
For each prompt, GRPO samples a whole group of completions from the current policy (say ). Each completion gets a reward from the verifier. Now the group itself defines “average”: just take the mean reward across the group. The group-relative advantage group-relative advantage GRPO’s advantage estimate: a response’s reward minus the mean reward of its group of siblings (often divided by their standard deviation), replacing a learned value function. See in glossary → of completion is how much it beat the group mean, normalized by the group’s spread:
That’s the whole trick. A completion that scored above the group average gets a positive advantage and is reinforced; one below average gets a negative advantage and is suppressed. The group is its own baseline. No value network, ever.
The GRPO objective
GRPO keeps PPO’s stabilizers — it’s PPO with the advantage swapped out. For each token in completion , let be the importance ratio (new-policy probability over old-policy probability for that token). The objective is the familiar clipped surrogate clipped surrogate objective PPO’s loss: maximize the probability-ratio-weighted advantage, but clip the ratio to [1−ε, 1+ε] so a single update can’t move the policy too far. See in glossary → , using the group-relative advantage , plus a KL KL divergence Kullback–Leibler divergence — a measure of how far one probability distribution is from another. Used in post-training as a "leash" that keeps a model close to a reference policy. See in glossary → penalty to a frozen reference reference model A frozen copy of the policy (usually the SFT model) that RLHF and DPO stay close to via a KL penalty, preventing the optimized policy from drifting into degenerate text. See in glossary → model:
The /clip is exactly PPO’s trust-region trick: don’t let the policy move too far in one step. The KL term is the leash to the reference model that keeps the policy from drifting into degenerate text. The only structural change from PPO is that comes from the group mean instead of a critic — and that single change removes an entire network from the training loop.
Try it
Below, sample a group of completions, see each one’s verifier reward, watch the group mean become the baseline, and see the resulting normalized advantages — positive for above-average samples, negative for below. Toggle the critic on and off to feel what GRPO removes.
DeepSeek-R1-Zero: reasoning from nothing
In January 2025, DeepSeek-AI put GRPO and verifiable rewards together and ran an experiment so clean it became instantly famous. They took DeepSeek-V3-Base — a pre-trained base model, with no supervised fine-tuning, no instruction tuning, no demonstrations of how to reason — and trained it with pure GRPO against verifiable rewards (answer-key match for math, test execution for code). That’s it. No SFT cold-start, no human reasoning traces. They called it DeepSeek-R1-Zero.
And it worked. With nothing but a correctness signal and group-relative advantages, the model taught itself to reason. Over training, its chains of thought grew longer on their own. It began to allocate more thinking to harder problems, to decompose them, and — most strikingly — to backtrack and self-correct, all without ever being shown a single example of doing so. These behaviors weren’t programmed or imitated; they emerged because they were the strategies that earned reward. On AIME 2024, R1-Zero climbed from 15.6% to 71.0% pass@1 (and to 86.7% with majority voting) — from barely-trying to genuinely-competitive, on pure RL.
DeepSeek-R1: cold-start for readability
R1-Zero proved the principle, but it had warts. Pure-RL reasoning traces were powerful yet unreadable — mixed languages, chaotic formatting, chains that worked but no human would want to read. So the full DeepSeek-R1 added a small amount of supervised cold-start cold-start data A small amount of high-quality SFT data used to "warm up" a base model before RL, so reasoning RL is more stable and readable. DeepSeek-R1 adds it; R1-Zero skips it. See in glossary → data: a curated set of clean, well-formatted long chain-of-thought examples used to fine-tune the base model before the GRPO stage. Cold-start gave the model a readable, well-behaved starting point; GRPO then drove its reasoning ability up from there, followed by a final alignment pass for helpfulness and safety.
The result was a model that matched OpenAI’s o1 on hard reasoning benchmarks — and, unlike o1, was released openly, with weights and a detailed technical report. The black box of chapter 21 had been reproduced, explained, and opened to the world. The route o1 hid, R1 published.
The whole arc, reassembled
Stand back and look at what just happened across this section. STaR used correctness as a filter. The PRM/ORM work taught us to think about credit assignment. o1 showed reasoning scales with test-time compute. RLVR gave us a trustworthy, unhackable reward. And GRPO supplied a critic-free algorithm lean enough to run that reward at scale. DeepSeek-R1-Zero is what you get when you snap all five together: a base model, a verifier, and group-relative RL — and reasoning, complete with self-correction and the aha moment, emerges.
This is the climax of the reinforcement-learning story this explainer has been building since the preference era. We began by imitating good answers (SFT), moved to learning human preferences (RLHF), and arrive here: a model improving its own thinking against the bedrock of verifiable truth, inventing strategies no one taught it. The remaining chapters refine the algorithm (DAPO, Dr.GRPO, and friends), scale the recipe across the open ecosystem, and push it into agentic, tool-using, multi-turn settings. But the conceptual summit is this one — the moment reasoning stopped being something we wrote into a model, and became something a model could learn for itself.