Section 23

GRPO & DeepSeek-R1

Group-relative advantage and critic-free RL

Papers: DeepSeekMath (GRPO) — Shao et al., 2024 · DeepSeek-R1 — DeepSeek-AI, 2025

This is the payoff. Everything in this section has been converging here: chain-of-thought made reasoning visible, self-consistency and o1 made it scale with compute, process-vs-outcome rewards taught us about credit assignment, and RLVR gave us a reward we can trust. The last two pieces are an algorithm lean enough to run RL at scale — GRPO — and the model that combined it with verifiable rewards to reproduce o1 in the open and, astonishingly, watched reasoning emerge from a base model with no supervised examples at all — DeepSeek-R1. Let’s finish the arc.

GRPO: throw away the critic

Recall PPO from chapter 13. To compute the advantage — “was this action better or worse than expected?” — PPO trains a second network alongside the policy: a value function that estimates expected return at each token. That critic is roughly the same size as the policy, must be trained in lockstep, and doubles the memory and a big chunk of the compute of the whole RL run. For a giant LLM, the critic is a brutal tax.

GRPO (Group Relative Policy Optimization), introduced by Shao et al. (2024) in DeepSeekMath, asks: what if we just delete the critic? The critic only ever existed to provide a baseline — a reference point to subtract from each reward so we know whether a sample was above or below average. GRPO gets that baseline a completely different way: sampling.

For each prompt, GRPO samples a whole group of $G$ completions from the current policy (say $G = 16$ ). Each completion $i$ gets a reward $r_i$ from the verifier. Now the group itself defines “average”: just take the mean reward across the group. The group-relative advantage of completion $i$ is how much it beat the group mean, normalized by the group’s spread:

A_i = \frac{r_i - \operatorname{mean}(r_1, \ldots, r_G)}{\operatorname{std}(r_1, \ldots, r_G)}

That’s the whole trick. A completion that scored above the group average gets a positive advantage and is reinforced; one below average gets a negative advantage and is suppressed. The group is its own baseline. No value network, ever.

The GRPO objective

GRPO keeps PPO’s stabilizers — it’s PPO with the advantage swapped out. For each token in completion $i$ , let $\rho_{i,t}$ be the importance ratio (new-policy probability over old-policy probability for that token). The objective is the familiar clipped surrogate , using the group-relative advantage $A_i$ , plus a KL penalty to a frozen reference model:

\mathcal{J}_{\text{GRPO}} = \mathbb{E}\!\left[\frac{1}{G}\sum_{i=1}^{G}\frac{1}{|o_i|}\sum_{t} \min\!\big(\rho_{i,t}\,A_i,\; \operatorname{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon)\,A_i\big)\right] - \beta\, D_{\text{KL}}\!\left(\pi_\theta \,\|\, \pi_{\text{ref}}\right)

The $\min$ /clip is exactly PPO’s trust-region trick: don’t let the policy move too far in one step. The KL term is the leash to the reference model that keeps the policy from drifting into degenerate text. The only structural change from PPO is that $A_i$ comes from the group mean instead of a critic — and that single change removes an entire network from the training loop.

Try it

Below, sample a group of completions, see each one’s verifier reward, watch the group mean become the baseline, and see the resulting normalized advantages — positive for above-average samples, negative for below. Toggle the critic on and off to feel what GRPO removes.

GRPO: group-relative advantage

Sample a group of completions for one prompt. The baseline is the group's own mean reward — no critic needed.

Prompt

Prove that the sum of the first n odd numbers is n².

Group size N = 6

completion · reward → advantagebaseline 0.67 (mean)

induction, base + step ✓

A = -1.41

visual L-shapes argument ✓

A = -1.41

algebraic sum, off-by-one ✗

A = +0.71

restates claim, no proof ✗

A = +0.71

telescoping sum ✓

A = +0.71

induction, missing base case ✗

A = +0.71

Group mean (baseline)

0.667

Group std

0.471

Critic network

none

GRPO drops PPO's critic. Instead of training a separate value network to predict a baseline, it samples a group of completions for the same prompt and uses the group's own mean reward as the baseline; each completion's advantage is (reward − mean) / std. Completions above the group average get pushed up (teal), those below get pushed down (rose). No value network means roughly half the model memory of PPO — the trick behind DeepSeek-R1. Flip to PPO to see the same update driven by a learned, imperfect critic estimate that costs an entire extra network to train.

DeepSeek-R1-Zero: reasoning from nothing

In January 2025, DeepSeek-AI put GRPO and verifiable rewards together and ran an experiment so clean it became instantly famous. They took DeepSeek-V3-Base — a pre-trained base model, with no supervised fine-tuning, no instruction tuning, no demonstrations of how to reason — and trained it with pure GRPO against verifiable rewards (answer-key match for math, test execution for code). That’s it. No SFT cold-start, no human reasoning traces. They called it DeepSeek-R1-Zero.

And it worked. With nothing but a correctness signal and group-relative advantages, the model taught itself to reason. Over training, its chains of thought grew longer on their own. It began to allocate more thinking to harder problems, to decompose them, and — most strikingly — to backtrack and self-correct, all without ever being shown a single example of doing so. These behaviors weren’t programmed or imitated; they emerged because they were the strategies that earned reward. On AIME 2024, R1-Zero climbed from 15.6% to 71.0% pass@1 (and to 86.7% with majority voting) — from barely-trying to genuinely-competitive, on pure RL.

The aha moment

The DeepSeek team documented a now-legendary training artifact: mid-trajectory, R1-Zero would literally write something like “Wait, wait. That’s an aha moment. Let me re-evaluate this step by step.” — and then go back and fix its own reasoning. The model wasn’t told to second-guess itself; it discovered that pausing to reconsider earns reward. This is the aha moment : emergent self-correction, arising spontaneously from reinforcement learning against a correctness signal. It is, as close as anything we have, a glimpse of a model learning how to think rather than being shown what to say — and it is the most quoted result of the entire reasoning era.

DeepSeek-R1: cold-start for readability

R1-Zero proved the principle, but it had warts. Pure-RL reasoning traces were powerful yet unreadable — mixed languages, chaotic formatting, chains that worked but no human would want to read. So the full DeepSeek-R1 added a small amount of supervised cold-start data: a curated set of clean, well-formatted long chain-of-thought examples used to fine-tune the base model before the GRPO stage. Cold-start gave the model a readable, well-behaved starting point; GRPO then drove its reasoning ability up from there, followed by a final alignment pass for helpfulness and safety.

The result was a model that matched OpenAI’s o1 on hard reasoning benchmarks — and, unlike o1, was released openly, with weights and a detailed technical report. The black box of chapter 21 had been reproduced, explained, and opened to the world. The route o1 hid, R1 published.

The whole arc, reassembled

Stand back and look at what just happened across this section. STaR used correctness as a filter. The PRM/ORM work taught us to think about credit assignment. o1 showed reasoning scales with test-time compute. RLVR gave us a trustworthy, unhackable reward. And GRPO supplied a critic-free algorithm lean enough to run that reward at scale. DeepSeek-R1-Zero is what you get when you snap all five together: a base model, a verifier, and group-relative RL — and reasoning, complete with self-correction and the aha moment, emerges.

This is the climax of the reinforcement-learning story this explainer has been building since the preference era. We began by imitating good answers (SFT), moved to learning human preferences (RLHF), and arrive here: a model improving its own thinking against the bedrock of verifiable truth, inventing strategies no one taught it. The remaining chapters refine the algorithm (DAPO, Dr.GRPO, and friends), scale the recipe across the open ecosystem, and push it into agentic, tool-using, multi-turn settings. But the conceptual summit is this one — the moment reasoning stopped being something we wrote into a model, and became something a model could learn for itself.