Section 27

Recap

The pipeline reassembled, and further reading

We started with a base model base model A model straight out of pre-training — a powerful text continuator that has not yet been taught to follow instructions, hold a conversation, or refuse harmful requests. See in glossary → that could only continue text and ended at agents that search, run code, and reason for thousands of tokens before they answer. This final chapter steps back to find the through-line — the one idea underneath every chapter, the handful of levers that explain the differences, and the order in which to read the papers if you want to go deeper.

The through-line

Here is the single sentence that ties the whole explainer together: post-training post-training Everything done to a model after pre-training to turn a raw next-token predictor into a useful assistant: supervised fine-tuning, RLHF, and RL from verifiable rewards. See in glossary → is the search for a better training signal than next-token likelihood likelihood The probability a model assigns to observed data. Supervised fine-tuning maximizes the likelihood of human-written target responses given their prompts. See in glossary → .

Pre-training gave us a model that’s superb at imitating text. But imitation has a ceiling — it can only reproduce the distribution of what humans wrote, and it has no notion of good versus bad. Every technique in this explainer is an attempt to get a richer signal:

  • Imitation ( SFT supervised fine-tuning (SFT) Training a pre-trained model on curated (prompt, response) pairs with the ordinary next-token objective, so it imitates demonstrated assistant behavior. The first stage of post-training. See in glossary → ) upgrades the signal from “any text on the internet” to “good demonstrations of helpful behavior.”
  • Preferences ( RLHF RLHF Reinforcement Learning from Human Feedback — train a reward model on human preference comparisons, then optimize the policy against that reward with RL (typically PPO), with a KL leash to a reference. See in glossary → , DPO DPO Direct Preference Optimization (Rafailov, 2023) — a closed-form supervised loss that optimizes the RLHF objective directly from preference pairs, with no separate reward model and no RL loop. See in glossary → ) upgrade it again from “copy this answer” to “this answer is better than that one” — a comparative signal that can exceed the best single demonstration.
  • Verifiable rewards ( RLVR RLVR Reinforcement Learning from Verifiable Rewards — use an automatic checker (unit tests, an answer key, a math grader) as the reward instead of a learned reward model. No reward hacking of a neural proxy. See in glossary → , reasoning RL) upgrade it once more to “this answer is correct” — a ground-truth signal a checker can produce at scale, with no human and no learned judge in the loop.

The objective got better and better; the model’s raw next-token machinery never changed. That’s the constant. Everything else is engineering around the signal.

The recurring levers

Almost every paper in this explainer is a particular setting of four dials:

  • The signal. What tells the model it did well? It marched from demonstrations (SFT) → preferences preference data Data where humans (or an AI) compare two or more model responses to the same prompt and mark which is better. The training signal for reward models and DPO. See in glossary → (reward models) → verifiable rewards (RLVR). Each step removes a bottleneck — human demonstrations, then human comparisons, then human judgment altogether on checkable tasks.
  • The optimizer. How is the signal turned into a weight update? REINFORCE REINFORCE The basic Monte-Carlo policy-gradient estimator (Williams, 1992): scale the gradient of each action’s log-probability by the reward (or advantage) it earned. Everything else builds on it. See in glossary → PPO PPO Proximal Policy Optimization (Schulman, 2017) — approximates TRPO’s trust region with a simple clipped surrogate objective. The RLHF workhorse: stable, simple, widely used. See in glossary → DPO GRPO GRPO Group Relative Policy Optimization (Shao, 2024) — drop PPO’s critic; sample a group of responses per prompt and use their mean reward as the baseline, giving a group-relative advantage. Memory-cheap RL that powered DeepSeek-R1. See in glossary → and its refinements. The arc here is mostly about variance and stability — better baselines baseline A reference value subtracted from the reward to reduce gradient variance without adding bias. Can be a learned critic, a group mean (GRPO), or a leave-one-out average (RLOO). See in glossary → , trust regions, dropping the critic, then sometimes bringing it back.
  • The leash. What stops the model from wandering off into gibberish that games the reward? The KL penalty KL penalty A term added to the RLHF reward that subtracts β times the KL divergence from the reference policy, keeping the optimized model from drifting too far while chasing reward. See in glossary → to a reference model reference model A frozen copy of the policy (usually the SFT model) that RLHF and DPO stay close to via a KL penalty, preventing the optimized policy from drifting into degenerate text. See in glossary → — present from Ziegler 2019 through every modern recipe, baked into DPO’s very derivation. It’s the quiet constant that keeps optimization honest.
  • The eternal enemy. Reward hacking reward hacking When a policy finds ways to score high on the reward model without actually being better — exploiting quirks of an imperfect proxy. A central danger of RL post-training. See in glossary → Goodhart’s law made concrete. Optimize any proxy hard enough and it stops tracking what you wanted. Reward ensembles, KL leashes, verifiable rewards, and careful eval are all, at bottom, defenses against this one failure mode.

The arc, in one paragraph each

  • The instruction-tuning era. A base model isn’t an assistant; instruction tuning taught it to follow instructions, and SFT on good demonstrations — increasingly synthetic and self-generated — gave us the first usable assistants. Signal: demonstrations.
  • The RLHF era. Imitation can’t exceed its data, so we learned from human preferences: collect pairwise comparisons, fit a reward model, and optimize it with PPO under a KL leash. RLAIF and Constitutional AI then swapped the human judge for a model. Signal: preferences.
  • The DPO / offline era. Running an online RL loop is fiddly, so DPO collapsed RLHF into a single supervised loss with an implicit reward implicit reward In DPO, the reward is never trained explicitly; it is implied by the log-ratio between the policy and the reference. Optimizing the DPO loss is equivalent to RLHF under that implied reward. See in glossary → — and a whole zoo of variants (IPO, KTO, ORPO, SimPO) plus rejection-sampling methods followed. Signal: preferences, optimized offline.
  • The reasoning / RLVR era. For math and code, correctness is checkable, so a verifier verifier An automatic, often rule-based checker that returns whether a response is correct (e.g. runs unit tests, compares to a known answer). Provides the reward in RLVR. See in glossary → can replace the reward model entirely. Bootstrapped reasoning, process rewards, test-time compute (o1), and GRPO / DeepSeek-R1 turned verifiable rewards into emergent long-form reasoning. Signal: verifiable correctness.
  • The agentic frontier. The unit of optimization moved from the response to the whole trajectory trajectory The sequence of states and actions in a rollout. For text generation, the tokens generated one after another, each conditioned on those before it. See in glossary → : agentic and tool-use RL, where the model acts, observes, and acts again, and credit must be assigned across many turns. Signal: outcomes of multi-step interaction.

The pipeline, reassembled

Here is the whole thing as one map. Every node is a stage we built up across the explainer — click through to revisit any of them. Read left to right and you’re reading the standard 2026 recipe: a base model, made into an assistant by SFT, polished by preference optimization, and sharpened by verifiable-reward RL.

The post-training stack
How a raw pretrained model becomes an aligned reasoning assistant — click any stage to see what it does and jump to its chapter.
Path A — RLHF
Path B — DPO
DPO (direct)— no separate reward model
Base modelpretrained

The raw pretrained language model. It has absorbed broad world knowledge from next-token prediction over a huge corpus, but it only continues text — it has not yet been taught to follow instructions, hold a conversation, or behave like a helpful assistant.

The modern post-training stack — click any stage to jump to its chapter.

Further reading — the papers, in order

The instruction-tuning and SFT roots:

  • Wei et al., Finetuned Language Models Are Zero-Shot Learners (FLAN, 2021) — arxiv 2109.01652.
  • Wang et al., Self-Instruct (2022) — arxiv 2212.10560. Bootstrapped instruction data.

The RLHF lineage:

  • Christiano et al., Deep RL from Human Preferences (2017) — arxiv 1706.03741. The founding RLHF paper.
  • Ziegler et al., Fine-Tuning LMs from Human Preferences (2019) — arxiv 1909.08593. The KL-to-reference penalty.
  • Ouyang et al., Training LMs to Follow Instructions with Human Feedback (InstructGPT, 2022) — arxiv 2203.02155. The SFT → RM → PPO recipe.
  • Bai et al., Constitutional AI (2022) — arxiv 2212.08073. RLAIF and scalable oversight.

The optimization algorithms:

  • Schulman et al., Proximal Policy Optimization (2017) — arxiv 1707.06347. The RLHF workhorse.
  • Rafailov et al., Direct Preference Optimization (DPO, 2023) — arxiv 2305.18290. RLHF as one supervised step.
  • Shao et al., DeepSeekMath (GRPO, 2024) — arxiv 2402.03300. The critic-free group baseline.
  • Yu et al., DAPO (2025) — arxiv 2503.14476. The open GRPO refinement.

The reasoning / RLVR era:

  • Lightman et al., Let’s Verify Step by Step (2023) — arxiv 2305.20050. Process reward models.
  • OpenAI, o1 (2024). RL-trained long chain-of-thought; test-time compute.
  • DeepSeek-AI, DeepSeek-R1 (2025) — arxiv 2501.12948. Pure-RL reasoning and the “aha moment.”
  • Lambert et al., Tülu 3 (2024) — arxiv 2411.15124. The open post-training reference manual.

The agentic frontier:

  • Zhang et al., The Landscape of Agentic RL for LLMs: A Survey (2025) — arxiv 2509.02547.