Section 27

Recap

The pipeline reassembled, and further reading

We started with a base model that could only continue text and ended at agents that search, run code, and reason for thousands of tokens before they answer. This final chapter steps back to find the through-line — the one idea underneath every chapter, the handful of levers that explain the differences, and the order in which to read the papers if you want to go deeper.

The through-line

Here is the single sentence that ties the whole explainer together: post-training is the search for a better training signal than next-token likelihood .

Pre-training gave us a model that’s superb at imitating text. But imitation has a ceiling — it can only reproduce the distribution of what humans wrote, and it has no notion of good versus bad. Every technique in this explainer is an attempt to get a richer signal:

Imitation ( SFT ) upgrades the signal from “any text on the internet” to “good demonstrations of helpful behavior.”
Preferences ( RLHF , DPO ) upgrade it again from “copy this answer” to “this answer is better than that one” — a comparative signal that can exceed the best single demonstration.
Verifiable rewards ( RLVR , reasoning RL) upgrade it once more to “this answer is correct” — a ground-truth signal a checker can produce at scale, with no human and no learned judge in the loop.

The objective got better and better; the model’s raw next-token machinery never changed. That’s the constant. Everything else is engineering around the signal.

The recurring levers

Almost every paper in this explainer is a particular setting of four dials:

The signal. What tells the model it did well? It marched from demonstrations (SFT) → preferences (reward models) → verifiable rewards (RLVR). Each step removes a bottleneck — human demonstrations, then human comparisons, then human judgment altogether on checkable tasks.
The optimizer. How is the signal turned into a weight update? REINFORCE → PPO → DPO → GRPO and its refinements. The arc here is mostly about variance and stability — better baselines , trust regions, dropping the critic, then sometimes bringing it back.
The leash. What stops the model from wandering off into gibberish that games the reward? The KL penalty to a reference model — present from Ziegler 2019 through every modern recipe, baked into DPO’s very derivation. It’s the quiet constant that keeps optimization honest.
The eternal enemy. Reward hacking — Goodhart’s law made concrete. Optimize any proxy hard enough and it stops tracking what you wanted. Reward ensembles, KL leashes, verifiable rewards, and careful eval are all, at bottom, defenses against this one failure mode.

The arc, in one paragraph each

The instruction-tuning era. A base model isn’t an assistant; instruction tuning taught it to follow instructions, and SFT on good demonstrations — increasingly synthetic and self-generated — gave us the first usable assistants. Signal: demonstrations.
The RLHF era. Imitation can’t exceed its data, so we learned from human preferences: collect pairwise comparisons, fit a reward model, and optimize it with PPO under a KL leash. RLAIF and Constitutional AI then swapped the human judge for a model. Signal: preferences.
The DPO / offline era. Running an online RL loop is fiddly, so DPO collapsed RLHF into a single supervised loss with an implicit reward — and a whole zoo of variants (IPO, KTO, ORPO, SimPO) plus rejection-sampling methods followed. Signal: preferences, optimized offline.
The reasoning / RLVR era. For math and code, correctness is checkable, so a verifier can replace the reward model entirely. Bootstrapped reasoning, process rewards, test-time compute (o1), and GRPO / DeepSeek-R1 turned verifiable rewards into emergent long-form reasoning. Signal: verifiable correctness.
The agentic frontier. The unit of optimization moved from the response to the whole trajectory : agentic and tool-use RL, where the model acts, observes, and acts again, and credit must be assigned across many turns. Signal: outcomes of multi-step interaction.

The pipeline, reassembled

Here is the whole thing as one map. Every node is a stage we built up across the explainer — click through to revisit any of them. Read left to right and you’re reading the standard 2026 recipe: a base model, made into an assistant by SFT, polished by preference optimization, and sharpened by verifiable-reward RL.

The post-training stack

How a raw pretrained model becomes an aligned reasoning assistant — click any stage to see what it does and jump to its chapter.

Path A — RLHF

Reward model→PPO / RLHF

Path B — DPO

DPO (direct)— no separate reward model

Base modelpretrained

The raw pretrained language model. It has absorbed broad world knowledge from next-token prediction over a huge corpus, but it only continues text — it has not yet been taught to follow instructions, hold a conversation, or behave like a helpful assistant.

The modern post-training stack — click any stage to jump to its chapter.

Further reading — the papers, in order

The instruction-tuning and SFT roots:

Wei et al., Finetuned Language Models Are Zero-Shot Learners (FLAN, 2021) — arxiv 2109.01652.
Wang et al., Self-Instruct (2022) — arxiv 2212.10560. Bootstrapped instruction data.

The RLHF lineage:

Christiano et al., Deep RL from Human Preferences (2017) — arxiv 1706.03741. The founding RLHF paper.
Ziegler et al., Fine-Tuning LMs from Human Preferences (2019) — arxiv 1909.08593. The KL-to-reference penalty.
Ouyang et al., Training LMs to Follow Instructions with Human Feedback (InstructGPT, 2022) — arxiv 2203.02155. The SFT → RM → PPO recipe.
Bai et al., Constitutional AI (2022) — arxiv 2212.08073. RLAIF and scalable oversight.

The optimization algorithms:

Schulman et al., Proximal Policy Optimization (2017) — arxiv 1707.06347. The RLHF workhorse.
Rafailov et al., Direct Preference Optimization (DPO, 2023) — arxiv 2305.18290. RLHF as one supervised step.
Shao et al., DeepSeekMath (GRPO, 2024) — arxiv 2402.03300. The critic-free group baseline.
Yu et al., DAPO (2025) — arxiv 2503.14476. The open GRPO refinement.

The reasoning / RLVR era:

Lightman et al., Let’s Verify Step by Step (2023) — arxiv 2305.20050. Process reward models.
OpenAI, o1 (2024). RL-trained long chain-of-thought; test-time compute.
DeepSeek-AI, DeepSeek-R1 (2025) — arxiv 2501.12948. Pure-RL reasoning and the “aha moment.”
Lambert et al., Tülu 3 (2024) — arxiv 2411.15124. The open post-training reference manual.

The agentic frontier:

Zhang et al., The Landscape of Agentic RL for LLMs: A Survey (2025) — arxiv 2509.02547.

Evaluating the judge, and where to go next

A theme that runs under everything here: a reward signal is only as good as your ability to trust it. Reward models are themselves models, and they can be wrong in ways that quietly corrupt the whole pipeline — which is why RewardBench (Lambert et al., 2024) exists: a standard benchmark for measuring reward-model quality, so you can catch a bad judge before it teaches your policy bad habits. For every term in this explainer with its acronym spelled out, see the glossary. And the two sibling explainers complete the picture: LLM Pre-training covers how the base model is built in the first place, and LLM & vLLM Inference covers how the finished model is actually served.