Section 01

What is post-training?

Turning a base model into an assistant

Ask a freshly pre-trained language model “What is the capital of France?” and you might get this back:

What is the capital of France? What is the largest city in France? What is the official language of France? List three famous French painters.

It didn’t answer. It did something stranger and, on reflection, completely logical: it continued the document. Somewhere in its training data lived a quiz, a worksheet, a list of trivia questions — and the single most likely thing to follow one question is another question. The model did exactly what it was built to do. It is a brilliant autocomplete. It is also a terrible assistant.

This explainer is about closing that gap. The model that ships in a chat product — the one that answers your question, refuses to help you build a weapon, thinks step by step through a proof, and calls a tool to look something up — is not the model that came out of pre-training. It is that model after a second, entirely different phase of training. That phase is post-training , and it is the whole subject of this explainer.

Two models, one set of weights

Pre-training, the subject of the sibling explainer, produces a base model : a network trained on a large fraction of the public web to do one thing, predict the next token . That objective forces it to absorb grammar, facts, code, and reasoning patterns, because all of those help it guess what comes next. The result is a model that knows an enormous amount — and has no idea what you want from it.

The reason is subtle but decisive. The base model’s training distribution is “text on the internet.” Most internet text is not a helpful assistant responding to a user. It is articles, forum flame wars, half-finished code, SEO spam, and, yes, trivia worksheets. When you prompt the base model, it doesn’t ask “how would a helpful assistant respond?” It asks “what is the most likely continuation of this text?” — and that continuation is often unhelpful, repetitive, or actively wrong, because plenty of unhelpful, repetitive, wrong text exists.

Post-training doesn’t add new world knowledge so much as it reshapes how the existing knowledge is expressed. The capital of France was already in there. What was missing was the disposition to answer the question directly, in a helpful tone, while declining the requests it shouldn’t honor. Crucially, post-training touches the same parameters pre-training set — it nudges them, rather than rebuilding them from scratch. That is why it can be done with a tiny fraction of pre-training’s data, and why it can be redone cheaply when priorities change.

A note on cost

It’s tempting to call post-training “the cheap part” and pre-training “the expensive part.” Resist it. Pre-training is a massive one-time compute bill — a frontier run can cost tens of millions of dollars in GPU time. But modern post-training is far from trivial: it involves collecting human preference labels at scale, training auxiliary reward models, and running reinforcement-learning loops that repeatedly sample from the model. The reasoning-focused RL pipelines we’ll meet later can burn enormous amounts of inference compute generating and grading rollouts.

The honest framing is that the two phases trade in different currencies. Pre-training spends raw FLOPs to build knowledge. Post-training spends a mix of human labor, careful data curation, and a different shape of compute to build behavior. Neither is categorically “the expensive one” — it depends on the model, the goals, and the year.

The post-training stack

Here is the whole arc at a glance. A base model goes through some subset of these stages, in roughly this order:

Supervised fine-tuning (SFT) / instruction tuning. Show the model thousands of high-quality examples of instructions paired with good responses, and continue next-token training on just those examples. This teaches the model the format of being an assistant — that a user turn should be followed by a helpful answer, not another question. We cover it in Section 2.
Preference optimization. SFT can only imitate the demonstrations it’s given, which caps quality at the demonstrators’ level. To push past that, we collect preferences — humans (or other models) judging which of two responses is better — and optimize the model to produce the preferred kind. This is the heart of RLHF (reinforcement learning from human feedback). The classic recipe trains a reward model from those preferences and then uses PPO to optimize against it. A later, simpler alternative — Direct Preference Optimization (DPO) — skips the reward model and the RL loop, collapsing the whole thing into a single supervised-style loss.
RL from verifiable rewards (RLVR) / reasoning RL. For tasks where correctness can be checked automatically — math with a known answer, code that must pass tests — we can drop the learned reward model entirely and reward the model for getting the right answer. Optimizing this signal with algorithms like GRPO is what produces the long-chain-of-thought reasoning models that emerged in 2024–2025.

The map below makes this navigable. Each node is a stage; click through to see how they connect.

The post-training stack

How a raw pretrained model becomes an aligned reasoning assistant — click any stage to see what it does and jump to its chapter.

Path A — RLHF

Reward model→PPO / RLHF

Path B — DPO

DPO (direct)— no separate reward model

Base modelpretrained

The raw pretrained language model. It has absorbed broad world knowledge from next-token prediction over a huge corpus, but it only continues text — it has not yet been taught to follow instructions, hold a conversation, or behave like a helpful assistant.

The modern post-training stack — click any stage to jump to its chapter.

Three eras, briefly

It helps to see how this stack assembled itself historically, because each layer was a response to the limits of the last.

2021–2022 — instruction tuning and RLHF. FLAN and T0 showed that fine-tuning on instructions phrased in natural language unlocks zero-shot generalization. Then InstructGPT (the model behind the first ChatGPT) cemented the three-step recipe — SFT, then a reward model, then PPO — and RLHF went mainstream. Alignment became a training problem, not just a prompting trick.
2023 — offline and direct methods. RLHF’s RL loop is finicky and expensive to run. DPO showed you could hit a mathematically equivalent target with a plain supervised loss on preference pairs, no reward model, no rollouts. A wave of variants followed, and “offline” preference optimization became the default for many open models.
2024–2026 — verifiable rewards and reasoning. OpenAI’s o1 and then DeepSeek-R1 showed that pure RL against automatically verifiable rewards could teach a model to reason at length before answering — sometimes with no SFT at all. This is the current frontier, extending now into agentic, tool-using, multi-turn RL.

What this explainer covers

We build the stack from the bottom up. After this on-ramp, the next two chapters lay the conceptual foundation: the math of likelihood, KL divergence, and entropy that every later method leans on, and the alignment problem that motivates moving beyond imitation in the first place.

From there: instruction tuning and SFT (Section 2), the RLHF preference era and reward models (Section 3), the reinforcement-learning fundamentals and PPO that make RLHF work (Section 4), the offline and direct-preference methods like DPO (Section 5), the RLVR and reasoning era including GRPO and DeepSeek-R1 (Section 6), and finally the modern algorithms and agentic frontier (Section 7).

Throughout, we care about both halves of the craft: the machine-learning theory — why optimizing a preference signal is different from imitating a demonstration — and the practice — the data, the instabilities, the failure modes that make post-training as much engineering as science. Let’s start where pre-training left off: with the objective itself, and what it means to turn next-token prediction into behavior.