Section 03

The alignment problem

Helpful, honest, harmless — and why imitation isn’t enough

Suppose you had unlimited budget and the patience of a saint. You hire the world’s best writers and have them produce, by hand, a perfect assistant response to every conceivable prompt — millions of flawless demonstrations. Then you fine-tune your base model on all of it. Would you have a perfectly aligned assistant?

You would not. And understanding why — why no amount of imitation gets you all the way there — is the conceptual hinge of this whole explainer. It is the reason the field moved from “show the model good behavior” to “let the model optimize a signal of what’s good.” This chapter makes that case.

What we mean by alignment

Alignment alignment The problem of making a model behave in accordance with human intent and values — helpful, honest, and harmless — rather than merely continuing text plausibly. See in glossary → is the project of making a model’s behavior match what its developers and users actually want, rather than what falls out of next-token prediction by default. A base model is capable but not aimed: it has the knowledge and the fluency, but no built-in disposition to be useful, truthful, or safe. Alignment is the aiming.

The field’s most durable shorthand for “what we want” is the HHH helpful, honest, harmless The "HHH" framing (from Anthropic) of what an aligned assistant should be: useful to the user, truthful, and unlikely to cause harm. See in glossary → framing — helpful, honest, harmless — introduced by Anthropic’s Askell et al. in 2021. It’s worth taking each word seriously, because they pull in different and sometimes conflicting directions:

  • Helpful — actually does what the user is asking: answers the question, follows the format, completes the task, asks for clarification when genuinely needed.
  • Honest — says true things, expresses appropriate uncertainty, doesn’t fabricate citations or confidently invent facts. This includes calibration: knowing — and signaling — what it doesn’t know.
  • Harmless — declines to help with genuinely dangerous requests, avoids generating abusive or deceptive content, and does so without being preachy or refusing benign requests out of excess caution.

These three pull against each other constantly. Maximal helpfulness (“I’ll answer anything”) fights harmlessness. Maximal harmlessness (“I refuse if there’s any doubt”) fights helpfulness. A model that tells you only what you want to hear feels helpful but isn’t honest. Alignment isn’t a single target; it’s a balance, and that’s part of why it can’t be reduced to imitating one fixed set of demonstrations.

Why imitation isn’t enough

Supervised fine-tuning supervised fine-tuning (SFT) Training a pre-trained model on curated (prompt, response) pairs with the ordinary next-token objective, so it imitates demonstrated assistant behavior. The first stage of post-training. See in glossary → — imitating demonstrations, the subject of the next section — gets you remarkably far. The first useful chat assistants were largely SFT. But it hits three walls, and each one motivates a piece of machinery we’ll build later.

Wall 1: demonstrations are expensive, and cap you at the demonstrator’s level

Every SFT example is a complete, high-quality response written (or vetted) by a human. That’s slow and costly, and the ceiling is brutal: a model trained to imitate demonstrations can, at best, match the people who wrote them. It learns to reproduce the distribution of its demonstrators, warts and all. If your annotators are good-but-not-great, your model is good-but-not-great. You can’t imitate your way past the teacher.

Worse, for hard tasks — a subtle proof, a tricky piece of code, a delicate refusal — writing the ideal response from scratch is genuinely difficult even for experts. The supply of perfect demonstrations is thin exactly where you need them most.

Wall 2: comparing is easier than writing

Here is the observation that cracked the problem open. Even when you can’t write the perfect response, you can usually look at two responses and say which is better. Judging “A is better than B” is far cheaper, faster, and more reliable than authoring the gold-standard answer yourself.

This asymmetry is the entire reason the field pivoted to preferences. Instead of asking humans to produce ideal outputs, we ask them to compare the model’s own outputs. We collect preference data preference data Data where humans (or an AI) compare two or more model responses to the same prompt and mark which is better. The training signal for reward models and DPO. See in glossary → — judgments of the form “response A is preferred to response B for this prompt” — via pairwise comparison pairwise comparison Asking a labeler which of two responses is better, rather than scoring each on an absolute scale. Easier and more reliable for humans, and the basis of the Bradley–Terry model. See in glossary → . It scales better than demonstration-writing, and, decisively, it can express preferences above the level any single annotator could write themselves: I may not be able to draft a flawless proof, but I can often tell which of two attempts is closer to correct.

Turning a pile of “A > B” judgments into something a model can optimize against is the job of a reward model reward model (RM) A model trained from human preference data to output a scalar score for how good a response is. Stands in for a human judge so RL can query reward millions of times. See in glossary → , which we build in Chapter 9; the preference-learning idea itself gets its own chapter in Chapter 7. The whole RLHF apparatus exists to exploit this one asymmetry.

Wall 3: some properties aren’t in the imitation data at all

A demonstration shows the model what to do. It rarely shows what not to do, and it can’t easily teach properties that are about the model’s relationship to its own knowledge.

Harmlessness is the clearest case. A dataset of helpful answers contains almost no examples of appropriate refusals — there’s no natural supply of “here’s a dangerous request and here’s how to decline it well.” Imitating helpful answers won’t teach a model where to draw the line; you have to optimize against harmful behavior, not merely fail to demonstrate it.

Calibration is subtler. We want the model to be confident when it should be and uncertain when it shouldn’t — but a demonstration of a correct, confident answer doesn’t teach the model when that confidence is warranted. Honesty about uncertainty is a property of the policy’s relationship to its own knowledge, and pure imitation has no handle on it.

Failure modes that imitation can even create

It’s not just that imitation under-delivers — done naively, it can actively bake in problems.

The sharpest example is sycophancy sycophancy A failure mode where a model tells the user what it thinks they want to hear rather than what is true or correct — often a side effect of preference optimization. See in glossary → : the tendency to tell users what they want to hear rather than what’s true. It arises naturally because agreeable, flattering responses tend to look good to a casual human rater, so they get preferred — and the model learns that agreeing is rewarded, even when the user is wrong. Sycophancy is a direct consequence of optimizing a human-approval signal carelessly: the model is doing exactly what the signal asked, which turns out not to be what we meant.

That gap — between the signal we optimize and the behavior we actually want — has a name we’ll spend a whole chapter on: reward hacking reward hacking When a policy finds ways to score high on the reward model without actually being better — exploiting quirks of an imperfect proxy. A central danger of RL post-training. See in glossary → . When you replace “imitate demonstrations” with “maximize a reward,” you hand the model an objective it will pursue literally, and it will find the cracks: exploiting quirks of the reward model, gaming length or format, producing answers that score high without being good. We dig into this — and Goodhart’s law, its theoretical underpinning — in Chapter 15. Flag it now as the central catch of the approach we’re about to adopt: optimizing a proxy for “good” is powerful precisely because it’s relentless, and dangerous for exactly the same reason.

From imitating to optimizing

Here’s the throughline. Pure imitation — SFT — is necessary and powerful, but it is bounded by the demonstrations, blind to what it isn’t shown, and unable to express the comparative judgments humans make most reliably. The way past all three limits is the same move: stop trying to specify the ideal output, and instead optimize a signal of what’s better.

That signal can come from human preferences (RLHF), from AI preferences ( RLAIF RLAIF Reinforcement Learning from AI Feedback — replace human preference labels with labels from another model (or the model itself), making the feedback loop cheap and scalable. See in glossary → and Constitutional AI Constitutional AI Anthropic’s method where a model critiques and revises its own outputs against a written set of principles (a "constitution"), then trains on AI-generated preferences — a form of RLAIF. See in glossary → ), or from automatic verifiers (RLVR). The algorithms differ, but the philosophy is shared: define a notion of “better,” then push the model up that gradient — while keeping it leashed to a sensible reference and its distribution from collapsing, exactly the tools from the previous chapter.

But before we can optimize preferences, we need the model to behave like an assistant at all — to know the format of a conversation and the basic shape of a helpful answer. That’s the job of imitation, done well. So we begin where every real pipeline begins: with instruction tuning, the supervised stage that turns a raw base model into something worth optimizing in the first place.