Section 10

RLAIF & Constitutional AI

AI feedback and scalable oversight

Papers: Constitutional AI: Harmlessness from AI Feedback — Bai et al., 2022 · RLAIF vs. RLHF — Lee et al., 2024 · Llama 2: Open Foundation and Fine-Tuned Chat Models — Touvron et al., 2023

The three-step recipe has one expensive, slow ingredient buried inside it: humans. Every reward model needs a steady diet of human comparisons, and humans are costly, inconsistent, and — for some tasks — simply not good enough or numerous enough to keep up. The natural question: what if the judge in the loop were itself a model? This chapter is about replacing human labels with AI labels, the surprising fact that it works about as well, and the new problems it creates.

RLAIF: swap the labeler

RLAIF RLAIF Reinforcement Learning from AI Feedback — replace human preference labels with labels from another model (or the model itself), making the feedback loop cheap and scalable. See in glossary → — Reinforcement Learning from AI Feedback — keeps the entire RLHF machinery from the last three chapters and changes exactly one thing: the pairwise comparisons pairwise comparison Asking a labeler which of two responses is better, rather than scoring each on an absolute scale. Easier and more reliable for humans, and the basis of the Bradley–Terry model. See in glossary → that train the reward model come from an AI labeler instead of a human. You prompt a capable model with two candidate responses and ask which is better, and why; its judgment becomes the preference label. Everything downstream — the Bradley–Terry reward model, the PPO optimization, the KL leash — is unchanged.

Is an AI judge good enough? The decisive evidence came from RLAIF vs. RLHF (Lee et al., 2024), which ran the two pipelines head to head on summarization and dialogue and found that RLAIF achieves comparable performance to RLHF — human evaluators preferred the AI-feedback-trained models at roughly the same rate as the human-feedback-trained ones. The labeler can be a model, and quality barely moves. That single result is why AI feedback is now everywhere in post-training.

Constitutional AI: the model critiques itself

The idea was pioneered, and given its sharpest form, in Constitutional AI (Bai et al., 2022). The goal was harmlessness without armies of humans labeling toxic content — work that is both expensive and genuinely unpleasant. The method has two phases, and the elegant move is that the model improves itself against a short written document called a constitution: a list of plain-language principles like “choose the response that is least harmful” or “prefer the answer that is honest and non-evasive.”

The first phase is supervised self-revision. The model generates a response to a red-teaming prompt, then is asked to critique its own answer against a constitutional principle, and finally to revise it to better satisfy that principle. Fine-tuning on the revised answers produces a model that has, in effect, edited itself toward the constitution — no human ever wrote a demonstration.

The second phase is RLAIF proper. The model generates pairs of responses and an AI labeler picks which one better satisfies the constitution, producing AI preference data preference data Data where humans (or an AI) compare two or more model responses to the same prompt and mark which is better. The training signal for reward models and DPO. See in glossary → . That data trains a reward model, and PPO optimizes against it — the standard loop, but with the human judge entirely replaced by a model consulting a written rule book.

Scalable oversight: why this matters beyond cost

The deeper motivation isn’t just saving money. It’s scalable oversight scalable oversight The challenge of supervising models on tasks too hard or numerous for humans to label directly — addressed by AI feedback, critiques, and verifiers. See in glossary → : the problem of supervising tasks that are too hard, too numerous, or too specialized for humans to reliably judge. How do you human-label the correctness of a 2,000-line program, a subtle mathematical proof, or a million-response firehose? You often can’t — at least not fast enough or well enough. If models can help judge other models, oversight can scale alongside capability instead of being throttled by human bandwidth. AI feedback is the first rung on that ladder, and the same impulse runs straight through to the verifiable-reward methods later in this explainer.

The open case study: Llama 2-Chat

The most detailed open account of an RLHF-plus-AI-feedback pipeline at scale is Llama 2 (Touvron et al., 2023), and it’s worth knowing as the canonical worked example. A few choices stand out:

  • Two reward models, not one. Llama 2-Chat trains a separate reward model reward model (RM) A model trained from human preference data to output a scalar score for how good a response is. Stands in for a human judge so RL can query reward millions of times. See in glossary → for helpfulness and another for safety, because optimizing a single blended score tends to let one objective quietly swallow the other. Keeping them apart lets the team balance “be useful” against “be safe” explicitly.
  • Rejection sampling and PPO. Each round, the policy samples many candidate responses; the best-scoring ones (by the reward models) are kept and fine-tuned on — this is rejection sampling rejection sampling Generate several candidate responses, keep only the best-scoring one(s) by some reward or verifier, and fine-tune on those. A simple, stable, RL-free way to improve a model. See in glossary → , a simple offline form of preference optimization — and then PPO does the on-policy RL refinement on top. The two are complementary stages, not rivals.
  • Iterative rounds. The whole loop runs repeatedly, with fresh preference data collected on each improved model — Christiano’s original “loop and refine” structure, at production scale.

Llama 2 is the chapter’s recipe made real: AI-assisted feedback, multiple specialized reward models, and rejection sampling feeding PPO, all wrapped in iterative rounds.

The catch: who writes the constitution?

AI feedback is cheap, fast, scalable, and — usefully — consistent, since a model applies the same criteria to every example without the fatigue and drift that plague human labelers. But the benefits come with a real cost. The AI labeler’s biases get baked in. If the judge model has a blind spot, a stylistic preference, or a subtle misjudgment, that flaw is now stamped onto every preference label and propagated into the reward model and the policy. You’ve automated the judge, including its mistakes.

And the constitution itself raises the uncomfortable question the method can’t escape: who writes it? The principles encode values — about what’s harmful, what’s honest, what’s worth refusing — and those choices are made by a small group of people and frozen into a document that then shapes the model’s behavior at scale. Scalable oversight makes the mechanism of alignment cheaper; it does nothing to settle the content of what we’re aligning to.

Where this is going

We’ve now spent four chapters on the preference half of RLHF — comparisons, reward models, and where the labels come from, human or AI. But we’ve kept treating the reinforcement learning itself as a black box labeled “PPO.” Section 4 finally opens it. We’ll build the RL machinery from the ground up — policies, returns, the policy gradient and REINFORCE, then value functions and baselines, and finally the trust-region idea that leads to PPO — and answer the question this whole section has been deferring: what is the algorithm that turns a reward signal into a better model?