Section 17

The DPO zoo

IPO, KTO, ORPO, and SimPO

DPO was so clean that within a year the literature filled with descendants — each one a small surgical edit to the loss, each fixing one specific weakness we flagged at the end of the last chapter. Together they form what people half-jokingly call the DPO zoo. You don’t need to memorize the menagerie, but you should know the four that matter and exactly which problem each one solves.

IPO: stop trusting deterministic preferences

The first crack in DPO is subtle. Its loss pushes the implicit-reward gap between chosen and rejected ever wider — and when a preference pair is labeled with certainty (always ywyly_w \succ y_l, never the reverse), there’s nothing stopping that gap from running off to infinity. The model can drive πθ(yl)\pi_\theta(y_l) toward zero and the KL KL penalty A term added to the RLHF reward that subtracts β times the KL divergence from the reference policy, keeping the optimized model from drifting too far while chasing reward. See in glossary → regularization, in this limit, fails to hold it back. The result is over-fitting to the exact preference labels you happened to collect.

IPO IPO Identity Preference Optimization — a DPO variant that replaces the logistic loss with a squared loss to avoid overfitting to deterministic preferences. See in glossary → (Identity Preference Optimization, Azar et al. 2023) fixes this by replacing the logistic loss with a squared loss that targets a finite margin rather than an ever-growing one. Instead of “make the gap as large as possible,” IPO says “make the implicit-reward gap equal to 12β\tfrac{1}{2\beta}, and no larger.” That bounded target keeps the regularization meaningful even when preferences are deterministic, so IPO is harder to over-fit and degrades more gracefully on small or noisy preference sets.

KTO: drop the pairs entirely

DPO needs paired data: for every prompt, a chosen and a rejected answer, judged against each other. But a lot of real feedback isn’t paired — it’s a thumbs-up or thumbs-down on a single response, with no matched counterpart. Collecting clean pairs is expensive; collecting binary labels is cheap and abundant.

KTO KTO Kahneman–Tversky Optimization — a preference method using a prospect-theory loss on unpaired, binary good/bad labels, so you don’t need matched preference pairs. See in glossary → (Kahneman–Tversky Optimization, Ethayarajh et al. 2024) throws out the pairing requirement. It works on unpaired binary good/bad labels, and it borrows its loss shape from prospect theory — Kahneman and Tversky’s model of how humans weigh gains and losses asymmetrically (losses loom larger). Each example is scored relative to a reference point, with desirable and undesirable outputs handled by separate, asymmetric terms. The practical win is enormous: KTO lets you align on the messy, plentiful thumbs-up/thumbs-down signal that real products actually generate, rather than the curated comparison data DPO demands.

ORPO: fold SFT and preference into one stage

Both DPO and KTO still assume you’ve already run supervised fine-tuning supervised fine-tuning (SFT) Training a pre-trained model on curated (prompt, response) pairs with the ordinary next-token objective, so it imitates demonstrated assistant behavior. The first stage of post-training. See in glossary → , and both still need a frozen reference model reference model A frozen copy of the policy (usually the SFT model) that RLHF and DPO stay close to via a KL penalty, preventing the optimized policy from drifting into degenerate text. See in glossary → sitting in memory for every forward pass. That’s two training stages and two copies of the model.

ORPO ORPO Odds-Ratio Preference Optimization — folds SFT and preference optimization into a single reference-free stage using an odds-ratio penalty term. See in glossary → (Odds-Ratio Preference Optimization, Hong et al. 2024) collapses both. It adds an odds-ratio penalty term directly onto the ordinary SFT loss: alongside the standard next-token objective on the chosen answer, a term that increases the odds of the chosen response relative to the rejected one. Because the odds ratio is a self-contained contrast between chosen and rejected, ORPO needs no reference model at all — it is reference-free. The payoff is a single-stage, reference-free recipe: one pass over your data does instruction-following and preference alignment together, with half the memory of a DPO setup.

SimPO: kill the length bias, kill the reference model

The last lingering problem is one we met two chapters ago: length bias. DPO’s implicit reward is a sum of per-token log-probabilities, so longer sequences accumulate larger magnitudes — the loss has a built-in thumb on the scale for length, exactly the hack we want to avoid.

SimPO SimPO Simple Preference Optimization — a reference-free DPO variant using a length-normalized implicit reward plus a target margin, removing the need for a reference model. See in glossary → (Simple Preference Optimization, Meng et al. 2024) makes two changes. First, it length-normalizes the implicit reward — dividing by the number of tokens, so the reward is an average log-probability rather than a sum, neutralizing the length advantage. Second, like ORPO it drops the reference model entirely (reference-free), and adds an explicit target margin γ\gamma the chosen answer must clear. The result is a strikingly simple, memory-light loss that, on many benchmarks, matches or beats DPO and produces noticeably less length-inflated output.

The zoo at a glance

MethodReference-free?Paired data?Key idea
DPONoYesImplicit reward = βlogπθπref\beta\log\frac{\pi_\theta}{\pi_{\text{ref}}}; logistic preference loss
IPONoYesSquared loss to a bounded margin; resists deterministic-preference over-fitting
KTONoNo (binary)Prospect-theory loss on unpaired good/bad labels
ORPOYesYesOdds-ratio term folds SFT + preference into one stage
SimPOYesYesLength-normalized, reference-free reward + target margin γ\gamma

Try it

Pick a variant and watch how its loss responds as you vary the preference margin and the response length. Notice how IPO’s squared loss bottoms out at a finite target instead of pushing forever, how SimPO’s length-normalized curve refuses to reward sheer length, and which methods need that frozen reference curve at all.

The DPO variant zoo
Each variant tweaks one ingredient of direct preference optimization. Pick one to see its loss shape and trade-offs.
Logistic loss on the implicit reward margin; the original direct-preference objective.
reference-free: noneeds paired data: yes
loss vs reward marginqualitative · logistic: −log σ(margin)
0−4+4lossmargin r(y_w) − r(y_l) →
y-axis auto-scaled · peak shown ≈ 4.22
The DPO "zoo": each variant changes one ingredient — the loss shape, the reference model, or paired-vs-unpaired data — to fix a specific weakness. IPO swaps the logistic for a squared loss to a target margin; KTO drops the need for pairs; ORPO and SimPO remove the reference model. The curves above are qualitative illustrations of each loss's character, not the exact published formulas.

Where this leaves us

The offline-preference family — DPO and its zoo — gives you alignment without a reward model and without an RL loop: cheap, stable, and reference-free in its most modern forms. What it still requires is preference data: someone, human or model, deciding which of two answers is better. The next chapter steps to an even simpler RL-free idea that needs only a score, not a comparison — rejection-sampling alignment — and it turns out to be the bridge straight into the reasoning era.