Section 16

Direct Preference Optimization

Collapsing RLHF into a single loss

Paper: Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Rafailov et al., 2023

By 2023, the RLHF recipe was the standard but also the headache: train a reward model reward model (RM) A model trained from human preference data to output a scalar score for how good a response is. Stands in for a human judge so RL can query reward millions of times. See in glossary → , then run a delicate PPO PPO Proximal Policy Optimization (Schulman, 2017) — approximates TRPO’s trust region with a simple clipped surrogate objective. The RLHF workhorse: stable, simple, widely used. See in glossary → loop against it, juggling a policy, a reference reference model A frozen copy of the policy (usually the SFT model) that RLHF and DPO stay close to via a KL penalty, preventing the optimized policy from drifting into degenerate text. See in glossary → , a reward model, and a critic critic A model trained to predict the value function. PPO uses an actor (the policy) and a critic; GRPO drops the critic and uses a group average instead. See in glossary → all at once — four networks, a sampling loop, and every instability we’ve spent the last chapters cataloguing. Then a paper arrived with a startling claim: you can throw all of that away. No reward model. No RL loop. Just one supervised loss on your preference pairs. This is Direct Preference Optimization DPO Direct Preference Optimization (Rafailov, 2023) — a closed-form supervised loss that optimizes the RLHF objective directly from preference pairs, with no separate reward model and no RL loop. See in glossary → , and it reorganized the field almost overnight.

The trick in one sentence

DPO’s insight is that the reward model and the policy are not really two separate things. The RLHF objective already implies a relationship between them — and once you write that relationship down, you can express the reward entirely in terms of the policy, substitute it into the preference loss, and the reward model simply vanishes. The policy, it turns out, is secretly a reward model. Let’s derive it, because the derivation is short and the payoff is the whole chapter.

Step 1: the optimal RLHF policy has a closed form

Recall the RLHF objective from the PPO chapter: maximize reward while staying close to the reference, with a KL penalty KL penalty A term added to the RLHF reward that subtracts β times the KL divergence from the reference policy, keeping the optimized model from drifting too far while chasing reward. See in glossary → of strength β\beta holding you near πref\pi_{\text{ref}}:

maxπ Ex,yπ[r(x,y)]    βKL ⁣(π(x)πref(x)).\max_{\pi}\ \mathbb{E}_{x,\,y\sim\pi}\big[\,r(x,y)\,\big] \;-\; \beta\,\mathrm{KL}\!\big(\pi(\cdot|x)\,\big\|\,\pi_{\text{ref}}(\cdot|x)\big).

This particular objective has a known closed-form solution. The policy that maximizes it is the reference distribution, reweighted by the exponentiated reward:

π(yx)  =  1Z(x)πref(yx)exp ⁣(1βr(x,y)),\pi^{*}(y|x) \;=\; \frac{1}{Z(x)}\,\pi_{\text{ref}}(y|x)\,\exp\!\Big(\tfrac{1}{\beta}\,r(x,y)\Big),

where Z(x)=yπref(yx)exp(r(x,y)/β)Z(x) = \sum_{y}\pi_{\text{ref}}(y|x)\exp(r(x,y)/\beta) is a normalizing constant (the partition function) that makes it a valid distribution. Intuitively: start from the reference, then up-weight high-reward answers and down-weight low-reward ones, with β\beta controlling how aggressively. This is just the familiar fact that the KL-regularized reward objective is solved by a softmax softmax Function that turns any vector into a probability distribution (positive, sums to 1) by exponentiating and normalizing. See in glossary → -like Boltzmann distribution.

Step 2: invert it to read off the reward

Here’s the move. That equation relates the optimal policy to the reward. So solve it for the reward. Take logs and rearrange:

r(x,y)  =  βlogπ(yx)πref(yx)  +  βlogZ(x).r(x,y) \;=\; \beta\,\log\frac{\pi^{*}(y|x)}{\pi_{\text{ref}}(y|x)} \;+\; \beta\,\log Z(x).

Read that carefully — it is the heart of DPO. The reward of any answer is just β\beta times the log-ratio between the optimal policy and the reference, plus a term βlogZ(x)\beta\log Z(x) that depends only on the prompt xx, not on the answer yy. The policy encodes the reward. If you know the optimal policy and the reference, you already know the reward up to a per-prompt constant. We call this quantity the implicit reward implicit reward In DPO, the reward is never trained explicitly; it is implied by the log-ratio between the policy and the reference. Optimizing the DPO loss is equivalent to RLHF under that implied reward. See in glossary → : r^(x,y)=βlogπθ(yx)πref(yx)\hat r(x,y) = \beta\log\frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}.

Step 3: substitute into the preference likelihood

Now bring in the Bradley–Terry Bradley–Terry model A statistical model that turns pairwise preferences into latent scalar scores: the probability A beats B is the logistic of the score difference, σ(s_A − s_B). The core of most reward models. See in glossary → model from the reward-models chapter. It says the probability that a “winning” answer ywy_w beats a “losing” one yly_l is a logistic function of their reward difference:

P(ywylx)  =  σ ⁣(r(x,yw)r(x,yl)).P(y_w \succ y_l \mid x) \;=\; \sigma\!\big(r(x,y_w) - r(x,y_l)\big).

Substitute our expression for rr from Step 2. The magic: the troublesome βlogZ(x)\beta\log Z(x) term depends only on xx, so it is identical for ywy_w and yly_l and cancels in the difference. The intractable partition function — the thing that normally forces you into RL — disappears completely. What’s left is written purely in terms of the policy and the reference:

P(ywylx)  =  σ ⁣(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx)).P(y_w \succ y_l \mid x) \;=\; \sigma\!\Big(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta\log\frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\Big).

Maximize the likelihood of the observed preferences — equivalently, minimize its negative log — and you have the DPO loss:

LDPO  =  E(x,yw,yl)[logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))].\mathcal{L}_{\text{DPO}} \;=\; -\,\mathbb{E}_{(x,y_w,y_l)}\Big[\log\sigma\Big(\beta\log\tfrac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta\log\tfrac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\Big)\Big].

That’s it. A plain supervised loss over preference triples (x,yw,yl)(x, y_w, y_l) — no reward model reward model (RM) A model trained from human preference data to output a scalar score for how good a response is. Stands in for a human judge so RL can query reward millions of times. See in glossary → , no sampling, no RL loop, no critic. You compute four log-probabilities (the chosen and rejected answers under both πθ\pi_\theta and the frozen πref\pi_{\text{ref}}), form the difference, push it through logσ-\log\sigma, and backpropagate.

Try it

Toggle between the two-stage RLHF pipeline (reward model + PPO loop) and the single DPO loss on the same preference pair. Watch the implicit reward βlogπθπref\beta\log\frac{\pi_\theta}{\pi_{\text{ref}}} move the chosen and rejected log-probabilities apart, and see that it lands in the same place the RM-plus-PPO route was trying to reach — just without ever building the reward model.

RLHF vs DPO: same goal, fewer stages
One preference pair — a chosen response yw and a rejected one yl — optimized two different ways.
RLHF
3 stages
1
Collect preferences
(y_w ≻ y_l) pairs
2
Train reward model
fit r̂(x, y) via Bradley–Terry
3
PPO RL loop
sample → score with r̂ → KL-reg update → repeat
DPO
1 stage
1
One supervised loss on the pair
−log σ( β·log π(y_w)/π_ref(y_w) − β·log π(y_l)/π_ref(y_l) )
no reward model · no RL loop
DPO implicit reward · r(x, y) = β · log( π(y|x) / πref(y|x) )
y_w (chosen)r = 0.140 (log-ratio 1.40)
y_l (rejected)r = -0.080 (log-ratio -0.80)
implicit reward margin r(y_w) − r(y_l)+0.220
DPO optimizes the same objective as RLHF but skips the explicit reward model and the RL loop by using a closed-form loss. The reward is implicit in the policy-versus-reference log-ratio: r(x, y) = β·log(π/π_ref). Both pipelines are ultimately doing the same thing — increasing the margin between the chosen and rejected responses — but DPO collapses the three RLHF stages into one supervised step.

What DPO trades away

DPO is simpler, cheaper, and far more stable — which is why it became the default for open post-training and shows up in Llama 3, Tülu, Zephyr, and countless others. But the simplification is not free, and the trade is worth naming precisely.

DPO is offline offline RL Optimizing from a fixed dataset of responses and preferences without generating new rollouts during training. DPO and rejection-sampling methods are offline. See in glossary → . It learns only from the fixed set of preference pairs you collected up front; it never generates new samples and never gets fresh feedback on them. PPO, by contrast, is on-policy on-policy RL where the data used to update the policy was generated by the current policy. PPO and GRPO are (approximately) on-policy; they resample as the policy changes. See in glossary → — at every step it samples from the current policy and scores those samples, so it can discover and reinforce good behaviors that weren’t in any human-written dataset. DPO cannot explore beyond its data. If the best answer to a prompt was never in the preference set, DPO has no way to find it.

This makes DPO sensitive to distribution shift. As training pushes πθ\pi_\theta away from the distribution that generated the preference pairs, those pairs describe a region the policy has already left, and the implicit-reward signal gets stale. PPO re-samples and stays current; DPO is stuck with the snapshot it was handed. In practice this shows up as DPO over-fitting to quirks of the dataset, and it’s a recurring motivation for the variants in the next chapter — and for iterating DPO on freshly generated, freshly labeled pairs.

The next chapter tours the DPO zoo — IPO, KTO, ORPO, SimPO — each one targeting a specific weakness we just identified: the deterministic-preference overfitting, the need for paired data, the separate SFT stage, and the lingering length bias from the previous chapter.