Section 16

Direct Preference Optimization

Collapsing RLHF into a single loss

Paper: Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Rafailov et al., 2023

By 2023, the RLHF recipe was the standard but also the headache: train a reward model , then run a delicate PPO loop against it, juggling a policy, a reference , a reward model, and a critic all at once — four networks, a sampling loop, and every instability we’ve spent the last chapters cataloguing. Then a paper arrived with a startling claim: you can throw all of that away. No reward model. No RL loop. Just one supervised loss on your preference pairs. This is Direct Preference Optimization , and it reorganized the field almost overnight.

The trick in one sentence

DPO’s insight is that the reward model and the policy are not really two separate things. The RLHF objective already implies a relationship between them — and once you write that relationship down, you can express the reward entirely in terms of the policy, substitute it into the preference loss, and the reward model simply vanishes. The policy, it turns out, is secretly a reward model. Let’s derive it, because the derivation is short and the payoff is the whole chapter.

Step 1: the optimal RLHF policy has a closed form

Recall the RLHF objective from the PPO chapter: maximize reward while staying close to the reference, with a KL penalty of strength $\beta$ holding you near $\pi_{\text{ref}}$ :

\max_{\pi}\ \mathbb{E}_{x,\,y\sim\pi}\big[\,r(x,y)\,\big] \;-\; \beta\,\mathrm{KL}\!\big(\pi(\cdot|x)\,\big\|\,\pi_{\text{ref}}(\cdot|x)\big).

This particular objective has a known closed-form solution. The policy that maximizes it is the reference distribution, reweighted by the exponentiated reward:

\pi^{*}(y|x) \;=\; \frac{1}{Z(x)}\,\pi_{\text{ref}}(y|x)\,\exp\!\Big(\tfrac{1}{\beta}\,r(x,y)\Big),

where $Z(x) = \sum_{y}\pi_{\text{ref}}(y|x)\exp(r(x,y)/\beta)$ is a normalizing constant (the partition function) that makes it a valid distribution. Intuitively: start from the reference, then up-weight high-reward answers and down-weight low-reward ones, with $\beta$ controlling how aggressively. This is just the familiar fact that the KL-regularized reward objective is solved by a softmax -like Boltzmann distribution.

Step 2: invert it to read off the reward

Here’s the move. That equation relates the optimal policy to the reward. So solve it for the reward. Take logs and rearrange:

r(x,y) \;=\; \beta\,\log\frac{\pi^{*}(y|x)}{\pi_{\text{ref}}(y|x)} \;+\; \beta\,\log Z(x).

Read that carefully — it is the heart of DPO. The reward of any answer is just $\beta$ times the log-ratio between the optimal policy and the reference, plus a term $\beta\log Z(x)$ that depends only on the prompt $x$ , not on the answer $y$ . The policy encodes the reward. If you know the optimal policy and the reference, you already know the reward up to a per-prompt constant. We call this quantity the implicit reward : $\hat r(x,y) = \beta\log\frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$ .

Step 3: substitute into the preference likelihood

Now bring in the Bradley–Terry model from the reward-models chapter. It says the probability that a “winning” answer $y_w$ beats a “losing” one $y_l$ is a logistic function of their reward difference:

P(y_w \succ y_l \mid x) \;=\; \sigma\!\big(r(x,y_w) - r(x,y_l)\big).

Substitute our expression for $r$ from Step 2. The magic: the troublesome $\beta\log Z(x)$ term depends only on $x$ , so it is identical for $y_w$ and $y_l$ and cancels in the difference. The intractable partition function — the thing that normally forces you into RL — disappears completely. What’s left is written purely in terms of the policy and the reference:

P(y_w \succ y_l \mid x) \;=\; \sigma\!\Big(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta\log\frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\Big).

Maximize the likelihood of the observed preferences — equivalently, minimize its negative log — and you have the DPO loss:

\mathcal{L}_{\text{DPO}} \;=\; -\,\mathbb{E}_{(x,y_w,y_l)}\Big[\log\sigma\Big(\beta\log\tfrac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta\log\tfrac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\Big)\Big].

That’s it. A plain supervised loss over preference triples $(x, y_w, y_l)$ — no reward model , no sampling, no RL loop, no critic. You compute four log-probabilities (the chosen and rejected answers under both $\pi_\theta$ and the frozen $\pi_{\text{ref}}$ ), form the difference, push it through $-\log\sigma$ , and backpropagate.

What the gradient actually does

The DPO gradient has a clean interpretation. It increases the likelihood of the chosen answer $y_w$ and decreases that of the rejected answer $y_l$ — but each pair is weighted by $\sigma\big(\hat r(x,y_l) - \hat r(x,y_w)\big)$ , how wrong the implicit reward currently is. Pairs the model already ranks correctly contribute almost nothing; pairs it gets backwards get the biggest push. So DPO automatically spends its gradient where the implicit reward most disagrees with the human label — a built-in hard-example weighting, with no reward model in sight.

Try it

Toggle between the two-stage RLHF pipeline (reward model + PPO loop) and the single DPO loss on the same preference pair. Watch the implicit reward $\beta\log\frac{\pi_\theta}{\pi_{\text{ref}}}$ move the chosen and rejected log-probabilities apart, and see that it lands in the same place the RM-plus-PPO route was trying to reach — just without ever building the reward model.

RLHF vs DPO: same goal, fewer stages

One preference pair — a chosen response y_w and a rejected one y_l — optimized two different ways.

RLHF

3 stages

Collect preferences

(y_w ≻ y_l) pairs

Train reward model

fit r̂(x, y) via Bradley–Terry

PPO RL loop

sample → score with r̂ → KL-reg update → repeat

DPO

1 stage

One supervised loss on the pair

−log σ( β·log π(y_w)/π_ref(y_w) − β·log π(y_l)/π_ref(y_l) )

no reward model · no RL loop

DPO implicit reward · r(x, y) = β · log( π(y|x) / π_ref(y|x) )

β = 0.10 (strength of the KL constraint to π_ref)

y_w (chosen)r = 0.140 (log-ratio 1.40)

y_l (rejected)r = -0.080 (log-ratio -0.80)

implicit reward margin r(y_w) − r(y_l)+0.220

DPO optimizes the same objective as RLHF but skips the explicit reward model and the RL loop by using a closed-form loss. The reward is implicit in the policy-versus-reference log-ratio: r(x, y) = β·log(π/π_ref). Both pipelines are ultimately doing the same thing — increasing the margin between the chosen and rejected responses — but DPO collapses the three RLHF stages into one supervised step.

What DPO trades away

DPO is simpler, cheaper, and far more stable — which is why it became the default for open post-training and shows up in Llama 3, Tülu, Zephyr, and countless others. But the simplification is not free, and the trade is worth naming precisely.

DPO is offline . It learns only from the fixed set of preference pairs you collected up front; it never generates new samples and never gets fresh feedback on them. PPO, by contrast, is on-policy — at every step it samples from the current policy and scores those samples, so it can discover and reinforce good behaviors that weren’t in any human-written dataset. DPO cannot explore beyond its data. If the best answer to a prompt was never in the preference set, DPO has no way to find it.

This makes DPO sensitive to distribution shift. As training pushes $\pi_\theta$ away from the distribution that generated the preference pairs, those pairs describe a region the policy has already left, and the implicit-reward signal gets stale. PPO re-samples and stays current; DPO is stuck with the snapshot it was handed. In practice this shows up as DPO over-fitting to quirks of the dataset, and it’s a recurring motivation for the variants in the next chapter — and for iterating DPO on freshly generated, freshly labeled pairs.

The next chapter tours the DPO zoo — IPO, KTO, ORPO, SimPO — each one targeting a specific weakness we just identified: the deterministic-preference overfitting, the need for paired data, the separate SFT stage, and the lingering length bias from the previous chapter.