DPO: Direct Preference Optimization

Rafailov et al. Β· NeurIPS 2023 Β· arXiv 2305.18290

TL;DR

DPO eliminates the need for a separate reward model in RLHF. By reparameterizing the reward function, it converts the RL problem into a simple classification loss on preference pairs. No PPO, no reward model, no sampling β€” just supervised learning on (preferred, rejected) pairs.

β—†DPO vs Standard RLHF Pipeline
Standard RLHF: 3-stage pipeline
Complex, unstable, expensive
1. SFT
2. Train Reward Model
3. PPO with RM
Key insight
DPO Insight: reward model is implicit in the policy!
Closed-form mapping: optimal policy ↔ reward function
Simplification
DPO: 2-stage pipeline
Simple, stable, cheap
1. SFT
2. DPO on preferences
No reward model needed
No RL sampling needed
Just supervised learning

1. Background: The RLHF Problem

Standard RLHF optimizes a language model to maximize a learned reward while staying close to the reference (SFT) model:

Standard RLHF objective
maxβ‘Ο€ΞΈβ€…β€ŠEx∼D, yβˆΌΟ€ΞΈ(β‹…βˆ£x)[rΟ•(x,y)]βˆ’Ξ²β€‰DKL ⁣[πθ(β‹…βˆ£x)βˆ₯Ο€ref(β‹…βˆ£x)]\max_{\pi_\theta} \; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)} \left[ r_\phi(x, y) \right] - \beta \, D_{\text{KL}}\!\left[\pi_\theta(\cdot|x) \| \pi_{\text{ref}}(\cdot|x)\right]
πθ\pi_\thetaThe policy (language model) we're trainingΟ€ref\pi_{\text{ref}}Reference policy β€” usually the SFT model, prevents the policy from drifting too farrΟ•(x,y)r_\phi(x, y)Learned reward model: scores how good response y is for prompt xΞ²\betaKL penalty coefficient β€” controls how far policy can deviate from referenceDKLD_{\text{KL}}KL divergence β€” measures distribution difference between policy and reference

This requires: (1) training a separate reward model r_Ο† on human preferences, (2) running PPO with r_Ο† as the reward signal β€” which involves sampling from the policy, computing rewards, estimating advantages, and updating. It's complex and unstable.

2. The Key Derivation

DPO's insight: the RLHF objective has a closed-form optimal solution. The optimal policy is:

Optimal policy (closed-form solution to RLHF)
Ο€βˆ—(y∣x)=1Z(x)Ο€ref(y∣x)exp⁑ ⁣(1Ξ²r(x,y))\pi^*(y \mid x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y \mid x) \exp\!\left(\frac{1}{\beta} r(x, y)\right)
Ο€βˆ—\pi^*Optimal policy β€” the solution to the RLHF objectiveZ(x)Z(x)Partition function (normalizer) β€” ensures probabilities sum to 1exp⁑(r/beta)\exp(r/\\beta)Responses with higher reward get exponentially more probability mass

Now rearrange to express reward in terms of policy:

Reward as function of policy (the key reparameterization)
r(x,y)=Ξ²log⁑πθ(y∣x)Ο€ref(y∣x)+Ξ²log⁑Z(x)r(x, y) = \beta \log \frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x)

Key insight: The reward is just the log-ratio of policy to reference, scaled by Ξ²! We don't NEED a separate reward model β€” the policy itself implicitly defines one.

3. The DPO Loss Function

Substituting the implicit reward into the Bradley-Terry preference model gives the DPO loss:

DPO loss function
LDPO(ΞΈ)=βˆ’E(x,yw,yl)∼D[log⁑σ ⁣(Ξ²log⁑πθ(yw∣x)Ο€ref(yw∣x)βˆ’Ξ²log⁑πθ(yl∣x)Ο€ref(yl∣x))]\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma\!\left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} \right) \right]
(x,yw,yl)(x, y_w, y_l)A preference triple: prompt x, preferred response y_w (winner), rejected response y_l (loser)Οƒ(β‹…)\sigma(\cdot)Sigmoid function β€” turns log-odds into probability\log \\frac{\\pi_\\theta(y_w|x)}{\\pi_{\\text{ref}}(y_w|x)}Log-ratio for preferred response: how much has the policy upweighted y_w relative to reference?Ξ²\betaTemperature: larger Ξ² β†’ policy changes more aggressively

Intuition: The loss encourages the model to increase the probability of y_w and decrease the probability of y_l, relative to the reference model. It's essentially binary classification: "which response is better?"

Prompt: "Explain gravity."

Preferred (y_w): "Gravity is the force that attracts objects with mass toward each other..."
log⁑πθ(yw∣x)=βˆ’12.3\log \pi_\theta(y_w|x) = -12.3, log⁑πref(yw∣x)=βˆ’14.1\log \pi_{\text{ref}}(y_w|x) = -14.1
Log-ratio = -12.3 - (-14.1) = +1.8 (policy likes it MORE than ref)
Rejected (y_l): "Gravity is when things fall down."
log⁑πθ(yl∣x)=βˆ’8.5\log \pi_\theta(y_l|x) = -8.5, log⁑πref(yl∣x)=βˆ’7.2\log \pi_{\text{ref}}(y_l|x) = -7.2
Log-ratio = -8.5 - (-7.2) = -1.3 (policy likes it LESS than ref)
With Ξ²=0.1:
Οƒ(0.1Γ—(1.8βˆ’(βˆ’1.3)))=Οƒ(0.31)=0.577\sigma(0.1 \times (1.8 - (-1.3))) = \sigma(0.31) = 0.577
Loss = βˆ’log⁑(0.577)=0.549-\log(0.577) = 0.549
Gradient will push to increase this gap further β€” make y_w even more likely, y_l even less likely.

4. Results & Impact

  • Matches or exceeds PPO-based RLHF on summarization and dialogue tasks
  • Much simpler to implement (~50 lines of core code)
  • More stable training β€” no reward hacking, no sampling instabilities
  • Became the default alignment method for many open-source LLMs

5. Limitations & Future Work

  • Offline only: DPO uses fixed preference data β€” can't explore like online RL (PPO)
  • Reference model dependency: Quality depends on having a good reference model
  • No reward shaping: Can't add auxiliary rewards for specific behaviors (safety, factuality)

6. Connections to Other Work

PPO

The RL algorithm DPO replaces. PPO is more flexible but harder to tune.

GRPO

DeepSeek's PPO variant that uses group-relative advantages instead of a value network β€” a middle ground between PPO's flexibility and DPO's simplicity.

7. Additional Resources