DPO eliminates the need for a separate reward model in RLHF. By reparameterizing the reward function, it converts the RL problem into a simple classification loss on preference pairs. No PPO, no reward model, no sampling β just supervised learning on (preferred, rejected) pairs.
βDPO vs Standard RLHF Pipeline
Standard RLHF: 3-stage pipeline
Complex, unstable, expensive
1. SFT
2. Train Reward Model
3. PPO with RM
Key insight
DPO Insight: reward model is implicit in the policy!
Closed-form mapping: optimal policy β reward function
Simplification
DPO: 2-stage pipeline
Simple, stable, cheap
1. SFT
2. DPO on preferences
No reward model needed
No RL sampling needed
Just supervised learning
1. Background: The RLHF Problem
Standard RLHF optimizes a language model to maximize a learned reward while staying close to the reference (SFT) model:
ΟΞΈβThe policy (language model) we're trainingΟrefβReference policy β usually the SFT model, prevents the policy from drifting too farrΟβ(x,y)Learned reward model: scores how good response y is for prompt xΞ²KL penalty coefficient β controls how far policy can deviate from referenceDKLβKL divergence β measures distribution difference between policy and reference
This requires: (1) training a separate reward model r_Ο on human preferences, (2) running PPO with r_Ο as the reward signal β which involves sampling from the policy, computing rewards, estimating advantages, and updating. It's complex and unstable.
2. The Key Derivation
DPO's insight: the RLHF objective has a closed-form optimal solution. The optimal policy is:
ΟβOptimal policy β the solution to the RLHF objectiveZ(x)Partition function (normalizer) β ensures probabilities sum to 1exp(r/beta)Responses with higher reward get exponentially more probability mass
Now rearrange to express reward in terms of policy:
Reward as function of policy (the key reparameterization)
Key insight: The reward is just the log-ratio of policy to reference, scaled by Ξ²! We don't NEED a separate reward model β the policy itself implicitly defines one.
3. The DPO Loss Function
Substituting the implicit reward into the Bradley-Terry preference model gives the DPO loss:
(x,ywβ,ylβ)A preference triple: prompt x, preferred response y_w (winner), rejected response y_l (loser)Ο(β )Sigmoid function β turns log-odds into probability\log \\frac{\\pi_\\theta(y_w|x)}{\\pi_{\\text{ref}}(y_w|x)}Log-ratio for preferred response: how much has the policy upweighted y_w relative to reference?Ξ²Temperature: larger Ξ² β policy changes more aggressively
Intuition: The loss encourages the model to increase the probability of y_w and decrease the probability of y_l, relative to the reference model. It's essentially binary classification: "which response is better?"
Prompt: "Explain gravity."
Preferred (y_w): "Gravity is the force that attracts objects with mass toward each other..."
DeepSeek's PPO variant that uses group-relative advantages instead of a value network β a middle ground between PPO's flexibility and DPO's simplicity.