PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

DPO eliminates the need for a separate reward model in RLHF. By reparameterizing the reward function, it converts the RL problem into a simple classification loss on preference pairs. No PPO, no reward model, no sampling — just supervised learning on (preferred, rejected) pairs.

Standard RLHF: 3-stage pipeline

Complex, unstable, expensive

1. SFT

2. Train Reward Model

3. PPO with RM

Key insight

DPO Insight: reward model is implicit in the policy!

Closed-form mapping: optimal policy ↔ reward function

Simplification

DPO: 2-stage pipeline

Simple, stable, cheap

1. SFT

2. DPO on preferences

No reward model needed

No RL sampling needed

Just supervised learning

1. Background: The RLHF Problem

Standard RLHF optimizes a language model to maximize a learned reward while staying close to the reference (SFT) model:

\max_{\pi_\theta} \; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)} \left[ r_\phi(x, y) \right] - \beta \, D_{\text{KL}}\!\left[\pi_\theta(\cdot|x) \| \pi_{\text{ref}}(\cdot|x)\right]

\pi_\theta

The policy (language model) we're training

\pi_{\text{ref}}

Reference policy — usually the SFT model, prevents the policy from drifting too far

r_\phi(x, y)

Learned reward model: scores how good response y is for prompt x

\beta

KL penalty coefficient — controls how far policy can deviate from reference

D_{\text{KL}}

KL divergence — measures distribution difference between policy and reference

This requires: (1) training a separate reward model r_φ on human preferences, (2) running PPO with r_φ as the reward signal — which involves sampling from the policy, computing rewards, estimating advantages, and updating. It's complex and unstable.

2. The Key Derivation

DPO's insight: the RLHF objective has a closed-form optimal solution. The optimal policy is:

\pi^*(y \mid x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y \mid x) \exp\!\left(\frac{1}{\beta} r(x, y)\right)

\pi^*

Optimal policy — the solution to the RLHF objective

Z(x)

Partition function (normalizer) — ensures probabilities sum to 1

\exp(r/\\beta)

Responses with higher reward get exponentially more probability mass

Now rearrange to express reward in terms of policy:

r(x, y) = \beta \log \frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x)

Key insight: The reward is just the log-ratio of policy to reference, scaled by β! We don't NEED a separate reward model — the policy itself implicitly defines one.

3. The DPO Loss Function

Substituting the implicit reward into the Bradley-Terry preference model gives the DPO loss:

\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma\!\left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} \right) \right]

(x, y_w, y_l)

A preference triple: prompt x, preferred response y_w (winner), rejected response y_l (loser)

\sigma(\cdot)

Sigmoid function — turns log-odds into probability

\log \\frac{\\pi_\\theta(y_w|x)}{\\pi_{\\text{ref}}(y_w|x)}

Log-ratio for preferred response: how much has the policy upweighted y_w relative to reference?

\beta

Temperature: larger β → policy changes more aggressively

Intuition: The loss encourages the model to increase the probability of y_w and decrease the probability of y_l, relative to the reference model. It's essentially binary classification: "which response is better?"

Prompt: "Explain gravity."

Preferred (y_w): "Gravity is the force that attracts objects with mass toward each other..."

\log \pi_\theta(y_w|x) = -12.3

\log \pi_{\text{ref}}(y_w|x) = -14.1

Log-ratio = -12.3 - (-14.1) = +1.8 (policy likes it MORE than ref)

Rejected (y_l): "Gravity is when things fall down."

\log \pi_\theta(y_l|x) = -8.5

\log \pi_{\text{ref}}(y_l|x) = -7.2

Log-ratio = -8.5 - (-7.2) = -1.3 (policy likes it LESS than ref)

With β=0.1:

\sigma(0.1 \times (1.8 - (-1.3))) = \sigma(0.31) = 0.577

Loss =

-\log(0.577) = 0.549

Gradient will push to increase this gap further — make y_w even more likely, y_l even less likely.

4. Results & Impact

Matches or exceeds PPO-based RLHF on summarization and dialogue tasks
Much simpler to implement (~50 lines of core code)
More stable training — no reward hacking, no sampling instabilities
Became the default alignment method for many open-source LLMs

5. Limitations & Future Work

Offline only: DPO uses fixed preference data — can't explore like online RL (PPO)
Reference model dependency: Quality depends on having a good reference model
No reward shaping: Can't add auxiliary rewards for specific behaviors (safety, factuality)

6. Connections to Other Work

PPO

The RL algorithm DPO replaces. PPO is more flexible but harder to tune.

GRPO

DeepSeek's PPO variant that uses group-relative advantages instead of a value network — a middle ground between PPO's flexibility and DPO's simplicity.

7. Additional Resources

DPO (arXiv)Original paper InstructGPT (Ouyang et al. 2022)The original RLHF paper that DPO improves upon