PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

Zephyr aligns a 7B model (Mistral-7B) to ChatGPT-level helpfulness using only GPT-4-generated data — no PPO, no human labelers, no reward model. Two steps: (1) distilled SFT on GPT-4 conversations, (2) distilled DPO on GPT-4-ranked preference pairs. Zephyr-7B-β scores 90.60 on MT-Bench, beating Llama-2-Chat-70B despite being 10× smaller.

GPT-4 Teacher

Generates conversations & ranks responses

UltraChat 200k

Step 1: dSFT — Distilled Supervised Fine-Tuning

Fine-tune Mistral-7B on UltraChat 200k (GPT-4 conversations)

UltraFeedback

Step 2: dDPO — Distilled DPO

Preference learning on UltraFeedback (GPT-4 scores 4 responses, best vs. worst)

Final model

Zephyr-7B-β

MT-Bench 90.60 · AlpacaEval 90.6% win rate

No PPO needed

No reward model

No human labelers

1. The Alignment Tax

Aligning large language models to be helpful, harmless, and honest has traditionally required a three-stage pipeline: supervised fine-tuning (SFT), reward model training on human preference data, and reinforcement learning via PPO. This pipeline is expensive, brittle, and requires massive infrastructure.

For small labs and open-source practitioners, replicating ChatGPT-level alignment is nearly impossible: you need thousands of high-quality human preference labels, a stable PPO implementation, and significant compute for reward model training. Zephyr's key insight is that GPT-4 can replace all of this.

Core idea: Use GPT-4 as both a teacher (generating high-quality conversations) and a judge (ranking model outputs). Distill this "alignment signal" into a 7B model using SFT followed by DPO — no RL, no reward model, no PPO.

2. dSFT: Learning From GPT-4 Conversations

The first stage is distilled Supervised Fine-Tuning (dSFT). The model is fine-tuned on UltraChat 200k — a dataset of 200,000 multi-turn conversations generated by GPT-3.5/GPT-4 covering a wide range of topics: world knowledge, creative writing, coding, reasoning, and more.

The SFT objective is standard next-token prediction over the assistant turns only — the model learns to predict GPT-4's responses given the conversation context:

\mathcal{L}_{\text{SFT}} = -\sum_{t=1}^{T} \log \pi_\theta(y_t \mid x,\, y_{<t})

\pi_\theta

The student model being trained (Mistral-7B)

x

The conversation context: system prompt + user turns

y_t

The t-th token of the GPT-4-generated assistant response

y_{<t}

All previously generated assistant tokens

Why only assistant turns? We mask out the user turns and system prompt during loss computation. The model only needs to learn how to respond — not how to formulate questions. This is the standard SFT protocol for chat models.

The result is a model that can follow instructions and produce coherent, helpful responses. But it doesn't yet know how to rank between good and bad answers — that's what dDPO adds.

3. UltraFeedback Dataset

UltraFeedback is a large-scale preference dataset where GPT-4 evaluates multiple model outputs for the same prompt. For each instruction, four different models generate responses. GPT-4 then rates each response on a 1–5 scale across dimensions like instruction-following, honesty, and truthfulness.

Dataset construction

Sample a diverse instruction from curated sources (ShareGPT, FLAN, etc.)
Generate 4 responses using different models (GPT-4, Claude, LLaMA, etc.)
Ask GPT-4 to rate each response (score 1–5) and provide a rationale
Construct preference pairs: highest-rated response = y_w, lowest-rated = y_l

This dataset contains ~64,000 preference triples (x, y_w, y_l). Crucially, no human labelers are involved — GPT-4 serves as the sole annotator. This makes the entire pipeline scalable and reproducible.

AI Feedback vs. Human Feedback: Traditional RLHF uses human annotators to rank responses — slow, expensive, and inconsistent. UltraFeedback replaces humans with GPT-4 ("AI Feedback"), which is fast, cheap, and surprisingly aligned with human judgments at scale.

4. dDPO: Preference Learning Without a Reward Model

The second stage is distilled DPO (dDPO). Starting from the dSFT model as both the policy and reference, the model is fine-tuned on UltraFeedback preference pairs using the DPO loss. No reward model is trained. No PPO rollouts happen. The preference signal comes directly from GPT-4's ratings.

The key insight from the DPO paper (Rafailov et al., 2023): you don't need an explicit reward model at all. The optimal policy under the KL-constrained RLHF objective implicitly defines a reward function. By reparameterizing this reward in terms of the policy ratio, DPO converts preference learning into a simple binary classification problem.

The implicit reward that DPO recovers is:

r^*(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x)

The implicit reward is just the log-ratio between the optimal policy and the reference policy, scaled by β. This tells us: a response y has high reward if the aligned model assigns it much higher probability than the base model.

\pi^*(y \mid x)

Optimal aligned policy — what we're training toward

\pi_{\text{ref}}(y \mid x)

Reference policy — the dSFT model, frozen during DPO

Z(x)

Partition function — a constant per prompt that cancels in the preference model

\beta

Controls how far the policy can deviate from the reference (KL penalty strength)

5. DPO Loss Derivation

Plugging the implicit reward into the Bradley-Terry preference model (which says the probability that y_w is preferred over y_l is σ(r(y_w) − r(y_l))) gives the DPO loss:

\mathcal{L}_{\text{DPO}}(\pi_\theta;\pi_{\text{ref}}) = -\mathbb{E}_{(x,\,y_w,\,y_l)\sim\mathcal{D}}\!\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]

(x, y_w, y_l) \sim \mathcal{D}

A sampled preference triple from UltraFeedback: prompt, winner, loser

\sigma(\cdot)

Sigmoid — maps the score difference to a probability in (0, 1)

\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)}

Scaled implicit reward for the winning response — positive when policy prefers y_w more than reference does

\beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}

Scaled implicit reward for the losing response — subtracted to push the model away from y_l

Why no reward model is needed

The DPO loss only requires computing log-probabilities under π_θ and π_ref — both are just forward passes through the language model. No separate reward network, no PPO rollouts, no advantage estimation. The entire training loop is as simple as cross-entropy classification.

6. Results

Zephyr-7B-β is evaluated on MT-Bench (a multi-turn benchmark judged by GPT-4) and AlpacaEval (single-turn win rate against text-davinci-003). The results are striking:

Model	Params	MT-Bench	AlpacaEval
Zephyr-7B-β	7B	90.60	90.6%
Llama-2-Chat	70B	6.27	92.7%
Llama-2-Chat	13B	6.65	81.1%
Falcon-Instruct	40B	5.17	45.7%
GPT-3.5-turbo	—	7.94	89.4%

Zephyr-7B-β scores 90.60 on MT-Bench — beating Llama-2-Chat-70B (a model 10× larger) at 6.27
AlpacaEval win rate of 90.6% against text-davinci-003, competitive with much larger models
dSFT alone (without dDPO) significantly underperforms — the preference learning stage is critical
The gap between dSFT and dDPO is larger than the gap between different SFT datasets, suggesting the alignment method matters more than the SFT data

7. Implications for Open-Source Alignment

Zephyr demonstrates that the gap between open-source and proprietary models is largely a data and alignment gap, not a capability gap. The Mistral-7B base model already has strong reasoning abilities — what it lacked was exposure to high-quality aligned conversations and preference training.

Several implications follow for the community:

AI Feedback scales: GPT-4 as a judge is surprisingly effective and eliminates the bottleneck of human annotation. Later work (RLAIF, Constitutional AI) confirms this direction.
DPO over PPO for small teams: DPO is far easier to implement and tune than PPO, making high-quality alignment accessible without massive RL infrastructure.
Scale is not everything: A well-aligned 7B model can outperform poorly-aligned 70B models. Alignment quality matters as much as raw parameter count.
Dataset curation is a competitive moat: UltraChat and UltraFeedback were pivotal to Zephyr's success. High-quality curated data often matters more than model architecture.

Historical significance: Zephyr was one of the first public demonstrations that open-source 7B models could reach GPT-3.5-level helpfulness purely through distillation-based alignment. It catalyzed an explosion of DPO-trained models and AI-feedback datasets in the open-source community throughout 2023–2024.

8. Additional Resources

Zephyr (arXiv 2310.16944)Original paper by Tunstall et al.DPO: Direct Preference Optimization (arXiv 2305.18290)The DPO paper that Zephyr builds on UltraFeedback (HuggingFace)The preference dataset used for dDPO UltraChat 200k (HuggingFace)The SFT dataset used for dSFT

DPO Deep Dive

Full derivation of the DPO loss — how the RLHF objective reduces to a simple classification loss without a reward model.

PPO Deep Dive

The RL algorithm that Zephyr avoids. Understanding PPO makes it clearer why dDPO is such an improvement for open-source alignment.

← Back to Papers