Zephyr: Direct Distillation of LM Alignment

Tunstall et al. Β· HuggingFace Β· 2023 Β· arXiv 2310.16944

TL;DR

Zephyr aligns a 7B model (Mistral-7B) to ChatGPT-level helpfulness using only GPT-4-generated data β€” no PPO, no human labelers, no reward model. Two steps: (1) distilled SFT on GPT-4 conversations, (2) distilled DPO on GPT-4-ranked preference pairs. Zephyr-7B-Ξ² scores 90.60 on MT-Bench, beating Llama-2-Chat-70B despite being 10Γ— smaller.

β—†Zephyr Two-Step Distillation Pipeline
GPT-4 Teacher
Generates conversations & ranks responses
UltraChat 200k
Step 1: dSFT β€” Distilled Supervised Fine-Tuning
Fine-tune Mistral-7B on UltraChat 200k (GPT-4 conversations)
UltraFeedback
Step 2: dDPO β€” Distilled DPO
Preference learning on UltraFeedback (GPT-4 scores 4 responses, best vs. worst)
Final model
Zephyr-7B-Ξ²
MT-Bench 90.60 Β· AlpacaEval 90.6% win rate
No PPO needed
No reward model
No human labelers

1. The Alignment Tax

Aligning large language models to be helpful, harmless, and honest has traditionally required a three-stage pipeline: supervised fine-tuning (SFT), reward model training on human preference data, and reinforcement learning via PPO. This pipeline is expensive, brittle, and requires massive infrastructure.

For small labs and open-source practitioners, replicating ChatGPT-level alignment is nearly impossible: you need thousands of high-quality human preference labels, a stable PPO implementation, and significant compute for reward model training. Zephyr's key insight is that GPT-4 can replace all of this.

Core idea: Use GPT-4 as both a teacher (generating high-quality conversations) and a judge (ranking model outputs). Distill this "alignment signal" into a 7B model using SFT followed by DPO β€” no RL, no reward model, no PPO.

2. dSFT: Learning From GPT-4 Conversations

The first stage is distilled Supervised Fine-Tuning (dSFT). The model is fine-tuned on UltraChat 200k β€” a dataset of 200,000 multi-turn conversations generated by GPT-3.5/GPT-4 covering a wide range of topics: world knowledge, creative writing, coding, reasoning, and more.

The SFT objective is standard next-token prediction over the assistant turns only β€” the model learns to predict GPT-4's responses given the conversation context:

SFT loss (next-token prediction on assistant turns)
LSFT=βˆ’βˆ‘t=1Tlog⁑πθ(yt∣x, y<t)\mathcal{L}_{\text{SFT}} = -\sum_{t=1}^{T} \log \pi_\theta(y_t \mid x,\, y_{<t})
πθ\pi_\thetaThe student model being trained (Mistral-7B)xxThe conversation context: system prompt + user turnsyty_tThe t-th token of the GPT-4-generated assistant responsey<ty_{<t}All previously generated assistant tokens

Why only assistant turns? We mask out the user turns and system prompt during loss computation. The model only needs to learn how to respond β€” not how to formulate questions. This is the standard SFT protocol for chat models.

The result is a model that can follow instructions and produce coherent, helpful responses. But it doesn't yet know how to rank between good and bad answers β€” that's what dDPO adds.

3. UltraFeedback Dataset

UltraFeedback is a large-scale preference dataset where GPT-4 evaluates multiple model outputs for the same prompt. For each instruction, four different models generate responses. GPT-4 then rates each response on a 1–5 scale across dimensions like instruction-following, honesty, and truthfulness.

Dataset construction
  1. Sample a diverse instruction from curated sources (ShareGPT, FLAN, etc.)
  2. Generate 4 responses using different models (GPT-4, Claude, LLaMA, etc.)
  3. Ask GPT-4 to rate each response (score 1–5) and provide a rationale
  4. Construct preference pairs: highest-rated response = y_w, lowest-rated = y_l

This dataset contains ~64,000 preference triples (x, y_w, y_l). Crucially, no human labelers are involved β€” GPT-4 serves as the sole annotator. This makes the entire pipeline scalable and reproducible.

AI Feedback vs. Human Feedback: Traditional RLHF uses human annotators to rank responses β€” slow, expensive, and inconsistent. UltraFeedback replaces humans with GPT-4 ("AI Feedback"), which is fast, cheap, and surprisingly aligned with human judgments at scale.

4. dDPO: Preference Learning Without a Reward Model

The second stage is distilled DPO (dDPO). Starting from the dSFT model as both the policy and reference, the model is fine-tuned on UltraFeedback preference pairs using the DPO loss. No reward model is trained. No PPO rollouts happen. The preference signal comes directly from GPT-4's ratings.

The key insight from the DPO paper (Rafailov et al., 2023): you don't need an explicit reward model at all. The optimal policy under the KL-constrained RLHF objective implicitly defines a reward function. By reparameterizing this reward in terms of the policy ratio, DPO converts preference learning into a simple binary classification problem.

The implicit reward that DPO recovers is:

Implicit reward (emerges from the optimal policy)
rβˆ—(x,y)=Ξ²logβ‘Ο€βˆ—(y∣x)Ο€ref(y∣x)+Ξ²log⁑Z(x)r^*(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x)

The implicit reward is just the log-ratio between the optimal policy and the reference policy, scaled by Ξ². This tells us: a response y has high reward if the aligned model assigns it much higher probability than the base model.

Ο€βˆ—(y∣x)\pi^*(y \mid x)Optimal aligned policy β€” what we're training towardΟ€ref(y∣x)\pi_{\text{ref}}(y \mid x)Reference policy β€” the dSFT model, frozen during DPOZ(x)Z(x)Partition function β€” a constant per prompt that cancels in the preference modelΞ²\betaControls how far the policy can deviate from the reference (KL penalty strength)

5. DPO Loss Derivation

Plugging the implicit reward into the Bradley-Terry preference model (which says the probability that y_w is preferred over y_l is Οƒ(r(y_w) βˆ’ r(y_l))) gives the DPO loss:

DPO loss (full form)
LDPO(πθ;Ο€ref)=βˆ’E(x, yw, yl)∼D ⁣[log⁑σ ⁣(Ξ²log⁑πθ(yw∣x)Ο€ref(yw∣x)βˆ’Ξ²log⁑πθ(yl∣x)Ο€ref(yl∣x))]\mathcal{L}_{\text{DPO}}(\pi_\theta;\pi_{\text{ref}}) = -\mathbb{E}_{(x,\,y_w,\,y_l)\sim\mathcal{D}}\!\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]
(x,yw,yl)∼D(x, y_w, y_l) \sim \mathcal{D}A sampled preference triple from UltraFeedback: prompt, winner, loserΟƒ(β‹…)\sigma(\cdot)Sigmoid β€” maps the score difference to a probability in (0, 1)Ξ²log⁑πθ(yw∣x)Ο€ref(yw∣x)\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)}Scaled implicit reward for the winning response β€” positive when policy prefers y_w more than reference doesΞ²log⁑πθ(yl∣x)Ο€ref(yl∣x)\beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}Scaled implicit reward for the losing response β€” subtracted to push the model away from y_l

Why no reward model is needed

The DPO loss only requires computing log-probabilities under Ο€_ΞΈ and Ο€_ref β€” both are just forward passes through the language model. No separate reward network, no PPO rollouts, no advantage estimation. The entire training loop is as simple as cross-entropy classification.

6. Results

Zephyr-7B-Ξ² is evaluated on MT-Bench (a multi-turn benchmark judged by GPT-4) and AlpacaEval (single-turn win rate against text-davinci-003). The results are striking:

ModelParamsMT-BenchAlpacaEval
Zephyr-7B-Ξ²7B90.6090.6%
Llama-2-Chat70B6.2792.7%
Llama-2-Chat13B6.6581.1%
Falcon-Instruct40B5.1745.7%
GPT-3.5-turboβ€”7.9489.4%
  • Zephyr-7B-Ξ² scores 90.60 on MT-Bench β€” beating Llama-2-Chat-70B (a model 10Γ— larger) at 6.27
  • AlpacaEval win rate of 90.6% against text-davinci-003, competitive with much larger models
  • dSFT alone (without dDPO) significantly underperforms β€” the preference learning stage is critical
  • The gap between dSFT and dDPO is larger than the gap between different SFT datasets, suggesting the alignment method matters more than the SFT data

7. Implications for Open-Source Alignment

Zephyr demonstrates that the gap between open-source and proprietary models is largely a data and alignment gap, not a capability gap. The Mistral-7B base model already has strong reasoning abilities β€” what it lacked was exposure to high-quality aligned conversations and preference training.

Several implications follow for the community:

  • AI Feedback scales: GPT-4 as a judge is surprisingly effective and eliminates the bottleneck of human annotation. Later work (RLAIF, Constitutional AI) confirms this direction.
  • DPO over PPO for small teams: DPO is far easier to implement and tune than PPO, making high-quality alignment accessible without massive RL infrastructure.
  • Scale is not everything: A well-aligned 7B model can outperform poorly-aligned 70B models. Alignment quality matters as much as raw parameter count.
  • Dataset curation is a competitive moat: UltraChat and UltraFeedback were pivotal to Zephyr's success. High-quality curated data often matters more than model architecture.

Historical significance: Zephyr was one of the first public demonstrations that open-source 7B models could reach GPT-3.5-level helpfulness purely through distillation-based alignment. It catalyzed an explosion of DPO-trained models and AI-feedback datasets in the open-source community throughout 2023–2024.

8. Additional Resources

DPO Deep Dive

Full derivation of the DPO loss β€” how the RLHF objective reduces to a simple classification loss without a reward model.

PPO Deep Dive

The RL algorithm that Zephyr avoids. Understanding PPO makes it clearer why dDPO is such an improvement for open-source alignment.