Fast Discrete Diffusion Language Models (Fast DLLM)

Shi et al. · 2025

TL;DR

Masked diffusion LMs (LLaDA, MDLM) need many denoising steps at inference → slow. Fast DLLM speeds this up with: (1) an optimal denoising schedule that allocates more steps where they matter, (2) importance sampling to skip low-value steps, and (3) DualCache for KV reuse. Result: 3-10x fewer steps with minimal quality loss.

Architecture Overview
Problem: Diffusion LMs are slow at inference
100+ denoising steps, each = full forward pass
Observation
Insight: Not all steps are equally important
Early steps (high noise) contribute much more to quality than late steps
Solution
Method: Three acceleration techniques
Optimal Schedule
Concentrate steps where L_t is large
Importance Sampling
Probabilistically skip low-value steps
DualCache
Reuse KV across denoising steps
Result: 3-10x speedup, near-lossless quality
10 steps ≈ 100 steps quality
Compatible with any masked diffusion LM

1. Background: The Speed Problem

In masked diffusion LMs, generation requires T denoising steps. At each step, the model does a full forward pass through all layers to predict masked tokens. With T=100, generating a sequence takes 100x more compute than a single AR forward pass.

Cost comparison
AR cost=L×CforwardvsDiffusion cost=T×Cforward\text{AR cost} = L \times C_{\text{forward}} \quad \text{vs} \quad \text{Diffusion cost} = T \times C_{\text{forward}}

2. Key Insight: Non-uniform Step Importance

The contribution of each denoising step to the overall ELBO is highly non-uniform:

Per-timestep ELBO contribution
Lt=E[1tLi:xti=[M]logpθ(x0ixt)]L_t = -\mathbb{E}\left[\frac{1}{t \cdot L} \sum_{i: x_t^i = \texttt{[M]}} \log p_\theta(x_0^i \mid x_t)\right]
LtL_tLoss at timestep t — measures how "hard" this denoising step isttTimestep (noise level). High t = many masks = hard. Low t = few masks = easyxti=[M]x_t^i = \texttt{[M]}Only masked positions contribute to the loss

Intuition: Early steps (high t) have high L_t because the model sees few tokens and must guess a lot. Late steps (low t) have low L_t because almost everything is revealed — just filling in 1-2 obvious tokens.

Interactive: Per-step ELBO ContributionHover to see how much each timestep contributes
ELBO = -∑ Lₜ — hover over each bar to see the per-timestep loss
t=1
t=2
t=3
t=4
t=5
t=6
t=7
t=8
Total ELBO loss: 1.790

3. Optimal Denoising Schedule

Instead of uniform timesteps, Fast DLLM places steps proportionally to their importance using the inverse CDF:

Cumulative importance function
F(t)=0tLsds01LsdsF(t) = \frac{\int_0^t L_s \, ds}{\int_0^1 L_s \, ds}
Optimal schedule via inverse CDF
tk=F1 ⁣(kK),k=0,1,,Kt_k = F^{-1}\!\left(\frac{k}{K}\right), \quad k = 0, 1, \ldots, K
F(t)F(t)Cumulative importance up to time t. F(0)=0, F(1)=1LsL_sPer-step loss at time s (from Section 2)F1F^{-1}Inverse CDF — finds the time t where cumulative importance = targetKKTotal number of denoising steps we want to use (e.g. 10)tkt_kThe k-th timestep in our schedule

Suppose L_t looks like: high at t=0.8-1.0, medium at t=0.4-0.8, low at t=0-0.4. With K=4 steps:

Uniform schedule
t = [1.0, 0.75, 0.50, 0.25, 0.0]
Equal spacing, wastes steps in low-L region
Optimal schedule
t = [1.0, 0.90, 0.78, 0.55, 0.0]
Dense in high-L region, sparse in low-L

Each step in the optimal schedule handles an equal share of the total importance. This means no compute is wasted on easy steps.

4. Importance Sampling for Step Skipping

Beyond scheduling, Fast DLLM can skip steps entirely. The idea: if a step contributes very little to quality, we can skip it with high probability:

Skip probability
P(execute step k)=min ⁣(1,  LtkKj=1KLtj)P(\text{execute step } k) = \min\!\left(1,\; \frac{L_{t_k} \cdot K}{\sum_{j=1}^{K} L_{t_j}}\right)
P(execute step k)P(\text{execute step } k)Probability of actually running step kLtkL_{t_k}Importance of step k (its ELBO contribution)KKTotal steps in the schedulemin(1, ...)Cap at 1 — important steps are always executed

Steps with above-average importance (L_tk > average) → probability = 1, always run. Steps with below-average importance → run with reduced probability. Reweighting keeps the estimate unbiased.

5. DualCache: KV Cache for Diffusion

Standard AR models use KV-cache to avoid recomputing key/value projections. But diffusion models change all token representations at each step — so KV-cache seems impossible. Fast DLLM's insight: tokens that are already unmasked DON'T change much between steps.

DualCache split
KVt=KVunmaskedcached from step t+1KVmaskedrecomputed at step t\text{KV}_{t} = \underbrace{\text{KV}_{\text{unmasked}}}_{\text{cached from step } t+1} \cup \underbrace{\text{KV}_{\text{masked}}}_{\text{recomputed at step } t}

6. Experiments & Results

MethodStepsPPLSpeedup
MDLM (uniform)10031.21x
Uniform (fewer steps)2535.44x
Uniform (fewer steps)1045.810x
Fast DLLM (adaptive)1033.110x

Key result: With 10 steps (10x speedup), adaptive schedule: 33.1 PPL vs uniform's 45.8 PPL. Nearly matches the 100-step baseline (31.2). This makes diffusion LMs practical for real-time applications.

7. Limitations & Future Work

  • Schedule estimation: Need to estimate L_t on a validation set first — adds a pre-processing step
  • DualCache approximation: Reusing unmasked KV is an approximation that can degrade at very few steps
  • Distillation unexplored: Could potentially learn to generate in fewer steps via distillation (like in image diffusion)

8. Connections to Other Work

LLaDA/ MDLM

Fast DLLM is compatible with any masked diffusion LM — it's a sampling strategy, not a new model. Drop it into LLaDA or MDLM for instant speedup.

Block Diffusion

Addresses speed from a different angle (block-level AR). Both can be combined: block diffusion with Fast DLLM's adaptive schedule within each block.

DDIM / DPM-Solver(image diffusion)

Analogous works in continuous image diffusion that also optimize the denoising schedule. Fast DLLM adapts these ideas to the discrete (masked) setting.

9. Additional Resources