TL;DR
Masked diffusion LMs (LLaDA, MDLM) need many denoising steps at inference → slow. Fast DLLM speeds this up with: (1) an optimal denoising schedule that allocates more steps where they matter, (2) importance sampling to skip low-value steps, and (3) DualCache for KV reuse. Result: 3-10x fewer steps with minimal quality loss.
1. Background: The Speed Problem
In masked diffusion LMs, generation requires T denoising steps. At each step, the model does a full forward pass through all layers to predict masked tokens. With T=100, generating a sequence takes 100x more compute than a single AR forward pass.
2. Key Insight: Non-uniform Step Importance
The contribution of each denoising step to the overall ELBO is highly non-uniform:
Intuition: Early steps (high t) have high L_t because the model sees few tokens and must guess a lot. Late steps (low t) have low L_t because almost everything is revealed — just filling in 1-2 obvious tokens.
3. Optimal Denoising Schedule
Instead of uniform timesteps, Fast DLLM places steps proportionally to their importance using the inverse CDF:
Suppose L_t looks like: high at t=0.8-1.0, medium at t=0.4-0.8, low at t=0-0.4. With K=4 steps:
Equal spacing, wastes steps in low-L region
Dense in high-L region, sparse in low-L
Each step in the optimal schedule handles an equal share of the total importance. This means no compute is wasted on easy steps.
4. Importance Sampling for Step Skipping
Beyond scheduling, Fast DLLM can skip steps entirely. The idea: if a step contributes very little to quality, we can skip it with high probability:
Steps with above-average importance (L_tk > average) → probability = 1, always run. Steps with below-average importance → run with reduced probability. Reweighting keeps the estimate unbiased.
5. DualCache: KV Cache for Diffusion
Standard AR models use KV-cache to avoid recomputing key/value projections. But diffusion models change all token representations at each step — so KV-cache seems impossible. Fast DLLM's insight: tokens that are already unmasked DON'T change much between steps.
6. Experiments & Results
| Method | Steps | PPL | Speedup |
|---|---|---|---|
| MDLM (uniform) | 100 | 31.2 | 1x |
| Uniform (fewer steps) | 25 | 35.4 | 4x |
| Uniform (fewer steps) | 10 | 45.8 | 10x |
| Fast DLLM (adaptive) | 10 | 33.1 | 10x |
Key result: With 10 steps (10x speedup), adaptive schedule: 33.1 PPL vs uniform's 45.8 PPL. Nearly matches the 100-step baseline (31.2). This makes diffusion LMs practical for real-time applications.
7. Limitations & Future Work
- Schedule estimation: Need to estimate L_t on a validation set first — adds a pre-processing step
- DualCache approximation: Reusing unmasked KV is an approximation that can degrade at very few steps
- Distillation unexplored: Could potentially learn to generate in fewer steps via distillation (like in image diffusion)
8. Connections to Other Work
Fast DLLM is compatible with any masked diffusion LM — it's a sampling strategy, not a new model. Drop it into LLaDA or MDLM for instant speedup.
Addresses speed from a different angle (block-level AR). Both can be combined: block diffusion with Fast DLLM's adaptive schedule within each block.
Analogous works in continuous image diffusion that also optimize the denoising schedule. Fast DLLM adapts these ideas to the discrete (masked) setting.