TL;DR
Block Diffusion bridges AR and diffusion by splitting text into blocks. Blocks are generated left-to-right (AR-style), but tokens within each block are generated via masked diffusion (parallel). Block size B controls the tradeoff: B=1 is pure AR, B=L is pure diffusion.
1. The AR-Diffusion Spectrum
Two extremes of text generation:
Autoregressive (B = 1)
- + Excellent long-range coherence
- + Strong benchmarks
- - Sequential: 1 token per step
- - Left-to-right only
Full Diffusion (B = L)
- + Parallel within block
- + Bidirectional context
- - Many denoising steps needed
- - Weaker long-range coherence
Block Diffusion sits in between: block size B controls the tradeoff.
2. Mathematical Formulation
Partition a sequence of length L into blocks of size B:
Block-level AR factorization
This is exactly like standard AR, but at the block level. Each block conditions on all previous blocks, just like how each token conditions on all previous tokens in GPT.
Within-block diffusion
Within each block, we use masked diffusion (same as LLaDA/MDLM):
Generate "The cat sat on the mat" with B=2 (3 blocks), T=2 denoising steps per block:
Total forward passes: 3 blocks Γ 2 steps = 6. Pure AR would need 6 steps too (one per token). But with B=4 and L=8: only 2 blocks Γ 2 steps = 4 passes for 8 tokens β that's the speedup!
3. Training Objective
The training loss is a masked-token-only cross-entropy, similar to LLaDA but with block-aware conditioning:
Key difference from LLaDA: the conditioning includes both the AR prefix (previous blocks) AND bidirectional context within the current block. This is what makes the attention mask design special.
4. Attention Mask Design (Figure 7)
The heart of Block Diffusion is its specialized attention mask. During training, the input is the concatenation of the noisy sequence x_t and clean targets x_0. The mask combines three types of attention:
Intra-block bidirectional attention among noisy tokens. Within the same noisy block, every token can see every other token β just like BERT. This enables parallel denoising within a block.
Noisy tokens attend to clean tokens in the corresponding block AND all preceding clean blocks. This is how the model sees the "answer" β x_t^(b) can look at x_0^(1), x_0^(2), ..., x_0^(b) to learn the denoising mapping.
Standard left-to-right causal attention among clean tokens β exactly like GPT's causal mask. Clean block b can see clean blocks 1..b-1 and tokens to its left within block b.
Consider block 2 (tokens 3-4) during training. The model receives:
- Noisy input: β these might be [M], [M] or partially revealed
- Intra-block context (Blue): see each other bidirectionally
- Clean prefix (Green): can see (block 1's clean tokens) AND (own block's clean tokens)
- Prediction: output probability for the masked positions in block 2
At inference, the clean tokens of previous blocks are cached (like KV-cache in standard AR). Only the current block's noisy tokens go through the full attention computation. This is the DualCache mechanism from Fast-dLLM.
5. Inference Process (Figure 3)
At inference, blocks are decoded sequentially, and each decoded block is cached:
DualCache: Two-level KV Cache
Block Diffusion uses two types of caching to speed up inference:
KV cache of all previously decoded blocks. Just like standard AR KV-cache. Only computed once per block.
KV cache within the current block across denoising steps. Since denoising is iterative, token representations from earlier steps can be reused (from Fast-dLLM).
6. D2F: Discrete Diffusion Forcing
D2F is a training technique that converts a pre-trained bidirectional dLLM into a model supporting block-wise causal attention β enabling KV cache reuse and inter-block parallel decoding at inference time.
π₯ D2F dLLM (student)
- Block-wise causal attention mask
- Sees past blocks, current block tokens left-to-right
- Compatible with KV cache across denoising steps
βοΈ Pre-trained dLLM (teacher)
- Bidirectional full attention
- Sees all tokens simultaneously
- Strong quality, but no KV cache reuse
Training with Monotonically Increasing Masks
The answer sequence is divided into blocks with progressively increasing masking ratios. The D2F student model is trained to mimic the teacher's predictions on partially denoised preceding tokens β via a KL divergence loss between the two models' output distributions.
Why this matters
Standard dLLMs use bidirectional attention β each denoising step must recompute all token representations from scratch. D2F's causal attention structure means the KV activations from step t can be partially reused in step t+1, breaking the O(NΒ²) per-step compute barrier.
7. Results: The Block Size Tradeoff
| Block Size | Perplexity | Steps for 256 tokens | Speedup |
|---|---|---|---|
| B=1 (pure AR) | 24.1 | 256 | 1x |
| B=4 | 24.8 | 64 Γ T | ~2-4x |
| B=16 | 26.3 | 16 Γ T | ~4-8x |
| B=256 (pure diffusion) | 31.2 | 1 Γ T | depends on T |
8. Connections to Other Work
Can be seen as Block Diffusion with B = L (one giant block). LLaDA's training objective is a special case of the block diffusion loss.
Provides the adaptive denoising schedule and DualCache that Block Diffusion uses within each block. The two papers are complementary β Fast DLLM reduces T, Block Diffusion reduces the number of blocks.
The continuous-time ELBO theory that underpins the within-block diffusion process. Block Diffusion uses MDLM's training framework.