Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Arriola et al. Β· 2025

TL;DR

Block Diffusion bridges AR and diffusion by splitting text into blocks. Blocks are generated left-to-right (AR-style), but tokens within each block are generated via masked diffusion (parallel). Block size B controls the tradeoff: B=1 is pure AR, B=L is pure diffusion.

1. The AR-Diffusion Spectrum

Two extremes of text generation:

Autoregressive (B = 1)

  • + Excellent long-range coherence
  • + Strong benchmarks
  • - Sequential: 1 token per step
  • - Left-to-right only

Full Diffusion (B = L)

  • + Parallel within block
  • + Bidirectional context
  • - Many denoising steps needed
  • - Weaker long-range coherence

Block Diffusion sits in between: block size B controls the tradeoff.

2. Mathematical Formulation

Partition a sequence of length L into blocks of size B:

Block partition
x=[x1,…,xB]⏟BlockΒ 1,[xB+1,…,x2B]⏟BlockΒ 2,…,[xLβˆ’B+1,…,xL]⏟BlockΒ L/Bx = \underbrace{[x_1, \ldots, x_B]}_{\text{Block 1}}, \underbrace{[x_{B+1}, \ldots, x_{2B}]}_{\text{Block 2}}, \ldots, \underbrace{[x_{L-B+1}, \ldots, x_L]}_{\text{Block } L/B}
xThe full sequence of L tokensLTotal sequence length (e.g. 256)BBlock size β€” the key hyperparameter. Larger B = more parallel, less coherentL/BNumber of blocks (must divide evenly)

Block-level AR factorization

Outer loop: autoregressive over blocks
p(x)=∏b=1L/Bpθ ⁣(x(b)∣x(<b))p(x) = \prod_{b=1}^{L/B} p_\theta\!\left(\mathbf{x}^{(b)} \mid \mathbf{x}^{(<b)}\right)
x(b)\mathbf{x}^{(b)}The b-th block of B tokensx(<b)\mathbf{x}^{(<b)}All blocks before block b (the "prefix")pΞΈp_\thetaThe neural network (Transformer) parameterized by ΞΈ

This is exactly like standard AR, but at the block level. Each block conditions on all previous blocks, just like how each token conditions on all previous tokens in GPT.

Within-block diffusion

Within each block, we use masked diffusion (same as LLaDA/MDLM):

Inner loop: masked diffusion within block
pθ ⁣(x(b)∣x(<b))=∫∏s=1Tpθ ⁣(xtsβˆ’1(b)∣xts(b), x(<b))dxt1:tTβˆ’1p_\theta\!\left(\mathbf{x}^{(b)} \mid \mathbf{x}^{(<b)}\right) = \int \prod_{s=1}^{T} p_\theta\!\left(\mathbf{x}_{t_{s-1}}^{(b)} \mid \mathbf{x}_{t_s}^{(b)},\, \mathbf{x}^{(<b)}\right) d\mathbf{x}_{t_1:t_{T-1}}

Generate "The cat sat on the mat" with B=2 (3 blocks), T=2 denoising steps per block:

Block 1: start from [M][M], no prefix
t=1.0: [M] [M]
t=0.5: The [M] Β  ("The" has 0.92 confidence)
t=0.0: The cat Β  βœ“
Block 2: start from [M][M], prefix = "The cat"
t=1.0: [M] [M] Β  | Β  prefix: The cat
t=0.5: sat [M]
t=0.0: sat on Β  βœ“
Block 3: prefix = "The cat sat on"
t=1.0: [M] [M] Β  | Β  prefix: The cat sat on
t=0.5: the [M]
t=0.0: the mat Β  βœ“

Total forward passes: 3 blocks Γ— 2 steps = 6. Pure AR would need 6 steps too (one per token). But with B=4 and L=8: only 2 blocks Γ— 2 steps = 4 passes for 8 tokens β€” that's the speedup!

βš™Interactive: Block-by-Block Generationβ€” Watch blocks get generated left-to-right with within-block diffusion
Left-to-right block generation Β· within-block parallel diffusion
Block 1
[M]
[M]
Block 2
[M]
[M]
Block 3
[M]
[M]
Step 0: All blocks masked
Masked Denoising Complete

3. Training Objective

The training loss is a masked-token-only cross-entropy, similar to LLaDA but with block-aware conditioning:

Block diffusion training loss
Lblock(ΞΈ)=βˆ’Ex,m[βˆ‘i=1L1[xti=[MASK]]log⁑pΞΈ(x0i∣x<i, xblock(i))]\mathcal{L}_{\text{block}}(\theta) = -\mathbb{E}_{x,m}\left[\sum_{i=1}^{L} \mathbf{1}[x_t^i = \texttt{[MASK]}] \log p_\theta(x_0^i \mid x_{<i},\, \mathbf{x}_{\text{block}(i)})\right]
Lblock(ΞΈ)\mathcal{L}_{\text{block}}(\theta)The loss function we minimize during trainingEx,m\mathbb{E}_{x,m}Expectation over: training data x and random mask m1[xti=[MASK]]\mathbf{1}[x_t^i = \texttt{[MASK]}]Indicator: only count loss for masked positionsx<ix_{<i}All tokens before position i in previous blocks (AR prefix)xblock(i)\mathbf{x}_{\text{block}(i)}All tokens in the same block as position i (bidirectional context)pΞΈ(x0iβˆ£β‹…)p_\theta(x_0^i \mid \cdot)Model's predicted probability of the true token at position i

Key difference from LLaDA: the conditioning includes both the AR prefix (previous blocks) AND bidirectional context within the current block. This is what makes the attention mask design special.

4. Attention Mask Design (Figure 7)

The heart of Block Diffusion is its specialized attention mask. During training, the input is the concatenation of the noisy sequence x_t and clean targets x_0. The mask combines three types of attention:

Block Diagonal

Intra-block bidirectional attention among noisy tokens. Within the same noisy block, every token can see every other token β€” just like BERT. This enables parallel denoising within a block.

Offset Block Causal

Noisy tokens attend to clean tokens in the corresponding block AND all preceding clean blocks. This is how the model sees the "answer" β€” x_t^(b) can look at x_0^(1), x_0^(2), ..., x_0^(b) to learn the denoising mapping.

Block Causal

Standard left-to-right causal attention among clean tokens β€” exactly like GPT's causal mask. Clean block b can see clean blocks 1..b-1 and tokens to its left within block b.

βš™Interactive: Attention Mask Matrix (Figure 7)β€” Adjust block size, number of blocks, and switch between training/inference mode. Hover over cells to see the attention type.
Matrix size: 12 Γ— 12 (6 noisy + 6 clean tokens, 3 blocks Γ— 2 tokens)
t1
t2
t3
t4
t5
t6
c1
c2
c3
c4
c5
c6
t1
t2
t3
t4
t5
t6
c1
c2
c3
c4
c5
c6
← noisy tokens (x_t) | clean tokens (x_0) β†’
Block Diagonal(intra-block bidirectional)
Offset Block Causal(noisy→clean)
Block Causal(clean AR)
No attention

Consider block 2 (tokens 3-4) during training. The model receives:

  1. Noisy input: xt3,xt4x_t^3, x_t^4 β€” these might be [M], [M] or partially revealed
  2. Intra-block context (Blue): xt3↔xt4x_t^3 \leftrightarrow x_t^4 see each other bidirectionally
  3. Clean prefix (Green): can see x01,x02x_0^1, x_0^2 (block 1's clean tokens) AND x03,x04x_0^3, x_0^4 (own block's clean tokens)
  4. Prediction: output probability for the masked positions in block 2

At inference, the clean tokens of previous blocks are cached (like KV-cache in standard AR). Only the current block's noisy tokens go through the full attention computation. This is the DualCache mechanism from Fast-dLLM.

5. Inference Process (Figure 3)

At inference, blocks are decoded sequentially, and each decoded block is cached:

Inference: block-by-block decoding
ForΒ b=1,2,…,L/B:x^(b)=DenoiseT ⁣(xT(b)=[M]Bβ€…β€Š|β€…β€Šx^cached(<b))\text{For } b = 1, 2, \ldots, L/B: \quad \hat{\mathbf{x}}^{(b)} = \text{Denoise}_T\!\left(\mathbf{x}_T^{(b)} = [\texttt{M}]^B \;\middle|\; \hat{\mathbf{x}}^{(<b)}_{\text{cached}}\right)

DualCache: Two-level KV Cache

Block Diffusion uses two types of caching to speed up inference:

Prefix Cache

KV cache of all previously decoded blocks. Just like standard AR KV-cache. Only computed once per block.

Block Cache

KV cache within the current block across denoising steps. Since denoising is iterative, token representations from earlier steps can be reused (from Fast-dLLM).

6. D2F: Discrete Diffusion Forcing

D2F is a training technique that converts a pre-trained bidirectional dLLM into a model supporting block-wise causal attention β€” enabling KV cache reuse and inter-block parallel decoding at inference time.

πŸ”₯ D2F dLLM (student)

  • Block-wise causal attention mask
  • Sees past blocks, current block tokens left-to-right
  • Compatible with KV cache across denoising steps

❄️ Pre-trained dLLM (teacher)

  • Bidirectional full attention
  • Sees all tokens simultaneously
  • Strong quality, but no KV cache reuse

Training with Monotonically Increasing Masks

The answer sequence is divided into blocks with progressively increasing masking ratios. The D2F student model is trained to mimic the teacher's predictions on partially denoised preceding tokens β€” via a KL divergence loss between the two models' output distributions.

D2F distillation objective
LD2F=E[KL(pteacher(β‹…βˆ£xt)β€…β€Šβˆ₯β€…β€Špstudent(β‹…βˆ£xtcausal))]\mathcal{L}_{\text{D2F}} = \mathbb{E}\bigl[\text{KL}\bigl(p_{\text{teacher}}(\cdot \mid x_t) \;\|\; p_{\text{student}}(\cdot \mid x_t^{\text{causal}})\bigr)\bigr]

Why this matters

Standard dLLMs use bidirectional attention β€” each denoising step must recompute all token representations from scratch. D2F's causal attention structure means the KV activations from step t can be partially reused in step t+1, breaking the O(NΒ²) per-step compute barrier.

7. Results: The Block Size Tradeoff

Block SizePerplexitySteps for 256 tokensSpeedup
B=1 (pure AR)24.12561x
B=424.864 Γ— T~2-4x
B=1626.316 Γ— T~4-8x
B=256 (pure diffusion)31.21 Γ— Tdepends on T

8. Connections to Other Work

LLaDA

Can be seen as Block Diffusion with B = L (one giant block). LLaDA's training objective is a special case of the block diffusion loss.

Fast DLLM

Provides the adaptive denoising schedule and DualCache that Block Diffusion uses within each block. The two papers are complementary β€” Fast DLLM reduces T, Block Diffusion reduces the number of blocks.

MDLM

The continuous-time ELBO theory that underpins the within-block diffusion process. Block Diffusion uses MDLM's training framework.

9. Additional Resources