PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

Block Diffusion bridges AR and diffusion by splitting text into blocks. Blocks are generated left-to-right (AR-style), but tokens within each block are generated via masked diffusion (parallel). Block size B controls the tradeoff: B=1 is pure AR, B=L is pure diffusion.

1. The AR-Diffusion Spectrum

Two extremes of text generation:

Autoregressive (B = 1)

+ Excellent long-range coherence
+ Strong benchmarks
- Sequential: 1 token per step
- Left-to-right only

Full Diffusion (B = L)

+ Parallel within block
+ Bidirectional context
- Many denoising steps needed
- Weaker long-range coherence

Block Diffusion sits in between: block size B controls the tradeoff.

2. Mathematical Formulation

Partition a sequence of length L into blocks of size B:

x = \underbrace{[x_1, \ldots, x_B]}_{\text{Block 1}}, \underbrace{[x_{B+1}, \ldots, x_{2B}]}_{\text{Block 2}}, \ldots, \underbrace{[x_{L-B+1}, \ldots, x_L]}_{\text{Block } L/B}

xThe full sequence of L tokensLTotal sequence length (e.g. 256)BBlock size — the key hyperparameter. Larger B = more parallel, less coherentL/BNumber of blocks (must divide evenly)

Block-level AR factorization

p(x) = \prod_{b=1}^{L/B} p_\theta\!\left(\mathbf{x}^{(b)} \mid \mathbf{x}^{(<b)}\right)

\mathbf{x}^{(b)}

The b-th block of B tokens

\mathbf{x}^{(<b)}

All blocks before block b (the "prefix")

p_\theta

The neural network (Transformer) parameterized by θ

This is exactly like standard AR, but at the block level. Each block conditions on all previous blocks, just like how each token conditions on all previous tokens in GPT.

Within-block diffusion

Within each block, we use masked diffusion (same as LLaDA/MDLM):

p_\theta\!\left(\mathbf{x}^{(b)} \mid \mathbf{x}^{(<b)}\right) = \int \prod_{s=1}^{T} p_\theta\!\left(\mathbf{x}_{t_{s-1}}^{(b)} \mid \mathbf{x}_{t_s}^{(b)},\, \mathbf{x}^{(<b)}\right) d\mathbf{x}_{t_1:t_{T-1}}

Generate "The cat sat on the mat" with B=2 (3 blocks), T=2 denoising steps per block:

Block 1: start from [M][M], no prefix

t=1.0: [M] [M]

t=0.5: The [M] ("The" has 0.92 confidence)

t=0.0: The cat ✓

Block 2: start from [M][M], prefix = "The cat"

t=1.0: [M] [M] | prefix: The cat

t=0.5: sat [M]

t=0.0: sat on ✓

Block 3: prefix = "The cat sat on"

t=1.0: [M] [M] | prefix: The cat sat on

t=0.5: the [M]

t=0.0: the mat ✓

Total forward passes: 3 blocks × 2 steps = 6. Pure AR would need 6 steps too (one per token). But with B=4 and L=8: only 2 blocks × 2 steps = 4 passes for 8 tokens — that's the speedup!

Left-to-right block generation · within-block parallel diffusion

Block 1

[M]

Block 2

[M]

Block 3

[M]

Step 0: All blocks masked

Masked Denoising Complete

3. Training Objective

The training loss is a masked-token-only cross-entropy, similar to LLaDA but with block-aware conditioning:

\mathcal{L}_{\text{block}}(\theta) = -\mathbb{E}_{x,m}\left[\sum_{i=1}^{L} \mathbf{1}[x_t^i = \texttt{[MASK]}] \log p_\theta(x_0^i \mid x_{<i},\, \mathbf{x}_{\text{block}(i)})\right]

\mathcal{L}_{\text{block}}(\theta)

The loss function we minimize during training

\mathbb{E}_{x,m}

Expectation over: training data x and random mask m

\mathbf{1}[x_t^i = \texttt{[MASK]}]

Indicator: only count loss for masked positions

x_{<i}

All tokens before position i in previous blocks (AR prefix)

\mathbf{x}_{\text{block}(i)}

All tokens in the same block as position i (bidirectional context)

p_\theta(x_0^i \mid \cdot)

Model's predicted probability of the true token at position i

Key difference from LLaDA: the conditioning includes both the AR prefix (previous blocks) AND bidirectional context within the current block. This is what makes the attention mask design special.

4. Attention Mask Design (Figure 7)

The heart of Block Diffusion is its specialized attention mask. During training, the input is the concatenation of the noisy sequence x_t and clean targets x_0. The mask combines three types of attention:

Block Diagonal

Intra-block bidirectional attention among noisy tokens. Within the same noisy block, every token can see every other token — just like BERT. This enables parallel denoising within a block.

Offset Block Causal

Noisy tokens attend to clean tokens in the corresponding block AND all preceding clean blocks. This is how the model sees the "answer" — x_t^(b) can look at x_0^(1), x_0^(2), ..., x_0^(b) to learn the denoising mapping.

Block Causal

Standard left-to-right causal attention among clean tokens — exactly like GPT's causal mask. Clean block b can see clean blocks 1..b-1 and tokens to its left within block b.

Mode:

Block size:

Blocks:

Matrix size: 12 × 12 (6 noisy + 6 clean tokens, 3 blocks × 2 tokens)

← noisy tokens (x_t) | clean tokens (x_0) →

Block Diagonal(intra-block bidirectional)

Offset Block Causal(noisy→clean)

Block Causal(clean AR)

No attention

Consider block 2 (tokens 3-4) during training. The model receives:

Noisy input: $x_t^3, x_t^4$ — these might be [M], [M] or partially revealed
Intra-block context (Blue): $x_t^3 \leftrightarrow x_t^4$ see each other bidirectionally
Clean prefix (Green): can see $x_0^1, x_0^2$ (block 1's clean tokens) AND $x_0^3, x_0^4$ (own block's clean tokens)
Prediction: output probability for the masked positions in block 2

At inference, the clean tokens of previous blocks are cached (like KV-cache in standard AR). Only the current block's noisy tokens go through the full attention computation. This is the DualCache mechanism from Fast-dLLM.

5. Inference Process (Figure 3)

At inference, blocks are decoded sequentially, and each decoded block is cached:

\text{For } b = 1, 2, \ldots, L/B: \quad \hat{\mathbf{x}}^{(b)} = \text{Denoise}_T\!\left(\mathbf{x}_T^{(b)} = [\texttt{M}]^B \;\middle|\; \hat{\mathbf{x}}^{(<b)}_{\text{cached}}\right)

DualCache: Two-level KV Cache

Block Diffusion uses two types of caching to speed up inference:

Prefix Cache

KV cache of all previously decoded blocks. Just like standard AR KV-cache. Only computed once per block.

Block Cache

KV cache within the current block across denoising steps. Since denoising is iterative, token representations from earlier steps can be reused (from Fast-dLLM).

6. D2F: Discrete Diffusion Forcing

D2F is a training technique that converts a pre-trained bidirectional dLLM into a model supporting block-wise causal attention — enabling KV cache reuse and inter-block parallel decoding at inference time.

🔥 D2F dLLM (student)

Block-wise causal attention mask
Sees past blocks, current block tokens left-to-right
Compatible with KV cache across denoising steps

❄️ Pre-trained dLLM (teacher)

Bidirectional full attention
Sees all tokens simultaneously
Strong quality, but no KV cache reuse

Training with Monotonically Increasing Masks

The answer sequence is divided into blocks with progressively increasing masking ratios. The D2F student model is trained to mimic the teacher's predictions on partially denoised preceding tokens — via a KL divergence loss between the two models' output distributions.

\mathcal{L}_{\text{D2F}} = \mathbb{E}\bigl[\text{KL}\bigl(p_{\text{teacher}}(\cdot \mid x_t) \;\|\; p_{\text{student}}(\cdot \mid x_t^{\text{causal}})\bigr)\bigr]

Why this matters

Standard dLLMs use bidirectional attention — each denoising step must recompute all token representations from scratch. D2F's causal attention structure means the KV activations from step t can be partially reused in step t+1, breaking the O(N²) per-step compute barrier.

7. Results: The Block Size Tradeoff

Block Size	Perplexity	Steps for 256 tokens	Speedup
B=1 (pure AR)	24.1	256	1x
B=4	24.8	64 × T	~2-4x
B=16	26.3	16 × T	~4-8x
B=256 (pure diffusion)	31.2	1 × T	depends on T

8. Connections to Other Work

LLaDA

Can be seen as Block Diffusion with B = L (one giant block). LLaDA's training objective is a special case of the block diffusion loss.

Fast DLLM

Provides the adaptive denoising schedule and DualCache that Block Diffusion uses within each block. The two papers are complementary — Fast DLLM reduces T, Block Diffusion reduces the number of blocks.

MDLM

The continuous-time ELBO theory that underpins the within-block diffusion process. Block Diffusion uses MDLM's training framework.

9. Additional Resources

Block Diffusion (arXiv)Original paper GitHub: kuleshov-group/block-diffusionOfficial code