PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

LLaDA treats language modeling as a masked diffusion process: the forward process randomly masks tokens, and the reverse process learns to predict all masked tokens simultaneously. Unlike AR models that generate left-to-right, LLaDA can fill in tokens in any order. Scaled to 8B parameters, it matches LLaMA3 on many benchmarks.

Problem: AR LMs have inherent limitations

Unidirectional, sequential, error accumulation

Insight

Core Idea: Masked Diffusion for Language

Masking = natural "noise" for discrete tokens

Forward Process

Progressively mask tokens with probability t

Reverse Process

Predict all [MASK] positions in parallel

Design choice

Architecture: Standard Transformer (like LLaMA)

Bidirectional attention (no causal mask) + RoPE + RMSNorm + SwiGLU

How to learn

Training: Variable-rate MLM

Cross-entropy on masked positions, t ~ Uniform(0,1)

How to generate

Generation: Iterative Unmasking

T steps: unmask highest-confidence tokens first

8B params, competitive with LLaMA3

Bidirectional context

Parallel decoding potential

1. Background & Motivation

Autoregressive (AR) LLMs like GPT generate tokens one-by-one, left to right. This is the dominant paradigm, but it has inherent limitations:

Unidirectional: can only condition on left context — can't "look ahead"
Sequential decoding: generating N tokens requires N forward passes, can't parallelize
Error accumulation: mistakes in early tokens propagate to all later tokens
Fixed generation order: always left→right, even when later tokens are more certain

Continuous diffusion models (like DALL-E for images) have shown that iterative denoising can be a powerful generative paradigm. But text is discrete — you can't add Gaussian noise to words. LLaDA's key insight: masking IS the natural noise for discrete tokens.

2. Forward Process: Progressive Masking

Given a clean sequence, the forward process independently masks each token with probability t:

q(x_t^i \mid x_0^i) = \begin{cases} x_0^i & \text{with probability } 1 - t \\ \texttt{[MASK]} & \text{with probability } t \end{cases}

x_0^i

The original (clean) token at position i

x_t^i

The token at position i at noise level t (either original or [MASK])

t \in [0, 1]

Noise level / timestep. t=0 means clean, t=1 means fully masked

q(\cdot \mid \cdot)

The forward transition distribution — how we add noise

Key properties of this forward process:

At t=0, the sequence is fully clean
At t=1, every token is [MASK]
Masking is independent across positions — each token decides on its own
Expected number of masked tokens at time t: t × L

q(x_t \mid x_0) = \prod_{i=1}^{L} q(x_t^i \mid x_0^i)

Take the sentence "The cat sat on the mat" (L=6 tokens). Each token independently decides whether to mask:

t = 0.0: mask prob = 0% → The cat sat on the mat

Expected masked: 0×6 = 0 tokens

t = 0.25: mask prob = 25% → The cat sat [M] the mat

Expected: 0.25×6 = 1.5 tokens (got 1 here — it's stochastic!)

t = 0.5: mask prob = 50% → The [M] sat [M] [M] mat

Expected: 0.5×6 = 3 tokens

t = 0.75: mask prob = 75% → [M] [M] sat [M] [M] [M]

Expected: 0.75×6 = 4.5 tokens

t = 1.0: mask prob = 100% → [M] [M] [M] [M] [M] [M]

Expected: 1.0×6 = 6 tokens (deterministic)

Click to toggle direction

The

cat

sat

the

mat

t = 0 / 4 · 0 / 6 tokens masked

3. Reverse Process: Learning to Unmask

The reverse process is a neural network that takes a partially masked sequence and predicts the original token at every masked position:

p_\theta(x_0 \mid x_t) = \prod_{i=1}^{L} p_\theta(x_0^i \mid x_t)

p_\theta

Neural network (Transformer) with parameters θ

x_t

The noisy (partially masked) input sequence

x_0

The clean target sequence we want to predict

\prod_{i=1}^{L}

Independent prediction at each position — this is what enables parallelism

Key insight: The model predicts ALL masked tokens independently in a single forward pass. This means we don't need to generate tokens one by one like in AR models.

4. Training Objective

\mathcal{L}(\theta) = -\mathbb{E}_{t \sim \mathcal{U}(0,1)} \mathbb{E}_{x_t \sim q(x_t \mid x_0)} \left[ \frac{1}{t \cdot L} \sum_{i: x_t^i = \texttt{[M]}} \log p_\theta(x_0^i \mid x_t) \right]

\mathcal{L}(\theta)

The loss function we minimize

t \sim \mathcal{U}(0,1)

Sample a random timestep uniformly from [0,1]. This means we train at ALL noise levels

x_t \sim q(x_t \mid x_0)

Apply forward process: mask each token with probability t

\sum_{i: x_t^i = \texttt{[M]}}

Sum only over masked positions — don't compute loss on unmasked tokens

\log p_\theta(x_0^i \mid x_t)

Log probability of predicting the correct original token

\frac{1}{t \cdot L}

Normalization: expected number of masked tokens is t×L

Sentence: "The cat sat on the mat" (L=6)

Step 1: Sample t = 0.5

Step 2: Mask with prob 0.5 → "The [M] sat [M] [M] mat"

Masked positions: {2, 4, 5} (cat, on, the)

Step 3: Model predicts at masked positions:

pos 2: p(cat|x_t) = 0.7 → log(0.7) = -0.357
pos 4: p(on|x_t) = 0.9 → log(0.9) = -0.105
pos 5: p(the|x_t) = 0.6 → log(0.6) = -0.511

Step 4: Loss = -1/(0.5×6) × (-0.357 + -0.105 + -0.511)

= -1/3 × (-0.973) = 0.324

Connection to BERT: LLaDA's training is essentially BERT with a variable masking rate (t~Uniform) instead of BERT's fixed 15%. By training across ALL masking rates, the model learns to handle any degree of partial information — from nearly complete to fully masked.

5. Generation: Iterative Unmasking

At inference, LLaDA starts from a fully masked sequence and iteratively unmasks tokens over T steps:

\text{For } s = T, T{-}1, \ldots, 1: \quad x_{t_{s-1}} = \text{Unmask}(x_{t_s}, p_\theta, n_s)

T

Total number of denoising steps (e.g. 10, 50, 100)

t_s

Noise level at step s: t_T=1, t_0=0, uniformly spaced

n_s

Number of tokens to unmask at this stepUnmask()1) Model predicts all masked tokens; 2) Keep top-n_s by confidence; 3) Re-mask the rest

n_s = \left\lfloor (t_s - t_{s-1}) \cdot L \right\rfloor

Reverse process: iteratively denoise from [MASK] → text

[M]

Step 0 / 3

Masked Low confidence High confidence

6. Architecture Details

LLaDA uses a standard Transformer architecture (same as LLaMA), with one crucial difference:

Component	LLaMA (AR)	LLaDA (Diffusion)
Attention	Causal (left-to-right)	Bidirectional (full)
Positional Encoding	RoPE	RoPE
Normalization	RMSNorm	RMSNorm
Activation	SwiGLU	SwiGLU
Timestep conditioning	N/A	Implicit (via masking ratio)

Why no causal mask? AR models use causal masking because they generate left-to-right and shouldn't see future tokens. LLaDA generates all positions simultaneously, so every token needs to see every other token (including the [MASK] tokens) to make coordinated predictions. This is the same as BERT-style bidirectional attention.

7. Supervised Fine-tuning (SFT)

For instruction following, LLaDA only masks the response tokens (not the prompt) during fine-tuning:

\mathcal{L}_{\text{SFT}} = -\mathbb{E}_t \left[ \frac{1}{t \cdot L_r} \sum_{i \in \text{response}} \mathbf{1}[x_t^i = \texttt{M}] \cdot \log p_\theta(x_0^i \mid x_{\text{prompt}}, x_{t,\text{response}}) \right]

8. Experiments & Results

LLaDA was trained at two scales: 1.1B and 8B parameters.

Model	Type	Params	MMLU	ARC-C	HellaSwag
LLaMA3	AR	8B	65.3	53.7	82.1
LLaDA	Diffusion	8B	67.0	55.6	79.8
GPT-2	AR	1.5B	32.4	33.3	71.3
LLaDA	Diffusion	1.1B	38.9	38.2	62.1

Key takeaway: A diffusion LM can be competitive with AR models at scale. LLaDA 8B slightly outperforms LLaMA3 8B on MMLU and ARC-C — a strong signal that AR is not the only path to powerful LLMs.

9. Limitations & Future Work

Inference speed: Multiple denoising steps needed vs. AR's single pass per token. Mitigated by Fast DLLM's adaptive schedule.
No KV-cache: Bidirectional attention means standard KV-cache doesn't work. Block Diffusion addresses this with block-level caching.
RLHF unexplored: How to do reinforcement learning from human feedback with diffusion LMs is an open question.
Long-form generation: Performance on very long sequences (>4K tokens) not yet studied at scale.

10. Connections to Other Work

MDLM

Provides the rigorous continuous-time ELBO theory that underpins LLaDA's training objective. MDLM's loss is essentially the same as LLaDA's, derived from first principles.

Fast DLLM

Addresses LLaDA's main weakness — slow multi-step generation — with adaptive denoising schedules and importance sampling. Reduces steps by 3-10x.

Block Diffusion

Combines AR and diffusion at the block level. Can be seen as a generalization of LLaDA where B=L is one extreme (full diffusion = LLaDA) and B=1 is the other (pure AR).

11. Additional Resources

LLaDA (arXiv)Original paper GitHub: ML-GSAI/LLaDAOfficial code MDLM (arXiv)Theoretical foundation paper D3PM (Austin et al. 2021)Earlier discrete diffusion work