TL;DR
LLaDA treats language modeling as a masked diffusion process: the forward process randomly masks tokens, and the reverse process learns to predict all masked tokens simultaneously. Unlike AR models that generate left-to-right, LLaDA can fill in tokens in any order. Scaled to 8B parameters, it matches LLaMA3 on many benchmarks.
1. Background & Motivation
Autoregressive (AR) LLMs like GPT generate tokens one-by-one, left to right. This is the dominant paradigm, but it has inherent limitations:
- Unidirectional: can only condition on left context β can't "look ahead"
- Sequential decoding: generating N tokens requires N forward passes, can't parallelize
- Error accumulation: mistakes in early tokens propagate to all later tokens
- Fixed generation order: always leftβright, even when later tokens are more certain
Continuous diffusion models (like DALL-E for images) have shown that iterative denoising can be a powerful generative paradigm. But text is discrete β you can't add Gaussian noise to words. LLaDA's key insight: masking IS the natural noise for discrete tokens.
2. Forward Process: Progressive Masking
Given a clean sequence, the forward process independently masks each token with probability t:
Key properties of this forward process:
- At t=0, the sequence is fully clean
- At t=1, every token is [MASK]
- Masking is independent across positions β each token decides on its own
- Expected number of masked tokens at time t: t Γ L
Take the sentence "The cat sat on the mat" (L=6 tokens). Each token independently decides whether to mask:
3. Reverse Process: Learning to Unmask
The reverse process is a neural network that takes a partially masked sequence and predicts the original token at every masked position:
Key insight: The model predicts ALL masked tokens independently in a single forward pass. This means we don't need to generate tokens one by one like in AR models.
4. Training Objective
Sentence: "The cat sat on the mat" (L=6)
pos 4: p(on|x_t) = 0.9 β log(0.9) = -0.105
pos 5: p(the|x_t) = 0.6 β log(0.6) = -0.511
Connection to BERT: LLaDA's training is essentially BERT with a variable masking rate (t~Uniform) instead of BERT's fixed 15%. By training across ALL masking rates, the model learns to handle any degree of partial information β from nearly complete to fully masked.
5. Generation: Iterative Unmasking
At inference, LLaDA starts from a fully masked sequence and iteratively unmasks tokens over T steps:
6. Architecture Details
LLaDA uses a standard Transformer architecture (same as LLaMA), with one crucial difference:
| Component | LLaMA (AR) | LLaDA (Diffusion) |
|---|---|---|
| Attention | Causal (left-to-right) | Bidirectional (full) |
| Positional Encoding | RoPE | RoPE |
| Normalization | RMSNorm | RMSNorm |
| Activation | SwiGLU | SwiGLU |
| Timestep conditioning | N/A | Implicit (via masking ratio) |
Why no causal mask? AR models use causal masking because they generate left-to-right and shouldn't see future tokens. LLaDA generates all positions simultaneously, so every token needs to see every other token (including the [MASK] tokens) to make coordinated predictions. This is the same as BERT-style bidirectional attention.
7. Supervised Fine-tuning (SFT)
For instruction following, LLaDA only masks the response tokens (not the prompt) during fine-tuning:
8. Experiments & Results
LLaDA was trained at two scales: 1.1B and 8B parameters.
| Model | Type | Params | MMLU | ARC-C | HellaSwag |
|---|---|---|---|---|---|
| LLaMA3 | AR | 8B | 65.3 | 53.7 | 82.1 |
| LLaDA | Diffusion | 8B | 67.0 | 55.6 | 79.8 |
| GPT-2 | AR | 1.5B | 32.4 | 33.3 | 71.3 |
| LLaDA | Diffusion | 1.1B | 38.9 | 38.2 | 62.1 |
Key takeaway: A diffusion LM can be competitive with AR models at scale. LLaDA 8B slightly outperforms LLaMA3 8B on MMLU and ARC-C β a strong signal that AR is not the only path to powerful LLMs.
9. Limitations & Future Work
- Inference speed: Multiple denoising steps needed vs. AR's single pass per token. Mitigated by Fast DLLM's adaptive schedule.
- No KV-cache: Bidirectional attention means standard KV-cache doesn't work. Block Diffusion addresses this with block-level caching.
- RLHF unexplored: How to do reinforcement learning from human feedback with diffusion LMs is an open question.
- Long-form generation: Performance on very long sequences (>4K tokens) not yet studied at scale.
10. Connections to Other Work
Provides the rigorous continuous-time ELBO theory that underpins LLaDA's training objective. MDLM's loss is essentially the same as LLaDA's, derived from first principles.
Addresses LLaDA's main weakness β slow multi-step generation β with adaptive denoising schedules and importance sampling. Reduces steps by 3-10x.
Combines AR and diffusion at the block level. Can be seen as a generalization of LLaDA where B=L is one extreme (full diffusion = LLaDA) and B=1 is the other (pure AR).