TL;DR
MDLM derives a clean, principled training objective for masked diffusion language models from first principles β starting from a continuous-time ELBO. It shows that a simple absorbing-state diffusion (tokens β [MASK]) with the right loss weighting achieves strong perplexity results, providing the theoretical foundation that LLaDA later scales up.
1. Background: Why We Need Better Theory
Discrete diffusion models (like D3PM) existed before MDLM, but they had issues:
- Complex ELBO: D3PM's loss has many terms that are hard to balance
- Auxiliary losses: needed extra loss terms to work well in practice
- Gap from continuous: continuous diffusion (images) had cleaner theory β can we match it for discrete?
MDLM answers: yes. By taking the continuous-time limit of the discrete ELBO, we get an elegant, simple loss.
2. The Forward Process: Absorbing Diffusion
MDLM uses an absorbing-state forward process β each token independently transitions to [MASK] (the absorbing state) at rate Ξ²(t):
This is more general than LLaDA's "mask with probability t". By choosing different Ξ²(t), we can have non-linear masking schedules. When Ξ²(t) = 1/(1-t), we recover LLaDA's linear schedule where Ξ±_t = 1-t.
3. The Continuous-Time ELBO
The key contribution: MDLM derives a clean ELBO in continuous time. Starting from the standard variational bound:
Practical Training Loss
In practice, we can't compute the integral β so we sample t uniformly and get:
Connection to LLaDA: With the linear schedule Ξ²(t)=1/(1-t), we get Ξ±_t=1-t and Ξ²(t)/(1-Ξ±_t) = 1/t/(t) = 1/(tΒ·L) when normalized. This is exactly LLaDA's loss! MDLM provides the theoretical justification for LLaDA's training objective.
4. Noise Schedule Design
MDLM explores different noise schedules Ξ²(t) and finds that the choice matters significantly:
5. Experiments & Results
| Model | Type | PPL (text8) | PPL (OpenWebText) |
|---|---|---|---|
| D3PM (absorbing) | Discrete Diff. | 1.45 | β |
| SEDD | Score Entropy | 1.39 | 32.1 |
| MDLM | Masked Diff. | 1.36 | 31.2 |
| GPT-2 (small) | AR | β | 29.1 |
Key result: MDLM outperforms all prior discrete diffusion models and nearly matches GPT-2 on OpenWebText. The gap to AR models is small enough to suggest that with more scale (which LLaDA later demonstrates), diffusion LMs can be fully competitive.
6. Limitations & Future Work
- Scale: Only tested up to ~110M parameters. LLaDA later proves it works at 8B.
- Generation quality: Perplexity is good but unconditional text samples can be incoherent β needs SFT/RLHF.
- Schedule sensitivity: Performance depends on noise schedule choice β not fully understood why log-linear works best.
7. Connections to Other Work
Scales up MDLM's framework to 8B parameters. Uses MDLM's training objective (with linear schedule). Proves that masked diffusion works at LLM scale.
Optimizes MDLM's inference speed using the ELBO decomposition that MDLM derives. The per-step L_t values come directly from MDLM's theory.
Uses MDLM's within-block diffusion framework combined with AR between blocks.
The predecessor. D3PM introduced discrete diffusion with transition matrices. MDLM simplifies D3PM's approach by taking the continuous-time limit and focusing on the absorbing state.