PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

MDLM derives a clean, principled training objective for masked diffusion language models from first principles — starting from a continuous-time ELBO. It shows that a simple absorbing-state diffusion (tokens → [MASK]) with the right loss weighting achieves strong perplexity results, providing the theoretical foundation that LLaDA later scales up.

Problem: No principled training for discrete diffusion

Prior work (D3PM) had complex objectives with many terms

Key question

Insight: Continuous-time limit simplifies everything

Take the ELBO in continuous time → clean closed-form loss

Derivation

Method: Absorbing-state masked diffusion + ELBO loss

Forward: token → [MASK]

Absorbing state = masking

Loss: weighted CE on masks

Weight = 1/t for proper ELBO

Result: Strong perplexity, simple implementation

Theoretical foundation for LLaDA

~100 lines of core code

1. Background: Why We Need Better Theory

Discrete diffusion models (like D3PM) existed before MDLM, but they had issues:

Complex ELBO: D3PM's loss has many terms that are hard to balance
Auxiliary losses: needed extra loss terms to work well in practice
Gap from continuous: continuous diffusion (images) had cleaner theory — can we match it for discrete?

MDLM answers: yes. By taking the continuous-time limit of the discrete ELBO, we get an elegant, simple loss.

2. The Forward Process: Absorbing Diffusion

MDLM uses an absorbing-state forward process — each token independently transitions to [MASK] (the absorbing state) at rate β(t):

q(x_t = \texttt{[M]} \mid x_0) = 1 - e^{-\int_0^t \beta(s)\,ds} \triangleq 1 - \alpha_t

\beta(t)

Noise rate at time t — controls how fast tokens get masked

\alpha_t

Survival probability: probability that token is still unmasked at time t

e^{-\\int_0^t \\beta(s)\\,ds}

Exponential decay — same math as radioactive decay!

This is more general than LLaDA's "mask with probability t". By choosing different β(t), we can have non-linear masking schedules. When β(t) = 1/(1-t), we recover LLaDA's linear schedule where α_t = 1-t.

3. The Continuous-Time ELBO

The key contribution: MDLM derives a clean ELBO in continuous time. Starting from the standard variational bound:

\log p(x_0) \geq \mathbb{E}_{q}\left[-\int_0^1 \underbrace{\frac{\beta(t)}{1-\alpha_t} \sum_{i: x_t^i = \texttt{[M]}} \log p_\theta(x_0^i \mid x_t)}_{L_t}\, dt\right]

\log p(x_0)

Log-likelihood of the data — what we want to maximize

\geq

Evidence Lower BOund — maximizing the right side pushes up log p(x₀)

\int_0^1 \cdots\, dt

Integrate over all timesteps — continuous version of summing over T discrete steps

\frac{\beta(t)}{1-\alpha_t}

Weighting factor — derived from the math, not hand-tuned. This is what makes it "principled"

\sum_{i: x_t^i = \texttt{[M]}}

Sum over masked positions only

\log p_\theta(x_0^i \\mid x_t)

Cross-entropy: how well does the model predict each masked token?

Practical Training Loss

In practice, we can't compute the integral — so we sample t uniformly and get:

\mathcal{L}_{\text{MDLM}} = -\mathbb{E}_{t \sim \mathcal{U}(0,1)}\left[\frac{\beta(t)}{1-\alpha_t} \sum_{i: x_t^i = \texttt{[M]}} \log p_\theta(x_0^i \mid x_t)\right]

Connection to LLaDA: With the linear schedule β(t)=1/(1-t), we get α_t=1-t and β(t)/(1-α_t) = 1/t/(t) = 1/(t·L) when normalized. This is exactly LLaDA's loss! MDLM provides the theoretical justification for LLaDA's training objective.

ELBO = -∑ Lₜ — hover over each bar to see the per-timestep loss

t=1

t=2

t=3

t=4

t=5

t=6

t=7

t=8

Total ELBO loss: 1.790

4. Noise Schedule Design

MDLM explores different noise schedules β(t) and finds that the choice matters significantly:

\alpha_t = \frac{1 - t}{1 + (e^{10} - 1) \cdot t} \quad \Rightarrow \quad \text{slow start, fast end}

5. Experiments & Results

Model	Type	PPL (text8)	PPL (OpenWebText)
D3PM (absorbing)	Discrete Diff.	1.45	—
SEDD	Score Entropy	1.39	32.1
MDLM	Masked Diff.	1.36	31.2
GPT-2 (small)	AR	—	29.1

Key result: MDLM outperforms all prior discrete diffusion models and nearly matches GPT-2 on OpenWebText. The gap to AR models is small enough to suggest that with more scale (which LLaDA later demonstrates), diffusion LMs can be fully competitive.

6. Limitations & Future Work

Scale: Only tested up to ~110M parameters. LLaDA later proves it works at 8B.
Generation quality: Perplexity is good but unconditional text samples can be incoherent — needs SFT/RLHF.
Schedule sensitivity: Performance depends on noise schedule choice — not fully understood why log-linear works best.

7. Connections to Other Work

LLaDA

Scales up MDLM's framework to 8B parameters. Uses MDLM's training objective (with linear schedule). Proves that masked diffusion works at LLM scale.

Fast DLLM

Optimizes MDLM's inference speed using the ELBO decomposition that MDLM derives. The per-step L_t values come directly from MDLM's theory.

Block Diffusion

Uses MDLM's within-block diffusion framework combined with AR between blocks.

D3PM (Austin et al. 2021)

The predecessor. D3PM introduced discrete diffusion with transition matrices. MDLM simplifies D3PM's approach by taking the continuous-time limit and focusing on the absorbing state.

8. Additional Resources

MDLM (arXiv)Original paper GitHub: kuleshov-group/mdlmOfficial code D3PM (Austin et al. 2021)Predecessor — discrete diffusion with transition matrices