A Tutorial Introduction to the Minimum Description Length Principle

Peter GrΓΌnwald Β· 2004 Β· arXiv math/0406077

TL;DR

The Minimum Description Length (MDL) principle frames statistical learning as a data compression problem: the best model is the one that compresses the data most. By formalizing Occam's razor as code-length minimization, MDL provides a rigorous, prior-free foundation for model selection that bridges information theory, statistics, and Kolmogorov complexity.

1. Learning as Compression

MDL rests on a single foundational insight: regularity in data enables compression. If a dataset has no structure β€” if it is pure noise β€” then no model can describe it more concisely than listing every observation. But if there is genuine pattern, a compact model can describe the pattern plus the residual errors in fewer bits than the raw data.

This correspondence between compression and learning is not a metaphor β€” it is exact. Every probability model implicitly defines a code via Shannon's source coding theorem, and every code implicitly defines a probability model. Learning a good model and finding a short description are the same task.

The MDL Slogan

"The best model for a dataset is the one that leads to the shortest description of the data." This description must include a description of the model itself β€” simpler models get a shorter free lunch.

The Kraft inequality makes this precise: any set of code lengths L(s1),L(s2),…L(s_1), L(s_2), \ldots for a prefix-free code must satisfy:

βˆ‘s2βˆ’L(s)≀1\sum_{s} 2^{-L(s)} \leq 1

Conversely, any set of lengths satisfying this inequality corresponds to a valid prefix-free code. The bijection between valid codes and sub-probability distributions means that assigning shorter codes to more probable sequences is always consistent.

2. Shannon Codes and Entropy

If data source X has distribution p, the optimal code assigns length L(x)=βˆ’log⁑2p(x)L(x) = -\log_2 p(x) bits to outcome x. Shannon's noiseless coding theorem proves this is tight: no code can do better on average than Shannon entropy.

E[L(X)]β‰₯H(X)=βˆ’βˆ‘xp(x)log⁑2p(x)\mathbb{E}[L(X)] \geq H(X) = -\sum_x p(x) \log_2 p(x)

The Shannon code achieves expected length H(X) + 1 bits (the +1 comes from rounding to integer lengths). The entropy H(X) is therefore the fundamental limit of lossless compression for source p.

When the true distribution is unknown (the typical case), we must use a model. If model M assigns probability pM(xn)p_M(x^n) to data sequence x^n, then the code length is:pM(xn)p_M(x^n), the code length is:

LM(xn)=βˆ’log⁑2pM(xn)L_M(x^n) = -\log_2 p_M(x^n)

Model selection in MDL reduces to: which model M gives the shortest code length for the observed data? But we must add the cost of describing M itself.

3. Two-Part MDL

The simplest and most interpretable formulation of MDL is the two-part (or crude) code. The total description of data D given a hypothesis class requires two parts:

  1. Model description: the code length L(H) to communicate which hypothesis H was selected
  2. Data-given-model description: the code length L(D | H) to communicate the data assuming the receiver knows H

The two-part MDL criterion selects the hypothesis that minimizes the total:

LΛ‰(Hn,Dn)=L(Hn)+L(Dn∣Hn)\bar{L}(H_n, D^n) = L(H_n) + L(D^n \mid H_n)

Here, H_n is a hypothesis from a parametric family fit to n data points, and D^n is the observed sequence of length n. The first term penalizes model complexity; the second penalizes poor fit. Their sum is minimized at the sweet spot between underfitting and overfitting.

For parametric models with k real-valued parameters, the model description cost scales roughly as k2log⁑2n\frac{k}{2} \log_2 n bits (since parameters need to be described to precision ~1/√n for n observations). This is exactly the penalty term in the Bayesian Information Criterion (BIC), revealing a deep connection between MDL and BIC.

4. Concrete Example: Polynomial Model Selection

Suppose we observe n data points (x_i, y_i) and want to decide whether to fit a degree-1, degree-2, or degree-3 polynomial. Let k denote the number of parameters (k = 2, 3, 4 respectively) and let RSS_k denote the residual sum of squares under the best-fit degree-k polynomial.

Under a Gaussian noise model with variance σ², the negative log-likelihood of the data given the model is:

L(Dn∣Hk)=n2log⁑(2πσ2)+RSSk2Οƒ2L(D^n \mid H_k) = \frac{n}{2}\log(2\pi\sigma^2) + \frac{\text{RSS}_k}{2\sigma^2}

Adding the model description cost, the total two-part code length for each degree is:

DegreeParams (k)Model costData costTotal
12log⁑n\log nRSS1/(2Οƒ2)\text{RSS}_1 / (2\sigma^2)(large RSS)
2332log⁑n\frac{3}{2}\log nRSS2/(2Οƒ2)\text{RSS}_2 / (2\sigma^2)Shortest β€” select!
342log⁑n2\log nRSS3/(2Οƒ2)\text{RSS}_3 / (2\sigma^2)(smaller RSS but bigger model cost)

The degree-2 polynomial wins when its reduction in data cost exceeds the extra model description cost relative to degree-1, and its extra model cost exceeds the reduction in data cost relative to degree-3. MDL selects the model at the compression sweet spot β€” not the best-fitting model.

5. Refined MDL and Normalized Maximum Likelihood

Two-part MDL requires a separate code for the model, which can be arbitrary. Refined MDL eliminates this arbitrariness by using a single-part code that is minimax optimal over all data sequences in the model class.

The key concept is the Normalized Maximum Likelihood (NML) distribution. For a parametric model class M with data sequence x^n, the NML distribution is defined as:

pNML(xn)=p(xn; θ^(xn))βˆ‘ynp(yn; θ^(yn))p_{\text{NML}}(x^n) = \frac{p(x^n;\, \hat{\theta}(x^n))}{\sum_{y^n} p(y^n;\, \hat{\theta}(y^n))}

where ΞΈΜ‚(x^n) is the maximum likelihood estimate for data x^n, and the denominator (the parametric complexity) sums over all possible data sequences of length n. The NML assigns each sequence a probability proportional to how well the model class can explain it.

The stochastic complexity of x^n under model class M is the NML code length:

SC(xn)=βˆ’log⁑2pNML(xn)=βˆ’log⁑2p(xn;ΞΈ^(xn))+log⁑2βˆ‘ynp(yn;ΞΈ^(yn))SC(x^n) = -\log_2 p_{\text{NML}}(x^n) = -\log_2 p(x^n;\hat{\theta}(x^n)) + \log_2 \sum_{y^n} p(y^n;\hat{\theta}(y^n))

The stochastic complexity decomposes into the maximum likelihood code length plus the log of the parametric complexity (the model class overhead). Model selection with refined MDL chooses M to minimize SC(x^n).

6. Kolmogorov Complexity Connection

MDL has a deep theoretical cousin: Kolmogorov complexity, also called algorithmic information theory. The Kolmogorov complexity K(x) of a string x is the length of the shortest computer program (on a universal Turing machine) that outputs x and halts:

K(x)=min⁑p : U(p)=x∣p∣K(x) = \min_{p\, :\, U(p) = x} |p|

where U is a universal Turing machine and |p| is the length of program p in bits. K(x) represents the ultimate compression limit: the shortest possible description of x across all computable descriptions.

The connection to MDL is precise: MDL can be viewed as the computable, statistical approximation to Kolmogorov complexity. While K(x) is uncomputable (no algorithm can compute it for all x), MDL restricts the search to a parametric model class and finds the shortest description within that class.

Kolmogorov Complexity K(x)

  • Uncomputable
  • Universal (all programs)
  • Absolute optimal compression
  • Model-free

MDL (Stochastic Complexity)

  • Computable
  • Restricted to model class M
  • Class-optimal compression
  • Requires model specification

The conditional Kolmogorov complexity K(x | y) β€” the shortest program that outputs x given y as input β€” is the algorithmic analogue of MDL's two-part code. When y is the model and x is the data, K(x | y) plays the role of L(D | H).

7. Why Simpler Models Generalize Better

MDL provides the first rigorous explanation of Occam's razor: simpler models generalize better because they are better compressors of genuinely structured data. This is not a philosophical preference β€” it follows from mathematics.

A complex model class can compress any data sequence to zero bits (by memorizing it), but must pay for this universality in model description cost. A model that achieves short total description length must genuinely compress β€” it cannot be memorizing noise.

Formally, a model selected by MDL satisfies: the data is compressible under the model. This is equivalent to: the model has captured the regularities in the data. Regularities that generalize across samples will compress future samples too.

8. MDL in Modern ML

MDL's influence permeates modern machine learning, often under different names. Understanding these connections reveals the unified information-theoretic structure beneath diverse regularization techniques.

MDL ↔ Regularization

L2 regularization (weight decay) can be derived as MDL with a Gaussian prior on weights β€” the regularization coefficient sets the model description cost. L1 (Lasso) corresponds to a Laplace prior. Both penalize model complexity, exactly as the MDL model description term does.

MDL ↔ Variational Inference

The ELBO in variational autoencoders decomposes as: -E[log p(x|z)] + D_KL(q(z|x) || p(z)). The KL term is the description length of the latent code z under the prior, and the reconstruction term is the description length of x given z. VAE training minimizes a two-part MDL code.

MDL ↔ Early Stopping

The number of gradient descent steps is itself a complexity parameter β€” more steps means a more complex effective model. Early stopping caps this complexity, implicitly optimizing an MDL-like tradeoff between fit and complexity measured by training steps.

MDL ↔ Minimum Message Length (MML)

Developed independently by Wallace and Boulton (1968), MML is MDL's Australian cousin with a Bayesian twist. MML encodes both the parameter estimate and the data under that estimate, accounting for the precision with which parameters need to be stated. The two frameworks give identical results in many cases.

Perhaps most strikingly, the double descent phenomenon in modern overparameterized networks β€” where test loss decreases again after the interpolation threshold β€” can be partially understood through MDL: very large models can compress data via interpolation (memorization), but the specific way they interpolate (induced by gradient descent on neural architectures) turns out to generalize surprisingly well.

Key Takeaway: Compression = Understanding

MDL gives a principled answer to the question "what does it mean for a model to have learned something?" A model has learned when it can compress β€” when it describes the data more concisely than any simpler model. The bits saved are exactly the regularities captured. This makes MDL not just a model selection technique but a theory of statistical learning itself.