PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

Hinton and van Camp propose training neural networks by minimizing the total description length of the weights plus the data given the weights. This is Occam's Razor formalized as information theory. Adding Gaussian noise to weights during training forces the network to encode weights cheaply — penalizing large, precise weights in favor of small, uncertain ones. This 1993 paper secretly derives weight decay, dropout, KL regularization, and variational inference for neural networks — three decades before they became mainstream.

1. Occam's Razor as Information Theory

Occam's Razor says: prefer simpler explanations. But "simpler" is vague. Information theory makes it precise. If you need to communicate a model and the residual errors to a receiver, the best model is the one that minimizes the total number of bits sent.

This framing — model selection as communication cost — is the Minimum Description Length (MDL) principle, developed by Rissanen and Kolmogorov. Hinton and van Camp apply it directly to neural network weights.

Imagine two models: Model A fits the training data perfectly but needs 1000 bits to describe its weights. Model B fits the data almost as well but only needs 50 bits for weights (plus a few more for the errors). MDL says Model B is better — it has found a more compact explanation of the data. Complex models that memorize noise need many bits to store every wiggle; simple models that capture true structure need few.

2. The MDL Principle

Formally, given weights w and dataset D, we want to minimize:

\min_w \;\Big[ L(w) + L(D \mid w) \Big]

where:

L(w)

— the number of bits needed to describe the weights

L(D \mid w)

— the number of bits needed to describe the data given those weights (i.e., the errors)

This is the two-part MDL code: first send the model, then send the data compressed with that model. The total cost trades off model complexity against goodness of fit.

3. Applying MDL to Neural Weights

How do you encode a real-valued weight w on a finite grid? If you transmit w to precision σ (i.e., you round to the nearest multiple of σ), then the number of bits needed is approximately:

L(w_i) \approx \log_2\!\left(\frac{|w_i|}{\sigma} + 1\right) \text{ bits}

This makes intuitive sense: a weight of 0.001 with noise σ = 0.01 essentially encodes as zero — nearly free. A weight of 100 with σ = 0.01 needs many bits. The noise level σ determines the precision, and precision costs bits.

The key move: Hinton and van Camp propose adding Gaussian noise ε ~ N(0, σ²) to each weight during the forward pass. This has two effects:

Noisy weights are cheaper to encode (you only need to encode them to the precision of σ)
Noisy weights hurt predictions — the network is forced to find weights that are both small and robust

4. Weight Noise as Compression

Formally, the noisy weight used in the forward pass is:

\tilde{w}_i = w_i + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0,\, \sigma_i^2)

Both the mean w_i and the noise level σ_i are learned parameters. The total MDL training objective becomes:

\mathcal{L} = \underbrace{-\mathbb{E}_{\varepsilon}\big[\log p(D \mid \tilde{w})\big]}_{\text{fit the data}} + \underbrace{\sum_i \log_2\!\left(\frac{|w_i|}{\sigma_i} + 1\right)}_{\text{cost of encoding weights}}

During training, the network simultaneously optimizes prediction quality and compression of the weights. A weight that can be set to zero (or driven small relative to σ) is essentially pruned for free.

5. The Bits-Back Argument

Here is the deepest insight of the paper. Naive two-part coding — send weights, then send data — seems to require L(w) + L(D|w) bits. But if the weights have uncertainty (a posterior distribution q(w|D)), the sender can use that uncertainty to communicate for free.

The "bits-back" trick: the sender samples a weight w from the posterior q(w|D) to encode the data. The choice of w implicitly communicates log q(w|D) bits of information — bits the receiver can "read back" for free. The actual description length is therefore:

\text{DL} = L(w) + L(D \mid w) - \underbrace{H\big(q(w \mid D)\big)}_{\text{bits got back}}

The entropy H(q(w|D)) represents how uncertain the weights are — the more uncertain, the more bits you get back for free. Minimizing this actual description length is exactly the variational free energy.

6. Connection to Weight Decay and L2 Regularization

Consider the special case where the prior over weights is a Gaussian with variance σ²_0:

p(w) = \prod_i \mathcal{N}(w_i;\, 0,\, \sigma_0^2)

The description length of the weights under this prior (via Shannon's theorem) is:

L(w) = -\log p(w) = \sum_i \frac{w_i^2}{2\sigma_0^2} + \text{const} = \frac{\|w\|^2}{2\sigma_0^2} + \text{const}

This is exactly L2 regularization — weight decay! Minimizing MDL with a Gaussian prior is identical to training with weight decay. The regularization strength λ = 1/(2σ²₀) is the inverse prior variance: a tighter prior (smaller σ₀) means stronger weight decay.

7. Connection to Variational Inference

The full MDL objective — with bits-back correction — has an exact algebraic identity with the variational lower bound (ELBO) used in VAEs and Bayesian deep learning. Define a variational posterior q(w) ≈ p(w|D). The bits-back MDL objective is:

\mathcal{F} = \underbrace{-\mathbb{E}_{w \sim q}\big[\log p(D \mid w)\big]}_{\text{expected NLL}} + \underbrace{D_{\mathrm{KL}}\!\big(q(w) \;\|\; p(w)\big)}_{\text{coding cost of weights}}

This is precisely the VAE objective applied to weight space. The KL term plays the role of the weight description length; the expected NLL plays the role of the data description length. Hinton and van Camp derived this in 1993 — the VAE paper (Kingma & Welling) appeared in 2013, and Bayes By Backprop (Blundell et al.) in 2015.

8. Thirty Years Ahead of Its Time

This 1993 paper implicitly contains the mathematical foundations of multiple techniques that would be independently rediscovered and celebrated over the next three decades:

Modern technique	Year popularized	How it appears in MDL-weights (1993)
Weight decay / L2 reg	1950s–1980s	Exact derivation from Gaussian prior MDL
Weight noise / stochastic training	2000s	Core mechanism: ŵ = w + ε
Dropout (Srivastava et al. 2014)	2014	Multiplicative noise on activations/weights
VAE (Kingma & Welling 2013)	2013	ELBO = E[log p(D\|w)] - KL(q\|\|p) on weights
Bayes By Backprop (Blundell et al. 2015)	2015	Variational Bayes over network weights
Model compression / pruning	2015–2020	Weights with high σ relative to \|w\| are prunable

The 1993 context matters: backpropagation had only been widely known for ~6 years (since Rumelhart et al. 1986). GPUs did not exist as compute accelerators. Neural networks were trained on tiny datasets on CPUs. Hinton was thinking about generalization and compression as first principles — not as engineering tricks.

← Back to Papers