PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

Language model loss follows smooth power laws in model size (N), dataset size (D), and compute (C) — each independently, over many orders of magnitude. This means you can predict how much better a model will be before training it. The key insight: given a fixed compute budget, you should scale model size faster than dataset size. This paper directly motivated GPT-3 and inaugurated the LLM scaling era.

1. Why Does Scaling Matter?

Before this paper, practitioners had to run expensive experiments to figure out whether making a model bigger would help. The conventional wisdom was vague: bigger is often better, but by how much? And where should you invest — more parameters, more data, or longer training?

Kaplan et al. ran hundreds of language model training runs across a vast range of model sizes (768 parameters to 1.5 billion), dataset sizes, and compute budgets. They found something remarkable: performance doesn't improve randomly or chaotically — it follows precise mathematical laws.

Key insight: Loss (cross-entropy, in nats) behaves like a power law in N, D, and C separately. Power laws are straight lines on a log-log plot. This predictability is what makes scaling laws so powerful — and so actionable.

2. The Three Power Laws

The paper identifies three fundamental scaling laws — one for each of the three ways you can invest resources in training a language model.

2a. Loss vs. Parameters (N)

When you train with unlimited data (so data is never the bottleneck), loss follows:

L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}

where the exponent α_N ≈ 0.076 and N_c is a fitted constant. This is a small exponent — every 10× increase in parameters only reduces loss by a factor of 10^{0.076} ≈ 1.19. Progress is real but gradual.

2b. Loss vs. Dataset Size (D)

When you train with an unlimited model (so capacity is never the bottleneck), loss follows:

L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}

where α_D ≈ 0.095. The dataset exponent is slightly larger than the parameter exponent (0.095 vs 0.076), meaning more data gives marginally more benefit per doubling than more parameters — but only marginally.

2c. Loss vs. Compute Budget (C)

When you optimize the training run for a fixed compute budget (choosing the best N and D), loss follows:

L(C_{\min}) = \left(\frac{C_c^{\min}}{C_{\min}}\right)^{\alpha_C^{\min}}

where α_C ≈ 0.050. This is the smallest exponent, meaning compute is the hardest axis to gain from. But compute is also the axis you can always increase — you just need more GPUs and more time. This law is what made the paper so actionable for practitioners.

2d. The Combined Law: L(N, D)

When both N and D are finite (the realistic case), both effects combine. The paper fits a unified formula:

L(N, D) = \left[\left(\frac{N_0}{N}\right)^{\alpha_N / \alpha_D} + \frac{D_0}{D}\right]^{\alpha_D}

This formula captures the key competition: as N grows, the first term shrinks. As D grows, the second term shrinks. Whichever term is larger dominates the loss. This is why undertrained large models can be beaten by well-trained smaller ones — the data term is too large.

3. Compute-Optimal Training

Given a fixed compute budget C (measured in FLOPs), how should you divide it between model size N and training tokens D? First, recall the cost approximation:

C \approx 6ND

The factor of 6 comes from: 2 FLOPs per multiply-add, × 3 for forward + backward pass. So for a 1B parameter model trained on 300B tokens: C ≈ 6 × 10^9 × 3×10^{11} = 1.8×10^{21} FLOPs.

By minimizing L(N,D) subject to C = 6ND, Kaplan et al. derived the optimal allocation:

N_{\text{opt}} \propto C^{0.73}, \quad D_{\text{opt}} \propto C^{0.27}

The exponents sum to 1.0 (since C ∝ N·D). The key ratio: N grows much faster than D as compute scales. For every 10× increase in compute, optimal N should grow by 10^{0.73} ≈ 5.4×, while D grows only 10^{0.27} ≈ 1.9×.

Practical implication: The Kaplan et al. finding says: spend most of your extra compute on bigger models, not more data. This is what motivated training GPT-3 (175B parameters) on "only" 300B tokens — by the scaling laws, that was roughly compute-optimal given their budget.

4. Sample Efficiency: Bigger Models Learn More Per Token

One of the most counterintuitive findings: larger models are more sample-efficient. They reach the same loss level with fewer training tokens.

Why? Because a larger model has more capacity to extract structure from each example. A small model might need to see the same pattern thousands of times before it reliably captures it. A large model may get it right in dozens of exposures.

This also connects to convergence behavior: smaller models converge faster in wall-clock time (fewer parameters to update) but plateau at higher loss. Larger models take longer to converge but end up at much lower loss. If you stop early, a smaller model might look better — but that's an artifact of not training long enough.

5. Worked Example: Doubling Your Compute Budget

Let's make this concrete. Suppose you currently run 1× compute (say, 10^{22} FLOPs, roughly a mid-sized LLM training run). Now you get access to 2× that compute. How should you allocate it?

Applying N_opt ∝ C^{0.73}, D_opt ∝ C^{0.27}:

Scale N by: 2^{0.73} ≈ 1.66× (e.g., go from 7B → 11.6B parameters)
Scale D by: 2^{0.27} ≈ 1.21× (e.g., go from 200B → 242B tokens)
Check: (1.66 × 1.21) ≈ 2.0 ✓ — budget is fully used

The wrong answer (and what many teams did before this paper): split the compute evenly — 1.41× more parameters, 1.41× more data. That misallocates compute. You'd be underusing your extra model capacity by giving it insufficient extra data, or vice versa.

For a $1M compute budget (roughly 10^{22}–10^{23} FLOPs at 2020 prices), these laws say: spend it almost entirely on a bigger model with a modest data increase, not an equal split. The model is the cheaper investment per unit of loss reduction.

6. Chinchilla's Correction (2022)

Two years later, Hoffmann et al. at DeepMind published "Training Compute-Optimal Large Language Models" (the Chinchilla paper), which revisited Kaplan's optimal allocation using a different methodology — and got a substantially different answer.

Chinchilla's finding: the optimal allocation is approximately equal scaling — N and D should grow at the same rate. More precisely:

N_{\text{opt}} \approx \frac{D_{\text{opt}}}{20}, \quad \text{i.e.,} \quad D \approx 20 \cdot N

This rule of thumb — 20 tokens per parameter — became the dominant heuristic. Chinchilla itself (70B parameters) was trained on 1.4T tokens (= 20 × 70B) and outperformed Gopher (280B) trained on only 300B tokens.

What changed? Kaplan's runs used a learning rate schedule that didn't fully decay — models were slightly undertrained. This biased the exponents. Chinchilla used IsoFLOP curves (fixing C and varying N/D at constant C), which gave a cleaner estimate. The qualitative message is the same — scaling laws hold — but the optimal split shifted toward more data.

Today, most frontier labs use roughly the Chinchilla ratio or beyond. Llama 3 (8B) was trained on 15T tokens — nearly 2,000 tokens per parameter, far beyond both recipes, because inference efficiency favors smaller, longer-trained models.

7. Why This Paper Changed Everything

Before scaling laws, building large models felt like exploration in the dark. Each new model was an expensive bet — nobody could reliably predict if going bigger would help or by how much.

After this paper, scaling became an engineering discipline. You could:

Run small experiments and extrapolate to predict large model performance
Plan your compute budget before training — know roughly what loss you'll hit
Justify building GPT-3 (175B) with a quantitative argument, not just intuition
Understand the 'compute frontier': what's the best model achievable for a given budget?

The broader intellectual impact: scaling laws suggested that intelligence might be more predictable and more continuous than previously believed. There's no sharp threshold where models become capable — just steady improvement along smooth power laws. This view, sometimes called the 'smooth scaling hypothesis', drove the bet that GPT-3, GPT-4, and their successors would keep getting better.

The compute frontier: Every point on the curve L(C_min) represents the best possible loss for a given compute budget. Models trained below this frontier are suboptimal — they're either too small (undertrained) or too data-limited. The frontier has been pushed down consistently for five years, and scaling laws let you project where it'll be next.

Quick Reference: The Key Numbers

Law	Formula	Exponent
Parameters	L(N) ∝ N^{-α_N}	α_N ≈ 0.076
Dataset size	L(D) ∝ D^{-α_D}	α_D ≈ 0.095
Compute	L(C) ∝ C^{-α_C}	α_C ≈ 0.050
Optimal N	N_opt ∝ C^{0.73}	(Kaplan)
Optimal D	D_opt ∝ C^{0.27}	(Kaplan)
Chinchilla rule	D ≈ 20·N	(Hoffmann 2022)

Scaling Laws for Neural Language Models