Section 14

Scaling laws

Loss as a power law in size, data, and compute

Paper: Scaling Laws for Neural Language Models — Kaplan et al., 2020

GPT-2 showed that scaling works. Kaplan et al.’s 2020 Scaling Laws for Neural Language Models showed that scaling is predictable — so precisely that you can forecast a model’s loss before you train it. This is the paper that turned pre-training from a craft into something you could put on a spreadsheet and budget against.

Loss is a power law in scale

The central finding is startlingly clean. Train transformers of many sizes on many amounts of data, and the test loss cross-entropy loss The standard LM loss: the negative log-probability the model assigned to the actual next token, averaged over all positions. Zero would mean perfect confidence in every correct token. See in glossary → follows a smooth power law power law A relationship of the form y = a·x^(−b): on log-log axes it's a straight line. Pre-training loss follows a power law in scale, so each 10× of compute buys a roughly constant drop in loss. See in glossary → in each of the three scale factors — model size NN, dataset size DD, and compute CC — over many orders of magnitude:

L(N)=(NcN)0.076L(D)=(DcD)0.095L(N) = \left(\frac{N_c}{N}\right)^{0.076} \qquad L(D) = \left(\frac{D_c}{D}\right)^{0.095}

with Nc8.8×1013N_c \approx 8.8\times10^{13} (non-embedding parameters) and Dc5.4×1013D_c \approx 5.4\times10^{13} tokens. There’s a matching law for compute with exponent 0.050\approx 0.050. A power law power law A relationship of the form y = a·x^(−b): on log-log axes it's a straight line. Pre-training loss follows a power law in scale, so each 10× of compute buys a roughly constant drop in loss. See in glossary → plots as a straight line on log-log axes: each multiplicative step in scale buys a fixed additive drop in loss.

Scaling laws (Kaplan et al.)
Test loss falls as a power law in scale. On log-log axes, a power law is a straight line.
1.0M10.0M100.0M1.0B10.0B100.0B1.0T2.38model size N (non-embedding parameters) (log scale)test loss (log)
L(N) = (8.8e13 / N) ^ 0.076
The line never bends: each 10× in scale buys a fixed drop in loss. The exponents are small (~0.08), so gains are real but slow — which is precisely why frontier labs spend 10× more compute for each increment. These curves let you predict a big model's loss from small ones — and set up the next question: given a fixed compute budget, how should you split it between N and D?

These are scaling laws scaling laws Empirical formulas showing that test loss falls as a smooth power law in model size, dataset size, and compute. They let you predict a large model's performance from small experiments. See in glossary →

A few properties made the result so influential:

  • Smoothness. No bumps, no plateaus across the studied range — just clean power laws.
  • Universality. The shape barely depends on architectural details (depth vs. width, etc.) within reason; it’s dominated by scale.
  • Sample efficiency of large models. Bigger models reach any given loss using fewer tokens. A large model “learns more per example.”

Kaplan’s recipe — and the catch

From these laws, Kaplan derived how to spend a fixed compute budget 6ND rule A rule of thumb: training a dense model with N parameters on D tokens costs about 6ND floating-point operations (≈2ND forward + ≈4ND backward). See in glossary → (C6NDC \approx 6ND) optimally. Their answer: pour most of the increase into model size, with data growing only slowly — they estimated DC0.27D \propto C^{0.27}, i.e. as you scale compute, grow the model fast and the dataset gently, training very large models and stopping well before convergence.

This recipe shaped a generation of models, including GPT-3 — make them enormous, don’t worry too much about training on proportionally more tokens. It was also, in one important respect, wrong. A couple of years later, Chinchilla showed Kaplan had badly under-weighted data, largely because of a subtle flaw in how the learning rate was scheduled across the experiments. We’ll see exactly what changed in chapter 16.

First, though, the model that took Kaplan’s “go big” recipe to its logical extreme — and discovered something nobody had predicted from the loss curves alone.