Section 03

How a model learns

Gradient descent and backpropagation

We have an objective — minimize the cross-entropy loss — and a model with billions of parameters parameters The numbers (weights) inside a model that get adjusted during training. A “7B model” has 7 billion of them. See in glossary → to adjust. This chapter is about the engine that does the adjusting: gradient descent gradient descent The core training algorithm: repeatedly nudge each parameter a small step in the direction that lowers the loss, as told by the gradient. See in glossary → , powered by backpropagation backpropagation The algorithm that computes the loss gradient for every parameter efficiently by applying the chain rule backward through the network, reusing intermediate results from the forward pass. See in glossary → . Together they are the reason a pile of random numbers turns into something that can write code.

The loss is a landscape; training walks downhill

Fix the training data and think of the loss L(θ)\mathcal{L}(\theta) as a function of the parameters θ\theta alone. With billions of parameters this loss landscape loss landscape The (extremely high-dimensional) surface of loss as a function of the parameters. Training is a walk downhill on this surface toward a low-loss region. See in glossary → lives in billions of dimensions, but the three-dimensional picture carries over: two parameters spread across the ground, height is the loss, and we want to reach a low valley.

The tool for going downhill is the gradient gradient The vector of partial derivatives of the loss with respect to every parameter — it points in the direction of steepest loss increase, so we step the opposite way to reduce the loss. See in glossary → , θL\nabla_\theta \mathcal{L} (the symbol \nabla is read “nabla” or “del”) — the vector of partial derivatives of the loss with respect to every parameter. It points in the direction of steepest increase, so to decrease the loss we step the opposite way:

θθηθL\theta \leftarrow \theta - \eta \, \nabla_\theta \mathcal{L}

The scalar η\eta (the Greek letter eta) is the learning rate learning rate The size of each parameter step. Too high and training diverges; too low and it crawls. The single most important hyperparameter in pre-training. See in glossary → — how big a step to take. That single update rule, repeated, is gradient descent.

Gradient descent on a loss surface
The bowl is steeper top-to-bottom than side-to-side. Set the learning rate, then step the ball downhill along the negative gradient.
step 0loss 6.528
The gradient points uphill, so we step the opposite way: θ ← θ − η·∇L. One learning rate has to serve both directions at once — too large for the steep axis means zig-zagging or blowing up, too small for the shallow axis means crawling. Optimizers like Adam fix this by giving every parameter its own effective step size.

Play with the learning rate above and the central tension of all training appears immediately. Too small and progress is glacial. Too large and the ball overshoots the valley, zig-zags, or flies off entirely. And because the surface is steeper in one direction than the other, no single learning rate is ideal for both axes at once — a real network has billions of axes with wildly different curvatures. Holding that thought; it is precisely what the optimizer in the next chapter exists to solve.

Stochastic gradient descent: don’t read the whole library each step

The true loss is an average over the entire corpus. Computing its exact gradient would mean a forward pass over trillions of tokens for a single update — absurd. Instead we estimate the gradient from a mini-batch mini-batch The chunk of training examples processed together in one step. Gradients are averaged over the mini-batch, trading off gradient noise against memory and compute. See in glossary → of sequences sampled from the data. The estimate is noisy, but it is unbiased and millions of times cheaper, and the noise even helps escape bad regions. This is stochastic gradient descent SGD Stochastic Gradient Descent — gradient descent using a noisy gradient estimated from one mini-batch at a time rather than the whole dataset. See in glossary → , and one such update is a training step training step One iteration of the loop: forward pass on a batch, backward pass to get gradients, optimizer update. A large model is trained for hundreds of thousands of steps. See in glossary → .

A frontier run is hundreds of thousands of these steps. Most models make roughly a single epoch epoch One full pass over the training dataset. Frontier LLMs are often trained for roughly a single epoch over a deduplicated corpus, so each token is seen about once. See in glossary → — one pass over a deduplicated corpus — so each token is seen about once. (Why not many passes? Because with enough fresh data, a new token teaches more than a repeated one; we will quantify this with scaling laws.)

Backpropagation: getting a billion derivatives for the price of one pass

The catch is step 3. We need L/θi\partial \mathcal{L} / \partial \theta_i for every parameter — billions of derivatives. Computing each one independently would be hopeless. Backpropagation backpropagation The algorithm that computes the loss gradient for every parameter efficiently by applying the chain rule backward through the network, reusing intermediate results from the forward pass. See in glossary → computes them all in a single backward sweep, and it is the algorithm that makes deep learning feasible.

The idea is the chain rule chain rule The calculus rule for differentiating composed functions. Backpropagation is just the chain rule applied layer by layer, from the loss back to the inputs. See in glossary → . A neural network is a long composition of simple operations (matrix multiplies, normalizations, nonlinearities). In the forward pass forward pass Running inputs through the network to produce outputs (logits) and the loss, caching intermediate activations that backpropagation will need. See in glossary → we run inputs through to produce the loss, caching the intermediate activations activations The intermediate tensors produced during the forward pass. They must be kept around for the backward pass, and at long context they can dominate memory use. See in glossary → . In the backward pass backward pass The second half of a training step: backpropagation walks from the loss back through the network, computing each parameter's gradient. See in glossary → we walk the operations in reverse, and at each one multiply the gradient flowing back by that operation’s local derivative. Each layer needs only its cached inputs and the gradient arriving from the layer above — so the whole gradient costs about the same as two forward passes, regardless of how many parameters there are.

Two consequences of backprop shape everything downstream:

  • Memory. The backward pass needs the forward activations, so they must be kept in memory until used. At long context length context length The maximum number of tokens the model can attend to at once (also called the context window or sequence length). Pre-training picks a context length; later stages often extend it. See in glossary → these activations balloon — which is why gradient checkpointing gradient checkpointing Activation recomputation — saving memory by discarding most activations in the forward pass and recomputing them during the backward pass, trading extra compute for far less memory. See in glossary → (recomputing activations instead of storing them) becomes essential at scale.
  • Numerics. Gradients are computed through many multiplications and can become very small or very large. Keeping them representable is the job of the precision machinery — loss scaling, BF16, careful normalization.

You rarely write backprop by hand; frameworks like PyTorch build a computational graph during the forward pass and differentiate it automatically. But knowing that the gradient is cheap to compute but expensive to store explains an enormous amount about how large models are actually engineered.

With gradients in hand, the remaining question is how to use them well — which is the optimizer.