Section 03

How a model learns

Gradient descent and backpropagation

We have an objective — minimize the cross-entropy loss — and a model with billions of parameters to adjust. This chapter is about the engine that does the adjusting: gradient descent , powered by backpropagation . Together they are the reason a pile of random numbers turns into something that can write code.

The loss is a landscape; training walks downhill

Fix the training data and think of the loss $\mathcal{L}(\theta)$ as a function of the parameters $\theta$ alone. With billions of parameters this loss landscape lives in billions of dimensions, but the three-dimensional picture carries over: two parameters spread across the ground, height is the loss, and we want to reach a low valley.

The tool for going downhill is the gradient , $\nabla_\theta \mathcal{L}$ (the symbol $\nabla$ is read “nabla” or “del”) — the vector of partial derivatives of the loss with respect to every parameter. It points in the direction of steepest increase, so to decrease the loss we step the opposite way:

\theta \leftarrow \theta - \eta \, \nabla_\theta \mathcal{L}

The scalar $\eta$ (the Greek letter eta) is the learning rate — how big a step to take. That single update rule, repeated, is gradient descent.

Gradient descent on a loss surface

The bowl is steeper top-to-bottom than side-to-side. Set the learning rate, then step the ball downhill along the negative gradient.

Learning rate η = 0.30· healthy descent toward the minimum

step 0loss 6.528

The gradient points uphill, so we step the opposite way: θ ← θ − η·∇L. One learning rate has to serve both directions at once — too large for the steep axis means zig-zagging or blowing up, too small for the shallow axis means crawling. Optimizers like Adam fix this by giving every parameter its own effective step size.

Play with the learning rate above and the central tension of all training appears immediately. Too small and progress is glacial. Too large and the ball overshoots the valley, zig-zags, or flies off entirely. And because the surface is steeper in one direction than the other, no single learning rate is ideal for both axes at once — a real network has billions of axes with wildly different curvatures. Holding that thought; it is precisely what the optimizer in the next chapter exists to solve.

Stochastic gradient descent: don’t read the whole library each step

The true loss is an average over the entire corpus. Computing its exact gradient would mean a forward pass over trillions of tokens for a single update — absurd. Instead we estimate the gradient from a mini-batch of sequences sampled from the data. The estimate is noisy, but it is unbiased and millions of times cheaper, and the noise even helps escape bad regions. This is stochastic gradient descent , and one such update is a training step .

A frontier run is hundreds of thousands of these steps. Most models make roughly a single epoch — one pass over a deduplicated corpus — so each token is seen about once. (Why not many passes? Because with enough fresh data, a new token teaches more than a repeated one; we will quantify this with scaling laws.)

Backpropagation: getting a billion derivatives for the price of one pass

The catch is step 3. We need $\partial \mathcal{L} / \partial \theta_i$ for every parameter — billions of derivatives. Computing each one independently would be hopeless. Backpropagation computes them all in a single backward sweep, and it is the algorithm that makes deep learning feasible.

The idea is the chain rule . A neural network is a long composition of simple operations (matrix multiplies, normalizations, nonlinearities). In the forward pass we run inputs through to produce the loss, caching the intermediate activations . In the backward pass we walk the operations in reverse, and at each one multiply the gradient flowing back by that operation’s local derivative. Each layer needs only its cached inputs and the gradient arriving from the layer above — so the whole gradient costs about the same as two forward passes, regardless of how many parameters there are.

Two consequences of backprop shape everything downstream:

Memory. The backward pass needs the forward activations, so they must be kept in memory until used. At long context length these activations balloon — which is why gradient checkpointing (recomputing activations instead of storing them) becomes essential at scale.
Numerics. Gradients are computed through many multiplications and can become very small or very large. Keeping them representable is the job of the precision machinery — loss scaling, BF16, careful normalization.

You rarely write backprop by hand; frameworks like PyTorch build a computational graph during the forward pass and differentiate it automatically. But knowing that the gradient is cheap to compute but expensive to store explains an enormous amount about how large models are actually engineered.

With gradients in hand, the remaining question is how to use them well — which is the optimizer.