Section 11

Prefill and decode

The two phases of inference

The naive generation loop from the last section has a particular structure that, once you notice it, divides inference into two very different regimes. The first call processes a long, parallel input (the prompt). Every subsequent call processes exactly one new token. These two regimes — prefill and decode — have utterly different performance characteristics, and every later optimization in this essay is grounded in understanding why.

Prefill: parallel and compute-bound

When a request arrives with a 1,000-token prompt, we need to run the whole prompt through the model once to get to a state where we can start generating the next token. That first pass — called prefill — is happy news for the GPU:

All 1,000 tokens can be processed in parallel. The attention computation sees all of them; the MLP runs on all 1,000 hidden vectors at once.
A single forward pass does ~1,000 tokens’ worth of work per layer.
The matrix multiplications are big — the Q/K/V projections become a (1000, 4096) × (4096, 4096) matmul, which is a tensor-core’s dream.

Prefill is compute-bound: the GPU’s arithmetic units are the bottleneck. On an H100, prefill processes tokens at thousands of tokens per second per request because all that work fits cleanly into matrix multiplies.

The latency you experience as a user during prefill is the time to first token (TTFT): wall-clock from request submitted to first generated token returned. Prefill dominates TTFT for any non-trivial prompt.

Decode: serial and memory-bound

Once prefill is done, we enter the autoregressive loop. Every step:

Take the model’s last hidden state.
Compute logits at that one position via the LM head.
Sample the next token.
Run another full forward pass through all 32 layers, but with input length = 1.

Step 4 is the painful one. The forward pass still requires reading every parameter of the model from HBM — about 16 GB for Llama-3-8B, ~140 GB for 70B — but the matrix multiplies are now tiny: (1, 4096) × (4096, 4096). The matrix units finish in microseconds and then wait. The bottleneck is not the math; it’s how fast bytes can move from HBM into the matrix units.

This is memory-bound behavior, and on an H100 it caps decode throughput at, roughly, the model’s parameter count divided by the GPU’s HBM bandwidth. Llama-3-8B at fp16 = 16 GB; H100 SXM HBM bandwidth = 3.35 TB/s; theoretical max ≈ 3,350 ÷ 16 ≈ 209 tokens/second from a single batched stream. In practice you see numbers in that ballpark (sometimes a bit lower, sometimes — with batching — much higher per-GPU because reading the weights once serves many requests).

The latency per token during decode is the inter-token latency (ITL): the gap between two consecutive output tokens.

Why this split matters

You can see the asymmetry in a back-of-envelope ratio called arithmetic intensity: roughly, FLOPs per byte read. Prefill might do 100s of FLOPs per byte (highly compute-bound). Decode does ~2 FLOPs per byte (highly memory-bound). The GPU’s hardware ratio — its FLOPs-per-byte capability — is hundreds. So in prefill we use almost all the compute we paid for; in decode we use about 1%. That’s the gap inference engines are trying to close.

Inside one inference: prefill, then a stream of decodes

Step through a single request. Watch what changes from prefill to decode — and what stays the same.

Prefill

Sequence so far

Thecatsatonthesoft?

input to this pass just produced will be predicted

Input to this forward pass

6 tokens

All 6 prompt tokens at once, processed in parallel.

Output of this forward pass

1 new token

We only need logits at the last position to sample the next token. Prefill computes logits at every prompt position (training-style), but in inference we throw away all but the last.

Where the time goes this pass (Llama-3-8B on H100, ballpark)

Weight read from HBM4.78 ms · every parameter, once

Tensor-core compute0.10 ms · 2 · 8B · 6 = 96.0B FLOPs

Pass duration

4.78 ms

Compute utilization

2.0%

Tokens/sec

1256

Prefill is compute-bound: with 6 tokens in the batch, the tensor cores have real work to do and weight bandwidth isn't the bottleneck. The whole prompt gets processed in one shot.

Latency vs throughput: TTFT, ITL, throughput

Three metrics define a serving system’s performance:

TTFT (time to first token) — primarily prefill time + queue time. What you notice when you press Enter.
ITL (inter-token latency) — decode time per token, including time waiting for the GPU to finish a batched step. What you notice as the speed of the streaming text.
Throughput (tokens/sec) — total tokens generated per second across all concurrent requests. What the cloud bill cares about.

These three are in tension. Higher throughput usually means bigger batches, which means more queue time (worse TTFT) and slower per-step time (worse ITL). One of the central jobs of a scheduler is to manage this tradeoff in a way that respects whatever Service Level Objectives (SLOs) the service has — e.g. “p99 TTFT < 1 s, p99 ITL < 50 ms, maximize throughput within that envelope.”

What’s wrong with our naive loop

Right now, our generation loop redoes huge amounts of work. To predict token 1,001, we run the model on tokens 1–1,000 (already computed during prefill). To predict token 1,002, we run on 1–1,001 again. To predict 1,003, 1–1,002. By the time we’ve generated 500 output tokens, we’ve done ~500,000 tokens worth of attention math, when we should only need ~500.

There has to be a better way, and there is. The model only ever attends to the past (causal mask). So once we’ve computed the keys and values for token 5, those values will never change as more tokens are appended. We can save them.

The cache of those saved $K$ and $V$ vectors is called the KV cache , and it is the single most important data structure in modern LLM serving. That’s the next section.