Section 17

Chunked prefill

Stop blocking decoders with one big prompt

We’ve made decode efficient (batching), KV memory efficient (paging), and prefix sharing efficient (caching). One serving pathology remains: a single huge prompt blocking everything else. Chunked prefill is the fix.

The problem

Suppose 32 decode requests are running, each happily generating one token per step at, say, 50 ms per step. Now a new request arrives with a 64,000-token prompt. Its prefill needs to process all 64k tokens through the model — and prefill compute is quadratic in sequence length for attention (every token attends to every earlier token). On an H100 that prefill might take ~500 ms.

If you naively schedule the prefill as a single forward pass, every one of the 32 decoders is frozen for those 500 ms. Their inter-token latency spikes from 50 ms to 550 ms. Users start watching the text catch up in a sudden burst. SLOs blow up.

Worse, you can’t just lower the priority of the prefill — it has to happen before the new request can generate even its first token, and there is no way to “decode without prefilling first.”

The fix: cut the prefill into bites

Chunked prefill splits a long prompt across multiple forward passes. Instead of one 64k-token prefill, you do, say, 8 chunks of 8k tokens, interleaved with decode steps from the other 32 requests. Each step’s forward pass is:

[ 8k prefill tokens from request A ] + [ 1 decode token each from requests 1..32 ]

All 8 + 32 = 8,032 tokens flow through one packed batch. The attention kernel handles the variable-length structure (it has to do this anyway for continuous batching). Per-step latency rises modestly — 50 ms might become 80 ms — but no one is frozen anymore; everyone makes progress every step.

After 8 such steps, request A’s prefill is done; it then joins the decoders, generating one token per step like everyone else.

Mixing prefill and decode kernels

The implementation challenge is that prefill tokens and decode tokens, in the same batch, have very different shapes for attention:

A prefill token at position $p$ attends to positions 0.. $p$ within its own request (causal-masked).
A decode token attends to all of its request’s cached positions.

The vLLM attention kernel handles this by laying out a “per-token position” tensor that records, for every token in the packed batch, which request it belongs to and how many keys/values are visible to it. The same kernel walks the page table and computes attention correctly for both kinds in one fused pass.

Throughput vs latency, finally meeting

Chunked prefill is the moment in this essay where the textbook tradeoff between throughput and latency becomes adjustable in real time. Three knobs:

Chunk size: smaller = better decoder latency, lower per-chunk compute efficiency.
Max number of decode requests per step: bigger = more throughput, more per-step latency.
Admission policy: do we accept a new long prompt now, or queue it until the decoder population is smaller?

A good scheduler tunes these continuously based on current SLO measurements. vLLM’s auto-tuner (“vLLM serve” with the right flags) can search the space offline for a given workload mix.

The corollary: prefill-decode disaggregation

For very large deployments, the asymmetry between prefill (compute-bound, big matmuls) and decode (memory-bound, tiny matmuls) is so stark that some teams run them on different GPUs entirely:

Prefill GPUs: optimized for high compute, smaller HBM. Receive prompts, do prefill, emit KV cache.
Decode GPUs: optimized for high HBM bandwidth, larger HBM. Receive KV cache, do decode.

The KV cache has to be shipped from prefill GPU to decode GPU (NVLink or RDMA) when prefill finishes — adding latency but unlocking much higher throughput per dollar. vLLM supports this as “P/D disaggregation” via the KVConnector interface. It’s the most production-grade variant of the prefill/decode split we’ve been treating as just two phases of one process.

We’ve now covered every batching, memory, and scheduling trick in vLLM’s main playbook. There’s one more category, orthogonal to all of the above: making the autoregressive decode loop generate more than one token per forward pass. That’s speculative decoding.