Continuous batching
Stop wasting GPU steps
The previous section established that a single request only uses ~1% of the GPU’s compute during decode. The fix is obvious: run many requests at once, so a single read of the weights serves many tokens. A group of requests processed together is a batch batch A group of sequences processed together in one forward pass. Bigger batches = better GPU utilization, more memory used. See in glossary → . The interesting question is: how do you assemble those batches?
There’s a naive answer and a much better answer. They are far enough apart in throughput that the better one — continuous batching continuous batching A scheduler that swaps finished requests out and queued requests in at every decode step instead of waiting for the whole batch to finish. See in glossary → — basically defines the difference between “a personal-use server” and “a production-grade inference engine.”
Static batching: the naive answer
The simplest scheme: take the first N requests in the queue, run them all through the model together, wait for all of them to finish, then take the next N. This is called static batching.
Static batching has two problems, both severe:
-
Slow requests stall the batch. If your batch contains one request that wants 8 tokens of output and three that want 200, the 8-token request finishes after step 8 — and then its GPU slot sits idle for the next 192 steps waiting for the others. The GPU is doing 192 × (3/4 useful work) instead of 192 × (4/4 useful work).
-
New arrivals wait for the next batch. A request that shows up while the batch is mid-flight has to wait for the entire batch to drain before it gets to start. At a load of 100 requests in flight with average output length 200 tokens, the worst-case TTFT is terrible.
Production systems used to ship with this. People complained.
Continuous batching: drop in, drop out, every step
The fix, proposed by the Orca paper and popularized by vLLM, is to make the batch dynamic at every decode step:
- At every decode step, check which requests have finished (hit EOS or max length).
- Evict the finished requests from the batch.
- Admit waiting requests into the now-empty slots.
- Run the next decode step with the new batch composition.
The key implementation detail is that prefill and decode can be mixed in the same step. A new request needs prefill (a long sequence of new tokens) while old requests need decode (one new token each). With the right kernel design — flattening all “this many tokens to process” into one packed batch with a per-token position mask — both can happen in the same forward pass. vLLM ships exactly this kernel.
Try it
Below is a simplified Gantt chart: 4 GPU slots running 8 requests with different output lengths and arrival times. Toggle between static and continuous to see what changes.
Notice in static batching how much “gray” idle area accumulates — those are decode steps that were paid for in HBM traffic but produced nothing. Continuous batching squeezes most of that gray out.
The role of the scheduler
A scheduler scheduler The component that picks which requests run in the next forward pass given GPU memory and policy constraints. See in glossary → is the component that decides, every step, which requests to run. vLLM’s scheduler maintains two queues:
- WAITING: requests that arrived but haven’t started prefill yet.
- RUNNING: requests currently being decoded (or partially prefilled in a chunked-prefill setup).
At each step:
- Compute the available KV cache budget (how many free pages remain).
- Try to admit waiting requests until the budget is exhausted or a policy says stop.
- If memory is tight, preempt some RUNNING requests — either recompute their KV from scratch later, or swap them out to host RAM.
- Build the per-step batch from the surviving RUNNING set + admitted prefills.
There are several knobs here:
- Policy: FCFS (first-come-first-served), priority, fairness.
- Max batch size: a hard upper bound to keep latency bounded.
- Max KV memory utilization: the fraction of HBM the cache pool can take.
- Chunked prefill (§17): split a single huge prefill across multiple steps to avoid blocking decoders.
The scheduler is where SLOs are enforced. If your service promises p99 TTFT < 1 s, the scheduler is the thing that has to actually do something when that target is at risk: preempt low-priority work, refuse new admissions, or fall back to a lighter sampling strategy.
Why batching alone isn’t enough
Continuous batching gets you most of the GPU back. But there’s still a subtle problem: the KV cache itself is contiguous in memory per-request. When a request finishes and its slot is reused for a new request, what happens to the old request’s KV? When the new request grows, what happens if it needs more memory than the old request’s old slot? You end up either pre-reserving enormous contiguous chunks (wasting memory) or moving things around (wasting time).
This is the problem vLLM’s signature contribution — paged attention — was invented to solve. That’s the next section.