Section 16

Prefix caching

Shared system prompts for free

In any real deployment, lots of requests start with exactly the same tokens. A chat assistant prepends the same system prompt to every conversation. A code-completion service prepends a header file every time. A few-shot classifier sends the same handful of examples before every query. In each case, the first several hundred or thousand tokens are byte-identical across requests.

If two requests share their first 1,000 tokens and we are doing prefill from scratch for each one, we’re paying for the same 1,000-token forward pass twice. The KV cache for those first 1,000 tokens would be byte-identical too — the math is deterministic and depends only on the input tokens. Why not just share it?

That’s prefix caching prefix caching Sharing KV pages across requests that start with the same tokens (system prompts, few-shot prefixes), so the prefill is computed once. See in glossary → , and once you have paged attention, it falls out almost for free.

Sharing pages instead of copying them

Here is the recipe:

  1. Hash each KV-cache page by the token IDs that produced it (and any earlier prefix it depends on — this matters because each block’s KV depends on all previous blocks via attention). The hash of block ii is something like hash(hash(block_{i-1}), tokens_in_block_i).

  2. When a new request arrives, hash its prompt block-by-block.

  3. For each block, look up the hash in a global table mapping hash → physical block. If a hit, point the new request’s page table at the existing physical block instead of allocating a new one. The block is now shared — multiple page tables reference it.

  4. Use reference counts to free shared blocks: a block is only returned to the free pool when its refcount drops to zero.

  5. On the first block where the hashes diverge (the system prompt is identical but the user message is different), the new request branches off and allocates fresh blocks from there on.

The implementation is exactly the OS-y trick of copy-on-write, except writes never happen here (the cache is append-only by design), so it’s really just “share-on-read.”

Why this is huge

Suppose your service has a 1,000-token system prompt and serves 100 concurrent users. Without prefix caching: 100 separate 1,000-token prefills + 100 × 1,000 tokens of KV cache. With prefix caching: 1 prefill (the first request triggers it), then 99 “skip the prefix, here are pre-built blocks” admissions. The 99 follow-on requests get their first generated token in milliseconds instead of seconds.

  • TTFT drops dramatically for the cache-hit fraction of requests.
  • KV cache memory pressure drops: 1× shared prefix instead of N×.
  • Total throughput rises because GPU cycles previously spent on duplicate prefills are now available for decode.

Try it

Toggle prefix caching on in the allocator below, then add a couple of “shared prefix” requests in a row. Watch the same physical blocks appear in multiple page tables, with the teal “shared” highlight.

Paged attention allocator
24 physical blocks of 4 tokens each. Add requests, step decoding, finish them — watch the page table track everything.
Physical KV blocks
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Per-request page tables
Click "+ Request" to start.
Used / total
0 / 24
Utilization
0%
Wasted (internal frag)
0.0%
Prefix caching ON: when you add a "shared prefix" request, the first 5 prefix tokens reuse existing blocks (look for the teal outline). The same KV pages serve multiple requests — the prefill is computed once.

A subtle thing happens here: the saving isn’t just memory. The prefill computation for those shared blocks is also skipped — the second request that shares a prefix gets to “jump in” mid-prefix, with the KV values already populated by the first request’s earlier work.

Trickier sharing patterns

The basic scheme assumes “requests share a left-anchored prefix.” But prefix caching also benefits:

  • Tree-structured generation (e.g. beam search, best-of-N sampling, agentic tree search): a single request branches into multiple continuations. Each continuation shares the common ancestor’s KV.
  • System prompt + few-shot examples + user message: in some deployments, the cached prefix has nested layers — every user shares the system prompt; users in the same workflow share the same few-shot examples; the last few tokens diverge per user. Multiple prefix levels can be cached.
  • Document-prefilled RAG: if your retrieved context is shared across many users (a popular doc), its KV can be cached.

In all cases, hashed page-level sharing handles them uniformly. You don’t need to know in advance what’s reusable; you just hash blocks as you create them.

Limits of prefix caching

A few caveats:

  • The prefix has to be exactly the same. One token of difference in the system prompt means cache miss.
  • Prefix caching doesn’t help decode. Once you’re past the cache hit and into per-request generated tokens, every request is unique.
  • Position encoding matters. With RoPE RoPE Rotary Position Embeddings — rotates Q/K vectors by an angle proportional to position. Standard in modern LLMs. See in glossary → , the cached keys have already been rotated for their absolute position. Two requests can share KV at position 0–N only if they would have rotated to the same angles — which they do, because both started at position 0. Fortunately, this is the case in every practical scenario.

We’ve now made KV memory efficient (paging) and shareable (prefix caching). There’s one more thing the scheduler has to be smart about: what happens when a single request shows up with a 64,000-token prompt and would dominate the GPU for half a second of pure prefill, blocking every other decoder. That’s the next section: chunked prefill.