Section 18

Speculative decoding

Draft fast, verify in bulk

Every trick in the previous three sections improves batching. They don’t change the fundamental rule that decode produces one token per forward pass through the model. Speculative decoding speculative decoding A small draft model proposes K tokens; the big target model verifies them all in one pass. Net effect: more tokens per target-model step. See in glossary → challenges that rule directly. By using a tiny “draft” model to guess multiple future tokens and verifying them all in one big-model pass, we can generate 2–4× more tokens per target-model step on many workloads — for free, with no quality loss whatsoever.

It is one of those rare ideas that feels too good to be true and turns out to actually be true.

The asymmetry it exploits

Recall from §11/§13 that decode is memory-bound: the GPU spends ~99% of its time reading the weights, ~1% doing math. So a forward pass for input length 1 and input length, say, 8 take almost the same amount of time — the dominant cost is the read, and you can do 8 tokens worth of math during one read just as easily as 1.

So if we could somehow propose 8 candidate future tokens and check whether the target model would have generated each of them, we could verify all 8 in a single target-model forward pass at almost the same cost as decoding 1 token. That’s the entire premise.

The protocol

Two models cooperate:

  • Target model — the big one you actually want to sample from (Llama-3-70B, say).
  • Draft model — a much smaller, faster model that approximates the target (e.g. Llama-3-8B, or a custom-trained 1B model).

Each iteration:

  1. Draft proposes K tokens. Run the draft model autoregressively for K steps starting from the current context. This produces K candidate tokens with their probabilities.

  2. Target verifies. Run the target model ONCE on the (current context + K candidates) as input. The target produces logits at every of the K positions in parallel — because it’s just a prefill of K tokens.

  3. Accept / reject. Walk through the K candidates left to right. At each position, compare the draft’s probability for that token to the target’s probability. Accept stochastically with probability min(1,ptarget/pdraft)\min(1, p_{\text{target}} / p_{\text{draft}}). The first time you reject, stop. (Importantly: when you reject, you also get the target’s correction for free, sampled from a corrected residual distribution — so the total number of new tokens per step is at least 1.)

  4. Add the accepted prefix + the corrective token to the context. Repeat.

Step 3 is the magic step. Done correctly (the “rejection sampling” version), the output distribution is provably identical to plain sampling from the target. There is no quality loss. The whole thing is a way to do the same sampling, just faster.

What determines speedup

Two factors:

  1. Acceptance rate. If the draft model is well-aligned with the target — i.e. it tends to agree with what the target would have generated — the accepted prefix is long and you get many tokens per target step. EAGLE-style drafts achieve 70–85% acceptance on typical chat workloads. Cheap n-gram drafts (just predict the next token from a Markov chain over the context) sit around 20–40%. The higher the acceptance, the bigger the win.

  2. Draft cost. Running the draft model has its own forward-pass cost. If that cost is large relative to a target step, the math stops working. You want the draft model to be cheap — typically less than 10–15% of a target-model forward pass per token. A 1B model paired with a 70B target is great; a 7B paired with a 70B is borderline.

The net speedup is roughly:

speedupavg new tokens per step1+Kdraft cost ratio\text{speedup} \approx \frac{\text{avg new tokens per step}}{1 + K \cdot \text{draft cost ratio}}

Try it

The widget below simulates the loop. Move the sliders for K, acceptance rate, and draft cost to see how speedup changes.

Speculative decoding timeline
Each step, the draft model proposes K tokens; the target model verifies them all in one pass. Tokens up to the first disagreement get accepted; the target's correction is appended for free.
Per-step token outcomes
step 1
d1
d2
d3
d4
T
3 new tokens
step 2
d1
d2
d3
d4
T
4 new tokens
step 3
d1
d2
d3
d4
T
5 new tokens
step 4
d1
d2
d3
d4
T
1 new tokens
step 5
d1
d2
d3
d4
T
1 new tokens
step 6
d1
d2
d3
d4
T
2 new tokens
step 7
d1
d2
d3
d4
T
1 new tokens
step 8
d1
d2
d3
d4
T
3 new tokens
Tokens produced
20
Target-model steps
8
Avg tokens / step
2.50
Effective speedup vs baseline
1.79×
The dial that matters most is acceptance rate. Higher = the draft model is well-aligned with the target. EAGLE-style drafts reach 70–85% on many workloads; cheap n-gram drafts manage 20–40%. The draft cost is the floor: speedup only happens when the per-step cost overhead is less than the average tokens-per-step won.

A few things worth seeing:

  • At K = 4, acceptance = 70%, draft cost = 10%, you get around 2.5× speedup. Real systems often match this.
  • Push K too high and you start rejecting most of the back of the chain; the marginal token is rarely accepted, and the draft cost adds up. There’s a sweet spot per workload, usually K = 4-6.
  • Push acceptance below ~25% and you actually slow down — the draft costs more than it saves.

Draft model designs

The drafts are where most of the recent research lives:

  • n-gram drafts — no neural model at all; just predict the next token from frequency tables over the context. Zero draft cost, low acceptance, surprisingly often net-positive.

  • Smaller-model drafts — a smaller version of the same architecture. Easy to set up but acceptance is mediocre because the small model’s distribution differs from the target’s in ways that matter.

  • EAGLE EAGLE A draft-model architecture that predicts feature vectors of the target model, achieving high acceptance rates. See in glossary → / EAGLE-2 / EAGLE-3 — train a tiny draft head that predicts the target model’s hidden states (not its tokens) and uses tree-structured branching. Very high acceptance rates; the current state of the art for most workloads.

  • Medusa Medusa Adds multiple parallel “medusa heads” onto the base model to propose several future tokens at once — no separate draft model. See in glossary → — bolt several “Medusa heads” onto the target model itself, each predicting a future token in parallel. No separate draft model required. Slightly lower acceptance than EAGLE but no two-model coordination.

  • Token-level tree drafts — propose not one chain of K tokens but a tree: many candidate continuations at once, verified together. Boosts the expected accepted-prefix length at moderate extra cost.

vLLM ships built-in support for n-gram, EAGLE, and Medusa style speculation.

Where speculative decoding fits in the bigger picture

The trick changes the balance between the three serving metrics (§11):

  • Throughput: improves substantially under speculation, since you push more tokens through the same target-model step.
  • ITL: improves (each new token arrives faster on average).
  • TTFT: unchanged — speculation kicks in after prefill.

It also interacts with batching. Speculation likes fewer, less batched requests (so the target step has room for the K verification tokens). At very high batch sizes the target-model step is already compute-bound and there’s no spare capacity to amortize, so speculation gains shrink. vLLM’s scheduler can switch speculation on and off dynamically based on current batch fullness.

We have now covered every major optimization in modern single-GPU inference. The remaining sections look at what happens when one GPU isn’t enough.