Section 10

Sampling

Logits → the next token

We have logits — one vector of 128,256 numbers — sitting at the last position of the model’s output. We need to turn this into a single chosen token to append to the sequence and feed back in. That choice is called sampling sampling Choosing the next token from logits — greedy (argmax), temperature scaling, top-k, top-p, etc. See in glossary → , and the choice of how to sample materially shapes what the model feels like.

Greedy decoding: just pick the max

The simplest possible strategy: take the argmax. Whichever token has the highest logit is the chosen token.

next_token = logits.argmax()

Greedy decoding is fully deterministic — same prompt always yields the same completion. It’s also great when there is one obviously correct answer (factual lookup, code completion of a syntactically constrained snippet). It’s bad when there are many reasonable continuations, because it always picks the model’s most confident guess, which can lead to repetitive, bland, or stuck-in-a-loop text.

Temperature scaling: sharpen or flatten the distribution

Before applying softmax, divide logits by a number TT called the temperature temperature Divides logits before softmax. <1 sharpens (more deterministic), >1 flattens (more random). 0 = greedy. See in glossary → :

P(v)exp(logitv/T)P(v) \propto \exp(\text{logit}_v / T)
  • T=1T = 1: leaves the distribution unchanged.
  • T<1T < 1: sharpens (the top tokens get even more probability mass, the long tail gets squashed). T0T \to 0 recovers greedy decoding.
  • T>1T > 1: flattens (the model becomes more random and willing to pick unusual tokens).

Then we sample from this distribution: roll a weighted die.

probs = softmax(logits / T)
next_token = multinomial(probs)  # sample one

Temperature alone is the simplest random strategy. The problem: even at T=1T = 1, the long tail of the vocabulary still has some non-zero probability. Every token has some chance of being chosen, including weird and clearly wrong ones. To clip the tail we add top-k or top-p.

Temperature sampling
The model has predicted these logits for the token after "The cat sat on the soft". Drag the temperature and watch what would actually get sampled.
Probability of each candidate token
mat
8.2
70.7%
bed
6.5
12.9%
cushion
5.9
7.1%
couch
5.5
4.8%
ground
4.8
2.4%
carpet
4.2
1.3%
fur
3.1
0.4%
side
2.4
0.2%
spot
2.0
0.1%
cloud
0.8
<0.1%
token
logit
probability after softmax(logit / T)
P(top guess "mat")
70.7%
P(top 3 combined)
90.8%
Entropy (bits)
1.49 / 3.32
The logit column is fixed — those are what the model produced. Only the divisor T changes. At T → 0 the distribution collapses to a single spike on the top guess (greedy). At T = 1 you sample from the model's natural distribution. As T → ∞ all candidates approach equal probability — and the output becomes essentially random.

Top-k: only consider the top k tokens

Keep only the kk highest-logit tokens, zero out the rest, re-normalize, sample.

top_k_logits, top_k_indices = topk(logits, k)
probs = softmax(top_k_logits / T)
next_token = top_k_indices[multinomial(probs)]

Common values: k=40k = 40 or k=50k = 50. Cheap, easy, and works well when the model is fairly confident — but kk is a fixed shape, while the right “cutoff” actually varies by context. Sometimes only 3 tokens are reasonable; sometimes 500 are. A fixed kk either truncates good candidates or admits bad ones.

Top-p (nucleus sampling): use the smallest set whose mass ≥ p

Top-p picks the cutoff dynamically. Sort tokens by probability, accumulate until you’ve covered fraction pp of the mass, and sample only from that “nucleus.”

sorted_probs, sorted_idx = sort(softmax(logits / T), descending=True)
cumulative = cumsum(sorted_probs)
keep = cumulative <= p
keep[first_false] = True  # always keep at least one
probs = renormalize(sorted_probs * keep)
next_token = sorted_idx[multinomial(probs)]

Common values: p=0.9p = 0.9 or p=0.95p = 0.95. This adapts: if the model is very confident, the nucleus is tiny; if many tokens are plausible, the nucleus widens. Top-p is the most common high-quality sampling strategy.

Top-k and top-p sampling
Same prompt as before — "The cat sat on the soft" — at T = 1. Toggle the strategy and watch which candidates get kept, dropped, and re-weighted.
Candidates ranked by probability (highest first)
#1
mat
70.7%
77.9%
#2
bed
12.9%
14.2%
#3
cushion
7.1%
7.8%
#4
couch
4.8%
dropped
#5
ground
2.4%
dropped
#6
carpet
1.3%
dropped
#7
fur
0.4%
dropped
#8
side
0.2%
dropped
#9
spot
0.1%
dropped
#10
cloud
0.0%
dropped
rank
token
original P
after renormalization
Candidates kept
3 of 10
Mass covered before renorm
90.8%
P(top guess "mat") after renorm
77.9%
Top-k uses a fixed shape — it always keeps exactly k candidates, regardless of how peaked or flat the distribution is. Cheap, easy, but blind to context: if only 2 tokens are reasonable, k = 40 still lets 38 garbage tokens in; if 500 are reasonable, k = 40 truncates the good ones.

Combining them

In practice, production samplers usually apply: temperature → top-k → top-p, in that order. You can also stack in:

  • Repetition penalty / frequency penalty / presence penalty: subtract from the logits of tokens that have already appeared, discouraging loops.
  • Min-p: keep only tokens whose probability is at least pminp_{\min} times the top probability — a newer alternative to top-p.
  • Logit bias: directly add a value to specific token IDs (e.g. to forbid a token, set its logit to -\infty).
  • Guided / constrained decoding: at every step, mask out any token that would violate a grammar (JSON schema, regex, function call format). vLLM ships with this via outlines and xgrammar.

The whole inference loop

Now we can write the entire generation loop:

tokens = tokenize(prompt)
while True:
    logits = model.forward(tokens)
    next_token = sample(logits[-1])           # logits at the last position only
    if next_token == END_OF_TEXT:
        break
    tokens.append(next_token)
print(detokenize(tokens))

This is logically correct, and an introductory tutorial would stop here. But it would also be catastrophically slow and wasteful at any real scale, because every iteration of that while loop runs the entire model — billions of parameters of work — to produce one token, and re-does all the work for every previous token along the way.

The rest of this essay is the answer to the question: how do we make this loop fast?

That story splits in two. First, we’ll separate the loop into two phases (prefill and decode) and observe that they have very different performance characteristics. Then we’ll introduce the KV cache, which makes decode cheap per step but expensive per request. Then everything from §13 onwards is about managing that KV cache cleverly enough to keep a $30k GPU busy with hundreds of users. The model is done. The serving system is just beginning.