Sampling
Logits → the next token
We have logits — one vector of 128,256 numbers — sitting at the last position of the model’s output. We need to turn this into a single chosen token to append to the sequence and feed back in. That choice is called sampling sampling Choosing the next token from logits — greedy (argmax), temperature scaling, top-k, top-p, etc. See in glossary → , and the choice of how to sample materially shapes what the model feels like.
Greedy decoding: just pick the max
The simplest possible strategy: take the argmax. Whichever token has the highest logit is the chosen token.
next_token = logits.argmax()
Greedy decoding is fully deterministic — same prompt always yields the same completion. It’s also great when there is one obviously correct answer (factual lookup, code completion of a syntactically constrained snippet). It’s bad when there are many reasonable continuations, because it always picks the model’s most confident guess, which can lead to repetitive, bland, or stuck-in-a-loop text.
Temperature scaling: sharpen or flatten the distribution
Before applying softmax, divide logits by a number called the temperature temperature Divides logits before softmax. <1 sharpens (more deterministic), >1 flattens (more random). 0 = greedy. See in glossary → :
- : leaves the distribution unchanged.
- : sharpens (the top tokens get even more probability mass, the long tail gets squashed). recovers greedy decoding.
- : flattens (the model becomes more random and willing to pick unusual tokens).
Then we sample from this distribution: roll a weighted die.
probs = softmax(logits / T)
next_token = multinomial(probs) # sample one
Temperature alone is the simplest random strategy. The problem: even at , the long tail of the vocabulary still has some non-zero probability. Every token has some chance of being chosen, including weird and clearly wrong ones. To clip the tail we add top-k or top-p.
Top-k: only consider the top k tokens
Keep only the highest-logit tokens, zero out the rest, re-normalize, sample.
top_k_logits, top_k_indices = topk(logits, k)
probs = softmax(top_k_logits / T)
next_token = top_k_indices[multinomial(probs)]
Common values: or . Cheap, easy, and works well when the model is fairly confident — but is a fixed shape, while the right “cutoff” actually varies by context. Sometimes only 3 tokens are reasonable; sometimes 500 are. A fixed either truncates good candidates or admits bad ones.
Top-p (nucleus sampling): use the smallest set whose mass ≥ p
Top-p picks the cutoff dynamically. Sort tokens by probability, accumulate until you’ve covered fraction of the mass, and sample only from that “nucleus.”
sorted_probs, sorted_idx = sort(softmax(logits / T), descending=True)
cumulative = cumsum(sorted_probs)
keep = cumulative <= p
keep[first_false] = True # always keep at least one
probs = renormalize(sorted_probs * keep)
next_token = sorted_idx[multinomial(probs)]
Common values: or . This adapts: if the model is very confident, the nucleus is tiny; if many tokens are plausible, the nucleus widens. Top-p is the most common high-quality sampling strategy.
Combining them
In practice, production samplers usually apply: temperature → top-k → top-p, in that order. You can also stack in:
- Repetition penalty / frequency penalty / presence penalty: subtract from the logits of tokens that have already appeared, discouraging loops.
- Min-p: keep only tokens whose probability is at least times the top probability — a newer alternative to top-p.
- Logit bias: directly add a value to specific token IDs (e.g. to forbid a token, set its logit to ).
- Guided / constrained decoding: at every step, mask out any token that would violate a grammar (JSON schema, regex, function call format). vLLM ships with this via
outlinesandxgrammar.
The whole inference loop
Now we can write the entire generation loop:
tokens = tokenize(prompt)
while True:
logits = model.forward(tokens)
next_token = sample(logits[-1]) # logits at the last position only
if next_token == END_OF_TEXT:
break
tokens.append(next_token)
print(detokenize(tokens))
This is logically correct, and an introductory tutorial would stop here. But it would also be catastrophically slow and wasteful at any real scale, because every iteration of that while loop runs the entire model — billions of parameters of work — to produce one token, and re-does all the work for every previous token along the way.
The rest of this essay is the answer to the question: how do we make this loop fast?
That story splits in two. First, we’ll separate the loop into two phases (prefill and decode) and observe that they have very different performance characteristics. Then we’ll introduce the KV cache, which makes decode cheap per step but expensive per request. Then everything from §13 onwards is about managing that KV cache cleverly enough to keep a $30k GPU busy with hundreds of users. The model is done. The serving system is just beginning.