Attention
Queries, keys, and values
Attention is the idea that made modern LLMs possible.
This section is the longest in the essay. Get it, and the rest of the transformer falls into place.
The setup
Imagine the model has read the prompt “the cat sat on the soft” and now has to predict what comes next. After the embedding step from the previous section, you have a sequence of 6 vectors — one per token — plus a 7th position where the next token will go. Call the embeddings , each of size (say 4096). Attention’s job is to produce a new sequence of 7 vectors where each is informed by all the earlier it might find relevant. The one we care about most is — the vector that, after passing through every layer, will produce the logits used to sample the actual next token (with luck, “mat”).
(There’s one thing missing here that you might already be wondering about: nothing in this setup says where each token sits in the sequence. We’ll come back to that in section 6 — for now, just imagine the model already knows.)
The first move is to project each input vector into three different vectors using three learned weight matrices: , , and .
Each , , vector has the same dimension (usually equal to in a single-head sketch, smaller per head in the real multi-head version we’ll see next section). These are the famous queries, keys, and values.
The names are an information-retrieval analogy:
- The query query A vector asking “what am I looking for in other tokens?”. Computed per token, used to score against keys. See in glossary → asks: “What information am I looking for?”
- The key key A vector saying “what I represent”. Compared against queries to compute attention scores. See in glossary → advertises: “Here’s what kind of information I contain.”
- The value value A vector representing the content actually mixed into the output when a token gets attended to. See in glossary → holds: “Here’s the actual information I’ll contribute if someone picks me.”
Scoring: how much does care about ?
For every pair , we compute an attention score attention score A single number sᵢⱼ measuring how much token i wants to attend to token j. Computed as the dot product of i's query vector and j's key vector (scaled by √d_k), then softmaxed across j so the weights for each i sum to 1. High score = i finds j relevant. See in glossary → by taking the dot product dot product A single number summarizing how aligned two vectors are. To compute a · b: multiply corresponding components (a₀·b₀, a₁·b₁, …, a_{d-1}·b_{d-1}) and sum the results. Large positive = the two vectors point in similar directions; near zero = they're unrelated; large negative = opposite directions. See in glossary → of and :
That is: multiply and component-by-component, then add up all the products into a single scalar. If the two vectors point in similar directions in their high-dimensional space, the matched components reinforce each other and the sum is large and positive. If they point in unrelated directions, positive and negative products cancel and the sum is near zero. A high score therefore means “token ‘s query lines up with token ‘s key” — token wants information from token .
We then divide by to keep the numbers from blowing up. This is the “scaled” in scaled dot-product, and it’s worth understanding where that exact factor comes from.
is the head dimension — the length of each query and key vector. It isn’t a free choice: the model has a width (say 512) that gets split across attention heads, so . With 512 dimensions and 8 heads, .
Why does the score need scaling at all? Look back at the sum in — it adds up separate products. If the query and key components are roughly independent with unit variance, each product contributes variance , and variance adds, so the whole sum has variance . That means a typical score has magnitude around — and it grows as the head gets wider. Feed scores that large into softmax and it saturates: almost all the weight collapses onto a single token, and the gradient through softmax goes nearly flat, which stalls learning.
Dividing by is exactly the fix: variance divided by brings the variance back to , independent of how wide the head is. So isn’t arbitrary — it’s the standard deviation of the unscaled scores, the precise amount needed to renormalize them. For , that factor is .
Finally, we apply the softmax softmax Function that turns any vector into a probability distribution (positive, sums to 1) by exponentiating and normalizing. See in glossary → function across each row:
Softmax turns any row of real numbers into a probability distribution — every weight is between 0 and 1, and they sum to 1 across . The high-scoring keys end up with most of the weight; the rest get squeezed toward zero.
T is the temperature — a single positive number that scales every score before exponentiation. T = 1 is "plain" softmax. T < 1 divides by a small number, blowing up the differences between scores and sharpening the distribution. T > 1 shrinks the differences and flattens the distribution toward uniform.
Inside attention, T is always 1 — temperature shows up later, at sampling time (§10), to control how "creative" generation feels.
Combining: weighted sum of values
We now have, for each token , a row of attention weights that sum to 1. These weights are how much of each other token to mix in. The remaining question is: mix in what?
This is where the third projection finally pays off. Recall we made produce a value vector at every position. The queries and keys were used only to figure out the weights; they don’t appear in the output. The values are the payload — the actual content vector that gets sent forward when a token gets attended to.
So the new representation for token is constructed by taking every other token’s value vector, scaling it by how much cares about that token, and adding them all up:
Each is a -dimensional vector, and so is . The weights tell you the mixing ratios — if and and the rest are tiny, then 70 % of the resulting comes from token 3’s content, 20 % from token 5’s, and a small smear from everything else. In effect, every token gets to “pull in” content from the tokens it found relevant, weighted by how relevant.
That’s the entire operation. The three projections do three different jobs:
- — “what am I asking for?”
- — “what kind of thing am I?” (used to compute relevance against queries)
- — “if you pick me, here is the content I’ll contribute”
You can think of it as a soft, differentiable dictionary lookup: query meets keys to decide who; values are the what.
In matrix form
Doing this for every token at once, and lining everything up into matrices, gives you the famous formula from “Attention Is All You Need”:
Reading left to right: compute all pairwise scores (), scale by , softmax each row, then multiply the resulting weight matrix by — and because matrix multiplication is “weighted sums of rows,” that last multiply is exactly the per-token mixing we just did, executed for all tokens in parallel. We call this scaled dot-product attention scaled dot-product attention softmax(QKᵀ / √d_k) · V — the canonical attention formula from “Attention is All You Need”. See in glossary → , and it is the centerpiece of every transformer-based model on the planet.
The causal mask
There’s one wrinkle specific to generation. When the model produces text autoregressively, position must not be allowed to attend to positions in the future () — otherwise the model would “cheat” by peeking at tokens that come later. To prevent this, we set those scores to before the softmax, which makes and zeros them out cleanly.
This is the causal mask. It’s why the heatmap below is triangular: the lower-left half is computed, the upper-right is forbidden.
Try it
Below is an interactive attention map for our running example: the model has read “The cat sat on the soft” and is about to predict the next token, shown as ? at position 6. Pick a head profile (one of a few canonical attention patterns), then click any row to see how that token’s weights are constructed — by default we show position 6, the one doing the prediction. You’ll see the raw scores, the softmax that turns them into a probability distribution, and the resulting weighted-value combination.
| key (attended-to) | |||||||
|---|---|---|---|---|---|---|---|
The | cat | sat | on | the | soft | ? | |
| 0The | 1.00 | — | — | — | — | ||