Section 04

Attention

Queries, keys, and values

Attention is the idea that made modern LLMs possible.

This section is the longest in the essay. Get it, and the rest of the transformer falls into place.

The setup

Imagine the model has read the prompt “the cat sat on the soft” and now has to predict what comes next. After the embedding step from the previous section, you have a sequence of 6 vectors — one per token — plus a 7th position where the next token will go. Call the embeddings x1,,x7x_1, \ldots, x_7, each of size dmodeld_{\text{model}} (say 4096). Attention’s job is to produce a new sequence of 7 vectors y1,,y7y_1, \ldots, y_7 where each yiy_i is informed by all the earlier xjx_j it might find relevant. The one we care about most is y7y_7 — the vector that, after passing through every layer, will produce the logits used to sample the actual next token (with luck, “mat”).

(There’s one thing missing here that you might already be wondering about: nothing in this setup says where each token sits in the sequence. We’ll come back to that in section 6 — for now, just imagine the model already knows.)

The first move is to project each input vector xix_i into three different vectors using three learned weight matrices: WQW_Q, WKW_K, and WVW_V.

qi=WQxiki=WKxivi=WVxiq_i = W_Q\, x_i \qquad k_i = W_K\, x_i \qquad v_i = W_V\, x_i

Each qq, kk, vv vector has the same dimension (usually equal to dmodeld_{\text{model}} in a single-head sketch, smaller per head in the real multi-head version we’ll see next section). These are the famous queries, keys, and values.

The names are an information-retrieval analogy:

  • The query query A vector asking “what am I looking for in other tokens?”. Computed per token, used to score against keys. See in glossary → qiq_i asks: “What information am I looking for?”
  • The key key A vector saying “what I represent”. Compared against queries to compute attention scores. See in glossary → kjk_j advertises: “Here’s what kind of information I contain.”
  • The value value A vector representing the content actually mixed into the output when a token gets attended to. See in glossary → vjv_j holds: “Here’s the actual information I’ll contribute if someone picks me.”

Scoring: how much does ii care about jj?

For every pair (i,j)(i, j), we compute an attention score attention score A single number sᵢⱼ measuring how much token i wants to attend to token j. Computed as the dot product of i's query vector and j's key vector (scaled by √d_k), then softmaxed across j so the weights for each i sum to 1. High score = i finds j relevant. See in glossary → by taking the dot product dot product A single number summarizing how aligned two vectors are. To compute a · b: multiply corresponding components (a₀·b₀, a₁·b₁, …, a_{d-1}·b_{d-1}) and sum the results. Large positive = the two vectors point in similar directions; near zero = they're unrelated; large negative = opposite directions. See in glossary → of qiq_i and kjk_j:

sij=qikj=d=1dkqi,dkj,ds_{ij} = q_i \cdot k_j = \sum_{d=1}^{d_k} q_{i,d} \cdot k_{j,d}

That is: multiply qiq_i and kjk_j component-by-component, then add up all the products into a single scalar. If the two vectors point in similar directions in their high-dimensional space, the matched components reinforce each other and the sum is large and positive. If they point in unrelated directions, positive and negative products cancel and the sum is near zero. A high score therefore means “token ii‘s query lines up with token jj‘s key” — token ii wants information from token jj.

Dot product: sign and magnitude
Three 2-D examples. Real query and key vectors live in 128 dimensions, but the geometric rule is identical: alignment → big positive, perpendicular → near zero, opposite → big negative.
Aligned
q · k = 7.40
qk
q = (3, 1) · k = (2, 1.4)
= 3·2 + 1·1.4 = 7.40
q and k point in roughly the same direction — the token is highly relevant.
Orthogonal
q · k = 0.00
qk
q = (3, 1) · k = (-1, 3)
= 3·-1 + 1·3 = 0.00
q and k are perpendicular — positive and negative products cancel out. The token is irrelevant.
Opposite
q · k = -8.20
qk
q = (3, 1) · k = (-2.4, -1)
= 3·-2.4 + 1·-1 = -8.20
q and k point in opposite directions — strongly anti-aligned. The token is actively unwanted.

We then divide by dk\sqrt{d_k} to keep the numbers from blowing up. This is the “scaled” in scaled dot-product, and it’s worth understanding where that exact factor comes from.

dkd_k is the head dimension — the length of each query and key vector. It isn’t a free choice: the model has a width dmodeld_{\text{model}} (say 512) that gets split across hh attention heads, so dk=dmodel/hd_k = d_{\text{model}} / h. With 512 dimensions and 8 heads, dk=64d_k = 64.

Why does the score need scaling at all? Look back at the sum in sij=d=1dkqi,dkj,ds_{ij} = \sum_{d=1}^{d_k} q_{i,d}\, k_{j,d} — it adds up dkd_k separate products. If the query and key components are roughly independent with unit variance, each product contributes variance 1\approx 1, and variance adds, so the whole sum has variance dk\approx d_k. That means a typical score has magnitude around dk\sqrt{d_k} — and it grows as the head gets wider. Feed scores that large into softmax and it saturates: almost all the weight collapses onto a single token, and the gradient through softmax goes nearly flat, which stalls learning.

Dividing by dk\sqrt{d_k} is exactly the fix: variance dkd_k divided by (dk)2=dk(\sqrt{d_k})^2 = d_k brings the variance back to 1\approx 1, independent of how wide the head is. So dk\sqrt{d_k} isn’t arbitrary — it’s the standard deviation of the unscaled scores, the precise amount needed to renormalize them. For dk=64d_k = 64, that factor is 64=8\sqrt{64} = 8.

Finally, we apply the softmax softmax Function that turns any vector into a probability distribution (positive, sums to 1) by exponentiating and normalizing. See in glossary → function across each row:

wij=exp(sij/dk)jexp(sij/dk)w_{ij} = \frac{\exp(s_{ij} / \sqrt{d_k})}{\sum_{j'} \exp(s_{ij'} / \sqrt{d_k})}

Softmax turns any row of real numbers into a probability distribution — every weight is between 0 and 1, and they sum to 1 across jj. The high-scoring keys end up with most of the weight; the rest get squeezed toward zero.

Softmax: raw scores → probability distribution
Drag the sliders to change the raw attention scores. Watch the softmaxed weights re-balance — they always sum to 1, and high scores soak up most of the mass.
Raw scores (any real number)
s₁
2.4
s₂
0.5
s₃
3.1
s₄
-1.0
s₅
1.7
After softmax (sum to 1)
w₁
27.1%
w₂
4.0%
w₃
54.5%
w₄
0.9%
w₅
13.4%
sum =1.0000
The formula
softmax(s)i = exp(si / T) / Σj exp(sj / T)
Exponentiate each score (makes everything positive and amplifies differences), then divide by the total so the row sums to 1. Bigger scores → exponentially more mass.
T is the temperature — a single positive number that scales every score before exponentiation. T = 1 is "plain" softmax. T < 1 divides by a small number, blowing up the differences between scores and sharpening the distribution. T > 1 shrinks the differences and flattens the distribution toward uniform.
Inside attention, T is always 1 — temperature shows up later, at sampling time (§10), to control how "creative" generation feels.
Try setting one score to a much larger value than the others — softmax will give it nearly all the probability mass. Now lower the temperature: it gets even sharper. Raise the temperature toward 3 and the distribution flattens toward uniform. This same operation runs at every attention layer to decide who attends to whom, and again at the very end of the model to turn logits into a distribution over the next token (§10).

Combining: weighted sum of values

We now have, for each token ii, a row of attention weights wi,1,wi,2,,wi,nw_{i,1}, w_{i,2}, \ldots, w_{i,n} that sum to 1. These weights are how much of each other token to mix in. The remaining question is: mix in what?

This is where the third projection finally pays off. Recall we made WVxW_V x produce a value vector vjv_j at every position. The queries and keys were used only to figure out the weights; they don’t appear in the output. The values are the payload — the actual content vector that gets sent forward when a token gets attended to.

So the new representation for token ii is constructed by taking every other token’s value vector, scaling it by how much ii cares about that token, and adding them all up:

yi=wi,1v1  +  wi,2v2  +    +  wi,nvn  =  jwijvjy_i = w_{i,1}\, v_1 \;+\; w_{i,2}\, v_2 \;+\; \cdots \;+\; w_{i,n}\, v_n \;=\; \sum_j w_{ij}\, v_j

Each vjv_j is a dkd_k-dimensional vector, and so is yiy_i. The weights tell you the mixing ratios — if wi,3=0.7w_{i,3} = 0.7 and wi,5=0.2w_{i,5} = 0.2 and the rest are tiny, then 70 % of the resulting yiy_i comes from token 3’s content, 20 % from token 5’s, and a small smear from everything else. In effect, every token gets to “pull in” content from the tokens it found relevant, weighted by how relevant.

That’s the entire operation. The three projections do three different jobs:

  • qiq_i“what am I asking for?”
  • kjk_j“what kind of thing am I?” (used to compute relevance against queries)
  • vjv_j“if you pick me, here is the content I’ll contribute”

You can think of it as a soft, differentiable dictionary lookup: query meets keys to decide who; values are the what.

In matrix form

Doing this for every token ii at once, and lining everything up into matrices, gives you the famous formula from “Attention Is All You Need”:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V

Reading left to right: compute all pairwise scores (QKQ K^\top), scale by dk\sqrt{d_k}, softmax each row, then multiply the resulting weight matrix by VV — and because matrix multiplication is “weighted sums of rows,” that last multiply is exactly the per-token mixing we just did, executed for all tokens in parallel. We call this scaled dot-product attention scaled dot-product attention softmax(QKᵀ / √d_k) · V — the canonical attention formula from “Attention is All You Need”. See in glossary → , and it is the centerpiece of every transformer-based model on the planet.

The causal mask

There’s one wrinkle specific to generation. When the model produces text autoregressively, position ii must not be allowed to attend to positions in the future (j>ij > i) — otherwise the model would “cheat” by peeking at tokens that come later. To prevent this, we set those scores to -\infty before the softmax, which makes exp()=0\exp(-\infty) = 0 and zeros them out cleanly.

This is the causal mask. It’s why the heatmap below is triangular: the lower-left half is computed, the upper-right is forbidden.

Try it

Below is an interactive attention map for our running example: the model has read “The cat sat on the soft” and is about to predict the next token, shown as ? at position 6. Pick a head profile (one of a few canonical attention patterns), then click any row to see how that token’s weights are constructed — by default we show position 6, the one doing the prediction. You’ll see the raw QKQ\cdot K^\top scores, the softmax that turns them into a probability distribution, and the resulting weighted-value combination.

Attention heatmap — predicting the next token
The model has read "The cat sat on the soft" and must now predict the next token (the ? at position 6). Pick one of the attention patterns below and click any row to see what that position attends to. These are illustrative — real models learn many more, and not always so clean.
The token being predicted ("?") pulls strongly from "cat" — the only noun in the sentence, and the most likely thing the new noun will rhyme with or modify. Determiners ("the", "The") light each other up too.
key (attended-to)
The
cat
sat
on
the
soft
?
0The1.00
1cat0.120.88
2sat0.110.110.79
3on0.100.100.100.71
4the0.840.020.020.020.11
5soft0.080.080.080.080.080.60
6?0.010.880.010.010.010.010.07
query (asking)
How the output at position 6 (?) is computed:
Raw scores Q·Kᵀ
-1.00
3.50
-1.00
-1.00
-1.00
-1.00
1.00
After softmax
0.01
0.88
0.01
0.01
0.01
0.01
0.07
output = Σⱼ wⱼ · v(tokenj) — where the wⱼ are the softmaxed weights above and v(tokenj) is the value vector for each attended-to token.

A few things to look for as you click around:

  • Pattern A attends mostly to the previous token. For the predicting position, that’s “soft” — a useful starting place for “what kind of word follows an adjective?”.
  • Pattern B matches content: the ”?” position pulls on “cat” (the noun whose role it is filling), and the two “the”s lock together.
  • Pattern C spreads attention evenly over all preceding tokens — a kind of “summarizer” that mixes in broad context.
  • Pattern D dumps most of its probability mass onto the first token. This is the real “attention sink” pattern that shows up in many trained models — a kind of pressure-relief valve when nothing else feels clearly relevant.

Notice the upper-right triangle is always masked: position 2 (“sat”) can attend to positions 0, 1, 2 but not 3+. That’s the causal mask in action.

Each of these is one plausible thing a single attention computation could learn — but a real model needs to track many such patterns at once: tense, subject-verb agreement, coreference coreference When two words in a text refer to the same thing. In "Marie went home because she was tired," the pronoun "she" co-refers to "Marie." Resolving coreference — figuring out which earlier mention a pronoun, "this", "the company", etc. points back to — is one of the relationships transformer heads learn to track during training. See in glossary → , sentence boundaries, you name it. The mechanism that lets one layer run many independent attention patterns side by side is called multi-head attention, and it’s the next section.

What just happened, at the level of vectors

The end result of one attention layer is that every position’s vector has been updated to incorporate information from positions it considered relevant. The ”?” position’s vector now contains some “cat” flavor and some “soft” flavor, weighted by how much the model finds each relevant. The model gets to learn what flavors matter.

One catch: with what we’ve described so far, a single attention layer can only express one pattern of who-attends-to-whom at a time. Real language has many things to track at once (tense, subjects, references, sentence boundaries, …) and a single pattern can’t capture all of them. The fix — running many attention computations side by side, each with its own WQ,WK,WVW_Q, W_K, W_V — is multi-head attention, the next section.