Section 04

Attention

Queries, keys, and values

Attention is the idea that made modern LLMs possible.

This section is the longest in the essay. Get it, and the rest of the transformer falls into place.

The setup

Imagine the model has read the prompt “the cat sat on the soft” and now has to predict what comes next. After the embedding step from the previous section, you have a sequence of 6 vectors — one per token — plus a 7th position where the next token will go. Call the embeddings $x_1, \ldots, x_7$ , each of size $d_{\text{model}}$ (say 4096). Attention’s job is to produce a new sequence of 7 vectors $y_1, \ldots, y_7$ where each $y_i$ is informed by all the earlier $x_j$ it might find relevant. The one we care about most is $y_7$ — the vector that, after passing through every layer, will produce the logits used to sample the actual next token (with luck, “mat”).

(There’s one thing missing here that you might already be wondering about: nothing in this setup says where each token sits in the sequence. We’ll come back to that in section 6 — for now, just imagine the model already knows.)

The first move is to project each input vector $x_i$ into three different vectors using three learned weight matrices: $W_Q$ , $W_K$ , and $W_V$ .

q_i = W_Q\, x_i \qquad k_i = W_K\, x_i \qquad v_i = W_V\, x_i

Each $q$ , $k$ , $v$ vector has the same dimension (usually equal to $d_{\text{model}}$ in a single-head sketch, smaller per head in the real multi-head version we’ll see next section). These are the famous queries, keys, and values.

The names are an information-retrieval analogy:

The query $q_i$ asks: “What information am I looking for?”
The key $k_j$ advertises: “Here’s what kind of information I contain.”
The value $v_j$ holds: “Here’s the actual information I’ll contribute if someone picks me.”

Scoring: how much does $i$ care about $j$ ?

For every pair $(i, j)$ , we compute an attention score by taking the dot product of $q_i$ and $k_j$ :

s_{ij} = q_i \cdot k_j = \sum_{d=1}^{d_k} q_{i,d} \cdot k_{j,d}

That is: multiply $q_i$ and $k_j$ component-by-component, then add up all the products into a single scalar. If the two vectors point in similar directions in their high-dimensional space, the matched components reinforce each other and the sum is large and positive. If they point in unrelated directions, positive and negative products cancel and the sum is near zero. A high score therefore means “token $i$ ‘s query lines up with token $j$ ‘s key” — token $i$ wants information from token $j$ .

Dot product: sign and magnitude

Three 2-D examples. Real query and key vectors live in 128 dimensions, but the geometric rule is identical: alignment → big positive, perpendicular → near zero, opposite → big negative.

Aligned

q · k = 7.40

q = (3, 1) · k = (2, 1.4)

= 3·2 + 1·1.4 = 7.40

q and k point in roughly the same direction — the token is highly relevant.

Orthogonal

q · k = 0.00

q = (3, 1) · k = (-1, 3)

= 3·-1 + 1·3 = 0.00

q and k are perpendicular — positive and negative products cancel out. The token is irrelevant.

Opposite

q · k = -8.20

q = (3, 1) · k = (-2.4, -1)

= 3·-2.4 + 1·-1 = -8.20

q and k point in opposite directions — strongly anti-aligned. The token is actively unwanted.

We then divide by $\sqrt{d_k}$ to keep the numbers from blowing up. This is the “scaled” in scaled dot-product, and it’s worth understanding where that exact factor comes from.

$d_k$ is the head dimension — the length of each query and key vector. It isn’t a free choice: the model has a width $d_{\text{model}}$ (say 512) that gets split across $h$ attention heads, so $d_k = d_{\text{model}} / h$ . With 512 dimensions and 8 heads, $d_k = 64$ .

Why does the score need scaling at all? Look back at the sum in $s_{ij} = \sum_{d=1}^{d_k} q_{i,d}\, k_{j,d}$ — it adds up $d_k$ separate products. If the query and key components are roughly independent with unit variance, each product contributes variance $\approx 1$ , and variance adds, so the whole sum has variance $\approx d_k$ . That means a typical score has magnitude around $\sqrt{d_k}$ — and it grows as the head gets wider. Feed scores that large into softmax and it saturates: almost all the weight collapses onto a single token, and the gradient through softmax goes nearly flat, which stalls learning.

Dividing by $\sqrt{d_k}$ is exactly the fix: variance $d_k$ divided by $(\sqrt{d_k})^2 = d_k$ brings the variance back to $\approx 1$ , independent of how wide the head is. So $\sqrt{d_k}$ isn’t arbitrary — it’s the standard deviation of the unscaled scores, the precise amount needed to renormalize them. For $d_k = 64$ , that factor is $\sqrt{64} = 8$ .

Finally, we apply the softmax function across each row:

w_{ij} = \frac{\exp(s_{ij} / \sqrt{d_k})}{\sum_{j'} \exp(s_{ij'} / \sqrt{d_k})}

Softmax turns any row of real numbers into a probability distribution — every weight is between 0 and 1, and they sum to 1 across $j$ . The high-scoring keys end up with most of the weight; the rest get squeezed toward zero.

Softmax: raw scores → probability distribution

Drag the sliders to change the raw attention scores. Watch the softmaxed weights re-balance — they always sum to 1, and high scores soak up most of the mass.

Temperature: 1.00 (standard)

Raw scores (any real number)

s₁

2.4

s₂

0.5

s₃

3.1

s₄

-1.0

s₅

1.7

After softmax (sum to 1)

w₁

27.1%

w₂

4.0%

w₃

54.5%

w₄

0.9%

w₅

13.4%

sum =1.0000

The formula

softmax(s)_i = exp(s_i / T) / Σ_j exp(s_j / T)

Exponentiate each score (makes everything positive and amplifies differences), then divide by the total so the row sums to 1. Bigger scores → exponentially more mass.
T is the temperature — a single positive number that scales every score before exponentiation. T = 1 is "plain" softmax. T < 1 divides by a small number, blowing up the differences between scores and sharpening the distribution. T > 1 shrinks the differences and flattens the distribution toward uniform.
Inside attention, T is always 1 — temperature shows up later, at sampling time (§10), to control how "creative" generation feels.

Try setting one score to a much larger value than the others — softmax will give it nearly all the probability mass. Now lower the temperature: it gets even sharper. Raise the temperature toward 3 and the distribution flattens toward uniform. This same operation runs at every attention layer to decide who attends to whom, and again at the very end of the model to turn logits into a distribution over the next token (§10).

Combining: weighted sum of values

We now have, for each token $i$ , a row of attention weights $w_{i,1}, w_{i,2}, \ldots, w_{i,n}$ that sum to 1. These weights are how much of each other token to mix in. The remaining question is: mix in what?

This is where the third projection finally pays off. Recall we made $W_V x$ produce a value vector $v_j$ at every position. The queries and keys were used only to figure out the weights; they don’t appear in the output. The values are the payload — the actual content vector that gets sent forward when a token gets attended to.

So the new representation for token $i$ is constructed by taking every other token’s value vector, scaling it by how much $i$ cares about that token, and adding them all up:

y_i = w_{i,1}\, v_1 \;+\; w_{i,2}\, v_2 \;+\; \cdots \;+\; w_{i,n}\, v_n \;=\; \sum_j w_{ij}\, v_j

Each $v_j$ is a $d_k$ -dimensional vector, and so is $y_i$ . The weights tell you the mixing ratios — if $w_{i,3} = 0.7$ and $w_{i,5} = 0.2$ and the rest are tiny, then 70 % of the resulting $y_i$ comes from token 3’s content, 20 % from token 5’s, and a small smear from everything else. In effect, every token gets to “pull in” content from the tokens it found relevant, weighted by how relevant.

That’s the entire operation. The three projections do three different jobs:

$q_i$ — “what am I asking for?”
$k_j$ — “what kind of thing am I?” (used to compute relevance against queries)
$v_j$ — “if you pick me, here is the content I’ll contribute”

You can think of it as a soft, differentiable dictionary lookup: query meets keys to decide who; values are the what.

In matrix form

Doing this for every token $i$ at once, and lining everything up into matrices, gives you the famous formula from “Attention Is All You Need”:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V

Reading left to right: compute all pairwise scores ( $Q K^\top$ ), scale by $\sqrt{d_k}$ , softmax each row, then multiply the resulting weight matrix by $V$ — and because matrix multiplication is “weighted sums of rows,” that last multiply is exactly the per-token mixing we just did, executed for all tokens in parallel. We call this scaled dot-product attention , and it is the centerpiece of every transformer-based model on the planet.

The causal mask

There’s one wrinkle specific to generation. When the model produces text autoregressively, position $i$ must not be allowed to attend to positions in the future ( $j > i$ ) — otherwise the model would “cheat” by peeking at tokens that come later. To prevent this, we set those scores to $-\infty$ before the softmax, which makes $\exp(-\infty) = 0$ and zeros them out cleanly.

This is the causal mask. It’s why the heatmap below is triangular: the lower-left half is computed, the upper-right is forbidden.

Try it

Below is an interactive attention map for our running example: the model has read “The cat sat on the soft” and is about to predict the next token, shown as ? at position 6. Pick a head profile (one of a few canonical attention patterns), then click any row to see how that token’s weights are constructed — by default we show position 6, the one doing the prediction. You’ll see the raw $Q\cdot K^\top$ scores, the softmax that turns them into a probability distribution, and the resulting weighted-value combination.

Attention heatmap — predicting the next token

The model has read "The cat sat on the soft" and must now predict the next token (the ? at position 6). Pick one of the attention patterns below and click any row to see what that position attends to. These are illustrative — real models learn many more, and not always so clean.

The token being predicted ("?") pulls strongly from "cat" — the only noun in the sentence, and the most likely thing the new noun will rhyme with or modify. Determiners ("the", "The") light each other up too.

	key (attended-to)
	The	cat	sat	on	the	soft	?
0The	1.00	—	—	—	—	—	—
1cat	0.12	0.88	—	—	—	—	—
2sat	0.11	0.11	0.79	—	—	—	—
3on	0.10	0.10	0.10	0.71	—	—	—
4the	0.84	0.02	0.02	0.02	0.11	—	—
5soft	0.08	0.08	0.08	0.08	0.08	0.60	—
6?	0.01	0.88	0.01	0.01	0.01	0.01	0.07
query (asking)

How the output at position 6 (?) is computed:

Raw scores Q·Kᵀ

-1.00

3.50

-1.00

1.00

After softmax

0.01

0.88

0.01

0.07

output = Σⱼ wⱼ · v(token_j) — where the wⱼ are the softmaxed weights above and v(token_j) is the value vector for each attended-to token.

A few things to look for as you click around:

Pattern A attends mostly to the previous token. For the predicting position, that’s “soft” — a useful starting place for “what kind of word follows an adjective?”.
Pattern B matches content: the ”?” position pulls on “cat” (the noun whose role it is filling), and the two “the”s lock together.
Pattern C spreads attention evenly over all preceding tokens — a kind of “summarizer” that mixes in broad context.
Pattern D dumps most of its probability mass onto the first token. This is the real “attention sink” pattern that shows up in many trained models — a kind of pressure-relief valve when nothing else feels clearly relevant.

Notice the upper-right triangle is always masked: position 2 (“sat”) can attend to positions 0, 1, 2 but not 3+. That’s the causal mask in action.

Each of these is one plausible thing a single attention computation could learn — but a real model needs to track many such patterns at once: tense, subject-verb agreement, coreference , sentence boundaries, you name it. The mechanism that lets one layer run many independent attention patterns side by side is called multi-head attention, and it’s the next section.

What just happened, at the level of vectors

The end result of one attention layer is that every position’s vector has been updated to incorporate information from positions it considered relevant. The ”?” position’s vector now contains some “cat” flavor and some “soft” flavor, weighted by how much the model finds each relevant. The model gets to learn what flavors matter.

One catch: with what we’ve described so far, a single attention layer can only express one pattern of who-attends-to-whom at a time. Real language has many things to track at once (tense, subjects, references, sentence boundaries, …) and a single pattern can’t capture all of them. The fix — running many attention computations side by side, each with its own $W_Q, W_K, W_V$ — is multi-head attention, the next section.