Section 06

Positional encoding

Telling the model where each token sits

In the last two sections we walked through attention and multi-head attention without ever telling the model which token came first. That was a deliberate fudge — the math you saw treats its input sequence as a bag. Permute the input tokens, get a permuted output. “The cat sat on the mat” and “the mat sat on the cat” would produce identical attention weights, which is clearly not what you want.

So before any real model starts running, we have to inject position information explicitly. That’s the job of positional encoding positional encoding Information added to embeddings so the model knows where each token sits in the sequence. See in glossary → .

The high-level idea is the same in every variant: each position in the sequence (0, 1, 2, …) gets associated with some pattern of numbers, and that pattern is mixed into the token’s representation so the network can tell positions apart. There are a few ways to do this, and which one you pick affects how the model generalizes to longer sequences than it saw during training.

Approach 1: Sinusoidal (the original Transformer)

The “Attention Is All You Need” paper added a fixed, hand-crafted pattern to each token’s embedding:

PEp,2i=sin ⁣(p100002i/d),PEp,2i+1=cos ⁣(p100002i/d)\text{PE}_{p, 2i} = \sin\!\left(\frac{p}{10000^{2i/d}}\right), \quad \text{PE}_{p, 2i+1} = \cos\!\left(\frac{p}{10000^{2i/d}}\right)

Each position pp gets a vector of sines and cosines at exponentially-spaced frequencies. Position 0 looks one way, position 1 looks slightly different, position 100 looks very different. The vector is added to the token embedding before the first layer.

Sinusoidal PE across positions 0–2000
Four dimensions from the same encoding, picked to span the frequency range. Low i = high frequency (changes fast); high i = low frequency (smooth, almost linear over this range).
05001000150020000+1−1position p
i = 16 (high freq)· period ≈ 11.2
i = 64· period ≈ 62.8
i = 128· period ≈ 628
i = 200 (low freq)· period ≈ 8,379
With d_model = 512, the dimension index i controls the wavelength geometrically. i = 16 oscillates many times within the first 100 positions; i = 200 barely completes a fraction of a cycle across all 2000 positions. The full encoding stacks 512 such curves, giving every position a unique fingerprint across many frequency scales.

This worked, but had a problem: it doesn’t generalize well past the training context length. If the model was trained on sequences of 2k tokens, sending in a position-encoded 4k token still works mechanically but the model has never seen those frequency patterns and behaves unpredictably.

Approach 2: Learned positional embeddings

GPT-2 used a much simpler scheme: a second embedding matrix, one row per position, learned alongside everything else. Same problem, more so: positions past the training length have completely random vectors.

Learned positional embeddings (e.g. GPT-2)
One row of d_model floats per position. No formula — these values are learned alongside everything else. Past the training context length, you have nothing.
p = 0
p = 47
p = 63
← max training position
dim 0dim 47
Each row is a learned vector unique to that position. Patterns emerge during training but they don't encode any explicit formula — and the dashed area below the rose line is the model's failure mode: positions past the training context have never had their row updated, so those embeddings are whatever random initialization happens to be there. This is why context-length extension is hard for learned PE.

Approach 3: RoPE (what almost everyone uses now)

The dominant scheme in modern open-weight LLMs (Llama, Mistral, Qwen, DeepSeek) is Rotary Position Embeddings (RoPE) RoPE Rotary Position Embeddings — rotates Q/K vectors by an angle proportional to position. Standard in modern LLMs. See in glossary → . It works differently from the first two approaches. The token embedding itself is not modified. Instead, the position information is mixed in later, inside the attention computation — specifically, by rotating the query and key vectors before they’re dot-producted.

Recall from section 4 that for every token, attention computes a query qq and a key kk — vectors of, say, 128 floats each per head. RoPE’s trick is to look at qq as a sequence of pairs of consecutive floats: (q0,q1),(q2,q3),,(q126,q127)(q_0, q_1), (q_2, q_3), \ldots, (q_{126}, q_{127}) — so 64 pairs in total. Each pair is treated as a 2-D arrow lying in its own little plane, and that arrow gets rotated by an angle that depends on the token’s position in the prompt. The same rotation is applied to the matching key.

Different pairs rotate at very different rates. The first pair (q0,q1)(q_0, q_1) spins fast — many full rotations across just a few hundred positions. The last pair (q126,q127)(q_{126}, q_{127}) barely budges, even across thousands of positions. The combined “fingerprint” of these fast and slow rotations is what carries position information.

Try it below. The slider is the token’s position in the prompt — 0 means the first token, 2,000 means a token deep into a long context. Each circle is one of those query/key pairs. As you move the slider, watch the leftmost circles (fast pairs) sweep wildly while the rightmost (slow pairs) barely move.

RoPE: rotation per pair, per position
Each pair of dimensions is treated as a 2-D vector and rotated by an angle p · θᵢ. Low i = fast spin, high i = barely moves.
pair 0 (fastest)
θ ≈ 1.0e+0
< 0.001 cycles
pair 4
θ ≈ 3.2e-1
< 0.001 cycles
pair 12
θ ≈ 3.2e-2
< 0.001 cycles
pair 24
θ ≈ 1.0e-3
< 0.001 cycles
pair 30 (slowest)
θ ≈ 1.8e-4
< 0.001 cycles
Move the slider and watch the leftmost circles spin fast while the rightmost barely move. The same rotation gets applied to q and k at every layer — and when the model computes q · k, what survives is the difference in their rotation angles, which encodes the relative distance between two tokens for free. That's why RoPE generalizes to longer contexts so much more gracefully than fixed sinusoidal or learned PE.

The payoff comes from how attention later uses these rotated vectors. Attention scores are dot products of queries and keys. After RoPE, that dot product depends only on the difference in rotation angles — which is determined by the difference in the two tokens’ positions. So RoPE bakes in relative position information automatically, with no learned parameters and no stored “this is position 17” vectors anywhere.

Three things that fell out of this:

  1. It generalizes well to longer contexts. Because the dot product only sees the difference in rotation, the model isn’t memorizing “position 17 looks like this.” It’s reading “you are 5 tokens apart” — and that’s the same kind of signal whether you’re at position 100 or position 100,000.

  2. No new parameters. RoPE is fixed math; there’s nothing to learn. The model only has to learn to use it.

  3. Extending context is tractable. Several tricks ( NTK-aware scaling NTK-aware scaling A RoPE-extension trick: instead of linearly shrinking all positions (which over-compresses the fast-spinning low-i pairs), adjust the rotation base — the 10000 in 10000^(2i/d) — so high-frequency pairs are preserved while only the slow pairs get stretched. Named after the Neural Tangent Kernel theory it was originally motivated by. Better quality than plain position interpolation at modest extension factors. See in glossary → , YaRN YaRN Yet another RoPE eNtension method. Combines NTK-aware scaling with a length-dependent attention-score scaling and a "ramp" that smoothly transitions between high- and low-frequency treatment. Currently the highest-quality way to extend a RoPE model's context length without retraining; used to ship Llama-3, Qwen-2, and others at 128k+ contexts. See in glossary → , position interpolation position interpolation (PI) A RoPE-extension trick: linearly scale incoming positions down so a model trained at length L "sees" a longer context as if it were still length L. To go from 4k to 16k, divide all positions by 4 before rotating. Cheap, effective for short extensions, but degrades quality on the tasks the model was already good at. See in glossary → ) let you take a model trained at 8k context and extend it to 128k or more by adjusting the frequency base of the rotations. This is why so many models now ship in “long context” variants without full retraining.

Where positional encoding actually lives in the pipeline

The three approaches plug into the pipeline at different points.

Sinusoidal and learned PE happen once, at the very top, before the first transformer layer:

hidden = embed(token_ids) + positional_encoding(positions)

The position-augmented vectors then flow unchanged through every layer.

RoPE happens every layer, inside attention, after queries and keys have been projected:

q = project_q(hidden); q = apply_rope(q, position)
k = project_k(hidden); k = apply_rope(k, position)
# v is not rotated

Why no rotation on values? Because the attention score (the part that decides who attends to whom) only involves qq and kk. The values are just the “content” that gets mixed in once the weights are decided, and there’s no need to position-encode them.

We’ve now closed the loop on the cross-token mixing half of a transformer block: attention (with positional information now properly accounted for) lets every token gather information from every other relevant token. The other half of every block is much simpler — it just transforms each token’s vector independently, in a per-position feed-forward network called the MLP. That’s next.