Section 03

Embeddings

Token IDs → vectors

We now have a list of integers. That’s still not useful for a neural network — the integers are arbitrary indices, and the network needs to understand that “king” and “queen” are related while “king” and “banana” are not. We need each integer to become an object that can carry meaning: a list of real numbers. That object is called an embedding .

A giant lookup table

The first layer of the model is, structurally, the simplest one of all: an embedding matrix . It is a 2-D table with one row per token in the vocabulary and one column per dimension of the model’s “hidden size”, a number conventionally written $d_{\text{model}}$ (read “d-model”).

For Llama-3-8B, the vocabulary has 128,256 entries and $d_{\text{model}} = 4096$ . So the embedding matrix is 128,256 × 4096 — a little over half a billion numbers, around a gigabyte all on its own. To turn a token ID into a vector, you just go to that row of the table. Token ID 9906 (“Hello”) returns row 9906: a vector of 4,096 floats. No arithmetic at all, just indexing.

What does an embedding mean?

A 4,096-dimensional vector is hard to picture. But the rough intuition is: each dimension represents something about the token. One dimension might roughly correspond to “is this a noun?”, another to “is this related to royalty?”, another to “is this a proper noun?”, and so on. Real embeddings don’t have such clean axes — the meaningful directions are tangled across many dimensions — but the geometric facts are striking. Tokens with similar meanings really do end up near each other. The vector for “king” really is closer to “queen” than to “banana”, and the famous result king − man + woman ≈ queen works (roughly) in these spaces.

The values are learned. During training the embedding matrix starts random and gets nudged so that the model is better at predicting next tokens. The model “discovers” that certain tokens behave similarly and pushes their rows toward each other.

Try it

Below is a deterministic, fake embedding — the values are seeded from the token text so similar words won’t actually land near each other; they’ll just look like noise. But the shape is real: a token in, a row of d-many floats out. Visualized as colored bars (warm orange-red = negative, cyan = positive, brightness = magnitude, dark slate ≈ zero), this is what a token actually looks like to the rest of the model.

Embedding vector visualizer

Showing a fake 128-D vector. Real models (Llama-3-8B) use d_model = 4,096. The values aren't real embeddings — they're deterministic noise seeded from the token — but the *shape* is the point: a token becomes a long row of floats.

Token:

token

"king"

dim

128

‖v‖₂

4.251

first 3 values

0.56, -0.00, -0.47…

In the real network, every other vector you’ll meet — the queries, keys, values, the activations between layers, the final hidden state — has the same shape as one of these: a row of $d_{\text{model}}$ floats. The whole rest of the model is just an enormous sequence of operations that transform such vectors into other such vectors.

A common confusion: input embedding vs output projection

There is one detail worth flagging now because it confuses everyone the first time. The model also needs to produce token outputs — at the end, a single vector of $d_{\text{model}}$ floats has to be turned back into a probability distribution over the vocabulary. That requires a “reverse embedding” matrix of the same shape (vocab × $d_{\text{model}}$ ).

Many models save memory by tying these two matrices: the input embedding and the output projection are literally the same parameters. Llama models do not tie them; they’re two separate matrices. Either choice works — it’s just a tradeoff between parameter count and a small quality gain. We’ll come back to this when we talk about the LM head in section 9.

Embeddings give every token a vector. We now have everything we need to look at the first real operation the model does on those vectors — the one that lets information flow between tokens and is responsible for almost everything interesting an LLM does. That operation is attention, and it’s the topic of the next section.