Embeddings
Token IDs → vectors
We now have a list of integers. That’s still not useful for a neural network — the integers are arbitrary indices, and the network needs to understand that “king” and “queen” are related while “king” and “banana” are not. We need each integer to become an object that can carry meaning: a list of real numbers. That object is called an embedding embedding A dense vector representation of a token (typically d=2k–8k floats). Similar tokens get nearby vectors. See in glossary → .
A giant lookup table
The first layer of the model is, structurally, the simplest one of all: an embedding matrix embedding matrix A table with one row per vocabulary entry. Looking up a token = indexing into this matrix. See in glossary → . It is a 2-D table with one row per token in the vocabulary and one column per dimension of the model’s “hidden size”, a number conventionally written (read “d-model”).
For Llama-3-8B, the vocabulary has 128,256 entries and . So the embedding matrix is 128,256 × 4096 — a little over half a billion numbers, around a gigabyte all on its own. To turn a token ID into a vector, you just go to that row of the table. Token ID 9906 (“Hello”) returns row 9906: a vector of 4,096 floats. No arithmetic at all, just indexing.
What does an embedding mean?
A 4,096-dimensional vector is hard to picture. But the rough intuition is: each dimension represents something about the token. One dimension might roughly correspond to “is this a noun?”, another to “is this related to royalty?”, another to “is this a proper noun?”, and so on. Real embeddings don’t have such clean axes — the meaningful directions are tangled across many dimensions — but the geometric facts are striking. Tokens with similar meanings really do end up near each other. The vector for “king” really is closer to “queen” than to “banana”, and the famous result king − man + woman ≈ queen works (roughly) in these spaces.
The values are learned. During training the embedding matrix starts random and gets nudged so that the model is better at predicting next tokens. The model “discovers” that certain tokens behave similarly and pushes their rows toward each other.
Try it
Below is a deterministic, fake embedding — the values are seeded from the token text so similar words won’t actually land near each other; they’ll just look like noise. But the shape is real: a token in, a row of d-many floats out. Visualized as colored bars (warm orange-red = negative, cyan = positive, brightness = magnitude, dark slate ≈ zero), this is what a token actually looks like to the rest of the model.
In the real network, every other vector you’ll meet — the queries, keys, values, the activations between layers, the final hidden state — has the same shape as one of these: a row of floats. The whole rest of the model is just an enormous sequence of operations that transform such vectors into other such vectors.
A common confusion: input embedding vs output projection
There is one detail worth flagging now because it confuses everyone the first time. The model also needs to produce token outputs — at the end, a single vector of floats has to be turned back into a probability distribution over the vocabulary. That requires a “reverse embedding” matrix of the same shape (vocab × ).
Many models save memory by tying these two matrices: the input embedding and the output projection are literally the same parameters. Llama models do not tie them; they’re two separate matrices. Either choice works — it’s just a tradeoff between parameter count and a small quality gain. We’ll come back to this when we talk about the LM head LM head Language-Model head — the final linear projection from hidden states (d_model) back to vocab size, producing logits over every token. "Head" because it sits atop the transformer stack like the head of a body; "LM" because it's the layer specialized for the language-modeling (next-token-prediction) objective. See in glossary → in section 9.
Embeddings give every token a vector. We now have everything we need to look at the first real operation the model does on those vectors — the one that lets information flow between tokens and is responsible for almost everything interesting an LLM does. That operation is attention, and it’s the topic of the next section.