Section 09

Stacking into a full model

From embeddings to logits

Stacking. That’s the punchline. A “large” language model is a transformer block repeated 30, 60, 100, or even 120 times, with three small pieces at the ends to convert tokens in and logits logits The raw, pre-softmax scores the model produces — one per vocabulary token, per position. Bigger logit = the model finds that token more likely; the actual value can be any real number, positive or negative. Applying softmax across the vocabulary turns logits into a probability distribution that sums to 1. Sampling then picks one token from that distribution. See in glossary → out. Here is the entire forward pass in a few lines:

def llm_forward(token_ids):
    h = embed(token_ids)              # (seq,) → (seq, d_model)
    for block in transformer_blocks:  # 32 blocks for Llama-3-8B, 80 for 70B
        h = block(h)                  # attention + MLP + residuals + RMSNorms
    h = rmsnorm(h)                    # one final norm
    logits = h @ W_lm_head            # (seq, d_model) → (seq, vocab_size)
    return logits

That’s it. That’s the whole model. The complexity is entirely in the scale of the matrices, not the structure.

A full LLM forward pass
Llama-3-8B as a concrete example: embedding lookup, 32 stacked transformer blocks, a final RMSNorm, and the LM head. Shape annotations on the right show the tensor at each stage.
token IDs [9906, 11, 1917, …] Embedding lookup vocab × d_model table — one row per token → (seq, 4096) Transformer block 1 RMSNorm → attention → + · RMSNorm → MLP → + → (seq, 4096) Transformer block 2 same shape, fresh weights → (seq, 4096) Transformer block 3 same shape, fresh weights → (seq, 4096) 32 transformer blocks in total (Llama-3-8B) 80 for the 70B · 126 for the 405B Transformer block 32 final layer of the stack → (seq, 4096) Final RMSNorm one last normalization before reading off → (seq, 4096) LM head linear projection: d_model → vocab → (seq, 128 256) logits — one per vocab token, per position at inference we sample from logits[-1] — the next token after the prompt

Embedding → blocks → norm → LM head

Let’s name the three end pieces:

  • Embedding lookup (already covered in §3): each token ID becomes a dmodeld_{\text{model}} vector.
  • Final RMSNorm: a last normalization before reading anything off.
  • LM head: a single linear layer that projects from dmodeld_{\text{model}} back up to vocab_size\text{vocab\_size}.

The output of the LM head, for every position, is a vector of size vocab_size\text{vocab\_size} — about 128,256 for Llama-3. Those numbers are the logits. The LM head LM head Language-Model head — the final linear projection from hidden states (d_model) back to vocab size, producing logits over every token. "Head" because it sits atop the transformer stack like the head of a body; "LM" because it's the layer specialized for the language-modeling (next-token-prediction) objective. See in glossary → is the matrix that produces them.

A logit is an unnormalized score: positive means “the model thinks this token is more likely than average”, negative means less. To convert logits into a probability distribution we apply softmax across the vocabulary axis:

P(token vcontext)=exp(logitv)vexp(logitv)P(\text{token } v \mid \text{context}) = \frac{\exp(\text{logit}_v)}{\sum_{v'} \exp(\text{logit}_{v'})}

And then “sampling” (section 10) picks one token from that distribution.

Picking apart the model file

The file you download — model.safetensors or consolidated.00.pth — is a dictionary of named tensors. For Llama-3-8B it looks roughly like:

tok_embeddings.weight                # (128256, 4096)
layers.0.attention_norm.weight       # (4096,)
layers.0.attention.wq.weight         # (4096, 4096)
layers.0.attention.wk.weight         # (1024, 4096)   ← smaller, GQA
layers.0.attention.wv.weight         # (1024, 4096)   ← smaller, GQA
layers.0.attention.wo.weight         # (4096, 4096)
layers.0.ffn_norm.weight             # (4096,)
layers.0.feed_forward.w1.weight      # (14336, 4096)   gate
layers.0.feed_forward.w3.weight      # (14336, 4096)   up
layers.0.feed_forward.w2.weight      # (4096, 14336)   down
layers.1.attention_norm.weight       # ... and so on for 32 layers
norm.weight                          # (4096,)
output.weight                        # (128256, 4096)  ← LM head

Add up those shapes and you get 8.03 billion parameters. The file at fp16 (2 bytes per param) is about 16 GB.

This matters for inference because that 16 GB has to live somewhere, and the only memory fast enough to feed an H100’s matrix units is HBM. Knowing where every byte of the model lives and how it moves is the topic of section 13.

How depth and width interact

Two knobs control the size of a transformer: depth (number of layers LL) and width (dmodeld_{\text{model}}). Empirically, models perform better when these are scaled together — not too tall and skinny, not too short and wide. Modern models tend to follow rough ratios like dmodel128Ld_{\text{model}} \approx 128 \cdot L. Llama-3-8B is 32 × 4096; Llama-3-70B is 80 × 8192. Doubling width roughly doubles each layer’s parameter count; doubling depth roughly doubles the layer count. Each gives different tradeoffs at inference time:

  • More depth = more sequential work per token. Pipeline parallelism (§19) hates this.
  • More width = bigger matrices, more compute and memory per layer. Tensor parallelism (§19) helps here.

We now have a forward pass that, given a list of token IDs, produces logits at every position. Almost done with the model itself. The remaining question: how do you turn those logits into the next token?