Section 09

Stacking into a full model

From embeddings to logits

Stacking. That’s the punchline. A “large” language model is a transformer block repeated 30, 60, 100, or even 120 times, with three small pieces at the ends to convert tokens in and logits out. Here is the entire forward pass in a few lines:

def llm_forward(token_ids):
    h = embed(token_ids)              # (seq,) → (seq, d_model)
    for block in transformer_blocks:  # 32 blocks for Llama-3-8B, 80 for 70B
        h = block(h)                  # attention + MLP + residuals + RMSNorms
    h = rmsnorm(h)                    # one final norm
    logits = h @ W_lm_head            # (seq, d_model) → (seq, vocab_size)
    return logits

That’s it. That’s the whole model. The complexity is entirely in the scale of the matrices, not the structure.

A full LLM forward pass

Llama-3-8B as a concrete example: embedding lookup, 32 stacked transformer blocks, a final RMSNorm, and the LM head. Shape annotations on the right show the tensor at each stage.

Embedding → blocks → norm → LM head

Let’s name the three end pieces:

Embedding lookup (already covered in §3): each token ID becomes a $d_{\text{model}}$ vector.
Final RMSNorm: a last normalization before reading anything off.
LM head: a single linear layer that projects from $d_{\text{model}}$ back up to $\text{vocab\_size}$ .

The output of the LM head, for every position, is a vector of size $\text{vocab\_size}$ — about 128,256 for Llama-3. Those numbers are the logits. The LM head is the matrix that produces them.

A logit is an unnormalized score: positive means “the model thinks this token is more likely than average”, negative means less. To convert logits into a probability distribution we apply softmax across the vocabulary axis:

P(\text{token } v \mid \text{context}) = \frac{\exp(\text{logit}_v)}{\sum_{v'} \exp(\text{logit}_{v'})}

And then “sampling” (section 10) picks one token from that distribution.

Picking apart the model file

The file you download — model.safetensors or consolidated.00.pth — is a dictionary of named tensors. For Llama-3-8B it looks roughly like:

tok_embeddings.weight                # (128256, 4096)
layers.0.attention_norm.weight       # (4096,)
layers.0.attention.wq.weight         # (4096, 4096)
layers.0.attention.wk.weight         # (1024, 4096)   ← smaller, GQA
layers.0.attention.wv.weight         # (1024, 4096)   ← smaller, GQA
layers.0.attention.wo.weight         # (4096, 4096)
layers.0.ffn_norm.weight             # (4096,)
layers.0.feed_forward.w1.weight      # (14336, 4096)   gate
layers.0.feed_forward.w3.weight      # (14336, 4096)   up
layers.0.feed_forward.w2.weight      # (4096, 14336)   down
layers.1.attention_norm.weight       # ... and so on for 32 layers
norm.weight                          # (4096,)
output.weight                        # (128256, 4096)  ← LM head

Add up those shapes and you get 8.03 billion parameters. The file at fp16 (2 bytes per param) is about 16 GB.

This matters for inference because that 16 GB has to live somewhere, and the only memory fast enough to feed an H100’s matrix units is HBM. Knowing where every byte of the model lives and how it moves is the topic of section 13.

How depth and width interact

Two knobs control the size of a transformer: depth (number of layers $L$ ) and width ( $d_{\text{model}}$ ). Empirically, models perform better when these are scaled together — not too tall and skinny, not too short and wide. Modern models tend to follow rough ratios like $d_{\text{model}} \approx 128 \cdot L$ . Llama-3-8B is 32 × 4096; Llama-3-70B is 80 × 8192. Doubling width roughly doubles each layer’s parameter count; doubling depth roughly doubles the layer count. Each gives different tradeoffs at inference time:

More depth = more sequential work per token. Pipeline parallelism (§19) hates this.
More width = bigger matrices, more compute and memory per layer. Tensor parallelism (§19) helps here.

We now have a forward pass that, given a list of token IDs, produces logits at every position. Almost done with the model itself. The remaining question: how do you turn those logits into the next token?