Section 08

A full transformer block

Putting it together

Attention and the MLP are the two main computations a transformer does. A layer — or transformer block, the names are interchangeable — is the recipe for combining them, with two helper ideas glued in: residual connections and normalization. Both helpers exist for a single reason: to make it possible to stack many such blocks on top of each other without the whole tower falling over.

A modern transformer block, in pseudocode

Here’s the canonical pre-norm structure used in Llama, Mistral, Qwen, and friends:

def transformer_block(x):
    # x: (seq, d_model)
    h = x + attention(rmsnorm(x))    # block A: communication
    y = h + mlp(rmsnorm(h))          # block B: computation
    return y

A modern (pre-norm) transformer block

Data flows top to bottom. The two side loops are residual connections — each sub-block's output is added back to the pre-block input rather than replacing it.

Two halves, both following the same shape:

Normalize the input.
Apply some operation (attention, then MLP).
Add the operation’s output to the unnormalized input.

That + is the residual connection , and the normalize step is the RMSNorm . Let’s look at each.

Residual connections: $\text{out} = x + f(x)$

Without residuals, a deep stack of layers means the output is block_N(block_{N-1}(... block_1(x) ...)) — every step has to fully reproduce whatever it wants to preserve from earlier layers. Information has to flow through every operation perfectly, and gradients have to flow back through every operation during training. In practice this doesn’t work past about 10 layers; the signal gets washed out.

With residuals, each block computes a delta to add to the running representation. The “main stream” of information is the residual itself, sometimes called the residual stream. Each block reads from it (via the normalize-then-apply path), computes some refinement, and adds it back. If a block has nothing useful to contribute, it can output approximately zero and the residual passes through unchanged.

This is also why models can survive layer pruning surprisingly well — many layers contribute small refinements, and removing one degrades quality smoothly rather than catastrophically.

RMSNorm: Root Mean Square Normalization

The other helper is normalization. Without it, the magnitudes of the residual stream can grow or shrink as the stack gets deeper, and activations explode or vanish. The fix is to renormalize the vector before each operation.

The original Transformer used LayerNorm:

\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

where $\mu, \sigma$ are the mean and standard deviation across the vector’s dimensions, and $\gamma, \beta$ are learned per-dimension scale and shift.

Modern models use RMSNorm, which drops the mean-centering step:

\text{RMSNorm}(x) = \gamma \cdot \frac{x}{\sqrt{\text{RMS}(x)^2 + \epsilon}}, \quad \text{RMS}(x) = \sqrt{\tfrac{1}{d} \sum_i x_i^2}

This is empirically just as good as LayerNorm at training, and slightly cheaper at inference (no mean to compute, no bias term). Llama, Mistral, Qwen, Gemma all use RMSNorm.

Pre-norm vs post-norm

There’s one more architectural detail: where you normalize.

The original Transformer was post-norm:

out = norm(x + sublayer(x))

Modern models are pre-norm:

out = x + sublayer(norm(x))

Pre-norm keeps the residual stream unnormalized — only the input to each block is normalized. This is much more stable to train at depth: gradients flow back through the residual addition without ever passing through a normalization. Essentially every model since GPT-2 has been pre-norm.

Putting one layer’s parameters in your head

For Llama-3-8B, one transformer block contains, roughly:

Two RMSNorm scale vectors (4096 params each — tiny).
Attention: $W_Q, W_K, W_V$ — but the K and V are smaller because of GQA, see §12. Plus $W_O$ . ~50–70M parameters total.
SwiGLU MLP: $W_{\text{up}}, W_{\text{gate}}, W_{\text{down}}$ — each (4096 × 14336). ~176M parameters.

So roughly a quarter billion parameters per layer, dominated by the MLP. Multiply by 32 layers and you’ve got the model.

We have one block. Now we just need to stack them — and bolt on the parts at the top and bottom that turn token IDs into logits.