Section 07

The MLP block

Per-token nonlinear processing

Attention let tokens look at each other. The other half of every layer does the opposite: it works on each token in isolation. That half is called the MLP block — short for Multi-Layer Perceptron, which is a fancy name for the most ordinary kind of neural net there is.

If attention is the social half of a transformer layer, the MLP is the private half — every token gets pulled aside, run through the same little network, and handed back with its representation refined.

What’s actually inside

An MLP, here, is a three-step procedure:

Take the token’s vector (size $d_{\text{model}}$ — say 4,096).
Project it up to a much wider vector — the expanded layer (size $d_{\text{ff}}$ , usually 3–4× wider).
Apply an elementwise nonlinearity that decides which entries of that wider vector “fire.”
Project the result back down to $d_{\text{model}}$ .

That’s it. Two matrix multiplications with a nonlinearity in between. The output replaces the input in the residual stream, and we move on.

An MLP block, per token

The same tiny neural net runs independently on every token. Input vector goes up to a wider "feature" layer; the nonlinearity decides which features fire; a second projection brings the result back to the original size.

1. Input vector — d_model = 8

↓multiply by W_up · apply nonlinearity (GELU / SiLU)↓

2. Expanded — d_ff = 24 (~3× wider) · brighter cells = "this feature fired"

↓multiply by W_down↓

3. Output vector — back to d_model = 8

Click a different token: same little network, different input → different feature detectors fire → different output. Crucially, the network is identical at every position — there's no mixing between tokens here, and no sense of sequence. Whatever the MLP learns about "what features matter for this kind of vector," it applies token-by-token, in parallel.

A useful way to think about the middle “expanded” layer: it’s a collection of feature detectors. Each entry asks something like “does this token look like a verb of motion?” or “does the residual stream look like it’s building up to a comma?”. The nonlinearity decides whether each detector fires; the down-projection blends the firing detectors back into a new vector, which the rest of the model reads.

The detectors aren’t designed — they’re learned. But people who have probed real models can often find single MLP neurons that correspond to surprisingly clean things: “indented Python code,” “the word ‘because’ is coming up,” “this is a token inside a quoted string.” It’s not always so clean, but the principle holds — the wide middle layer is where the model stores most of what it knows.

Why there has to be a nonlinearity

Stack two matrix multiplies with nothing in between and they collapse into a single matrix multiply. Multiplying by a matrix can only do scaling and rotation, so without the nonlinearity, the MLP couldn’t represent any of the “if A and B, then turn on C” kind of logic that text understanding actually needs. The nonlinearity is the bend. It’s what lets the network be more than a glorified rotation.

The specific bend doesn’t matter much; what matters is that there is one. The original Transformer used ReLU. GPT-2 used GELU . Modern Llama-class models use a slightly fancier gated variant called SwiGLU , which adds a second up-projection that acts as a per-feature volume knob on the first. SwiGLU is consistently a little better at the same parameter count, which is why it’s now the default — but it’s an incremental improvement on the same three-step picture above.

Common activation functions

Three nonlinearities you'll see inside the MLP block. They all do the same job — bend the otherwise-linear network — and differ mostly in the shape of the bend.

ReLUOriginal Transformer

A literal hinge at zero. Below zero: dead. Above zero: passes through unchanged. Simple, cheap, but has a "dying ReLU" problem where neurons can get stuck outputting 0 forever.

GELUGPT-2 / GPT-3 / BERT

A smooth, slightly curved relative of ReLU. Lets a small amount of negative signal through near zero, which empirically trains better. The default for most pre-Llama models.

SiLU (Swish)Llama, Mistral, Qwen (inside SwiGLU)

Even smoother than GELU. Has a small dip below zero before flattening, which seems to help gradient flow. Used as the gating function inside SwiGLU.

The MLP holds most of the weights

This is the load-bearing fact for the rest of the essay. For a Llama-3-8B layer:

Attention: roughly 67 million parameters per layer.
MLP: roughly 176 million parameters per layer — over 2.5× the attention.

Multiplied by 32 layers, that’s about 5.6 billion of the model’s 8 billion parameters living in MLP blocks. Most of a modern LLM, by parameter count, is feed-forward networks. Attention gets the credit; the MLP carries the weight.

This is what makes decoding memory-bound. Every time the model generates one new token, it has to read most of those MLP weights from HBM — the slow GPU DRAM — and use them for a single token’s worth of math. The matrix units finish in microseconds and then sit waiting for the next chunk of weights to arrive. We’ll come back to this asymmetry in §11 and §13.

Position-wise, not sequence-wise

Worth saying once more, because it matters: the MLP runs the same way at every position, independently. There’s no mixing between tokens here, no sense of order, no past or future. All cross-token mixing happens in attention; all per-token transformation happens in the MLP. A transformer layer alternates those two operations, and that’s the whole recipe.

Speaking of which — we now have all the pieces. Let’s assemble them into a single transformer block.