Parallelism
Splitting one model across thousands of GPUs
The last chapter ended on an impossibility: a frontier model’s state is measured in terabytes, and its compute in numbers that would take one GPU centuries. The only way out is to split the work across thousands of GPUs at once. How you split it is parallelism, and there are four distinct axes — each with its own communication pattern and its own price. We’ll walk through all four, then bring them together in an interactive map.
Data parallelism: copy the model, split the batch
The simplest axis. Put a full copy of the model on every GPU, give each one a different slice of the mini-batch mini-batch The chunk of training examples processed together in one step. Gradients are averaged over the mini-batch, trading off gradient noise against memory and compute. See in glossary → , and let them compute gradients independently. Then average the gradients across all GPUs — a collective operation called an all-reduce all-reduce A collective op where every GPU contributes a tensor and every GPU ends up with the sum (or other reduction). The TP workhorse. See in glossary → — so everyone applies the same update and the copies stay in sync.
Data parallelism data parallelism Replicating the whole model on each GPU, giving each a different slice of the batch, then averaging gradients across GPUs with an all-reduce. The simplest way to scale out. See in glossary → is easy and efficient, but it has a hard ceiling: each GPU still holds the entire model and optimizer state. If the model doesn’t fit on one GPU, pure data parallelism can’t help. That’s where the other axes — and a key refinement — come in.
Tensor parallelism: split each layer
Tensor parallelism tensor parallelism Splitting each weight matrix across N GPUs. Every GPU does a slice of every layer; activations get all-reduced across them. See in glossary → cuts within a layer. A big matrix multiply is divided column-wise (or row-wise) across several GPUs, each computing part of the output, with an all-reduce all-reduce A collective op where every GPU contributes a tensor and every GPU ends up with the sum (or other reduction). The TP workhorse. See in glossary → to stitch the pieces back together. Because this communication happens inside every layer, on every forward and backward pass, it is extremely chatty — so tensor parallelism is kept within a single node, where GPUs are joined by ultra-fast NVLink NVLink Nvidia’s high-speed GPU-to-GPU interconnect. H100 NVLink ≈ 900 GB/s per GPU — much faster than PCIe. See in glossary → . Sequence parallelism sequence parallelism Splitting the work along the token/sequence dimension across GPUs, often paired with tensor parallelism to shard the normalization and dropout activations it leaves behind. See in glossary → is a common companion that splits the leftover normalization and dropout work along the token dimension.
Pipeline parallelism: split the depth
Pipeline parallelism pipeline parallelism Splitting the model layer-wise across GPUs. Each GPU owns a contiguous slab of layers; activations flow from one to the next. See in glossary → assigns each GPU a contiguous range of layers — GPU 0 does layers 1–8, GPU 1 does 9–16, and so on. A batch flows through like an assembly line, each stage passing its activations activations The intermediate tensors produced during the forward pass. They must be kept around for the backward pass, and at long context they can dominate memory use. See in glossary → to the next. The communication is light (just activations at stage boundaries), but there’s a subtler cost: the pipeline bubble pipeline bubble Idle GPU time at the start and end of a pipeline-parallel batch, while stages wait for the first micro-batches to flow through. Smaller micro-batches shrink the bubble. See in glossary → , the idle time at the start and end of each batch while the pipeline fills and drains. Splitting the batch into smaller micro-batches keeps more stages busy and shrinks the bubble.
Expert parallelism: split the experts
For Mixture-of-Experts Mixture of Experts Mixture of Experts (MoE) — a layer with many parallel sub-networks ("experts") where a router sends each token to only a few. The model has a huge total parameter count but activates only a fraction per token, so compute stays modest. See in glossary → models (which we’ll meet properly with DeepSeek-V3), there’s a fourth axis. An expert is one of many parallel feed-forward sub-networks MLP Multi-Layer Perceptron — a stack of dense (matrix-multiply + nonlinearity) layers applied per-token. The transformer’s feed-forward block. See in glossary → inside an MoE layer; a small router network scores the experts for each token and sends it to only the top few, so each token is processed by just a fraction of the layer’s parameters. Expert parallelism expert parallelism Placing different experts of a Mixture-of-Experts layer on different GPUs, so each device holds only some experts and tokens are routed across the network to reach them. See in glossary → scatters those experts across GPUs, and each token is routed over the network to whichever experts it selected, using an all-to-all exchange. The challenge is balance: if routing sends too many tokens to experts living on one GPU, that GPU becomes the bottleneck while others idle. Much of the MoE literature is about keeping this routing balanced.
Here are all four axes side by side. Toggle each one to see what every GPU holds and what it has to communicate:
Putting it together: 3D (and 4D) parallelism
No real run uses just one axis. A frontier training job composes them — 3D parallelism 3D parallelism Combining data, tensor, and pipeline parallelism (three axes) at once to train a model too big for any single axis to handle. Frontier runs add expert parallelism as a fourth. See in glossary → (data × tensor × pipeline), plus expert parallelism when the model is sparse. The art is matching each axis to the interconnect interconnect The high-speed network linking GPUs — NVLink within a node, InfiniBand/Ethernet across nodes. Its bandwidth and latency cap how aggressively you can shard a model. See in glossary → it can afford: the chattiest axis (tensor) goes on the fastest links (NVLink within a node), the lightest (data) can stretch across the slowest (Ethernet across racks).
We’ve now assembled the entire training machine: objective, gradient, optimizer, precision, compute budget, and the parallelism to run it at scale. Exactly one ingredient is left before we can turn to the models themselves — the thing all of this consumes. The data.