Section 07

Parallelism

Splitting one model across thousands of GPUs

The last chapter ended on an impossibility: a frontier model’s state is measured in terabytes, and its compute in numbers that would take one GPU centuries. The only way out is to split the work across thousands of GPUs at once. How you split it is parallelism, and there are four distinct axes — each with its own communication pattern and its own price. We’ll walk through all four, then bring them together in an interactive map.

Data parallelism: copy the model, split the batch

The simplest axis. Put a full copy of the model on every GPU, give each one a different slice of the mini-batch , and let them compute gradients independently. Then average the gradients across all GPUs — a collective operation called an all-reduce — so everyone applies the same update and the copies stay in sync.

Data parallelism is easy and efficient, but it has a hard ceiling: each GPU still holds the entire model and optimizer state. If the model doesn’t fit on one GPU, pure data parallelism can’t help. That’s where the other axes — and a key refinement — come in.

ZeRO / FSDP: data parallelism that shards the state

The fix is to stop replicating what you don’t need to. ZeRO (Zero Redundancy Optimizer) and its PyTorch incarnation FSDP (Fully Sharded Data Parallel) shard the optimizer states, gradients, and even parameters across the data-parallel GPUs. Each device stores only its shard and gathers the rest just in time for each layer’s compute, then frees it again. You keep data parallelism’s simplicity but cut per-GPU memory by the number of GPUs — the single most important trick for fitting large models. Recall from the last chapter that optimizer states are the biggest slice of the budget; this is what shards them.

Tensor parallelism: split each layer

Tensor parallelism cuts within a layer. A big matrix multiply is divided column-wise (or row-wise) across several GPUs, each computing part of the output, with an all-reduce to stitch the pieces back together. Because this communication happens inside every layer, on every forward and backward pass, it is extremely chatty — so tensor parallelism is kept within a single node, where GPUs are joined by ultra-fast NVLink . Sequence parallelism is a common companion that splits the leftover normalization and dropout work along the token dimension.

Pipeline parallelism: split the depth

Pipeline parallelism assigns each GPU a contiguous range of layers — GPU 0 does layers 1–8, GPU 1 does 9–16, and so on. A batch flows through like an assembly line, each stage passing its activations to the next. The communication is light (just activations at stage boundaries), but there’s a subtler cost: the pipeline bubble , the idle time at the start and end of each batch while the pipeline fills and drains. Splitting the batch into smaller micro-batches keeps more stages busy and shrinks the bubble.

Expert parallelism: split the experts

For Mixture-of-Experts models (which we’ll meet properly with DeepSeek-V3), there’s a fourth axis. An expert is one of many parallel feed-forward sub-networks inside an MoE layer; a small router network scores the experts for each token and sends it to only the top few, so each token is processed by just a fraction of the layer’s parameters. Expert parallelism scatters those experts across GPUs, and each token is routed over the network to whichever experts it selected, using an all-to-all exchange. The challenge is balance: if routing sends too many tokens to experts living on one GPU, that GPU becomes the bottleneck while others idle. Much of the MoE literature is about keeping this routing balanced.

Here are all four axes side by side. Toggle each one to see what every GPU holds and what it has to communicate:

Splitting a model across 8 GPUs

The four axes of parallelism. Real frontier runs combine all of them at once.

GPU 0

full model

GPU 1

full model

GPU 2

full model

GPU 3

full model

GPU 4

full model

GPU 5

full model

GPU 6

full model

GPU 7

full model

CommunicationAll-reduce gradients once per step

Memory it savesNothing — every GPU holds a full copy

The catchSimple and fast, but caps model size at one GPU

These compose. A run might use tensor parallelism within a node (8 GPUs on fast NVLink), pipeline parallelism across a few nodes, expert parallelism for the MoE layers, and data parallelism across the whole cluster — "3D" (or 4D) parallelism. Choosing the split is a balance between each axis's communication cost and the speed of the links it has to cross.

Putting it together: 3D (and 4D) parallelism

No real run uses just one axis. A frontier training job composes them — 3D parallelism (data × tensor × pipeline), plus expert parallelism when the model is sparse. The art is matching each axis to the interconnect it can afford: the chattiest axis (tensor) goes on the fastest links (NVLink within a node), the lightest (data) can stretch across the slowest (Ethernet across racks).

Two more tricks that show up everywhere

Gradient accumulation sums gradients over several micro-batches before one optimizer step, simulating a huge batch size that wouldn’t fit in memory — important because large models want large batches for stable gradients.

Overlapping communication with compute hides the cost of all those collectives by launching them while the GPU is busy with other math. Squeezing out these overlaps is a big part of why DeepSeek-V3’s custom pipeline scheduler exists.

We’ve now assembled the entire training machine: objective, gradient, optimizer, precision, compute budget, and the parallelism to run it at scale. Exactly one ingredient is left before we can turn to the models themselves — the thing all of this consumes. The data.