DeepSeek-V3
MoE, MLA, MTP, and FP8 at scale
Paper: DeepSeek-V3 Technical Report — DeepSeek-AI, 2024
If Llama 3 is the modern dense baseline, DeepSeek-V3 (DeepSeek-AI, 2024) is the modern efficiency masterclass. It trained a model with 671 billion total parameters to frontier quality for under $6 million of compute — a result that reset expectations for what a training run costs. Almost every part of that came from rethinking architecture and numerics together. This is the densest chapter; take it slowly.
Mixture-of-Experts: huge capacity, small active compute
DeepSeek-V3 is a Mixture-of-Experts Mixture of Experts Mixture of Experts (MoE) — a layer with many parallel sub-networks ("experts") where a router sends each token to only a few. The model has a huge total parameter count but activates only a fraction per token, so compute stays modest. See in glossary → model. Of its 671B parameters, only 37B are active active parameters In a Mixture-of-Experts model, the subset of parameters actually used to process a given token. DeepSeek-V3 has 671B total but only 37B active per token, so compute tracks the smaller number. See in glossary → per token — a router sends each token to a small subset of experts, so compute tracks the 37B, not the 671B. You get the knowledge capacity of a giant model at the FLOP cost of a medium one.
DeepSeek’s specific design, DeepSeekMoE, has two ingredients:
- Fine-grained experts. Many small experts instead of a few big ones, giving the router more precise specialization.
- Shared experts shared expert In DeepSeekMoE, an expert that every token always passes through (alongside a few routed experts), used to capture common knowledge so the routed experts can specialize. See in glossary → . A couple of experts every token always uses, which absorb common knowledge so the routed experts are free to specialize.
Play with the router above and you’ll feel the core tension of MoE: as you add experts, total capacity grows but each expert sees fewer tokens, and the load across experts gets uneven. An overloaded expert (and the GPU holding it under expert parallelism expert parallelism Placing different experts of a Mixture-of-Experts layer on different GPUs, so each device holds only some experts and tokens are routed across the network to reach them. See in glossary → ) becomes a bottleneck.
Multi-head Latent Attention: a tiny KV cache
DeepSeek’s second architectural lever is Multi-head Latent Attention Multi-head Latent Attention Multi-head Latent Attention (MLA) — DeepSeek's attention variant that compresses the keys and values into a small shared low-rank latent vector, drastically shrinking the KV cache while keeping multi-head expressivity. See in glossary → (MLA). Standard attention caches a key and value vector for every head at every position — the KV cache that dominates memory at long context. MLA instead compresses the keys and values into a single small low-rank latent vector per token, from which the per-head keys and values are reconstructed on the fly. The cache stores the latent, not all the heads.
The widget compares the three attention designs we’ve now met. GQA GQA Grouped-Query Attention — multiple query heads share one K/V head, shrinking the KV cache by 4–8× with minimal quality loss. See in glossary → (Llama, Qwen, Gemma) shrinks the KV cache by sharing key/value heads; MLA shrinks it much further by storing one compressed latent. A smaller KV cache means longer context and bigger batches fit in memory — which helps both training throughput and inference cost.
Multi-Token Prediction
The third change is to the objective itself. Alongside ordinary next-token prediction, DeepSeek-V3 trains with Multi-Token Prediction Multi-Token Prediction Multi-Token Prediction (MTP) — a training objective where the model predicts several future tokens at each position (not just the next one), densifying the learning signal and enabling faster speculative decoding later. See in glossary → (MTP): at each position the model also predicts a couple of further future tokens, via small extra prediction heads. This densifies the training signal (more to learn per position) and, as a bonus, the extra heads make speculative decoding faster at inference time.
FP8 training, validated at scale
Finally, the numerics. DeepSeek-V3 is the first model to demonstrate FP8 FP8 8-bit Floating Point (typically E4M3 or E5M2 layouts). The newest training precision, used on H100/Blackwell GPUs to roughly double throughput; needs careful scaling to stay numerically stable. See in glossary → mixed-precision mixed-precision training Doing the heavy matrix multiplies in a low-precision format (BF16/FP8) for speed while keeping a high-precision (FP32) copy of the weights and accumulating sensitive sums in FP32 for stability. See in glossary → training on an extremely large run. Recall from the precision chapter that FP8’s range is tiny; DeepSeek’s answer is fine-grained scaling — separate scaling factors for small tiles/blocks of each tensor — plus keeping the most sensitive accumulations in higher precision. The payoff is roughly double the throughput of BF16 and half the memory, which is a big part of how the run came in so cheap.
DeepSeek-V3 is the template for efficient frontier pre-training: sparse compute (MoE), a tiny KV cache (MLA), a denser objective (MTP), and aggressive numerics (FP8). We’ll see its ideas echo through the 2026 frontier. But first, two more open models — Qwen and Gemma — each with a different emphasis.