Section 26

DeepSeek-V4

The next generation of efficient MoE

Paper: DeepSeek-V4 Technical Report — DeepSeek-AI, 2026

DeepSeek-V4 (DeepSeek-AI, 2026) is the most architecturally dense report of the 2026 frontier, and the natural successor to chapter 18. Where V3 was about training a frontier model cheaply, V4 is about a single hard target: million-token context at affordable cost. Every new piece serves that goal.

The lineup

The V4 series is, unsurprisingly, Mixture-of-Experts Mixture of Experts Mixture of Experts (MoE) — a layer with many parallel sub-networks ("experts") where a router sends each token to only a few. The model has a huge total parameter count but activates only a fraction per token, so compute stays modest. See in glossary → : DeepSeek-V4-Pro with 1.6 trillion total parameters (49B active active parameters In a Mixture-of-Experts model, the subset of parameters actually used to process a given token. DeepSeek-V3 has 671B total but only 37B active per token, so compute tracks the smaller number. See in glossary → ) and DeepSeek-V4-Flash at 284B (13B active), both supporting a one-million-token context window context length The maximum number of tokens the model can attend to at once (also called the context window or sequence length). Pre-training picks a context length; later stages often extend it. See in glossary → . Both were pre-trained on more than 32 trillion tokens — roughly double DeepSeek-V3’s 14.8T, a reminder that token counts keep climbing.

Hybrid attention for million-token context

The KV cache is the enemy of long context, and V4 attacks it with a hybrid attention architecture — different attention mechanisms in different layers:

  • Compressed Sparse Attention Compressed Sparse Attention Compressed Sparse Attention (CSA) — a DeepSeek-V4 attention variant that attends to a compressed, sparsely-selected subset of past tokens to make million-token context affordable. See in glossary → (CSA): attend to a compressed, sparsely-selected subset of past tokens rather than all of them.
  • Heavily Compressed Attention (HCA): an even more aggressively compressed variant for layers that can tolerate it.

This is the lineage of V3’s Multi-head Latent Attention Multi-head Latent Attention Multi-head Latent Attention (MLA) — DeepSeek's attention variant that compresses the keys and values into a small shared low-rank latent vector, drastically shrinking the KV cache while keeping multi-head expressivity. See in glossary → , pushed to the extreme that million-token context demands. The payoff is stark: at one-million-token context, V4-Pro reportedly needs only ~27% of the per-token inference FLOPs and ~10% of the KV cache of the previous DeepSeek generation. Long context stops being a memory catastrophe and becomes routine.

Two more upgrades: mHC and Muon

V4 also revisits two pieces we’d taken as fixed:

  • Manifold-Constrained Hyper-Connections hyper-connections A generalization of residual connections that learns richer ways to combine the inputs and outputs of layers. DeepSeek-V4 uses a Manifold-Constrained variant (mHC) in place of plain residuals. See in glossary → (mHC). Recall from chapter 9 that residual connections residual connection output = x + f(x). Lets gradients flow through deep stacks and means each block adds a refinement rather than rewriting. See in glossary → — the simple x+Sublayer(x)x + \text{Sublayer}(x) — are what make deep transformers trainable. Hyper-connections generalize them, learning richer ways to combine layer inputs and outputs; the manifold-constrained variant keeps that flexibility numerically well-behaved. After years of plain residuals, even that primitive is being upgraded.
  • The Muon Muon A newer optimizer (Momentum Orthogonalized by Newton-Schulz) that orthogonalizes each weight-matrix update instead of scaling it per-element like Adam. Used at scale by Kimi K2.5 via the MuonClip variant. See in glossary → optimizer. Like Kimi, V4 moves off AdamW AdamW Adam with decoupled Weight decay — the de facto standard LLM optimizer. It applies weight decay directly to the parameters instead of folding it into the gradient, which regularizes more cleanly. See in glossary → to Muon for faster convergence and better stability. Two independent frontier labs adopting Muon in 2026 is a strong signal that the optimizer landscape, static for years, is genuinely shifting.

One frontier report remains, and it widens the lens as far as it goes — to a model pre-trained on every modality at once.