Section 26

DeepSeek-V4

The next generation of efficient MoE

Paper: DeepSeek-V4 Technical Report — DeepSeek-AI, 2026

DeepSeek-V4 (DeepSeek-AI, 2026) is the most architecturally dense report of the 2026 frontier, and the natural successor to chapter 18. Where V3 was about training a frontier model cheaply, V4 is about a single hard target: million-token context at affordable cost. Every new piece serves that goal.

The lineup

The V4 series is, unsurprisingly, Mixture-of-Experts : DeepSeek-V4-Pro with 1.6 trillion total parameters (49B active ) and DeepSeek-V4-Flash at 284B (13B active), both supporting a one-million-token context window . Both were pre-trained on more than 32 trillion tokens — roughly double DeepSeek-V3’s 14.8T, a reminder that token counts keep climbing.

Hybrid attention for million-token context

The KV cache is the enemy of long context, and V4 attacks it with a hybrid attention architecture — different attention mechanisms in different layers:

Compressed Sparse Attention (CSA): attend to a compressed, sparsely-selected subset of past tokens rather than all of them.
Heavily Compressed Attention (HCA): an even more aggressively compressed variant for layers that can tolerate it.

This is the lineage of V3’s Multi-head Latent Attention , pushed to the extreme that million-token context demands. The payoff is stark: at one-million-token context, V4-Pro reportedly needs only ~27% of the per-token inference FLOPs and ~10% of the KV cache of the previous DeepSeek generation. Long context stops being a memory catastrophe and becomes routine.

Two more upgrades: mHC and Muon

V4 also revisits two pieces we’d taken as fixed:

Manifold-Constrained Hyper-Connections (mHC). Recall from chapter 9 that residual connections — the simple $x + \text{Sublayer}(x)$ — are what make deep transformers trainable. Hyper-connections generalize them, learning richer ways to combine layer inputs and outputs; the manifold-constrained variant keeps that flexibility numerically well-behaved. After years of plain residuals, even that primitive is being upgraded.
The Muon optimizer. Like Kimi, V4 moves off AdamW to Muon for faster convergence and better stability. Two independent frontier labs adopting Muon in 2026 is a strong signal that the optimizer landscape, static for years, is genuinely shifting.

One frontier report remains, and it widens the lens as far as it goes — to a model pre-trained on every modality at once.