Section 24

Kimi K2.5

Trillion-parameter MoE and the Muon optimizer

Paper: Kimi K2.5: Visual Agentic Intelligence — Moonshot AI, 2026

We’ve reached the 2026 frontier. These four chapters cover the very latest reports, each read from its primary source, and each gets the “what’s genuinely new” treatment. Kimi K2.5 (Moonshot AI, Feb 2026) is a multimodal agentic model, but its pre-training contribution is a sharp, surprising finding about how to fold vision into a language model from the start.

Built on a trillion-parameter MoE base

Kimi K2.5 is built on Kimi K2, a roughly trillion-parameter Mixture-of-Experts transformer — so the sparse-compute, many-experts design from DeepSeek-V3 is now simply the default substrate for frontier models. We won’t re-derive MoE; assume it underneath everything in this group.

MuonClip: training a trillion-parameter model without loss spikes

The Kimi models are notable for using the Muon optimizer rather than AdamW, in a stabilized form called MuonClip . Muon orthogonalizes each weight-matrix update (instead of Adam’s per-element rescaling), which can converge faster — but at trillion-parameter scale, raw Muon is prone to numerical blow-ups from exploding attention logits. MuonClip clips/rescales the query-key logits to keep them bounded, taming the loss spikes that otherwise derail such runs. It’s the clearest sign yet that the AdamW monopoly on large-scale pre-training is finally being challenged.

The new idea: native multimodal pre-training

Here’s the pre-training contribution. The conventional way to make a language model see is to train a strong text model first, then bolt on vision late in training by adding visual tokens. Kimi K2.5 rejects this. It does native multimodal pre-training : text and vision tokens are mixed at a constant ratio throughout the entire run, with vision fused in early rather than late. A vision encoder (MoonViT-3D, a native-resolution encoder) turns images into tokens that join the stream from the beginning.

K2.5 was pre-trained this way on roughly 15 trillion mixed visual-and-text tokens.

This is the same underlying move as Gemma 3’s vision tokens, taken to its logical end: don’t adapt a text model to see — pre-train a model that sees and reads at once. It’s also the philosophy that the omni-modal models (chapter 27) push across every modality. (Everything in K2.5 about agentic behavior, Agent Swarm, and its reinforcement-learning stages is post-training, and out of scope here.)

The next chapter narrows the lens instead of widening it — to a single domain where the data itself has to be manufactured: code.