Section 24

Kimi K2.5

Trillion-parameter MoE and the Muon optimizer

Paper: Kimi K2.5: Visual Agentic Intelligence — Moonshot AI, 2026

We’ve reached the 2026 frontier. These four chapters cover the very latest reports, each read from its primary source, and each gets the “what’s genuinely new” treatment. Kimi K2.5 (Moonshot AI, Feb 2026) is a multimodal agentic model, but its pre-training contribution is a sharp, surprising finding about how to fold vision into a language model from the start.

Built on a trillion-parameter MoE base

Kimi K2.5 is built on Kimi K2, a roughly trillion-parameter Mixture-of-Experts Mixture of Experts Mixture of Experts (MoE) — a layer with many parallel sub-networks ("experts") where a router sends each token to only a few. The model has a huge total parameter count but activates only a fraction per token, so compute stays modest. See in glossary → transformer — so the sparse-compute, many-experts design from DeepSeek-V3 is now simply the default substrate for frontier models. We won’t re-derive MoE; assume it underneath everything in this group.

The new idea: native multimodal pre-training

Here’s the pre-training contribution. The conventional way to make a language model see is to train a strong text model first, then bolt on vision late in training by adding visual tokens. Kimi K2.5 rejects this. It does native multimodal pre-training native multimodal pre-training Training on a mix of text and other modalities (e.g. vision) from the very start, with a constant ratio, rather than bolting a modality onto a finished text model late in training. Kimi K2.5's approach. See in glossary → : text and vision tokens are mixed at a constant ratio throughout the entire run, with vision fused in early rather than late. A vision encoder vision encoder A module (such as SigLIP) that converts an image into a sequence of embedding vectors the language model can attend to, as if they were tokens. The bridge that makes a text model multimodal. See in glossary → (MoonViT-3D, a native-resolution encoder) turns images into tokens that join the stream from the beginning.

K2.5 was pre-trained this way on roughly 15 trillion mixed visual-and-text tokens.

This is the same underlying move as Gemma 3’s vision tokens, taken to its logical end: don’t adapt a text model to see — pre-train a model that sees and reads at once. It’s also the philosophy that the omni-modal models (chapter 27) push across every modality. (Everything in K2.5 about agentic behavior, Agent Swarm, and its reinforcement-learning stages is post-training, and out of scope here.)

The next chapter narrows the lens instead of widening it — to a single domain where the data itself has to be manufactured: code.