Qwen3-Coder-Next
Pre-training for code at repository scale
Paper: Qwen3-Coder-Next Technical Report — Qwen Team, 2026
Qwen3-Coder-Next (Qwen Team, early 2026) is a code-specialized model, and it’s here because pre-training for code exposes a problem the general models can mostly ignore: for the most valuable coding skills, good training data barely exists in raw form and has to be constructed. Its pre-training story is about manufacturing data, plus the now-familiar efficiency recipe.
A small active footprint
Qwen3-Coder-Next is an 80-billion-parameter Mixture-of-Experts Mixture of Experts Mixture of Experts (MoE) — a layer with many parallel sub-networks ("experts") where a router sends each token to only a few. The model has a huge total parameter count but activates only a fraction per token, so compute stays modest. See in glossary → model that activates only 3 billion parameters per forward pass — an even more aggressive sparsity ratio than DeepSeek-V3, paired with a hybrid attention design. The motivation is explicit: coding agents run in tight local-development loops where latency matters, so you want a model with a giant knowledge base but a tiny active active parameters In a Mixture-of-Experts model, the subset of parameters actually used to process a given token. DeepSeek-V3 has 671B total but only 37B active per token, so compute tracks the smaller number. See in glossary → compute cost. MoE is no longer exotic; it’s the obvious tool when you want capability without inference cost.
Why code needs special data
Ordinary web text teaches a model to write plausible code, but the skills that matter for an agent — fixing a failing test, navigating a real repository, satisfying a build — require something the open web doesn’t readily provide: verifiable, executable, interaction-rich examples. You can’t learn “did this patch make the tests pass?” from static text.
Fill-in-the-middle: editing, not just continuing
One more code-specific pre-training detail worth knowing (standard across the Qwen-Coder and other code models) is the fill-in-the-middle fill-in-the-middle Fill-in-the-Middle (FIM) — a code pre-training objective that gives the model a prefix and a suffix and asks it to generate the missing middle, teaching it to edit and complete code in place, not just continue it. See in glossary → (FIM) objective. Ordinary causal causal language model A model that predicts each token using only earlier tokens (never future ones). "Causal" because information flows strictly left to right. The GPT family are causal LMs (Language Models). See in glossary → training only teaches left-to-right continuation, but real coding is mostly editing in place — inserting a function between existing code. FIM reorders training documents so the model sees a prefix and a suffix and must generate the middle, teaching it to complete code surrounded by context on both sides. It’s a small change to how the next-token objective is presented, with a big effect on how useful the model is in an editor.
(As with Kimi, the reinforcement-learning-from-execution loop that this data feeds is post-training; we cover only the pre-training data and architecture here.) Next, the most architecturally ambitious 2026 report: DeepSeek-V4.