Section 19

Qwen2.5

18 trillion tokens and data quality

Paper: Qwen2.5 Technical Report — Qwen Team, 2024

Qwen2.5 (Qwen Team, Alibaba, 2024) doesn’t introduce a flashy new architecture — it’s a conventional dense transformer in the modern style ( pre-norm , GQA , RoPE , SwiGLU). Its lesson is about data scale and data discipline, and it’s a clean illustration of the single most reliable lever in pre-training: more, better-filtered tokens.

From 7 trillion to 18 trillion tokens

The defining change from Qwen2 to Qwen2.5 is the corpus: pre-training data grew from 7 trillion to 18 trillion tokens — among the largest disclosed for an open model. Crucially, the growth wasn’t just “more web text.” The expansion concentrated on the categories that most improve capability: knowledge, code, and mathematics, with heavy quality filtering to keep the added tokens valuable rather than merely numerous.

Staged training and mixture transitions

Qwen2.5’s pre-training is staged: the data mixture changes over the course of training rather than staying fixed. This is the curriculum idea from chapter 8 in practice — early training on broad data to build general ability, later stages shifting toward higher-quality and more specialized mixtures, much like Llama 3’s annealing finish. The model is fed a changing diet tuned to what it most needs at each phase.

The family, and the MoE hint

Qwen2.5 ships an unusually wide range of open dense sizes — 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B — a deliberate choice to serve everything from on-device to data-center deployment, each size over-trained on the full corpus for inference efficiency in the Llama style. Alibaba also fielded proprietary Mixture-of-Experts variants (Qwen2.5-Turbo and -Plus), and Turbo pushed context length to as much as a million tokens — foreshadowing the long-context and MoE directions that dominate the 2026 chapters.

Qwen scaled data. The next model, Gemma 2, does something cleverer with the targets the model trains against — and brings two architectural efficiency tricks worth knowing.