Section 19

Qwen2.5

18 trillion tokens and data quality

Paper: Qwen2.5 Technical Report — Qwen Team, 2024

Qwen2.5 (Qwen Team, Alibaba, 2024) doesn’t introduce a flashy new architecture — it’s a conventional dense transformer in the modern style ( pre-norm pre-norm Placing the normalization layer before each sub-layer (inside the residual branch) rather than after it. Pre-norm transformers are far more stable to train at depth, and became standard after GPT-2. See in glossary → , GQA GQA Grouped-Query Attention — multiple query heads share one K/V head, shrinking the KV cache by 4–8× with minimal quality loss. See in glossary → , RoPE RoPE Rotary Position Embeddings — rotates Q/K vectors by an angle proportional to position. Standard in modern LLMs. See in glossary → , SwiGLU). Its lesson is about data scale and data discipline, and it’s a clean illustration of the single most reliable lever in pre-training: more, better-filtered tokens.

From 7 trillion to 18 trillion tokens

The defining change from Qwen2 to Qwen2.5 is the corpus: pre-training data grew from 7 trillion to 18 trillion tokens — among the largest disclosed for an open model. Crucially, the growth wasn’t just “more web text.” The expansion concentrated on the categories that most improve capability: knowledge, code, and mathematics, with heavy quality filtering quality filtering Discarding low-value text (spam, boilerplate, gibberish) using heuristics and trained classifiers, keeping the corpus closer to the kind of text you want the model to learn. See in glossary → to keep the added tokens valuable rather than merely numerous.

Staged training and mixture transitions

Qwen2.5’s pre-training is staged: the data mixture changes over the course of training rather than staying fixed. This is the curriculum idea from chapter 8 in practice — early training on broad data to build general ability, later stages shifting toward higher-quality and more specialized mixtures, much like Llama 3’s annealing annealing A final pre-training phase that upsamples small amounts of the highest-quality data (math, code, curated text) while the learning rate decays to its floor. Reliably boosts quality and can be used to gauge a dataset's value. See in glossary → finish. The model is fed a changing diet tuned to what it most needs at each phase.

The family, and the MoE hint

Qwen2.5 ships an unusually wide range of open dense sizes — 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B — a deliberate choice to serve everything from on-device to data-center deployment, each size over-trained over-training Deliberately training a model on far more tokens than the compute-optimal ~20 per parameter. It costs more training compute for a slightly better, much smaller model that is cheaper to run at inference. See in glossary → on the full corpus for inference efficiency in the Llama style. Alibaba also fielded proprietary Mixture-of-Experts Mixture of Experts Mixture of Experts (MoE) — a layer with many parallel sub-networks ("experts") where a router sends each token to only a few. The model has a huge total parameter count but activates only a fraction per token, so compute stays modest. See in glossary → variants (Qwen2.5-Turbo and -Plus), and Turbo pushed context length context length The maximum number of tokens the model can attend to at once (also called the context window or sequence length). Pre-training picks a context length; later stages often extend it. See in glossary → to as much as a million tokens — foreshadowing the long-context and MoE directions that dominate the 2026 chapters.

Qwen scaled data. The next model, Gemma 2, does something cleverer with the targets the model trains against — and brings two architectural efficiency tricks worth knowing.