Synthetic data
When generated text helps pre-training
Paper: Demystifying Synthetic Data in LLM Pre-training — Kang et al., 2025
Every model so far has leaned on the same scarce resource: high-quality human-written text. But there’s only so much of it, and the largest runs are starting to bump against that ceiling — the so-called data wall data wall The looming limit where the supply of high-quality human-written text is exhausted relative to models' appetite for tokens, motivating interest in synthetic data and better filtering. See in glossary → . The obvious escape is to have models generate training data for other models. Demystifying Synthetic Data in LLM Pre-training (Kang et al., 2025) is the most careful empirical study of whether that actually works — and its answer is a precise, useful “it depends.”
The setup: a controlled study
Rather than anecdotes, the authors ran a large-scale, controlled investigation — over 1,000 models and 100k+ GPU-hours under a unified protocol with scaling laws scaling laws Empirical formulas showing that test loss falls as a smooth power law in model size, dataset size, and compute. They let you predict a large model's performance from small experiments. See in glossary → — comparing natural web text against two kinds of synthetic data synthetic data Training text generated by another model or an automated pipeline, rather than scraped from humans. Used to augment scarce high-quality data; its benefits in pre-training are conditional. See in glossary → :
- Rephrased text: take real web documents and have a model reword/restructure them.
- Generated “textbooks”: have a model write fresh expository text from scratch.
And, crucially, mixtures of synthetic and natural data at varying ratios. This control is what makes the findings trustworthy where earlier synthetic-data claims were murky.
The findings, precisely
Model collapse: real, but conditional
A central fear about synthetic data is model collapse model collapse Degradation that can occur when models are trained on too much model-generated data over generations, as rare patterns in the distribution get washed out. Observed for some pure-synthetic mixtures, not for moderate rephrased-data ratios. See in glossary → : train models on model-generated text over and over, and rare patterns in the true distribution get washed out, degrading quality generation after generation. The study gives this nuance teeth. For single-round training:
- Rephrased synthetic data showed no degradation at foreseeable scales — because it stays anchored to real documents, it preserves the true distribution’s diversity.
- Pure textbook-style generated mixtures did show the patterns predicted by model collapse — cut loose from real text, the distribution narrows.
This is the research frontier of the data pipeline — and a live question for every lab approaching the data wall. With it, we’ve covered data scale (Qwen), richer targets (Gemma 2), and now data generation. The last modern chapter, Gemma 3, folds in another axis entirely: data from other modalities.