Section 21

Synthetic data

When generated text helps pre-training

Paper: Demystifying Synthetic Data in LLM Pre-training — Kang et al., 2025

Every model so far has leaned on the same scarce resource: high-quality human-written text. But there’s only so much of it, and the largest runs are starting to bump against that ceiling — the so-called data wall . The obvious escape is to have models generate training data for other models. Demystifying Synthetic Data in LLM Pre-training (Kang et al., 2025) is the most careful empirical study of whether that actually works — and its answer is a precise, useful “it depends.”

The setup: a controlled study

Rather than anecdotes, the authors ran a large-scale, controlled investigation — over 1,000 models and 100k+ GPU-hours under a unified protocol with scaling laws — comparing natural web text against two kinds of synthetic data :

Rephrased text: take real web documents and have a model reword/restructure them.
Generated “textbooks”: have a model write fresh expository text from scratch.

And, crucially, mixtures of synthetic and natural data at varying ratios. This control is what makes the findings trustworthy where earlier synthetic-data claims were murky.

The findings, precisely

Model collapse: real, but conditional

A central fear about synthetic data is model collapse : train models on model-generated text over and over, and rare patterns in the true distribution get washed out, degrading quality generation after generation. The study gives this nuance teeth. For single-round training:

Rephrased synthetic data showed no degradation at foreseeable scales — because it stays anchored to real documents, it preserves the true distribution’s diversity.
Pure textbook-style generated mixtures did show the patterns predicted by model collapse — cut loose from real text, the distribution narrows.

This is the research frontier of the data pipeline — and a live question for every lab approaching the data wall. With it, we’ve covered data scale (Qwen), richer targets (Gemma 2), and now data generation. The last modern chapter, Gemma 3, folds in another axis entirely: data from other modalities.