Section 28

Recap

The through-line, and further reading

We’ve traveled from a 65-million-parameter translation model in 2017 to omni-modal, million-token, trillion-parameter systems in 2026. This final chapter steps back to find the through-line — what changed, what stayed remarkably constant, and the handful of levers that explain almost everything in between.

The thing that never changed

Here is the most striking fact about nine years of progress: the core objective never moved. From GPT-1 to DeepSeek-V4, the model is (almost always) a decoder-only transformer trained to predict the next token with cross-entropy loss . Everything you learned in the foundations chapters — the objective, the gradient , the optimizer — is the same machinery running underneath a 2026 frontier model as under GPT-2.

What changed is everything around that objective. The history of pre-training isn’t a search for a better objective; it’s a relentless campaign to make each FLOP, each byte, and each token deliver more.

The recurring levers

Every paper in this explainer pulls on some subset of the same five levers:

Scale. GPT-2 → GPT-3 showed scale alone unlocks new capabilities ( in-context learning , emergence ). Scaling laws made it predictable; Chinchilla made it efficient (~20 tokens per parameter ), and then inference economics pushed everyone to deliberately over-train smaller models.
Data. From BooksCorpus to 32-trillion-token corpora, with ever-heavier filtering , deduplication , scaling-law-tuned mixtures , annealing , and — as the data wall looms — grounded synthetic data and new modalities.
Architectural efficiency. Mixture-of-Experts (huge capacity, small active compute), GQA and MLA and compressed attention (tiny KV cache, long context), distillation (richer targets).
Precision. FP32 → BF16 → FP8 , each halving memory and bandwidth and roughly doubling throughput — turning numerical analysis into a frontier capability.
Parallelism & systems. Data , tensor , pipeline , and expert parallelism, ZeRO / FSDP , and bespoke schedulers (DualPipe) that turn ten-thousand-GPU clusters into a single trainable model.

The arc, in one paragraph each

2017–2020, the paradigm. The transformer made sequence models parallelizable; GPT-1 turned it into the pre-train-then-adapt recipe; BERT and T5 explored the objective space; GPT-2 revealed that scale buys zero-shot capability.
2020–2022, the laws. GPT-3 proved scale unlocks in-context learning; Kaplan made loss predictable; Chinchilla corrected how to spend compute, and quietly shrank the models.
2024–2025, the modern era. Llama 3 industrialized data and over-training; DeepSeek-V3 rebuilt the architecture for efficiency (MoE, MLA, MTP, FP8); Qwen scaled tokens; Gemma brought distillation and long-context attention; and the field began seriously asking what to do when good data runs out.
2026, the frontier. Native multimodality (Kimi), grounded synthetic data for code (Qwen3-Coder), million-token context with compressed attention and a new optimizer (DeepSeek-V4), and omni-modal pre-training across audio and video (Qwen3.5-Omni). Same objective; wider world.