Recap
The through-line, and further reading
We’ve traveled from a 65-million-parameter translation model in 2017 to omni-modal, million-token, trillion-parameter systems in 2026. This final chapter steps back to find the through-line — what changed, what stayed remarkably constant, and the handful of levers that explain almost everything in between.
The thing that never changed
Here is the most striking fact about nine years of progress: the core objective never moved. From GPT-1 to DeepSeek-V4, the model is (almost always) a decoder-only decoder The half of a transformer that generates a sequence one token at a time using masked (causal) self-attention. GPT-style language models are decoder-only. See in glossary → transformer transformer The neural-network architecture introduced in "Attention Is All You Need" (2017), built from stacked self-attention and feed-forward layers. Every model in this explainer is a transformer. See in glossary → trained to predict the next token next-token prediction The pre-training objective for GPT-style models: given the tokens so far, predict a probability distribution over the next token. Also called causal or autoregressive language modeling. See in glossary → with cross-entropy loss cross-entropy loss The standard LM loss: the negative log-probability the model assigned to the actual next token, averaged over all positions. Zero would mean perfect confidence in every correct token. See in glossary → . Everything you learned in the foundations chapters — the objective, the gradient gradient descent The core training algorithm: repeatedly nudge each parameter a small step in the direction that lowers the loss, as told by the gradient. See in glossary → , the optimizer optimizer The rule that turns gradients into parameter updates. Plain gradient descent is the simplest; Adam-family optimizers add per-parameter adaptive step sizes and dominate LLM training. See in glossary → — is the same machinery running underneath a 2026 frontier model as under GPT-2.
What changed is everything around that objective. The history of pre-training isn’t a search for a better objective; it’s a relentless campaign to make each FLOP, each byte, and each token deliver more.
The recurring levers
Every paper in this explainer pulls on some subset of the same five levers:
- Scale. GPT-2 → GPT-3 showed scale alone unlocks new capabilities ( in-context learning in-context learning A model performing a new task purely from examples or instructions placed in its prompt, with no gradient updates. GPT-3 showed this emerges from pure next-token pre-training at scale. See in glossary → , emergence emergent abilities Capabilities that are absent in smaller models but appear, sometimes abruptly, once a model is large enough — e.g. multi-step arithmetic or in-context learning of novel tasks. See in glossary → ). Scaling laws scaling laws Empirical formulas showing that test loss falls as a smooth power law in model size, dataset size, and compute. They let you predict a large model's performance from small experiments. See in glossary → made it predictable; Chinchilla made it efficient (~20 tokens per parameter tokens per parameter The ratio of training tokens to model parameters (D/N). Chinchilla's compute-optimal point is around 20; modern models often deliberately exceed it to get smaller, cheaper-to-serve models. See in glossary → ), and then inference economics pushed everyone to deliberately over-train over-training Deliberately training a model on far more tokens than the compute-optimal ~20 per parameter. It costs more training compute for a slightly better, much smaller model that is cheaper to run at inference. See in glossary → smaller models.
- Data. From BooksCorpus to 32-trillion-token corpora, with ever-heavier filtering quality filtering Discarding low-value text (spam, boilerplate, gibberish) using heuristics and trained classifiers, keeping the corpus closer to the kind of text you want the model to learn. See in glossary → , deduplication deduplication Removing duplicate or near-duplicate documents from the corpus. Dedup improves quality, reduces memorization, and stops the model wasting capacity on repeated text. See in glossary → , scaling-law-tuned mixtures data mixture The recipe specifying what fraction of training tokens comes from each source (web, code, books, math, multilingual). Tuning the mixture is one of the highest-leverage data decisions. See in glossary → , annealing annealing A final pre-training phase that upsamples small amounts of the highest-quality data (math, code, curated text) while the learning rate decays to its floor. Reliably boosts quality and can be used to gauge a dataset's value. See in glossary → , and — as the data wall data wall The looming limit where the supply of high-quality human-written text is exhausted relative to models' appetite for tokens, motivating interest in synthetic data and better filtering. See in glossary → looms — grounded synthetic data synthetic data Training text generated by another model or an automated pipeline, rather than scraped from humans. Used to augment scarce high-quality data; its benefits in pre-training are conditional. See in glossary → and new modalities.
- Architectural efficiency. Mixture-of-Experts Mixture of Experts Mixture of Experts (MoE) — a layer with many parallel sub-networks ("experts") where a router sends each token to only a few. The model has a huge total parameter count but activates only a fraction per token, so compute stays modest. See in glossary → (huge capacity, small active active parameters In a Mixture-of-Experts model, the subset of parameters actually used to process a given token. DeepSeek-V3 has 671B total but only 37B active per token, so compute tracks the smaller number. See in glossary → compute), GQA GQA Grouped-Query Attention — multiple query heads share one K/V head, shrinking the KV cache by 4–8× with minimal quality loss. See in glossary → and MLA Multi-head Latent Attention Multi-head Latent Attention (MLA) — DeepSeek's attention variant that compresses the keys and values into a small shared low-rank latent vector, drastically shrinking the KV cache while keeping multi-head expressivity. See in glossary → and compressed attention (tiny KV cache, long context), distillation knowledge distillation Training a smaller "student" model to match the full output probability distribution of a larger "teacher" model, rather than just the one-hot next token. Richer targets let the student learn more per token. See in glossary → (richer targets).
- Precision. FP32 FP32 32-bit single-precision Floating Point: 1 sign + 8 exponent + 23 mantissa bits. The traditional "full precision" format; accurate but memory- and bandwidth-hungry. See in glossary → → BF16 BF16 Brain Floating-point 16-bit: 1 sign + 8 exponent + 7 mantissa bits. Keeps FP32's wide exponent range (so it rarely overflows) at the cost of precision — the workhorse format for modern pre-training. See in glossary → → FP8 FP8 8-bit Floating Point (typically E4M3 or E5M2 layouts). The newest training precision, used on H100/Blackwell GPUs to roughly double throughput; needs careful scaling to stay numerically stable. See in glossary → , each halving memory and bandwidth and roughly doubling throughput — turning numerical analysis into a frontier capability.
- Parallelism & systems. Data data parallelism Replicating the whole model on each GPU, giving each a different slice of the batch, then averaging gradients across GPUs with an all-reduce. The simplest way to scale out. See in glossary → , tensor tensor parallelism Splitting each weight matrix across N GPUs. Every GPU does a slice of every layer; activations get all-reduced across them. See in glossary → , pipeline pipeline parallelism Splitting the model layer-wise across GPUs. Each GPU owns a contiguous slab of layers; activations flow from one to the next. See in glossary → , and expert expert parallelism Placing different experts of a Mixture-of-Experts layer on different GPUs, so each device holds only some experts and tokens are routed across the network to reach them. See in glossary → parallelism, ZeRO ZeRO Zero Redundancy Optimizer — a family of techniques that shard optimizer states, gradients, and optionally parameters across data-parallel GPUs so no device holds a full redundant copy. See in glossary → / FSDP FSDP Fully Sharded Data Parallel — PyTorch's implementation of ZeRO-style sharding: each GPU stores a shard of the parameters and gathers the rest just in time for each layer's compute. See in glossary → , and bespoke schedulers (DualPipe) that turn ten-thousand-GPU clusters into a single trainable model.
The arc, in one paragraph each
- 2017–2020, the paradigm. The transformer made sequence models parallelizable; GPT-1 turned it into the pre-train-then-adapt recipe; BERT and T5 explored the objective space; GPT-2 revealed that scale buys zero-shot capability.
- 2020–2022, the laws. GPT-3 proved scale unlocks in-context learning; Kaplan made loss predictable; Chinchilla corrected how to spend compute, and quietly shrank the models.
- 2024–2025, the modern era. Llama 3 industrialized data and over-training; DeepSeek-V3 rebuilt the architecture for efficiency (MoE, MLA, MTP, FP8); Qwen scaled tokens; Gemma brought distillation and long-context attention; and the field began seriously asking what to do when good data runs out.
- 2026, the frontier. Native multimodality (Kimi), grounded synthetic data for code (Qwen3-Coder), million-token context with compressed attention and a new optimizer (DeepSeek-V4), and omni-modal pre-training across audio and video (Qwen3.5-Omni). Same objective; wider world.
Further reading — the papers, in order
The foundations:
- Vaswani et al., Attention Is All You Need (2017) — arxiv 1706.03762. The transformer.
- Radford et al., Improving Language Understanding by Generative Pre-Training (GPT-1, 2018). The pre-train-then-fine-tune recipe.
- Devlin et al., BERT (2019) — arxiv 1810.04805. Masked language modeling.
- Radford et al., Language Models are Unsupervised Multitask Learners (GPT-2, 2019). Scale and zero-shot.
- Raffel et al., Exploring the Limits of Transfer Learning (T5, 2020) — arxiv 1910.10683. Text-to-text and span corruption.
- Kaplan et al., Scaling Laws for Neural Language Models (2020) — arxiv 2001.08361.
- Brown et al., Language Models are Few-Shot Learners (GPT-3, 2020) — arxiv 2005.14165.
- Hoffmann et al., Training Compute-Optimal Large Language Models (Chinchilla, 2022) — arxiv 2203.15556.
The modern era:
- Grattafiori et al., The Llama 3 Herd of Models (2024) — arxiv 2407.21783.
- DeepSeek-AI, DeepSeek-V3 Technical Report (2024) — arxiv 2412.19437.
- Qwen Team, Qwen2.5 Technical Report (2024) — arxiv 2412.15115.
- Gemma Team, Gemma 2 (2024) — arxiv 2408.00118.
- Kang et al., Demystifying Synthetic Data in LLM Pre-training (2025) — arxiv 2510.01631.
- Gemma Team, Gemma 3 Technical Report (2025) — arxiv 2503.19786.
- Karydas, Margaritis & Leligou, Training Methods for Large Language Models (2026) — Technologies 14(2), 133.
The 2026 frontier:
- Moonshot AI, Kimi K2.5: Visual Agentic Intelligence (2026) — arxiv 2602.02276.
- Qwen Team, Qwen3-Coder-Next Technical Report (2026) — arxiv 2603.00729.
- DeepSeek-AI, DeepSeek-V4 Technical Report (2026).
- Qwen Team, Qwen3.5-Omni Technical Report (2026) — arxiv 2604.15804.
And don’t forget the glossary — every term in this explainer, with every acronym spelled out, in one place.