Section 23

The big picture

A survey of training methods

Paper: Training Methods for Large Language Models: Current Approaches and Challenges — Karydas, Margaritis & Leligou, 2026

Before we step into the 2026 frontier, it’s worth pausing to see the modern era mapped from above. Training Methods for Large Language Models: Current Approaches and Challenges (Karydas, Margaritis & Leligou, Technologies 2026) is a systematic survey of LLM training spanning the 2017–2025 literature. It’s the “bonus” overview in our reading list, and it’s a useful cross-check: an independent map of the same territory we’ve been walking. We’ll look only at its pre-training content (the survey also covers fine-tuning, alignment, and retrieval, which are out of scope here).

The survey’s organizing claim: objective follows architecture

The survey frames pre-training as self-supervised learning self-supervised learning Training where the labels come for free from the data itself — e.g. hide the next word and ask the model to predict it. No human annotation needed, which is what makes training on trillions of tokens possible. See in glossary → that “converts the data into the supervision signal” — exactly the free-label insight from chapter 1. Its central organizing principle is a clean mapping from architecture to objective, which neatly summarizes our paradigm chapters:

  • Decoder-only models (GPT-3, Llama) → Causal Language Modeling causal language model A model that predicts each token using only earlier tokens (never future ones). "Causal" because information flows strictly left to right. The GPT family are causal LMs (Language Models). See in glossary → (next-token prediction), best for generation and notable for data efficiency and fine-tuning stability.
  • Encoder-only models (BERT) → Masked Language Modeling masked language model Masked Language Model (MLM) — a pre-training objective (used by BERT) that hides a fraction of tokens and trains the model to fill them in using context from both sides. Contrast with next-token prediction. See in glossary → , best for representation/understanding tasks.
  • Encoder-decoder models (T5) → denoising/span objectives.

Data curation as a first-class subject

Tellingly, the survey gives data curation as much weight as architecture — and its checklist is the one we built in chapter 8. It frames modern corpora as a balance between web-scale coverage and curated quality:

  • Common Crawl Common Crawl A free, monthly public crawl of the web — petabytes of raw HTML. It is the raw feedstock for most large pre-training corpora after heavy filtering. See in glossary → (on the order of 100 TB) for breadth — “very large and noisy,” demanding heavy preprocessing.
  • Curated sources: Wikipedia (~15 GB) for trustworthy structured knowledge, book corpora for long coherent discourse, specialized code datasets (e.g. The Stack, GitHub-derived) for programming ability, and multilingual corpora for cross-lingual transfer.
  • “RefinedWeb-style” cleaned web datasets explicitly designed to outperform raw Common Crawl.

And it names the same filtering pipeline: deduplication deduplication Removing duplicate or near-duplicate documents from the corpus. Dedup improves quality, reduces memorization, and stops the model wasting capacity on repeated text. See in glossary → (to reduce memorization and improve generalization), quality filtering quality filtering Discarding low-value text (spam, boilerplate, gibberish) using heuristics and trained classifiers, keeping the corpus closer to the kind of text you want the model to learn. See in glossary → (heuristics, perplexity scoring, classifier-based selection), and safety filtering (removing personally identifiable information and toxic content). The survey’s conclusion is one this explainer has repeated: “careful balancing of the scale, quality, and diversity of data sources is just as important for successful LLM training as dataset size.”

The big axis: dense scaling vs. sparse efficiency

The survey’s analytical taxonomy organizes frontier pre-training along an axis we’ve watched emerge: dense scaling versus sparse efficiency. On one side, the scaling-law scaling laws Empirical formulas showing that test loss falls as a smooth power law in model size, dataset size, and compute. They let you predict a large model's performance from small experiments. See in glossary → -driven dense models (GPT-3, Llama) that scale parameters, data, and compute together. On the other, the sparse Mixture-of-Experts Mixture of Experts Mixture of Experts (MoE) — a layer with many parallel sub-networks ("experts") where a router sends each token to only a few. The model has a huge total parameter count but activates only a fraction per token, so compute stays modest. See in glossary → models (it uses recent DeepSeek as the case study) that decouple total capacity from active active parameters In a Mixture-of-Experts model, the subset of parameters actually used to process a given token. DeepSeek-V3 has 671B total but only 37B active per token, so compute tracks the smaller number. See in glossary → compute. It even cites the concrete numbers we’ve seen — 14.8T training tokens and ~2.79M H800 GPU-hours for DeepSeek-V3 — as evidence that sparse architectures shift the cost-capability frontier.

With the modern era both walked and mapped, we’re ready for the frontier — where every one of these levers is pushed to its 2026 extreme.