Section 23

The big picture

A survey of training methods

Paper: Training Methods for Large Language Models: Current Approaches and Challenges — Karydas, Margaritis & Leligou, 2026

Before we step into the 2026 frontier, it’s worth pausing to see the modern era mapped from above. Training Methods for Large Language Models: Current Approaches and Challenges (Karydas, Margaritis & Leligou, Technologies 2026) is a systematic survey of LLM training spanning the 2017–2025 literature. It’s the “bonus” overview in our reading list, and it’s a useful cross-check: an independent map of the same territory we’ve been walking. We’ll look only at its pre-training content (the survey also covers fine-tuning, alignment, and retrieval, which are out of scope here).

The survey’s organizing claim: objective follows architecture

The survey frames pre-training as self-supervised learning that “converts the data into the supervision signal” — exactly the free-label insight from chapter 1. Its central organizing principle is a clean mapping from architecture to objective, which neatly summarizes our paradigm chapters:

Decoder-only models (GPT-3, Llama) → Causal Language Modeling (next-token prediction), best for generation and notable for data efficiency and fine-tuning stability.
Encoder-only models (BERT) → Masked Language Modeling , best for representation/understanding tasks.
Encoder-decoder models (T5) → denoising/span objectives.

Data curation as a first-class subject

Tellingly, the survey gives data curation as much weight as architecture — and its checklist is the one we built in chapter 8. It frames modern corpora as a balance between web-scale coverage and curated quality:

Common Crawl (on the order of 100 TB) for breadth — “very large and noisy,” demanding heavy preprocessing.
Curated sources: Wikipedia (~15 GB) for trustworthy structured knowledge, book corpora for long coherent discourse, specialized code datasets (e.g. The Stack, GitHub-derived) for programming ability, and multilingual corpora for cross-lingual transfer.
“RefinedWeb-style” cleaned web datasets explicitly designed to outperform raw Common Crawl.

And it names the same filtering pipeline: deduplication (to reduce memorization and improve generalization), quality filtering (heuristics, perplexity scoring, classifier-based selection), and safety filtering (removing personally identifiable information and toxic content). The survey’s conclusion is one this explainer has repeated: “careful balancing of the scale, quality, and diversity of data sources is just as important for successful LLM training as dataset size.”

The big axis: dense scaling vs. sparse efficiency

The survey’s analytical taxonomy organizes frontier pre-training along an axis we’ve watched emerge: dense scaling versus sparse efficiency. On one side, the scaling-law -driven dense models (GPT-3, Llama) that scale parameters, data, and compute together. On the other, the sparse Mixture-of-Experts models (it uses recent DeepSeek as the case study) that decouple total capacity from active compute. It even cites the concrete numbers we’ve seen — 14.8T training tokens and ~2.79M H800 GPU-hours for DeepSeek-V3 — as evidence that sparse architectures shift the cost-capability frontier.

With the modern era both walked and mapped, we’re ready for the frontier — where every one of these levers is pushed to its 2026 extreme.