LLM Pre-training, from the ground up

A long-form, interactive explainer

This is a walkthrough of how a large language model is actually built — the part that happens before anyone chats with it. We start from the single idea the whole field is built on (predict the next token) and follow it all the way to the architecture, data, and hardware tricks of the 2026 frontier models.

We cover both halves of the story: the machine-learning science (the objective, the gradient, the optimizer) and the systems engineering (GPU memory budgets, numerical precision, parallelism, and the data pipeline) that turns the science into a real trillion-token training run.

Only basic machine-learning knowledge is assumed. Every term gets defined the first time it shows up — hover any underlined word for a quick tooltip, or jump to the glossary at any time. There are interactive widgets throughout: a loss you can watch shrink, a gradient-descent ball you can roll down a hill, a GPU memory budget you can blow past, a scaling-law plot you can drag, and a Mixture-of-Experts router you can poke.

Scope note: this explainer is about pre-training only. Supervised fine-tuning, reinforcement learning, alignment, and other post-training steps are deliberately left out — they get their own essay in the sibling LLM Post-training explainer.

Start reading → ~3–4 hours, 28 sections

Contents

Foundations: how training works

  1. 01 What is pre-training? — Learning from raw text, no labels required
  2. 02 The objective — Next-token prediction and cross-entropy loss
  3. 03 How a model learns — Gradient descent and backpropagation
  4. 04 Optimizers & schedules — From SGD to AdamW, warmup and decay
  5. 05 Precision & numerics — FP32, BF16, FP8, and mixed precision
  6. 06 Compute & memory — FLOPs, the 6ND rule, and where the GBs go
  7. 07 Parallelism — Splitting one model across thousands of GPUs
  8. 08 The data pipeline — Crawl, filter, dedup, tokenize, mix

The transformer & the pre-training paradigm

  1. 09 Attention Is All You Need — The transformer, from a training lens
  2. 10 GPT-1 — Generative pre-training, then fine-tune
  3. 11 BERT — Masked language modeling and bidirectionality
  4. 12 GPT-2 — Scale, zero-shot, and the scaling hypothesis
  5. 13 T5 — Text-to-text and span corruption

Scaling laws & compute-optimal training

  1. 14 Scaling laws — Loss as a power law in size, data, and compute
  2. 15 GPT-3 — 175B parameters and in-context learning
  3. 16 Chinchilla — Compute-optimal training and the 20:1 rule

The modern open-model era

  1. 17 Llama 3 — 15 trillion tokens and a data engine
  2. 18 DeepSeek-V3 — MoE, MLA, MTP, and FP8 at scale
  3. 19 Qwen2.5 — 18 trillion tokens and data quality
  4. 20 Gemma 2 — Distillation as a pre-training objective
  5. 21 Synthetic data — When generated text helps pre-training
  6. 22 Gemma 3 — Long context and refined distillation
  7. 23 The big picture — A survey of training methods

The 2026 frontier

  1. 24 Kimi K2.5 — Trillion-parameter MoE and the Muon optimizer
  2. 25 Qwen3-Coder-Next — Pre-training for code at repository scale
  3. 26 DeepSeek-V4 — The next generation of efficient MoE
  4. 27 Qwen3.5-Omni — One model pre-trained on every modality
  1. 28 Recap — The through-line, and further reading