LLM Pre-training, from the ground up

A long-form, interactive explainer

This is a walkthrough of how a large language model is actually built — the part that happens before anyone chats with it. We start from the single idea the whole field is built on (predict the next token) and follow it all the way to the architecture, data, and hardware tricks of the 2026 frontier models.

We cover both halves of the story: the machine-learning science (the objective, the gradient, the optimizer) and the systems engineering (GPU memory budgets, numerical precision, parallelism, and the data pipeline) that turns the science into a real trillion-token training run.

Only basic machine-learning knowledge is assumed. Every term gets defined the first time it shows up — hover any underlined word for a quick tooltip, or jump to the glossary at any time. There are interactive widgets throughout: a loss you can watch shrink, a gradient-descent ball you can roll down a hill, a GPU memory budget you can blow past, a scaling-law plot you can drag, and a Mixture-of-Experts router you can poke.

Scope note: this explainer is about pre-training only. Supervised fine-tuning, reinforcement learning, alignment, and other post-training steps are deliberately left out — they get their own essay in the sibling LLM Post-training explainer.

Start reading → ~3–4 hours, 28 sections

Published June 14, 2026

Foundations: how training works

The transformer & the pre-training paradigm

Scaling laws & compute-optimal training

The modern open-model era

The 2026 frontier

28 Recap — The through-line, and further reading