The data pipeline
Crawl, filter, dedup, tokenize, mix
We have the entire training machine. Now for its fuel. If there is one thing the modern papers agree on, it’s that data quality is as decisive as model size — two models of identical architecture and parameter count can differ enormously based purely on what they were fed. This chapter is the last piece of foundations: how a usable training corpus corpus The body of text a model is trained on. Modern pre-training corpora are measured in trillions of tokens drawn from web crawls, books, code, and more. See in glossary → is built, and how it’s turned into tokens.
The funnel: from petabytes to trillions of tokens
Most pre-training data starts as raw web text. Common Crawl Common Crawl A free, monthly public crawl of the web — petabytes of raw HTML. It is the raw feedstock for most large pre-training corpora after heavy filtering. See in glossary → — a free, repeated public crawl of the web — provides petabytes of HTML, but the overwhelming majority of it is unusable: navigation menus, spam, machine-generated junk, near-duplicate boilerplate. The pipeline is essentially a giant filter that throws most of it away.
Two stages do most of the heavy lifting:
- Quality filtering quality filtering Discarding low-value text (spam, boilerplate, gibberish) using heuristics and trained classifiers, keeping the corpus closer to the kind of text you want the model to learn. See in glossary → discards low-value text using a mix of cheap heuristics (length, symbol ratios, language detection) and trained classifiers that score how “document-like” or “high-quality” a page is. The bar has risen over time: early corpora were lightly filtered; today’s best are aggressively curated.
- Deduplication deduplication Removing duplicate or near-duplicate documents from the corpus. Dedup improves quality, reduces memorization, and stops the model wasting capacity on repeated text. See in glossary → removes documents that are duplicates or near-duplicates of others. Dedup matters more than it sounds: repeated text wastes capacity, encourages verbatim memorization overfitting When a model memorizes training-set quirks instead of learning general patterns, so it does well on training data but poorly on new data. Rarely the main worry in single-epoch LLM pre-training, but it shapes data choices. See in glossary → , and distorts the data distribution. Removing it consistently improves quality per token.
A related concern is contamination data contamination When test or benchmark data leaks into the training corpus, inflating scores. Careful pipelines try to detect and remove contamination before training. See in glossary → — benchmark or test data leaking into training, which silently inflates evaluation scores. Careful pipelines actively detect and strip known benchmarks before training.
The data mixture
Filtered web text is the bulk, but it’s blended with curated high-value sources — code, books, scientific papers, math, and multilingual text — into a deliberate data mixture data mixture The recipe specifying what fraction of training tokens comes from each source (web, code, books, math, multilingual). Tuning the mixture is one of the highest-leverage data decisions. See in glossary → . The ratios are a genuine design decision with real consequences: adding code improves structured reasoning even on non-code tasks; adding math lifts quantitative ability; multilingual data broadens reach but competes for finite model capacity. As you saw in the widget, the mixture is a recipe, and modern teams tune it carefully — often with small proxy models and scaling-law extrapolation — before committing to a full run.
Tokenization: text becomes integers
Finally, text has to become numbers. The tokenizer tokenizer The program that converts raw text into a sequence of integer token IDs (and back). Its vocabulary and merge rules are fixed before pre-training begins. See in glossary → converts raw characters into a sequence of integer token IDs token ID An integer index into the vocabulary that uniquely identifies a token. See in glossary → drawn from a fixed vocabulary vocabulary The fixed set of tokens a model knows about. Modern LLMs have ~32k–200k entries. See in glossary → (often 100k–256k entries). The dominant algorithm is Byte Pair Encoding BPE Byte-Pair Encoding — the most common tokenization algorithm. It merges frequent byte pairs into tokens. See in glossary → (BPE): start from individual characters or bytes and repeatedly merge the most frequent adjacent pair into a new token, until the vocabulary reaches its target size. Common words become single tokens; rare words split into pieces.
The tokenizer is fixed before pre-training begins and can’t easily be changed afterward — every parameter is trained against its specific vocabulary. Several variants matter across the papers:
- WordPiece WordPiece A subword tokenization algorithm (used by BERT) closely related to Byte Pair Encoding, building a vocabulary of word pieces from frequent character sequences. See in glossary → — BERT’s close cousin of BPE.
- byte-level BPE byte-level BPE Byte-level Byte Pair Encoding — running BPE over raw bytes rather than Unicode characters, so any possible input (emoji, code, any language) is representable with a small base vocabulary. Introduced by GPT-2. See in glossary → — GPT-2’s innovation of running BPE over raw bytes, so any input (emoji, code, any language) is representable with a small base vocabulary and nothing is ever “out of vocabulary.”
- SentencePiece SentencePiece A tokenizer toolkit that operates directly on raw text (treating spaces as symbols), so it works language-agnostically without pre-splitting on whitespace. See in glossary → — a toolkit that tokenizes raw text directly (treating spaces as symbols), making it language-agnostic.
Variable documents, fixed window
There’s a mismatch hiding in plain sight. A corpus is a pile of documents of wildly different lengths — a tweet is a few dozen tokens, a forum post a few hundred, a news article a few thousand, a novel or a code repository hundreds of thousands. But the model trains on sequences of exactly one fixed length: its context length context length The maximum number of tokens the model can attend to at once (also called the context window or sequence length). Pre-training picks a context length; later stages often extend it. See in glossary → (say 4,096 or 8,192 tokens). Every training example must be precisely tokens. So how do you feed documents that are almost all shorter than — and occasionally much longer — into a window of fixed size?
The two obvious answers are both bad:
- One document per sequence, then pad padding Filler tokens added to a sequence to reach a fixed length. Padding wastes compute — the model still processes the meaningless tokens — which is exactly what sequence packing exists to avoid. See in glossary → the leftover space to reach . Simple, but if your documents average, say, 500 tokens and is 8,192, then ~94% of every sequence is meaningless filler — you’d burn the overwhelming majority of your FLOPs (via the 6ND rule 6ND rule A rule of thumb: training a dense model with N parameters on D tokens costs about 6ND floating-point operations (≈2ND forward + ≈4ND backward). See in glossary → ) processing padding. Unacceptable at scale.
- Truncate truncation Cutting a document off at the model's maximum context length and discarding the rest. It avoids overflow but throws away data and can split documents mid-thought. See in glossary → every document to and throw the rest away. No padding waste, but you discard data and chop long documents mid-thought.
Sequence packing
The standard fix is sequence packing sequence packing Concatenating many short documents into full-length training sequences (with separators) so no compute is wasted padding to the context length. See in glossary → : concatenate the entire tokenized corpus into one enormous continuous stream of token IDs, then slice that stream into back-to-back chunks of exactly tokens. There’s no padding (except possibly the final partial chunk) and no truncation loss — every token is used. A short document and the beginning of the next one simply share a chunk; a long document just spans several consecutive chunks.
To stop the model from blending unrelated documents, a document separator document separator A special token (often an End-Of-Sequence / EOS marker such as <|endoftext|>) inserted between documents packed into one training sequence, marking where one document ends and the next begins. See in glossary → — a special token, commonly an end-of-sequence marker like <|endoftext|> — is inserted between documents in the stream. It teaches the model “this is a boundary; what comes next is unrelated, reset your expectations.”
What about documents longer than the window?
A document bigger than is split across multiple chunks, and those chunks are typically shuffled into the training order — so the model usually sees an -token slice of a book, not the whole book in sequence. This is one reason long-context ability doesn’t come for free from pre-training: reaching 128K or 1M tokens requires deliberate long-context training stages (which we’ll meet in the modern models), not just longer documents.
The end result, whichever choices you make, is a stream of uniform -token, token-ID sequences ready for the training loop.
That completes the foundations. We have the objective, the optimizer, the numerics, the hardware budget, the parallelism, and the data. Everything from here on is innovation on top of this base — and it starts with the architecture that made all of it worth doing: the transformer.