Section 08

The data pipeline

Crawl, filter, dedup, tokenize, mix

We have the entire training machine. Now for its fuel. If there is one thing the modern papers agree on, it’s that data quality is as decisive as model size — two models of identical architecture and parameter count can differ enormously based purely on what they were fed. This chapter is the last piece of foundations: how a usable training corpus is built, and how it’s turned into tokens.

The funnel: from petabytes to trillions of tokens

Most pre-training data starts as raw web text. Common Crawl — a free, repeated public crawl of the web — provides petabytes of HTML, but the overwhelming majority of it is unusable: navigation menus, spam, machine-generated junk, near-duplicate boilerplate. The pipeline is essentially a giant filter that throws most of it away.

From the raw web to a training corpus

Most of the crawl is thrown away. What survives gets blended into a deliberate mixture.

The filtering funnel

Raw Common Crawl (web text)

100%

petabytes of HTML

After language + quality filtering

22%

drop spam, boilerplate, junk

After deduplication

13%

remove repeats & near-duplicates

Final curated corpus

plus books, code, math, etc.

Illustrative proportions: roughly a tenth of the raw crawl survives to training.

Your data mixture

50%

20%

15%

Web text50%Code20%Books & papers15%Math & reasoning7%Multilingual8%

The mixture is one of the highest-leverage knobs in pre-training. More code improves reasoning and structure; more math lifts quantitative skill; multilingual data broadens reach but competes for capacity. Modern teams tune these ratios with small-scale experiments and scaling laws before committing a trillion-token run.

Two stages do most of the heavy lifting:

Quality filtering discards low-value text using a mix of cheap heuristics (length, symbol ratios, language detection) and trained classifiers that score how “document-like” or “high-quality” a page is. The bar has risen over time: early corpora were lightly filtered; today’s best are aggressively curated.
Deduplication removes documents that are duplicates or near-duplicates of others. Dedup matters more than it sounds: repeated text wastes capacity, encourages verbatim memorization , and distorts the data distribution. Removing it consistently improves quality per token.

A related concern is contamination — benchmark or test data leaking into training, which silently inflates evaluation scores. Careful pipelines actively detect and strip known benchmarks before training.

The data mixture

Filtered web text is the bulk, but it’s blended with curated high-value sources — code, books, scientific papers, math, and multilingual text — into a deliberate data mixture . The ratios are a genuine design decision with real consequences: adding code improves structured reasoning even on non-code tasks; adding math lifts quantitative ability; multilingual data broadens reach but competes for finite model capacity. As you saw in the widget, the mixture is a recipe, and modern teams tune it carefully — often with small proxy models and scaling-law extrapolation — before committing to a full run.

Tokenization: text becomes integers

Finally, text has to become numbers. The tokenizer converts raw characters into a sequence of integer token IDs drawn from a fixed vocabulary (often 100k–256k entries). The dominant algorithm is Byte Pair Encoding (BPE): start from individual characters or bytes and repeatedly merge the most frequent adjacent pair into a new token, until the vocabulary reaches its target size. Common words become single tokens; rare words split into pieces.

The tokenizer is fixed before pre-training begins and can’t easily be changed afterward — every parameter is trained against its specific vocabulary. Several variants matter across the papers:

WordPiece — BERT’s close cousin of BPE.
byte-level BPE — GPT-2’s innovation of running BPE over raw bytes, so any input (emoji, code, any language) is representable with a small base vocabulary and nothing is ever “out of vocabulary.”
SentencePiece — a toolkit that tokenizes raw text directly (treating spaces as symbols), making it language-agnostic.

Tokenization shapes what's easy and hard

Because the model sees tokens, not characters, anything that doesn’t align to token boundaries is harder for it: arithmetic on long numbers, spelling, rhyming, manipulating individual letters. Tokenizer design also sets how many tokens a given text costs — which directly affects training compute (via the 6ND rule ) and the effective context length in characters. Good tokenization is quietly load-bearing.

Variable documents, fixed window

There’s a mismatch hiding in plain sight. A corpus is a pile of documents of wildly different lengths — a tweet is a few dozen tokens, a forum post a few hundred, a news article a few thousand, a novel or a code repository hundreds of thousands. But the model trains on sequences of exactly one fixed length: its context length $L$ (say 4,096 or 8,192 tokens). Every training example must be precisely $L$ tokens. So how do you feed documents that are almost all shorter than $L$ — and occasionally much longer — into a window of fixed size?

The two obvious answers are both bad:

One document per sequence, then pad the leftover space to reach $L$ . Simple, but if your documents average, say, 500 tokens and $L$ is 8,192, then ~94% of every sequence is meaningless filler — you’d burn the overwhelming majority of your FLOPs (via the 6ND rule ) processing padding. Unacceptable at scale.
Truncate every document to $L$ and throw the rest away. No padding waste, but you discard data and chop long documents mid-thought.

Sequence packing

The standard fix is sequence packing : concatenate the entire tokenized corpus into one enormous continuous stream of token IDs, then slice that stream into back-to-back chunks of exactly $L$ tokens. There’s no padding (except possibly the final partial chunk) and no truncation loss — every token is used. A short document and the beginning of the next one simply share a chunk; a long document just spans several consecutive chunks.

To stop the model from blending unrelated documents, a document separator — a special token, commonly an end-of-sequence marker like <|endoftext|> — is inserted between documents in the stream. It teaches the model “this is a boundary; what comes next is unrelated, reset your expectations.”

Cross-document attention — leak or don't bother?

Packing creates a subtle issue: a chunk might hold the tail of document A followed by the start of document B, and with a plain causal mask the tokens in B can still attend back to A — unrelated context bleeding across the seam. Two stances exist. Many models simply ignore it: the separator token plus sheer scale make it a minor effect, and it’s free. Others apply document attention masking — a block-diagonal mask so each document only attends within itself, never across a separator — for a cleaner training signal at the cost of some bookkeeping. It’s increasingly common in careful pipelines.

What about documents longer than the window?

A document bigger than $L$ is split across multiple chunks, and those chunks are typically shuffled into the training order — so the model usually sees an $L$ -token slice of a book, not the whole book in sequence. This is one reason long-context ability doesn’t come for free from pre-training: reaching 128K or 1M tokens requires deliberate long-context training stages (which we’ll meet in the modern models), not just longer documents.

The end result, whichever choices you make, is a stream of uniform $L$ -token, token-ID sequences ready for the training loop.

That completes the foundations. We have the objective, the optimizer, the numerics, the hardware budget, the parallelism, and the data. Everything from here on is innovation on top of this base — and it starts with the architecture that made all of it worth doing: the transformer.