T5
Text-to-text and span corruption
Paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5) — Raffel et al., 2020
T5 — Raffel et al.’s 2020 Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer — is the last of our “paradigm” papers, and a different kind of contribution. Where GPT and BERT each made one bold bet, T5 ran the systematic study: it held everything fixed and ablated the objective, the architecture, the data, and the scale, one factor at a time. Much of what the field “knows” about pre-training design choices traces back to this paper’s experiments.
Everything is text-to-text
T5’s unifying idea is to cast every task — translation, summarization, classification, even regression — as text-to-text text-to-text T5's framing in which every task — translation, classification, summarization — is cast as "input text → output text", so one model and one objective handle all of them. See in glossary → : feed the model input text, ask it to produce output text. Classification becomes “generate the class name”; translation becomes “generate the translation.” One model, one loss ( cross-entropy cross-entropy loss The standard LM loss: the negative log-probability the model assigned to the actual next token, averaged over all positions. Zero would mean perfect confidence in every correct token. See in glossary → ), one format for everything. This is mostly a fine-tuning/evaluation convenience, but it matters to pre-training because it let T5 compare wildly different tasks on an equal footing, and it foreshadows how today’s models treat all problems as next-token generation.
The span-corruption objective
T5’s pre-training objective generalizes BERT’s masked language modeling masked language model Masked Language Model (MLM) — a pre-training objective (used by BERT) that hides a fraction of tokens and trains the model to fill them in using context from both sides. Contrast with next-token prediction. See in glossary → into a sequence-to-sequence denoising denoising objective Any pre-training objective that corrupts the input (masking, deleting, or shuffling tokens) and trains the model to restore the original. Masked LM and span corruption are both denoising objectives. See in glossary → task called span corruption span corruption T5's pre-training objective: replace random contiguous spans of tokens with sentinel placeholders and train the model to reconstruct the missing spans. A denoising objective. See in glossary → . Instead of masking individual tokens, it masks whole contiguous spans:
- Randomly select ~15% of tokens; group consecutive selected tokens into spans (mean span length 3).
- Replace each span in the input with a single unique sentinel token (
<X>,<Y>, …). - The model’s target is just the dropped spans, each prefixed by its sentinel.
So “the quick brown fox jumps over” might become input “the quick <X> jumps over” with target “<X> brown fox <Y>”. Because one sentinel replaces a whole span, both the corrupted input and the target are short, which makes training efficient compared to predicting every position.
C4: a dataset as a deliverable
To run experiments at scale, T5’s authors built and released the C4 C4 Colossal Clean Crawled Corpus — the ~750 GB cleaned web-text dataset built from Common Crawl for training T5, and widely reused since. See in glossary → — the Colossal Clean Crawled Corpus — about 750 GB of cleaned English text filtered from Common Crawl Common Crawl A free, monthly public crawl of the web — petabytes of raw HTML. It is the raw feedstock for most large pre-training corpora after heavy filtering. See in glossary → . The cleaning was aggressive and rule-based (drop pages without terminal punctuation, remove boilerplate and offensive lists, deduplicate). C4 became a standard pre-training dataset in its own right and a template for the heavy filtering pipelines that follow. Releasing the dataset, not just the model, was itself influential.
The architecture, and a note on what won
T5 used a standard encoder-decoder encoder-decoder An architecture with an encoder that reads the input and a decoder that writes the output, connected by cross-attention. The original transformer and T5 are encoder-decoder models. See in glossary → transformer — and its ablations found that, for their text-to-text setup, encoder-decoder beat decoder-only and encoder-only variants. Scale ran from a 220M-parameter baseline up to an 11-billion-parameter model, trained for (≈524k) steps on C4 with the memory-efficient AdaFactor AdaFactor A memory-efficient optimizer (used to train T5) that factorizes Adam's second-moment matrix into row and column statistics, drastically cutting optimizer-state memory for very large models. See in glossary → optimizer, an inverse-square-root schedule, and SentencePiece SentencePiece A tokenizer toolkit that operates directly on raw text (treating spaces as symbols), so it works language-agnostically without pre-splitting on whitespace. See in glossary → tokenization (32k vocabulary).
That closes the paradigm. We have the architecture (transformer), the generative recipe (GPT), the alternative objectives (BERT, T5), and the seed of an idea that scaling just works. The next group turns that seed into a quantitative science.