Section 13

T5

Text-to-text and span corruption

Paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5) — Raffel et al., 2020

T5 — Raffel et al.’s 2020 Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer — is the last of our “paradigm” papers, and a different kind of contribution. Where GPT and BERT each made one bold bet, T5 ran the systematic study: it held everything fixed and ablated the objective, the architecture, the data, and the scale, one factor at a time. Much of what the field “knows” about pre-training design choices traces back to this paper’s experiments.

Everything is text-to-text

T5’s unifying idea is to cast every task — translation, summarization, classification, even regression — as text-to-text : feed the model input text, ask it to produce output text. Classification becomes “generate the class name”; translation becomes “generate the translation.” One model, one loss ( cross-entropy ), one format for everything. This is mostly a fine-tuning/evaluation convenience, but it matters to pre-training because it let T5 compare wildly different tasks on an equal footing, and it foreshadows how today’s models treat all problems as next-token generation.

The span-corruption objective

T5’s pre-training objective generalizes BERT’s masked language modeling into a sequence-to-sequence denoising task called span corruption . Instead of masking individual tokens, it masks whole contiguous spans:

Randomly select ~15% of tokens; group consecutive selected tokens into spans (mean span length 3).
Replace each span in the input with a single unique sentinel token (<X>, <Y>, …).
The model’s target is just the dropped spans, each prefixed by its sentinel.

So “the quick brown fox jumps over” might become input “the quick <X> jumps over” with target “<X> brown fox <Y>”. Because one sentinel replaces a whole span, both the corrupted input and the target are short, which makes training efficient compared to predicting every position.

C4: a dataset as a deliverable

To run experiments at scale, T5’s authors built and released the C4 — the Colossal Clean Crawled Corpus — about 750 GB of cleaned English text filtered from Common Crawl . The cleaning was aggressive and rule-based (drop pages without terminal punctuation, remove boilerplate and offensive lists, deduplicate). C4 became a standard pre-training dataset in its own right and a template for the heavy filtering pipelines that follow. Releasing the dataset, not just the model, was itself influential.

The architecture, and a note on what won

T5 used a standard encoder-decoder transformer — and its ablations found that, for their text-to-text setup, encoder-decoder beat decoder-only and encoder-only variants. Scale ran from a 220M-parameter baseline up to an 11-billion-parameter model, trained for $2^{19}$ (≈524k) steps on C4 with the memory-efficient AdaFactor optimizer, an inverse-square-root schedule, and SentencePiece tokenization (32k vocabulary).

That closes the paradigm. We have the architecture (transformer), the generative recipe (GPT), the alternative objectives (BERT, T5), and the seed of an idea that scaling just works. The next group turns that seed into a quantitative science.