Section 04

Instruction tuning is born

FLAN, T0, and zero-shot generalization

Papers: FLAN — Finetuned Language Models Are Zero-Shot Learners (Wei et al., 2021) · T0 — Multitask Prompted Training (Sanh et al., 2021) · Super-NaturalInstructions (Wang et al., 2022) · Scaling Instruction-Finetuned Language Models (Flan 2022) (Chung et al., 2022)

A pre-trained model has read a large fraction of the internet, but ask it “Translate bonjour to English” and it might cheerfully reply with another French phrase — because in its training data, a line of French is most often followed by more French, not by someone obeying you. It has the knowledge; it just doesn’t have the habit of treating your text as a command. The first big idea of post-training is a remarkably blunt fix for this: if you want a model that follows instructions, train it on a pile of instructions being followed.

Pretrain-then-finetune, generalized

You have already seen the seed of this idea. The GPT-1 recipe was pre-train a language model on raw text, then fine-tune it on a labeled task. That worked, but you got one fine-tuned model per task — a sentiment classifier here, a question-answerer there. Each adaptation was a dead end that only knew its one job.

Instruction tuning takes that same fine-tuning machinery and asks a more ambitious question: what if, instead of fine-tuning on one task, we fine-tune on many tasks, each phrased as a natural-language instruction, and then test on tasks the model was never trained on? The technique itself is just supervised fine-tuning — ordinary next-token training on (instruction, response) pairs, nothing exotic. What changes is the framing of the data and, as it turned out, the entire character of the resulting model.

FLAN: instruction tuning unlocks zero-shot

Google’s FLAN (Wei et al., 2021) was the paper that made the case. The authors took a 137B-parameter pre-trained model and instruction-tuned it on 60+ NLP datasets — translation, summarization, natural-language inference, sentiment, and more — but with a crucial twist: each dataset was rewritten into several natural-language instruction templates. Instead of feeding the model a raw premise–hypothesis pair, they wrote things like “Does the premise entail the hypothesis? Premise: … Hypothesis: …”

Then came the test that mattered. They grouped the tasks into clusters, held an entire cluster out of training, and evaluated the model on it zero-shot — no examples, just the instruction. FLAN substantially beat the same model’s plain zero-shot performance, and on many held-out tasks beat even GPT-3’s few-shot results. The headline, captured in the title, was that finetuned language models are zero-shot learners: training to follow many instructions made the model follow instructions in general, including kinds it had never been trained on.

T0: smaller, open, and template-explicit

Almost simultaneously, T0 (Sanh et al., 2021) from BigScience showed the same effect at a fraction of the size — an 11B encoder–decoder model — and did it in the open. T0’s lasting contribution was methodological: it leaned hard on explicit prompt templates, crowd-sourcing many differently-worded prompts per dataset so that a single task was seen through dozens of surface phrasings. A model trained this way can’t latch onto one rigid format; it has to learn the task behind the wording. T0 matched or beat models 16× its size on held-out tasks, and made the recipe reproducible for everyone.

Scaling the recipe: Super-NaturalInstructions and Flan 2022

Once the effect was established, the obvious move was to push every knob. Super-NaturalInstructions (Wang et al., 2022) assembled a benchmark of 1,600+ tasks, each with a declarative instruction, drowning the model in instruction variety and giving the field a hard test of cross-task generalization.

Then Flan 2022 / Flan-T5 (Chung et al., 2022) ran the systematic scaling study. It pulled three levers at once and found they compound:

More tasks — combining the FLAN, T0, and Super-NaturalInstructions collections into 1,800+ tasks.
More model — scaling up to 540B parameters (Flan-PaLM).
More reasoning data — adding chain-of-thought examples, where the instruction’s answer shows its work step by step.

That last ingredient is the sleeper. Mixing in CoT data didn’t just help on reasoning tasks; it improved instruction-following broadly and let the model produce step-by-step reasoning zero-shot when asked. Flan-T5 became, for years, the default open instruction-tuned model — a small checkpoint that punched far above its weight precisely because it had been bathed in instruction diversity.