Section 20

Gemma 2

Distillation as a pre-training objective

Paper: Gemma 2: Improving Open Language Models at a Practical Size — Gemma Team, 2024

Gemma 2 (Gemma Team, Google DeepMind, 2024) is a family of small open models (2B, 9B, 27B) whose most interesting pre-training idea isn’t about data or scale — it’s about what the model trains against. For its smaller models, Gemma 2 replaces the one-hot next-token target with the soft predictions of a much larger teacher. That’s knowledge distillation, used as a pre-training objective.

Distillation: a richer target than one-hot

Recall the cross-entropy objective: the target is a one-hot vector — all probability on the single correct next token, zero on everything else. Knowledge distillation changes the target. A large, already-trained teacher model produces a full probability distribution over the next token, and the student is trained to match that distribution instead of (or alongside) the hard label.

This is a genuinely different answer to the data-scarcity problem than Qwen’s “get more tokens”: instead of more data, get richer targets from a model that already learned from lots of data.

Two architecture efficiency tricks

Gemma 2 also brings two changes worth adding to our running list of modern techniques:

Interleaved local/global attention. Rather than every layer attending over the full sequence, Gemma 2 alternates sliding-window (local) attention layers with occasional global layers. Local layers only attend to a fixed window of nearby tokens, which is much cheaper and shrinks the KV cache, while the periodic global layers preserve long-range information. This local/global interleaving becomes a defining feature of the Gemma line.
Logit soft-capping . The model bounds its logits (and attention scores) with a scaled $\tanh$ so they can’t grow without limit, which improves training numerical stability — a small regularizing touch in the same spirit as gradient clipping.

Both sit alongside the now-standard GQA and RMSNorm .