Section 22

Gemma 3

Long context and refined distillation

Paper: Gemma 3 Technical Report — Gemma Team, 2025

Gemma 3 (Gemma Team, Google DeepMind, 2025) extends the Gemma line (1B, 4B, 12B, 27B) along three axes that are increasingly the frontier of pre-training: multimodality, long context, and multilinguality — while keeping Gemma 2’s distillation and refining its attention design. Following our “what’s new” rule, we focus on the multimodal and long-context pieces.

Pre-training with images: multimodality

Gemma 3 is multimodal : most models in the family can take images as well as text. The mechanism is a vision encoder — a tailored version of SigLIP — that turns an image into a sequence of embedding vectors. The language model then attends to those vectors as if they were tokens. To keep the cost down, each image is condensed into a fixed budget of 256 “soft tokens.”

Long context, paid for with attention design

Gemma 3 supports at least a 128K-token context window . The obstacle, as always, is the KV cache: at 128K tokens, the memory for keys and values balloons (recall the KV-footprint widget). Gemma 3’s answer builds directly on Gemma 2’s local/global idea: it increases the ratio of local to global attention layers — roughly five local (sliding-window) layers for each global one — and keeps the local window short.

Because only the sparse global layers attend across the full 128K, the KV-cache cost of long context drops sharply while long-range information still flows through the periodic global layers. It’s a concrete example of co-designing the architecture around the memory budget rather than the other way around.

Closing the modern era

Step back and the modern open-model era tells one story along several axes:

DeepSeek-V3 — architectural and numerical efficiency (MoE, MLA, MTP, FP8).
Llama 3 / Qwen2.5 — data at scale, scaling-law-driven mixes, deliberate over-training.
Gemma 2 / 3 — richer training targets (distillation), attention designs for long context, and new modalities.

None of these replaced the next-token objective or the transformer; they made each FLOP, each byte, and each token count for more. The next group pushes all of it further still — into the models of 2026. But each of those reports must be read from its primary source before we write about it, so we’ll proceed one at a time.