Gemma 3
Long context and refined distillation
Paper: Gemma 3 Technical Report — Gemma Team, 2025
Gemma 3 (Gemma Team, Google DeepMind, 2025) extends the Gemma line (1B, 4B, 12B, 27B) along three axes that are increasingly the frontier of pre-training: multimodality, long context, and multilinguality — while keeping Gemma 2’s distillation knowledge distillation Training a smaller "student" model to match the full output probability distribution of a larger "teacher" model, rather than just the one-hot next token. Richer targets let the student learn more per token. See in glossary → and refining its attention design. Following our “what’s new” rule, we focus on the multimodal and long-context pieces.
Pre-training with images: multimodality
Gemma 3 is multimodal multimodal A model that handles more than one input type — e.g. text plus images (or audio). Pre-training can fold in non-text data via encoders that turn it into token-like embeddings. See in glossary → : most models in the family can take images as well as text. The mechanism is a vision encoder vision encoder A module (such as SigLIP) that converts an image into a sequence of embedding vectors the language model can attend to, as if they were tokens. The bridge that makes a text model multimodal. See in glossary → — a tailored version of SigLIP — that turns an image into a sequence of embedding vectors. The language model then attends to those vectors as if they were tokens. To keep the cost down, each image is condensed into a fixed budget of 256 “soft tokens.”
Long context, paid for with attention design
Gemma 3 supports at least a 128K-token context window context length The maximum number of tokens the model can attend to at once (also called the context window or sequence length). Pre-training picks a context length; later stages often extend it. See in glossary → . The obstacle, as always, is the KV cache: at 128K tokens, the memory for keys and values balloons (recall the KV-footprint widget). Gemma 3’s answer builds directly on Gemma 2’s local/global idea: it increases the ratio of local to global attention layers — roughly five local (sliding-window) sliding-window attention Restricting attention to a fixed-size window of nearby tokens instead of the whole sequence. Cheaper and smaller-KV than global attention; modern models interleave local (windowed) and global layers. See in glossary → layers for each global one — and keeps the local window short.
Because only the sparse global layers attend across the full 128K, the KV-cache cost of long context drops sharply while long-range information still flows through the periodic global layers. It’s a concrete example of co-designing the architecture around the memory budget rather than the other way around.
Closing the modern era
Step back and the modern open-model era tells one story along several axes:
- DeepSeek-V3 — architectural and numerical efficiency (MoE, MLA, MTP, FP8).
- Llama 3 / Qwen2.5 — data at scale, scaling-law-driven mixes, deliberate over-training.
- Gemma 2 / 3 — richer training targets (distillation), attention designs for long context, and new modalities.
None of these replaced the next-token objective or the transformer; they made each FLOP, each byte, and each token count for more. The next group pushes all of it further still — into the models of 2026. But each of those reports must be read from its primary source before we write about it, so we’ll proceed one at a time.