Qwen3.5-Omni
One model pre-trained on every modality
Paper: Qwen3.5-Omni Technical Report — Qwen Team, 2026
The last frontier report, Qwen3.5-Omni (Qwen Team, April 2026), takes the multimodal thread of Gemma 3 and Kimi K2.5 to its conclusion. Where those added vision to a language model, Qwen3.5-Omni is omni-modal omni-modal A model natively pre-trained to handle all major modalities at once — text, images, audio, and video — jointly, rather than text plus a single added modality. See in glossary → : a single model natively pre-trained on text, images, audio, and video together. It’s a fitting place to end, because it shows the next-token objective absorbing the entire sensory world.
Pre-training on every modality at once
Qwen3.5-Omni scales to hundreds of billions of parameters with a 256K-token context window context length The maximum number of tokens the model can attend to at once (also called the context window or sequence length). Pre-training picks a context length; later stages often extend it. See in glossary → , and the pre-training claim is the striking part: it is natively pretrained omni-modally on massive text and visual data plus over 100 million hours of audio-visual content. Audio and video aren’t adapters stapled onto a finished text model — they’re in the pre-training mixture from the start, exactly the native multimodal native multimodal pre-training Training on a mix of text and other modalities (e.g. vision) from the very start, with a constant ratio, rather than bolting a modality onto a finished text model late in training. Kimi K2.5's approach. See in glossary → philosophy Kimi K2.5 argued for, now extended across all modalities.
The efficiency recipe, inherited
Even at this scope, the now-standard frontier toolkit carries over. Qwen3.5-Omni uses a Hybrid Attention Mixture-of-Experts Mixture of Experts Mixture of Experts (MoE) — a layer with many parallel sub-networks ("experts") where a router sends each token to only a few. The model has a huge total parameter count but activates only a fraction per token, so compute stays modest. See in glossary → design (for both halves of its Thinker–Talker architecture) to keep long-sequence, multi-modal inference affordable — the same marriage of sparse experts and compressed attention we saw in DeepSeek-V4, now applied to streams of audio and video as well as text. The levers don’t change with the modality; only the data does.
That completes our tour of the papers, from a 65-million-parameter translation model in 2017 to omni-modal, million-token, trillion-parameter systems in 2026. The final chapter steps back to trace the through-line — what changed, what didn’t, and why.