Section 27

Qwen3.5-Omni

One model pre-trained on every modality

Paper: Qwen3.5-Omni Technical Report — Qwen Team, 2026

The last frontier report, Qwen3.5-Omni (Qwen Team, April 2026), takes the multimodal thread of Gemma 3 and Kimi K2.5 to its conclusion. Where those added vision to a language model, Qwen3.5-Omni is omni-modal : a single model natively pre-trained on text, images, audio, and video together. It’s a fitting place to end, because it shows the next-token objective absorbing the entire sensory world.

Pre-training on every modality at once

Qwen3.5-Omni scales to hundreds of billions of parameters with a 256K-token context window , and the pre-training claim is the striking part: it is natively pretrained omni-modally on massive text and visual data plus over 100 million hours of audio-visual content. Audio and video aren’t adapters stapled onto a finished text model — they’re in the pre-training mixture from the start, exactly the native multimodal philosophy Kimi K2.5 argued for, now extended across all modalities.

How every modality becomes 'just tokens'

The deep idea threading the last several chapters is modality-agnostic pre-training. An image becomes tokens via a vision encoder ; audio becomes tokens via an audio encoder; video is frames-plus-audio. Once each modality is a sequence of embeddings in a shared space, the transformer’s next-token objective doesn’t need to change — it just predicts the next token in a stream that might be words, image patches, or audio frames. The architecture and objective stay fixed; the encoders and the data do the work of opening new senses. That is why “predict the next token” has proven so absurdly general.

The efficiency recipe, inherited

Even at this scope, the now-standard frontier toolkit carries over. Qwen3.5-Omni uses a Hybrid Attention Mixture-of-Experts design (for both halves of its Thinker–Talker architecture) to keep long-sequence, multi-modal inference affordable — the same marriage of sparse experts and compressed attention we saw in DeepSeek-V4, now applied to streams of audio and video as well as text. The levers don’t change with the modality; only the data does.

That completes our tour of the papers, from a 65-million-parameter translation model in 2017 to omni-modal, million-token, trillion-parameter systems in 2026. The final chapter steps back to trace the through-line — what changed, what didn’t, and why.