LLM & vLLM Inference, from the ground up

A long-form, interactive explainer

This is a walkthrough of how modern Large Language Models actually run — from the moment text becomes a list of tokens, through every matrix multiplication inside a transformer, to the clever memory tricks that production serving systems like vLLM use to keep an Nvidia GPU saturated.

No prior machine-learning knowledge required. Every term gets defined the first time it shows up — you can hover any underlined word for a quick tooltip, or jump to the glossary at any time.

There are interactive widgets throughout: a real tokenizer you can type into, an attention heatmap you can hover, a KV-cache that fills up as you step through decoding, a paged-attention allocator you can poke at, and a data-flow visualization of where bytes actually live on an H100. They're meant to be played with, not just looked at.

Start reading → ~75–90 minutes, 21 sections

Published June 14, 2026

Foundations

How serving actually works

vLLM internals

Scaling out

21 Recap — And further reading

LLM & vLLM Inference, from the ground up

Contents

Foundations

How serving actually works

vLLM internals

Scaling out