LLM & vLLM Inference, from the ground up

A long-form, interactive explainer

This is a walkthrough of how modern Large Language Models actually run — from the moment text becomes a list of tokens, through every matrix multiplication inside a transformer, to the clever memory tricks that production serving systems like vLLM use to keep an Nvidia GPU saturated.

No prior machine-learning knowledge required. Every term gets defined the first time it shows up — you can hover any underlined word for a quick tooltip, or jump to the glossary at any time.

There are interactive widgets throughout: a real tokenizer you can type into, an attention heatmap you can hover, a KV-cache that fills up as you step through decoding, a paged-attention allocator you can poke at, and a data-flow visualization of where bytes actually live on an H100. They're meant to be played with, not just looked at.

Start reading → ~75–90 minutes, 21 sections

Contents

Foundations

  1. 01 What is an LLM? — And what does "inference" mean?
  2. 02 Tokens — Text → numbers the model can see
  3. 03 Embeddings — Token IDs → vectors
  4. 04 Attention — Queries, keys, and values
  5. 05 Multi-head attention — Many attentions in parallel
  6. 06 Positional encoding — Telling the model where each token sits
  7. 07 The MLP block — Per-token nonlinear processing
  8. 08 A full transformer block — Putting it together
  9. 09 Stacking into a full model — From embeddings to logits
  10. 10 Sampling — Logits → the next token

How serving actually works

  1. 11 Prefill and decode — The two phases of inference
  2. 12 The KV cache — Why decode is cheap and memory is expensive
  3. 13 GPU memory hierarchy — Where data actually lives on H100s
  4. 14 Continuous batching — Stop wasting GPU steps

vLLM internals

  1. 15 Paged attention — KV cache as a page table
  2. 16 Prefix caching — Shared system prompts for free
  3. 17 Chunked prefill — Stop blocking decoders with one big prompt
  4. 18 Speculative decoding — Draft fast, verify in bulk

Scaling out

  1. 19 Scaling out — Tensor parallelism vs pipeline parallelism
  2. 20 Throughput vs latency — What knobs move what
  1. 21 Recap — And further reading