Section 01

What is an LLM?

And what does "inference" mean?

When you type something into ChatGPT, Claude, or any modern chat assistant and watch words appear one by one, something concrete is happening on a computer. A program is reading your text, turning it into numbers, doing an enormous amount of arithmetic on those numbers, and turning the result back into more text. That program is a Large Language Model (LLM) — a particular kind of neural network — and the act of running it to produce output is called inference .

This essay is about how inference works. We will start from “what is a token” and end at “how does vLLM keep a $30,000 GPU saturated with hundreds of concurrent users.” Every concept builds on the previous one, every new term is defined the first time it appears, and there are interactive widgets along the way so you can poke at the ideas instead of just reading about them.

Training vs inference

A neural network is, at heart, a giant function. It takes numbers in, multiplies them by other numbers (called parameters , or weights), passes the results through some simple nonlinear functions , and produces numbers out. The interesting trick is that the parameters are learned from data. We start with random parameters, show the network billions of examples (“here is some text — predict what comes next”), and slowly nudge the parameters so it does better. That nudging process is called training.

Training is expensive. The frontier models cost tens of millions of dollars and run for months on tens of thousands of GPUs. But you only have to do it once. Afterward, the parameters are frozen — they are just a big file of numbers, hundreds of gigabytes for a flagship model — and what you ship to users is the much cheaper act of using those frozen parameters to answer one prompt at a time. That is inference.

A useful analogy

Imagine an enormous network of interconnected pipes with a valve at every junction. Training is the painstaking process of tuning those valves: pour water in one end, see where it comes out, adjust the valves a tiny bit, repeat billions of times until the flow paths produce the right answers. It’s slow and expensive, and the result is a single object: a network with all its valves locked into a specific configuration. Inference is what happens after the valves stop moving — you pour a prompt in the top, water rushes through the fixed network, and what spills out the bottom is the completion. The pipes never change again, but how cleverly you pump water through them determines whether the network serves one user or a million. This essay is almost entirely about inference: the flow.

What does an LLM actually compute?

An LLM is trained for one job: given some text, predict the next token. We will say much more about what a “token” is in the next section, but for now think of it as roughly “a word or word-fragment”. The model is shown a chunk of text and asked: what comes next?

That sounds almost embarrassingly simple. But once you have a really good next-token predictor, you can string it together: predict the next token, append it to the input, predict the next token after that, and so on. That loop — generate one token, feed it back in, generate another — is what we call autoregressive generation. It is also exactly how all the chat assistants you have seen work, including the one currently writing words across your screen.

The piece you type in is called the prompt . The text the model generates in response is the completion . Everything you see being typed out token by token in a chat UI is the model running through its autoregressive loop, doing one forward pass through hundreds of billions of arithmetic operations to produce each token.

Why is this hard?

If next-token prediction is the whole game, you might wonder why the engineering is interesting at all. It turns out the difficulty splits into two very different problems:

The model is gigantic. A flagship open-weights model like Llama-3-70B has 70 billion parameters. At 16-bit precision that is 140 GB of weights — more than fits on a single GPU. Even the smaller 8B model is 16 GB. Every single token you generate requires reading every parameter at least once from GPU memory. The chip is fast at math but limited by how fast it can move bytes around, and that asymmetry shapes every decision in a serving system.
Many users want answers at the same time. If you only ever ran one prompt at a time, GPUs would sit nearly idle: a single forward pass touches every weight, but most of the GPU’s arithmetic units have nothing to do because there’s only one token’s worth of work. Real systems pack many requests together so that one read of the weights serves many users, while juggling the fact that those requests have wildly different prompt lengths, completion lengths, and arrival times.

A modern inference engine like vLLM is, fundamentally, an answer to both problems. It is a memory-management system disguised as a model server.

What this essay covers

We will move in three phases:

Foundations (sections 2–10). What is a token, what is an embedding, what is attention, what is a transformer block — culminating in a complete picture of “a forward pass through an LLM.” If you already know all of this, you can skim.
How inference actually runs (sections 11–14). What it means for a token to be generated, why decoding splits into two phases, what the KV cache is, where bytes physically live on an Nvidia H100, and how a scheduler folds many requests into one batch.
vLLM internals (sections 15–18). Paged attention, prefix caching, chunked prefill, and speculative decoding — the four ideas that took serving from “a single user per GPU” to “hundreds, with great latency.”
Scaling out (sections 19–21). What changes when one GPU isn’t enough.

The end goal is to give you a complete mental model of what happens between you pressing Enter and a stream of tokens coming back. Let’s start with the very first step: turning your text into numbers.