Section 01

What is an LLM?

And what does "inference" mean?

When you type something into ChatGPT, Claude, or any modern chat assistant and watch words appear one by one, something concrete is happening on a computer. A program is reading your text, turning it into numbers, doing an enormous amount of arithmetic on those numbers, and turning the result back into more text. That program is a Large Language Model (LLM) LLM Large Language Model — a neural network trained on huge text corpora to predict the next token given previous tokens. See in glossary → — a particular kind of neural network neural network A function built by stacking many simple operations — mostly matrix multiplies with nonlinearities between them — whose behavior is shaped by tuning billions of internal numbers (its parameters) from data. See in glossary → — and the act of running it to produce output is called inference inference Running a trained model to produce outputs. Training learns the weights once; inference uses them many times. See in glossary → .

This essay is about how inference works. We will start from “what is a token” and end at “how does vLLM vLLM An open-source LLM inference engine, originally from UC Berkeley, that introduced paged attention and is now one of the most widely used serving systems for open-weight models. See in glossary → keep a $30,000 GPU saturated with hundreds of concurrent users.” Every concept builds on the previous one, every new term is defined the first time it appears, and there are interactive widgets along the way so you can poke at the ideas instead of just reading about them.

Training vs inference

A neural network is, at heart, a giant function. It takes numbers in, multiplies them by other numbers (called parameters parameters The numbers (weights) inside a model that get adjusted during training. A “7B model” has 7 billion of them. See in glossary → , or weights), passes the results through some simple nonlinear functions nonlinear function A function whose output isn't just a scaled, shifted copy of its input — e.g. ReLU, GELU, sigmoid. Stacking nonlinearities between matrix multiplies is what lets a neural net represent anything more interesting than scaling and rotation. See in glossary → , and produces numbers out. The interesting trick is that the parameters are learned from data. We start with random parameters, show the network billions of examples (“here is some text — predict what comes next”), and slowly nudge the parameters so it does better. That nudging process is called training.

Training is expensive. The frontier models cost tens of millions of dollars and run for months on tens of thousands of GPUs. But you only have to do it once. Afterward, the parameters are frozen — they are just a big file of numbers, hundreds of gigabytes for a flagship model — and what you ship to users is the much cheaper act of using those frozen parameters to answer one prompt at a time. That is inference.

What does an LLM actually compute?

An LLM is trained for one job: given some text, predict the next token. We will say much more about what a “token” is in the next section, but for now think of it as roughly “a word or word-fragment”. The model is shown a chunk of text and asked: what comes next?

That sounds almost embarrassingly simple. But once you have a really good next-token predictor, you can string it together: predict the next token, append it to the input, predict the next token after that, and so on. That loop — generate one token, feed it back in, generate another — is what we call autoregressive autoregressive Generating one token at a time, where each new token is conditioned on every token that came before it. See in glossary → generation. It is also exactly how all the chat assistants you have seen work, including the one currently writing words across your screen.

The piece you type in is called the prompt prompt The input text fed to the model — what you want it to continue or respond to. See in glossary → . The text the model generates in response is the completion completion The text the model generates in response to a prompt. See in glossary → . Everything you see being typed out token by token in a chat UI is the model running through its autoregressive loop, doing one forward pass through hundreds of billions of arithmetic operations to produce each token.

Why is this hard?

If next-token prediction is the whole game, you might wonder why the engineering is interesting at all. It turns out the difficulty splits into two very different problems:

  1. The model is gigantic. A flagship open-weights model like Llama-3-70B has 70 billion parameters. At 16-bit precision that is 140 GB of weights — more than fits on a single GPU. Even the smaller 8B model is 16 GB. Every single token you generate requires reading every parameter at least once from GPU memory. The chip is fast at math but limited by how fast it can move bytes around, and that asymmetry shapes every decision in a serving system.

  2. Many users want answers at the same time. If you only ever ran one prompt at a time, GPUs would sit nearly idle: a single forward pass touches every weight, but most of the GPU’s arithmetic units have nothing to do because there’s only one token’s worth of work. Real systems pack many requests together so that one read of the weights serves many users, while juggling the fact that those requests have wildly different prompt lengths, completion lengths, and arrival times.

A modern inference engine like vLLM is, fundamentally, an answer to both problems. It is a memory-management system disguised as a model server.

What this essay covers

We will move in three phases:

  1. Foundations (sections 2–10). What is a token, what is an embedding, what is attention, what is a transformer block — culminating in a complete picture of “a forward pass through an LLM.” If you already know all of this, you can skim.

  2. How inference actually runs (sections 11–14). What it means for a token to be generated, why decoding splits into two phases, what the KV cache KV cache The stored keys and values from all past tokens, so attention at step t only needs to compute Q for the new token. See in glossary → is, where bytes physically live on an Nvidia H100, and how a scheduler folds many requests into one batch.

  3. vLLM internals (sections 15–18). Paged attention, prefix caching, chunked prefill, and speculative decoding — the four ideas that took serving from “a single user per GPU” to “hundreds, with great latency.”

  4. Scaling out (sections 19–21). What changes when one GPU isn’t enough.

The end goal is to give you a complete mental model of what happens between you pressing Enter and a stream of tokens coming back. Let’s start with the very first step: turning your text into numbers.