Section 21

Inference scaling & o1

Test-time compute and the reasoning model

System: Learning to Reason with LLMs (OpenAI o1) — OpenAI, September 2024

In September 2024 the ground shifted. OpenAI released o1, a model that, before answering, thinks — sometimes for many seconds, sometimes for minutes, emitting a long private chain of reasoning the user never sees. And the headline wasn’t just that it scored far higher on hard math, code, and science. It was the shape of the result: o1’s accuracy rose smoothly as you let it think longer. A new scaling axis had appeared — not bigger models, not more data, but more compute at inference time — and it turned out to be one of the most important graphs in the field’s history.

A new kind of model

o1 is the first widely-known reasoning model reasoning model A model trained (usually with RL) to produce long internal chains of thought before answering, trading inference compute for accuracy on hard problems. o1 and DeepSeek-R1 are examples. See in glossary → : a model explicitly trained to spend a variable, often large, amount of computation reasoning through a problem before committing to an answer. The mechanism is a long chain-of-thought long chain-of-thought Extended internal reasoning — thousands of tokens of self-correction, backtracking, and exploration — that reasoning-RL elicits and that test-time scaling rewards. See in glossary → — an internal reasoning trace that can run to thousands of tokens, vastly longer than the few-step rationales of the chain-of-thought prompting we met in chapter 19.

What makes o1’s long CoT qualitatively different isn’t just length — it’s behavior. Inside the trace, the model does things earlier models didn’t: it tries an approach, notices it’s stuck, and backtracks. It second-guesses a calculation and redoes it. It decomposes a hard problem into sub-problems, explores a wrong branch, and abandons it. It checks its own work. This isn’t a single forward sweep to an answer; it’s something that looks like genuine deliberation, with all the false starts and self-corrections that implies. And — the crucial part — OpenAI didn’t prompt this behavior. They trained it in with reinforcement learning, rewarding the model for traces that reach correct answers and letting the deliberative strategy emerge from that pressure.

Test-time compute as a scaling axis

Here is the result that reorganized the field. Plot o1’s accuracy against the amount of test-time compute test-time compute Compute spent at inference — longer chains of thought, more samples — to improve answer quality, as opposed to compute spent during training. See in glossary → it’s allowed to spend — the length of its reasoning trace — and you get a clean, rising curve. Let it think longer, and it gets more right. This is inference scaling inference scaling The empirical finding that accuracy improves predictably as you spend more test-time compute (longer reasoning, more samples) — a second scaling axis beyond model and data size. See in glossary → (also called test-time scaling): performance that improves predictably with compute spent at inference, not at training.

To feel why this is a big deal, recall the pre-training scaling laws from the sibling pre-training explainer. For years, the recipe for a better model was: more parameters, more data, more training compute. Loss fell predictably as you scaled those up. That axis is expensive and slow — you train one giant model, once, and you’re stuck with whatever capability it has. o1 revealed an orthogonal axis. Take a fixed trained model and simply let it think longer on the hard problems, and accuracy climbs along its own curve. You can now dial capability up per query, at inference, by spending more tokens.

This connects straight back to the deepest idea of chapter 19: tokens are compute. Self-consistency scaled test-time compute in parallel — sample NN independent chains and vote. o1 scales it serially — one very long chain that builds on itself, backtracks, and self-corrects. Serial scaling is more powerful for genuinely hard reasoning, because later steps can use the conclusions (and the discovered dead-ends) of earlier ones, which independent parallel samples cannot.

Try it

Below, slide the test-time-compute budget and watch accuracy climb. Compare the serial long-CoT curve against parallel self-consistency, and notice the diminishing returns — every doubling of compute buys a bit less than the last.

Test-time scaling
Accuracy vs test-time compute. Spend more on sampling and reasoning length — accuracy rises, then saturates.
25%50%75%100%test-time compute (log scale) →
Reasoning-trainedBase model
Reasoning-trained accuracy
45.6%
Base model accuracy
43.7%
budget ≈ 4 × 512 = 2,048 units of test-time compute
Accuracy improves predictably with test-time compute — a second scaling axis beyond model size and training data. But the base model saturates early: more sampling and longer chains stop helping once it has exhausted what it knows. The reasoning-trained model has a higher ceiling and keeps climbing for longer, because it was trained to actually use the extra thinking budget rather than just repeat itself.

The cost, and the honesty about it

There is no free lunch. Every reasoning token is a real forward pass on a GPU, so a long chain of thought is expensive — o1 can spend orders of magnitude more inference compute than a normal model answering the same prompt, and that compute is paid on every single query you ask it to think hard about. The economics of generating those tokens — KV-cache growth, latency per token, throughput — are exactly the inference mechanics covered in the LLM & vLLM Inference explainer. Test-time scaling moves cost from a one-time training bill to a recurring per-query bill, which is a very different operational shape: cheaper to reach a capability, more expensive to use it at volume.

The catch: o1 was a closed box

o1 was a landmark, but it was also a black box. OpenAI hid the reasoning traces, published no training details, and released no weights. The community knew the destination — RL-trained long chain-of-thought, scaling with test-time compute — but not the route. How do you train a base model to reason like this? What reward? What algorithm? Does it need the careful step-level supervision of chapter 20, or can a simple correctness signal suffice? Could anyone reproduce it in the open?

Within months, someone did. The next two chapters tell that story. Chapter 22 gives us the reward — RL from verifiable rewards, the automatic-correctness signal — and chapter 23 gives us the algorithm and the open model that put it all together: GRPO and DeepSeek-R1. o1 proved the destination existed. R1 drew the map.