Section 02

Tokens

Text → numbers the model can see

A neural network can only operate on numbers, not text. So before any actual model work happens, every prompt you type goes through a small program that turns text into a list of integers. That program is called a tokenizer, the chunks of text it produces are called tokens token The atomic unit of text the model sees. Roughly a word-fragment — “tokenization” is a piece of text → list of token IDs. See in glossary → , and the fixed list of all possible tokens the model knows about is called its vocabulary vocabulary The fixed set of tokens a model knows about. Modern LLMs have ~32k–200k entries. See in glossary → .

For modern LLMs the vocabulary has somewhere between 32,000 and 200,000 entries. Each entry has an integer index — its token ID token ID An integer index into the vocabulary that uniquely identifies a token. See in glossary → — and that integer is what flows into the rest of the network. From the model’s point of view, your prompt is not “Hello, world” but [9906, 11, 1917].

Why not just use characters? Or whole words?

If we tokenized one character at a time, every prompt would be very long. A 500-word email might become 3,000 tokens, and since the cost of inference grows with sequence length, that gets expensive fast. The model would also have to learn that c, a, t next to each other means something, instead of being told “this is the word cat” up front.

If we tokenized whole words, we’d be fine on common words — but English alone has hundreds of thousands of them, and we’d have no way to handle a word the tokenizer had never seen (a typo, a new product name, a piece of code). Worse, modern LLMs are expected to handle every major language at once: a vocabulary that covered the words of English, Mandarin, Spanish, Hindi, Arabic, Japanese, Russian, and the other 90+ languages users actually type would balloon into the millions of entries — and still miss every word never written down before. Out-of-vocabulary words would break the system.

The practical compromise everyone uses is somewhere in between: subword tokenization, where common words get a single token, less common words get split into a few pieces, and the worst case — total gibberish — falls back to one token per byte. The most widespread algorithm for this is Byte-Pair Encoding (BPE) BPE Byte-Pair Encoding — the most common tokenization algorithm. It merges frequent byte pairs into tokens. See in glossary → .

Try it

Below is a real tokenizer running in your browser — the same cl100k_base encoding used by GPT-4 and ChatGPT. Type anything, or pick one of the sample inputs. Watch how the same text gets chopped up into different numbers of tokens depending on whether it’s common English, code, a long word, or another language.

Tokenizer playground
cl100k_base · used by GPT-4 / ChatGPT
44 chars
9 words
0 tokens
0.00 tokens / word
Loading vocabulary…
Each colored chip is one token. The small number is its ID — the integer index into the vocabulary. Notice how common words and word-pieces get a single token, while rare words, code, and other languages get split into many. A leading · means there was a space before the token in the original text.

A few things worth noticing as you play with it:

  • Common words are one token. “The”, “and”, “world” — single chips, often with low IDs because they were merged early during training.
  • Rare words get split. Try “antidisestablishmentarianism”: it breaks into 5–6 pieces. The model still sees a meaningful sequence; it just has to do a bit more work.
  • Code is token-hungry. Programming languages weren’t the bulk of the training data, so things like def, return, indentation, and parens often each get their own token. A 100-line Python file is much longer in tokens than 100 lines of English prose.
  • Non-English text is even more token-hungry. Languages with non-Latin scripts often spend several tokens per character, which is one reason multilingual prompts have historically been more expensive per word.
  • Emoji and rare Unicode can hit the byte fallback — one token per byte of UTF-8.

What the model actually receives

By the end of the tokenizer’s work, your prompt — a string — has become a flat list of integers, each between 0 and vocab_size - 1. Those integers will be the very first thing the neural network sees. The model has no idea they used to be letters; as far as it knows, the world is just a sequence of integer indices.

Of course, integers alone don’t capture meaning. The token ID for “king” and the ID for “queen” are just two arbitrary numbers; they don’t tell the network that those words are related. The next step is to turn each integer into a much richer object that can carry information about meaning — a vector. That’s the topic of the next section: embeddings.