Glossary
Every term, in one place
Every defined term across the explainers, alphabetized. Each entry links to the section where the term is first introduced.
- "aha moment" Introduced in §23 · GRPO & DeepSeek-R1
- The observation in DeepSeek-R1-Zero that, under pure RL with verifiable rewards, the model spontaneously learns to pause, reconsider, and backtrack — reasoning behaviors no one demonstrated.
- 3D parallelism Introduced in §07 · Parallelism
- Combining data, tensor, and pipeline parallelism (three axes) at once to train a model too big for any single axis to handle. Frontier runs add expert parallelism as a fourth.
- 6ND rule Introduced in §06 · Compute & memory
- A rule of thumb: training a dense model with N parameters on D tokens costs about 6ND floating-point operations (≈2ND forward + ≈4ND backward).
- activations Introduced in §06 · Compute & memory
- The intermediate tensors produced during the forward pass. They must be kept around for the backward pass, and at long context they can dominate memory use.
- active parameters Introduced in §18 · DeepSeek-V3
- In a Mixture-of-Experts model, the subset of parameters actually used to process a given token. DeepSeek-V3 has 671B total but only 37B active per token, so compute tracks the smaller number.
- AdaFactor Introduced in §13 · T5
- A memory-efficient optimizer (used to train T5) that factorizes Adam's second-moment matrix into row and column statistics, drastically cutting optimizer-state memory for very large models.
- Adam Introduced in §04 · Optimizers & schedules
- Adaptive Moment Estimation — an optimizer that tracks running averages of the gradient (first moment) and its square (second moment) to give each parameter its own adaptive step size.
- AdamW Introduced in §04 · Optimizers & schedules
- Adam with decoupled Weight decay — the de facto standard LLM optimizer. It applies weight decay directly to the parameters instead of folding it into the gradient, which regularizes more cleanly.
- advantage Introduced in §12 · Value, advantage, baselines
- How much better an action was than the baseline expectation: A = reward − value. Positive advantage pushes an action’s probability up, negative pushes it down.
- agentic RL Introduced in §26 · Agentic & tool-use RL
- Reinforcement learning over multi-step, tool-using trajectories — the model acts, observes results, and acts again — rather than producing a single response. The 2025–26 frontier.
- alignment Introduced in §03 · The alignment problem
- The problem of making a model behave in accordance with human intent and values — helpful, honest, and harmless — rather than merely continuing text plausibly.
- all-reduce Introduced in §19 · Scaling out
- A collective op where every GPU contributes a tensor and every GPU ends up with the sum (or other reduction). The TP workhorse.
- annealing Introduced in §08 · The data pipeline
- A final pre-training phase that upsamples small amounts of the highest-quality data (math, code, curated text) while the learning rate decays to its floor. Reliably boosts quality and can be used to gauge a dataset's value.
- arithmetic intensity Introduced in §06 · Compute & memory
- The ratio of compute (FLOPs) to memory traffic (bytes) for an operation. High-intensity ops keep the GPU's math units busy; low-intensity ops stall on memory.
- attention score Introduced in §04 · Attention
- A single number sᵢⱼ measuring how much token i wants to attend to token j. Computed as the dot product of i's query vector and j's key vector (scaled by √d_k), then softmaxed across j so the weights for each i sum to 1. High score = i finds j relevant.
- autoregressive Introduced in §01 · What is an LLM?
- Generating one token at a time, where each new token is conditioned on every token that came before it.
- auxiliary-loss-free load balancing Introduced in §18 · DeepSeek-V3
- DeepSeek-V3's load-balancing method that adjusts a per-expert routing bias instead of adding a balancing loss term, avoiding the quality hit that auxiliary losses impose on the main objective.
- backpropagation Introduced in §03 · How a model learns
- The algorithm that computes the loss gradient for every parameter efficiently by applying the chain rule backward through the network, reusing intermediate results from the forward pass.
- backward pass Introduced in §03 · How a model learns
- The second half of a training step: backpropagation walks from the loss back through the network, computing each parameter's gradient.
- base model Introduced in §01 · What is post-training?
- A model straight out of pre-training — a powerful text continuator that has not yet been taught to follow instructions, hold a conversation, or refuse harmful requests.
- baseline Introduced in §12 · Value, advantage, baselines
- A reference value subtracted from the reward to reduce gradient variance without adding bias. Can be a learned critic, a group mean (GRPO), or a leave-one-out average (RLOO).
- batch Introduced in §14 · Continuous batching
- A group of sequences processed together in one forward pass. Bigger batches = better GPU utilization, more memory used.
- best-of-N Introduced in §18 · Rejection-sampling alignment
- Sampling N responses and selecting the highest-reward one. Used both at inference time and as the data-generation step in rejection-sampling fine-tuning.
- BF16 Introduced in §05 · Precision & numerics
- Brain Floating-point 16-bit: 1 sign + 8 exponent + 7 mantissa bits. Keeps FP32's wide exponent range (so it rarely overflows) at the cost of precision — the workhorse format for modern pre-training.
- bidirectional Introduced in §11 · BERT
- Able to use context from both the left and the right of a token. BERT is bidirectional; a causal language model is left-to-right only.
- bits-per-token Introduced in §02 · The objective
- Cross-entropy loss measured in bits (log base 2) instead of nats. A compression-flavored view: a better language model encodes the next token in fewer bits.
- BooksCorpus Introduced in §10 · GPT-1
- A dataset of around 7,000 unpublished books (~800M words) used to pre-train GPT-1. Long contiguous passages made it good for learning long-range structure.
- BPE Introduced in §02 · Tokens
- Byte-Pair Encoding — the most common tokenization algorithm. It merges frequent byte pairs into tokens.
- Bradley–Terry model Introduced in §09 · Reward models
- A statistical model that turns pairwise preferences into latent scalar scores: the probability A beats B is the logistic of the score difference, σ(s_A − s_B). The core of most reward models.
- byte-level BPE Introduced in §08 · The data pipeline
- Byte-level Byte Pair Encoding — running BPE over raw bytes rather than Unicode characters, so any possible input (emoji, code, any language) is representable with a small base vocabulary. Introduced by GPT-2.
- C4 Introduced in §13 · T5
- Colossal Clean Crawled Corpus — the ~750 GB cleaned web-text dataset built from Common Crawl for training T5, and widely reused since.
- causal language model Introduced in §02 · The objective
- A model that predicts each token using only earlier tokens (never future ones). "Causal" because information flows strictly left to right. The GPT family are causal LMs (Language Models).
- causal mask Introduced in §09 · Attention Is All You Need
- A mask applied before the attention softmax that sets future positions to −∞, preventing each token from attending to tokens that come after it. What makes a decoder autoregressive.
- chain rule Introduced in §03 · How a model learns
- The calculus rule for differentiating composed functions. Backpropagation is just the chain rule applied layer by layer, from the loss back to the inputs.
- chain-of-thought (CoT) Introduced in §19 · Bootstrapping reasoning
- Having a model write out intermediate reasoning steps before its final answer. Improves accuracy on multi-step problems and is the substrate reasoning RL optimizes.
- chat template Introduced in §05 · The SFT stage in practice
- The fixed formatting (with special tokens marking roles like system/user/assistant) that turns a multi-turn conversation into the single token stream a model is trained and served on.
- chunked prefill Introduced in §17 · Chunked prefill
- Splitting a long prompt into multiple smaller prefills so decoding requests aren’t blocked behind one giant compute step.
- clipped surrogate objective Introduced in §13 · TRPO to PPO
- PPO’s loss: maximize the probability-ratio-weighted advantage, but clip the ratio to [1−ε, 1+ε] so a single update can’t move the policy too far.
- cold-start data Introduced in §23 · GRPO & DeepSeek-R1
- A small amount of high-quality SFT data used to "warm up" a base model before RL, so reasoning RL is more stable and readable. DeepSeek-R1 adds it; R1-Zero skips it.
- Common Crawl Introduced in §08 · The data pipeline
- A free, monthly public crawl of the web — petabytes of raw HTML. It is the raw feedstock for most large pre-training corpora after heavy filtering.
- completion Introduced in §01 · What is an LLM?
- The text the model generates in response to a prompt.
- Compressed Sparse Attention Introduced in §26 · DeepSeek-V4
- Compressed Sparse Attention (CSA) — a DeepSeek-V4 attention variant that attends to a compressed, sparsely-selected subset of past tokens to make million-token context affordable.
- compute-optimal Introduced in §16 · Chinchilla
- The allocation of a fixed compute budget between model size and training tokens that minimizes loss. Chinchilla showed it means scaling both roughly equally — about 20 tokens per parameter.
- Constitutional AI Introduced in §10 · RLAIF & Constitutional AI
- Anthropic’s method where a model critiques and revises its own outputs against a written set of principles (a "constitution"), then trains on AI-generated preferences — a form of RLAIF.
- context length Introduced in §02 · The objective
- The maximum number of tokens the model can attend to at once (also called the context window or sequence length). Pre-training picks a context length; later stages often extend it.
- continuous batching Introduced in §14 · Continuous batching
- A scheduler that swaps finished requests out and queued requests in at every decode step instead of waiting for the whole batch to finish.
- coreference Introduced in §04 · Attention
- When two words in a text refer to the same thing. In "Marie went home because she was tired," the pronoun "she" co-refers to "Marie." Resolving coreference — figuring out which earlier mention a pronoun, "this", "the company", etc. points back to — is one of the relationships transformer heads learn to track during training.
- corpus Introduced in §01 · What is pre-training?
- The body of text a model is trained on. Modern pre-training corpora are measured in trillions of tokens drawn from web crawls, books, code, and more.
- cosine decay Introduced in §04 · Optimizers & schedules
- A learning-rate schedule that follows a half-cosine curve from the peak down to a small floor, decaying slowly at first and fast at the end. The most common LLM schedule.
- critic Introduced in §12 · Value, advantage, baselines
- A model trained to predict the value function. PPO uses an actor (the policy) and a critic; GRPO drops the critic and uses a group average instead.
- cross-entropy loss Introduced in §02 · The objective
- The standard LM loss: the negative log-probability the model assigned to the actual next token, averaged over all positions. Zero would mean perfect confidence in every correct token.
- d_model Introduced in §03 · Embeddings
- The hidden dimension that flows through the whole transformer. Llama-3-8B uses 4096, GPT-3 12288.
- DAPO Introduced in §24 · GRPO refinements
- A fully open GRPO refinement (ByteDance/Tsinghua, 2025) combining Clip-Higher, dynamic sampling, token-level loss, and overlong-reward shaping to stabilize large-scale reasoning RL.
- data contamination Introduced in §08 · The data pipeline
- When test or benchmark data leaks into the training corpus, inflating scores. Careful pipelines try to detect and remove contamination before training.
- data mixture Introduced in §08 · The data pipeline
- The recipe specifying what fraction of training tokens comes from each source (web, code, books, math, multilingual). Tuning the mixture is one of the highest-leverage data decisions.
- data parallelism Introduced in §07 · Parallelism
- Replicating the whole model on each GPU, giving each a different slice of the batch, then averaging gradients across GPUs with an all-reduce. The simplest way to scale out.
- data wall Introduced in §21 · Synthetic data
- The looming limit where the supply of high-quality human-written text is exhausted relative to models' appetite for tokens, motivating interest in synthetic data and better filtering.
- decode Introduced in §11 · Prefill and decode
- The autoregressive phase: one forward pass per generated token. Memory-bandwidth-bound — the GPU mostly waits on weights.
- decoder Introduced in §09 · Attention Is All You Need
- The half of a transformer that generates a sequence one token at a time using masked (causal) self-attention. GPT-style language models are decoder-only.
- deduplication Introduced in §08 · The data pipeline
- Removing duplicate or near-duplicate documents from the corpus. Dedup improves quality, reduces memorization, and stops the model wasting capacity on repeated text.
- denoising objective Introduced in §13 · T5
- Any pre-training objective that corrupts the input (masking, deleting, or shuffling tokens) and trains the model to restore the original. Masked LM and span corruption are both denoising objectives.
- document attention masking Introduced in §08 · The data pipeline
- Restricting attention so each token can only attend within its own document (never across a separator) when several documents share one packed sequence. Prevents one document's tokens from leaking into another's predictions. Also called intra-document masking.
- document separator Introduced in §08 · The data pipeline
- A special token (often an End-Of-Sequence / EOS marker such as <|endoftext|>) inserted between documents packed into one training sequence, marking where one document ends and the next begins.
- dot product Introduced in §04 · Attention
- A single number summarizing how aligned two vectors are. To compute a · b: multiply corresponding components (a₀·b₀, a₁·b₁, …, a_{d-1}·b_{d-1}) and sum the results. Large positive = the two vectors point in similar directions; near zero = they're unrelated; large negative = opposite directions.
- downstream task Introduced in §01 · What is pre-training?
- Any specific job (translation, question answering, coding) a pre-trained model is later applied to. Pre-training is deliberately task-agnostic so it transfers to many downstream tasks.
- DPO Introduced in §16 · Direct Preference Optimization
- Direct Preference Optimization (Rafailov, 2023) — a closed-form supervised loss that optimizes the RLHF objective directly from preference pairs, with no separate reward model and no RL loop.
- Dr.GRPO Introduced in §24 · GRPO refinements
- A corrected GRPO that removes length and standard-deviation normalization biases, so the gradient is unbiased and long wrong answers aren’t implicitly favored.
- dropout Introduced in §09 · Attention Is All You Need
- A regularizer that randomly zeroes a fraction of activations during training, forcing the network not to rely on any single unit. Common in early models; large modern pre-training often uses little or none.
- dynamic range Introduced in §05 · Precision & numerics
- The span between the smallest and largest magnitudes a number format can represent, set by its exponent bits. BF16 has FP32-like range; FP16 does not.
- EAGLE Introduced in §18 · Speculative decoding
- A draft-model architecture that predicts feature vectors of the target model, achieving high acceptance rates.
- embedding Introduced in §03 · Embeddings
- A dense vector representation of a token (typically d=2k–8k floats). Similar tokens get nearby vectors.
- embedding matrix Introduced in §03 · Embeddings
- A table with one row per vocabulary entry. Looking up a token = indexing into this matrix.
- emergent abilities Introduced in §15 · GPT-3
- Capabilities that are absent in smaller models but appear, sometimes abruptly, once a model is large enough — e.g. multi-step arithmetic or in-context learning of novel tasks.
- encoder Introduced in §09 · Attention Is All You Need
- The half of a transformer that reads an input sequence with full (bidirectional) attention, producing a contextual representation of it. BERT is encoder-only.
- encoder-decoder Introduced in §09 · Attention Is All You Need
- An architecture with an encoder that reads the input and a decoder that writes the output, connected by cross-attention. The original transformer and T5 are encoder-decoder models.
- entropy Introduced in §02 · From next-token to behavior
- A measure of how spread-out (uncertain) a probability distribution is. In RL post-training, keeping entropy up preserves exploration and prevents premature collapse onto one answer.
- epoch Introduced in §03 · How a model learns
- One full pass over the training dataset. Frontier LLMs are often trained for roughly a single epoch over a deduplicated corpus, so each token is seen about once.
- expert parallelism Introduced in §07 · Parallelism
- Placing different experts of a Mixture-of-Experts layer on different GPUs, so each device holds only some experts and tokens are routed across the network to reach them.
- few-shot Introduced in §15 · GPT-3
- Giving the model a handful of worked examples in the prompt before the real query, so it infers the task from them. Contrast with zero-shot (instructions only) and one-shot (a single example).
- fill-in-the-middle Introduced in §25 · Qwen3-Coder-Next
- Fill-in-the-Middle (FIM) — a code pre-training objective that gives the model a prefix and a suffix and asks it to generate the missing middle, teaching it to edit and complete code in place, not just continue it.
- fine-tuning Introduced in §01 · What is pre-training?
- Continuing to train a pre-trained model on a smaller, task- or behavior-specific dataset. This explainer is about pre-training; fine-tuning and other post-training steps are out of scope.
- floating point Introduced in §05 · Precision & numerics
- How computers store real numbers: a sign, an exponent (range), and a mantissa (precision). The trade-off between range and precision is central to training numerics.
- FLOP Introduced in §06 · Compute & memory
- Floating-Point Operation — one multiply or add. Training compute is measured in total FLOPs; a frontier run is on the order of 10^24–10^25 FLOPs.
- forward pass Introduced in §03 · How a model learns
- Running inputs through the network to produce outputs (logits) and the loss, caching intermediate activations that backpropagation will need.
- foundation model Introduced in §01 · What is pre-training?
- A large model pre-trained on broad data that can be adapted to many downstream tasks. The pre-trained LLM is the foundation; fine-tuning specializes it.
- FP16 Introduced in §05 · Precision & numerics
- 16-bit half-precision Floating Point: 1 sign + 5 exponent + 10 mantissa bits. Half the memory of FP32 but a narrow exponent range, so it can overflow/underflow without loss scaling.
- FP32 Introduced in §05 · Precision & numerics
- 32-bit single-precision Floating Point: 1 sign + 8 exponent + 23 mantissa bits. The traditional "full precision" format; accurate but memory- and bandwidth-hungry.
- FP8 Introduced in §05 · Precision & numerics
- 8-bit Floating Point (typically E4M3 or E5M2 layouts). The newest training precision, used on H100/Blackwell GPUs to roughly double throughput; needs careful scaling to stay numerically stable.
- fragmentation Introduced in §15 · Paged attention
- Wasted memory from allocations that don’t fit cleanly. Paging trades internal fragmentation (≤1 page per request) for none of the external kind.
- FSDP Introduced in §07 · Parallelism
- Fully Sharded Data Parallel — PyTorch's implementation of ZeRO-style sharding: each GPU stores a shard of the parameters and gathers the rest just in time for each layer's compute.
- GAE Introduced in §12 · Value, advantage, baselines
- Generalized Advantage Estimation — a way to trade bias against variance in advantage estimates using a decay parameter λ. The standard advantage signal inside PPO.
- GELU Introduced in §07 · The MLP block
- Gaussian Error Linear Unit — a smooth nonlinearity used inside the MLP. SiLU/SwiGLU are common modern variants.
- generalization Introduced in §01 · What is pre-training?
- How well a model performs on data it never saw during training. The whole point of pre-training is to generalize, not to memorize the corpus.
- Goodhart’s law Introduced in §15 · Reward hacking & over-optimization
- "When a measure becomes a target, it ceases to be a good measure." Optimizing a proxy reward (the measure) eventually diverges from the true objective it stood in for.
- GPUDirect Introduced in §13 · GPU memory hierarchy
- Nvidia tech that lets the NIC or NVMe DMA straight into/out of GPU HBM, bypassing host RAM.
- GQA Introduced in §12 · The KV cache
- Grouped-Query Attention — multiple query heads share one K/V head, shrinking the KV cache by 4–8× with minimal quality loss.
- gradient Introduced in §03 · How a model learns
- The vector of partial derivatives of the loss with respect to every parameter — it points in the direction of steepest loss increase, so we step the opposite way to reduce the loss.
- gradient accumulation Introduced in §07 · Parallelism
- Summing gradients over several mini-batches before doing one optimizer update, to simulate a large batch size that wouldn't fit in memory all at once.
- gradient checkpointing Introduced in §06 · Compute & memory
- Activation recomputation — saving memory by discarding most activations in the forward pass and recomputing them during the backward pass, trading extra compute for far less memory.
- gradient clipping Introduced in §04 · Optimizers & schedules
- Capping the overall size (norm) of the gradient before the update, to stop occasional huge gradients from destabilizing training.
- gradient descent Introduced in §03 · How a model learns
- The core training algorithm: repeatedly nudge each parameter a small step in the direction that lowers the loss, as told by the gradient.
- group-relative advantage Introduced in §23 · GRPO & DeepSeek-R1
- GRPO’s advantage estimate: a response’s reward minus the mean reward of its group of siblings (often divided by their standard deviation), replacing a learned value function.
- GRPO Introduced in §23 · GRPO & DeepSeek-R1
- Group Relative Policy Optimization (Shao, 2024) — drop PPO’s critic; sample a group of responses per prompt and use their mean reward as the baseline, giving a group-relative advantage. Memory-cheap RL that powered DeepSeek-R1.
- HBM Introduced in §13 · GPU memory hierarchy
- High-Bandwidth Memory — the DRAM stack soldered next to the GPU die. H100 SXM has 80 GB at ~3.35 TB/s.
- head Introduced in §05 · Multi-head attention
- One independent attention computation. Multi-head splits d_model into N parallel heads, each learning its own pattern.
- head_dim Introduced in §05 · Multi-head attention
- d_model / num_heads — the dimension each attention head operates in.
- helpful, honest, harmless Introduced in §03 · The alignment problem
- The "HHH" framing (from Anthropic) of what an aligned assistant should be: useful to the user, truthful, and unlikely to cause harm.
- hyper-connections Introduced in §26 · DeepSeek-V4
- A generalization of residual connections that learns richer ways to combine the inputs and outputs of layers. DeepSeek-V4 uses a Manifold-Constrained variant (mHC) in place of plain residuals.
- hyperparameter Introduced in §04 · Optimizers & schedules
- A training setting you choose rather than learn — learning rate, batch size, number of layers, etc. Tuning these well is much of the craft of pre-training.
- implicit reward Introduced in §16 · Direct Preference Optimization
- In DPO, the reward is never trained explicitly; it is implied by the log-ratio between the policy and the reference. Optimizing the DPO loss is equivalent to RLHF under that implied reward.
- importance sampling Introduced in §13 · TRPO to PPO
- Reweighting samples from one distribution to estimate expectations under another, via the probability ratio π_new/π_old. The ratio PPO clips comes from here.
- in-context learning Introduced in §15 · GPT-3
- A model performing a new task purely from examples or instructions placed in its prompt, with no gradient updates. GPT-3 showed this emerges from pure next-token pre-training at scale.
- inference Introduced in §01 · What is an LLM?
- Running a trained model to produce outputs. Training learns the weights once; inference uses them many times.
- inference scaling Introduced in §21 · Inference scaling & o1
- The empirical finding that accuracy improves predictably as you spend more test-time compute (longer reasoning, more samples) — a second scaling axis beyond model and data size.
- initialization Introduced in §05 · Precision & numerics
- The scheme for setting parameters before training starts. Good initialization keeps activations and gradients at sane scales through a deep network so training can get going.
- instruction tuning Introduced in §04 · Instruction tuning is born
- Fine-tuning on many tasks phrased as natural-language instructions so the model learns to follow instructions in general — including ones it never saw in training.
- interconnect Introduced in §07 · Parallelism
- The high-speed network linking GPUs — NVLink within a node, InfiniBand/Ethernet across nodes. Its bandwidth and latency cap how aggressively you can shard a model.
- IPO Introduced in §17 · The DPO zoo
- Identity Preference Optimization — a DPO variant that replaces the logistic loss with a squared loss to avoid overfitting to deterministic preferences.
- IsoFLOP Introduced in §16 · Chinchilla
- A curve of loss versus model size at a fixed compute budget ("iso" = equal FLOPs). Its minimum reveals the compute-optimal model size; Chinchilla used IsoFLOP profiles to find the 20:1 rule.
- ITL Introduced in §11 · Prefill and decode
- Inter-Token Latency — gap between consecutive generated tokens during decode.
- key Introduced in §04 · Attention
- A vector saying “what I represent”. Compared against queries to compute attention scores.
- KL divergence Introduced in §02 · From next-token to behavior
- Kullback–Leibler divergence — a measure of how far one probability distribution is from another. Used in post-training as a "leash" that keeps a model close to a reference policy.
- KL penalty Introduced in §14 · PPO for RLHF in practice
- A term added to the RLHF reward that subtracts β times the KL divergence from the reference policy, keeping the optimized model from drifting too far while chasing reward.
- knowledge distillation Introduced in §20 · Gemma 2
- Training a smaller "student" model to match the full output probability distribution of a larger "teacher" model, rather than just the one-hot next token. Richer targets let the student learn more per token.
- KTO Introduced in §17 · The DPO zoo
- Kahneman–Tversky Optimization — a preference method using a prospect-theory loss on unpaired, binary good/bad labels, so you don’t need matched preference pairs.
- KV cache Introduced in §12 · The KV cache
- The stored keys and values from all past tokens, so attention at step t only needs to compute Q for the new token.
- label smoothing Introduced in §09 · Attention Is All You Need
- Softening the one-hot target so a little probability mass is spread over all other tokens. It slightly worsens perplexity but discourages overconfidence and often improves downstream quality.
- language model Introduced in §01 · What is pre-training?
- A model that assigns probabilities to sequences of tokens — in practice, one that predicts the probability distribution of the next token given the preceding ones.
- layer Introduced in §08 · A full transformer block
- One transformer block (attention + MLP + residuals + norms). Modern LLMs stack 32–120 of them.
- LayerNorm Introduced in §05 · Precision & numerics
- Layer Normalization — rescales each token's activation vector to zero mean and unit variance (then applies learned scale/shift), stabilizing training. RMSNorm is the cheaper modern variant.
- learning rate Introduced in §04 · Optimizers & schedules
- The size of each parameter step. Too high and training diverges; too low and it crawls. The single most important hyperparameter in pre-training.
- learning-rate schedule Introduced in §04 · Optimizers & schedules
- A plan for changing the learning rate over training — typically a short warmup ramp up followed by a long cosine or linear decay down to a small final value.
- length / format reward Introduced in §24 · GRPO refinements
- Auxiliary reward terms that shape output length or enforce a required format (e.g. putting reasoning in tags, the answer in a box) — used to keep reasoning-RL outputs usable.
- likelihood Introduced in §02 · From next-token to behavior
- The probability a model assigns to observed data. Supervised fine-tuning maximizes the likelihood of human-written target responses given their prompts.
- LLM Introduced in §01 · What is an LLM?
- Large Language Model — a neural network trained on huge text corpora to predict the next token given previous tokens.
- LM head Introduced in §09 · Stacking into a full model
- Language-Model head — the final linear projection from hidden states (d_model) back to vocab size, producing logits over every token. "Head" because it sits atop the transformer stack like the head of a body; "LM" because it's the layer specialized for the language-modeling (next-token-prediction) objective.
- load balancing Introduced in §18 · DeepSeek-V3
- Keeping tokens spread evenly across a Mixture-of-Experts layer's experts, so no single expert (or the GPU holding it) becomes a bottleneck while others sit idle.
- logit soft-capping Introduced in §20 · Gemma 2
- Bounding the model's logits (and/or attention scores) with a scaled tanh so they can't grow without limit, improving training stability. Used in Gemma 2.
- logits Introduced in §09 · Stacking into a full model
- The raw, pre-softmax scores the model produces — one per vocabulary token, per position. Bigger logit = the model finds that token more likely; the actual value can be any real number, positive or negative. Applying softmax across the vocabulary turns logits into a probability distribution that sums to 1. Sampling then picks one token from that distribution.
- long chain-of-thought Introduced in §21 · Inference scaling & o1
- Extended internal reasoning — thousands of tokens of self-correction, backtracking, and exploration — that reasoning-RL elicits and that test-time scaling rewards.
- loss function Introduced in §02 · The objective
- A single number measuring how wrong the model's predictions are on a batch of data. Training works by adjusting parameters to make this number smaller.
- loss landscape Introduced in §03 · How a model learns
- The (extremely high-dimensional) surface of loss as a function of the parameters. Training is a walk downhill on this surface toward a low-loss region.
- loss scaling Introduced in §05 · Precision & numerics
- Multiplying the loss by a large constant before backprop (and dividing it back out before the update) to push small FP16 gradients up into the format's representable range.
- mantissa Introduced in §05 · Precision & numerics
- The significant-digits part of a floating-point number; more mantissa bits means finer precision. FP16 has 10, BF16 only 7.
- masked language model Introduced in §11 · BERT
- Masked Language Model (MLM) — a pre-training objective (used by BERT) that hides a fraction of tokens and trains the model to fill them in using context from both sides. Contrast with next-token prediction.
- Medusa Introduced in §18 · Speculative decoding
- Adds multiple parallel “medusa heads” onto the base model to propose several future tokens at once — no separate draft model.
- memory bandwidth Introduced in §06 · Compute & memory
- How fast data moves between GPU compute units and high-bandwidth memory. Many training kernels are bandwidth-bound, not compute-bound, so bandwidth often sets real speed.
- MFU Introduced in §06 · Compute & memory
- Model FLOPs Utilization — the fraction of a GPU's peak floating-point throughput actually used for useful model math. Real large-scale runs often land around 30–50%.
- mid-training Introduced in §25 · Qwen3-Coder-Next
- A phase between the main pre-training run and post-training, used to inject specialized data or capabilities (e.g. long context, code-from-execution) while still training the base model on a next-token-style objective.
- mini-batch Introduced in §03 · How a model learns
- The chunk of training examples processed together in one step. Gradients are averaged over the mini-batch, trading off gradient noise against memory and compute.
- mixed-precision training Introduced in §05 · Precision & numerics
- Doing the heavy matrix multiplies in a low-precision format (BF16/FP8) for speed while keeping a high-precision (FP32) copy of the weights and accumulating sensitive sums in FP32 for stability.
- Mixture of Experts Introduced in §18 · DeepSeek-V3
- Mixture of Experts (MoE) — a layer with many parallel sub-networks ("experts") where a router sends each token to only a few. The model has a huge total parameter count but activates only a fraction per token, so compute stays modest.
- MLP Introduced in §07 · The MLP block
- Multi-Layer Perceptron — a stack of dense (matrix-multiply + nonlinearity) layers applied per-token. The transformer’s feed-forward block.
- model collapse Introduced in §21 · Synthetic data
- Degradation that can occur when models are trained on too much model-generated data over generations, as rare patterns in the distribution get washed out. Observed for some pure-synthetic mixtures, not for moderate rephrased-data ratios.
- momentum Introduced in §04 · Optimizers & schedules
- An optimizer trick that accumulates a running average of past gradients, letting updates build up speed in consistent directions and damp out oscillations.
- MQA Introduced in §12 · The KV cache
- Multi-Query Attention — extreme GQA where all query heads share a single K/V head.
- multi-head attention Introduced in §09 · Attention Is All You Need
- Running several attention operations ("heads") in parallel, each with its own learned projections, so the layer can track many kinds of relationships at once, then concatenating their outputs.
- Multi-head Latent Attention Introduced in §18 · DeepSeek-V3
- Multi-head Latent Attention (MLA) — DeepSeek's attention variant that compresses the keys and values into a small shared low-rank latent vector, drastically shrinking the KV cache while keeping multi-head expressivity.
- Multi-Token Prediction Introduced in §18 · DeepSeek-V3
- Multi-Token Prediction (MTP) — a training objective where the model predicts several future tokens at each position (not just the next one), densifying the learning signal and enabling faster speculative decoding later.
- multi-turn RL Introduced in §26 · Agentic & tool-use RL
- RL where an episode spans many interaction turns (with a user or an environment), requiring credit assignment across turns rather than within one response.
- multimodal Introduced in §22 · Gemma 3
- A model that handles more than one input type — e.g. text plus images (or audio). Pre-training can fold in non-text data via encoders that turn it into token-like embeddings.
- Muon Introduced in §24 · Kimi K2.5
- A newer optimizer (Momentum Orthogonalized by Newton-Schulz) that orthogonalizes each weight-matrix update instead of scaling it per-element like Adam. Used at scale by Kimi K2.5 via the MuonClip variant.
- MuonClip Introduced in §24 · Kimi K2.5
- A stabilized variant of the Muon optimizer (used by the Kimi models) that clips/rescales attention query-key logits to prevent the loss spikes that can derail very large training runs.
- native multimodal pre-training Introduced in §24 · Kimi K2.5
- Training on a mix of text and other modalities (e.g. vision) from the very start, with a constant ratio, rather than bolting a modality onto a finished text model late in training. Kimi K2.5's approach.
- negative log-likelihood Introduced in §02 · The objective
- Another name for the cross-entropy LM (Language Model) loss: −log of the probability the model gave to the correct token. Big when the model was confidently wrong, small when it was confidently right.
- neural network Introduced in §01 · What is an LLM?
- A function built by stacking many simple operations — mostly matrix multiplies with nonlinearities between them — whose behavior is shaped by tuning billions of internal numbers (its parameters) from data.
- next sentence prediction Introduced in §11 · BERT
- Next Sentence Prediction (NSP) — a secondary BERT objective: given two sentences, predict whether the second actually follows the first. Later work found it largely unnecessary.
- next-token prediction Introduced in §02 · The objective
- The pre-training objective for GPT-style models: given the tokens so far, predict a probability distribution over the next token. Also called causal or autoregressive language modeling.
- nonlinear function Introduced in §01 · What is an LLM?
- A function whose output isn't just a scaled, shifted copy of its input — e.g. ReLU, GELU, sigmoid. Stacking nonlinearities between matrix multiplies is what lets a neural net represent anything more interesting than scaling and rotation.
- NTK-aware scaling Introduced in §06 · Positional encoding
- A RoPE-extension trick: instead of linearly shrinking all positions (which over-compresses the fast-spinning low-i pairs), adjust the rotation base — the 10000 in 10000^(2i/d) — so high-frequency pairs are preserved while only the slow pairs get stretched. Named after the Neural Tangent Kernel theory it was originally motivated by. Better quality than plain position interpolation at modest extension factors.
- NVLink Introduced in §13 · GPU memory hierarchy
- Nvidia’s high-speed GPU-to-GPU interconnect. H100 NVLink ≈ 900 GB/s per GPU — much faster than PCIe.
- off-policy Introduced in §11 · Policy gradients & REINFORCE
- RL that learns from data generated by a different (older or separate) policy. DPO and rejection-sampling methods are off-policy / offline.
- offline RL Introduced in §16 · Direct Preference Optimization
- Optimizing from a fixed dataset of responses and preferences without generating new rollouts during training. DPO and rejection-sampling methods are offline.
- omni-modal Introduced in §27 · Qwen3.5-Omni
- A model natively pre-trained to handle all major modalities at once — text, images, audio, and video — jointly, rather than text plus a single added modality.
- on-policy Introduced in §11 · Policy gradients & REINFORCE
- RL where the data used to update the policy was generated by the current policy. PPO and GRPO are (approximately) on-policy; they resample as the policy changes.
- one-hot vector Introduced in §02 · The objective
- A vector that is 1 at a single index and 0 everywhere else. The target next token is represented as a one-hot over the vocabulary; cross-entropy compares the model's distribution against it.
- online RL Introduced in §14 · PPO for RLHF in practice
- RL that generates fresh rollouts from the current policy during training (e.g. PPO, GRPO). Expensive but adaptive, since the data tracks the improving policy.
- optimizer Introduced in §04 · Optimizers & schedules
- The rule that turns gradients into parameter updates. Plain gradient descent is the simplest; Adam-family optimizers add per-parameter adaptive step sizes and dominate LLM training.
- optimizer states Introduced in §06 · Compute & memory
- Extra per-parameter values an optimizer maintains — for Adam, the first and second moment estimates. In FP32 these add 8 bytes per parameter, often dwarfing the weights themselves.
- ORPO Introduced in §17 · The DPO zoo
- Odds-Ratio Preference Optimization — folds SFT and preference optimization into a single reference-free stage using an odds-ratio penalty term.
- outcome reward model (ORM) Introduced in §20 · Process vs outcome rewards
- A reward model that scores only the final answer of a solution, ignoring how it was reached. Simpler than a PRM but gives sparser credit.
- over-training Introduced in §17 · Llama 3
- Deliberately training a model on far more tokens than the compute-optimal ~20 per parameter. It costs more training compute for a slightly better, much smaller model that is cheaper to run at inference.
- overfitting Introduced in §01 · What is pre-training?
- When a model memorizes training-set quirks instead of learning general patterns, so it does well on training data but poorly on new data. Rarely the main worry in single-epoch LLM pre-training, but it shapes data choices.
- padding Introduced in §08 · The data pipeline
- Filler tokens added to a sequence to reach a fixed length. Padding wastes compute — the model still processes the meaningless tokens — which is exactly what sequence packing exists to avoid.
- page Introduced in §15 · Paged attention
- A fixed-size slab of KV cache memory (e.g. 16 tokens). The unit vLLM allocates and frees.
- page table Introduced in §15 · Paged attention
- Per-sequence mapping from logical position → physical page in the KV cache. Same idea as OS virtual memory, applied to attention.
- pairwise comparison Introduced in §07 · Learning from human preferences
- Asking a labeler which of two responses is better, rather than scoring each on an absolute scale. Easier and more reliable for humans, and the basis of the Bradley–Terry model.
- parameters Introduced in §01 · What is an LLM?
- The numbers (weights) inside a model that get adjusted during training. A “7B model” has 7 billion of them.
- PCIe Introduced in §13 · GPU memory hierarchy
- The bus between the GPU and the host (CPU/RAM/NVMe). PCIe Gen5 x16 ≈ 64 GB/s — far slower than HBM.
- perplexity Introduced in §02 · The objective
- The exponential of the cross-entropy loss — roughly "how many equally-likely tokens is the model choosing between?" Lower is better; a perplexity of 1 means perfect prediction.
- pipeline bubble Introduced in §07 · Parallelism
- Idle GPU time at the start and end of a pipeline-parallel batch, while stages wait for the first micro-batches to flow through. Smaller micro-batches shrink the bubble.
- pipeline parallelism Introduced in §19 · Scaling out
- Splitting the model layer-wise across GPUs. Each GPU owns a contiguous slab of layers; activations flow from one to the next.
- policy Introduced in §11 · Policy gradients & REINFORCE
- In RL, the thing that chooses actions — here, the language model itself, viewed as a distribution over next tokens given the context. RL post-training optimizes the policy.
- policy gradient Introduced in §11 · Policy gradients & REINFORCE
- A family of RL methods that directly adjust the policy’s parameters in the direction that increases expected reward, using the score-function (REINFORCE) estimator.
- position interpolation (PI) Introduced in §06 · Positional encoding
- A RoPE-extension trick: linearly scale incoming positions down so a model trained at length L "sees" a longer context as if it were still length L. To go from 4k to 16k, divide all positions by 4 before rotating. Cheap, effective for short extensions, but degrades quality on the tasks the model was already good at.
- positional encoding Introduced in §06 · Positional encoding
- Information added to embeddings so the model knows where each token sits in the sequence.
- post-training Introduced in §01 · What is post-training?
- Everything done to a model after pre-training to turn a raw next-token predictor into a useful assistant: supervised fine-tuning, RLHF, and RL from verifiable rewards.
- power law Introduced in §14 · Scaling laws
- A relationship of the form y = a·x^(−b): on log-log axes it's a straight line. Pre-training loss follows a power law in scale, so each 10× of compute buys a roughly constant drop in loss.
- PPO Introduced in §13 · TRPO to PPO
- Proximal Policy Optimization (Schulman, 2017) — approximates TRPO’s trust region with a simple clipped surrogate objective. The RLHF workhorse: stable, simple, widely used.
- pre-norm Introduced in §12 · GPT-2
- Placing the normalization layer before each sub-layer (inside the residual branch) rather than after it. Pre-norm transformers are far more stable to train at depth, and became standard after GPT-2.
- pre-training Introduced in §01 · What is pre-training?
- The first phase of building a language model: training on an enormous corpus of raw text to predict the next token, learning general-purpose language ability before any task-specific tuning.
- preference data Introduced in §07 · Learning from human preferences
- Data where humans (or an AI) compare two or more model responses to the same prompt and mark which is better. The training signal for reward models and DPO.
- prefill Introduced in §11 · Prefill and decode
- The first forward pass that processes the entire prompt at once. Compute-bound, parallel over prompt tokens.
- prefix caching Introduced in §16 · Prefix caching
- Sharing KV pages across requests that start with the same tokens (system prompts, few-shot prefixes), so the prefill is computed once.
- process reward model (PRM) Introduced in §20 · Process vs outcome rewards
- A reward model that scores each step of a reasoning chain, not just the final answer — giving denser, better-targeted credit. Trained on per-step correctness labels.
- process supervision Introduced in §20 · Process vs outcome rewards
- Training or rewarding a model on the correctness of intermediate reasoning steps rather than just outcomes — the idea behind PRMs and "Let’s Verify Step by Step."
- prompt Introduced in §01 · What is an LLM?
- The input text fed to the model — what you want it to continue or respond to.
- quality filtering Introduced in §08 · The data pipeline
- Discarding low-value text (spam, boilerplate, gibberish) using heuristics and trained classifiers, keeping the corpus closer to the kind of text you want the model to learn.
- query Introduced in §04 · Attention
- A vector asking “what am I looking for in other tokens?”. Computed per token, used to score against keys.
- RAFT Introduced in §18 · Rejection-sampling alignment
- Reward-rAnked Fine-Tuning — iteratively sample, rank by reward, and fine-tune on the top responses. Offline, RL-free preference alignment.
- RDMA Introduced in §13 · GPU memory hierarchy
- Remote DMA — letting one node’s NIC write directly into another node’s memory without involving the CPU. The basis of InfiniBand and RoCE.
- reasoning model Introduced in §21 · Inference scaling & o1
- A model trained (usually with RL) to produce long internal chains of thought before answering, trading inference compute for accuracy on hard problems. o1 and DeepSeek-R1 are examples.
- reference model Introduced in §08 · RLHF scales to language
- A frozen copy of the policy (usually the SFT model) that RLHF and DPO stay close to via a KL penalty, preventing the optimized policy from drifting into degenerate text.
- REINFORCE Introduced in §11 · Policy gradients & REINFORCE
- The basic Monte-Carlo policy-gradient estimator (Williams, 1992): scale the gradient of each action’s log-probability by the reward (or advantage) it earned. Everything else builds on it.
- REINFORCE++ Introduced in §24 · GRPO refinements
- A critic-free baseline that adds PPO-style stabilizers (token-level KL, clipping) to plain REINFORCE, aiming for robustness without a value network.
- rejection sampling Introduced in §18 · Rejection-sampling alignment
- Generate several candidate responses, keep only the best-scoring one(s) by some reward or verifier, and fine-tune on those. A simple, stable, RL-free way to improve a model.
- residual connection Introduced in §08 · A full transformer block
- output = x + f(x). Lets gradients flow through deep stacks and means each block adds a refinement rather than rewriting.
- return Introduced in §11 · Policy gradients & REINFORCE
- The total (often discounted) reward accumulated over a trajectory. Policy-gradient methods push up the probability of actions that led to high return.
- reward Introduced in §11 · Policy gradients & REINFORCE
- A scalar signal saying how good an outcome was. In post-training it can come from a learned reward model, a verifier, or a rule, and is what RL maximizes.
- reward ensemble Introduced in §15 · Reward hacking & over-optimization
- Using several reward models and aggregating (e.g. taking the minimum) to make hacking harder — a policy must fool all of them at once.
- reward hacking Introduced in §15 · Reward hacking & over-optimization
- When a policy finds ways to score high on the reward model without actually being better — exploiting quirks of an imperfect proxy. A central danger of RL post-training.
- reward model (RM) Introduced in §09 · Reward models
- A model trained from human preference data to output a scalar score for how good a response is. Stands in for a human judge so RL can query reward millions of times.
- reward over-optimization Introduced in §15 · Reward hacking & over-optimization
- Pushing the policy so hard against a proxy reward that true quality starts to fall even as the proxy keeps rising — the quantitative face of reward hacking (Gao et al., 2022).
- RewardBench Introduced in §27 · Recap
- A standard benchmark for evaluating reward models across chat, safety, and reasoning, making reward-model quality measurable and comparable.
- RLAIF Introduced in §10 · RLAIF & Constitutional AI
- Reinforcement Learning from AI Feedback — replace human preference labels with labels from another model (or the model itself), making the feedback loop cheap and scalable.
- RLHF Introduced in §07 · Learning from human preferences
- Reinforcement Learning from Human Feedback — train a reward model on human preference comparisons, then optimize the policy against that reward with RL (typically PPO), with a KL leash to a reference.
- RLOO Introduced in §24 · GRPO refinements
- REINFORCE Leave-One-Out — use the average reward of the other samples in a group as each sample’s baseline. A simple, critic-free policy-gradient method for LLMs.
- RLVR Introduced in §22 · RL from verifiable rewards
- Reinforcement Learning from Verifiable Rewards — use an automatic checker (unit tests, an answer key, a math grader) as the reward instead of a learned reward model. No reward hacking of a neural proxy.
- RMSNorm Introduced in §08 · A full transformer block
- Root Mean Square Normalization — a normalization layer that divides each activation by the root-mean-square (√(mean(x²))) of the whole vector, then multiplies by a learned per-dimension scale. Cheaper than LayerNorm (no mean subtraction, no learned bias) and empirically just as good. Standard in Llama-class models.
- rollout Introduced in §11 · Policy gradients & REINFORCE
- A complete generated sample from the policy — for an LLM, one full response to a prompt. RL collects rollouts, scores them, and updates the policy.
- RoPE Introduced in §06 · Positional encoding
- Rotary Position Embeddings — rotates Q/K vectors by an angle proportional to position. Standard in modern LLMs.
- sampling Introduced in §10 · Sampling
- Choosing the next token from logits — greedy (argmax), temperature scaling, top-k, top-p, etc.
- scalable oversight Introduced in §10 · RLAIF & Constitutional AI
- The challenge of supervising models on tasks too hard or numerous for humans to label directly — addressed by AI feedback, critiques, and verifiers.
- scaled dot-product attention Introduced in §04 · Attention
- softmax(QKᵀ / √d_k) · V — the canonical attention formula from “Attention is All You Need”.
- scaling hypothesis Introduced in §12 · GPT-2
- The idea — crystallized around GPT-2 — that simply scaling up model size, data, and compute keeps improving capabilities, without needing fundamentally new architectures.
- scaling laws Introduced in §14 · Scaling laws
- Empirical formulas showing that test loss falls as a smooth power law in model size, dataset size, and compute. They let you predict a large model's performance from small experiments.
- scheduler Introduced in §14 · Continuous batching
- The component that picks which requests run in the next forward pass given GPU memory and policy constraints.
- score-function estimator Introduced in §11 · Policy gradients & REINFORCE
- The identity ∇E[R] = E[R · ∇log π] that lets us estimate a reward gradient by sampling, even though the reward itself isn’t differentiable in the model’s parameters.
- self-attention Introduced in §09 · Attention Is All You Need
- Attention where the queries, keys, and values all come from the same sequence, so each token can gather information from every other token. The core operation of the transformer.
- self-consistency Introduced in §19 · Bootstrapping reasoning
- Sample many chain-of-thought solutions and take the majority-vote answer. A test-time technique that trades extra compute for accuracy.
- Self-Instruct Introduced in §06 · Synthetic & self-generated data
- A method that bootstraps instruction-tuning data from a model itself: seed it with a few tasks, have it generate many more, filter, and fine-tune. Made instruction data cheap and synthetic.
- self-supervised learning Introduced in §01 · What is pre-training?
- Training where the labels come for free from the data itself — e.g. hide the next word and ask the model to predict it. No human annotation needed, which is what makes training on trillions of tokens possible.
- SentencePiece Introduced in §08 · The data pipeline
- A tokenizer toolkit that operates directly on raw text (treating spaces as symbols), so it works language-agnostically without pre-splitting on whitespace.
- sequence packing Introduced in §08 · The data pipeline
- Concatenating many short documents into full-length training sequences (with separators) so no compute is wasted padding to the context length.
- sequence parallelism Introduced in §07 · Parallelism
- Splitting the work along the token/sequence dimension across GPUs, often paired with tensor parallelism to shard the normalization and dropout activations it leaves behind.
- SGD Introduced in §04 · Optimizers & schedules
- Stochastic Gradient Descent — gradient descent using a noisy gradient estimated from one mini-batch at a time rather than the whole dataset.
- SimPO Introduced in §17 · The DPO zoo
- Simple Preference Optimization — a reference-free DPO variant using a length-normalized implicit reward plus a target margin, removing the need for a reference model.
- sliding-window attention Introduced in §20 · Gemma 2
- Restricting attention to a fixed-size window of nearby tokens instead of the whole sequence. Cheaper and smaller-KV than global attention; modern models interleave local (windowed) and global layers.
- SLO Introduced in §20 · Throughput vs latency
- Service Level Objective — a target like “p99 TTFT < 1 s”. Serving systems are tuned to maximize throughput subject to SLOs.
- softmax Introduced in §04 · Attention
- Function that turns any vector into a probability distribution (positive, sums to 1) by exponentiating and normalizing.
- span corruption Introduced in §13 · T5
- T5's pre-training objective: replace random contiguous spans of tokens with sentinel placeholders and train the model to reconstruct the missing spans. A denoising objective.
- special tokens Introduced in §05 · The SFT stage in practice
- Reserved tokens (e.g. role markers and end-of-turn markers) added to the vocabulary to delimit structure that ordinary text tokens cannot express.
- speculative decoding Introduced in §18 · Speculative decoding
- A small draft model proposes K tokens; the big target model verifies them all in one pass. Net effect: more tokens per target-model step.
- SRAM Introduced in §13 · GPU memory hierarchy
- Static Random-Access Memory — the on-chip scratchpad / L1+shared memory inside each SM. Tiny (~100s of KB per SM) but ~10× faster than HBM.
- SSM / hybrid architectures Introduced in §12 · The KV cache
- State-Space Models (SSMs) replace attention with a recurrent operator (Mamba, RWKV) that compresses the entire past into a fixed-size hidden state — no KV cache to grow with sequence length. Hybrids (Jamba, Zamba, RecurrentGemma) interleave a few attention layers with many SSM layers, keeping most of the recall power of attention while shrinking the KV cache by 5–20×. They're a different bet on the same memory problem.
- STaR Introduced in §19 · Bootstrapping reasoning
- Self-Taught Reasoner (Zelikman, 2022) — generate chain-of-thought rationales, keep those that reach the correct answer, fine-tune on them, and repeat. Bootstraps reasoning from a model’s own correct attempts.
- supervised fine-tuning (SFT) Introduced in §04 · Instruction tuning is born
- Training a pre-trained model on curated (prompt, response) pairs with the ordinary next-token objective, so it imitates demonstrated assistant behavior. The first stage of post-training.
- SwiGLU Introduced in §07 · The MLP block
- A gated MLP variant (Llama, PaLM): output = SiLU(xW₁) ⊙ (xW₂), then projected. Outperforms plain MLPs at the same param count.
- SXM5 Introduced in §13 · GPU memory hierarchy
- Server PCI eXpress Module, 5th generation — Nvidia's proprietary mezzanine board form factor for datacenter GPUs. (Despite the name, SXM bypasses PCIe entirely.) An H100 SXM5 module plugs directly into the motherboard via the SXM socket, which gives it more power (700 W vs ~350 W for PCIe), more NVLink bandwidth (900 GB/s per GPU), and higher HBM bandwidth than the PCIe variant of the same chip. Standard in HGX/DGX servers; what you get in most cloud H100 instances.
- sycophancy Introduced in §03 · The alignment problem
- A failure mode where a model tells the user what it thinks they want to hear rather than what is true or correct — often a side effect of preference optimization.
- synthetic data Introduced in §21 · Synthetic data
- Training text generated by another model or an automated pipeline, rather than scraped from humans. Used to augment scarce high-quality data; its benefits in pre-training are conditional.
- system prompt Introduced in §05 · The SFT stage in practice
- A special leading instruction that sets the assistant’s persona, rules, and constraints for a conversation, separate from the user’s turns.
- teacher forcing Introduced in §02 · The objective
- During training, feeding the model the true previous tokens (not its own guesses) at every position, so all next-token predictions in a sequence can be learned in parallel.
- temperature Introduced in §10 · Sampling
- Divides logits before softmax. <1 sharpens (more deterministic), >1 flattens (more random). 0 = greedy.
- tensor parallelism Introduced in §19 · Scaling out
- Splitting each weight matrix across N GPUs. Every GPU does a slice of every layer; activations get all-reduced across them.
- test-time compute Introduced in §21 · Inference scaling & o1
- Compute spent at inference — longer chains of thought, more samples — to improve answer quality, as opposed to compute spent during training.
- text-to-text Introduced in §13 · T5
- T5's framing in which every task — translation, classification, summarization — is cast as "input text → output text", so one model and one objective handle all of them.
- throughput Introduced in §11 · Prefill and decode
- Total tokens generated per second across all concurrent requests. Often traded against per-request latency.
- token Introduced in §02 · Tokens
- The atomic unit of text the model sees. Roughly a word-fragment — “tokenization” is a piece of text → list of token IDs.
- token ID Introduced in §02 · Tokens
- An integer index into the vocabulary that uniquely identifies a token.
- tokenizer Introduced in §08 · The data pipeline
- The program that converts raw text into a sequence of integer token IDs (and back). Its vocabulary and merge rules are fixed before pre-training begins.
- tokens per parameter Introduced in §16 · Chinchilla
- The ratio of training tokens to model parameters (D/N). Chinchilla's compute-optimal point is around 20; modern models often deliberately exceed it to get smaller, cheaper-to-serve models.
- tool-use RL Introduced in §26 · Agentic & tool-use RL
- Training a model with RL to call external tools (search, code execution, calculators) effectively, rewarding trajectories that use tools to reach correct outcomes.
- top-k Introduced in §10 · Sampling
- Only sample from the k highest-probability tokens; zero out the rest.
- top-p Introduced in §10 · Sampling
- Nucleus sampling — keep the smallest set of tokens whose cumulative probability ≥ p, sample from that set.
- training step Introduced in §03 · How a model learns
- One iteration of the loop: forward pass on a batch, backward pass to get gradients, optimizer update. A large model is trained for hundreds of thousands of steps.
- trajectory Introduced in §11 · Policy gradients & REINFORCE
- The sequence of states and actions in a rollout. For text generation, the tokens generated one after another, each conditioned on those before it.
- transfer learning Introduced in §10 · GPT-1
- Learning general skills on one task (here, next-token prediction on huge text) and reusing them on other tasks. Pre-training plus adaptation is the transfer-learning recipe behind modern LLMs.
- transformer Introduced in §09 · Attention Is All You Need
- The neural-network architecture introduced in "Attention Is All You Need" (2017), built from stacked self-attention and feed-forward layers. Every model in this explainer is a transformer.
- TRPO Introduced in §13 · TRPO to PPO
- Trust Region Policy Optimization (Schulman, 2015) — take the largest policy-gradient step that stays within a trust region (a KL bound), guaranteeing stable improvement. PPO’s parent.
- truncation Introduced in §08 · The data pipeline
- Cutting a document off at the model's maximum context length and discarding the rest. It avoids overflow but throws away data and can split documents mid-thought.
- trust region Introduced in §13 · TRPO to PPO
- A bound on how far the policy may move in one update (measured in KL divergence), so the update stays in the region where the local approximation is trustworthy.
- TTFT Introduced in §11 · Prefill and decode
- Time-to-First-Token — wall-clock from request submitted to first generated token returned. Dominated by prefill.
- Tülu 3 Introduced in §25 · Scaling open post-training
- Allen AI’s fully open post-training recipe (2024) — SFT, then DPO, then RLVR — released with data, code, and evals. A reference manual for open post-training.
- turn-level reward Introduced in §26 · Agentic & tool-use RL
- A reward assigned to individual turns or tool calls within a multi-turn trajectory, giving denser feedback than a single end-of-episode reward.
- value Introduced in §04 · Attention
- A vector representing the content actually mixed into the output when a token gets attended to.
- value function Introduced in §12 · Value, advantage, baselines
- The expected return from a given state under the current policy. A learned value function (the critic) provides a baseline that reduces the variance of policy-gradient updates.
- VAPO Introduced in §24 · GRPO refinements
- Value-Augmented PPO (2025) — brings a well-trained critic back for long chain-of-thought RL, building on DAPO’s tricks to beat critic-free methods on reasoning.
- verifier Introduced in §22 · RL from verifiable rewards
- An automatic, often rule-based checker that returns whether a response is correct (e.g. runs unit tests, compares to a known answer). Provides the reward in RLVR.
- vision encoder Introduced in §22 · Gemma 3
- A module (such as SigLIP) that converts an image into a sequence of embedding vectors the language model can attend to, as if they were tokens. The bridge that makes a text model multimodal.
- vLLM Introduced in §01 · What is an LLM?
- An open-source LLM inference engine, originally from UC Berkeley, that introduced paged attention and is now one of the most widely used serving systems for open-weight models.
- vocabulary Introduced in §02 · Tokens
- The fixed set of tokens a model knows about. Modern LLMs have ~32k–200k entries.
- warmup Introduced in §04 · Optimizers & schedules
- Starting training with a tiny learning rate and ramping it up over the first few thousand steps, to avoid blowing up the still-random early model.
- WebText Introduced in §12 · GPT-2
- The dataset behind GPT-2: ~8 million web pages reached via outbound Reddit links with at least 3 karma, used as a quality filter. About 40 GB of text.
- weight decay Introduced in §04 · Optimizers & schedules
- A regularizer that shrinks parameters toward zero a little each step, discouraging large weights and improving generalization.
- WordPiece Introduced in §08 · The data pipeline
- A subword tokenization algorithm (used by BERT) closely related to Byte Pair Encoding, building a vocabulary of word pieces from frequent character sequences.
- YaRN Introduced in §06 · Positional encoding
- Yet another RoPE eNtension method. Combines NTK-aware scaling with a length-dependent attention-score scaling and a "ramp" that smoothly transitions between high- and low-frequency treatment. Currently the highest-quality way to extend a RoPE model's context length without retraining; used to ship Llama-3, Qwen-2, and others at 128k+ contexts.
- ZeRO Introduced in §07 · Parallelism
- Zero Redundancy Optimizer — a family of techniques that shard optimizer states, gradients, and optionally parameters across data-parallel GPUs so no device holds a full redundant copy.
- zero-shot Introduced in §12 · GPT-2
- Performing a task from instructions alone, with no examples given. GPT-2 showed a pre-trained LM can do many tasks zero-shot, just by being prompted.