Section 05

Precision & numerics

FP32, BF16, FP8, and mixed precision

So far we’ve reasoned in real numbers. GPUs don’t have those — they have floating-point floating point How computers store real numbers: a sign, an exponent (range), and a mantissa (precision). The trade-off between range and precision is central to training numerics. See in glossary → formats with a fixed number of bits, and the choice of format is one of the highest-leverage decisions in a training run. Picking a smaller format can literally double throughput and halve memory, but get the numerics wrong and the whole run silently corrupts. This chapter is where the ML science meets the hardware.

What a floating-point number is

A floating-point number splits its bits into three fields: a sign, an exponent (which sets the scale — how big or small), and a mantissa mantissa The significant-digits part of a floating-point number; more mantissa bits means finer precision. FP16 has 10, BF16 only 7. See in glossary → (which sets the precision — how many significant digits). More exponent bits means more dynamic range dynamic range The span between the smallest and largest magnitudes a number format can represent, set by its exponent bits. BF16 has FP32-like range; FP16 does not. See in glossary → ; more mantissa bits means finer resolution. With a fixed bit budget you trade one against the other.

To turn those three fields back into a number, the formula is:

value=(1)sign×(1+mantissa2m)×2(exponentbias)\text{value} = (-1)^{\text{sign}} \times \left(1 + \frac{\text{mantissa}}{2^{m}}\right) \times 2^{\,(\text{exponent} - \text{bias})}

where mm is the number of mantissa bits and the bias=2(exponent bits1)1\text{bias} = 2^{(\text{exponent bits} - 1)} - 1 lets one exponent field represent both very large and very small numbers. (When the exponent field is all zeros, the implicit leading 11 becomes a 00 and the power switches to 2(1bias)2^{(1-\text{bias})} — these are the subnormal numbers near zero.) The widget below shows every field, in bits and in decimal, for any value you pick:

Floating-point formats, bit by bit
Pick a value and watch how each format encodes it. Boxes are grouped into bytes; colors mark sign, exponent, and mantissa.
sign exponent (range) mantissa (precision)
value = (−1)sign × (1 + mantissa / 2m) × 2(exponent − bias)m = mantissa bits; bias = 2(exp bits − 1) − 1. (When the exponent field is 0 the leading 1 becomes 0 and the power is 2(1 − bias) — a subnormal.)
FP32
32 bits · 8 exp / 23 mantissa · full precision · 4 bytes
00111101
11001100
11001100
11001101
sign = 0exponent = 123 (bias 127 → 2^-4)mantissa = 5033165 / 8388608
max ≈ 3.4e+38min normal ≈ 1.2e-38stores as: 1.000e-1 (err 0.00%)
FP16
16 bits · 5 exp / 10 mantissa · narrow range · 2 bytes
00101110
01100110
sign = 0exponent = 11 (bias 15 → 2^-4)mantissa = 614 / 1024
max ≈ 6.6e+4min normal ≈ 6.1e-5stores as: 9.998e-2 (err 0.02%)
BF16
16 bits · 8 exp / 7 mantissa · FP32 range, coarse · 2 bytes
00111101
11001101
sign = 0exponent = 123 (bias 127 → 2^-4)mantissa = 77 / 128
max ≈ 3.4e+38min normal ≈ 1.2e-38stores as: 1.001e-1 (err 0.10%)
FP8 (E4M3)
8 bits · 4 exp / 3 mantissa · H100/Blackwell · 1 byte
00011101
sign = 0exponent = 3 (bias 7 → 2^-4)mantissa = 5 / 8
max ≈ 4.5e+2min normal ≈ 1.6e-2stores as: 1.016e-1 (err 1.56%)
Watch the exponent bits: try 70000 — FP16 has too few exponent values and overflows, while BF16 (same 16 bits, more exponent) stores it. Try 0.1 — no binary float can represent it exactly, and the error grows as the mantissa shrinks from FP32 → BF16 → FP8. That range-vs-precision trade is the whole story of training numerics.

The four formats above are the ones that matter for pre-training:

  • FP32 FP32 32-bit single-precision Floating Point: 1 sign + 8 exponent + 23 mantissa bits. The traditional "full precision" format; accurate but memory- and bandwidth-hungry. See in glossary → — 32-bit, the classic “full precision.” Accurate but twice the memory and bandwidth of the 16-bit formats.
  • FP16 FP16 16-bit half-precision Floating Point: 1 sign + 5 exponent + 10 mantissa bits. Half the memory of FP32 but a narrow exponent range, so it can overflow/underflow without loss scaling. See in glossary → — 16-bit with only 5 exponent bits. Half the memory, but its range tops out at 65504, so large activations and small gradients fall off the edges.
  • BF16 BF16 Brain Floating-point 16-bit: 1 sign + 8 exponent + 7 mantissa bits. Keeps FP32's wide exponent range (so it rarely overflows) at the cost of precision — the workhorse format for modern pre-training. See in glossary → — 16-bit but with FP32’s full 8-bit exponent, sacrificing mantissa bits instead. It keeps the wide range (so it rarely overflows) at the cost of precision. This is the default format for modern pre-training.
  • FP8 FP8 8-bit Floating Point (typically E4M3 or E5M2 layouts). The newest training precision, used on H100/Blackwell GPUs to roughly double throughput; needs careful scaling to stay numerically stable. See in glossary → — 8-bit, the newest training precision, used on H100 and Blackwell GPUs to roughly double throughput again. Its range is tiny, so it demands careful scaling.

Mixed precision: fast where you can, safe where you must

You don’t train entirely in low precision. The standard recipe is mixed-precision training mixed-precision training Doing the heavy matrix multiplies in a low-precision format (BF16/FP8) for speed while keeping a high-precision (FP32) copy of the weights and accumulating sensitive sums in FP32 for stability. See in glossary → : do the expensive matrix multiplies in BF16 (or FP8) for speed, but keep a master copy of the weights in FP32 and accumulate sensitive sums — like the optimizer’s updates and the loss — in higher precision. The big, throughput-bound operations run fast; the small, accuracy-critical ones stay safe.

Normalization and initialization keep the numbers sane

Two more pieces of machinery exist largely to keep activations and gradients in a healthy numerical range as they flow through a deep stack of layers.

Initialization initialization The scheme for setting parameters before training starts. Good initialization keeps activations and gradients at sane scales through a deep network so training can get going. See in glossary → sets the starting parameters so that, before any training, signals neither explode nor vanish as they pass through dozens of layers. The scale is chosen to keep the variance of activations roughly constant with depth.

LayerNorm LayerNorm Layer Normalization — rescales each token's activation vector to zero mean and unit variance (then applies learned scale/shift), stabilizing training. RMSNorm is the cheaper modern variant. See in glossary → (and its leaner successor RMSNorm RMSNorm Root Mean Square Normalization — a normalization layer that divides each activation by the root-mean-square (√(mean(x²))) of the whole vector, then multiplies by a learned per-dimension scale. Cheaper than LayerNorm (no mean subtraction, no learned bias) and empirically just as good. Standard in Llama-class models. See in glossary → ) re-centers and re-scales each token’s activation vector at every block, so no layer’s outputs drift to extreme magnitudes. Beyond stabilizing training, normalization is what lets very deep transformers be trained at all. Almost every model after the original transformer moves the normalization before each sub-layer (“pre-norm”) for better gradient flow — a small change with a large effect on trainability, which we’ll flag when it appears in GPT-2.

We now know how numbers are stored and kept stable. The next question is how many of them we can afford — the compute and memory budget of an actual run.