Section 05

Precision & numerics

FP32, BF16, FP8, and mixed precision

So far we’ve reasoned in real numbers. GPUs don’t have those — they have floating-point formats with a fixed number of bits, and the choice of format is one of the highest-leverage decisions in a training run. Picking a smaller format can literally double throughput and halve memory, but get the numerics wrong and the whole run silently corrupts. This chapter is where the ML science meets the hardware.

What a floating-point number is

A floating-point number splits its bits into three fields: a sign, an exponent (which sets the scale — how big or small), and a mantissa (which sets the precision — how many significant digits). More exponent bits means more dynamic range ; more mantissa bits means finer resolution. With a fixed bit budget you trade one against the other.

To turn those three fields back into a number, the formula is:

\text{value} = (-1)^{\text{sign}} \times \left(1 + \frac{\text{mantissa}}{2^{m}}\right) \times 2^{\,(\text{exponent} - \text{bias})}

where $m$ is the number of mantissa bits and the $\text{bias} = 2^{(\text{exponent bits} - 1)} - 1$ lets one exponent field represent both very large and very small numbers. (When the exponent field is all zeros, the implicit leading $1$ becomes a $0$ and the power switches to $2^{(1-\text{bias})}$ — these are the subnormal numbers near zero.) The widget below shows every field, in bits and in decimal, for any value you pick:

Floating-point formats, bit by bit

Pick a value and watch how each format encodes it. Boxes are grouped into bytes; colors mark sign, exponent, and mantissa.

sign exponent (range) mantissa (precision)

value = (−1)^sign × (1 + mantissa / 2^m) × 2^{(exponent − bias)}m = mantissa bits; bias = 2^{(exp bits − 1)} − 1. (When the exponent field is 0 the leading 1 becomes 0 and the power is 2^{(1 − bias)} — a subnormal.)

FP32

32 bits · 8 exp / 23 mantissa · full precision · 4 bytes

00111101

11001100

11001101

sign = 0exponent = 123 (bias 127 → 2^-4)mantissa = 5033165 / 8388608

max ≈ 3.4e+38min normal ≈ 1.2e-38stores as: 1.000e-1 (err 0.00%)

FP16

16 bits · 5 exp / 10 mantissa · narrow range · 2 bytes

00101110

01100110

sign = 0exponent = 11 (bias 15 → 2^-4)mantissa = 614 / 1024

max ≈ 6.6e+4min normal ≈ 6.1e-5stores as: 9.998e-2 (err 0.02%)

BF16

16 bits · 8 exp / 7 mantissa · FP32 range, coarse · 2 bytes

00111101

11001101

sign = 0exponent = 123 (bias 127 → 2^-4)mantissa = 77 / 128

max ≈ 3.4e+38min normal ≈ 1.2e-38stores as: 1.001e-1 (err 0.10%)

FP8 (E4M3)

8 bits · 4 exp / 3 mantissa · H100/Blackwell · 1 byte

00011101

sign = 0exponent = 3 (bias 7 → 2^-4)mantissa = 5 / 8

max ≈ 4.5e+2min normal ≈ 1.6e-2stores as: 1.016e-1 (err 1.56%)

Value to encode

Watch the exponent bits: try 70000 — FP16 has too few exponent values and overflows, while BF16 (same 16 bits, more exponent) stores it. Try 0.1 — no binary float can represent it exactly, and the error grows as the mantissa shrinks from FP32 → BF16 → FP8. That range-vs-precision trade is the whole story of training numerics.

The four formats above are the ones that matter for pre-training:

FP32 — 32-bit, the classic “full precision.” Accurate but twice the memory and bandwidth of the 16-bit formats.
FP16 — 16-bit with only 5 exponent bits. Half the memory, but its range tops out at 65504, so large activations and small gradients fall off the edges.
BF16 — 16-bit but with FP32’s full 8-bit exponent, sacrificing mantissa bits instead. It keeps the wide range (so it rarely overflows) at the cost of precision. This is the default format for modern pre-training.
FP8 — 8-bit, the newest training precision, used on H100 and Blackwell GPUs to roughly double throughput again. Its range is tiny, so it demands careful scaling.

Mixed precision: fast where you can, safe where you must

You don’t train entirely in low precision. The standard recipe is mixed-precision training : do the expensive matrix multiplies in BF16 (or FP8) for speed, but keep a master copy of the weights in FP32 and accumulate sensitive sums — like the optimizer’s updates and the loss — in higher precision. The big, throughput-bound operations run fast; the small, accuracy-critical ones stay safe.

The FP16 footgun that BF16 fixed

FP16’s narrow range bites in two places: large values overflow to infinity, and tiny gradients underflow to zero and vanish. The original workaround was loss scaling — multiply the loss by a big constant before backprop so small gradients land inside FP16’s representable window, then divide it back out before the update. BF16’s wide exponent makes most of this unnecessary, which is a big reason the field migrated to it. FP8 brings the range problem back, harder, and solves it with fine-grained per-tensor (or per-block) scaling factors — a technique DeepSeek-V3 pushes furthest.

Normalization and initialization keep the numbers sane

Two more pieces of machinery exist largely to keep activations and gradients in a healthy numerical range as they flow through a deep stack of layers.

Initialization sets the starting parameters so that, before any training, signals neither explode nor vanish as they pass through dozens of layers. The scale is chosen to keep the variance of activations roughly constant with depth.

LayerNorm (and its leaner successor RMSNorm ) re-centers and re-scales each token’s activation vector at every block, so no layer’s outputs drift to extreme magnitudes. Beyond stabilizing training, normalization is what lets very deep transformers be trained at all. Almost every model after the original transformer moves the normalization before each sub-layer (“pre-norm”) for better gradient flow — a small change with a large effect on trainability, which we’ll flag when it appears in GPT-2.

We now know how numbers are stored and kept stable. The next question is how many of them we can afford — the compute and memory budget of an actual run.