What is NVFP4?

NVFP4 is NVIDIA's proprietary 4-bit float format for Blackwell SM architecture GPUs, introduced in June 2025. It departs from the OCP MX standard in two ways: blocks are 16 elements (vs. 32 for MXFP4), and the block scale uses full FP8 E4M3 rather than a power-of-two E8M0, giving each block a finer-grained scale. A second per-tensor FP32 scale bridges the dynamic range gap. Native training in FP4 is used in some of the top open models as of mid-2026, like DeepSeek-V4.

What are blockwise quantization/micro-scaling float formats?

Standard formats like FP16 encode each element independently, with one exponent and significand per value. Micro-scaling formats like OCP MXFP4 trade that independence for compression: every 16 consecutive elements share a single scale factor (stored as a E4M3 value), and each element stores only its relative magnitude within the block in a low-precision E2M1 value.

The banding in the image above is the block structure made visible. A block containing both a very bright and a very dark pixel must scale to fit the bright one, collapsing the darker values into only a handful of distinct levels. FP4 has just 8 non-negative representable values (0, 0.5, 1, 1.5, 2, 3, 4, 6 × scale), so FP4 blocks "posterize" to at most 8 colors. FP6 has up to 28 non-negative values and FP8 up to 240, so degradation at those precisions is subtler.

These quantized formats are used in LLM inference to reduce demand on memory bandwidth, especially during decode, and to take advantage of higher arithmetic bandwidth, especially during prefill. They are generally destined for use in the Tensor Cores, where the vast majority of that bandwidth lies in contemporary GPUs.

Explore how individual floats in these formats are encoded on the Quant Formats page. The image-as-tensor visualization technique is inspired by quant-jaunt.