Peak rate represents the absolute upper bound of GPU performance when every execution unit operates at maximum capacity with perfect efficiency. It assumes ideal operation, where no resource constraints (registers , memory bandwidth , synchronization barriers, etc.) create bottlenecks .

Peak rate is the yardstick against which all achieved performance is measured. It sets the compute-bound "roof" in a roofline analysis . It is the denominator in the utilization fraction reported in pipe utilization metrics and the ultimate arbiter of GPU utilization .

Poetically, NVIDIA engineers often call it the "speed of light" — the limit on program speed imposed by physics.

Peak rate is computed directly from the fixed hardware specifications of each GPU architecture.

For example, an NVIDIA H100 GPU with 132 SMs, each containing 128 FP32 cores, can issue 1 single precision Fused Multiply Add (FMA) operation, which comprises 2 floating point operations per core. That's 33,792 instructions per clock . The H100 can operate its compute subsystem clock at a maximum rate of 1980 MHz (million clocks per second) when using the FP32 cores, and so the peak rate is 66,908 billion FLOPS, or 66.9 TFLOPS.

This precisely matches the Peak FP32 TFLOPS (non-Tensor) rate advertised in NVIDIA's H100 whitepaper .

Building on GPUs? We know a thing or two about it.

Modal is an ergonomic Python SDK wrapped around a global GPU fleet. Deploy serverless AI workloads instantly without worrying about quota requests, driver compatibility issues, or managing bulky ML dependencies. Deploy on GPUs

Deploy on GPUs

Pipe Utilization Issue Efficiency