What is arithmetic bandwidth?
Arithmetic bandwidth is the peak rate at which arithmetic work can be performed by a system.
It represents the theoretical maximum of the achievable throughput for arithmetic operations per second. It determines the height of the "compute roof" in a roofline model of the hardware.
There are many arithmetic bandwidths in a complete system — one for each grouping of hardware units that provide bandwidth for executing arithmetic operations.
On many GPUs, the most important arithmetic bandwidth is the bandwidth of the CUDA Cores for floating point arithmetic. GPUs generally provide more bandwidth for floating point operations than for integer operations, and the key to the Compute Unified Device Architecture (CUDA) is that the CUDA Cores and supporting systems provide a unified computing interface for GPU applications (unlike prior GPU architectures).
But in recent GPUs, the unity of the architecture has been lessened by the introduction of Tensor Cores , which perform only matrix multiplication operations but do so at a much higher arithmetic bandwidth than the CUDA Cores -- a ratio of 100:1 between Tensor Core and CUDA Core bandwidth is a good rule of thumb. That makes the Tensor Core arithmetic bandwidth the most important for kernels that wish to maximize performance.
Contemporary GPUs have Tensor Core arithmetic bandwidths measured in petaFLOPS — quadrillions of floating point operations per second. For example, B200 GPUs have a bandwidth of nine PFLOPS when running 4-bit floating point matrix multiplications.
Representative bandwidth numbers for NVIDIA data center GPUs between the Ampere and Blackwell Streaming Multiprocessor architecures are listed in the table below.
System (Compute / Memory) | Arithmetic Bandwidth (TFLOPs/s) | Memory Bandwidth (TB/s) | Ridge Point (FLOPs/byte) |
---|---|---|---|
A100 80GB SXM BF16 TC / HBM2e | 312 | 2 | 156 |
H100 SXM BF16 TC / HBM3 | 989 | 3.35 | 295 |
B200 BF16 TC / HBM3e | 2250 | 8 | 281 |
H100 SXM FP8 TC / HBM3 | 1979 | 3.35 | 592 |
B200 FP8 TC / HBM3e | 4500 | 8 | 562 |
B200 FP4 TC / HBM3e | 9000 | 8 | 1125 |
Or want to contribute?
Click this button to
let us know on GitHub.