GPU Glossary
GPU Glossary
/perf/arithmetic-intensity

What is arithmetic intensity?

Arithmetic intensity is the ratio of arithmetic operations to memory operations in a kernel .

A high arithmetic intensity indicates that a kernel performs many arithmetic operations per byte loaded. Due to the high ratio between arithmetic bandwidth and memory bandwidth in modern GPUs, the most efficient kernels have high arithmetic intensity. That means that when elevating a memory bottleneck , we can often shift work from the memory subsystem to the compute subsystem, saving on memory bandwidth but adding to the load on the arithmetic units.

For example, compressing data in global memory reduces memory traffic since fewer bytes need to be transferred, but the compute units must perform additional decompression operations. If we were previously bottlenecked by memory, this can improve performance. It also increases the ratio of FLOPs to bytes moved, increasing the arithmetic intensity.

As another example, the backpropagation algorithm creates long-lived intermediates (activation values) that generally must stored in global memory during a forward pass and then retrieved during a backwards pass. In some cases, it is faster to store only a fraction of these intermediates and then recompute the remainder (a technique known as gradient checkpointing ), which increases arithmetic intensity.

Because different algorithms inherently have different operational and memory complexities, they inherently scale differently in arithmetic intensity. An algorithm with O(1) operational complexity and O(N) memory complexity has O(1/N) arithmetic intensity scaling, while one with O(N) operational complexity and O(1) memory complexity has O(N) arithmetic intensity scaling.

KernelFLOPsBytes MovedArithmetic IntensityArithmetic Intensity Scaling
SAXPY y = ax + y2N8N1/4O(1)
Single-Precision Real FFT5/2 N log(N)16N5/32 log(N)O(log(N))
SGEMM2N^36N^2N/8O(N)

Notably, matrix multiplication scales linearly, i.e. is O(N), in arithmetic intensity — it is O(N^3) in operational complexity and O(N^2) in memory complexity. This favorable scaling makes it easy to map applications of matrix multiplication onto arithmetic-intensity-oriented hardware (see discussion in the article on roofline modeling ). It is a key secret to the success of machine learning algorithms based on matrix multiplication, like neural networks, in the past few decades.

For a discussion of arithmetic intensity as applied to Bahdanau attention, used in Transformer neural networks, see this paper by Zadouri, Strauss, and Dao.

The minimum arithmetic intensity required for work to be compute-bound (that is, to be past the ridge point of the roofline model ) is a fixed parameter of a system and so only needs to be derived once. Ridge point arithmetic intensities for recent NVIDIA data center GPUs appear in the table below. Notice that the highest ridge point has increased going from the Ampere to Hopper to Blackwell Streaming Multiprocessor architectures .

System (Compute / Memory)Arithmetic Bandwidth (TFLOPs/s)Memory Bandwidth (TB/s)Ridge Point (FLOPs/byte)
A100 80GB SXM BF16 TC / HBM2e 3122156
H100 SXM BF16 TC / HBM3 9893.35295
B200 BF16 TC / HBM3e 22508281
H100 SXM FP8 TC / HBM3 19793.35592
B200 FP8 TC / HBM3e 45008562
B200 FP4 TC / HBM3e 900081125
Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.