A high arithmetic intensity indicates that a kernel performs many arithmetic operations per byte loaded. Due to the high ratio between arithmetic bandwidth and memory bandwidth in modern GPUs, the most efficient kernels have high arithmetic intensity. That means that when elevating a memory bottleneck , we can often shift work from the memory subsystem to the compute subsystem, saving on memory bandwidth but adding to the load on the arithmetic units.

For example, compressing data in global memory reduces memory traffic since fewer bytes need to be transferred, but the compute units must perform additional decompression operations. If we were previously bottlenecked by memory, this can improve performance. It also increases the ratio of FLOPs to bytes moved, increasing the arithmetic intensity.

As another example, the backpropagation algorithm creates long-lived intermediates (activation values) that generally must be stored in global memory during a forward pass and then retrieved during a backwards pass. In some cases, it is faster to store only a fraction of these intermediates and then recompute the remainder (a technique known as gradient checkpointing ), which increases arithmetic intensity.

Because different algorithms inherently have different operational and memory complexities, they inherently scale differently in arithmetic intensity. An algorithm with O(1) operational complexity and O(N) memory complexity has O(1/N) arithmetic intensity scaling, while one with O(N) operational complexity and O(1) memory complexity has O(N) arithmetic intensity scaling.

Kernel	FLOPs	Bytes Moved	Arithmetic Intensity	Arithmetic Intensity Scaling
SAXPY y = ax + y	2N	8N	1/4	O(1)
Single-Precision Real FFT	5/2 N log(N)	16N	5/32 log(N)	O(log(N))
SGEMM	2N^3	6N^2	N/8	O(N)

Notably, matrix multiplication scales linearly, i.e. is O(N), in arithmetic intensity — it is O(N^3) in operational complexity and O(N^2) in memory complexity. This favorable scaling makes it easy to map applications of matrix multiplication onto arithmetic-intensity-oriented hardware (see discussion in the article on roofline modeling ). It is a key secret to the success of machine learning algorithms based on matrix multiplication, like neural networks, in the past few decades.

For a discussion of arithmetic intensity as applied to Bahdanau attention, used in Transformer neural networks, see this paper by Zadouri, Strauss, and Dao.

The minimum arithmetic intensity required for work to be compute-bound (that is, to be past the ridge point of the roofline model ) is a fixed parameter of a system and so only needs to be derived once. Ridge point arithmetic intensities for recent NVIDIA data center GPUs appear in the table below. Notice that the highest ridge point has increased going from the Ampere to Hopper to Blackwell Streaming Multiprocessor architectures .

System (Compute / Memory)	Arithmetic Bandwidth (TFLOPs/s)	Memory Bandwidth (TB/s)	Ridge Point (FLOPs/byte)
A100 80GB SXM BF16 TC / HBM2e	312	2	156
H100 SXM BF16 TC / HBM3	989	3.35	295
B200 BF16 TC / HBM3e	2250	8	281
H100 SXM FP8 TC / HBM3	1979	3.35	592
B200 FP8 TC / HBM3e	4500	8	562
B200 FP4 TC / HBM3e	9000	8	1125

Memory-bound

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.

Overhead ?