The roofline model is a simplified, visual model of performance used to quickly determine whether a program is bound by memory bandwidth or arithmetic bandwidth .

In the roofline model, two hardware‑derived "roofs" put a "ceiling" on the possible performance:

the "compute roof" – the peak rate of the target hardware (CUDA Cores or Tensor Cores ), aka the arithmetic bandwidth
the "memory roof" – the peak memory throughput of the target hardware, aka the memory bandwidth .

These are visualized on a plane with the arithmetic intensity (in operations per byte) on the x-axis and the performance (in operations per second) on the y-axis. The "compute roof" is a horizontal line with height equal to the arithmetic bandwidth . The "memory roof" is a slanted line with slope equal to the memory bandwidth . Slope is "rise over run", and so the line has units of bytes per second (operations per second divided by operations per byte).

A specific kernel's x-coordinate tells you instantly whether it is fundamentally compute-bound (points beneath the flat roof) or memory-bound (points beneath the slanted roof). Kernels are rarely up against either roof due to the effects of overhead .

The point on the boundary, i.e. where the diagonal and horizontal roof meet, is called the "ridge point". Its x-coordinate is the minimum arithmetic intensity required to be able to escape the memory bottleneck . Computer systems whose ridge point is further to the left are easier to achieve maximum performance on, but the relatively poor scaling of memory relative to compute generally has pushed the ridge points of systems to the right over time.

The compute and memory roofs need only be derived once per subsystem (though importantly they vary depending on the subsystem, not just the system; Tensor Cores have more FLOPS than CUDA Cores ).

NVIDIA's NSight Compute tool for kernel performance engineering automatically performs roofline analysis for profiled kernels .

The roofline model is deceptively simple. Note that, for instance, system latencies do not appear anywhere in the diagram, only bandwidths and throughputs. It is simple because it is highly opinionated, and understanding those opinions and their reasoning is key to understanding the power and the proper application of the roofline.

The roofline model was introduced by Samuel Williams, Andrew Waterman, and David Patterson in this 2008 paper . They introduced it in the face of several hardware scaling trends that shaped system architectures before and since.

First, as Patterson separately observed in a famous 2004 paper, "latency lags bandwidth" . More specifically, across subsystems like compute, memory, and storage, a linear improvement in latency has historically been accompanied by a quadratic improvement in bandwidth. This suggested that future systems would be, like GPUs, throughput-oriented.

Second, as has long been observed, compute subsystems (like processor cores) have scaled their performance much more rapidly than memory subsystems like caches and DRAM . This was popularized as the "memory wall" by Wulf and McKee in 1994.

Finally, the early 2000s saw the end of Dennard scaling , aka increasing clock speed at equal power, due primarily to the fixed leakage current of transistors, which posed power draw and heat dissipation problems. Increasing clock speed had previously buoyed general purpose, latency-oriented systems like CPUs, over special purpose hardware. This slowdown was not accompanied by a slowdown in Moore's Law , aka increasing transistor count per chip. The architectural solution to an abundance of transistors but scarcity of power was hardware specialization: disaggregating computers into components specialized in completing distinct tasks. For a well-documented example, see the Pixel Visual Core image co-processor, explained in detail in chapter 7 of the sixth edition of Hennessy and Patterson's Computer Architecture .

Taken together, these trends correctly suggested to the authors that future systems would be throughput-oriented and that among the various bandwidths at play, the bandwidth of memory subsystems would be the primary performance bottleneck . Applications of those systems that wanted to achieve peak performance would therefore need to have high operational intensity for that hardware's specialized operations — in the case of GPUs, arithmetic intensity for Tensor Cores , which is to say very large matrix multiplications.

Performance Bottleneck

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.

Compute-bound ?