The literal neck of a bottle limits the rate at which liquid can be poured; a metaphorical performance bottleneck in a system limits the rate at which tasks can be completed.

Bottlenecks are the target of performance optimization. The textbook approach to optimization is to

determine the bottleneck,
elevate the bottleneck until it is no longer such, and
repeat on the new bottleneck.

This approach is formalized in, for instance, the "Theory of Constraints" by Eliyahu Goldratt that helped transmit the Toyota approach to manufacturing to manufacturers worldwide , thence to software engineering and operations .

In this talk for Jane Street , Horace He broke down the work done by the kernels of programs run on GPUs into three categories:

Compute (running floating point operations on CUDA Cores or Tensor Cores )
Memory (moving data in the system's memory hierarchy )
Overhead (everything else)

And so for GPU kernels , performance bottlenecks fall into three main* categories:

compute-bound kernels , bottlenecked by the arithmetic bandwidth of compute units, like large matrix-matrix multiplication,
memory-bound kernels , bottlenecked by the bandwidth of memory subsystems , like large vector-vector multiplication, and
overhead-bound kernels bottlenecked by latency, like small array operations.

Roofline model analysis helps quickly identify whether a program's performance is bottlenecked by compute/arithmetic bandwidth or memory bandwidth .

Of course, any resource can become a bottleneck. For instance, power ingress and heat egress can and does bottleneck some GPUs below their theoretical maximum performance. See this article from NVIDIA explaining a 4% end-to-end performance improvement by redirecting power from the L2 cache to the Streaming Multiprocessors or this article from Horace He indicating that matrix multiplication performance varies depending on the input data via the amount of power demanded by transistor switching. But compute and memory are the most important resources and the most common bottlenecks.

Building on GPUs? We know a thing or two about it.

Modal is an ergonomic Python SDK wrapped around a global GPU fleet. Deploy serverless AI workloads instantly without worrying about quota requests, driver compatibility issues, or managing bulky ML dependencies. Deploy on GPUs

Deploy on GPUs

Performance Roofline Model