Specifically, they are limited by the bandwidth between the GPU RAM and the local cache of the Streaming Multiprocessors , because the problems of interest for GPU performance generally have working set sizes much larger than any higher level of the memory hierarchy .

Memory-bound kernels have a lower arithmetic intensity (fewer operations per byte moved), relative to the ridge point of their roofline model .

Technically, memory-boundedness is only defined for a single kernel , as part of the roofline model , but with a bit of squinting it can be generalized to cover the multiple kernels that make up a typical workload.

Contemporary large language model inference workloads are often memory-bound during the decode/output generation stage, when the weights must be loaded once in each forward pass. That happens once per output token, unless multi-token prediction or speculative decoding are used, which makes it easy to calculate the minimum latency between tokens (intertoken latency or time per output token) for memory-bound Transformer large language model inference.

Assume the model has 500B parameters, stored in 16-bit precision, for a total of 1 TB. If we run inference on a single GPU with a memory bandwidth of 10 TB/s, we can load the weights once every 100 ms, and that puts a lower bound on our intertoken latency. By batching multiple inputs together, we can linearly increase the number of floating point operations done per parameter loaded (the arithmetic intensity ), in principle up to the point of compute-boundedness , without incurring any additional latency, which implies that the throughput improves linearly in the batch size.

For more on LLM inference, see our LLM Engineer's Almanac .

Compute-bound

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.

Arithmetic Intensity ?