Compute-bound kernels are characterized by high arithmetic intensity (many arithmetic operations per byte of memory loaded or stored). Utilization of arithmetic pipes is the limiting factor for a compute-bound kernel.

Technically, compute-boundedness is only defined for a single kernel , as part of the roofline model , but with a bit of squinting it can be generalized to cover the multiple kernels that make up a typical workload.

Large diffusion model inference workloads are generally compute-bound. Contemporary large language model inference workloads are often compute-bound during batch prefill/prompt processing, when each weight can be loaded into shared memory once and then used across many tokens.

Let's do a simple estimation, inspired by kipperrii 's Transformer inference arithmetic framework, of the minimum latency between tokens (inter-token latency or time per output token) for compute-bound Transformer language model inference. Assume the model has 500B parameters, stored in 16-bit precision, for a total of 1 TB. This model will perform roughly one trillion floating point operations (one multiply and one accumulate per parameter) per batch element. Run on a GPU with one petaFLOP/s of arithmetic bandwidth for 16-bit matrix math, the minimum latency between tokens, assuming compute-boundedness, is one millisecond per batch element.

Note that for this GPU to be compute-bound at batch size one, it would need a memory bandwidth of 1 PB/s (so that it can load all 1 TB of weights in one ms). Contemporary memory bandwidths are in the TB/s range, and so batches of hundreds of inputs are required to provide sufficient arithmetic intensity for execution to be compute-bound.

For more on LLM inference, see our LLM Engineer's Almanac .

Roofline Model

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.

Memory-bound ?