Or want to contribute?
Click this button to
let us know on GitHub.
What does it mean to be compute-bound?
Kernels that are compute-bound are limited by the arithmetic bandwidth of the CUDA Cores or Tensor Cores .
Compute-bound kernels are characterized by high arithmetic intensity (many arithmetic operations per byte of memory loaded or stored). Utilization of arithmetic pipes is the limiting factor for a compute-bound kernel.
Technically, compute-boundedness is only defined for a single kernel , as part of the roofline model , but with a bit of squinting it can be generalized to cover the multiple kernels that make up a typical workload.
Large diffusion model inference workloads are generally compute-bound. Contemporary large language model inference workloads are often compute-bound during batch prefill/prompt processing, when each weight can be loaded into shared memory once and then used across many tokens.
Let's do a simple estimation, inspired by kipperrii 's Transformer inference arithmetic framework, of the minimum latency between tokens (inter-token latency or time per output token) for compute-bound Transformer language model inference. Assume the model has 500B parameters, stored in 16-bit precision, for a total of 1 TB. This model will perform roughly one trillion floating point operations (one multiply and one accumulate per parameter) per batch element. Run on a GPU with one petaFLOP/s of arithmetic bandwidth for 16-bit matrix math, the minimum latency between tokens, assuming compute-boundedness, is one millisecond per batch element.
Note that for this GPU to be compute-bound at batch size one, it would need a memory bandwidth of 1 PB/s (so that it can load all 1 TB of weights in one ms). Contemporary memory bandwidths are in the TB/s range, and so batches of hundreds of inputs are required to provide sufficient arithmetic intensity for execution to be compute-bound.
For more on LLM inference, see our LLM Engineer's Almanac .
Building on GPUs? We know a thing or two about it.
Modal is an ergonomic Python SDK wrapped around a global GPU fleet.Deploy serverless AI workloads instantly without worrying about quota requests, driver compatibility issues, or managing bulky ML dependencies.
Deploy serverless AI workloads instantly without worrying about quota requests, driver compatibility issues, or managing bulky ML dependencies.