SM utilization is akin to the more familiar kernel utilization reported by nvidia-smi , but more fine-grained. Instead of reporting the fraction of time that a kernel is executing anywhere on the GPU, it reports the fraction of time all SMs spend executing kernels . If a kernel uses only one SM , e.g. because it only has one thread block , then it will achieve 100% GPU utilization while it is active, but the SM utilization will be at most one over the number of SMs — under 1% in an H100 GPU.

As with GPU utilization but unlike CPU utilization , SM utilization should be high, even up to 100%.

But even though SM utilization is finer-grained than GPU utilization, it still isn't fine-grained enough to capture how well the GPU's compute resources are being used. If SM utilization is high, but performance is still inadequate, programmers should check pipe utilization , which measures how effectively each SM uses its internal functional units. High SM utilization with low pipe utilization indicates that your kernel is running on many SMs but not fully utilizing the computational resources within each one.

Building on GPUs? We know a thing or two about it.

Modal is an ergonomic Python SDK wrapped around a global GPU fleet. Deploy serverless AI workloads instantly without worrying about quota requests, driver compatibility issues, or managing bulky ML dependencies. Deploy on GPUs

Deploy on GPUs

Issue Efficiency Warp Divergence