Upon dispatching a warpgroup-level instruction, we coordinate 128 threads -- 4 warps per warpgroup × 32 threads per warp. Operating at a larger granularity removes the need for explicit inter-warp synchronization and allows work to be performed on larger problem sizes per instruction, especially larger matrix multiplications. Larger matrix multiplications more readily saturate the massive arithmetic bandwidth of the Tensor Cores of recent data center GPUs.

Warpgroups were introduced in NVIDIA's Hopper Streaming Multiprocessor architecture , where they are used to support warpgroup-level matrix multiplication, like wgmma.mma_async. See this blog post from Colfax for a deep dive. Warpgroups feature prominently in the organization of pipeline components of high-performance Hopper and Blackwell kernels , like Flash Attention 4 .

In Parallel Thread Execution (PTX) IR, the warp-rank of a warp is:

cpp

int linearIdx = (%tid.x + %tid.y * %ntid.x  + %tid.z * %ntid.x * %ntid.y);
int warpRank = linearIdx / 32;

where tid is the thread index, accessed via special PTX registers .

So the valid warpgroups for an 8-warp dispatch are:

Warpgroup 0: warp-ranks 0, 1, 2, and 3
Warpgroup 1: warp-ranks 4, 5, 6, and 7.

To our knowledge, the purpose of the warp-rank alignment restriction is not documented. But Streaming Multiprocessors for recent data center GPUs appear to contain four (unnamed) subunits, each with their own Warp Scheduler and Tensor Core.

Building on GPUs? We know a thing or two about it.

Modal is an ergonomic Python SDK wrapped around a global GPU fleet. Deploy serverless AI workloads instantly without worrying about quota requests, driver compatibility issues, or managing bulky ML dependencies. Deploy on GPUs

Deploy on GPUs

Warp Cooperative Thread Array