A warp is a group of threads that are scheduled together and execute in parallel. All threads in a warp are scheduled onto a single Streaming Multiprocessor (SM) . A single SM typically executes multiple warps, at the very least all warps from the same Cooperative Thread Array , aka thread block .

Warps are the typical unit of execution on a GPU. In normal execution, all threads of a warp execute the same instruction in parallel — the so-called "Single-Instruction, Multiple Thread" or SIMT model. Warp size is technically a machine-dependent constant, but in practice it is 32.

When a warp is issued an instruction, the results are generally not available within a single clock cycle, and so dependent instructions cannot be issued. While this is most obviously true for fetches from global memory , which generally go off-chip , it is also true for some arithmetic instructions (see the CUDA C++ Programing Guide's "Performance Guidelines" for a table of results per clock cycle for specific instructions).

Instead of waiting for a warp to return results, when multiple warps are scheduled onto a single SM , the Warp Scheduler will select another warp to execute. This "latency-hiding" is how GPUs achieve high throughput and ensure work is always available for all of their cores during execution. For this reason, it is often beneficial to maximize the number of warps scheduled onto each SM , ensuring there is always a warp ready for the SM to run.

Warps are not actually part of the CUDA programming model 's thread group hierarchy. Instead, they are an implementation detail of the implementation of that model on NVIDIA GPUs. In that way, they are somewhat akin to cache lines in CPUs: a feature of the hardware that you don't directly control and don't need to consider for program correctness, but which is important for achieving maximum performance.

Warps are named in reference to weaving, "the first parallel thread technology", according to Lindholm et al., 2008 . The equivalent of warps in other GPU programming models include subgroups in WebGPU, waves in DirectX, and simdgroups in Metal.

Thread

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.

Cooperative Thread Array ?