GPU Glossary
GPU Glossary
/device-software/warp

What is a Warp?

A warp is a group of threads that are scheduled together and execute in parallel. All threads in a warp are scheduled onto a single Streaming Multiprocessor (SM) . A single SM typically executes multiple warps, at the very least all warps from the same Cooperative Thread Array , aka thread block .

Warps are the typical unit of execution on a GPU. In normal execution, all threads of a warp execute the same instruction in parallel — the so-called "Single-Instruction, Multiple Thread" or SIMT model. When the threads in a warp split from one another to execute different instructions, also known as warp divergence , performance generally drops precipitously.

Warp size is technically a machine-dependent constant, but in practice (and elsewhere in this glossary) it is 32.

When a warp is issued an instruction, the results are generally not available within a single clock cycle, and so dependent instructions cannot be issued. While this is most obviously true for fetches from global memory , which generally go off-chip , it is also true for some arithmetic instructions (see the CUDA C++ Best Practices Guide for a table of results per clock cycle for specific instructions).

A warp whose next instruction is delayed by missing operands is said to be stalled .

Instead of waiting for an instructions results to return, when multiple warps are scheduled onto a single SM , the Warp Scheduler will select another warp to execute. This latency-hiding is how GPUs achieve high throughput and ensure work is always available for all of their cores during execution. For this reason, it is often beneficial to maximize the number of warps scheduled onto each SM , ensuring there is always an eligible warp for the SM to run. The fraction of cycles on which a warp was issued an instruction is known as the issue efficiency . The degree of concurrency in warp scheduling is known as occupancy .

Warps are not actually part of the CUDA programming model 's thread hierarchy . Instead, they are an implementation detail of the implementation of that model on NVIDIA GPUs. In that way, they are somewhat akin to cache lines in CPUs: a feature of the hardware that you don't directly control and don't need to consider for program correctness, but which is important for achieving maximum performance .

Warps are named in reference to weaving, "the first parallel thread technology", according to Lindholm et al., 2008 . The equivalent of warps in other GPU programming models include subgroups in WebGPU, waves in DirectX, and simdgroups in Metal.

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.