These groups of threads , known as warps , are switched out on a per clock cycle basis — roughly one nanosecond - much like the fine-grained thread-level parallelism of simultaneous multi-threading ("hyper-threading") in CPUs, but at a much larger scale. The ability of the Warp Schedulers to switch rapidly between a large number of concurrent tasks as soon as their instructions' operands are available is key to the latency hiding capabilities of GPUs.

Full CPU thread context switches take a few hundred to a few thousand clock cycles (more like a microsecond than a nanosecond) due to the need to save the context of one thread and restore the context of another. Additionally, context switches on CPUs lead to reduced locality, further reducing performance by increasing cache miss rates (see Mogul and Borg, 1991 ).

Because each thread has its own private registers allocated from the register file of the SM , context switches on the GPU do not require any data movement to save or restore contexts.

And because the L1 caches on GPUs can be entirely programmer-managed and are shared between the warps scheduled together onto an SM (see cooperative thread array ), context switches on the GPU have much less impact on cache hit rates. For details on the interaction between programmer-managed caches and hardware-managed caches in GPUs, see the "Maximize Memory Throughput" section of the CUDA C Programming Guide .

The Warp Schedulers also manage the execution state of warps .

Load/Store Unit

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.

CUDA Core ?