Warp Scheduler
The Warp Scheduler of the Streaming Multiprocessor (SM) decides which group of threads to execute.
These groups of threads, known as warps , are switched out on a per clock cycle basis — roughly one nanosecond.
CPU thread context switches, on the other hand, take few hundred to a few thousand clock cycles (more like a microsecond than a nanosecond) due to the need to save the context of one thread and restore the context of another. Additionally, context switches on CPUs lead to reduced locality, further reducing performance by increasing cache miss rates (see Mogul and Borg, 1991 ).
Because each thread has its own private registers allocated from the register file of the SM , context switches on the GPU do not require any data movement to save or restore contexts.
Because the L1 caches on GPUs can be entirely programmer-managed and are shared between the warps scheduled together onto an SM (see cooperative thread array ), context switches on the GPU have much less impact on cache hit rates. For details on the interaction between programmer-managed caches and hardware-managed caches in GPUs, see the "Maximize Memory Throughput" section of the CUDA C Programming Guide