GPU Glossary
GPU Glossary
/device-software/thread-hierarchy

Thread Hierarchy

The thread hierarchy of the CUDA programming model spans from individual threads to thread blocks to thread block grids (left), mapping onto the hardware from CUDA Cores to Streaming Multiprocessors to the entire GPU (right). Modified from diagrams in NVIDIA's CUDA Refresher: The CUDA Programming Model and the NVIDIA CUDA C++ Programming Guide .

The thread hierarchy is a key abstraction of the CUDA programming model , alongside the memory hierarchy . It organizes the execution of parallel programs across multiple levels, from individual threads up to entire GPU devices.

At the lowest level are individual threads . Like a thread of execution on a CPU, each CUDA thread executes a stream of instructions. The hardware resources that effect arithmetic and logic instructions are called Cores . Threads are selected for execution by the Warp Scheduler .

The intermediate level consists of thread blocks , which are also known as cooperative thread arrays in PTX and SASS . Each thread has a unique identifier within its thread block . These thread identifiers are index-based, to support assignment of work to threads based on indices into input or output arrays. All threads within a block are scheduled simultaneously onto the same Streaming Multiprocessor (SM) . They can coordinate through shared memory and synchronize with barriers.

At the highest level, multiple thread blocks are organized into a thread block grid that spans the entire GPU. Thread blocks are strictly limited in their coordiation and communication. Blocks within a grid execute concurrently with respect to each other, with no guaranteed execution order. CUDA programs must be written so that any interleaving of blocks is valid, from fully serial to fully parallel. That means thread blocks cannot, for instance, synchronize with barriers. Like threads , each thread block has a unique, index-based identifier to support assignment of work based on array index.

This hierarchy maps directly onto the GPU hardware : threads execute on individual cores , thread blocks are scheduled onto SMs , and grids utilize all available SMs on the device.

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.