GPU Glossary
GPU Glossary
/device-software/memory-hierarchy

What is the CUDA Memory Hierarchy?

As part of the CUDA programming model , each level of the thread hierarchy has access to a distinct block of memory shared by all threads in a group at that level: a "memory hierarchy". This memory can be used for coordination and communication and is managed by the programmer (not the hardware or a runtime).

For a thread block grid , that shared memory is in the GPU's RAM and is known as the global memory . Access to this memory can be coordinated with atomic operations and barriers, but execution order across thread blocks is indeterminate.

For a single thread , the memory is a chunk of the Streaming Multiprocessor's (SM's) register file . According to the original semantics of the CUDA programming model , this memory is private to a thread , but certain instructions added to PTX and SASS to target matrix multiplication on Tensor Cores share inputs and outputs across threads .

In between, the shared memory for the thread block level of the thread hierarchy is stored in the L1 data cache of each SM . Careful management of this cache — e.g. loading data into it to support the maximum number of arithmetic operations before new data is loaded — is key to the art of designing high-performance CUDA kernels .

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.