What is the CUDA Memory Hierarchy?
As part of the CUDA programming model , each level of the thread hierarchy has access to a distinct block of memory shared by all threads in a group at that level: a "memory hierarchy". This memory can be used for coordination and communication and is managed by the programmer (not the hardware or a runtime).
For a thread block grid , that shared memory is in the GPU's RAM and is known as the global memory . Access to this memory can be coordinated with atomic operations and barriers, but execution order across thread blocks is indeterminate.
For a single thread , the memory is a chunk of the Streaming Multiprocessor's (SM's) register file . According to the original semantics of the CUDA programming model , this memory is private to a thread , but certain instructions added to PTX and SASS to target matrix multiplication on Tensor Cores share inputs and outputs across threads .
In between, the shared memory for the thread block level of the thread hierarchy is stored in the L1 data cache of each SM . Careful management of this cache — e.g. loading data into it to support the maximum number of arithmetic operations before new data is loaded — is key to the art of designing high-performance CUDA kernels .