What is the Memory Hierarchy?
As part of the CUDA programming model , each level of the thread group hierarchy has access to a distinct block of memory shared by all threads in a group at that level: a "memory hierarchy" to match the thread group hierarchy. This memory can be used for coordination and communication and is managed by the programmer (not the hardware or a runtime).
For a thread block grid , that shared memory is in the GPU's RAM and is known as the global memory . Access to this memory can be coordinated with atomic operations and barriers, but execution order across thread blocks is indeterminate.
For a single thread , the memory is a chunk of the Streaming Multiprocessor's (SM's) register file . In keeping with the memory semantics of the CUDA programming model , this memory is private.
In between, the shared memory for the thread block level of the thread hierarchy is stored in the L1 data cache of each SM . Careful management of this cache — e.g. loading data into it to support the maximum number of arithmetic operations before new data is loaded — is key to the art of designing high-performance CUDA kernels .