/device-software/shared-memory
What is Shared Memory?
Shared memory is the abstract memory associated with the thread block level (left, center) of the CUDA thread group hierarchy (left). Modified from diagrams in NVIDIA's CUDA Refresher: The CUDA Programming Model and the NVIDIA CUDA C++ Programming Guide .
Shared memory is the level of the memory hierarchy corresponding to the thread block level of the thread group hierarchy in the CUDA programming model . It is generally expected to be much smaller but much faster (in throughput and latency) than the global memory .
A fairly typical kernel therefore looks something like this:
- load data from global memory into shared memory
- perform a number of arithmetic operations on that data via the CUDA Cores and Tensor Cores
- optionally, synchronize threads within a thread block by means of barriers while performing those operations
- write data back into global memory , optionally preventing races across thread blocks by means of atomics
Shared memory is stored in the L1 data cache of the GPU's Streaming Multiprocessor (SM) .