GPU Glossary
GPU Glossary
/device-software/memory-hierarchy

What is the CUDA Memory Hierarchy?

As part of the CUDA programming model , each level of the thread hierarchy has access to a distinct block of memory shared by all threads in a group at that level: a "memory hierarchy". This memory can be used for coordination and communication and is managed by the programmer (not the hardware or a runtime).

For a thread block grid , that shared memory is in the GPU's RAM and is known as the global memory . Access to this memory can be coordinated with atomic operations and barriers, but execution order across thread blocks is indeterminate.

For a single thread , the memory is a chunk of the Streaming Multiprocessor's (SM's) register file . According to the original semantics of the CUDA programming model , this memory is private to a thread , but certain instructions added to PTX and SASS to target matrix multiplication on Tensor Cores share inputs and outputs across threads .

In between, the shared memory for the thread block level of the thread hierarchy is stored in the L1 data cache of each SM . Careful management of this cache — e.g. loading data into it to support the maximum number of arithmetic operations before new data is loaded — is key to the art of designing high-performance CUDA kernels .

Modal LogoBuilding on GPUs? We know a thing or two about it.

Modal is an ergonomic Python SDK wrapped around a global GPU fleet. Deploy serverless AI workloads instantly without worrying about quota requests, driver compatibility issues, or managing bulky ML dependencies.

Deploy on GPUs
Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.