/device-software/shared-memory
What is Shared Memory?
Shared memory is the level of the memory hierarchy corresponding to the thread block level of the thread hierarchy in the CUDA programming model . It is generally expected to be much smaller but much faster (in throughput and latency) than the global memory .
A fairly typical kernel therefore looks something like this:
- load data from global memory into shared memory
- perform a number of arithmetic operations on that data via the CUDA Cores and Tensor Cores
- optionally, synchronize threads within a thread block by means of barriers while performing those operations
- write data back into global memory , optionally preventing races across thread blocks by means of atomics
Shared memory is stored in the L1 data cache of the GPU's Streaming Multiprocessor (SM) .
Building on GPUs? We know a thing or two about it.
Modal is an ergonomic Python SDK wrapped around a global GPU fleet. Deploy serverless AI workloads instantly without worrying about quota requests, driver compatibility issues, or managing bulky ML dependencies.
Deploy on GPUs
Registers
Something seem wrong?
Or want to contribute?
Click this button to
let us know on GitHub.
Or want to contribute?
Click this button to
let us know on GitHub.