L1 Data Cache
The L1 data cache is the private memory of the Streaming Multiprocessor (SM).
Each SM partitions that memory among groups of threads scheduled onto it.
The L1 data cache is co-located with and nearly as fast as components that effect computations (e.g. the CUDA Cores ).
It is implemented with SRAM, the same basic semiconductor cell used in CPU caches and registers and in the memory subsystem of Groq LPUs . The L1 data cache is accessed by the Load/Store Units of the SM .
CPUs also maintain an L1 cache. In CPUs, that cache is fully hardware-managed. In GPUs that cache is mostly programmer-managed, even in high-level languages like CUDA C .
Each L1 data cache in an each of an H100's SMs can store 256 KiB (2,097,152 bits). Across the 132 SMs in an H100 SXM 5, that's 33 MiB (242,221,056 bits) of cache space.