What is Tensor Memory?
Tensor memory is a specialized memory in the Streaming Multiprocessor (SM) of certain GPUs, like the B200, for storing the inputs and outputs of Tensor Cores .
Tensor memory access is highly restricted. Data must be moved collectively by four warps in a warpgroup, and they can move memory only in specific patterns between tensor memory and registers , write shared memory to tensor memory or issue matrix-multiply-accumulate (MMA) instructions to Tensor Cores that use tensor memory for specific operands. So much for a "compute-unified" device architecture !
Specifically, for a tcgen05.mma
Parallel Thread eXecution
instruction computing D += A @ B
to use tensor memory, the "accumulator"
matrix D
must be in tensor memory, the left-hand matrix A
may be in
tensor memory or shared memory ,
and the right-hand matrix B must be in
shared memory , not tensor memory.
This is complex, but not arbitrary -- accumulators are accessed more frequently
during matmuls than are the tiles, so they benefit more from specialized
hardware, e.g. from shorter, simpler wiring between the
Tensor Cores and the tensor memory.
Note that none of the matrices are in the
registers .
Beware: tensor memory is not directly related to the Tensor Memory Accelerator , which instead loads into the L1 data cache . Roughly speaking, data is moved from that cache into tensor memory only as a result of a Tensor Core operations and then is explicitly moved out for post-processing, e.g. the non-linearity after a matrix multiplication in a neural network.
For details on tensor memory and patterns for its use in matrix multiplications, see the Programming Blackwell Tensor Cores with CUTLASS talk from GTC 2025 .