Tensor Memory is a specialized memory in the Streaming Multiprocessor (SM) of certain GPUs, like the B200 , for storing the inputs and outputs of Tensor Cores .

Tensor Memory access is highly restricted. Data must be moved collectively by four warps in a warpgroup, and they can move memory only in specific patterns between Tensor Memory and registers , write shared memory to Tensor Memory, or issue matrix-multiply-accumulate (MMA) instructions to Tensor Cores that use Tensor Memory for specific operands. So much for a "compute-unified" device architecture !

Specifically, for a tcgen05.mma Parallel Thread eXecution instruction computing D += A @ B to use Tensor Memory, the "accumulator" matrix D must be in Tensor Memory, the left-hand matrix A may be in Tensor Memory or shared memory , and the right-hand matrix B must be in shared memory , not Tensor Memory. This is complex, but not arbitrary -- accumulators are accessed more frequently during matmuls than are the tiles, so they benefit more from specialized hardware, e.g. from shorter, simpler wiring between the Tensor Cores and the Tensor Memory. Note that none of the matrices are in the registers .

Beware: Tensor Memory is not directly related to the Tensor Memory Accelerator , which instead loads into the L1 data cache . Roughly speaking, data is moved from that cache into Tensor Memory only as a result of a Tensor Core operation and then is explicitly moved out for post-processing, e.g. the non-linearity after a matrix multiplication in a neural network.

For details on tensor memory and patterns for its use in matrix multiplications, see the Programming Blackwell Tensor Cores with CUTLASS talk from GTC 2025 .

L1 Data Cache

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.

GPU RAM ?