Tensor Memory Accelerators are specialized hardware in Hopper and Blackwell architecture GPUs designed to accelerate access to multi-dimensional arrays in GPU RAM .

The TMA loads data from global memory /GPU RAM to shared memory /L1 data cache , bypassing the registers /register file entirely.

The first advantage of the TMA comes from reducing the use of other compute and memory resources. The TMA hardware calculates addresses for bulk affine memory accesses, i.e. accesses of the form addr = width * base + offset for many bases and offsets concurrently, which are the most common accesses for arrays. Offloading this work to the TMA saves space in the register file , reducing "register pressure ", and reduces demand on the arithmetic bandwidth provided by the CUDA Cores . The savings are more pronounced for large (KB-scale) accesses to arrays with two or more dimensions.

The second advantage comes from the asynchronous execution model of TMA copies. A single CUDA thread can trigger a large copy and then rejoin its warp to perform other work. Those threads and others in the same thread block can then asynchronously detect the completion of the TMA copy after it finishes and operate on the results (as in a producer-consumer model).

For details, see the TMA sections of Luo et al.'s Hopper micro-benchmarking paper and the NVIDIA Hopper Tuning Guide .

Note that, despite the name, the Tensor Memory Accelerator does not accelerate operations using Tensor Memory .

Tensor Core

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.

Streaming Multiprocessor Architecture ?