What is a Tensor Memory Accelerator?
Tensor Memory Accelerators are specialized hardware in Hopper and Blackwell architecture GPUs designed to accelerate access to multi-dimensional arrays in GPU RAM .
The TMA loads data from global memory /GPU RAM to shared memory /L1 data cache , bypassing the registers /register file entirely.
The first advantage of the TMA comes from reducing the use of other compute and
memory resources. The TMA hardware calculates addresses for bulk affine memory
accesses, i.e. accesses of the form addr = width * base + offset
for many
bases and offsets concurrently, which are the most common accesses for arrays.
Offloading this work to the TMA saves space in the
register file and cycles of the
CUDA Cores . The savings are more
pronounced for large (KB-scale) accesses to arrays with two or more dimensions.
The second advantage comes from the asynchronous execution model of TMA copies. A single CUDA thread can trigger a large copy and then rejoin its warp to perform other work. Those threads and others in the same thread block can then asynchronously detect the completion of the TMA copy after it finishes and operate on the results (as in a producer-consumer model).
For details, see the TMA sections of Luo et al.'s Hopper micro-benchmarking paper . and the NVIDIA Hopper Tuning Guide .
Note that, despite the name, the Tensor Memory Accelerator does not accelerate operations using tensor memory .
Or want to contribute?
Click this button to
let us know on GitHub.