GPU Glossary
GPU Glossary
/device-hardware/tensor-core

What is a Tensor Core?

Tensor Cores are GPU cores that operate on entire matrices with each instruction.

The internal architecture of an H100 SM. Note the larger size and lower number of Tensor Cores. Modified from NVIDIA's H100 white paper .

Operating on more data for a single instruction fetch dramatically reduces power requirements, which unlocks increased performance (see this talk by Bill Dally, Chief Scientist at NVIDIA). As of the Blackwell Streaming Multiprocessor (SM) Architecture generation, they are the only way to achieve the highest arithmetic throughput on NVIDIA GPUs.

As an example, the HMMA16.16816.F32 SASS instruction calculates D = AB + C for matrices A, B, C, and D (where C is often the same physical matrix as D). The MMA stands for "Matrix Multiply and Accumulate". HMMA16 indicates that the inputs are half-precision (16 bits) and the F32 indicates that the outputs are accumulated into 32 bit (aka single-precision) floats.

16816 is not single number larger than 16,000. Instead, the string of numbers 16, 8, 16 denote the dimensions of the matrices. These dimensions are generally named m, k, and n by NVIDIA, for example in PTX instructions. The outer dimensions of A and B, aka m and n, come first and last, respectively, and the shared inner dimension for the accumulation, k, is in the middle. Multiplying these out, we see that the HMMA16.16816.32 instruction performs 16 × 8 × 8 × 16 = 16,384 multiply-accumulate (MAC) operations.

Note that a single instruction in a single thread does not produce the entire matrix multiplication. Instead, the 32 threads of a warp cooperatively produce the result by executing the instruction together. Most of the per-instruction power overhead is in decoding, which is shared across a warp thanks to the warp scheduler . But even spread across those 32 threads, that's 512 = 16,384 ÷ 32 MACs per instruction.

For this reason, it is helpful to think of Tensor Cores, and similar hardware like the systolic arrays in Google Tensor Processing Units (TPUs), as a form of complex instruction set computer (CISC) hardware. For more on this perspective, applied to TPUs, see this talk by computer architect David Patterson , who also coined the term .

That assembler-level instruction might be produced by a compiler to implement PTX-level matrix-multiply-and-accumlate instructions like wmma (documented here ). Those instructions also calculate D = AB + C for matrices A, B, C, and D, but are generally compiled into many individual SASS Tensor Core instructions that operate on smaller matrices.

These instructions from the PTX instruction set architecture are exposed in the high-level CUDA C++ programming language as intrinsics.

In reverse order, a line of CUDA C++ coding a matrix multiplication C = AB + C, like

wmma::mma_sync(c, a, b, c);

might be compiled by nvcc to the PTX intermediate representation as

wmma.load.a.sync.aligned.col.m16n16k16.global.f16       {%r2, %r3, %r4, %r5, %r6, %r7, %r8, %r9}, [%rd6], %r1;
wmma.load.b.sync.aligned.row.m16n16k16.global.f16       {%r10, %r11, %r12, %r13, %r14, %r15, %r16, %r17}, [%rd4], %r1;

and then finally compiled by ptxas to SASS as

HMMA.1688.F32 R20, R12, R11, RZ   // 1
HMMA.1688.F32 R24, R12, R17, RZ   // 2
HMMA.1688.F32 R20, R14, R16, R20  // 3
HMMA.1688.F32 R24, R14, R18, R24  // 4

The operands to the instruction can be read as D = AB + C, and so instruction 3 uses register 20 for its output D, registers 14 and 16 for its inputs A and B, respectively, and re-uses register 20 for its input D, effecting the computation C += AB.

This program splits the accumulation dimension -- k, the column dimension of A and the row dimension of B -- across time: instructions 1 and 3 accumulate by reusing the "C matrix register" of 1, R20, as the "D matrix register" of 3 (RZ is a special-purpose "register" that contains the value Zero). On machines that support 16816 matrix multiplications, we might expect these two instructions to run concurrently on the Tensor Core.

The program splits n, the column dimension of B, across time and across space: that is, distinct instructions 1 and 3 (time) operate on distinct registers R11 and R16 (space) holding distinct columns in B. Within a block of eight rows and eight columns in B and within an entire column of A, calculations occur inside the Tensor Core concurrently, with respect to the instruction -- each instruction handles all m rows of A for the given block of rows and columns from B.

Explore this compiler output on Godbolt if you want to dive deeper. Note that this is far from a utilization-maximizing matrix multiplication using Tensor Cores! For that, see this worklog by Pranjal Shandkar .

Programming Hopper and Blackwell Tensor Cores for maximum performance cannot be done in pure CUDA C++, requiring instead PTX intrinsics for both computation and memory. It is generally recommended to instead use existing kernels from kernel libraries like cuBLAS (CUDA Basic Linear Algebra Subroutines) or higher-level kernel programming interfaces like CUTLASS (CUDA Templates for Linear Algebra Subroutines) . For an introduction to CUTLASS, see this blog post series by Colfax Research .

Tensor Cores are much larger and less numerous than CUDA Cores. An H100 SXM5 has only four Tensor Cores per SM , i.e. one per Warp Scheduler , compared to hundreds of CUDA Cores .

Tensor Cores were introduced in the V100 GPU, which represented a major improvement in the suitability of NVIDIA GPUs for large neural network worloads. For more, see the NVIDIA white paper introducing the V100 .

The internals of Tensor Cores are unknown, and likely differ from SM Architecture to SM Architecture . They are commonly assumed to be systolic arrays, like TPUs, but there is no consensus in the microbenchmarking literature.

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.