/device-hardware/tensor-core
Tensor Core
Tensor Cores are GPU cores that operate on entire matrices with each instruction.
For example, the mma
PTX instructions
(documented
here )
calculate D = AB + C for matrices A, B, C, and D. Operating on more data for a
single instruction fetch dramatically reduces power requirements (see
this talk by Bill Dally, Chief Scientist
at NVIDIA).
Tensor Cores are much larger and less numerous than CUDA Cores. An H100 SXM5 has only four Tensor Cores per Streaming Multiprocessor , to compared to hundreds of CUDA Cores .
Tensor Cores were introduced in the V100 GPU, which represented a major improvement in the suitability of NVIDIA GPUs for large neural network worloads. For more, see the NVIDIA white paper introducing the V100 .