CUDA stands for Compute Unified Device Architecture. Depending on the context, "CUDA" can refer to multiple distinct things: a high-level device architecture, a parallel programming model for architectures with that design, or a software platform that extends high-level languages like C to add that programming model.

The vision for CUDA is laid out in the Lindholm et al., 2008 white paper. We highly recommend this paper, which is the original source for many claims, diagrams, and even specific turns of phrase in NVIDIA's documentation.

Here, we focus on the device architecture part of CUDA. The core feature of a "compute unified device architecture" is simplicity, relative to preceding GPU architectures.

Prior to the GeForce 8800 and the Tesla data center GPUs it spawned, NVIDIA GPUs were designed with a complex pipeline shader architecture that mapped software shader stages onto heterogeneous, specialized hardware units. This architecture was challenging for the software and hardware sides alike: it required software engineers to map programs onto a fixed pipeline and forced hardware engineers to guess the load ratios between pipeline steps.

GPU devices with a unified architecture are much simpler: the hardware units are entirely uniform, each capable of a wide array of computations. These units are known as Streaming Multiprocessors (SMs) and their main subcomponents are the CUDA Cores and (for recent GPUs) Tensor Cores .

For an accessible introduction to the history and design of CUDA hardware architectures, see this blog post by Fabien Sanglard. That blog post cites its (high-quality) sources, like NVIDIA's Fermi Compute Architecture white paper . The white paper by Lindholm et al. in 2008 introducing the Tesla architecture is both well-written and thorough. The NVIDIA whitepaper for the Tesla P100 is less scholarly but documents the introduction of a number of features that are critical for today's large-scale neural network workloads, like NVLink and on-package high-bandwidth memory .

Device Hardware

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.

Streaming Multiprocessor ?