GPU Glossary
/device-software/cuda-programming-model

CUDA (Programming Model)

CUDA stands for Compute Unified Device Architecture. Depending on the context, "CUDA" can refer to multiple distinct things: a high-level device architecture , a parallel programming model for architectures with that design, or a software platform that extends high-level languages like C to add that programming model.

The vision for CUDA is laid out in the Lindholm et al., 2008 white paper. We highly recommend this paper, which is the original source for many claims, diagrams, and even specific turns of phrase in NVIDIA's documentation.

Here, we focus on the CUDA programming model.

The Compute Unified Device Architecture (CUDA) programming model is a programming model for programming massively parallel processors.

Per the NVIDIA CUDA C++ Programming Guide , there are three key abstractions in the CUDA programming model:

The hierarchies of execution and memory and their mapping onto device hardware are summarized in the following diagram.

Left: the abstract thread group and memory hierarchies of the CUDA programming model. Right: the matching hardware implementing those abstractions. Modified from diagrams in NVIDIA's CUDA Refresher: The CUDA Programming Model and the NVIDIA CUDA C++ Programming Guide .

Together, these three abstractions encourage the expression of programs in a way that scales transparently as GPU devices scale in their parallel execution resources.

Put provocatively: this programming model prevents programmers from writing programs for NVIDIA's CUDA-architected GPUs that fail to get faster when the program's user buys a new NVIDIA GPU.

For example, each thread block in a CUDA program can coordinate tightly, but coordination between blocks is limited. This ensures blocks capture parallelizable components of the program and can be scheduled in any order — in the terminology of NVIDIA documentation, the programmer "exposes" this parallelism to the compiler and hardware. When the program is executed on a new GPU that has more scheduling units (specifically, more Streaming Multiprocessors ), more of these blocks can be executed in parallel.

A CUDA program with eight blocks runs in four sequential steps (waves) on a GPU with two SMs but in half as many steps on one with twice as many SMs . Modified from the CUDA Programming Guide .

The CUDA programming model abstractions are made available to programmers as extensions to high-level CPU programming languages, like the CUDA C++ extension of C++ . The programming model is implemented in software by an instruction set architecture (Parallel Thread eXecution, or PTX) and low-level assembly language (Streaming Assembler, or SASS) . For example, the thread block level of the thread hierarchy is implemented via cooperative thread arrays in these languages.

Something seem wrong?
Or want to contribute?
Email: glossary@modal.com