CUDA stands for Compute Unified Device Architecture. Depending on the context, "CUDA" can refer to multiple distinct things: a high-level device architecture , a parallel programming model for architectures with that design, or a software platform that extends high-level languages like C to add that programming model.

The vision for CUDA is laid out in the Lindholm et al., 2008 white paper. We highly recommend this paper, which is the original source for many claims, diagrams, and even specific turns of phrase in NVIDIA's documentation.

Here, we focus on the CUDA programming model.

The Compute Unified Device Architecture (CUDA) programming model is a programming model for programming massively parallel processors.

Per the NVIDIA CUDA C++ Programming Guide , there are three key abstractions in the CUDA programming model:

Hierarchy of thread groups. Programs are executed in threads but can make reference to groups of threads in a nested hierarchy, from blocks to grids .
Hierarchy of memories. Thread groups have access to a memory resource for communication between threads in the group. Accessing the lowest layer of the memory hierarchy should be nearly as fast as executing an instruction .
Barrier synchronization. Thread groups can coordinate execution by means of barriers.

The hierarchies of execution and memory and their mapping onto device hardware are summarized in the following diagram.

Together, these three abstractions encourage the expression of programs in a way that scales transparently as GPU devices scale in their parallel execution resources.

Put provocatively: this programming model prevents programmers from writing programs for NVIDIA's CUDA-architected GPUs that fail to get faster when the program's user buys a new NVIDIA GPU.

For example, each thread block in a CUDA program can coordinate tightly, but coordination between blocks is limited. This ensures blocks capture parallelizable components of the program and can be scheduled in any order — in the terminology of NVIDIA documentation, the programmer "exposes" this parallelism to the compiler and hardware. When the program is executed on a new GPU that has more scheduling units (specifically, more Streaming Multiprocessors ), more of these blocks can be executed in parallel.

The CUDA programming model abstractions are made available to programmers as extensions to high-level CPU programming languages, like the CUDA C++ extension of C++ . The programming model is implemented in software by an instruction set architecture (Parallel Thread eXecution, or PTX) and low-level assembly language (Streaming Assembler, or SASS) . For example, the thread block level of the thread hierarchy is implemented via cooperative thread arrays in these languages.

Device Software

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.

Streaming ASSembler ?