What is Parallel Thread Execution?
Parallel Thread eXecution (PTX) is an intermediate representation (IR) for code
that will run on a parallel processor (almost always an NVIDIA GPU). It is one
of the formats output by nvcc
, the
NVIDIA CUDA Compiler Driver .
NVIDIA documentation refers to PTX as both a "virtual machine" and an "instruction set architecture".
From the programmer's perspective, PTX is an instruction set for programming against a virtual machine model. Programmers or compilers producing PTX can be confident their program will run with the same semantics on many distinct physical machines, including machines that do not yet exist. In this way, it is also similar to CPU instruction set architectures like x86_64 , aarch64 , or SPARC .
Unlike those ISAs, PTX is very much an intermediate representation , like LLVM-IR. The PTX components of a CUDA binary will be just-in-time (JIT) compiled by the host CUDA Drivers into device-specific SASS for execution.
In the case of NVIDIA GPUs, PTX is forward-compatible: GPUs with a matching or higher compute capability version will be able to run the program, thanks to this mechanisn of JIT compilation.
Some exemplary PTX:
.reg .f32 %f<7>;
- a compiler directive for the PTX-to-SASS compiler indicating that this kernel consumes seven 32-bit floating point registers . Registers are dynamically allocated to groups of threads (warps ) from the SM 's register file .
fma.rn.f32 %f5, %f4, %f3, 0f3FC00000;
- apply a fused multiply-add (
fma
) operation to multiply the contents of registersf3
andf4
and add the constant0f3FC00000
, storing the result inf5
. All numbers are in 32 bit floating point representation. Thern
suffix for the FMA operation sets the floating point rounding mode to IEEE 754 "round even" (the default).
mov.u32 %r1, %ctaid.x;
mov.u32 %r2, %ntid.x;
mov.u32 %r3, %tid.x;
mov
e thex
-axis values of thec
ooperativet
hreada
rrayi
nd
ex, the cooperative thread array dimension index (ntid
), and thet
hreadi
nd
ex into threeu32
registersr1
-r3
.
The PTX programming model exposes multiple levels of parallelism to the programmer. These levels map directly onto the hardware through the PTX machine model, diagrammed below.
Notably, in this machine model there is a single instruction unit for multiple processors. While each processor runs one thread , those threads must execute the same instructions — hence parallel thread execution, or PTX. They coordinate with each other through shared memory and effect different results by means of private registers .
The documentation for the latest version of PTX is available from NVIDIA here . The instruction sets of PTX are versioned with a number called the "compute capability ", which is synonymous with "minimum supported Streaming Multiprocessor architecture version".
Writing in-line PTX by hand is uncommon but not unheard of, similar to writing
in-line x86_64
assembly, as is done in high-performance vectorized query
operators in analytical databases and in performance-sensitive sections of
operating system kernels. At time of writing in October of 2024, in-line PTX is
the only way to take advantage of some Hopper-specific hardware features like
the wgmma
and tma
instructions, as in
Flash Attention 3 or in the
Machete w4a16 kernels . Viewing
CUDA C/C++ ,
SASS , and
PTX together is
supported on
Godbolt .
See the
NVIDIA "Inline PTX Assembly in CUDA" guide
for details.