What is Parallel Thread Execution?
Parallel Thread eXecution (PTX) is an intermediate representation (IR) for code
that will run on a parallel processor (almost always an NVIDIA GPU). It is one
of the formats output by nvcc
, the
NVIDIA CUDA Compiler Driver . It is
pronounced "pee-tecks" by many NVIDIA engineers and "pee-tee-ecks" by everyone
else.
NVIDIA documentation refers to PTX as both a "virtual machine" and an "instruction set architecture".
From the programmer's perspective, PTX is an instruction set for programming against a virtual machine model. Programmers or compilers producing PTX can be confident their program will run with the same semantics on many distinct physical machines, including machines that do not yet exist. In this way, it is also similar to CPU instruction set architectures like x86_64 , aarch64 , or SPARC .
Unlike those ISAs, PTX is very much an intermediate representation , like LLVM-IR. The PTX components of a CUDA binary will be just-in-time (JIT) compiled by the host CUDA Drivers into device-specific SASS for execution.
In the case of NVIDIA GPUs, PTX is forward-compatible: GPUs with a matching or higher compute capability version will be able to run the program, thanks to this mechanism of JIT compilation. In this way, PTX is a "narrow waist" that separates the worlds of hardware and software.
Some exemplary PTX:
.reg .f32 %f<7>;
- a compiler directive for the PTX-to-SASS compiler indicating that this kernel consumes seven 32-bit floating point registers . Registers are dynamically allocated to groups of threads (warps ) from the SM 's register file .
fma.rn.f32 %f5, %f4, %f3, 0f3FC00000;
- apply a fused multiply-add (
fma
) operation to multiply the contents of registersf3
andf4
and add the constant0f3FC00000
, storing the result inf5
. All numbers are in 32 bit floating point representation. Thern
suffix for the FMA operation sets the floating point rounding mode to IEEE 754 "round even" (the default).
mov.u32 %r1, %ctaid.x;
mov.u32 %r2, %ntid.x;
mov.u32 %r3, %tid.x;
mov
e thex
-axis values of thec
ooperativet
hreada
rrayi
nd
ex, the cooperative thread array dimension index (ntid
), and thet
hreadi
nd
ex into threeu32
registersr1
-r3
.
The PTX programming model exposes multiple levels of parallelism to the programmer. These levels map directly onto the hardware through the PTX machine model, diagrammed below.
Notably, in this machine model there is a single instruction unit for multiple processors. While each processor runs one thread , those threads must execute the same instructions — hence parallel thread execution, or PTX. They coordinate with each other through shared memory and effect different results by means of private registers .
The documentation for the latest version of PTX is available from NVIDIA here . The instruction sets of PTX are versioned with a number called the "compute capability ", which is synonymous with "minimum supported Streaming Multiprocessor architecture version".
Writing in-line PTX by hand is uncommon outside of the cutting edge of
performance, similar to writing in-line x86_64
assembly, as is done in
high-performance vectorized query operators in analytical databases and in
performance-sensitive sections of operating system kernels. At time of writing
in September of 2025, in-line PTX is the only way to take advantage of some
Hopper-specific hardware features like the wgmma
and tma
instructions, as in
Flash Attention 3 or in the
Machete w4a16 kernels . Viewing
CUDA C/C++ ,
SASS , and
PTX together is
supported on Godbolt . See the
NVIDIA "Inline PTX Assembly in CUDA" guide
for details.
Or want to contribute?
Click this button to
let us know on GitHub.