Parallel Thread eXecution (PTX) is an intermediate representation (IR) for code that will run on a parallel processor (almost always an NVIDIA GPU). It is one of the formats output by nvcc, the NVIDIA CUDA Compiler Driver .

NVIDIA documentation refers to PTX as both a "virtual machine" and an "instruction set architecture".

From the programmer's perspective, PTX is an instruction set for programming against a virtual machine model. Programmers or compilers producing PTX can be confident their program will run with the same semantics on many distinct physical machines, including machines that do not yet exist. In this way, it is also similar to CPU instruction set architectures like x86_64 , aarch64 , or SPARC .

Unlike those ISAs, PTX is very much an intermediate representation , like LLVM-IR. The PTX components of a CUDA binary will be just-in-time (JIT) compiled by the host CUDA Drivers into device-specific SASS for execution.

In the case of NVIDIA GPUs, PTX is forward-compatible: GPUs with a matching or higher compute capability version will be able to run the program, thanks to this mechanisn of JIT compilation.

Some exemplary PTX:

ptx

.reg .f32 %f<7>;

a compiler directive for the PTX-to-SASS compiler indicating that this kernel consumes seven 32-bit floating point registers . Registers are dynamically allocated to groups of threads (warps ) from the SM 's register file .

ptx

fma.rn.f32 %f5, %f4, %f3, 0f3FC00000;

apply a fused multiply-add (fma) operation to multiply the contents of registers f3 and f4 and add the constant 0f3FC00000, storing the result in f5. All numbers are in 32 bit floating point representation. The rn suffix for the FMA operation sets the floating point rounding mode to IEEE 754 "round even" (the default).

ptx

mov.u32 %r1, %ctaid.x;
mov.u32 %r2, %ntid.x;
mov.u32 %r3, %tid.x;

move the x-axis values of the cooperative thread array index, the cooperative thread array dimension index (ntid), and the thread index into three u32 registers r1 - r3.

The PTX programming model exposes multiple levels of parallelism to the programmer. These levels map directly onto the hardware through the PTX machine model, diagrammed below.

Notably, in this machine model there is a single instruction unit for multiple processors. While each processor runs one thread , those threads must execute the same instructions — hence parallel thread execution, or PTX. They coordinate with each other through shared memory and effect different results by means of private registers .

The documentation for the latest version of PTX is available from NVIDIA here . The instruction sets of PTX are versioned with a number called the "compute capability ", which is synonymous with "minimum supported Streaming Multiprocessor architecture version".

Writing in-line PTX by hand is uncommon but not unheard of, similar to writing in-line x86_64 assembly, as is done in high-performance vectorized query operators in analytical databases and in performance-sensitive sections of operating system kernels. At time of writing in October of 2024, in-line PTX is the only way to take advantage of some Hopper-specific hardware features like the wgmma and tma instructions, as in Flash Attention 3 or in the Machete w4a16 kernels . Viewing CUDA C/C++ , SASS , and PTX together is supported on Godbolt . See the NVIDIA "Inline PTX Assembly in CUDA" guide for details.

Streaming ASSembler

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.

Compute Capability ?