Streaming ASSembler (SASS) is the assembly format for programs running on NVIDIA GPUs. This is the lowest-level format in which human-readable code can be written. It is one of the formats output by nvcc, the NVIDIA CUDA Compiler Driver , alongside PTX . It is converted to device-specific binary microcodes during execution. Presumably, the "Streaming" in "Streaming Assembler" refers to the Streaming Multiprocessors which the assembly language programs.

SASS is versioned and tied to a specific NVIDIA GPU SM architecture . See also Compute Capability .

Some exemplary instructions in SASS for the SM90a architecture of Hopper GPUs:

FFMA R0, R7, R0, 1.5 ; - perform a Fused Floating point Multiply Add that multiplies the contents of Register 7 and Register 0, adds 1.5, and stores the result in Register 0.
S2UR UR4, SR_CTAID.X ; - copy the X value of the Cooperative Thread Array 's InDex from its Special Register to Uniform Register 4.

As for CPUs, writing this "GPU assembler" by hand is very uncommon. Viewing compiler-generated SASS while profiling and editing high-level CUDA C/C++ code or in-line PTX is more common , especially in the production of the highest-performance kernels. Viewing CUDA C/C++ , SASS, and PTX together is supported on Godbolt . For more detail on SASS with a focus on performance debugging workflows, see this talk from Arun Demeure.

SASS is very lightly documented — the instructions are listed in the documentation for NVIDIA's CUDA binary utilities , but their semantics are not defined. The mapping from ASCII assembler to binary opcodes and operands is entirely undocumented, but it has been reverse-engineered in certain cases (Maxwell , Lovelace ).

CUDA (Programming Model)

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.

Parallel Thread eXecution ?