A scoreboard is a hardware structure that tracks which registers are waiting to be written to by an in-flight instruction. A warp cannot progress when it is in the stalled state .

Scoreboard stalls can be classified into two types: short scoreboard stalls and long scoreboard stalls.

A short scoreboard stall occurs when an instruction is waiting on the result of a variable latency instruction which does not leave the Streaming Multiprocessor (SM) . This includes slow math instructions on the Special Function Unit like MUFU.EX2 and MUFU.SQRT and matrix multiplications on the Tensor Core like MMA. It also includes shared memory operations like LDS and STS.

A long scoreboard stall occurs when an instruction is waiting on the result of a memory operation that leaves the SM , such as global memory loads (LDG) or stores (STG). Long scoreboard stalls dominate memory-bound code.

A warp has 6 scoreboards which the compiler uses to track data dependencies between instructions.

Some scoreboard information is legible in Streaming Assembler (SASS) . For example, below is what you might see from a cuobjdump with the --dump-sass flag:

nasm

[barrier:  :  :  :  ]  /*line*/  INSTRUCTION Ri, [Rj] ; # Format: scoreboard info, line number, instruction, operands
[B------:R-:W2:-:S04]  /*00f0*/  LDG.E.SYS R0, [R2] ;   # Sets scoreboard 2
[B------:R-:W2:-:S01]  /*0100*/  LDG.E.SYS R5, [R4] ;   # `ptxas` intelligently reuses scoreboard 2
...
[B--2---:R-:W-:Y:S08]  /*0150*/  IMAD R0, R0, c[0x0][0x160], R5 ;  # Waits on scoreboard 2

We can see that our IMAD instruction has a barrier (B--2---) on scoreboard 2, indicating that it requires that bit flag to be cleared before it can issue. Both LDG instructions increment (W2 write) scoreboard 2 when they are issued so that our IMAD instruction will have the correct values in registers R0 and R5 before it executes.

There may be multiple scoreboards to barrier, such as B01--4- which means wait until scoreboards 0,1,4 are all cleared. When the data dependency has been satisfied, the respective scoreboard is decremented.

Scoreboard reuse can mean that the stall classification from Nsight Compute is incorrect, as a long and short scoreboard stall may be conflated if they use the same scoreboard.

Scoreboarding for dependency tracking in dynamic instruction scheduling dates back to the "first supercomputer", the Control Data Corporation 6600 , one of which disproved Euler's sum of powers conjecture in 1966. Unlike in CPUs, scoreboarding in GPUs isn't used for out-of-order execution within threads (instruction-level parallelism), only across them (thread-level parallelism); see this NVIDIA patent .

For more details about scoreboard implementation on GPUs, see Professor Matthew D. Sinclair's slides .

Building on GPUs? We know a thing or two about it.

Modal is an ergonomic Python SDK wrapped around a global GPU fleet. Deploy serverless AI workloads instantly without worrying about quota requests, driver compatibility issues, or managing bulky ML dependencies. Deploy on GPUs

Deploy on GPUs

Warp Divergence Branch Efficiency