Or want to contribute?
Click this button to
let us know on GitHub.
What is a scoreboard stall?
A scoreboard stall occurs when an instruction cannot be issued due to a dependency on the result of a prior instruction.
A scoreboard is a hardware structure that tracks which registers are waiting to be written to by an in-flight instruction. A warp cannot progress when it is in the stalled state .
Scoreboard stalls can be classified into two types: short scoreboard stalls and long scoreboard stalls.
A short scoreboard stall occurs when an instruction is waiting on the result of
a variable latency instruction which does not leave the
Streaming Multiprocessor (SM) .
This includes slow math instructions on the
Special Function Unit
like MUFU.EX2 and MUFU.SQRT and matrix multiplications on the
Tensor Core like MMA. It also
includes shared memory operations
like LDS and STS.
A long scoreboard stall occurs when an instruction is waiting on the result of a
memory operation that leaves the
SM , such as global
memory loads (LDG) or stores (STG). Long scoreboard stalls dominate
memory-bound code.
A warp has 6 scoreboards which the compiler uses to track data dependencies between instructions.
Some scoreboard information is legible in
Streaming Assembler (SASS) .
For example, below is what you might see from a cuobjdump with the
--dump-sass flag:
[barrier: : : : ] /*line*/ INSTRUCTION Ri, [Rj] ; # Format: scoreboard info, line number, instruction, operands
[B------:R-:W2:-:S04] /*00f0*/ LDG.E.SYS R0, [R2] ; # Sets scoreboard 2
[B------:R-:W2:-:S01] /*0100*/ LDG.E.SYS R5, [R4] ; # `ptxas` intelligently reuses scoreboard 2
...
[B--2---:R-:W-:Y:S08] /*0150*/ IMAD R0, R0, c[0x0][0x160], R5 ; # Waits on scoreboard 2
We can see that our IMAD instruction has a barrier (B--2---) on scoreboard
2, indicating that it requires that bit flag to be cleared before it can issue.
Both LDG instructions increment (W2 write) scoreboard 2 when they are issued
so that our IMAD instruction will have the correct values in registers R0
and R5 before it executes.
There may be multiple scoreboards to barrier, such as B01--4- which means wait
until scoreboards 0,1,4 are all cleared. When the data dependency has been
satisfied, the respective scoreboard is decremented.
Scoreboard reuse can mean that the stall classification from Nsight Compute is incorrect, as a long and short scoreboard stall may be conflated if they use the same scoreboard.
Scoreboarding for dependency tracking in dynamic instruction scheduling dates back to the "first supercomputer", the Control Data Corporation 6600 , one of which disproved Euler's sum of powers conjecture in 1966. Unlike in CPUs, scoreboarding in GPUs isn't used for out-of-order execution within threads (instruction-level parallelism), only across them (thread-level parallelism); see this NVIDIA patent .
For more details about scoreboard implementation on GPUs, see Professor Matthew D. Sinclair's slides .
Building on GPUs? We know a thing or two about it.
Modal is an ergonomic Python SDK wrapped around a global GPU fleet. Deploy serverless AI workloads instantly without worrying about quota requests, driver compatibility issues, or managing bulky ML dependencies.