What is warp divergence?
Warp divergence occurs when threads within a warp take different execution paths due to control flow statements.
For example, consider this kernel :
__global__ void divergent_kernel(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
if (data[idx] > 0.5f) {
// A
data[idx] = data[idx] * 4.0f;
} else {
// B
data[idx] = data[idx] + 2.0f;
}
data[idx] = data[idx] * data[idx];
}
}
When the threads within a
warp encounter the data-dependent
conditional, some threads must execute
block A while others must execute block B, depending on the value at
data[idx]
. Because of this data-dependency and the structural constraints of
the
CUDA programming model
and its implementation in the
PTX machine model ,
there is no way for a programmer or a compiler to avoid this split in control
flow inside of the warp .
Instead, the warp scheduler must handle concurrent execution of these divergent code paths, which it achieves by "masking" some threads so that they don't execute the instruction. This is achieved using predicate registers .
Let's examine the generated SASS (Godbolt link ) to understand the execution flow:
LDG.E.SYS R4, [R2] // L1 load data[idx]
FSETP.GT.AND P0, PT, R4.reuse, 0.5, PT // L2 set P0 to data[idx] > 0.5
FADD R0, R4, 2 // L3 store 2 + data[idx] in R0
@P0 FMUL R0, R4, 4 // L4 in some threads, store 4 * data[idx] in R0
FMUL R5, R0, R0 // L5 store R0 * R0 in R5
STG.E.SYS [R2], R5 // L6 store R5 in data[idx]
After loading the data into R4
(L1
), all 32
threads in the
warp execute FSETP.GT.AND
concurrently
(L2
), and each thread gets its own
P0
value based on the data
value in R4
. Then, we have a bit of
compiler cleverness: in L3
all
threads execute the code in A, writing
to R0
. Only those for whom P0
is true then execute the code in B (L4
),
over-writing the value written to R0
in L3
. On this instruction, the
warp is said to be "divergent". On L5
,
all threads are back to executing the
same code. Once the
warp scheduler brings them back
into alignment by issuing the same instruction on the same clock cycle, the warp
has "converged".
This is presumably more efficient than the naïve encoding of the branch into
SASS , which would instead
predicate both lines L3
and L4
— "presumably" in that we can trust the
compiler and in that, heuristically, we are
trading use of cheap, plentiful
CUDA Core computation for more
expensive flow control. As often in GPU programming, it's better to waste
compute (an unnecessary FADD
for every execution of L4
) than to add
complexity, even if it's just a simple predication!
One reason compilers might aggressively avoid divergence is that in early (pre-Volta) GPUs, divergent warps were always fully serialized. While warp divergence still reduces efficiency, modern GPUs with independent thread scheduling don't necessarily experience the full serialization penalties.
Or want to contribute?
Click this button to
let us know on GitHub.