Unlike CPU cores, instructions issued to CUDA Cores are not generally independently scheduled. Instead, groups of cores are issued the same instruction simultaneously by the Warp Scheduler but apply it to different registers . Commonly, these groups are of size 32, the size of a warp , but for contemporary GPUs groups can contain as little as one thread, at a cost to performance.

The term "CUDA Core" is slightly slippery: in different Streaming Multiprocessor architectures CUDA Cores can consist of different units -- a different mixture of 32 bit integer and 32 bit and 64 bit floating point units.

So, for example, the H100 whitepaper indicates that an H100 GPU's Streaming Multiprocessors (SMs) each have 128 "FP32 CUDA Cores", which accurately counts the number of 32 bit floating point units but is double the number of 32 bit integer or 64 bit floating point units (as evidenced by the diagram above). For estimating performance, it's best to look directly at the number of hardware units for a given operation.

Warp Scheduler

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.

Tensor Core ?