A cooperative thread array (CTA) is a collection of threads scheduled onto the same Streaming Multiprocessor (SM) . CTAs are the PTX /SASS implementation of the CUDA programming model 's thread blocks . CTAs are composed of one or more warps .

Programmers can direct threads within a CTA to coordinate with each other. The programmer-managed shared memory , in the L1 data cache of the SMs , makes this coordination fast. Threads in different CTAs cannot coordinate with each other via barriers, unlike threads within a CTA, and instead must coordinate via global memory , e.g. via atomic update instructions. Due to driver control over the scheduling of CTAs at runtime, CTA execution order is indeterminate and blocking a CTA on another CTA can easily lead to deadlock.

The number of CTAs that can be scheduled onto a single SM depends on a number of factors. Fundamentally, the SM has a limited set of resources — lines in the register file , "slots" for warps , bytes of shared memory in the L1 data cache — and each CTA uses a certain amount of those resources (as calculated at compile time) when scheduled onto an SM .

Warp

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.

Kernel ?