Cooperative Thread Array
A cooperative thread array (CTA) is a collection of threads scheduled onto the same Streaming Multiprocessor (SM) . CTAs are the PTX /SASS implementation of the CUDA programming model 's thread blocks . CTAs are composed of one or more warps .
Programmers can direct threads within a CTA to coordinate with each other. The programmer-managed shared memory , in the L1 data cache of the SMs , makes this coordination fast. Threads in different CTAs cannot coordinate with each other via barriers, unlike threads within a CTA, and instead must coordinate via global memory , e.g. via atomic update instructions. Due to driver control over the scheduling of CTAs at runtime, CTA execution order is indeterminate and blocking a CTA on another CTA can easily lead to deadlock.
The number of CTAs that can be scheduled onto a single SM depends on a number of factors. Fundamentally, the SM has a limited set of resources — lines in the register file , "slots" for warps , bytes of shared memory in the L1 data cache — and each CTA uses a certain amount of those resources (as calculated at compile time) when scheduled onto an SM .