GPU Glossary
/device-software/cooperative-thread-array

Cooperative Thread Array

Cooperative thread arrays correspond to the thread block level of the thread block hierarchy in the CUDA programming model . Modified from diagrams in NVIDIA's CUDA Refresher: The CUDA Programming Model and the NVIDIA CUDA C++ Programming Guide .

A cooperative thread array (CTA) is a collection of threads scheduled onto the same Streaming Multiprocessor (SM) . CTAs are the PTX /SASS implementation of the CUDA programming model 's thread blocks . CTAs are composed of one or more warps .

Programmers can direct threads within a CTA to coordinate with each other. The programmer-managed shared memory , in the L1 data cache of the SMs , makes this coordination fast. Threads in different CTAs cannot coordinate with each other via barriers, unlike threads within a CTA, and instead must coordinate via global memory , e.g. via atomic update instructions. Due to driver control over the scheduling of CTAs at runtime, CTA execution order is indeterminate and blocking a CTA on another CTA can easily lead to deadlock.

The number of CTAs that can be scheduled onto a single SM depends on a number of factors. Fundamentally, the SM has a limited set of resources — lines in the register file , "slots" for warps , bytes of shared memory in the L1 data cache — and each CTA uses a certain amount of those resources (as calculated at compile time) when scheduled onto an SM .

Something seem wrong?
Or want to contribute?
Email: glossary@modal.com