The traditional CUDA programming model exposes a hierarchy of threads and a hierarchy of memories to user programs that receive pointers and execute concurrently to mutate memory relative to those pointers. The same instructions are issued to multiple threads in parallel, and so this programming model is a "single-instruction, multiple thread" (SIMT) programming model. This is the programming model used in, for instance, CUDA C/C++ and the PTX IR used by pre-CUDA-Tile programs targeting NVIDIA GPUs.

This programming model is defined for a "unified" hardware substrate -- the "U" in "CUDA". That is, homogenous Streaming Multiprocessors (SMs) with homogenous CUDA Cores implement the majority of operations, rather than the device comprising specialized cores, programmed heterogenously, as was generically the case in graphics programming before CUDA.

This programming model is a poor fit for GPUs of the latest SM architectures , where the vast majority of arithmetic bandwidth is in the Tensor Cores . The Tensor Cores can only perform matrix multiplications and must be programmed with thread -level instructions and asynchrony, rather than the warp -level asynchrony used to program the rest of the hardware.

In the CUDA Tile programming model, programs are expressed at the level of tile-kernels, which are instances of the program that run concurrently across a grid of tile blocks, each of which is a single thread of execution. Tile-kernels operate, in the happy path, on structured pointers, which combine a pointer with information about an array: its total extent (shape) and its access patterns (stride). Note the similarity to the CuTe type system for Layouts and Tensors.

As with traditional "CUDA SIMT" in CUDA C/C++ and PTX IR, this programming model is shared between high-level languages and an intermediate representation -- here, Tile IR .

At time of writing in mid-2026, the CUDA Tile programming model is new, and to what extent it will replace the existing "CUDA SIMT" programming model is as yet unclear. The CUDA Tile programming model is currently available via cuTile Python . It is also available, albeit in experimental form, via cuTile BASIC and cuTile Rust .

Building on GPUs? We know a thing or two about it.

Modal is an ergonomic Python SDK wrapped around a global GPU fleet. Deploy serverless AI workloads instantly without worrying about quota requests, driver compatibility issues, or managing bulky ML dependencies.

Deploy on GPUs

Global Memory Host Software