CUDA Templates for Linear Algebra Subroutines and Solvers (CUTLASS) is a library of abstractions for implementing high-performance linear algebra in CUDA kernels .

Like cuBLAS , CUTLASS is named in reference to the Basic Linear Algebra Subprograms (BLAS) standard for low-level routines for linear algebraic computations. Unlike cuBLAS, CUTLASS is a toolkit for constructing kernels, rather than a library of ready-to-call routines. CUTLASS is primarily associated with the third level of the BLAS hierarchy, general matrix multiplications ("GEMMs").

As the name suggests, CUTLASS includes a collection of CUDA C++ template abstractions. Templates are the C++ implementation of parametric polymorphism , which you may have encountered in the form of generics in other languages. Polymorphic functions are written once but can operate on inputs with different types.

The core of modern CUTLASS is the CuTe library, which defines Layout and Tensor types for composably describing and manipulating tensors of data and threads . It is not to be confused with CuTe DSL , which exposes CuTe/CUTLASS templates via a Domain-Specific Language (DSL) in Python.

Atop CuTe, CUTLASS exposes a header-only CUDA C++ library that operates at three levels: the whole device, a single kernel , or a collective of threads (typically a thread block ). At the collective layer, matrix-matrix multiplications are typically split into "mainloops" and "epilogues". Mainloops express the core algorithm, like tiling strategies. Epilogues describe post-processing steps, like the application of scaling factors or scalar non-linearities (popular in neural networks).

CUTLASS is very commonly used to write some of the highest-performing kernels, especially matrix-matrix multiplications on hardware from more recent Streaming Multiprocessor architectures . These kernels require careful programming of the Tensor Cores to achieve anything like peak performance .

CUTLASS is open source and available on GitHub . The library also includes many implementations of high-performance open-source kernels using CUTLASS, which are regularly used as references elsewhere in open-source kernel development. We can highly recommend the popular tutorials by Jay Shah of Colfax International , which explain in detail how the key components of CUTLASS are used to achieve maximum performance. Note, however, that like most C++ template metaprogramming, CUTLASS is not for the faint of heart!

Building on GPUs? We know a thing or two about it.

Modal is an ergonomic Python SDK wrapped around a global GPU fleet. Deploy serverless AI workloads instantly without worrying about quota requests, driver compatibility issues, or managing bulky ML dependencies.

Deploy on GPUs

cuDNN CuTe