cuDNN provides highly optimized kernels for operations arising frequently in neural networks. These include convolution, self-attention (including scaled dot-product attention, aka "Flash Attention"), matrix multiplication, various normalizations, poolings, etc.

cuDNN is a key library at the application layer of the CUDA software platform , alongside its sibling library, cuBLAS . Deep learning frameworks like PyTorch typically leverage cuBLAS for general-purpose linear algebra, such as the matrix multiplications that form the core of dense (fully-connected) layers. They rely on cuDNN for more specialized primitives like convolutional layers, normalization routines, and attention mechanisms.

In modern cuDNN code, computations are expressed as operation graphs, which can be constructed using open source Python and C++ frontend APIs via the declarative Graph API .

This API allows the developer to define a sequence of operations as a graph, which cuDNN can then analyze to perform optimizations, most importantly operation fusion. In operation fusion, a sequence of operations like Convolution + Bias + ReLU are merged ("fused") into a single operation run as a single kernel . Operation fusion helps reduce demand on memory bandwidth by keeping program intermediates in shared memory throughout a sequence of operations.

The frontends interact with a lower-level, closed source C backend , which exposes an API for legacy use cases or direct C FFI.

For any given operation, cuDNN maintains multiple underlying implementations and uses (unknown) internal heuristics to select the most performant one for the target Streaming Multiprocessor (SM) architecture , data types, and input sizes.

cuDNN's initial claim to fame was accelerating convolutional neural networks on Ampere SM architecture GPUs. For Transformer neural networks on Hopper and especially Blackwell SM architectures , NVIDIA has tended to place more emphasis on the CUTLASS library.

For more information on cuDNN, see the official cuDNN documentation , and the open source frontend APIs .

Building on GPUs? We know a thing or two about it.

Modal is an ergonomic Python SDK wrapped around a global GPU fleet. Deploy serverless AI workloads instantly without worrying about quota requests, driver compatibility issues, or managing bulky ML dependencies. Deploy on GPUs

Deploy on GPUs

cuBLAS

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.

Performance ?