What is cuBLAS?
cuBLAS (CUDA Basic Linear Algebra Subroutines) is NVIDIA's high-performance implementation of the BLAS (Basic Linear Algebra Subprograms) standard. It is a proprietary software library that provides highly optimized kernels for common linear algebra operations.
Instead of writing and optimizing complex operations like matrix multiplication from scratch, developers can call cuBLAS functions from their host code. The library contains a wide array of kernels, each fine-tuned for specific data types (e.g. FP32, FP16), matrix sizes, and GPU architectures . At runtime, cuBLAS uses (unknown) internal heuristics to select the most performant kernel and its optimal launch parameters for the target hardware . As a result, cuBLAS is the foundation for a lot of high-performance numerical computing on NVIDIA GPUs and is used extensively by deep learning frameworks like PyTorch to accelerate their core operations.
The single most common source of error when using cuBLAS is the matrix data
layout. For historical reasons, and to maintain compatibility with the original
BLAS standard (which was written in Fortran), cuBLAS expects matrices to be in
column-major order .
This is the opposite of the commonly used row-major order in C, C++ and Python.
Furthermore, a BLAS function needs to know not just the size of the operation
(e.g., M
, N
, K
), but also how to find the start of each column in memory.
This is specified by the leading dimension (e.g. lda
). The leading dimension
is the stride between consecutive columns. When working with an entire allocated
matrix, the leading dimension is just the number of rows. However, if working
with a submatrix, the leading dimension would be the number of rows in the
larger, parent matrix from which the submatrix is taken.
Fortunately, for computationally intensive kernels like GEMM, it is not
necessary to reorder matrices from row-major to column-major. Instead, we can
use the mathematical identity that if C = A @ B
, then C^T = B^T @ A^T
. The
key insight is that a matrix stored in row-major order has the exact same memory
layout as its transpose stored in column-major order. Therefore, if we provide
our row-major matrices A
and B
to cuBLAS but swap their order in the
function call (along with their dimensions), cuBLAS will compute C^T
and
output it in column-major order. This resulting block of memory, when
interpreted in row-major, is exactly the matrix C
that we want. This technique
is demonstrated in the following function:
#include <cublas_v2.h>
// performs single-precision C = alpha * A @ B + beta * C
// on row-major matrices using cublasSgemm
void sgemm_row_major(cublasHandle_t handle, int M, int N, int K,
const float *alpha,
const float *A, const float *B,
const float *beta,
float *C) {
// A is M x K (row-major), cuBLAS sees it as A^T (K x M, column-major),
// the leading dimension of A^T is K
// B is K x N (row-major), cuBLAS sees it as B^T (N x K, column-major),
// the leading dimension of B^T is N
// C is M x N (row-major), cuBLAS sees it as C^T (N x M, column-major),
// the leading dimension of C^T is N
// note the swapped A and B, and the swapped M and N
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,
N, M, K,
alpha,
B, N, // leading dimension of B^T
A, K, // leading dimension of A^T
beta,
C, N); // leading dimension of C^T
}
A complete, runnable version of this example is available on Godbolt .
The CUBLAS_OP_N
flag instructs the kernel to use the matrices as provided
(without an additional transpose operation from its perspective).
To use the cuBLAS library, it must be linked (e.g. using the flag -lcublas
when compiling with nvcc ). Its functions are
exposed via the cublas_v2.h
header.
For more information on cuBLAS, see the official cuBLAS documentation .
Or want to contribute?
Click this button to
let us know on GitHub.