Using CUDA on Modal

Modal makes it easy to accelerate your workloads with datacenter-grade NVIDIA GPUs.

To take advantage of the hardware, you need to use matching software: the CUDA stack. This guide explains the components of that stack and how to install them on Modal. For more on which GPUs are available on Modal and how to choose a GPU for your use case, see this guide.

Here’s the tl;dr:

What is CUDA?

When someone refers to “installing CUDA” or “using CUDA”, they are referring not to a library, but to a stack with multiple layers. Your application code (and its dependencies) can interact with the stack at different levels.

The CUDA stack

This leads to a lot of confusion. To help clear that up, the following sections explain each component in detail.

Level 0: Kernel-mode driver components

At the lowest level are the kernel-mode driver components. The Linux kernel is essentially a single program operating the entire machine and all of its hardware. To add hardware to the machine, this program is extended by loading new modules into it. These components communicate directly with hardware — in this case the GPU.

Because they are kernel modules, these driver components are tightly integrated with the host operating system that runs your containerized Modal Functions and are not something you can inspect or change yourself.

Level 1: User-mode driver API

All action in Linux that doesn’t occur in the kernel occurs in user space. To talk to the kernel drivers from our user space programs, we need user-mode driver components.

Most prominently, that includes:

These components are installed on all Modal machines with access to GPUs. Because they are user-level components, you can use them directly:

import modal

app = modal.App()

@app.function(gpu="any")
def check_nvidia_smi():
    import subprocess
    output = subprocess.check_output(["nvidia-smi"], text=True)
    assert "Driver Version: 550.90.07" in output
    assert "CUDA Version: 12.4" in output
    return output

Level 2: CUDA Toolkit

Wrapping the CUDA Driver API is the CUDA Runtime API, the libcudart.so shared library. This API includes functions like cudaLaunchKernel and is more commonly used in CUDA programs (see this HackerNews comment for color commentary on why). This shared library is not installed by default on Modal.

The CUDA Runtime API is generally installed as part of the larger NVIDIA CUDA Toolkit, which includes the NVIDIA CUDA compiler driver (nvcc) and its toolchain and a number of useful goodies for writing and debugging CUDA programs (cuobjdump, cudnn, profilers, etc.).

Contemporary GPU-accelerated machine learning workloads like LLM inference frequently make use of many components of the CUDA Toolkit, such as the run-time compilation library nvrtc.

So why aren’t these components installed along with the drivers? A compiled CUDA program can run without the CUDA Runtime API installed on the system, by statically linking the CUDA Runtime API into the program binary, though this is fairly uncommon for CUDA-accelerated Python programs. Additionally, older versions of these components are needed for some applications and some application deployments even use several versions at once. Both patterns are compatible with the host machine driver provided on Modal.

Install GPU-accelerated torch and transformers with pip_install

The components of the CUDA Toolkit can be installed via pip, via PyPI packages like nvidia-cuda-runtime-cu12 and nvidia-cuda-nvrtc-cu12. These components are listed as dependencies of some popular GPU-accelerated Python libraries, like torch.

Because Modal already includes the lower parts of the CUDA stack, you can install these libraries with the pip_install method of modal.Image, just like any other Python library:

image = modal.Image.debian_slim().pip_install("torch")


@app.function(gpu="any", image=image)
def run_torch():
    import torch
    has_cuda = torch.cuda.is_available()
    print(f"It is {has_cuda} that torch can access CUDA")
    return has_cuda

Many libraries for running open-weights models, like transformers and vllm, use torch under the hood and so can be installed in the same way:

image = modal.Image.debian_slim().pip_install("transformers[torch]")
image = image.apt_install("ffmpeg")  # for audio processing


@app.function(gpu="any", image=image)
def run_transformers():
    from transformers import pipeline
    transcriber = pipeline(model="openai/whisper-tiny.en", device="cuda")
    result = transcriber("https://modal-public-assets.s3.amazonaws.com/mlk.flac")
    print(result["text"])  # I have a dream that one day this nation will rise up live out the true meaning of its creed

For more complex setups, use an officially-supported CUDA image

The disadvantage of installing the CUDA stack via pip is that many other libraries that depend on its components being installed as normal system packages cannot find them.

For these cases, we recommend you use an image that already has the full CUDA stack installed as system packages and all environment variables set correctly, like the nvidia/cuda:*-devel-* images on Docker Hub.

One library that requires this more involved installation process is flash-attn, which was, for a time, by far the fastest implementation of Transformer multi-head attention:

cuda_version = "12.4.0"  # should be no greater than host CUDA version
flavor = "devel"  #  includes full CUDA toolkit
operating_sys = "ubuntu22.04"
tag = f"{cuda_version}-{flavor}-{operating_sys}"

image = (
    modal.Image.from_registry(f"nvidia/cuda:{tag}", add_python="3.11")
    .apt_install("git")
    .pip_install(  # required to build flash-attn
        "ninja",
        "packaging",
        "wheel",
        "torch",
    )
    .run_commands(  # add flash-attn
        "pip install flash-attn==2.5.8 --no-build-isolation"
    )
)


@app.function(gpu="a10g", image=image)
def run_flash_attn():
    import torch
    from flash_attn import flash_attn_func

    batch_size, seqlen, nheads, headdim, nheads_k = 2, 4, 3, 16, 3

    q = torch.randn(batch_size, seqlen, nheads, headdim, dtype=torch.float16).to("cuda")
    k = torch.randn(batch_size, seqlen, nheads_k, headdim, dtype=torch.float16).to("cuda")
    v = torch.randn(batch_size, seqlen, nheads_k, headdim, dtype=torch.float16).to("cuda")

    out = flash_attn_func(q, k, v)
    assert out.shape == (batch_size, seqlen, nheads, headdim)

Make sure to choose a version of CUDA that is no greater than 12.4, the version provided by the host machine. Older minor (12.*) versions are guaranteed to be compatible with the host machine’s driver, but older major (11.*, 10.*, etc.) versions may not be.

What next?

For more on accessing and choosing GPUs on Modal, check out this guide.

To see these installation patterns in action, check out these examples: