GPU acceleration

Contemporary machine learning models are large linear algebra machines, and running them with reasonable latency and throughput requires specialized hardware for executing large linear algebra tasks. The weapon of choice here is the venerable Graphics Processing Unit, or GPU.

Modal is designed from the ground up to make running your ML-powered functions on GPUs as easy, cost-effective, and performant as possible. And Modal GPUs are great for graphics too!

This guide will walk you through all the options available for running your GPU-acclerated code on Modal and suggest techniques for choosing the right hardware for your problem. If you’re looking for information on how to install the CUDA stack, check out this guide.

If you have code or use libraries that benefit from GPUs, you can attach the first available GPU to your function by passing the gpu="any" argument to the @app.function decorator:

import modal

app = modal.App()

@app.function(gpu="any")
def render_toy_story():
    # code here will be executed on a machine with an available GPU
    ...

Specifying GPU type

When gpu="any" is specified, your function runs in a container with access to a GPU. Currently this GPU will be either an NVIDIA Tesla T4, L4, or A10G instance, and pricing is based on which one you land on.

If you need more control, you can pick a specific GPU type by changing this argument:

@app.function(gpu="A10G")
def run_sdxl_turbo():
    ...

@app.function(gpu="A100")
def run_sdxl_batch():
    ...

@app.function(gpu="H100")
def finetune_sdxl():
    ...

You can also specify a list of GPU types if your function is compatible with multiple options.

@app.function(gpu=["H100", "A100-80GB"])
def finetune_sdxl():
    ...

For information on all valid values for the gpu parameter see the reference docs.

For running, rather than training, neural networks, we recommend starting off with the A10Gs, which offer an excellent trade-off of cost and performance and 24 GB of GPU RAM for storing model weights. For historical reasons, Modal does not distinguish between A10G GPUs and A10s.

For more on how to pick a GPU for use with neural networks like LLaMA or Stable Diffusion, and for tips on how to make that GPU go brrr, check out Tim Dettemers’ blog post or the Full Stack Deep Learning page on Cloud GPUs.

Specifying GPU count

The largest machine learning models are too large to fit in the memory of just one of even the most capacious GPUs. Rather than off-loading from GPU memory to CPU memory or disk, which leads to punishing drops in latency and throughput, the usual tactic is to parallelize the model across several GPUs on the same machine — or even to distribute it across several machines, each with several GPUs.

You can run your function on a Modal machine with more than one GPU by changing the count argument in the object form of the gpu parameter:

@app.function(gpu=modal.gpu.H100(count=8))
def train_sdxl():
    ...

We also support an equivalent string-based shorthand for specifying the count:


@app.function(gpu="H100:8")
def train_sdxl():
    ...

Currently H100, A100, L4, and T4 instances support up to 8 GPUs (up to 640 GB GPU RAM), and A10G instances support up to 4 GPUs (up to 96 GB GPU RAM). Note that requesting more than 2 GPUs per container will usually result in larger wait times. These GPUs are always attached to the same physical machine.

H100 GPUs

Modal’s fastest GPUs are the H100s, NVIDIA’s flagship data center chip for the Hopper/Lovelace architecture.

To request an H100, set the gpu argument to "H100"

@app.function(gpu="H100")
def run_mixtral():
    ...

Check out this example to see how you can run 7B parameter language models at thousands of tokens per second using an H100 on Modal.

Before you jump for the most powerful (and so most expensive) GPU, make sure you understand where the bottlenecks are in your computations. For example, running language models with small batch sizes (e.g. one prompt at a time) results in a bottleneck on memory, not arithmetic. Since arithmetic throughput has risen faster than memory throughput in recent hardware generations, speedups for memory-bound GPU jobs are not as extreme and may not be worth the extra cost.

A100 GPUs

A100s are the previous generation of top-of-the-line data center chip from NVIDIA, based on the Ampere architecture. Modal offers two versions of the A100: one with 40 GB of RAM and another with 80 GB of RAM.

To request an A100 with 40 GB of GPU memory, replace the gpu="any" argument with gpu="A100":

@app.function(gpu="A100")
def llama_7b():
    ...

At half precision, a 34B parameter language model like LLaMA 34B will require more than 40 GB of RAM (16 bits = 2 bytes and 34 × 2 > 40). To request an 80 GB A100 that can run those models, use the string a100-80gb or the object form of the gpu argument:

@app.function(gpu=modal.gpu.A100(size="80GB"))
def llama_34b():
    ...

To run the largest useful open source models, or when finetuning models that are of size 7B or higher, you may need multiple GPUs to have enough GPU RAM (off-loading weights to CPU RAM or disk generally leads to unacceptable latency penalties). Finetuning models can be particularly RAM intensive because optimizing neural networks requires storing a lot of things in memory: not only input data and weights, but also intermediate calculations, gradients, and optimizer parameters.

To use more than one GPU, set the count argument to an integer value between 2 and 8.

@app.function(gpu=modal.gpu.A100(size="80GB", count=4))
def finetune_llama_70b():
    ...


@app.function(gpu=modal.gpu.H100(count=8))
def run_llama_405b_fp8():
    ...

Multi GPU training

Modal currently supports multi-GPU training on a single machine, but not multi-node training (yet). Depending on which framework you are using, you may need to use different techniques to train on multiple GPUs.

If the framework re-executes the entrypoint of the Python process (like PyTorch Lightning) you need to either set the strategy to ddp_spawn or ddp_notebook if you wish to invoke the training directly. Another option is to run the training script as a subprocess instead.

@app.function(gpu=modal.gpu.A100(count=2))
def run():
    import subprocess
    import sys
    subprocess.run(
        ["python", "train.py"],
        stdout=sys.stdout, stderr=sys.stderr,
        check=True,
    )

GPU fallbacks

Modal allows specifying a list of possible GPU types, suitable for functions that are compatible with multiple options. Modal respects the ordering of this list and will try to allocate the most preferred GPU type before falling back to less preferred ones.

@app.function(gpu=["h100", "a100-40gb:2"])
def run_on_80gb():
    ...

See this example for more detail.

Examples

Take a look at some of our examples that use GPUs: