GPU acceleration

Contemporary machine learning models are large linear algebra machines, and running them with reasonable latency and throughput requires specialized hardware for executing large linear algebra tasks. The weapon of choice here is the venerable Graphics Processing Unit, or GPU.

Modal is designed from the ground up to make running your ML-powered functions on GPUs as easy, cost-effective, and performant as possible. And Modal GPUs are great for graphics too!

This guide will walk you through all the options available for running your GPU-acclerated code on Modal and suggest techniques for choosing the right hardware for your problem. If you’re looking for information on how to install the CUDA stack, check out this guide.

If you have code or use libraries that benefit from GPUs, you can attach the first available GPU to your function by passing the gpu="any" argument to the @app.function decorator:

import modal

app = modal.App()

def render_toy_story():
    # code here will be executed on a machine with an available GPU

Specifying GPU type

When gpu="any" is specified, your function runs in a container with access to a GPU. Currently this GPU will be either an NVIDIA Tesla T4 or A10G instance, and pricing is based on which one you land on.

If you need more control, you can pick a specific GPU type by changing this argument:

def run_sdxl_turbo():

def run_sdxl_batch():

def finetune_sdxl():

For information on all valid values for the gpu parameter see the reference docs.

For running, rather than training, neural networks, we recommend starting off with the A10Gs, which offer an excellent trade-off of cost and performance and 24 GB of GPU RAM for storing model weights. For historical reasons, Modal does not distinguish between A10G GPUs and A10s.

For more on how to pick a GPU for use with neural networks like LLaMA or Stable Diffusion, and for tips on how to make that GPU go brrr, check out Tim Dettemers’ blog post or the Full Stack Deep Learning page on Cloud GPUs.

Specifying GPU count

The largest machine learning models are too large to fit in the memory of just one of even the most capacious GPUs. Rather than off-loading from GPU memory to CPU memory or disk, which leads to punishing drops in latency and throughput, the usual tactic is to parallelize the model across several GPUs on the same machine — or even to distribute it across several machines, each with several GPUs.

You can run your function on a Modal machine with more than one GPU by changing the count argument in the object form of the gpu parameter:

def train_sdxl():

We also support an equivalent string-based shorthand for specifying the count:

def train_sdxl():

Currently H100, A100, and T4 instances support up to 8 GPUs (up to 640 GB GPU RAM), and A10G instances support up to 4 GPUs (up to 96 GB GPU RAM). Note that requesting more than 2 GPUs per container will usually result in larger wait times. These GPUs are always attached to the same physical machine.

H100 GPUs

Modal’s fastest GPUs are the H100s, which are NVIDIA’s flagship data center chip. Modal offers H100s in the SXM form factor, with Tensor Cores capable of nearly two petaFLOPS in 16-bit precision connected via >3 TB/s bandwidth connections to 80 GB of on-device RAM. Powerful!

To request an H100, set the gpu argument to "H100"

def run_mixtral():

Check out this example to see how you can run 7B parameter language models at thousands of tokens per second using an H100 on Modal.

Before you jump for the most powerful (and so most expensive) GPU, make sure you understand where the bottlenecks are in your computations. For example, running language models with small batch sizes (e.g. one prompt at a time) results in a bottleneck on memory, not arithmetic. Since arithmetic throughput has risen faster than memory throughput in recent hardware generations, speedups for memory-bound GPU jobs are not as extreme and may not be worth the extra cost.

Additionally, it takes time for library providers to adapt to the latest hardware, so be on the lookout for sharp edges, like missing CUDA kernels, when using H100s.

A100 GPUs

A100s are the previous generation of top-of-the-line data center chip from NVIDIA. Modal offers two versions of the A100: one with 40 GB of RAM and another with 80 GB of RAM.

To request an A100 with 40 GB of GPU memory, replace the gpu="any" argument with gpu="A100":

def llama_7b():

At half precision, a 34B parameter language model like LLaMA 34B will require more than 40 GB of RAM (16 bits = 2 bytes and 34 × 2 > 40). To request an 80 GB A100 that can run those models, use the string a100-80gb or the object form. of the gpu argument:

def llama_34b():

To run the largest useful open source models, or when finetuning models that are of size 7B or higher, you may need mutliple A100s to have enough GPU RAM. Finetuning models can be particularly RAM intensive because optimizing neural networks requires storing a lot of things in memory: not only input data and weights, but also intermediate calculations, gradients, and optimizer parameters.

To use more than one GPU, set the count argument to an integer value between 2 and 8.

@app.function(gpu=modal.gpu.A100(size="80GB", count=8))
def finetune_llama_70b():


Take a look at some of our examples that use GPUs: