GPU acceleration
Contemporary machine learning models are large linear algebra machines, and running them with reasonable latency and throughput requires specialized hardware for executing large linear algebra tasks. The weapon of choice here is the venerable Graphics Processing Unit, or GPU.
Modal is designed from the ground up to make running your ML-powered functions on GPUs as easy, cost-effective, and performant as possible. And Modal GPUs are great for graphics too!
This guide will walk you through all the options available for running your GPU-acclerated code on Modal and suggest techniques for choosing the right hardware for your problem. If you’re looking for information on how to install the CUDA stack, check out this guide.
If you have code or use libraries that benefit from GPUs, you can attach the
first available GPU to your function by passing the gpu="any"
argument to the
@app.function
decorator:
import modal
app = modal.App()
@app.function(gpu="any")
def render_toy_story():
# code here will be executed on a machine with an available GPU
...
Specifying GPU type
When gpu="any"
is specified, your function runs in a container with access to
a GPU. Currently this GPU will be either an NVIDIA
Tesla T4 or
A10G instance, and
pricing is based on which one you land on.
If you need more control, you can pick a specific GPU type by changing this argument:
@app.function(gpu="A10G")
def run_sdxl_turbo():
...
@app.function(gpu="A100")
def run_sdxl_batch():
...
@app.function(gpu="H100")
def finetune_sdxl():
...
For information on all valid values for the gpu
parameter see
the reference docs.
For running, rather than training, neural networks, we recommend starting off with the A10Gs, which offer an excellent trade-off of cost and performance and 24 GB of GPU RAM for storing model weights. For historical reasons, Modal does not distinguish between A10G GPUs and A10s.
For more on how to pick a GPU for use with neural networks like LLaMA or Stable Diffusion, and for tips on how to make that GPU go brrr, check out Tim Dettemers’ blog post or the Full Stack Deep Learning page on Cloud GPUs.
Specifying GPU count
The largest machine learning models are too large to fit in the memory of just one of even the most capacious GPUs. Rather than off-loading from GPU memory to CPU memory or disk, which leads to punishing drops in latency and throughput, the usual tactic is to parallelize the model across several GPUs on the same machine — or even to distribute it across several machines, each with several GPUs.
You can run your function on a Modal machine with more than one GPU by changing
the count
argument in the object form of the
gpu
parameter:
@app.function(gpu=modal.gpu.H100(count=8))
def train_sdxl():
...
We also support an equivalent string-based shorthand for specifying the count:
@app.function(gpu="H100:8")
def train_sdxl():
...
Currently H100, A100, and T4 instances support up to 8 GPUs (up to 640 GB GPU RAM), and A10G instances support up to 4 GPUs (up to 96 GB GPU RAM). Note that requesting more than 2 GPUs per container will usually result in larger wait times. These GPUs are always attached to the same physical machine.
H100 GPUs
Modal’s fastest GPUs are the H100s, which are NVIDIA’s flagship data center chip. Modal offers H100s in the SXM form factor, with Tensor Cores capable of nearly two petaFLOPS in 16-bit precision connected via >3 TB/s bandwidth connections to 80 GB of on-device RAM. Powerful!
To request an H100, set the gpu
argument to "H100"
@app.function(gpu="H100")
def run_mixtral():
...
Check out this example to see how you can run 7B parameter language models at thousands of tokens per second using an H100 on Modal.
Before you jump for the most powerful (and so most expensive) GPU, make sure you understand where the bottlenecks are in your computations. For example, running language models with small batch sizes (e.g. one prompt at a time) results in a bottleneck on memory, not arithmetic. Since arithmetic throughput has risen faster than memory throughput in recent hardware generations, speedups for memory-bound GPU jobs are not as extreme and may not be worth the extra cost.
Additionally, it takes time for library providers to adapt to the latest hardware, so be on the lookout for sharp edges, like missing CUDA kernels, when using H100s.
A100 GPUs
A100s are the previous generation of top-of-the-line data center chip from NVIDIA. Modal offers two versions of the A100: one with 40 GB of RAM and another with 80 GB of RAM.
To request an A100 with 40 GB of GPU memory, replace the gpu="any"
argument
with gpu="A100"
:
@app.function(gpu="A100")
def llama_7b():
...
At half precision, a 34B parameter language model like LLaMA 34B will require
more than 40 GB of RAM (16 bits = 2 bytes and 34 × 2 > 40). To request an 80 GB
A100 that can run those models, use the string a100-80gb
or the
object form of the gpu
argument:
@app.function(gpu=modal.gpu.A100(size="80GB"))
def llama_34b():
...
To run the largest useful open source models, or when finetuning models that are of size 7B or higher, you may need multiple GPUs to have enough GPU RAM (off-loading weights to CPU RAM or disk generally leads to unacceptable latency penalties). Finetuning models can be particularly RAM intensive because optimizing neural networks requires storing a lot of things in memory: not only input data and weights, but also intermediate calculations, gradients, and optimizer parameters.
To use more than one GPU, set the count
argument to an integer value between
2
and 8
.
@app.function(gpu=modal.gpu.A100(size="80GB", count=4))
def finetune_llama_70b():
...
@app.function(gpu=modal.gpu.H100(count=8))
def run_llama_405b():
...
Multi GPU training
Modal currently supports multi-GPU training on a single machine, but not multi-node training (yet). Depending on which framework you are using, you may need to use different techniques to train on multiple GPUs.
If the framework re-executes the entrypoint of the Python process (like PyTorch Lightning) you need to either set the strategy to ddp_spawn
or ddp_notebook
if you wish to invoke the training directly. Another option is to run the training script as a subprocess instead.
@app.function(gpu=modal.gpu.A100(count=2))
def run():
import subprocess
import sys
subprocess.run(
["python", "train.py"],
stdout=sys.stdout, stderr=sys.stderr,
check=True,
)
Examples
Take a look at some of our examples that use GPUs: