Top 5 serverless GPU providers

Solutions Engineer

Serverless GPUs refer to a type of cloud computing service that allows you to run GPU-accelerated workloads that automatically scale up and down based on demand. You pay only for the compute time you use, and offload the management of the underlying hardware or software.

In the last few years, a number of new serverless GPU providers have emerged. This article will explain what differentiates them from one another.

What are serverless GPUs good for?

Serverless GPUs can be a good fit for:

Model Serving: Deploying and running AI models for inference.
Model Fine-tuning: Fine-tuning AI models on custom datasets.
Video and image processing: Speeding up video and image processing tasks.
CI/CD: Running GPU-accelerated CI/CD pipelines.

Top serverless GPU providers

Modal offers a Python SDK that makes it easy to deploy and run GPU-accelerated functions.

For example, if you want to run a function that requires torch and uses a GPU, you can define the following:

import modal

image = modal.Image.debian_slim().pip_install("torch")

app = modal.App("gpu-example")

@app.function(gpu="A100")
def gpu_task():
    import torch
    return torch.cuda.get_device_name(0)

Modal is probably the most flexible of the new serverless GPU providers: it lets you run arbitrary Python code in the cloud, attaching GPUs if you want, making Modal suitable for a wide range of use cases - not just model serving, fine-tuning, and training, but also other potentially GPU-accelerated workflows like CI/CD.

For more detailed examples and documentation, visit the Modal docs.

RunPod

Runpod’s serverless GPU offering is called RunPod Serverless.

RunPod Serverless lets you deploy custom endpoints with your choice of GPU via a couple different modalities:

Quick Deploy: Pre-built custom endpoints for popular AI models.
Handler Functions: Bring your own functions to run in the cloud.
vLLM Endpoint: Specify and run a Hugging Face model in the cloud.

RunPod Serverless allows you to deploy custom endpoints with GPU support through their web console. The process involves logging into the RunPod Serverless console, creating a new endpoint, and configuring various parameters such as the endpoint name, GPU specifications, worker count, and Docker image details. Optional features like FlashBoot can be enabled for faster startup times. Once configured, you can deploy your endpoint with a single click, making it ready for use in GPU-accelerated tasks.

Once deployed, you can interact with your RunPod Serverless endpoint using the provided Endpoint URL. This allows you to send requests to your deployed model or application for inference or other GPU-accelerated tasks.

Note that RunPod also has a non-serverless GPU offering, called RunPod Pods, which are virtual machines with GPUs.

Baseten

Baseten is a serverless platform that is highly focused on model serving and inference.

They offer a unique framework called Truss, with an associated CLI, for configuration and deployment of models. To deploy a model on Baseten backed by a GPU, you specify the resources you need in a config.yaml file.

Here’s an example of how to ask for A10G GPUs in order to run Stable Diffusion XL.

resources:
  accelerator: A10G
  cpu: "4"
  memory: 16Gi
  use_gpu: true

You can also configure other aspects of the deployment, such as the number of replicas (for scaling) and whether you want Baseten to auto-scale.

You can then deploy your model with a truss push command. This creates a Docker image and pushes it to Baseten, where it can be deployed and run. It also automatically creates an API that you can use to send requests to your deployed model.

Baseten isn’t as suitable if you have use cases outside of model serving and inference.

Replicate

Replicate offers serverless GPU-powered inference for a wide range of pre-trained models, as well as the ability to deploy custom models on GPUs behind a serverless endpoint.

Pre-trained Models

For most users, the main benefit of Replicate is the extensive library of pre-trained models that are ready to use. The details of which GPU resources are needed for each model are generally abstracted away from the user, who can simply specify the model name.

Custom Models

Replicate also allows users to deploy custom models. In the context of Replicate, a “model” refers to a trained, packaged, and published software program that accepts inputs and returns outputs.

To create and deploy a custom model on Replicate, you create a model in the Replicate web UI and train the model using the Replicate training API. You can then create a deployment for your model, which will provide a private, fixed API endpoint, and configure it to use certain GPUs/hardware.

Fal

Fal is a newer player in the serverless GPU space. It is focused on the out-of-the-box deployment and serving of media generation models like Flux and SDXL and offers ready-made endpoints for the most popular models that users can call via API.

Fal offers private serverless models as an enterprise feature. To use Fal for private serverless models, similar to with Modal, you can:

Decorate your Python code with Fal-specific decorators.
Specify the GPU you want to use (e.g., “GPU-A100”) as a parameter to the decorator.
Deploy your code using the Fal CLI.