Top 5 serverless GPU providers
Serverless GPUs refer to a type of cloud computing service that allows you to run GPU-accelerated workloads that automatically scale up and down from 0 based on demand. You pay only for the compute time you use, and offload the management of the underlying hardware or software. Serverless GPUs have grown in popularity in the last few years with the advent of generative AI. The expensive, complex, and compute-intensive nature of AI workloads has driven developers to search for new cloud computing paradigms that reduce cost and operational effort.
In the last few years, a number of new serverless GPU providers have emerged. This article will explain what differentiates them from one another.
What are serverless GPUs good for?
Serverless GPUs are a good fit for:
- Model Serving: Deploying and running AI models for inference.
- Model Fine-tuning: Fine-tuning AI models on custom datasets.
- Video and image processing: Speeding up video and image processing tasks.
- CI/CD: Running GPU-accelerated CI/CD pipelines.
Top serverless GPU providers
Modal
Modal is an AI infrastructure platform that offers serverless GPUs behind an ergonomic Python SDK.
For example, if you want to deploy a function for inference that requires torch and uses a GPU, you can define the following:
import modal
image = modal.Image.debian_slim().pip_install("torch")
app = modal.App("gpu-example")
@app.function(gpu="A100")
def inference_function():
import torch
return torch.cuda.get_device_name(0)Notice that GPU requirements for the function are defined right in line with application code. There’s no need to manage complex configuration surface areas for your serverless deployments.
Once your function is deployed, Modal automatically spins up GPUs as needed (up to thousands) to serve requests. Modal provisions GPU containers in less than a second, ensuring that you don’t waste money on excessive idle GPU capacity.
Modal is the most flexible of the new serverless GPU providers: it lets you run arbitrary Python code in the cloud, attaching GPUs if you want, making Modal suitable for a wide range of AI workloads. Modal is most commonly used to run and scale inference for custom models, since it balances flexibility with ease-of-use. It is also used by companies for fine-tuning, training, and other GPU-accelerated tasks.
For detailed examples and documentation, visit the Modal docs.
RunPod
Runpod’s serverless GPU offering is called RunPod Serverless.
RunPod Serverless lets you deploy custom endpoints with your choice of GPU via a couple different modalities:
- Quick Deploy: Pre-built custom endpoints for popular AI models.
- Handler Functions: Bring your own functions to run in the cloud.
- vLLM Endpoint: Specify and run a Hugging Face model in the cloud.
RunPod Serverless allows you to deploy custom endpoints with GPU support through their web console. The process involves logging into the RunPod Serverless console, creating a new endpoint, and configuring various parameters such as the endpoint name, GPU specifications, worker count, and Docker image details. Optional features like FlashBoot can be enabled for faster startup times. Once configured, you can deploy your endpoint with a single click, making it ready for use in GPU-accelerated tasks.
Once deployed, you can interact with your RunPod Serverless endpoint using the provided Endpoint URL. This allows you to send requests to your deployed model or application for inference or other GPU-accelerated tasks.
Note that RunPod also has a non-serverless GPU offering, called RunPod Pods, which are virtual machines with GPUs.
Baseten
Baseten is a serverless inference platform.
They offer a unique framework called Truss, with an associated CLI, for configuration and deployment of models. To deploy a model on Baseten backed by a GPU, you specify the resources you need in a config.yaml file.
Here’s an example of how to ask for A10G GPUs in order to run Stable Diffusion XL.
resources:
accelerator: A10G
cpu: "4"
memory: 16Gi
use_gpu: trueYou can also configure other aspects of the deployment, such as the number of replicas (for scaling) and whether you want Baseten to auto-scale.
You can then deploy your model with a truss push command. This creates a Docker image and pushes it to Baseten, where it can be deployed and run. It also automatically creates an API that you can use to send requests to your deployed model.
Fal
Fal is a newer player in the serverless GPU space. It is focused on the out-of-the-box deployment and serving of media generation models like Flux and SDXL and offers ready-made endpoints for the most popular models that users can call via API.
Fal offers private serverless models as an enterprise feature. To use Fal for private serverless models, similar to with Modal, you can:
- Decorate your Python code with Fal-specific decorators.
- Specify the GPU you want to use (e.g., “GPU-A100”) as a parameter to the decorator.
- Deploy your code using the Fal CLI.
Replicate
Replicate offers serverless GPU-powered inference for a wide range of pre-trained models, as well as the ability to deploy custom models on GPUs behind a serverless endpoint.
Pre-trained Models
For most users, the main benefit of Replicate is the extensive library of pre-trained models that are ready to use. The details of which GPU resources are needed for each model are generally abstracted away from the user, who can simply specify the model name.
Custom Models
Replicate also allows users to deploy custom models. In the context of Replicate, a “model” refers to a trained, packaged, and published software program that accepts inputs and returns outputs.
To create and deploy a custom model on Replicate, you create a model in the Replicate web UI and train the model using the Replicate training API. You can then create a deployment for your model, which will provide a private, fixed API endpoint, and configure it to use certain GPUs/hardware.