Best practices for serverless inference

Solutions Engineer

Serverless inference is a cloud computing model that allows you to deploy and serve machine learning models without managing the underlying infrastructure. Notable characteristics of a serverless model include:

No server management required
Automatic scaling to handle varying loads
Pay-per-use pricing model
Low operational overhead

Why use serverless inference?

Serverless inference offers several advantages, particularly for deploying and managing expensive transformer-based models. Here’s why it’s beneficial:

Cost-efficiency: Serverless inference eliminates idle GPU time costs. You only pay for the compute resources used during actual inference, making it ideal for models with variable or “bursty” traffic patterns.
Scalability: It automatically scales to handle varying loads, from sporadic requests to sudden traffic spikes, without manual intervention.
Reduced operational overhead: There’s no need to manage servers or worry about capacity planning. The cloud provider handles infrastructure management, allowing you to focus on model development and optimization.
Flexibility: Serverless inference adapts to your needs, whether you’re serving a single model or multiple models with different resource requirements.

While serverless inference may appear more expensive on a “per-minute” basis compared to traditional server-based deployments, it eliminates the need to provision for maximum capacity scenarios. This can lead to significant cost savings, especially for workloads with variable demand.

It’s worth noting that even if you anticipate running GPUs around the clock, actual utilization rarely matches this expectation. Serverless inference helps optimize resource usage and costs in these scenarios.

Top serverless inference providers

In recent years, a number of companies have emerged to offer serverless capabilities for running inference workloads. These include:

Note that while GCP, Azure, and AWS each offer their own serverless cloud platforms, only GCP Cloud Run Functions supports running GPUs, and this is currently still in preview.

For more details on the providers above, check out our comparison article.