Check out our new GPU Glossary! Read now
September 25, 20245 minute read
Best practices for serverless inference
author
Yiren Lu@YirenLu
Solutions Engineer

Serverless inference is a cloud computing model that allows you to deploy and serve machine learning models without managing the underlying infrastructure. Notable characteristics of a serverless model include:

  • No server management required
  • Automatic scaling to handle varying loads
  • Pay-per-use pricing model
  • Low operational overhead

Why use serverless inference?

Serverless inference offers several advantages, particularly for deploying and managing expensive transformer-based models. Here’s why it’s beneficial:

  1. Cost-efficiency: Serverless inference eliminates idle GPU time costs. You only pay for the compute resources used during actual inference, making it ideal for models with variable or “bursty” traffic patterns.

  2. Scalability: It automatically scales to handle varying loads, from sporadic requests to sudden traffic spikes, without manual intervention.

  3. Reduced operational overhead: There’s no need to manage servers or worry about capacity planning. The cloud provider handles infrastructure management, allowing you to focus on model development and optimization.

  4. Flexibility: Serverless inference adapts to your needs, whether you’re serving a single model or multiple models with different resource requirements.

While serverless inference may appear more expensive on a “per-minute” basis compared to traditional server-based deployments, it eliminates the need to provision for maximum capacity scenarios. This can lead to significant cost savings, especially for workloads with variable demand.

It’s worth noting that even if you anticipate running GPUs around the clock, actual utilization rarely matches this expectation. Serverless inference helps optimize resource usage and costs in these scenarios.

Top serverless inference providers

In recent years, a number of companies have emerged to offer serverless capabilities for running inference workloads. These include:

Note that while GCP, Azure, and AWS each offer their own serverless cloud platforms, only GCP Cloud Run Functions supports running GPUs, and this is currently still in preview.

For more details on the providers above, check out our comparison article.

Best practices for serverless inference

To optimize your serverless inference deployments:

  1. Leverage GPU acceleration: For compute-intensive models, utilize GPU resources effectively:

    • Choose the appropriate GPU type and memory for your model to ensure efficient resource utilization.
    • Consult your provider’s documentation on how to specify GPU requirements for your functions.
  2. Minimize cold starts: Cold starts (the time it takes to spin up a new container with your model in it) can significantly impact latency for serverless functions. Consider these techniques:

    • Maintain a pool of warm instances that are always up and running.
    • Adjust container idle timeouts to keep containers warm for longer periods, if supported.
  3. Optimize model loading and initialization:

    • Utilize lifecycle methods or initialization hooks provided by your serverless platform to load models during container warm-up rather than on first invocation.
    • Move large file downloads (e.g. model weights) to the build or deployment phase when possible, so that they are downloaded only once.
    • Take advantage of pre-built images or layers which come with optimized dependencies for common ML frameworks.
    • Consider model quantization or pruning techniques to reduce the size of the model that needs to be loaded without significantly impacting performance.
    • Use persistent storage options to cache model weights, reducing load times on subsequent invocations.
  4. Implement efficient batching:

    • Utilize batching mechanisms provided by your serverless platform to automatically batch incoming requests, improving throughput.
    • Implement custom batching logic within your inference function for fine-grained control over batch size and processing.

Conclusion

Serverless inference offers a powerful way to deploy machine learning models with minimal operational overhead. By understanding the concepts and following best practices, you can leverage serverless platforms to efficiently serve your AI models at scale.

To get started with serverless inference, check out the Modal documentation or explore other cloud providers’ offerings.

Ship your first app in minutes.

Get Started

$30 / month free compute