Cold start performance

Modal Functions are run in containers.

If a container is already ready to run your Function, it will be reused.

If not, Modal spins up a new container. This is known as a cold start, and it is often associated with higher latency.

There are two sources of increased latency during cold starts:

  1. inputs may spend more time waiting in a queue for a container to become ready or “warm”.
  2. when an input is handled by the container that just started, there may be extra work that only needs to be done on the first invocation (“amortized work”).

This guide presents techniques and Modal features for reducing the impact of both queueing and amortized work on observed latencies.

If you are invoking Functions with no warm containers or if you otherwise see inputs spending too much time in the “pending” state, you should target queueing time for optimization.

If you see some Function invocations taking much longer than others, and those invocations are the first handled by a new container, you should target amortized work for optimization.

Reduce time spent queueing for warm containers

New containers are booted when there are not enough other warm containers to to handle the current number of inputs.

For example, the first time you send an input to a Function, there are zero warm containers and there is one input, so a single container must be booted up. The total latency for the input will include the time it takes to boot a container.

If you send another input right after the first one finishes, there will be one warm container and one pending input, and no new container will be booted.

Generalizing, there are two factors that affect the time inputs spend queueing: the time it takes for a container to boot and become warm and the chance a warm container is available to handle an input.

Warm up containers faster

The time taken for a container to become warm and ready for inputs can range from seconds to minutes.

Modal’s custom container stack has been heavily optimized to reduce this time. Containers boot in about one second.

But before a container is considered warm and ready to handle inputs, we need to execute any logic in your code’s global scope (such as imports) or in any modal.enter methods. So if your boots are slow, these are the first places to work on optimization.

For example, you might be downloading a large model from a model server during the boot process. You can instead download the model during the build phase, which only runs when the container image changes, at most once per deployment.

For models in the tens of gigabytes, this can reduce boot times from minutes to seconds.

Run more warm containers

It is not always possible to speed up boots sufficiently. For example, seconds of added latency to load a model may not be acceptable in an interactive setting.

In this case, the only option is to have more warm containers running. This increases the chance that an input will be handled by a warm container, for example one that finishes an input while another container is booting.

Modal currently exposes two parameters to control how many containers will be warm: container_idle_timeout and keep_warm.

Keep containers warm for longer with container_idle_timeout

By default, Modal containers spin down after 60 seconds of inactivity. You can configure this time by setting the container_idle_timeout value on the @function decorator. The timeout is measured in seconds and can be set to any value between two seconds and twenty minutes.

import modal

app = modal.App()

@app.function(container_idle_timeout=300)
def my_idle_greeting():
    return {"hello": "world"}

Maintain a warm pool with keep_warm

Keeping already warm containers around longer doesn’t help if there are no warm containers to begin with, as when Functions scale from zero.

To keep some containers warm and running at all times, set the keep_warm value on the @function decorator. This sets the minimum number of containers that will always be ready to run your Function. Modal will still scale up (and spin down) more containers if the demand for your Function exceeds the keep_warm value, as usual.

import modal

app = modal.App(image=modal.Image.debian_slim().pip_install("fastapi"))

@app.function(keep_warm=3)
@modal.web_endpoint()
def my_warm_greeting():
    return {"hello": "world"}

Adjust warm pools dynamically

You can also set the warm pool size for a deployed function dynamically with Function.keep_warm. This can be used with a Modal scheduled function to update the number of warm containers based on the time of day, for example:

import modal

app = modal.App()

@app.function()
def square(x):
    return x**2

@app.function(schedule=modal.Cron("0 * * * *"))  # run at the start of the hour
def update_keep_warm():
    from datetime import datetime, timezone

    peak_hours_start, peak_hours_end = 6, 18
    if peak_hours_start <= datetime.now(timezone.utc).hour < peak_hours_end:
        square.keep_warm(3)
    else:
        square.keep_warm(0)

Reduce time spent on amortized work

Some work is done only the first time that a function is invoked, but used on every subsequent invocation. This is amortized work.

For example, you may be using a large pre-trained model whose weights need to be loaded from disk to memory the first time it is used.

This results in longer latencies for the first invocation of a warm container, which shows up in the application as occasional slow calls (high tail latency).

Move amortized work to build or warm up

As with work at warm up time, some work done on the first invocation can be moved out to build time or to warm up time.

Any work that can be saved to disk, like downloading model weights, should be done during the build phase.

If you can move the logic for the amortized work out of the function body and into a container enter method, you can move work into the warm up period. Containers will not be considered warm until all enter methods have completed, so no inputs will have elevated latency inside the body of the function.

For more on how to use enter with machine learning model weights, see this guide.

Note that enter doesn’t get rid of the latency — it just moves the latency to the warm up period, where it can be handled by running more warm containers.

Target amortized work for optimization

Sometimes, there is nothing to be done but to speed this work up.

Here, we share specific patterns that show up in optimizing amortized work in Modal functions.

Load multiple large files concurrently

Often Modal applications need to read large files into memory (eg. model weights) before they can process inputs. Where feasible these large file reads should happen concurrently and not sequentially. Concurrent IO takes full advantage of our platform’s high disk and network bandwidth to reduce latency.

One common example of slow sequential IO is loading multiple independent Huggingface transformers models in series.

from transformers import CLIPProcessor, CLIPModel, BlipProcessor, BlipForConditionalGeneration
model_a = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor_a = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
model_b = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
processor_b = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")

The above snippet does four .from_pretrained loads sequentially. None of the components depend on another being already loaded in memory, so they can be loaded concurrently instead.

They could instead be loaded concurrently using a function like this:

from concurrent.futures import ThreadPoolExecutor, as_completed
from transformers import CLIPProcessor, CLIPModel, BlipProcessor, BlipForConditionalGeneration

def load_models_concurrently(load_functions_map: dict) -> dict:
    model_id_to_model = {}
    with ThreadPoolExecutor(max_workers=len(load_functions_map)) as executor:
        future_to_model_id = {
            executor.submit(load_fn): model_id
            for model_id, load_fn in load_functions_map.items()
        }
        for future in as_completed(future_to_model_id.keys()):
            model_id_to_model[future_to_model_id[future]] = future.result()
    return model_id_to_model

components = load_models_concurrently({
    "clip_model": lambda: CLIPModel.from_pretrained("openai/clip-vit-base-patch32"),
    "clip_processor": lambda: CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32"),
    "blip_model": lambda: BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large"),
    "blip_processor": lambda: BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
})

If performing concurrent IO on large file reads does not speed up your cold starts, it’s possible that some part of your function’s code is holding the Python GIL and reducing the efficacy of the multi-threaded executor.