Cold start performance

For a deployed function or a web endpoint, Modal will spin up as many containers as needed to handle the current number of concurrent requests. Starting up containers incurs a cold-start time of ~1s. Any logic in global scope (such as imports) and the container enter function will be executed next. In the case of loading large models from the image, this can take a few seconds depending on the size of the model, because the file will be copied over the network to the worker running your job.

After the cold-start, subsequent requests to the same container will see lower response latency (~50-200ms), until the container is shut down after a period of inactivity. Modal currently exposes two parameters to control how many cold starts users experience: container_idle_timeout and keep_warm.

Container idle timeout

By default, Modal containers spin down after 60 seconds of inactivity. This can be overridden explicitly by setting the container_idle_timeout value on the @function decorator. This can be set to any integer value between 2 and 1200, and is measured in seconds.

import modal

app = modal.App()  # Note: prior to April 2024, "app" was called "stub"

@app.function(container_idle_timeout=300)
def my_idle_f():
    return {"hello": "world"}

Warm pool

If you want to have some containers running at all times to mitigate the cold-start penalty, you could set the keep_warm value on the @function decorator. This configures a given minimum number of containers that will always be up for your function, but Modal will still scale up (and spin down) more containers if the demand for your function exceeds the keep_warm value, as usual.

from modal import App, web_endpoint

app = App()  # Note: prior to April 2024, "app" was called "stub"

@app.function(keep_warm=3)
@web_endpoint()
def my_warm_f():
    return {"hello": "world"}

Functions with slow start-up and keep_warm

The guarantee that keep_warm provides is that there are always at least n containers up that have finished starting up. If your function does expensive / slow initialization the first time it receives an input (e.g. if you use a pre-trained model, and this model needs to be loaded into memory the first time you use it), you’d observe that those function calls will still be slow.

To avoid this, you can use a container enter method to perform the expensive initialization. This will ensure that the initialization is performed before the container is deemed ready for the warm pool.

Memory Snapshot

Modal snapshotting is a developer preview feature that can significantly reduce cold start times. Refer to the Memory Snapshot page for details.

Perform concurrent IO

Often Modal applications need to read large files into memory (eg. model checkpoints) before they can process inputs. Where feasible these large file reads should happen concurrently and not sequentially. By performing concurrent IO your application will take advantage of our platform’s high disk and network bandwidth, and reduce the IO latency cold start penalty.

One common example of slow sequential IO is loading multiple independent Huggingface transformers models in series.

from transformers import CLIPProcessor, CLIPModel, BlipProcessor, BlipForConditionalGeneration
model_a = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor_a = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
model_b = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
processor_b = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")

The above snippet unnecessarily does four .from_pretrained loads sequentially. None of the components depend on another being already loaded in memory, so they can be loaded concurrently.

They could instead be loaded concurrently using a function like this:

from concurrent.futures import ThreadPoolExecutor, as_completed
from transformers import CLIPProcessor, CLIPModel, BlipProcessor, BlipForConditionalGeneration

def load_models_concurrently(load_functions_map: dict) -> dict:
    model_id_to_model = {}
    with ThreadPoolExecutor(max_workers=len(load_functions_map)) as executor:
        future_to_model_id = {
            executor.submit(load_fn): model_id
            for model_id, load_fn in load_functions_map.items()
        }
        for future in as_completed(future_to_model_id.keys()):
            model_id_to_model[future_to_model_id[future]] = future.result()
    return model_id_to_model

components = load_models_concurrently({
    "clip_model": lambda: CLIPModel.from_pretrained("openai/clip-vit-base-patch32"),
    "clip_processor": lambda: CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32"),
    "blip_model": lambda: BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large"),
    "blip_processor": lambda: BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
})

If performing concurrent IO on large file reads does not speed up your cold starts, it’s possible that some part of your Function’s code is holding the Python GIL and reducing the efficacy of the multi-threaded executor.

In general, if your application is performing multiple large and independent reads on cold start, attempt to make those reads concurrent using threads.