Cold start performance
Modal Functions are run in containers.
If a container is already ready to run your Function, it will be reused.
If not, Modal spins up a new container. This is known as a cold start, and it is often associated with higher latency.
There are two sources of increased latency during cold starts:
- inputs may spend more time waiting in a queue for a container to become ready or “warm”.
- when an input is handled by the container that just started, there may be extra work that only needs to be done on the first invocation (“initialization”).
This guide presents techniques and Modal features for reducing the impact of both queueing and initialization on observed latencies.
If you are invoking Functions with no warm containers or if you otherwise see inputs spending too much time in the “pending” state, you should target queueing time for optimization.
If you see some Function invocations taking much longer than others, and those invocations are the first handled by a new container, you should target initialization for optimization.
Reduce time spent queueing for warm containers
New containers are booted when there are not enough other warm containers to to handle the current number of inputs.
For example, the first time you send an input to a Function, there are zero warm containers and there is one input, so a single container must be booted up. The total latency for the input will include the time it takes to boot a container.
If you send another input right after the first one finishes, there will be one warm container and one pending input, and no new container will be booted.
Generalizing, there are two factors that affect the time inputs spend queueing: the time it takes for a container to boot and become warm (which we solve by booting faster) and the time until a warm container is available to handle an input (which we solve by having more warm containers).
Warm up containers faster
The time taken for a container to become warm and ready for inputs can range from seconds to minutes.
Modal’s custom container stack has been heavily optimized to reduce this time. Containers boot in about one second.
But before a container is considered warm and ready to handle inputs,
we need to execute any logic in your code’s global scope (such as imports)
or in any
modal.enter
methods.
So if your boots are slow, these are the first places to work on optimization.
For example, you might be downloading a large model from a model server during the boot process. You can instead download the model ahead of time, so that it only needs to be downloaded once.
For models in the tens of gigabytes, this can reduce boot times from minutes to seconds.
Run more warm containers
It is not always possible to speed up boots sufficiently. For example, seconds of added latency to load a model may not be acceptable in an interactive setting.
In this case, the only option is to have more warm containers running. This increases the chance that an input will be handled by a warm container, for example one that finishes an input while another container is booting.
Modal currently exposes two parameters to control how many containers will be warm:
container_idle_timeout
and keep_warm
.
Keep containers warm for longer with container_idle_timeout
By default, Modal containers spin down after 60 seconds of inactivity.
You can configure this time by setting the container_idle_timeout
value on the
@function
decorator. The timeout is measured in seconds
and can be set to any value between two seconds and twenty minutes.
import modal
app = modal.App()
@app.function(container_idle_timeout=300)
def my_idle_greeting():
return {"hello": "world"}
Maintain a warm pool with keep_warm
Keeping already warm containers around longer doesn’t help if there are no warm containers to begin with, as when Functions scale from zero.
To keep some containers warm and running at all times, set the keep_warm
value on the
@function
decorator. This sets
the minimum number of containers that will always be ready to run your Function.
Modal will still scale up (and spin down) more containers if the demand for your
Function exceeds the keep_warm
value, as usual.
import modal
app = modal.App(image=modal.Image.debian_slim().pip_install("fastapi"))
@app.function(keep_warm=3)
@modal.web_endpoint()
def my_warm_greeting():
return {"hello": "world"}
Adjust warm pools dynamically
You can also set the warm pool size for a deployed function dynamically with Function.keep_warm
.
This can be used with a Modal scheduled function to update the number of warm containers based on the time of day, for example:
import modal
app = modal.App()
@app.function()
def square(x):
return x**2
@app.function(schedule=modal.Cron("0 * * * *")) # run at the start of the hour
def update_keep_warm():
from datetime import datetime, timezone
peak_hours_start, peak_hours_end = 6, 18
if peak_hours_start <= datetime.now(timezone.utc).hour < peak_hours_end:
square.keep_warm(3)
else:
square.keep_warm(0)
Reduce latency from initialization
Some work is done the first time that a function is invoked but can be used on every subsequent invocation. This is amortized work done at initialization.
For example, you may be using a large pre-trained model whose weights need to be loaded from disk to memory the first time it is used.
This results in longer latencies for the first invocation of a warm container, which shows up in the application as occasional slow calls (high tail latency).
Move initialization work to build or warm up
As with work at warm up time, some work done on the first invocation can be moved out to build time or to warm up time.
Any work that can be saved to disk, like downloading model weights, should be done as early as possible — e.g. when the container image is built.
If you can move the logic for initialization out of the function body and into a
container enter
method,
you can move work into the warm up period.
Containers will not be considered warm until all enter
methods have completed,
so no inputs will be routed to containers that have yet to complete this initialization.
For more on how to use enter
with machine learning model weights, see
this guide.
Note that enter
doesn’t get rid of the latency —
it just moves the latency to the warm up period,
where it can be handled by
running more warm containers.
Share initialization work across cold starts with memory snapshots (beta)
Cold starts can also be made faster by using memory snapshots.
Invocations of a function after the first are faster in part because the memory is already populated with values that otherwise need to be computed or read from disk, like the contents of imported libraries.
Memory snapshotting is a beta feature that captures the state of a container’s memory at user-controlled points after it has been warmed up and reuses that state in future boots.
Refer to the memory snapshot page for details.
Optimize initialization code
Sometimes, there is nothing to be done but to speed this work up.
Here, we share specific patterns that show up in optimizing initialization in Modal Functions.
Load multiple large files concurrently
Often Modal applications need to read large files into memory (eg. model weights) before they can process inputs. Where feasible these large file reads should happen concurrently and not sequentially. Concurrent IO takes full advantage of our platform’s high disk and network bandwidth to reduce latency.
One common example of slow sequential IO is loading multiple independent
Huggingface transformers
models in series.
from transformers import CLIPProcessor, CLIPModel, BlipProcessor, BlipForConditionalGeneration
model_a = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor_a = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
model_b = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
processor_b = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
The above snippet does four .from_pretrained
loads sequentially.
None of the components depend on another being already loaded in memory, so they
can be loaded concurrently instead.
They could instead be loaded concurrently using a function like this:
from concurrent.futures import ThreadPoolExecutor, as_completed
from transformers import CLIPProcessor, CLIPModel, BlipProcessor, BlipForConditionalGeneration
def load_models_concurrently(load_functions_map: dict) -> dict:
model_id_to_model = {}
with ThreadPoolExecutor(max_workers=len(load_functions_map)) as executor:
future_to_model_id = {
executor.submit(load_fn): model_id
for model_id, load_fn in load_functions_map.items()
}
for future in as_completed(future_to_model_id.keys()):
model_id_to_model[future_to_model_id[future]] = future.result()
return model_id_to_model
components = load_models_concurrently({
"clip_model": lambda: CLIPModel.from_pretrained("openai/clip-vit-base-patch32"),
"clip_processor": lambda: CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32"),
"blip_model": lambda: BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large"),
"blip_processor": lambda: BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
})
If performing concurrent IO on large file reads does not speed up your cold starts, it’s possible that some part of your function’s code is holding the Python GIL and reducing the efficacy of the multi-threaded executor.