Cold start performance
Modal Functions are run in containers.
If a container is already ready to run your Function, it will be reused.
If not, Modal spins up a new container. This is known as a cold start, and it is often associated with higher latency.
There are two sources of increased latency during cold starts:
- inputs may spend more time waiting in a queue for a container to become ready or “warm”.
- when an input is handled by the container that just started, there may be extra work that only needs to be done on the first invocation (“initialization”).
This guide presents techniques and Modal features for reducing the impact of both queueing and initialization on observed latencies.
If you are invoking Functions with no warm containers or if you otherwise see inputs spending too much time in the “pending” state, you should target queueing time for optimization.
If you see some Function invocations taking much longer than others, and those invocations are the first handled by a new container, you should target initialization for optimization.
Reduce time spent queueing for warm containers
New containers are booted when there are not enough other warm containers to to handle the current number of inputs.
For example, the first time you send an input to a Function, there are zero warm containers and there is one input, so a single container must be booted up. The total latency for the input will include the time it takes to boot a container.
If you send another input right after the first one finishes, there will be one warm container and one pending input, and no new container will be booted.
Generalizing, there are two factors that affect the time inputs spend queueing: the time it takes for a container to boot and become warm (which we solve by booting faster) and the time until a warm container is available to handle an input (which we solve by having more warm containers).
Warm up containers faster
The time taken for a container to become warm and ready for inputs can range from seconds to minutes.
Modal’s custom container stack has been heavily optimized to reduce this time. Containers boot in about one second.
But before a container is considered warm and ready to handle inputs,
we need to execute any logic in your code’s global scope (such as imports)
or in any
modal.enter
methods.
So if your boots are slow, these are the first places to work on optimization.
For example, you might be downloading a large model from a model server during the boot process. You can instead download the model ahead of time, so that it only needs to be downloaded once.
For models in the tens of gigabytes, this can reduce boot times from minutes to seconds.
Run more warm containers
It is not always possible to speed up boots sufficiently. For example, seconds of added latency to load a model may not be acceptable in an interactive setting.
In this case, the only option is to have more warm containers running. This increases the chance that an input will be handled by a warm container, for example one that finishes an input while another container is booting.
Modal currently exposes three parameters to control how many containers will be warm:
scaledown_window
, min_containers
, and buffer_containers
.
All of these strategies can increase the resources consumed by your Function and so introduce a trade-off between cold start latencies and cost.
Keep containers warm for longer with scaledown_window
Modal containers will remain idle for a short period before shutting down. By
default, the maximum idle time is 60 seconds. You can configure this by setting
the scaledown_window
on the @function
decorator. The value is measured in seconds, and it can be set anywhere between
two seconds and twenty minutes.
import modal
app = modal.App()
@app.function(scaledown_window=300)
def my_idle_greeting():
return {"hello": "world"}
Increasing the scaledown_window
reduces the chance that subsequent requests
will require a cold start, although you will be billed for any resources used
while the container is idle (e.g., GPU reservation or residual memory
occupancy). Note that containers will not necessarily remain alive for the
entire window, as the autoscaler will scale down more agressively when the
Function is substantially over-provisioned.
Overprovision resources with min_containers
and buffer_containers
Keeping already warm containers around longer doesn’t help if there are no warm containers to begin with, as when Functions scale from zero.
To keep some containers warm and running at all times, set the min_containers
value on the @function
decorator. This
puts a floor on the the number of containers so that the Function doesn’t scale
to zero. Modal will still scale up and spin down more containers as the
demand for your Function fluctuates above the min_containers
value, as usual.
While min_containers
overprovisions containers while the Function is idle,
buffer_containers
provisions extra containers while the Function is active.
This “buffer” of extra containers will be idle and ready to handle inputs if
the rate of requests increases. This parameter is particularly useful for
bursty request patterns, where the arrival of one input predicts the arrival of more inputs,
like when a new user or client starts hitting the Function.
import modal
app = modal.App(image=modal.Image.debian_slim().pip_install("fastapi"))
@app.function(min_containers=3, buffer_containers=3)
def my_warm_greeting():
return "Hello, world!"
Adjust warm pools dynamically
You can also set the warm pool size for a deployed function dynamically with Function.keep_warm
.
This can be used with a Modal scheduled function to update the number of warm containers based on the time of day, for example:
import modal
app = modal.App()
@app.function()
def square(x):
return x**2
@app.function(schedule=modal.Cron("0 * * * *")) # run at the start of the hour
def update_keep_warm():
from datetime import datetime, timezone
peak_hours_start, peak_hours_end = 6, 18
if peak_hours_start <= datetime.now(timezone.utc).hour < peak_hours_end:
square.keep_warm(3)
else:
square.keep_warm(0)
Reduce latency from initialization
Some work is done the first time that a function is invoked but can be used on every subsequent invocation. This is amortized work done at initialization.
For example, you may be using a large pre-trained model whose weights need to be loaded from disk to memory the first time it is used.
This results in longer latencies for the first invocation of a warm container, which shows up in the application as occasional slow calls: high tail latency or elevated p9Xs.
Move initialization work out of the first invocation
Some work done on the first invocation can be moved up and completed ahead of time.
Any work that can be saved to disk, like downloading model weights, should be done as early as possible. The results can be included in the container’s Image or saved to a Modal Volume.
Some work is tricky to serialize, like spinning up a network connection or an inference server.
If you can move this initialization logic out of the function body and into the global scope or a
container enter
method,
you can move this work into the warm up period.
Containers will not be considered warm until all enter
methods have completed,
so no inputs will be routed to containers that have yet to complete this initialization.
For more on how to use enter
with machine learning model weights, see
this guide.
Note that enter
doesn’t get rid of the latency —
it just moves the latency to the warm up period,
where it can be handled by
running more warm containers.
Share initialization work across cold starts with memory snapshots
Cold starts can also be made faster by using memory snapshots.
Invocations of a Function after the first are faster in part because the memory is already populated with values that otherwise need to be computed or read from disk, like the contents of imported libraries.
Memory snapshotting captures the state of a container’s memory at user-controlled points after it has been warmed up and reuses that state in future boots, which can substantially reduce cold start latency penalties and warm up period duration.
Refer to the memory snapshot guide for details.
Optimize initialization code
Sometimes, there is nothing to be done but to speed this work up.
Here, we share specific patterns that show up in optimizing initialization in Modal Functions.
Load multiple large files concurrently
Often Modal applications need to read large files into memory (eg. model weights) before they can process inputs. Where feasible these large file reads should happen concurrently and not sequentially. Concurrent IO takes full advantage of our platform’s high disk and network bandwidth to reduce latency.
One common example of slow sequential IO is loading multiple independent
Huggingface transformers
models in series.
from transformers import CLIPProcessor, CLIPModel, BlipProcessor, BlipForConditionalGeneration
model_a = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor_a = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
model_b = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
processor_b = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
The above snippet does four .from_pretrained
loads sequentially.
None of the components depend on another being already loaded in memory, so they
can be loaded concurrently instead.
They could instead be loaded concurrently using a function like this:
from concurrent.futures import ThreadPoolExecutor, as_completed
from transformers import CLIPProcessor, CLIPModel, BlipProcessor, BlipForConditionalGeneration
def load_models_concurrently(load_functions_map: dict) -> dict:
model_id_to_model = {}
with ThreadPoolExecutor(max_workers=len(load_functions_map)) as executor:
future_to_model_id = {
executor.submit(load_fn): model_id
for model_id, load_fn in load_functions_map.items()
}
for future in as_completed(future_to_model_id.keys()):
model_id_to_model[future_to_model_id[future]] = future.result()
return model_id_to_model
components = load_models_concurrently({
"clip_model": lambda: CLIPModel.from_pretrained("openai/clip-vit-base-patch32"),
"clip_processor": lambda: CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32"),
"blip_model": lambda: BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large"),
"blip_processor": lambda: BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
})
If performing concurrent IO on large file reads does not speed up your cold starts, it’s possible that some part of your function’s code is holding the Python GIL and reducing the efficacy of the multi-threaded executor.