Cold start performance
For a
deployed function
or a web endpoint, Modal will spin up as many containers
as needed to handle the current number of concurrent requests. Starting up
containers incurs a cold-start time of ~1s. Any logic in global scope (such as
imports) and the container enter
function
will be executed next. In the case of loading large models from the image, this
can take a few seconds depending on the size of the model, because the file will
be copied over the network to the worker running your job.
After the cold-start, subsequent requests to the same container will see lower
response latency (~50-200ms), until the container is shut down after a
period of inactivity. Modal
currently exposes two parameters to control how many cold starts users
experience: container_idle_timeout
and keep_warm
.
Container idle timeout
By default, Modal containers spin down after 60 seconds of inactivity. This can
be overridden explicitly by setting the container_idle_timeout
value on the
@function
decorator. This can be set to
any integer value between 2 and 1200, and is measured in seconds.
import modal
app = modal.App() # Note: prior to April 2024, "app" was called "stub"
@app.function(container_idle_timeout=300)
def my_idle_f():
return {"hello": "world"}
Warm pool
If you want to have some containers running at all times to mitigate the
cold-start penalty, you could set the keep_warm
value on the
@function
decorator. This configures a
given minimum number of containers that will always be up for your function, but
Modal will still scale up (and spin down) more containers if the demand for your
function exceeds the keep_warm
value, as usual.
from modal import App, web_endpoint
app = App() # Note: prior to April 2024, "app" was called "stub"
@app.function(keep_warm=3)
@web_endpoint()
def my_warm_f():
return {"hello": "world"}
Functions with slow start-up and keep_warm
The guarantee that keep_warm
provides is that there are always at least n
containers up that have finished starting up. If your function does expensive /
slow initialization the first time it receives an input (e.g. if you use a
pre-trained model, and this model needs to be loaded into memory the first time
you use it), you’d observe that those function calls will still be slow.
To avoid this, you can use a container enter method to perform the expensive initialization. This will ensure that the initialization is performed before the container is deemed ready for the warm pool.
Memory Snapshot
Modal snapshotting is a developer preview feature that can significantly reduce cold start times. Refer to the Memory Snapshot page for details.
Perform concurrent IO
Often Modal applications need to read large files into memory (eg. model checkpoints) before they can process inputs. Where feasible these large file reads should happen concurrently and not sequentially. By performing concurrent IO your application will take advantage of our platform’s high disk and network bandwidth, and reduce the IO latency cold start penalty.
One common example of slow sequential IO is loading multiple independent
Huggingface transformers
models in series.
from transformers import CLIPProcessor, CLIPModel, BlipProcessor, BlipForConditionalGeneration
model_a = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor_a = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
model_b = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
processor_b = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
The above snippet unnecessarily does four .from_pretrained
loads sequentially.
None of the components depend on another being already loaded in memory, so they
can be loaded concurrently.
They could instead be loaded concurrently using a function like this:
from concurrent.futures import ThreadPoolExecutor, as_completed
from transformers import CLIPProcessor, CLIPModel, BlipProcessor, BlipForConditionalGeneration
def load_models_concurrently(load_functions_map: dict) -> dict:
model_id_to_model = {}
with ThreadPoolExecutor(max_workers=len(load_functions_map)) as executor:
future_to_model_id = {
executor.submit(load_fn): model_id
for model_id, load_fn in load_functions_map.items()
}
for future in as_completed(future_to_model_id.keys()):
model_id_to_model[future_to_model_id[future]] = future.result()
return model_id_to_model
components = load_models_concurrently({
"clip_model": lambda: CLIPModel.from_pretrained("openai/clip-vit-base-patch32"),
"clip_processor": lambda: CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32"),
"blip_model": lambda: BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large"),
"blip_processor": lambda: BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
})
If performing concurrent IO on large file reads does not speed up your cold starts, it’s possible that some part of your Function’s code is holding the Python GIL and reducing the efficacy of the multi-threaded executor.
In general, if your application is performing multiple large and independent reads on cold start, attempt to make those reads concurrent using threads.