Memory Snapshot (beta)

You can improve cold-start performance for some functions using the Memory Snapshot feature.

Snapshots happen after your function’s import sequence. During import, your app reads many files from the file system, which can potentially be expensive. For instance torch is hundreds of MiB.

Memory snapshots are created after a function is done importing packages but before it is called. We save that snapshot as a file. Then, every time your function is invoked, we restore memory from the snapshot. The result is increased cold boot performance: functions with memory snapshots enabled typically start 1.5-3x faster.

You don’t need to modify your function to take advantage of snapshotting in most cases (see below).

This is a beta feature. Let us know in Modal Slack if you find any issues.

Enabling automatic snapshots

Memory Snapshot is a beta feature. It is available as a flag in the function decorator. You can enable it as follows:

import modal

app = modal.App("example-memory-snapshot")  # Note: prior to April 2024, "app" was called "stub"


@app.function(enable_memory_snapshot=True)
def my_func():
    print("hello")

Then deploy the app with modal deploy.

Keep the following in mind when using Memory Snapshot:

  • Modal may take a snapshot only after your function runs the first few times, not necessarily on the first run (see Snapshot compatibility section).
  • Creating memory snapshots adds latency to a function start time, so expect your function to be slower to start during the first invocations. Subsequent runs should be faster.

Controlling snapshot memory

You can also snapshot programs that use a lot of memory with a class interface. This allows you to start programs that use large model weights faster, among other things.

With memory snapshots enabled, you can load model weights into CPU memory before creating the snapshot. On every subsequent cold boot, your function will resume from that point. Because we serialize the memory state—including the model weights—in an efficient format, it can start up in less time than it originally took to load the model.

This example loads BGE embeddigs into CPU memory and creates a memory snapshot. When your Function cold boots, it will move those weights onto a GPU in the function move_to_gpu().

You can this use feature in classes by setting enable_memory_snapshots=True and then marking the methods that you want to run before saving the snapshot with @enter(snap=True). Conversely, decorate methods with @enter(snap=False) when you want them to run on every cold boot after saving or resuming from the snapshot state.

import modal

image = (
    modal.Image.debian_slim()
        .pip_install("sentence-transformers")
)
app = modal.App("sentence-transformers", image=image)  # Note: prior to April 2024, "app" was called "stub"


with image.imports():
    from sentence_transformers import SentenceTransformer


@app.cls(
    gpu=modal.gpu.A10G(),
    enable_memory_snapshot=True,
)
class Embedder:

    # model_id = "BAAI/bge-large-en-v1.5"
    model_id = "BAAI/bge-small-en-v1.5"

    @modal.build()
    def build(self):
        model = SentenceTransformer(self.model_id)
        model.save("/model.bge")

    @modal.enter(snap=True)
    def load(self):
        # Create a memory snapshot with the model loaded in CPU memory.
        self.model = SentenceTransformer("/model.bge", device="cpu")

    @modal.enter(snap=False)
    def setup(self):
        # Move the model to a GPU before doing any work.
        self.model.to("cuda")

    @modal.method()
    def run(self, sentences:list[str]):
        embeddings = self.model.encode(sentences, normalize_embeddings=True)
        print(embeddings)


@app.local_entrypoint()
def main():
    sentences = ["what is the meaning of life?"]
    Embedder().run.remote(sentences)


if __name__ == "__main__":
    cls = modal.Cls.lookup("sentence-transformers", "Embedder")

    sentences = ["what is the meaning of life?"]
    cls().run.remote(sentences)

This reduces the time it takes for our app to boot by about 3x, from ~14s to ~4.8s.

Snapshot compatibility

Modal will create memory snapshots for every new version of your function. Changing your function or updating its dependencies will trigger a new snapshotting operation when you run your function anew.

Additionally, you may observe in application logs your function being memory snapshots multiple times during its first few invocations. This happens because Modal will create a memory snapshot for every CPU type and runtime version in our fleet. We typically need 3-5 snapshots to cover our entire fleet. The cold boot benefits should greatly outweigh the penalty of creating multiple snapshots.

Known limitations

Memory Snapshot is still in beta. Please report any issues on our community Slack server.

No GPUs available during the snapshotting phase

It’s currently not possible to snapshot GPU memory. We avoid exposing GPU devices to your function during the snapshotting stage (e.g. when @enter(snap=True)). NVIDIA drivers are available but no GPU devices are. This can be a problem if you need the GPU — for example, you may need to compile a package. We suggest using the @build decorator and store outputs in disk as part of your image. You can then load these into CPU memory and successfully snapshot your function. Then, when invoking your function, you can move objects to GPU memory for more details.

If your program calls functions that check if GPUs are availale during snapshots and then in restore, it will get different results in each stage.

from modal import App, enter, method

app = App()  # Note: prior to April 2024, "app" was called "stub"

@app.cls(enable_memory_snapshots=True)
class GPUAvailability:

    @enter(snap=True)
    def no_gpus_available_during_snapshots(self):
        import torch
        print(f"GPUs available: {torch.cuda.is_available()}")

    @enter(snap=False)
    def gpus_available_during_restore(self):
        import torch
        print(f"GPUs available: {torch.cuda.is_available()}")

In the example above, GPUs are not available when no_gpus_available_during_snapshots() is called but are available when your app is restored and gpus_available_during_restore() is called.

Filesystem writes are not snapshotted

Currently only the container memory is snapshotted, but your function may modify the filesystem during the snapshotting phase and the lost writes can break the function code on restore.

We are actively working to incorporate filesystem modifications into snapshots so that this failure case is removed.

Cached GPU device queries

PyTorch’s torch.cuda.device_count() function will cache its result after first execution, providing incorrect results when used with snapshotting because the GPU availability changes between snapshotting and restore (see No GPUs available during the snapshotting_phase section above).

A workaround is to patch torch to use a non-caching device count query function:

torch.cuda.device_count = torch.cuda._device_count_nvml

Randomness and uniqueness

If your applications depend on uniqueness of state, you must evaluate your function code and verify that it is resilient to snapshotting operations. For example, if a variable is randomly initialized and snapshotted, that variable will be identical after every restore, possibly breaking uniqueness expectations of the proceeding function code.