Checkpointing (beta)

Checkpointing improves cold-boot performance by creating a memory checkpoint of your function and then restoring the memory checkpoint when you need it.

Checkpointing happens after your function’s import sequence. During import, your all reads many files from the file system. Some of these files are very large, for instance torch is hundreds of MB. We create a memory checkpoint of your function after it is done importing packages—but before calling for inputs. We save that checkpoint as a file. Then, every time your function is invoked, we restore your function memory. Your function will skip all file reads associated with its imports in favor of a single checkpoint file. The result is increased cold boot performance: functions with checkpointing enabled start 2-3x faster compared to not using checkpoints.

You don’t need to modify your function to take advantage of checkpointing in most cases (see below). This is a beta feature. Let us know in Modal Slack if you find any issues.

Enabling checkpointing

Checkpointing is currently in beta and is available as a flag in the function decorator. You can enable it as follows:

# checkpoint_example.py
import modal

stub = modal.Stub("checkpointing-example-simple")

@stub.function(checkpointing_enabled=True)
def my_func():
    print("hello")

Then deploy the function with modal deploy.

Keep the following in mind when using checkpointing:

  • Modal may take a memory checkpoint after your function runs the first few times, not just on the first run (see Checkpoint compatibility section)
  • Creating checkpoints adds latency to a function start time, so expect your function to be slower to start during the first invocations. Subsequent runs are much faster.

Checkpoint compatibility

Modal will create a memory checkpoint for every new version of your function. Deploying your function anew, even without code changes, will trigger a new checkpointing operation when you run your function.

Additionally, you may observe in application logs your function being checkpointed multiple times during its first few invocations. This happens because Modal will create a memory checkpoint for every CPU type and runtime version in our fleet. We typically only need 1-3 checkpoints to cover our entire fleet. The cold boot benefits should greatly outweigh the penalty of creating multiple checkpoints.

Known limitations

Checkpointing is still in beta, please report any issues in Modal Slack.

No GPUs available during checkpointing

It’s currently not possible to checkpoint GPU memory. We avoid exposing GPU devices to your function during the checkpointing stage. NVIDIA drivers are available, but no GPU devices are. This can be a problem if you need the GPU — for example, you may need to compile a package. We suggest using the Image.run_function method and store outputs in disk as part of your image. You can then load these into CPU memory and successfully checkpoint your function. Then, when invoking your function, you can move objects to GPU memory for more details.

Filesystem writes are not checkpointed

Currently only the container memory is checkpointed, but your function may modify the filesystem during the checkpointing stage and the lost writes can break the function code on restore.

We are actively working to incorporate filesystem modifications into checkpoints so that this failure case is removed.

Cached GPU device queries

PyTorch’s torch.cuda.device_count() function will cache its result after first execution, providing incorrect results when used with checkpointing because the GPU availability changes between checkpointing and restore (see No GPUs available during checkpointing section above).

A workaround is to patch torch to use a non-caching device count query function:

torch.cuda.device_count = torch.cuda._device_count_nvml

Randomness and uniqueness

If your applications depend on uniqueness of state, you must evaluate your function code and verify that it is resilient to checkpoint operations. For example, if a variable is randomly initialized during checkpointing, that variable will be identical after every restore, possibly breaking uniqueness expectations of the proceeding function code.