Storing model weights on Modal

Efficiently managing the weights of large models is crucial for optimizing the build times and startup latency of ML and AI applications. This page discusses best practices for handling model weights with Modal, focusing on two key patterns:

  1. Storing weights in container images when they are built, with @build
  2. Storing weights in a distributed file system, with Modal Volumes

The first pattern leads to faster downloads and startup times, but it is only possible for weights that are known at build time, like the weights of pretrained models.

In both cases, you can further optimize latencies by loading weights into memory at container startup.

Pattern #1 - Storing weights in container images

Whenever possible, you should store weights in your image as it is built, just as you store your code dependencies. Modal’s custom container runtime stack is designed to make builds and loads of large images as fast as possible.

In the code below, we demonstrate this pattern. We define a Python function, download_model_to_folder, that downloads the weights of a model from Hugging Face. Notice that the method has been annotated with the @build decorator. Methods of modal.Clss that are decorated with @build are run while your container image is being built, just like commands to install dependencies with .pip_install. You can also use the run_function method on the Image class for the same purpose.

image = (  # start building the image
    .pip_install("huggingface", "other_dependencies")

# ... other setup

@app.cls(gpu="any", image=image)
class Model:
    @build()  # add another step to the image build
    def download_model_to_folder(self):
        from huggingface_hub import snapshot_download

        os.makedirs(MODEL_DIR, exist_ok=True)
        snapshot_download("stabilityai/sdxl-turbo", local_dir=MODEL_DIR)

Pre-loading weights into memory with @enter

Because they are part of the container image, your model weights will be available as files when your functions start, just like your code dependencies. But model weights must still be loaded into memory before they can be used for inference. For models with billions of weights, that can still take several seconds.

To avoid spending that time on every input, you can load the weights into memory when your Modal containers start, but before they begin running your function, with another decorator: @enter. A method decorated with the @enter decorator will only run once at container startup.

    def setup(self):
        self.pipe = AutoPipelineForImage2Image.from_pretrained("stabilityai/sdxl-turbo")

    def inference(self, prompt):
        return self.pipe(prompt)

You can also stack @build and @enter decorators on the same method. This can have some benefits, as discussed here. For even further optimization of startup times with @enter, consider the (beta) memory snapshot feature.

Pattern #2 - Storing weights in Volumes

Not all applications use model weights that are known when the app’s container image is built.

For example, you might be

  • serving models that are regularly fine-tuned
  • serving too many different large models from one app to store them in a single image
  • training models on the fly as your app runs

In each case, different components of your application will need to store, retrieve, and communicate weights over time. For this, we recommend Modal Volumes, which act as a distributed file system, a “shared disk” all of your Modal functions can access.

To store your model weights in a Volume, you need to make the Volume available to a function that creates or retrieves the model weights, as in the snippet below.

import modal

# create a Volume, or retrieve it if it exists
volume = modal.Volume.from_name("model-weights-vol", create_if_missing=True)
MODEL_DIR = "/vol/models"

  volumes={MODEL_DIR: volume},  # "mount" the Volume, sharing it with your function
  _allow_background_volume_commits=True  # use this flag when writing large files like model weights
def run_training():
    model = train(...)
    save(MODEL_DIR, model)

You can then read those weights from the Volume as you would normally read them from disk, so long as you attach the Volume to your function or class.

@app.cls(gpu="any", volumes={MODEL_DIR: volume})
class Model:
    def inference(self, prompt):
      model = load_model(MODEL_DIR)

In the above code sample, weights are loaded into memory each time the inference function is run. You can once again use @enter to load weights only once, at container boot.

    def setup(self):
        self.model = load_model(MODEL_DIR)

    def inference(self, prompt):

Pre-loading weights for multiple models dynamically with __init__ and @enter

Finally, you might be serving several different models from the same app and so need to dynamically determine which weights to load.

Even in this case, you can avoid loading the weights at every inference. Just define a __init__ method on the modal.Cls with arguments that identify which model to use and then use the @enter method decorator to load those weights into memory:

@app.cls(gpu="any", volumes={MODEL_DIR: volume})
class Model:

    def __init__(self, model_id):
        self.model_id = model_id

    def setup(self):
        volume.reload()  # Fetch latest changes to the volume
        self.model = load_model(MODEL_DIR, self.model_id)

    def inference(self, prompt):