Storing model weights on Modal

Efficiently managing the weights of large models is crucial for optimizing the build times and startup latency of many ML and AI applications.

Our recommended method for working with model weights is to store them in a Modal Volume, which acts as a distributed file system, a “shared disk” all of your Modal Functions can access.

Storing weights in a Modal Volume

To store your model weights in a Volume, you need to either make the Volume available to a Modal Function that saves the model weights or upload the model weights into the Volume from a client.

Saving model weights into a Modal Volume from a Modal Function

If you’re already generating the weights on Modal, you just need to attach the Volume to your Modal Function, making it available for reading and writing:

from pathlib import Path

volume = modal.Volume.from_name("model-weights-vol", create_if_missing=True)
MODEL_DIR = Path("/models")

@app.function(gpu="any", volumes={MODEL_DIR: volume})  # attach the Volume
def train_model(data, config):
    import run_training

    model = run_training(config, data)
    model.save(config, MODEL_DIR)

Volumes are attached by including them in a dictionary that maps a path on the remote machine to a modal.Volume object. They look just like a normal file system, so model weights can be saved to them without adding any special code.

If the model weights are generated outside of Modal and made available over the Internet, for example by an open-weights model provider or your own training job on a dedicated cluster, you can also download them into a Volume from a Modal Function:

@app.function(volumes={MODEL_DIR: volume})
def download_model(model_id):
    import model_hub

    model_hub.download(model_id, local_dir=MODEL_DIR / model_id)

Add Modal Secrets to access weights that require authentication.

See below for more on downloading from the popular Hugging Face Hub.

Uploading model weights into a Modal Volume

Instead of pulling weights into a Modal Volume from inside a Modal Function, you might wish to push weights into Modal from a client, like your laptop or a dedicated training cluster.

For that, you can use the batch_upload method of modal.Volumes via the Modal Python client library:

volume = modal.Volume.from_name("model-weights-vol", create_if_missing=True)

@app.local_entrypoint()
def main(local_path: str, remote_path: str):
    with volume.batch_upload() as upload:
        upload.put_directory(local_path, remote_path)

Alternatively, you can upload model weights using the modal volume CLI command:

modal volume put model-weights-vol path/to/model path/on/volume

Mounting cloud buckets as Modal Volumes

If your model weights are already in cloud storage, for example in an S3 bucket, you can connect them to Modal Functions with a CloudBucketMount.

See the guide for details.

Reading model weights from a Modal Volume

You can read weights from a Volume as you would normally read them from disk, so long as you attach the Volume to your Function.

@app.function(gpu="any", volumes={MODEL_DIR: volume})
def inference(prompt, model_id):
    import load_model

    model = load_model(MODEL_DIR / model_id)
    model.run(prompt)

Storing weights in the Modal Image

It is also possible to store weights in your Function’s Modal Image, the private file system state that a Function sees when it starts up. The weights might be downloaded via shell commands with Image.run_commands or downloaded using a Python function with Image.run_function.

We recommend storing model weights in a Modal Volume, as described above. Performance is similar for the two methods. Volumes are more flexible. Images are rebuilt when their definition changes, starting from the changed layer, which increases reproducibility for some builds but leads to unnecessary extra downloads in most cases.

Optimizing model weight reads with @enter

In the above code samples, weights are loaded from disk into memory each time the inference function is run. This isn’t so bad if inference is much slower than model loading (e.g. it is run on very large datasets) or if the model loading logic is smart enough to skip reloading.

To guarantee a particular model’s weights are only loaded once, you can use the @enter container lifecycle hook to load the weights only when a new container starts.

MODEL_ID = "some-model-id"

@app.cls(gpu="any", volumes={MODEL_DIR: volume})
class Model:
    @modal.enter()
    def setup(self, model_id=MODEL_ID):
        import load_model

        self.model = load_model(MODEL_DIR, model_id)

    @modal.method()
    def inference(self, prompt):
        return self.model.run(prompt)

Note that methods decorated with @enter can’t be passed dynamic arguments.

If you need to load a single but possibly different model on each container start, you can parameterize your Modal Cls. Below, we use the modal.parameter syntax.

@app.cls(gpu="any", volumes={MODEL_DIR: volume})
class ParameterizedModel:
    model_id: str = modal.parameter()

    @modal.enter()
    def setup(self):
        import load_model

        self.model = load_model(MODEL_DIR, self.model_id)

    @modal.method()
    def inference(self, prompt):
        return self.model.run(prompt)

Storing weights from the Hugging Face Hub on Modal

The Hugging Face Hub has over 1,000,000 models with weights available for download.

The snippet below shows some additional tricks for downloading models from the Hugging Face Hub on Modal.

from pathlib import Path

import modal

# create a Volume, or retrieve it if it exists
volume = modal.Volume.from_name("model-weights-vol", create_if_missing=True)
MODEL_DIR = Path("/models")

# define dependencies for downloading model
download_image = (
    modal.Image.debian_slim()
    .pip_install("huggingface_hub[hf_transfer]")  # install fast Rust download client
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})  # and enable it
)

# define dependencies for running model
inference_image =  modal.Image.debian_slim().pip_install("transformers")

@app.function(
    volumes={MODEL_DIR: volume},  # "mount" the Volume, sharing it with your function
    image=download_image,  # only download dependencies needed here
)
def download_model(
    repo_id: str="hf-internal-testing/tiny-random-GPTNeoXForCausalLM",
    revision: str=None,  # include a revision to prevent surprises!
    ):
    from huggingface_hub import snapshot_download

    snapshot_download(repo_id=repo_id, local_dir=MODEL_DIR / repo_id)
    print(f"Model downloaded to {MODEL_DIR / repo_id}")