Volumes (beta)
The modal.Volume
is a mutable volume built for
high-performance file serving. Like the
modal.NetworkFileSystem
, these
volumes can be simultaneously attached to multiple Modal functions, supporting
concurrent reading and writing. But unlike the modal.NetworkFileSystem
, the
modal.Volume
has been designed for fast reads and does not automatically
synchronize writes between mounted volumes.
The modal.Volume
works best with write-once, read-many I/O workloads.
Volumes work best when they contain less then 50,000 files and directories. The latency to attach or modify a volume scales linearly with the number of files in the volume, and past a few tens of thousands of files the linear component starts to dominate the fixed overhead.
Volume basics
Unlike a networked filesystem, you need to explicitly reload the volume to see
changes made since it was first mounted. This reload is handled by invoking the
.reload()
method on a volume object.
Similarly, you need to explicitly commit any changes you make to the volume for
the changes to become visible outside the current task. This is handled by
invoking the .commit()
method on a
volume object.
At container creation time the latest state of an attached volume is mounted. If
the volume is then subsequently modified by a commit operation in another
running container, that volume modification won’t become available until the
original container does a .reload()
.
Consider this example which demonstrates the effect of a reload:
import pathlib
import modal
stub = modal.Stub()
stub.volume = modal.Volume.new()
p = pathlib.Path("/root/foo/bar.txt")
@stub.function(volumes={"/root/foo": stub.volume})
def f():
p.write_text("hello")
print(f"Created {p=}")
stub.volume.commit() # Persist changes
print(f"Committed {p=}")
@stub.function(volumes={"/root/foo": stub.volume})
def g(reload: bool = False):
if reload:
stub.volume.reload() # Fetch latest changes
if p.exists():
print(f"{p=} contains '{p.read_text()}'")
else:
print(f"{p=} does not exist!")
@stub.local_entrypoint()
def main():
g.call() # 1. container for `g` starts
f.call() # 2. container for `f` starts, commits file
g.call(reload=False) # 3. reuses container for `g`, no reload
g.call(reload=True) # 4. reuses container, but reloads to see file.
The output for this example is this:
p=PosixPath('/root/foo/bar.txt') does not exist!
Created p=PosixPath('/root/foo/bar.txt')
Committed p=PosixPath('/root/foo/bar.txt')
p=PosixPath('/root/foo/bar.txt') does not exist!
p=PosixPath('/root/foo/bar.txt') contains hello
This code runs two containers, one for f
and one for g
. Only the last
function invocation reads the file created and committed by f
because it was
configured to reload.
Model serving
A single ML model can be served by simply baking it into a modal.Image
at
build time using run_function
. But
if you have dozens of models to serve, or otherwise need to decouple image
builds from model storage and serving, use a modal.Volume
.
Volumes can be used to save a large number of ML models and later serve any one
of them at runtime with much better performance than can be achieved with a
modal.NetworkFileSystem
.
This snippet below shows the basic structure of the solution.
import modal
stub = modal.Stub()
stub.volume = modal.Volume.persisted("model-store")
model_store_path = "/vol/models"
@stub.function(volumes={model_store_path: stub.volume}, gpu="any")
def run_training():
model = train(...)
save(model_store_path, model)
stub.volume.commit() # Persist changes
@stub.function(volumes={model_store_path: stub.volume})
def inference(model_id: str, request):
try:
model = load_model(model_store_path, model_id)
except NotFound:
stub.volume.reload() # Fetch latest changes
model = load_model(model_store_path, model_id)
return model.run(request)
Model checkpointing
Checkpoints are snapshots of an ML model and can be configured by the callback functions of ML frameworks. You can use saved checkpoints to restart a training job from the last saved checkpoint. This is particularly helpful in managing preemption.
Huggingface transformers
Use this callback class to have your checkpoints committed to your volume:
from transformers import TrainerCallback
class CheckpointCallback(TrainerCallback):
def __init__(self, volume):
self.volume = volume
def on_save(self, args, state, control, **kwargs):
if state.is_world_process_zero:
print("running commit on modal.Volume after model checkpoint")
self.volume.commit()
This callback class should be included in the
callbacks
argument to a Huggingface transformers
Trainer
:
import pathlib
VOL_MOUNT_PATH = pathlib.Path("/vol")
@stub.function(
gpu="A10g",
timeout=7_200,
volumes={VOL_MOUNT_PATH: stub.volume},
)
def finetune():
from transformers import Seq2SeqTrainer
...
training_args = Seq2SeqTrainingArguments(
output_dir=str(VOL_MOUNT_PATH / "model"),
...
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
callbacks=[CheckpointCallback(stub.volume)],
train_dataset=tokenized_xsum_train,
eval_dataset=tokenized_xsum_test,
)
As shown above, you must also set the Trainer’s
output_dir
to write into the volume’s mount location.
Filesystem consistency
Concurrent modification
Concurrent modification is supported, but concurrent modifications of the same files should be avoided. Last write wins in case of concurrent modification of the same file — any data the last writer didn’t have when committing changes will be lost!
As a result, volumes are typically not a good fit for use cases where you need to make concurrent modifications to the same file (nor is distributed file locking supported).
Busy volume errors
Volume commits cannot be performed while volume files are still open for writing. The commit operation will fail with “volume busy”. The following is a simple example of how a “volume busy” error can occur:
volume = modal.Volume.new()
@stub.function(volumes={"/vol": volume})
def seed_volume():
f = open("/vol/data.txt", "w")
f.write("hello world") # file not closed after writing
stub.volume.commit()
f.close() # closed file too late
Persisting volumes
By default, a modal.Volume lives as long as the
app it’s defined in, just like any other Modal object. However in many
situations you might want to persist file data between runs of the app. To do
this, you can use the persisted
method on the Volume
object.