“The beauty of Modal is that you can scale inference to thousands of requests with just a few lines of Python.”
Stay in your application code. Modal handles scaling, serving, and infrastructure behind the scenes.
import modal
MODEL_NAME = "black-forest-labs/FLUX.1-schnell"
image = (
modal.Image.from_registry("nvidia/cuda:12.4.0-devel-ubuntu22.04")
.pip_install("torch", "transformers", "diffusers", ...)
)
volume = modal.Volume.from_name("flux-lora-models")
@app.cls(gpu="H100", image=image, volumes={"/loras": volume})
class FluxWithLoRA:
@modal.enter()
def setup(self):
self.pipeline = FluxPipeline.from_pretrained(MODEL_NAME).to("cuda")
self.pipeline.load_and_fuse_lora()
@modal.method()
def generate_image(self, prompt: str):
return self.pipeline(prompt).images[0]
flux = FluxWithLoRA()
flux.generate_image.remote("")
Define your inference function with Modal’s SDK. Easily keep ML dependencies and GPU requirements in sync with application code.
image = (
modal.Image.from_registry(f"nvidia/cuda:{tag}")
.uv_pip_install("transformers", "accelerate")
)
@app.cls(
gpu="H100",
image=image,
volumes={MODEL_CACHE: volume}
)
class Model:
@modal.enter()
def enter(self):
self.model = load_model(MODEL_CACHE)
def inference(self, prompt):
image = self.model(prompt)
return image
Modal’s container engine launches GPUs in <1 s when your inference function is called. Load a 7B model in seconds with our cutting-edge GPU snapshotting.
Instantly scale to 1000+ GPUs during traffic spikes, then back down to 0 when idle. No commitments, no waits.