“The beauty of Modal is that you can scale inference to thousands of requests with just a few lines of Python.”
“We use Modal to run edge inference with <10ms overhead and batch jobs at large scale. Our team loves the platform for the power and flexibility it gives us.”
Stay in your application code. Modal handles scaling, serving, and infrastructure behind the scenes.
import modal
MODEL_NAME = "black-forest-labs/FLUX.1-schnell"
image = (
modal.Image.from_registry("nvidia/cuda:12.4.0-devel-ubuntu22.04")
.pip_install("torch", "transformers", "diffusers", ...)
)
volume = modal.Volume.from_name("flux-lora-models")
@app.cls(gpu="H100", image=image, volumes={"/loras": volume})
class FluxWithLoRA:
@modal.enter()
def setup(self):
self.pipeline = FluxPipeline.from_pretrained(MODEL_NAME).to("cuda")
self.pipeline.load_and_fuse_lora()
@modal.method()
def generate_image(self, prompt: str):
return self.pipeline(prompt).images[0]
flux = FluxWithLoRA()
flux.generate_image.remote("")
Define your inference function with Modal’s SDK. Easily keep ML dependencies and GPU requirements in sync with application code.
image = (
modal.Image.from_registry(f"nvidia/cuda:{tag}")
.uv_pip_install("transformers", "accelerate")
)
@app.cls(
gpu="H100",
image=image,
volumes={MODEL_CACHE: volume}
)
class Model:
@modal.enter()
def enter(self):
self.model = load_model(MODEL_CACHE)
def inference(self, prompt):
image = self.model(prompt)
return image
Modal’s container engine launches GPUs in <1 s when your inference function is called. Load a 7B model in seconds with our cutting-edge GPU snapshotting.
Instantly scale to 1000+ GPUs during traffic spikes, then back down to 0 when idle. No commitments, no waits.