Modal has raised an $87M Series B led by Lux Capital. Read more
Modal Inference

The fastest way to scale Inference

Serve custom or open source AI models using Python syntax, with sub-second cold starts and access to all the latest GPUs.
customer logo

“The beauty of Modal is that you can scale inference to thousands of requests with just a few lines of Python.”

Georg Kucsko, CTO & Co-founder
customer logo

“We use Modal to run edge inference with <10ms overhead and batch jobs at large scale. Our team loves the platform for the power and flexibility it gives us.”

Brian Ichter, Co-founder

Code-first inference

Stay in your application code. Modal handles scaling, serving, and infrastructure behind the scenes.

01
import modal
02
03
MODEL_NAME = "black-forest-labs/FLUX.1-schnell"
04
image = (
05
    modal.Image.from_registry("nvidia/cuda:12.4.0-devel-ubuntu22.04")
06
    .pip_install("torch", "transformers", "diffusers", ...)
07
)
08
volume = modal.Volume.from_name("flux-lora-models")
09
10
@app.cls(gpu="H100", image=image, volumes={"/loras": volume})
11
class FluxWithLoRA:
12
    @modal.enter()
13
    def setup(self):
14
        self.pipeline = FluxPipeline.from_pretrained(MODEL_NAME).to("cuda")
15
        self.pipeline.load_and_fuse_lora()
16
17
    @modal.method()
18
    def generate_image(self, prompt: str):
19
        return self.pipeline(prompt).images[0]
20
21
flux = FluxWithLoRA()
22
flux.generate_image.remote("")
Modal Inference

Ship inference for 1M+ users from day one

Defined in code

Define your inference function with Modal’s SDK. Easily keep ML dependencies and GPU requirements in sync with application code.

01
image = (
02
    modal.Image.from_registry(f"nvidia/cuda:{tag}")
03
    .uv_pip_install("transformers", "accelerate")
04
)
05
@app.cls(
06
    gpu="H100",
07
    image=image,
08
    volumes={MODEL_CACHE: volume}
09
)
10
class Model:
11
    @modal.enter()
12
    def enter(self):
13
        self.model = load_model(MODEL_CACHE)
14
    def inference(self, prompt):
15
        image = self.model(prompt)
16
        return image

Low latency

Modal’s container engine launches GPUs in <1 s when your inference function is called. Load a 7B model in seconds with our cutting-edge GPU snapshotting.

Elastic scale

Instantly scale to 1000+ GPUs during traffic spikes, then back down to 0 when idle. No commitments, no waits.

Infrastructure optimized for every deployment pattern




Get clear insight into production deployments

Get clear insight into production deployments


Built with Modal

Ship your first app in minutes.

Get Started

$30 / month free compute