The fastest way to scale Inference

Serve custom or open source AI models using Python syntax, with sub-second cold starts and access to all the latest GPUs.

Get Started Read the docs

“The beauty of Modal is that you can scale inference to thousands of requests with just a few lines of Python.”

Georg Kucsko, CTO & Co-founder

“We use Modal to run edge inference with <10ms overhead and batch jobs at large scale. Our team loves the platform for the power and flexibility it gives us.”

Brian Ichter, Co-founder

“Modal makes it unbelievably quick to deploy our models onto scalable infrastructure. We’ve been able to move faster on our last few model launches, including Olmo and Tülu, thanks to the platform.”

Michael Schmitz, Engineering

Code-first inference

Stay in your application code. Modal handles scaling, serving, and infrastructure behind the scenes.

import modal

MODEL_NAME = "black-forest-labs/FLUX.1-schnell"

image = (

    modal.Image.from_registry("nvidia/cuda:12.4.0-devel-ubuntu22.04")

    .pip_install("torch", "transformers", "diffusers", ...)

volume = modal.Volume.from_name("flux-lora-models")

@app.cls(gpu="H100", image=image, volumes={"/loras": volume})

class FluxWithLoRA:

    @modal.enter()

    def setup(self):

        self.pipeline = FluxPipeline.from_pretrained(MODEL_NAME).to("cuda")

        self.pipeline.load_and_fuse_lora()

    @modal.method()

    def generate_image(self, prompt: str):

        return self.pipeline(prompt).images[0]

flux = FluxWithLoRA()

flux.generate_image.remote("")

DEFINED IN CODE

LOW LATENCY

ELASTIC SCALE

Modal Inference

Ship inference for 1M+ users
from day one

Defined in code

Define your inference function with Modal’s SDK. Easily keep ML dependencies and GPU requirements in sync with application code.

image = (

    modal.Image.from_registry(f"nvidia/cuda:{tag}")

    .uv_pip_install("transformers", "accelerate")

@app.cls(

    gpu="H100",

    image=image,

    volumes={MODEL_CACHE: volume}

class Model:

    @modal.enter()

    def enter(self):

        self.model = load_model(MODEL_CACHE)

    def inference(self, prompt):

        image = self.model(prompt)

        return image

Low latency

Modal’s container engine launches GPUs in <1 s when your inference function is called. Load a 7B model in seconds with our cutting-edge GPU snapshotting.