Modal has raised an $87M Series B led by Lux Capital. Read more

Deploy fine-tuned LLMs without compromising on control

Stay focused on optimizing your LLM and inference engine for your needs. We handle the compute infrastructure.
Get Started
customer logo

“Our ML engineers want to use Modal for everything. Modal helped reduce our VLM document parsing latency by 3x and allowed us to scale throughput to >100,000 pages per minute.”

Raunak Chowdhuri, Founder
customer logo

“Modal lets us deploy new ML models in hours rather than weeks. We use it across spam detection, recommendations, audio transcription, and video pipelines, and it’s helped us move faster with far less complexity.”

Mike Cohen, Head of AI & ML Engineering

Ship faster with Python-defined infrastructure

01
import modal
02
03
vllm_image = (
04
    modal.Image.from_registry(f"nvidia/{tag}", add_python="3.12")
05
    .uv_pip_install("vllm==0.10.2", "torch==2.8.0")
06
)
07
08
model_cache = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
09
10
app = modal.App("vllm-inference")
11
12
@app.function(image=vllm_image, gpu="H100", volumes={"/root/.cache/huggingface": model_cache})
13
@modal.web_server(port=8000)
14
def serve():
15
    import subprocess
16
17
    cmd = "vllm serve Qwen/Qwen3-8B-FP8 --port 8000"
18
    subprocess.Popen(cmd)

Inference optimizations that you control

Inference optimizations that you control

Deploy any state-of-the-art or custom LLM using our flexible Python SDK.


Our in-house ML engineering team helps you implement inference optimizations specific to your workload.


You maintain full control of all code and deployments for instant iterations. No black boxes.

View Examples

Autoscale to thousands of GPUs without reservations


Modal’s Rust-based container stack spins up GPUs in < 1s.


Modal autoscales up and down for max cost efficiency.


Modal’s proprietary cloud capacity orchestrator guarantees high GPU availability.



Everything you need for production-grade deployments




Built with Modal

Ship your first app in minutes.

Get Started

$30 / month free compute