Modal has raised an $87M Series B led by Lux Capital. Read more

Deploy fine-tuned LLMs without compromising on control

Stay focused on optimizing your LLM and inference engine for your needs. We handle the compute infrastructure.

“Our ML engineers want to use Modal for everything. Modal helped reduce our VLM document parsing latency by 3x and allowed us to scale throughput to >100,000 pages per minute.”

Raunak Chowdhuri, Founder

“Modal lets us deploy new ML models in hours rather than weeks. We use it across spam detection, recommendations, audio transcription, and video pipelines, and it’s helped us move faster with far less complexity.”

Mike Cohen, Head of AI & ML Engineering

“Modal makes it unbelievably quick to deploy our models onto scalable infrastructure. We’ve been able to move faster on our last few model launches, including Olmo and Tülu, thanks to the platform.”

Michael Schmitz, Engineering

Ship faster with Python-defined infrastructure

import modal

vllm_image = (

    modal.Image.from_registry(f"nvidia/{tag}", add_python="3.12")

    .uv_pip_install("vllm==0.10.2", "torch==2.8.0")

model_cache = modal.Volume.from_name("huggingface-cache", create_if_missing=True)

app = modal.App("vllm-inference")

@app.function(image=vllm_image, gpu="H100", volumes={"/root/.cache/huggingface": model_cache})

@modal.web_server(port=8000)

def serve():

    import subprocess

    cmd = "vllm serve Qwen/Qwen3-8B-FP8 --port 8000"

    subprocess.Popen(cmd)

Inference optimizations that you control

View the LLM Almanac

Deploy any state-of-the-art or custom LLM using our flexible Python SDK.

Our in-house ML engineering team helps you implement inference optimizations specific to your workload.

You maintain full control of all code and deployments for instant iterations. No black boxes.

Inference optimizations that you control

Deploy any state-of-the-art or custom LLM using our flexible Python SDK.

Our in-house ML engineering team helps you implement inference optimizations specific to your workload.

You maintain full control of all code and deployments for instant iterations. No black boxes.

View Examples

Autoscale to thousands of GPUs without reservations

Modal’s Rust-based container stack spins up GPUs in < 1s.

Modal autoscales up and down for max cost efficiency.

Modal’s proprietary cloud capacity orchestrator guarantees high GPU availability.

Modal’s Rust-based container stack spins up GPUs in < 1s.

Modal autoscales up and down for max cost efficiency.

Modal’s proprietary cloud capacity orchestrator guarantees high GPU availability.

Unbeatable cost for batch inference

Save 50%+ on high-throughput, short-context tasks compared to API providers.

Sub-10ms network latency for online inference

Global GPU fleet runs close to your users, wherever they are. Support for inference optimizations like prefill disaggregation and prefix-aware routing.

Everything you need for production-grade deployments

Volumes

Load LLM weights quickly from any region.

Observability

Intuitive dashboards help you navigate the health of your deployments.

Enterprise-grade security

SOC2 and HIPAA compliance, zero data retention, and more.

Built with Modal

All examples

Deploy an OpenAI-compatible LLM service

Run large language models with a drop-in replacement for the OpenAI API

Run llama.cpp

Run DeepSeek-R1 and Phi-4 on llama.cpp

Serverless TensorRT-LLM (LLaMA 3 8B)

Run interactive language model applications

RAG Chat with PDFs

Use ColBERT-style, multimodal embeddings with a Vision-Language Model to answer questions about documents

Run vision-language models with SGLang

Ask questions about images and get back answers from a multimodal model

Fine-tune Qwen3 with Unsloth

Use Unsloth to fine-tune the Qwen3 language model efficiently

Fine-tune Llama 3.1 with torchtune

Customize Llama 3.1 with torchtune for your downstream applications

Train an LLM with GRPO

Apply Grouped Reinforcement Policy Optimization (GRPO) to train large language models

Ship your first app in minutes.

Get Started

$30 / month free compute