“Our ML engineers want to use Modal for everything. Modal helped reduce our VLM document parsing latency by 3x and allowed us to scale throughput to >100,000 pages per minute.”
import modal
vllm_image = (
modal.Image.from_registry(f"nvidia/{tag}", add_python="3.12")
.uv_pip_install("vllm==0.10.2", "torch==2.8.0")
)
model_cache = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
app = modal.App("vllm-inference")
@app.function(image=vllm_image, gpu="H100", volumes={"/root/.cache/huggingface": model_cache})
@modal.web_server(port=8000)
def serve():
import subprocess
cmd = "vllm serve Qwen/Qwen3-8B-FP8 --port 8000"
subprocess.Popen(cmd)
Deploy any state-of-the-art or custom LLM using our flexible Python SDK.
Our in-house ML engineering team helps you implement inference optimizations specific to your workload.
You maintain full control of all code and deployments for instant iterations. No black boxes.
Modal’s Rust-based container stack spins up GPUs in < 1s.
Modal autoscales up and down for max cost efficiency.
Modal’s proprietary cloud capacity orchestrator guarantees high GPU availability.