LLM Inference That Scales From Zero to Thousands of GPUs
in Seconds

Deploy production-ready models without cold start delays, capacity limits, or infrastructure overhead.

Scale logoQuora logoSubstack logoMeta logoLovable logoSuno logoMistral logoCartesia logo

Direct Answer

What is LLM inference?

LLM inference is the process of running a trained large language model to generate predictions, completions, or responses in production. It transforms a static model into an API or service that processes real-time requests, handles batching, and serves predictions to end users or applications. Real LLM inference requires optimized serving infrastructure that manages GPU memory, request queuing, and autoscaling without latency penalties.

  • Instant cold startsModal autoscales containers 100x faster than Docker for instant cold starts.
  • Elastic GPU scalingElastic GPU scaling provides access to thousands of GPUs across clouds without capacity negotiations.
  • Code-first infrastructureCode-first infrastructure eliminates YAML configuration file complexity.

What you can build

Ship production LLM APIs in hours, not weeks

Modal handles GPU provisioning, autoscaling, and observability so you can focus on model quality and product iteration.

Read the docs

Production inference APIs

Deploy GPT-style chat, code completion, or summarization with sub-second cold starts and automatic HTTPS endpoints.

Batch inference at scale

Run inference across millions of documents, images, or records with elastic parallelism that scales to thousands of GPUs.

Multi-modal pipelines

Combine vision-language models, embeddings, and retrieval into unified inference graphs.

GPU autoscaling

Capacity adjusts in real time based on queue depth and request rate, scaling to zero when idle.

What LLM inference is

  • Takes a trained model (GPT, Llama, Mistral, or custom) and exposes it as a callable endpoint at scale
  • Runs continuously in production, handling thousands to millions of requests per day
  • Demands low latency, elastic GPU capacity, and efficient batching
  • Modal eliminates cold start overhead with sub-second container launches
  • Teams deploy vLLM, TensorRT-LLM, or custom engines and scale from zero to hundreds of GPUs automatically
Blocks Grid

Why it matters now

  • Every 100ms of added latency reduces user engagement; every idle GPU wastes budget
  • Modal's inference platform is built to serve 1M+ users from day one — no capacity planning, no quota negotiations
  • Modal achieves sub-10ms overhead latency for distributed inference workloads
  • Modal's intelligent scheduler routes workloads across clouds to bypass quota limits
  • Teams using Modal scale to zero GPU usage during idle periods, eliminating waste
Squares

Getting started in 3 steps

Step 1: Install Modal and define your inference function (10 min)

Run pip install modal and wrap your model loading and prediction logic with @app.function(gpu="T4"). Modal's code-first approach keeps ML dependencies and GPU requirements in sync with application code — no Dockerfiles, no YAML, no cluster management.

Step 2: Deploy to production with one command (2 min)

Execute modal deploy to push your function live with autoscaling, HTTPS endpoints, and logging enabled automatically. When a request arrives, Modal spins up a container in under one second across A100, H100, or A10s GPUs. You never hit quota limits.

Step 3: Monitor and iterate in real time (ongoing)

Use Modal's dashboard to track request latency, GPU utilization, and error rates. Adjust GPU types, scaling policies, or vLLM parameters directly in Python — no YAML edits or cluster restarts required.

Real results from a production deployment

From 90 second cold starts to 0.8 seconds

"We migrated from EKS with Karpenter autoscaling to Modal's native GPU orchestration in 40 developer-hours. GPU utilization jumped from 34% to 78% with vLLM continuous batching and automatic request queuing."

A startup building an AI code review tool deployed Llama 3.1 70B inference on Modal, replacing a self-managed Kubernetes cluster. Within two weeks they reduced cold start time from 90 seconds to 0.8 seconds, cut inference costs by 60% through scale-to-zero, and expanded from 2 AWS regions to 6 clouds without re-architecting.

AI Engineering Team, AI Code Review Platform, Modal Customer

Modal dashboard showing LLM inference performance metrics

Who benefits most

Built for every AI team

AI engineering teams at startups

You need grad-school-grade inference without hiring platform engineers. Modal's Python-native API lets you deploy models in the same codebase as training, with autoscaling and observability included.

ML researchers running large-scale evaluations

You run inference on thousands to millions of tokens in GPUs for batch jobs. Modal ensures GPU clusters pre-emptively scale back to zero, eliminating reservation costs.

Data science teams at enterprises

You manage dozens of models across business units with varying traffic patterns. Modal's unified deployment reduces operational overhead of multiplexing and latency SLAs.

"We use Modal to run edge inference with <10ms overhead and batch jobs at large scale. Our team loves the platform for the power and flexibility it gives us."

Brian Ichter, Co-founder

"Modal makes it easy to write code that runs on 100s of GPUs in parallel, transcribing podcasts in a fraction of the time."

Mike Cohen, Head of Data

"Everyone here loves Modal because it helps us move so much faster. We rely on it to handle massive spikes in volume for evals, RL environments, and MCP servers."

Aakash Sabharwal, VP of Engineering

"Modal was the only infrastructure provider that enabled us to reliably run tens of thousands of app creation sessions in an instant. We're excited to build with them for the long term."

Anton Osika, CEO & Founder

Join Modal's developer community

Modal Community Slack
Twitter profile @erinseleneErin BoyleML Engineer, Tesla

This tool is awesome. So empowering to have your infra needs met with just a couple decorators. Good people, too!

Twitter profile @jai_chopraJai ChopraProduct, LanceDB

Recently built an app on Lambda and just started to use @modal, the difference is insane! Modal is amazing, virtually no cold start time, onboarding experience is great

Twitter profile @isidoremillerIzzy MillerDevRel, Hex

special shout out to @modal for providing the crucial infrastructure to run this! Modal is the coolest tool I've tried in a really long time. Cannot say enough good things.

Frequently asked questions

What is the difference between LLM inference and LLM training?

LLM training adjusts model weights to optimize performance on a task — it runs once (or periodically) and requires large GPU clusters for hours or days. LLM inference runs a fixed, trained model to generate responses in real time — it runs continuously in production and demands low latency, high throughput, and autoscaling. Modal supports both through the same Python API.

Does Modal support vLLM GitHub's latest releases?

Yes. Modal's container images can install any version of vLLM from PyPI or directly from GitHub. Because Modal rebuilds images automatically when dependencies change, teams can pin to a specific commit for stability or upgrade to the latest release with a single version bump in their code.

How fast are Modal's cold starts for LLM inference?

Modal achieves sub-second cold starts for pre-cached containers. For large models like Llama 3.1 70B, Modal's snapshot technology checkpoints a running container and restores it in under one second on a fresh GPU — compared to 60-90 seconds on traditional cloud instances.

Can I use Modal for llm.c or custom C++ inference engines?

Yes. Modal containers can run any binary, CUDA kernel, or native executable. You can install custom inference engines by defining them in your container image using pip, apt, or manual compilation steps.

What happens to my inference API when traffic drops to zero?

By default, Modal scales containers to zero when no requests arrive, and you pay nothing during idle periods. When traffic resumes, Modal cold-starts a new container in under one second. For latency-sensitive applications, you can configure a minimum container count to keep containers warm.

How does Modal handle vLLM continuous batching configuration?

Modal passes environment variables and startup arguments directly to your vLLM server process, giving you full control over continuous batching parameters like max_num_batched_tokens, max_num_seqs, and tensor parallel size. You can tune these values in Python code and redeploy without YAML edits or cluster restarts.

Start running LLM inference in minutes.

Get Started Free

$30 in free compute to get started.