Run Flux Kontext on B200s. Try now

Language model inference

Run the latest open-source LLM and embedding models with Modal's serverless GPUs.

“Modal's user-friendly interface and efficient tools have truly empowered our team to navigate data-intensive tasks with ease, enabling us to achieve our project goals more efficiently.”

Karim Atiyeh, Co-Founder & CTO

“Switched to Modal for our LLM inference instead of Azure. 1/4 the price for GPUs and so much simpler to set up/scale. Big fan.”

Alex Reichenbach, CEO

“Using Modal for inference is like having an extra infra team - it’s reliable, scalable, and fast - meaning I can get back to training models”

Vik Paruchari, Founder

GPUs on demand

View Examples

Top of the line hardware

Access A100s and H100s to run the latest and largest models, like Llama3-405B.

Cheaper than running your own cluster

No more paying for idle GPUs.

Seamless autoscaling

When your app gets an influx of traffic, Modal scales with you.

View Examples

Blazing-fast performance

View Examples

Fast cold starts

Load gigabytes of weights in seconds with our optimized container file system and engine.

Support for inference engines

Easily run any framework or model on Modal (e.g. TensorRT and vLLM).

Dynamic batching

Use Modal's batching feature to process requests in dynamically-sized batches.

View Examples

Best-in-class developer experience

Metrics and observability

Visualize and debug failures.

Monitor resource utilization

Track your usage and spending in real-time.

Ready for production

Support for webhooks, batching, and token streaming.

Try it out

View all

Deploy an OpenAI-compatible LLM service

Run large language models with a drop-in replacement for the OpenAI API.

Run llama.cpp

Run DeepSeek-R1 and Phi-4 on llama.cpp

Serverless TensorRT-LLM (LLaMA 3 8B)

Run interactive language model applications.

RAG Chat with PDFs

Use ColBERT-style, multimodal embeddings with a Vision-Language Model to answer questions about documents.

Fine-tune FLAN-T5

Fine-tune a small language model for summarization.

Run vision-language models with SGLang

Ask questions about images and get back answers from a multimodal model.

Structured data extraction using Instructor

Extract structured data from unstructured text using LLMs.

Enforcing JSON outputs on LLMs

Guarantee that language model outputs match a JSON schema.

Ship your first app in minutes.

Get Started

$30 / month free compute