Benchmark an endpoint

Live metrics tell you how an endpoint behaves under whatever traffic it happens to be getting. A benchmark tells you how it behaves under a known, repeatable load — so you can compare models, regions, and configurations on an apples-to-apples basis.

Modal runs benchmarks for you: it drives a standard load generator against your live endpoint from a sandbox and reports the results. Start one from the Benchmark tab on the endpoint’s detail page. The resulting metrics are available in the dashboard once completed.

Workload patterns

A benchmark runs one of two built-in patterns, each shaped like a different real-world workload:

Pattern	Prompt shape	Models
Real-time generation	~3,000 input tokens → ~100 output tokens, randomized prompts	Interactive chat / short-answer Q&A
Agentic multi-turn	~45,000-token shared system prefix + ~5,000-token question → ~200 output tokens	Agent / tool-use workloads with long, reused context

The agentic pattern reuses a long shared prefix across requests, so it also exercises prefix caching — a major factor in agent workload performance.

Endpoint preview benchmarks

When you pick a model while creating an endpoint, you’ll also see precomputed benchmarks attached to the model’s recipe. These are reference numbers Modal measured on a known GPU configuration, so you can compare candidate models before deploying anything. They differ from the benchmarks above in two ways: they’re produced ahead of time by Modal (not run against your endpoint), and they’re tied to the recipe rather than your specific deployment. Use recipe benchmarks to choose a model; run your own benchmark to validate an endpoint in your region with your settings.

Caveats

Benchmarks send real traffic. A run drives your live endpoint, triggers autoscaling, and incurs the usual compute cost while it runs.
Results are point-in-time. Numbers depend on the current fleet size, region, and any cold starts during the run. Compare runs taken under similar conditions, and let the endpoint warm up first for steady-state figures.
Pick the pattern that matches your use case. Real-time and agentic workloads stress very different parts of the serving stack; benchmarking the wrong shape can be misleading.