Endpoint metrics
Every endpoint reports live inference metrics so you can see how it’s performing under real traffic — latency, throughput, and how many requests are in flight. Open an endpoint from the Endpoints tab and go to the Activity view to see them.
There are two types of metrics available:
- Inference metrics — LLM engine-specific metrics designed to give you more performance observability.
- Server metrics — the standard Modal container health metrics.
What the metrics mean
Latency (reported as p50 / p95 / p99):
- Time to first token (TTFT) — how long after a request arrives before the first output token streams back. The number users feel first.
- Inter-token latency (ITL) — average gap between successive output tokens. Drives perceived “typing speed.”
- End-to-end latency (E2E) — total time to complete a request.
Throughput:
- Requests per second (QPS) — request arrival rate.
- Token throughput — tokens/second, split into prefill (processing the prompt, with a separate line for cache-hit tokens) and decode (generating output).
Request load:
- Request activity — the rate of requests arriving at and completing on the endpoint over time.
- Running — requests currently being processed.
- Queued — requests waiting for a free slot. Sustained queueing means the fleet is saturated and scaling up.
Speculative decoding (only for recipes that use it) — the average number of draft tokens accepted per step; higher means speculation is paying off.
Caveats
- Metrics need traffic. Latency and throughput are computed over recent rolling windows; an idle or scaled-to-zero endpoint shows no current data.
- Cold starts skew early numbers. The first requests after a scale-up include model load time. Look at steady-state windows when evaluating performance.
- Percentiles need volume. p95/p99 are only meaningful once enough requests have accumulated in the window.
- Endpoint metrics are available in the dashboard. To get repeatable performance numbers under a controlled load, run a benchmark.