Endpoint metrics

Every endpoint reports live inference metrics so you can see how it’s performing under real traffic — latency, throughput, and how many requests are in flight. Open an endpoint from the Endpoints tab and go to the Activity view to see them.

There are two types of metrics available:

  • Inference metrics — LLM engine-specific metrics designed to give you more performance observability.
  • Server metrics — the standard Modal container health metrics.

What the metrics mean 

Latency (reported as p50 / p95 / p99):

  • Time to first token (TTFT) — how long after a request arrives before the first output token streams back. The number users feel first.
  • Inter-token latency (ITL) — average gap between successive output tokens. Drives perceived “typing speed.”
  • End-to-end latency (E2E) — total time to complete a request.

Throughput:

  • Requests per second (QPS) — request arrival rate.
  • Token throughput — tokens/second, split into prefill (processing the prompt, with a separate line for cache-hit tokens) and decode (generating output).

Request load:

  • Request activity — the rate of requests arriving at and completing on the endpoint over time.
  • Running — requests currently being processed.
  • Queued — requests waiting for a free slot. Sustained queueing means the fleet is saturated and scaling up.

Speculative decoding (only for recipes that use it) — the average number of draft tokens accepted per step; higher means speculation is paying off.

Caveats 

  • Metrics need traffic. Latency and throughput are computed over recent rolling windows; an idle or scaled-to-zero endpoint shows no current data.
  • Cold starts skew early numbers. The first requests after a scale-up include model load time. Look at steady-state windows when evaluating performance.
  • Percentiles need volume. p95/p99 are only meaningful once enough requests have accumulated in the window.
  • Endpoint metrics are available in the dashboard. To get repeatable performance numbers under a controlled load, run a benchmark.