Frequently Asked Questions
What is this? How do I use it?
This interactive chart indicates the per-replica throughput and client-side latency you can expect when running open weights language models on open source inference engines, in particular on Modal. Select a workload (model, tokens in and out), set a latency objective, and indicate whether you want to see all configurations or just the one that got the best throughput or best latency. Select a line in the chart to see a working code snippet you can use to try out the LLM engine on Modal.
Results were measured for "out-of-the-box" configurations of the LLM engines, and so represent an upper, not a lower, bound for performance, especially for TensorRT-LLM. For a deep dive on the methodology, see this page. For a high-level overview of the results, see our executive summary.
Click the dice to see a random result.
Should I use vLLM, SGLang, or TensorRT-LLM?
Our results indicate that vLLM and SGLang achieve comparable performance out-of-the-box, so the decision between those two frameworks needs to be made on other grounds, like their time-to-market on the features you care about. Our internal and other published results indicate that TensorRT-LLM can be faster if tuned for very specific workloads, but the engineering lift and churn should not be under-estimated.
See our executive summary for details.
What data did you use?
We used the default dataset in guidellm
, random chunks of Pride and Prejudice of varying lengths. This results
in a low KV cache hit rate and so is closer to the performance of a system
handling independent requests, like a translation service, than to the performance
of a system handling correlated requests, like a conversational chatbot.
What do I do if my LLM engine needs to serve hundreds of requests per second? I only see loads ranging from under one RPS to a few dozen.
Our results are measured for a single replica, one instance of the LLM inference engine. A high-throughput service is constructed by "scaling out" these replicas. If you're interested in running a high-throughput service that can handle variable load, consider Modal, the serverless platform used to measure these results. Services on Modal can scale from zero to thousands of replicas in minutes.
In our results, a single replica runs on at most a single node, which may have up to eight GPUs. Contemporary deployments of large models often run single replicas that are sharded across many nodes ("distributed inference"), like the ~40 node-per-replica deployment described by the DeepSeek team. These configurations can achieve lower latencies for larger models and/or higher throughputs, including throughput per dollar, but are much more complex to deploy, maintain, and scale up and down. If and as this style of deployment becomes more common in open source LLM engine software, we plan to add it to our benchmarking (see NVIDIA Dynamo for one implementation).
Why is the minimum time-to-first-token (TTFT) around 200 ms, even for small models?
Our latencies are measured from a client, not the server, and include network and service delays of around 150 milliseconds (p95). These overheads include network transmission and from systems that enable auto-scaling, retries, request logging, and other features common to production deployments. They might be reduced to a few dozen milliseconds by replication at the edge, at the cost of much increased engineering complexity, which we leave to future work.
I want to run language models on CPUs/TPUs/LPUs. Do you have results for that?
Currently, our benchmark only includes GPU-accelerated language model
inference for models with over one billion parameters, which is the most
common case we see on our platform. See the excellent CPU-centric benchmarking work from Spare Cores for results with the llama.cpp
engine.
What data did you use?
We used the default dataset in guidellm
, random chunks of Pride and Prejudice of varying lengths. This results
in a low KV cache hit rate and so is closer to the performance of a system
handling independent requests, like a translation service, than to the performance
of a system handling correlated requests, like a conversational chatbot.
I want to know all the details about how you ran these benchmarks so that I can poke holes in your results. Where can I find them?
Great to hear! We've used our benchmarking system enough to know that it is useful but we haven't done enough to make it bulletproof (and nothing is perfect). We released the code open source here. Let us know if you spot any issues.
We did the minimum configuration to get workloads to run, but the breadth of our benchmarking, across three frameworks on ten context lengths for over a dozen models, meant that we couldn't give any particular configuration the attention that an engineer focused on building a single service would. So we welcome contributions from the community, including teams building LLM engines, to our open repository of configs for this benchmark. We intend to keep this benchmark up-to-date as long as there are users who want to run LLM engines on our platform.
You can find a detailed walkthrough of our general approach to benchmarking LLM engines here.
But benchmarks measure a software and hardware system together, not separately. So below are some key technical details about the system we ran on, rented serverlessly on the Modal platform.
The benchmarking code ran on Oracle Cloud (OCI) machines in a variety of data centers in the United States (mostly in the Midwest and Mid-Atlantic). We did not observe meaningful differences across data centers. Machines were all AMD x86-64 CPUs running Oracle Linux.
All LLM engine serving machines used NVIDIA GPUs. Modal’s entire GPU fleet is actively and automatically monitored for GPU health issues, including heating issues. The H100 GPU cards used were all of the SXM form factor (data sheet here). Experiments were run with version 570.86.15 of the NVIDIA GPU Driver and CUDA Driver API version 12.8.
Version information for the LLM engine software is included with each result. We used the latest version of each framework. We used container images made publicly available by vLLM, SGLang, and NVIDIA (details in sample code). We retrieved model weights from the Hugging Face Hub.
The benchmarking clients and LLM engine servers ran inside of the gvisor
hypervisor as part of the Modal container runtime. The guest OS was Debian
Linux. CPU and RAM allocations were lightly tuned to avoid bottlenecking while
maximizing bin-packing. LLM engine servers all provided an OpenAI-compatible
REST API. Clients communicated with them via HTTP/TCP/IP. These requests passed
through the Modal input plane in the eastern United States, which would handle
routing and auto-scaling in a production deployment. All together, this stack
adds 100ms of overhead onto ~50ms of network latency in the 95th percentiles,
which could be reduced an order of magnitude by peering clients and edge servers
more directly, at increased engineering complexity (see this sample code for WebRTC on Modal, which achieves <25ms peer-to-peer over RTP for users near the edge
deployment).
We would like to thank Michael Goin of RedHat AI, Moin Nadeem and Nikhil Murthy of Phonic, Ishan Dhanani of NVIDIA Dynamo, and Charles Pierse of Weaviate for feedback on early drafts of this interface.
