LLM Engines: An Executive Summary

Nearly every serious application of computer systems includes relational database management software, from local SQLite on phones to planet-scale Spanner in cloud data centers. These systems all trace their lineage back through IBM System R to the Information Management System built to send humans to the Moon — or at least track all the bills of material for the rockets that carried them. In place of behemoth proprietary software built for specific systems at great expense, there is now a legion of open source software.

Language models are still closer in their lifecycle to the Peterlee Relational Test Vehicle than to Postgres, but they hold the same promise: to be a part of nearly every application of computer systems. When these applications need to store and retrieve structured data, they currently deduce following the relational algebra of Ted Codd; when they need to produce or process unstructured data, they will infer with neural probabilistic models.

The weights for capable language models are now widely available under permissive or open source licenses — the Llama, Qwen, DeepSeek, Mistral, and Gemma model families, to name a few. Over the past year, these models have rapidly improved, reaching the baseline of quality required for inference to be useful. This is a key step enabling organizations to serve their own language model applications. Below, we walk through the cases we've seen where it makes sense to use these models in place of proprietary systems.

But just as data needs a software system, the RDBMS, to manage storage and execute queries, language models need a software system to drive their inference, managing the storage of query caches and scheduling large matrix multiplications on specialized hardware.

Like SQL database engines, there are open source "LLM engines" you can run yourself. And like the models, this software stack has rapidly improved in both usability and performance over the past year — reaching the baseline of quality required for self-serve inference to be economical.

These engines are the primary subject of this report.

It answers the most common and most critical questions we hear asked by technical leaders interested in running their own LLM engines. It is informed by discussion with those leaders, with LLM engineers building applications, and with the developers of LLM engines. It grounds its claims in the same benchmarking work that supports our LLM Engine Advisor, which indicates baseline performance for engines on specific workloads and provides starter code for running your own LLM engine on Modal's serverless cloud infrastructure.

When should I use open weights language models instead of proprietary services like OpenAI or Anthropic?

Many organizations are already building LLM applications based on proprietary models that can't be self-hosted. So the first question to consider is when and why they should switch away from those services.

The standard arguments for building technology in-house instead of buying apply here.

The most common concern is data governance, which frequently mandates tight control of the servers that process user data. Worries about model providers secretly training on data are overblown, but legitimate requirements remain, especially in regulated industries.

The other most commonly-cited motivation is cost, familiar to anyone who has considered self-hosting anything. Competition (including from open models) has prevented any provider from charging too high of a premium. But when LLM applications mature and the requirements become clearer, the capabilities of the systems provided by frontier labs focused on artificial general intelligence or superintelligence become unnecessary. Those capabilities come at a cost, relative to a smaller model tuned or prompted carefully. Think of it like rewriting code from JavaScript or Python to Go or Rust once the feature velocity decreases. For details, see this post from one of our customers, OpenPipe, which provides post-training as a service.

This is just one instance of the kind of customization that's not possible or not economical with proprietary models. Limits here are similar to limits on extending proprietary software and we expect them to be similarly durable. Models are, after all, valuable intellectual property, and exposing them too much to tinkering and development risks leaking that IP. We expect customization to only increase in importance over time, as it has in domains like image generation, where open weights models are more mature. Organizations that move now will be better prepared for this future.

Finally, there is a less technical reason to consider this switch: the movement of OpenAI and Anthropic into the application layer. Releases like Claude Code and OpenAI Codex represent large steps away from language modeling and towards applications of language models. And we've seen this before: OpenAI's move to Chat Completions APIs (i.e. those for instruction-tuned models designed for chat) represented a large step towards applications relative to the original Completions APIs (i.e. those for models trained just to predict text). OpenAI has further signaled that they plan to continue this trajectory with their even more abstract Responses API.

Use open weights models when there's such a thing as "smart enough".

Collaborative, open source solutions tend to win out over competitive, proprietary solutions when they pool together labor and resources to build non-differentiating capabilities — think programming languages, operating systems, and databases. These are needed by everyone, and almost no one gets enough competitive edge from building their own to make it worth the effort.

We see the same phenomenon with open weights* language models. Basic capabilities like code completion, assistant chat, and data extraction have been commoditized by open weights models. In each case, there's a maximum level of capability (or "intelligence") required to complete the task satisfactorily. Each fixed level of capability has been reached by the open weights models within a year, most recently by DeepSeek-R1-0528, which goes toe-to-toe with OpenAI's six-month-old o3 on a number of benchmarks.

There are other cases where the demand for intelligence, like the demand for RAM in computing systems, is effectively unbounded. These cases include:

  • zero-sum competitive settings (politics, markets, & other games) and
  • settings with high tail risk (like human-off-the-loop control of computers, where rm -rf ~/ is always only a few tokens away).

There, we see a continued role for proprietary language models, just as there is still a role for proprietary database systems like Oracle, Microsoft SQL Server, and IBM Db2 (all in the top ten on db-engines.com).

*We use the term "open weights" here instead of "open source", since in most cases the source code required to produce the weight binaries is not provided under an OSI-approved license (or at all). We expect this distinction to matter more, not less, in the future.

How do I make the build vs buy decision for LLM inference?

Open source databases like Postgres are often offered as managed services and used by everyone from startups to the Fortune 500. Open weights language models are no different. Startups like Together and hyperscalers like Amazon are already offering inference as a service. So why run it yourself?

You can readily beat language model API providers on price if you're running batch workloads on shorter contexts.

Chatting and code completion are the most popular applications of large language models, and they are both interactive. Engineering an interactive, streaming, and latency-sensitive language model application is challenging, just as it is for other computer systems (more on that below).

But LLMs can also perform other tasks that are less latency-sensitive, like extracting data from support chat logs or translating a large corpus of documents. There, throughput is the most important factor — the name of the game is queries per second. This setting is much easier to engineer and optimize, as described below, and so it's easier to beat managed services on price.

In one set of experiments, depicted below, we ran Meta's Llama 3.1 70B in 8bit floating point precision. The test data has more input tokens than output tokens, as is common in retrieval-augmented generation (RAG) or structured data extraction. In particular, the inputs have 1024 tokens (about a page of text) and the outputs have 128 tokens (about a paragraph).

Both vLLM and SGLang ran at ~17 QPS per 8xH100 replica without any tuning. The chart below shows the median latency to first token as we varied the request rate. Sacrificing interactivity (~200 ms end-to-end, leftmost points) led to an 8x throughput increase (~4s end-to-end, rightmost points). Details of our method are here.

Loading...

Running this configuration on Modal's starter plan, which has purely usage-based pricing, you can set up a batch system that processes ~20k tok/s with Llama 3.1 70B fp8 at ~50¢ per million tokens. Modal's paid plans allow this to scale up to hundreds of replicas. This compares favorably with published rates from API providers. For performance data for other configurations, see the LLM Engine Advisor released along with this summary.

Start by building your own batch "token factory" before you run a streaming token service.

Many of us were introduced to language models in an interactive system, like OpenAI ChatGPT or Anthropic Claude, that streams the outputs. These systems are harder to set up and to run economically than batch systems, so start with batch.

This is typically the case in computing, where batch (say, Spotify Discover Weekly) precedes streaming (say, TikTok feeds). Consider: early computers started out as batch job machines, processing large bulks of data like the U.S. Census or company payrolls. Users submitted very large tasks, via punch cards, and then waited for them to finish. The interactivity we are used to today, derived from time-sharing systems like MULTICS/UNIX, was only added later.

It is in general better to start with the easier, batch case and then to build the more difficult one afterwards, with the benefit of hard-won experience. This is true both as a matter of your organization's internal technical growth and the development of the broader field of open source language model inference — i.e., outside of the handful of organizations that have driven the frontier so far.

But don't write your own LLM engine (unless you're betting the company).

If you're running the language model yourself, you need to think about the software used to pass inputs through those weights to create outputs — language model "inference" using an "LLM engine" or "LLM serving framework".

LLM engines are less complex than database management systems, but they are not simple software. Contemporary models are based on the Transformer architecture, which is optimized for efficiency during training. This is necessary to achieve the dizzying scale of frontier model training runs (estimated at 6 x 1025, or ~100 mol, FLOPs). But it means that running this architecture efficiently when serving, say, chatbot requests is not as simple as writing a few lines of PyTorch.

One reason for the complication is the primacy of performance. Running large language models is expensive (often on the order of cents per thousand user queries), which incentivizes close attention to performance. Engineering for performance melts abstractions and reveals the thickets of complexity hidden underneath.

But it is nowhere nearly the complexity and expense of training your own model. If running LLMs is a key differentiating capability for your organization, building your own engine is worth considering. A small, talented team of engineers can start from published research and open source code and develop just the features you need within a few months. The primary technical risks are hum-drum: maintenance, churn, and tooling compatibility in a rapidly-changing field.

But the same arguments about differentiation used above to argue the case for using open weights also apply here. This is something many teams need to do, and now a number of them are collaborating to build it together — the topic of our next section. You can join them!

Which open source LLM engine should I choose?

There are three main open source LLM engines: vLLM, SGLang, and TensorRT-LLM.

vLLM and SGLang are open source, open governance projects in the same basic mold as Postgres — to the point of both also coming out of the University of California, Berkeley. Contributions to these projects come from the usual suspects in infrastructure: large organizations like Red Hat, late-stage startups like Anyscale, and leading teams serving proprietary models like xAI. Both build on Meta's PyTorch framework.

TensorRT-LLM is an open source but closed governance project by NVIDIA. It builds on top of NVIDIA's TensorRT framework.

We'll cover the differences between these projects and how to pick which one to use below.

All the engines stand on the shoulders of giants. All will get better as those giants get taller.

Because of the high cost of large language model inference, performance is the first factor used to evaluate LLM engines. There is less daylight here than you might expect from the intensity of benchmark wars on social media. This is due to the fact that all of them are building with the same basic tools and constraints.

First, the limits of performance are set by the hardware. Hardware engineers refer often to the "speed-of-light" — not literally, but as the maximum speed at which the hardware can run, set by clock speeds and bus widths. Almost all open source LLM inference is done on NVIDIA GPUs and so has the same speed of light.

Unlike typical CPU workloads, LLM inference on GPUs frequently runs at a high fraction of the speed of light, set by either the arithmetic bandwidth of the matrix multiplication hardware (Tensor Cores) or the memory bandwidth between GPU RAM and the registers of the Streaming Multiprocessors. This limits the domain for speedups to ~2-3x at most, absent algorithmic differences, which are generally small, due to rapid diffusion of innovations.

More deeply, the same basic stack is used by each of the engines — the CUDA software platform, the CUDA Basic Linear Algebra Subroutine (cuBLAS) library, and the CUDA Templates for Linear Algebra Subroutines (CUTLASS) kernel framework. In addition, vLLM and SGLang both use PyTorch. All of the engines stand to see their performance improve as new hardware is released and these bedrock libraries update to take advantage of it.

vLLM and SGLang achieve comparable out-of-the-box performance. Other factors should drive your decision.

We ran both vLLM and SGLang, out of the box, on dozens of LLM inference workloads. These workloads included models ranging in size from a few billion to nearly a trillion parameters and on sequence lengths ranging from one thousand to ten thousand tokens. The results were strikingly similar across the two frameworks especially when looking at throughput during batch processing -- see how closely the points hug the SGLang = vLLM line in the plot below. Read more about our methodology here.

That means you'll need to consider other factors in making your decision. Because vLLM has been around for longer and has historically been faster to market with new features, we have accumulated more experience with it. But SGLang's recent rapid development is promising, and we're looking closely at both. A little competition is good for everyone (else).

In our experience, vLLM is fastest to the market with new features.

At time of completion of our experiments in late May 2025, TensorRT-LLM (0.20.0.rc3) did not support Gemma 3 or Qwen 3.

On SGLang (0.4.6-post5-cu124), we hit this issue (resolved but not released) running DeepSeek-V3 in INT4 quant and CUDA OOMs when running Qwen 3 235B A22B we couldn't resolve in time for release.

We didn't find any workloads of interest that we couldn't run on vLLM (0.8.x and then 0.9.0). We did, however, discover that we couldn't independently toggle CUDA graph capture and Torch graph compilation (also now resolved but not released) as needed to match SGLang's behavior more closely. See the next section.

Finally, at time of writing in early June 2025, SGLang does not have accelerated kernels for Blackwell GPUs like the B200 in a stable release, as they are still on PyTorch 2.6. Kernels compiled for max performance on the Blackwell SM architecture were only added to PyTorch wheels in 2.7. Blackwell support, including partially-optimized kernels, was released for vLLM just as we wrapped up our work.

Startup times are slower with vLLM than with SGLang by default — mostly due to Torch compilation.

With the default settings, startup times for vLLM servers were much longer, around five minutes for 8B models to SGLang's one minute. In both cases, model weights were loaded from our distributed model cache at about the same rate, ~1 GB/s.

The primary difference is in the out-of-the-box configuration. vLLM turns Torch graph compilation on by default. Compilation can improve performance, in particular for models that don't have custom fused kernels available, but it incurs a startup cost that is hard to manage with cacheing (docs).

Separately, both frameworks use CUDA graph capture, which also reduces latency, in particular at low (~3-5ms) inter-token latencies. CUDA graph capture is simpler and faster than Torch graph compilation — well under a minute for typical configurations.

vLLM has recently bet heavily on Torch compilation, while SGLang seems to have done the opposite. Which choice turns out to be best will depend on the progress of that project.

TensorRT-LLM can provide big wins, especially at the lowest latencies — but don't underestimate the engineering cost.

While vLLM and SGLang are strikingly similar in many ways, TensorRT-LLM is quite different. Its Python interface is a thinner wrapper over the underlying CUDA C++ software than theirs. TensorRT-LLM also requires the LLM engine be compiled ahead of time per workload. These artifacts are stored on disk, which reduces startup times, but this extra manual build step adds complexity. Our sample code for running it on Modal is thus about three times longer, at 150 lines to 50 for the other two frameworks.

And out of the box, we observe worse performance with TensorRT-LLM than with vLLM or SGLang. This is to be expected, since there is no intention by the developers that the default settings should be used in production serving, but it makes an apples-to-apples benchmark challenging.

Instead, TensorRT-LLM's build step exposes a bewildering array of un- or under-documented flags and parameters, like --reduce-fusion (presented as a strict improvement for end-to-end performance in the docstring, caveats explained elsewhere). It is widely reported, and we have observed in a few cases, that the proper setting of these flags can make a substantial difference in performance, including going from much slower than the other engines to much faster.

This is a tough challenge for engineers and engineering leaders — is it worth it to pour a few weeks of very expensive engineering time into this tuning? That depends on what speedup is possible. Reported numbers help, but in our experience, the impact of configuration changes is very sensitive to surprising features of workloads, including features that might change during serving, so estimation is challenging and churn is high.

Another engineering challenge arises from the TensorRT-LLM development model. Until the most recent stable release, v0.19.0 in mid-May 2025, TensorRT-LLM's source code on GitHub was updated per release, many thousands of lines at a time. Seemingly, this was done to mirror a in bulk large number of changes made to an internal GitLab repo during actual development. This made it essentially impossible to connect changes in code to changes in behavior (using tools like git bisect, for instance). These releases also offered no backwards compatibility guarantee, so every update was a tedious, manual process requiring careful review of documentation, examples, and code. According to a recent announcement, they have officially adopted a "GitHub-first" development flow and are planning for a 1.0 release to improve stability and reduce churn. Very welcome!

Our current practice when we approach a new workload is to try vLLM or SGLang first, get benchmark numbers as quickly as possible, do some light tuning, and compare results to the latency objectives. If those frameworks meet the objective, we presume that the performance benefits of TensorRT-LLM aren't worth the extra complexity, brittleness, and delays in time-to-market until proven otherwise.

Meanwhile, we are slowly accumulating and sharing optimized TensorRT-LLM configurations as we discover them. We welcome contributions here. The biggest win we've seen so far came in a case where latency was at a premium but cost efficiency was not — perhaps unsurprising for software written by the hardware provider. That case is described in detail in our docs here.

What next?

If you either know what workload you want to run or are curious to see our results in more detail, check out the LLM Engine Advisor, which reports the latency numbers (time-to-first-token, time-to-last-token, and inter-token latency) we observed across a variety of request rate loads for popular open weights models run with vLLM, SGLang, and TensorRT-LLM. Code snippets are included.

If you'd like to know more about how to think about and benchmark LLM engine performance, check out our benchmarking guide.