Serve an interactive language model app with latency-optimized TensorRT-LLM (LLaMA 3 8B)

In this example, we demonstrate how to configure the TensorRT-LLM framework to serve Meta’s LLaMA 3 8B model at interactive latencies on Modal.

Many popular language model applications, like chatbots and code editing, put humans and models in direct interaction. According to an oft-cited if scientifically dubious rule of thumb, computer systems need to keep their response times under 400ms in order to match pace with their human users.

To hit this target, we use the TensorRT-LLM inference framework from NVIDIA. TensorRT-LLM is the Lamborghini of inference engines: it achieves seriously impressive latency, but only if you tune it carefully. With the out-of-the-box defaults we observe an unacceptable median time to last token of over a second, but with careful configuration, we’ll bring that down to under 250ms — over a 4x speed up! These latencies were measured on a single NVIDIA H100 GPU running LLaMA 3 8B on prompts and generations of a few dozen to a few hundred tokens.

Here’s what that looks like in a terminal chat interface:

Overview

This guide is intended to document two things:

the Python API for building and running TensorRT-LLM engines, and
how to use recommendations from the TensorRT-LLM performance guide to optimize the engine for low latency.

Be sure to check out TensorRT-LLM’s examples for sample code beyond what we cover here, like low-rank adapters (LoRAs).

What is a TRT-LLM engine?

The first step in running TensorRT-LLM is to build an “engine” from a model. Engines have a large number of parameters that must be tuned on a per-workload basis, so we carefully document the choices we made here and point you to additional resources that can help you optimize for your specific workload.

Historically, this process was done with a clunky command-line-interface (CLI), but things have changed for the better! 2025 is the year of CUDA Python, including a new-and-improved Python SDK for TensorRT-LLM, supporting all the same features as the CLI — quantization, speculative decoding, in-flight batching, and much more.

Installing TensorRT-LLM

To run TensorRT-LLM, we must first install it. Easier said than done!

To run code on Modal, we define container images. All Modal containers have access to GPU drivers via the underlying host environment, but we still need to install the software stack on top of the drivers, from the CUDA runtime up.

We start from an official nvidia/cuda container image, which includes the CUDA runtime & development libraries and the environment configuration necessary to run them.

On top of that, we add some system dependencies of TensorRT-LLM, including OpenMPI for distributed communication, some core software like git, and the tensorrt_llm package itself.

Note that we’re doing this by method-chaining a number of calls to methods on the modal.Image. If you’re familiar with Dockerfiles, you can think of this as a Pythonic interface to instructions like RUN and CMD.

End-to-end, this step takes about five minutes on first run. If you’re reading this from top to bottom, you might want to stop here and execute the example with modal run so that it runs in the background while you read the rest.

Downloading the model

Next, we’ll set up a few things to download the model to persistent storage and do it quickly — this is a latency-optimized example after all! For persistent, distributed storage, we use Modal Volumes, which can be accessed from any container with read speeds in excess of a gigabyte per second.

We also set the HF_HOME environment variable to point to the Volume so that the model is cached there. And we install hf-transfer to get maximum download throughput from the Hugging Face Hub, in the hundreds of megabytes per second.

Setting up the engine

Quantization

The amount of GPU RAM on a single card is a tight constraint for large models: RAM is measured in billions of bytes and large models have billions of parameters, each of which is two to four bytes. The performance cliff if you need to spill to CPU memory is steep, so all of those parameters must fit in the GPU memory, along with other things like the KV cache built up while processing prompts.

The simplest way to reduce LLM inference’s RAM requirements is to make the model’s parameters smaller, fitting their values in a smaller number of bits, like four or eight. This is known as quantization.

NVIDIA’s Ada Lovelace/Hopper chips, like the L40S and H100, are capable of native 8bit floating point calculations in their Tensor Cores, so we choose that as our quantization format. These GPUs are capable of twice as many floating point operations per second in 8bit as in 16bit — about two quadrillion per second on an H100 SXM.

Quantization buys us two things:

faster startup, since less data has to be moved over the network onto CPU and GPU RAM
faster inference, since we get twice the FLOP/s and less data has to be moved from GPU RAM into on-chip memory and registers with each computation

We’ll use TensorRT-LLM’s QuantConfig to specify that we want FP8 quantization. See their code for more options.

Quantization is a lossy compression technique. The impact on model quality can be minimized by tuning the quantization parameters on even a small dataset. Typically, we see less than 2% degradation in evaluation metrics when using fp8. We’ll use the CalibrationConfig class to specify the calibration dataset.

Configure plugins

TensorRT-LLM is an LLM inference framework built on top of NVIDIA’s TensorRT, which is a generic inference framework for neural networks.

TensorRT includes a “plugin” extension system that allows you to adjust behavior, like configuring the CUDA kernels used by the engine. The General Matrix Multiply (GEMM) plugin, for instance, adds heavily-optimized matrix multiplication kernels from NVIDIA’s cuBLAS library of linear algebra routines.

We’ll specify a number of plugins for our engine implementation. The first is multiple profiles, which configures TensorRT to prepare multiple kernels for each high-level operation, where different kernels are optimized for different input sizes. The second is paged_kv_cache which enables a paged attention algorithm for the key-value (KV) cache.

The last two parameters are GEMM plugins optimized specifically for low latency, rather than the more typical high arithmetic throughput, the low_latency plugins for gemm and gemm_swiglu.

The low_latency_gemm_swiglu_plugin plugin fuses the two matmul operations and non-linearity of the feedforward component of the Transformer block into a single kernel, reducing round trips between GPU cache memory and RAM. For details on kernel fusion, see this blog post by Horace He of Thinking Machines. Note that at the time of writing, this only works for FP8 on Hopper GPUs.

The low_latency_gemm_plugin is a variant of the GEMM plugin that brings in latency-optimized kernels from NVIDIA’s CUTLASS library.

Configure speculative decoding

Speculative decoding is a technique for generating multiple tokens per step, avoiding the auto-regressive bottleneck in the Transformer architecture. Generating multiple tokens in parallel exposes more parallelism to the GPU. It works best for text that has predicable patterns, like code, but it’s worth testing for any workload where latency is critical.

Speculative decoding can use any technique to guess tokens, including running another, smaller language model. Here, we’ll use a simple, but popular and effective speculative decoding strategy called “lookahead decoding”, which essentially guesses that token sequences from the past will occur again.

Set the build config

Finally, we’ll specify the overall build configuration for the engine. This includes more obvious parameters such as the maximum input length, the maximum number of tokens to process at once before queueing occurs, and the maximum number of sequences to process at once before queueing occurs.

To minimize latency, we set the maximum number of sequences (the “batch size”) to just one. We enforce this maximum by setting the number of inputs that the Modal Function is allowed to process at once — max_concurrent_inputs. The default is 1, so we don’t need to set it, but we are setting it explicitly here in case you want to run this code with a different balance of latency and throughput.

Serving inference under the Doherty Threshold

Now that we have written the code to compile the engine, we can serve it with Modal!

We start by creating an App.

Thanks to our custom container runtime system, even this large container boots in seconds.

On the first container start, we mount the Volume, download the model, and build the engine, which takes a few minutes. Subsequent starts will be much faster, as the engine is cached in the Volume and loaded in seconds.

Container starts are triggered when Modal scales up your Function, like the first time you run this code or the first time a request comes in after a period of inactivity. For details on optimizing container start latency, see this guide.

Container lifecycles in Modal are managed via our Cls interface, so we define one below to separate out the engine startup (enter) and engine execution (generate). For details, see this guide.

Calling our inference function

To run our Model’s .generate method from Python, we just need to call it — with .remote appended to run it on Modal.

We wrap that logic in a local_entrypoint so you can run it from the command line with

which will output something like:

Use --mode=slow to see model latency without optimizations.

which will output something like

For simplicity, we hard-code 10 questions to ask the model, then run them one by one while recording the latency of each call. But the code in the local_entrypoint is just regular Python code that runs on your machine — we wrap it in a CLI automatically — so feel free to customize it to your liking.

Once deployed with modal deploy, this Model.generate function can be called from other Python code. It can also be converted to an HTTP endpoint for invocation over the Internet by any client. For details, see this guide.

As a quick demo, we’ve included some sample chat client code in the Python main entrypoint below. To use it, first deploy with

and then run the client with