Serverless TensorRT-LLM (LLaMA 3 8B)
In this example, we demonstrate how to use the TensorRT-LLM framework to serve Meta’s LLaMA 3 8B model at very high throughput.
We achieve a total throughput of over 25,000 output tokens per second on a single NVIDIA H100 GPU. At Modal’s on-demand rate of ~$4/hr, that’s under $0.05 per million tokens — on auto-scaling infrastructure and served via a customizable API.
Overview
This guide is intended to document two things: the general process for building TensorRT-LLM on Modal and a specific configuration for serving the LLaMA 3 8B model.
Build process
Any given TensorRT-LLM service requires a multi-stage build process, starting from model weights and ending with a compiled engine. Because that process touches many sharp-edged high-performance components across the stack, it can easily go wrong in subtle and hard-to-debug ways that are idiosyncratic to specific systems. And debugging GPU workloads is expensive!
This example builds an entire service from scratch, from downloading weight tensors to responding to requests, and so serves as living, interactive documentation of a TensorRT-LLM build process that works on Modal.
Engine configuration
TensorRT-LLM is the Lamborghini of inference engines: it achieves seriously impressive performance, but only if you tune it carefully. We carefully document the choices we made here and point to additional resources so you know where and how you might adjust the parameters for your use case.
Installing TensorRT-LLM
To run TensorRT-LLM, we must first install it. Easier said than done!
In Modal, we define container images that run our serverless workloads. All Modal containers have access to GPU drivers via the underlying host environment, but we still need to install the software stack on top of the drivers, from the CUDA runtime up.
We start from an official nvidia/cuda image,
which includes the CUDA runtime & development libraries
and the environment configuration necessary to run them.
On top of that, we add some system dependencies of TensorRT-LLM,
including OpenMPI for distributed communication, some core software like git,
and the tensorrt_llm package itself.
Note that we’re doing this by method-chaining a number of calls to methods on the modal.Image. If you’re familiar with
Dockerfiles, you can think of this as a Pythonic interface to instructions like RUN and CMD.
End-to-end, this step takes five minutes.
If you’re reading this from top to bottom,
you might want to stop here and execute the example
with modal run trtllm_throughput.py so that it runs in the background while you read the rest.
Downloading the Model
Next, we download the model we want to serve. In this case, we’re using the instruction-tuned version of Meta’s LLaMA 3 8B model. We use the function below to download the model from the Hugging Face Hub.
Just defining that function doesn’t actually download the model, though.
We can run it by adding it to the image’s build process with run_function.
The download process has its own dependencies, which we add here.
Quantization
The amount of GPU RAM on a single card is a tight constraint for most LLMs: RAM is measured in billions of bytes and models have billions of parameters. The performance cliff if you need to spill to CPU memory is steep, so all of those parameters must fit in the GPU memory, along with other things like the KV cache.
The simplest way to reduce LLM inference’s RAM requirements is to make the model’s parameters smaller, to fit their values in a smaller number of bits, like four or eight. This is known as quantization.
We use a quantization script provided by the TensorRT-LLM team. This script takes a few minutes to run.
NVIDIA’s Ada Lovelace/Hopper chips, like the 4090, L40S, and H100,
are capable of native calculations in 8bit floating point numbers, so we choose that as our quantization format (qformat).
These GPUs are capable of twice as many floating point operations per second in 8bit as in 16bit —
about two quadrillion per second on an H100 SXM.
Quantization is lossy, but the impact on model quality can be minimized by tuning the quantization parameters based on target outputs.
We put that all together with another invocation of .run_commands.
Compiling the engine
TensorRT-LLM achieves its high throughput primarily by compiling the model:
making concrete choices of CUDA kernels to execute for each operation.
These kernels are much more specific than matrix_multiply or softmax —
they have names like maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148t_nt.
They are optimized for the specific types and shapes of tensors that the model uses
and for the specific hardware that the model runs on.
That means we need to know all of that information a priori — more like the original TensorFlow, which defined static graphs, than like PyTorch, which builds up a graph of kernels dynamically at runtime.
This extra layer of constraint on our LLM service is an important part of what allows TensorRT-LLM to achieve its high throughput.
So we need to specify things like the maximum batch size and the lengths of inputs and outputs. The closer these are to the actual values we’ll use in production, the better the throughput we’ll get.
Since we want to maximize the throughput, assuming we had a constant workload, we set the batch size to the largest value we can fit in GPU RAM. Quantization helps us again here, since it allows us to fit more tokens in the same RAM.
There are many additional options you can pass to trtllm-build to tune the engine for your specific workload.
You can find the document we used for LLaMA here,
which you can use to adjust the arguments to fit your workloads,
e.g. adjusting rotary embeddings and block sizes for longer contexts.
For more performance tuning tips, check out NVIDIA’s official TensorRT-LLM performance guide.
To make best use of our 8bit floating point hardware, and the weights and KV cache we have quantized, we activate the 8bit floating point fused multi-head attention plugin.
Because we are targeting maximum throughput, we do not activate the low latency 8bit floating point matrix multiplication plugin
or the 8bit floating point matrix multiplication (gemm) plugin, which documentation indicates target smaller batch sizes.
We put all of this together with another invocation of .run_commands.
Serving inference at tens of thousands of tokens per second
Now that we have the engine compiled, we can serve it with Modal by creating an App.
Thanks to our custom container runtime system even this large, many gigabyte container boots in seconds.
At container start time, we boot up the engine, which completes in under 30 seconds. Container starts are triggered when Modal scales up your infrastructure, like the first time you run this code or the first time a request comes in after a period of inactivity.
Container lifecycles in Modal are managed via our Cls interface, so we define one below
to manage the engine and run inference.
For details, see this guide.
Calling our inference function
Now, how do we actually run the model?
There are two basic methods: from Python via our SDK or from anywhere, by setting up an API.
Calling inference from Python
To run our Model’s .generate method from Python, we just need to call it —
with .remote appended to run it on Modal.
We wrap that logic in a local_entrypoint so you can run it from the command line with
For simplicity, we hard-code a batch of 128 questions to ask the model, and then bulk it up to a batch size of 1024 by appending seven distinct prefixes. These prefixes ensure KV cache misses for the remainder of the generations, to keep the benchmark closer to what can be expected in a real workload.
Calling inference via an API
We can use modal.fastapi_endpoint with app.function to turn any Python function into a web API.
This API wrapper doesn’t need all the dependencies of the core inference service,
so we switch images here to a basic Linux image, debian_slim, and add the FastAPI stack.
From there, we can take the same remote generation logic we used in main and serve it with only a few more lines of code.
To set our function up as a web endpoint, we need to run this file —
with modal serve to create a hot-reloading development server or modal deploy to deploy it to production.
The URL for the endpoint appears in the output of the modal serve or modal deploy command.
Add /docs to the end of this URL to see the interactive Swagger documentation for the endpoint.
You can also test the endpoint by sending a POST request with curl from another terminal:
And now you have a high-throughput, low-latency, autoscaling API for serving LLM completions!
Footer
The rest of the code in this example is utility code.