High-throughput LLM inference with Tokasaurus (LLama 3.2 1B Instruct)
In this example, we demonstrate how to use Tokasaurus, an LLM inference framework designed for maximum throughput.
It maps the Large Language Monkeys GSM8K demo from the Tokasaurus release blog post onto Modal and replicates the core result: sustained inference at >80k tok/s throughput, exceeding their reported numbers for vLLM and SGLang by ~3x.
In the “Large Language Monkeys” inference-time compute scaling paradigm, also introduced by the same Stanford labs, the response quality of a system using a small model is improved to match or exceed a system using a large model by running many requests in parallel. Here, it’s applied to the Grade School Math (GSM8K) dataset.
For more on this LLM inference pattern (and an explainer on why it’s such a natural fit for current parallel computing systems) see our blog post reproducing and extending their results.
Set up the container image
Our first order of business is to define the environment our LLM engine will run in:
the container Image.
We translate the recipe the authors used to build their Tokasaurus environment into methods of modal.Image.
This requires, for instance, picking a base Image that includes the right version of the CUDA toolkit.
We also set an environment variable that directs Torch-based libraries to only compile kernels for the GPU SM architecture we are targeting, Hopper. This isn’t strictly necessary, but it silences some paranoid logs.
From there, Tokasaurus can be installed like any normal Python package, since Modal provides the host CUDA drivers.
Download the model weights
For this demo, we run Meta’s Llama 3.2 1B Instruct model, downloaded from Hugging Face. Since this is a gated model, you’ll need to accept the terms of use and create a Secret with your Hugging Face token to download the weights.
Although Tokasaurus will download weights from Hugging Face on-demand, we want to cache them so we don’t do it every time our server starts. We’ll use a Modal Volume for our cache. For more on storing model weights on Modal, see this guide.
Configure Tokasaurus for maximum throughput on this workload
On throughput-focused benchmarks with high prefix sharing workloads, Tokasaurus can outperform vLLM and SGLang nearly three-fold.
For small models like the one we are running, it reduces CPU overhead by maintaining a deep input queue and exposing shared prefixes to the GPU Tensor Cores with Hydragen.
We start by maximizing the number of tokens processed per forward pass by adjusting the following parameters:
kv_cache_num_tokens: max tokens in the KV cache, higher values increase throughput but consume GPU memorymax_tokens_per_forward: max tokens/seq processed per forward pass, higher values increase throughput but use more GPU memorymax_seqs_per_forward: max sequences processed per forward pass, higher values increase batch size and throughput, but require more GPU memory
We also set a few other parameters with less obvious impacts — the KV cache page size and the stop token behavior.
All values are derived from this version of the official benchmarking script,
except the KV_CACHE_NUM_TOKENS, which we increase to the maximum the GPU can handle.
The value in the script is set to (1024 + 512) * 1024, which is the maximum that the other engines can handle, lower than that of Tokasaurus.
We could apply the Torch compiler to the model to make it faster and, via kernel fusion, reduce the amount of used activation memory, leaving space for a larger KV cache. However, it dramatically increases the startup time of the server, and we only see modest (20%, not 2x) improvements to throughput, so we don’t use it here.
Lastly, we need to set a few of the parameters for the client requests, again based on the official benchmarking script.
Serve Tokasaurus with an OpenAI-compatible API
The function below spawns a Tokasaurus instance listening at port 10210,
serving an OpenAI-compatible API.
We wrap it in the @modal.web_server decorator to connect it to the Internet.
The server runs in an independent process, via subprocess.Popen.
If it hasn’t started listening on the PORT within the startup_timeout,
the server start will be marked as failed.
The code we have so far is enough to deploy Tokasaurus on Modal. Just run:
And you can hit the server with your favorite OpenAI-compatible API client,
like the openai Python SDK.
Run the Large Language Monkeys GSM8K benchmark
To make it easier to check the performance and to provide a simple test
that can be used when setting up/configuring a Tokasaurus deployment,
we include a simple benchmark function that acts as a local_entrypoint.
If you target this script with modal run, this code will execute,
spinning up a new replica and sending some test requests to it.
Because the API responses don’t include token counts, we need a quick helper function to
calculate token counts from a prompt or completion.
We add automatic dynamic batching with modal.batched, so that we can send single strings but still take advantage
of batched encoding.
You can run the benchmark with
or pass the --help flag to see options.
Addenda
The remaining code in this example is utility code, mostly for managing the GSM8K dataset. That code is slightly modified from the code in the Tokasaurus repo here.