Low latency Qwen 3 8B with SGLang and Modal

In this example, we show how to serve SGLang at low latency on Modal.

This example is intended to demonstrate everything required to run inference at the highest performance and with the lowest latency possible, and so it includes advanced features of both SGLang and Modal. For a simpler introduction to LLM serving, see this example.

To minimize routing overheads, we use @modal.experimental.http_server, which uses a new, low-latency routing service on Modal designed for latency-sensitive inference workloads. This gives us more control over routing, but with increased power comes increased responsibility.

Set up the container image

Our first order of business is to define the environment our server will run in: the container Image.

We start from a container image provided by the SGLang team via Dockerhub.

While we’re at it, we import the dependencies we’ll need both remotely and locally (for deployment).

We also choose a GPU to deploy our inference server onto. We choose the H100 GPU, which offers excellent price-performance and supports 8bit floating point operations, which are the lowest precision well-supported in the relevant GPU kernels across a variety of model architectures.

Below, we discuss the choice of GPU count.

Loading and cacheing the model weights

We’ll serve Alibaba’s Qwen 3 LLM. For lower latency, we pick a smaller model (8B params) in a lower precision floating point format (FP8). This reduces the amount of data that needs to be loaded from GPU RAM into SM SRAM in each forward pass.

We load the model from the Hugging Face Hub, so we’ll need their Python package.

We don’t want to load the model from the Hub every time we start the server. We can load it much faster from a Modal Volume. Typical speeds are around one to two GB/s.

In addition to pointing the Hugging Face Hub at the path where we mount the Volume, we also turn on “high performance” downloads, which can fully saturate our network bandwidth.

Cacheing compilation artifacts

Model weights aren’t the only thing we want to cache.

As a rule, LLM inference servers like SGLang don’t directly provide their own kernels. They draw high-performance kernels from a variety of sources.

As of version 0.5.6, SGLang’s default kernel backend for FP8 matrix multiplications (fp8-gemm-backend) on Hopper SM architecture GPUs like the H100 is DeepGEMM by DeepSeek.

The binaries of these kernels are not included in the SGLang Docker image and so must be JIT-compiled. We store these in a Modal Volume as well.

JIT DeepGEMM kernels are on by default, but we explicitly enable them via an environment variable.

We trigger the compilation by running sglang.compile_deep_gemm in a subprocess kicked off from a Python function.

We run this Python function on Modal as part of building the Image so that it has access to the appropriate GPU and the caches for our model and compilaton artifacts.

Configure SGLang for minimal latency

LLM inference engines like SGLang come with a wide variety of “knobs” to tune performance.

To determine the appropriate configuration to hit latency and throughput service objectives, we recommend application-specific benchmarking guided by published generic benchmarks.

Here, we assume that the primary goal is to minimize per-request latency, with less regard to throughput (and so to cost) and walk through some of the key choices.

The primary contributor to per-request latency is the time to move all of the model’s weights (multiple gigabytes) from GPU RAM into SRAM in the Streaming Multiprocessors, which must be done at least once in the course of processing a request — naively, once per token per request. The time taken is limited by the memory bandwidth between those two stores, which is on the order of terabytes per second on modern data center GPUs. With models at the scale of gigabytes, a token will take milliseconds to generate — or whole seconds for the kilotoken responses users are accustomed to.

We use two strategies to cut latency in our memory-bound workload:

operate across multiple GPUs for more aggregate bandwidth and faster loads, with tensor parallelism
generate more tokens per load, with speculative decoding

Increasing effective memory bandwidth with tensor parallelism

Running SGLang on two H100s will double our effective memory bandwidth during large matrix multiplications.

Matrices are also known as tensors, and so this strategy that takes advantage of the inherent parallelism within matrix multiplication is known as tensor parallelism.

Actual speedups are generally less than what you get from “napkin math” based on available bandwidths — we observed a speedup of about 30% moving from one to two H100s when developing this example, rather than 100%.

Parallelizing token generation with speculative decoding

Transformer and recurrent language models generate text sequentially: the model’s output at step i is part of the input at step i+1. Per Amdahl’s Law, that sequential work becomes the bottleneck as other steps get faster from increased parallelism.

The solution is to generate more tokens on each step. The primary technique to do so without changing model behavior is known as speculative decoding, which “speculates” a number of draft tokens and verifies them in parallel with the primary model.

Speculative decoding techniques themselves have a number of parameters, the most important of which is the technique to use to generate draft tokens. Simple techniques based on n-grams are a good place to start. But in our experience, the EAGLE-3 technique gives enough of a performance boost to be worth the overhead of maintaining an extra model for speculation.

And for popular models, you can often find a high-quality EAGLE-3 draft model with open weights. For Qwen 3-8B, we like Tengyunw’s model.

We adopt the default configuration for this model from the documentation. With these settings, we observed an ~30% boost in throughput for a single user during the development of this sample code.

Note that unlike tensor parallelism, speculative decoding is not good for compute-bound workloads, since it generally increases demand for arithmetic bandwidth. So for workloads that admit larger batch sizes for requests, on the scale of dozens to hundreds, speculative decoding is not recommended.

Define the inference server and infrastructure

Selecting infrastructure to minimize latency

Minimizing latency requires geographic co-location of clients and servers.

So for low latency LLM inference services on Modal, you must select a cloud region for both the GPU-accelerated containers running inference and for the internal Modal proxies that forward requests to them as part of defining a modal.experimental.http_server.

Here, we assume users are mostly in the northern half of the Americas and select the us-east cloud region serve them. This should result in at most a few dozen milliseconds of round-trip time.

Latencies for mutli-turn interactions with LLMs are substantially cut when previous interaction turns are in the KV cache. KV caches are stored in GPU RAM, so they aren’t shared across replicas. To improve cache hit rate, modal.experimental.http_server includes sticky routing based on a client-provided header. See the client code below for details.

For production-scale LLM inference services, there are generally enough requests to justify keeping at least one replica running at all times. Having a “warm” or “live” replica reduces latency by skipping slow initialization work that occurs when new replica boots up (a “cold start”). For LLM inference servers, that latency runs from seconds to minutes.

To ensure at least one container is always available, we can set the min_containers of our Modal Function to 1 or more.

However, since this is documentation code, we’ll set it to 0 to avoid surprise bills during casual use.

Finally, we need to decide how we will scale up and down replicas in response to load. Without autoscaling, users’ requests will queue when the server becomes overloaded. Even apart from queueing, responses generally become slower per user above a certain minimum number of concurrent requests.

So we set a target for the number of inputs to run on a single container with modal.concurrent. For details, see the guide.

Generally, this choice needs to be made as part of LLM inference engine benchmarking.

Controlling container lifecycles with `modal.Cls`

We wrap up all of the choices we made about the infrastructure of our inference server into a number of Python decorators that we apply to a Python class that encapsulates the logic to run our server.

The key decorators are:

@app.cls to define the core of our service. We attach our Image, request a GPU, attach our cache Volumes, specify the region, and configure auto-scaling. See the reference documentation for details.
@modal.experimental.http_server to turn our Python code into an HTTP server (i.e. fronting all of our containers with a proxy with a URL). The wrapped code needs to eventually listen for HTTP connections on the provided port.
@modal.concurrent to specify how many requests our server can handle before we need to scale up.
@modal.enter and @modal.exit to indicate which methods of the class should be run when starting the server and shutting it down.

Modal considers a new replica ready to receive inputs once the modal.enter methods have exited and the container accepts connections. To ensure that we actually finish setting up our server before we are marked ready for inputs, we define a helper function to check whether the server is finished setting up and to send it a few test inputs.

We use the requests library to send ourselves these HTTP requests on localhost/127.0.0.1.

With all this in place, we are ready to define our high-performance, low-latency LLM inference server.

Deploy the server

To deploy the server on Modal, just run

This will create a new App on Modal and build the container image for it if it hasn’t been built yet.

Interact with the server

Once it is deployed, you’ll see a URL appear in the command line, something like https://your-workspace-name--example-sglang-low-latency-sglang.us-east.modal.direct.

You can find interactive Swagger UI docs at the /docs route of that URL, i.e. https://your-workspace-name--example-sglang-low-latency-sglang.us-east.modal.direct/docs. These docs describe each route and indicate the expected input and output and translate requests into curl commands. For simple routes, you can even send a request directly from the docs page.

Note: when no replicas are available, Modal will respond with the 503 Service Unavailable status. In your browser, you can just hit refresh until the docs page appears. You can see the status of the applicaton and its containers on your Modal dashboard.

Test the server

To make it easier to test the server setup, we also include a local_entrypoint that hits the server with a simple client.

If you execute the command

a fresh replica of the server will be spun up on Modal while the code below executes on your local machine.

Think of this like writing simple tests inside of the if __name__ == "__main__" block of a Python script, but for cloud deployments!

This test relies on the two helper functions below, which ping the server and wait for a valid response to stream.

The probe helper function specifically ignores two types of errors that can occur while a replica is starting up — timeouts on the client and 5XX responses from the server. Modal returns the 503 Service Unavailable status when an experimental.http_server has no live replicas.

We include a header with each request — Modal-Session-ID. This is header is used by clients of http_servers on Modal to identify which requests should be routed to the same container (with caveats explained below).

The value associated with this key is used to map requests onto containers such that while the set of containers is fixed, requests with the same value are sent to the same container. Set this to a different value per distinct multi-turn interaction (prototypically, a user conversation thread with a chatbot) to improve KV cache hit rates. Additionally, when the set of containers changes (e.g. due to autoscaling), sessions are rebalanced such that load is approximately evenly spread, much like in RAID rebalancing. This ensures no container ends up as a “hot spot” handling too many client requests.