Serverless Qwen 3-8B with SGLang and Modal Snapshots

In this example, we show how to serve SGLang on Modal with ~10x faster cold starts.

Fast cold starts are particularly useful for LLM inference applications that have highly “bursty” workloads, like document processing. See this guide for a breakdown of different LLM inference workloads and how to optimize them.

The key technique is CPU + GPU memory snapshotting, which saves and restores the SGLang server directly from its in-memory state.

This adds some complexity to the deployment. If you just want to get started running a basic LLM server on Modal, see this example.

Set up the container image

Our first order of business is to define the environment our server will run in: the container Image.

We start from a container image provided by the SGLang team via Dockerhub.

While we’re at it, we import the dependencies we’ll need both remotely and locally (for deployment).

We also choose a GPU to deploy our inference server onto. We choose the H100 GPU, which offers excellent price-performance and supports 8bit floating point operations, which are the lowest precision well-supported in the relevant GPU kernels across a variety of model architectures.

Actual speedups are generally less than what you get from “napkin math” based on available bandwidths — we observed a speedup of about 30% moving from one to two H100s when developing this example. We recommend application-specific benchmarking guided by published generic benchmarks.

Loading and cacheing the model weights

We’ll serve Alibaba’s Qwen 3 LLM. For lower latency and faster cold starts, we pick a smaller model (8B params) in a lower precision floating point format (FP8). This reduces the amount of data that needs to be loaded from GPU RAM into SM SRAM in each forward pass.

We load the model from the Hugging Face Hub, so we’ll need their Python package.

We don’t want to load the model from the Hub every time we start the server. We can load it much faster from a Modal Volume. Typical speeds are around one to two GB/s.

In addition to pointing the Hugging Face Hub at the path where we mount the Volume, we also turn on “high performance” downloads, which can fully saturate our network bandwidth.

Cacheing compilation artifacts

Model weights aren’t the only thing we want to cache.

As a rule, LLM inference servers like SGLang don’t directly provide their own kernels. They draw high-performance kernels from a variety of sources.

As of version 0.5.6, SGLang’s default kernel backend for FP8 matrix multiplications (fp8-gemm-backend) on Hopper SM architecture GPUs like the H100 is DeepGEMM by DeepSeek.

The binaries of these kernels are not included in the SGLang Docker image and so must be JIT-compiled. We store these in a Modal Volume as well.

JIT DeepGEMM kernels are on by default, but we explicitly enable them via an environment variable.

We trigger the compilation by running sglang.compile_deep_gemm in a subprocess kicked off from a Python function.

We run this Python function on Modal as part of building the Image so that it has access to the appropriate GPU and the caches for our model and compilaton artifacts.

Speed up cold starts with GPU snapshotting

Modal is a serverless compute platform, so all of your inference services automatically scale up and down to handle variable load.

Scaling up a new replica requires quite a bit of work — loading up Python and system packages, loading model weights, setting up the inference engine, and so on.

We can skip over and speed up a bunch of this work when spinning up new replicas after the first by directly booting from a memory snapshot, which contains the exact in-memory representation of our server just before it begins taking requests.

Most applications can be snapshot and experience substantial speedups (2x to 10x, see our initial benchmarks here). However, it generally requires some extra work to adapt the application code.

For instance, we here set an environment variable that improves the compatibility of the Torch Inductor compiler with GPU snapshotting.

Below, we walk through the additional steps required to make an SGLang server compatible with snapshots.

Sleeping and waking an SGLang server

We prepare our SGLang inference server for snapshotting by first sending a few requests to “warm it up”, ensuring that it is fully ready to process requests. Then we “put it to sleep”, moving non-essential data out of GPU memory, with a request to /release_memory_occupation. At this point, we can take a memory snapshot. Upon snapshot restoration, we “wake up” the server with a request to /resume_memory_occupation.

We use the requests library to send ourselves these HTTP requests on localhost/127.0.0.1.

Define the inference server and infrastructure

We wrap up all of the choices we made about the infrastructure of our inference server into a number of Python decorators that we apply to a Python class that encapsulates the logic to run our server.

The key decorators are:

@app.cls to define the core of our service. We attach our Image, request a GPU, attach our cache Volumes, specify the region, and configure auto-scaling. See the reference documentation for details.
@modal.web_server to turn our Python code into an HTTP server. The wrapped code needs to eventually listen for HTTP connections on the provided port.
@modal.concurrent to specify how many requests our server can handle before we need to scale up.
@modal.enter and @modal.exit to indicate which methods of the class should be run when starting the server and shutting it down. The enter methods also define what code is run before memory snapshot creation (snap=True) and after memory snapshot restoration (snap=False).

The modal.concurrent decorator and the lifecycle management are particular important for bursty workloads and for snapshotting, respectively, so let’s discuss them in detail.

Determining autoscaling policy with `@modal.concurrent`

To handle bursty workloads, we need to decide how we will scale up and down replicas in response to load. Without autoscaling, users’ requests will queue when the server becomes overloaded.

We can set two values with the @modal.concurrent decorator. max_inputs should be set to the maximum number of inputs a replica can handle concurrently without internal queueing — the max-running-requests in SGLang. target_inputs can be left unset or, if the per-request latency degrades too much when handling the maximum batch size, it can be set to a lower value.

Generally, this choice needs to be made as part of LLM inference engine benchmarking in reference to a particular application’s latency and throughput targets.

Controlling container lifecycles with `@modal.enter`

Modal considers a new replica ready to receive inputs once the @modal.enter methods have exited and the container accepts connections. To ensure that we actually finish setting up our server before we are marked ready for inputs, we define a helper function to check whether the server is finished setting up.

With all this in place, we are ready to define our high-performance, low-latency LLM inference server.

Deploy the server

To deploy the server on Modal, just run

This will create a new App on Modal and build the container image for it if it hasn’t been built yet.

Interact with the server

Once it is deployed, you’ll see a URL appear in the command line, something like https://your-workspace-name--example-sglang-snapshot-sglang.modal.run.

You can find interactive Swagger UI docs at the /docs route of that URL, i.e. https://your-workspace-name--example-sglang-snapshot-sglang.modal.direct/docs. These docs describe each route and indicate the expected input and output and translate requests into curl commands. For simple routes, you can even send a request directly from the docs page.

Test the server

To make it easier to test the server setup, we also include a local_entrypoint that hits the server with a simple client.

If you execute the command

a fresh replica of the server will be spun up on Modal while the code below executes on your local machine.

Think of this like writing simple tests inside of the if __name__ == "__main__" block of a Python script, but for cloud deployments!

This test relies on the two helper functions below, which ping the server and wait for a valid response.

Test memory snapshotting

Using modal run creates an ephemeral Modal App, rather than a deployed Modal App. Ephemeral Modal Apps are short-lived, so they turn off memory snapshotting.

To test the memory snapshot version of the server, first deploy it with modal deploy and then hit it with a client.

You should observe startup improvements after a handful of cold starts (usually less than five). If you want to see the speedup during a test, we recommend heading to the deployed App in your Modal dashboard and manually stopping containers after they have served a request to ensure turnover.

You can use the client code below to test the endpoint.

It can be run with the command