Low Latency, Serverless LFM2 with vLLM and Modal

In this example, we show how to serve Liquid AI’s LFM2 models with vLLM with low latency and fast cold starts on Modal.

The LFM2 models are not vanilla Transformers — they have a hybrid architecture, discovered via an architecture search that optimized for quality, latency, and memory footprint. Check out their technical report for more details.

Here, we run the 24B-A2B variant of LFM2, described here. This variant is designed for efficient inference and includes instruction tuning. It is released under the weights-available LFM 1.0 License, which restricts commercial use for entities with over $10M in revenue.

This example demonstrates techniques to run inference at high efficiency, including advanced features of both vLLM and Modal. For a simpler introduction to LLM serving, see this example.

To minimize routing overheads, we use @modal.experimental.http_server, which uses a new, low-latency routing service on Modal designed for latency-sensitive inference workloads. This gives us more control over routing, but with increased power comes increased responsibility.

We also include instructions for cutting cold start times using Modal’s CPU + GPU memory snapshots.

Fast cold starts are particularly useful for LLM inference applications that have highly “bursty” workloads, like document processing. See this guide for a breakdown of different LLM inference workloads and how to optimize them.

Set up the container image

Our first order of business is to define the environment our server will run in: the container Image. We’ll use the vLLM inference server.

While we’re at it, we import the dependencies we’ll need both remotely and locally (for deployment).

Selecting the GPU

We choose the H100 GPU, which offers excellent price-performance and has sufficient VRAM to store the models.

Loading and caching the model weights

We don’t want to load the model from the Hub every time we start the server. We can load it much faster from a Modal Volume. Typical speeds are around one to two GB/s.

In addition to pointing the Hugging Face Hub at the path where we mount the Volume, we also turn on “high performance” downloads, which can fully saturate our network bandwidth, and provide an HF_TOKEN via a Modal Secret so that our downloads aren’t throttled. You’ll need to create a Secret named huggingface-secret with your token here.

Caching compilation artifacts

Model weights aren’t the only thing we want to cache. vLLM also produces compilation artifacts that we want to persist across restarts.

Define the inference server and infrastructure

Selecting infrastructure to minimize latency

Minimizing latency requires geographic co-location of clients and servers.

So for low latency LLM inference services on Modal, you must select a cloud region for both the GPU-accelerated containers running inference and for the internal Modal proxies that forward requests to them as part of defining a modal.experimental.http_server.

Here, we assume users are mostly in the northern half of the Americas and select the us-east cloud region to serve them. This should result in at most a few dozen milliseconds of round-trip time.

For production-scale LLM inference services, there are generally enough requests to justify keeping at least one replica running at all times. Having a “warm” or “live” replica reduces latency by skipping slow initialization work that occurs when new replica boots up (a “cold start”). For LLM inference servers, that latency runs from seconds to minutes.

However, since this is documentation code, we’ll set the min_containers of our Modal Function to 0 to avoid surprise bills during casual use.

Finally, we need to decide how we will scale up and down replicas in response to load. Without autoscaling, users’ requests will queue when the server becomes overloaded. Even apart from queueing, responses generally become slower per user above a certain minimum number of concurrent requests.

So we set a target for the number of inputs to run on a single container with modal.concurrent. For details, see the guide.

Generally, this choice needs to be made as part of LLM inference engine benchmarking.

Speed up cold starts with GPU snapshotting

Modal is a serverless compute platform, so all of your inference services automatically scale up and down to handle variable load.

Scaling up a new replica requires quite a bit of work — loading up Python and system packages, loading model weights, setting up the inference engine, and so on.

We can skip over and speed up a bunch of this work when spinning up new replicas after the first by directly booting from a memory snapshot, which contains the exact in-memory representation of our server just before it begins taking requests.

Most applications can be snapshot and experience substantial speedups (2x to 10x, see our initial benchmarks here). However, it generally requires some extra work to adapt the application code.

vLLM supports a sleep mode that allows us to leverage Modal’s CPU + GPU memory snapshots for dramatically faster cold starts.

When enable_memory_snapshot=True and experimental_options={"enable_gpu_snapshot": True} are set on the class, Modal captures both CPU and GPU memory state. The @modal.enter(snap=True) method runs before the snapshot is taken: we start vLLM, wait for it to be ready, warm it up, then put it to sleep. The @modal.enter(snap=False) method runs after restoring from snapshot: we wake vLLM back up so it can serve requests immediately.

Sleeping and waking a vLLM server

We prepare our vLLM inference server for snapshotting by first sending a few requests to “warm it up”, ensuring that it is fully ready to process requests. Then we “put it to sleep”, moving non-essential data out of GPU memory, with a request to /sleep. At this point, we can take a memory snapshot. Upon snapshot restoration, we “wake up” the server with a request to /wake_up.

We use the requests library to send ourselves these HTTP requests on localhost/127.0.0.1.

Controlling container lifecycles with `modal.Cls`

We wrap up all of the choices we made about the infrastructure of our inference server into a number of Python decorators that we apply to a Python class that encapsulates the logic to run our server.

The key decorators are:

@app.cls to define the core of our service. We attach our Image, request a GPU, attach our cache Volumes, specify the region, and configure auto-scaling. See the reference documentation for details.
@modal.experimental.http_server to turn our Python code into an HTTP server (i.e. fronting all of our containers with a proxy with a URL). The wrapped code needs to eventually listen for HTTP connections on the provided port.
@modal.concurrent to specify how many requests our server can handle before we need to scale up.
@modal.enter and @modal.exit to indicate which methods of the class should be run when starting the server and shutting it down. The snap=True/snap=False distinction controls which methods run before/after a memory snapshot.

Modal considers a new replica ready to receive inputs once the modal.enter methods have exited and the container accepts connections.

With all this in place, we are ready to define our high-performance, low-latency LFM 2 inference server.

Deploy the server

To deploy the server on Modal, just run

This will create a new App on Modal and build the container image for it if it hasn’t been built yet.

Interact with the server

Once it is deployed, you’ll see a URL appear in the command line, something like https://your-workspace-name--example-lfm-snapshot-lfmvllminference.us-east.modal.direct.

You can find interactive Swagger UI docs at the /docs route of that URL, i.e. https://your-workspace-name--example-lfm-snapshot-lfmvllminference.us-east.modal.direct/docs. These docs describe each route and indicate the expected input and output and translate requests into curl commands. For simple routes, you can even send a request directly from the docs page.

Note: when no replicas are available, Modal will respond with the 503 Service Unavailable status. In your browser, you can just hit refresh until the docs page appears. You can see the status of the application and its containers on your Modal dashboard.

Test the server

To make it easier to test the server setup, we also include a local_entrypoint that hits the server with a simple client.

If you execute the command

a fresh replica of the server will be spun up on Modal while the code below executes on your local machine.

Think of this like writing simple tests inside of the if __name__ == "__main__" block of a Python script, but for cloud deployments!

This test relies on the probe helper function below, which ping the server and wait for a valid response to stream.

The probe helper function specifically ignores two types of errors that can occur while a replica is starting up — timeouts on the client and 5XX responses from the server. Modal returns the 503 Service Unavailable status when an experimental.http_server has no live replicas.

We include a header with each request — Modal-Session-ID. The value associated with this key is used to map requests onto containers such that while the set of containers is fixed, requests with the same value are sent to the same container. Set this to a different value per multi-turn interaction (prototypically, a user conversation thread with a chatbot) to improve KV cache hit rates. Note that this header is only compatible with Modal http_servers.

Test memory snapshotting

Using modal run creates an ephemeral Modal App, rather than a deployed Modal App. Ephemeral Modal Apps are short-lived, so they turn off snapshotting.

To test the memory snapshot version of the server, first deploy it with modal deploy and then hit it with a client.

You should observe startup improvements after a handful of cold starts (usually less than five). If you want to see the speedup during a test, we recommend heading to the deployed App in your Modal dashboard and manually stopping containers after they have served a request.

You can use the client code below to test the endpoint. It can be run with the command