Serve very large language models (DeepSeek V3, Kimi-K2, GLM 4)
This example demonstrates the basic patterns for serving language models on Modal whose weights consume hundreds of gigabytes of storage.
In short:
- load weights into a Modal Volume ahead of server launch
- use random “dummy” weights when iteratively developing your server
- use two, four, or eight H200 or B200 GPUs
- use lower-precision weight formats (FP4 on Blackwell, FP8 on Hopper)
- default to using speculative decoding, especially if batches are in the few tens of sequences
For more tips on how to serve specific types of LLM inference at high performance, see this guide. For a gentler introduction to LLM serving, see this example.
import asyncio
import json
import os
import subprocess
import time
from pathlib import Path
import aiohttp
import modal
import modal.experimental
here = Path(__file__).parentSet up the container image
We start by creating a Modal Image based on the Docker image
provided by the SGLang team.
This contains our Python and system dependencies.
Add more by chaining .apt_install and .uv_pip_install or .pip_install method calls, as we do below with .entrypoint.
See the Modal Image guide for details.
image = modal.Image.from_registry("lmsysorg/sglang:v0.5.7").entrypoint(
[] # silence chatty logs on entry
)Load model weights
Large model weights take a long time to move around. Model weight servers like Hugging Face will send weights at a few hundred megabytes per second. For large models, with weight sizes in the hundreds of gigabytes, that means thousands of seconds (tens of minutes) of model loading time.
After loading them we can cache these weights in a Modal Volume so that they are loaded about 10x faster — about one to three gigabytes per second.
hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)That still means minutes of startup time. Both of these latencies kill productivity when you’re iterating on aspects besides model behavior, like server configuration.
For this reason, we recommend skipping model loading while you’re developing
a server or configuration — even when benchmarking, if you can!
You can still exercise the same code paths if you use the dummy model
loading format. In this sample code, we add an APP_USE_DUMMY_WEIGHTS environment variable
to control this behavior from the command line during iteration.
USE_DUMMY_WEIGHTS = os.environ.get("APP_USE_DUMMY_WEIGHTS", "0").lower() in (
"1",
"true",
)
image = image.env(
{
"HF_XET_HIGH_PERFORMANCE": "1", # faster downloads
"APP_USE_DUMMY_WEIGHTS": str(int(USE_DUMMY_WEIGHTS)),
}
)We download the model weights from Hugging Face by running a Python function as part of the Modal Image build. Note that command-line logging will be somewhat limited.
def download_model(repo_id, revision=None):
from huggingface_hub import snapshot_download
snapshot_download(repo_id=repo_id, revision=revision)To run the function, we need to pick a specific model to download. We’ll use Z.ai’s GLM 4.7 in eight bit floating point quantization. This model takes about thirty minutes to an hour to download from Hugging Face.
REPO_ID = "zai-org/GLM-4.7-FP8"
if not USE_DUMMY_WEIGHTS: # skip download if we don't need real weights
image = image.run_function(
download_model,
volumes={"/root/.cache/huggingface": hf_cache_vol},
args=(REPO_ID,),
)Configure the inference engine
Running large models efficiently requires specialized inference engines like SGLang. These engines are generally highly configurable.
For SGLang, there are three main sources of configuration values:
- Environment variables for the process running
sglang. - Command-line arguments for the command to launch the
sglangprocess. - Configuration files loaded by the
sglangprocess.
For deployments, we prefer to put information in configuration files where possible. CLI arguments and configuration files can typically be interchanged. CLI arguments are convenient when iterating, but configuration files are easier to share. We use environment variables only as a last resort, typically to activate new or experimental features.
Environment variables
SGLang environment variables are prefixed with SGL_ or SGLANG_.
The SGL_ prefix is deprecated.
The snippet below adds any such environment variables present during deployment to the Modal Image.
def is_sglang_env_var(key):
return key.startswith("SGL_") or key.startswith("SGLANG_")
image = image.env(
{key: value for key, value in os.environ.items() if is_sglang_env_var(key)}
)YAML
Configuration files can be passed in YAML format.
We include a default config in-line in the code here for ease of use. It’s designed to run GLM 4.7 FP8 at low to moderate concurrency. In particular, it uses that model’s built-in multi-token prediction speculative decoding to improve time per output token.
default_config = """\
# General Config
host: 0.0.0.0
log-level: debug # very noisy
# Model Config
tool-call-parser: glm47
reasoning-parser: glm45
trust-remote-code: true
# Memory
mem-fraction-static: 0.85
chunked-prefill-size: 32768
kv-cache-dtype: fp8_e4m3
# Observability
enable-metrics: true
collect-tokens-histogram: true
# Batching
max-running-requests: 32
cuda-graph-max-bs: 32
# SpecDec (speed up low/moderate concurrency)
speculative-algorithm: EAGLE # built into GLM 4.7, is just multi-token prediction
"""You’ll want to provide your own configuration file for other settings, in particular if you change the model.
We add an environment variable, APP_LOCAL_CONFIG_PATH,
to change the loaded configuration.
local_config_path = os.environ.get("APP_LOCAL_CONFIG_PATH")
if modal.is_local():
if local_config_path is None:
local_config_path = here / "config_very_large_models.yaml"
if not local_config_path.exists():
local_config_path.write_text(default_config)
print(
f"Using default config from {local_config_path.relative_to(here)}:",
default_config,
sep="\n",
)
image = image.add_local_file(local_config_path, "/root/config.yaml")Command-line arguments
We launch our server by kicking off a subprocess. The convenience function below encapsulates the command and its arguments.
We pass a few key bits of configuration that are consumed by other code here, rather than in a configuration file, so that values stay in sync.
That includes:
- Model information, which is also used during weight cacheing
- GPU count, which is also used below when defining our Modal deployment
- the port to serve on, which is also used to connect up Modal networking
We also pass the HF_HUB_OFFLINE environment variable here,
so that our server will crash when trying to load the real model
if those weights are not in cache.
For smaller models, we can instead load weights dynamically on
server start (and cache them so later starts are faster).
But for large models, weight loading extends the first start latency
so much that downstream timeouts are triggered —
or need to be extended so much that they are no longer tight enough
on the happy path.
def _start_server() -> subprocess.Popen:
"""Start SGLang server in a subprocess"""
cmd = [
f"HF_HUB_OFFLINE={0 if USE_DUMMY_WEIGHTS else 1}",
"python",
"-m",
"sglang.launch_server",
"--host",
"0.0.0.0",
"--port",
str(SGLANG_PORT),
"--model-path",
REPO_ID,
"--tp-size",
str(GPU_COUNT),
"--config",
"/root/config.yaml",
]
if USE_DUMMY_WEIGHTS:
cmd.extend(["--load-format", "dummy"])
print("Starting SGLang server with command:")
print(*cmd)
return subprocess.Popen(" ".join(cmd), shell=True, start_new_session=True)Lastly, we import the sglang library as part of loading the Image on Modal.
This is a minor optimization, but it can shave a few seconds off cold start latencies
by providing better prefetching hints, and every second counts!
with image.imports():
import sglang # noqaConfigure infrastructure
Now, we wrap our configured SGLang server for our large model in the infrastructure required to run and interact with it. Infrastrucure in Modal is generally attached to an App. Here, we’ll attach our Modal Image as the default for Modal Functions that run in the App.
app = modal.App("example-serve-very-large-models", image=image)Most importantly, we need to decide what hardware to run on. H200 and B200 GPUs have over 100 GB of GPU RAM — 141 GB and 180 GB, respectively. The model’s weights will be stored in this memory, and they consume several hundred gigabytes of space, so we will generally want several of these accelerators. We also need space for the model’s KV cache of activations on input sequences.
In eight-bit precision, GLM 4.7 consumes ~350 GB of space, so we use four H200s for 564 GB of RAM.
GPU_TYPE = "H200"
GPU_COUNT = 4We’ll use a Modal experimental.http_server to serve our model.
This reduces client latencies and provides for regionalized deployment.
You can read more about it in this example.
To configure it, we need to pass in region information for the GPU workers
and for the load-balancing proxy.
REGION = "us"
PROXY_REGIONS = ["us-east"]Lastly, we need to configure autoscaling parameters. By default, Modal is fully serverless, and applications scale to zero when there is no load. But booting up inference engines for large models takes minutes, which is generally longer than clients can tolerate waiting.
So a production deployment of large models that has clients with
per-request SLAs in the few or tens of seconds
generally needs to keep one replica up at all times.
In Modal, we achieve this with the min_containers parameter
of App.cls or App.function.
This can trigger substantial costs, so we leave the value at 0 in this sample code.
MIN_CONTAINERS = 0 # Set to 1 for production to keep a warm replicaDeployments of large models with a single node per replica can generally handle a few tens of requests
without queueing. When a particular replica has more requests than it can handle, we want to scale it up.
This behavior is configured by passing the target_inputs parameter to modal.concurrent.
TARGET_INPUTS = 10 # Concurrent requests per replica before scalingDefine the server
Now we’re ready to put all of our infrastructure configuration together into a Modal Cls.
The Modal Cls allows us to control container lifecycle.
In particular, it lets us define work that a replica should do before
and after it handles requests in methods decorated with modal.enter and modal.exit, respectively.
SGLANG_PORT = 8000
MINUTES = 60 # seconds
@app.cls(
image=image,
gpu=f"{GPU_TYPE}:{GPU_COUNT}",
scaledown_window=20 * MINUTES, # how long should we stay up with no requests?
timeout=30 * MINUTES, # how long should we wait for container start?
volumes={"/root/.cache/huggingface": hf_cache_vol},
region=REGION,
min_containers=MIN_CONTAINERS,
)
@modal.experimental.http_server(
port=SGLANG_PORT,
proxy_regions=["us-east"],
exit_grace_period=5 * MINUTES, # time to finish requests on shutdown
)
@modal.concurrent(target_inputs=TARGET_INPUTS)
class Server:
@modal.enter()
def start(self):
"""Start SGLang server process and wait for it to be ready"""
self.proc = _start_server()
wait_for_server_ready()
@modal.exit()
def stop(self):
"""Terminate the SGLang server process"""
self.proc.terminate()
self.proc.wait()We called a wait_for_server_ready function in our modal.enter method.
That’s defined below. It pings the /health endpoint until the server responds.
def wait_for_server_ready():
"""Wait for SGLang server to be ready"""
import requests
url = f"http://localhost:{SGLANG_PORT}/health"
print(f"Waiting for server to be ready at {url}")
while True:
try:
resp = requests.get(url, timeout=5)
if resp.status_code == 200:
print("Server is ready!")
return
except requests.exceptions.RequestException:
pass
time.sleep(5)Test the server
You can deploy a fresh replica and test it using the command
APP_USE_DUMMY_WEIGHTS=1 modal run very_large_models.pywhich will create an ephemeral Modal App
and execute the local_entrypoint code below.
Because the weights are randomized, the outputs are also random.
Remove the APP_USE_DUMMY_WEIGHTS flag to test the trained model.
@app.local_entrypoint()
async def test(test_timeout=20 * MINUTES, content=None, twice=True):
"""Test the model serving endpoint"""
url = Server._experimental_get_flash_urls()[0]
if USE_DUMMY_WEIGHTS:
system_prompt = {"role": "system", "content": "This system produces gibberish."}
else:
system_prompt = {"role": "system", "content": "You are a helpful AI assistant."}
if content is None:
content = "Explain the transformer architecture in one paragraph."
messages = [system_prompt, {"role": "user", "content": content}]
print(f"Sending messages to {url}:", *messages, sep="\n\t")
await probe(url, messages, timeout=test_timeout)
if twice:
messages[1]["content"] = "What is the capital of France?"
print(f"Sending second request to {url}:", *messages, sep="\n\t")
await probe(url, messages, timeout=1 * MINUTES)The unique client logic for Modal deployments is in the probe function below.
Specifically, when a Modal experimental.http_server is spinning up,
i.e. before the modal.enter finishes for at least one replica,
clients will see a 503 Service Unavailable status
and so should retry.
async def probe(url, messages, timeout=20 * MINUTES):
"""Send request with retry logic for startup delays"""
deadline = time.time() + timeout
async with aiohttp.ClientSession(base_url=url) as session:
while time.time() < deadline:
try:
await _send_request_streaming(session, messages)
return
except asyncio.TimeoutError:
await asyncio.sleep(1)
except aiohttp.client_exceptions.ClientResponseError as e:
if e.status == 503: # Service Unavailable during startup
await asyncio.sleep(1)
continue
raise e
raise TimeoutError(f"No response from server within {timeout} seconds")Deploy the server
When you’re ready, you can create a persistent deployment with
APP_USE_DUMMY_WEIGHTS=0 modal deploy very_large_models.pyAnd hit it with any OpenAI API-compatible client!
Addenda
The probe function above uses this helper function
to stream response tokens as they become available.
async def _send_request_streaming(
session: aiohttp.ClientSession, messages: list, timeout: int | None = None
):
"""Stream response from chat completions endpoint"""
payload = {
"messages": messages,
"stream": True,
"max_tokens": 1024 if USE_DUMMY_WEIGHTS else None,
}
headers = {"Accept": "text/event-stream"}
async with session.post(
"/v1/chat/completions", json=payload, headers=headers, timeout=timeout
) as resp:
resp.raise_for_status()
full_text = ""
async for raw in resp.content:
line = raw.decode("utf-8", errors="ignore").strip()
if not line:
continue
if not line.startswith("data:"):
continue
data = line[len("data:") :].strip()
if data == "[DONE]":
break
try:
evt = json.loads(data)
except json.JSONDecodeError:
continue
delta = (evt.get("choices") or [{}])[0].get("delta") or {}
chunk = delta.get("content") or delta.get("reasoning_content")
if chunk:
print(
chunk,
end="",
flush="\n" in chunk or "." in chunk or len(chunk) > 100,
)
full_text += chunk
print()