Run OpenAI’s gpt-oss model with vLLM
Background
gpt-oss is a reasoning model
that comes in two flavors: gpt-oss-120B and gpt-oss-20B. They are both Mixture
of Experts (MoE) models with a low number of active parameters, ensuring they
combine good world knowledge and capabilities with fast inference.
We describe a few of its notable features below.
MXFP4
OpenAI’s gpt-oss models use a fairly uncommon 4bit mxfp4 floating point
format for the MoE layers. This “block” quantization format combines e2m1 floating point numbers
with blockwise scaling factors. The attention operations are not quantized.
Attention Sinks
Attention sink models allow for longer context lengths without sacrificing output quality. The vLLM team added attention sink support for Flash Attention 3 (FA3) in preparation for this release.
Response Format
GPT-OSS is trained with the harmony response format which enables models to output to multiple channels for chain-of-thought (CoT) and input tool-calling preambles along with regular text responses. We’ll stick to a simpler format here, but see this cookbook for details on the new format.
Set up the container image
We’ll start by defining a custom container Image that
installs all the necessary dependencies to run vLLM and the model.
import json
import time
from datetime import datetime, timezone
from typing import Any
import aiohttp
import modal
vllm_image = (
modal.Image.from_registry(
"nvidia/cuda:12.8.1-devel-ubuntu22.04",
add_python="3.12",
)
.entrypoint([])
.uv_pip_install(
"vllm==0.13.0",
"huggingface_hub[hf_transfer]==0.36.0",
)
.env( # fast Blackwell-specific MoE kernels
{"VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8": "1"}
)
)Download the model weights
We’ll be downloading OpenAI’s model from Hugging Face. We’re running the 20B parameter model by default but you can easily switch to the 120B model, which also fits in a single H100 or H200 GPU.
MODEL_NAME = "openai/gpt-oss-20b"
MODEL_REVISION = "d666cf3b67006cf8227666739edf25164aaffdeb"Although vLLM will download weights from Hugging Face on-demand, we want to cache them so we don’t do it every time our server starts. We’ll use Modal Volumes for our cache. Modal Volumes are essentially a “shared disk” that all Modal Functions can access like it’s a regular disk. For more on storing model weights on Modal, see this guide.
hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)The first time you run a new model or configuration with vLLM on a fresh machine, a number of artifacts are created. We also cache these artifacts.
vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)
flashinfer_cache_vol = modal.Volume.from_name(
"flashinfer-cache", create_if_missing=True
)Configuring vLLM to serve GPT-OSS
The vLLM docs include an excellent resource on tuning GPT-OSS. We mostly use the configuration values reported there, but try to explain the reasoning as we go.
VLLM_CONFIG = { # return tokens in chunks of 20, save on host overhead
"stream-interval": 20
}One of the most important choices is to use speculative decoding, which attempts to generate multiple tokens per forward pass by means of a separate “speculator” model. We here use RedHatAI’s open source, generic EAGLE3-based speculator for this model. We recommend using the EAGLE3 technique to train a custom speculator on your own traffic.
SPECULATIVE_CONFIG = {
"model": "RedHatAI/gpt-oss-20b-speculator.eagle3",
"num_speculative_tokens": 7,
"method": "eagle3",
}Speculative decoding acclerates inference without changing model behavior. We can also accelerate inference by further quantizing the model. Here, we reduce the size of KV cache entries by quantizing them to FP8.
VLLM_CONFIG |= {"kv-cache-dtype": "fp8"}There are a number of compilation settings for vLLM. Compilation improves inference performance but incurs extra latency at engine start time. When iterating on and developing a server, we recommend turning compilation off to speed up development cycles, which we here control with a global variable.
FAST_BOOT = FalseOtherwise, we use the values suggested in the recipe:
COMPILATION_CONFIG = {
"pass_config": {"fuse_allreduce_rms": True, "eliminate_noops": True}
}As part of compilation, vLLM collects up sequences (really, DAGs) of CUDA kernel launches into CUDA graphs. We set the maximum batch size for the CUDA graph capture step to the maximum number of inputs we want to handle per replica, which also shows up in our autoscaling configuration below.
MAX_INPUTS = 32 # how many requests can one replica handle? tune carefully!
VLLM_CONFIG |= {"max-cudagraph-capture-size": MAX_INPUTS}Lastly, there are a few knobs we can tune based on the typical lengths of sequences we expect to observe. For many agentic tasks to which this model is well-suited, those lengths can go into the tens of thousands of tokens. Let’s assume they’re never longer than 2 ^ 15 tokens.
VLLM_CONFIG |= {
"max-num-batched-tokens": 16384,
"max-model-len": 32768,
}Build a vLLM engine and serve it
The function below spawns a vLLM instance listening at port 8000, serving requests to our model.
app = modal.App("example-gpt-oss-inference")
N_GPU = 1
MINUTES = 60 # seconds
VLLM_PORT = 8000
@app.function(
image=vllm_image,
gpu=f"B200:{N_GPU}",
scaledown_window=10 * MINUTES, # how long should we stay up with no requests?
timeout=30 * MINUTES, # how long should we wait for container start?
volumes={
"/root/.cache/huggingface": hf_cache_vol,
"/root/.cache/vllm": vllm_cache_vol,
"/root/.cache/flashinfer": flashinfer_cache_vol,
},
)
@modal.concurrent(max_inputs=MAX_INPUTS)
@modal.web_server(port=VLLM_PORT, startup_timeout=30 * MINUTES)
def serve():
import subprocess
cmd = [
"vllm",
"serve",
MODEL_NAME,
"--revision",
MODEL_REVISION,
"--served-model-name",
MODEL_NAME,
"llm",
"--host",
"0.0.0.0",
"--port",
str(VLLM_PORT),
"--async-scheduling", # reduces host overhead, but might not be compatible with all features
]
# enforce-eager disables both Torch compilation and CUDA graph capture
# default is no-enforce-eager. see the --compilation-config flag for tighter control
cmd += ["--enforce-eager" if FAST_BOOT else "--no-enforce-eager"]
# assume multiple GPUs are for splitting up large matrix multiplications
cmd += ["--tensor-parallel-size", str(N_GPU)]
# add complex configuration objects
cmd += ["--compilation-config", json.dumps(COMPILATION_CONFIG)]
cmd += ["--speculative-config", json.dumps(SPECULATIVE_CONFIG)]
cmd += [ # add assorted config
item for k, v in VLLM_CONFIG.items() for item in (f"--{k}", str(v))
]
print(*cmd)
subprocess.Popen(cmd)Deploy the server
To deploy the API on Modal, just run
modal deploy gpt_oss_inference.pyThis will create a new app on Modal, build the container image for it if it hasn’t been built yet, and deploy the app.
Test the server
To make it easier to test the server setup, we also include a local_entrypoint that does a healthcheck and then hits the server.
If you execute the command
modal run gpt_oss_inference.pya fresh replica of the server will be spun up on Modal while the code below executes on your local machine.
We set up the system prompt with low reasoning effort to run
inference a bit faster. For the best ergonomics we recommend using
the harmony API,
which can be installed with pip install openai-harmony.
@app.local_entrypoint()
async def test(test_timeout=30 * MINUTES, user_content=None, twice=True):
url = serve.get_web_url()
system_prompt = {
"role": "system",
"content": f"""You are ChatModal, a large language model trained by Modal.
Knowledge cutoff: 2024-06
Current date: {datetime.now(timezone.utc).date()}
Reasoning: low
\\# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.""",
}
if user_content is None:
user_content = "Explain what the Singular Value Decomposition is."
messages = [ # OpenAI chat format
system_prompt,
{"role": "user", "content": user_content},
]
async with aiohttp.ClientSession(base_url=url) as session:
print(f"Running health check for server at {url}")
async with session.get("/health", timeout=test_timeout - 1 * MINUTES) as resp:
up = resp.status == 200
assert up, f"Failed health check for server at {url}"
print(f"Successful health check for server at {url}")
print(f"Sending messages to {url}:", *messages, sep="\n\t")
await _send_request(session, "llm", messages)
if twice:
messages[0]["content"] += "\nTalk like a pirate, matey."
print(f"Re-sending messages to {url}:", *messages, sep="\n\t")
await _send_request(session, "llm", messages)
async def _send_request(
session: aiohttp.ClientSession, model: str, messages: list
) -> None:
# `stream=True` tells an OpenAI-compatible backend to stream chunks
payload: dict[str, Any] = {"messages": messages, "model": model, "stream": True}
headers = {"Content-Type": "application/json", "Accept": "text/event-stream"}
t = time.perf_counter()
async with session.post(
"/v1/chat/completions", json=payload, headers=headers, timeout=10 * MINUTES
) as resp:
async for raw in resp.content:
resp.raise_for_status()
# extract new content and stream it
line = raw.decode().strip()
if not line or line == "data: [DONE]":
continue
if line.startswith("data: "): # SSE prefix
line = line[len("data: ") :]
chunk = json.loads(line)
assert (
chunk["object"] == "chat.completion.chunk"
) # or something went horribly wrong
delta = chunk["choices"][0]["delta"]
if "content" in delta:
print(delta["content"], end="") # print the content as it comes in
elif "reasoning_content" in delta:
print(delta["reasoning_content"], end="")
elif not delta:
print()
else:
raise ValueError(f"Unsupported response delta: {delta}")
print("")
print(f"Time to Last Token: {time.perf_counter() - t:.2f} seconds")