Run OpenAI’s gpt-oss model with vLLM

Background 

gpt-oss is a reasoning model that comes in two flavors: gpt-oss-120B and gpt-oss-20B. They are both Mixture of Experts (MoE) models with a low number of active parameters, ensuring they combine good world knowledge and capabilities with fast inference.

We describe a few of its notable features below.

MXFP4 

OpenAI’s gpt-oss models use a fairly uncommon 4bit mxfp4 floating point format for the MoE layers. This “block” quantization format combines e2m1 floating point numbers with blockwise scaling factors. The attention operations are not quantized.

Attention Sinks 

Attention sink models allow for longer context lengths without sacrificing output quality. The vLLM team added attention sink support for Flash Attention 3 (FA3) in preparation for this release.

Response Format 

GPT-OSS is trained with the harmony response format which enables models to output to multiple channels for chain-of-thought (CoT) and input tool-calling preambles along with regular text responses. We’ll stick to a simpler format here, but see this cookbook for details on the new format.

Set up the container image 

We’ll start by defining a custom container Image that installs all the necessary dependencies to run vLLM and the model.

Download the model weights 

We’ll be downloading OpenAI’s model from Hugging Face. We’re running the 20B parameter model by default but you can easily switch to the 120B model, which also fits in a single H100 or H200 GPU.

Although vLLM will download weights from Hugging Face on-demand, we want to cache them so we don’t do it every time our server starts. We’ll use Modal Volumes for our cache. Modal Volumes are essentially a “shared disk” that all Modal Functions can access like it’s a regular disk. For more on storing model weights on Modal, see this guide.

The first time you run a new model or configuration with vLLM on a fresh machine, a number of artifacts are created. We also cache these artifacts.

Configuring vLLM to serve GPT-OSS 

The vLLM docs include an excellent resource on tuning GPT-OSS. We mostly use the configuration values reported there, but try to explain the reasoning as we go.

One of the most important choices is to use speculative decoding, which attempts to generate multiple tokens per forward pass by means of a separate “speculator” model. We here use RedHatAI’s open source, generic EAGLE3-based speculator for this model. We recommend using the EAGLE3 technique to train a custom speculator on your own traffic.

Speculative decoding acclerates inference without changing model behavior. We can also accelerate inference by further quantizing the model. Here, we reduce the size of KV cache entries by quantizing them to FP8.

There are a number of compilation settings for vLLM. Compilation improves inference performance but incurs extra latency at engine start time. When iterating on and developing a server, we recommend turning compilation off to speed up development cycles, which we here control with a global variable.

Otherwise, we use the values suggested in the recipe:

As part of compilation, vLLM collects up sequences (really, DAGs) of CUDA kernel launches into CUDA graphs. We set the maximum batch size for the CUDA graph capture step to the maximum number of inputs we want to handle per replica, which also shows up in our autoscaling configuration below.

Lastly, there are a few knobs we can tune based on the typical lengths of sequences we expect to observe. For many agentic tasks to which this model is well-suited, those lengths can go into the tens of thousands of tokens. Let’s assume they’re never longer than 2 ^ 15 tokens.

Build a vLLM engine and serve it 

The function below spawns a vLLM instance listening at port 8000, serving requests to our model.

Deploy the server 

To deploy the API on Modal, just run

This will create a new app on Modal, build the container image for it if it hasn’t been built yet, and deploy the app.

Test the server 

To make it easier to test the server setup, we also include a local_entrypoint that does a healthcheck and then hits the server.

If you execute the command

a fresh replica of the server will be spun up on Modal while the code below executes on your local machine.

We set up the system prompt with low reasoning effort to run inference a bit faster. For the best ergonomics we recommend using the harmony API, which can be installed with pip install openai-harmony.