Run OpenAI-compatible LLM inference with LLaMA 3.1-8B and vLLM

LLMs do more than just model language: they chat, they produce JSON and XML, they run code, and more. This has complicated their interface far beyond “text-in, text-out”. OpenAI’s API has emerged as a standard for that interface, and it is supported by open source LLM serving frameworks like vLLM.

In this example, we show how to run a vLLM server in OpenAI-compatible mode on Modal.

Our examples repository also includes scripts for running clients and load-testing for OpenAI-compatible APIs here.

You can find a (somewhat out-of-date) video walkthrough of this example and the related scripts on the Modal YouTube channel here.

Set up the container image

Our first order of business is to define the environment our server will run in: the container Image. vLLM can be installed with pip, since Modal provides the CUDA drivers.

To take advantage of optimized kernels for CUDA 12.8, we install PyTorch, flashinfer, and their dependencies via an extra Python package index.

import json
from typing import Any

import aiohttp
import modal

vllm_image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install(
        "vllm==0.9.1",
        "huggingface_hub[hf_transfer]==0.32.0",
        "flashinfer-python==0.2.6.post1",
        extra_index_url="https://download.pytorch.org/whl/cu128",
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})  # faster model transfers
)

Download the model weights

We’ll be running a pretrained foundation model — Meta’s LLaMA 3.1 8B in the Instruct variant that’s trained to chat and follow instructions.

Model parameters are often quantized to a lower precision during training than they are run at during inference. We’ll use an eight bit floating point quantization from Neural Magic/Red Hat. Native hardware support for FP8 formats in Tensor Cores is limited to the latest Streaming Multiprocessor architectures, like those of Modal’s Hopper H100/H200 and Blackwell B200 GPUs.

You can swap this model out for another by changing the strings below. A single B200 GPUs has enough VRAM to store a 70,000,000,000 parameter model, like Llama 3.3, in eight bit precision, along with a very large KV cache.

MODEL_NAME = "RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8"
MODEL_REVISION = "12fd6884d2585dd4d020373e7f39f74507b31866"  # avoid nasty surprises when repos update!

Although vLLM will download weights from Hugging Face on-demand, we want to cache them so we don’t do it every time our server starts. We’ll use Modal Volumes for our cache. Modal Volumes are essentially a “shared disk” that all Modal Functions can access like it’s a regular disk. For more on storing model weights on Modal, see this guide.

hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)

We’ll also cache some of vLLM’s JIT compilation artifacts in a Modal Volume.

vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)

Configuring vLLM

The V1 engine

In its 0.7 release, in early 2025, vLLM added a new version of its backend infrastructure, the V1 Engine. Using this new engine can lead to some impressive speedups. It was made the default in version 0.8 and is slated for complete removal by 0.11, in late summer of 2025.

A small number of features, described in the RFC above, may still require the V0 engine prior to removal. Until deprecation, you can use it by setting the below environment variable to 0.

vllm_image = vllm_image.env({"VLLM_USE_V1": "1"})

Trading off fast boots and token generation performance

vLLM has embraced dynamic and just-in-time compilation to eke out additional performance without having to write too many custom kernels, e.g. via the Torch compiler and CUDA graph capture. These compilation features incur latency at startup in exchange for lowered latency and higher throughput during generation. We make this trade-off controllable with the FAST_BOOT variable below.

FAST_BOOT = True

If you’re running an LLM service that frequently scales from 0 (frequent “cold starts”) then you’ll want to set this to True.

If you’re running an LLM service that usually has multiple replicas running, then set this to False for improved performance.

See the code below for details on the parameters that FAST_BOOT controls.

For more on the performance you can expect when serving your own LLMs, see our LLM engine performance benchmarks.

Build a vLLM engine and serve it

The function below spawns a vLLM instance listening at port 8000, serving requests to our model. We wrap it in the @modal.web_server decorator to connect it to the Internet.

The server runs in an independent process, via subprocess.Popen, and only starts accepting requests once the model is spun up and the serve function returns.

app = modal.App("example-vllm-inference")

N_GPU = 1
MINUTES = 60  # seconds
VLLM_PORT = 8000


@app.function(
    image=vllm_image,
    gpu=f"B200:{N_GPU}",
    scaledown_window=15 * MINUTES,  # how long should we stay up with no requests?
    timeout=10 * MINUTES,  # how long should we wait for container start?
    volumes={
        "/root/.cache/huggingface": hf_cache_vol,
        "/root/.cache/vllm": vllm_cache_vol,
    },
)
@modal.concurrent(  # how many requests can one replica handle? tune carefully!
    max_inputs=32
)
@modal.web_server(port=VLLM_PORT, startup_timeout=10 * MINUTES)
def serve():
    import subprocess

    cmd = [
        "vllm",
        "serve",
        "--uvicorn-log-level=info",
        MODEL_NAME,
        "--revision",
        MODEL_REVISION,
        "--served-model-name",
        MODEL_NAME,
        "llm",
        "--host",
        "0.0.0.0",
        "--port",
        str(VLLM_PORT),
    ]

    # enforce-eager disables both Torch compilation and CUDA graph capture
    # default is no-enforce-eager. see the --compilation-config flag for tighter control
    cmd += ["--enforce-eager" if FAST_BOOT else "--no-enforce-eager"]

    # assume multiple GPUs are for splitting up large matrix multiplications
    cmd += ["--tensor-parallel-size", str(N_GPU)]

    print(cmd)

    subprocess.Popen(" ".join(cmd), shell=True)

Deploy the server

To deploy the API on Modal, just run

modal deploy vllm_inference.py

This will create a new app on Modal, build the container image for it if it hasn’t been built yet, and deploy the app.

Interact with the server

Once it is deployed, you’ll see a URL appear in the command line, something like https://your-workspace-name--example-vllm-inference-serve.modal.run.

You can find interactive Swagger UI docs at the /docs route of that URL, i.e. https://your-workspace-name--example-vllm-inference-serve.modal.run/docs. These docs describe each route and indicate the expected input and output and translate requests into curl commands.

For simple routes like /health, which checks whether the server is responding, you can even send a request directly from the docs.

To interact with the API programmatically in Python, we recommend the openai library.

See the client.py script in the examples repository here to take it for a spin:

# pip install openai==1.76.0
python openai_compatible/client.py

Testing the server

To make it easier to test the server setup, we also include a local_entrypoint that does a healthcheck and then hits the server.

If you execute the command

modal run vllm_inference.py

a fresh replica of the server will be spun up on Modal while the code below executes on your local machine.

Think of this like writing simple tests inside of the if __name__ == "__main__" block of a Python script, but for cloud deployments!

@app.local_entrypoint()
async def test(test_timeout=10 * MINUTES, content=None, twice=True):
    url = serve.get_web_url()

    system_prompt = {
        "role": "system",
        "content": "You are a pirate who can't help but drop sly reminders that he went to Harvard.",
    }
    if content is None:
        content = "Explain the singular value decomposition."

    messages = [  # OpenAI chat format
        system_prompt,
        {"role": "user", "content": content},
    ]

    async with aiohttp.ClientSession(base_url=url) as session:
        print(f"Running health check for server at {url}")
        async with session.get("/health", timeout=test_timeout - 1 * MINUTES) as resp:
            up = resp.status == 200
        assert up, f"Failed health check for server at {url}"
        print(f"Successful health check for server at {url}")

        print(f"Sending messages to {url}:", *messages, sep="\n\t")
        await _send_request(session, "llm", messages)
        if twice:
            messages[0]["content"] = "You are Jar Jar Binks."
            print(f"Sending messages to {url}:", *messages, sep="\n\t")
            await _send_request(session, "llm", messages)


async def _send_request(
    session: aiohttp.ClientSession, model: str, messages: list
) -> None:
    # `stream=True` tells an OpenAI-compatible backend to stream chunks
    payload: dict[str, Any] = {"messages": messages, "model": model, "stream": True}

    headers = {"Content-Type": "application/json", "Accept": "text/event-stream"}

    async with session.post(
        "/v1/chat/completions", json=payload, headers=headers, timeout=1 * MINUTES
    ) as resp:
        async for raw in resp.content:
            resp.raise_for_status()
            # extract new content and stream it
            line = raw.decode().strip()
            if not line or line == "data: [DONE]":
                continue
            if line.startswith("data: "):  # SSE prefix
                line = line[len("data: ") :]

            chunk = json.loads(line)
            assert (
                chunk["object"] == "chat.completion.chunk"
            )  # or something went horribly wrong
            print(chunk["choices"][0]["delta"]["content"], end="")
    print()

We also include a basic example of a load-testing setup using locust in the load_test.py script here:

modal run openai_compatible/load_test.py