Try DeepSeek-R1 on Modal! View example

Run OpenAI-compatible LLM inference with LLaMA 3.1-8B and vLLM

LLMs do more than just model language: they chat, they produce JSON and XML, they run code, and more. This has complicated their interface far beyond “text-in, text-out”. OpenAI’s API has emerged as a standard for that interface, and it is supported by open source LLM serving frameworks like vLLM.

In this example, we show how to run a vLLM server in OpenAI-compatible mode on Modal.

Our examples repository also includes scripts for running clients and load-testing for OpenAI-compatible APIs here.

You can find a video walkthrough of this example and the related scripts on the Modal YouTube channel here.

Set up the container image

Our first order of business is to define the environment our server will run in: the container Image. vLLM can be installed with pip.

import modal

vllm_image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install(
        "vllm==0.7.2",
        "huggingface_hub[hf_transfer]==0.26.2",
        "flashinfer-python==0.2.0.post2",  # pinning, very unstable
        extra_index_url="https://flashinfer.ai/whl/cu124/torch2.5",
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})  # faster model transfers
)

In its 0.7 release, vLLM added a new version of its backend infrastructure, the V1 Engine. Using this new engine can lead to some impressive speedups, but as of version 0.7.2 the new engine does not support all inference engine features (including important performance optimizations like speculative decoding).

The features we use in this demo are supported, so we turn the engine on by setting an environment variable on the Modal Image.

vllm_image = vllm_image.env({"VLLM_USE_V1": "1"})

Download the model weights

We’ll be running a pretrained foundation model — Meta’s LLaMA 3.1 8B in the Instruct variant that’s trained to chat and follow instructions, quantized to 4-bit by Neural Magic and uploaded to Hugging Face.

You can read more about the w4a16 “Machete” weight layout and kernels here.

MODELS_DIR = "/llamas"
MODEL_NAME = "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16"
MODEL_REVISION = "a7c09948d9a632c2c840722f519672cd94af885d"

Although vLLM will download weights on-demand, we want to cache them if possible. We’ll use Modal Volumes, which act as a “shared disk” that all Modal Functions can access, for our cache.

hf_cache_vol = modal.Volume.from_name(
    "huggingface-cache", create_if_missing=True
)
vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)

Build a vLLM engine and serve it

The function below spawns a vLLM instance listening at port 8000, serving requests to our model. vLLM will authenticate requests using the API key we provide it.

We wrap it in the @modal.web_server decorator to connect it to the Internet.

app = modal.App("example-vllm-openai-compatible")

N_GPU = 1  # tip: for best results, first upgrade to more powerful GPUs, and only then increase GPU count
API_KEY = "super-secret-key"  # api key, for auth. for production use, replace with a modal.Secret

MINUTES = 60  # seconds

VLLM_PORT = 8000


@app.function(
    image=vllm_image,
    gpu=f"H100:{N_GPU}",
    # how many requests can one replica handle? tune carefully!
    allow_concurrent_inputs=100,
    # how long should we stay up with no requests?
    container_idle_timeout=15 * MINUTES,
    volumes={
        "/root/.cache/huggingface": hf_cache_vol,
        "/root/.cache/vllm": vllm_cache_vol,
    },
)
@modal.web_server(port=VLLM_PORT, startup_timeout=5 * MINUTES)
def serve():
    import subprocess

    cmd = [
        "vllm",
        "serve",
        "--uvicorn-log-level=info",
        MODEL_NAME,
        "--revision",
        MODEL_REVISION,
        "--host",
        "0.0.0.0",
        "--port",
        str(VLLM_PORT),
        "--api-key",
        API_KEY,
    ]

    subprocess.Popen(" ".join(cmd), shell=True)

Deploy the server

To deploy the API on Modal, just run

modal deploy vllm_inference.py

This will create a new app on Modal, build the container image for it if it hasn’t been built yet, and deploy the app.

Interact with the server

Once it is deployed, you’ll see a URL appear in the command line, something like https://your-workspace-name--example-vllm-openai-compatible-serve.modal.run.

You can find interactive Swagger UI docs at the /docs route of that URL, i.e. https://your-workspace-name--example-vllm-openai-compatible-serve.modal.run/docs. These docs describe each route and indicate the expected input and output and translate requests into curl commands.

For simple routes like /health, which checks whether the server is responding, you can even send a request directly from the docs.

To interact with the API programmatically in Python, we recommend the openai library.

See the client.py script in the examples repository here to take it for a spin:

# pip install openai==1.13.3
python openai_compatible/client.py

Testing the server

To make it easier to test the server setup, we also include a local_entrypoint that does a healthcheck and then hits the server.

If you execute the command

modal run vllm_inference.py

a fresh replica of the server will be spun up on Modal while the code below executes on your local machine.

Think of this like writing simple tests inside of the if __name__ == "__main__" block of a Python script, but for cloud deployments!

@app.local_entrypoint()
def test(test_timeout=5 * MINUTES):
    import json
    import time
    import urllib

    print(f"Running health check for server at {serve.web_url}")
    up, start, delay = False, time.time(), 10
    while not up:
        try:
            with urllib.request.urlopen(serve.web_url + "/health") as response:
                if response.getcode() == 200:
                    up = True
        except Exception:
            if time.time() - start > test_timeout:
                break
            time.sleep(delay)

    assert up, f"Failed health check for server at {serve.web_url}"

    print(f"Successful health check for server at {serve.web_url}")

    messages = [{"role": "user", "content": "Testing! Is this thing on?"}]
    print(f"Sending a sample message to {serve.web_url}", *messages, sep="\n")

    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json",
    }
    payload = json.dumps({"messages": messages, "model": MODEL_NAME})
    req = urllib.request.Request(
        serve.web_url + "/v1/chat/completions",
        data=payload.encode("utf-8"),
        headers=headers,
        method="POST",
    )
    with urllib.request.urlopen(req) as response:
        print(json.loads(response.read().decode()))

We also include a basic example of a load-testing setup using locust in the load_test.py script here:

modal run openai_compatible/load_test.py