Article

November 3, 2025•6 minute read

How to deploy vLLM

Large models are increasingly used for modern applications, but serving them efficiently remains a challenge. vLLM, a high-performance open-source inference engine, was designed to solve this.

Whether you are building a chat interface, captioning system, or reasoning engine, vLLM gives you production-grade performance without closed-source dependencies. By the end of this tutorial, you will have a self-contained, serverless vLLM deployment ready to power your own AI applications.

What is vLLM?

vLLM is an open-source library for running large language models (LLMs) quickly and efficiently. This means it takes a trained model and makes it available to respond to requests. vLLM is one of three popular open-source LLM inference engines, the other two being SGLang and TensorRT-LLM. All are built on top of CUDA. If you want to learn more about how to choose an LLM inference engine for your use case, check out our LLM Engineer’s Almanac.

What does the “v” in vLLM stand for?

Originally, the “v” stood for virtual, reflecting the project’s goal to make large models lightweight and deployable. Today, vLLM is increasingly associated with Vision + Language models — systems that can interpret both images and text (e.g., captioning tools or visual question-answering systems).

What do you need to deploy vLLM?

Deploying vLLM requires careful orchestration. You need a GPU-enabled environment (hardware that can accelerate large matrix operations), a container with all the right dependencies preinstalled, and a way to manage massive model weights so you’re not redownloading them for every request. On top of that, you need an API layer to accept user input, return responses, and handle errors gracefully. Most importantly, you must ensure that the system scales seamlessly while keeping startup and inference latency low.

Modal handles much of this heavy lifting for you. Modal consists of a Python SDK that wraps an ultra-fast container stack and multi-cloud GPU pool. It abstracts away container management, GPU scheduling, and autoscaling, letting you focus on customizing your model and serving logic.

In the next section, we will walk through how to deploy vLLM on Modal.

Quickstart

If you haven’t set up Modal already:

pip install modal
modal setup

Then, clone and run our examples repo:

git clone https://github.com/modal-labs/modal-examples
cd modal-examples
modal run 06_gpu_and_ml/llm-serving/vllm_inference.py

For a step by step walkthrough of this example, keep reading.

How to deploy vLLM and Qwen3-8B in minutes

Modal makes it easy to deploy open-source models on powerful GPUs. If you don’t have a Modal account, follow the two-line setup instructions in the Quickstart above. You get $30 of GPU credits every month, so following this tutorial will be entirely free.

To serve vLLM, you need a container image—a packaged environment that includes Python, CUDA (NVIDIA’s GPU acceleration toolkit), and all the model dependencies. Modal’s SDK makes image definition easy, since you can define all your requirements in-line with your application code.

This code pulls a base NVIDIA CUDA image and layers on the dependencies required for vLLM inference: PyTorch, FlashInfer, and Hugging Face Hub. The environment variable speeds up model downloads using Hugging Face’s optimized transfer utility.

import json
from typing import Any

import aiohttp
import modal

vllm_image = (
    modal.Image.from_registry("nvidia/cuda:12.8.0-devel-ubuntu22.04", add_python="3.12")
    .entrypoint([])
    .uv_pip_install(
        "vllm==0.10.2",
        "huggingface_hub[hf_transfer]==0.35.0",
        "flashinfer-python==0.3.1",
        "torch==2.8.0",
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})  # faster model transfers
)

Step 2: Download and cache Qwen3-8B model weights

Next, let’s grab the LLM we want to serve. In this tutorial, we’ll use Qwen/Qwen3-8B-FP8, a quantized eight-billion-parameter model trained for reasoning and general text understanding. The “FP8” variant uses 8-bit floating-point precision—a compact numerical format that saves GPU memory without major accuracy loss.

MODEL_NAME = "Qwen/Qwen3-8B-FP8"
MODEL_REVISION = "220b46e3b2180893580a4454f21f22d3ebb187d3"  # avoid nasty surprises when repos update!

To reduce startup latency, we cache model weights and compiled artifacts. Modal provides Volumes, which are persistent network-attached filesystems. These caches prevent repeated downloads or recompilations whenever you rerun a function.

hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)

Step 3: Configuring vLLM

vLLM supports JIT compilation (just-in-time kernel optimization) and CUDA graph capture, both of which speed up inference after startup. However, enabling them increases initialization time, otherwise known as a cold start. The FAST_BOOT flag controls this trade-off.

FAST_BOOT = True

If your service frequently scales from zero, keep this set to True for faster launches. If you expect consistent traffic and replicas remain warm, set it to False to unlock full performance.

Step 4: Define the vLLM inference function

Now let’s declare a Modal app that runs vLLM as a web server.

In the first decorator, we’re attaching the image we defined in step 1, an H100 GPU, and the Volume with model weights attached.
In the second decorator, we’re configuring how many requests one replica can handle at once before Modal scales up replicas.
In the third decorator, we’re exposing a public HTTP endpoint for the function.

Here, subprocess.Popen launches vllm serve as a background process inside your container. When the model finishes loading, it begins accepting requests.

app = modal.App("example-vllm-inference")

@app.function(
    image=vllm_image,
    gpu=H100:1,
    scaledown_window=900,  # how long should we stay up with no requests?
    timeout=10 * MINUTES,  # how long should we wait for container start?
    volumes={
        "/root/.cache/huggingface": hf_cache_vol,
        "/root/.cache/vllm": vllm_cache_vol,
    },
)
@modal.concurrent(  # how many requests can one replica handle? tune carefully!
    max_inputs=32
)
@modal.web_server(port=8000, startup_timeout=10 * MINUTES)
def serve():
    import subprocess

    cmd = [
        "vllm",
        "serve",
        "--uvicorn-log-level=info",
        MODEL_NAME,
        "--revision",
        MODEL_REVISION,
        "--served-model-name",
        MODEL_NAME,
        "llm",
        "--host",
        "0.0.0.0",
        "--port",
        "8000",
    ]

    # enforce-eager disables both Torch compilation and CUDA graph capture
    # default is no-enforce-eager. see the --compilation-config flag for tighter control
    cmd += ["--enforce-eager" if FAST_BOOT else "--no-enforce-eager"]

    # assume multiple GPUs are for splitting up large matrix multiplications
    cmd += ["--tensor-parallel-size", str(N_GPU)]

    print(cmd)

    subprocess.Popen(" ".join(cmd), shell=True)

Step 5: Deploy it & invoke it

Finally, deploy your app with one line in your terminal:

modal deploy vllm_inference.py

Modal builds the image the first time you deploy, and this image is cached for future deployments. Once the deployment is complete, you’ll see a live URL printed out that looks like this: https://your-workspace-name--example-vllm-inference-serve.modal.run.

You can test the endpoint directly in the browser at /docs, a built-in Swagger UI that lists available endpoints and lets you send sample requests: https://your-workspace-name--example-vllm-inference-serve.modal.run/docs

That’s it! You now have a fully functioning vLLM inference server. It’s scalable, GPU-powered, and ready to integrate into any application.

Get started today

For a more detailed walkthrough of this example, including some code to test the server programatically, check out the full writeup in our docs.

Ready to build with vLLM or deploy any AI model? Sign up for Modal and get $30 in free credits. Whether you’re running open-source models or your own custom models, Modal gives you instant access to thousands of GPUs, from T4s to B200s. No waiting for quota, configuring Kubernetes, or wasting money on idle costs—just fluid GPU compute you can attach to your inference code.

Deploy vLLM