Serve the Qwen 3.6 Vision-Language Model with SGLang

Vision-Language Models (VLMs) are like LLMs with eyes: they can generate text based not just on other text, but on images as well.

This example shows how to serve a VLM on Modal using the SGLang library with an OpenAI-compatible API server.

Setup and container image definition

First, we import our global dependencies and define constants.

import asyncio
import json
import subprocess
import time

import aiohttp
import modal

MINUTES = 60

To define the container Image with our server’s dependencies, we build off of the official SGLang Docker image with CUDA 13.

sglang_image = modal.Image.from_registry(
    "lmsysorg/sglang:v0.5.10.post1-cu130-runtime"
).entrypoint([])

Configure the model

Qwen3.6-35B-A3B-FP8 is a vision-language reasoning foundational model with 35B total parameters, of which only 3B are activated per input sequence per forward pass. We use the 8bit quantized floating point version of the model for faster cold starts and faster inference with negligible behavior differences.

MODEL_NAME = "Qwen/Qwen3.6-35B-A3B-FP8"
MODEL_REVISION = "95a723d08a9490559dae23d0cff1d9466213d989"

Configure GPU

We use a single H100 GPU. The ~35 GB of model weights fits comfortably in this GPU’s 80GB of high-bandwidth memory.

GPU = "H100!:1"
N_GPUS = 1

Modal Apps typically cache some artifacts in a Modal Volume for faster cold starts. Here, we cache the model weights and the JIT-compiled DeepGEMM kernels.

HF_CACHE_VOL = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
HF_CACHE_PATH = "/root/.cache/huggingface"

DG_CACHE_VOL = modal.Volume.from_name("deepgemm-cache", create_if_missing=True)
DG_CACHE_PATH = "/root/.cache/deep_gemm"

We configure the behavior and performance of the weight and compilation caches via environment variables. We also set a few other useful performance flags for this model.

sglang_image = sglang_image.env(
    {
        "HF_HUB_CACHE": HF_CACHE_PATH,
        "HF_XET_HIGH_PERFORMANCE": "1",
        "SGLANG_ENABLE_JIT_DEEPGEMM": "1",
        "SGLANG_USE_CUDA_IPC_TRANSPORT": "1",
        "SGLANG_USE_IPC_POOL_HANDLE_CACHE": "1",
    }
)

We additionally compile the DeepGEMM kernels as part of building the container Image. This can take tens of minutes the first time, but only takes seconds when reading from cache.

def compile_deep_gemm():
    import os
    import subprocess

    if int(os.environ.get("SGLANG_ENABLE_JIT_DEEPGEMM", "1")):
        subprocess.run(
            f"python3 -m sglang.compile_deep_gemm --model-path {MODEL_NAME} --revision {MODEL_REVISION} --tp {N_GPUS}",
            shell=True,
            check=True,
        )


sglang_image = sglang_image.run_function(
    compile_deep_gemm,
    volumes={DG_CACHE_PATH: DG_CACHE_VOL, HF_CACHE_PATH: HF_CACHE_VOL},
    gpu=GPU,
)

Define the inference server

With environment setup out of the way, we’re ready to define our inference server. We use a Modal Cls to separate container startup logic from input processing (as part of modal.enter-decorated methods). We use a Modal HTTP Server backed by a proxy in us-east. We also handle clean teardown of the server in a modal.exit method.

ROUTING_REGION = "us-east"

PORT = 8000
TARGET_INPUTS = 10

app = modal.App(name="example-sglang-vlm")


@app.server(
    image=sglang_image,
    gpu=GPU,
    volumes={HF_CACHE_PATH: HF_CACHE_VOL, DG_CACHE_PATH: DG_CACHE_VOL},
    startup_timeout=15 * MINUTES,
    port=PORT,
    routing_region=ROUTING_REGION,
    target_concurrency=TARGET_INPUTS,
    unauthenticated=True,
)
class VlmServer:
    @modal.enter()
    def startup(self):
        self.process = _start_server()
        wait_ready(self.process)
        warmup()

    @modal.exit()
    def stop(self):
        self.process.terminate()
        self.process.wait()

Setting up the server

The server configuration is based on the information in the SGLang Cookbook. It includes speculative decoding via multi-token prediction for lower latency at low to moderate concurrency. For more on optimizing the performance of VLMs and LLMs, see this guide.

def _start_server() -> subprocess.Popen:
    """Start SGLang server in a subprocess"""
    cmd = [
        "python",
        "-m",
        "sglang.launch_server",
        "--model-path",
        MODEL_NAME,
        "--revision",
        MODEL_REVISION,
        "--served-model-name",
        MODEL_NAME,
        "--host",
        "0.0.0.0",
        "--port",
        f"{PORT}",
        "--tp",
        f"{N_GPUS}",
        "--cuda-graph-max-bs",
        f"{TARGET_INPUTS * 2}",
        "--enable-metrics",
        "--mem-fraction-static",
        "0.8",
        "--context-length",
        "131_072",
        "--mamba-scheduler-strategy",
        "extra_buffer",
        "--reasoning-parser",
        "qwen3",
        "--tool-call-parser",
        "qwen3_coder",
        "--speculative-algo",
        "EAGLE",
        "--speculative-num-steps",
        "3",
        "--speculative-eagle-topk",
        "1",
        "--speculative-num-draft-tokens",
        "4",
    ]

    print("Starting SGLang server with command:")
    print(*cmd)

    return subprocess.Popen(" ".join(cmd), shell=True, start_new_session=True)

Before returning from our modal.enter method, we wait for the server to finish spinning up, which can take several minutes.

def wait_ready(process: subprocess.Popen, timeout: int = 10 * MINUTES):
    import requests

    deadline = time.time() + timeout
    while time.time() < deadline:
        try:
            check_running(process)
            requests.get(f"http://127.0.0.1:{PORT}/health").raise_for_status()
            return
        except (
            subprocess.CalledProcessError,
            requests.exceptions.ConnectionError,
            requests.exceptions.HTTPError,
        ):
            time.sleep(5)
    raise TimeoutError(f"SGLang server not ready within {timeout} seconds")


def check_running(p: subprocess.Popen):
    if (rc := p.poll()) is not None:
        raise subprocess.CalledProcessError(rc, cmd=p.args)

We also send a few warmup requests to ensure that the server is fully ready to service requests — otherwise the first few requests to a new replica might be substantially slower.

SAMPLE_PAYLOAD = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://modal-cdn.com/golden-gate-bridge.jpg"
                    },
                },
                {"type": "text", "text": "What is this?"},
            ],
        }
    ],
    "max_tokens": 16,
}


def warmup():
    import requests

    for _ in range(2):
        requests.post(
            f"http://127.0.0.1:{PORT}/v1/chat/completions",
            json=SAMPLE_PAYLOAD,
            timeout=120,
        ).raise_for_status()

Test the server

We can test the entire server creation, from soup to nuts, by running the file with modal run. We just need to add a local_entrypoint that exercises the server.

@app.local_entrypoint()
async def main():
    url = await VlmServer.get_url.aio()

    messages = SAMPLE_PAYLOAD["messages"]
    print(f"Sending image at {messages[0]['content'][0]['image_url']} to the server")

    await probe(url, messages, timeout=10 * MINUTES)

The client logic is normally handled by your preferred interface — a coding agent harness like OpenCode, a chat UI in the browser. Our server uses the standard OpenAI-compatible API format, so most of these clients should work out of the box. We replicate the minimum amount of its functionality we need for a test below.

Note that in the probe we include a Modal-Session-Id header for sticky routing between Modal HTTP Server replicas and ignore 503s that occur when no Modal HTTP Server replicas are available.

async def probe(url: str, messages: list, timeout: int = 25 * MINUTES):
    headers = {"Modal-Session-Id": "test-session"}
    deadline = time.time() + timeout

    async with aiohttp.ClientSession(base_url=url, headers=headers) as session:
        while time.time() < deadline:
            try:
                await _send_request_streaming(session, messages)
                return
            except asyncio.TimeoutError:
                await asyncio.sleep(1)
            except aiohttp.client_exceptions.ClientResponseError as e:
                if e.status == 503:
                    await asyncio.sleep(1)
                    continue
                raise
    raise TimeoutError(f"No response from server within {timeout} seconds")


async def _send_request_streaming(
    session: aiohttp.ClientSession, messages: list, timeout: int | None = None
) -> None:
    payload = {
        "messages": messages,
        "stream": True,
        "top_k": 20,
    }
    headers = {"Accept": "text/event-stream"}

    async with session.post(
        "/v1/chat/completions", json=payload, headers=headers, timeout=timeout
    ) as resp:
        resp.raise_for_status()
        full_text = ""

        chunk = ""
        async for raw in resp.content:
            line = raw.decode("utf-8", errors="ignore").strip()
            if not line:
                continue

            if not line.startswith("data:"):
                continue

            data = line[len("data:") :].strip()
            if data == "[DONE]":
                break

            try:
                evt = json.loads(data)
            except json.JSONDecodeError:
                continue

            delta = (evt.get("choices") or [{}])[0].get("delta") or {}
            chunk += delta.get("content") or delta.get("reasoning_content") or ""

            if chunk and ("." in chunk or "\n" in chunk):
                print(chunk, end="", flush=True)
                full_text += chunk
                chunk = ""

        if chunk:
            print(chunk, end="", flush=True)
            full_text += chunk

        print()
        return full_text

You can kick off a test run with the command

modal run sglang_vlm.py

Deploy the server

When you’re ready to deploy the server, replace modal run with modal deploy:

modal deploy sglang_vlm.py