Serve the Qwen3.5 Vision-Language Model with SGLang
Vision-Language Models (VLMs) are like LLMs with eyes: they can generate text based not just on other text, but on images as well.
This example shows how to serve a VLM on Modal using the SGLang library with an OpenAI-compatible API server.
Setup and container image definition
First, we import our global dependencies and define constants.
To define the container Image with our server’s dependencies, we build off of the official SGLang Docker image with CUDA 12.9.
Configure the model
Qwen3.5-35B-A3B-FP8 is a vision-language reasoning foundational model with 35B total parameters, of which only 3B are activated per input sequence per forward pass. We use the 8bit quantized floating point version of the model for faster cold starts and faster inference with negligible behavior differences.
Configure GPU
We use a single H100 GPU. The 35 GB of model weights fits comfortably in this GPU’s 80GB of high-bandwidth memory.
Cacheing in Modal Volumes
Modal Apps typically cache some artifacts in a Modal Volume for faster cold starts. Here, we cache the model weights and the JIT-compiled DeepGEMM kernels.
We additionally compile the DeepGEMM kernels as part of building the container Image. This can take tens of minutes the first time, but only takes seconds when reading from cache.
Define the inference server
With environment setup out of the way, we’re ready to define our inference server.
We use a Modal Cls to separate container startup logic from input processing
(as part of modal.enter-decorated methods).
We use a Modal HTTP Server to create a low latency edge deployment
in the us served by a proxy in us-east.
We also handle clean teardown of the server in a modal.exit method.
Setting up the server
The server configuration is based on the information in the Hugging Face repo. It includes speculative decoding via multi-token prediction for improved performance at low to moderate concurrency. For more on optimizing the performance of VLMs and LLMs, see this guide.
Before returning from our modal.enter method,
we wait for the server to finish spinning up, which can take several minutes.
We also send a few warmup requests to ensure that the server is fully ready to service requests — otherwise the first few requests to a new replica might be substantially slower.
Test the server
We can test the entire server creation, from soup to nuts,
by running the file with modal run.
We just need to add a local_entrypoint that exercises the server.
The client logic is normally handled by your preferred interface — a coding agent harness like OpenCode, a chat UI in the browser. Our server uses the standard OpenAI-compatible API format, so most of these clients should work out of the box. We replicate the minimum amount of its functionality we need for a test below.
Note that in the probe we include a Modal-Session-Id header for sticky routing
between Modal HTTP Server replicas and ignore 503s that occur
when no Modal HTTP Server replicas are available.
You can kick off a test run with the command
Deploy the server
When you’re ready to deploy the server,
replace modal run with modal deploy: