Serverless Ministral 3 with vLLM and Modal

In this example, we show how to serve Mistral’s Ministral 3 vision-language models on Modal.

The Ministral 3 model series performs competitively with the Qwen 3-VL model series on benchmarks (see model cards for details).

We also include instructions for cutting cold start times for long-running deployments by an order of magnitude using Modal’s CPU + GPU memory snapshots.

Set up the container image 

Our first order of business is to define the environment our server will run in: the container Image. We’ll use the vLLM inference server. vLLM can be installed with uv pip, since Modal provides the CUDA drivers.

Download the Ministral weights 

We also need to download the model weights. We’ll retrieve them from the Hugging Face Hub.

To speed up the model load, we’ll toggle the HIGH_PERFORMANCE flag for Hugging Face’s Xet backend.

The Ministral 3 model series contains a variety of models:

  • 3B, 8B, and 14B sizes
  • base models and instruction & reasoning fine-tuned models
  • BF16 and FP8 quantizations

All are available under the Apache 2.0 open source license.

We’ll use the FP8 instruct variant of the 8B model:

Native hardware support for FP8 formats in Tensor Cores is limited to the latest Streaming Multiprocessor architectures, like those of Modal’s Hopper H100/H200 and Blackwell B200 GPUs.

At 80 GB VRAM, a single H100 GPU has enough space to store the 8B FP8 model weights (~8 GB) and a very large KV cache. A single H100 is also enough to serve the 14B model in full precision, but without as much room for KV (though still enough to handle the full sequence length).

Cache with Modal Volumes 

Modal Functions are serverless: when they aren’t being used, their underlying containers spin down and all ephemeral resources, like GPUs, memory, network connections, and local disks are released.

We can preserve saved files by mounting a Modal Volume — a persistent, remote filesystem.

We’ll use two Volumes: one for weights from Hugging Face and one for compilation artifacts from vLLM.

Serve Ministral 3 with vLLM 

We serve Ministral 3 on Modal by spinning up a Modal Function that acts as a web_server and spins up a vLLM server in a subprocess (via the vllm serve command).

Improve cold start time with snapshots 

Starting up a vLLM server can be slow — tens of seconds to minutes. Much of that time is spent on JIT compilation of inference code.

We can skip most of that work and reduce startup times by a factor of 10 using Modal’s memory snapshots, which serialize the contents of CPU and GPU memory.

This adds quite some complexity to the code. If you’re looking for a minimal example, see our vllm_inference example here.

We’ll need to set a few extra configuration values:

Setting the DEV_MODE flag allows us to use the sleep/wake_up endpoints to toggle the server in and out of “sleep mode”.

Sleep Mode helps with memory snapshotting. When the server is asleep, model weights are offloaded to CPU memory and the KV cache is emptied. For details, see the vLLM docs.

We’ll also need two helper functions. Ther first, wait_ready, busy-polls the server until it is live.

Once the server is live, we warmup inference with a few requests. This is important for capturing non-serializable JIT compilation artifacts, like CUDA graphs and some Torch compilation outputs, in our snapshot.

Define the server 

We construct our web-serving Modal Function by decorating a regular Python class. The decorators include a number of configuration options for deployment, including resources like GPUs and Volumes and timeouts on container scaledown. You can read more about the options here.

We control memory snapshotting and container start behavior by decorating the methods of the class.

We start the server, warm it up, and then put it to sleep in the start method. This method has the modal.enter decorator to ensure it runs when a new container starts and we pass snap=True to turn on memory snapshotting.

The following method, wake_up, calls the wake_up endpoint and then waits for the server to be ready. It is run after the start method because it is defined later in the code and also has the modal.enter decorator. It has snap=False so that it isn’t included in the snapshot.

Finally, we connect the vLLM server to the Internet using the modal.web_server decorator.

Deploy the server 

To deploy the API on Modal, just run

This will create a new app on Modal, build the container image for it if it hasn’t been built yet, and deploy the app.

Interact with the server 

Once it is deployed, you’ll see a URL appear in the command line, something like https://your-workspace-name--example-ministral3-inference-serve.modal.run.

You can find interactive Swagger UI docs at the /docs route of that URL, i.e. https://your-workspace-name--example-ministral-inference-serve.modal.run/docs. These docs describe each route and indicate the expected input and output and translate requests into curl commands.

For simple routes like /health, which checks whether the server is responding, you can even send a request directly from the docs.

To interact with the API programmatically in Python, we recommend the openai library.

Test the server 

To make it easier to test the server setup, we also include a local_entrypoint that does a healthcheck and then hits the server.

If you execute the command

a fresh replica of the server will be spun up on Modal while the code below executes on your local machine.

Think of this like writing simple tests inside of the if __name__ == "__main__" block of a Python script, but for cloud deployments!

Test memory snapshotting 

Using modal run creates an ephemeral Modal App, rather than a deployed Modal App. Ephemeral Modal Apps are short-lived, so they turn off snapshotting.

To test the memory snapshot version of the server, first deploy it with modal deploy and then hit it with a client.

You should observe startup improvements after a handful of cold starts (usually less than five). If you want to see the speedup during a test, we recommend heading to the deployed App in your Modal dashboard and manually stopping containers after they have served a request.

You can use the client code below to test the endpoint. It can be run with the command