Fast inference with vLLM (Mixtral 8x7B)

In this example, we show how to run basic inference, using vLLM to take advantage of PagedAttention, which speeds up sequential inferences with optimized key-value caching.

We are running the Mixtral 8x7B Instruct model here, which is a mixture-of-experts model finetuned for conversation. You can expect ~3 minute cold starts. For a single request, the throughput is over 50 tokens/second. The larger the batch of prompts, the higher the throughput (up to hundreds of tokens per second).


First we import the components we need from modal.

import os
import time

import modal

MODEL_DIR = "/model"
MODEL_NAME = "mistralai/Mixtral-8x7B-Instruct-v0.1"
GPU_CONFIG = modal.gpu.A100(memory=80, count=2)

Define a container image

We want to create a Modal image which has the model weights pre-saved to a directory. The benefit of this is that the container no longer has to re-download the model from Huggingface - instead, it will take advantage of Modal’s internal filesystem for faster cold starts.

Download the weights

We can download the model to a particular directory using the HuggingFace utility function snapshot_download.

Mixtral is beefy, at nearly 100 GB in safetensors format, so this can take some time — at least a few minutes.

Tip: avoid using global variables in this function. Changes to code outside this function will not be detected and the download step will not re-run.

def download_model_to_image(model_dir, model_name):
    from huggingface_hub import snapshot_download
    from transformers.utils import move_cache

    os.makedirs(model_dir, exist_ok=True)

        ignore_patterns=["*.pt", "*.bin"],  # Using safetensors

Image definition

We’ll start from a Dockerhub image recommended by vLLM, and use run_function to run the function defined above to ensure the weights of the model are saved within the container image.

vllm_image = (
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
        timeout=60 * 20,
        kwargs={"model_dir": MODEL_DIR, "model_name": MODEL_NAME},

stub = modal.Stub("example-vllm-mixtral")

The model class

The inference function is best represented with Modal’s class syntax and the @enter decorator. This enables us to load the model into memory just once every time a container starts up, and keep it cached on the GPU for each subsequent invocation of the function.

The vLLM library allows the code to remain quite clean. We do have to patch the multi-GPU setup due to issues with Ray.

    timeout=60 * 10,
    container_idle_timeout=60 * 10,
class Model:
    def start_engine(self):
        from vllm.engine.arg_utils import AsyncEngineArgs
        from vllm.engine.async_llm_engine import AsyncLLMEngine

        print("🥶 cold starting inference")
        start = time.monotonic_ns()

        engine_args = AsyncEngineArgs(
            enforce_eager=False,  # capture the graph for faster inference, but slower cold starts
            disable_log_stats=True,  # disable logging so we can stream tokens
        self.template = "<s> [INST] {user} [/INST] "

        # this can take some time!
        self.engine = AsyncLLMEngine.from_engine_args(engine_args)
        duration_s = (time.monotonic_ns() - start) / 1e9
        print(f"🏎️ engine started in {duration_s:.0f}s")

    async def completion_stream(self, user_question):
        from vllm import SamplingParams
        from vllm.utils import random_uuid

        sampling_params = SamplingParams(

        request_id = random_uuid()
        result_generator = self.engine.generate(
        index, num_tokens = 0, 0
        start = time.monotonic_ns()
        async for output in result_generator:
            if (
                and "\ufffd" == output.outputs[0].text[-1]
            text_delta = output.outputs[0].text[index:]
            index = len(output.outputs[0].text)
            num_tokens = len(output.outputs[0].token_ids)

            yield text_delta
        duration_s = (time.monotonic_ns() - start) / 1e9

        yield (
            f"\n\tGenerated {num_tokens} tokens from {MODEL_NAME} in {duration_s:.1f}s,"
            f" throughput = {num_tokens / duration_s:.0f} tokens/second on {GPU_CONFIG}.\n"

    def stop_engine(self):
        if GPU_CONFIG.count > 1:
            import ray


Run the model

We define a local_entrypoint to call our remote function sequentially for a list of inputs. You can run this locally with modal run -q The q flag enables the text to stream in your local terminal.

def main():
    questions = [
        "Implement a Python function to compute the Fibonacci numbers.",
        "What is the fable involving a fox and grapes?",
        "What were the major contributing factors to the fall of the Roman Empire?",
        "Describe the city of the future, considering advances in technology, environmental changes, and societal shifts.",
        "What is the product of 9 and 8?",
        "Who was Emperor Norton I, and what was his significance in San Francisco's history?",
    model = Model()
    for question in questions:
        print("Sending new request:", question, "\n\n")
        for text in model.completion_stream.remote_gen(question):
            print(text, end="", flush=text.endswith("\n"))

Deploy and invoke the model

Once we deploy this model with modal deploy, we can invoke inference from other apps, sharing the same pool of GPU containers with all other apps we might need.

$ python
>>> import modal
>>> f = modal.Function.lookup("example-tgi-Mixtral-8x7B-Instruct-v0.1", "Model.generate")
>>> f.remote("What is the story about the fox and grapes?")
'The story about the fox and grapes ...

Coupling a frontend web application

We can stream inference from a FastAPI backend, also deployed on Modal.

You can try our deployment here.

from pathlib import Path

from modal import Mount, asgi_app

frontend_path = Path(__file__).parent.parent / "llm-frontend"

    mounts=[Mount.from_local_dir(frontend_path, remote_path="/assets")],
    timeout=60 * 10,
def app():
    import json

    import fastapi
    import fastapi.staticfiles
    from fastapi.responses import StreamingResponse

    web_app = fastapi.FastAPI()

    async def stats():
        stats = await Model().completion_stream.get_current_stats.aio()
        return {
            "backlog": stats.backlog,
            "num_total_runners": stats.num_total_runners,
            "model": MODEL_NAME + " (vLLM)",

    async def completion(question: str):
        from urllib.parse import unquote

        async def generate():
            async for text in Model().completion_stream.remote_gen.aio(
                yield f"data: {json.dumps(dict(text=text), ensure_ascii=False)}\n\n"

        return StreamingResponse(generate(), media_type="text/event-stream")

        "/", fastapi.staticfiles.StaticFiles(directory="/assets", html=True)
    return web_app