Try DeepSeek-R1 on Modal! View example
January 31, 20255 minute read
How to run DeepSeek-R1 Distilled Qwen-32B with vLLM on Modal
author
Yiren Lu@YirenLu
Solutions Engineer

What is DeepSeek-R1 Distilled Qwen-32B?

DeepSeek-R1 is an open-source state of the art “reasoning” LLM that is competitive with gpt-o1 and other top closed LLMs.

There is a tremendous amount of excitement around DeepSeek-R1, and since its weights are available on Hugging Face, theoretically, anyone can run it.

However, the full un-quantized DeepSeek-R1 model, at 671B parameters, is much too large to run on consumer grade GPUs.

To address this, DeepSeek has created a series of distilled models, including DeepSeek-R1 Distilled Qwen-32B.

What are distilled models?

Distilled models are smaller, more efficient AI models that are trained to replicate the behavior of larger “teacher” models while using fewer computational resources.

In particular, the DeepSeek-R1 distilled models were created by fine-tuning smaller base models (e.g., Alibaba’s Qwen, in the case of DeepSeek-R1 Distilled Qwen-32B) using samples of reasoning data generated by DeepSeek-R1.

The expected performance of DeepSeek-R1 Distilled Qwen-32B is on par with gpt-4o, gpt-o1-mini, and claude-3.5-sonnet according to reasoning, math, and coding benchmarks.

What is vLLM?

If you are trying to serve DeepSeek in production, you will need to use an inference server like vLLM, TensorRT, or Triton for better latency and performance.

Why should you run DeepSeek-R1 Distilled on Modal?

Running even the distilled DeepSeek-R1 models requires GPUs, and Modal is the easiest way to get a GPU.

Modal is a Python library that makes it painless for you to deploy your code in the cloud and scale it to millions of requests, with or without GPUs.

You just write your Python function, attach a Modal decorator, and Modal handles the rest.

Example code for running DeepSeek-R1 Distilled Qwen-32B on Modal

To run the following code, you will need to:

  1. Create an account at modal.com
  2. Run pip install modal to install the modal Python package
  3. Run modal setup to authenticate (if this doesn’t work, try python -m modal setup)
  4. Copy the code below into a file called app.py
  5. Run modal deploy app.py
import os
import subprocess


from modal import Image, App, Secret, gpu, web_server

MODEL_DIR = "/model"
BASE_MODEL = "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"


# ## Define a container image
def download_model_to_folder():
    from huggingface_hub import snapshot_download
    from transformers.utils import move_cache

    os.makedirs(MODEL_DIR, exist_ok=True)

    snapshot_download(
        BASE_MODEL,
        local_dir=MODEL_DIR,
        ignore_patterns=["*.pt", "*.bin"],  # Using safetensors
    )
    move_cache()


# ### Image definition
# We'll start from a recommended Docker Hub image and install `vLLM`.
# Then we'll use `run_function` to run the function defined above to ensure the weights of
# the model are saved within the container image.
image = (
    Image.debian_slim(python_version="3.12")
    .pip_install("vllm==0.7.0", "fastapi[standard]==0.115.4")
    .pip_install(
        "huggingface_hub[hf_transfer]==0.26.2",
    )
    # Use the barebones hf-transfer package for maximum download speeds. No progress bar, but expect 700MB/s.
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
    .run_function(
        download_model_to_folder,
        secrets=[Secret.from_name("huggingface")],
        timeout=60 * 20,
    )
)

app = App("vllm-inference-openai-compatible", image=image)


GPU_CONFIG = gpu.H100(count=2)  # 2x80GB H100


@app.function(
    allow_concurrent_inputs=100,
    gpu=GPU_CONFIG,
    container_idle_timeout=1200,
    keep_warm=1,
)
@web_server(8000, startup_timeout=1200)
def openai_compatible_server():
    target = BASE_MODEL
    cmd = (
        f"python -m vllm.entrypoints.openai.api_server "
        f"--model {target} "
        f"--host 0.0.0.0 "
        f"--port 8000 "
        f"--tensor-parallel-size 2 "  # Enable tensor parallelism across 2 GPUs
        f"--max-model-len 32768 "  # Set maximum sequence length
        f"--enforce-eager"
    )
    subprocess.Popen(cmd, shell=True)

This will start up an OpenAI-compatible vLLM server.

Interact with the server

Once it is deployed, you’ll see a URL appear in the command line, something like https://your-workspace-name—example-vllm-openai-compatible-serve.modal.run.

You can then interact with the server using the Python openai library.

from openai import OpenAI

question = "What is the capital of the moon?"

client = OpenAI(
    base_url="<your-modal-url>/v1/",
    api_key="token-abc123",
)


response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    messages=[
        {
            "role": "user",
            "content": question,
        }
    ],
)

Additional resources

Ship your first app in minutes.

Get Started

$30 / month free compute