
What is DeepSeek-R1 Distilled Qwen-32B?
DeepSeek-R1 is an open-source state of the art “reasoning” LLM that is competitive with gpt-o1 and other top closed LLMs.
There is a tremendous amount of excitement around DeepSeek-R1, and since its weights are available on Hugging Face, theoretically, anyone can run it.
However, the full un-quantized DeepSeek-R1 model, at 671B parameters, is much too large to run on consumer grade GPUs.
To address this, DeepSeek has created a series of distilled models, including DeepSeek-R1 Distilled Qwen-32B.
What are distilled models?
Distilled models are smaller, more efficient AI models that are trained to replicate the behavior of larger “teacher” models while using fewer computational resources.
In particular, the DeepSeek-R1 distilled models were created by fine-tuning smaller base models (e.g., Alibaba’s Qwen, in the case of DeepSeek-R1 Distilled Qwen-32B) using samples of reasoning data generated by DeepSeek-R1.
The expected performance of DeepSeek-R1 Distilled Qwen-32B is on par with gpt-4o, gpt-o1-mini, and claude-3.5-sonnet according to reasoning, math, and coding benchmarks.
What is vLLM?
If you are trying to serve DeepSeek in production, you will need to use an inference server like vLLM, TensorRT, or Triton for better latency and performance.
Why should you run DeepSeek-R1 Distilled on Modal?
Running even the distilled DeepSeek-R1 models requires GPUs, and Modal is the easiest way to get a GPU.
Modal is a Python library that makes it painless for you to deploy your code in the cloud and scale it to millions of requests, with or without GPUs.
You just write your Python function, attach a Modal decorator, and Modal handles the rest.
Example code for running DeepSeek-R1 Distilled Qwen-32B on Modal
To run the following code, you will need to:
- Create an account at modal.com
- Run
pip install modal
to install the modal Python package - Run
modal setup
to authenticate (if this doesn’t work, trypython -m modal setup
) - Copy the code below into a file called
app.py
- Run
modal deploy app.py
import os
import subprocess
from modal import Image, App, Secret, gpu, web_server
MODEL_DIR = "/model"
BASE_MODEL = "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
# ## Define a container image
def download_model_to_folder():
from huggingface_hub import snapshot_download
from transformers.utils import move_cache
os.makedirs(MODEL_DIR, exist_ok=True)
snapshot_download(
BASE_MODEL,
local_dir=MODEL_DIR,
ignore_patterns=["*.pt", "*.bin"], # Using safetensors
)
move_cache()
# ### Image definition
# We'll start from a recommended Docker Hub image and install `vLLM`.
# Then we'll use `run_function` to run the function defined above to ensure the weights of
# the model are saved within the container image.
image = (
Image.debian_slim(python_version="3.12")
.pip_install("vllm==0.7.0", "fastapi[standard]==0.115.4")
.pip_install(
"huggingface_hub[hf_transfer]==0.26.2",
)
# Use the barebones hf-transfer package for maximum download speeds. No progress bar, but expect 700MB/s.
.env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
.run_function(
download_model_to_folder,
secrets=[Secret.from_name("huggingface")],
timeout=60 * 20,
)
)
app = App("vllm-inference-openai-compatible", image=image)
GPU_CONFIG = gpu.H100(count=2) # 2x80GB H100
@app.function(
allow_concurrent_inputs=100,
gpu=GPU_CONFIG,
container_idle_timeout=1200,
keep_warm=1,
)
@web_server(8000, startup_timeout=1200)
def openai_compatible_server():
target = BASE_MODEL
cmd = (
f"python -m vllm.entrypoints.openai.api_server "
f"--model {target} "
f"--host 0.0.0.0 "
f"--port 8000 "
f"--tensor-parallel-size 2 " # Enable tensor parallelism across 2 GPUs
f"--max-model-len 32768 " # Set maximum sequence length
f"--enforce-eager"
)
subprocess.Popen(cmd, shell=True)
This will start up an OpenAI-compatible vLLM server.
Interact with the server
Once it is deployed, you’ll see a URL appear in the command line, something like https://your-workspace-name—example-vllm-openai-compatible-serve.modal.run.
You can then interact with the server using the Python openai
library.
from openai import OpenAI
question = "What is the capital of the moon?"
client = OpenAI(
base_url="<your-modal-url>/v1/",
api_key="token-abc123",
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
messages=[
{
"role": "user",
"content": question,
}
],
)