How to deploy DeepSeek-R1 with llama.cpp
Over the last several years, large language models (LLMs) have transformed the way that we learn, solve problems, and analyze data.
As the name humbly suggests, LLMs like DeepSeek-R1, GPT-4, and Claude are massive, often possessing billions of parameters and requiring huge amounts of memory intense compute demands. While leveraging the power of LLMs has become more accessible through Claude and ChatGPT, running LLMs locally without sophisticated hardware presents challenges due to their sheer size. In this tutorial, we’re going to show you how to use llama.cpp and Modal to deploy DeepSeek-R1 inference on L40S GPUs in the cloud.
What is DeepSeek-R1?
DeepSeek-R1 is an open-weight “reasoning” LLM that was released by DeepSeek in early 2025. It was created to rival proprietary systems like OpenAI’s o1 while remaining inexpensive to run and modify. It emphasizes step-by-step reasoning (with chain-of-thought internally) and was post-trained primarily with large-scale reinforcement learning.
According to benchmarking providers like Artificial Analysis (as of November 2025), the DeepSeek family of models still ranks as some of the most “intelligent” open-source LLMs available today.
DeepSeek-R1 is very large. It has 671B total parameters and consumes over 100GB of storage, even when quantized down to one ternary digit (1.58 bits) per parameter.
What is llama.cpp?
Llama.cpp is an open-source library that enables efficient inference of LLMs possible on consumer hardware. It was developed by software engineer Georgi Gerganov as an implementation of Meta’s LLaMA (Large Language Model Meta AI) in efficient C/C++ with no dependencies. This lightweight yet powerful framework makes running models that “shouldn’t” fit on your hardware as simple as running a few commands in your terminal.
The magic behind llama.cpp is quantization, the process of reducing model precision (from, say, 16-bit to 4-bit) to shrink model size and make inference faster while maintaining reasonable quality. Essentially, models that typically require hundreds to thousands of gigabytes of RAM—like DeepSeek-R1—can run on much smaller systems.
Why developers love llama.cpp
- It’s fast: its highly optimized C/C++ implementation squeezes maximum speed from CPUs/GPUs
- It’s portable: the framework compiles and runs on almost anything — Windows, macOS, Linux, and even Raspberry Pi
- It’s open-source: you can use it self-deploy LLMs, and you can make any modifications to it
Quickstart
If you haven’t set up Modal already:
pip install modal
modal setupThen, clone and run our examples repo:
git clone https://github.com/modal-labs/modal-examples
cd modal-examples
modal run 06_gpu_and_ml/llm-serving/llama_cpp.py --n-predict 1024For a step by step walkthrough of this example, keep reading.
How to deploy DeepSeek-R1 in minutes
Modal makes it easy to deploy open-source models on powerful GPUs. Modal consists of a Python SDK that wraps an ultra-fast container stack and multi-cloud GPU pool. If you don’t have a Modal account, follow the two-line setup instructions in the Quickstart above. You get $30 of GPU credits every month, so following this tutorial will be entirely free.
1. Define a local function which will invoke our Modal inference function remotely
First, let’s create a Modal app and define a local function that we’ll use to call the Modal inference function (llama_cpp_inference), which we’ll define later. This local function:
- Calls another Modal Function to download and cache the model. We’re using a quantized version of DeepSeek-R1 released by Unsloth AI.
- Invokes our inference function, which runs on cloud GPUs thanks to Modal.
- Writes the results to a txt file.
Note the .local_entrypoint() Modal decorator, which allows us to invoke this function from the CLI.
from pathlib import Path
from typing import Optional
import modal
app = modal.App("example-llama-cpp")
@app.local_entrypoint()
def main(
prompt: Optional[str] = None,
n_predict: int = -1, # max number of tokens to predict, -1 is infinite
args: Optional[str] = None, # string of arguments to pass to llama.cpp's cli
):
import shlex
org_name = "unsloth"
model = "DeepSeek-R1"
model_name = "DeepSeek-R1-GGUF"
quant = "UD-IQ1_S"
model_entrypoint_file = (
f"{model}-{quant}/DeepSeek-R1-{quant}-00001-of-00003.gguf"
)
model_pattern = f"*{quant}*"
revision = "02656f62d2aa9da4d3f0cdb34c341d30dd87c3b6"
parsed_args = DEFAULT_DEEPSEEK_R1_ARGS if args is None else shlex.split(args)
repo_id = f"{org_name}/{model_name}"
download_model.remote(repo_id, [model_pattern], revision)
# call out to a `.remote` Function on Modal for inference
result = llama_cpp_inference.remote(
model_entrypoint_file,
prompt,
n_predict,
parsed_args,
)
output_path = Path("/tmp") / f"llama-cpp-{model}.txt"
output_path.parent.mkdir(parents=True, exist_ok=True)
print(f"🦙 writing response to {output_path}")
output_path.write_text(result)2. Define the model download function
Next, we download the model weights from Hugging Face. Modal is serverless, so disks are by default ephemeral. To persist the weights between runs, we store them in a Modal Volume.
Note the download_image Modal Image we defined here, which ensures that the download_model Modal Function has all the required dependencies needed to download the model.
model_cache = modal.Volume.from_name("llamacpp-cache", create_if_missing=True)
cache_dir = "/root/.cache/llama.cpp"
download_image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install("huggingface_hub[hf_transfer]==0.26.2")
.env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
)
@app.function(
image=download_image, volumes={cache_dir: model_cache}, timeout=30 * MINUTES
)
def download_model(repo_id, allow_patterns, revision: Optional[str] = None):
from huggingface_hub import snapshot_download
print(f"🦙 downloading model from {repo_id} if not present")
snapshot_download(
repo_id=repo_id,
revision=revision,
local_dir=cache_dir,
allow_patterns=allow_patterns,
)
model_cache.commit() # ensure other Modal Functions can see our writes before we quit
print("🦙 model loaded")3. Define the DeepSeek inference function
Let’s define the actual inference function now! This will run remotely when called with llama_cpp_inference.remote(...), so we need to define a Modal Image for the Function that will include the necessary hardware and environment requirements. This includes CUDA.
image = (
modal.Image.from_registry("nvidia/cuda:12.4.0-devel-ubuntu22.04", add_python="3.12")
.apt_install("git", "build-essential", "cmake", "curl", "libcurl4-openssl-dev")
.run_commands("git clone https://github.com/ggerganov/llama.cpp")
.run_commands(
"cmake llama.cpp -B llama.cpp/build "
"-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON "
)
.run_commands( # this one takes a few minutes!
"cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli"
)
.run_commands("cp llama.cpp/build/bin/llama-* llama.cpp")
.entrypoint([]) # remove NVIDIA base container entrypoint
)Next, we’ll define the inference function itself. At the top of our llama_cpp_inference function, we add an app.function decorator to attach all of our infrastructure:
- the
imagewith the dependencies - the
volumeswith the weights - the
gpuwe want, in this case, 4 L40S’s
Inside the function, we call the llama.cpp CLI with subprocess.Popen. This requires a bit of extra dressing because we want to both show the output as we run and store the output to save and return to the local caller.
DEFAULT_DEEPSEEK_R1_ARGS = [ # good default llama.cpp cli args for deepseek-r1
"--cache-type-k",
"q4_0",
"--threads",
"12",
"-no-cnv",
"--prio",
"2",
"--temp",
"0.6",
"--ctx-size",
"8192",
]
@app.function(
image=image,
volumes={cache_dir: model_cache},
gpu="L40S:4",
timeout=30 * MINUTES,
)
def llama_cpp_inference(
model_entrypoint_file: str,
prompt: Optional[str] = None,
n_predict: int = -1,
args: Optional[list[str]] = None,
):
import subprocess
if prompt is None:
prompt = DEFAULT_PROMPT # see end of file
if "deepseek" in model_entrypoint_file.lower():
prompt = "<|User|>" + prompt + "<think>"
if args is None:
args = []
# set layers to "off-load to", aka run on, GPU
n_gpu_layers = 9999 # all
command = [
"/llama.cpp/llama-cli",
"--model",
f"{cache_dir}/{model_entrypoint_file}",
"--n-gpu-layers",
str(n_gpu_layers),
"--prompt",
prompt,
"--n-predict",
str(n_predict),
] + args
print("🦙 running command:", command, sep="\n\t")
p = subprocess.Popen(
command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=False
)
stdout, stderr = collect_output(p)
if p.returncode != 0:
raise subprocess.CalledProcessError(p.returncode, command, stdout, stderr)
return stdoutHelper functions
The remainder of this code is less interesting from the perspective of running LLM inference on Modal but is necessary for the code to run.
DEFAULT_PROMPT = "Write a Python function to calculate the factorial of a number."
def stream_output(stream, queue, write_stream):
"""Reads lines from a stream and writes to a queue and a write stream."""
for line in iter(stream.readline, b""):
line = line.decode("utf-8", errors="replace")
write_stream.write(line)
write_stream.flush()
queue.put(line)
stream.close()
def collect_output(process):
"""Collect up the stdout and stderr of a process while still streaming it out."""
import sys
from queue import Queue
from threading import Thread
stdout_queue = Queue()
stderr_queue = Queue()
stdout_thread = Thread(
target=stream_output, args=(process.stdout, stdout_queue, sys.stdout)
)
stderr_thread = Thread(
target=stream_output, args=(process.stderr, stderr_queue, sys.stderr)
)
stdout_thread.start()
stderr_thread.start()
stdout_thread.join()
stderr_thread.join()
process.wait()
stdout_collected = "".join(stdout_queue.queue)
stderr_collected = "".join(stderr_queue.queue)
return stdout_collected, stderr_collected4. Run it
To run inference, simply trigger this code from the command line with:
modal run llama_cpp.pyThis will run the main function on your device. This function uses the special Modal .remote syntax to run the functions defined in steps 3 and 4 remotely. That means supercharged GPUs to make DeepSeek-R1 work, without the hassle of procuring hardware and configuring it.
Get started today
That’s it! For a more detailed walkthrough of this example, check out the full writeup in our docs.
Ready to build with DeepSeek-R1 or any other AI model? Sign up for Modal and get $30 in free credits. Whether you’re running open-source models or your own custom models, Modal gives you instant access to thousands of GPUs, from T4s to B200s. No waiting for quota, configuring Kubernetes, or wasting money on idle costs—just fluid GPU compute you can attach to your inference code.
Deploy DeepSeek-R1