Article

November 3, 2025•8 minute read

How to deploy DeepSeek-R1 with llama.cpp

Over the last several years, large language models (LLMs) have transformed the way that we learn, solve problems, and analyze data.

As the name humbly suggests, LLMs like DeepSeek-R1, GPT-4, and Claude are massive, often possessing billions of parameters and requiring huge amounts of memory intense compute demands. While leveraging the power of LLMs has become more accessible through Claude and ChatGPT, running LLMs locally without sophisticated hardware presents challenges due to their sheer size. In this tutorial, we’re going to show you how to use llama.cpp and Modal to deploy DeepSeek-R1 inference on L40S GPUs in the cloud.

What is DeepSeek-R1?

DeepSeek-R1 is an open-weight “reasoning” LLM that was released by DeepSeek in early 2025. It was created to rival proprietary systems like OpenAI’s o1 while remaining inexpensive to run and modify. It emphasizes step-by-step reasoning (with chain-of-thought internally) and was post-trained primarily with large-scale reinforcement learning.

According to benchmarking providers like Artificial Analysis (as of November 2025), the DeepSeek family of models still ranks as some of the most “intelligent” open-source LLMs available today.

DeepSeek-R1 is very large. It has 671B total parameters and consumes over 100GB of storage, even when quantized down to one ternary digit (1.58 bits) per parameter.

What is llama.cpp?

Llama.cpp is an open-source library that enables efficient inference of LLMs possible on consumer hardware. It was developed by software engineer Georgi Gerganov as an implementation of Meta’s LLaMA (Large Language Model Meta AI) in efficient C/C++ with no dependencies. This lightweight yet powerful framework makes running models that “shouldn’t” fit on your hardware as simple as running a few commands in your terminal.

The magic behind llama.cpp is quantization, the process of reducing model precision (from, say, 16-bit to 4-bit) to shrink model size and make inference faster while maintaining reasonable quality. Essentially, models that typically require hundreds to thousands of gigabytes of RAM—like DeepSeek-R1—can run on much smaller systems.

Why developers love llama.cpp

It’s fast: its highly optimized C/C++ implementation squeezes maximum speed from CPUs/GPUs
It’s portable: the framework compiles and runs on almost anything — Windows, macOS, Linux, and even Raspberry Pi
It’s open-source: you can use it self-deploy LLMs, and you can make any modifications to it

Quickstart

If you haven’t set up Modal already:

pip install modal
modal setup

Then, clone and run our examples repo:

git clone https://github.com/modal-labs/modal-examples
cd modal-examples
modal run 06_gpu_and_ml/llm-serving/llama_cpp.py --n-predict 1024

For a step by step walkthrough of this example, keep reading.

How to deploy DeepSeek-R1 in minutes

Modal makes it easy to deploy open-source models on powerful GPUs. Modal consists of a Python SDK that wraps an ultra-fast container stack and multi-cloud GPU pool. If you don’t have a Modal account, follow the two-line setup instructions in the Quickstart above. You get $30 of GPU credits every month, so following this tutorial will be entirely free.

First, let’s create a Modal app and define a local function that we’ll use to call the Modal inference function (llama_cpp_inference), which we’ll define later. This local function:

Calls another Modal Function to download and cache the model. We’re using a quantized version of DeepSeek-R1 released by Unsloth AI.
Invokes our inference function, which runs on cloud GPUs thanks to Modal.
Writes the results to a txt file.

Note the .local_entrypoint() Modal decorator, which allows us to invoke this function from the CLI.

from pathlib import Path
from typing import Optional

import modal

app = modal.App("example-llama-cpp")

@app.local_entrypoint()
def main(
    prompt: Optional[str] = None,
    n_predict: int = -1,  # max number of tokens to predict, -1 is infinite
    args: Optional[str] = None,  # string of arguments to pass to llama.cpp's cli
):
    import shlex

    org_name = "unsloth"
    model = "DeepSeek-R1"
    model_name = "DeepSeek-R1-GGUF"
    quant = "UD-IQ1_S"
    model_entrypoint_file = (
        f"{model}-{quant}/DeepSeek-R1-{quant}-00001-of-00003.gguf"
    )
    model_pattern = f"*{quant}*"
    revision = "02656f62d2aa9da4d3f0cdb34c341d30dd87c3b6"
    parsed_args = DEFAULT_DEEPSEEK_R1_ARGS if args is None else shlex.split(args)

    repo_id = f"{org_name}/{model_name}"
    download_model.remote(repo_id, [model_pattern], revision)

    # call out to a `.remote` Function on Modal for inference
    result = llama_cpp_inference.remote(
        model_entrypoint_file,
        prompt,
        n_predict,
        parsed_args,
    )
    output_path = Path("/tmp") / f"llama-cpp-{model}.txt"
    output_path.parent.mkdir(parents=True, exist_ok=True)
    print(f"🦙 writing response to {output_path}")
    output_path.write_text(result)

2. Define the model download function

Next, we download the model weights from Hugging Face. Modal is serverless, so disks are by default ephemeral. To persist the weights between runs, we store them in a Modal Volume.

Note the download_image Modal Image we defined here, which ensures that the download_model Modal Function has all the required dependencies needed to download the model.

model_cache = modal.Volume.from_name("llamacpp-cache", create_if_missing=True)
cache_dir = "/root/.cache/llama.cpp"

download_image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install("huggingface_hub[hf_transfer]==0.26.2")
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
)

@app.function(
    image=download_image, volumes={cache_dir: model_cache}, timeout=30 * MINUTES
)
def download_model(repo_id, allow_patterns, revision: Optional[str] = None):
    from huggingface_hub import snapshot_download

    print(f"🦙 downloading model from {repo_id} if not present")

    snapshot_download(
        repo_id=repo_id,
        revision=revision,
        local_dir=cache_dir,
        allow_patterns=allow_patterns,
    )

    model_cache.commit()  # ensure other Modal Functions can see our writes before we quit

    print("🦙 model loaded")

3. Define the DeepSeek inference function

Let’s define the actual inference function now! This will run remotely when called with llama_cpp_inference.remote(...), so we need to define a Modal Image for the Function that will include the necessary hardware and environment requirements. This includes CUDA.

image = (
    modal.Image.from_registry("nvidia/cuda:12.4.0-devel-ubuntu22.04", add_python="3.12")
    .apt_install("git", "build-essential", "cmake", "curl", "libcurl4-openssl-dev")
    .run_commands("git clone https://github.com/ggerganov/llama.cpp")
    .run_commands(
        "cmake llama.cpp -B llama.cpp/build "
        "-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON "
    )
    .run_commands(  # this one takes a few minutes!
        "cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli"
    )
    .run_commands("cp llama.cpp/build/bin/llama-* llama.cpp")
    .entrypoint([])  # remove NVIDIA base container entrypoint
)

Next, we’ll define the inference function itself. At the top of our llama_cpp_inference function, we add an app.function decorator to attach all of our infrastructure:

the image with the dependencies
the volumes with the weights
the gpu we want, in this case, 4 L40S’s

Inside the function, we call the llama.cpp CLI with subprocess.Popen. This requires a bit of extra dressing because we want to both show the output as we run and store the output to save and return to the local caller.

DEFAULT_DEEPSEEK_R1_ARGS = [  # good default llama.cpp cli args for deepseek-r1
    "--cache-type-k",
    "q4_0",
    "--threads",
    "12",
    "-no-cnv",
    "--prio",
    "2",
    "--temp",
    "0.6",
    "--ctx-size",
    "8192",
]

@app.function(
    image=image,
    volumes={cache_dir: model_cache},
    gpu="L40S:4",
    timeout=30 * MINUTES,
)
def llama_cpp_inference(
    model_entrypoint_file: str,
    prompt: Optional[str] = None,
    n_predict: int = -1,
    args: Optional[list[str]] = None,
):
    import subprocess

    if prompt is None:
        prompt = DEFAULT_PROMPT  # see end of file
    if "deepseek" in model_entrypoint_file.lower():
        prompt = "<｜User｜>" + prompt + "<think>"
    if args is None:
        args = []

    # set layers to "off-load to", aka run on, GPU
    n_gpu_layers = 9999  # all

    command = [
        "/llama.cpp/llama-cli",
        "--model",
        f"{cache_dir}/{model_entrypoint_file}",
        "--n-gpu-layers",
        str(n_gpu_layers),
        "--prompt",
        prompt,
        "--n-predict",
        str(n_predict),
    ] + args

    print("🦙 running command:", command, sep="\n\t")
    p = subprocess.Popen(
        command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=False
    )

    stdout, stderr = collect_output(p)

    if p.returncode != 0:
        raise subprocess.CalledProcessError(p.returncode, command, stdout, stderr)

    return stdout

Helper functions

The remainder of this code is less interesting from the perspective of running LLM inference on Modal but is necessary for the code to run.

DEFAULT_PROMPT = "Write a Python function to calculate the factorial of a number."

def stream_output(stream, queue, write_stream):
    """Reads lines from a stream and writes to a queue and a write stream."""
    for line in iter(stream.readline, b""):
        line = line.decode("utf-8", errors="replace")
        write_stream.write(line)
        write_stream.flush()
        queue.put(line)
    stream.close()

def collect_output(process):
    """Collect up the stdout and stderr of a process while still streaming it out."""
    import sys
    from queue import Queue
    from threading import Thread

    stdout_queue = Queue()
    stderr_queue = Queue()

    stdout_thread = Thread(
        target=stream_output, args=(process.stdout, stdout_queue, sys.stdout)
    )
    stderr_thread = Thread(
        target=stream_output, args=(process.stderr, stderr_queue, sys.stderr)
    )
    stdout_thread.start()
    stderr_thread.start()

    stdout_thread.join()
    stderr_thread.join()
    process.wait()

    stdout_collected = "".join(stdout_queue.queue)
    stderr_collected = "".join(stderr_queue.queue)

    return stdout_collected, stderr_collected

4. Run it

To run inference, simply trigger this code from the command line with:

modal run llama_cpp.py

This will run the main function on your device. This function uses the special Modal .remote syntax to run the functions defined in steps 3 and 4 remotely. That means supercharged GPUs to make DeepSeek-R1 work, without the hassle of procuring hardware and configuring it.

Get started today

That’s it! For a more detailed walkthrough of this example, check out the full writeup in our docs.

Ready to build with DeepSeek-R1 or any other AI model? Sign up for Modal and get $30 in free credits. Whether you’re running open-source models or your own custom models, Modal gives you instant access to thousands of GPUs, from T4s to B200s. No waiting for quota, configuring Kubernetes, or wasting money on idle costs—just fluid GPU compute you can attach to your inference code.

Deploy DeepSeek-R1