How to deploy Llama 3.1 70B Instruct on Modal

Solutions Engineer

Introduction to Llama 3.1 70B Instruct

The Meta Llama 3.1 collection of multilingual large language models (LLMs) includes the Llama 3.1 70B model. With 70 billion parameters, it is powerful enough for a wide range of tasks while being more accessible in terms of computational requirements compared to larger models like the 405B.

This model is good for tasks like code execution, search, and complex reasoning. Additionally, it includes a 128K token context window, making it well-suited for processing extended inputs, such as lengthy documents or comprehensive conversations.

Given that it is a larger model, it requires 2 H100 GPUs to run. For a more memory efficient version of the same model, see the 8B variant.

Modal is the easiest way to access a GPU to run machine learning workloads. With Modal, you can take your local function, decorate it with Modal decorators, and send it off to run in the cloud on a GPU.

Additionally, Modal supports various configurations, allowing you to customize your environment based on your specific needs, such as selecting the number of GPUs and setting timeouts for your applications.

To run the following code, you will need to:

Create an account at modal.com
Run pip install modal to install the modal Python package
Run modal setup to authenticate (if this doesn’t work, try python -m modal setup)
Copy the code below into a file called app.py
Run modal run app.py

Please note that this code is not optimized for best performance. To run Llama 3.1 70B Instruct with a LLM serving framework like vLLM for better latency and throughput, refer to this more detailed example here. (You can modify the code in that example to run the 70B version instead of the 8B version.)

import modal

MODEL_ID = "NousResearch/Meta-Llama-3.1-70B-Instruct"
MODEL_REVISION = "d50656ee28e2c2906d317cbbb6fcb55eb4055a84"

image = modal.Image.debian_slim().pip_install("transformers", "torch", "accelerate")
app = modal.App("example-base-Meta-Llama-3-70B-Instruct", image=image)

GPU_CONFIG = "H100:2"

CACHE_DIR = "/cache"
cache_vol = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)


@app.cls(
    gpu=GPU_CONFIG,
    volumes={CACHE_DIR: cache_vol},
    scaledown_window=60 * 10,
    timeout=60 * 60,
)
@modal.concurrent(max_inputs=15)
class Model:
    @modal.enter()
    def setup(self):
        import torch

        from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

        from huggingface_hub import snapshot_download

        # Download the model to the cache directory
        model_path = snapshot_download(repo_id=MODEL_ID, cache_dir=CACHE_DIR)

        print(f"Model downloaded to: {model_path}")

        # Specify cache directory if needed
        model = AutoModelForCausalLM.from_pretrained(MODEL_ID, cache_dir=CACHE_DIR)
        tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, cache_dir=CACHE_DIR)

        self.pipeline = pipeline(
            "text-generation",
            model=model,
            revision=MODEL_REVISION,
            tokenizer=tokenizer,
            model_kwargs={"torch_dtype": torch.bfloat16},
            device_map="auto",
        )

    @modal.method()
    def generate(self, input: str):
        messages = [
            {
                "role": "system",
                "content": "You are a helpful assistant.",
            },
            {"role": "user", "content": input},
        ]

        outputs = self.pipeline(
            messages,
            max_new_tokens=256,
        )

        return outputs[0]["generated_text"][-1]


# ## Run the model
@app.local_entrypoint()
def main(prompt: str = None):
    if prompt is None:
        prompt = "Please write a Python function to compute the Fibonacci numbers."
    print(Model().generate.remote(prompt))

Additional resources

How to run Llama 3.1 8B Instruct on Modal
How to run Llama 3.1 405B Instruct on Modal
vLLM Documentation - Official documentation for vLLM, a framework for serving large language models.

How to deploy Llama 3.1 70B Instruct on Modal

Introduction to Llama 3.1 70B Instruct

Why should you run Llama 3.1 70B Instruct on Modal?

Example code for running the Llama 3.1 70B Instruct LLM on Modal

Additional resources

Ship your first app in minutes.