December 16, 20245 minute read
How to deploy Llama 3.1 70B Instruct on Modal
author
Yiren Lu@YirenLu
Solutions Engineer

Introduction to Llama 3.1 70B Instruct

The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models. The Llama 3.1 8B Instruct model is optimized for multilingual dialogue use cases.

Given that it is a larger model, it requires 2 H100 GPUs to run. For a more memory efficient version of the same model, see the 8B variant.

Example code for running the Llama 3.1 70B Instruct LLM on Modal

To run the following code, you will need to:

  1. Create an account at modal.com
  2. Run pip install modal to install the modal Python package
  3. Run modal setup to authenticate (if this doesn’t work, try python -m modal setup)
  4. Copy the code below into a file called app.py
  5. Run modal run app.py

Please note that this code is not optimized for best performance. To run Llama 3.1 70B Instruct with a LLM serving framework like vLLM for better latency and throughput, refer to this more detailed example here. (You can modify the code in that example to run the 70B version instead of the 8B version.)

import modal

MODEL_ID = "NousResearch/Meta-Llama-3.1-70B-Instruct"
MODEL_REVISION = "d50656ee28e2c2906d317cbbb6fcb55eb4055a84"

image = modal.Image.debian_slim().pip_install("transformers", "torch", "accelerate")
app = modal.App("example-base-Meta-Llama-3-70B-Instruct", image=image)

GPU_CONFIG = modal.gpu.H100(count=2)

CACHE_DIR = "/cache"
cache_vol = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)


@app.cls(
    gpu=GPU_CONFIG,
    volumes={CACHE_DIR: cache_vol},
    allow_concurrent_inputs=15,
    container_idle_timeout=60 * 10,
    timeout=60 * 60,
)
class Model:
    @modal.enter()
    def setup(self):
        import torch
        import transformers

        self.pipeline = transformers.pipeline(
            "text-generation",
            model=MODEL_ID,
            revision=MODEL_REVISION,
            cache_dir=CACHE_DIR,
            model_kwargs={"torch_dtype": torch.bfloat16},
            device_map="auto",
        )

    @modal.method()
    def generate(self, input: str):
        messages = [
            {
                "role": "system",
                "content": "You are a helpful assistant.",
            },
            {"role": "user", "content": input},
        ]

        outputs = self.pipeline(
            messages,
            max_new_tokens=256,
        )

        return outputs[0]["generated_text"][-1]


# ## Run the model
@app.local_entrypoint()
def main(prompt: str = None):
    if prompt is None:
        prompt = "Please write a Python function to compute the Fibonacci numbers."
    print(Model().generate.remote(prompt))

Ship your first app in minutes.

Get Started

$30 / month free compute