How to deploy Llama 3.1 8B Instruct on Modal

All posts

Back

Model Library

December 16, 2024•5 minute read

Yiren Lu@YirenLu

Solutions Engineer

Introduction to Llama 3.1 8B Instruct

The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models. The Llama 3.1 8B Instruct model is optimized for multilingual dialogue use cases and outperforms many of the available open source and closed chat models on common industry benchmarks.

To run the following code, you will need to:

Create an account at modal.com
Run pip install modal to install the modal Python package
Run modal setup to authenticate (if this doesn’t work, try python -m modal setup)
Copy the code below into a file called app.py
Run modal run app.py

Please note that this code is not optimized for best performance. To run Llama 3.1 8B Instruct with a LLM serving framework like vLLM for better latency and throughput, refer to this more detailed example here.

import modal

MODEL_ID = "NousResearch/Meta-Llama-3-8B"
MODEL_REVISION = "315b20096dc791d381d514deb5f8bd9c8d6d3061"

image = modal.Image.debian_slim().pip_install("transformers", "torch", "accelerate")
app = modal.App("example-base-Meta-Llama-3-8B", image=image)

GPU_CONFIG = modal.gpu.H100(count=2)

CACHE_DIR = "/cache"
cache_vol = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)


@app.cls(
    gpu=GPU_CONFIG,
    volumes={CACHE_DIR: cache_vol},
    allow_concurrent_inputs=15,
    container_idle_timeout=60 * 10,
    timeout=60 * 60,
)
class Model:
    @modal.enter()
    def setup(self):
        import torch
        import transformers

        self.pipeline = transformers.pipeline(
            "text-generation",
            model=MODEL_ID,
            revision=MODEL_REVISION,
            cache_dir=CACHE_DIR,
            model_kwargs={"torch_dtype": torch.bfloat16},
            device_map="auto",
        )

    @modal.method()
    def generate(self, input: str):
        return self.pipeline(input)


# ## Run the model
@app.local_entrypoint()
def main(prompt: str = None):
    if prompt is None:
        prompt = "Please write a Python function to compute the Fibonacci numbers."
    print(Model().generate.remote(prompt))

Introduction to Llama 3.1 8B Instruct

Example code for running the Llama 3.1 8B Instruct LLM on Modal

Ship your first app in minutes.