How to run Llama 3.1 8B Instruct on Modal

Solutions Engineer

Introduction to Llama 3.1 8B Instruct

Meta Llama 3.1 is a family of open-source LLMs that includes various models of different sizes - 8B, 70B, and 405B parameters. While larger models like the 405B variant are designed to deliver superior quality, they are also significantly more expensive to run. Llama 3.1 8B is a good choice for many applications that require a balance between quality and cost.

Licensing terms

When using the Llama 3.1 model, it’s important to be aware of the licensing terms set by Meta.

In particular, if you fine-tune your own model on top of Llama 3.1, you must prominently include “Built with Llama” on your website or documentation. The fine-tuned model’s name must also start with “Llama”.

GPU requirements

To run the Llama 3.1 model effectively, you will need access to a GPU due to its substantial computational requirements. The easiest way to access a GPU is through Modal, a cloud platform designed for running machine learning workloads. Modal simplifies the process of deploying AI models by automatically provisioning the necessary GPU resources, allowing you to focus on your application without the hassle of managing infrastructure.

To run the following code, you will need to:

Create an account at modal.com
Run pip install modal to install the modal Python package
Run modal setup to authenticate (if this doesn’t work, try python -m modal setup)
Copy the code below into a file called app.py
Run modal run app.py

Please note that this code is not optimized for best performance. To run Llama 3.1 8B Instruct with a LLM serving framework like vLLM for better latency and throughput, refer to this more detailed example here.

import modal

MODEL_ID = "NousResearch/Meta-Llama-3-8B"
MODEL_REVISION = "315b20096dc791d381d514deb5f8bd9c8d6d3061"

image = modal.Image.debian_slim().pip_install(
    "transformers==4.49.0", "torch==2.6.0", "accelerate==1.4.0"
)
app = modal.App("example-base-Meta-Llama-3-8B", image=image)

GPU_CONFIG = "H100:2"

CACHE_DIR = "/cache"
cache_vol = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)


@app.cls(
    gpu=GPU_CONFIG,
    volumes={CACHE_DIR: cache_vol},
    scaledown_window=60 * 10,
    timeout=60 * 60,
)
@modal.concurrent(max_inputs=15)
class Model:
    @modal.enter()
    def setup(self):
        import torch
        from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

        from huggingface_hub import snapshot_download

        # Download the model to the cache directory
        model_path = snapshot_download(repo_id=MODEL_ID, cache_dir=CACHE_DIR)

        print(f"Model downloaded to: {model_path}")

        # Specify cache directory if needed
        model = AutoModelForCausalLM.from_pretrained(MODEL_ID, cache_dir=CACHE_DIR)
        tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, cache_dir=CACHE_DIR)

        self.pipeline = pipeline(
            "text-generation",
            model=model,
            tokenizer=tokenizer,
            model_kwargs={"torch_dtype": torch.bfloat16},
            device_map="auto",
        )

    @modal.method()
    def generate(self, input: str):
        return self.pipeline(input)


# ## Run the model
@app.local_entrypoint()
def main(prompt: str = None):
    if prompt is None:
        prompt = "Please write a Python function to compute the Fibonacci numbers."
    print(Model().generate.remote(prompt))

Additional resources

How to run Llama 3.1 8B on Modal with TensorRT-LLM
How to run LLama 3.1 70B on Modal
Llama 3.1 launch post
Hugging Face model card for Llama 3.1
vLLM GitHub Repository - Repository for vLLM, a framework for serving large language models fast, including Llama models.

How to run Llama 3.1 8B Instruct on Modal

Introduction to Llama 3.1 8B Instruct

Licensing terms

GPU requirements

Example code for running the Llama 3.1 8B Instruct LLM on Modal

Additional resources

Ship your first app in minutes.