Try DeepSeek-R1 on Modal! View example
January 21, 20255 minute read
How to run Llama 3.1 8B Instruct on Modal
author
Yiren Lu@YirenLu
Solutions Engineer

Introduction to Llama 3.1 8B Instruct

Meta Llama 3.1 is a family of open-source LLMs that includes various models of different sizes - 8B, 70B, and 405B parameters. While larger models like the 405B variant are designed to deliver superior quality, they are also significantly more expensive to run. Llama 3.1 8B is a good choice for many applications that require a balance between quality and cost.

Licensing terms

When using the Llama 3.1 model, it’s important to be aware of the licensing terms set by Meta.

In particular, if you fine-tune your own model on top of Llama 3.1, you must prominently include “Built with Llama” on your website or documentation. The fine-tuned model’s name must also start with “Llama”.

GPU requirements

To run the Llama 3.1 model effectively, you will need access to a GPU due to its substantial computational requirements. The easiest way to access a GPU is through Modal, a cloud platform designed for running machine learning workloads. Modal simplifies the process of deploying AI models by automatically provisioning the necessary GPU resources, allowing you to focus on your application without the hassle of managing infrastructure.

Example code for running the Llama 3.1 8B Instruct LLM on Modal

To run the following code, you will need to:

  1. Create an account at modal.com
  2. Run pip install modal to install the modal Python package
  3. Run modal setup to authenticate (if this doesn’t work, try python -m modal setup)
  4. Copy the code below into a file called app.py
  5. Run modal run app.py

Please note that this code is not optimized for best performance. To run Llama 3.1 8B Instruct with a LLM serving framework like vLLM for better latency and throughput, refer to this more detailed example here.

import modal

MODEL_ID = "NousResearch/Meta-Llama-3-8B"
MODEL_REVISION = "315b20096dc791d381d514deb5f8bd9c8d6d3061"

image = modal.Image.debian_slim().pip_install("transformers", "torch", "accelerate")
app = modal.App("example-base-Meta-Llama-3-8B", image=image)

GPU_CONFIG = "H100:2"

CACHE_DIR = "/cache"
cache_vol = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)


@app.cls(
    gpu=GPU_CONFIG,
    volumes={CACHE_DIR: cache_vol},
    allow_concurrent_inputs=15,
    container_idle_timeout=60 * 10,
    timeout=60 * 60,
)
class Model:
    @modal.enter()
    def setup(self):
        import torch
        import transformers

        self.pipeline = transformers.pipeline(
            "text-generation",
            model=MODEL_ID,
            revision=MODEL_REVISION,
            cache_dir=CACHE_DIR,
            model_kwargs={"torch_dtype": torch.bfloat16},
            device_map="auto",
        )

    @modal.method()
    def generate(self, input: str):
        return self.pipeline(input)


# ## Run the model
@app.local_entrypoint()
def main(prompt: str = None):
    if prompt is None:
        prompt = "Please write a Python function to compute the Fibonacci numbers."
    print(Model().generate.remote(prompt))

Additional resources

Ship your first app in minutes.

Get Started

$30 / month free compute