
Introduction to Llama 3.1 8B Instruct
Meta Llama 3.1 is a family of open-source LLMs that includes various models of different sizes - 8B, 70B, and 405B parameters. While larger models like the 405B variant are designed to deliver superior quality, they are also significantly more expensive to run. Llama 3.1 8B is a good choice for many applications that require a balance between quality and cost.
Licensing terms
When using the Llama 3.1 model, it’s important to be aware of the licensing terms set by Meta.
In particular, if you fine-tune your own model on top of Llama 3.1, you must prominently include “Built with Llama” on your website or documentation. The fine-tuned model’s name must also start with “Llama”.
GPU requirements
To run the Llama 3.1 model effectively, you will need access to a GPU due to its substantial computational requirements. The easiest way to access a GPU is through Modal, a cloud platform designed for running machine learning workloads. Modal simplifies the process of deploying AI models by automatically provisioning the necessary GPU resources, allowing you to focus on your application without the hassle of managing infrastructure.
Example code for running the Llama 3.1 8B Instruct LLM on Modal
To run the following code, you will need to:
- Create an account at modal.com
- Run
pip install modal
to install the modal Python package - Run
modal setup
to authenticate (if this doesn’t work, trypython -m modal setup
) - Copy the code below into a file called
app.py
- Run
modal run app.py
Please note that this code is not optimized for best performance. To run Llama 3.1 8B Instruct with a LLM serving framework like vLLM for better latency and throughput, refer to this more detailed example here.
import modal
MODEL_ID = "NousResearch/Meta-Llama-3-8B"
MODEL_REVISION = "315b20096dc791d381d514deb5f8bd9c8d6d3061"
image = modal.Image.debian_slim().pip_install("transformers", "torch", "accelerate")
app = modal.App("example-base-Meta-Llama-3-8B", image=image)
GPU_CONFIG = "H100:2"
CACHE_DIR = "/cache"
cache_vol = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)
@app.cls(
gpu=GPU_CONFIG,
volumes={CACHE_DIR: cache_vol},
allow_concurrent_inputs=15,
container_idle_timeout=60 * 10,
timeout=60 * 60,
)
class Model:
@modal.enter()
def setup(self):
import torch
import transformers
self.pipeline = transformers.pipeline(
"text-generation",
model=MODEL_ID,
revision=MODEL_REVISION,
cache_dir=CACHE_DIR,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
@modal.method()
def generate(self, input: str):
return self.pipeline(input)
# ## Run the model
@app.local_entrypoint()
def main(prompt: str = None):
if prompt is None:
prompt = "Please write a Python function to compute the Fibonacci numbers."
print(Model().generate.remote(prompt))
Additional resources
- How to run Llama 3.1 8B on Modal with TensorRT-LLM
- How to run LLama 3.1 70B on Modal
- Llama 3.1 launch post
- Hugging Face model card for Llama 3.1
- vLLM GitHub Repository - Repository for vLLM, a framework for serving large language models fast, including Llama models.