Meta’s Llama3-405B represents a new frontier in open-source large language models, offering capabilities that rival top closed-source AI models. However, due to its size and computational requirements, it can be daunting to run.
This guide will walk you through the process of setting up and running Llama3-405B using vLLM on Modal, a serverless cloud computing platform. For the full code, you can view the gist.
Memory Requirements and Optimization
Running Llama3-405B is resource-intensive, but there are some optimizations to make it more accessible:
VRAM Requirements: In the normal “half-precision” (FP16), the model would require over 800GB of VRAM just to load.
8-bit Quantization: In our code, we are using the 8-bit quantized model, which significantly reduces the VRAM footprint.
Multi-GPU Setup: In our example, we’re distributing the model across 8 A100 GPUs with 80GB each, totaling 640GB of VRAM.
System Memory: The setup requires 336GB of system memory to handle data loading and processing.
Prerequisites
Before we begin, ensure you have the following:
- Create an account at modal.com
- Install the Modal Python package by running:
pip install modal
- Authenticate your Modal account by running:
If this doesn’t work, try:
modal setup
python -m modal setup
Running Llama3-405B
To run Llama3-405B, you’ll need to use three separate files from the provided gist. Here’s how to use each one:
1. Downloading the Model (download.py)
First, you need to download the model weights to a Modal volume:
- Save the
download.py
script from the gist to your local directory. - Run the command:
modal run download.py
This process may take about 30 minutes. It downloads the model weights and stores them in a Modal volume for faster access in subsequent runs.
2. Setting Up the vLLM Server (api.py)
Once the model is downloaded, you need to set up the vLLM server:
- Save the
api.py
script from the gist to your local directory. - Run the command:
modal deploy api.py
This command deploys an OpenAI-compatible API server on Modal’s infrastructure. It sets up the necessary GPU resources and serves the model through an API.
3. Interacting with the Model (client.py)
Finally, you can interact with the model using the provided client script:
- Save the
client.py
script from the gist to your local directory. - Run the script with:
python client.py
This script allows you to send requests to the vLLM server and receive responses. It offers several options for customization:
--model
: Specify a model name (optional, defaults to the first available model)--api-key
: Set the API key for authentication (default is “super-secret-token”)--max-tokens
,--temperature
,--top-p
, etc.: Adjust various generation parameters--prompt
: Provide a custom prompt (default is a limerick about baboons and raccoons)--system-prompt
: Set a custom system prompt--no-stream
: Disable streaming of response chunks--chat
: Enable interactive chat mode
For example, to start an interactive chat session with a custom system prompt, you could use:
python client.py --chat --system-prompt "You are a helpful AI assistant."