How to run Llama 3.1 as an API

Growth Engineer

Llama 3.1 is Meta’s latest family of large language models (LLMs) that are quickly becoming the standard in the open-source LLM space. Llama 3.1 comes in three sizes (8B, 70B, 405B) as well as a fine-tuned Instruct version that is more optimized for instructions and dialogue.

Serving Llama 3.1 as an API requires significant compute, especially if you are using the 405B version. This guide will walk you through how to do this on Modal’s serverless compute platform, giving you access to the latest GPUs (like A100s and H100s) while only paying for what you use.

How to serve Llama 3.1 8B as an API

Check out our detailed example which uses open-source serving framework vLLM to serve Llama 3.1 8B in OpenAI-compatible mode.

To run this example, you first need to create a Modal account and clone our examples repo.

How to serve Llama 3.1 70B as an API

You can edit the linked example above to download the 70B model instead of the 8B. You will likely also have to boost your GPU VRAM settings; I’d recommend starting with 2 A100 80GB GPUs.

How to serve Llama 3.1 405B as an API

This is the largest Llama 3.1 model and requires even more VRAM. Check out our full guide and corresponding gist.

Pricing

Modal’s pricing is usage-based. For example, if you use two A100 80GB GPUs for 10 minutes, at a rate of $4.75/h that would cost you $4.75 • 2 gpus • 1/6 = $1.58.

In production, Modal containers automaticaly spin down when there is no usage and auto-scale when there is, giving you a nice balance between cost and performance.

Fine-tuning

The biggest reason to choose Llama 3.1 as your LLM is their generous community license. You can fine-tune Llama 3.1 and serve your model as a commercial product. Modal’s generic compute platform makes it a great choice for fine-tuning LLMs especially compared to other API providers that often offer very limited fine-tuning options. Check out our LLM fine-tuning example to learn more.

Bottom line

For production-grade LLM inference, it’s hard to go wrong with Llama 3.1. Combined with open-source serving framework vLLM and Modal’s serverless compute platform, you can easily build a Llama 3.1 API to serve your LLM use cases at scale, all at a cost-effective price point.