September 1, 20243 minute read
How much VRAM do I need for LLM inference?
author
Yiren Lu@YirenLu
Solutions Engineer

The rule of thumb

A quick rule of thumb for LLM serving for models loaded in “half precision” - i.e. 16 bits, is approximately 2GB of GPU memory per 1B parameters in the model.

Example

Let’s calculate for Llama3-70B loaded in 16-bit precision:

70B x 2GB/B = 140GB

A single A100 80GB wouldn’t be enough, but 2x A100 80GB should suffice.

Impact of quantization

You can decrease the amount of GPU memory needed by quantizing, essentially reducing the precision of the weights of the model. Common quantization levels include:

16-bit: Also called “half-precision”, often used as the default, balancing precision and memory usage.

8-bit: Generally achieves similar performance to 16-bit while halving memory requirements.

4-bit: Significantly reduces memory needs but may noticeably impact model performance.

You can load HuggingFace models at half, 8-bit, or 4-bit precision with simple parameter changes with the transformers library.

Precision matters

To calculate the memory needed for a model with quantization, you can use the following formula:

M = (P x (Q/8)) x 1.2

Where:

M: GPU memory (VRAM) expressed in gigabytes

P: The number of parameters in the model (e.g., 70 for a 70B model)

Q: The number of bits used for loading the model (e.g., 16, 8, or 4 bits)

1.2: Represents a 20% overhead for additional tasks like key-value caching, where you cache self-attention tensors for faster inference.

Example with quantization

Let’s consider 4-bit quantization of Llama3-70B:

70 x (4/8) x 1.2 = 42GB

This could run on 2x A10 24GB GPUs.

Ship your first app in minutes

with $30 / month free compute