The rule of thumb
A quick rule of thumb for LLM serving for models loaded in “half precision” - i.e. 16 bits, is approximately 2GB of GPU memory per 1B parameters in the model.
Example
Let’s calculate for Llama3-70B loaded in 16-bit precision:
70B x 2GB/B = 140GB
A single A100 80GB wouldn’t be enough, but 2x A100 80GB should suffice.
Impact of quantization
You can decrease the amount of GPU memory needed by quantizing, essentially reducing the precision of the weights of the model. Common quantization levels include:
16-bit: Also called “half-precision”, often used as the default, balancing precision and memory usage.
8-bit: Generally achieves similar performance to 16-bit while halving memory requirements.
4-bit: Significantly reduces memory needs but may noticeably impact model performance.
You can load HuggingFace models at half, 8-bit, or 4-bit precision with simple parameter changes with the transformers library.
Precision matters
To calculate the memory needed for a model with quantization, you can use the following formula:
M = (P x (Q/8)) x 1.2
Where:
M
: GPU memory (VRAM) expressed in gigabytes
P
: The number of parameters in the model (e.g., 70 for a 70B model)
Q
: The number of bits used for loading the model (e.g., 16, 8, or 4 bits)
1.2
: Represents a 20% overhead for additional tasks like key-value caching, where you cache self-attention tensors for faster inference.
Example with quantization
Let’s consider 4-bit quantization of Llama3-70B:
70 x (4/8) x 1.2 = 42GB
This could run on 2x A10 24GB GPUs.