## The rule of thumb

A quick rule of thumb for LLM serving for models loaded in “half precision” - i.e. 16 bits, is approximately **2GB of GPU memory per 1B parameters in the model**.

## Example

Let’s calculate for Llama3-70B loaded in 16-bit precision:

`70B x 2GB/B = 140GB`

A single A100 80GB wouldn’t be enough, but 2x A100 80GB should suffice.

## Impact of quantization

You can decrease the amount of GPU memory needed by quantizing, essentially reducing the precision of the weights of the model. Common quantization levels include:

**16-bit:** Also called “half-precision”, often used as the default, balancing precision and memory usage.

**8-bit:** Generally achieves similar performance to 16-bit while halving memory requirements.

**4-bit:** Significantly reduces memory needs but may noticeably impact model performance.

You can load HuggingFace models at half, 8-bit, or 4-bit precision with simple parameter changes with the transformers library.

## Precision matters

To calculate the memory needed for a model with quantization, you can use the following formula:

`M = (P x (Q/8)) x 1.2`

Where:

`M`

: GPU memory (VRAM) expressed in gigabytes

`P`

: The number of parameters in the model (e.g., 70 for a 70B model)

`Q`

: The number of bits used for loading the model (e.g., 16, 8, or 4 bits)

`1.2`

: Represents a 20% overhead for additional tasks like key-value caching, where you cache self-attention tensors for faster inference.

## Example with quantization

Let’s consider 4-bit quantization of Llama3-70B:

`70 x (4/8) x 1.2 = 42GB`

This could run on 2x A10 24GB GPUs.