Fine-tuning LLMs can be a computationally expensive and time-consuming process, especially when dealing with models containing billions of parameters. However, new fine-tuning techniques have made it possible to more efficiently fine-tune LLMs by reducing the number of parameters to update. In this blog post, we’ll explore two such techniques: LoRA and QLoRA and discuss their differences and pros and cons.
Table of contents
- Overview of LoRA and QLoRA
- The problems with full fine-tuning
- What is LoRA?
- What is QLoRA?
- Which one should you use?
Overview of LoRA and QLoRA
Full FT | LoRA | QLoRA | |
---|---|---|---|
GB VRAM* (Memory needed per 1GB model) | 16+ | 2+ | 0.5+ |
% Params trained | ~100% | 0.5-5% | 0.5-5% |
Speed | Slow | Fast | Slightly slower than LoRA |
Quality | Can overfit | Stable and accurate | Can lose accuracy |
The problems with full fine-tuning
Before diving into the efficient techniques, let’s briefly review the challenges of traditional full fine-tuning:
- Updates every single parameter in the base model
- Because of the large number of parameters that need to be updated, requires significant computational resources, typically 60GB+ of VRAM for a 7B parameter model
- Slow
- Prone to overfitting, especially when working with smaller datasets
What is LoRA?
LoRA, short for Low-Rank Adaptation, is a fine-tuning technique introduced by Microsoft researchers in their paper LoRA: Low-Rank Adaptation of Large Language Models.
The main idea behind LoRA is that instead of updating all the pre-trained weights, you freeze them and train smaller “adapter” matrices that represent the update to the base model.
In a standard neural network layer, we have:
Y = WX + b
Where:
- W is the weight matrix
- X is the input
- b is the bias
- Y is the output
LoRA modifies this as follows:
Y = (W + BA)X + b
Where:
- W is the frozen pre-trained weight matrix
- B and A are low-rank matrices, which means that they are “smaller” than the original W matrix and can be stored more efficiently
- BA is the product of these matrices, representing the update to W
Upshot
- Only a small number of parameters (in A and B) need to be trained
- Uses way less VRAM, and most of the VRAM requirement is for loading the base model, not for training
- Can result in less overfitting compared to full fine-tuning
- Can be applied selectively to certain layers or components of the model
- Multiple LoRA modules can be trained for different tasks and swapped out as needed
- Can use a higher learning rate due to the smaller number of parameters
What is QLoRA?
QLoRA, or Quantized LoRA, is an extension of the LoRA technique that further reduces the memory footprint of fine-tuning by quantizing the low-rank matrices. Introduced in the paper QLoRA: Efficient Finetuning of Quantized LLMs, QLoRA applies post-training quantization to the A and B matrices, converting them from 32-bit floating-point numbers to lower-precision representations, such as 8-bit integers.
By quantizing the low-rank matrices, QLoRA achieves a 4x reduction in memory usage compared to standard LoRA, making it possible to fine-tune even larger models on resource-constrained devices.
Upshot
- Further reduces the memory footprint of fine-tuning
- Can lead to a loss of knowledge and a lower-quality fine-tune, but not necessarily. Sometimes the quantization actually reduces overfitting.
- The loss of knowledge is also mitigated because the adapters are generally not quantized - it’s the base model that will suffer in performance
Which one should you use?
- If you have access to hardware with enough space, use LoRA. Refer the table below for a rough estimate of the memory requirements for different model sizes.
Method | Bits | 7B | 13B | 30B | 70B | 110B | 8x7B | 8x22B |
---|---|---|---|---|---|---|---|---|
Full | Amp | 120GB | 240GB | 600GB | 1200GB | 2000GB | 900GB | 2400GB |
Full | 16 | 60GB | 120GB | 300GB | 600GB | 900GB | 400GB | 1200GB |
LoRA | 16 | 16GB | 32GB | 64GB | 160GB | 240GB | 120GB | 320GB |
QLoRA | 8 | 10GB | 20GB | 40GB | 80GB | 140GB | 60GB | 160GB |
QLoRA | 4 | 6GB | 12GB | 24GB | 48GB | 72GB | 30GB | 96GB |
QLoRA | 2 | 4GB | 8GB | 16GB | 24GB | 48GB | 18GB | 48GB |
- If you don’t have enough space, for example, if you only have access to a free T4 on Google Colab, try qLoRA.