LoRA vs. QLoRA: Efficient fine-tuning techniques for LLMs

Solutions Engineer

Fine-tuning LLMs can be a computationally expensive and time-consuming process, especially when dealing with models containing billions of parameters. However, new fine-tuning techniques have made it possible to more efficiently fine-tune LLMs by reducing the number of parameters to update. In this blog post, we’ll explore two such techniques: LoRA and QLoRA and discuss their differences and pros and cons.

Overview of LoRA and QLoRA
The problems with full fine-tuning
What is LoRA?
What is QLoRA?
Which one should you use?

Overview of LoRA and QLoRA

	Full FT	LoRA	QLoRA
GB VRAM* (Memory needed per 1GB model)	16+	2+	0.5+
% Params trained	~100%	0.5-5%	0.5-5%
Speed	Slow	Fast	Slightly slower than LoRA
Quality	Can overfit	Stable and accurate	Can lose accuracy

The problems with full fine-tuning

Before diving into the efficient techniques, let’s briefly review the challenges of traditional full fine-tuning:

Updates every single parameter in the base model
Because of the large number of parameters that need to be updated, requires significant computational resources, typically 60GB+ of VRAM for a 7B parameter model
Slow
Prone to overfitting, especially when working with smaller datasets

What is LoRA?

LoRA, short for Low-Rank Adaptation, is a fine-tuning technique introduced by Microsoft researchers in their paper LoRA: Low-Rank Adaptation of Large Language Models.

The main idea behind LoRA is that instead of updating all the pre-trained weights, you freeze them and train smaller “adapter” matrices that represent the update to the base model.

In a standard neural network layer, we have:

Y = WX + b

Where:

W is the weight matrix
X is the input
b is the bias
Y is the output

LoRA modifies this as follows:

Y = (W + BA)X + b

Where:

W is the frozen pre-trained weight matrix
B and A are low-rank matrices, which means that they are “smaller” than the original W matrix and can be stored more efficiently
BA is the product of these matrices, representing the update to W

Upshot

Only a small number of parameters (in A and B) need to be trained
Uses way less VRAM, and most of the VRAM requirement is for loading the base model, not for training
Can result in less overfitting compared to full fine-tuning
Can be applied selectively to certain layers or components of the model
Multiple LoRA modules can be trained for different tasks and swapped out as needed
Can use a higher learning rate due to the smaller number of parameters

What is QLoRA?

QLoRA, or Quantized LoRA, is an extension of the LoRA technique that further reduces the memory footprint of fine-tuning by quantizing the low-rank matrices. Introduced in the paper QLoRA: Efficient Finetuning of Quantized LLMs, QLoRA applies post-training quantization to the A and B matrices, converting them from 32-bit floating-point numbers to lower-precision representations, such as 8-bit integers.

By quantizing the low-rank matrices, QLoRA achieves a 4x reduction in memory usage compared to standard LoRA, making it possible to fine-tune even larger models on resource-constrained devices.

Upshot

Further reduces the memory footprint of fine-tuning
Can lead to a loss of knowledge and a lower-quality fine-tune, but not necessarily. Sometimes the quantization actually reduces overfitting.
The loss of knowledge is also mitigated because the adapters are generally not quantized - it’s the base model that will suffer in performance

Which one should you use?

If you have access to hardware with enough space, use LoRA. Refer the table below for a rough estimate of the memory requirements for different model sizes.

Method	Bits	7B	13B	30B	70B	110B	8x7B	8x22B
Full	Amp	120GB	240GB	600GB	1200GB	2000GB	900GB	2400GB
Full	16	60GB	120GB	300GB	600GB	900GB	400GB	1200GB
LoRA	16	16GB	32GB	64GB	160GB	240GB	120GB	320GB
QLoRA	8	10GB	20GB	40GB	80GB	140GB	60GB	160GB
QLoRA	4	6GB	12GB	24GB	48GB	72GB	30GB	96GB
QLoRA	2	4GB	8GB	16GB	24GB	48GB	18GB	48GB

If you don’t have enough space, for example, if you only have access to a free T4 on Google Colab, try qLoRA.

LoRA vs. QLoRA: Efficient fine-tuning techniques for LLMs

Table of contents

Overview of LoRA and QLoRA

The problems with full fine-tuning

What is LoRA?

Upshot

What is QLoRA?

Upshot

Which one should you use?

Ship your first app in minutes.