August 22, 20245 minute read
LoRA vs. QLoRA: Efficient fine-tuning techniques for LLMs
author
Yiren Lu@YirenLu
Solutions Engineer

Fine-tuning LLMs can be a computationally expensive and time-consuming process, especially when dealing with models containing billions of parameters. However, new fine-tuning techniques have made it possible to more efficiently fine-tune LLMs by reducing the number of parameters to update. In this blog post, we’ll explore two such techniques: LoRA and QLoRA and discuss their differences and pros and cons.

Table of contents

Overview of LoRA and QLoRA

Full FT LoRA QLoRA
GB VRAM* (Memory needed per 1GB model) 16+ 2+ 0.5+
% Params trained ~100% 0.5-5% 0.5-5%
Speed Slow Fast Slightly slower than LoRA
Quality Can overfit Stable and accurate Can lose accuracy

The problems with full fine-tuning

Before diving into the efficient techniques, let’s briefly review the challenges of traditional full fine-tuning:

  • Updates every single parameter in the base model
  • Because of the large number of parameters that need to be updated, requires significant computational resources, typically 60GB+ of VRAM for a 7B parameter model
  • Slow
  • Prone to overfitting, especially when working with smaller datasets

What is LoRA?

LoRA, short for Low-Rank Adaptation, is a fine-tuning technique introduced by Microsoft researchers in their paper LoRA: Low-Rank Adaptation of Large Language Models.

The main idea behind LoRA is that instead of updating all the pre-trained weights, you freeze them and train smaller “adapter” matrices that represent the update to the base model.

In a standard neural network layer, we have:

Y = WX + b

Where:

  • W is the weight matrix
  • X is the input
  • b is the bias
  • Y is the output

LoRA modifies this as follows:

Y = (W + BA)X + b

Where:

  • W is the frozen pre-trained weight matrix
  • B and A are low-rank matrices, which means that they are “smaller” than the original W matrix and can be stored more efficiently
  • BA is the product of these matrices, representing the update to W

Upshot

  • Only a small number of parameters (in A and B) need to be trained
  • Uses way less VRAM, and most of the VRAM requirement is for loading the base model, not for training
  • Can result in less overfitting compared to full fine-tuning
  • Can be applied selectively to certain layers or components of the model
  • Multiple LoRA modules can be trained for different tasks and swapped out as needed
  • Can use a higher learning rate due to the smaller number of parameters

What is QLoRA?

QLoRA, or Quantized LoRA, is an extension of the LoRA technique that further reduces the memory footprint of fine-tuning by quantizing the low-rank matrices. Introduced in the paper QLoRA: Efficient Finetuning of Quantized LLMs, QLoRA applies post-training quantization to the A and B matrices, converting them from 32-bit floating-point numbers to lower-precision representations, such as 8-bit integers.

By quantizing the low-rank matrices, QLoRA achieves a 4x reduction in memory usage compared to standard LoRA, making it possible to fine-tune even larger models on resource-constrained devices.

Upshot

  • Further reduces the memory footprint of fine-tuning
  • Can lead to a loss of knowledge and a lower-quality fine-tune, but not necessarily. Sometimes the quantization actually reduces overfitting.
  • The loss of knowledge is also mitigated because the adapters are generally not quantized - it’s the base model that will suffer in performance

Which one should you use?

  • If you have access to hardware with enough space, use LoRA. Refer the table below for a rough estimate of the memory requirements for different model sizes.
Method Bits 7B 13B 30B 70B 110B 8x7B 8x22B
Full Amp 120GB 240GB 600GB 1200GB 2000GB 900GB 2400GB
Full 16 60GB 120GB 300GB 600GB 900GB 400GB 1200GB
LoRA 16 16GB 32GB 64GB 160GB 240GB 120GB 320GB
QLoRA 8 10GB 20GB 40GB 80GB 140GB 60GB 160GB
QLoRA 4 6GB 12GB 24GB 48GB 72GB 30GB 96GB
QLoRA 2 4GB 8GB 16GB 24GB 48GB 18GB 48GB
  • If you don’t have enough space, for example, if you only have access to a free T4 on Google Colab, try qLoRA.

Ship your first app in minutes

with $30 / month free compute