Startups get up to $50k in free compute credits.
August 22, 20245 minute read
LoRA vs. QLoRA: Efficient fine-tuning techniques for LLMs
author
Yiren Lu@YirenLu
Solutions Engineer

Fine-tuning LLMs can be a computationally expensive and time-consuming process, especially when dealing with models containing billions of parameters. However, new fine-tuning techniques have made it possible to more efficiently fine-tune LLMs by reducing the number of parameters to update. In this blog post, we’ll explore two such techniques: LoRA and QLoRA and discuss their differences and pros and cons.

Table of contents

Overview of LoRA and QLoRA

Full FTLoRAQLoRA
GB VRAM* (Memory needed per 1GB model)16+2+0.5+
% Params trained~100%0.5-5%0.5-5%
SpeedSlowFastSlightly slower than LoRA
QualityCan overfitStable and accurateCan lose accuracy

The problems with full fine-tuning

Before diving into the efficient techniques, let’s briefly review the challenges of traditional full fine-tuning:

  • Updates every single parameter in the base model
  • Because of the large number of parameters that need to be updated, requires significant computational resources, typically 60GB+ of VRAM for a 7B parameter model
  • Slow
  • Prone to overfitting, especially when working with smaller datasets

What is LoRA?

LoRA, short for Low-Rank Adaptation, is a fine-tuning technique introduced by Microsoft researchers in their paper LoRA: Low-Rank Adaptation of Large Language Models.

The main idea behind LoRA is that instead of updating all the pre-trained weights, you freeze them and train smaller “adapter” matrices that represent the update to the base model.

In a standard neural network layer, we have:

Y = WX + b

Where:

  • W is the weight matrix
  • X is the input
  • b is the bias
  • Y is the output

LoRA modifies this as follows:

Y = (W + BA)X + b

Where:

  • W is the frozen pre-trained weight matrix
  • B and A are low-rank matrices, which means that they are “smaller” than the original W matrix and can be stored more efficiently
  • BA is the product of these matrices, representing the update to W

Upshot

  • Only a small number of parameters (in A and B) need to be trained
  • Uses way less VRAM, and most of the VRAM requirement is for loading the base model, not for training
  • Can result in less overfitting compared to full fine-tuning
  • Can be applied selectively to certain layers or components of the model
  • Multiple LoRA modules can be trained for different tasks and swapped out as needed
  • Can use a higher learning rate due to the smaller number of parameters

What is QLoRA?

QLoRA, or Quantized LoRA, is an extension of the LoRA technique that further reduces the memory footprint of fine-tuning by quantizing the low-rank matrices. Introduced in the paper QLoRA: Efficient Finetuning of Quantized LLMs, QLoRA applies post-training quantization to the A and B matrices, converting them from 32-bit floating-point numbers to lower-precision representations, such as 8-bit integers.

By quantizing the low-rank matrices, QLoRA achieves a 4x reduction in memory usage compared to standard LoRA, making it possible to fine-tune even larger models on resource-constrained devices.

Upshot

  • Further reduces the memory footprint of fine-tuning
  • Can lead to a loss of knowledge and a lower-quality fine-tune, but not necessarily. Sometimes the quantization actually reduces overfitting.
  • The loss of knowledge is also mitigated because the adapters are generally not quantized - it’s the base model that will suffer in performance

Which one should you use?

  • If you have access to hardware with enough space, use LoRA. Refer the table below for a rough estimate of the memory requirements for different model sizes.
MethodBits7B13B30B70B110B8x7B8x22B
FullAmp120GB240GB600GB1200GB2000GB900GB2400GB
Full1660GB120GB300GB600GB900GB400GB1200GB
LoRA1616GB32GB64GB160GB240GB120GB320GB
QLoRA810GB20GB40GB80GB140GB60GB160GB
QLoRA46GB12GB24GB48GB72GB30GB96GB
QLoRA24GB8GB16GB24GB48GB18GB48GB
  • If you don’t have enough space, for example, if you only have access to a free T4 on Google Colab, try qLoRA.
Modal CTA

Ship your first app in minutes.

Get Started

$30 / month free compute