Fine-tuning vs. RAG: Which approach is right for your use case?

Solutions Engineer

If you’re looking to use LLMs to build personalized chatbots or other AI applications, you’ve probably heard of fine-tuning and Retrieval Augmented Generation (RAG). These approaches allow organizations to tailor LLMs to specific domains or tasks, improving accuracy and relevance.

But when should you use fine-tuning versus RAG? This article explores the key differences and use cases for each method.

Fine-tuning LLMs

What is fine-tuning?

Fine-tuning involves taking a pre-trained LLM and further training it on a smaller, specialized dataset. For example, you might take Mixtral and fine-tune it on a dataset of medical articles to improve its understanding of medical terminology. This process adjusts the model’s parameters to better suit a specific task or domain.

When to use fine-tuning

Fine-tuning is particularly useful in the following scenarios:

Domain-specific tasks: When you need the model to understand and generate content in a specialized field, such as legal, medical, or technical writing.
Setting the style, tone, format, or other qualitative aspects: For applications requiring a specific writing style or brand voice, fine-tuning can help maintain consistency.
Improving reliability at producing a desired output: Fine-tuning can significantly enhance performance on subjects not well-represented in the original training data.
Handling many edge cases in specific ways: Fine-tuning allows the model to learn from specific examples and adapt to unique scenarios.
Cost sensitive: While fine-tuning requires an upfront investment, it can lead to long-term cost savings and improved performance. Once a model has been fine-tuned, you may not need to provide as many examples in the prompt, resulting in cost savings and lower-latency requests.

How to fine-tune

Step 1: Prepare your custom data

The first step is to gather a dataset that is relevant to your specific task or domain. This dataset should be large enough to provide meaningful training data for the model. Ensure that the data is high-quality, diverse, and representative of the task you want the model to perform. This includes considering the form of the data; for example, if you want the eventual fine-tuned outputs to be long, then your training data examples should be long. You may need to preprocess the data by tokenizing it, removing stop words, or performing other necessary steps to prepare it for fine-tuning. Additionally, the data will likely need to be prepared in a specific format, such as JSONL.

Step 2: Select a base model

Choose a pre-trained LLM that is suitable for your task. Consider factors such as the model’s architecture, size, and performance on similar tasks. Popular choices include models like Llama3 and Mistral. You can find pre-trained models on platforms like Hugging Face’s Model Hub.

Step 3: Determine the required VRAM

Estimate the amount of Video Random Access Memory (VRAM) needed to fine-tune your model. This is crucial to ensure that your computing resources can handle the model’s requirements. You can refer to our article on how much VRAM you need to fine-tune an LLM for guidance on estimating VRAM requirements.

Step 4: Select a fine-tuning framework or library

There are a number of libraries and frameworks that can help you fine-tune an LLM and abstract away some of the low-level details and provide built-in optimizations.

Step 5: Select a serverless GPU provider for fine-tuning

Serverless GPU providers like Modal offer scalable GPU resources that simplify the fine-tuning process. You only pay for the GPUs when you actually run the fine-tunes and you don’t need to manage infrastructure.

Retrieval Augmented Generation (RAG)

What is RAG?

RAG, or Retrieval Augmented Generation, is a technique that enhances an LLM’s responses by incorporating external knowledge sources as part of the prompt context. Instead of relying solely on the model’s pre-trained knowledge, RAG allows the model to query and integrate information from external databases or documents in real-time.

How does RAG work

Step 1: Document Chunking

The first step is to break down your text documents into smaller, manageable chunks. This process is called chunking. The goal is to divide the documents into sections that are meaningful and coherent, representing units of relevant context. For example, you might choose to chunk a lengthy article into paragraphs or sections based on thematic content. This ensures that each chunk retains enough context to be useful during the retrieval process, while also being small enough to allow for efficient searching and embedding.

Step 2: Embedding and Database Storage

Once the documents are chunked, the next step is to embed each chunk using an embedding model. An embedding model is a type of neural network that converts text into a dense vector representation, allowing the model to capture the semantic meaning of the text. The embeddings are then stored in a database, which can be queried later to retrieve relevant information.

Step 3: Query Embedding and Similarity Search

When a user submits a query to the LLM, the query is also embedded using the same embedding model as before. This embedded query is then used to search the database for the most similar embeddings. The similarity search is typically done using a vector similarity metric, such as cosine similarity or dot product. The goal is to find the embeddings that are closest to the query embedding, indicating that they contain relevant information.

Step 4: Context Addition and LLM Query

The final step in the RAG process is to take the contents of the most similar embedding(s) and add them as context to the original query. This enriched query is then passed to the LLM, which generates a response based on the original query and the additional context. The context provided by RAG helps the LLM to better understand the query and generate more accurate and informative responses.

When to use RAG

RAG is particularly beneficial in these situations:

Up-to-date information: When your application requires access to the latest information that may not be present in the model’s training data.
Factual accuracy: RAG can improve the model’s ability to provide accurate, verifiable information by referencing external sources.
Customizable knowledge base: RAG allows you to easily update or modify the external knowledge source without retraining the entire model.

For instance, a customer support chatbot for a tech company could be fine-tuned on the company’s documentation and support tickets to understand product-specific terminology and common issues. RAG could then be used to incorporate the latest product updates or known issues in real-time.

RAG performance

In order to get RAG to perform how you want, there are a number of parameters that you will likely need to fiddle around with, including:

Chunking strategy: You may choose to chunk based on sentence boundaries, paragraphs, or even semantic meaning. Experimenting with different chunk sizes can help you find the optimal balance between context richness and retrieval efficiency. Smaller chunks may provide more precise information but can lead to a loss of context, while larger chunks may retain context but could introduce irrelevant information.
Embedding model: The choice of embedding model is crucial, as it determines how well the text is represented in vector space. Different models may capture different aspects of the text, so it’s important to select one that aligns with your specific use case. You can start by looking through the MTEB leaderboard for the top embedding models, but remember that just because a model tops the leaderboard doesn’t necessarily mean that it will work best for your use case.
Similarity metric: The method you use to measure similarity between embeddings can greatly affect the retrieval results. Common metrics include cosine similarity, which measures the angle between two vectors, and dot product, which assesses the magnitude of the vectors. Depending on your application, you may want to experiment with different metrics to see which yields the best results for your queries. Additionally, consider implementing a hybrid approach that combines multiple metrics for improved accuracy.
Retrieval threshold: Setting a threshold for how similar an embedding must be to be considered relevant can help filter out noise. A lower threshold may retrieve more results, but could include less relevant information, while a higher threshold may yield fewer, but more accurate results. Tuning this parameter based on your specific needs can enhance the overall effectiveness of the RAG system.
Context length: The amount of context you provide to the LLM can also influence its performance. Too little context may lead to vague or irrelevant responses, while too much can overwhelm the model. Finding the right balance is key, and you may need to adjust the context length based on the complexity of the queries and the nature of the information being retrieved.

Conclusion

Both fine-tuning and RAG offer powerful ways to enhance LLM performance for specific use cases. Fine-tuning excels in creating models with deep domain expertise and consistent output, while RAG provides flexibility and up-to-date information access. By understanding the strengths of each approach, you can choose the most appropriate method—or combination of methods—for your specific needs.