Fine-tune an LLM in minutes (ft. Llama 2, CodeLlama, Mistral, etc.)

Tired of prompt engineering? Fine-tuning helps you get more out of a pretrained LLM by adjusting the model weights to better fit a specific task. This operational guide will help you take a base model and fine-tune it on your own dataset (API docs, conversation transcripts, etc.) in the matter of minutes.

The repository comes ready to use as-is with all the recommended, start-of-the-art optimizations for fast training results:

The heavy lifting is done by the axolotl library. For the purposes of this guide, we’ll fine-tune CodeLlama 7B to generate SQL queries, but the code is easy to tweak for many base models, datasets, and training configurations.

Best of all, using Modal for training means you never have to worry about infrastructure headaches like building images, provisioning GPUs, and managing cloud storage. If a training script runs on Modal, it’s repeatable and scalable enough to ship to production right away.


To follow along, make sure that you have completed the following:

  1. Set up Modal account:

    pip install modal
    python3 -m modal setup
  2. Create a HuggingFace secret in your workspace (only HF_TOKEN is needed, which you can get if you go into your Hugging Face settings > API tokens)

  3. Clone the repository and navigate to the directory:

    git clone
    cd llm-finetuning

Some models like Llama 2 also require that you apply for access, which you can do on the Hugging Face page (granted instantly).


We created a simple GUI using Gradio that makes it easy to train and test models using this repository with a single click. All you need to do for customization is paste your desired training config (YAML) and dataset (JSONL) as plain text into the UI.

See Using the GUI for details on getting started with the Gradio interface.

Code overview

The source directory contains a training script to launch a training job in the cloud with your config/dataset (config.yml and my_data.jsonl, unless otherwise specified), as well as an inference engine for testing your training results.

We use Modal’s built-in cloud storage system to share data across all functions in the app. In particular, we mount a persisting volume at /pretrained inside the container to store our pretrained models (so we only need to load them once) and another persisting volume at /runs to store our training config, dataset, and results for each run (for easier reproducibility and management).

There are two main ways to train:


The training script contains three Modal functions that run in the cloud:

  • launch prepares a new folder in the /runs volume with the training config and data for a new training job. It also ensures the base model is downloaded from HuggingFace.
  • train takes a prepared run folder in the volume and performs the training job using the config and data.
  • merge merges the trained adapter with the base model (as a CPU job).

By default, when you make local changes to either config.yml or my_data.jsonl, they will be used for your next training run. You can also specify which local config and data files to use with the --config and --dataset flags. See Making it your own for more details on customizing your dataset and config.

To kickstart a training job with the CLI, use:

modal run --detach src.train

--detach lets the app continue running even if your client disconnects.

The training run folder name will be in the command output (e.g. axo-2023-11-24-17-26-66e8). You can check if your fine-tuned model is stored properly in this folder using modal volume ls or the File Explorer in our Gradio GUI.

Serving your fine-tuned model

Once a training run has completed, run inference to compare the model before/after training.

  • Inference.completion can spawn a vLLM inference container for any pre-trained or fine-tuned model from a previous training job.

You can serve a model for inference using the following command, specifying which training run folder to load the model from with the –run-folder flag (run folder name is in the training log output):

modal run -q src.inference --run-folder /runs/axo-2023-11-24-17-26-66e8

We use vLLM to speed up our inference up to 24x.

Using the GUI

The Gradio GUI makes it easy to run each of the cloud functions with a single click. The interface is also helpful for viewing the files in your mounted volume.

To use the GUI, first deploy the training backend with all the business logic (launch, train, and completion in

modal deploy src

If you would like to change the number of GPUs in your training config (2 80GB A100s), you should set them during this deployment step. For example, if you wanted to use 4 40GB A100s instead:

NUM_GPUS=4 GPU_MEM=40 modal deploy src

Then run the GUI as an ephemeral app:

modal run src.gui

Within a couple seconds of running the app, you should see a * link in the command output. Open the URL in your browser to use the app.

In the Gradio app, you should see two tabs for launching training runs and testing out trained models:


  1. Train

    Paste your desired config and dataset as text directly into the UI (formatted as YAML for the config and JSONL for the dataset, see training section for more details), then click “Launch training job.” You can inspect the files stored in your runs volume, including your training checkpoints and results, using the FileExplorer.

  2. Inference

    After a training run is completed, you can test out your fine-tuned model. Use the dropdown to switch between various training run results.

Making it your own

Training on your own dataset, using a different base model, and activating another SOTA technique is as easy as modifying a couple files.


Bringing your own dataset is as simple as creating a JSONL file — Axolotl supports many dataset formats (see more).

We recommend adding your custom dataset as a JSONL file in the src directory (or pasting the text directly into my_data.jsonl box if using the GUI) and making the appropriate modifications to your config, as explained below.


All of your training parameters and options are customizable in a single config file. We recommend duplicating one of the example_configs to src/config.yml (or directly pasting into the UI if using the GUI) and modifying as you need. See an overview of Axolotl’s config options here.

The most important options to consider are:

  • Model

    base_model: codellama/CodeLlama-7b-Instruct-hf
  • Dataset (by default we upload a local .jsonl file from the src folder in completion format, but you can see all dataset options here)

    - path: my_data.jsonl
       ds_type: json
       type: completion
  • LoRA

    adapter: lora # for qlora, or leave blank for full finetune
    lora_r: 8
    lora_alpha: 16
    lora_dropout: 0.05
      - q_proj
      - v_proj
  • Multi-GPU training

    We recommend DeepSpeed for multi-GPU training, which is easy to set up. Axolotl provides several default deepspeed JSON configurations and Modal makes it easy to attach multiple GPUs of any type in code, so all you need to do is specify which of these configs you’d like to use.

    In your config.yml:

    deepspeed: /root/axolotl/deepspeed/zero3.json


    import os
    N_GPUS = int(os.environ.get("N_GPUS", 2))
    GPU_MEM = int(os.environ.get("GPU_MEM", 80))
    GPU_CONFIG = modal.gpu.A100(count=N_GPUS, memory=GPU_MEM)  # you can also change this in code to use A10Gs or T4s
  • Logging with Weights and Biases

    To track your training runs with Weights and Biases:

    1. Create a Weights and Biases secret in your Modal dashboard, if not set up already (only the WANDB_API_KEY is needed, which you can get if you log into your Weights and Biases account and go to the Authorize page)
    2. Add the Weights and Biases secret to your app, so initializing your stub in should look like:
    from modal import Stub, Secret
    stub = Stub("my_app", secrets=[Secret.from_name("huggingface"), Secret.from_name("my-wandb-secret")])
    1. Add your wandb config to your config.yml:
    wandb_project: mistral-7b-samsum
    wandb_watch: gradients

Once you have your trained model, you can easily deploy it to production for serverless inference via Modal’s web endpoint feature (see example here). Modal will handle all the auto-scaling for you, so that you only pay for the compute you use!