Efficient LLM Finetuning with Unsloth

Training large language models is an incredibly compute-hungry process. Open-source LLMs often require many GBs (or in extreme cases, one TB!) of VRAM just to fit in memory. Finetuning models requires even more memory; a common estimate for naive finetuning puts the VRAM requirements at roughly 4.2x the original model size: 1x for model weights + 1x for gradients + 2x for optimizer state + 20% for activations. Parameter efficient methods like LoRA can improve matters significantly, since this estimate now applies to just the LoRA modules’ weights, rather than the entire model’s. Further gains can be made with quantization of each of the components mentioned above, but doing so requires quantization-aware training, which can be tricky to combine with methods like LoRA.

Unsloth provides optimized methods for LLM finetuning with LoRA and quantization, leading to typical performance gains of 2x faster training with 70% less memory usage. This example demonstrates using Unsloth to finetune a version of Qwen3-14B with the FineTome-100k dataset on Modal using only a single GPU!

import pathlib
from dataclasses import dataclass
from datetime import datetime
from typing import Optional

import modal

We create a Modal App to organize our functions and shared infrastructure like container images and volumes.

app = modal.App("example-unsloth-finetune")

Container Image Configuration 

We build a custom container image with Unsloth and all necessary dependencies. The image includes the latest version of Unsloth (as of writing) with optimizations for the latest model architectures. Once the image is defined, we can specify the imports we’ll need to write the rest of our training code. Importantly, we import unsloth before the rest so that Unsloth’s patches are applied to packages like transformers, peft, and trl.

train_image = (
    modal.Image.debian_slim(python_version="3.11")
    .uv_pip_install(
        "accelerate==1.9.0",
        "datasets==3.6.0",
        "hf-transfer==0.1.9",
        "huggingface_hub==0.34.2",
        "peft==0.16.0",
        "transformers==4.54.0",
        "trl==0.19.1",
        "unsloth[cu128-torch270]==2025.7.8",
        "unsloth_zoo==2025.7.10",
        "wandb==0.21.0",
    )
    .env({"HF_HOME": "/model_cache"})
)

with train_image.imports():
    # unsloth must be first!
    import unsloth  # noqa: F401,I001
    import datasets
    import torch
    import wandb
    from transformers import TrainingArguments
    from trl import SFTTrainer
    from unsloth import FastLanguageModel
    from unsloth.chat_templates import standardize_sharegpt

Volume Configuration 

Modal Volumes provide storage that persists between function invocations. We use separate volumes for different types of data to enable efficient caching and sharing:

  • A cache for pretrained model weights - reused across all experiments
  • A cache for processed datasets - reused when using the same dataset
  • Storage for training checkpoints and final models
model_cache_volume = modal.Volume.from_name(
    "unsloth-model-cache", create_if_missing=True
)
dataset_cache_volume = modal.Volume.from_name(
    "unsloth-dataset-cache", create_if_missing=True
)
checkpoint_volume = modal.Volume.from_name(
    "unsloth-checkpoints", create_if_missing=True
)

Picking a GPU 

We use L40S for its healthy balance of VRAM, CUDA cores, and clock speed. The timeout provides an upper bound on our training time; if our training run finishes faster, we won’t end up using the full 6 hours. We also specify 3 retries, which will be useful in case our training function gets preempted.

GPU_TYPE = "L40S"
TIMEOUT_HOURS = 6
MAX_RETRIES = 3

Data Processing 

We’ll be finetuning our model on the FineTome-100k dataset, which is subset of The Tome curated with fineweb-edu-classifier Below we define some helpers for processing this dataset.

CONVERSATION_COLUMN = "conversations"  # ShareGPT format column name
TEXT_COLUMN = "text"  # Output column for formatted text
TRAIN_SPLIT_RATIO = 0.9  # 90% train, 10% eval split
PREPROCESSING_WORKERS = 2  # Number of workers for dataset processing


def format_chat_template(examples, tokenizer):
    texts = []
    for conversation in examples[CONVERSATION_COLUMN]:
        formatted_text = tokenizer.apply_chat_template(
            conversation, tokenize=False, add_generation_prompt=False
        )
        texts.append(formatted_text)
    return {TEXT_COLUMN: texts}


def load_or_cache_dataset(config: "TrainingConfig", paths: dict, tokenizer):
    dataset_cache_path = paths["dataset_cache"]

    if dataset_cache_path.exists():
        print(f"Loading cached dataset from {dataset_cache_path}")
        train_dataset = datasets.load_from_disk(dataset_cache_path / "train")
        eval_dataset = datasets.load_from_disk(dataset_cache_path / "eval")
    else:
        print(f"Downloading and processing dataset: {config.dataset_name}")

        # Load and standardize the dataset format
        dataset = datasets.load_dataset(config.dataset_name, split="train")
        dataset = standardize_sharegpt(dataset)

        # Split into training and evaluation sets with fixed seed for reproducibility
        dataset = dataset.train_test_split(
            test_size=1.0 - TRAIN_SPLIT_RATIO, seed=config.seed
        )
        train_dataset = dataset["train"]
        eval_dataset = dataset["test"]

        # Apply chat template formatting to convert conversations to text
        print("Formatting datasets with chat template...")
        train_dataset = train_dataset.map(
            lambda examples: format_chat_template(examples, tokenizer),
            batched=True,
            num_proc=PREPROCESSING_WORKERS,
            remove_columns=train_dataset.column_names,
        )

        eval_dataset = eval_dataset.map(
            lambda examples: format_chat_template(examples, tokenizer),
            batched=True,
            num_proc=PREPROCESSING_WORKERS,
            remove_columns=eval_dataset.column_names,
        )

        # Cache the processed datasets for future runs
        print(f"Caching processed datasets to {dataset_cache_path}")
        dataset_cache_path.mkdir(parents=True, exist_ok=True)
        train_dataset.save_to_disk(dataset_cache_path / "train")
        eval_dataset.save_to_disk(dataset_cache_path / "eval")

        # Commit the dataset cache to the volume
        dataset_cache_volume.commit()

    return train_dataset, eval_dataset

Loading the pretrained model 

We can’t finetune without a pretarined model! Since these models are fairly large, we don’t want to download them from scratch for each training run. We solve this by caching the weights in a Volume on download, and then loading from the Volume on subsequent runs.

def load_or_cache_model(config: "TrainingConfig", paths: dict):
    print(f"Downloading and caching model: {config.model_name}")
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=config.model_name,
        max_seq_length=config.max_seq_length,
        dtype=None,
        load_in_4bit=config.load_in_4bit,
        load_in_8bit=config.load_in_8bit,
    )

    return model, tokenizer

Training Configuration 

First we’ll define what layers our LoRA modules should target. Generally, it’s advisable to LoRA finetune every linear layer in the model, so we target every projection matrix of each attention layer.

LORA_TARGET_MODULES = [
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
    "gate_proj",
    "up_proj",
    "down_proj",
]

We want to expose the different hyperparameters and optimizations that Unsloth supports, so we wrap them into a TrainingConfig class. Later, we’ll populate this config with arguments from the command line.

@dataclass
class TrainingConfig:
    # Model and dataset selection
    model_name: str
    dataset_name: str
    max_seq_length: int
    load_in_4bit: bool
    load_in_8bit: bool

    # LoRA configuration for efficient finetuning
    lora_r: int
    lora_alpha: int
    lora_dropout: float
    lora_bias: str
    use_rslora: bool

    # Training hyperparameters
    optim: str
    batch_size: int
    gradient_accumulation_steps: int
    packing: bool
    use_gradient_checkpointing: str
    learning_rate: float
    lr_scheduler_type: str
    warmup_ratio: float
    weight_decay: float
    max_steps: int
    save_steps: int
    eval_steps: int
    logging_steps: int

    # Experiment management
    seed: int
    experiment_name: Optional[str] = None
    enable_wandb: bool = True

    # For testing purposes
    skip_eval: bool = False

    def __post_init__(self):
        # Generate a unique experiment name if not provided
        if self.experiment_name is None:
            timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
            model_short = self.model_name.split("/")[-1]
            self.experiment_name = f"{model_short}-r{self.lora_r}-{timestamp}"

Main Training Function 

This function orchestrates the entire training process, from model loading to final model saving. It’s decorated with Modal function configuration that specifies the compute resources, the volumes needed, and execution details like timeout and retries.

@app.function(
    image=train_image,
    gpu=GPU_TYPE,
    volumes={
        "/model_cache": model_cache_volume,
        "/dataset_cache": dataset_cache_volume,
        "/checkpoints": checkpoint_volume,
    },
    secrets=[modal.Secret.from_name("wandb-secret")],
    timeout=TIMEOUT_HOURS * 60 * 60,
    retries=modal.Retries(initial_delay=0.0, max_retries=MAX_RETRIES),
    max_inputs=1,  # Ensure we get a fresh container on retry
)
def finetune(config: TrainingConfig):
    # Get structured paths for organized file storage
    paths = get_structured_paths(config)

    # Initialize Weights & Biases for experiment tracking if enabled
    if config.enable_wandb:
        wandb.init(
            project="unsloth-finetune",
            name=config.experiment_name,
            config=config.__dict__,
        )

    # Load or cache model and datasets with progress indicators
    print("Setting up model and data...")
    model, tokenizer = load_or_cache_model(config, paths)
    train_dataset, eval_dataset = load_or_cache_dataset(config, paths, tokenizer)

    # Configure the model for LoRA training
    model = setup_model_for_training(model, config)

    # Prepare checkpoint directory and check for existing checkpoints
    checkpoint_path = paths["checkpoints"]
    checkpoint_path.mkdir(parents=True, exist_ok=True)
    resume_from_checkpoint = check_for_existing_checkpoint(paths)

    # Create training configuration
    training_args = create_training_arguments(config, str(checkpoint_path))

    # Initialize the supervised finetuning trainer
    print("Initializing SFTTrainer...")
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        dataset_text_field=TEXT_COLUMN,
        max_seq_length=config.max_seq_length,
        dataset_num_proc=PREPROCESSING_WORKERS,
        packing=config.packing,  # Sequence packing for efficiency
        args=training_args,
    )

    # Display training information for transparency
    print(f"Training dataset size: {len(train_dataset):,}")
    print(f"Evaluation dataset size: {len(eval_dataset):,}")
    print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
    print(
        f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}"
    )
    print(f"Experiment: {config.experiment_name}")

    # Start training or resume from checkpoint
    if resume_from_checkpoint:
        print(f"Resuming training from {resume_from_checkpoint}")
        trainer.train(resume_from_checkpoint=resume_from_checkpoint)
    else:
        print("Starting training from scratch...")
        trainer.train()

    # Save the final trained model and tokenizer
    print("Saving final model...")
    final_model_path = checkpoint_path / "final_model"
    model.save_pretrained(final_model_path)
    tokenizer.save_pretrained(final_model_path)

    # Clean up experiment tracking
    if config.enable_wandb:
        wandb.finish()

    print(f"Training completed! Model saved to: {final_model_path}")
    return config.experiment_name

Finally, we invoke our training function from an App.local_entrypoint. Arguments to this function automatically get converted into CLI flags that can be specified when we modal run our code. This allows us to do things like tweak hyperparameters directly from the command line without modifying our source code.

To try this example, checkout the examples repo, install the Modal client, and run

modal run 06_gpu_and_ml/unsloth_finetune.py

You can also customize the training process by tweaking hyperparameters with command line flags, e.g.

modal run 06_gpu_and_ml/unsloth_finetune.py --max-steps 10000
@app.local_entrypoint()
def main(
    # Model and dataset configuration
    model_name: str = "unsloth/Qwen3-32B",
    dataset_name: str = "mlabonne/FineTome-100k",
    max_seq_length: int = 32768,
    load_in_4bit: bool = True,  # unsloth: use 4bit quant for frozen model weights
    load_in_8bit: bool = False,  # unsloth: use 8bit quant for frozen model weights
    # LoRA hyperparameters for finetuning efficiency
    lora_r: int = 16,
    lora_alpha: int = 16,
    lora_dropout: float = 0.0,
    lora_bias: str = "none",  # unsloth: optimized lora kernel
    use_rslora: bool = False,
    # Training hyperparameters for optimization
    optim: str = "adamw_8bit",  # unsloth: 8bit optimizer
    batch_size: int = 16,
    gradient_accumulation_steps: int = 1,
    packing: bool = False,
    use_gradient_checkpointing: str = "unsloth",  # unsloth: optimized gradient offloading
    learning_rate: float = 2e-4,
    lr_scheduler_type: str = "cosine",
    warmup_ratio: float = 0.06,
    weight_decay: float = 0.01,
    max_steps: int = 5,  # increase!
    save_steps: int = 2,  # increase!
    eval_steps: int = 2,  # increase!
    logging_steps: int = 1,  # increase!
    # Optional experiment configuration
    seed: int = 105,
    experiment_name: Optional[str] = None,
    disable_wandb: bool = True,
    skip_eval: bool = False,
):
    # Create configuration object from command line arguments
    config = TrainingConfig(
        model_name=model_name,
        dataset_name=dataset_name,
        max_seq_length=max_seq_length,
        load_in_4bit=load_in_4bit,
        load_in_8bit=load_in_8bit,
        lora_r=lora_r,
        lora_alpha=lora_alpha,
        lora_bias=lora_bias,
        lora_dropout=lora_dropout,
        use_rslora=use_rslora,
        optim=optim,
        batch_size=batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        packing=packing,
        use_gradient_checkpointing=use_gradient_checkpointing,
        learning_rate=learning_rate,
        max_steps=max_steps,
        lr_scheduler_type=lr_scheduler_type,
        warmup_ratio=warmup_ratio,
        weight_decay=weight_decay,
        save_steps=save_steps,
        eval_steps=eval_steps,
        logging_steps=logging_steps,
        seed=seed,
        experiment_name=experiment_name,
        enable_wandb=not disable_wandb,
        skip_eval=skip_eval,
    )

    # Display experiment configuration for user verification
    print(f"Starting finetuning experiment: {config.experiment_name}")
    print(f"Model: {config.model_name}")
    print(f"Dataset: {config.dataset_name}")
    print(f"LoRA configuration: rank={config.lora_r}, alpha={config.lora_alpha}")
    print(
        f"Effective batch size: {config.batch_size * config.gradient_accumulation_steps}"
    )
    print(f"Training steps: {config.max_steps}")

    # Launch the training job on Modal infrastructure
    experiment_name = finetune.remote(config)
    print(f"Training completed successfully: {experiment_name}")

Utility Functions 

These functions handle the core logic for model loading, dataset processing, and training setup. They’re designed to be hackable for new use cases.

def get_structured_paths(config: TrainingConfig):
    """
    Create structured paths within the mounted volumes for organized storage.

    This function maps the configuration to specific directory paths that allow
    multiple models, datasets, and experiments to coexist without conflicts.
    """
    # Replace forward slashes in names to create valid directory names
    dataset_cache_path = (
        pathlib.Path("/dataset_cache")
        / "datasets"
        / config.dataset_name.replace("/", "--")
    )
    checkpoint_path = (
        pathlib.Path("/checkpoints") / "experiments" / config.experiment_name
    )

    return {
        "dataset_cache": dataset_cache_path,
        "checkpoints": checkpoint_path,
    }


def setup_model_for_training(model, config: TrainingConfig):
    """
    Configure the model with LoRA adapters for efficient finetuning.

    LoRA (Low-Rank Adaptation) allows us to finetune large models efficiently
    by only training a small number of additional parameters. This significantly
    reduces memory usage and training time.
    """
    print("Configuring LoRA for training...")
    model = FastLanguageModel.get_peft_model(
        model,
        r=config.lora_r,  # LoRA rank - higher values = more parameters
        target_modules=LORA_TARGET_MODULES,  # Which layers to apply LoRA to
        lora_alpha=config.lora_alpha,  # LoRA scaling parameter
        lora_dropout=config.lora_dropout,  # Dropout for LoRA layers
        bias=config.lora_bias,  # Bias configuration
        use_gradient_checkpointing=config.use_gradient_checkpointing,  # Memory optimization
        random_state=config.seed,  # Fixed seed for reproducibility
        use_rslora=config.use_rslora,  # Rank-stabilized LoRA
        loftq_config=None,  # LoFTQ quantization config
    )
    return model


def create_training_arguments(config: TrainingConfig, output_dir: str):
    """
    Create training arguments for the SFTTrainer.

    These arguments control the training process, including optimization settings,
    evaluation frequency, and checkpointing behavior.
    """
    print("SKIP_EVAL", config.skip_eval)
    return TrainingArguments(
        # Core training configuration
        per_device_train_batch_size=config.batch_size,
        gradient_accumulation_steps=config.gradient_accumulation_steps,
        learning_rate=config.learning_rate,
        max_steps=config.max_steps,
        warmup_ratio=config.warmup_ratio,
        # Evaluation and checkpointing
        eval_steps=config.eval_steps,
        save_steps=config.save_steps,
        eval_strategy="no" if config.skip_eval else "steps",
        save_strategy="steps",
        do_eval=not config.skip_eval,
        # Optimization settings based on hardware capabilities
        fp16=not torch.cuda.is_bf16_supported(),  # Use fp16 if bf16 not available
        bf16=torch.cuda.is_bf16_supported(),  # Prefer bf16 when available
        optim=config.optim,
        weight_decay=config.weight_decay,
        lr_scheduler_type=config.lr_scheduler_type,
        # Logging and output configuration
        logging_steps=config.logging_steps,
        output_dir=output_dir,
        report_to="wandb" if config.enable_wandb else None,
        seed=config.seed,
    )


def check_for_existing_checkpoint(paths: dict):
    """
    Check if there's an existing checkpoint to resume training from.

    This enables resumable training, which is crucial for long-running experiments
    that might be interrupted by infrastructure issues or resource limits.
    """
    checkpoint_dir = paths["checkpoints"]
    if not checkpoint_dir.exists():
        return None

    # Look for the most recent checkpoint directory
    checkpoints = list(checkpoint_dir.glob("checkpoint-*"))
    if checkpoints:
        latest_checkpoint = max(checkpoints, key=lambda p: int(p.name.split("-")[1]))
        print(f"Found existing checkpoint: {latest_checkpoint}")
        return str(latest_checkpoint)

    return None