December 20, 202415 minute read
What we learned fine-tuning a FLUX.1-dev style LoRA
author
Yiren Lu@YirenLu
Solutions Engineer

FLUX.1-dev, a 12B parameter model developed by Black Forest Labs, is one of the hottest open-source text-to-image AI models on the market today.

But what if you want to use Flux to generate images that adhere to a specific artistic style or theme? Can you fine-tune Flux?

The answer is yes!

In a previous blog post, we fine-tuned Stable Diffusion 1.5 on Heroicons, a set of freely-available icons from the makers of Tailwind CSS. Our results were decent, but a bit noisy and more detailed and complex than the very clean, abstract Heroicon style. The training also took on the order of hours, because we did a full fine-tune.

In this follow-up blog post, we switched to FLUX.1-dev and got better results. We fine-tuned it on Heroicons using the Dreambooth with LoRA technique. The Flux models are well-known for their superior performance in generating clean lines and text, which aligns with our results. We end up with a fine-tuned Flux style LoRA that allows us to generate an infinite icon library.

Here are some example output images:

heroicon-fine-tuned

From top left: Donald Trump, Beyonce, Barack Obama, Hillary Clinton, cocktail, castle, phone, tent, golden retriever, heart, Mt. Everest, dress

You can play around with the LoRA yourself here.

We’ll cover everything from curating the dataset to the GPU and fine-tuning technique used. Fine-tuning is run on Modal, a scalable, serverless cloud computing platform that abstracts away the complexities of infrastructure management.

Table of contents

Experiment quick facts:

Fine-tuning technique

For our experiment, we use a fine-tuning technique called Dreambooth that teaches Flux a new concept (i.e. a style or a character) by associating a special word with the example images. In particular, we use a LoRA implementation of Dreambooth that allows you to achieve full fine-tuning-like performance but with much less memory.

In LoRA fine-tuning, instead of updating all the parameters of a model during training, you introduce low-rank matrices that capture the essential changes needed for adaptation. During the fine-tuning process, only these low-rank “adapters” are updated. Then on inference, you load the base model, which remains unchanged, followed by the LoRA adapters. Compared to full fine-tuning, this approach offers faster training times and lower memory usage.

At this point, you might be wondering, how do you choose between full fine-tuning and LoRA fine-tuning?

Generally speaking, the best practice is to start with LoRA fine-tuning, and then, if the results are not adequate, move on to full fine-tuning.

You might have also heard of other optimization techniques like qLoRA, where the base model and the LoRA adapters are further quantized to cut down on memory usage.

Should you use qLoRA to fine-tune Flux?

The answer is that it’s probably not necessary. You can run a LoRA fine-tune on a single A100 40GB GPU without needing to quantize down to 8-bits.

Preparing the Dataset

heroicon-training-data

Some examples of our training data

In our previous attempt at fine-tuning Heroicons, we used the entire 300+ Heroicons icons that are publically available as our training set.

With LoRA, we can fine-tune with fewer images. Instead of the hundreds we would need for full fine-tuning, we only need around 20-25 images, so we curate 25 images. Some things to note about the training data:

  • Variety of images: For LoRA, more important than the number of images is the variety and representativeness of the images. For example, we deliberately choose icons that represent both tangible objects and more abstract concepts, as well as icons that contain both primarily straight lines/angles as well as curves.
  • Captioning: To train a style LoRA, it’s very helpful to provide individual captions for the images. The nice things about Heroicons is that each Heroicon image file is already named, so it’s easy to convert those names into appropriate captions. Our captions take the form "an HCON, a black and white minimalist icon of a <object>." Note that this means that at inference time, we will need to prompt with something similar to the caption, i.e. "an HCON, a black and white minimalist icon of Barack Obama" in order to trigger the style.
  1. Dreambooth keyword: As previously mentioned, Dreambooth is a technique that teaches Flux a new concept (i.e. a style or a character) by associating a special word with the example images. In our case, that special word is HCON. At inference time, HCON must be present in the prompt.

The final dataset used for our LoRA fine-tuning is here.

Training on Modal

To fine-tune on Modal, we can adapt this Dreambooth example, which shows you how to run a Diffusers fine-tuning script on Modal.

Diffusers is a HuggingFace-produced library that provides a set of easy-to-use scripts for fine-tuning Diffusion models on custom datasets. You can see an up-to-date list of all their scripts in their examples subdirectory.

For this fine-tuning task, we will be using the train_dreambooth_lora_flux.py script. This script does a Dreambooth fine-tune with LoRA.

Setting up accounts

If you’re following along or using this blog post as a template for your own fine-tuning experiments, make sure you have the following set up before you use the scripts above:

  • A HuggingFace account (sign up here if you don’t have one).
  • A Modal account (sign up here if you don’t have one).

Hyperparameter optimization

Fine-tuning a model is highly sensitive to the selection of hyperparameters. These parameters significantly influence the training process and the final performance of the model. Key hyperparameters to consider include the number of training steps, the learning rate, and, specifically for LoRA fine-tuning, the rank. In our experiments, we explored a range of values around commonly used configurations, varying the following parameters:

  • Rank: The higher the rank chosen, the closer it approximates full fine-tuning. On the other hand, the higher the rank chosen, the more memory and time it takes to train. In general, the LoRA rank chosen should correspond to the complexity of the style. For simpler styles, you can probably get away with ranks like 4, 8, or 16. For more complex styles, you will probably need rank 32 or 64. In general, training a style LoRA requires a higher rank than training a character LoRA. This makes intuitive sense - style LoRAs need to capture more nuanced details.

  • Learning Rate: The standard learning rate for full fine-tuning of Diffusion models is typically set at 1e-6. LoRA fine-tuning, however, generally allows for higher learning rates, because it only updates a small subset of parameters compared to full fine-tuning, making it less prone to overfitting.

  • Max training steps: This parameter defines the total number of training iterations the model will undergo. A full fine-tune of a diffusion model will often require 10,000 steps, but you can generally get pretty good LoRA results in less than 5000 steps.

In addition to these primary hyperparameters, we also utilized the following hyperparameters:

resolution: int = 512
train_batch_size: int = 1
gradient_accumulation_steps: int = 1
lr_scheduler: str = "constant"
lr_warmup_steps: int = 0
seed: int = 0

Performing hyperparameter search with Modal

Modal makes it easy to scale up our training — running tens or hundreds, etc, of copies of the training script at once, each with different hyperparameters.

To do this, we first set up a Python class with the different hyperparameter values we want to search through.

@dataclass
class SweepConfig(TrainConfig):
"""Configuration for hyperparameter sweep"""

# Sweep parameters
learning_rates = [8e-5, 2e-4]
train_steps = [1000, 1500, 3000, 4000]
ranks = [4, 8, 16]

Next, we write a function that generates all possible combinations of the hyperparameters:

def generate_sweep_configs(sweep_config: SweepConfig):
"""Generate all combinations of hyperparameters"""
param_combinations = itertools.product(
    sweep_config.learning_rates,
    sweep_config.train_steps,
    sweep_config.ranks,
)

return [
    {
        "learning_rate": lr,
        "max_train_steps": steps,
        "rank": rank,
        "output_dir": Path(MODEL_DIR)
        / f"lr_{lr}_steps_{steps}_rank_{rank}", # store the different LoRAs in different directories within the same volume
    }
    for lr, steps, rank in param_combinations
]

Finally, we use the local entrypoint to orchestrate the hyperparameter sweep using .map():

@app.local_entrypoint()
def run()
import wandb

sweep_config = SweepConfig()
app_config = AppConfig()
configs = generate_sweep_configs(sweep_config)

results_by_rank = {}  # Dictionary to store results for each rank

# Log results to wandb
with wandb.init(
    project="flux-lora-sweep-heroicons",
    name="hyperparameter_sweep",
) as run:
    for config in train.map(configs):

        learning_rate = config['learning_rate']
        rank = config['rank']
        max_train_steps = config['max_train_steps']

        for image, prompt in Model(
            learning_rate, rank, max_train_steps
        ).inference.starmap(
            [(x, app_config) for x in sweep_config.heroicon_test_prompts]
        ):
            results_by_rank[rank][prompt][steps] = wandb.Image(image)

    # log results to wandb
    run.log()

Results

Learning rate 8e-5

Below, we see the results across rank 4, 8, and 16 for a progressive number of training steps.

lr_8e-5_rank_4

lr_8e-5_rank_8

lr_8e-5_rank_16

Learning rate 2e-4

Below, we see the results across rank 4, 8, and 16 for a progressive number of training steps.

lr_2e-4_rank_4

lr_2e-4_rank_8

lr_2e-4_rank_16

There’s obviously a lot of subjectivity when it comes to deciding which hyperparameter combination gives the best results, but to our eyes, at least, it appears that the LoRA with rank 16, trained for 4000 steps at a learning rate of 2e-4 (the last image), gives the best results in terms of being clean, well-structured, and appropriately representing the prompt concept.

Some further observations:

  • There’s some overfitting and graininess with the lower learning rates, particularly when trained for an extended number of steps.
  • Although the general guideline suggests that simpler styles require lower ranks, our findings indicate that the lower ranks produced results that were unexpectedly more complex and noisy. It seems that a higher rank was necessary to effectively capture the true “style” of Heroicons, which are inherently abstract and conceptual icons.
  • The base FLUX.1-dev model was initially trained on 1024x1024 images, while our dataset consisted of lower-resolution 512x512 images. As a potential improvement, we could consider resizing our fine-tuning dataset to 1024x1024 to evaluate whether the outputs improve.

Ship your first app in minutes.

Get Started

$30 / month free compute