Fine-tune Flux on your pet using LoRA
This example finetunes the Flux.1-dev model on images of a pet (by default, a puppy named Qwerty) using a technique called textual inversion from the “Dreambooth” paper. Effectively, it teaches a general image generation model a new “proper noun”, allowing for the personalized generation of art and photos. We supplement textual inversion with low-rank adaptation (LoRA) for increased efficiency during training.
It then makes the model shareable with others — without costing $25/day for a GPU server— by hosting a Gradio app on Modal.
It demonstrates a simple, productive, and cost-effective pathway to building on large pretrained models using Modal’s building blocks, like GPU-accelerated Modal Functions and Clses for compute-intensive work, Volumes for storage, and web endpoints for serving.
And with some light customization, you can use it to generate images of your pet!
You can find a video walkthrough of this example on the Modal YouTube channel here.
Imports and setup
We start by importing the necessary libraries and setting up the environment.
Building up the environment
Machine learning environments are complex, and the dependencies can be hard to manage. Modal makes creating and working with environments easy via containers and container images.
We start from a base image and specify all of our dependencies. We’ll call out the interesting ones as they come up below. Note that these dependencies are not installed locally — they are only installed in the remote environment where our Modal App runs.
Downloading scripts and installing a git repo with run_commands
We’ll use an example script from the diffusers library to train the model.
We acquire it from GitHub and install it in our environment with a series of commands.
The container environments Modal Functions run in are highly flexible —
see the docs for more details.
Configuration with dataclasses
Machine learning apps often have a lot of configuration information. We collect up all of our configuration into dataclasses to avoid scattering special/magic values throughout code.
Storing data created by our app with modal.Volume
The tools we’ve used so far work well for fetching external information,
which defines the environment our app runs in,
but what about data that we create or modify during the app’s execution?
A persisted modal.Volume can store and share data across Modal Apps and Functions.
We’ll use one to store both the original and fine-tuned weights we create during training and then load them back in for inference. For more on storing model weights on Modal, see this guide.
Note that access to the Flux.1-dev model on Hugging Face is gated by a license agreement which
you must agree to here.
After you have accepted the license, create a Modal Secret with the name huggingface-secret following the instructions in the template.
Load fine-tuning dataset
Part of the magic of the low-rank fine-tuning is that we only need 3-10 images for fine-tuning. So we can fetch just a few images, stored on consumer platforms like Imgur or Google Drive, whenever we need them — no need for expensive, hard-to-maintain data pipelines.
Low-Rank Adaptation (LoRA) fine-tuning for a text-to-image model
The base model we start from is trained to do a sort of “reverse ekphrasis”: it attempts to recreate a visual work of art or image from only its description.
We can use the model to synthesize wholly new images by combining the concepts it has learned from the training data.
We use a pretrained model, the Flux model from Black Forest Labs. In this example, we “finetune” Flux, making only small adjustments to the weights. Furthermore, we don’t change all the weights in the model. Instead, using a technique called low-rank adaptation, we change a much smaller matrix that works “alongside” the existing weights, nudging the model in the direction we want.
We can get away with such a small and simple training process because we’re just teach the model the meaning of a single new word: the name of our pet.
The result is a model that can generate novel images of our pet: as an astronaut in space, as painted by Van Gogh or Bastiat, etc.
Finetuning with Hugging Face 🧨 Diffusers and Accelerate
The model weights, training libraries, and training script are all provided by 🤗 Hugging Face.
You can kick off a training job with the command modal run dreambooth_app.py::app.train.
It should take about ten minutes.
Training machine learning models takes time and produces a lot of metadata — metrics for performance and resource utilization, metrics for model quality and training stability, and model inputs and outputs like images and text. This is especially important if you’re fiddling around with the configuration parameters.
This example can optionally use Weights & Biases to track all of this training information. Just sign up for an account, switch the flag below, and add your API key as a Modal Secret.
You can see an example W&B dashboard here. Check out this run, which despite having high GPU utilization suffered from numerical instability during training and produced only black images — hard to debug without experiment management logs!
You can read more about how the values in TrainConfig are chosen and adjusted in this blog post on Hugging Face.
To run training on images of your own pet, upload the images to separate URLs and edit the contents of the file at TrainConfig.instance_example_urls_file to point to them.
Tip: if the results you’re seeing don’t match the prompt too well, and instead produce an image
of your subject without taking the prompt into account, the model has likely overfit. In this case, repeat training with a lower
value of max_train_steps. If you used W&B, look back at results earlier in training to determine where to stop.
On the other hand, if the results don’t look like your subject, you might need to increase max_train_steps.
Running our model
To generate images from prompts using our fine-tuned model, we define a Modal Function called inference.
Naively, this would seem to be a bad fit for the flexible, serverless infrastructure of Modal: wouldn’t you need to include the steps to load the model and spin it up in every function call?
In order to initialize the model just once on container startup,
we use Modal’s container lifecycle features, which require the function to be part
of a class. Note that the modal.Volume we saved the model to is mounted here as well,
so that the fine-tuned model created by train is available to us.
Wrap the trained model in a Gradio web UI
Gradio makes it super easy to expose a model’s functionality in an easy-to-use, responsive web interface.
This model is a text-to-image generator, so we set up an interface that includes a user-entry text box and a frame for displaying images.
We also provide some example text inputs to help guide users and to kick-start their creative juices.
And we couldn’t resist adding some Modal style to it as well!
You can deploy the app on Modal with the command modal deploy dreambooth_app.py.
You’ll be able to come back days, weeks, or months later and find it still ready to go,
even though you don’t have to pay for a server to run while you’re not using it.
Running your fine-tuned model from the command line
You can use the modal command-line interface to set up, customize, and deploy this app:
modal run diffusers_lora_finetune.pywill train the model. Change theinstance_example_urls_fileto point to your own pet’s images.modal serve diffusers_lora_finetune.pywill serve the Gradio interface at a temporary location. Great for iterating on code!modal shell diffusers_lora_finetune.pyis a convenient helper to open a bash shell in our image. Great for debugging environment issues.
Remember, once you’ve trained your own fine-tuned model, you can deploy it permanently — for no cost when it is not being used! —
using modal deploy diffusers_lora_finetune.py.
If you just want to try the app out, you can find our deployment here.