Create an infinite icon library by fine-tuning Stable Diffusion
For part 2 of this blog post, on how we fine-tuned Flux.1-dev with the same dataset, see here.
Icon libraries provide a clean, consistent look for web interfaces. Here at Modal, we mostly use Lucide. We also like Heroicons, a set of freely-available icons from the makers of Tailwind CSS, another open source library we use.
calendar-days, film, and users.These icon libraries are incredibly useful.
But like libraries of books, icon libraries are limited.
If our app needs an icon for golden-retrievers or barack-obama,
we’re just out of luck.
But what if icon libraries were more like Borges’ Biblioteca de Babel: an endless collection of everything we could possibly need?
Generative models like Stable Diffusion hold this exact promise: once they have seen enough examples of some kind of data, they learn to simulate the process by which that data is generated, and can then generate more, endlessly.
So as an experiment, we took a Stable Diffusion model and fine-tuned it on the Heroicons library.
Here’s an example icon it generated for barack-obama:
You can play around with the fine-tuned model yourself here.
We were able to create a number of delightful new black-and-white line icons, all in a rough imitation of the Heroicons style:
apple-computer, bmw, castle.Middle row:
ebike, future-of-ai, golden-retriever.Bottom row:
jail, piano, snowflakeThe entire application, from downloading a pretrained model through fine-tuning and up to serving an interactive web UI, is run on Modal.
Modal is a scalable, serverless cloud computing platform that abstracts away the complexities of infrastructure management.
With Modal, we can easily spin up powerful GPU instances, run the fine-tuning training script, and deploy the fine-tuned model as an interactive web app, all with just a few lines of code.
In this blog post, we’ll show you how.
Table of contents
- Choosing a fine-tuning technique
- Setting up accounts
- Preparing the dataset
- Training on Modal
- Serving the fine-tuned model
- Wrapping inference in a Gradio UI
- Parting thoughts
Choosing a fine-tuning technique
Your first choice when fine-tuning a model is how you’re going to do it.
In full fine-tuning, the entire model is updated during training. This is the most computationally expensive method. It is particularly costly in terms of memory, because information that can be several times the size of the model needs to be kept in memory.
In sequential adapter fine-tuning, new layers are appended to the model and trained. This requires much less memory than full fine-tuning, because the number of new layers is usually small — even just one. However, it is unable to adjust the earliest layers of the model, where critical aspects of the representation are formed, and it increases the time required for inference.
In parallel adapter fine-tuning, new layers are inserted “alongside” the existing layers of the model, and their outputs superimposed on the outputs of the existing layers. This approach takes excellent advantage of the parallel processing capabilities of GPUs and the natural parallelism of linear algebra, and it has become especially popular in the last few years, in the form of techniques like LoRA (Low Rank Adaptation).
HuggingFace has pretty comprehensive documentation on all these techniques here.
For our use-case, we found that full fine-tuning worked best. But parallel adapter fine-tuning methods, like LoRA, can also work well, especially if you have a small dataset and want to fine-tune quickly.
Setting up accounts
If you’re following along or using this blog post as a template for your own fine-tuning experiments, make sure you have the following set up before continuing:
- A HuggingFace account (sign up here if you don’t have one).
- A Modal account (sign up here if you don’t have one).
Preparing the Dataset
The first step in fine-tuning Stable Diffusion for style is to prepare the dataset.
Most blog posts skip over this part, or give only a cursory overview. This gives the false impression that dataset preparation is trivial and that models, optimization algorithms, and infrastructure are the most important.
We found that handling the data was actually the most important and most difficult part of fine-tuning — and just about all machine learning practitioners will tell you the same.
To use the Heroicons dataset, which consists of around 300 SVG icons, for fine-tuning, we need to:
Download the Heroicons from the GitHub repo
Convert the SVGs to PNGs and add white backgrounds to the images
Image models are trained on rasterized graphics, so we need to convert the icons.
Add white backgrounds to the PNGs
We also need to add white backgrounds to the PNGs. This may seem trivial, but it is critically important - many models are incapable of outputting with transparency.
Generate captions for each image and create a
metadata.csvfileSince the Heroicon filenames match the concept they represent, we can parse them into captions. We also add a prefix to each caption:
“an icon of a <object>.”We then create a
metadata.csvfile, where each row is an image file name with the associated caption. Themetadata.csvfile should be placed in the same directory as all the training images and contain a header row with the stringfile_name,textUpload the dataset to the HuggingFace Hub
You can see the post-processed dataset here.
Training on Modal
Setting up Diffusers dependencies on Modal
To fine-tune Stable Diffusion for style, we used the Diffusers library by HuggingFace. Diffusers provides a set of easy-to-use scripts for fine-tuning these models on custom datasets.
You can see an up-to-date list of all their scripts in their examples subdirectory.
For this fine-tuning task, we will be using the train_text_to_image.py script. This script does full fine-tuning.
When you run your code on Modal, it executes in a containerized environment in the cloud, not on your machine. This means that you need to set up any dependencies in that environment.
Modal provides a Pythonic API to define containerized environments — the same power and flexibility as a Dockerfile, but without all the tears.
Setting up Volume for cloud storage of weights
Modal provides network file systems, Volumes, for writing information persistently from those cloud containers.
We use one to store the weights after we’re done training. We then read the weights from it when it’s time to run inference and generate new icons.
Setting up hyperparameter configs
We fine-tuned off the StableDiffusion v1.5 model, but you can easily also fine-tune off of other Stable Diffusion
versions by changing the config below. We used 4000 training steps, a learning rate of 1e-5, and a batch size of 1.
We set up one dataclass, TrainConfig, to hold all the training hyperparameters,
and another, AppConfig, to store all the inference hyperparameters.
Running fine-tuning
Now, finally, we’re ready to fine-tune.
We first need to decorate the train function with @app.function,
which tells Modal that the function should be launched in a cloud container on Modal.
Functions on Modal combine code and the infrastructure required to run it.
So the @app.function decorator takes several arguments that lets us specify
the type of GPU we want to use for training,
the Modal Volumes we want to mount to the container,
and any secret values (like the HuggingFace API key) that we want to pass to the container.
This training function does a bunch of preparatory things,
but the core of it is the notebook_launcher call that launches the actual Diffusers training script as a subprocess.
In particular, we are launching the script using the Accelerate CLI command.
Accelerate is a Python library that makes it easy to leverage multiple GPUs for accelerated model training.
The training script saves checkpoint files every 1000 steps.
To make sure that those checkpoints are persisted,
we need to set _allow_background_volume_commits=True in the @app.function decorator.
With that all in place, we can kick off a training run on Modal from anywhere with a simple command:
Serving the fine-tuned model
Once fine-tune-stable-diffusion.py has finished its training run, the fine-tuned model will be saved in the Volume.
We can then mount the volume to a new Modal inference function,
which we can then invoke from any Python code running anywhere.
Wrapping inference in a Gradio UI
Finally, we set up a Gradio UI that will allow us to interact with our icon generator. That lets us build this entire app, from data prep to browser app, in Python.
Our Gradio app calls the Model.inference function we defined above.
We can do this from any Python code we want, but we choose to also make this part of our Modal app, because Modal makes it easy to host Python web apps.
Deployment on Modal is as simple as running one command:
Parting thoughts
How does our fine-tuned model do as an infinite icon library?
camera, chemistry, fountain-pen.Middle row:
german-shepherd, international-monetary-system, library.Bottom row:
skiing, snowman, water-bottleIt’s certainly not perfect:
- The model sometimes outputs multiple objects when prompted for one (
water-bottle,fountain-pen). - Some icons have visual artifacts or strange shapes (
snowman). - The outputs aren’t as simple as the real Heroicons (
camera,german-shepherd).
Fine-tuning can be sensitive to the hyperparameters used, including dataset size, number of training steps, learning rates, and resolution.
Because we defined our training to run on Modal, we can immediately scale it up into a massive grid search — running tens or hundreds or thousands of copies of the training script at once, each with different hyperparameters.
And it only takes a few lines of code to set up a grid search. It might look like this:
Evaluation of which hyperparameter combinations are best will probably have to be done manually, given how subjective style can be.
But that’s what makes machine learning hard fun!