We've dropped our GPU & CPU prices! Learn more
May 21, 202415 minute read
Create an infinite icon library by fine-tuning Stable Diffusion
author
Yiren Lu@YirenLu
Solutions Engineer

Icon libraries provide a clean, consistent look for web interfaces. Here at Modal, we mostly use Lucide. We also like Heroicons, a set of freely-available icons from the makers of Tailwind CSS, another open source library we use.

Some example original Heroicons

Some examples icons from Heroicons: calendar-days, film, and users.

These icon libraries are incredibly useful. But like libraries of books, icon libraries are limited. If our app needs an icon for golden-retrievers or barack-obama, we’re just out of luck.

But what if icon libraries were more like Borges’ Biblioteca de Babel: an endless collection of everything we could possibly need?

Generative models like Stable Diffusion hold this exact promise: once they have seen enough examples of some kind of data, they learn to simulate the process by which that data is generated, and can then generate more, endlessly.

So as an experiment, we took a Stable Diffusion model and fine-tuned it on the Heroicons library.

Here’s an example icon it generated for barack-obama:

An icon of Barack Obama's head

Yes, we can fine-tune our own models.

You can play around with the fine-tuned model yourself here.

We were able to create a number of delightful new black-and-white line icons, all in a rough imitation of the Heroicons style:

Some example custom Heroicons

Top row: apple-computer, bmw, castle.
Middle row: ebike, future-of-ai, golden-retriever.
Bottom row: jail, piano, snowflake

The entire application, from downloading a pretrained model through fine-tuning and up to serving an interactive web UI, is run on Modal.

Modal is a scalable, serverless cloud computing platform that abstracts away the complexities of infrastructure management.

With Modal, we can easily spin up powerful GPU instances, run the fine-tuning training script, and deploy the fine-tuned model as an interactive web app, all with just a few lines of code.

In this blog post, we’ll show you how.

Table of contents

Choosing a fine-tuning technique

Your first choice when fine-tuning a model is how you’re going to do it.

In full fine-tuning, the entire model is updated during training. This is the most computationally expensive method. It is particularly costly in terms of memory, because information that can be several times the size of the model needs to be kept in memory.

In sequential adapter fine-tuning, new layers are appended to the model and trained. This requires much less memory than full fine-tuning, because the number of new layers is usually small — even just one. However, it is unable to adjust the earliest layers of the model, where critical aspects of the representation are formed, and it increases the time required for inference.

In parallel adapter fine-tuning, new layers are inserted “alongside” the existing layers of the model, and their outputs superimposed on the outputs of the existing layers. This approach takes excellent advantage of the parallel processing capabilities of GPUs and the natural parallelism of linear algebra, and it has become especially popular in the last few years, in the form of techniques like LoRA (Low Rank Adaptation).

HuggingFace has pretty comprehensive documentation on all these techniques here.

For our use-case, we found that full fine-tuning worked best. But parallel adapter fine-tuning methods, like LoRA, can also work well, especially if you have a small dataset and want to fine-tune quickly.

Setting up accounts

If you’re following along or using this blog post as a template for your own fine-tuning experiments, make sure you have the following set up before continuing:

  • A HuggingFace account (sign up here if you don’t have one).
  • A Modal account (sign up here if you don’t have one).

Preparing the Dataset

The first step in fine-tuning Stable Diffusion for style is to prepare the dataset.

Most blog posts skip over this part, or give only a cursory overview. This gives the false impression that dataset preparation is trivial and that models, optimization algorithms, and infrastructure are the most important.

We found that handling the data was actually the most important and most difficult part of fine-tuning — and just about all machine learning practitioners will tell you the same.

To use the Heroicons dataset, which consists of around 300 SVG icons, for fine-tuning, we need to:

  1. Download the Heroicons from the GitHub repo

  2. Convert the SVGs to PNGs and add white backgrounds to the images

    Image models are trained on rasterized graphics, so we need to convert the icons.

  3. Add white backgrounds to the PNGs

    We also need to add white backgrounds to the PNGs. This may seem trivial, but it is critically important - many models are incapable of outputting with transparency.

  4. Generate captions for each image and create a metadata.csv file

    Since the Heroicon filenames match the concept they represent, we can parse them into captions. We also add a prefix to each caption: “an icon of a <object>.”

    We then create a metadata.csv file, where each row is an image file name with the associated caption. The metadata.csv file should be placed in the same directory as all the training images and contain a header row with the string file_name,text

    # tree heroicons_training_dir
    heroicons_training_dir/
     ├── arrow.png
     ├── bike.png
     ├── cruiseShip.png
     └── metadata.csv
    # metadata.csv
    
    file_name,text
    arrow.png,"an icon of a arrow"
    bike.png,"an icon of a bike"
    cruiseShip.png,"an icon of a cruise ship"
  5. Upload the dataset to the HuggingFace Hub

    import os
    from datasets import load_dataset
    import huggingface_hub
    
    # login to huggingface
    hf_key = os.environ["HUGGINGFACE_TOKEN"]
    huggingface_hub.login(hf_key)
    
    dataset = load_dataset("imagefolder", data_dir="/lg_white_bg_heroicon_png_img", split="train")
    
    dataset.push_to_hub("yirenlu/heroicons", private=True)

You can see the post-processed dataset here.

Training on Modal

Setting up Diffusers dependencies on Modal

To fine-tune Stable Diffusion for style, we used the Diffusers library by HuggingFace. Diffusers provides a set of easy-to-use scripts for fine-tuning these models on custom datasets.

You can see an up-to-date list of all their scripts in their examples subdirectory.

For this fine-tuning task, we will be using the train_text_to_image.py script. This script does full fine-tuning.

When you run your code on Modal, it executes in a containerized environment in the cloud, not on your machine. This means that you need to set up any dependencies in that environment.

Modal provides a Pythonic API to define containerized environments — the same power and flexibility as a Dockerfile, but without all the tears.

# fine-tune-stable-diffusion.py
import os
import sys
from dataclasses import dataclass
from pathlib import Path

from fastapi import FastAPI
from modal import Image, App, Volume, gpu, Secret

GIT_SHA = "abd922bd0c43a504e47eca2ed354c3634bd00834"  # specify the commit to fetch

image = (
    Image.debian_slim(python_version="3.10")
    .pip_install(
        "accelerate==0.27.2",
        "datasets~=2.19.1",
        "ftfy~=6.1.1",
        "gradio~=3.50.2",
        "smart_open~=6.4.0",
        "transformers~=4.38.1",
        "torch~=2.2.0",
        "torchvision~=0.16",
        "triton~=2.2.0",
        "peft==0.7.0",
        "wandb==0.16.3",
    )
    .apt_install("git")
    # Perform a shallow fetch of just the target `diffusers` commit, checking out
    # the commit in the container's current working directory, /root.
    .run_commands(
        "cd /root && git init .",
        "cd /root && git remote add origin https://github.com/huggingface/diffusers",
        f"cd /root && git fetch --depth=1 origin {GIT_SHA} && git checkout {GIT_SHA}",
        "cd /root && pip install -e .",
    )
)

Setting up Volume for cloud storage of weights

Modal provides network file systems, Volumes, for writing information persistently from those cloud containers.

We use one to store the weights after we’re done training. We then read the weights from it when it’s time to run inference and generate new icons.

# fine-tune-stable-diffusion.py

web_app = FastAPI()
app = App(name="example-diffusers-app")

MODEL_DIR = Path("/model")
model_volume = Volume.from_name("diffusers-model-volume", create_if_missing=True)

VOLUME_CONFIG = {
    "/model": model_volume,
}

Setting up hyperparameter configs

We fine-tuned off the StableDiffusion v1.5 model, but you can easily also fine-tune off of other Stable Diffusion versions by changing the config below. We used 4000 training steps, a learning rate of 1e-5, and a batch size of 1.

We set up one dataclass, TrainConfig, to hold all the training hyperparameters, and another, AppConfig, to store all the inference hyperparameters.

# fine-tune-stable-diffusion.py

@dataclass
class TrainConfig:
    """Configuration for the finetuning training."""

    # identifier for pretrained model on Hugging Face
    model_name: str = "runwayml/stable-diffusion-v1-5"

    resume_from_checkpoint: str = "latest"
    # HuggingFace Hub dataset
    dataset_name = "yirenlu/heroicons"

    # Hyperparameters/constants from some of the Diffusers examples
    # You should modify these to match the hyperparameters of the script we are using.
    mixed_precision: str = "fp16"  # set the precision of floats during training, fp16 or less needs to be mixed with fp32 under the hood
    resolution: int = 128
    max_train_steps: int = (
        4000  # number of times to apply a gradient update during training
    )
    checkpointing_steps: int = (
        1000  # number of steps between model checkpoints, for resuming training
    )
    train_batch_size: int = (
        1  # how many images to process at once, limited by GPU VRAM
    )
    gradient_accumulation_steps: int = 1  # how many batches to process before updating the model, stabilizes training with large batch sizes
    learning_rate: float = 1e-05  # scaling factor on gradient updates, make this proportional to the batch size * accumulation steps
    lr_scheduler: str = (
        "constant"  # dynamic schedule for changes to the base learning_rate
    )
    max_grad_norm: int = 1  # value above which to clip gradients, stabilizes training
    caption_column: str = "text"  # name of the column in the dataset that contains the captions of the images
    validation_prompt: str = "an icon of a dragon creature"


@dataclass
class AppConfig:
    """Configuration information for inference."""

    num_inference_steps: int = 50 # How many steps to run the model for inference, the more the higher quality generally
    guidance_scale: float = 20 # How much the image should adhere to the text prompt

Running fine-tuning

Now, finally, we’re ready to fine-tune.

We first need to decorate the train function with @app.function, which tells Modal that the function should be launched in a cloud container on Modal.

Functions on Modal combine code and the infrastructure required to run it. So the @app.function decorator takes several arguments that lets us specify the type of GPU we want to use for training, the Modal Volumes we want to mount to the container, and any secret values (like the HuggingFace API key) that we want to pass to the container.

This training function does a bunch of preparatory things, but the core of it is the notebook_launcher call that launches the actual Diffusers training script as a subprocess. In particular, we are launching the script using the Accelerate CLI command. Accelerate is a Python library that makes it easy to leverage multiple GPUs for accelerated model training.

The training script saves checkpoint files every 1000 steps. To make sure that those checkpoints are persisted, we need to set _allow_background_volume_commits=True in the @app.function decorator.

# fine-tune-stable-diffusion.py

@app.function(
    image=image,
    gpu=gpu.A100(
        size="80GB"
    ),  # finetuning is VRAM hungry, so this should be an A100 or H100
    volumes=VOLUME_CONFIG,
    timeout=3600 * 2,  # multiple hours
    secrets=[Secret.from_name("huggingface-secret")],
    _allow_background_volume_commits=True
)
def train():
    import huggingface_hub
    from accelerate import notebook_launcher
    from accelerate.utils import write_basic_config

    # change this line to import the training script we want to use
    from examples.text_to_image.train_text_to_image import main
    from transformers import CLIPTokenizer

    # set up TrainConfig
    config = TrainConfig()

    # set up runner-local image and shared model weight directories
    os.makedirs(MODEL_DIR, exist_ok=True)

    # set up hugging face accelerate library for fast training
    write_basic_config(mixed_precision="fp16")

    # authenticate to hugging face so we can download the model weights
    hf_key = os.environ["HF_TOKEN"]
    huggingface_hub.login(hf_key)

    # check whether we can access the model repo
    try:
        CLIPTokenizer.from_pretrained(config.model_name, subfolder="tokenizer")
    except OSError as e:  # handle error raised when license is not accepted
        license_error_msg = f"Unable to load tokenizer. Access to this model requires acceptance of the license on Hugging Face here: https://huggingface.co/{config.model_name}."
        raise Exception(license_error_msg) from e

    def launch_training():
        sys.argv = [
            "examples/text_to_image/train_text_to_image.py",  # potentially modify
            f"--pretrained_model_name_or_path={config.model_name}",
            f"--dataset_name={config.dataset_name}",
            "--use_ema",
            f"--output_dir={MODEL_DIR}",
            f"--resolution={config.resolution}",
            "--center_crop",
            "--random_flip",
            f"--gradient_accumulation_steps={config.gradient_accumulation_steps}",
            "--gradient_checkpointing",
            f"--train_batch_size={config.train_batch_size}",
            f"--learning_rate={config.learning_rate}",
            f"--lr_scheduler={config.lr_scheduler}",
            f"--max_train_steps={config.max_train_steps}",
            f"--lr_warmup_steps={config.lr_warmup_steps}",
            f"--checkpointing_steps={config.checkpointing_steps}",
        ]

        main()

    # run training -- see huggingface accelerate docs for details
    print("launching fine-tuning training script")

    notebook_launcher(launch_training, num_processes=1)


@stub.local_entrypoint()
def run():
    train.remote()

With that all in place, we can kick off a training run on Modal from anywhere with a simple command:

modal run fine-tune-stable-diffusion.py

Serving the fine-tuned model

Once fine-tune-stable-diffusion.py has finished its training run, the fine-tuned model will be saved in the Volume. We can then mount the volume to a new Modal inference function, which we can then invoke from any Python code running anywhere.

# fine-tune-stable-diffusion.py

@app.cls(
    image=image,
    gpu="A10G", # inference requires less VRAM than training, so we can use a cheaper GPU
    volumes=VOLUME_CONFIG, # mount the location where your model weights were saved to
)
class Model:
    @enter()
    def load_model(self):

        import torch
        from diffusers import StableDiffusionPipeline

        # Reload the modal.Volume to ensure the latest state is accessible.
        app.model_volume.reload()

        # set up a hugging face inference pipeline using our model
        # potentially use different pipeline
        pipe = StableDiffusionPipeline.from_pretrained(
            MODEL_DIR,
            torch_dtype=torch.float16,
        ).to("cuda")

        pipe.enable_xformers_memory_efficient_attention()
        self.pipe = pipe

    @method()
    def inference(self, text, config):

        image = self.pipe(
            text,
            num_inference_steps=config.num_inference_steps,
            guidance_scale=config.guidance_scale,
        ).images[0]

        return image

Wrapping inference in a Gradio UI

Finally, we set up a Gradio UI that will allow us to interact with our icon generator. That lets us build this entire app, from data prep to browser app, in Python.

Our Gradio app calls the Model.inference function we defined above.

We can do this from any Python code we want, but we choose to also make this part of our Modal app, because Modal makes it easy to host Python web apps.

# fine-tune-stable-diffusion.py

@app.function(
    image=image,
    concurrency_limit=3,
)
@asgi_app()
def fastapi_app():
    import gradio as gr
    from gradio.routes import mount_gradio_app

    # Call to the GPU inference function on Modal.
    def go(text):
        return Model().inference.remote(text, config)

    # set up AppConfig
    config = AppConfig()

    prefix = "an icon of"

    example_prompts = [
        f"{prefix} a movie ticket",
        f"{prefix} campfire",
        f"{prefix} a castle",
        f"{prefix} a German Shepherd",
    ]

    description = f"""Describe a concept that you would like drawn as a [Heroicon](https://heroicons.com/). Try the examples below for inspiration.
    """

    # add a gradio UI around inference
    interface = gr.Interface(
        fn=go,
        inputs="text",
        outputs=gr.Image(shape=(512, 512)),
        title="Generate custom heroicons",
        examples=example_prompts,
        description=description,
        css="/assets/index.css",
        allow_flagging="never",
    )

    # mount for execution on Modal
    return mount_gradio_app(
        app=web_app,
        blocks=interface,
        path="/",
    )

Deployment on Modal is as simple as running one command:

modal deploy fine-tune-stable-diffusion.py

Parting thoughts

How does our fine-tuned model do as an infinite icon library?

More generated Heroicons

Top row: camera, chemistry, fountain-pen.
Middle row: german-shepherd, international-monetary-system, library.
Bottom row: skiing, snowman, water-bottle

It’s certainly not perfect:

  • The model sometimes outputs multiple objects when prompted for one (water-bottle, fountain-pen).
  • Some icons have visual artifacts or strange shapes (snowman).
  • The outputs aren’t as simple as the real Heroicons (camera, german-shepherd).

Fine-tuning can be sensitive to the hyperparameters used, including dataset size, number of training steps, learning rates, and resolution.

Because we defined our training to run on Modal, we can immediately scale it up into a massive grid search — running tens or hundreds or thousands of copies of the training script at once, each with different hyperparameters.

And it only takes a few lines of code to set up a grid search. It might look like this:


RESOLUTIONS = [128, 512]
LEARNING_RATES = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
LEARNING_RATE_SCHEDULERS = ["constant", "cosine"]


@app.local_entrypoint()
def run():
    from uuid import uuid4

    configs = []
    for resolution in RESOLUTIONS:
        for learning_rate in LEARNING_RATES:
            for learning_rate_scheduler in LEARNING_RATE_SCHEDULERS:
                train.spawn(
                    {
                        "resolution": resolution,
                        "learning_rate": learning_rate,
                        "learning_rate_scheduler": learning_rate_scheduler,
                        "output_dir": uuid4(),
                    }
                )

Evaluation of which hyperparameter combinations are best will probably have to be done manually, given how subjective style can be.

But that’s what makes machine learning hard fun!

Ship your first app in minutes

with $30 / month free compute