Tutorials

February 19, 2025•10 minute read

When should you self-deploy diffusion models?

Margaret Shen

Head of Business Operations

Hundreds of companies come to Modal for help deploying diffusion models. Some are a great fit for serverless infrastructure, others aren’t. In this blog post, we’ll share a framework for how we help teams select the right infrastructure for their needs.

If you’re building with diffusion models, your options range from using inference API providers to self-deploying on generic cloud platforms. Below is a summary of tradeoffs that correspond to different ends of this spectrum:

infra tradeoffs

sorted top-down from most to least commonly cited dimensions

The most important factor for your decision is how much fine-grained control you need over the outputs of the model. The more control you need, the more customization is required.

Why customize at all?

Diffusion models are used to generate creative assets—anything from video avatars to 3D jewelry designs to personalized ads. If you don’t need a high level of control over the outputs, just using prompt engineering is sufficient. This is often the case when the outputs are used for a company’s free-tier offering or are auxiliary to the main product.

But if your core competency lies in the creation of unique, high-quality assets, customization is required to achieve objectives like:

Applying unique styles to outputs in a consistent way (e.g. professional headshots with a cool-tone color palette, photorealistic facial details, and diffuse lighting)
Controlling objects in the output in order to maintain a certain structure in the image or include/exclude certain elements (e.g. include accurate XYZ brand logo on clothing, exclude NSFW elements)
Offering users a slate of editing features to tweak outputs (e.g. background replacement, inserting objects to an image with the right shadows)

Dimensions of customization

There are 3 general ways to customize diffusion models, which can be used together or individually.

customization dimensions

Full parameter fine-tune: produces outputs that generalize well for a specific domain
Training/using inference-time adapters (e.g. LoRA, ControlNet, IP adapter): produces outputs that adhere to specific stylistic/compositional elements that you feed it
Assembling multi-component pipelines: produces outputs that satisfy multiple aesthetic requirements by using multiple models, adapters, and image processing steps

Customization is more accessible than ever

There are not only internal business drivers for customization, there are also external factors that make it practical to do so.

Open-source models like FLUX.1-dev are state-of-the-art. Users can typically achieve the best outputs for their objectives by customizing open-source models rather than using closed-source ones.
Diffusion model use cases have more forgiving latency requirements. These models (at least for image generation) can also be loaded into memory in less than a minute. This means self-deployment can be economical even at smaller scales.
There’s a robust ecosystem of practitioners who have open-sourced a variety of diffusion model checkpoints, adapters, and workflows. Communities like Civitai, the StableDiffusion subreddit, and HuggingFace’s discord have hundreds of thousands of members.

Infrastructure for low customization needs

We define low customization as using an off-the-shelf diffusion model by itself or with 1-2 inference-time adapters. If this is you, you should default to the far left of the spectrum above and use an inference API provider like Replicate or Fal. Why?

[Ease of use] There’s no need to write or optimize inference code.
[Performance] There’s a finite set of popular diffusion models, so these providers always have warm GPUs with these models pre-loaded, minimizing cold start time.
[Cost] You pay per output rather than having to optimize resource utilization.

Note that at scale, you may eventually want to switch to self-deployment for these reasons:

[Performance] If you have large bursts of traffic, you’ll run up against rate limits on API platforms.
[Cost] If you have sustained, steady traffic, your cost per output will be cheaper if you self-deploy. This is because you’ll be able to amortize cold start latency/cost across a larger number of outputs.
[Control] API platforms may make updates to the versions of underlying models, adapters, and ComfyUI nodes, or change inference-time configurations. Self-deployment gives you guaranteed reproducibility.

Infrastructure for high customization needs

Your decision will look quite different if you need a high degree of control over generated outputs. Your process of building a custom diffusion pipeline will have a few stages: exploration, customization, and productionization. Each stage of this journey warrants a new decision point on the appropriate infrastructure to use (summarized below).

self-deploy decision points

Stage 1: Exploration

You’ll typically start by exploring the capabilities of state-of-the-art models. This phase will center around prompt engineering, generating many samples, and comparing the outputs of different base models. Note that this isn’t a one-and-done phase, since new models and techniques are constantly being announced.

What’s the right infra for this?

Inference API platforms are the easiest way to try out newly released models. These providers have interactive playgrounds where you can prompt models and observe the effects of different parameters. When you begin prototyping your application, you can simply plug in various model endpoints to your architecture to test end-to-end workflows.

Stage 2: Customization

As we outlined earlier, there’s a few different ways to customize a diffusion pipeline. You may need to train new weights if you’re fine-tuning the base model or creating custom adapters. You may also need to assemble complex diffusion pipelines.

What’s the right infra for training new weights?

Some inference API providers offer simple training services, but you’ll likely need code-level control over the training process. This allows you to preprocess your dataset, directly leverage ML libraries like accelerate, use arbitrary base models and training techniques, experiment with hyperparameters efficiently, and overlay ML observability frameworks to guide your process.

For GPU compute, you have the choice of serverless platforms like Modal, managed ML platforms like SageMaker, or roll-your-own-infra on a traditional cloud platform like GCP.

We recommend serverless compute for these benefits:

[Speed of development]

You can instantly run any job on H100s and A100s without having to wait on availability, provisioning, or long container starts.
You can easily parallelize data preprocessing or experiments like hyperparameter sweeps. These platforms automatically spin up as many GPUs as you need.
You spend much less time waiting around for ML dependencies to be installed on your containers. These platforms have optimized systems for caching image builds.

[Cost]

You can get single GPUs rather than overcommitting to large configurations.
You’re only charged for usage, and you don’t need to manually turn resources on/off.

When might serverless platforms not be a good fit?

If you need a one-stop-shop ML platform with features like native notebook interfaces and model versioning, you might still choose a managed ML platform.
If you need complete infrastructure control (e.g. to configure multi-node networking for distributed training), you might still choose a traditional cloud platform. This is likely only relevant if you’re undertaking a full-parameter fine-tune of a large diffusion model.

What’s the right infra for assembling custom pipelines?

Workflow orchestration tools or even simple chained function calling can be used to build out pipelines. More commonly though we’re seeing companies use ComfyUI to assemble and execute advanced diffusion pipelines that have many components. You can use hosted versions of ComfyUI (e.g. RunComfy, Comfy Deploy, Fal) for simplicity. Note that you’ll be limited to the nodes (i.e. model checkpoints, adapters, or processing steps) that are available on those platforms.

If you’re using less popular nodes or custom nodes, you’ll have to run ComfyUI yourself. Similar to deciding on a compute platform for training weights, you can either use a serverless compute platform or deploy on a traditional cloud setup. Many of the same benefits of serverless apply here—specifically, you can get a ComfyUI server up and running in seconds, and you only pay for GPU time when the server is running.

Stage 3: Productionization

What’s the right infra for this?

For highly customized pipelines, you’ll need the control of self deployment. Some API platforms allow you to deploy custom models and ComfyUI workflows, but we wouldn’t recommend this. These platforms focus on making inference fast for popular models by ensuring there’s always warm containers available. They are not optimized, however, for performant cold starts and scaling for arbitrary containers and code.

This leaves you with the choice of serverless platforms, managed ML platforms, or traditional cloud platforms. We would recommend serverless platforms for these reasons:

[Performance/cost tradeoff] Diffusion models require powerful GPUs to run, which means cost control is likely a high priority for you.

With cloud hyperscalers, you have to make year-plus long reservations to access H100s/A100s at the best price points. Not only is this a large commitment, the static nature of reservations means you’re either underutilizing what you pay for or unable to meet peak customer demand. For variable workloads like real-time inference, we’ve seen fast autoscaling with a serverless platform increase resource utilization from 40% to over 80%.
Alternative clouds like Lambda Labs offer better prices for on-demand H100s/A100s (75% cheaper sometimes!), but availability and reliability will vary. On face value their GPU/hr prices might sometimes be lower than serverless providers’, but keep in mind that on-demand instances still have minimum usage times, take minutes to spin up/down, and don’t autoscale.

[Speed of development]

Traditional ML and cloud platforms have convoluted interfaces. Modern serverless platforms are much simpler to set up, obviating the need to manage extensive configuration surface areas.
Serverless platforms let you define different dependencies and hardware per function, making it easy to use heterogenous instance types. This is useful if your diffusion pipeline consists of multiple components that each have different needs (see below).

heterogenous pipeline

example diffusion pipeline with heterogenous compute needs

When might serverless platforms not be a good fit?

If you need a one-stop-shop ML platform with features like A/B testing and model monitoring, you might still choose a managed ML platform.
If you need complete infrastructure control (e.g. custom scaling logic, unusual hardware requirements), you might still choose a traditional cloud platform.
If your inference endpoints are very latency sensitive (e.g. 50ms overhead would be too high) or receive stable, constant traffic, you might still choose a traditional platform and keep a static number of GPUs warm at all times.

While these capabilities may not exist in serverless platforms today, they are actively being worked on. Here at Modal, we’re expanding our ML feature set and constantly shaving milliseconds off our overhead latency.

Conclusion

There’s no one-size-fits-all solution! You’ll be solving for different needs at different stages of development; you may also use multiple solutions concurrently for different parts of your application. Customers like Suno, OpenArt, and Eden use a mix of inference API providers and self-deployment for different features. At the end of the day, your decisions will be primarily informed by how much model customization is required to generate the outputs you need.

If you’re building with diffusion models and just getting started, we’re happy to be your spiritual guides. Find us in #diffusion-models in our community Slack.

Why customize at all?

Dimensions of customization

Customization is more accessible than ever

Infrastructure for low customization needs

Infrastructure for high customization needs

Stage 1: Exploration

What’s the right infra for this?

Stage 2: Customization

What’s the right infra for training new weights?

What’s the right infra for assembling custom pipelines?

Stage 3: Productionization

What’s the right infra for this?

Conclusion

Ship your first app in minutes.