Tutorials

August 21, 2024•7 minute read

Scaling ComfyUI

Kenny Ning@kenny_ning

Growth Engineer

In our ComfyUI example, we demonstrate how to run a ComfyUI workflow with arbitrary custom models and nodes as an API.

But does it scale?

Generally, any code run on Modal leverages our serverless autoscaling behavior:

One container per input (default behavior) i.e. if a live container is busy processing an input, a new container will spin up
Option to have one container handle more than one concurrent input with @modal.concurrent
Option to keep some containers always live with min_containers

In this post, we’ll load test our ComfyUI API endpoint across these options and compare how these different approaches affect latency and cost.

Load testing with Locust

First, we’ll serve the existing ComfyUI example on an A10G:

@app.cls(gpu="A10G")

Then we’ll use Python load testing library Locust to simulate web traffic to our hosted ComfyUI endpoint.

# locustfile.py
import locust
from locust import HttpUser, task


class FakeUser(HttpUser):
    wait_time = locust.between(1, 5)

    prompts = [
        "a regal phoenix with feathers that ignite with rebirth",
        "a majestic peacock with secret patterns in its feathers",
        "a majestic bald eagle with hidden constellations in its plumage",
        "a graceful swan with feathers that purify_water"
        # 100 other AI-generated prompts to choose from random
    ]

    @task
    def generate_image(self):
        import random

        self.client.post(
            "/",
            json={"prompt": random.choice(self.prompts)},
        )

Running locust in the same directory as the above locustfile.py opens an interactive UI, where we can specify how many concurrent users we want to simulate. Each “fake” user will select a random prompt from our list and submit a POST request to our ComfyUI API endpoint, then wait 1-5 seconds between subsequent requests.

Each request triggers our ComfyUI workflow to draw the given prompt into a background image like so: comfy-collage

Option 1: Run one container per input

Median response time: 4.4s
Estimated cost per minute: $0.18 (10 A10Gs at $1.10/h)

By default, a Modal web endpoint will spin up a new container per input unless there’s one sitting ready. Let’s simulate 10 concurrent users using the Locust web UI: comfy_load_test_1

We can see in our Modal dashboard that a new container spins up to process each individual request: 10-containers

This is a good default autoscaling option. However, the first few inputs need to wait for a container to start from scratch — a “cold start”. This cost is noticeable in the first data point of the yellow line in the Locust report.

In our ComfyUI application, this time is mostly spent launching the ComfyUI server in the background, as indicated in the @enter function of our ComfyUI class:

class ComfyUI:
    @modal.enter()
    def launch_comfy_background(self):
        cmd = "comfy launch --background"
        subprocess.run(cmd, shell=True, check=True)

Note that cold start here does not include downloading models or custom nodes, all of which is done once at build time. However, loading said models and custom nodes into memory is done at container start time. Long start times are most often driven by running lots of custom nodes.

Option 2: Run multiple inputs on one container

Median response time: 32s
Estimated cost per minute: $0.02 (1 A10G)

One down side of Option 1 is cost; you have to pay for usage on each of the 10 GPUs you launched. Let’s try running all requests on a single container / GPU with @modal.concurrent.

@app.cls(gpu="A10G")
@modal.concurrent(max_inputs=10)

When we run the same Locust load test, only one container is provisioned: modal-dash-1-container

This setting is 10x cheaper, but nearly 10x slower. This is because ComfyUI is single threaded and can only process one input at a time. The chart below shows the elevated response time in Option 2 (right side), compared to Option 1 (left side).

comfy_load_test_2

Option 3: Maintain a warm pool with `min_containers`

Option 1 eventually stabilizes to ~4s response time, but the first request can be upwards of ~20s because it has to wait for the container to warm up. To drive these first request response times down, we can specify a minimum number of containers to have always running with min_containers. To demonstrate, let’s set up a warm pool of 5 containers on our endpoint:

@app.cls(min_containers=5, gpu="A10G")

After a few seconds, we have 5 containers ready to accept inputs: modal-dash-5-containers

Let’s run the Option 1 load test again (10 concurrent users, 1 container per input), once without min_containers (left) and once with min_containers (right):

load-test-3

This reduced the first request response time by ~10s (compare the first point of the yellow line charts). However unlike the previous options, you will always have a minimum of 5 containers live costing you $0.09 per minute. That’s equivalent to ~$130 per day, so be sure to use this option with caution.

A more economical complement to min_containers is scaledown_window, which specifies how many seconds a container should wait after processing a request before spinning down. By default, it is one minute. By extending this timeout to, say, five minutes, we increase the chance we can re-use this container for the next request and save some cold start time.

Scaling to 100 concurrent users

Now let’s run another load test with our Option 1 settings (one container per input, no keep warm) starting with 10 concurrent users, then scaling up to 100. This graph tells a good story of Modal’s container lifecycle:

modal-scaling-locust

At a high level:

The first few requests will take some time while containers start up (~20s)
System eventually stabilizes at ~5s per request (mostly raw ComfyUI workflow execution time)
As users scale to 100, response time increases temporariliy while new containers spin up to help work through the demand spike
System goes back to normal after ~1 minute

At peak load, Modal scaled up to 62 concurrent GPUs:

gpu-modal-spike

Note that only Enterprise customers can scale this high. Team workspaces are limited to 30 concurrent GPUs and Starter workspaces are limited to 10.

Conclusion

Yes, ComfyUI as an API does scale well with serverless! However, you need to think about how to balance inference speed with cost and a lot of this also depends on your specific ComfyUI application.

Have a lot of custom nodes? Your container cold start might be longer than in this experiment and you might see better performance running concurrent requests on fewer containers i.e. raising max_inputs in @modal.concurrent.
“Bursty” traffic? If you expect clients to use the API multiple times in short succession, increase scaledown_window and/or set a small warm pool to increase the chance of re-using live containers across sessions.

The only way to know the right balance is to run similar experiments on your deployment and see what works best for you.

Coda: Deploying with Comfy Deploy

We’re proud to be the underlying serverless provider of Comfy Deploy, the easiest way to take a local workflow and deploy it to production with a rich UI, team collaboration features, and development environments. Because Comfy Deploy uses Modal under the hood, the same scaling principles mentioned here also apply to workflows deployed with them.

Thanks to the team at Comfy Deploy for inspiring this blog post and providing feedback!