Cold start ComfyUI in less than 3 seconds with memory snapshotting

Growth Engineer

Tolga Oguz

Software Engineer

Update: We’ve directly folded memory snapshotting into our official ComfyUI example. We recommend starting there if you want to use memory snapshotting with ComfyUI.

In this post, we show how Modal’s memory snapshotting feature can drive ComfyUI cold starts down from 10-15 seconds to less than 3 seconds on average.

ComfyUI cold start (seconds)

The problem: high ComfyUI cold starts

Cold start is the time it takes for a container to start up.

In the context of ComfyUI on Modal, cold start is the time it takes for the ComfyUI server to launch (i.e. running ComfyUI main.py).

Let’s break down the cold start for our ComfyUI example.

comfyui-cold-start

Note that loading a checkpoint (e.g. flux, stable diffusion) does not happen at cold start, but rather on the first workflow execution.

Custom node prestartup scripts

Feb 13  13:33:10.476
Launching ComfyUI from: /root/comfy/ComfyUI
Feb 13  13:33:11.355
[START] Security scan [DONE] Security scan ## ComfyUI-Manager: installing dependencies done. ** ComfyUI startup time: 2025-02-13 18:33:11.350 ** Platform: Linux ** Python version: 3.11.10 (main, Dec 3 2024, 02:25:00) [GCC 12.2.0] ** Python executable: /usr/local/bin/python ** ComfyUI Path: /root/comfy/ComfyUI ** User directory: /root/comfy/ComfyUI/user ** ComfyUI-Manager config path: /root/comfy/ComfyUI/user/default/ComfyUI-Manager/config.ini ** Log path: /root/comfy/ComfyUI/user/comfyui.log
Feb 13  13:33:11.975
Prestartup times for custom nodes:
Feb 13  13:33:11.980
1.4 seconds: /root/comfy/ComfyUI/custom_nodes/ComfyUI-Manager

First, we run some prestartup scripts for ComfyUI manager, which includes a security scan (to make sure none of our custom nodes are malicious), importing standard libraries, and ensuring we have necessary dependencies.

Import torch

Feb 13  13:33:14.292
Total VRAM 45589 MB, total RAM 928032 MB pytorch version: 2.5.1+cu124
Feb 13  13:33:14.297
Set vram state to: NORMAL_VRAM Device: cuda:0 NVIDIA L40S : cudaMallocAsync
Feb 13  13:33:16.253
Using pytorch attention
Feb 13  13:33:18.170
[Prompt Server] web root: /root/comfy/ComfyUI/web

Importing torch is by far the heaviest part of starting up the ComfyUI server; it’s executing 26,000 syscalls after all.

Initialize custom nodes


Feb 13  13:33:19.068
WAS Node Suite: Created default conf file at `/root/comfy/ComfyUI/custom_nodes/pr-was-node-suite-comfyui-47064894/was_suite_config.json`.
Feb 13  13:33:20.119
WAS Node Suite: OpenCV Python FFMPEG support is enabled WAS Node Suite Warning: `ffmpeg_bin_path` is not set in `/root/comfy/ComfyUI/custom_nodes/pr-was-node-suite-comfyui-47064894/was_suite_config.json` config file. Will attempt to use system ffmpeg binaries if available.
Feb 13  13:33:20.418
WAS Node Suite: Finished. Loaded 218 nodes successfully. "Art is the voice of the soul, expressing what words cannot." - Unknown
Feb 13  13:33:20.435
### Loading: ComfyUI-Manager (V3.7.6)
Feb 13  13:33:20.498
### ComfyUI Revision: 2980 [ee9547ba] *DETACHED | Released on '2024-12-26'
Feb 13  13:33:20.512
Import times for custom nodes:
Feb 13  13:33:20.517
0.0 seconds: /root/comfy/ComfyUI/custom_nodes/websocket_image_save.py 0.1 seconds: /root/comfy/ComfyUI/custom_nodes/ComfyUI-Manager 1.7 seconds: /root/comfy/ComfyUI/custom_nodes/pr-was-node-suite-comfyui-47064894
Feb 13  13:33:20.520
Starting server

Lastly, we initialize our custom nodes. Specifically, this is the time it takes for init_external_custom_nodes() to run, which loops through each custom node directory and imports the relevant modules. For customers running workflows with many custom nodes, this step is often the biggest cold start driver.

Solution: memory snapshotting

Modal offers an advanced feature called memory snapshots that allows you to snapshot your container state right after starting up. Then when subsequent containers start up they can restore from this snapshot and improve cold start time by 1.5-3x.

One complication of using memory snapshots is that you can only snapshot CPU memory. If you try to snapshot the ComfyUI server start step with a GPU attached, memory snapshot will not work.

Luckily, community member Tolga Oguz came up with a great solution to override core ComfyUI and skip the CUDA device checks on server start. We don’t need the GPU enabled in the beginning when we’re effectively just running a lot of imports:

# https://github.com/tolgaouz/modal-comfy-worker/blob/main/comfy/experimental_server.py
class ExperimentalComfyServer:
    """Experimental ComfyUI server that runs workflows in the main thread.

    Features:
    - Executes workflows without starting separate server process
    - Allows model preloading to CPU for faster cold starts
    - Uses modified ComfyUI components for direct execution
    - Maintains similar interface to regular ComfyServer
    """

    MSG_TYPES_TO_PROCESS = [
        "executing",
        "execution_cached",
        "execution_complete",
        "execution_start",
        "progress",
        "status",
        "completed",
    ]

    from contextlib import contextmanager

    @contextmanager
    def force_cpu_during_snapshot(self):
        import torch

        """Monkeypatch Torch CUDA checks during model loading/snapshotting"""
        original_is_available = torch.cuda.is_available
        original_current_device = torch.cuda.current_device

        # Force Torch to report no CUDA devices
        torch.cuda.is_available = lambda: False
        torch.cuda.current_device = lambda: torch.device("cpu")

        try:
            yield
        finally:
            # Restore original implementations
            torch.cuda.is_available = original_is_available
            torch.cuda.current_device = original_current_device

    def __init__(self, config=None, preload_models: List[str] = []):
        """Initialize experimental server.

        Args:
            config: Compatibility with regular server (not used)
            preload_models: List of model paths to preload to CPU
        """
        with self.force_cpu_during_snapshot():
            logger.info("Initializing experimental server")
            self.preload_models = preload_models
            self.initialized = False
            self.model_cache = {}
            self.executor = None

            # Set up ComfyUI environment overrides
            self._override_comfy(preload_models)

Then we wrap the ComfyUI server initialization steps with @enter(snap=True) to enable snapshotting:

class ComfyWorkflow:
    @enter(snap=True)
    def run_this_on_container_startup(self):
        self.web_app = FastAPI()
        self.server = ExperimentalComfyServer()

Deploy your ComfyUI app and the first few runs will create a memory snapshot. These first few cold starts will be longer, but subsequent cold starts restore from the snapshot. In our experiment, 90% of containers started in less than 3 seconds. This is an order of magnitude improvement over the baseline of 10 to sometimes 20 second cold starts.

Conclusion

For customers running ComfyUI at scale and spinning up containers frequently, memory snapshots are a game changer. Containers that previously took 10s of seconds to start up now start up in single-digit seconds.

The tradeoff of using this approach is that the code is relatively untested and may require some effort to get working, especially for complex workflows. Reach out on Slack or leave an issue on modal-comfy-worker if you have any questions.