May 12, 202620 minute read

How to achieve truly serverless GPUs

author
Charles Frye@charles_irl
Member of Technical Staff
author
Jonathan Belotti@jonobelotti_IO
Member of Technical Staff
author
Akshat Bubna@akshat_b
CTO and Founder

Diagram indicating latency to scale up an inference server in a baseline cloud system and on Modal

We are in the age of inference. Billion- to trillion-parameter neural networks are run on specialized accelerators at quadrillions of operations per second to generate media, author software, and fold proteins at massive scale.

Inference workloads are more variable and less predictable than the training workloads that previously dominated. That makes them a natural fit for serverless computing, where applications are defined at a level above the (virtual) machine so that they can be more readily scaled up and down to handle variable load.

But serverless computing only works if new replicas can be spun up quickly — as fast as demand changes, which can be at the scale of seconds. Naïvely spinning up a new instance of, say, SGLang serving a billion-parameter LLM on a B200 can take tens of minutes or stall for hours on GPU availability.

At Modal, we’ve done deep engineering work over the last five years to solve this problem. In this blog post, we walk through what we did.

There are four key ingredients:

  • Cloud buffers: maintain a small buffer of healthy, idle GPUs to take on new load
  • Custom filesystem: serve container images lazily out of a content-addressed, multi-tier cloud-native cache
  • Checkpoint/restore: fast-forward through CPU-side initialization by directly restoring processes into memory
  • CUDA checkpoint/restore: fast-forward through GPU-side initialization by directly restoring CUDA contexts into memory

Together, they take AI inference server replica scaling from multiple kiloseconds to just tens of seconds.

We’ve shared bits and pieces of this work along the way, because we believe that secrecy is a bad moat. And if more people learn how to use GPUs efficiently, there will be more available in the market for us!

But this blog post represents the first time we’ve put the entire story together in one place. We hope it convinces you that our system is worth buying into — or joining us to build it.

Why care about serverless GPUs? To maximize GPU Allocation Utilization for inference workloads.

First, let’s frame the problem clearly. GPUs are expensive and scarce, so we want to maximize their utilization, where “utilization” is the following unitless quantity:

Utilization := Output achieved ÷ Capacity paid for

There are many ways to measure utilization — to define output and capacity. The most sophisticated and most stringent here is probably “Model FLOP/s Utilization”, which divides raw algorithmic operation requirements by aggregate arithmetic bandwidth.

This is catnip for engineers. It’s also especially critical for “hero run” large-scale training, so it draws a lot of investment and attention, e.g. recently as everyone dunked on xAI’s ~10% MFU.

But at the other end of the stack, there’s a more basic form of utilization that wrecks the relationship between achieved output and allocated capacity for inference workloads, GPU Allocation Utilization:

GPU Allocation Utilization := GPU-seconds running application code ÷ GPU-seconds paid for

Aside on "GPU Utilization" terminology
The "GPU utilization" reported by nvidia-smi and similar tools is in between these two extremes. It reports the fraction of the time that kernel code is running on the GPU — literally, the fraction of time there is a CUDA stream running on the GPU. Read more here.

Inference applications have highly variable scale. Unlike training, the demand for capacity is outside the direct control and management of the engineering organization. Instead, it is driven by external user behavior — by markets or social media algorithms or product teams.

Here’s a sample trace of requests per minute from a time-varying Poisson process we use to model inference applications. Notice not only the seasonal variation (daily cycles) but also the long-term trend of increasing variability in demand as the average demand increases.

Diagram of simulated inference traffic from a time-varying Poisson process with seasonal variation and spikes

Spiky demand raises serious engineering problems. To borrow from Marc Brooker of AWS: “the cost of a system scales with its (short-term) peak traffic, but for most applications the value the system generates scales with the (long-term) average traffic.” Spiky demand means high peak-to-average ratios, which challenge system economics.

Concretely, imagine the capacity planning for such an application. You might have demand (measured in GPUs required to service requests within latency targets) that looks like this:

With a fixed, over-provisioned GPU allocation, utilization is low

To properly service your anticipated load, you allocate (rack-and-stack, rent on a hyperscaler) 140 GPUs. But most of those GPUs sit idle most of the time — the GPU Allocation Utilization is low.

You might accuse us of talking our book here. But we aren’t the only ones to call this out! See the excellent blog post by Hebbia. And we have data, not just vibes: according to the State of AI Infrastructure at Scale report in 2024, the majority of organizations achieve less than 70% GPU Allocation Utilization when running at peak demand. Actual GPU Allocation Utilizations are commonly often closer to 10-20%.

With fixed allocations, demand can also exceed supply during unanticipated spikes. Trying to anticipate them just increases cost further — more than it increases revenue.

What’s so hard about serverless GPUs? Startup latency.

The immediate solution is to provision auto-scaling capacity: when demand increases, increase your supply.

Done naïvely, this actually worsens the problem:

If allocation is slow, utilization and QoS suffer

Without optimization, going from hyperscaler API request to a running service replica can take tens of minutes.

You need to do the following:

  • spin up a new instance and health-check it (minutes to tens of minutes)
  • load application program and filesystem state (minutes)
  • start the application program on the host, ready it to service requests (tens of seconds)
  • start the application program on the device, ready it to service requests (minutes to tens of minutes)

During all of this time, load is in excess of capacity, and QoS typically degrades (absorbed into higher concurrency or queues and thus inflated tail latencies or, worse, 503s). That means angry users. If the capacity takes too long to come online, it can even miss a transient spike. But given the unpredictability of demand and the difficulty of allocation, that capacity typically sticks around, under-utilized, for an extended period.

At Modal, we’ve optimized the spin-up of inference applications on GPUs from many tens of minutes down to a few seconds or tens of seconds. With these optimizations, a wide variety of inference applications of GPUs can run “truly serverlessly”: with provisioned supply tightly matched to system demand.

With fast, automatic allocation, utilization and QoS can both be high

In the rest of this document, we will explain the engineering approach we took and the performance optimizations we implemented for each of the four steps above, which span the stack from cloud storage systems and machine management to local disks, CPUs, and, of course, GPUs.

Together, these optimizations allow inference on Modal to spin up 40x faster: 50 seconds instead of 2k.

Diagram indicating latency to scale up an inference server in a baseline cloud system and on Modal
Inference servers that take upwards of 2 kiloseconds to boot naïvely boot in ~50 seconds on Modal. The key architectural optimizations that achieve this speedup are indicated and color-coded by the key system component that they target -- GPUs and GPU RAM, CPUs and CPU RAM, local solid state disk (SSD), or machine/instance management. We use this color scheme and schematic throughout the post.

You can remove tens of minutes of latency by taking instance allocation and health checks out of the hot path.

Consider the first step in replica spin-up:

  • spin up a new instance and health-check it (minutes to tens of minutes)

We can remove this from the hot path by doing it ahead of time: running a buffer of idle, healthy GPUs, shared by many applications, scheduling new replicas onto those units, and spinning up new devices into the buffer asynchronously. We can also de-allocate units when the buffer grows too large, as replicas spin down. Managing this buffer is a fun linear programming problem, as we’ve written elsewhere.

Maintaining a small buffer of ready-to-use, but unused, machines on top of allocated machines allows for new replicas to be quickly scheduled onto empty machines (indicated by bright colors).

Servicing requests from the buffer removes tens of minutes of latency from replica spin-up.

Diagram indicating latency reduction of tens of minutes for inference startup with a cloud instance buffer.

Aside on system- and application-level buffers
If you're running a single workload, rather than a multi-workload system, you might ask why we are only moving instance allocation out of the "hot path" and into the buffer. Can't we move more of our setup work into the buffer? And you can! Modal users can maintain an application layer buffer of replicas ready to service requests with buffer_containers. But even then, the size of the buffer you need to absorb spikes of a given magnitude scales with the speed you can create new replicas, and so the optimizations described below are still important for single-workload systems.

Running a buffer limits the peak allocation utilization below 100%. This is a reasonable trade-off to make, since 100% utilization is generally a mirage. Consider that it is common practice to spin up new replicas and even page engineers when utilization of other resources, like CPU or IOPS, gets too high!

This is important for robustness. A 100% utilized system has no margin for error, and so faults routinely become failures. We can personally recommend adding more buffers to your life — keep an extra toothbrush in your bathroom; keep a charger for your critical devices at home, the office, and on your person.

This buffer is especially useful for accommodating a wider variety of workloads on a single system. At Modal, we’ve leaned into supporting a variety of “development” workloads, not just production serving, because we can quickly create a new development environment. As an extra win, these environments are reproducible-by-default and on prod-ready infra. Closing the gap with production infrastructure also improves development velocity.

The devil is, of course, in the details. One key piece: health checks are critical for GPUs, which fail at a much higher rate than other hardware, including notoriously finicky components like spinning disks. We wrote about our GPU health-checking system in detail here. The tl;dr is that, in our experience, you need to run a short active health check on boot and monitor for health issues that arise later, but you can defer more intense checks (like dcgmi diag) to a slower cadence (for us, weekly).

Critical level Xid errors per hour per GPU, grouped by (anonymized) cloud. Failure rates are far from negligible!

You can cut container start from minutes to seconds by serving files lazily out of a content-addressed cache.

Now let’s consider the next step:

  • load application program and filesystem state (minutes)

In contemporary practice, this generally means booting up one or more containers or VMs.

Roughly, a container is a root filesystem backing a process with limited permissions. For distributed deployments of many containers, performance is bottlenecked by the construction of the root filesystem on the worker instance.

Root filesystems of operating system distros are thicc — tens of thousands of files, gigabytes in size. Naïvely, with a command like docker run, you need to load that whole thing, typically at the few GB/s supported by cloud Ethernet. Even worse, the container image is split into several layers, which must be applied sequentially.

The solution is to disaggregate the container launcher (runc for Docker, runsc for gVisor) from the container image delivery. We use a custom filesystem we call ImageFS, built with libfuse, that combines lazy loading with a multi-tiered, content-addressed cache designed to match cloud provider affordances.

The key “trick” for fast container start that we implement in our custom filesystem is to be lazy, judiciously (as all good engineers do). Container images contain many files, like timezone and locale information for the entire world, that will never be read by most applications. You can skip loading the entire filesystem before the container starts and instead just block start on loading metadata (an index). The metadata is only a few megabytes, so it can be loaded in 100ms or less, along with everything else needed to start a container.

Diagram showing lazy loading of container filesystem contents concurrently with spinup

The remainder can be loaded concurrently with other work — or not at all! The majority of the files will not be read, as reported in Figure 5 of the Slacker paper from USENIX FAST ‘16, reproduced below.

We currently implement this filesystem with libfuse. It’s a library for writing Linux filesystems in userspace. The kernel intermediates between one userspace program using standard system calls on files and another implementing a new filesystem (with, in the end, its own syscalls, one of which eventually returns to the original program by way of the kernel). This is much simpler to build and distribute than a custom filesystem in a kernel module.

Diagram showing lazy loading of container filesystem contents concurrently with spinup

There’s a price: you have double the context switches between userspace and kernelspace. This can be painful for latency-dominated workloads, like reading from a terminal character device, but has less of an impact on throughput-dominated workloads, where we use it. For a useful detailed breakdown of the performance implications, see To FUSE or Not to FUSE from USENIX FAST ‘17.

But there’s no free lunch: all the data that is accessed by the container still needs to be loaded. If you naïvely fetched each of the thousands of files accessed by a typical container from object storage, getting from “container start” to torch.cuda.is_available would take hours. So the other key component is a tiered, content-addressed cache that we fill eagerly (but asynchronously).

We use a content-addressed cache because the overlap in image contents across containers is huge. Many inference applications use the same software (e.g. Python, PyTorch, CUDA stack).

But path-based cacheing and Docker’s layerwise cacheing leave performance on the table. For instance, shared bytes aren’t guaranteed to be in the exact same container image layer.

Diagram depicting multiple container filesystems with overlapping content stored in a content-addressed cache

The cache is tiered, just like the caches inside CPUs and GPUs, to map onto the hierarchy of storage available on cloud providers. The diagram and table below list the key components and their throughputs, latencies, & costs.

SystemRead Latency (us)Read Throughput (GiB/s)
Page Cache0.001 - 0.110-40
SSD1004
AZ Cache Server100010
Regional CDN100,0003-10
Blob Storage200,0003-10

The key breakpoints are:

  • Memory: Linux page cache. This is the primary in-memory target. It's your only option for microsecond latencies and has high throughput, but capacity is limited (RAM is expensive, and getting dearer).
  • Disk: Local solid state. SSDs are a much gentler step down from memory than spinning disks, but still noticeable. We grab big drives and rapidly fill them up with the most commonly-used content.
  • Over-the-network: Blob/object storage. This has essentially infinite capacity — you will run out of money pushing bytes before the hyperscalers run out of money storing them. But it has much higher latency. That latency penalty is punishing, but note that the peak bandwidth can be higher than disk in many cloud configurations!

To really make this rip, you might build more layers between SSD and object storage, like an RDMA layer or within-AZ peer-to-peer sharing. Both are compelling on the numbers, but add a lot of engineering complexity, so we haven’t added them — yet.

Altogether, we’ve cut container start times down by a minute. For simple applications whose spin-up takes only a few seconds, this is an absolute gamechanger. For heavier applications like an LLM inference server, it removes about a minute from a several-minute start.

Diagram indicating latency reduction of 33% for inference startup with the custom container filesystem.

The big architectural moves that give integral factors of speedup are followed by a grind of percentage points. We detailed that grind in this blogpost. Here’s a few highlights.

First, libfuse exposes some knobs for performance. We found the most juice from tuning the read_ahead_kb, which directs the kernel to read that many kilobytes ahead of each request (as depicted below). We increased the value from the default 128 to 32 * 1024. The bigger value is nice for the large-read-heaviness of container image loading. Much higher values (in the gigabytes) caused gnarly thrashing.

Diagram showing requested blocks and read blocks with different levels of readahead.

Second, we skip gzip de/compression of container image layers. DEFLATE is inherently single-threaded (LZ77, Huffman), which limits you to ~100 MB/s, much lower than the throughput of any of the cache layers. If you have full control over container image creation, you might choose to use zstd and pay an upfront cost in image compression to save bandwidth during transfer and avoid bottlenecking the network during decompression. But we also try to keep image creation fast, especially to better support dynamic agent workloads.

You can fast-forward through tens of seconds of application host-side startup with CPU memory snapshotting.

Now, let’s consider our third step:

  • start the application program on the host, ready it to service requests (tens of seconds)

This encompasses all of the work that needs to be done to go from the point where the application’s container process is started to the point where the first “useful” work is started — processing the first request.

Consider, for instance, executing the Python statement import torch. This kicks off thousands of lines of Python code that, among other things, executes tens of thousands of syscalls to load a variety of files and interact with drivers. Repeat that for all of the libraries used in a typical inference application, and you’ve got many seconds of work to do between “process start” and “request in-flight”.

The key insight here is that any running process is a heap, some threads, and a file descriptor table. Something like this:

Diagram showing the internals of a Linux process as a data structure - memory mappings, thread state, file descriptor table.

If you can recreate that state (”create a checkpoint”), you can recreate the running process (”restore from checkpoint”). Package that state right, and you can recreate a process from storage faster than you can recreate it by executing a fresh copy.

Diagram demonstrating the 'skipped work' when restoring a process from a checkpoint.

That’s the core idea behind transparent checkpoint/restore interfaces — transparent because user programs do not need to be aware that they are being checkpointed and restored. The implementation of transparent checkpoint/restore in Linux is called Checkpoint/Restore In Userspace (CRIU). Transparent C/R is useful for, among other things, migration of live programs onto different machines. Because our primary goal is to reduce cold start time, transparent checkpoint/restore is not a requirement in our system.

So we expose the C/R interface to users via our container lifecycle management interface. It looks something like this:

We don’t use Linux C/R currently. We run user containers with gVisor’s runsc, which effectively emulates (a subset of) the Linux kernel in userspace. This limited surface area provides automatic protection from exploits like the recently-identified CVE-2026-31431, bka “CopyFail”. We’re particularly interested in (and contribute to!) the nvproxy sub-component, which communicates with the GPU’s kernel-mode drivers.

Because applications only interface with this emulated kernel, they can be checkpointed and restored by it without cooperation from the host kernel. Checkpoint/restore is actually especially easy for runsc. A container in the runsc runtime is straightforwardly a state machine. That is, the runtime is architected (in Go) as a collection of tasks with cooperative preemption, as in most other systems with async/await-style concurrency. The system is already being interrupted and then continued at every await point, so it’s “only” a matter of serializing that state into a checkpoint.

More precisely, the runsc checkpoint command stops a container and produces state on disk that can be used to restart the container with runsc restore (docs here). By default, it’s a zipped archive, but we generate without compression (remember the gzip bottleneck!). The key file is pages.img, which contains the raw page data. It is at least 100 MB but can be many GB (though generally not larger than system memory).

Diagram of the gVisor checkpoint/restore process, including nvproxy.

There’s a lot of other moving pieces, but checkpoint restoration performance is won and lost on how quickly this can be brought into the host page cache. We use the same custom filesystem machinery described above to deliver the checkpoint files.

The result is about a 10x reduction in time to load the host-side components of a new replica. You can read more about our memory snapshotting system in this blog post.

import torch

Stable Diffusion

There are some limitations here. Checkpoints are very sensitive to the details of the host environment. For example, the AWS g6.12xlarge instance type does not support the pclmulqdq Perform a Carry-Less Multiplication of Quadword instruction and so it cannot accept any snapshot created on a host which does — that instruction might be hard-coded into a code region page, and issuing it would lead to an illegal instruction fault (or worse). So on a heterogeneous cloud platform like Modal, which aggregates capacity from many providers to ensure low costs and high availability, a single inference server deployment needs more than one snapshot.

Second, this system only considers the host-side state of the program. But among the file descriptors in a process’s state might be a handle on the GPU via the drivers, and the program state properly includes whatever state is on the device. This is where the lion’s share of time is spent when starting up an inference server, so the wins from pure host-side checkpoint/restore are limited. But they unlock device snapshotting, where we turn next.

You can fast-forward through minutes of application device startup with GPU memory snapshotting.

Now, let’s consider the final step in creating a new inference server replica:

  • start the application program on the device, ready it to service requests (minutes to tens of minutes)

There are two distinct pieces of work that can both take many minutes for a contemporary inference workload. First, the neural network weights need to be loaded from storage into the GPU memory. Second, the code around the neural network (the “inference engine”) needs to do some setup work.

Frontier LLM weights these days run from the billions to trillions of bytes (aka, GB to TB). We use the same basic storage affordances as in the custom filesystem to load model weights, and so model weights are loaded at a few gigabytes per second. That’s a total latency between a few seconds and a few hundred seconds. Checkpoint/restore doesn’t help you here, since you are bottlenecked on throughput for large reads, not work that is readily skippable via snapshotting.

Aside on weight loading
It's tough to elevate this bottleneck when you load across clouds or cloud regions like we do. You might >3x this speed by running a weight server within an AZ or >10x it by running an RDMA weight server over RoCE/InfiniBand. This is is doable, but expensive and gnarly, especially when scaling to many models and dynamic worker pools. We're working on it!

But weight loading is not the only device-dependent slow step. Inference engine setup involves multiple compute-heavy, device-dependent steps that produce small in-memory artifacts that are tricky to cache. For instance, the vLLM inference server captures CUDA graphs and runs the Torch compiler. Each of these steps takes between tens of seconds and minutes. CUDA graphs are made up of pointers to tensors and kernels and have no native serialization option. The Torch compiler can produce serializable artifacts. But in our experience, validating cache hits on these artifacts is often slow (seconds to tens of seconds), limiting the benefit of cacheing.

These small, in-memory runtime artifacts that are expensive to produce make inference setup a prime candidate for checkpoint/restore. However, checkpointing based on runsc or CRIU only operates on host memory, not device memory. Even though most of the inference engine artifacts are host-side, the setup process creates device resources that must be checkpointed and restored.

Nvidia’s recent driver versions include an elegant solution to this problem. In short, the driver checkpoints device memory in host memory so that it can be checkpointed to disk by a host-side checkpointing system. Then, once the host-side system has restored the host memory, including the device checkpoint, the driver restores the device memory.

The device checkpoint/restore builds on the host checkpoint/restore, which builds on the underlying filesystem. Infrastructure compounds.

The results are striking: typical speedups are between 4-10x, reducing container start time from several minutes to tens of seconds.

Diagram indicating a reduction in start latency from 300s to 50s for inference applications with snapshots.

There are a few caveats to this, which we document here. First, snapshotting of multi-GPU programs is tricky, since nccl programs are not designed for pauses and frequently deadlock when one peer goes quiet. Second, while our experience has been that almost any application can be snapshot, most applications require some minor adjustment first. For instance, snapshotting vLLM or SGLang works best with weight offloading — moving the weights back onto the host before checkpointing. Additionally, the engines generally eagerly create an (empty) KV cache, which can be recreated much more quickly than it can be restored from a checkpoint.

With these adjustments in place, LLM inference server replicas can boot up nearly an order of magnitude faster with snapshots. Below, we report the latency cumulative distribution functions for fresh replicas of vLLM or SGLang serving a ~1 GiB language model (Qwen 3 0.6B in bf16). This includes all latency to create new replicas on Modal — any queueing for machines, container start, host- and device-side prep. Across over ten thousand cold starts, we observed superior latency at every quantile for the snapshot deployments.

Boot 1 GiB model in vLLM

Boot 1 GiB model in SGLang

Averages improve as well.

Snapshot OFFSnapshot ON
vLLM boot latency (mean)95,679 ms13,797 ms
SGLang boot latency (mean)83,713 ms17,486 ms

You can read more about GPU snapshotting, including additional benchmark results, in this blog post.

We have run this stack at the scale of tens of millions of replicas across many use cases.

Usage numbers for the last three months, February - April of 2026, appear below.

Replicas restoredHours of executionDistinct snapshots created
CPU Snapshots~35,000,000>5,000,000~1,000,000
CPU+GPU Snapshots~15,000,000>2,000,000~700,000

CPU and GPU snapshots are used by a few hundred distinct organizations across a variety of use cases.

CPU snapshots are most commonly used in data pipelines or job queue systems, but they are also frequently used across a variety of use cases to speed up initialization, especially Python imports.

GPU snapshots are also used across a variety of domains. Because of the current restriction to a single GPU, they are most commonly used for models with sizes in the few to few tens of gigabytes. That includes the smaller end of large language/vision-language models, used for structured data extraction or for tasks with low cognitive complexity. We have found that GPU snapshots are also popular for audio/voice use cases, both for speech-to-text/automated speech recognition and text-to-speech/voice generation.

Spotlight Use Case: Reducto scales document processing seamlessly up to thousands of GPUs.

Reducto is a document processing platform that uses vision-language foundation models to extract structured data from unstructured documents. You may have read about their contributions to the JMail project, for which they indexed the correspondence of international criminal Jeffrey Epstein.

One of the primary constraints on Reducto’s document processing system is a high peak-to-average ratio. A customer may appear at any moment with a request to process, say, an entire enterprise’s collection of Notion documents. These jobs have deadlines on the scale of tens of minutes, which can only be met by horizontally scaling across hundreds or thousands of GPUs.

Fast container starts allow Reducto to hit these deadlines without maintaining idle capacity — to run a kilo-GPU inference workload “truly serverlessly”.

In particular, the addition of GPU memory snapshotting pushed down cold starts about six-fold, from ~70s to ~12s.

You can read more about Reducto’s use of serverless GPUs for inference on our blog.

Coda

The existing container stack and cloud services built with it are designed for a previous generation of workloads, like hosting websites and interacting with databases. They aren’t designed for training or inference, and this impedance mismatch is a tax on cost, efficiency, and performance for teams building inference applications.

We built Modal to bring cloud infrastructure up to speed (literally) with the demands of artificial intelligence. We’re sharing how we did it for a few reasons.

First, we hope you’re as excited about building applications on this infrastructure as our customers, like Physical Intelligence, Runway, Ramp, Zencastr, Lovable, Substack, and Suno. You can start using it, including $30 of free usage credits per month, right now.

Second, we built this system on top of and with reference to a number of excellent open source or well-documented systems, from Linux CRIU to AWS Lambda. We want to pay that forward. If you take inspiration from our work, we’d love to hear about it.

Finally, we’ve still got a lot of work to do — those RDMA networks don’t configure themselves! We hope that sharing what that work looks like will get more of the best engineers interested in doing it with us. “Building a boat isn’t about weaving canvas, forging nails, or reading the sky. It’s about giving a shared taste for the sea.” Shipbuilders and sea-yearners, inquire within and share your favorite story of working for hours to save a millisecond, one trillion times.

The authors would like to thank Vikram Mailthody, Steven Gurfinkel, Stephen Jones, Radostin Stoyanov, Rodrigo Bruno, and Jordan Sassoon for their input.

Ship your first app in minutes.

Get Started

$30 / month free compute