Concurrent inputs on a single container (beta)

This guide explores why and how to configure containers to process multiple inputs simultaneously.

Default parallelism

Modal offers beautifully simple parallelism: when there is a large backlog of inputs enqueued, the number of containers scales up automatically. This is the ideal source of parallelism in the majority of cases.

When to use concurrent inputs

There are, however, a few cases where it is ideal to run multiple inputs on each container concurrently.

One use case is hosting web applications where the endpoints are not CPU-bound - for example, making an asynchronous request to a deployed Modal function or querying a database. Only a handful of containers can handle hundreds of simultaneous requests for such applications if we allow concurrent inputs.

Another use case is to support continuous batching on GPU-accelerated containers. Frameworks such as vLLM allow us to push higher token throughputs by maximizing compute in each forward pass. In LLMs, this means each GPU step can generate tokens for multiple user queries; in diffusion models, we can denoise multiple images concurrently. In order to take full advantage of this, containers need to be processing multiple inputs concurrently.

Configuring concurrent inputs

To configure functions to allow each container to process n inputs concurrently, we set allow_concurrent_inputs=n on the function decorator.

If the function is synchronous, the Modal container will execute concurrent inputs on separate threads. As such, one must take care that function implementation itself is thread-safe.

Similarly, if the function is asynchronous, the Modal container will execute the concurrent inputs on separate asyncio tasks.

# Each container executes up to 10 inputs in separate threads
def sleep_sync():
    # Function must be thread-safe

# Each container executes up to 10 inputs in separate async tasks
async def sleep_async():
    await asyncio.sleep(1)