Concurrent inputs on a single container (beta)
This guide explores why and how to configure containers to process multiple inputs simultaneously.
Modal offers beautifully simple parallelism: when there is a large backlog of inputs enqueued, the number of containers scales up automatically. This is the ideal source of parallelism in the majority of cases.
When to use concurrent inputs
There are, however, a few cases where it is ideal to run multiple inputs on each container concurrently.
One use case is hosting web applications where the endpoints are not CPU-bound - for example, making an asynchronous request to a deployed Modal function or querying a database. Only a handful of containers can handle hundreds of simultaneous requests for such applications if we allow concurrent inputs.
Another use case is to support continuous batching on GPU-accelerated containers. Frameworks such as vLLM allow us to push higher token throughputs by maximizing compute in each forward pass. In LLMs, this means each GPU step can generate tokens for multiple user queries; in diffusion models, we can denoise multiple images concurrently. In order to take full advantage of this, containers need to be processing multiple inputs concurrently.
Configuring concurrent inputs
To configure functions to allow each container to process
concurrently, we set
allow_concurrent_inputs=n on the function decorator.
If the function is synchronous, the Modal container will execute concurrent inputs on separate threads. As such, one must take care that function implementation itself is thread-safe.
Similarly, if the function is asynchronous, the Modal container will execute the
concurrent inputs on separate
# Each container executes up to 10 inputs in separate threads def sleep_sync(): # Function must be thread-safe time.sleep(1) # Each container executes up to 10 inputs in separate async tasks async def sleep_async(): await asyncio.sleep(1)