Streaming endpoints

Modal web endpoints support streaming responses using FastAPI’s StreamingResponse class. This class accepts asynchronous generators, synchronous generators, or any Python object that implements the iterator protocol, and can be used with Modal Functions!

Simple example

This simple example combines Modal’s @web_endpoint decorator with the FastAPI documentation’s streaming response example:

import time
from fastapi.responses import StreamingResponse

from modal import Stub, web_endpoint

stub = Stub()

def fake_video_streamer():
    for i in range(10):
        yield f"frame {i}: some data\n".encode()
        time.sleep(0.5)


@stub.function()
@web_endpoint()
def stream_me():
    return StreamingResponse(
        fake_video_streamer(), media_type="text/event-stream"
    )

If you serve this web endpoint and hit it with curl, you will see the ten fake video frames progressively appear in your terminal over a ~5 second period.

curl --no-buffer https://modal-labs--example-streaming-stream-me.modal.run

Streaming responses with .remote

A Modal Function wrapping a generator function body can have its response passed directly into a StreamingResponse. This is particularly useful if you want to do some GPU processing in one Modal Function that is called by a CPU-based web endpoint Modal Function.

from modal import Stub, web_endpoint

stub = Stub()

@stub.function(gpu="any")
def fake_video_render():
    for i in range(10):
        yield f"frame {i}: some fake data from GPU\n".encode()
        time.sleep(1)


@stub.function()
@web_endpoint()
def hook():
    return StreamingResponse(
        fake_video_render.remote(), media_type="application/octet-stream"
    )

Streaming responses with .map and .starmap

You can also combine Modal Function parallelization with streaming responses, enabling applications to service a request by farming out to dozens of containers and iteratively returning result chunks to the client.

from modal import Stub, web_endpoint

stub = Stub()

@stub.function()
def map_me(i):
    time.sleep(i)  # stagger the results for demo purposes
    return f"hello from {i}\n"


@stub.function()
@web_endpoint()
def mapped():
    return StreamingResponse(
        map_me.map(range(10)), media_type="text/event-stream"
    )

This snippet will spread the ten map_me(i) executions across containers, and return each string response part as it completes. By default the results will be ordered, but if this isn’t necessary pass order_outputs=False as keyword argument to the .map call.

Cooperative yielding

In asynchronous applications a loop over a .map or .starmap call can block the main thread and not allow the StreamingResponse to return response parts iteratively to the client.

To avoid this, it’s important to do some ‘cooperative yielding’ inside the loop. For example:

@stub.function(gpu="any")
def transcribe_video(segment):
    ...

# Notice that this is an `async` function.
async def stream_response_wrapper(request)
    segments = split_video(request)
    for partial_result in transcribe_video.map(segments):
        await asyncio.sleep(0.5)  # Cooperatively yield here by sleeping.
        yield partial_result

Further examples