Streaming endpoints
Modal web endpoints support streaming responses using FastAPI’s
StreamingResponse
class. This class accepts asynchronous generators, synchronous generators, or
any Python object that implements the
iterator protocol,
and can be used with Modal Functions!
Simple example
This simple example combines Modal’s @web_endpoint
decorator with the FastAPI
documentation’s
streaming response example:
import time
from fastapi.responses import StreamingResponse
from modal import Stub, web_endpoint
stub = Stub()
def fake_video_streamer():
for i in range(10):
yield f"frame {i}: some data\n".encode()
time.sleep(0.5)
@stub.function()
@web_endpoint()
def stream_me():
return StreamingResponse(
fake_video_streamer(), media_type="text/event-stream"
)
If you serve this web endpoint and hit it with curl
, you will see the ten fake
video frames progressively appear in your terminal over a ~5 second period.
curl --no-buffer https://modal-labs--example-streaming-stream-me.modal.run
Streaming responses with .remote
A Modal Function wrapping a generator function body can have its response passed
directly into a StreamingResponse
. This is particularly useful if you want to
do some GPU processing in one Modal Function that is called by a CPU-based web
endpoint Modal Function.
from modal import Stub, web_endpoint
stub = Stub()
@stub.function(gpu="any")
def fake_video_render():
for i in range(10):
yield f"frame {i}: some fake data from GPU\n".encode()
time.sleep(1)
@stub.function()
@web_endpoint()
def hook():
return StreamingResponse(
fake_video_render.remote(), media_type="application/octet-stream"
)
Streaming responses with .map
and .starmap
You can also combine Modal Function parallelization with streaming responses, enabling applications to service a request by farming out to dozens of containers and iteratively returning result chunks to the client.
from modal import Stub, web_endpoint
stub = Stub()
@stub.function()
def map_me(i):
time.sleep(i) # stagger the results for demo purposes
return f"hello from {i}\n"
@stub.function()
@web_endpoint()
def mapped():
return StreamingResponse(
map_me.map(range(10)), media_type="text/event-stream"
)
This snippet will spread the ten map_me(i)
executions across containers, and
return each string response part as it completes. By default the results will be
ordered, but if this isn’t necessary pass order_outputs=False
as keyword
argument to the .map
call.
Cooperative yielding
In asynchronous applications a loop over a .map
or .starmap
call can block
the main thread and not allow the StreamingResponse
to return response parts
iteratively to the client.
To avoid this, it’s important to do some ‘cooperative yielding’ inside the loop. For example:
@stub.function(gpu="any")
def transcribe_video(segment):
...
# Notice that this is an `async` function.
async def stream_response_wrapper(request)
segments = split_video(request)
for partial_result in transcribe_video.map(segments):
await asyncio.sleep(0.5) # Cooperatively yield here by sleeping.
yield partial_result
Further examples
- Complete code the for the simple examples given above is available in our modal-examples Github repository.
- An example of streaming ChatGPT responses over HTTP
- An end-to-end example of streaming Youtube video transcriptions with OpenAI’s whisper model.