Streaming Speaker Diarization with Sortformer2.1

In this example, we show how to deploy a streaming speaker diarization service with NVIDIA’s Sortformer2.1 on Modal. Sortformer2.1 is a state-of-the-art speaker diarization model that is designed to operate on streams of audio.

Try it yourself! Click the “View on GitHub” button to see the code. And sign up for a Modal account if you haven’t already.

Setup 

We start by importing some basic packages and the Modal SDK. As well as setting up our Modal App, Volume, and Image.

Run Sortformer2.1 speaker diarization 

Now we’re ready to add the code that runs the Sortformer2.1 speaker diarization model.

We use a Modal Cls so that we can separate out the model loading and setup code from the inference. For more on lifecycle management with Clses and cold start penalty reduction on Modal, see this guide. In particular, the Sortformer2.1 model is amenable to GPU snapshots which can significantly reduce cold start times.

We also include two configurations. The low latency configuration is used for real-time diarization, and the high latency configuration is used for non-real-time diarization with higher accuracy.

Using WebSockets to stream audio and diarization results 

We use a Modal ASGI app to serve the diarization results over WebSockets. This allows us to stream the diarization results to the client in real-time.

We use a simple queue-based architecture to handle the audio and diarization results.

The audio is received from the client over WebSockets and added to a queue. The diarization results are then processed and added to a queue. The diarization results are then sent to the client over WebSockets.

Serving the diarization results to a frontend 

We use a simple HTML frontend to display the diarization results.