Streaming audio transcription using Parakeet
This examples demonstrates the use of Parakeet ASR models for streaming speech-to-text on Modal.
Parakeet is the name of a family of ASR models built using NVIDIA’s NeMo Framework. We’ll show you how to use Parakeet for streaming audio transcription on Modal GPUs, with simple Python and browser clients.
This example uses the nvidia/parakeet-tdt-0.6b-v2 model which, as of June 2025, sits at the
top of Hugging Face’s Open ASR leaderboard.
To try out transcription from your terminal,
provide a URL for a .wav file to modal run:
You should see output like the following:
Running a web service you can hit from any browser isn’t any harder — Modal handles the deployment of both the frontend and backend in a single App! Just run
and go to the link printed in your terminal.
The full frontend code can be found here.
Setup
Volume for caching model weights
We use a Modal Volume to cache the model weights. This allows us to avoid downloading the model weights every time we start a new instance.
For more on storing models on Modal, see this guide.
Configuring dependencies
The model runs remotely inside a container on Modal. We can define the environment
and install our Python dependencies in that container’s Image.
For finicky setups like NeMO’s, we recommend using the official NVIDIA CUDA Docker images from Docker Hub.
You’ll need to install Python and pip with the add_python option because the image
doesn’t have these by default.
Additionally, we install ffmpeg for handling audio data and fastapi to create a web
server for our WebSocket.
Implementing streaming audio transcription on Modal
Now we’re ready to implement transcription. We wrap inference in a modal.Cls that
ensures models are loaded and then moved to the GPU once when a new container starts.
A couple of notes about this code:
- The
transcribemethod takes bytes of audio data and returns the transcribed text. - The
webmethod creates a FastAPI app usingmodal.asgi_appthat serves a WebSocket endpoint for streaming audio transcription and a browser frontend for transcribing audio from your microphone. - The
run_with_queuemethod takes amodal.Queueand passes audio data and transcriptions between our local machine and the GPU container.
Parakeet tries really hard to transcribe everything to English!
Hence it tends to output utterances like “Yeah” or “Mm-hmm” when it runs on silent audio.
We pre-process the incoming audio in the server using pydub’s silence detection,
ensuring that we don’t pass silence into our model.
Running transcription from a local Python client
Next, let’s test the model with a local_entrypoint that streams audio data to the server and prints
out the transcriptions to our terminal as they arrive.
Instead of using the WebSocket endpoint like the browser frontend,
we’ll use a modal.Queue to pass audio data and transcriptions between our local machine and the GPU container.
Below are the two functions that coordinate streaming audio and receiving transcriptions.
send_audio transmits chunks of audio data with a slight delay,
as though it was being streamed from a live source, like a microphone. receive_text waits for transcribed text to arrive and prints it.
Addenda
The remainder of the code in this example is boilerplate, mostly for handling Parakeet’s input format.