Streaming audio transcription using Parakeet

This examples demonstrates the use of Parakeet ASR models for streaming speech-to-text on Modal.

Parakeet is the name of a family of ASR models built using NVIDIA’s NeMo Framework. We’ll show you how to use Parakeet for streaming audio transcription on Modal GPUs, with simple Python and browser clients.

This example uses the nvidia/parakeet-tdt-0.6b-v2 model which, as of June 2025, sits at the top of Hugging Face’s Open ASR leaderboard.

To try out transcription from your terminal, provide a URL for a .wav file to modal run:

You should see output like the following:

Running a web service you can hit from any browser isn’t any harder — Modal handles the deployment of both the frontend and backend in a single App! Just run

and go to the link printed in your terminal.

The full frontend code can be found here.

Setup 

Volume for caching model weights 

We use a Modal Volume to cache the model weights. This allows us to avoid downloading the model weights every time we start a new instance.

For more on storing models on Modal, see this guide.

Configuring dependencies 

The model runs remotely inside a container on Modal. We can define the environment and install our Python dependencies in that container’s Image.

For finicky setups like NeMO’s, we recommend using the official NVIDIA CUDA Docker images from Docker Hub. You’ll need to install Python and pip with the add_python option because the image doesn’t have these by default.

Additionally, we install ffmpeg for handling audio data and fastapi to create a web server for our WebSocket.

Implementing streaming audio transcription on Modal 

Now we’re ready to implement transcription. We wrap inference in a modal.Cls that ensures models are loaded and then moved to the GPU once when a new container starts.

A couple of notes about this code:

  • The transcribe method takes bytes of audio data and returns the transcribed text.
  • The web method creates a FastAPI app using modal.asgi_app that serves a WebSocket endpoint for streaming audio transcription and a browser frontend for transcribing audio from your microphone.
  • The run_with_queue method takes a modal.Queue and passes audio data and transcriptions between our local machine and the GPU container.

Parakeet tries really hard to transcribe everything to English! Hence it tends to output utterances like “Yeah” or “Mm-hmm” when it runs on silent audio. We pre-process the incoming audio in the server using pydub’s silence detection, ensuring that we don’t pass silence into our model.

Running transcription from a local Python client 

Next, let’s test the model with a local_entrypoint that streams audio data to the server and prints out the transcriptions to our terminal as they arrive.

Instead of using the WebSocket endpoint like the browser frontend, we’ll use a modal.Queue to pass audio data and transcriptions between our local machine and the GPU container.

Below are the two functions that coordinate streaming audio and receiving transcriptions.

send_audio transmits chunks of audio data with a slight delay, as though it was being streamed from a live source, like a microphone. receive_text waits for transcribed text to arrive and prints it.

Addenda 

The remainder of the code in this example is boilerplate, mostly for handling Parakeet’s input format.