“Modal has been a really nice, scalable solution for audio transcription. We don’t have to worry about pre-allocating GPUs weeks ahead of time – we can spin up 1500 GPUs in minutes and it just works.”
import modal
MODAL_NAME = "nvidia/parakeet-tdt-0.6b-v2"
image = (
modal.Image.from_registry(f"nvidia/cuda:{tag}", add_python="3.12")
.pip_install("nemo_toolkit[asr]", "torchaudio", "soundfile",...)
)
@app.cls(gpu="L40S", image=image)
class BatchTranscription():
@modal.enter()
def setup(self):
self.asr_model = nemo.models.ASRModel.from_pretrained(model_name=MODAL_NAME)
@modal.method()
async def run_inference(self, audio_filepaths):
transcriptions = self.asr_model.transcribe(
audio_filepaths, batch_size=128, num_workers=1
)
return transcriptions
transcriber = BatchTranscription()
for result in transcriber.run_inference.map(filepath_batches):
...
Get higher accuracy and 100x cheaper + faster transcription when you deploy the latest open-source ASR models on Modal.
Process 1M transcription jobs with Modal Batch. Leave the distributed systems to us.
Instant scaling. Modal’s Rust-based container stack provisions GPUs in less than a second.
Guaranteed GPU capacity. Modal’s cloud capacity orchestrator is robust to demand spikes.
Achieve 100ms latency with WebRTC on Modal. Get GPUs wherever your users are.
Integrate with your favorite voice AI frameworks like Pipecat and LiveKit
Serve best-in-class models for real-time ASR, like Kyutai or RealtimeSTT
Achieve lower word error rates for your specific domain.
Iterate quickly on training code without managing cloud environments.
Fan out experiments on Modal’s autoscaling container infra.