Modal has raised an $87M Series B led by Lux Capital. Read more

Batch and real-time ASR with scalability built-in

Transcribe 2.7 hours of audio in a second on Modal’s autoscaling infrastructure
Get Started
customer logo

“Modal has been a really nice, scalable solution for audio transcription. We don’t have to worry about pre-allocating GPUs weeks ahead of time – we can spin up 1500 GPUs in minutes and it just works.”

Alex Cannan, ML Team Lead
customer logo

“Modal makes it easy to write code that runs on 100s of GPUs in parallel, transcribing podcasts in a fraction of the time.”

Mike Cohen, Head of AI & ML Engineering

Deploy state-of-the-art ASR models in minutes

01
import modal
02
03
MODAL_NAME = "nvidia/parakeet-tdt-0.6b-v2"
04
image = (
05
    modal.Image.from_registry(f"nvidia/cuda:{tag}", add_python="3.12")
06
    .pip_install("nemo_toolkit[asr]", "torchaudio", "soundfile",...)
07
)
08
09
@app.cls(gpu="L40S", image=image)
10
class BatchTranscription():
11
12
    @modal.enter()
13
    def setup(self):
14
        self.asr_model = nemo.models.ASRModel.from_pretrained(model_name=MODAL_NAME)
15
16
    @modal.method()
17
    async def run_inference(self, audio_filepaths):
18
        transcriptions = self.asr_model.transcribe(
19
            audio_filepaths, batch_size=128, num_workers=1
20
        )
21
        return transcriptions
22
23
transcriber = BatchTranscription()
24
for result in transcriber.run_inference.map(filepath_batches):
25
    ...

Outperform proprietary ASR

Get higher accuracy and 100x cheaper + faster transcription when you deploy the latest open-source ASR models on Modal.

Audio batch and real-time

Transcribe millions of hours of audio


Process 1M transcription jobs with Modal Batch. Leave the distributed systems to us.


Instant scaling. Modal’s Rust-based container stack provisions GPUs in less than a second.


Guaranteed GPU capacity. Modal’s cloud capacity orchestrator is robust to demand spikes.

Deploy low-latency transcription for real-time voice agents


Achieve 100ms latency with WebRTC on Modal. Get GPUs wherever your users are.


Integrate with your favorite voice AI frameworks like Pipecat and LiveKit


Serve best-in-class models for real-time ASR, like Kyutai or RealtimeSTT

Fine-tune ASR models without the black box

Fine-tune ASR models without the black box

Achieve lower word error rates for your specific domain.


Iterate quickly on training code without managing cloud environments.


Fan out experiments on Modal’s autoscaling container infra.

Built with Modal

Ship your first app in minutes.

Get Started

$30 / month free compute