Audio inference

Use Modal to build flexible pipelines to transcribe audio, generate voice or synthesize high-fidelity music.

Get started
Podcast transcription

Run Whisper on hardware of your choice, customized with pre-processing (such as ffmpeg) to your liking. Use Modal’s map capability to spin up hundreds of containers to transcribe a single audio track in parallel.

Music generation

Host diffusion Models to create high-quality music samples from text or audio input. Serve your model in any form, including a serverless Discord bot.

Voice generation

Generate realistic human voice in real-time using open-source models. Easily combine voice generation with LLM synthesis in a single app.


Try it out


Mike Cohen
Mike Cohen
Head of Data

Substack recently launched a feature for AI-powered audio transcriptions. The data team picked Modal because it makes it easy to write code that runs on 100s of GPUs in parallel, transcribing podcasts in a fraction of the time.

Georg Kucsko
Georg Kucsko
Co-Founder

Suno has developed proprietary state-of-the-art models that generate music and speech using AI. We chose Modal as our infrastructure provider for inference and parallel data processing. Modal's superb developer experience enables our team to ship new models to production quickly, and with and confidence we'll scale to thousands of simultaneous users.

Ship your first app in minutes

with $30 / month free compute