
How to use Whisper
Whisper is a state-of-the-art open-source speech-to-text model developed by OpenAI, designed to convert audio into accurate text.
To get started with Whisper, you have two primary options:
- OpenAI API: Access Whisper’s capabilities through the OpenAI API.
- Self-hosted deployment: Deploy the open-source Whisper library on your own hardware, such as Modal, to maintain control over your transcription processes. This option allows you to utilize Whisper as:
- A command-line tool for quick and straightforward transcription tasks.
- A Python library for more complex integrations and custom applications.
Your Whisper implementation is too slow. Now what?
Let’s say that you choose to self-host Whisper. If, for whatever reason, you find that your Whisper implementation is too slow for your needs, here are some strategies to speed it up:
1. Leverage GPU Acceleration 🚀
The single most effective way to speed up Whisper is to run it on a GPU. By offloading computations from CPU to GPU, you can achieve dramatically faster inference times, especially for larger versions of Whisper.
To run Whisper on a GPU, first make sure that you have the CUDA drivers installed.
You can usually do this by installing torch
with CUDA support. Then ensure that your GPU is being used by Whisper by setting the device
argument to cuda
.
import whisper
model = whisper.load_model(model_size, device="cuda")
2. Choose Smaller Models 🎯
Whisper offers multiple model sizes, each with different speed-accuracy tradeoffs:
- tiny: Fastest but least accurate
- base: Good balance for many use cases
- small: More accurate than base, still reasonably fast
- medium: Better accuracy, slower processing
- large: Most accurate, but slowest
If speed is crucial, consider using base
or small
models. They often provide sufficient accuracy while processing audio significantly faster than larger models.
3. Process Audio Chunks in Parallel ⚡
For long audio files like podcasts or meeting recordings, parallel processing can dramatically reduce total transcription time. Here’s how:
- Split your audio into smaller chunks (e.g., 30-second segments)
- Process multiple chunks simultaneously
- Combine the results
If you are self-hosting Whisper on a platform like Modal, you can use Modal’s
.map
feature to process audio chunks in parallel.
4. Implement Real-Time Streaming 🔄
If you need real-time transcription, the standard open-source Whisper library (which processes 30-second chunks) won’t cut it. Instead, use Whisper Streaming, which enables:
- Live audio processing
- Immediate transcription output
- Lower latency for interactive applications
For optimal streaming performance, pair it with Faster-Whisper as the backend.
5. Try Optimized Whisper Variants 🔧
Several optimized versions of Whisper offer significant speed improvements:
- WhisperX: Enhanced speed and word-level timestamps
- Faster-Whisper: Optimized for GPU performance
- Whisper.cpp: Efficient C++ implementation
These variants can provide substantial performance gains while maintaining accuracy.
Deploy Fast Whisper on Modal
Want to implement these optimizations without managing infrastructure? Modal offers serverless GPU-powered compute that makes it easy to:
- ✅ Run Whisper on powerful GPUs
- ✅ Scale automatically with demand
- ✅ Pay only for actual usage
- ✅ Focus on building, not infrastructure
Start deploying high-performance Whisper on Modal today!