Why can’t you just use Whisper?
When OpenAI open-sourced Whisper, it gave the world a great speech-to-text model. Whisper is accurate, reasonably efficient, and free (even for commercial use cases). But it’s missing some key features that developers are often looking for, including:
- Speaker Diarization: Distinguishing between different speakers in an audio file.
- Word-level timestamps: Whisper has segment level timestamps but not word-level timestamps.
- Streaming: Whisper processes audio in chunks of 30 seconds and does not support real-time or streaming speech-to-text conversion.
Whisper variations
This means that if you are building a performant app that needs transcription, you shouldn’t use the Whisper library directly. Instead, you should use a variant that fills in some of the missing functionality, or provides speedups.
In this article, we’ll explore the various open-source Whisper variations, their unique features, and help you decide which one best suits your specific needs.
Table of contents
- Takeaways
- WhisperX
- Whisper JAX
- Whisper.cpp
- Distil-Whisper
- Whisper Streaming
- Running Whisper on Modal
Takeaways
- Overall best: WhisperX
- If you need to recognize multiple speakers: WhisperX
- If you need real-time transcription: Whisper Streaming
- If you want word-level timestamps: WhisperX
- If you need to run it on a phone or laptop: Whisper.cpp
WhisperX
WhisperX stands out as the most versatile and feature-rich Whisper variation. Here’s why it’s our top pick:
- Fast automatic speaker recognition: WhisperX adds word-level timestamps and speaker diarization, making it ideal for multi-speaker transcriptions.
- Speed: It uses Faster-Whisper under the hood, providing a 4x speed increase compared to the original Whisper.
- Language support: While not universal, WhisperX supports a wide range of languages including English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Mandarin Chinese, and Japanese.
If your project requires accurate transcription with speaker identification and precise timing, WhisperX is likely your best choice.
Whisper JAX
Whisper JAX offers unparalleled speed for those with access to TPU v4 hardware:
- JAX implementation: Built on the Hugging Face implementation of Whisper but written in JAX instead of PyTorch.
- Extreme speed-up: Achieves 10-15x speed increase on TPU v4 hardware, with potential for up to 70-100x faster performance than the original OpenAI implementation.
- Scalability: Particularly effective with large batch sizes for long audio files.
If you have access to TPU v4 and need to process large volumes of audio quickly, Whisper JAX is the winner.
Whisper.cpp
Whisper.cpp brings Whisper’s capabilities to edge devices:
- C++ implementation: Super lightweight implementation in pure C/C++ that allows for on-device usage, including laptops and phones.
- Quick startup: Fast boot-up time makes it ideal for applications requiring immediate transcription.
- Efficient processing: Can transcribe 1 hour of audio in approximately 8.5 minutes on standard hardware.
Choose Whisper.cpp when you need offline processing or want to run transcriptions directly on user devices.
Distil-Whisper
Distil-Whisper, from the HuggingFace team, is a lightweight and efficient Whisper variation:
- Smaller and faster: A distilled version of Whisper that is 4x smaller and 6x faster than the original model.
Whisper Streaming
Whisper Streaming is a real-time Whisper implementation that supports streaming audio:
- Real-time transcription: Provides streaming capabilities for live audio transcription.
- Support for different backends: Several alternative backends are integrated. The most recommended one is faster-whisper with GPU support.
Running Whisper on Modal
Once you’ve picked your Whisper model, you will need somewhere to deploy it. Modal offers a serverless compute platform for AI and data teams. When you deploy Whisper (or a variant) of it on Modal, you are only charged when the model is running, and you don’t have to worry about infrastructure management.
We have a guide for how to get this set up here.
Conclusion
In the vast majority of use cases, our recommendation is to use WhisperX (preferably deployed on Modal) for the best balance of ease-of-use, performance, and feature completion.