Article

August 15, 2024•5 minute read

All the open-source Whisper variations

Solutions Engineer

Why can’t you just use Whisper?

When OpenAI open-sourced Whisper, it gave the world a great speech-to-text model. Whisper is accurate, reasonably efficient, and free (even for commercial use cases). But it’s missing some key features that developers are often looking for, including:

Speaker Diarization: Distinguishing between different speakers in an audio file.
Word-level timestamps: Whisper has segment level timestamps but not word-level timestamps.
Streaming: Whisper processes audio in chunks of 30 seconds and does not support real-time or streaming speech-to-text conversion.

(For a guide on how to run vanilla Whisper on Modal, see here.)

Whisper variations

This means that if you are building a performant app that needs transcription, you shouldn’t use the Whisper library directly. Instead, you should use a variant that fills in some of the missing functionality, or provides speedups.

In this article, we’ll explore the various open-source Whisper variations, their unique features, and help you decide which one best suits your specific needs.

Takeaways
WhisperX
Whisper JAX
Whisper.cpp
Distil-Whisper
Whisper Streaming
Running Whisper on Modal

Takeaways

Overall best: WhisperX
If you need to recognize multiple speakers: WhisperX
If you need real-time transcription: Whisper Streaming
If you want word-level timestamps: WhisperX
If you need to run it on a phone or laptop: Whisper.cpp

WhisperX

WhisperX stands out as the most versatile and feature-rich Whisper variation. Here’s why it’s our top pick:

Fast automatic speaker recognition: WhisperX adds word-level timestamps and speaker diarization, making it ideal for multi-speaker transcriptions.
Speed: It uses Faster-Whisper under the hood, providing a 4x speed increase compared to the original Whisper.
Language support: While not universal, WhisperX supports a wide range of languages including English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Mandarin Chinese, and Japanese.

If your project requires accurate transcription with speaker identification and precise timing, WhisperX is likely your best choice.

For a guide on how to deploy WhisperX on Modal, see here.

Whisper JAX

Whisper JAX offers unparalleled speed for those with access to TPU v4 hardware:

JAX implementation: Built on the Hugging Face implementation of Whisper but written in JAX instead of PyTorch.
Extreme speed-up: Achieves 10-15x speed increase on TPU v4 hardware, with potential for up to 70-100x faster performance than the original OpenAI implementation.
Scalability: Particularly effective with large batch sizes for long audio files.

If you have access to TPU v4 and need to process large volumes of audio quickly, Whisper JAX is the winner.

Whisper.cpp

Whisper.cpp brings Whisper’s capabilities to edge devices:

C++ implementation: Super lightweight implementation in pure C/C++ that allows for on-device usage, including laptops and phones.
Quick startup: Fast boot-up time makes it ideal for applications requiring immediate transcription.
Efficient processing: Can transcribe 1 hour of audio in approximately 8.5 minutes on standard hardware.

Choose Whisper.cpp when you need offline processing or want to run transcriptions directly on user devices.

Distil-Whisper

Distil-Whisper, from the HuggingFace team, is a lightweight and efficient Whisper variation:

Smaller and faster: A distilled version of Whisper that is 4x smaller and 6x faster than the original model.

Whisper Streaming

Whisper Streaming is a real-time Whisper implementation that supports streaming audio:

Real-time transcription: Provides streaming capabilities for live audio transcription.
Support for different backends: Several alternative backends are integrated. The most recommended one is faster-whisper with GPU support.

Once you’ve picked your Whisper model, you will need somewhere to deploy it. Modal offers a serverless compute platform for AI and data teams. When you deploy Whisper (or a variant) of it on Modal, you are only charged when the model is running, and you don’t have to worry about infrastructure management.

We have a guide for how to get this set up here.

Conclusion

In the vast majority of use cases, our recommendation is to use WhisperX (preferably deployed on Modal) for the best balance of ease-of-use, performance, and feature completion.

For more tips and tricks on how to get faster transcription, see our post on performance optimizations for running Whisper.

All the open-source Whisper variations

Why can’t you just use Whisper?

Whisper variations

Table of contents

Takeaways

WhisperX

Whisper JAX

Whisper.cpp

Distil-Whisper

Whisper Streaming

Conclusion

Ship your first app in minutes.

All the open-source Whisper variations

Why can’t you just use Whisper?

Whisper variations

Table of contents

Takeaways

WhisperX

Whisper JAX

Whisper.cpp

Distil-Whisper

Whisper Streaming

Running Whisper on Modal

Conclusion

Ship your first app in minutes.