How to use Whisper
Whisper is a state-of-the-art open-source speech-to-text model developed by OpenAI, designed to convert audio into accurate text.
To get started with Whisper, you have two primary options:
- OpenAI API: Access Whisper’s capabilities through the OpenAI API.
- Self-hosted deployment: Deploy the open-source Whisper library on your own hardware, such as Modal, to maintain control over your transcription processes. This option allows you to utilize Whisper as:
- A command-line tool for quick and straightforward transcription tasks.
- A Python library for more complex integrations and custom applications.
Your Whisper implementation is too slow. Now what?
Let’s say that you choose to self-host Whisper. If, for whatever reason, you find that your Whisper implementation is too slow for your needs, here are some strategies to speed it up:
1. Utilize GPU acceleration
Leveraging a GPU can significantly enhance the performance of Whisper. By offloading computations to the GPU, you can achieve faster inference times, especially for larger models. Ensure that your environment is set up with the appropriate CUDA drivers and that Whisper is configured to utilize the GPU.
To run Whisper on a GPU, first make sure that you have the CUDA drivers installed.
You can usually do this by installing torch
with CUDA support. Then ensure that your GPU is being used by Whisper by setting the device
argument to cuda
.
import whisper
model = whisper.load_model(model_size, device="cuda")
2. Opt for smaller Whisper models
There are several different versions of the Whisper model available for open use. If speed is a priority, using a smaller Whisper model can drastically reduce inference time. While larger models may offer better accuracy, smaller models like base
or small
can provide sufficient performance for many applications while processing audio more quickly.
3. Chunk the audio file and process in parallel
If you are trying to transcribe a long audio file (.e.g. a podcast or meeting recording), and your use-case is not realtime, you can chunk the audio file and process each of the chunks in parallel. Modal’s .map
feature makes this super easy to set up.
4. Implement real-time streaming with Whisper
While the base open source Whisper library processes audio in chunks of 30 seconds, which means it doesn’t support real-time transcription, the Whisper Streaming implementation allows for real-time transcription, making it ideal for applications such as live captioning or interactive voice assistants.
Whisper Streaming supports various backends, with Faster-Whisper being the most recommended option. This variant is optimized for GPU support, offering significant speed improvements for high-demand transcription tasks.
5. Explore faster variants of Whisper
If the tricks above don’t meet your needs, consider using alternatives like WhisperX or Faster-Whisper. These variations are designed to enhance speed and efficiency, making them suitable for high-demand transcription tasks.
Deploy Whisper, fast, on Modal GPUS
Modal offers a serverless compute platform for AI and data teams. When you deploy Whisper (or a variant) of it on Modal, you are only charged when the model is running, and you don’t have to worry about infrastructure management. This means you can focus on building and scaling your applications without the overhead of managing servers.
With Modal, you can easily spin up powerful GPU instances to handle your transcription tasks efficiently. Whether you’re processing large volumes of audio data or need real-time transcription capabilities, Modal provides the flexibility and performance you need. Start deploying Whisper on Modal today and experience the benefits of a streamlined, cost-effective solution for your audio processing needs!