WhisperX: Run a faster version of Whisper on Modal

All posts

Back

Model Library

December 16, 2024•5 minute read

Yiren Lu@YirenLu

Solutions Engineer

Introduction to WhisperX

WhisperX is a transcription library built on top of OpenAI’s Whisper with some additional features, including word-level timestamps and speaker diarization. It enables ⚡️ 70x realtime transcription with the Whisper large-v2 model and requires under 8GB GPU memory with beam_size=5.

For more details about why you might choose WhisperX over Whisper, or one of the other Whisper variants, see our comparison blog post.

To run the following code, you will need to:

Create an account at modal.com
Run pip install modal to install the modal Python package
Run modal setup to authenticate (if this doesn’t work, try python -m modal setup)
Copy the code below into a file called app.py
Run modal run app.py

import modal

cuda_version = "12.4.0"  # should be no greater than host CUDA version
flavor = "devel"  #  includes full CUDA toolkit
operating_sys = "ubuntu22.04"
tag = f"{cuda_version}-{flavor}-{operating_sys}"

image = (
    modal.Image.from_registry(f"nvidia/cuda:{tag}", add_python="3.11")
    .apt_install(
        "git",
        "ffmpeg",
    )
    .pip_install(
        "torch==2.0.0",
        "torchaudio==2.0.0",
        "numpy<2.0",
        index_url="https://download.pytorch.org/whl/cu118",
    )
    .pip_install(
        "git+https://github.com/Hasan-Naseer/whisperX.git@release/latest-faster-whisper-version",
        "ffmpeg-python",
        "ctranslate2==4.4.0",
    )
)
app = modal.App("example-base-whisperx", image=image)

GPU_CONFIG = modal.gpu.H100(count=1)

CACHE_DIR = "/cache"
cache_vol = modal.Volume.from_name("whisper-cache", create_if_missing=True)

@app.cls(
    gpu=GPU_CONFIG,
    volumes={CACHE_DIR: cache_vol},
    allow_concurrent_inputs=15,
    container_idle_timeout=60 * 10,
    timeout=60 * 60,
)
class Model:
    @modal.enter()
    def setup(self):
        import whisperx

        device = "cuda"
        compute_type = (
            "float16"  # change to "int8" if low on GPU mem (may reduce accuracy)
        )

        # 1. Transcribe with original whisper (batched)
        self.model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=CACHE_DIR)

    @modal.method()
    def transcribe(self, audio_url: str):
        import requests
        import whisperx

        batch_size = 16  # reduce if low on GPU mem

        response = requests.get(audio_url)
        # Save the audio file locally
        with open("downloaded_audio.wav", "wb") as audio_file:
            audio_file.write(response.content)

        audio = whisperx.load_audio("downloaded_audio.wav")

        result = self.model.transcribe(audio, batch_size=batch_size)
        return result["segments"]


# ## Run the model
@app.local_entrypoint()
def main():
    url = "https://pub-ebe9e51393584bf5b5bea84a67b343c2.r2.dev/examples_english_english.wav"

    print(Model().transcribe.remote(url))

Introduction to WhisperX

Example code for running the WhisperX speech recognition model on Modal

Ship your first app in minutes.