Sandboxes now GA, run LLM-generated code at scale! Learn more
January 21, 20255 minute read
How to run WhisperX on Modal
author
Yiren Lu@YirenLu
Solutions Engineer

What is WhisperX?

WhisperX extends OpenAI’s open-source Whisper model with enhanced speaker diarization and more accurate timestamp alignment. It uses Faster-Whisper under the hood, providing a 4x speed increase compared to the original Whisper. This guide shows you how to deploy WhisperX on Modal for production-ready audio transcription.

For more information on why you might want to use WhisperX, see our blog post on all the Whisper variants.

Why should you run WhisperX on Modal?

Modal is the best and easiest way to access GPU resources for running advanced transformer models like WhisperX.

With Modal, you simply write a Python function, apply a decorator, and your model is ready to run in the cloud on a GPU.

Modal also allows you to run WhisperX in parallel across multiple containers, which is useful for batch processing large audio datasets. This enables you to transcribe hundreds of audio files simultaneously.

Example code for running the WhisperX speech recognition model on Modal

To run the following code, you will need to:

  1. Create an account at modal.com
  2. Run pip install modal to install the modal Python package
  3. Run modal setup to authenticate (if this doesn’t work, try python -m modal setup)
  4. Copy the code below into a file called app.py
  5. Run modal run app.py
import modal

cuda_version = "12.4.0"  # should be no greater than host CUDA version
flavor = "devel"  #  includes full CUDA toolkit
operating_sys = "ubuntu22.04"
tag = f"{cuda_version}-{flavor}-{operating_sys}"

image = (
    modal.Image.from_registry(f"nvidia/cuda:{tag}", add_python="3.11")
    .apt_install(
        "git",
        "ffmpeg",
    )
    .pip_install(
        "torch==2.0.0",
        "torchaudio==2.0.0",
        "numpy<2.0",
        index_url="https://download.pytorch.org/whl/cu118",
    )
    .pip_install(
        "git+https://github.com/Hasan-Naseer/whisperX.git@release/latest-faster-whisper-version",
        "ffmpeg-python",
        "ctranslate2==4.4.0",
    )
)
app = modal.App("example-base-whisperx", image=image)

GPU_CONFIG = modal.gpu.H100(count=1)

CACHE_DIR = "/cache"
cache_vol = modal.Volume.from_name("whisper-cache", create_if_missing=True)

@app.cls(
    gpu=GPU_CONFIG,
    volumes={CACHE_DIR: cache_vol},
    allow_concurrent_inputs=15,
    container_idle_timeout=60 * 10,
    timeout=60 * 60,
)
class Model:
    @modal.enter()
    def setup(self):
        import whisperx

        device = "cuda"
        compute_type = (
            "float16"  # change to "int8" if low on GPU mem (may reduce accuracy)
        )

        # 1. Transcribe with original whisper (batched)
        self.model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=CACHE_DIR)

    @modal.method()
    def transcribe(self, audio_url: str):
        import requests
        import whisperx

        batch_size = 16  # reduce if low on GPU mem

        response = requests.get(audio_url)
        # Save the audio file locally
        with open("downloaded_audio.wav", "wb") as audio_file:
            audio_file.write(response.content)

        audio = whisperx.load_audio("downloaded_audio.wav")

        result = self.model.transcribe(audio, batch_size=batch_size)
        return result["segments"]


# ## Run the model
@app.local_entrypoint()
def main():
    url = "https://pub-ebe9e51393584bf5b5bea84a67b343c2.r2.dev/examples_english_english.wav"

    print(Model().transcribe.remote(url))

Ship your first app in minutes.

Get Started

$30 / month free compute