How to run XTTS

Solutions Engineer

How to Run XTTS: A Step-by-Step Guide

XTTS is one of the best open-source text-to-speech models available today. It offers high-quality, multilingual speech synthesis capabilities. In this guide, we’ll walk you through the process of running XTTS using Modal, a serverless cloud computing platform.

Prerequisites

Before we begin, make sure you have the following:

Create an account at modal.com
Install the Modal Python package:
pip install modal
Authenticate your Modal account:
modal setup
(If this doesn’t work, try python -m modal setup)

Setting Up the XTTS Environment

We’ll be using a single Python file to set up and run XTTS. Let’s break down the code and explain each part:

First, we import the necessary libraries and set up the Modal app:

import io
import modal

app = modal.App(name="xtts")

Next, we define the image that will be used to run our XTTS model:

tts_image = (
    modal.Image.debian_slim(python_version="3.11.9")
    .apt_install("git")
    .run_commands("pip install git+https://github.com/coqui-ai/TTS@8c20a599d8d4eac32db2f7b8cd9f9b3d1190b73a")
    .env({"COQUI_TOS_AGREED": "1", "TTS_HOME": "/tts"})
)

This image is based on Debian Slim, installs Git, and sets up the TTS package from the Coqui repository. Note that we’re agreeing to Coqui’s terms of service by setting the COQUI_TOS_AGREED environment variable.

Implementing the XTTS Class

Now, let’s create the XTTS class that will handle the text-to-speech conversion:

with tts_image.imports():
    from TTS.api import TTS
    import torch

@app.cls(
    image=tts_image,
    volumes={"/tts": modal.Volume.from_name("tts-cache", create_if_missing=True)},
    gpu="A10G",
    scaledown_window=300,
    timeout=180,
)
class XTTS:
    def __init__(self):
        pass

    @modal.enter()
    def load_model(self):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(self.device)
        print("Model Loaded")
        speakers = self.model.synthesizer.tts_model.speaker_manager.speakers.keys()
        print(f"Supported speakers: {speakers}")

    @modal.method()
    def speak(self, text, speaker="Kazuhiko Atallah", language="en"):
        wav_file = io.BytesIO()
        self.model.tts_to_file(
                text=text,
                file_path=wav_file,
                speaker=speaker,
                language=language,
        )
        return wav_file

This class does the following:

Loads the XTTS-v2 model when the container starts.
Provides a speak method that converts text to speech.

Running XTTS

Finally, we define an entrypoint to run our XTTS model:

@app.local_entrypoint()
def tts_entrypoint(text: str):
    tts = XTTS()
    wav = tts.speak.remote(text)
    with open(f"output.wav", "wb") as f:
        f.write(wav.getvalue())

This entrypoint function takes a text input, runs the XTTS model, and saves the output as a WAV file.

How to Use the XTTS Script

To use this script:

Save the entire code into a file, for example, xtts_modal.py.

Run the script using Modal:

modal run xtts_modal.py --text "Your text to be converted to speech"

This will generate an output.wav file in your current directory containing the synthesized speech.

Conclusion

By following this guide, you’ve learned how to run XTTS using Modal. This setup allows you to leverage powerful GPU resources in the cloud for high-quality text-to-speech conversion. You can easily modify the script to support different languages or speakers.

For the full code and more details, you can check out the complete gist here.