How to Run XTTS: A Step-by-Step Guide
XTTS is one of the best open-source text-to-speech models available today. It offers high-quality, multilingual speech synthesis capabilities. In this guide, we’ll walk you through the process of running XTTS using Modal, a serverless cloud computing platform.
Prerequisites
Before we begin, make sure you have the following:
- Create an account at modal.com
- Install the Modal Python package:
pip install modal
- Authenticate your Modal account:
modal setup
python -m modal setup
)
Setting Up the XTTS Environment
We’ll be using a single Python file to set up and run XTTS. Let’s break down the code and explain each part:
First, we import the necessary libraries and set up the Modal app:
import io
import modal
app = modal.App(name="xtts")
Next, we define the image that will be used to run our XTTS model:
tts_image = (
modal.Image.debian_slim(python_version="3.11.9")
.apt_install("git")
.run_commands("pip install git+https://github.com/coqui-ai/TTS@8c20a599d8d4eac32db2f7b8cd9f9b3d1190b73a")
.env({"COQUI_TOS_AGREED": "1", "TTS_HOME": "/tts"})
)
This image is based on Debian Slim, installs Git, and sets up the TTS package from the Coqui repository. Note that we’re agreeing to Coqui’s terms of service by setting the COQUI_TOS_AGREED
environment variable.
Implementing the XTTS Class
Now, let’s create the XTTS
class that will handle the text-to-speech conversion:
with tts_image.imports():
from TTS.api import TTS
import torch
@app.cls(
image=tts_image,
volumes={"/tts": modal.Volume.from_name("tts-cache", create_if_missing=True)},
gpu="A10G",
container_idle_timeout=300,
timeout=180,
)
class XTTS:
def __init__(self):
pass
@modal.enter()
def load_model(self):
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.model = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(self.device)
print("Model Loaded")
speakers = self.model.synthesizer.tts_model.speaker_manager.speakers.keys()
print(f"Supported speakers: {speakers}")
@modal.method()
def speak(self, text, speaker="Kazuhiko Atallah", language="en"):
wav_file = io.BytesIO()
self.model.tts_to_file(
text=text,
file_path=wav_file,
speaker=speaker,
language=language,
)
return wav_file
This class does the following:
- Loads the XTTS-v2 model when the container starts.
- Provides a
speak
method that converts text to speech.
Running XTTS
Finally, we define an entrypoint to run our XTTS model:
@app.local_entrypoint()
def tts_entrypoint(text: str):
tts = XTTS()
wav = tts.speak.remote(text)
with open(f"output.wav", "wb") as f:
f.write(wav.getvalue())
This entrypoint function takes a text input, runs the XTTS model, and saves the output as a WAV file.
How to Use the XTTS Script
To use this script:
- Save the entire code into a file, for example,
xtts_modal.py
. - Run the script using Modal:
modal run xtts_modal.py --text "Your text to be converted to speech"
This will generate an output.wav
file in your current directory containing the synthesized speech.
Conclusion
By following this guide, you’ve learned how to run XTTS using Modal. This setup allows you to leverage powerful GPU resources in the cloud for high-quality text-to-speech conversion. You can easily modify the script to support different languages or speakers.
For the full code and more details, you can check out the complete gist here.