ChatTTS is one of the best open-source text-to-speech libraries available today. It offers high-quality voice synthesis and is particularly useful for developers looking to integrate advanced AI voice capabilities into their applications. In this guide, we’ll walk you through the process of running ChatTTS using Modal, a serverless cloud computing platform.
Prerequisites
Before we begin, make sure you have the following:
- Create an account at modal.com
- Install the Modal Python package by running:
pip install modal
- Authenticate your Modal account by running:
modal setup
python -m modal setup
Setting Up the Environment
We’ll be using a single Python file to run ChatTTS. Let’s call it chattts_modal.py
. This file will contain all the necessary code to set up and run the text-to-speech service.
First, let’s import the required libraries and set up the Modal app:
import io
import modal
app = modal.App(name="tts")
Configuring the Image
Next, we’ll configure an image with all the necessary dependencies:
tts_image = (
modal.Image.debian_slim()
.apt_install("git")
.workdir("/app")
.pip_install("git+https://github.com/2noise/ChatTTS.git@51ec0c784c2795b257d7a6b64274e7a36186b731")
.pip_install("soundfile")
.env({"TTS_HOME": "/tts"})
)
with tts_image.imports():
import torch
import torchaudio
import ChatTTS
This image is based on Debian Slim and includes Git, the ChatTTS library, and the SoundFile library.
Creating the TTS Class
Now, let’s create a TTS
class that will handle the text-to-speech conversion:
@app.cls(
image=tts_image,
volumes={"/tts": modal.Volume.from_name("tts-cache", create_if_missing=True)},
gpu="A10G",
container_idle_timeout=300,
timeout=180,
)
class TTS:
def __init__(self, voice = "male"):
voice_seeds = {
"female": 28,
"male": 34,
"male_alt_1": 43,
}
print(f"Using voice {voice} with seed {voice_seeds[voice]}")
self.voice_seed = voice_seeds[voice]
@modal.enter()
def load_model(self):
import ChatTTS
self.chat = ChatTTS.Chat()
self.chat.load(compile=False)
torch.manual_seed(self.voice_seed)
self.rand_spk = self.chat.sample_random_speaker()
@modal.method()
def speak(self, text, temperature=0.18, top_p=0.9, top_k=20):
text = text.strip()
if not text:
return
params_infer_code = ChatTTS.Chat.InferCodeParams(
spk_emb = self.rand_spk,
temperature = temperature,
top_P = top_p,
top_K = top_k,
)
params_refine_text = ChatTTS.Chat.RefineTextParams(
prompt='[oral_8][laugh_2][break_2]',
)
wavs = self.chat.infer(text, skip_refine_text=True, params_infer_code=params_infer_code, params_refine_text=params_refine_text)
wav_file = io.BytesIO()
torchaudio.save(wav_file, torch.from_numpy(wavs[0]).unsqueeze(0), 24000, format="wav", backend="soundfile")
return wav_file
This class initializes the ChatTTS model, loads it into memory, and provides a speak
method to convert text to speech.
Running the Text-to-Speech Conversion
Finally, let’s add a local entrypoint to run the text-to-speech conversion:
@app.local_entrypoint()
def tts_entrypoint(text: str):
tts = TTS()
wav = tts.speak.remote(text)
with open(f"output.wav", "wb") as f:
f.write(wav.getvalue())
This entrypoint creates a TTS instance, calls the speak
method remotely, and saves the resulting audio as a WAV file.
Running the Script
To run the script, save all the code above in a file named chattts_modal.py
. Then, you can run it using Modal:
modal run chattts_modal.py --text "Hello, this is a test of ChatTTS running on Modal."
This command will generate an output.wav
file in your current directory with the synthesized speech.
Conclusion
You’ve now learned how to run ChatTTS using Modal. This setup allows you to leverage the power of serverless computing for your text-to-speech needs, making it easy to scale and integrate into various applications.
For the full code and more details, you can check out the complete gist here.