Check out our new GPU Glossary! Read now
September 15, 20245 minute read
ChatTTS: Running an open source text-to-speech model
author
Yiren Lu@YirenLu
Solutions Engineer

ChatTTS is one of the best open-source text-to-speech libraries available today. It offers high-quality voice synthesis and is particularly useful for developers looking to integrate advanced AI voice capabilities into their applications. In this guide, we’ll walk you through the process of running ChatTTS using Modal, a serverless cloud computing platform.

Prerequisites

Before we begin, make sure you have the following:

  1. Create an account at modal.com
  2. Install the Modal Python package by running:
    pip install modal
  3. Authenticate your Modal account by running:
    modal setup
    If this doesn’t work, try:
    python -m modal setup

Setting Up the Environment

We’ll be using a single Python file to run ChatTTS. Let’s call it chattts_modal.py. This file will contain all the necessary code to set up and run the text-to-speech service.

First, let’s import the required libraries and set up the Modal app:

import io
import modal

app = modal.App(name="tts")

Configuring the Image

Next, we’ll configure an image with all the necessary dependencies:

tts_image = (
    modal.Image.debian_slim()
    .apt_install("git")
    .workdir("/app")
    .pip_install("git+https://github.com/2noise/ChatTTS.git@51ec0c784c2795b257d7a6b64274e7a36186b731")
    .pip_install("soundfile")
)

with tts_image.imports():
    import torch
    import torchaudio
    import ChatTTS

This image is based on Debian Slim and includes Git, the ChatTTS library, and the SoundFile library.

Creating the TTS Class

Now, let’s create a TTS class that will handle the text-to-speech conversion:

@app.cls(
    image=tts_image,
    gpu="A10G",
    container_idle_timeout=300,
    timeout=180,
)
class TTS:
    def __init__(self, voice = "male"):
        voice_seeds = {
            "female": 28,
            "male": 34,
            "male_alt_1": 43,
        }
        print(f"Using voice {voice} with seed {voice_seeds[voice]}")
        self.voice_seed = voice_seeds[voice]

    @modal.build()
    @modal.enter()
    def load_model(self):
        import ChatTTS

        self.chat = ChatTTS.Chat()
        self.chat.load(compile=False)

        torch.manual_seed(self.voice_seed)
        self.rand_spk = self.chat.sample_random_speaker()

    @modal.method()
    def speak(self, text, temperature=0.18, top_p=0.9, top_k=20):
        text = text.strip()
        if not text:
            return

        params_infer_code = ChatTTS.Chat.InferCodeParams(
            spk_emb = self.rand_spk,
            temperature = temperature,
            top_P = top_p,
            top_K = top_k,
        )

        params_refine_text = ChatTTS.Chat.RefineTextParams(
            prompt='[oral_8][laugh_2][break_2]',
        )

        wavs = self.chat.infer(text, skip_refine_text=True, params_infer_code=params_infer_code, params_refine_text=params_refine_text)

        wav_file = io.BytesIO()
        torchaudio.save(wav_file, torch.from_numpy(wavs[0]).unsqueeze(0), 24000, format="wav", backend="soundfile")

        return wav_file

This class initializes the ChatTTS model, loads it into memory, and provides a speak method to convert text to speech.

Running the Text-to-Speech Conversion

Finally, let’s add a local entrypoint to run the text-to-speech conversion:

@app.local_entrypoint()
def tts_entrypoint(text: str):
    tts = TTS()
    wav = tts.speak.remote(text)
    with open(f"output.wav", "wb") as f:
        f.write(wav.getvalue())

This entrypoint creates a TTS instance, calls the speak method remotely, and saves the resulting audio as a WAV file.

Running the Script

To run the script, save all the code above in a file named chattts_modal.py. Then, you can run it using Modal:

modal run chattts_modal.py --text "Hello, this is a test of ChatTTS running on Modal."

This command will generate an output.wav file in your current directory with the synthesized speech.

Conclusion

You’ve now learned how to run ChatTTS using Modal. This setup allows you to leverage the power of serverless computing for your text-to-speech needs, making it easy to scale and integrate into various applications.

For the full code and more details, you can check out the complete gist here.

Ship your first app in minutes.

Get Started

$30 / month free compute