Check out our new GPU Glossary! Read now
August 16, 20245 minute read
Top open-source text-to-speech libraries in 2024
author
Yiren Lu@YirenLu
Solutions Engineer

open-source-tts

The text-to-speech (TTS) landscape has evolved significantly in recent years, with open-source solutions now rivaling proprietary systems in terms of quality and versatility. As we move into 2025, developers and businesses alike are seeking powerful, flexible, and cost-effective TTS options. This article explores the top open-source TTS libraries available this year, highlighting their unique features and potential applications.

Table of Contents

Takeaways

  • If you need real-time: Ultravox
  • If you only need English: StyleTTS
  • If you need it to run on-device: VITS

Ultravox

Ultravox is a new, fast open-source TTS. It directly transforms audio in input for Meta’s Llama 3 model without the need for a separate speech to text translation upfront

Pros:

  • Speed: Ultravox is fast enough for real-time AI conversations. The current version of Ultravox (v0.3), when invoked with audio content, has a time-to-first-token (TTFT) of approximately 150ms.

Cons:

  • No voice cloning: Currently does not support voice cloning.

TortoiseTTS

TortoiseTTS, created by a single developer who now works at OpenAI, has quickly become a frontrunner in the open-source TTS space.

Pros:

  • Great audio quality: TortoiseTTS produces some of the most natural-sounding speech among open-source options.
  • Multiple voices: The library excels at generating multiple distinct voices.
  • C++ Version available: For those prioritizing speed, a C++ implementation (tortoise.cpp) provides faster processing times.

Cons:

  • Text adherence: It occasionally deviates from the exact input text, which may be problematic for certain use cases.
  • Slow: The original Python version is slow, making it less suitable for real-time applications.

xtts-v2

xtts-v2, part of the Coqui-AI TTS project, is built on top of TortoiseTTS (see here for a deeper read on the architecture) but offers some additional functionality:

Pros:

  • Language support: XTTS supports 13 languages, making it an excellent choice for multilingual projects.
  • Voice cloning: One notable feature is that XTTS can clone voices with just a 3-second audio clip, and it can do so across different languages.
  • Expressive: The library allows for expressive speech synthesis, including emotion and style cloning.
  • Fast: Can get close to real-time with an Nvidia GPU.

Cons:

  • Maintenance: The company behind the library, Coqui-AI, shut down in January 2024 and the founder is now working elsewhere, which means the library is no longer being actively maintained.
  • License: The current license does not permit commercial use.

To run xtts-v2 on Modal, you can follow the snippet here.

StyleTTS

StyleTTS is an open-souce text to speech library that produces exceptionally natural-sounding English speech.

Pros:

  • High-quality audio: StyleTTS is known for its realistic and natural-sounding speech output.
  • Fast: StyleTTS is pretty fast, making it suitable for real-time applications.
  • Permissive license: StyleTTS is licensed under the MIT license, allowing for commercial use.

Cons:

  • Limited language support: StyleTTS is primarily designed for English speech synthesis and may not be suitable for multilingual projects.

MeloTTS

MeloTTS is a high-quality multi-lingual text-to-speech (TTS) library created by MyShell.ai.

Pros:

  • Speed: MeloTTS is fast enough for real-time CPU inference.
  • Language support: Supports a diverse range of languages, including American, British, Indian, and Australian English, as well as Spanish, French, Chinese, Japanese, and Korean.
  • Mixed Chinese-English support: Supports mixed Chinese and English speech. This feature can be useful for applications where a combination of the two languages is required, such as in international business settings or multilingual media production.
  • MIT license: MeloTTS is licensed under the MIT license, allowing for commercial use.

Cons:

  • No voice cloning: MeloTTS does not support voice cloning

OpenVoice v2

Built on top of MeloTTS by the same team, OpenVoice v2 offers an additional feature: voice cloning.

Pros:

  • Instant voice cloning: Quickly adapt to new voices without extensive training. OpenVoice v2 combines the speed of MeloTTS with advanced voice cloning capabilities.

Cons:

  • Compared to MeloTTS, OpenVoice supports fewer languages and sounds less natural.

VITS

VITS is probably the best choice if you are looking for something that will run on-device, for use cases like article reading or language practice.

Pros:

  • Lightweight: With only 40M parameters and a 150 MB size, VITS is lightweight and can run on CPU.

Cons:

  • Audio quality: While VITS is efficient, its audio quality may not match that of larger models.

Running open-source TTS libraries on Modal

For an example of how to use Modal in conjunction with an open-source TTS library, check out this snippet.

Ship your first app in minutes.

Get Started

$30 / month free compute