The text-to-speech (TTS) landscape has evolved significantly in recent years, with open-source solutions now rivaling proprietary systems in terms of quality and versatility. As we move into 2025, developers and businesses alike are seeking powerful, flexible, and cost-effective TTS options. This article explores the top open-source TTS libraries available this year, highlighting their unique features and potential applications.
Table of Contents
Takeaways
- If you need real-time: Ultravox
- If you only need English: StyleTTS
- If you need it to run on-device: VITS
Ultravox
Ultravox is a new, fast open-source TTS. It directly transforms audio in input for Meta’s Llama 3 model without the need for a separate speech to text translation upfront
Pros:
- Speed: Ultravox is fast enough for real-time AI conversations. The current version of Ultravox (v0.3), when invoked with audio content, has a time-to-first-token (TTFT) of approximately 150ms.
Cons:
- No voice cloning: Currently does not support voice cloning.
TortoiseTTS
TortoiseTTS, created by a single developer who now works at OpenAI, has quickly become a frontrunner in the open-source TTS space.
Pros:
- Great audio quality: TortoiseTTS produces some of the most natural-sounding speech among open-source options.
- Multiple voices: The library excels at generating multiple distinct voices.
- C++ Version available: For those prioritizing speed, a C++ implementation (tortoise.cpp) provides faster processing times.
Cons:
- Text adherence: It occasionally deviates from the exact input text, which may be problematic for certain use cases.
- Slow: The original Python version is slow, making it less suitable for real-time applications.
xtts-v2
xtts-v2, part of the Coqui-AI TTS project, is built on top of TortoiseTTS (see here for a deeper read on the architecture) but offers some additional functionality:
Pros:
- Language support: XTTS supports 13 languages, making it an excellent choice for multilingual projects.
- Voice cloning: One notable feature is that XTTS can clone voices with just a 3-second audio clip, and it can do so across different languages.
- Expressive: The library allows for expressive speech synthesis, including emotion and style cloning.
- Fast: Can get close to real-time with an Nvidia GPU.
Cons:
- Maintenance: The company behind the library, Coqui-AI, shut down in January 2024 and the founder is now working elsewhere, which means the library is no longer being actively maintained.
- License: The current license does not permit commercial use.
To run xtts-v2 on Modal, you can follow the snippet here.
StyleTTS
StyleTTS is an open-souce text to speech library that produces exceptionally natural-sounding English speech.
Pros:
- High-quality audio: StyleTTS is known for its realistic and natural-sounding speech output.
- Fast: StyleTTS is pretty fast, making it suitable for real-time applications.
- Permissive license: StyleTTS is licensed under the MIT license, allowing for commercial use.
Cons:
- Limited language support: StyleTTS is primarily designed for English speech synthesis and may not be suitable for multilingual projects.
MeloTTS
MeloTTS is a high-quality multi-lingual text-to-speech (TTS) library created by MyShell.ai.
Pros:
- Speed: MeloTTS is fast enough for real-time CPU inference.
- Language support: Supports a diverse range of languages, including American, British, Indian, and Australian English, as well as Spanish, French, Chinese, Japanese, and Korean.
- Mixed Chinese-English support: Supports mixed Chinese and English speech. This feature can be useful for applications where a combination of the two languages is required, such as in international business settings or multilingual media production.
- MIT license: MeloTTS is licensed under the MIT license, allowing for commercial use.
Cons:
- No voice cloning: MeloTTS does not support voice cloning
OpenVoice v2
Built on top of MeloTTS by the same team, OpenVoice v2 offers an additional feature: voice cloning.
Pros:
- Instant voice cloning: Quickly adapt to new voices without extensive training. OpenVoice v2 combines the speed of MeloTTS with advanced voice cloning capabilities.
Cons:
- Compared to MeloTTS, OpenVoice supports fewer languages and sounds less natural.
VITS
VITS is probably the best choice if you are looking for something that will run on-device, for use cases like article reading or language practice.
Pros:
- Lightweight: With only 40M parameters and a 150 MB size, VITS is lightweight and can run on CPU.
Cons:
- Audio quality: While VITS is efficient, its audio quality may not match that of larger models.
Running open-source TTS libraries on Modal
For an example of how to use Modal in conjunction with an open-source TTS library, check out this snippet.