Startups get up to $50k in free compute credits.
March 10, 20255 minute read
Top open-source text-to-speech libraries in 2025
author
Yiren Lu@YirenLu
Solutions Engineer

open-source-tts

The text-to-speech (TTS) landscape is changing rapidly, with new state-of-the-art models launching every month, many of them open-source.

As we move into 2025, developers and businesses alike are seeking powerful, flexible, and cost-effective TTS options. This article explores the top open-source TTS libraries available this year, highlighting their unique features and potential applications.

Table of Contents

Takeaways

  • If you need real-time: Kokoro
  • If you want to customize the voice: Spark-TTS
  • If you need it to run on-device: VITS

Spark-TTS

Spark-TTS is a new, 500 million parameter TTS model. It is built on Qwen2.5 and reconstructs audio directly from LLM codes.

It supports zero-shot voice cloning, bi-lingual speech synthesis, and you can adjust the gender/pitch/rate from text prompts.

Pros:

  • Customizable Voice Creation: You can generate highly customizable voices, surpassing the limitations of simply cloning existing ones. Imagine creating entirely new voices with specific characteristics.

  • Comprehensive Voice Control: You have control over various voice attributes, both in a general (coarse-grained) and detailed (fine-grained) manner. This includes things like:

  • Gender: Make the voice sound male or female.

  • Speaking Style: Adjust the overall style of speaking.

  • Pitch: Precisely control the pitch of the voice.

  • Speaking Rate: Adjust how fast or slow the voice speaks.

Cons:

  • Language Support: Currently only supports Chinese and English.

Kokoro

Kokoro is a new, open-source, super-small TTS model. At 82M parameters, it’s 5x smaller than other popular models like Spark-TTS. This means that it’s much faster and cheaper to run, while still delivering high quality.

Pros:

  • Cheap to deploy: Kokoro is 5x smaller than other popular models, so it’s much cheaper to deploy. It can run on both CPU and GPU.
  • Fast: Kokoro is fast enough for real-time applications - even on CPU!
  • Permissive license: Kokoro is licensed under the Apache 2.0 license, which allows for commercial use.

Cons:

  • Language Support: Currently only supports English.
  • Tone: Kokoro is not as expressive/natural as other models. Users say it sounds Siri-like.

Fish Speech v1.5

Fish Speech v1.5 is a new open-source TTS model from the team at Fish. It features zero-shot and few-shot TTS capabilities, allowing users to input a 10 to 30-second vocal sample to generate high-quality TTS output.

Pros:

  • Low CER/WER: Fish Speech v1.5 achieves a low Character Error Rate (CER) and Word Error Rate (WER) of around 2% for 5-minute English texts, ensuring high accuracy.
  • Fast: Fish Speech v1.5 claims a latency of less than 150 ms.
  • Websocket reuse: Fish Speech supports websocket reuse
  • Tunable: Fish Speech supports volume, speed, and phonetic tuning capabilities.
  • Supports multiple languages: Fish Speech supports languages including English, Japanese, Korean, Chinese, French, German, Arabic, and Spanish.

Cons:

  • License: Fish Speech is licensed under the BY-CC-NC-SA-4.0 license, which does not permit commercial use.

xtts-v2

xtts-v2, part of the Coqui-AI TTS project, is built on top of TortoiseTTS (see here for a deeper read on the architecture) but offers some additional functionality:

Pros:

  • Language support: XTTS supports 13 languages, making it an excellent choice for multilingual projects.
  • Voice cloning: One notable feature is that XTTS can clone voices with just a 3-second audio clip, and it can do so across different languages.
  • Expressive: The library allows for expressive speech synthesis, including emotion and style cloning.
  • Fast: Can get close to real-time with an Nvidia GPU.

Cons:

  • Maintenance: The company behind the library, Coqui-AI, shut down in January 2024 and the founder is now working elsewhere, which means the library is no longer being actively maintained.
  • License: The current license does not permit commercial use.

To run xtts-v2 on Modal, you can follow the snippet here.

StyleTTS

StyleTTS is an open-souce text to speech library that produces exceptionally natural-sounding English speech.

Pros:

  • High-quality audio: StyleTTS is known for its realistic and natural-sounding speech output.
  • Fast: StyleTTS is pretty fast, making it suitable for real-time applications.
  • Permissive license: StyleTTS is licensed under the MIT license, allowing for commercial use.

Cons:

  • Limited language support: StyleTTS is primarily designed for English speech synthesis and may not be suitable for multilingual projects.

OpenVoice v2

Built on top of MeloTTS by the same team, OpenVoice v2 offers an additional feature: voice cloning.

Pros:

  • Instant voice cloning: Quickly adapt to new voices without extensive training. OpenVoice v2 combines the speed of MeloTTS with advanced voice cloning capabilities.

Cons:

  • Compared to MeloTTS, OpenVoice supports fewer languages and sounds less natural.

VITS

VITS is probably the best choice if you are looking for something that will run on-device, for use cases like article reading or language practice.

Pros:

  • Lightweight: With only 40M parameters and a 150 MB size, VITS is lightweight and can run on CPU.

Cons:

  • Audio quality: While VITS is efficient, its audio quality may not match that of larger models.

Running open-source TTS libraries on Modal

For an example of how to use Modal in conjunction with an open-source TTS library, check out this snippet.

Ship your first app in minutes.

Get Started

$30 / month free compute