B200s and H200s are now available on Modal. Learn more
March 10, 20254 minute read
Top open-source text-to-speech models in 2025
author
Kenny Ning@kenny_ning
Growth Engineer

Updated: 2025-05-29

The text-to-speech (TTS) landscape is changing rapidly, with new state-of-the-art models launching every month, many of them open-source.

This article explores the top open-source TTS models, based on Hugging Face’s trending models and insights from our developer community.

Table of Contents

ModelParametersCreated byReleasedLicense
Chatterbox (deploy on Modal)Not specifiedResemble AIMay 28 2025MIT
Dia (deploy on Modal)1.6BNari LabsApr 21 2025Apache 2.0
Kokoro82MHexgradJan 10 2025Apache 2.0
Sesame CSM1BSesameFeb 26 2025Apache 2.0
Orpheus3B/1B/400M/150MCanopy LabsMar 7 2025Apache 2.0

Chatterbox

Chatterbox is a small, fast, and easy-to-use TTS built on 0.5B Llama. At the time of this writing, it’s the #1 trending TTS model on Hugging Face.

Here’s Chatterbox generated audio for “Have you heard about Modal Labs? They’re transforming cloud computing for AI and machine learning workloads.”

If you’re just getting started with open-source TTS models, we recommend Chatterbox. You can deploy it on Modal today using our example.

Dia

Dia is a 1.6B parameter TTS model that generates highly realistic sounding dialogue. At this time, Dia only supports English.

This generated audio sounds quite human, albeit a little manic (not to mention the creepy laughter they insert everywhere).

The creators of Dia, Nari Labs, are still in early stages of development. However, we’ve seen lots of excitement within our community around the Dia model, even a user-submitted how to deploy Dia on Modal repo.

Kokoro

Kokoro is an 82M parameter TTS model. At 82M parameters, it’s less than 10% the size of Dia. This means that it’s much faster and cheaper to run, though arguably at the cost of some quality.

This generated audio sounds more artificial and Siri-like than other models, but it’s probably the cleanest output among all our examples, especially its pronunciation of “Modal Labs”.

Sesame CSM

Sesame CSM (Conversational Speech Model) is a 1B parameter TTS model built on Llama. It’s particularly well-suited for conversational use cases where you have two different speakers.

This generated audio sounds the most unnatural out of our examples, though there are parameters you can tweak to improve quality like adding context.

Orpheus

Orpheus is a Llama-based TTS model that comes with 3B, 1B, 400M, and 150M parameter versions. It was trained on over 100k hours of English speech data.

It’s optimized for natural, human-like speech and also supports zero-shot voice cloning, guided emotion, and realtime streaming. They also have a family of multi-lingual models that includes Chinese, Hindi, Korean, and Spanish as well as fine-tuning scripts.

While the quality of the demos is impressive, we had trouble getting it to run (including their examples), so take caution if trying to deploy this yourself.

Conclusion

Text to speech is an exploding use case for many AI-first companies, and the quality of open-source models continues to improve every month. Deploying open-source TTS on Modal could give you the best of both worlds: higher quality models at a fraction of the cost compared to other closed source providers.

To get started, check out our Chatterbox TTS example.

Ship your first app in minutes.

Get Started

$30 / month free compute