
Updated: 2025-05-29
The text-to-speech (TTS) landscape is changing rapidly, with new state-of-the-art models launching every month, many of them open-source.
This article explores the top open-source TTS models, based on Hugging Face’s trending models and insights from our developer community.
Table of Contents
Model | Parameters | Created by | Released | License |
---|---|---|---|---|
Chatterbox (deploy on Modal) | Not specified | Resemble AI | May 28 2025 | MIT |
Dia (deploy on Modal) | 1.6B | Nari Labs | Apr 21 2025 | Apache 2.0 |
Kokoro | 82M | Hexgrad | Jan 10 2025 | Apache 2.0 |
Sesame CSM | 1B | Sesame | Feb 26 2025 | Apache 2.0 |
Orpheus | 3B/1B/400M/150M | Canopy Labs | Mar 7 2025 | Apache 2.0 |
Chatterbox
Chatterbox is a small, fast, and easy-to-use TTS built on 0.5B Llama. At the time of this writing, it’s the #1 trending TTS model on Hugging Face.
Here’s Chatterbox generated audio for “Have you heard about Modal Labs? They’re transforming cloud computing for AI and machine learning workloads.”
If you’re just getting started with open-source TTS models, we recommend Chatterbox. You can deploy it on Modal today using our example.
Dia
Dia is a 1.6B parameter TTS model that generates highly realistic sounding dialogue. At this time, Dia only supports English.
This generated audio sounds quite human, albeit a little manic (not to mention the creepy laughter they insert everywhere).
The creators of Dia, Nari Labs, are still in early stages of development. However, we’ve seen lots of excitement within our community around the Dia model, even a user-submitted how to deploy Dia on Modal repo.
Kokoro
Kokoro is an 82M parameter TTS model. At 82M parameters, it’s less than 10% the size of Dia. This means that it’s much faster and cheaper to run, though arguably at the cost of some quality.
This generated audio sounds more artificial and Siri-like than other models, but it’s probably the cleanest output among all our examples, especially its pronunciation of “Modal Labs”.
Sesame CSM
Sesame CSM (Conversational Speech Model) is a 1B parameter TTS model built on Llama. It’s particularly well-suited for conversational use cases where you have two different speakers.
This generated audio sounds the most unnatural out of our examples, though there are parameters you can tweak to improve quality like adding context.
Orpheus
Orpheus is a Llama-based TTS model that comes with 3B, 1B, 400M, and 150M parameter versions. It was trained on over 100k hours of English speech data.
It’s optimized for natural, human-like speech and also supports zero-shot voice cloning, guided emotion, and realtime streaming. They also have a family of multi-lingual models that includes Chinese, Hindi, Korean, and Spanish as well as fine-tuning scripts.
While the quality of the demos is impressive, we had trouble getting it to run (including their examples), so take caution if trying to deploy this yourself.
Conclusion
Text to speech is an exploding use case for many AI-first companies, and the quality of open-source models continues to improve every month. Deploying open-source TTS on Modal could give you the best of both worlds: higher quality models at a fraction of the cost compared to other closed source providers.
To get started, check out our Chatterbox TTS example.