The Top Open-Source Text to Speech (TTS) Models
The text-to-speech (TTS) landscape is rapidly changing, with new state-of-the-art models launching every month. Many of these models are open-source.
This article explores the top open-source TTS models, based on Hugging Face’s trending models and insights from our developer community.
Table of Contents
Model | Parameters | Created by | Released | License |
---|---|---|---|---|
Higgs Audio V2 | 5.77B | Boson AI | July 2025 | Apache 2.0 |
Kokoro v1.0 82M | 82M | Hexgrad | Jan 27 2025 | Apache 2.0 |
Dia (deploy on Modal) | 1.6B | Nari Labs | Apr 21 2025 | Apache 2.0 |
Chatterbox (deploy on Modal) | Not specified | Resemble AI | May 28 2025 | MIT |
Orpheus | 3B/1B/400M/150M | Canopy Labs | Mar 7 2025 | Apache 2.0 |
Sesame CSM | 1B | Sesame | Feb 26 2025 | Apache 2.0 |
How should you think about rankings?
While the Hugging Face trending models leaderboard can give you a rough idea of where each model stands, you shouldn’t take it as direct indication for which model will be best for your use case. Rather, to sharpen your search, you should consider the five axes of TTS models—naturalness, voice cloning capacity, word error rate (WER), latency, and parameters—and how they matter to your use case.
1. Naturalness
Some models produce more human-like, natural-sounding speech. These models might be preferred for applications where the user’s perception of the narration matters. For example, users might want to feel like they’re talking to a human when calling an automated customer service line. Or, they might prefer realistic voices when listening to an audiobook narration.
Naturalness isn’t an easy thing to measure with computers, but community voting platforms like TTS Arena provide a human-driven score. TTS Arena operates like chess rating systems, with win rates and Elo scores based on thousands of side-by-side comparisons. For example, Kokoro v1.0 currently has a 44% win rate on TTS Arena V2, meaning it wins against other models in 44% of head-to-head comparisons.
2. Voice Cloning Capability
Many modern TTS models offer zero-shot voice cloning: the model replicates a specific voice from just a few seconds of reference audio.
Voice cloning matters for a few practical scenarios. For example, developers might need to create consistent brand voices from a reference audio. Alternatively, they might be creating an app with personalized experiences based on the user’s voice. Video game developers might want to use voice cloning to create different scene permutations that were introduced after a previous recording session with a voice actor.
To test voice cloning, models undergo reflective speaker verification tasks, where the generated speech is tested by automatic speaker recognition (ASR) systems to match the original audio. This metric is known as speak similarity, measuring how closely the generated voice matches the reference speaker.
Additionally, another aspect of voice cloning is the capacity for the model to create cross-lingual speech with that same voice. Given that some languages are closely tied to an accent, but voices are independent of accents, models can sometimes struggle with cross-lingual cloning. Cross-lingual support can matter to tooling that interfaces with bi-lingual speakers or is working with multi-lingual content like an international play script.
Another aspect is multi-speaker support, a measure of a model’s ability to generate a dialog between a text transcript of multiple voice-cloned speakers. Multi-speaker support has niche use cases in the film and audiobook industry.
3. Word Error Rate
Word error rate (WER) is a measure of the accuracy of a TTS model’s synthesized speech to a speech-to-text model.
WER uses a second model to score the TTS model. By using a consistent, industry-leading STT model, models could be compared in their ability to produce speech that can be re-transcribed to text.
Word error rate is a crude metric of accuracy. Hypothetically, models could be scored by humans for accuracy, but unlike naturalness, it’s too time-consuming for humans to cross-check synthesized speech with the initial source audio. Accordingly, WER delegates a STT model to provide a rough measure of the synthesized speech’s understandability.
Notably, a model could have poor naturalness but exceptional WER. For example, if a model produces a Siri-like output, it sounds incredibly robotic and unnatural, but is simultaneously understandable to both humans and computers.
4. Latency
Latency is a measure of how long it takes for a model to start producing audio output after text was submitted. There are a few ways to measure latency. For engineering systems, the most recommended approach is end-to-end latency, which includes the time to first byte or TTFB (time for the initial API request to return data), audio synthesis latency (time for the model to produce a complete output), and network latency (time for the network to relay the final packets).
However, when comparing models in a vacuum, we can assume the network latency is moot. Additionally, most practical use cases depend on either TTFB or audio synthesis latency, not both. For use cases where audio is immediately played upon generation (such as a customer service bot), then TTFB only matters, because speech happens at a natural rate that is slower than the audio synthesis latency. Contrarily, for use cases where an audio file is generated for later use (such as an e-book app), then only audio synthesis latency matters. This is often represented as RTFx, or Real-Time Factor X, which is a measure of how fast a model could generate audio over the length of the audio’s playback.
5. Parameters
Parameters are internal variables that are tuned during a model’s training process for it to produce more accurate outputs. Generally speaking, higher parameter count models are more expensive to run on hardware, costing more memory and CPU/GPU cycles, while lower parameter count models are cheaper to run. This especially matters to use cases where the model needs to run on a local device, like a smart device or smartphone, or for companies producing speech at scale where costs might overwhelm other metrics of realism.
New to Text-to-Speech? The Best Pick
If you are building a text-to-speech powered application for the first time, we highly recommend starting with Chatterbox. Developed by Resemble AI, Chatterbox is easy to use, produces expressive and natural speech, and provides a crash introduction on the various configurations of TTS models.
To make things easier, you can deploy Chatterbox in Modal in just a few steps. Check out our dedicated guide to get started.
The Best Text-to-Speech Models in 2025
Keeping these metrics in mind, let’s visit today’s leading text-to-speech models.
Higgs Audio V2
Higgs Audio V2 is a massive model developed by BosonAI. It’s currently the top trending text-to-speech model on Hugging Face. It’s an open sourced model that was built on top of Llama 3.2 3B, pre-trained on over 10 million hours of audio data. This model provides industry-leading expressive audio generation and multilingual voice cloning. For example, Higgs Audio V2 wins audience scores on emulating emotion and question-asking.
BosonAI was able to achieve these results by incorporating a Dual-FFN architecture which acts as an audio-specific expert. This helps boost the LLM’s performance with minimal computational overhead. The model’s tokenizer also has discrete representations of semantic and acoustic aspects of audio, enabling it to produce more realistic speech.
Who was Higgs Audio V2 produced by?
Higgs Audio V2 is produced by BosonAI and is released under an Apache 2.0 license.
What metrics does Higgs Audio V2 excel at?
Higgs Audio V2 has incredible naturalness, especially when relaying emotions, has robust voice cloning capabilities, and can produce realistic multi-speaking dialog. Additionally, it has a relatively low word error rate.
Kokoro v1.0
Kokoro is an indie-developed TTS model with just 82M parameters. Kokoro’s tiny parameter count makes it significantly more economical to run on hardware. However, it does not support voice cloning and has poorer naturalness than other models.
Who was Kokoro produced by?
Contrary to other models on this list, Kokoro was not developed by a large, funded company, but instead an indie developer named Hexgrad. Hexgrad’s profile picture is a kitten fishing beside a pail.
What metrics does Kokoro v1.0 excel at?
Kokoro is primarily a low-footprint model, with a low parameter count and therefore minimal compute needs.
Dia
Dia is a 1.6B parameter TTS model developed by Nari Labs. Dia can generate highly realistic dialogue, but exclusively in English.
Despite being English-only, Dia supports multi-speaker speech generation and can add nonverbal audio with tags like (laughs)
, (coughs)
, and (gasps)
. This makes Dia an exceptional candidate for audiobook speech generation.
Dia has generated a fair amount of buzz for its Apache 2.0 license and early promise. A user has even published a how to deploy Dia on Modal repo.
Who was Dia produced by?
Dia was published by Nari Labs and is entirely free. Nari Labs seems to be generating some meager revenue from Google Ads plastered across its website, but is otherwise not presently monetizing the model.
What metrics does Dia excel at?
Dia is fantastic at producing natural audio (especially audio with nonverbal sounds and emotion) and supporting multiple speakers.
Chatterbox
Chatterbox is a small, fast, and easy-to-use TTS model developed by Resemble AI. Chatterbox was built atop 0.5B Llama. Until recently, it was the #1 trending TTS model on Hugging Face. Chatterbox supports AI voice cloning, is incredible natural, and allows configurable expressiveness. Check out this side-by-side comparison with ElevenLabs.
Who was Chatterbox produced by?
Chatterbox was produced by Resemble AI. Resemble AI builds various models, including TTS and STT models, for enterprises.
What metrics does Chatterbox excel at?
Chatterbox is excellent at producing configurable, natural audio with strong voice cloning support. It has low WER, is inexpensive with just 0.5B parameters, and could be framed as an all-rounder model. It also has incredible community adoption.
Orpheus
Orpheus is a Llama-based TTS model developed by Canopy AI. Orpheus comes with 3B, 1B, 400M, and 150M parameter versions, making it easily configurable for organizations that want to deploy models from a single stack in various settings. It was trained on over 100k hours of English speech data.
It’s optimized for natural, human-like speech and also supports zero-shot voice cloning, guided emotion, and realtime streaming. They also have a family of multi-lingual models that includes Chinese, Hindi, Korean, and Spanish as well as fine-tuning scripts.
Who was Orpheus produced by?
Orpheus was produced by Canopy AI. Canopy AI’s primary goal is creating realistic digital humans.
What metrics does Orpheus excel at?
Orpheus is excellent at producing natural audio with multi-lingual support. It supports robust voice cloning, and given its configurable size, is apt for use cases that necessitate a low parameter count.
Sesame CSM
Sesame CSM (Conversational Speech Model) is a 1B parameter TTS model produced by Sesame Labs. Sesame is built on Llama. It’s particularly well-suited for conversational use cases where you have two different speakers.
Who was Sesame CSM produced by?
Sesame CSM was produced by Sesame Labs. Sesame Labs is focused on creating realistic, personal companions that are digital yet emulate humans.
What metrics does Sesame CSM excel at?
Sesame CSM is strong for multi-speaker set-ups, but doesn’t produce the most natural speech.
Conclusion
Text to speech is an exploding use case for many AI-first companies, and the quality of open-source models continues to improve every month. Deploying open-source TTS on Modal could give you the best of both worlds: higher quality models at a fraction of the cost compared to other closed source providers.
To get started, check out our Chatterbox TTS example.