Top embedding models on the MTEB leaderboard
The Hugging Face MTEB leaderboard has become a standard way to compare embedding models. But the rankings are volatile. New submissions constantly reshuffle the order, and the overall score often hides which models are actually strongest for a given task (i.e., classification, semantic similarity).
As a team building scalable and serverless AI infrastructure, we wanted to create a guide that helps cut through that noise. We’ll break down how to read MTEB scores, highlight which open-weight models stand out today, and show where domain-specific models like those tuned for finance or law deliver better results than general-purpose ones.
What is the MTEB leaderboard?
The Massive Text Embedding Benchmark or MTEB evaluates models across eight categories:
- Classification
- Clustering
- Pair classification
- Reranking
- Retrieval
- Semantic textual similarity (STS)
- Summarization
- Bitext Mining
Each model gets a score for every category plus an overall average. The overall score is a useful headline number, but should not be thought of as the whole story.
For example, a model tuned for retrieval and semantic textual similarity (the two categories most correlated with production performance in RAG and search) may underperform on clustering or classification, which brings down the average. Conversely, a model that performs steadily across all tasks (but not exceptionally) can end up with a higher average while being weaker at retrieval.
In other words, the best overall model is not always the top choice for your workload. A team that builds retrieval pipelines should focus on retrieval and semantic textual similarity over the global average.
The next section goes over the factors that matter most when choosing a model.
How to Choose a Model
Selecting an embedding model should be driven by context. The following factors matter far more than what the leaderboard highlights.
Task Relevance
Not all models are trained with the same end use case in mind. A model that clusters documents cleanly might fail miserably at ranking passages for retrieval, and a model that excels at pair classification could still struggle when used for unsupervised grouping.
The danger of relying on the overall score is that it blends these strengths and weaknesses, which conceals the trade-offs that matter in practice. By aligning model choice with the category that matches your workload, you give the embeddings the best chance of capturing the relevant information.
The table below shows which factors are most important across different MTEB task categories and how they map to common use cases.
| Task Category | What Matters Most for Models | Example Use Cases |
|---|---|---|
| Classification | Discriminative features, stable margins between classes | Sentiment analysis, spam detection, topic tagging |
| Clustering | Global structure, ability to group related items without labels | Customer segmentation, document deduplication, theme discovery |
| Pair classification | Capturing entailment and contradiction relationships | Duplicate bug reports, Q&A pair matching, natural language inference (NLI) |
| Reranking | Sensitivity to fine differences in candidate orderings | Search pipelines with candidate filtering before cross-encoder re-ranking |
| Retrieval | Fine-grained semantic similarity, context handling, ranking consistency | RAG, semantic search, question answering |
| Semantic textual similarity (STS) | Sentence-level meaning preservation, robust to phrasing variations | Duplicate detection, paraphrase identification |
| Summarization | Preserving long-form semantics, capturing discourse-level relationships | Abstractive QA pipelines, document summarization, report generation |
Computational Requirements
Larger models often produce higher-quality embeddings. However, that quality comes with higher costs in the form of GPU memory, slower inference, and infrastructure expenses.
Smaller models, especially those with compression techniques like Matryoshka representation learning, can offer a better balance. By producing embeddings at multiple dimensionalities, they allow teams to trade vector size for speed and memory savings (without having to retrain from scratch).
For workloads constrained by hardware or operating over very large indexes, it’s worth weighing model size and throughput first, before worrying about a gap on the leaderboard.
Domain Relevance
When the workload involves specialized languages like biomedical text, financial filings, or source code, domain-specific models almost always outperform general-purpose ones.
The reason is simple. They are fine-tuned on terminology, structures, and conventions that generalist models can’t capture. A biomedical model will understand MeSH terms and clinical shorthand. A code model will encode syntax into its responses.
General-purpose embeddings are strong baselines, but when accuracy in a specific domain is a goal, in-domain training is the only way to deliver truly relevant results.
Licensing and Deployment
Not all open-weight models can be used commercially. Some are released under research-only licenses (i.e., CC-BY-NC-4.0) while others are available under permissive licenses (i.e., Apache 2.0).
Similarly, deployment constraints can also vary. Some models are designed for GPU inference at scale, while others are optimized for CPU environments or variable embedding dimensions to fit tighter hardware budgets.
Before embedding sensitive data, verify both the license terms and the deployment requirements of the model you’re choosing.
Evaluation of Your Data
No benchmark can fully capture the nuances of a specific dataset. Document style, query phrasing, and domain vocabulary all interact in ways that shape retrieval quality. The MTEB leaderboards are a solid starting point, but the decisive factor should always be how a model performs on your corpus.
Running small-scale evaluations with metrics like Mean Reciprocal Rank (MRR) and NDCG provides real data on how a model fits your use case. These metrics reveal not only if a model retrieves the right documents, but also whether it can consistently do so.
With these factors in mind, we can now look at the models that consistently performed well in the MTEB leaderboard as of 2025.
Top 6 Models on the MTEB Leaderboard
As of 2025, here are some of the top open-weight models on the MTEB leaderboard and their backgrounds.
1. Qwen3-Embedding-8B
What is it?
The largest of a family of new embedding models built on top of Qwen3. Also available in 4B and 0.6B. Outperforms the previous generation of Qwen embedding models (i.e. gte-Qwen2-7B-instruct) on MTEB benchmarks. It ranks high on both the multi-lingual and English-only MTEB leaderboards.
What is its license?
Apache-2.0 (commercial use permitted).
Who should use it?
Teams that require strong multi-lingual support or state-of-the-art performance on retrieval, classification, or semantic similarity tasks. The model is particularly strong at long-text understanding.
What are the trade-offs?
VRAM-heavy at 8B parameters and therefore more costly and slow to run. However, smaller versions with 4B and 0.6B parameters exist.
2. llama-embed-nemotron-8b
What is it?
Released in October 2025, this is the latest embedding model from NVIDIA. It is fine-tuned from Llama-3.1-8B and is particularly powerful at understanding multilingual text.
What is its license?
Customized-nscl-v1 (non-commercial only).
Who should use it?
Researchers building applications that need text understanding, especially multilingual RAG systems.
What are the trade-offs?
It cannot be used commercially.
3. bge-m3
What is it?
A versatile text embedding model released in 2024 by the Beijing Academy of Artificial Intelligence (BAAI). It is multilingual, supports long inputs, and supports multiple retrieval methods (dense, sparse, multi-vector).
What is its license?
MIT License (also permissive for commercial use).
Who should use it?
Teams that want a production-ready, open-weight model for retrieval/search pipelines.
What are the trade-offs?
Since this model was released in 2024, there may be newer models that have better performance for your tasks.
4. stella_en_1.5B_v5
What is it?
A compact, English-only embedding model (~1.5B parameters) built on top of the Alibaba-NLP/gte-large-en-v1.5 and Alibaba-NLP/gte-Qwen2-1.5B-instruct models. It produces 1,024-d embeddings by default, but supports Matryoshka for other dimensions.
What is its license?
MIT License.
Who should use it?
Teams with limited GPU resources or CPU-only environments that still need strong English retrieval.
What are the trade-offs?
Not being multilingual and having a smaller capacity means it lags behind 7B models in raw accuracy.
5. embeddinggemma-300m
What is it?
A 300M-parameter open embedding model from Google, built on Gemma 3 and T5Gemma, designed for search, retrieval, and semantic similarity across 100+ languages. Its smaller size makes it suitable for resource-limited hardware like phones or laptops.
What is its license?
Apache-2.0.
Who should use it?
Companies who want to balance cost and performance.
What are the tradeoffs?
The smaller size of this model means that it will be less accurate than the largest state-of-the-art embedding models.
Domain-Specific Embedding Models
While the MTEB leaderboard is dominated by general-purpose models, specialized domains benefit from embeddings tuned on in-domain corpora. Here are some of the top ones.
- Medicine: PubMedBERT is fine-tuned on medical literature and clinical notes, making it well-suited for tasks in healthcare and biomedical research. Additionally, BioLORD is another model tailored for similar applications.
- Finance: Finance Embeddings from Investopedia, Voyage Finance, and BGE Base Financial Matryoshka are examples of models fine-tuned on financial datasets, offering improved performance for tasks such as sentiment analysis of financial news or SEC filings.
- Law: For legal applications, consider exploring the Domain-Specific Embeddings and Retrieval: Legal Edition, which discusses models fine-tuned on legal documents, enhancing their utility for legal research, contract analysis, and other law-related NLP tasks.
- Code: CodeBERT and GraphCodeBERT are designed specifically for programming language understanding, making them useful for code search, code completion, and bug detection tasks.
- Math: Math Similarity Model is tailored for formula-aware embeddings and captures mathematical structure in LaTeX (or other symbolic formats). This can be useful for research search engines and technical Q&A systems.
- Language-Specific: Beyond English, strong monolingual models exist, such as RoSEtta-base-ja (Japanese), KoSimCSE-roberta (Korean), GTE-Qwen2-7B-instruct (Chinese), Sentence-Camembert-large (French), and Arabic-STS-Matryoshka (Arabic).
Closing Thoughts
The MTEB leaderboard has grown into the most comprehensive benchmark for models, covering classification, clustering, retrieval, and more. Its overall score is a useful signal, but production systems succeed when engineers look deeper. Which tasks matter? What cost constraints do you have? Does it make sense to look into a domain-specific model? The right approach is to use MTEB to narrow your options, and then benchmark on your own dataset.
Looking for an easy way to deploy open-source text embedding models on GPUs? Check out Modal’s text embedding tutorial here.