Article

October 30, 2025•5 minute read

Top embedding models on the MTEB leaderboard

Solutions Engineer

The Hugging Face MTEB leaderboard has become a standard way to compare embedding models. But the rankings are volatile. New submissions constantly reshuffle the order, and the overall score often hides which models are actually strongest for a given task (i.e., classification, semantic similarity).

As a team building scalable and serverless AI infrastructure, we wanted to create a guide that helps cut through that noise. We’ll break down how to read MTEB scores, highlight which open-weight models stand out today, and show where domain-specific models like those tuned for finance or law deliver better results than general-purpose ones.

What is the MTEB leaderboard?

The Massive Text Embedding Benchmark or MTEB evaluates models across eight categories:

Classification
Clustering
Pair classification
Reranking
Retrieval
Semantic textual similarity (STS)
Summarization
Bitext Mining

Each model gets a score for every category plus an overall average. The overall score is a useful headline number, but should not be thought of as the whole story.

For example, a model tuned for retrieval and semantic textual similarity (the two categories most correlated with production performance in RAG and search) may underperform on clustering or classification, which brings down the average. Conversely, a model that performs steadily across all tasks (but not exceptionally) can end up with a higher average while being weaker at retrieval.

In other words, the best overall model is not always the top choice for your workload. A team that builds retrieval pipelines should focus on retrieval and semantic textual similarity over the global average.

The next section goes over the factors that matter most when choosing a model.

How to Choose a Model

Selecting an embedding model should be driven by context. The following factors matter far more than what the leaderboard highlights.

Task Relevance

Not all models are trained with the same end use case in mind. A model that clusters documents cleanly might fail miserably at ranking passages for retrieval, and a model that excels at pair classification could still struggle when used for unsupervised grouping.

The danger of relying on the overall score is that it blends these strengths and weaknesses, which conceals the trade-offs that matter in practice. By aligning model choice with the category that matches your workload, you give the embeddings the best chance of capturing the relevant information.

The table below shows which factors are most important across different MTEB task categories and how they map to common use cases.

Task Category	What Matters Most for Models	Example Use Cases
Classification	Discriminative features, stable margins between classes	Sentiment analysis, spam detection, topic tagging
Clustering	Global structure, ability to group related items without labels	Customer segmentation, document deduplication, theme discovery
Pair classification	Capturing entailment and contradiction relationships	Duplicate bug reports, Q&A pair matching, natural language inference (NLI)
Reranking	Sensitivity to fine differences in candidate orderings	Search pipelines with candidate filtering before cross-encoder re-ranking
Retrieval	Fine-grained semantic similarity, context handling, ranking consistency	RAG, semantic search, question answering
Semantic textual similarity (STS)	Sentence-level meaning preservation, robust to phrasing variations	Duplicate detection, paraphrase identification
Summarization	Preserving long-form semantics, capturing discourse-level relationships	Abstractive QA pipelines, document summarization, report generation

Computational Requirements

Larger models often produce higher-quality embeddings. However, that quality comes with higher costs in the form of GPU memory, slower inference, and infrastructure expenses.

Smaller models, especially those with compression techniques like Matryoshka representation learning, can offer a better balance. By producing embeddings at multiple dimensionalities, they allow teams to trade vector size for speed and memory savings (without having to retrain from scratch).

For workloads constrained by hardware or operating over very large indexes, it’s worth weighing model size and throughput first, before worrying about a gap on the leaderboard.

Domain Relevance

domain embed diagram

When the workload involves specialized languages like biomedical text, financial filings, or source code, domain-specific models almost always outperform general-purpose ones.

The reason is simple. They are fine-tuned on terminology, structures, and conventions that generalist models can’t capture. A biomedical model will understand MeSH terms and clinical shorthand. A code model will encode syntax into its responses.

General-purpose embeddings are strong baselines, but when accuracy in a specific domain is a goal, in-domain training is the only way to deliver truly relevant results.

Licensing and Deployment

Not all open-weight models can be used commercially. Some are released under research-only licenses (i.e., CC-BY-NC-4.0) while others are available under permissive licenses (i.e., Apache 2.0).

Similarly, deployment constraints can also vary. Some models are designed for GPU inference at scale, while others are optimized for CPU environments or variable embedding dimensions to fit tighter hardware budgets.

Before embedding sensitive data, verify both the license terms and the deployment requirements of the model you’re choosing.

Evaluation of Your Data

Apache-2.0.

Who should use it?

Companies who want to balance cost and performance.

What are the tradeoffs?

The smaller size of this model means that it will be less accurate than the largest state-of-the-art embedding models.

Domain-Specific Embedding Models

While the MTEB leaderboard is dominated by general-purpose models, specialized domains benefit from embeddings tuned on in-domain corpora. Here are some of the top ones.

Medicine: PubMedBERT is fine-tuned on medical literature and clinical notes, making it well-suited for tasks in healthcare and biomedical research. Additionally, BioLORD is another model tailored for similar applications.
Finance: Finance Embeddings from Investopedia, Voyage Finance, and BGE Base Financial Matryoshka are examples of models fine-tuned on financial datasets, offering improved performance for tasks such as sentiment analysis of financial news or SEC filings.
Law: For legal applications, consider exploring the Domain-Specific Embeddings and Retrieval: Legal Edition, which discusses models fine-tuned on legal documents, enhancing their utility for legal research, contract analysis, and other law-related NLP tasks.
Code: CodeBERT and GraphCodeBERT are designed specifically for programming language understanding, making them useful for code search, code completion, and bug detection tasks.
Math: Math Similarity Model is tailored for formula-aware embeddings and captures mathematical structure in LaTeX (or other symbolic formats). This can be useful for research search engines and technical Q&A systems.
Language-Specific: Beyond English, strong monolingual models exist, such as RoSEtta-base-ja (Japanese), KoSimCSE-roberta (Korean), GTE-Qwen2-7B-instruct (Chinese), Sentence-Camembert-large (French), and Arabic-STS-Matryoshka (Arabic).

Closing Thoughts

The MTEB leaderboard has grown into the most comprehensive benchmark for models, covering classification, clustering, retrieval, and more. Its overall score is a useful signal, but production systems succeed when engineers look deeper. Which tasks matter? What cost constraints do you have? Does it make sense to look into a domain-specific model? The right approach is to use MTEB to narrow your options, and then benchmark on your own dataset.

Looking for an easy way to deploy open-source text embedding models on GPUs? Check out Modal’s text embedding tutorial here.

Top embedding models on the MTEB leaderboard

What is the MTEB leaderboard?

How to Choose a Model

Task Relevance

Computational Requirements

Domain Relevance

Licensing and Deployment

Evaluation of Your Data

Top 6 Models on the MTEB Leaderboard

1. Qwen3-Embedding-8B

2. llama-embed-nemotron-8b

3. bge-m3

4. stella_en_1.5B_v5

5. embeddinggemma-300m

Domain-Specific Embedding Models

Closing Thoughts

Ship your first app in minutes.