Modal has raised an $87M Series B led by Lux Capital. Read more
October 30, 20255 minute read

Top embedding models on the MTEB leaderboard

author
Yiren Lu@YirenLu
Solutions Engineer

The Hugging Face MTEB leaderboard has become a standard way to compare embedding models. But the rankings are volatile. New submissions constantly reshuffle the order, and the overall score often hides which models are actually strongest for a given task (i.e., classification, semantic similarity).

As a team building scalable and serverless AI infrastructure, we wanted to create a guide that helps cut through that noise. We’ll break down how to read MTEB scores, highlight which open-weight models stand out today, and show where domain-specific models like those tuned for finance or law deliver better results than general-purpose ones.

What is the MTEB leaderboard?

The Massive Text Embedding Benchmark or MTEB evaluates models across eight categories:

  1. Classification
  2. Clustering
  3. Pair classification
  4. Reranking
  5. Retrieval
  6. Semantic textual similarity (STS)
  7. Summarization
  8. Bitext Mining

Each model gets a score for every category plus an overall average. The overall score is a useful headline number, but should not be thought of as the whole story.

For example, a model tuned for retrieval and semantic textual similarity (the two categories most correlated with production performance in RAG and search) may underperform on clustering or classification, which brings down the average. Conversely, a model that performs steadily across all tasks (but not exceptionally) can end up with a higher average while being weaker at retrieval.

In other words, the best overall model is not always the top choice for your workload. A team that builds retrieval pipelines should focus on retrieval and semantic textual similarity over the global average.

The next section goes over the factors that matter most when choosing a model.

How to Choose a Model

Selecting an embedding model should be driven by context. The following factors matter far more than what the leaderboard highlights.

Task Relevance

Not all models are trained with the same end use case in mind. A model that clusters documents cleanly might fail miserably at ranking passages for retrieval, and a model that excels at pair classification could still struggle when used for unsupervised grouping.

The danger of relying on the overall score is that it blends these strengths and weaknesses, which conceals the trade-offs that matter in practice. By aligning model choice with the category that matches your workload, you give the embeddings the best chance of capturing the relevant information.

The table below shows which factors are most important across different MTEB task categories and how they map to common use cases.

Task CategoryWhat Matters Most for ModelsExample Use Cases
ClassificationDiscriminative features, stable margins between classesSentiment analysis, spam detection, topic tagging
ClusteringGlobal structure, ability to group related items without labelsCustomer segmentation, document deduplication, theme discovery
Pair classificationCapturing entailment and contradiction relationshipsDuplicate bug reports, Q&A pair matching, natural language inference (NLI)
RerankingSensitivity to fine differences in candidate orderingsSearch pipelines with candidate filtering before cross-encoder re-ranking
RetrievalFine-grained semantic similarity, context handling, ranking consistencyRAG, semantic search, question answering
Semantic textual similarity (STS)Sentence-level meaning preservation, robust to phrasing variationsDuplicate detection, paraphrase identification
SummarizationPreserving long-form semantics, capturing discourse-level relationshipsAbstractive QA pipelines, document summarization, report generation

Computational Requirements

Larger models often produce higher-quality embeddings. However, that quality comes with higher costs in the form of GPU memory, slower inference, and infrastructure expenses.

Smaller models, especially those with compression techniques like Matryoshka representation learning, can offer a better balance. By producing embeddings at multiple dimensionalities, they allow teams to trade vector size for speed and memory savings (without having to retrain from scratch).

For workloads constrained by hardware or operating over very large indexes, it’s worth weighing model size and throughput first, before worrying about a gap on the leaderboard.

Domain Relevance

domain embed diagram

When the workload involves specialized languages like biomedical text, financial filings, or source code, domain-specific models almost always outperform general-purpose ones.

The reason is simple. They are fine-tuned on terminology, structures, and conventions that generalist models can’t capture. A biomedical model will understand MeSH terms and clinical shorthand. A code model will encode syntax into its responses.

General-purpose embeddings are strong baselines, but when accuracy in a specific domain is a goal, in-domain training is the only way to deliver truly relevant results.

Licensing and Deployment

Not all open-weight models can be used commercially. Some are released under research-only licenses (i.e., CC-BY-NC-4.0) while others are available under permissive licenses (i.e., Apache 2.0).

Similarly, deployment constraints can also vary. Some models are designed for GPU inference at scale, while others are optimized for CPU environments or variable embedding dimensions to fit tighter hardware budgets.

Before embedding sensitive data, verify both the license terms and the deployment requirements of the model you’re choosing.

Evaluation of Your Data

No benchmark can fully capture the nuances of a specific dataset. Document style, query phrasing, and domain vocabulary all interact in ways that shape retrieval quality. The MTEB leaderboards are a solid starting point, but the decisive factor should always be how a model performs on your corpus.

Running small-scale evaluations with metrics like Mean Reciprocal Rank (MRR) and NDCG provides real data on how a model fits your use case. These metrics reveal not only if a model retrieves the right documents, but also whether it can consistently do so.

With these factors in mind, we can now look at the models that consistently performed well in the MTEB leaderboard as of 2025.

Top 6 Models on the MTEB Leaderboard

As of 2025, here are some of the top open-weight models on the MTEB leaderboard and their backgrounds.

1. Qwen3-Embedding-8B

What is it?

The largest of a family of new embedding models built on top of Qwen3. Also available in 4B and 0.6B. Outperforms the previous generation of Qwen embedding models (i.e. gte-Qwen2-7B-instruct) on MTEB benchmarks. It ranks high on both the multi-lingual and English-only MTEB leaderboards.

What is its license?

Apache-2.0 (commercial use permitted).

Who should use it?

Teams that require strong multi-lingual support or state-of-the-art performance on retrieval, classification, or semantic similarity tasks. The model is particularly strong at long-text understanding.

What are the trade-offs?

VRAM-heavy at 8B parameters and therefore more costly and slow to run. However, smaller versions with 4B and 0.6B parameters exist.

2. llama-embed-nemotron-8b

What is it?

Released in October 2025, this is the latest embedding model from NVIDIA. It is fine-tuned from Llama-3.1-8B and is particularly powerful at understanding multilingual text.

What is its license?

Customized-nscl-v1 (non-commercial only).

Who should use it?

Researchers building applications that need text understanding, especially multilingual RAG systems.

What are the trade-offs?

It cannot be used commercially.

3. bge-m3

What is it?

A versatile text embedding model released in 2024 by the Beijing Academy of Artificial Intelligence (BAAI). It is multilingual, supports long inputs, and supports multiple retrieval methods (dense, sparse, multi-vector).

What is its license?

MIT License (also permissive for commercial use).

Who should use it?

Teams that want a production-ready, open-weight model for retrieval/search pipelines.

What are the trade-offs?

Since this model was released in 2024, there may be newer models that have better performance for your tasks.

4. stella_en_1.5B_v5

What is it?

A compact, English-only embedding model (~1.5B parameters) built on top of the Alibaba-NLP/gte-large-en-v1.5 and Alibaba-NLP/gte-Qwen2-1.5B-instruct models. It produces 1,024-d embeddings by default, but supports Matryoshka for other dimensions.

What is its license?

MIT License.

Who should use it?

Teams with limited GPU resources or CPU-only environments that still need strong English retrieval.

What are the trade-offs?

Not being multilingual and having a smaller capacity means it lags behind 7B models in raw accuracy.

5. embeddinggemma-300m

What is it?

A 300M-parameter open embedding model from Google, built on Gemma 3 and T5Gemma, designed for search, retrieval, and semantic similarity across 100+ languages. Its smaller size makes it suitable for resource-limited hardware like phones or laptops.

What is its license?

Apache-2.0.

Who should use it?

Companies who want to balance cost and performance.

What are the tradeoffs?

The smaller size of this model means that it will be less accurate than the largest state-of-the-art embedding models.

Domain-Specific Embedding Models

While the MTEB leaderboard is dominated by general-purpose models, specialized domains benefit from embeddings tuned on in-domain corpora. Here are some of the top ones.

Closing Thoughts

The MTEB leaderboard has grown into the most comprehensive benchmark for models, covering classification, clustering, retrieval, and more. Its overall score is a useful signal, but production systems succeed when engineers look deeper. Which tasks matter? What cost constraints do you have? Does it make sense to look into a domain-specific model? The right approach is to use MTEB to narrow your options, and then benchmark on your own dataset.

Looking for an easy way to deploy open-source text embedding models on GPUs? Check out Modal’s text embedding tutorial here.

Ship your first app in minutes.

Get Started

$30 / month free compute