MTEB: Massive Text Embedding Benchmark

Niklas Muennighoff; Nouamane Tazi; Loic Magne; Nils Reimers

doi:10.18653/v1/2023.eacl-main.148

MTEB: Massive Text Embedding Benchmark

Niklas Muennighoff, Nouamane Tazi, Loic Magne, Nils Reimers

Abstract

Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be equally well applied to other tasks like clustering or reranking. This makes progress in the field difficult to track, as various models are constantly being proposed without proper evaluation. To solve this problem, we introduce the Massive Text Embedding Benchmark (MTEB). MTEB spans 8 embedding tasks covering a total of 58 datasets and 112 languages. Through the benchmarking of 33 models on MTEB, we establish the most comprehensive benchmark of text embeddings todate. We find that no particular text embedding method dominates across all tasks. This suggests that the field has yet to converge on a universal text embedding method and scale it up sufficiently to provide state-of-theart results on all embedding tasks. MTEB comes with open-source code and a public leaderboard at https://github.com/embeddings-benchmark/mteb.

Anthology ID:: 2023.eacl-main.148
Volume:: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Month:: May
Year:: 2023
Address:: Dubrovnik, Croatia
Editors:: Andreas Vlachos, Isabelle Augenstein
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2014–2037
Language:
URL:: https://aclanthology.org/2023.eacl-main.148/
DOI:: 10.18653/v1/2023.eacl-main.148
Bibkey:
Cite (ACL):: Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):: MTEB: Massive Text Embedding Benchmark (Muennighoff et al., EACL 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.eacl-main.148.pdf
Video:: https://aclanthology.org/2023.eacl-main.148.mp4

PDF Cite Search Video Fix data