Evaluating Embedding APIs for Information Retrieval

Ehsan Kamalloo, Xinyu Zhang, Odunayo Ogundepo, Nandan Thakur, David Alfonso-hermelo, Mehdi Rezagholizadeh, Jimmy Lin


Abstract
The ever-increasing size of language models curtails their widespread access to the community, thereby galvanizing many companies and startups into offering access to large language models through APIs. One particular API, suitable for dense retrieval, is the semantic embedding API that builds vector representations of a given text. With a growing number of APIs at our disposal, in this paper, our goal is to analyze semantic embedding APIs in realistic retrieval scenarios in order to assist practitioners and researchers in finding suitable services according to their needs. Specifically, we wish to investigate the capabilities of existing APIs on domain generalization and multilingual retrieval. For this purpose, we evaluate the embedding APIs on two standard benchmarks, BEIR, and MIRACL. We find that re-ranking BM25 results using the APIs is a budget-friendly approach and is most effective on English, in contrast to the standard practice, i.e., employing them as first-stage retrievers. For non-English retrieval, re-ranking still improves the results, but a hybrid model with BM25 works best albeit at a higher cost. We hope our work lays the groundwork for thoroughly evaluating APIs that are critical in search and more broadly, in information retrieval.
Anthology ID:
2023.acl-industry.50
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Sunayana Sitaram, Beata Beigman Klebanov, Jason D Williams
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
518–526
Language:
URL:
https://aclanthology.org/2023.acl-industry.50
DOI:
10.18653/v1/2023.acl-industry.50
Bibkey:
Cite (ACL):
Ehsan Kamalloo, Xinyu Zhang, Odunayo Ogundepo, Nandan Thakur, David Alfonso-hermelo, Mehdi Rezagholizadeh, and Jimmy Lin. 2023. Evaluating Embedding APIs for Information Retrieval. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 518–526, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Evaluating Embedding APIs for Information Retrieval (Kamalloo et al., ACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.acl-industry.50.pdf