Evaluating Embedding APIs for Information Retrieval

The ever-increasing size of language models curtails their widespread access to the community, thereby galvanizing many companies and startups into offering access to large language models through APIs. One particular API, suitable for dense retrieval, is the semantic embedding API that builds vector representations of a given text. With a growing number of APIs at our disposal, in this paper, our goal is to analyze semantic embedding APIs in realistic retrieval scenarios in order to assist practitioners and researchers in finding suitable services according to their needs. Specifically, we wish to investigate the capabilities of existing APIs on domain generalization and multilingual retrieval. For this purpose, we evaluate the embedding APIs on two standard benchmarks, BEIR, and MIRACL. We find that re-ranking BM25 results using the APIs is a budget-friendly approach and is most effective on English, in contrast to the standard practice, i.e., employing them as first-stage retrievers. For non-English retrieval, re-ranking still improves the results, but a hybrid model with BM25 works best albeit at a higher cost. We hope our work lays the groundwork for thoroughly evaluating APIs that are critical in search and more broadly, in information retrieval.


Introduction
Language models (LMs), pre-trained on a massive amount of text, empowered dense retrieval models in ad hoc retrieval (Lin et al., 2021b). Dense retrievers (Lee et al. 2019;Karpukhin et al. 2020;Xiong et al. 2021;Khattab and Zaharia 2020;Hofstätter et al. 2021;Izacard et al. 2022; inter alia) essentially measure relevance via similarity between the representations of documents and queries. As LMs are rapidly scaling up to gigantic models (Radford et al. 2019;Brown et al. 2020;Lieber et al. 2021;Chowdhery et al. 2022;Smith et al. 2022, inter alia), their use as the backbone of dense retrieval models has become limited primarily because large language models (LLMs) are computationally expensive and deploying them on commodity hardware is cumbersome and often impossible. To alleviate this problem, many companies, e.g., OpenAI, and Cohere, set out to offer access to their proprietary LLMs through a family of APIs. For dense retrieval, semantic embedding APIs are designed to provide LLM representations for queries as well as documents. These APIs are especially appealing in the IR ecosystem because they afford practitioners and researchers the benefit of scale and allow for wider outreach of LLMs in IR. However, although nowadays, the surge of companies offering such APIs with various model sizes has given us more options, a lack of thorough analysis of these APIs has made it more difficult to determine one's best options depending on their use-case. Besides, LLM-based APIs are often expensive and experimenting with all of them to determine most suitable ones is not reasonable.
In this paper, we aim at analyzing embedding APIs for various realistic scenarios in ad hoc retrieval. To this end, we select three embedding APIs available on the market, i.e., OpenAI, Cohere, and Aleph-Alpha. and assess their usability and effectiveness on two crucial directions that stand at the core of most IR applications.
First, we study domain generalization where retrieval is conducted over collections, drawn from a broad range of domains. Understanding for which domains embedding APIs work well or poorly elucidates their limitations while setting the stage for their wide adoption in their successful domains. We leverage the widely adopted BEIR benchmark (Thakur et al., 2021) for this purpose. On BEIR, we resort to using the APIs as re-rankers on top of BM25 retrieved documents because the large size of document collections in BEIR makes fullranking via the APIs impractical. Our results suggest that embedding APIs are reasonably apt re-rankers on most domains, highlighting re-ranking is not only budget-friendly, but it also is effective. However, we find that on datasets that are collected based on lexical matching, they struggle. In particular, BM25 outperforms the full-fledged embedding APIs on BioASQ (bio-medical retrieval) and Sig-nal1M (tweet retrieval).
We also explore the capabilities of embedding APIs in multilingual retrieval where they are tested against several non-English languages, ranging from low-resource to high-resource languages. More precisely, we use MIRACL , a large-scale multilingual retrieval benchmark that spans 18 diverse languages. The manageable size of corpora allowed us to evaluate the APIs as full-rankers as well as re-rankers. We find that the winning recipe for non-English retrieval is not re-ranking, unlike retrieval on English documents. Instead, building hybrid models with BM25 yields the best results. Our results also indicate that the APIs are powerful on low-resource languages, whereas on high-resource languages, open-source models work better.
Overall, our findings offer insights on using embedding APIs in real-world scenarios through two crucial aspects of IR systems. We hope our work lays the groundwork for thoroughly evaluating APIs that are critical in search and more broadly, in IR. In summary, our key contributions are: • We extensively review the usability of commercial embedding APIs for realistic IR applications involving domain generalization and multilingual retrieval; and • We provide insights on how to effectively use these APIs in practice.
Dense Retrieval. While the paradigm has been around for a long time (Yih et al., 2011), the emergence of pre-trained LMs brought dense retrieval (Lee et al., 2019;Karpukhin et al., 2020) to the mainstream in IR. Recent dense retrieval models adopt a bi-encoder architecture and generally use contrastive learning to distinguish relevant documents from non-relevant ones (Lin et al., 2021b), similar to sentence embedding models. LMs are shown to be an effective source to extract representations from (Karpukhin et al., 2020;Xiong et al., 2021;Hofstätter et al., 2021;Khattab and Zaharia, 2020;Izacard and Grave, 2021;Izacard et al., 2022). This essentially denotes that with LMs in the backbone and analogous objectives, dense retrievers and sentence embedders have become indistinguishable.

APIs
Semantic embedding APIs are generally based on a siamese architecture, commonly known as biencoders, where queries and documents are fed to an fine-tuned LM in parallel (Seo et al., 2018;Reimers and Gurevych, 2019). The key ingredient of bi-encoders is contrastive learning whose objective is to enable models to distinguish relevant documents from irrelevant ones. In our experiments, we adopt the following semantic embedding APIs: 1 Aleph-Alpha: The company has trained a family of multilingual LMs, named luminous, 2 with three flavours in size, base (13B), extended (30B), and supreme (70B). luminous supports five highresource languages: English, French, German, Italian, and Spanish. However, no information is available about the data on which these LMs are trained. We used luminous base that projects text into 5120dimension embedding vectors.
Cohere: The company offers LMs for semantic representations in two sizes: small (410M), and large (13.1B), generating 1024-dimension and 4096-dimension embedding vectors. Models are accompanied by model cards (Mitchell et al., 2019) at https://docs.cohere.ai/docs/ representation-card. They also provide a multilingual model, multilingual-22-12, 3 that is trained on a large multilingual collection comprising 100+ languages. The data consists of 1.4 billion question/answer pairs, mined from the Web. The multilingual model maps text into 768-dimension embedding vectors.
OpenAI: The pioneering company behind GPT models (Radford et al., 2019;Brown et al., 2020;Ouyang et al., 2022) also offers an embedding service. We use the recommended second-generation model, text-embedding-ada-002 (350M; Neelakantan et al. 2022) that embeds text into a vector of 1536 dimensions. The model, initialized from a pre-trained GPT model, is fine-tuned on naturally occurring paired data with no explicit labels, mainly scraped from the Web, using contrastive learning with in-batch negatives. These APIs use Transformer-based language models (Vaswani et al., 2017), but differ from each other in various ways: • Model architecture: The companies built their models in different sizes, denoting a difference in the number of hidden layers, number of attention heads, the dimension of output layers, etc. Other subtle differences in the Transformer architecture are also likely, e.g., where to apply layer normalization in a Transformer layer (Xiong et al., 2020). Other differences lie in the vocabulary because of different tokenization methods and the pre-trained LM that was used to initialize these embedding models for fine-tuning.
• Training: While contrastive learning is at the core of these models, they may vary substantially in details, e.g., the contrastive learning objective, and negative sampling strategies. Also, the choice of hyper-parameters such as the number of training steps, learning rate, and optimizer is another key difference factor.
• Data: Chief among the differences is the data on which the embedding models are trained. The data as OpenAI and Cohere state in their documentation is mostly mined from the Web, but the details of the data curation process remains largely unknown. In addition, considering that each company has its own LLM, the difference in pre-training corpora is yet another key differ-3 https://txt.cohere.ai/multilingual/ ence in the ingredients of the complex process of building such APIs.
These differences may potentially lead to substantial differences in the overall effectiveness of the embedding APIs. Nevertheless, due to the nondisclosure of several details by API providers, it remains challenging to identify the specific factors that contribute to the strengths and weaknesses of embedding models. Yet, as the number of such APIs continues to grow, we believe that high-level comparisons on standard benchmarks can provide valuable insights into how well these models operate under various practical scenarios. For practitioners building systems on top of these APIs, this comparative analysis is useful as they are primarily interested in the end-to-end efficacy of these APIs and are not often concerned with their minutiae.

Usability
One of the vital advantages of using embedding APIs is their ease-of-use. For IR applications, even running an LLM to encode large document collections requires hefty resouces, let alone training a retrieval model. Thus, the emergence of such APIs makes LLMs more accessible to the community and paves the way for faster development of IR systems. However, these advantages rest on the usability of the APIs. In this section, we briefly overview the usability of embedding APIs.
Setup. The basic information on how one can setup their environment to use embedding APIs is the first step. All three companies provide detailed introductory documentation for this purpose. The procedure is identical for all three, i.e., one needs to create an account and generate an API key for authentication. The companies also furnish a web interface that enables users to monitor their usage history and available credit, in addition to configuring limits to prevent unintended charges. The only exception that Cohere have developed is the text truncation feature when the input text exceeds the length limit. OpenAI and Aleph-Alpha raise an error in this case, meaning that API users need to implement additional checks to avoid such exceptions. On the other hand, Cohere's API truncates text from left or right, and can also provide an average embedding for long texts up to 4096 tokens by averaging over 512-token spans.
Documentation. All three companies provide a technical API reference, explaining the inputs, the response, and the errors of their APIs. Also, all companies provide tutorials, and blog posts with examples on how to use their client libraries.
Latency. The APIs are all offered at a liberal rate limit, i.e., OpenAI 3K requests per minute, and Cohere 10K requests per minute. 4 We find that API calls are mostly reliable and the request service errors were scattershot. Each API call takes up to roughly 400ms, consistent across all three companies. However, latency presumably depends on the server workload because we observed that API calls sometimes became slower. We also found that the call time depends on the input length as embedding queries runs faster than embedding documents. Moreover, Cohere's embed API supports bulk calls of up to 96 texts per call, whereas for OpenAI and Aleph-Alpha, one text can be passed in each API call. This bulk call feature considerably speeds up encoding document collections.
Cost. 5 OpenAI and Aleph-Alpha charge based on the number of tokens and model size. ada2 and luminous base cost $0.0004 and e0.078 ≈ $0.086 6 per 1,000 tokens. On the other hand, Cohere follows a simplified plan, charging merely based on the number of API calls, i.e., $1.0 per 1,000 calls. Our re-ranking experiments on BEIR costs around $170 on OpenAI, whereas it would cost roughly $2,500 on Cohere. Also, the cost of our re-ranking experiments on MIRACL for three languages (German, Spanish, and French) hovers around e116 ≈ $128 using Aleph-Alpha and Cohere. Cohere also offers a free-tier access with a restricted API call rate limits of 100 calls per minute, which we opted for, albeit at the expense of sacrificing speed.

Experiments
In this section, our main goal is to evaluate embedding APIs in two real-world scenarios that often 4 We were not able to find the rate limits for Aleph-Alpha. 5 The fees are reported as of Feb 01, 2023. 6 e1.0 = $1.1 as of Feb 01, 2023 arise in IR applications: domain generalization, and multilingual retrieval.

BEIR
We first evaluate the generalization capabilities of embedding APIs across a variety of domains. To this end, we measure their effectiveness on BEIR (Thakur et al., 2021), a heterogeneous evaluation benchmark intended to gauge the domain generalization of retrieval models. It consists of 18 retrieval datasets across 9 domains. Thakur et al. (2021) showed that BM25 is a strong baseline on BEIR surpassing most dense retrieval models.
We adopt the embedding API as a re-ranking component on top of BM25 retrieved results. Reranking is a more realistic scenario, compared to full ranking, because the number of documents to encode in re-ranking is commensurate with the number of test queries, which is orders of magnitude smaller than the collection size, usually comprising millions of documents. Thus, re-ranking is more efficient and cheaper than full ranking.
For the BM25 retrieval, we use Anserini  to index the corpora in the BEIR collection and retrieve top-100 passages for each dataset. Then, queries and the retrieved passages are encoded using embedding APIs. We reorder the retrieval output based on the similarity between query embeddings and those of passages.
The results are presented in Table 1. 7 TASB reranking results show a +4% increase over TASB full-ranking on average, denoting that re-ranking via bi-encoder models is indeed a viable method. Moreover, OpenAI's ada2 is the best performing model, surpassing TASB, and Cohere large by +4.7%, and +2.4% on average, respectively. However, Cohere large outperforms ada2 on 5 7 We did not test Aleph-Alpha's luminous on BEIR due budget constraints.

Task Domain
Full-ranking BM25 Top-100 Re-rank BM25 TASB cpt-S TASB Cohere large Cohere small OpenAI ada2 tasks. Specifically, Cohere large scores the highest nDCG@10 on NQ (question answering), SCI-DOCS (citation prediction), Climate-FEVER (fact verification), and both duplicate question retrieval tasks, i.e., CQADupStack, and Quora. Also, Cohere small trails Cohere large by 2.6% on average and is nearly on par with TASB. Another interesting observation is that BM25 leads all other models on 3 tasks, i.e., BioASQ, Signal-1M, and Tóuche-2020 that are datasets collected based on lexical matching, highlighting thus embedding APIs struggle in finding lexical overlaps.

Multilingual Retrieval: MIRACL
We further assess the LLM APIs on the multilingual retrieval settings. Multilingual retrieval aims at building retrieval models that can operate in several languages while maintaining their effectiveness across languages. For this purpose, we use a large-scale multilingual retrieval benchmark, known as MIRACL , spanning 18 languages with more than 725K relevance judgments collected from native speakers. We test Cohere's multilingual model as well as Aleph-Alpha's luminous on MIRACL. OpenAI does not recommend to use the embeddings service for non-English documents and thus, their API was omitted from this experiment. Analogous to the previous experiment, we adopt a re-ranking strat-egy on top-100 passages, retrieved by BM25. For Cohere, we carry out full-ranking retrieval to draw a comparison with first-stage retrieval models. We also constructed a hybrid model out of BM25 and Cohere, by interpolating their normalized retrieval scores, following . The baselines are taken from : BM25, mDPR, and the hybrid model mDPR+BM25. We reuse the indices, provided in Pyserini (Lin et al., 2021a), to generate the baseline runs. For all the models, we measure nDCG@10 and Recall@100.
The results on the MIRACL dev set are reported in Table 2. Re-ranking BM25 via Cohere yields better overall results (0.542), compared to full-ranking (0.512), which is consistent with our observation on BEIR. However, while the two re-ranking models, i.e., luminous and Cohere, surpass BM25 on all languages, they lag behind the full-ranking hybrid models. The results showcase that the winning recipe here is to build a hybrid model, i.e., first perform retrieval on the entire corpus and then, combine the result with BM25. In particular, Cohere+BM25 achieves the highest average nDCG@10, outperforming the other models on 7 languages. The second best model overall is the other hybrid model, mDPR+BM25, trailing Co-here+BM25 by -1.2%.
We further investigate how the models perform on low/high-resource languages. To this   , and high-resource (>600K). We measure the average nDCG@10 and Recall@100 for each language family. The results are visualized in Figure 1. The effectiveness of BM25 on lowresource languages is nearly on par with its performance on high-resource languages. On the other hand, Cohere's hybrid and standalone models excel on low-resource languages, and are competitive with mDPR+BM25 on medium-resource languages. However, on high-resource languages, mDPR+BM25 outperforms Cohere's hybrid model, likely due to the prevalence of text in these languages during mBERT pre-training (Wu and Dredze, 2020).

Conclusion
The incredible capabilities of language model behemoths have attracted a handful of companies and startups to offer access to their proprietary LLMs via APIs. In this paper, we aim at qualitatively and quantitatively semantic embedding APIs that can be used for information retrieval. Our primary focus is to assess existing APIs for domain generalization and multilingual retrieval. Our findings suggest that re-ranking BM25 results is a suitable and cost-effective option for English; on the BEIR benchmark, OpenAI ada2 performs the best on average. In multilingual settings, while re-ranking remains a viable technique, a hybrid approach produces the most favorable results. We hope that our insights aid practitioners and researchers in selecting appropriate APIs based on their needs in this rapidly growing market.

Limitations
Similar to other commercial products, embedding APIs are subject to change that could potentially impact their effectiveness, pricing, and usability. Thus, it is important to note that our findings are specific to the APIs accessed in January and February 2023. Nevertheless, we believe our evaluation framework can serve a basis to thoroughly assess future releases of these APIs. Moreover, we limit our focus to the effectiveness and robustness of semantic embedding APIs. Nonetheless, safe deployment of retrieval systems for real-world applications necessitates the evaluation of their fairness. Despite their scale, language models have been found to learn, and sometimes perpetuate societal biases and harmful stereotypes ingrained in the training corpus (Bender et al., 2021). Consequently, it is crucial to assess the fairness of embedding APIs in relation to protected groups. This paper does not delve into this aspect of API evaluation. Further research is required to examine the extent to which these APIs exhibit fairness in real-world applications.