Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking

Neural information retrieval (IR) systems have progressed rapidly in recent years, in large part due to the release of publicly available benchmarking tasks. Unfortunately, some dimensions of this progress are illusory: the majority of the popular IR benchmarks today focus exclusively on downstream task accuracy and thus conceal the costs incurred by systems that trade away efficiency for quality. Latency, hardware cost, and other efficiency considerations are paramount to the deployment of IR systems in user-facing settings. We propose that IR benchmarks structure their evaluation methodology to include not only metrics of accuracy, but also efficiency considerations such as a query latency and the corresponding cost budget for a reproducible hardware setting. For the popular IR benchmarks MS MARCO and XOR-TyDi, we show how the best choice of IR system varies according to how these efficiency considerations are chosen and weighed. We hope that future benchmarks will adopt these guidelines toward more holistic IR evaluation.


Introduction
Benchmark datasets have helped to drive rapid progress in neural information retrieval (IR).When the MS MARCO (Nguyen et al., 2016) Passage Ranking leaderboard began in 2018, the best performing systems had MRR@10 scores around 0.20; the latest entries have since increased accuracy past 0.44.Similarly, the XOR TyDi multilingual question answering (QA) dataset (Asai et al., 2020) was released in 2021 and has seen improvements in recall scores from 0.45 to well past 0.70.
The leaderboards for these datasets are defined by a particular set of accuracy-based metrics, and progress on these metrics can easily become synonymous in people's minds with progress in gen- The trade-offs evident here are common in real-world applications of IR technologies.These submissions do not represent "optimal" implementations of each respective approach, but rather reflect existing reported implementations and hardware configurations in the literature.Including cost and other efficiency considerations on our leaderboards would lead to more thorough exploration of possible system designs and, in turn, to more meaningful progress.
eral.However, IR and QA systems deployed in production environments must not only deliver high accuracy but also operate within strict resource requirements, including tight bounds on per-query latency, constraints on disk and RAM capacity, and fixed cost budgets for hardware.Within the boundaries of these constraints, the optimal solution for a downstream task may no longer be the system which simply achieves the highest task accuracy.
Figure 1 shows how significant these tradeoffs can be.The figure tracks a selection of MS MARCO Passage Ranking submissions, with cost on the x-axis and accuracy (MRR@10) on the y-axis.At one extreme, the BM25 model costs just US$0.04 per million queries,1 but it is far behind the other models in accuracy.For very similar For hardware specifications, we show the precise resources given as the running environment in the paper, even if not all resources were available to the model or the resources were over-provisioned for the particular task.Table 2 provides our estimates of minimum hardware requirements for a subset of these systems.Note that the first PLAID ColBERTv2 result listed was run on a server which includes 4 GPUs but no GPU was actually used for measurement, thereby resulting in a larger latency than the second listed result which does measure GPU execution.
costs to BM25, one can use BT-SPLADE-S and achieve much better performance.On the other hand, the SPLADE-v2-distil model outperforms BT-SPLADE-S by about 1 point, but at a substantially higher cost.Unfortunately, these tradeoffs would not be reflected on the MS MARCO leaderboard.Similarly, the top two systems of the XOR TyDi leaderboard as of October 2022 were separated by only 0.1 points in Recall@5000 tokens, but the gap in resource efficiency between these two approaches is entirely unclear.
In this work, we contribute to the growing literature advocating for multidimensional leaderboards that can inform different values and goals (Coleman et al., 2017;Mattson et al., 2020a,b;Baidu Research, 2016;Ma et al., 2021;Liu et al., 2021a;Liang et al., 2022).Our proposal is that researchers should report orthogonal dimensions of performance such as query latency and overall cost, in addition to accuracy-based metrics.Our argument has two main parts.
In part 1 ( §2), we create a post-hoc MS MARCO leaderboard from published papers (Table 1).This reveals that systems with similar accuracy often differ substantially along other dimensions, and also that techniques for improving latency and reducing memory and hardware costs are currently being explored only very sporadically.However, a few of the contributions (Santhanam et al., 2022a;Lassance and Clinchant, 2022;Engels et al., 2022;Li et al., 2022) exemplify the kind of thorough investigation of accuracy and efficiency that we are advocating for, and we believe that improved multidimensional leaderboards could spur further innovation in these areas.
We close by discussing practical considerations relating to the multidimensional leaderboards that the field requires.Here, we argue that the Dynascore metric developed by Ma et al. (2021) is a promising basis for leaderboards that aim to (1) measure systems along multiple dimensions and (2) provide a single full ranking of systems.Dynascores allow the leaderboard creator to weight different assessment dimensions (e.g., to make cost more important than latency).These weightings transparently reflect a particular set of values, and we show that they give rise to leaderboards that are likely to incentivize different research questions and system development choices than current leaderboards do.

A Post-hoc Leaderboard
While existing IR benchmarks facilitate progress on accuracy metrics, the lack of a unified methodology for measuring latency, memory usage, and hardware cost makes it challenging to understand the trade-offs between systems.To illustrate this challenge, we constructed a post-hoc leaderboard for the MS MARCO Passage Ranking benchmark (Table 1).We include the MRR@10 values reported in prior work and, when available, copy the average per-query latency, index size, and hardware configurations reported in the respective papers. 2  We highlight the following key takeaways.

Hardware Provisioning
The hardware configurations in Table 1 are the specific compute environments listed in the corresponding papers rather than the minimum viable hardware necessary to achieve the reported latency.
In Table 2, we have sought to specify the minimal configuration that would be needed to run each system.(This may result in an overly optimistic assessment of latency; see §3).The hardware differences between Table 1 and Table 2 reveal that researchers are often using vastly over-provisioned hardware for their experiments.Our proposed leaderboards would create a pressure to be more deliberative about the costs of hardware used when reporting efficiency metrics.
2 We plan to expand our analysis to include the recently released CITADEL model (Li et al., 2022), first uploaded to arXiv on 11/18/22)

Variation in Methodology
Table 1 shows that both the quality metrics and the hardware used for evaluation across different models vary significantly.Many papers exclusively report accuracy, which precludes any quantitative understanding of efficiency implications (Ren et al., 2021;Gao and Callan, 2021;Wu et al., 2022).For papers that do report efficiency-oriented metrics, the evaluation environment and methodology are often different; for example, the results from Mackenzie et al. 2021 andLassance andClinchant 2022 are measured on a single CPU thread whereas Khattab andZaharia 2020 andSanthanam et al. 2022a leverage multiple CPU threads for intraquery parallelism, and even a GPU for certain settings.We also observe performance variability even for the same model, with Mackenzie et al. 2021 (220 ms) and Lassance and Clinchant 2022 (691 ms) reporting SPLADEv2 latency numbers which are 3× apart.Similarly, the BM25 latencies reported by these papers differ by a factor of 2×.

Multidimensional Evaluation Criteria
The optimal model choice for MS MARCO is heavily dependent on how we weight the different evaluation metrics.Based purely on accuracy, CoT-MAE and PLAID ColBERTv2 are the topperformers in Table 1, with an MRR@10 score of 39.4 for both.However, we do not have all the information we need to compare them along other dimensions.On the other hand, BM25 is the fastest model, with a per-query latency of only 4 ms as measured by Lassance and Clinchant (2022), and its space footprint is also small.The trade-off is that it has the lowest accuracy in the cohort.Compared to BM25, one of the highly optimized BT-SPLADE models may be a better choice.Figure 1 begins to suggest how we might reason about these often opposing pressures.

Experiments with Representative Retrievers
As Table 1 makes clear, the existing literature does not include systematic, multidimensional comparisons of models.In this section, we report on experiments that allow us to make these comparisons.We focus on four models: BM25 (Robertson et al., 1995) A sparse, termbased IR model.BM25 remains a strong baseline in many IR contexts and is notable for its low latency and low costs.We assess a basic implementation.More sophisticated versions may achieve better accuracy (Berger and Lafferty, 1999;Boytsov, 2020), though often with trade-offs along other dimensions (Lin et al., 2016).For evidence that simple BM25 models often perform best in their class, see Thakur et al. 2021.DPR (Karpukhin et al., 2020) A dense singlevector neural IR model.DPR separately encodes queries and documents into vectors and scores them using fast dot-product-based comparisons.
BT-SPLADE-L (Lassance and Clinchant, 2022) SPLADE (Formal et al., 2021) is a sparse neural model.The BT-SPLADE variants are highly optimized versions of this model designed to achieve low latency and reduce the overall computational demands of the original model.To the best of our knowledge, only the Large configuration, BT-SPLADE-L, is publicly available.PLAID ColBERTv2 (Santhanam et al., 2022a) The ColBERT retrieval model (Khattab and Zaharia, 2020) encodes queries and documents into sequences of output states, one per input token, and scoring is done based on the maximum similarity values obtained for each query token.Col-BERTv2 (Santhanam et al., 2022b) improves supervision and reduces the space footprint of the index, and the PLAID engine focuses on achieving low latency.The parameter k to the model dictates the initial candidate passages that are scored by the model.Larger k thus leads to higher latency but generally more accurate search.In our initial experiments, we noticed that higher k led to better out-of-domain performance, and thus we evaluated the recommended settings from Santhanam et al. (2022a), namely, k ∈ {10, 100, 1000}.To distinguish these configurations from the number of passages evaluated by the MRR or Success metric (also referred to as k), we refer to these configurations as the '-S', '-M', and '-L' variants of ColBERTv2, respectively.
We chose these models as representatives of key IR model archetypes: lexical models (BM25), dense single-vector models (DPR), sparse neural models (SPLADE), and late-interaction models (ColBERT).The three ColBERT variants provide a glimpse of how model configuration choices can interact with our metrics.
We use two retrieval datasets: MS MARCO (Nguyen et al., 2016) and XOR-TyDi (Asai et al., 2020).All neural models in our analysis are trained on MS MARCO data.We evaluate on XOR-TyDi without further fine-tuning to test out-of-domain evaluation (see Appendix A for more details).
Our goal is to understand how the relative performance of these models changes depending on the available resources and evaluation criteria.Our approach differs from the post-hoc leaderboard detailed in §2 in two key ways: (1) we fix the underlying hardware platform across all models, and (2) we evaluate each model across a broad range of hardware configurations (AWS instance types), ensuring that we capture an extensive space of compute environments.Furthermore, in addition to quality, we also report the average per-query latency and the corresponding cost of running 1 million queries given the latency and the choice of instance type.This approach therefore enables a more principled and holistic comparison between the models.
We use the open-source PrimeQA framework,3 which provides a uniform interface to implementations of BM25, DPR, and PLAID ColBERTv2.For SPLADE, we use the open-source implementation maintained by the paper authors. 4For each model we retrieve the top 10 most relevant passages.We report the average latency of running a fixed sample of 1000 queries from each dataset as measured across 5 trials.See Appendix A for more details about the evaluation environments and model configurations.
Table 3 summarizes our experiments.Tables 3a and 3b report efficiency numbers, with costs estimated according to the same hardware pricing used for Table 2. Table 3c gives accuracy results (MRR@10 and Success@10).
Overall, BM25 is the least expensive model when selecting the minimum viable instance type: only BM25 is able to run with 4 GB memory.However, its accuracy scores are low enough to essentially remove it from contention.
On both datasets, we find that BT-SPLADE-L and the PLAID ColBERTv2 variants are the most accurate models by considerable margins.On MS MARCO, all the ColBERTv2 variants outperform BT-SPLADE-L in MRR@10 and Success@10 respectively, while BT-SPLADE-L offers faster and cheaper scenarios than ColBERTv2 for applications that permit a moderate loss in quality.
In the out-of-domain XOR-TyDi evaluation, BT-SPLADE-L outperforms the ColBERTv2-S variant, which sets k = 10 (the least computationallyintensive configuration).We hypothesize this loss in quality is an artifact of the approximations employed by the default configuration.Hence, we also test the more computationally-intensive configurations mentioned above: ColBERTv2-M (k = 100) and ColBERTv2-L (k = 1000).These tests reveals that ColBERTv2-L solidly outperforms BT-SPLADE-L in MRR@10 and Success@10, while allowing BT-SPLADE-L to expand its edge in latency and cost.
Interestingly, despite per-instance costs being higher for certain instances, selecting the more expensive instance can actually reduce cost depending on the model.For example, the c7g.4xlarge instance is 3.5× more expensive than x2gd.large,but ColBERTv2-S runs 4× faster with 16 CPU threads and therefore is cheaper to execute on the c7g.4xlarge.These findings further reveal the rich space of trade-offs when it comes to model configurations, efficiency, and accuracy.

Discussion and Recommendations
In this section, we highlight several considerations for future IR leaderboards and offer recommendations for key design decisions.

Evaluation Platform
A critical design goal for IR leaderboards should be to encourage transparent, reproducible submissions.However, as we see in Table 1, many existing submissions are performed using custom-and likely private-hardware configurations and are therefore difficult to replicate.
Instead, we strongly recommend all submissions be tied to a particular public cloud instance type. 5n particular, leaderboards should require that the specific evaluation environment associated with each submission (at inference time) can be easily reproduced.This encourages submissions to find realistic and transparent ways to use public cloud resources that minimize the cost of their submissions in practice, subject to their own goals for latency and quality.We note that our inclusion of "cost" subsumes many individual tradeoffs that systems may consider, like the amount of RAM (or, in principle, storage) required by the index and model, or the number of CPUs, GPUs, or TPUs.
In principle, leaderboards could report the constituent resources instead of reporting a specific reproducible hardware platform.For example, a leaderboard could simply report the number of CPU threads and GPUs per submission.This offers the benefit of decoupling submissions from the offerings available on public cloud providers.However, this approach fails to account for the ever-growing space of hardware resources or their variable (and changing) pricing.For instance, it is likely unrealistic to expect leaderboard builders to quantify the difference in cost between a V100 and a more recent A100 GPU-or newer generations, like H100, let alone FPGAs or other heterogeneous choices.We argue that allowing submissions to select their own public cloud instance (including its capabilities and pricing) reflects a realistic, marketdriven, up-to-date strategy for estimating dollar costs.In practice, the leaderboard creators need to set a policy for dealing with changing prices over time.They may, for instance, opt to use the latest pricing at all times.This may lead to shifts in the leaderboard rankings over time, reflecting the changing tradeoffs between cost and the other dimensions evaluated.

Scoring
Efficiency-aware IR leaderboards have several options for scoring and ranking submissions.We enumerate three such strategies here: 1. Fix a latency or cost threshold (for example) and rank eligible systems by accuracy.Many different thresholds could be chosen to facilitate competition in different resource regimes (e.g., mobile phones vs. data centers).

Fix an accuracy threshold and rank eligible systems by latency or cost (or other aspects).
Google Cloud, or Azure) is acceptable as long as they offer a transparent way to estimate costs.
The accuracy threshold could be set to the state-of-the-art result from prior years.
3. Weight the different assessment dimensions and distill them into a single score, possibly after filtering systems based on thresholds on accuracy, latency, and/or cost.
Of these approaches, the third is the most flexible and is the only one that can provide a complete ranking of systems.The Dynascores of Ma et al. (2021) seem particularly well-suited to IR leaderboards, since they allow the leaderboard creator to assign weights to each of the dimensions included in the assessment, reflecting the relative importance assigned to each.The Dynascore itself is a utilitytheoretic aggregation of all the measurements and yields a ranking of the systems under consideration.
Following Ma et al., we define Dynascores as follows.For a set of models M = {M i , . . ., M N } and assessment metrics µ = {µ 1 , . . ., µ k }, the Dynascore for a model M i ∈ M is defined as where w µ j is the weight assigned to µ j (we ensure that the sum of all the weights is equal to 1), and acc is an appropriate notion of accuracy (e.g., MRR@10).The AMRS (average marginal rate of substitution) is defined as for models M i , . . ., M N organized from worst to best performing according to acc.In our experiments, we use the negative of Cost and Latency, so that all the metrics are oriented in such a way that larger values are better.If a model cannot be run for a given hardware configuration, it is excluded.For a default weighting, Ma et al. suggest assigning half of the weight to the performance metric and spreading the other half evenly over the other metrics.For our experiments, this leads to {MRR@10: 0.5, Cost: 0.25, Latency: 0.25} In Table 4: Dynascores for the default weighting scheme {Accuracy: 0.5, Cost: 0.25, Latency: 0.25}.
However, this weighting scheme is not the only reasonable choice one could make.Appendix B presents a range of different leaderboards capturing different relative values.Here, we mention a few highlights.First, if accuracy is very important (e.g., MRR@10: 0.9), then all the ColBERTv2 systems dominate all the others.Second, if we are very cost sensitive, then we could use a weighting {MRR@10: 0.4, Cost: 0.4, Latency: 0.2}.In this setting, ColBERTv2-S rises to the top of the leaderboard for MS MARCO and BT-SPLADE-L is more of a contender.Third, on the other hand, if money is no object, we could use a weighting like {MRR@10: 0.75, Cost: 0.01, Latency: 0.24}.This setting justifies using a GPU with COlBERTv2, whereas most other settings do not justify the expense of a GPU for this system.In contrast, a GPU is never justified for BT-SPLADE-L.
To get a holistic picture of how different weightings affect these leaderboards, we conducted a systematic exploration of different weighting vectors.Figure 2a summarizes these findings in terms of the winning system for each setting.The plots depict Latency on the x-axis and Accuracy on the y-axis.The three weights always sum to 1 (Dynascores are normalized), so the Cost value is determined by the other two, as 1.0 -Accuracy -Latency.
The overall picture is clear.For MS MARCO, a ColBERTv2-M or ColBERTvs-S system is generally the best choice overall assuming Accuracy is the most important value, and ColBERTv2-L is never a winner.In contrast, a BT-SPLADE-L system is generally the best choice where Cost and Latency are much more important than Accuracy.DPR is a winner only where Accuracy is relatively unimportant, and BM25 is a winner only where Accuracy is assigned essentially zero importance.For the out-of-domain XOR-TyDi test, the picture is somewhat different: now ColBERTv2-L is the dominant system, followed by BT-SPLADE-L.

Metrics
Here we briefly explore various metrics and their potential role in leaderboard design, beginning with the two that we focused on in our experiments: Latency Latency measures the time for a single query to be executed and a result to be returned to the user.Some existing work has measured latency on a single CPU thread to isolate the system performance from potential noise (Mackenzie et al., 2021;Lassance and Clinchant, 2022).While this approach ensures a level playing field for different systems, it fails to reward systems which do benefit from accelerated computation (e.g., on GPUs) or  intra-query parallelism such as DPR and PLAID ColBERTv2.Therefore, for leaderboards with raw latency as a primary objective, we recommend allowing flexibility in the evaluation hardware to enable the fastest possible submissions.Such flexibility is then subsumed in the dollar cost below.
Dollar cost Measuring the financial overhead of deploying IR systems is key for production settings.One way to measure cost is to select a particular public cloud instance type and simply multiply the instance rental rate by the time to execute some fixed number of queries, as in Table 2.
Throughput Throughput measures the total number of queries which can be executed over a fixed time period.Maximizing throughput could entail compromising the average per-query latency in favor of completing a larger volume of queries concurrently.It is important that leaderboards explicitly define the methodology for measuring latency and/or throughput in practice (e.g., in terms of average time to complete one query at a time or average time to complete a batch of 16 queries).
FLOPs The number of floating point operations (FLOPs) executed by a particular model gives a hardware-agnostic metric for assessing computational complexity.While this metric is meaningful in the context of compute-bound operations such as language modeling (Liu et al., 2021b), IR systems are often comprised of heterogeneous pipelines where the bottleneck operation may instead be bandwidth-bound (Santhanam et al., 2022a).There-fore we discourage FLOPs as a metric to compete on for IR leaderboards.
Memory usage IR systems often pre-compute large indexes and load them into memory (Johnson et al., 2019;Khattab and Zaharia, 2020), meaning memory usage is an important consideration for determining the minimal hardware necessary to run a given system.In particular, we recommend leaderboard submissions report the index size at minimum as well as the dynamic peak memory usage if possible.The reporting of the dollar cost of each system (i.e., which accounts for the total RAM made available for each system) allows us to quantify the effect of this dimension in practice.

Related Work
Many benchmarks holistically evaluate the accuracy of IR systems on dimensions such as out-ofdomain robustness (Thakur et al., 2021;Santhanam et al., 2022b) and multilingual capabilities (Zhang et al., 2021(Zhang et al., , 2022)).While these benchmarks are key for measuring retrieval effectiveness, they do not incorporate analysis of resource efficiency or cost.The MLPerf benchmark does include such analysis but is focused on vision and NLP tasks rather than retrieval (Mattson et al., 2020a).Several retrieval papers offer exemplar efficiency studies (Mackenzie et al., 2021;Santhanam et al., 2022a;Engels et al., 2022;Li et al., 2022); we advocate in this work for more widespread adoption as well as standardization around the evaluation procedure.

Conclusion
We argued that current benchmarks for information retrieval should adopt multidimensional leaderboards that assess systems based on latency and cost as well as standard accuracy-style metrics.Such leaderboards would likely have the effect of spurring innovation, and lead to more thorough experimentation and more detailed reporting of results in the literature.As a proof of concept, we conducted experiments with four representative IR systems, measuring latency, cost, and accuracy, and showed that this reveals important differences between these systems that are hidden if only accuracy is reported.Finally, we tentatively proposed Dynascoring as a simple, flexible method for creating multidimensional leaderboards in this space.

Limitations
We identify two sources of limitations in our work: the range of metrics we consider, and the range of models we explore in our experiments.
Our paper advocates for multidimensional leaderboards.In the interest of concision, we focused on cost and latency as well as system quality.These choices reflect a particular set of values when it comes to developing retrieval models.In §4.3, we briefly consider a wider range of metrics and highlight some of the values they encode.Even this list is not exhaustive, however.In general, we hope that our work leads to more discussion of the values that should be captured in the leaderboards in this space, and so we do not intend our choices to limit exploration here.
For our post-hoc leaderboard (Table 1), we surveyed the literature to find representative systems.We cannot claim that we have exhaustively listed all systems, and any omissions should count as limitations of our work.In particular, we note that we did not consider any re-ranking models, which would consume the top-k results from any of the retrievers we test and produce a re-arranged list.Such models would only add weight to our argument of diverse cost-quality tradeoffs, as re-ranking systems must determine which retriever to re-rank, how many passages to re-rank per query (i.e., setting k), and what hardware to use for re-ranking models, which are typically especially accelerator-intensive (i.e., require GPUs or TPUs).
For our experimental comparisons, we chose four models that we take to be representative of broad approaches in this area.However, different choices from within the space of all possibilities might have led to different conclusions.In addition, our experimental protocols may interact with our model choices in important ways.For example, the literature on SPLADE suggests that it may be able to fit its index on machines with 8 or 16 GB of RAM, but our experiments used 32 GB of RAM.
Our hope is merely that our results help encourage the development of leaderboards that offer numerous, fine-grained comparisons from many members of the scientific community, and that these leaderboards come to reflect different values for scoring and ranking such systems as well.

Figure 1 :
Figure1: Selected MS MARCO Passage Ranking submissions assessed on both cost and accuracy, with the Pareto frontier marked by a dotted line.The trade-offs evident here are common in real-world applications of IR technologies.These submissions do not represent "optimal" implementations of each respective approach, but rather reflect existing reported implementations and hardware configurations in the literature.Including cost and other efficiency considerations on our leaderboards would lead to more thorough exploration of possible system designs and, in turn, to more meaningful progress.

Figure 2 :
Figure 2: Exploration of Dynascore weighting schemes.Marker sizes are proportional to Cost weights (large dots represent more-cost-sensitive weightings and thus the most expensive systems are along the diagonal).

Table 1 :
Post-hoc leaderboard of MS MARCO v1 dev performance using results reported in corresponding papers.

Table 3 :
Experimental results.Latency is average per-query latency (ms), and Cost is per 1M queries.

Table 7 :
Dynascores for MS MARCO, for different weightings of the metrics.