Evaluation for Change

Evaluation is the central means for assessing, understanding, and communicating about NLP models. In this position paper, we argue evaluation should be more than that: it is a force for driving change, carrying a sociological and political character beyond its technical dimensions. As a force, evaluation's power arises from its adoption: under our view, evaluation succeeds when it achieves the desired change in the field. Further, by framing evaluation as a force, we consider how it competes with other forces. Under our analysis, we conjecture that the current trajectory of NLP suggests evaluation's power is waning, in spite of its potential for realizing more pluralistic ambitions in the field. We conclude by discussing the legitimacy of this power, who acquires this power and how it distributes. Ultimately, we hope the research community will more aggressively harness evaluation for change.


Introduction
Evaluation plays a defining role in NLP research; in fact, evaluation has a very rich history.While this genealogy can be traced in many ways, since this piece (roughly) coincides with the 5 th anniversary of the passing of one NLP's beloved pioneers and the first recipient of the ACL Lifetime Achievement Award, we look to Aravind Joshi's legacy.Best known for grammar formalism and discourse (see Webber, 2018), his research journey reflects broader field-wide trends towards evaluation.In early works (e.g.Joshi, 1969;Joshi et al., 1972;Grosz et al., 1983), evaluation went entirely unmentioned.Yet, over time, Aravind's work involved more evaluation (e.g. Joshi and Schabes, 1989), implicitly building new norms for evaluation in grammar formalism and discourse (Miltsakaki et al., 2004;Prasad et al., 2008Prasad et al., , 2014)).Liberman (2005) cites Joshi's standards for evaluation in conveying Joshi's signature belief in multidisciplinary approaches to human language.
Joshi's life and 5 decades of scholarship teaches us evaluation is important, but how should we reason about evaluation?Here, we present two perspectives that frame evaluation in considerably different ways.Under the first account, evaluation is technical in nature, functioning as a lens to study models.The motivation for this lens may depend on the specific evaluation, stakeholder, or both: evaluation may allow us to derive scientific insight.Or it can transparently document technology for broader audiences (e.g.practitioners, colleagues in other fields, policymakers, the public).Regardless, to determine if an evaluation is successful, under this account, the lens must yield the desired understanding about models.
In this work, we argue for a second perspective, which we believe is partially acknowledged but considerably less salient than the first perspective.Under our second account, evaluation is political in nature, functioning as a force to drive change.In contrast to the first account, this means evaluation pushes the research community in some direction, possibly referring to a specific social or scientific objective, with the emphasis being on future model development more so than existing models.Critically, under this account, to determine if an evaluation is successful, the force must yield the desired change in the community.By separating these two accounts, our goal is neither to suggest they are at odds nor that they are meaningfully separable, but to shed conceptual clarity on the merits of powercentric analysis.
In pushing for this position of viewing evaluation as a force, we explore what this force influences, what other forces it competes with, how it accrues power, whether its power is legitimate, and who it empowers.Motivated by the growing impact of language technology and our field, the abundant discord on the status quo, and the uncertainty on what lies ahead, we believe evaluation's potential for change presents a vital path forward.

Evaluation as a Force
If evaluation is a force, what domain does it act upon?And where does its power come from?Domain.We will restrict our scope to how evaluation influences NLP research.Specifically, evaluation concretizes desired behavior for systems, thereby communicating an objective for model design.This allows for the community to coordinate on goals for modeling research.For this goal-setting to succeed, future research should then go on to make progress on the proposed evaluation.That is, successful evaluation requires that the evaluation be prioritized, redistributing research attention such that it is allocated towards making progress on the evaluation.
Adoption constructs power.As this suggests, the adoption of an evaluation (by others) generates its power and determines its success.It is in this sense that our account for evaluation success deviates from a purely technical/intrinsic characterization.Most evaluations are concrete instantiations of a broader agenda: for these evaluations to be effective, they must shift power, namely towards addressing this agenda and materially making progress.In spite of this, we generally find that evaluations in NLP research do not even mention how adoption will arise, and if evaluation creators will take any overt actions to accelerate adoption.
Accelerating adoption.If the power of evaluations come from adoption, and evaluation creators are incentivized to accrue such power to advance their broader agenda, are there ways to accelerate adoption?We observe at least two such approaches, though they have not been considered in this way to our knowledge.As a softer means for acquiring adoption/power, evaluations may be used as shared tasks (e.g.SemEval; see Parra Escartín et al., 2017;Nissim et al., 2017) or be built as part of workshops/conferences (e.g.BIG-bench; see WELM, 2021;Srivastava et al., 2022), which leans into the relationship between coordinating research and convening researchers.More aggressively, explicit competitions with prizes or other stronger incentives can more directly drive adoption, perhaps most famously in the Netflix Prize, which remarkably accelerated and shifted research on recommender systems (see Hallinan and Striphas, 2016).
Authority as a standard.As evaluations accrue influence, they eventually become reified as high-status standards like ImageNet, WMT, and SQuAD (Dotan and Milli, 2020;Dehghani et al., 2021).While it is difficult to directly assess the power these evaluations have (e.g.how would research have changed counterfactually in their absence; see Liu et al., 2021), strong norms emerge for modeling work to evaluate on these standards.And, consequently, improvements on these evaluations function as stand-ins for more fundamental progress (Porter, 1995;Liao et al., 2021;Raji et al., 2021).In fact, their authority is made clear in how serious improvements were seen as watershed moments, ushering in new paradigms.Famous examples include the performance of AlexNet (Krizhevsky et al., 2012) on ImageNet, which initiated the deep learning revolution, and Transformers (Vaswani et al., 2017) on WMT, which, by outperforming specialized machine translation approaches by a considerable margin, marked the dawn of the current dominance of Transformers.
Related work.This work is not the first to bring questions of power, values, reflection, and change to the fore in relation to evaluation/benchmarking (Spärck Jones and Galliers, 1995;Welty et al., 2019;Dotan and Milli, 2020;Ethayarajh and Jurafsky, 2020;Linzen, 2020;Scheuerman et al., 2021;Kiela et al., 2021;Dehghani et al., 2021;Bowman and Dahl, 2021;Raji et al., 2021;Koch et al., 2021;Denton et al., 2021;Paullada et al., 2021;Liu et al., 2021;Hutchinson et al., 2021;Jacobs and Wallach, 2021;Birhane et al., 2022;Liang et al., 2022b, inter alia).Prior work establishes that evaluations embed values, carry influence, encode broader power structures, and the nature of evaluation as ranking aligns with broader themes of hierarchy.They make clear how other disciplines can provide guidance on what we see in NLP, but also how our evaluation practices are distinctive (e.g.competitive tendencies in benchmarking, differences in standards for measure validity).
While we draw significant inspiration from these works, our work also significantly diverges in its objective.Rather than trying to make visible the tacit assumptions, norms, and infrastructure that animate evaluation's power, we instead set our sights on how evaluation's power can animate change.In this regard, our work more closely mirrors the aesthetic of Abebe et al. (2020), as can be seen in the similar titles.

Competing Forces
Having argued for where evaluation draws power from, how powerful is it?While difficult to state in absolute terms, we instead consider what other forces are in play and how they interact/compete.
Coexisting forces.NLP research is a fabric stitched through myriad social interactions: conversations with colleagues, talks at conferences, academic Twitter, scholarship from adjacent disciplines, and much more.Most of these interactions are poorly conceptualized as forces: while they exert influence, they are generally diffuse rather than concentrated and lack strong directionality.For this short-form analysis, we juxtapose evaluation with the force of resources.By resources, we refer to assets like money, compute, and engineering support, choosing to treat them as monolithic (rather than disaggregating) for brevity.
Language models.Given the central position language models occupy in modern NLP, we consider language models as a case study to relate evaluation and resources.Our thesis is resources, to a far greater extent than evaluation, dictate research on language models, which more broadly influences NLP research given the pervasive dependence on language models.Influential language models have near-exclusively been developed by resource-rich institutions.Further, we argue a resource-allocation mindset drives decisionmaking in their development.Namely, the use of scaling laws (e.g.Kaplan et al., 2020;Hoffmann et al., 2022) indicates development is framed as an efficient resource allocation problem.Evaluation does play a small role here: scaling laws relate resources (x-axis) with evaluated model performance (y-axis).But the evaluation scope is narrow: scaling laws generally center accuracy for a single task (generally upstream language model perplexity/loss), with predictability of this relationship being the principal concern (Ganguli et al., 2022, cf. Wei et al. (2022)).
In contrast, evaluation currently does not exert similar influence over language model development.Namely, while influential language models are similar in that they were developed by resourcerich institutions, they strikingly differ in the benchmarks they are evaluated on.Across all datasets evaluated for across language modeling works at the time, Liang et al. (2022b) find that RTE is the unique dataset evaluated for in more than 50% of the 32 language modeling works they consider (e.g.GPT-3, GPT-NeoX, BLOOM, PaLM, Gopher, OPT, GLM)1 , with some works sharing no evaluation datasets in common.Given this status quo, evaluations currently fail to achieve the widespread adoption required to drive change. 2  Contrasting properties.Which forces orient NLP research is consequential: different forces profile differently.Resources are distributed very unevenly, so resources orienting progress implies a small subset of the community expresses outsized influence in shaping the field's trajectory.Further, by the nature of how these resource disparities came to be, these resource-rich actors tend to have specific incentives (e.g.commercial interest) and demographics (e.g.poor diversity), potentially causing them to advance particular agendas (see Rogaway, 2015).In contrast, we believe evaluation structurally is better equipped to enable broader participation (e.g.BIG-bench) and, critically, pluralism.Different values can be simultaneously foregrounded in evaluations (e.g.HELM (Liang et al., 2022b) highlights values/desiderata such as accuracy, robustness, fairness, uncertainty, and efficiency).For example, insofar as scaling laws drive language model development, greater pluralism would be achieved if scaling laws were studied, fit, and applied for a broader array of evaluation targets than just upstream accuracy/perplexity.

Legitimacy
Since evaluation accrues power, is this power legitimate?And who does this power distribute to?
Legitimacy.Evaluations are generally built by a small number of researchers, but could orient work across the broader research community.Consequently, in arguing for the greater use of evaluation as a means for shifting power, we should question whether this implicitly recommends value imposition: imposing values of the few onto the many.However, recall that evaluation's power derives not from its creation but its adoption.Consequently, for this power to emerge requires the consensual action of the early adopters, who choose to use the evaluation.To an extent, this (voluntary) choice suggests that the power of evaluation is generally and, at least, initially legitimate.
If the power of evaluation is legitimate, then what does this imply when evaluations are shown to have issues with respect to their validity, reliability, relevance, or appropriateness (Gururangan et al., 2018;Kaushik and Lipton, 2018;Ethayarajh, 2020;Blodgett et al., 2021;Aribandi et al., 2021;Birhane and Prabhu, 2021, inter alia)?Here, we recognize that while the initial adoption of an evaluation is in most cases clearly legitimate, the subsequent sustained adoption can be more complicated.
In particular, we emphasize that evaluations tend to exhibit inertia: once an evaluation is widely adopted, it is hard for the evaluation to lose this status or for other evaluations to eclipse it (e.g.due to reviewing norms; Dehghani et al., 2021), even when there are strong reasons to demote or deprecate the evaluation (Peng et al., 2021).Most directly, we point to the strong norms of comparison in NLP, whereby model developers are expected to compare their models to prior models in headto-head comparisons.While generally a useful norm, this does promote a certain conservatism.Notably, when prior models (i.e.those that are to be compared to) are not public (Bommasani et al., 2023) or laborious to re-evaluate on new datasets, developers of new models can most easily be compare to old models on the evaluations used in prior work.In this regard, paradigms where evaluations are continuously updated and refreshed (e.g. the evaluation rounds in ANLI (Nie et al., 2020) and versions in HELM (Liang et al., 2022b); inherently dynamic evaluations like DynaBench (Kiela et al., 2021)) more directly ensure the sustained power of specific evaluations is legitimate.

Distribution of power.
Even if an evaluation's power is acquired legitimately, we should further question how the power distributes over different members of the community, especially as other forces (especially resources) are inequitably distributed.Koch et al. (2021) show the distribution of evaluation developers is also uneven, aligning strongly with institutional privilege (e.g.elite academic institutions like Stanford and Princeton, massive commercial organizations like Microsoft and Google).In part, this is likely a byproduct of the fact that evaluations themselves can be quite resource-intensive, especially when this scale is a virtue: ImageNet (Deng et al., 2009), especially for its time, was exceedingly costly in both money and time; large-scale model evaluation on HELM costs almost 40k USD in addition to 20k A100 GPU hours (Liang et al., 2022b).
With that said, we have significant optimism that evaluation can realize more pluralistic visions.Specifically, (i) the rise of foundation models in NLP has shifted the field towards few-shot evaluations (Brown et al., 2020;Bragg et al., 2021), which means evaluations need not include largescale training subsets which constituted much of the cost for evaluations historically (e.g.80%, or 80k+, of the examples in SQuAD (Rajpurkar et al., 2016) were allocated for training).This suggests that their development should be more broadly accessible (Bommasani et al., 2021, §4.4), though the dynamics of their adoption are less clear.Further, (ii) the practice of community-driven evaluation design has been successfully implemented in several instances: the EleutherAI LM Harness (Gao et al., 2021), GEM (Gehrmann et al., 2021), GEMv2 (Gehrmann et al., 2022), BIG-Bench (Srivastava et al., 2022), the Hugging Face Evaluate library (von Werra et al., 2022), with examples like Universal Dependencies (UD; Nivre et al., 2016;de Marneffe et al., 2021) even pre-dating them for many years.In most cases, these efforts did not push a very clear directional change/agenda in research priorities (UD as a partial exception), but we believe future efforts could more explicitly exert power while learning from these prior efforts.Finally, (iii) the community has grown to more properly recognize and value evaluation-type contributions (e.g. the NeurIPS datasets and benchmarks track, cf.Rogers (2020)).That is, while we argue evaluation's power is currently waning relative to resources, suggesting a trend towards less pluralism, we simultaneously believe the conditions are ripe for renewed commitment to evaluation to reverse this trajectory.

Conclusion
Evaluation wields power: we believe the community is largely aware of this, yet we foreground this power to understand how evaluation drives change.This perspective leads us to three conclusions: (i) adoption imbues evaluation with its power, (ii) evaluation's power relative to other competing social forces appears to be diminishing, and yet (iii) evaluation has attractive qualities, especially under current conditions, as a force for change relative to other forces with growing power.Overall, we hope the community reflects on the mantra "evaluation for change".

Limitations
This work puts forth a position: by the nature of a position paper, the work is deliberately intended to be evocative and opinionated, in some places not having unequivocal evidence for certain claims.This presents a clear limitation: the analysis presented may diverge from the realities of NLP at present or in the future, namely if the assumptions/conditions presented themselves prove to be untrue in practice.Nonetheless, we believe centering power and change, and understanding evaluation as a political and sociological phenomenon, is likely to be useful under all conditions.
Further, in understanding the qualities of evaluation relative to other social forces, we directly suggest that evaluation is more readily operationalized in more pluralistic ways than other key forces (primarily resources).While initial efforts indicate the potential for such holistic approaches that reflect many different desiderata (Liang et al., 2022b) as well as participatory approaches that permit contribution from different entities (e.g.Srivastava et al., 2022), it is still unclear how much adoption such approaches will get, and therefore how much power they will acquire.That is, the extent to which evaluation can realize this pluralistic vision still largely remains an unresolved aspiration than a readily realizable certainty.And, conversely, we do note that while current practices potentially put pluralism and resources at odds, they may be mutually compatible in other regimes (e.g.decentralized training through the pooling of shared/volunteered compute (Yuan et al., 2022), open-source software development (Wolf et al., 2020;Gao et al., 2021;von Werra et al., 2022)).
Finally, we do not discuss other forces that we believe have not exhibited strong influence on NLP research thus far, in favor of allocating focus to evaluation and resources, which have had clear influence.To enumerate some of these other (potential) forces, we specifically note (i) research norms, (ii) policy and regulation, and (iii) auditing/advocacy.For (i), we note that while the NLP research community has many established norms (e.g.reproducibility checklists, peer review guidelines, conference organization structure, policies on respectful conduct), most of these do not directly/significantly influence what research topics different researchers work on.We do note that is possible in the future that certain norms (e.g. the access to training data or model checkpoints; Liang et al., 2022a) would influence what research is conducted (e.g.we may have not seen as much work on the learning dynamics of language models and/or memorization of training data due to the relative inaccessibility of intermediary checkpoints and training data until recently).For (ii), we note that policy and regulatory efforts have had little to no salient impact on the deployment of most language technologies, let alone NLP research, to our knowledge.With that said, much as efforts like GDPR and privacy legislation has impacted scientific research on privacy (e.g.work that operationalizes the right to be forgotten as in Ginart et al., 2019), similar trends could occur in NLP research (e.g. in response to the EU AI Act). 3 Akin to (ii), for (iii), we also have seen fairly little impact from auditing/advocacy work on NLP research to our knowledge.But, much as work on auditing/advocacy around face recognition (Buolamwini and Gebru, 2018;Raji and Buolamwini, 2019;Raji et al., 2020, inter alia) influenced research in the computer vision community, we could see similar trends in NLP (e.g. in response to auditing/advocacy intervention around language models).

Ethics Statement
We do not find serious risks or ethical concerns with this work.We do note this work advances a specific position, which we clearly identify.It should not be assumed there is consensus in the community (or beyond) on any account for evaluation, let alone the account on power that we espouse.In this regard, we actively solicit response and interrogation of the positions presented in this work, especially given myriad relevant analyses of evaluation/measurement/benchmarking exist in other parts of AI, computer science, linguistics, and other disciplines.
Deb Raji, Emily Denton, Henrik Kugelberg, Jacob Andreas, Jacob Steinhardt, John Hewitt, Judy Shen, Kawin Ethayarajh, Nelson Liu, Rediet Abebe, Rohan Taori, Sam Bowman, Stella Biderman, Tal Linzen, Yann Dubois, and Yoav Goldberg for being specific inspirations, whose writings and thoughts helped develop my current position on evaluation, with special thanks to Percy Liang.Thanks to Jason Wei for feedback on the initial version of this work.I would like to thank the CRFM community; the experience of designing and building HELM (Liang et al., 2022b) in particular helped sharpen my belief in this philosophy towards evaluation.I am supported by the NSF Graduate Research Fellowship Program under grant number DGE-1655618.