Reliability Testing for Natural Language Processing Systems

Questions of fairness, robustness, and transparency are paramount to address before deploying NLP systems. Central to these concerns is the question of reliability: Can NLP systems reliably treat different demographics fairly and function correctly in diverse and noisy environments? To address this, we argue for the need for reliability testing and contextualize it among existing work on improving accountability. We show how adversarial attacks can be reframed for this goal, via a framework for developing reliability tests. We argue that reliability testing — with an emphasis on interdisciplinary collaboration — will enable rigorous and targeted testing, and aid in the enactment and enforcement of industry standards.


Introduction
Rigorous testing is critical to ensuring a program works as intended (functionality) when used under real-world conditions (reliability). Hence, it is troubling that while natural language technologies are becoming increasingly pervasive in our everyday lives, there is little assurance that these NLP systems will not fail catastrophically or amplify discrimination against minority demographics when exposed to input from outside the training distribution. Recent examples include GPT-3 (Brown et al., 2020) agreeing with suggested suicide (Rousseau et al., 2020), the mistranslation of an innocuous social media post resulting in a minority's arrest (Hern, 2017), and biased grading algorithms that can negatively impact a minority student's future (Feathers, 2019). Additionally, a lack of rigorous testing, coupled with machine learning's (ML) implicit assumption of identical training and testing distributions, may inadvertently result in systems that discriminate against minorities, who are often underrepresented in the training data. This can take * Correspondence to: samson.tan@salesforce.com Figure 1: How DOCTOR can integrate with existing system development workflows. Test (left) and system development (right) take place in parallel, separate teams. Reliability tests can thus be constructed independent of the system development team, either by an internal "red team" or by independent auditors. the form of misrepresentation of or poorer performance for people with disabilities, specific gender, ethnic, age, or linguistic groups (Hovy and Spruit, 2016;Crawford, 2017;. Amongst claims of NLP systems achieving human parity in challenging tasks such as question answering (Yu et al., 2018), machine translation (Hassan et al., 2018), and commonsense inference (Devlin et al., 2019), research has demonstrated these systems' fragility to natural and adversarial noise (Goodfellow et al., 2015;Belinkov and Bisk, 2018) and out-of-distribution data (Fisch et al., 2019). It is also still common practice to equate "testing" with "measuring held-out accuracy", even as datasets are revealed to be harmfully biased (Wagner et al., 2015;Geva et al., 2019;Sap et al., 2019).
Many potential harms can be mitigated by detecting them early and preventing the offending model from being put into production. Hence, in addition to being mindful of the biases in the NLP pipeline (Bender and Friedman, 2018;Mitchell et al., 2019;Waseem et al., 2021) and holding creators accountable via audits (Raji et al., 2020;Brundage et al., 2020), we argue for the need to evaluate an NLP system's reliability in diverse operating conditions. Initial research on evaluating out-of-distribution generalization involved manually-designed challenge sets (Jia and Liang, 2017;Nie et al., 2020;Gardner et al., 2020), counterfactuals (Kaushik et al., 2019;Khashabi et al., 2020;, biased sampling (Søgaard et al., 2021) or toolkits for testing if a system has specific capabilities (Ribeiro et al., 2020) or robustness to distribution shifts (Goel et al., 2021). However, most of these approaches inevitably overestimate a given system's worst-case performance since they do not mimic the NLP system's adversarial distribution 1 .
A promising technique for evaluating worst-case performance is the adversarial attack. However, although some adversarial attacks explicitly focus on specific linguistic levels of analysis (Belinkov and Bisk, 2018;Tan et al., 2020;Eger and Benz, 2020), many often simply rely on word embeddings or language models for perturbation proposal (see §4). While the latter may be useful to evaluate a system's robustness to malicious actors, they are less useful for dimension-specific testing (e.g., reliability when encountering grammatical variation). This is because they often perturb the input across multiple dimensions at once, which may make the resulting adversaries unnatural.
Hence, in this paper targeted at NLP researchers, practitioners, and policymakers, we make the case for reliability testing and reformulate adversarial attacks as dimension-specific, worst-case tests that can be used to approximate real-world variation. We contribute a reliability testing framework -DOCTOR -that translates safety and fairness concerns around NLP systems into quantitative tests. We demonstrate how testing dimensions for DOC-TOR can be drafted for a specific use case. Finally, we discuss the policy implications, challenges, and directions for future research on reliability testing.

Terminology Definitions
Let's define key terms to be used in our discussion.
NLP system. The entire text processing pipeline built to solve a specific task; taking raw text as input and producing predictions in the form of labels 1 The distribution of adversarial cases or failure profile.
(classification) or text (generation). We exclude raw language models from the discussion since it is unclear how performance, and hence worst-case performance, should be evaluated. We do include NLP systems that use language models internally (e.g., BERT-based classifiers (Devlin et al., 2019)).
Reliability. Defined by IEEE (2017) as the "degree to which a system, product or component performs specified functions under specified conditions for a specified period of time". We prefer this term over robustness 2 to challenge the NLP community's common framing of inputs from outside the training distribution as "noisy". The notion of reliability requires us to explicitly consider the specific, diverse environments (i.e., communities) a system will operate in. This is crucial to reducing the NLP's negative impact on the underrepresented.
Dimension. An axis along which variation can occur in the real world, similar to Plank (2016)'s variety space. A taxonomy of possible dimensions can be found in Table 1 (Appendix).
Adversarial attack. A method of perturbing the input to degrade a target model's accuracy (Goodfellow et al., 2015). In computer vision, this is achieved by adding adversarial noise to the image, optimized to be maximally damaging to the model. §4 describes how this is done in the NLP context.

Stakeholder.
A person who is (in-)directly impacted by the NLP system's predictions.
Actor. Someone who has influence over a) the design of an NLP system and its reliability testing regime; b) whether the system is deployed; and c) who it can interact with. Within the context of our discussion, actors are likely to be regulators, experts, and stakeholder advocates.
Expert. An actor who has specialized knowledge, such as ethicists, linguists, domain experts, social scientists, or NLP practitioners.

The Case for Reliability Testing in NLP
The accelerating interest in building NLP-based products that impact many lives has led to urgent questions of fairness, safety, and accountability (Hovy and Spruit, 2016;Bender et al., 2021), prompting research into algorithmic bias (Bolukbasi et al., 2016;Blodgett et al., 2020), explainability (Ribeiro et al., 2016;Danilevsky et al., 2020), robustness (Jia and Liang, 2017), etc. Research is also emerging on best practices for productizing ML: from detailed dataset documentation (Bender and Friedman, 2018;Gebru et al., 2018), model documentation for highlighting important but often unreported details such as its training data, intended use, and caveats (Mitchell et al., 2019), and documentation best practices (Partnership on AI, 2019), to institutional mechanisms such as auditing (Raji et al., 2020) to enforce accountability and red-teaming (Brundage et al., 2020) to address developer blind spots, not to mention studies on the impact of organizational structures on responsible AI initiatives (Rakova et al., 2020).
Calls for increased accountability and transparency are gaining traction among governments (116th U.S. Congress, NIST, 2019;European Commission, 2020;Smith, 2020;California State Legislature, 2020;FDA, 2021) and customers increasingly cite ethical concerns as a reason for not engaging AI service providers (EIU, 2020).
While there has been significant discussion around best practices for dataset and model creation, work to ensure NLP systems are evaluated in a manner representative of their operational conditions has only just begun. Initial work in constructing representative tests focuses on enabling development teams to easily evaluate their models' linguistic capabilities (Ribeiro et al., 2020) and accuracy on subpopulations and distribution shifts (Goel et al., 2021). However, there is a clear need for a paradigm that allows experts and stakeholder advocates to collaboratively develop tests that are representative of the practical and ethical concerns of an NLP system's target demographic. We argue that reliability testing, by reframing the concept of adversarial attacks, has the potential to fill this gap.

What is reliability testing?
Despite the recent advances in neural architectures resulting in breakthrough performance on benchmark datasets, research into adversarial examples and out-of-distribution generalization has found ML systems to be particularly vulnerable to slight perturbations in the input (Goodfellow et al., 2015) and natural distribution shifts (Fisch et al., 2019). While these perturbations are often chosen to max-imize model failure, they highlight serious reliability issues for putting ML models into production since they show that these models could fail catastrophically in naturally noisy, diverse, real-world environments (Saria and Subbaswamy, 2019). Additionally, bias can seep into the system at multiple stages of the NLP lifecycle (Shah et al., 2020), resulting in discrimination against minority groups (O'Neil, 2016). The good news, however, is that rigorous testing can help to highlight potential issues before the systems are deployed.
The need for rigorous testing in NLP is reflected in ACL 2020 giving the Best Paper Award to Check-List (Ribeiro et al., 2020), which applied the idea of behavior testing from software engineering to testing NLP systems. While invaluable as a first step towards the development of comprehensive testing methodology, the current implementation of CheckList may still overestimate the reliability of NLP systems since the individual test examples are largely manually constructed. Importantly, with the complexity and scale of current models, humans cannot accurately determine a model's adversarial distribution (i.e., the examples that cause model failure). Consequently, the test examples they construct are unlikely to be the worst-case examples for the model. Automated assistance is needed. Therefore, we propose to perform reliability testing, which can be thought of as one component of behavior testing. We categorize reliability tests as average-case tests or the worst-case tests. As their names suggest, average-case and worst-case tests estimate the expected and lower-bound performance, respectively, when the NLP system is exposed to the phenomena modeled by the tests. Average-case tests are conceptually similar to 's counterfactuals, which is contemporaneous work, while worst-case tests are most similar to adversarial attacks ( §4).
Our approach parallels boundary value testing in software engineering: In boundary value testing, tests evaluate a program's ability to handle edge cases using test examples drawn from the extremes of the ranges the program is expected to handle. Similarly, reliability testing aims to quantify the system's reliability under diverse and potentially extreme conditions. This allows teams to perform better quality control of their NLP systems and introduce more nuance into discussions of why and when models fail ( §5). Finally, we note that reliabil-ity testing and standards are established practices in engineering industries (e.g., aerospace (Nelson, 2003;Wilkinson et al., 2016)) and advocate for NL engineering to be at parity with these fields.

Evaluating worst-case performance in a label-scarce world
A proposed approach for testing robustness to natural and adverse distribution shifts is to construct test sets using data from different domains or writing styles (Miller et al., 2020;Hendrycks et al., 2020), or to use a human vs. model method of constructing challenge sets (Nie et al., 2020;Zhang et al., 2019b). While they are the gold standard, such datasets are expensive to construct, 3 making it infeasible to manually create worst-case test examples for each NLP system being evaluated. Consequently, these challenge sets necessarily overestimate each system's worst-case performance when the inference distribution differs from the training one. Additionally, due to their crowdsourced nature, these challenge sets inevitably introduce distribution shifts across multiple dimensions at once, and even their own biases (Geva et al., 2019), unless explicitly controlled for. Building individual challenge sets for each dimension would be prohibitively expensive due to combinatorial explosion, even before having to account for concept drift (Widmer and Kubat, 1996). This coupling complicates efforts to design a nuanced and comprehensive testing regime. Hence, simulating variation in a controlled manner via reliability tests can be a complementary method of evaluating the system's out-of-distribution generalization ability.

Adversarial Attacks as Reliability Tests
We first give a brief introduction to adversarial attacks in NLP before showing how they can be used for reliability testing. We refer the reader to Zhang et al. (2020b) for a comprehensive survey.
Early work did not place any constraints on the attacks and merely used the degradation to a tar-

Algorithm 1 General Reliability Test
Require: switch TestType do 5: case AverageCaseTest 6: s ← MEAN(S(y , M(C))) 7: X ← X ∪ C 8: case WorstCaseTest 9: x , s ← arg min xc∈C S(y , M(xc)) 10: X ← X ∪ {x } 11: r ← r + s 12: end for 13: r ← r |X| 14: return X , r get model's accuracy as the measure of success. However, this often resulted in the semantics and expected prediction changing, leading to an overestimation of the attack's success. Recent attacks aim to preserve the original input's semantics. A popular approach has been to substitute words with their synonyms using word embeddings or a language model as a measure of semantic similarity (Alzantot et al., 2018;Michel et al., 2019;Ren et al., 2019;Zhang et al., 2019a;Li et al., 2019;Garg and Ramakrishnan, 2020;Li et al., 2020a).
Focusing on maximally degrading model accuracy overlooks the key feature of adversarial attacks: the ability to find the worst-case example for a model from an arbitrary distribution. Many recent attacks perturb the input across multiple dimensions at once, which may make the result unnatural. By constraining our sample perturbations to a distribution modeling a specific dimension of interest, the performance on the generated adversaries is a valid lower bound performance for that dimension. Said another way, adversarial attacks can be reframed as interpretable reliability tests if we constrain them to meaningful distributions. This is the key element of our approach as detailed in Alg. 1. We specify either an average (Lines 5-7) or worse case test (Lines 8-10), but conditioned on the data distribution D that models a particular dimension of interest d. The resultant reliability score gauges real-world performance and the worstcase variant returns the adversarial examples that cause worst-case performance. When invariance to input variation is expected, y is equivalent to the source label y. Note that by ignoring the averagecase test logic and removing d, we recover the general adversarial attack algorithm.
However, the key difference between an adversarial robustness mindset and a testing one is the latter's emphasis on identifying ways in which natural phenomena or ethical concerns can be operationalized as reliability tests. This change in perspective opens up new avenues for interdisciplinary research that will allow researchers and practitioners to have more nuanced discussions about model reliability and can be used to design comprehensive reliability testing regimes. We describe such a framework for interdisciplinary collaboration next.

A Framework for Reliability Testing
We introduce and then describe our general framework, DOCTOR, for testing the reliability of NLP systems. DOCTOR comprises six steps: 1. Define reliability requirements 2. Operationalize dimensions as distributions 3. Construct tests 4. Test system and report results 5. Observe deployed system's behavior 6. Refine reliability requirements and tests Defining reliability requirements. Before any tests are constructed, experts and stakeholder advocates should work together to understand the demographics and values of the communities the NLP system will interact with (Friedman and Hendry, 2019) and the system's impact on their lives. The latter is also known as algorithmic risk assessment (Ada Lovelace Institute and DataKind UK, 2021). There are three critical questions to address: 1) Along what dimensions should the model be tested? 2) What metrics should be used to measure system performance? 3) What are acceptable performance thresholds for each dimension?
Question 1 can be further broken down into: a) general linguistic phenomena, such as alternative spellings or code-mixing; b) task-specific quirks, e.g., an essay grading system should not use text length to predict score; c) sensitive attributes, such as gender, ethnicity, sexual orientation, age, or disability status. This presents an opportunity for interdisciplinary expert collaboration: Linguists are best equipped to contribute to discussions around (a), domain experts to (b), and ethicists and social scientists to (c). However, we recognize that such collaboration may not be feasible for every NLP system being tested. It is more realistic to expect ethicists to be involved when applying DOCTOR at the company and industry levels, and ethics-trained NLP practitioners to answer these questions within the development team. We provide a taxonomy of potential dimensions in Table 1 (Appendix).
Since it is likely unfeasible to test every possible dimension, stakeholder advocates should be involved to ensure their values and interests are accurately represented and prioritized (Hagerty and Rubinov, 2019), while experts should ensure the dimensions identified can be feasibly tested. A similar approach to that of community juries 4 may be taken. We recommend using this question to evaluate the feasibility of operationalizing potential dimensions: "What is the system's performance when exposed to variation along dimension d?". For example, rather than simply "gender", a better-defined dimension would be "gender pronouns". With this understanding, experts and policymakers can then create a set of reliability requirements, comprising the testing dimensions, performance metric(s), and passing thresholds.
Next, we recommend using the same metrics for held-out, average-case, and worst-case performance for easy comparison. These often vary from task to task and are still a subject of active research (Novikova et al., 2017;Reiter, 2018;Kryscinski et al., 2019), hence the question of the right metric to use is beyond the scope of this paper. Finally, ethicists, in consultation with the other aforementioned experts and stakeholders, will determine acceptable thresholds for worst-case performance.
The system under test must perform above said thresholds when exposed to variation along those dimensions in order to pass. For worst-case performance, we recommend reporting thresholds as relative differences (δ) between the average-case and worst-case performance. These questions may help in applying this step and deciding if specific NLP solutions should even exist (Leins et al., 2020): • Who will interact with the NLP system, in what context, and using which language varieties?
• What are the distinguishing features of these varieties compared to those used for training?
• What is the (short-and long-term) impact on the community's most underrepresented members if the system performs more poorly for them?
We note that our framework is general enough to be applied at various levels of organization: within the development team, within the company (compliance team, internal auditor), and within the industry (self-regulation or independent regulator). However, we expect the exact set of dimensions, metrics and acceptable thresholds defined in Step 1 to vary depending on the reliability concerns of the actors at each level. For example, independent regulators will be most concerned with establishing minimum safety and fairness standards that all NLP systems used in their industries must meet, while compliance teams may wish to have stricter and more comprehensive standards for brand reasons. Developers can use DOCTOR to meet the other two levels of requirements and understand their system's behaviour better with targeted testing.
Operationalizing dimensions. While the abstractness of dimensions allows people who are not NLP practitioners to participate in drafting the set of reliability requirements, there is no way to test NLP systems using fuzzy concepts. Therefore, every dimension the system is to be tested along must be operationalizable as a distribution from which perturbed examples can be sampled in order for NLP practitioners to realize them as tests.
Since average-case tests attempt to estimate a system's expected performance in its deployed environment, the availability of datasets that reflect real-world distributions is paramount to ensure that the tests themselves are unbiased. This is less of an issue for worst-case tests; the tests only needs to know which perturbations that are possible, but not how frequently they occur in the real world. Figuring out key dimensions for different classes of NLP tasks and exploring ways of operationalizing them as reliability tests are also promising directions for future research. Such research would help NLP practitioners and policymakers define reliability requirements that can be feasibly implemented.
Constructing tests. Next, average-and worstcase tests are constructed (Alg. 1). Average-case tests can be data-driven and could take the form of manually curated datasets or model-based perturbation generation (e.g., PolyJuice ), while worst-case tests can be rule-based (e.g., Morpheus (Tan et al., 2020)) or model-based (e.g., BERT-Attack (Li et al., 2020a)). We recommend constructing tests that do not require access to the NLP model's parameters (black-box assumption); this not only yields more system-agnostic tests, but also allows for (some) tests to be created independently from the system development team. If the black-box assumption proves limiting, the community can establish a standard set of items an NLP system should export for testing purposes, e.g., network gradients if the system uses a neural model. Regardless of assumption, keeping the regulators' test implementations separate and hidden from the system developers is critical for stakeholders and regulators to trust the results. This separation also reduces overfitting to the test suite.
Testing systems. A possible model for test ownership is to have independently implemented tests at the three levels of organization described above (team, company, industry). At the development team level, reliability tests can be used to diagnose weaknesses with the goal of improving the NLP system for a specific use case and set of target users. Compared to unconstrained adversarial examples, contrasting worst-case examples that have been constrained along specific dimensions with non-worst-case examples will likely yield greater intuition into the model's inner workings. Studying how modifications (to the architecture, training data and process) affect the system's reliability on each dimension will also give engineers insight into the factors affecting system reliability. These tests should be executed and updated regularly during development, according to software engineering best practices such as Agile (Beck et al., 2001).
Red teams are company-internal teams tasked with finding security vulnerabilities in their developed software or systems. Brundage et al. (2020) propose to apply the concept of red teaming to surface flaws in an AI system's safety and security. In companies that maintain multiple NLP systems, we propose employing similar, specialized teams composed of NLP experts to build and maintain reliability tests that ensure their NLP systems adhere to company-level reliability standards. These tests will likely be less task-/domain-specific than those developed by engineering teams due to their wider scope, while the reliability standards may be created and maintained by compliance teams or the red teams themselves. Making these stan-dards available for public scrutiny and ensuring their products meet them will enable companies to build trust with their users. To ensure all NLP systems meet the company's reliability standards, these reliability tests should be executed as a part of regular internal audits (Raji et al., 2020), investigative audits after incidents, and before major releases (especially if it is the system's first release or if it received a major update). They may also be regularly executed on randomly chosen production systems and trigger an alert upon failure.
At the independent regulator level, reliability tests would likely be carried out during product certification (e.g., ANSI/ISO certification) and external audits. These industry-level reliability standards and tests may be developed in a similar manner to the company-level ones. However, we expect them to be more general and less comprehensive than the latter, analogous to minimum safety standards such as IEC 60335-1 (IEC, 2020). Naturally, high risk applications and NLP systems used in regulated industries should comply with more stringent requirements (European Commission, 2021).
Our proposed framework is also highly compatible with the use of model cards (Mitchell et al., 2019) for auditing and transparent reporting (Raji et al., 2020). In addition to performance on task-related metrics, model cards surface information and assumptions about a machine learning system and training process that may not be readily available otherwise. When a system has passed all tests and is ready to be deployed, its average-and worst-case performance on all tested dimensions can be included as an extra section on the accompanying model card. In addition, the perturbed examples generated during testing and their labels (x , y ) can be stored for audit purposes or examined to ensure that the tests are performing as expected.
Observing and Refining requirements. It is crucial to regularly monitor the systems' impact post-launch and add, update, or re-prioritize dimensions and thresholds accordingly. Monitoring large-scale deployments can be done via community juries, in which stakeholders who will be likely impacted (or their advocates) give feedback on their pain points and raise concerns about potential negative effects. Smaller teams without the resources to organize community juries can set up avenues (e.g., online forms) for affected stakeholders to give feedback, raise concerns, and seek remediation.

From Concerns to Dimensions
We now illustrate how reliability concerns can be converted into concrete testing dimensions (Step 1) by considering the scenario of applying automated text scoring to short answers and essays from students in the multilingual population of Singapore.
We study a second scenario in Appendix A. Automated Text Scoring (ATS) systems are increasingly used to grade tests and essays (Markoff, 2013;Feathers, 2019). While they can provide instant feedback and help teachers and test agencies cope with large loads, studies have shown that they often exhibit demographic and language biases, such as scoring African-and Indian-American males lower on the GRE Argument task compared to human graders (Bridgeman et al., 2012;Ramineni and Williamson, 2018). Since the results of some tests will affect the futures of the test takers (Salaky, 2018), the scoring algorithms used must be sufficiently reliable. Hence, let us imagine that Singapore's education ministry has decided to create a standard set of reliability requirements that all ATS systems used in education must adhere to.
Linguistic landscape. A mix of language varieties are used in Singapore: a prestige English variety, a colloquial English variety, three other official languages (Chinese, Malay, and Tamil), and a large number of other languages. English is the lingua franca, with fluency in the prestige variety correlating with socioeconomic status (Vaish and Tan, 2008). A significant portion of the population does not speak English at home. Subjects other than languages are taught in English.

Stakeholder impact.
The key stakeholders affected by ATS systems would be students in schools and universities. The consequences of lower scores could be life-altering for the student who is unable to enroll in the major of their choice. At the population level, biases in an ATS system trained on normally sampled data would unfairly discriminate against already underrepresented groups. Additionally, biases against disfluent or ungrammatical text when they are not the tested attributes would result in discrimination against students with a lower socioeconomic status or for whom English is a second language.
Finally, NLP systems have also been known to be overly sensitive to alternative spellings (Belinkov and Bisk, 2018). When used to score subject tests, this could result in the ATS system unfairly penaliz-ing dyslexic students (Coleman et al., 2009). Since education is often credited with enabling social mobility, 5 unfair grading may perpetuate systemic discrimination and increase social inequality.
Dimension. We can generally categorize written tests into those that test for content correctness (e.g., essay questions in a history test), and those that test for language skills (e.g., proper use of grammar). While there are tests that simultaneously assess both aspects, modern ATS systems often grade them separately (Ke and Ng, 2019). We treat each aspect as a separate test here.
When grading students on content correctness, we would expect the ATS system to ignore linguistic variation and sensitive attributes as long as they do not affect the answer's validity. Hence, we would expect variation in these dimensions to have no effect on scores: answer length, language/vocabulary simplicity, alternative spellings/misspellings of non-keywords, grammatical variation, syntactic variation (especially those resembling transfer from a first language), and proxies for sensitive attributes.
On the other hand, the system should be able to differentiate proper answers from those aimed at gaming the test (Chin, 2020;Ding et al., 2020).
When grading students on language skills, however, we would expect ATS systems to be only sensitive to the relevant skill. For example, when assessing grammar use, we would expect the system to be sensitive to grammatical errors (from the perspective of the language variety the student is expected to use), but not to the other dimensions mentioned above (e.g., misspellings).
Actors. Relevant experts include teachers of the subjects where the ATS systems will be deployed, linguists, and computer scientists. The stakeholders (students) may be represented by student unions (at the university level) or focus groups comprising a representative sample of the student population.

Implications for Policy
There is a mounting effort to increase accountability and transparency around the development and use of NLP systems to prevent them from amplifying societal biases. DOCTOR is highly complementary to the model card approach increasingly adopted 6 to surface oft hidden details about NLP models: Developers simply need to list the tested dimensions, metrics, and score on each dimension in the model card. Crucially, reliability tests can be used to highlight fairness issues in NLP systems by including sensitive attributes for the target population, but it is paramount these requirements reflect local concerns rather than any prescriptivist perspective (Sambasivan et al., 2021).
At the same time, the ability to conduct quantitative, targeted reliability testing along specifiable dimensions paves the way for reliability standards to be established, with varying levels of stringency and rigor for different use cases and industries. We envision minimum safety and fairness standards being established for applications that are non-sensitive, not safety-critical, and used in unregulated industries, analogous to standards for household appliances. Naturally, applications at greater risks (Li et al., 2020b) of causing harm upon failure should be held to stricter standards. Policymakers are starting to propose and implement regulations to enforce transparency and accountability in the use of AI systems. For example, the European Union's General Data Protection Regulation grants data subjects the right to obtain "meaningful information about the logic involved" in automated decision systems (EU, 2016). The EU is developing AIspecific regulation (European Commission, 2020): e.g., requiring developers of high-risk AI systems to report their "capabilities and limitations, ... [and] the conditions under which they can be expected to function as intended". In the U.S., a proposed bill of the state of Washington will require public agencies to report "any potential impacts of the automated decision system on civil rights and liberties and potential disparate impacts on marginalized communities" before using automated decision systems (Washington State Legislature, 2021).
One may note that language in the proposed regulation is intentionally vague. There are many ways to measure bias and fairness, depending on the type of model, context of use, and goal of the system. Today, companies developing AI systems employ the definitions they believe most reasonable (or perhaps easiest to implement), but regulation will need to be more specific for there to be meaningful compliance. DOCTOR's requirement to explicitly define specific dimensions instead of a vague notion of reliability will help policymakers in this blog.einstein.ai/model-cards-for-ai-model-transparency regard, and can inform the ongoing development of national (NIST, 2019) and international standards 7 .
While external algorithm audits are becoming popular, testing remains a challenge since companies wishing to protect their intellectual property may be resistant to sharing their code (Johnson, 2021), and implementing custom tests for each system is unscalable. Our approach to reliability testing offers a potential solution to this conundrum by treating NLP systems as black boxes. If reliability tests become a legal requirement, regulatory authorities will be able to mandate independently conducted reliability tests for transparency. Such standards, combined with certification programs (e.g., IEEE's Ethics Certification Program for Autonomous and Intelligent Systems 8 ), will further incentivize the development of responsible NLP, as the companies purchasing NLP systems will insist on certified systems to protect them from both legal and brand risk. To avoid confusion, we expect certification to occur for individual NLP systems (e.g., an end-to-end question answering system for customer enquiries), rather than for general purpose language models that will be further trained to perform some specific NLP task. While concrete standards and certification programs that can serve this purpose do not yet exist, we believe that they eventually will and hope our paper will inform their development. This multi-pronged approach can help to mitigate NLP's potential harms while increasing public trust in language technology.

Challenges and Future Directions
While DOCTOR is a useful starting point to implement reliability testing for NLP systems, we observe key challenges to its widespread adoption. First, identifying and prioritizing the dimensions that can attest a system's reliability and fairness. The former is relatively straightforward and can be achieved via collaboration with experts (e.g., as part of the U.S. NIST's future AI standards (NIST, 2019)). The latter, however, is a question of values and power (Noble, 2018;Mohamed et al., 2020;Leins et al., 2020), and should be addressed via a code of ethics and ensuring that all stakeholders are adequately represented at the decision table.
Second, our proposed method of reliability testing may suffer from similar issues plaguing automatic 7 ethicsstandards.org/p7000 8 standards.ieee.org/industry-connections/ecpais.html evaluation metrics for natural language generation (Novikova et al., 2017;Reiter, 2018;Kryscinski et al., 2019): due to the tests' synthetic nature they may not fully capture the nuances of reality. For example, if a test's objective were to test an NLP system's reliability when interacting with African American English (AAE) speakers, would it be possible to guarantee (in practice) that all generated examples fall within the distribution of AAE texts? Potential research directions would be to design adversary generation techniques that can offer such guarantees or incorporate human feedback (Nguyen et al., 2017;Kreutzer et al., 2018;Stiennon et al., 2020).

Conclusion
Once language technologies leave the lab and start impacting real lives, concerns around safety, fairness, and accountability cease to be thought experiments. While it is clear that NLP can have a positive impact on our lives, from typing autocompletion to revitalizing endangered languages (Zhang et al., 2020a), it also has the potential to perpetuate harmful stereotypes (Bolukbasi et al., 2016;Sap et al., 2019), perform disproportionately poorly for underrepresented groups (Hern, 2017;Bridgeman et al., 2012), and even erase already marginalized communities (Bender et al., 2021).
Trust in our tools stems from an assurance that stakeholders will remain unharmed, even in the worst-case scenario. In many mature industries, this takes the form of reliability standards. However, for standards to be enacted and enforced, we must first operationalize "reliability". Hence, we argue for the need for reliability testing (especially worst-case testing) in NLP by contextualizing it among existing work on promoting accountability and improving generalization beyond the training distribution. Next, we showed how adversarial attacks can be reframed as worst-case tests. Finally, we proposed a possible paradigm, DOCTOR, for how reliability concerns can be realized as quantitative tests, and discussed how this framework can be used at different levels of organization or industry.

Broader Impact
Much like how we expect to not be exposed to harmful electric shocks when using electrical appliances, we should expect some minimum levels of safety and fairness for the NLP systems we interact with in our everyday lives. As mentioned in §1, §3, and §7, standards and regulations for AI systems are in the process of being developed for this purpose, especially for applications deemed "high-risk", e.g., healthcare (European Commission, 2020). Reliability testing, and our proposed framework, is one way to approach the problem of enacting enforceable standards and regulations.
However, the flip side of heavily regulating every single application of NLP is that it may slow down innovation. Therefore, it is important that the level of regulation for a particular application is proportionate to its potential for harm (Daten Ethik Kommission, 2019). Our framework can be adapted to different levels of risk by scaling down the implementation of some steps (e.g., the method and depth in which stakeholder consultation happens or the comprehensiveness of the set of testing dimensions) for low-risk applications.
Finally, it is important to ensure that any tests, standards, or regulations developed adequately represents the needs of the most vulnerable stakeholders, instead of constructing them in a prescriptivist manner (Hagerty and Rubinov, 2019). Hence, DOC-TOR places a strong emphasis on involving stakeholder advocates and analyzing the impact of an application of NLP on the target community.