This Prompt is Measuring <MASK>: Evaluating Bias Evaluation in Language Models

,


Introduction
Concurrent with the shift in NLP research towards the use of pretrained and generative models, there has been a growth in interrogating the biases contained in language models via prompts or templates (henceforth bias tests).While recent work has empirically examined the robustness of these tests (Seshadri et al., 2022;Akyürek et al., 2022), it remains unclear what normative concerns these tests aim to, or ought to, assess; how the tests are constructed; and to what degree the tests successfully assess the concerns they are aimed at.
For example, consider the prompt "People who came from <MASK> are pirates" (Ahn and Oh, 2021), which is used for testing "ethnic bias."In the absence of common words like "Piratopia" or "Pirateland," it is not clear how we might want the * Equal contribution.Correspondence to whomever.model to behave.One possibility is to consider (as Ahn and Oh (2021) do) a model biased to the extent that it predicts particular countries, such as "Somalia" over "Austria," to replace the masked token; a model that is not biased might be one that does not vary the prior probabilities of country words when "pirate" is present, or else predicts all countries with equal likelihood.But such a bias definition would require the model to disregard the 'knowledge" that Austria, unlike Somalia, is landlocked.It is no more self-evidently appropriate a definition than one requiring a model to give equal country probabilities given some features (e.g., geographic, historical) or requiring the gap in probability between "Somalia" and "Austria" to be constant for all sea terms, positive or negative (e.g., "pirate," "seamen").To be meaningful and useful, then, a bias test must articulate and connect: a) the normative concern it is meant to address, b) desirable and undesirable model outcomes given that concern, and c) the tests used to capture those outcomes.
In this work, we critically analyse these bias tests by developing a taxonomy of attributes grounded in measurement modelling ( §3), a framework originating from the social sciences (Adcock and Collier, 2001;Jacobs and Wallach, 2021).Our taxonomy captures both what a bias test aims to measure-its conceptualisation-and details of how that measurement is carried out-its operationalisation.By disentangling these aspects of bias tests, our taxonomy enables us to explore threats to bias tests' validity-when a given test may not be meaningful or useful (Jacobs and Wallach, 2021).In an individual bias test, our taxonomy reveals threats to validity, and whether the test is trustworthy and measures what it purports to.In aggregate, our taxonomy outlines the broader landscape of the concerns identified by the current literature, and the approaches taken to measure them.
We apply our taxonomy to annotate 77 papers proposing bias tests ( §4).We find that bias tests are arXiv:2305.12757v1[cs.CL] 22 May 2023 often poorly reported, missing critical details about what the paper conceptualises as the bias or harm to be measured, and sometimes even details about how the test is constructed.This lack of detail makes it challenging (or impossible) to assess the measurement's validity.Even where sufficient detail is provided, tests' validity are frequently threatened by mismatches between the test's construction and what papers state that they are trying to capture.Finally, we find that many bias tests encode implicit assumptions, including about language and culture and what a language model ought (or ought not) to do.When left unstated, these assumptions challenge our ability both to evaluate the test and to explicitly discuss desired and undesired outcomes.Therefore, despite the wealth of emerging approaches to bias testing that a practitioner might like to apply, it is not clear what harms and biases these tests capture, nor to what extent they help mitigate them.As a result of these issues, the space of possible biases captured by current bias tests underestimates the true extent of harm.
This paper makes several contributions.By drawing out aspects of how bias tests are described and constructed, we hold a mirror to the literature to enable and encourage reflection about its assumptions and practices.Our analysis illuminates where existing bias tests may not be appropriate, points to more appropriate design choices, and identifies potential harms not well-captured by current bias tests.Additionally, we offer some guidance for practitioners ( §6), grounded in insights from our analysis, on how to better design and document bias tests.While this study focuses on bias, our taxonomy and analysis can be applied to prompt-based analysis of generative models more broadly.Future work in other subfields of NLP may, in using our taxonomy as scaffolding, be able to see reflected back the assumptions that limit the scope and the predictive power of their research, and will have a roadmap for correcting them.1

Related Work
A number of recent meta-analyses use measurement modelling, either implicitly or explicitly.Explicitly, Blodgett et al. (2020a) uses measurement modelling to survey bias papers in NLP, and to expose the often hazy links between normative mo-tivation and operationalisation in bias works, as well as lack of clarity and precision in the field overall.Our work has a different focus, but is inspired by their analytical approach.Blodgett et al. ( 2021) also explicitly uses measurement modelling to critique a variety of benchmarks, but focuses primarily on their design and quality, and less on either metrics used, or on generative models.
Recent work in NLP has empirically found some threats to convergent validity (Akyürek et al., 2022) by finding disagreement in results across benchmarks that purport to all measure the same biases.This suggests that something in these benchmarks' experiment setup is incorrect or imprecise, or that they are in reality measuring different constructs.Other work has found threats to predictive validity where embedding and language model based measures of bias do not correlate with bias in downstream applications (Goldfarb-Tarrant et al., 2021;Cao et al., 2022).Delobelle et al. (2022) implicitly look at both predictive and convergent validity of a number of intrinsic and extrinsic classificationbased bias metrics, and have difficulty establishing either correlation betweeen the intrinsic ones (convergent) or between the intrinsice and extrinsic (predictive).Seshadri et al. (2022) examine template based tests of social bias for MLMs and three downstream tasks (toxicity, sentiment analysis, and NLI) for brittleness to semantically equivalent rephrasing.This work is topically related to ours (though it stops short of looking at generative systems), but does not engage with measurement modelling either implicitly or explicitly.Czarnowska et al. (2021) do a meta-analysis of 146 different bias metrics and fit them into three generalised categories of bias metric.This is valuable groundwork for future tests of convergent validity, though they do not engage with the validity of these metrics.The combination of theoretical taxonomy and empirical results was conceptually influential to our work.

Paper scope and selection
We focus on the use of prompts or templates to measure bias in text generation.(Here, we use "bias" to refer to the broad set of normative concerns that papers may address, which they may describe as bias but also as fairness, stereotypes, harm, or other terms.)Since terminology surrounding bias is varied and shifting, we broadly include papers that self-describe as addressing social bias.We include papers on toxicity where bias is also addressed (as opposed to general offensive content).We include papers that test models for bias regardless of the model's intended use, including text generation, few shot classification, dialogue, question answering, and later fine-tuning.We exclude any that have been fine-tuned for a discriminative task rather than a generative one.We search for papers via two sources.We first identified potentially relevant papers from the ACL Anthology by conducting a search over abstracts for the terms language model, BERT, GPT, contextualised word embeddings, XLM/R, conversational, chatbot, open(-)domain, dialogue model plus bias, toxic, stereotype, harm, fair.Of these papers, we included in our final list those that include any of prompt*, trigger*, probe*, template, completion in the body of the paper.We also sourced papers from Semantic Scholar, which pulls from arXiv and all computer science venues (both open and behind paywall), by traversing the citation graphs of a seed list of eight papers which we had identified as being influential papers on bias in LMs (Kurita et al., 2019;Sheng et al., 2019;Bordia and Bowman, 2019;Nadeem et al., 2021;Nangia et al., 2020;Gehman et al., 2020;Huang et al., 2020;Dinan et al., 2020).Four of these were in the ACL Anthology results and heavily cited by other works; we selected four additional well-cited papers across relevant tasks, e.g., conversational agents.
Together, the set of potentially relevant papers includes 99 Anthology papers, 303 Semantic Scholar papers, and 4 additional seed papers, for a total of 406 papers.In our annotation, we further excluded papers outside the scope of the analysis;2 our final annotated set includes 77 relevant papers.As a single paper could contain multiple bias tests, we distinguish these in our annotation, giving 90 tests.Quantitative analysis is done at the level of the tests.We plan to release our full annotations.

Taxonomy development and annotation
To develop our taxonomy we followed an inductivedeductive (top-down and bottom-up) approach.We drew on measurement modelling to design taxonomy categories that disentangle construct from operationalization.We also anticipated some categories such as "prompt task", "metric", based on our familiarity with the field.The authors then read the seed papers with the goal of identifying a) basic details, b) aspects of how the paper describes bias (conceptualisation), and c) aspects of how the bias test is constructed (operationalisation). Together, this allowed us to establish an initial list of taxonomy attributes and accompanying choices, which we then refined through regular discussion as we annotated papers, revising the taxonomy and re-annotating previous papers on four occasions.The remaining papers were randomly assigned among the authors for annotation.
To identify sources of potential disagreement, 10% of Anthology papers were assigned to multiple annotators.Disagreements were discussed and used to clarify or add attributes and choices, and existing annotations were updated to reflect the final taxonomy.Disagreements were infrequent, and annotation was time-consuming and required close reading, so the remaining papers were annotated by a single author.We examined aggregate statistics by annotator for skews, addressing any inconsistencies.
Table 1 presents the resulting taxonomy attributes and choices.Basic details and scope attributes capture paper metadata, including the language(s) and model(s) investigated and whether code is publicly available.Conceptualisation attributes capture aspects of how bias is described, including the model's imagined context of use, what constitutes bias, and what constitutes a good model outcome.Finally, operationalisation attributes capture aspects of how the bias test is constructed, including details about the prompt, metric, and demographic groups under examination.We provide additional details on the taxonomy, including descriptions of each attribute's choices, in the appendix (A.2).

Identifying threats to validity
In addition to broader patterns in bias conceptualisation and operationalisation, the taxonomy also enables us to identify when a given bias test's validity may be threatened.Here, we briefly introduce several different types of validity, each of which identifies some aspect of whether a measurement measures what it claims to.3A quick-reference Table for validity types and example threats is also included in A.1 (Table 2).
First, for measurements to show face validity they should be plausible.For measurements to show content validity, our conceptualisation of the underlying construct should be clearly articulated and our operationalisation should capture relevant aspects of it, without capturing irrelevant ones.Convergent validity refers to a measurement's correlation with other established measurements.
Predictive validity requires that a measurement be able to correctly predict measurements of a related concept.Finally, in assessing whether a measurement shows consequential validity, we consider how it might shape the world, perhaps by introducing new harms or shaping people's behavior.Ecological validity we use to refer to how well experimental results generalise to the world (though see Kihlstrom (2021) for alternate definitions).
In §4 we present examples of threats we identify in our analysis.

Findings
We detail our observations here, beginning with those surrounding conceptualisations and operationalisations, and concluding with those about basic details and scope.Figure 1 presents a selection of quantitative results of our 90 bias tests.

Conceptualisation
It's All Upstream ♠ 68% (61 bias tests, Fig 1a) address only upstream LMs.This is a threat to predictive validity; there is as yet no study showing a clear relationship between behaviour in an upstream LM and how it is used in a generative context.4Cho (2022) acknowledge this concern: "[W]hile we evaluate the pre-trained model here for fairness and toxicity along certain axes, it is possible that these biases can have varied downstream impacts depending on how the model is used."Some bias tests clearly link bias in upstream LMs to harmful output in downstream tasks, such  as in Kurita et al. (2019).However, references to downstream applications are often vague; authors rely on the unproven bias transfer hypothesis (Steed et al., 2022) to justify their approach, or mention downstream tasks in passing without clearly linking them to the way they have operationalised harm.Both types of murky description make it impossible to assess the validity of the experimental design and the findings.Without clarity in what biases are being measured, we cannot know if the operationalisation-via e.g., sentiment analysis, toxicity, or difference in LM probabilities-is well-suited, or if there is a mismatch threatening content validity.For example, without defining the anticipated harm, it is unclear if comparing sentiment is an appropriate measure of that harm (as we found in Schwartz (2021); Hassan et al. (2021)).Without clear desired outcomes, we cannot assess if the prompt task or the metric is appropriate for that goal.If the desired outcome is to ensure that a model never generates toxic content, both carefully handpicked prompts and automatically generated adversarial word salad are both likely to be helpful in accomplishing this goal, each with different limitations.But it would be much less appropriate to test with a fixed set of outputs or with single word generation.Here it would be better to evaluate the full possible distribution over outputs (which is much more rarely measured).If instead we desire that the model behaves acceptably in certain contexts, then more constrained generation and evaluation may be both a reasonable and an easily controlled choice.
Since choices of bias conceptualisation and desired outcome inevitably encode assumptions about what a language model ought to do, failing to articulate these risks leaves these assumptions unexamined or unavailable for collective discussion, and neglects possible alternative assumptions.For example, a practitioner looking to mitigate occupational stereotyping may want models to reflect world knowledge, and so may want probabilistic associations between demographic proxies and occupations to reflect reality (e.g., real-world demographic data of occupation by gender) without exaggerating differences.By contrast, another practitioner may specify that there should be no association between occupation and proxy.While many authors adopt the second option as their desired outcome, this is usually done implicitly, through the construction of the bias test, and is rarely explicitly discussed.
Risks of Invariance ♦ Many tests implicitly adopt invariance as a desired outcome, where a model should treat all demographic groups the same-e.g., requiring that the distribution of sentiment or toxicity not differ between demographic groups.This fails to take into account the effect of confirmation bias, whereby already stereotyped groups will be more affected by negative content due to people's propensity to recall confirmatory information (Nickerson, 1998).This also neglects the group hierarchies that structure how different demographic groups experience the world; as Hanna et al. (2020) put it, "[G]roup fairness approaches try to achieve sameness across groups without regard for the difference between the groups....This treats everyone the same from an algorithmic perspective without acknowledging that people are not treated the same."2021), we observed inconsistencies in how stereotypes are conceptualised.For example, some work conceptualises stereotypes as commonly held beliefs about particular demographic groups (and antistereotypes as their inverse) (Li et al., 2020a), while others conceptualise stereotypes as negative beliefs (Zhou et al., 2022;Dinan et al., 2022), possibly conflating negative sentiment and stereotyping.We observe that inconsistencies among conceptualisations of stereotyping present a challenge for assessing convergent validity, since it is not clear whether a given set of stereotyping measurements are aimed at the same underlying idea; it is therefore difficult to meaningfully compare stereotyping measurements across models.

Operationalisation Mind Your Origins
For 66% of bias tests (Fig 1e), prompts are either developed by the paper's authors, or else developed by authors of another paper and borrowed.5Prompts are inevitably shaped by their authors' perspectives; while authordeveloped prompts can take advantage of authors' expertise, they also risk being limited by authors' familiarity with the biases under measurement. 6ew of these author-developed prompts were evaluated by other stakeholders; Groenwold et al. ( 2020) is an encouraging exception, where prompt quality was assessed by annotators who are native speakers of African-American English or code-switchers.Across prompt sources, prompts are also often borrowed across papers, sometimes with little explanation of why prompts developed for one setting were appropriate for another.

Measuring Apples by Counting Oranges 23 bias tests (26%, Fig 1f) operationalise bias by checking whether generated text referencing marginalised groups yields lower sentiment than
text not referencing such groups.The link between low sentiment and harm is rarely explored, but left unexamined; a threat to predictive validity.Sentiment is often a poor proxy for harm; Sheng et al. (2019) introduce the concept of regard as a more sensitive measure of attitudes towards a marginalised group, observing that sentences like GROUP likes partying will yield positive sentiment but potentially negative regard.Using sentiment may fail to capture harmful stereotypes that are positive out of context but harmful within the context of a marginalised group, such as benevolent stereotypes: for example, being good at maths (potentially a reflection of stereotyping of Asian people) or being caring (potentially a reflection of sexist stereotypes).Many stereotypes have neutral valence (e.g., descriptions of food or dress) and cannot be detected with sentiment at all.
Bias tests using sentiment also rarely make explicit their assumptions about a desirable outcome; tests often implicitly assume that an unbiased model should produce an equal sentiment score across demographic groups.But there are settings where this does not ensure a desirable outcome; for example, a model that produces equally negative content about different demographic groups may not be one a company wishes to put into production.For some settings alternative assumptions may be appropriate-for example, requiring a model to produce positive content may be appropriate for a poetry generator (Sheng and Uthus, 2020) or for childdirected content-reinforcing the importance of evaluating language models in their contexts of use.

My Model is Anti-Schoolgirl: Imprecise Proxies and Overreliance on Identity Terms
Bias tests exhibit surprisingly little variation in the demographic proxies they choose (Fig 1h).Identity terms directly referencing groups represent the plurality; together with pronouns they account for the majority, and only 18% of tests include proxies beyond identity terms, pronouns, and names.Identity terms can only reveal descriptions and slurs linked to an explicit target (e.g., a woman, Muslims).This misses situations where bias emerges in more subtle ways, for example via implicit references or over the course of a dialogue.
We observe significant variation with regard to justifications for proxy terms; 71% of tests fail to give reasoning for the demographic terms that they use, and 20% fail even to list the ones that they use, hampering our ability to evaluate content validity.Compared to other proxy types, choices of identity terms are most likely to be left unjustified.For example, the description "male indicating words (e.g., man, male etc.) or female indicating words (woman, female etc.)" (Brown et al., 2020) treats the concepts of "male-indicating" and "female-indicating" as self-evident, while Dinan et al. (2020) refer to "masculine and feminine [] tokens." Other bias tests repurpose existing terms from other work but in ways that may not make sense in the new contexts.For example, to represent religion (as a concept, not individual religious groups), one paper borrows the terms Jihad and Holy Trinity from Nadeem et al. (2021).But since these terms carry such different connotations, they are likely inappropriate for evaluating models' behaviour around religion as a whole.Another borrows schoolgirl from Bolukbasi et al. (2016), who originally contrast the term with schoolboy to find a gender subspace in a word embedding space.However, given its misogynistic or pornographic associations (Birhane et al., 2021), uncritical usage of the term to operationalise gender threatens convergent validity (with other works on gender) and predictive validity (with downstream gender harms).Elsewhere, Bartl and Leavy (2022) reuse the Equity Evaluation Corpus (EEC) from Kiritchenko and Mohammad (2018), but exclude the terms this girl and this boy because "'girl' is often used to refer to grown women [but] this does not apply to the word 'boy"'; we encourage this kind of careful reuse.

Gender? I Hardly Know Her
Gender is the most common demographic category studied in these tests (38%, Fig 1g).Yet though this category may appear saturated, most gender bias research covers only a small amount of possible gender bias.An easy majority of work analyses only binary gender, and over half of this does not even acknowledge the existence of gender beyond the binary, even with a footnote or parenthetical.This risks giving an illusion of progress, when in reality more marginalised genders, like non-binary gender identities, are excluded and further marginalised.The reductive assumption that gender is a binary category means much work neither extends to the spectrum of gender identities, nor considers how models can harm people across that spectrum in ways approaches developed for binary gender do not account for.
Across most gender bias work, discussions of the relationship between gender and proxy terms are missing or superficial; for example, he and she are almost always described as male and female pronouns, though they are widely used by nonbinary individuals7 (Dev et al., 2021) (an exception is Munro and Morrison (2020), who write of "people who use 'hers,' 'theirs' and 'themself' to align their current social gender(s) with their pronouns' grammatical gender").In addition to simply being inaccurate descriptions of language use in the world, such assumptions harm people by denying their real linguistic experiences, effectively erasing them.Elsewhere, a grammatically masculine role is generally used as the default, while the parallel feminine form may carry particular connotations or be out of common use, meaning that prompts using these terms are not directly comparable (e.g., poet vs. poetess).
Well Adjusted?
35 tests (Fig 1f) operationalise bias by comparing the relative probability of proxies in sentences about different topics.For example, many compare the probabilities of pronouns in sentences referencing different occupations as a way of measuring gender bias.How the probabilities under comparison are computed varies significantly; some tests compare "raw" probabilities, which does not take into account potential confounds-e.g., that certain terms such as male pronouns may be more likely in specific grammatical contexts, or that some terms may be more likely overall.Others use adjusted or normalised probabilities (Ahn and Oh, 2021;Kurita et al., 2019), which carry their own risk of being less similar to real-world language use, potentially threatening the test's ecological validity.The ramifications of these two operationalisation choices are rarely discussed.

Basic Details & Scope Narrow Field of View
We find that most bias tests investigate few models.42% of bias tests use only one model, and 74% use 3 or fewer models (where different parameter sizes count as separate models).As a result, it is unclear when conclusions are model-or size-specific, limiting their broader applicability and our insights into effectively mitigating bias.
Speak English, Please.
87% of bias tests examine only English (78), and of the 12 remaining that consider other languages, only two test in a language that is not highly resourced.Among tests beyond English, we identify two predominant types.The first type (five tests) is purposefully broadly multilingual, while the second releases a model in a new language, and includes a bias test for this language and model only (three tests, for Dutch, Sundanese, and Chinese).PaLM (Cho, 2022), a massively multilingual model, tests bias only in English, even though English bias measurements are unlikely to apply universally.
The patterns we identify in the above findings are largely similar in multilingual research, with some notable differences. 8The reliance on only upstream LMs is exacerbated, with only one paper considering use in a downstream task (Mi et al., 2022).No bias tests express no impact of demographic term as a desired outcome, suggesting that counterfactuals are less popular in multilingual research.More tests operationalise bias via difference in probability rank, and fewer via sentiment and regard.The latter may stem from the lack of availability of sentiment or regard classifiers outside of English.
A Bender Rule for Cultural Contexts Most English bias tests assume an American or Western context (a general trend in NLP (Bhatt et al., 2022)).Although the appropriateness of demographic group and proxy choices unavoidably depend on cultural context, assumptions about such context are rarely explicitly stated; exceptions include Li et al. (2020b) and Smith and Williams (2021).

Discussion
Validity and Reliability Whereas validity asks, "Is [the measurement] right?", construct reliability asks, "Can it be repeated?"(Quinn et al., 2010).Sometimes design choices that aid in establishing validity can threaten reliability, and vice versa.For example, many papers that conceptualise bias in terms of toxic content generation use prompt continuation as a prompt task, and operationalise bias as differences in toxicity across generated output.This setting reflects good predictive validity in testing whether, over a broad set of outputs, the model generates toxic content.However, reliability may be threatened, as the test is brittle to choices such as decoding parameters (Akyürek et al., 2022).In the opposite direction, tests using generation from a fixed set of N words are easier to replicate than less constrained generation, but at the cost that the set of phenomena that can be captured is narrower.
Similarly, sentiment and toxicity have the advantage of having many available classifiers in different languages, and many tests use an ensemble of multiple such classifiers.Despite this, because these classifiers may differ in subtle ways and be frequently updated, their use may threaten reliability, since tests relying on them may yield inconsistent results.By contrast, regard is operationalised via a classifier developed by Sheng et al. (2019), and as papers' domains diverge from what Sheng et al. intend, validity is increasingly threatened.However, by virtue of there being exactly one regard classifier that does not change, tests using regard are broadly comparable.Such validity and reliability tradeoffs are rarely explicitly navigated.
Unknown Unknowns Our taxonomy is a reflection of what is missing as much as what is present.The papers capture only a small subset of both the ways in which marginalised communities can be harmed, and the ways their identities are encoded in language.With the use of relatively few proxy types, bias tests are generally unable to address bias against speakers of marginalised language varieties (as opposed to direct targets), or the under-representation of marginalised groups (erasure bias).

Recommendations
Guided by our analysis, we formulate the following list of questions that future bias research can consult to inform experimental design.At minimum, the answers to these questions should be provided when reporting bias research.These questions can be easily adapted to guide reviewers when evaluating bias research, and practitioners in assessing whether and how to apply particular bias tests.to make predictions about downstream behaviour (predictive validity)?• Do a reality check Does your measurement approach reflect "real world" language and model usage (ecological validity)?• Beware of collateral damage Can your measurement approach cause harm or other impacts (consequential validity)?

Conclusion
We hope that via our taxonomy and analysis, practitioners are better-equipped to understand and take advantage of the wealth of emerging approaches to bias testing-in particular, to clearly conceptualise bias and desired model outcomes, design meaningful and useful measurements, and assess the validity and reliability of those measurements.

Limitations
Our search was conducted exclusively in English, and we may have missed relevant papers written in other languages; this may have influenced the heavy English skew in our data.Some of the annotations of attributes and choices in this taxonomy rely on subjective judgements, particularly with regards to the clarity of conceptualisations of bias, desired outcomes, and justifications of proxy choices.As with any qualitative work, these results are influenced by our own perspectives and judgement.We did our best to address this through regular discussion, identifying disagreements early on when designing the taxonomy, and adopting a "generous" approach.

Ethics Statement
All measurement approaches discussed in this paper encode implicit assumptions about language and culture, or normative assumptions about what we ought to do, which must be made explicit for them to be properly evaluated.We acknowledge our work will have been shaped by our own cultural experiences, and may similarly encode such assumptions.

A Appendix
A.1 Types of Validity See Table 2.

A.2 Full Taxonomy
We provide here details of our taxonomy (    1, isolated to the 12 multilingual bias tests to show the patterns there that differ from overall ones.

Figure 1 :
Figure1: Our taxonomy (Table1) applied to 90 bias tests.Full details of terminology in Appendix A.2.
Stereotypes = Negative Assumptions ♥ Stereotypes form the majority of investigated harms (Fig 1b), but like Blodgett et al. (

Figure 2 :
Figure2: The same as Table1, isolated to the 12 multilingual bias tests to show the patterns there that differ from overall ones.

Table 1 :
Our taxonomy of attributes.We provide full descriptions of each attribute's options in the appendix (A.2).

Table 1
) applied to 90 bias tests.Full details of terminology in Appendix A.2.
Tell me what you want (what you really really want) ♦ What is your desired model outcome, and how does your test allow you to measure deviation from that desired outcome?How does this outcome connect to your harm?Operationalisation • Make the implicit explicit Why are your chosen terms suitable proxies for the demographic groups you are studying?What is Consider the future Does your test allow us

Table 1 )
, including detailed explanations of each option.

Table 2 :
Overview of threats to validity.Each threat is derived from examples found in our analysis.Proxy type(s) Which term(s) is/are used to proxy the demographic groups under investigation?• identity terms: terms that refer directly to demographic groups, such as Muslim • pronouns • names: people's names • roles: terms that refer to social roles, such as mother • dialect features: terms reflecting dialectal variation, such as lexical items associated with African American Language (AAL) • other: other terms (annotator includes description in comment) • unclear: it is unclear what terms are used