Choose Your Lenses: Flaws in Gender Bias Evaluation

Considerable efforts to measure and mitigate gender bias in recent years have led to the introduction of an abundance of tasks, datasets, and metrics used in this vein. In this position paper, we assess the current paradigm of gender bias evaluation and identify several flaws in it. First, we highlight the importance of extrinsic bias metrics that measure how a model’s performance on some task is affected by gender, as opposed to intrinsic evaluations of model representations, which are less strongly connected to specific harms to people interacting with systems. We find that only a few extrinsic metrics are measured in most studies, although more can be measured. Second, we find that datasets and metrics are often coupled, and discuss how their coupling hinders the ability to obtain reliable conclusions, and how one may decouple them. We then investigate how the choice of the dataset and its composition, as well as the choice of the metric, affect bias measurement, finding significant variations across each of them. Finally, we propose several guidelines for more reliable gender bias evaluation.


Introduction
A large body of work has been devoted to measurement and mitigation of social biases in natural language processing (NLP), with a particular focus on gender bias (Sun et al., 2019;Blodgett et al., 2020;Garrido-Muñoz et al., 2021;Stanczak and Augenstein, 2021).These considerable efforts have been accompanied by various tasks, datasets, and metrics for evaluation and mitigation of gender bias in NLP models.In this position paper, we critically assess the predominant evaluation paradigm and identify several flaws in it.These flaws hinder progress in the field, since they make it difficult to ascertain whether progress has been actually made.
Gender bias metrics can be divided into two groups: extrinsic metrics, such as performance difference across genders, measure gender bias with respect to a specific downstream task, while intrinsic metrics, such as WEAT (Caliskan et al., 2017), are based on the internal representations of the language model.We argue that measuring extrinsic metrics is crucial for building confidence in proposed metrics, defining the harms caused by biases found, and justifying the motivation for debiasing a model and using the suggested metrics as a measure of success.However, we find that many studies on gender bias only measure intrinsic metrics.As a result, it is difficult to determine what harm the presumably found bias may be causing.When it comes to gender bias mitigation efforts, improving intrinsic metrics may produce an illusion of greater success than reality, since their correlation to downstream tasks is questionable (Goldfarb-Tarrant et al., 2021;Cao et al., 2022).In the minority of cases where extrinsic metrics are reported, only few metrics are measured, although it is possible and sometimes crucial to measure more.
Additionally, gender bias measures are often applied as a dataset coupled with a measurement technique (a.k.a metric), but we show that they can be separated.A single gender bias metric can be measured using a wide range of datasets, and a single dataset can be applied to a wide variety of metrics.We then demonstrate how the choice of gender bias metric and the choice of dataset can each affect the resulting measures significantly.As an example, measuring the same metric on the same model with an imbalanced or a balanced dataset1 may result in very different results.It is thus difficult to compare newly proposed metrics and debiasing methods with previous ones, hindering progress in the field.
To summarize, our contributions are: • We argue that extrinsic metrics are important for defining harms ( §2), but researchers do not use them enough even though they can ( §5).
• We demonstrate the coupling of datasets with metrics and the feasibility of other combinations ( §3).
• On observing that a specific metric can be measured on many possible datasets and viceversa, we demonstrate how the choice and composition of a dataset ( §4), as well as the choice of bias metric to measure ( §5), can strongly influence the measured results.
• We provide guidelines for researchers on how to correctly evaluate gender bias ( §6).
Bias Statement This paper examines metrics and datasets that are used to measure gender bias, and discusses several pitfalls in the current paradigm.As a result of the observations and proposed guidelines in this work, we hope that future results and conclusions will become clearer and more reliable.
The definition of gender bias in this paper is through the discussed metrics, as each metric reflects a different definition.Some of the examined metrics are measured on concrete downstream tasks (extrinsic metrics), while others are measured on internal model representations (intrinsic metrics).The definitions of intrinsic and extrinsic metrics do not align perfectly with the definitions of allocational and representational harms (Kate Crawford, 2017).In the case of allocational harm, resources or opportunities are unfairly allocated due to bias.Representative harm, on the other hand, is when a certain group is negatively represented or ignored by the system.Extrinsic metrics can be used to quantify both allocational and representational harms, while intrinsic metrics can only quantify representational harms, in some cases.
There are also other important pitfalls that are not discussed in this paper, like the focus on highresource languages such as English and the binary treatment of gender (Sun et al., 2019;Stanczak and Augenstein, 2021;Dev et al., 2021).Inclusive research of non-binary genders would require a new set of methods, which could benefit from the observations in this work.

The Importance of Extrinsic Metrics in Defining Harms
In this paper, we divide metrics for gender bias to three groups: • Extrinsic performance: measures how a model's performance is affected by gender, and is calculated with respect to particular gold labels.For example, the True Positive Rate (TPR) gap between female and male examples.
• Extrinsic prediction: measures model's predictions, such as the output probabilities, but the bias is not calculated with respect to some gold labels.Instead, the bias is measured by the effect of gender or stereotypes on model predictions.For example, the probability gap can be measured on a language model queried on two sentences, one pro-stereotypical ("he is an engineer") and another anti-stereotypical ("she is an engineer").
• Intrinsic: measures bias in internal model representations, and is not directly related to any downstream task.For example, WEAT.
It is crucial to define how measured bias harms those interacting with the biased systems (Barocas et al., 2017;Kate Crawford, 2017;Blodgett et al., 2020;Bommasani et al., 2021).Extrinsic metrics are important for motivating bias mitigation and for accurately defining "why the system behaviors that are described as 'bias' are harmful, in what ways, and to whom" (Blodgett et al., 2020), since they clearly demonstrate the performance disparity between protected groups.
For example, in a theoretical CV-filtering system, one can measure the TPR gap between female and male candidates.A gap in TPR favoring men means that, given a set of valid candidates, the system picks valid male candidates more often than valid female candidates.The impact of this gap is clear: Qualified women are overlooked because of bias.In contrast, consider an intrinsic metric such as WEAT (Caliskan et al., 2017), which is derived from the proximity (in vector space) of words like "career" or "family" to "male" or "female" names.If one finds that male names relate more to career and female names relate more to family, the consequences are unclear.In fact, Goldfarb-Tarrant et al. (2021) found that WEAT does not correlate with other extrinsic metrics.However, many studies re- The developer argued with the designer because she did not like the design.
The developer argued with the designer because he did not like the design.Developers are stereotyped to be males.port only intrinsic metrics (a third of the papers we reviewed, §5).

Coupling of Datasets and Metrics
In this section, we discuss how datasets and metrics for gender bias evaluation are typically coupled, how they may be decoupled, and why this is important.We begin with a representative test case, followed by a discussion of the general phenomenon.

Case study: Winobias
Coreference resolution aims to find all textual expressions that refer to the same real-world entities.A popular dataset for evaluating gender bias in coreference resolution systems is Winobias (Zhao et al., 2018a).It consists of Winograd schema (Levesque et al., 2012) instances: two sentences that differ only by one or two words, but contain ambiguities that are resolved differently in the two sentences based on world knowledge and reasoning.Winobias sentences consist of an anti-and a pro-stereotypical sentence, as shown in Figure 1.Coreference systems should be able to resolve both sentences correctly, but most perform poorly on the anti-stereotypical ones (Zhao et al., 2018a(Zhao et al., , 2019;;de Vassimon Manela et al., 2021;Orgad et al., 2022).
Winobias was originally proposed as an extrinsic evaluation dataset, with a reported metric of anti-and pro-stereotypical performance disparity.However, other metrics can also be measured, both intrinsic and extrinsic, as shown in several studies (Zhao et al., 2019;Nangia et al., 2020b;Orgad et al., 2022).For example, one can measure how many stereotypical choices the model preferred over anti-stereotypical choices (an extrinsic performance measure), as done on Winogender (Rudinger et al., 2018), a similar dataset.Winobias sentences can also be used to evaluate language models (LMs), by evaluating if an LM gives higher probabilities to pro-stereotypical sentences (Nangia et al., 2020b) (an extrinsic prediction measure).Winobias can also be used for intrinsic metrics, for example as a template for SEAT (May et al., 2019a) and CEAT (Guo and Caliskan, 2021) (contextual extensions of WEAT).Each of these metrics reveals a different facet of gender bias in a model.An explicit measure of how many pro-stereotypical choices were preferred over antistereotypical choices has a different meaning than measuring a performance metric gap between two different genders.Additionally, measuring an intrinsic metric on Winobias may be help tie the results to the model's behavior on the same dataset in the downstream coreference resolution task.

Many possible combinations for datasets and metrics
Winobias is one example out of many.In fact, benchmarks for gender bias evaluation are typically proposed as a package of two components: 1.A dataset on which the benchmark task is performed.

2.
A metric, which is the particular method used to calculate bias of a model on the dataset.
Usually, these benchmarks are considered as a bundle; however, they can often be decoupled, mixed, and matched, as discussed in the Winobias test case above.The work by Delobelle et al. (2021) is an exception, in that they gathered a set of templates from diverse studies and tested them using the same metric.
In Table 1, we present possible combinations of datasets (rows) and metrics (columns) from the gender bias literature.The metrics are partitioned according to the three classes of metrics defined in Section 2. We present only metrics valid for assessing bias in contextualized LMs (rather than static word embeddings), since they are the common practice nowadays.The table does not claim to be exhaustive, but rather illustrates how metrics and datasets can be repurposed in many different ways.The metrics are described in appendix A, but the categories are very general and even a single column like "Gap (Label)" represents a wide variety of metrics that can be measured.
marks the original metrics used on the dataset, and (aug) marks metrics that can be measured after augmenting the dataset such that every example is matched with a counterexample of another gender.Extrinsic performance metrics depend on gold labels while extrinsic prediction metrics do not.A full description of the metrics is given in Appendix A.
sured (many 's in the same row).Some datasets, such as Bias in Bios (De-Arteaga et al., 2019), have numerous metrics compatible, while others have fewer, but still multiple, compatible metrics.Bias in Bios has many compatible metrics since it has information that can be used to calculate them: in addition to gold labels, it also has gender labels and clear stereotype definitions derived from the labels which are professions.Text corpora and template data, which do not address a specific task (bottom seven rows), are mostly compatible with intrinsic metrics.The compatibility of intrinsic metrics with many datasets may explain why papers report intrinsic metrics more often ( §5).Additionally, Table 1 indicates that not many datasets can be used to measure extrinsic metrics, particularly extrinsic performance metrics that require gold labels.On the other hand, measuring LM prediction on target words, which we consider as extrinsic, can be done on many datasets.This is useful for analyzing bias when dealing with LMs.It can be done by computing bias metrics from the LM output predictions, such as the mean probability gap when predicting the word "he" versus "she" in specific contexts.Also, some templates are valid for measuring extrinsic prediction metrics, especially stereotype-related metrics, as they were developed with explicit stereotypes in mind (such as profession-related stereotypes).1, it is clear that there are many possible ways to measure gender bias in the literature, but they all fall under the vague category of "gender bias".Each of the possible combinations gives a different definition, or interpretation, for gender bias.The large number of different metrics makes it difficult or even impossible to compare different studies, including proposed gender bias mitigation methods.This raises questions about the validity of results derived from specific combinations of measurements.In the next two sections, we demonstrate how the choice of datasets and metrics can affect the bias measurement.

Effect of Dataset on Measured Results
The choice of data to measure bias has an impact on the calculated bias.Many researchers used sentence templates that are "semantically bleached" (e.g., "This is <word>.","<person> studied <pro-fession> at college.") to adjust metrics developed for static word embeddings to contextualized representations (May et al., 2019b;Kurita et al., 2019;Webster et al., 2020;Bartl et al., 2020).Delobelle et al. (2021) found that the choice of templates significantly affected the results, with little correlation between different templates.Additionally, May et al. (2019b) reported that templates are not as semantically bleached as expected.
Another common feature of bias metrics is the use of hand-curated word lexicons by almost every bias metric in the literature.Antoniak and Mimno (2021) reported that the lexicon choice can greatly affect bias measurement, leading to differing conclusions between different lexicons.The statistical fairness metrics (bottom block in Table 2) show a significant difference in the measured bias across different test set balancing.Oversampling shows less bias than when measured on the original test set, while subsampling yields mixed results -it decreases one metric while increasing another.
What is the "correct" test set?Since metrics are defined over the entire dataset, they are sensitive to its composition.For measuring bias in a model, the dataset used should be as unbiased as possible, thus balanced datasets are preferable.
If we were only concerned with measuring one of the reduced metrics on a non-balanced test set, we could misrepresent the fairness of the model.Indeed, it is common practice to measure only a small portion of metrics out of all those that can be measured-as we show in section 5-which makes us vulnerable to misinterpretations.Figure 3: Correlation between an intrinsic metric (compression) and an extrinsic metric (sufficiency gap sums), for various models trained on occupation prediction task."None" was trained on the original dataset, "Oversampling" was trained on an oversampled dataset, "Subsampling" was trained on a subsampled dataset and "Scrubbing" was trained on a scrubbed dataset (explicit gender words like "he" and "she" were removed).

Case study: measuring intrinsic bias on two different datasets
It is critical to consider the impact of the data used when measuring intrinsic bias metrics on a language model.Previous work (Goldfarb-Tarrant et al., 2021;Cao et al., 2022;Orgad et al., 2022) inspected the correlations between extrinsic and intrinsic gender bias metrics.Some did not find correlations, while others did in some cases.However, correlations do not solely depend on the model used for bias measurement, but also on the dataset used to measure the intrinsic metric.
Our experiment analyzes the behavior of the same metric on different datasets.We again follow Orgad et al. (2022), who probed for the amount of gender information extractable from the model's internal representations.This is quantified by compression (Voita and Titov, 2020), where a higher compression indicates greater bias extractability.Orgad et al. found that this metric correlates strongly with various extrinsic metrics.An example of this correlation is shown in Figure 3a on the Bias in Bios task with models debiased with various strategies.The correlation is high (r 2 = 0.567).
In their experiment, the intrinsic metric was measured on the same dataset as the extrinsic one.We repeat the correlation tests, but this time measure the intrinsic metric on a different dataset, the Winobias dataset.The results (Figure 3b) clearly show that there is no correlation between extrinsic and intrinsic metrics in this case (r 2 = 0.025).
Hence, we conclude that the dataset used to measure intrinsic bias impacts the results significantly.
To reliably reflect the biases that the model has acquired, it should be closely related to the task that the model was trained on.In our experiment, when intrinsic and extrinsic metrics were not measured on the same dataset, no correlation was detected.This is the case for all metrics on this task from Orgad et al. (2022); see Appendix 3. As discussed in §3, the same intrinsic metrics can be evaluated across a variety of datasets.Even so, some intrinsic metrics were originally defined to be measured on different datasets than the task dataset, such as those defined on templates (Table 1).

Different Metrics Cover Different Aspects of Bias
In this section, we explore how the choice of bias metrics influences results.Although extrinsic bias metrics are useful in defining harms caused by a gender-biased system, we find that most studies on gender bias use only intrinsic metrics to support their claims.We surveyed a representative list of papers presenting bias mitigation techniques that appeared in the survey by Stanczak and Augenstein (2021), as well as recent papers from the past year.
In total, we examined 36 papers.Many papers do not measure extrinsic metrics.Even when downstream tasks are measured, only a very small subset of metrics (three or less) is typically measured, as shown in Figure 4. Furthermore, in these studies, typically no explanation is provided for choosing a particular metric.The exceptions are de Vassimon Manela et al. ( 2021) and Orgad et al. (2022), who measured six and nine or ten metrics on downstream tasks, respectively.Orgad et al. showed that different extrinsic metrics behave differently under various debiasing methods.Additionally, in §4 we saw that subsampling the test set increased one bias metric and decreased others, which would not have been evident had we only measured a small number of metrics.Measuring multiple metrics is also important for evaluating debiasing.When Kaneko and Bollegala compared their proposed debiasing method to that of Dev et al. (2020a), the new method outperformed the old one on two of the three metrics.
As the examples above illustrate, different extrinsic metrics are not necessarily consistent with one another.Furthermore, it is possible to measure more extrinsic metrics, although it is rarely done.When it is not feasible to measure multiple metrics, one should at least justify why a particular metric was chosen.In a CV-filtering system, for example, one might be more forgiving of FPR gaps than of TPR gaps, as the latter leaves out valid candidates for the job in one gender more than the other.However, more extrinsic metrics are likely to provide a more reliable picture of a model's bias.

Conclusion and Proposed Guidelines
The issues described in this paper concern the instabilities and vagueness of gender bias metrics in NLP.Since bias measurements are integral to bias research, this instability limits progress.We now provide several guidelines for improving the reliability of gender bias research in NLP.
Focus on downstream tasks and extrinsic metrics.Extrinsic metrics are helpful in motivating bias mitigation ( §2).However, few datasets can be used to quantify extrinsic metrics, especially extrinsic performance metrics, which require gold labels ( §3).More effort should be devoted to collecting datasets with extrinsic bias assessments, from more diverse domains and downstream tasks.
Stabilize the metric or the dataset.Both the metrics and the datasets could have significant effects over the results: The same dataset can be used to measure many metrics and yield different conclusions ( §4), and the same metric can be measured on different datasets that lead to different results and different conclusions ( §5).If one wishes to measure gender bias in an NLP system, it is better to hold one of these variables fixed: for example, to focus on a single metric and measure it on a set of datasets.Of course, this can be repeated for other metrics as well.This will produce much richer, more consistent, and more convincing results.
Neutralize dataset noise.As a result of altering a dataset's composition, we observed very different results ( §4).This is caused by the way various fairness metrics are defined and computed on the entire dataset.To ensure a more reliable evaluation, we recommend normalizing a dataset when using it for evaluation.In the case of occupation prediction, normalization can be obtained by balancing the test set.In other cases it could be by anonymizing the test set, removing harmful words, etc., depending on the specific scenario.
Motivate the choice of specific metrics, or measure many.Most work measures only a few metrics ( §5).A comprehensive experiment, such as to prove the efficacy of a new debiasing method, is more reliable if many metrics are measured.In some situations, a particular metric may be of interest; in this case one should carefully justify the choice of metric and the harm that is caused when the metric indicates bias.The motivation for debiasing this metric then follows naturally.
Define the motivation for debiasing through bias metrics.Blodgett et al. (2020) found that papers' motivations are "often vague, inconsistent, and lacking in normative reasoning".We propose to describe the motivations through the gender bias metrics chosen for the study: define what is the harm measured by a specific metric, what is the behavior of a desired versus a biased system, and how the metric measures it.This is where extrinsic metrics will be particularly useful.
We believe that following these guidelines will enhance clarity and comparability of results, contributing to the advancement of the field.

A List of gender bias metrics, as presented in Table 1
Many of the items in this list do not aim to describe a specific metric, but rather describe a family of metrics with similar characteristics and requirements.

A.1 Extrinsic Performance
This class of extrinsic metrics measures how a model's performance is affected by gender.This is computed with respect to particular gold labels and there is a clear defintion of harm derived from the specific performance metric measured, for instance F1, True Positive Rate (TPR), False Positive Rate (FPR), BLEU score for translation tasks, etc.
1. Gap (Label): Measures the difference in some performance metric between Female and Male examples, in a specific class.The performance gap can be computed as the difference or the quotient between two performance metrics on two protected group.For example, in Bias in Bios (De-Arteaga et al., 2019) one can measure the TPR gap between female teachers and male teachers.The gaps per class can be summed, or the correlation with the percentage of women in the particular class can be measured.

Gap (Stereo):
Measures the difference in some performance metric between prostereotypical (and/or non-stereotypical) and anti-stereotypical (and/or non-stereotypical) instances.A biased model will have better performance on pro-stereotypical instances.This can be measured across the whole dataset or per gender / class.
3. Gap (Gender): Measure the difference in some performance metric between male examples and female examples, across the entire dataset.In cases of non-binary gender datasets (Cao and Daumé III, 2021), the gap can be calculated to measure the difference between text that is trans-inclusive versus text that is trans-exclusive.Another option is to measure the difference in performance before and after removing various aspects of gender from the text.

A.2 Extrinsic Prediction
This class is also extrinsic as it measures model predictions, but the bias is not computed with respect to some gold labels.Instead, the bias is measured by the effect of gender on the predictions of the model.

% or # of answer changes:
The number or percentage that the prediction changed when the gender of the example changed.To measure this, each example should have a counterpart example of the opposite gender.This difference can be measured with respect to the number of females or males in the specific label, for instance with relation to occupation statistics.

% or # that model prefers stereotype:
Quantifies how much the model tends to go for the stereotypical option, for instance predicting that a "she" pronoun refers to a nurse in a coreference resolution task.This can also be measured as a correlation with the number of females or males in the label, which can be thought of as the "strength" of the stereotype.
3. Pred gap: The raw probabilities or some function of them are measured, and the gap is measured as the prediction gap between male and female predictions.This can be measured across the whole dataset or per label at other cases.
4. LM prediction on target words: This metric relates to the specific predictions of a pretrained LM, such as a masked LM.The prediction of the LM is calculated for a specific text or on a specific target word of interest.These probabilities are then used to measure the bias of the model.We did not include this metric category in the "Pred gap" category because it can be measured on a much larger number of datasets.For example, for the masked sentence: "The programmer said that <mask> would finish the work tomorrow", we might measure the relation between p(< mask >= he|sentence) and p(< mask >= she|sentence).Although somewhat similar in idea to the previously described metric "pred gap", it is presented as a separate metric since it can be computed on a wider range of datasets.The strategy for calculating a number quantifying bias from the raw probabilities varies in different papers.For example, Kurita et al. (2019); Nangia et al. (2020a); Bordia and Bowman (2019); Nadeem et al. (2021) all use different formulations.

A.3 Intrinsic
This class measures bias on the hidden model representations, and is not directly related to any downstream task.
1. WEAT: The Word Embedding Association Test (Caliskan et al., 2017) was proposed as a way to quantify bias in static word embeddings.While we consider only bias metrics that can be applied in contextualized settings, we describe WEAT here as it is popular and has been adapted to contextualized settings.
To compute WEAT, one defines a set of target words X, Y (e.g., programmer, engineer, scientist, etc., and nurse, teacher, librarian, etc.) and two sets of attribute words A, B (e.g., man, male, etc. and woman, female, etc.).The null hypothesis is that the two sets of target words are not different when it comes to their similarity to the two sets of attribute words.We test the null hypothesis using a permutation test on the word embeddings, and the resulting effect size is used to quantify how different the two sets are.
2. SEAT: the Sentence Encoder Association Test (May et al., 2019a) was proposed as a contextual version of the popular metric WEAT.As WEAT was computed on static word embedding, in SEAT they proposed using "semantically-bleached" templates such as "This is [target]", where the target word of interest is planted in the template, to get its word embedding in contextual language models.Thus, we only consider "semanticallybleached" templates to be appropriate as a dataset for SEAT.
3. CEAT: Contextualized Embedding Association Test (Guo and Caliskan, 2021) was proposed as another contextual alternative to WEAT.Here, instead of using templates to get the word of interest, for each word a large number of embeddings is collected from a corpus of text, where the word appears many times.WEAT's effect size is then computed many times, with different embeddings each time, and a combined effect size is then calculated on it.As already mentioned by the original authors, even with only 2 contextual embeddings collected per word in the WEAT stimuli, and each set of X, Y, A, B having only 5 stimuli, 2 5•4 possible combinations can be used to compute effect sizes.
4. Probe: The entire example, or a specific word in the text, is probed for gender.A classifier is trained to learn the gender from a representation of the word or the text as extracted from a model.This can be done on examples where there is some gender labeling (for instance, the gender of the person discussed in a biography text) or when the text contains some target words, with gender context.Such target words could be "nurse" for female and "doctor" for male.Usually, the word probe refers to a classifier from the family of multilayer preceptron classifiers, linear classifiers included.The accuracy achieved by the probe is often used as a measure of how much gender information in embedded in the representations, but there are some weaknesses with using accuracy, such as memorization and other issues (Hewitt and Liang, 2019;Belinkov, 2021), and so MDL Probing is proposed as an alternative (Voita and Titov, 2020), and the metric used is compression rate.Higher compression indicates more gender information in the representation.
5. Cluster: It is possible to cluster the word embeddings or representations of the examples and perform an analysis using the gender labels just like in probing.
6. Nearest Neighbors: As with probing, the examples and word representations can be classified using a nearest neighbor model, or an analysis can be done using nearest neighbors of word embeddings as done by Gonen and Goldberg (2019).
7. Gender Space: in the static embeddings regime, Bolukbasi et al. (2016) proposed to identify gender bias in word representations by computing the direction between representations of male and female word pairs such as "he" and "she".They then computed PCA to find the gender direction.Basta et al. (2021) extended the idea to contextual embeddings by using multiple representations for each word, by sampling sentences that contain these words from a large corpus.Zhao et al. (2019) performed the same technique on a different dataset.They then observed the percentage of variance explained in the first principal component, and this measure plays as a bias metric.The principal components can then be further used for a visual qualitative analysis by projecting word embeddings on the component space.
8. Cos: in static word embeddings (Bolukbasi et al., 2016), this was computed as the mean cosine similarity between neutral words which might have some stereotype such as "doctor" or "nurse", and the gender space.Basta et al.
(2021) computed it on profession words using extracted embeddings from a large corpus.

B Statistical Fairness Metrics
This section describes statistical metrics that are representative of many other fairness metrics that have been proposed in the field.separation and sufficiency fall under the definition of "extrinsic performance", specifically "gap (Gender)" while independence falls under the definition of "extrinsic prediction", specifically "pred gap".Various numbers are generated by these metrics that describe differences between two distributions as measured by Kullbeck-Liebr divergence.We sum all the numbers to quantify bias in a single number.
Let R be a model's prediction, G the protected attribute of gender, and Y the golden labels.
Independence requires that the model's predictions are independent of the gender.In this section we describe the metrics that were measured in the experiments on Bias in Bios, following Orgad et al. (2022).
Performance gap metrics.The standard measure for this task (De-Arteaga et al., 2019) is the True Positive Rate (TPR) gap between male and female examples, for each profession p: T P R p = T P R p F − T P R p M and then compute the Pearson correlation between each T P R p and the percentage of females in the training set with the profession p.The result is a single number in the range of 0 to 1, with a higher value indicating greater bias.We measure the Pearson correlations of T P R p , as well as of the False Positive Rate (FPR) and the Precision gaps.In addition, we sum all the gaps in the profession set P , thereby quantifying the absolute bias and not only the correlations, for example, for the TPR gaps: p∈P T P R p .
Statistical fairness metrics.We also measured three statistical metrics (Barocas et al., 2019), relating to several bias concepts: Separation, Sufficiency and Independence.A greater value means more bias.Detailed information on these metrics can be found in Appendix B.

C.2 Correlations between extrinsic and intrinsic metrics when measured on different datasets
Table 3 present the full results of our correlation tests, when intrinsic metrics was measured on a different dataset (Winobias) than the extrinsic metric (Bias in Bios).For all metrics, there is no correlation when we measured the intrinsic metric with a different dataset, although many of the metrics did correlate with the intrinsic metrics when measured on the same dataset as is originally done in Orgad et al..

C.3 Statistics of the Dataset Before Balancing
Table 4 presents how the professions in Bias in Bios dataset (De-Arteaga et al., 2019) are distributed, per gender.The gender was induced by the pronouns used to describe the person in the biography, thus it is likely the self-identified gender of the person described in it.

Figure 1 :
Figure 1: Coreference resolution example from Winobias: a pair of anti-stereotypical (top) and prostereotypical examples (bottom).Developers are stereotyped to be males.

Figure 2 :
Figure 2: Percentage of females in the training set versus the resulting precision gap, per each profession.The trend is opposite on different test sets.
(a) Intrinsic metric was measured on the test set of occupation prediction, figure reproduced from Orgad et al. (2022).(b)Intrinsic metric was measured on Winobias(Zhao et al., 2018a).

Figure 4 :
Figure 4: The number of extrinsic metrics measured in the papers we reviewed.
Formally: P (R = r|Z = F ) = P (R = r|Z = M ) It is measured by the distributional difference between P (R = r) and P (R = r|Z = z) ∀z ∈ {M, F }.Separation requires that the model's predictions are independent of the gender given the label.Formally:P (R = r|Y = y, G = F ) = (R = r|Y = y, G = M )∀y ∈ YIt is measured by the distributional difference between P (R = r|Y = y, Z = z) and P (R = r|Y = y)∀y ∈ Y, ∀z ∈ {M, F } Sufficiency requires that the distribution of the gold labels is independent of the model's predictions given the gender.Formally:P (Y = y|R = r, G = F ) = P (Y = y|R = r, G = M ) It ismeasured by the distributional difference between P (Y = y|R = r, Z = z) and P (Y = y|R = r)∀y ∈ Y, ∀z ∈ {M, F } C Bias in Bios experiments C.1 Implementation details

Table 2 :
(Liu et al., 2019)) Bias in Bios, separated to performance gap metrics (above the line) and statistical fairness metrics (below the line).Metrics are measured on the original test split, and on a subsampled and oversampled version of it.These occupations are not balanced across genders, so for example over 90% of the nurses in the dataset identify as females.Our case study extends the experiments done byOrgad et al. (2022).In their work, they tested a RoBERTa-based(Liu et al., 2019)classifier finetuned on Bias in Bios.The model was trained and evaluated on a training/test split of the dataset using numerous extrinsic bias metrics.Here we train the same model on the same training set, but evaluate it on three types of test sets: the original test set alongside two balanced versions of it, which have equal numbers of females and males in every profession, by either subsampling or oversampling.2WefollowOrgadetal. and report nine different metrics on this task, measuring either some notion of performance gap or a statistical metric from the fairness literature.For details on the metrics measured in this experiments, see Appendix C.1.As the results in Table2show, although many of the gap metrics (top block) are unaffected by the balancing of the test dataset, the absolute sum 2 Subsampling is the process of removing examples from the dataset such that the resulting dataset contains the same number of male and female examples for each label.Oversampling achieves this by repeating examples.
(De-Arteaga et al., 2019)nificant difference in a metric compared to the baseline (Original), using Pitman's permutation test (p < 0.05).4.1 Case study: balancing the test dataAnother important variable in gender bias evaluation, often overlooked in the literature, is the composition of the test dataset.Here, we demonstrate this by comparing metrics on different test sets, which come from the same dataset but have a different balance of examples.Bias in Bios(De-Arteaga et al., 2019)involves predicting an occupation from a biography text.