Intrinsic Bias Metrics Do Not Correlate with Application Bias

Natural Language Processing (NLP) systems learn harmful societal biases that cause them to amplify inequality as they are deployed in more and more situations. To guide efforts at debiasing these systems, the NLP community relies on a variety of metrics that quantify bias in models. Some of these metrics are intrinsic, measuring bias in word embedding spaces, and some are extrinsic, measuring bias in downstream tasks that the word embeddings enable. Do these intrinsic and extrinsic metrics correlate with each other? We compare intrinsic and extrinsic metrics across hundreds of trained models covering different tasks and experimental conditions. Our results show no reliable correlation between these metrics that holds in all scenarios across tasks and languages. We urge researchers working on debiasing to focus on extrinsic measures of bias, and to make using these measures more feasible via creation of new challenge sets and annotated test data. To aid this effort, we release code, a new intrinsic metric, and an annotated test set focused on gender bias in hate speech.


Introduction
Awareness of bias in Natural Language Processing (NLP) systems has rapidly increased as more and more systems are discovered to perpetuate societal unfairness at massive scales. This awareness has prompted a surge of research into measuring and mitigating bias, but this research suffers from lack of consistent metrics that discover and measure bias. Instead, work on bias is "rife with unstated assumptions" (Blodgett et al., 2020) and relies on metrics that are easy to measure rather than metrics that meaningfully detect bias in applications.
(a) Intrinsic metrics summarize biases in the geometry of embeddings. For example, in this embedding space, male words are closer to words about career and about math & science, whereas female words are closer to words about family.
(b) Extrinsic bias metrics summarize disparities in application performance across populations, such as rates of false negatives between different gender groups. For example, a coreference system may make more errors in an anti-stereotypical career coreferent (red arc) than in a pro-stereotypical one (green arc). Figure 1: The relationship between intrinsic bias metrics (a) and extrinsic bias metrics (b) has been assumed, but not confirmed.
A recent comprehensive survey of bias in NLP (Blodgett et al., 2020) found that one third of all research papers focused on bias in word embeddings. This makes embeddings the most common topic in studies of bias -over twice as common as any other topic related to bias in NLP. As is visualised in Figure 1a, bias in embedding spaces is measured with intrinsic metrics, most commonly with the Word Embedding Association Test (WEAT) (Caliskan et al., 2017), which relates bias to the geometry of the embedding space. Once embeddings are incorporated into an application, bias can be measured via extrinsic metrics (Figure 1b) that test whether the application performs differently on language related to different populations. Hence, research on debiasing embeddings relies crucially on a hypothesis that doing so will remove or reduce bias in downstream applications. However, we are aware of no prior research that confirms this hypothesis.
This untested assumption leaves NLP bias research in a precarious position. Research into the semantics of word embeddings has already shown that intrinsic metrics (e.g. using analogies and semantic similarity, as in Hill et al., 2015) do not correlate well with extrinsic metrics (Faruqui et al., 2016). Research into the bias of word embeddings lacks the same type of systematic study, and thus as a field we are exposed to three large risks: 1) making misleading claims about the fairness of our systems, 2) concentrating our efforts on the wrong problem, and most importantly, 3) feeling a false sense of security that we are making more progress on the problem than we are. Our bias research can be rigorous and innovative, but unless we understand the limitations of metrics we use to evaluate it, it might have no impact.
In this paper, we ask: Does the commonly used intrinsic metric for embeddings (WEAT) correlate with extrinsic metrics of application bias? To answer this question, we analyse the relationship between intrinsic and extrinsic bias. Our study considers two languages (English and Spanish), two common embedding algorithms (word2vec and fastText) and two downstream tasks (coreference resolution and hatespeech detection).
While we find a moderately high correlation between these metrics in a handful of conditions, we find no correlation or even negative correlation in most conditions. Therefore, we recommend that the ethical scientist or engineer does not rely on intrinsic metrics when attempting to mitigate bias, but instead focuses on the harms of specific applications and test for bias directly.
As additional contributions to these findings, we release new WEAT metrics for Spanish, and a new gender-annotated test set for hatespeech detection for English, both of which we created in the course of this research.

Bias Metrics
In all of our experiments, we compute correlations between commonly-used metrics, both intrinsic and extrinsic.

Intrinsic bias metrics
Intrinsic bias metrics are applied directly to word embeddings, formulating bias in terms of geometric relationships between concepts such as male, female, career, or family. Each concept is in turn represented by curated wordlists. For example, the concept male is represented by words like brother, father, grandfather, etc. while the concept math & science is represented by words like programmer, engineer, etc.
The most commonly used metric is WEAT (Caliskan et al., 2017). 2 , which measures the difference in mean cosine similarity between two target concepts X and Y ; and two attribute concepts A and B. This difference represents the imbalance in associations between concepts. Using w to represent the embedding of word w, we have a test statistic: This is normalised by the standard deviation to get the effect size which we use in our experiments.
WEAT was initially developed as an indicator of bias, to show that the Implicit Association Test (IAT) from the field of psychology (Greenwald et al., 1998) can be replicated via word embeddings measurements. There are thus 10 original tests chosen to replicate the tests presented to human subjects in IAT. The tests measure different kinds of biased associations, such as African-American names vs. White names with pleasant vs. unpleasant terms, and female terms vs. male terms with career vs. family words.
WEAT was later repurposed as a predictor of bias in embedding spaces, via a somewhat muddy logical journey. It has since been translated into 6 other languages (XWEAT; Lauscher and Glavas, 2019), and extended to operate on full sentences (May et al., 2019) and on contextual language models (Kurita et al., 2019). When WEAT is used as a metric, papers report the effect size of the subset of tests relevant to the task at hand, each separately.
There are known issues with WEAT, such as sensitivity to corpus word frequency, and sensitivity to target and attribute wordlists, as found by Sedoc and Ungar (2019) and Ethayarajh et al. (2019). The latter proposes an alternative more theoretically robust metric, relational inner product association (RIPA), which uses the principal component of a gender subspace (determined via the method of Bolukbasi et al. (2016)) to directly measure how "gendered" a word is. We have chosen to use the most common version of WEAT for this first empirical study, since it is most widely used. It would be interesting to test RIPA in the same way, if it were extended to more types of bias and more languages. But we note that all intrinsic metrics are sensitive to chosen wordlists, so this must be done carefully, especially across languages, a topic we will return to in Section 4.3.

Extrinsic bias metrics
Extrinsic bias metrics measure bias in applications, via some variant of performance disparity, or performance gap between groups. For instance, a speech recognition system is unfair if it has higher error rates for African-American dialects (Tatman, 2017), meaning that systems perform less well for those speakers. A hiring classification system is unfair if it has more false negatives for women than for men, meaning that more qualified women are accidentally rejected than are qualified men. 3 There are two commonly used metrics to quantify this possible performance disparity: Predictive Parity (Hutchinson and Mitchell, 2019), which measures the difference in precision for a privileged and non-privileged group, and Equality of Opportunity (Hardt et al., 2016), which measures the difference in recall between those groups (see Appendix A for formal definitions).
The metric that best identifies bias in a system varies based on the task. For different applications, false negatives may be more harmful, for others false positives may be. For our first task of coreference (Figure 1b), false negatives -where the system fails to identify anti-stereotypical coreference chains (e.g. women as farmers or as CEOs)are more harmful to the underprivileged class than false positives. For our second task, hate speech detection (Figure 2), both can be harmful, for different reasons. False positives for one group can systematically censor certain content, as has been found for hate speech detection applied to African-American Vernacular English (AAVE) (Sap et al.,

Methodology
Each of our experiments measures the correlation between a specific instance of WEAT and a specific extrinsic bias metric. In each experiment, we train an embedding, measure the bias according to WEAT, and measure the bias in the downstream task that uses that embedding. We then modify the embeddings by applying an algorithm to either debias them, or -by inverting the algorithm's behavior -to overbias them. Again we measure WEAT on the modified embedding and also the downstream bias in the target task. When we have done this multiple times until we reach a stopping condition (detailed below), we compute the correlation between the two metrics (via Pearson correlation and analysis with scatterplots).
Rather than draw conclusions from a single experiment, we attempt to draw more robust conclusions by running many experiments, which vary along several dimensions. We consider two common embedding algorithms, two tasks, and two languages. A full table of experiment conditions can be found in Table 1.

Debiasing and Overbiasing
We need to measure the relationship between intrinsic and extrinsic metrics as bias changes, we must generate many datapoints for each experiment. Previous work on bias in embeddings studies methods to reduce embedding bias. To generate enough data points, we take the novel approach of both decreasing and increasing bias in the embeddings. We measure the baseline bias level, via WEAT, for each embedding trained normally on the original corpus. We then adjust the bias up or down, remeasure WEAT, and measure the change in the downstream task. Each experiment consists of a task, an embedding method (either word2vec or fasttext), an intrinsic metric (one experiment for each listed), and an extrinsic metric (either Predictive Parity or Equality of Opportunity). We run an experiment for all possible combinations. To produce data points for each experiment, we use preprocessing and post-processing methods to debias and overbias the input word embeddings.
We choose two methods from previous work that are capable of both debiasing and overbiasing: the first is a preprocessing method that operates on the training data before training, the second is a postprocessing method that operates on the embedding space once it has been trained. This is important since both kinds of methods may be used in practice: a large company with proprietary data will train embeddings from scratch, and thus may use a preprocessing method; whereas a small company may rely on publicly available pretrained embeddings, and thus use a post-processing method. 4 For preprocessing, we use dataset balancing (Dixon et al., 2018), which consists of subsampling the training data to be more equal with respect to some attributes. For instance, if we are adjusting gender bias, we identify pro-stereotypical sentences 5 such as 'She was a talented housekeeper' vs. anti-stereotypical sentences, such as 'He was a talented housekeeper' or 'She was a talented analyst'. We sub-sample and reduce the frequency of the pro-stereotypical collocations to debias, and sub-sample the anti-stereotypical conditions to overbias.
As a postprocessing method for already trained embeddings, we use the Attract-Repel (Mrksic et al., 2017) algorithm. This algorithm was de-4 There are additional embedding based debiasing methods used in practice, based on identifying and removing a gender subspace during training or as postprocessing (Bolukbasi et al., 2016;Zhao et al., 2018b). However, these methods do not change a word's nearest neighbour clusters (Gonen and Goldberg, 2019), and so we would expect these debiasing methods to show superficial bias changes in WEAT without changing downstream bias. Both methods that we select modify the underlying word distribution and move many words in relation to each other. We verified this with tSNE visualisation as in Figure 1a following Gonen and Goldberg (2019) and find that our bias modification methods do change word clusters. veloped to use dictionary wordlists (synonyms, antonyms) to refine semantic spaces. It aims to move similar words (synonyms) close to each other and dissimilar words (antonyms) farther from each other, while keeping a regularisation term to preserve original semantics as much as possible. Lauscher et al. (2020) used an approach inspired by Attract-Repel for debiasing, though with constraints implemented somewhat differently. We use the same pro-and anti-stereotypical wordlists as in dataset-balancing. For debiasing, we use the algorithm to increase distance between prostereotypical combinations (she, housekeeper) and decrease distance between anti-stereotypical combinations (she, analyst or he, housekeeper). For overbiasing we do the reverse. 6 As the stopping condition for preprocessing, we constrain the sub-sampling so that it does not substantially change the dataset size, by limiting it to removing less than five percent of the original data. For postprocessing we limit the algorithm to maximum 5 iterations.

Embedding Algorithms
We use two common word embedding algorithms: fastText (Bojanowski et al., 2017) and Skip-gram word2vec (Mikolov et al., 2013) embeddings. Word embeddings in fastText are composed from embeddings of both the word and its subwords in the form of character n-grams. Lauscher and Glavas (2019) suggest that this difference may cause bias to be acquired and encoded differently in fastText and word2vec (We discuss this in more detail in Section 5).
Despite recent widespread interest in contextual embeddings (e.g. BERT; Devlin et al., 2019), our experiments use these simpler contextless embed-dings because they are widely available in many toolkits and used in many real-world applications. Their design simplifies our experiments, whereas contextual embeddings would add significant complexity. However, we know that bias is still a problem for large contextual embeddings Gehman et al., 2020;Sheng et al., 2019), so our work remains important. If intrinsic and extrinsic measures do not correlate with simple embeddings, this result is unlikely to be changed by adding more architectural layers and configurable hyperparameters.

Downstream tasks
We use three tasks that appear often in bias literature: Coreference resolution for English, hate speech detection for English, and hate speech detection for Spanish. To make the scenarios as realistic as possible, we use a common, easy-to-implement and high performing architecture for each task: the end-to-end coreference system of Lee et al. (2017) and the the CNN of Kim (2014), which has been used in high-scoring systems in recent hate speech detection shared tasks (Basile et al., 2019). For each task, we feed pretrained embeddings to the model, frozen, and then train the model using the standard hyperparameters published for each model and task.

Languages
We experiment on both English and Spanish. It is important to take a language with pervasive gender-marking (Spanish) into account, as previous work has shown that grammatical gender-marking has a strong effect on gender bias in embeddings (McCurdy and Serbetci, 2017;Gonen et al., 2019;Zhou et al., 2019). We use Spanish only for hate speech detection, because gender marking makes a challenge-set style coreference evaluation trivial to resolve and not a candidate for detection of gender bias. 7

Datasets
To train embeddings, we use domain-matched data for each downstream task. For coreference we train on Wikipedia data, and for hatespeech detection we train on English tweets or Spanish tweets, consistent with the task. 8 Our English Coreference system is trained on OntoNotes (Weischedel et al., 2017) and evaluated on Winobias (Zhao et al., 2018a), a Winograd-schema style challenge set designed to measure gender bias in coreference resolution. English hate speech detection uses the abusive tweets dataset of Founta et al. (2018), and is evaluated on the test set of ten thousand tweets, which we have hand labelled as targeted male, female, and neutral (we release this labelled test set for future work). Spanish hate speech detection uses the data from the shared task of Basile et al. (2019), which contains labels for comments directed at women and directed at migrants.

WEAT & Bias modification wordlists
Both WEAT and bias modification methods depend on seed wordlists. 9 These wordlists are closely related to each other, and we match them by type of bias, such that we measure WEAT tests for gender bias with embeddings modified via gender bias wordlists (themselves derived from WEAT lists, as detailed below) and WEAT tests for migrant bias with embeddings modified for migrant bias.
WEAT wordlists are standardised, and for English we use the three WEAT test wordlists (numbers 6,7,8) for gender. 10 To generate bias modification wordlists we follow the approach of Lauscher et al. (2020) and use a pretrained set of embeddings (from spacy.io) to expand the set of WEAT words to their 100 unique nearest neighbours. For all experiments, we take the union of all WEAT terms, expand them, and use this expanded set for both dataset balancing and for  For gender bias in coreference and hate speech, we use terms that are male vs. female and are career, math, science, vs. family, art. For gender bias and migrant bias in Spanish hate speech, we compare male/female identity or migrant/non-migrant identity with pleasantunpleasant term expansions. 12

New Spanish WEAT
We substantially modified Spanish WEAT (aka XWEAT for non-English WEATs) and added entirely new terms. The reason for this is that the original XWEAT was translated from English very literally, which causes two problems.
The first problem with XWEAT is that many of the terms do not make sense in a Spanish speaking community -names included in the original, like Amy, are names in Spanish and thus were untranslated, but are uncommon and have upper class connotations not intended in the original test. Another example is firearms translated as arma de fuego, which while technically a correct literal translation, is not commonly used to describe weapons. 13 The second problem with XWEAT is that nouns on the wordlists for both abstract math and science concepts as well as abstract art concepts are almost entirely grammatically female. For instance, ciencia (science), geometría (geometry) are grammatically female, as are escultura (sculpture) and novela (novel). It is well established that for languages with grammatical gender, words that share a grammatical gender have embeddings that are closer together than words that do not (Gonen et al., 2019;McCurdy and Serbetci, 2017). So, when WEAT in English was translated into XWEAT in Spanish (Glavas et al., 2019), the terms were imbalanced with regard to grammatical gender, which makes the results misleading. We balance the lists, often replacing abstract nouns with corresponding adjectives which can take male or female form, e.g. científico and científica (scientific, male and female), such that we can use both versions to account for the effect of grammatical gender.
Finally, we needed a metric to examine bias against migrants. Metrics for intrinsic bias must be targeted to the type of harm expected in the downstream application, and there is not an out-ofthe-box WEAT test for this. So we create a new WEAT test for bias against migrants in Spanish. Following the setup of tests for racial bias in original WEAT -based on American racial biases in English -we have lists of names associated with migrants vs. non-migrants, and compare them with lists of pleasant and unpleasant terms. The names are based on work of Salamanca and Pereira WEAT terms for debiasing, and found the trends to be similar but of smaller magnitude, so we settled on expanded lists as a more realistic scenario. 13 The standard would be armas. arma de fuego is also composed of three words, and so will not appear in any vocabulary.
(2013), who studied ranking names as lower vs. upper class; class status is closely correlated with whether a person is a migrant. We select a subset of names in which the majority in the study agree on the class. Pleasant and unpleasant terms exist in WEAT and XWEAT, but we again modify them to balance grammatical gender. Figure 3 displays data for all tasks: one scatterplot per triple of experimental variables: an intrinsic metric, an extrinsic metric, an embedding algorithm. If we want to be able to broadly use WEAT metrics for any given bias research, these graphs should each show a clear and a positive correlation. None of them do. There are no trends in correlation between the metrics that hold in all cases regardless of experimental detail, for any of the tasks. We have additionally examined whether there are correlations within one bias modification method (pre or postprocessing) in case a difference in the way these methods modify embeddings causes differences in trends. In most cases this breakout tells the same story. The select cases where positive (and negative) correlations are present are discussed below. Further breakout graphs and combinations are included in Appendix D.

Results
Coreference (en): Gender The coreference task (Figure 3, rows 1-3) doesn't display a clear correlation in all cases, and yet it has the clearest relationship of all three tasks, with a significant moderate positive correlation for both Predictive Parity (precision) and Equality of Opportunity (recall) for word2vec (columns 3 & 4). The overall trends are muddied by the data for fastText, which does not have a significant correlation under any conditions. Both are expected: that coreference would display the strongest trends, and that fast-Text would display more unpredictable or weaker trends. The Winobias coreference task is as directly matched to the WEAT tests as it is possible to be -since both use common career words to measure bias. So the relationship between the two metrics is clearest here: moving female terms closer to certain career terms most directly helps a system resolve anti-stereotypical coreference chains. However, we still only see a correlation in wod2vec, not fast-Text. fastText may behave less predictably because of its use of subwords; when subwords are used,  word representations are more interconnected. 14 We can debias with regard to a specific word, but that word's embedding will still be influenced by all other words that share its character ngrams. It is difficult to predict how changing the composition of a training corpus will affect all words that contain a certain ngram (e.g. ch) in them. For this reason, fastText may be initially more resistant to encoding biases than word2vec, as was found in Lauscher and Glavas (2019), but may also be more complex to debias. This has implications for extending this work to contextual models, which always use some form of subword unit.
Hatespeech (en): Gender Hatespeech (en) has fewer and more restricted correlations than coreference, as can be seen in Figure 3, rows 4-6. These plots show no relationship at all between intrinsic and extrinsic metrics. When data is broken out by bias modification method (see Figure 4b in Appendix D), it becomes clear that there is a moderate positive correlation for postprocessing for recall, and the aggregate appears this way because there is a moderate negative correlation for preprocessing. This holds for both embedding algorithms, though both positive and negative correlations are stronger for fastText. Precision displays no correlation. Note that the absolute variance in recall is much smaller than for precision, but this is still significant for each embedding algorithm individually and for both grouped together. Of interest for future bias research is that the baseline level of bias (premodification, from raw twitter data) in English hatespeech differs by embedding type, but only for precision. Initial models (with unmodified embeddings) using fastText have 10 additional points of precision for maletargeted hatespeech than for female-targeted. However initial models using word2vec have the opposite bias and have 4 fewer points of precision for male-targeted than female targeted hatespeech. For recall, the two embedding algorithms are equivalent, with 6 fewer points for male-targeted hatespeech. In fact, in the recall metric there is an early indication of unreliability of the relationship we are examining between WEAT and extrinsic bias, because there is a spread of different WEAT results that map to nearly the same difference in recall.

Hatespeech (es): Gender and Migrant
For hatespeech in Spanish, we examine two kinds of bias separately -gender bias and bias against migrants, in Figure 3, rows 7,8. Neither gender bias nor migrant bias show positive correlations in any experimental conditions.
Gender bias in our models is in an absolute sense never present, since in Spanish hatespeech targeted against women is easier to identify than against others (with F1 in the high 80s). 15 But there are no overall trends when this is bias is modified to be more or less extreme, and there are no positive correlations in any conditions. There is a moderate negative correlation for precision only when looking at fastText embeddings.
Migrant bias similarly has no trends save in very restricted conditions broken out by bias modification type. In contrast to the gender case, hatespeech against migrants is clearly challenging to identify, with much lower F1 in the low 60s. There is a positive correlation between migrant bias and performance gap for recall with preprocessing in fastText only. This fits the expectation that fastText may be more sensitive to preprocessing than postprocessing due to subwords, as discussed above, though in the gender bias case with negative correlation it is equally sensitive to both, so it is hard to draw conclusions. Given the smaller number of datapoints for Spanish (discussed below) this is likely just noise. To confuse the situation further, the only trends in precision are present in word2vec, and are negative correlations.
Note that all graphs for Spanish display central clusters, because it was more difficult to get an even spread of bias measures, and because Spanish has fewer data points than English. This is for a number of reasons that compound and underscore the difficulty of expanding supposedly languageagnostic techniques beyond English, even to high resourced languages like Spanish. We have only one WEAT test for each type of bias, since we made our own that carefully balanced grammatical gender, after rectifying the issues with the existing translated versions (see Section 4.3). Bias modification is also more difficult -the richer agreement system in Spanish means that there are more surface forms of what would be one word in English. In addition, the language model used for nearest neighbour expansion of wordlists (see Sec-tion 4.2) produces predominantly formal register words from news or scientific articles, due to a less varied makeup of its training data than the English model. This makes them less well suited to debiasing twitter data specifically, and there were no readily available models that had more casual register. For bias against migrants, there is the additional challenge that wordlists are predominantly based on proper names, which are much rarer in twitter (which tends to use @ mentions instead) than in other media.

Discussion
The broad result of this research is that changes in WEAT do not correlate with changes in application bias, and therefore that WEAT should not be used to measure progress in debiasing algorithms. We have found that even when we maximally target bias modification of an embedding, we cannot produce a reliable change in bias direction downstream. There was no pattern or correlation between tasks, for the same task in different languages, or even in most cases within one task. And we have chosen one of the simplest possible setups, with fullword embeddings and a single type of bias at a time. Real world scenarios can easily be more complicated and involve multiple types of bias or subword embeddings. Our findings also indicate that additional complexity may muddy the relationship further. For example, fastText behaved less predictably than word2vec across experiments, suggesting that if we were to expand to larger models that are fully reliant on subwords the patterns may become even less clear.
The implication of this finding is that an NLP scientist or engineer has limited options when investigating and mitigating bias. They must a) find the specific set of wordlists, embedding algorithms, downstream tasks, and bias modification methods that are together predictive of bias for the given task, language, and model or b) implement full systems to test application bias directly, even if their work focuses on embeddings.
While the latter may seem onerous, it may not be more so than exhaustively searching for a configuration where intrinsic bias metrics are predictive.
This underscores the importance of making good downstream bias measures available, as either approach will require these. More datasets that are collected need to be annotated with subgroup demographic and identity information -there are very few available. More research needs to focus on creating good challenge sets to measure application bias. Additional research on more broad usage of unsupervised methods  would also be valuable, though those also would benefit from subgroup identity annotation to make their results more interpretable.
It is only when more of these things are readily available that we can see the true measure of the efficacy of our debiasing efforts.
We do note a limitation of this study in that all downstream tasks are discriminative classification tasks. Bias in classification is more straightforward to measure, with well established metrics, but covers allocational harms (performance disparity), whereas the inclusion of generative models could better cover representational harms (misleading or harmful representations/portrayals) (Blodgett et al., 2020;Crawford, 2017). Concurrent research on causal mediation analysis for bias has shown that the embedding layer in open-domain generation has the strongest effect on gender bias (as compared to other layers of the network) (Vig et al., 2020). Further work could investigate whether generation tasks have display the same or different relationship to intrinsic metrics.

Conclusion
We have examined the relationship of the intrinsic bias metric WEAT to the extrinsic bias metrics of Equality of Opportunity and Predictive Parity, for multiple tasks and languages, and determined that positive correlations between them exist only in very restricted settings. In many cases there is either negative correlation or none at all. While intrinsic metrics such as WEAT remain good descriptive metrics for computational social science, and for examining bias in human texts, we advise that the NLP community not rely on them for measuring model bias. We instead advise that they focus on careful consideration of downstream applications and the creation of datasets and challenge sets that enable measurement at this stage.

A Bias Metric Definitions & Formulas
Performance Gap metrics measure difference in performance across different demographic splits of the data, and are in our case (and most commonly) applied to classification tasks.
Where A is a demographic variable (race, gender, etc), Y is the true label, andŶ is the predicted label, a fair system will satisfy: where x and y are demographic values usually of an privileged and a underprivileged group. This expresses that the probability of a given test sample being correctly identified as a true positive should be equal regardless of group, and is known as Equality of Opportunity (Hardt et al., 2016).
A fair system will also satisfy: which expresses that that probability of a given test sample being incorrectly identified as positive is equal regardless of group. This is known as Predictive Parity and when combined with Equality of Opportunity is known as Equalized Odds.
These are easily measured in most NLP systems. The former is captured by measuring recall gap, where if x is the privileged group and y the underprivileged, unfairness is captured by Recall x − Recall y , where any positive value is unfair. The latter is captured by P recision x − P recision y , again where positive values are unfair. We modify WEAT 6 to use the gender terms for WEAT 7/8 as the terms for 6, but otherwise leave terms as is.

B WEAT Formula and Wordlists
WEAT 6 (career/family vs. male/female) uses proper names as gender terms, whereas the other two tests use more standard gender terms (she, her, he, him, mother, father). This is an artifact of replicating IAT, which introduces a confound in their comparability -if the WEAT tests have different patterns of correlation, we don't know whether this is because of the difference in the way gender bias patterns for career/family vs. for arts/science or whether it patterns differently because of proper names vs. gender terms. This is exacerbated in our case where proper names are treated even more differently than usual both in twitter (where @mentions stand in for proper names) and in the Winobias metric that we use (where professions are used instead of proper names precisely because names contain gender information and the challenge set intends to be ambiguous).

C Training Data and Preprocessing
This details the data for training embeddings. For data used in training the final models, see relevant papers cited in Section 4.1.

C.1 Wikipedia
Wikipedia data is downloaded from the latest Wikipedia article dump, tokenized with NLTK (https://www.nltk.org/), and all words appearing less than 10 times are replaced with <unk>. The final dataset has 439,935,872 words.

C.2 Twitter
Twitter data is from 2019 and is downloaded from the Internet Archive https://archive.org/ details/twitterstream. Retweets are removed, and data is lowercased, tokenized with NLTK TweetTokenizer, and hashtags and @mentions are replaced with <HASH> and <MENTION> respectively. All words appearing less than 10 times are replaced with <unk>. English twitter data size is 3,641,306 tweets with 38,376,060 words. Spanish twitter data size is 10,683,846 tweets with 142,715,339 words.

D Further Results Graphs
Below are breakouts of graphs by bias modification method, as well as full graphs with metric scales and legends. Figure 4 breaks out all tasks by bias modification method (pre-vs. post-processing). The main interesting thing to note here is for hatespeech in English. Based on the spread of data points, it is easy to see that there is overall more effect on precision gap when embeddings are modified, whereas recall performance gap occupies a narrower band over a wide spread of WEAT metrics. Yet recall is the only metric which has a positive correlation with WEAT, and then only in the postprocessing condition. For Spanish it is also visible that it is much more difficult to modify bias for Spanish when preprocessing vs. when postprocessing. Figure 5 shows one graph for each task and bias type combination, in full, in order to view the effect of not controlling for experimental variable. It also shows the scale for the spread of data points.
Finally, for interest, we also include Figure 6, which displays the correlation broken out by type of Winobias test (which differ in difficulty because Type 1 is semantic and Type 2 is syntactic).  In each plot, the x-axis represents WEAT, and the y-axis shows performance gap between groups (male-female, female-other, migrant-other). Original embeddings (before modification) shown in black. There is no correlation that holds independently of experimental conditions (embedding type, bias modification method, WEAT test). Figure 6: Coreference (en) results broken out by type of Winobias challenge, Type 1 is more difficult as there are only semantic cues to correct coreference, Type 2 has also syntactic cues.