Second Order WinoBias (SoWinoBias) Test Set for Latent Gender Bias Detection in Coreference Resolution

We observe an instance of gender-induced bias in a downstream application, despite the absence of explicit gender words in the test cases. We provide a test set, SoWinoBias, for the purpose of measuring such latent gender bias in coreference resolution systems. We evaluate the performance of current debiasing methods on the SoWinoBias test set, especially in reference to the method’s design and altered embedding space properties. See https://github.com/hillary-dawkins/SoWinoBias.


Introduction
Explicit (or first-order) gender bias was observed in coreference resolution systems by Zhao et al. (2018a), by considering contrasting cases: Sentences 1 and 3 are pro-stereotypical examples because gender words align with a socially-held stereotype regarding the occupations. Sentences 2 and 4 are anti-stereotypical because the correct coreference resolution contradicts a stereotype. It was observed that systems performed better on pro cases than anti cases, and the WinoBias test set was developed to quantify this disparity.
Here we make a new observation of genderinduced (or second-order) bias in coreference resolution systems, and provide the corresponding test set SoWinoBias. Consider cases: The examples do not contain any explicit gender cues at all, and yet we can observe that sentences 1 and 2 align with a gender-induced social stereotype, while sentence 3 opposes the stereotype. The induction occurs because "nurse" is a female-coded occupation (Bolukbasi et al., 2016;Zhao et al., 2018b), and women are also more likely to be described based on physical appearance (Hoyle et al., 2019;Williams and Bennett, 1975). A coreference resolution system is gender-biased if correct predictions on sentences like 1 and 2 are more likely than on sentence 3. The difference between first-order and secondorder gender bias in a downstream application is especially interesting given current trends in debiasing static word embeddings. Early methods (Bolukbasi et al., 2016;Zhao et al., 2018b) focused on eliminating direct bias from the embedding space, quantified as associations between gender-neutral words and an explicit gender vocabulary. In response to an influential critique paper by Gonen and Goldberg (2019), the current trend is to focus on eliminating indirect bias from the embedding space, quantified either by gender-induced proximity among embeddings (Kumar et al., 2020) or by residual gender cues that could be learned by a classifier (Ravfogel et al., 2020;Davis et al., 2020).
Indirect bias in the embedding space was viewed as an undesirable property a priori, but we do not yet have a good understanding of the effect on downstream applications. Here we test debiasing methods from both camps on SoWinoBias, and make a series of observations on sufficient and necessary conditions for mitigating the latent genderbiased coreference resolution.
Additionally, we consider the case that our coreference resolution model employs both static and contextual word embeddings, but debiasing methods are applied to the static word embeddings only. Post-processing debiasing techniques applied to static word embeddings are computationally inexpensive, easy to concatenate, and have a longer development history. However contemporary models for downstream applications are likely to use some form of contextual embeddings as well. Therefore we might wonder whether previous work in debiasing static word embeddings remains relevant in this setting. The WinoBias test set for instance was developed and tested using the "end-to-end" coreference resolution model (Lee et al., 2017), a state-of-the-art model at that time using only static word embeddings. Subsequent debiasing schemes reported results on WinoBias using the same model, just plugging in different debiased embeddings, for the sake of fair comparison. However this is becoming increasingly outdated given the progress in coreference resolution systems. A contribution of this work is to report WinoBias results for previous debiasing techniques using a more updated model, one that makes use of unaltered contextual embeddings in addition to the debiased static embeddings.
The remainder of the paper is organized as follows: In section 2, we further define the type of bias being measured by the SoWinoBias test set and discuss some limitations. In section 3, we review the 4 word embedding debiasing methods that we will analyze, in the context of how each method aims to alter the word embedding space. In section 4, we provide details of the experimental setup and report results on both coreference resolution test sets, the original WinoBias and the newly constructed SoWinoBias. In section 5, we discuss the results with respect to the geometric properties of the altered embedding spaces. In particular, we review whether mitigation of intrinsic measures of bias on the embedding space, quantified as direct bias and indirect bias by various definitions, are related to mitigation of the latent bias in a downstream application.

Bias Statement
Within the scope of this paper, bias is defined and quantified as the difference in performance of a coreference resolution system on test cases aligning with a socially-held stereotype vs. test cases opposing a socially-held stereotype. We observe that gender-biased systems perform significantly better in pro-stereotypical situations. Such difference in performance creates representational harm by implying (for example) that occupations typically associated with one gender cannot have attributes typically associated with another.
Throughout this paper, the term "second-order" is used interchangeably with "latent". Characterizing the observed bias as "second-order" follows from the observation of a gender-induced bias in the absence of gender-definitional vocabulary, resting on the definition of "they" as a gender-neutral pronoun.
Therefore, a limitation in the test set construction is the possible semantic overloading of "they". As discussed, the intention throughout this paper is to use the singular "they" as a pronoun that does not carry any gender information (and could refer to someone of any gender). However, different contexts may choose to treat "they" exclusively as a non-binary gender pronoun.
The gender stereotypes used throughout this paper are sourced from peer-reviewed academic journals written in English, which draw from the US Labor Force Statistics, as well as US-based crowd workers. Therefore a limitation may be that stereotypes used here are not common to all languages or cultures.
3 Debiasing methods 3.1 Neutralization of static word embeddings 3.1.1 Methods addressing direct bias The first attempts to debias word embeddings focused on the mitigation of direct bias (Bolukbasi et al., 2016). The definition of direct bias assumes the presence of a "gender direction" g; a subspace that mostly encodes the difference between the binary genders. A non-zero projection of word w onto g implies that w is more similar to one gender over another. In the case of ideally gender-neutral words, this is an undesirable property. Direct bias quantifies the extent of this uneven similarity 1 : The Hard Debias method (Bolukbasi et al., 2016) is a post-processing technique that projects all gender-neutral words into the nullspace of g. Therefore, the direct bias is made to be zero by definition. We measure the performance of Hard-GloVe 2 on the coreference resolution tasks.
A related retraining method used a modified version of GloVe's original objective function with additional incentives to reduce the direct bias for gender-neutral words, resulting in the GN-GloVe embeddings (Zhao et al., 2018b). Rather than allowing for gender information to be distributed across the entire embedding space, the method explicitly sequesters the protected gender attribute to the final component. Therefore the first d − 1 components are taken as the gender-neutral embeddings, denoted GN-GloVe(w a ) 3 .

Methods addressing indirect bias
The indirect bias is less well defined, and loosely refers to the gender-induced similarity measure between gender-neutral words. For instance, semantically unrelated words such as "sweetheart" and "nurse" may appear quantitatively similar due to a shared gender association.
One definition (first given in (Bolukbasi et al., 2016)) measures the relative change in similarity after removing direct gender associations as where w ⊥ = w − ( w · g) g, however this relies on a limited definition of the original gender association. The Repulse-Attract-Neutralize (RAN) debiasing method attempts to repel undue gender proximities among gender-neutral words, while keeping word embeddings close to their original learned representations (Kumar et al., 2020). This method quantifies indirect bias by incorporating β into a graph-weighted holistic view of the embedding space (more on this later). In this paper, we will measure the performance of RAN-GloVe 4 on the coreference resolution tasks.
A related notion of indirect bias is to measure whether gender associations can be predicted from the word representation. The Iterative Nullspace Linear Projection method (INLP) achieves linear guarding of the gender attribute by iteratively learning the most informative gender subspace for a classification task, and projecting all words to the orthogonal nullspace (Ravfogel et al., 2020). After sufficient iteration, gender information cannot be recovered by a linear classifier. We will measure the performance of INLP-GloVe 5 .

Data augmentation
In addition to debiasing methods applied to word embeddings, we measure the effect of simple data augmentation applied to the training data for our coreference resolution system. The goal is to determine whether data augmentation can complement the debiased word embeddings on this particular test set. The training data is augmented using a simple gender-swapping protocol, such that binary gender words are replaced by their equivalent form of the opposite gender (e.g. "he" ↔ "she", etc.).

Detection of gender bias in coreference resolution: Experimental setup
All systems were built using the "Higher-order coreference resolution with coarse-to-fine inference" model (Lee et al., 2018) 6 . It is important to keep in mind that this model uses both static word embeddings and contextual word embeddings (specifically ELMo embeddings (Peters et al., 2018)). Our experimental debiasing methods were applied to static word embeddings only, and contextual embeddings are left unaltered in all cases. All systems were trained using the OntoNotes 5.0 7 train and development sets, using the default hyperparameters 8 , for approximately 350,000 steps until convergence. Baseline performance was tested using the OntoNotes 5.0 test set (results shown in Table 1). Baseline performance is largely consistent across all models, indicating that neither debiased word embeddings nor genderswapped training data significantly degrades the performance of the system overall.

WinoBias
The WinoBias test set was created by Zhao et al. (2018a), and measures the performance of coreference systems on test cases containing explicit bi- Table 1: Results on coreference resolution test sets. OntoNotes (F 1 ) performance provides a baseline for "vanilla" coreference resolution (n = 348). WinoBias (F 1 ) measures explicit gender bias, observable as the diff. between pro (n = 396) and anti (n = 396) test sets. SoWinoBias (% accuracy) measures second-order gender bias, likewise observable as the diff. between pro (n = 4096) and anti (n = 4096) test sets. Note: accuracy is the relevant metric to report on the SoWinoBias test set, rather than F 1 , due to our assertion that "they" is not a new entity mention. nary gender words. In particular, pro-stereotypical sentences contain coreferents where an explicit gender word (e.g. he, she) is paired with an occupation matching a socially held gender stereotype. Antistereotypical sentences use the same formulation but gender swap the explicit gender words such that coreferents now oppose a socially held gender stereotype. Gender bias is measured as the difference in performance on the pro. versus anti. test sets, each containing n = 396 sentences.
Recall that here we are reporting WinoBias results using a system incorporating unaltered contextual embeddings, in addition to the debiased static embeddings. Previously reported results on the "end-to-end" coreference model (Lee et al., 2017), using only debiased static word embeddings, are compiled in the Appendix for reference.
In this setting, we observe that debiasing methods addressing direct bias are more successful than those addressing indirect bias. In particular, without the additional resource of data augmentation, RAN-GloVe struggles to reduce the difference between pro and anti test sets (in contrast to RAN-GloVe's great success in the end-to-end model setting, as reported by Kumar et al. (2020)). Data augmentation is found to be a complementary resource, providing further gains in most cases. Overall, Hard-GloVe with simple data augmentation successfully reduces the difference in F 1 from 29% to 2.1%, while not significantly degrading the average performance on WinoBias or baseline performance on OntoNotes. This suggests that debiasing the con-textual word embeddings is not needed to mitigate the explicit gender bias in coreference resolution, as measured by this particular test set.

SoWinoBias
The SoWinoBias test set measures second-order, or latent, gender associations in the absence of explicit gender words. At present, we measure associations between male and female stereotyped occupations with female stereotyped adjectives, although this could easily be extended in the future. Adjectives with positive and negative polarities are represented evenly in the test set. We will denote the vocabularies of interest as where |M occ | = |F occ | = |F + adj | = |F − adj | = 16, and the full sets can be found in the appendix. Stereotypical occupations were sourced from the original WinoBias vocabulary (drawing from the US labor occupational statistics), as well as the SemBias (Zhao et al., 2018b) and Hard Debias analogy test sets (drawing from human-annotated judgements). Stereotypical adjectives with polarity were sourced from the latent gendered-language model of Hoyle et al. (2019), which was found to be consistent with the human-annotated corpus of Williams and Bennett (1975).
SoWinoBias test sentences are constructed as "The [occ1] (dis)liked the [occ2] because they were [adj]", where "(dis)liked" is matched appropriately to the adjective polarity, such that "they" always refers to "occ2". Each sentence selects one occupation from M occ , and the other from F occ . In prostereotypical sentences, occ2 ∈ F occ , such that the adjective describing the (they, occ2) entity matches a social stereotype. In anti-stereotypical sentences, occ2 ∈ M occ , such that the adjective describing the (they, occ2) entity contradicts a social stereotype. Example sentences in the test set include: 1. The doctor liked the nurse because they were beautiful. (pro) 2. The nurse liked the doctor because they were beautiful. (anti) 3. The ceo disliked the maid because they were unmarried. (pro) 4. The maid disliked the lawyer because they were unmarried. (anti) In total, there are n = 4096 sentences in each of the pro and anti test sets. Due to the simplicity of our constructed sentences, plus our desire to measure gendered associations, we further assert that "they" should refer to one of the two potential occupations (i.e. "they" cannot be predicted as a new entity mention). As with WinoBias, gender bias is observed as the difference in performance between the anti and pro test sets. Firstly, we observe that the second-order gender bias is more difficult to the correct than the explicit bias, given access to the debiased embeddings alone. Methods that made good progress in reducing the WinoBias diff. make little to no progress on the SoWinoBias diff. However, even simple data augmentation was found to be a valuable resource. When combined with GN-GloVe(w a ), the difference is reduced to 2.4% while increasing average performance significantly. Again, we observe that good bias reduction can be achieved, even before incorporating methods to debias the contextual word embeddings. It is interesting that debiasing methods explicitly designed to address indirect bias in the embedding space do not do better at mitigating second-order bias in a downstream task. Further discussion in relation to the embedding space properties is provided in the following section. 5 Relationship to embedding space properties

Single-attribute WEAT
The Word Embedding Association Test (WEAT) measures the association strength between two concepts of interest (e.g. arts vs. science) relative to two defined attribute groups (e.g. female vs. male) (Caliskan et al., 2017). It was popularized as a means for detecting gender bias in word embeddings by showing that (arts, science), (arts, math), and (family, careers) produced significantly different association strengths relative to gender.
Here we adapt the original WEAT to measure relative association across genders given a single concept of interest. This provides a means to measure whether the set of female-stereotyped adjectives F adj are quantitatively gender-marked in the embedding space.
The relative association of a single word t across attribute sets A 1 , A 2 is given by where s(t, A 1 , A2) > 0 indicates that t is more closely related to attribute A 1 than A 2 . The average relative association of concept T is then The significance of a non-zero association strength can be assessed by a partition test. We randomly sample alternate attribute sets of equal size A * 1 and A * 2 from the union of the original attribute sets. The significance p is defined as the proportion of samples to produce S(T, A * 1 , A * 2 ) > S(T, A 1 , A 2 ). Small p values indicate that the defined grouping of the attributes sets (here defined by gender) are meaningful compared to random groupings. Table 2 shows the results of the single-attribute WEAT. We measure association strength of the female adjectives relative to gender in two ways: i) gender is defined using a "definitional" vocabulary (A 1 = F def = {she, her, woman, ...}, A 2 = M def = {he, him, man, ...}), and ii) gender is defined using a latent vocabulary − the stereotypical occupations (A 1 = F occs , A 2 = M occs ).
As shown, the F adj embeddings are strongly associated with the explicit gender vocabulary in In contrast, the F adj embeddings are just as strongly associated with the latent gender vocabulary in the original GloVe space, but this is not undone by any of the debiasing methods. This is somewhat of an unexpected result in the case of the RAN and INLP debiasing methods, as they promised to go beyond direct bias mitigation.
The INLP method makes the most progress in reducing the implicit association strength, however a significant non-zero association remains. Combined with the SoWinoBias test results, we can observe that the WEAT reduction achieved by INLP is not a sufficient condition for mitigating latent gender-biased coreference resolution. Inversely, we observe that reduction of the WEAT measure is not a necessary condition for mitigation when debiased embeddings are combined with data augmentation (demonstrated by GN-GloVe(w a )).

Clustering and Recoverability
Clustering and recoverability (C&R) (Gonen and Goldberg, 2019) refer to a specific observation on the embedding space post debiasing; namely, that gender labels of words (assigned according to direct bias in the original embedding space) can be classified with a high degree of accuracy given only the debiased representations. Here we follow the same experimental setup, and report results on an expanded set of embeddings (see Table 3).
In agreement with Gonen and Goldberg (2019), we find that the Hard-GloVe and GN-GloVe embeddings retain nearly perfect recoverability of the original gender labels, indicating high levels of residual bias by this definition.
The INLP method was designed to guard against linear recoverability, and indeed we find that both C&R by a linear SVM are reduced to near-random performance. Recoverability by an SVM with a non-linear kernel (rbf) achieves 75% accuracy; much reduced compared to other debiasing methods, but still above the baseline of 50%. This result is consistent with Ravfogel et al. (2020). Of interest are the results obtained for the RAN-GloVe embeddings, which have not previously been reported. RAN was designed to mitigate undue proximity bias, conceptually similar to clustering. Despite this, C&R are still possible with high accuracy given RAN-debiased embeddings. Given RAN's success on various gender bias assessment tasks (SemBias, and WinoBias using the end-to-end coreference model), this suggests that complete suppression of C&R is unnecessary for many practical applications. Conversely, it may indicate that we have not yet developed any assessment tasks that probe the effect of indirect bias.
In reference to the SoWinoBias results, we can observe that linear attribute guarding (achieved by INLP) is not a sufficient condition for mitigating latent gender-biased coreference resolution. However, even linear guarding is not a necessary condition for mitigating SoWinoBias when retraining with data augmentation is available.

Gender-based Illicit Proximity Bias
The gender-based illicit proximity bias (GIPE) was proposed by Kumar et al. (2020) as a means to capture indirect bias on the embedding space as a well-defined metric, as opposed to the loosely defined idea of clustering and recoverabilty. Firstly, the gender-based proximity bias of a single word w, denoted η(w), is defined as the proportion of Nnearest neighbours {n i } with indirect bias β(n i , w) above some threshold θ. Intuitively, this is the proportion of words that are close by solely due to a shared gender association. The GIPE extends this Table 3: Clustering: (reported as accuracy and v-measure (Rosenberg and Hirschberg, 2007)) is performed by taking the n = 1500 most biased words in the original embedding space (excluding definitional gender words), and performing k-means clustering (k = 2) on the same words in the debiased space. Recoverability: (reported as accuracy) is performed by taking the n = 5000 most biased words in the original embedding space, and training a classifier (linear SVM or rbf kernel SVM) on the same words in the debiased space. Smaller values are better (indicating less residual cues that can be used classify gender-neutral words). GIPE: Smaller values are better (indicating less undue proximity bias in the embedding space).

Embedding
Acc word-level measure to a vocabulary-level measure using a weighted average over η(w). Table 3 shows the GIPE measure on the entire gender-neutral vocabulary V d , the gender-neutral vocabulary used to construct SoWinoBias V So = F occ ∪ M occ ∪ F adj , and the simple (unweighted) average η(w So ) on the SoWinoBias vocabulary.
The RAN method mitigates indirect bias as measured by GIPE by design, and therefore achieves the lowest GIPE values as expected (followed by Hard-GloVe, somewhat unexpectedly). However, non-zero proximity bias persists, more so on the stereotyped sub-vocabulary than the total vocabulary. Without extra help from data augmentation, RAN-GloVe achieves the best performance on the SoWinoBias (followed by Hard-GloVe). Therefore further reduction of GIPE may enable further mitigation of the latent gender-biased coreference resolution (cannot be ruled out as a sufficient condition at this time). However, RAN-GloVe does not benefit from the addition of data augmentation, unlike the majority of debiasing methods. Further investigation is needed to determine what conditions of the embedding properties allow for complementary data augmentation.

Conclusion
In this paper, we demonstrate the existence of observable latent gender bias in a downstream application, coreference resolution. We provide the first gender bias assessment test set not containing any explicit gender-definitional vocabulary. Although the present study is limited to binary gender, this construction should allow us to assess gender bias (or other demographic biases) in cases where explicit defining vocabulary is limited or unavailable. However, the construction does depend on knowl-edge of expected relationships or stereotypes (here occupations and adjectives). Therefore interdisciplinary work drawing from social sciences is encouraged as a future direction.
Our observations indicate that mitigation of indirect bias in the embedding space, according to our current understanding of such a notion, does not reduce the latent associations in the embedding space (as measured by WEAT), nor does it mitigate the downstream latent bias (as measured by SoWino-Bias). Future work could seek bias assessment tasks in downstream applications that do depend on the reduction of gender-based proximity bias or non-linear recoverability. Currently the motivation for such reduction is unknown, despite being an active direction of debiasing research.
Finally, we do observe that an early debiasing method, GN-GloVe, combined with simple data augmentation, can mitigate the latent gender biased coreference resolution, even when contextual embeddings in the system remain unaltered. Future work could extend the idea of the SoWinoBias test set to more complicated sentences representative of real "in the wild" cases, in order to determine if this result holds.
The SoWinoBias test set, all trained models presented in this paper, and code for reproducing the results are available at https://github.com/hillarydawkins/SoWinoBias.