Marked Attribute Bias in Natural Language Inference

Reporting and providing test sets for harmful bias in NLP applications is essential for building a robust understanding of the current problem. We present a new observation of gender bias in a downstream NLP application: marked attribute bias in natural language inference. Bias in downstream applications can stem from training data, word embeddings, or be amplified by the model in use. However, focusing on biased word embeddings is potentially the most impactful first step due to their universal nature. Here we seek to understand how the intrinsic properties of word embeddings contribute to this observed marked attribute effect, and whether current post-processing methods address the bias successfully. An investigation of the current debiasing landscape reveals two open problems: none of the current debiased embeddings mitigate the marked attribute error, and none of the intrinsic bias measures are predictive of the marked attribute effect. By noticing that a new type of intrinsic bias measure correlates meaningfully with the marked attribute effect, we propose a new postprocessing debiasing scheme for static word embeddings. The proposed method applied to existing embeddings achieves new best results on the marked attribute bias test set. See https://github.com/hillary-dawkins/MAB.


Introduction
Pre-trained distributed representations of words (a.k.a. word embeddings) are ubiquitous tools in natural language processing (NLP). Their utility is owing to the remarkable success in mapping semantic and syntactic relationships among words to linear relationships among real-valued vectors. For instance, analogy generation using vector addition on word embeddings (e.g. Tokyo is to Japan as Paris is to France) was taken to be an early measure of word embedding quality. In all kinds of related tasks, the vector space is known to encode semantic meaning surprisingly well (Pennington et al., 2014;Mikolov et al., 2013b,c). However, harmful gender-biased properties of word embeddings are also known to exist. Later is was observed that the same analogy generation property that produced the celebrated "man is to king as woman is to queen" analogy would also predict "man is to programmer as woman is to homemaker" (Bolukbasi et al., 2016). This observation sparked interest in developing debiased word embeddings.
Post-processing debiasing schemes are usually motivated by recognizing some intrinsic measure of bias in the embedding space, and then attempting to reduce that intrinsic bias. Early work (2016)(2017) focused on the idea of a "gender direction" vector within the embedding space, loosely defined as the difference vector between female and male attribute words. It was noted that any non-zero projection of a word onto the gender direction (termed direct bias) implied that the word was more related to one gender over another. In the case of ideally gender-neutral words (e.g. doctor, nurse, programmer, homemaker), this was viewed as an undesirable property. The first debiasing methods, Hard Debias (Bolukbasi et al., 2016) and Gender Neutral-GloVe (Zhao et al., 2018b), worked to minimize or eliminate the direct bias, and were shown to be successful in mitigating harmful analogies generated by word embeddings in relation to genderstereotyped occupations.
An influential critique paper by Gonen and Goldberg (2019) demonstrated that minimizing direct bias did not eliminate bias in the vector space entirely. Rather, words that tended to cluster together due to gender bias (e.g. nurse, teacher, secretary, etc.) would still cluster together in the nullspace of the gender direction. Furthermore, the original bias could be recovered by classification techniques using only the debiased word embeddings as input.
These observations were termed cluster and recoverability bias.
The next wave of debiasing methods (2019present) focused on reducing cluster and recoverability bias while proposing new metrics to systematically quantify the indirect bias of the embedding space (e.g. the Gender-based Illicit Proximity Estimate, introduced by Kumar et al. (2020)). While these new debiasing schemes do reduce indirect bias in multiple ways, there is a general lack of connection to downstream applications such as coreference resolution, natural language inference (NLI) and sentiment analysis.
Current gender-bias evaluation tests (GBETs) in widespread use include the WinoBias test set (Zhao et al., 2018a), designed to measure bias in coreference resolution systems using stereotypical occupations as a probe, and the NLI test set (Dev et al., 2020a), designed to measure stereotypical inferences again using occupations as the concept of interest. More commonly used evaluations include the Word Embedding Association Test (WEAT) (Caliskan et al., 2017), and the analogy generation test SemBias (Zhao et al., 2018b). However these tests solely evaluate the vector properties of the word embeddings, without any connection to downstream applications. Adding to the library of downstream GBETs is essential in building a robust understanding of gender bias in NLP applications (Sun et al., 2019).
Here we introduce a new observation of genderbiased predictions in a downstream task, namely "marked attribute bias" in natural language inference, and develop corresponding GBETs. Marked attribute bias refers to the language model's tendency to predict that "person" implies "man" (the default attribute), while simultaneously understanding that "person" does not necessarily imply "woman" (the marked attribute). Marked attribute bias was found to exist on explicitly defined gender words (e.g. man, woman, etc.), and persist on implicit gender words (e.g. names) as well as latent gender-carriers (e.g. stereotypical occupations).
An analysis of the currently available debiased embeddings reveals that none are able to successfully mitigate marked attribute bias. Furthermore, none of the currently proposed measures of intrinsic bias on the embedding space are predictive of the marked attribute effect. We define a new measure of intrinsic bias that was found to correlate with the marked attribute effect better than any currently available metric. Using this insight, we introduce a new debiasing scheme: Multi-dimensional Information-weighted Soft Projection. Applying MISP to an existing debiased embedding achieves the lowest observed marked attribute bias error. This inference task essentially asks the question: is "doctor" a subset of man/woman? I.e. if someone is a doctor, must they be a man? While both hypothesis sentences should receive a neutral prediction (as "doctor" does not imply any specific gender), hypothesis 1 will more likely receive an entailment, while hypothesis 2 will more likely receive a contradiction, given biased word embeddings. The corresponding GBET was published by Dev et al. (2020a)  Due to the language model's 1 tendency to predict that "person" implies a male (default) attribute, the first hypothesis sentence will have a prediction probability vector shifted towards Entail. However the same language model would tend towards a Neutral prediction for the second hypothesis, recognizing that "person" does not necessarily imply female (the marked attribute). To put it another way, this inference task essentially asks the question: is "person" a subset of man/woman? When presented with a masculine form, the model answers: yes (entailment), a person must be a man. When presented with a feminine form, the model answers: not necessarily (neutral), a female has an attribute (gender) that not all persons have. The name "Marked Attribute Bias" therefore derives from the observation that masculine forms are unmarked with respect to gender, whereas female forms carry a marked gender attribute.
Note that although the MAB test construction appears similar to Dev et al. (2020a), it is actually measuring quite a distinct effect. The (Dev et al., 2020a) test set measures associations between gender and some concept of interest (occupations). The MAB test set measures something more general and pervasive; it measures how gender words carry meaning, independent of any concept of interest.
Achieving the correct prediction probability of (N, E, C) = (1, 0, 0) on both sentences is difficult because it requires the language model to be attribute-aware (in this case gender-aware) while not using the gender attribute to alter predictions when it would be inappropriate to do so.

Analysis of the current situation
In order to investigate the presence of systematic marked attribute bias in natural language inference, we construct three types of tests: bias on explicit gender words, implicit gender carriers, and latent gender carriers. We wish to understand the depth and persistence of the marked attribute effect, as well as how it is handled by current debiasing methods. Firstly we provide a brief description of the current debiasing methods to be analyzed. Next we provide details of the test sets and report results.

Debiased embeddings
Within the scope of this paper, we focus on postprocessing techniques applied to static word embeddings. These types of methods are computationally inexpensive, easy to concatenate, and are independent of the base embedding. In addition, we include GN-GloVe, one of the highly cited retraining methods. Notationally, we specify embeddings as (base embedding).method. Where available, we use published debiased embeddings made available from the original authors of the corresponding method. Otherwise, we apply the method to the base GloVe embeddings. The methods we will analyze include: Hard Debias 3 (GloVe 4 .HD) (Bolukbasi et al., 2016): The subset of gender-neutral words are projected onto the nullspace of the gender direction g. Gender-neutral words are made equidistant to pairs of words in a defined equalization set.
Gender-Neutral GloVe 5 (GN-GloVe) (Zhao et al., 2018b): Similar to hard debias, this method seeks to eliminate the direct bias. The embeddings are retrained from scratch using a modified version of GloVe's original objective function. The gender information is sequestered to the final component of the word embedding. The gender-neutral portion of the word embedding is then defined as the first d−1 = 299 components, denoted GN-GloVe(w a ) (w a ) (w a ). Gender-Preserving 6 (GloVe 4 .GP) (Kaneko and Bollegala, 2019): This method seeks to eliminate harmful gender bias while retaining as much useful semantic gender information as possible.
Double Hard Debias 7 (GloVe 4 .DHD) (Wang et al., 2020): An extended version of the hard debias algorithm, based on the observation that frequency information encoded in the word embeddings convolutes the definition of the gender direction. Correctional pre-processing is applied prior to hard debiasing.
Bias Alignment Model 8 (GloVe 4 .BAM) (Lauscher et al., 2019): Gender subspace matrices are defined by stacking explicit gender words. The projection that maps the embedding space to itself while approximately aligning the gender subspaces is learned and applied to all words. After alignment, gender information is not retained.
Orthogonal Subspace Correction and Rectification 9 (GloVe 4 .OSCaR) (Dev et al., 2020b): The rationale is that linear projective methods are too aggressive in modifying the entire embedding space. OSCaR rectifies two concepts of interest (gender and occupations), such that these subspaces are orthogonal in the debiased space.
Iterative Nullspace Linear Projection 10 (GloVe 4 .INLP) (Ravfogel et al., 2020): Rather than defining a gender direction, INLP learns the most informative decision boundary for classifying gendered and gender-neutral words. All words are projected to the nullspace of the gender subspace, and the process proceeds iteratively until gender information is sufficiently erased. A closely related method is the D 4 algorithm (Davis et al., 2020). Repulse Attract Neutralize Debias 11 (GloVe 4 .RAN) (Kumar et al., 2020): Motivated by the persistence of implicit bias after debiasing through projective methods (observed as clustering and recoverability), RAN-debias attempts to address both direct bias and gender-based proximity bias.

Explicit gender words test set and error definitions
Firstly, we construct a test set where every sentence pair is of the form [A person verb object] → [(A) gender word verb object] (the correct inference is always neutral since a person can be of any gender). Verbs (n = 27) and objects (n = 184) are paired to create n = 1968 unique premise sentences 12 . Gender words are taken to be {man, woman, guy, girl, gentleman, lady, He, She}, following (Dev et al., 2020a) with the addition of the pronouns, for a total test set S of |S| = 15744 sentence pairs where hypotheses represent binary genders evenly For every hypothesis sentence in the test set, the ideal prediction probability vector is (N, E, C) = (1, 0, 0). We could define the error on the test set as the average Euclidean distance from the ideal distribution: This task, test set, and error definition are simple, and yet they encapsulate the central challenge of the debiasing field: to create attributeaware (required to obtain the Neutral prediction) but attribute-unbiased embeddings. A weaker, but still potentially desirable, condition might be to minimize the effect of gender The projection matrix computed for our base GloVe embeddings is available at https://github.com/hillary-dawkins/MAB. while not requiring that the model be gender-aware. Typically, this means that all hypotheses tend towards an Entail prediction, regardless of gender. We could define the error as the average distance between probability vectors between genders: (2) A gender-agnostic model could achieve zero error by this definition even with an accuracy of zero on the test set. Table 1 shows the results for this test set on all the embeddings of interest. None of the debiased embeddings successfully mitigate the marked attribute error. A similar test set shows that the effect persists on implicit gender words (e.g. names). Results are shown in the appendix.

Latent gender carriers: Stereotyped occupations
Next, we would like to check if the gender-induced marked attribute bias can affect entities which should be gender neutral, but turn out to be hidden carriers of a gender attribute (e.g. stereotypical occupations  Table 2. A permutation test is used to check if dividing the occupations into groups according to gender stereotypes produces a significant difference in the probability vectors (rather than dividing them randomly). As shown, the marked attribute effect persists on stereotypical occupations, especially on original embeddings. This is an important result because it highlights that unintended behaviour can appear in unexpected places due to a latent attribute. Previously, GBETs have focused on how explicit gender words are treated under biased models. To our knowledge, this is the first GBET designed to analyze unintended behaviour on a latent attribute carrier.
Note that this task is easier to correct than the explicit gender words because occupation words 13 The exact word set used to produce these results is available at https://github.com/hillary-dawkins/MAB. have defining characteristics beyond gender. That is, a debiasing method such as Iterative Nullspace Projection can perform well by removing gender information entirely. This does not mean that the challenge of having a gender-aware but genderunbiased embedding is solved, but it does provide evidence that latent gender effects can be mitigated using linear projective methods. The full extent of latent biased-attribute effects and possible mitigation strategies should be investigated further.

Intrinsic bias measures
How to define bias on an embedding space remains an active area of study. In general, we seek to understand how the intrinsic or geometric properties of an embedding space translate to real observable bias in downstream tasks. Intrinsic properties are easy to compute quickly, whereas computing performance on downstream tasks requires us to train new models for every case. Understanding of the correlations between the two gives insight on how word embeddings should be debiased.
As a case study, let us focus on the marked attribute error E on the explicit gender words (shown in Table 1). Recall that this measure of bias is of interest because zero error corresponds to the gold standard: having an attribute-aware model, while simultaneously not using the gender attribute to make inappropriate inferences. In this section, we look at 5 existing intrinsic bias measures: Direct Bias, Clustering, Recoverability, Gender-based Illicit Proximity Estimate (GIPE), and SemBias. We will investigate whether any of these measures are predictive of the marked attribute effect.
Recall that direct bias was the first measure to be proposed; it simply measures the average projection of word vectors onto a predefined gender direction. Early methods (i.e. Hard Debias and GN-GloVe) defined bias in the embedding space completely as direct bias. The idea of clustering and recoverability refer to a classifier's ability to correctly reassign gender labels to words, even after debiasing methods have been applied. Gonen and Goldberg (2019)'s observation of clustering and recoverability sparked new interest in defining metrics for indirect bias on the embedding space. Although clustering and recoverability do not provide well-defined measures of bias given an embedding space (as they depend training implementation -though they could be said to provide a lower bound), many new debiasing proposals will cite re- Table 1: Results of the marked attribute test set on explicit gender words. Due to varying results on gender nouns vs. pronouns, results are shown separately for each case (M and F represent averages across the gender nouns). Some debiased embeddings are able to eliminate the distance across pronouns (really by definition since she ≈ he in these cases), but none are able to eliminate differences between the gender nouns significantly. Even when differences between genders are minimized, distance from the ideal distribution (error E) remains or increases. This highlights the challenge of creating gender-aware but not gender-biased embeddings.

Emb.method
Gender  Table 2: Results of marked attribute test set on stereotypical occupations. Each (N,E,C) probability vector is averaged over the 1968 unique premise sentences and the gender attribute words from each category (M or F) (n = 31, 488 sentences for each gender). Smaller distances between the M and F vectors indicate less gender bias. The significance of the difference was evaluated using a permutation test; the alternate distance d * is computed for 10,000 randomly sampled partitions of the occupations into two groups. The significance value is the proportion of these samples to generate a distance d * > d. This gives us an idea of whether the defined partition, based on gender, is a meaningful grouping. Smaller significance values indicate that the defined partition is non-random with respect to the distance. Implementation details for each measure as well as the experimental set of embeddings (n = 16) are given in the appendix. The average Direct Bias on the embedding space was found to have a Pearson correlation coefficient of 0.104 with the marked attribute error. The Clustering v-measure 14 (Rosenberg and Hirschberg, 2007) achieved a correlation coefficient of 0.184. Recoverability was attempted using an SVM with a linear decision boundary, an SVM with a non-linear (radial basis function) kernel, logistic regression, and a simple 1-hidden-layer fully-connected network. All recoverability correlation results were comparable, but the best coefficient of 0.223 was achieved by logistic regression. The GIPE 15 had a correlation coefficient of 0.432. The SemBias 16 test set had a correlation coefficient of 0.091. The full correlation matrix between all intrinsic bias measures can be found in the appendix. The results suggest that the marked attribute effect is not well correlated with any present notion of intrinsic bias, therefore we do not have a good understanding of how the word embedding properties contribute to this type of observable bias.
In seeking a potential solution, we make note of a new intrinsic bias measure, multi-dimensional information-weighted direct bias (MIDB), found to have a more meaningful correlation of 0.667 with the marked attribute error. We define the MIDB of a particular word x to be a weighted average over inner products with basis vectors of a multidimensional gender subspace: where {g i } form an orthonormal basis for the gender subspace, here defined as the first d princi-pal components summarizing difference vectors {δ jk }. The difference vectors are taken as all pairwise differences 17 between vectors in defined gender sets (here common names were used 13 ): {δ jk } = f j − m k , f j ∈ F names , m k ∈ M names (|M names | = |F names | = 100). The weighting a i is the proportion of variance explained by the i th principal component, and d is a hyperparameter controlling the number of dimensions to keep 18 . New proposals for defining a gender direction or subspace potentially have far reaching consequences in the landscape of intrinsic bias measures and their related debiasing schemes. In fact all of Clustering, Recoverability, GIPE, and SemBias use the classic uni-dimensional gender direction g within their definitions. The weak observed correlation between DB and MIDB suggests that these subspaces are independent. Swapping in a uniquely informative gender subspace to the existing indirect measures would produce a new family of intrinsic bias measures. The observed utility of names in defining a meaningful gender subspace is encouraging because it opens an obvious avenue for this method to be applied to attributes of interest beyond gender (e.g. race or ethnicity).

Multi-dimensional information-weighted soft projection
In this section we motivate the above search for an informative intrinsic bias measure. As discussed, a greater understanding of how embedding properties influence observed bias can inform new debiasing techniques. Translating the idea of MIDB into a debiasing scheme yields Multi-dimensional Information-weighted Soft Projection (MISP). In this debiasing procedure, we project all words into the nullspace of the multi-dimensional gender subspace, proportional to our belief that certain dimensions actually encode the latent idea of gender: where w is the input embedding, w deb is the debiased output embedding, and all other quantities are defined as in eq. (3). As shown in Table 1, the GN-GloVe(w a ) embeddings are currently the top performers on the explicit gender words test set, as measured by either error E = 0.149, or distance d = 0.115. Applying MISP to GN-GloVe(w a ) embeddings (denoted GN-GloVe(w a ).MISP), we achieve an error on the explicit gender words test set of E = 0.1107, a 26% error reduction over the previous best. The distance d between genders is reduced to d = 0.08744, a 21% reduction over the previous best. Successful concatenation suggests that this technique is distinct, and independently useful, from techniques that seek to minimize the traditional direct bias (including GN-GloVe). This observation is consistent with the weak observed correlation between direct bias and MIDB 4 on the experimental set of embeddings.
Computing the intrinsic bias measures Clustering, Recoverability, GIPE and SemBias on the newly created embedding space GN-GloVe(w a ).MISP (compared to the base GN-GloVe(w a )), we observe a clustering v-score of 0.498 (previously 0.497) 19 , a recoverability accuracy of 0.992 (previously 0.993) 20 , a GIPE of 0.1169 (previously 0.1173) 21 , and a SemBias score of 0.938 (previously 0.938) 22 . The MISP method did not reduce bias by any of these measures, although this is not particularly surprising as it was designed to address the marked attribute effect (through MIDB). It is encouraging however that none of these bias measures were increased. In other words, there is no expected trade-off between the reduced marked attribute error and any previous debiasing work that relied on these measures. The 19 Where clustering size n = 1500. 20 This is the highest accuracy achieved by any of the four classification methods tested; implementation details are in the appendix. 21 Computed with indirect bias threshold θ = 0.03, and number of nearest neighbours n = 100. 22 Reported as the proportion of samples in the full test set to return the definitional analogy; higher scores are better.
SemBias result informs us that MISP did not reintroduce any harmful biased analogies, for example.
For reference, if we apply the analogous multidimensional hard debias method (i.e. equation (4) where all weights a i are set to 1), the output embeddings GN-Glove(w a ).MHD do not successfully mitigate the marked attribute effect (E = 0.1501, d = 0.1603). This suggests that the soft nature of the projection is a key ingredient.
Furthermore, we provide some evidence that specifically the information weighting of the soft projection is a good ingredient as follows. Recall that we are attenuating components of each basis vector according to our belief in that vector as a good gender direction. The basis vectors are defined to be the first d principal components, weighted by their corresponding variance explained. Therefore the first basis vector receives the greatest weight and so on. To test the significance of this decision, we define alternative debiased embeddings by applying MISP where the weights get reassigned to the "wrong" vector (for d = 4, we have 23 alternative pairings). We observe that none of the 23 alternatives obtain an error E less than the "true" implementation of MISP. This suggests that weighting the components by order of information is a good ingredient. Values of E for the alternate embeddings can be found in the appendix. Model parameters for each case are made available in order to reproduce this argument on any extended version of the MAB test set.
Information weighting is an interesting idea because it could be applied to either defined or learned gender subspaces alike. For instance, if the basis vectors of a gender subspace are taken as the iteratively learned linear decision boundaries (as in INLP), we could investigate weighting each dimension by the accuracy acc i of classification on each iteration, as a i = (1 − 2acc i ). In this way, dimensions receive weights proportional to their ability to predict gender information. When accuracy reaches 0.5, no gender information remains, the learned decision boundary is meaningless, and the basis vector receives zero weighting.
Finally, as with any debiasing method, we wish to verify that application of the method has not damaged the overall embedding quality. We assess the MISP embeddings on a handful of classic analogy and word semantic similarity benchmarks. The word similarity benchmarks measure how closely the word embeddings capture similarity between words compared to human annotation. We use the following datasets: RG (Rubenstein and Goodenough, 1965

Conclusion
This paper highlights a new observation of gender bias in a downstream setting: marked attribute bias in natural language inference. The current inference is that "person" implies male, while "person" does not imply female. Consequently, this inference is being baked into our models of natural language understanding. The effect was shown to persist on explicitly defined gender words and on latent gender-attribute carriers. Based on an assessment of the current debiasing landscape, none of the current debiasing methods satisfactorily mitigate the marked attribute error, and furthermore none of the intrinsic bias measures are useful at predicting the marked attribute effect.
By noticing a more meaningful correlation with a newly identified intrinsic bias measure, we propose a new debiasing scheme: multi-dimensional information-weighted soft projection (MISP). This method introduces several concepts, including the use of a multi-dimensional defined gender subspace. Previously, the concept of a defined gender subspace always appeared as a single dimension. The iterative nullspace projection method implic-itly uses higher learned dimensions, however this requires learning a new decision boundary at every iteration, subject to the implementation of a training procedure. Furthermore, the learned dimensions were not used to define any bias metric, they were strictly used operationally for the debiasing procedure. MISP also introduces the idea of a soft or partial projection, where weights are informed by some measure of the dimension's ability to capture the intended latent concept of a gender direction. Both of these ideas could be further explored and extended to create new notions of indirect bias, which in turn could inform more sophisticated debiasing procedures.
Multi-dimensional information-weighted soft projection applied to GN-GloVe(w a ) produces new debiased embeddings that achieve the lowest error on the marked attribute bias test set, a 26% reduction over the previous best, and a 45% reduction over the original undebiased embeddings. Error reduction on this test set is thought to encapsulate the overall goal of producing gender-aware but genderunbiased embeddings. Therefore, this method and its composite ingredients warrant further investigation. Each of the marked attribute bias test sets are made available for further exploration and iteration on these ideas.  Table 4. In short, the same effect is observed on names, especially on the original embeddings. A permutation test was used to check whether the stratification of names by gender was a non-random division according to the observed bias.

B Intrinsic bias measures and correlations
Please refer to tables 5 and 6.

C Alternate weighted embeddings
As discussed in the main text, we compute the error E on the explicit gender words test set for alternate 24 The exact word set used to produce these results is available at https://github.com/hillary-dawkins/MAB. 25 https://www.ssa.gov/oact/babynames/ soft-weighted embeddings. The alternate embeddings are created by permuting the weights to be matched with the incorrect basis vectors. For example, the permutation denoted 1243 means that weight a 1 is applied to basis vector g 1 , a 2 to g 2 , a 4 to g 3 , and a 3 to g 4 . Results for all alternate permutations and their errors are as follows: (permutation = 1243, E = 0.1574), (1324, 0.2331),  Table 6: Intrinsic bias measures of interest on the experimental set of embeddings. There are two base (undebiased) embeddings, word2vec and GloVe. All other embedding spaces are obtained by applying a debiasing method, where each method found here is described in the main text. Implementation notes: DB and MIDB: The direct bias (DB) and the new multi-dimensional information-weighted direct bias (MIDB) are average measures over a specific (ideally gender-neutral) vocabulary V t . V t (n = 46960) is defined by taking the 50,000 most frequent words in the common vocabulary between word2vec and GloVe, filtering out punctuation, numbers, and removing the gender-specific word set V s (n = 1622), defined as the union of gender-specific word sets used in previous works (Bolukbasi et al., 2016;Zhao et al., 2018b). DB is defined as the projection onto a gender direction, here taken to be the she − he direction. For debiasing methods that promote she ≈ he, the DB is not well defined (although it can be computed numerically, it is unstable). We leave these cases as NA rather than a spurious numerical value.
Clustering: The clustering experiment follows (Gonen and Goldberg, 2019) in taking the n ∈ [500, 1500] "most biased" words in the original embedding space (according to their projection on the she − he axis), and then applying k-means (k = 2) clustering on the words in the debiased embedding space. Bias is reported as the either the clustering accuracy or the v-measure (only n = 1500 shown here with v-measure). Recoverability: Similarly, the dataset (n = 5000) is taken to be the most biased words in the original embedding space, where bias labels are assigned according to the projection on the gender direction (n = 2500 taken from each class). Several classifiers (SVM with a linear decision boundary, SVM with an RBF kernel, logisitic regression, and a simple fully-connected 1-hidden layer network) were trained on 20% of the dataset with balanced classes. Recoverability bias is reported as the accuracy of classification on the remaining test set (only logisitic regression shown here). SemBias: The SemBias analogy test set is available from (Zhao et al., 2018b). The set contains n = 440 tuples of possible analogies ( a, b): 1 definitional analogy (e.g. king, queen), 1 stereotypical analogy (e.g. doctor, nurse), and 2 other analogies (e.g. cup, plate). For every sample, the best analogy is selected as the one to maximize cos( he − she, a − b). Bias is reported as the proportion of samples to return a definitional analogy, a stereotypical analogy, and an "other" analogy. (Only definitional and stereotypical shown here.) GIPE: The gender-based illicit proximity bias (GIPE) (see (Kumar et al., 2020)