OSCaR: Orthogonal Subspace Correction and Rectification of Biases in Word Embeddings

Language representations are known to carry stereotypical biases and, as a result, lead to biased predictions in downstream tasks. While existing methods are effective at mitigating biases by linear projection, such methods are too aggressive: they not only remove bias, but also erase valuable information from word embeddings. We develop new measures for evaluating specific information retention that demonstrate the tradeoff between bias removal and information retention. To address this challenge, we propose OSCaR (Orthogonal Subspace Correction and Rectification), a bias-mitigating method that focuses on disentangling biased associations between concepts instead of removing concepts wholesale. Our experiments on gender biases show that OSCaR is a well-balanced approach that ensures that semantic information is retained in the embeddings and bias is also effectively mitigated.


Introduction
Word embeddings are efficient tools used extensively across natural language processing (NLP). These low dimensional representations of words succinctly capture the semantic structure and syntax [12,14] of languages well, and more recently, also the polysemous nature of words [15,7]. As such, word embeddings are essential for state of the art results in tasks in NLP. But they are also known to capture a significant amount of societal biases [1,6,5,17,21] such as gender, race, nationality, or religion related biases, which gets expressed not just intrinsically but also downstream in the tasks that they are used for [5,18,20]. Biases expressed in embeddings, based on these protected attributes, can lead to biased and incorrect decisions. For these cases, such biases should be mitigated before deployment. Existing methods either require expensive retraining of vectors [11], thus not being very usable, or remove information contained along an entire subspace representing a concept (such as gender or race) in groups of words [1] or all words [6,5,10] in the embedding space. Removing a (part of a) subspace can still leave residual bias [10], or in the case of gender also undesirably alter the association of the word 'pregnant' with the words 'female' or 'mother'. prediction becomes neutral with a probability 62% while the confidence on label entail drops to much lower at 16%. Aggressive debiasing could erase valid gendered associations in such examples.
Biases in word embeddings are associations that are stereotypical and untrue. Hence, our goal is to carefully uncouple these (and only these) associations without affecting the rest of the embedding space. The residual associations need to be more carefully measured and preserved.
To address this issue, we need to balance information retention and bias removal. We propose a method which decouples specific biased associations in embeddings while not perturbing valid associations within each space. The aim is to develop a method which is low cost and usable, performs at least as well as the existing linear projection based methods [10], and also performs better on retaining the information that is relevant and not related to the bias. For concepts (captured by subspaces) identified to be 'wrongly' associated with biased connotations, our proposal is to orthogonalize or rectify them in space, thus reducing their explicit associations, with a technique we call OSCaR (orthogonal subspace correction and rectification). Word vectors outside these two directions or subspaces (but in their span) are stretched in a graded manner, and the components outside their span are untouched. We describe this operation in detail in Section 4.
For measuring bias, we use existing intrinsic [3,6] and extrinsic measures [5]. The extrinsic measure of NLI, as a probe of bias [5], demonstrates how bias percolates into downstream tasks, both static and contextualized. For instance, among the sentences: Premise: The doctor bought a bagel. Hypothesis 1: The woman bought a bagel. Hypothesis 2: The man bought a bagel.
Both hypotheses are neutral with respect to the premise. However, GloVe, using the decomposible attention model [13], deems that the premise entails hypothesis 1 with a probability 84% and contradicts hypothesis 2 with a probability 91%. Contextualized methods (e.g., ELMo, BERT, and RoBERTa) also demonstrate similar patterns and perpetrate similar gender biases. It has also been demonstrated that this bias percolation can be mitigated by methods which projecting word vectors along the subspace of bias [5]; these incorrect inferences see a significant reduction, implying a reduction in bias expressed.
But what happens if there is proper gender information being relayed by a sentence pair? The existing tasks and datasets for embedding quality do not directly evaluate that. For instance: Premise: The gentleman drove a car. Hypothesis 1: The woman drove a car. Hypothesis 2: The man drove a car.
Again the premise should contradict the first hypothesis and entail the second. We thus expand the use of NLI task as a probe not just to measure the amount of bias expressed but also the amount of correct gender information (or other relevant attributes) expressed. This measurement allows us to balance between bias retained and information lost in an explicit manner.
We demonstrate in Section 2 and Tables 3 and 4 that the ability to convey correctly gendered information is compromised when using projection-based methods for bias reduction, meaning that useful gender information is lost. This demonstrates our motivation for the development of more refined geometric operations that will achieve bias mitigation without loss of features entirely.
Our Contributions: (i) We develop a method OSCaR that is based on orthogonalization of subspaces deemed to have no interdependence by social norms, such that there is minimal change to the embedding and loss of features is prevented. This method performs similarly and in some cases, better than the existing best performing approach to bias mitigation.
(ii) We demonstrate how OSCaR is applicable in bias mitigation in both context free embeddings (GloVe) and contextualized embeddings (RoBERTa).
(iii) Further, we develop a combination of tests based on the task of natural language inference which help ascertain that significant loss of a feature has not been incurred at the cost of bias mitigation.

Considerations in Debiasing Word Embeddings
Maximizing bias removal. It has been established that social biases creep into word embeddings and how that affects the tasks they are put towards. There have been a few different approaches that try to mitigate these biases. There are a class of techniques, exemplified by Zhao et al. [22], which retrain the embedding from scratch with information meant to remove bias. These can be computationally expensive, and are mostly orthogonal to the techniques we focus on. In particular, our main focus will be on techniques [1,6,16,5] which focus on directly modifying the word vector embeddings without retraining them; we detail them in Section 5. These mostly build around identifying a bias subspace, and projecting all words along that subspace to remove its effect. They have mainly been evaluated on removing bias, measured structurally [3] or on downstream tasks [5].
Retaining embedding information. The ability to debias should not strip away the ability to correctly distinguish between concepts in the embedding space. Correctly gendered words such as 'man', 'woman', 'he' and 'she', encode information which helps enrich both intrinsic quality of word embeddings (via similarity-based measures) and extrinsic task performances (such as in natural language inference, pronoun resolution). Further, there are other atypically gendered words such as 'pregnant' or 'testosterone' which are more strongly associated to one gender in a meaningful way. Retaining such associations enrich language understanding and are essential. While there are tasks which evaluate the amount of bias reduced [3,5], and there are numerous tasks which evaluate the performance using word embeddings as a general measure of semantic information contained, no tasks specifically evaluate for a specific concept like gender.
Differentiability. While some debiasing efforts have focused on non-contextualized embeddings (e.g., GloVe, Word2Vec), many natural language processing tasks have moved to contextualized ones (e.g., ELMo, BERT, RoBERTa). Recent methods [5] have shown how debiasing method can be applied to both scenario. The key insight is that contextualized embeddings, although being contextdependent throughout the network, are context-independent at the input layer. Thus, debiasing can be effective if it is (a) applied to this first layer, and (b) matinained in the donwstream training step where the embeddings are subject to gradient updates. Note that maintaining debiasing during training requires it to be differentiable (such as linear projection [6]).

New Measures of Information Preservation
We provide two new approaches to evaluate specific information retained after a modification (e.g., debiasing) on embeddings. One is intrinsic which measures structure in the embedding itself; the other is extrinsic, which measures the effectiveness on tasks which use the modified embeddings.
New intrinsic measure (WEAT * ). This extends Caliskan et al.'s [3] WEAT. Both use gendered words sets as a baseline for the embeddings representation of gender X : { man, male, boy, brother, him, his, son } and Y : { woman, female, girl, brother, her, hers, daughter }, two more sets A,B, and a comparison function s(X, Y, A, B) = x∈X s(x, A, B) − y∈Y s(y, A, B) where, s(w, A, B) = mean a∈A cos(a, w) − mean b∈B cos(b, w) and, cos(a, b) is the cosine similarity between vector a and b. This score is normalized by stddev w∈X∪Y s(w, A, B). In WEAT A,B are words that should not be gendered, but stereotypically are (e.g., A male-biased occupations like doctor, lawyer, and B female-associated ones like nurse, secretary), and the closer s(X, Y, A, B) is to 0 the better.
In WEAT * A and B are definitionally gendered (A male and B female) so we want the score s(X, Y, A, B) to be large. In Section 6 we use A,B as he-she, as definitionally gendered words (e.g., father, actor and mother, actress), and as gendered names (e.g., james, ryan and emma, sophia); all listed in the Supplement C.7.
New extrinsic measure (SIRT). Maintaining a high performance on SNLI testsets after debiasing need not actually imply the retention of useful and relevant gender information. A very tiny fraction of sentences in SNLI actually evaluate the existence of coherent gender relations in the sentence pairs. Similarly, task of coreference resolution can involve resolving gendered information, it is more complex with a lot of other factors involved. For instance, there can be multiple people of the same gender in a paragraph with the task being the identification of the person being referred to by a given pronoun or noun. This is a much more complex evaluation than checking for correct gender being propagated. We propose here a simpler task which eliminates noise from other factors and only measures if correctly gendered information is passed on.
We extend the use of textual entailment bias evaluations [5] toward the evaluation of correctly gendered information, we call it SIRT: sentence inference retention test. These tasks have the advantage of being sentence based and thus use context much more than word based tests such as WEAT, this enabling us to evaluate contextualized embeddings such as RoBERTa. Unlike the original templates [5] which were ideally related neutrally, these sentences are constructed in a way that the probabilities should be entailment or contradiction.
For instance, in an entailment task a sample sentence pair that should be entailed would have the subject as words denoting the same gender in both the premise and the hypothesis: Premise: The lady bought a bagel. Hypothesis: The woman bought a bagel.
We should note here that not all same gendered words can be used interchangeably in the premise and hypothesis. For instance, the words 'mother' in the premise and 'woman' in hypothesis should be entailed but the opposite should not be entailed and would thus not be in our set of sentence pairs for this task. We use 12 words (6 male and 6 female) in the premise and 4 ('man', 'male', 'woman' and 'female') in the hypothesis. We only use the same gender word for the hypothesis as the premise each time. Along with 27 verbs and 184 objects. The verbs are sorted into categories (e.g., commerce verbs like bought or sold; interaction verbs like spoke etc) to appropriately match objects to maximize coherent templates (avoiding templates like 'The man ate a car').
In a contradiction task a sentence pair that should be contradicted would have opposite gendered words in the premise and hypothesis: Premise: The lady bought a bagel. Hypothesis: The man bought a bagel.
Unlike the case of the task for entailment, here all words of the opposite gender can be used interchangebly in the premise and hypothesis sentences, as all combinations should be contradicted. We use 16 words (8 male and 8 female) in the premise and any one from the opposite category in the hypothesis. We use the same 27 verbs and 184 objects as above in a coherent manner to form a resultant list of sentence pairs whose correct prediction should be a contradiction in a textual entailment task. For both these types of tests we have over 400,000 sentence pairs.
On each task, with N being the total number of sentence pairs per task, let the probability of entailment, contradiction and neutral classifications be P e , P c and P n respectively. Following Dev and Phillips [5], we define the following two metrics for the amount of valid gendered information contained: Net Entail (or Contradict) = Pe N (or = Pc N ) Fraction Entail (or Contradict) = Number of entail (or contradict) classifications N The higher the values on each of these metrics, the more valid gendered information is contained by the embedding. As we would expect, GloVe does well at our SIRT test, with net entail and fraction entail at 0.810 and 0.967 and net contradict and fractional contradict at 0.840 and 0.888. Debiasing the embedding by linear projection as we see in Table 4 by about 10% in each metric.

Orthogonal Subspace Correction and Rectification (OSCaR)
We describe a new geometric operation that is an alternative to the linear projection-based ones. This operator applies a graded rotation on the embedding space; it rectifies two identified directions (e.g., gender and occupations) which should ideally be independent of each other, so they become orthogonal, and remaining parts of their span are rotated at a varing rate so it is differentiable.
This method requires us to first identify two subspaces that in the embedding space should not have interdependence. This can be task specific; e.g., for resume sorting, subspaces of gender and occupation should be independent of each other. Since this interdependence has been observed [1,3,6,5] in word embeddings, we use them as an exemplar in this paper. We determine 1-dimensional subspaces capturing gender (V 1 ) and occupations (V 2 ) from words listed in the Supplement. Given the two subspaces V 1 and V 2 which we seek to make orthogonal, we can identify the two directions We can then restrict to the subspace S = (v 1 , v 2 ). In particular, we can identify a basis using vectors v 1 and v 2 = v2−v1 v1,v2 v2−v1 v1,v2 (these two v 1 and v 2 , are now definitionally orthogonal). We can then restrict any vector x ∈ R d to S as two coordinates π S (x) = ( v 1 , x , v 2 , x ). We adjust these coordinates, and leave all d − 2 other orthogonal components fixed.
We now restrict our attention to within the subspace S. Algorithmically we do this by defining a d × d rotation matrix U . The first two rows are v 1 and v 2 . The next d − 2 rows are any set u 3 , u 4 , . . . , u d which completes the orthogonal basis with v 1 and v 2 . We then rotate all data vectors x by U (as U x). Then we manipulate the first 2 coordinates (x 1 , x 2 ) to f (x 1 , x 2 ), described below, and then reattach the last d − 2 coordinates, and rotate the space back by U T . Next we devise the function f which is applied to each data vector x ∈ S (we can assume now that x is two dimensional). See illustration in Figure 1.
The function f should be the identity map for v 1 so For every other vector, it should provide a smooth partial application of this rotation so f is continuous.
In particular, for each data point x ∈ S we will determine an angle θ x and apply a rotation matrix Towards defining θ x , calculate two other measurements φ 1 = arccos v 1 , x x and d 2 = v 2 , x x . Now we have a case analysis: This effectively orthogonalizes the components of all points along the subpaces V 1 and V 2 . So, points lying in subspaces especially near V 2 get moved the most, while the rest of the embedding space faces a graded stretching. The information contained outside the two subspaces S remains the same, thus preserving most of the inherent structure and content of the original embeddings.
We have a differentiable operation applied onto all points in the space, enabling us to extend this method to contextualized embeddings. It can now be part of the model specification and integrated with the gradient-based fine-tuning step. Further, it is a post processing step applied onto an embedding space, thus its computational cost is relatively low, and it is easily malleable for a given task for which specific subspaces may be desired as independent.

Experimental Methods
Debiasing methods. Gender bias and its reduction has been observed on GloVe embeddings [1,6,5,10] using different metrics [3,10,6,5]. Each of these are projection-based, and begin by identifying a subspace represented as a vector v: in all of our experiments we determine with using the vector between words 'he' and 'she.' Some methods use an auxillary set of definitionally gendered words G (see Supplement C.3 and C.4) which are treated separately.
Linear Projection (LP): This is the simplest method [6]. For every embedded word vector w it projects it along v to remove that component as Afterwards the d-dimensional vectors lie in a (d − 1)-dimensional subspace, but retain their d coordinates; the subspace v is removed. Laushcher et al. [10] show this method [6] reduces the most bias and has the least residual bias.
Hard Debiasing (HD): This original method Bolukbasi et al. [1] begins with the above projection operation, and first applies it to all words w not in G. Then using an identified subset G ⊂ G which come in pairs (e.g, man, woman) it projects these words also, but then performs an "equalize" operation. This operation ensures that after projection they are the same distance apart as they were before projection, but entirely not within the subspace defined by v. As we will observe, this equalization retains certain gender information in the embedding (compared to projection), but has trouble generalizing when used on words that carry gender connotation outside of G (such as names). The final location of other words can also retain residual bias [9,10].

Iterative Nullspace Projection (INLP):
This method [16] begins with LP using v on all words except the set G. It then automatically identifies a second set B of most biased words: these are the most extreme words along the direction v (or −v) [16]. After the operation, it identifies the residual bias [16] by building a linear classifier on B. The normal of this classier is then chosen as the next direction v 1 on which to apply the LP operation, removing another subspace. It continues 35 iterations, finding v 2 and so on, until it cannot identify significant residual bias.
OSCaR: We also apply OSCaR, using 'he-she' as v 2 , and the subspace defined by an occupation list (see Supplement C.2) as v 1 . This subspace is determined by the first principal component of the word vectors in the list. Our code for reproducing experiments will be released upon publication.
Debiasing contextualized embeddings. The operations above are described for a noncontextualized embedding; we use one of the largest such embeddings GloVe (on 840 B token Common Crawl). They can be applied to contextualized embedding as well; we use RoBERTa (base version released by April 2020) [19], the widely adopted state-of-the-art architecture. As advocated in Dev et al. [5], for RoBERTa we only apply this operation onto the first layer which carries no context. Technically this is a subword-embedding, but 'he,' 'she,' and all words in G are stored as full subwords, so there is no ambiguity. LP and OSCaR are differentiable, so these modifications can be maintained under the fine-tuning gradient-based adjustment. HD and INLP are not applied to all words, and are more intricate, and so we do not have a clearly defined gradient. For these, we only apply it once and do not repeat it in the fine-tuning step.
Extrinsic measurement through textual entailment. For training the embeddings in the task of textual entailment, we use the SNLI [2] dataset. While MultiNLI contains more complex and diverse sentences than SNLI, making it a good contender for training entailment tasks, we observed that MultiNLI also carries signifiantly more gender bias. On training on MultiNLI, our model tends to return a significantly higher proportion of biased outcomes as seen in the Supplement (Table S1). Over 90% of sentence pairs that should have neutral associations are classified incorrectly using both GloVe and RoBERTa when trained using MultiNLI whereas using SNLI yields less than 70% biased and incorrect classifications on the same dataset. Since we focus on the representational bias in word embeddings and not on the bias in the datasets used in different machine learning tasks, using MultiNLI here would interfere with our focus. Moreover, MultiNLI has more complex sentences than SNLI, which means that there is a larger probability of having confounding noise. This would in turn weaken the implication of an incorrect inference of bias expressed.
Keeping Test/Train separate. Since two of the methods below (Hard Debiasing and Iterative Nullspace Projection) use words to generate lists of gendered and biased words as an essential step, we filter words carefully to avoid test-train word overlap. For words in Supplement C.6 used in the WEAT test, we filter out all words from the training step involved in HD and INLP. We also filter out words used for template generation for our NLI tests, listed in Supplement C.5. A larger list in Supplement C.7, has 59 words of each gender and removing all would hamper the training and  We use the templates proposed for using textual entailment as a bias measure [5], restructing sentence pairs for SIRT. Since we just use the words 'he' and 'she' to determine the gender subspace, we can fairly use all other gendered words in the list in the Supplement to generate templates. For occupations, we split the full set, using some to train a subspace for OSCaR, and retaining others for templates in testing. For occupations however, since we use a subset of occupation words to determine the occupation subspace for the OSCaR operation (Supplement C.2), we have disjoint lists for testing with WEAT (Supplement C.6) and NLI templates (Supplement C.5).

Experimental Results
Intrinsic Measures for Measurement of Bias. Table 1 shows the results on WEAT [3] between groups of attribute words (we use the he-she vector) and target words (we use 3 sets of stereotypical types of words: occupations, work vs. home, and math vs. art). It also shows results for the Embedding Coherence Test (ECT) [6], showing the association of vectors from X (gendered words) with a list of attribute neutral words Y (occupation) using the Spearman Coefficient. The score ranges in [−1, 1] with 1 being ideal. OSCaR performs similarly to the other debiasing methods in removing bias, always significantly reducing the measured intrinsic bias.
Extrinsic Measures for Measurement of Bias. Table 2 shows the textual entailment scores for GloVe and RoBERTa before and after they have been debiased by the various approaches. All debiasing methods, including OSCaR, increase the neutrality score (recall sentences should be objectively neutral) without significantly decreasing the performance (Dev/Test scores); HD on GloVe is the only one where these scores drop more than 0.015. While INLP appears to be the most successful at debiasing GloVe, OSCaR and LP are clear the most successful on RoBERTa (with LP slightly better); this is likely since they are differentiable, and can be integrated into fine-tuning.
Intrinsic Metric for Evaluating Gendered Information Preserved. Table 3 demonstrates using WEAT * how much correctly gendered information is contained or retained by an embedding after debiasing by the different methods; the larger the scores the better. Instead of probing with gendered words or names, we first used random word sets, and on 1000 trials the average WEAT or WEAT * Embedding WEAT*(he -she) WEAT*(gendered words) WEAT*(gendered names) GloVe   score was 0.001 with standard deviation of 0.33 So almost all of these results are significant; the exception is LP with he-she, since this basically aligns these two words exactly, and the resulting vector after re-normalizing probably acts random.
The first column uses sets A = {he} and B ={she} to represent gender, the same words used to define the debias direction. The second column uses two sets of 59 definitionally gendered words as A and B, and the third uses 700 gendered names for each set A and B (both in Supplement C.7). The baseline row (row 1, not debiased GloVe) shows these methods provide similar WEAT * scores.
The LP method retains the least information for tests based on he-she or other gendered words. These scores are followed closely behind by the scores for INLP in these tasks. The single projection of LP retains more info when compared against names, whereas INLP's multiple projects appears to remove some of this information as well. HD also performs well for he-she and gendered names; we suspect the equalize steps allows the information to be retained for these types of words. However, since names are not pre-specified to be equalized, HD loses much more information when compared against gendered names.
Finally, OSCaR performs at or near the best in all evaluations. HD (with its equalize step) performs better for he-she and other gendered word, but this does not generalize to words not specifically set to be adjusted as observed in the gendered names evaluation. In fact, since names do not come in explicit pairs (like uncle-aunt) this equalize step is not even possible.
Extrinsic Metric for Evaluating Gendered Information Preserved. For the new SIRT task we observe the performance using GloVe and RoBERTa in Table 4. GloVe without any debiasing, which is our baseline, preserves significant correctly gendered information as seen by the first row. The fraction of correct entails (male premise : male hypothesis, female premise : female hypothesis) and contradicts (male premise : female hypothesis, female premise : male hypothesis) are both high. We see a fall in these scores in all projection based methods (LP, HD and INLP), with a uniform projections step (LP) doing the best among the three. OSCaR does better than all three methods in all four scores measuring valid entailments and contradictions.
Using RoBERTa, we see a similar pattern where OSCaR outperforms the projection based methods, LP and HD in the retention of valid gendered information. It also does better than INLP on the entailment tasks. However, INLP does better on the contradition tasks than OSCaR. But, recall INLP showed almost no effect on debiasing with RoBERTa (Table 2); we suspect this is because most of the effect of the projection are erased in the fine-tuning step (recall INLP is not obviously differentiable, so it is just applied once before fine-tuning), and overall INLP has minimal effect when used with contextualized embeddings.

Broader Impacts
Biases in language representation is pernicious with wide ranging detrimental effects in machine learning applications. The propagation of undesirable and stereotypical associations learned from data into decisions made by language models maintains the vicious cycle of biases. Combating biases before deploying representations is thus extremely vital. But this poses its own challenges. Word embeddings capture a lot of information implicitly in relatively few dimensions. These implicit associations are what makes them state-of-the-art at tackling different language modeling tasks. Breaking down of these associations for bias rectification, thus, has be to be done in a very calculated manner so as to not break down the structure of the embeddings. Our method, OSCaR's surgical manner of rectifying associations helps integrate both these aspects, allowing it to have a more holistic approach at making word embeddings more usable. Moreover it is computationally lightweight and is differentiable, making it simple to apply adaptively at the time of analysis without extensive retraining.
We envision that this method can be used for all different types of biased associations (age -ability, religion -virtue perception, etc). Since it only decouples specific associations, not a lot of other components of each of these features is changed or lost.
Moreover, we believe these techniques should extend in a straight-forward way to other distributed, vectorized representations which can exhibit biases, such as in images, graphs, or spatial analysis.

A Discussion
Debiasing word embeddings is a very nuanced requirement for making embeddings suitable to be used in different tasks. It is not possible to have exactly one operation that works on all different embeddings and identifies and subsequently reduces all different biases it has significantly. What is more important to note is that this ability is not beneficial in every task uniformly, ie, not all tasks require bias mitigation to the same degree. Further, the debiasing need not apply to all word groups as generally. Disassociating only biased associations in a continuous manner is what needs to be achieved. Hence, having the ability to debias embeddings specifically for a given scenario or task with respect to specific biases is extremely advantageous.
Our method of orthogonal correction is easy to adapt to different types of biased associations, such as the good-bad notions attached to different races [3,4] or religions [5] etc. Creating metrics is harder with not as many words to create templates or tests out of, making comprehensive evaluation of bias reduction or information retention harder in these types of biases. We leave that for future exploration.

B.1 SNLI versus MultiNLI
We compare here the amount of gender bias contained by the templates in SNLI and MultiNLI. While MultiNLI has more complex sentences, it also contains more bias as seen in Table S1 using the standard metrics for neutrality defined by an earlier paper [5]. Since our work attempts to understand and mitigate the bias in language representation and not the bias in data used in various tasks, we restrict our experiments to SNLI which expresses significantly less bias that MultiNLI.

B.2 Standard Tests for Word Embedding Quality
Non-contextual word embeddings like GloVe or word2vec are characterized by their ability to capture semantic information reflected by valid similarities between word pairs and the ability to complete analogies. These properties reflect the quality of the word embedding obtained and should not be diminished post debiasing. We ascertain that in Table S2. We use a standard word similarity test [8] and an analogy test [12] which measure these properties in word embeddings across our baseline GloVe model and all the debiased GloVe models. All of them perform similarly to the baseline GloVe model, thus indicating that the structure of the embedding has been preserved. While this helps in preliminary examination of the retention of information in the embeddings, these tests do not contain a large number of sensitive gender comparisons or word pairs. It is thus, not sufficient to claim using this that bias has been removed or that valid gender associations have been retained, leading to the requirement of methods described in the paper and methods in other work [6,5,1,3] for the same.  Table S2: Intrinsic Standard Information Tests: These standard tests evaluate the amount of overall coherent associations in word embeddings. WSim is a word similarity test and Google Analogy is a set of analogy tests.

B.3 Gendered Names and Debiasing
All these methods primarily use the words 'he' and 'she' to determine the gender subspace, though both hard debiasing and INLP use other pre-determined sets of gendered words to guide the process of both debiasing and retention of some gendered information.
Gendered names too have been seen to be good at helping capture the gender subspace [6]. We compare in Table S3 the correctly gendered information retention when debiasing is done using projection or correction. We represent simple projection, hard debiasing and INLP by simple projection since it is the core debiasing step used by all three. Both rows have been debiaised using the gender subspace determined using most common gendered names in Wikipedia (listed in Supplement) and correction uses in addition, the same occupation subspace as used in Table 3.
Each value is again a WEAT* calculation where the two sets of words (X and Y) being compared against are kept constant as the same in table 3.The first column of this table, thus represents the association with the subspace determined by 'he -she' and correction results in a higher association, thus implying that more correctly gendered information retained. We see a similar pattern in columns 2 and 3 which represent other gendered words and gendered names sans the names used to determine the subspace used for debiasing. That correction fares significantly better even among other gendered names reflects the higher degree of precision of information removal and retention.  Table S3: Correctly gendered Information contained by embeddings; Larger scores better as they imply more correctly gendered information expressed

B.4 TPR-Gap-RMS
In Table S4, we show the TPR-Gap-RMS metric as used in [16] which is an aggregated measurement of gender bias score over professions. Lower scores imply lesser gendered bias in professions. We refer readers to [16] for detailed definition. We follow the same experiment steps, except that we apply different debiasing algorithms to the input word embeddings (instead of the CLS token). This allows us to compare debiasing methods with static use of contextualized embeddings (i.e. without fine-tuning). We see that RoBERTa and RoBERTa HD perform on par while both linear projection and iterative projection methods and our method OSCaR perform close to each other.  Equalized Words These gendered words are paired (one male, one female) and are equalized by the operation. Here is the filtered version of the list used in our experiments as per our description in Section 5 about the test/train word list split. Each pair in this list is "equalized".

C.4 Word Lists for INLP
Gendered Word List (G) for INLP consists of 1425 words found under https://github.com/Shaul1321/nullspace_projection/blob/master/data/lists/ as the list gender_specific_full.json. This list has been filtered of words used in generating templates (Supplement C.5 and for WEAT (Supplement C. 6.
More details about their word lists and code is available at: https://github.com/Shaul1321/nullspace_projection.

C.5 Words Lists for Template Generation
We keep most word lists the same for template generation as used in the paper : For occupations, we change the list to remove those words that we use for subspace determination of occupations for OSCaR. This creates a test/train split for our experiments. We also modify the gendered word lists to create the word lists used in premise/hypothesis for the entail and contradict templates for SIRT.