Don’t Forget About Pronouns: Removing Gender Bias in Language Models Without Losing Factual Gender Information

The representations in large language models contain multiple types of gender information. We focus on two types of such signals in English texts: factual gender information, which is a grammatical or semantic property, and gender bias, which is the correlation between a word and specific gender. We can disentangle the model’s embeddings and identify components encoding both types of information with probing. We aim to diminish the stereotypical bias in the representations while preserving the factual gender signal. Our filtering method shows that it is possible to decrease the bias of gender-neutral profession names without significant deterioration of language modeling capabilities. The findings can be applied to language generation to mitigate reliance on stereotypes while preserving gender agreement in coreferences.


Introduction
Neural networks are successfully applied in natural language processing. While they achieve stateof-the-art results on various tasks, their decision process is not yet fully explained (Lipton, 2018). It is often the case that neural networks base their prediction on spurious correlations learned from large uncurated datasets. An example of such a spurious tendency is gender bias. Even the state-of-theart models tend to counterfactually associate some words with a specific gender (Zhao et al., 2018a;Stanovsky et al., 2019). The representations of profession names tend to be closely connected with the stereotypical gender of their holders. When the model encounters the word "nurse", it will tend to use female pronouns ("she", "her") when referring to this person in the generated text. This tendency is reversed for words such as "doctor", "professor", or "programmer", which are male-biased. Figure 1: A schema is presenting the distinction between gender bias of nouns and factual (i.e., grammatical) gender in pronouns. We want to transform the representations to mitigate the former and preserve the latter.
It means that the neural model is not reliable enough to be applied in high-stakes language processing tasks such as connecting job offers to applicants' CVs (De-Arteaga et al., 2019). If the underlying model was biased, the high-paying jobs, which are stereotypically associated with men, could be inaccessible for female candidates. When we decide to use language models for that purpose, the key challenge is to ensure that their predictions are fair.
The recent works on the topics aimed to diminish the role of gender bias by feeding examples of unbiased text and training the network (de Vassimon Manela et al., 2021) or transforming the representations of the neural networks post-hoc (without additional training) (Bolukbasi et al., 2016). However, those works relied on the notion that to de-bias representation, most gender signal needs to be eliminated. It is not always the case, pronouns and a few other words (e.g.:"king" -"queen"; "boy" -"girl") have factual information about gender. A few works identified gendered words and exempted them from de-biasing (Zhao et al., 2018b;Kaneko and Bollegala, 2019). In contrast to these approaches, we focus on contextual word embeddings. In contextual representations, we want to preserve the factual gender information for genderneutral words when it is indicated by context, e.g., personal pronoun. This sort of information needs to be maintained in the representations. In language modeling, the network needs to be consistent about the gender of a person if it was revealed earlier in the text. The model's ability to encode factual gender information is crucial for that purpose.
We propose a method for disentangling the factual gender information and gender bias encoded in the representations. We hypothesise that semantic gender information (from pronouns) is encoded in the network distinctly from the stereotypical bias of gender-neutral words ( Figure 1). We apply an orthogonal probe, which proved to be useful, e.g., in separating lexical and syntactic information encoded in the neural model (Limisiewicz and Mareček, 2021). Then we filter out the bias subspace from the embedding space and keep the subspace encoding factual gender information. We show that this method performs well in both desired properties: decreasing the network's reliance on bias while retaining knowledge about factual gender.

Terminology
We consider two types of gender information encoded in text: • Factual gender is the grammatical (pronouns "he", "she", "her", etc.) or semantic ("boy", "girl", etc.) feature of specific word. It can also be indicated by a coreference link. We will call words with factual gender as gendered in contrast to gender-neutral words.
• Gender bias is the connection between a word and the specific gender with which it is usually associated, regardless of the factual premise. 2 We will refer to words with gender bias as biased in contrast to non-biased.
Please note that those definitions do not preclude the existence of biased and at the same time genderneutral words. In that case, we consider bias stereotypical and aim to mitigate it in our method. On the other hand, we want to preserve bias in gendered words.

Methods
We aim to remove the influence of gender-biased words while keeping the information about factual gender in the sentence given by pronouns. We focus on interactions of gender bias and factual gender information in coreference cues of the following form: [NOUN] examined the farmer for injuries because [PRONOUN] was caring.
In English, we can expect to obtain the factual gender from the pronoun. Revealing one of the words in coreference link should impact the prediction of the other. Therefore we can name two causal associations: C I : bias noun → f. gender pronoun C II : f. gender pronoun → bias noun In our method, we will primarily focus on two ways bias and factual gender interact. For genderneutral nouns (in association C I ), the effect on predicting masked pronouns would be primarily correlated with their gender bias. At the same time, the second association is desirable, as it reveals factual gender information and can improve the masked token prediction of a gendered word. We define two conditional probability distributions corresponding to those causal associations: Where y is a token predicted in the position of pronoun and noun, respectively; X is the context for masked language modeling. b and f are bias and factual gender factors, respectively. We model the bias factor by using a gender-neutral biased noun. Below we present examples for introducing female and male bias: 3 Example 1: Similarly, the factual gender factor is modeled by introducing a pronoun with a specific gender in the sentence: Example 2: We aim to diminish the role of bias in the prediction of pronouns of a specific gender. On the other hand, the gender indicated in pronouns can be useful in the prediction of a gendered noun. Mathematically speaking, we want to drop the conditionality on bias factor in P I from eq. (1), while keeping the conditionality on gender factor in P II .
To decrease the effect of gender signal from the words other than pronoun and noun, we introduce a baseline, where both pronoun and noun tokens are masked: Example 3:

Evaluation of Bias
Manifestation of gender bias may vary significantly from model to model and can be attributed mainly to the choice of the pre-training corpora as well as the training regime. We define gender preference in a sentence by the ratio between the probability of predicting male and female pronouns: To estimate the gender bias of a profession name, we compare the gender preference in a sentence where the profession word is masked (example 3 from the previous paragraph) and not masked (example 1). We define relative gender preference: X noun denotes contexts in which the noun is revealed (example 1), and X ∅ corresponds to example 3, where we mask both the noun and the pronoun. Our approach focuses on the bias introduced by a noun, especially profession name. We subtract log(GP (X ∅ )) to single out the bias contribution coming from the noun. 4 We use logarithm, so the results around zero would mean that revealing noun does not affect gender preference. 5

Disentangling Gender Signals with Orthogonal Probe
To mitigate the influence of bias on the predictions eq.
(2), we focus on the internal representations of the language model. We aim to inspect contextual representations of words and identify their parts that encode the causal associations C I and C II . For that purpose, we utilize orthogonal structural probes proposed by Limisiewicz and Mareček (2021).
In structural probing, the embedding vectors are transformed in a way so that distances between pairs of the projected embeddings approximate a linguistic feature, e.g., distance in a dependency tree (Hewitt and Manning, 2019). In our case, we want to approximate the gender information introduced by a gendered pronoun f (factual) and gender-neutral noun b (bias). The f takes the values −1 for female pronouns and, 1 for male ones, and 0 for gender-neutral "they". The b is the relative gender preference (eq. (4)) for a specific noun (b ≡ RGP noun ).
Our orthogonal probe consists of three trainable components: • O: orthogonal transformation, mapping representation to new coordinate system.
• SV : scaling vector, element-wise scaling of the dimensions in a new coordinate systems. We assume that dimensions that store probed information are associated with large scaling coefficients.
• i: intercept shifting the representation.
O is a tunable orthogonal matrix of size d emb × d emb , SV and i are tunable vectors of length d emb , where d emb is the dimensionality of model's embeddings. The probing losses are the following: where, h b,P is the vector representation of masked pronoun in example 1; h f,N is the vector representation of masked noun in example 2; vectors h ∅,P and h ∅,N are the representations of masked pronoun and noun respectively in baseline example 3.
To account for negative values of target factors (b and f ) in eq. (5), we generalize distance metric to negative values in the following way: (6) We jointly probe for both objectives (orthogonal transformation is shared). Limisiewicz and Mareček (2021) observed that the resulting scaling vector after optimization tends to be sparse, and thus they allow to find the subspace of the embedding space that encodes particular information.

Filtering Algorithm
In our algorithm we aim to filter out the latent vector's dimensions that encode bias. Particularly, we assume that, when ||h b,P − h ∅,P || → 0 then P I (y pronoun |X, b) → P I (y pronoun |X) We can diminish the information by masking the dimensions with a corresponding scaling vector coefficient larger than small ϵ. 6 The bias filter is defined as: where abs(·) is element-wise absolute value and − → 1 is element-wise indicator. We apply this vector to the representations of hidden layers: To preserve factual gender information, we propose an alternative version of the filter. The dimension is kept when its importance (measured by the absolute value of scaling vector coefficient) is higher in probing for factual gender than in probing for bias. We define factual gender preserving filter as: (9) The filtering is performed as in eq. (8) We analyze the number of overlapping dimensions in two scaling vectors in Section 3.2.

Experiments and Results
We examine the representation of two BERT models (base-cased: 12 layers, 768 embedding size; and large-cased: 24 layers, 1024 embedding size, Devlin et al. (2019)), and ELECTRA (base-generator: 12 layers, 256 embedding size Clark et al. (2020)). All the models are Transformer encoders trained on the masked language modeling objective.

Evaluation of Gender Bias in Language Models
Before constructing a de-biasing algorithm, we evaluate the bias in the prediction of three language models. We evaluate the gender bias in language models on 104 gender-neutral professional words from the WinoBias dataset (Zhao et al., 2018a). The authors analyzed the data from the US Labor Force Statistics. They annotated 20 professions with the highest share of women as stereotypically female and 20 professions with the highest share of men as stereotypically male.
We run the inference on the prompts in five formats presented in Table 1 and estimate with equation eq. (4). To obtain the bias of the word in the model, we take mean RGP noun computed on all prompts.

Results
We compare our results with the list of stereotypical words from the annotation of Zhao et al. (2018a). Similarly, we pick up to 20 nouns with the highest and positive RGP as male-biased and up to 20 nouns with the lowest and negative RGP as femalebiased. These lists differ for models. Table 2 presents the most biased words according to three models. Noticeably, there are differences between empirical and annotated bias. Especially word "salesperson" considered male-biased based on job market data was one of the most skewed toward the female gender in 2 out of 3 models. The full results of the evaluation can be found in appendix D.

Probing for Gender Bias and Factual Gender Information
We optimize the joint probe, where orthogonal transformation is shared, while scaling vectors and intercepts are task specific. The probing objective is to approximate: C I ) the gender bias of gender-neutral nouns (b ≡ RGP noun ); and  C II ) the factual gender information of pronouns (f ≡ f. gender pronoun ). We use WinoMT dataset 7 (Stanovsky et al., 2019) which is a derivate of WinoBias dataset (Zhao et al., 2018a). Examples are more challenging to solve in this dataset than in our evaluation prompts (Table 1). Each sentence contains two potential antecedents. We use WinoMT for probing because we want to separate probe optimization and evaluation data. Moreover, we want to identify the encoding of gender bias and factual gender information in more diverse contexts.
We split the dataset into train, development, and test sets with non-overlapping nouns, mainly profession names. They contain 62, 21, and 21 unique nouns, corresponding to 2474, 856, and 546 sentences. The splits are designed to balance male and female-biased words in each of them.

Results
The probes on the models' top layer give a good approximation of factual gender -Pearson corre-7 The dataset was originally introduced to evaluate gender bias in machine translation. lation between predicted and gold values in the range from 0.928 to 0.946. Pearson correlation for bias was high for BERT base (0.876), BERT large (0.946), and lower for ELECTRA (0.451). 8 We have identified the dimensions encoding conditionality C I and C II . In Figure 2, we present the number of dimensions selected for each objective and their overlap. We see that bias is encoded sparsely in 18 to 80 dimensions.

Filtering Gender Bias
The primary purpose of probing is to construct bias filters based on the values of scaling: F −b and F −b,+f . Subsequently, we perform our debiasing transformation eq. (7) on the last layers of the model. The probes on top of each layer are optimized separately.
After filtering, we again compute RGP for all professions. We monitor the following metrics to measure the overall improvement of the de-biasing algorithm on the set of 104 gender-neutral nouns S GN : Mean squared error show how far from zero RGP is. The advantage of this metric is that the bias of some words cannot be compensated by the opposite bias of others. The main objective of debiasing is to minimize mean squared error.
Mean shows whether the model is skewed toward predicting specific gender. In cases when the mean is close to zero, but M SE is high, we can tell that there is no general preference of the model toward one gender, but the individual words are biased.
Variance is a similar measure to M SE. It is useful to show the spread of RGP when the mean is non-zero.
Additionally, we introduce a set of 26 gendered nouns (S G ) for which we expect to observe non-zero RGP . We monitor M SE to diagnose whether semantic gender information is preserved in de-biasing:

Results
In  one layer usually further brings this metric down.
It is important to note that the original model differs in the extent to which their predictions are biased. The mean square error is the lowest for BERT large (0.099), noticeably it is lower than in other analyzed models after de-biasing (except for ELECTRA after 2-layer filtering 0.073).
The predictions of all the models are skewed toward predicting male pronoun when the noun is revealed. Most of the pronouns used in the evaluation were professional names. Therefore, we think that this result is the manifestation of the stereotype that career-related words tend to be associated with men.
After filtering BERT base becomes slightly skewed toward female pronouns (M EAN GN < 0).  Table 4: Top-1 accuracy for all tokens in EWT UD (Silveira et al., 2014). FT is the number of the model's top layers for which filtering was performed.
For the two remaining models, we observe that keeping factual gender signal performs well in decreasing M EAN GN . Another advantage of keeping factual gender representation is the preservation of the bias in semantically gendered nouns, i.e., higher M SE G .

How Does Bias Filtering Affect Masked
Language Modeling?
We examine whether filtering affects the model's performance on the original task. For that purpose, we evaluate top-1 prediction accuracy for the masked tokens in the test set from English Web Treebank UD (Silveira et al., 2014) with 2077 sentences. We also evaluate the capability of the model to infer the personal pronoun based on the context. We use the GAP Coreference Dataset (Webster et al., 2018) with 8908 paragraphs. In each test case, we mask a pronoun referring to a person usually mentioned by their name. In the sentences, gender can be easily inferred from the name. In some cases, the texts also contain other (un-masked) gender pronouns.

Results: All Tokens
The results in Table 4 show that filtering out bias dimensions moderately decrease MLM accuracy: up to 0.037 for BERT large; 0.052 for BERT base; 0.07 for ELECTRA. In most cases exempting factual gender information from filtering decreases the drop in results.

Results: Personal Pronouns in GAP
We observe a more significant drop in results in the GAP dataset after de-biasing.  Table 5: Top-1 accuracy for masked pronouns in GAP dataset (Webster et al., 2018). FT is the number of the model's top layers for which filtering was performed.
is that filtering can decrease the confounding information from stereotypically biased words that affect the prediction of correct gender. In this experiment, we also examine the filter, which removes all factual-gender dimensions. Expectedly such a transformation significantly decreases the accuracy. However, we still obtain relatively good results, i.e., accuracy higher than 0.5, which is a high benchmark for choosing gender by random. Thus, we conjecture that the gender signal is still left in the model despite filtering.
Summary of the Results: We observe that the optimal de-biasing setting is factual gender preserving filtering (F −b,+f ). This approach diminishes stereotypical bias in nouns while preserving gender information for gendered nouns (section 3.3). Moreover, it performs better in masked language modeling tasks (section 3.4).

Related Work
In recent years, much focus was put on evaluating and countering bias in language representations or word embeddings. Bolukbasi et al. (2016) observed the distribution of Word2Vec embeddings (Mikolov et al., 2013) encode gender bias. They tried to diminish its role by projecting the embeddings along the so-called gender direction, which separates gendered words such as he and she. They measure the bias as cosine similarity between an embedding and the gender direction.
Zhao et al. (2018b) propose a method to diminish differentiation of word representations in the gender dimension during training of the GloVe embeddings (Pennington et al., 2014). Nevertheless, the following analysis of Gonen and Goldberg (2019) argued that these approaches remove bias only partially and showed that bias is encoded in the multidimensional subspace of the embedding space. The issue can be resolved by projecting in multiple dimensions to further nullify the role of gender in the representations (Ravfogel et al., 2020). Dropping all the gender-related information, e.g., the distinction between feminine and masculine pronouns can be detrimental to gender-sensitive applications. Kaneko and Bollegala (2019) proposed a de-biasing algorithm that preserves gendered information in gendered words.
Unlike the approaches above, we work with contextual embeddings of language models. Vig et al. (2020) investigated bias in the representation of the contextual model (GPT-2, Radford et al. (2019)). They used causal mediation analysis to identify components of the model responsible for encoding bias. Nadeem et al. (2021) and Nangia et al. (2020) propose a method of evaluating bias (including gender) with counterfactual test examples, to some extent similar to our prompts. Qian et al. (2019) and Liang et al. (2020) employ prompts similar to ours to evaluate the gender bias of professional words in language models. The latter work also aims to identify and remove gender subspace in the model. In contrast to our approach, they do not guard factual gender signal.
Recently, Stanczak and Augenstein (2021) summarized the research on the evaluation and mitigation of gender bias in the survey of 304 papers.

Bias Statement
We define bias as the connection between a word and the specific gender it is usually associated with. The association usually stems from the imbalanced number of corpora mentions of the word in male and female contexts. This work focuses on the stereotypical bias of nouns that do not have otherwise denotation of gender (semantic or grammatical). We consider such a denotation as factual gender and want to guard it in the models' representation.
Our method is applied to language models, hence we recognize potential application in language generation. We envision the case where the language model is applied to complete the text about a person, where we don't have implicit information about their gender. In this scenario, the model should not be compelled by stereotypical bias to assign a specific gender to a person. On the other hand, when the implicit information about a person's gender is provided in the context, the generated text should be consistent.
Language generation is becoming ubiquitous in everyday NLP applications (e.g., chat-bots, autocompletion Dale (2020)). Therefore it is important to ensure that the language models do not propagate sex-based discrimination.
The proposed method can also be implemented in deep models for other tasks, e.g., machine translation systems. In machine translation, bias is especially harmful when translating from English to languages that widely denote gender grammatically. In translation to such languages generation of gendered nouns tends to be made based on stereotypical gender roles instead of factual gender information provided in the source language (Stanovsky et al., 2019).

Limitations
It is important to note that we do not remove the whole of the gender information in our filtering method. Therefore, a downstream classifier could easily retrieve the factual gender of a person mentioned in a text, e.g., their CV.
This aspect makes our method not applicable to downstream tasks that use gender-biased data. For instance, in the task of predicting a profession based on a person's biography (De-Arteaga et al., 2019), there are different proportions of men and women among holders of specific professions. A classifier trained on de-biased but not de-gendered embeddings would learn to rely on gender property in its predictions.
Admittedly, in our results, we see that the proposed method based on orthogonal probes does not fully remove gender bias from the representations section 3.3. Even though our method typically identifies multiple dimensions encoding bias and factual gender information, there is no guarantee that all such dimensions will be filtered. Noticeably, the de-biased BERT base still underperform offthe-shelf BERT large in terms of M SE GN . The reason behind this particular method was its ability to disentangle the representation of two language signals, in our case: gender bias and factual gender information.
Lastly, the probe can only recreate linear transformation, while in a non-linear system such as Transformer, the signal can be encoded nonlinearly. Therefore, even when we remove the whole bias subspace, the information can be recovered in the next layer of the model (Ravfogel et al., 2020). It is also the reason why we decided to focus on the top layers of models.

Conclusions
We propose a new insight into gender information in contextual language representations. In debiasing, we focus on the trade-off between removing stereotypical bias while preserving the semantic and grammatical information about the gender of a word from its context. Our evaluation of gender bias showed that three analyzed masked language models (BERT large, BERT based, and ELEC-TRA) are biased and skewed toward predicting male gender for profession names. To mitigate this issue, we disentangle stereotypical bias from factual gender information. Our filtering method can remove the former to some extent and preserve the latter. As a result, we decrease the bias in predictions of language models without significant deterioration of their performance in masked language modeling task.

A Technical Details
We use batches of size 10. Optimization is conducted with Adam (Kingma and Ba, 2015) with initial learning rate 0.02 and meta parameters: β 1 = 0.9, β 2 = 0.999, and ϵ = 10 −8 . We use learning rate decay and an early-stopping mechanism with a decay factor 10. The training is stopped after three consecutive epochs not resulting in the improvement of the validation loss learning rate. We clip each gradient's norm at c = 1.0. The orthogonal penalty was set to λ O = 0.1. We implemented the network in TensorFlow 2 (Abadi et al., 2015). The code will be available on GitHub.

A.1 Computing Infrastructure
We optimized probes on a GPU core GeForce GTX 1080 Ti. Training a probe on top of one layer of BERT large takes about 5 minutes.

A.2 Number of Parameters in the Probe
The number of the parameters in the probe depends on the model's embedding size d emb . The orthogonal transformation matrix consist of d 2 emb ; both intercept and scalling vector have d emb parameters. Altogether, the size of the probe equals to d 2 emb + 4 · d emb .

B Details about Datasets
WinoMT is distributed under MIT license; EWT UD under Creative Commons 4.0 license; GAP under Apache 2.0 license.

C Results for Different Filtering Thresholds
In   BERT large. We decided to pick the threshold equal to 10 −12 , as lowering it brought only minor improvement in M SE GN .

D Evaluation of Bias in Language Models
We present the list of 26 gendered words and their empirical bias in