Debiasing Pre-trained Contextualised Embeddings

In comparison to the numerous debiasing methods proposed for the static non-contextualised word embeddings, the discriminative biases in contextualised embeddings have received relatively little attention. We propose a fine-tuning method that can be applied at token- or sentence-levels to debias pre-trained contextualised embeddings. Our proposed method can be applied to any pre-trained contextualised embedding model, without requiring to retrain those models. Using gender bias as an illustrative example, we then conduct a systematic study using several state-of-the-art (SoTA) contextualised representations on multiple benchmark datasets to evaluate the level of biases encoded in different contextualised embeddings before and after debiasing using the proposed method. We find that applying token-level debiasing for all tokens and across all layers of a contextualised embedding model produces the best performance. Interestingly, we observe that there is a trade-off between creating an accurate vs. unbiased contextualised embedding model, and different contextualised embedding models respond differently to this trade-off.


Introduction
Contextualised word embeddings have significantly improved performance in numerous natural language processing (NLP) applications (Devlin et al., 2019;Liu et al., 2019;Clark et al., 2020) and have established as the de facto standard for input text representations. Compared to static word embeddings (Pennington et al., 2014;Mikolov et al., 2013) that represent a word by a single vector in all contexts it occurs, contextualised embeddings * Danushka Bollegala holds concurrent appointments as a Professor at University of Liverpool and as an Amazon Scholar. This paper describes work performed at the University of Liverpool and is not associated with Amazon. use dynamic context dependent vectors for representing a word in a specific context. Unfortunately however, it has been shown that, similar to their non-contextual counterparts, contextualised text embeddings also encode various types of unfair biases (Zhao et al., 2019;May et al., 2019;Tan and Celis, 2019;Bommasani et al., 2020;Kurita et al., 2019). This is a worrying situation because such biases can easily propagate to the downstream NLP applications that use contextualised text embeddings.
First, compared to static word embedding models where the semantic representation of a word is limited to a single vector, contextualised embedding models have a significantly large number of parameters related in complex ways. For example, BERT-large model (Devlin et al., 2019) contains 24 layers, 16 attention heads and 340M parameters. Therefore, it is not obvious which parameters are responsible for the unfair biases related to a partic-ular word. Because of this reason, projection-based methods, popularly used for debiasing pre-trained static word embeddings, cannot be directly applied to debias pre-trained contextualised word embeddings.
Second, in the case of contextualised embeddings, the biases associated with a particular word's representation is a function of both the target word itself and the context in which it occurs. Therefore, the same word can show unfair biases in some contexts and not in the others. It is important to consider the words that co-occur with the target word in different contexts when debiasing a contextualised embedding model.
Third, pre-training large-scale contextualised embeddings from scratch is time consuming and require specialised hardware such as GPU/TPU clusters. On the other hand, fine-tuning a pre-trained contextualised embedding model for a particular task (possibly using labelled data for the target task) is relatively less expensive. Consequently, the standard practice in the NLP community has been to share 1 pre-trained contextualised embedding models and fine-tune as needed. Therefore, it is desirable that a debiasing method proposed for contextualised embedding models can be applied as a fine-tuning method. In this view, counterfactual data augmentation methods (Zmigrod et al., 2019;Hall Maudslay et al., 2019;Zhao et al., 2019) that swap gender pronouns in the training corpus for creating a gender balanced version of the training data are less attractive when debiasing contextualised embeddings because we must retrain those models on the balanced corpora, which is more expensive compared to fine-tuning.
Using gender-bias as a running example, we address the above-mentioned challenges by proposing a debiasing method that fine-tunes pre-trained contextualised word embeddings 2 . Our proposed method retains the semantic information learnt by the contextualised embedding model with respect to gender-related words, while simultaneously removing any stereotypical biases in the pre-trained model. In particular, our proposed method is agnostic to the internal architecture of the contextualised embedding method and we apply it to debias different pre-trained embeddings such as BERT, RoBERTa (Liu et al., 2019), ALBERT (Lan et al., 1 https://huggingface.co/transformers/ pretrained_models.html 2 Code and debiased embeddings: https://github. com/kanekomasahiro/context-debias 2020), DistilBERT (Sanh et al., 2019) and ELEC-TRA (Clark et al., 2020). Moreover, our proposed method can be applied at token-level or at sentencelevel, enabling us to debias at different granularities and on different layers in the pre-trained contextualised embedding model.
Following prior work, we compare the proposed debiasing method in two sentence-level tasks: Sentence Encoder Association Test (SEAT; May et al., 2019) and Multi-genre co-reference-based Natural Language Inference (MNLI; Dev et al., 2020). Experimental results show that the proposed method not only debiases all contextualised word embedding models compared, but also preserves useful semantic information for solving downstream tasks such as sentiment classification (Socher et al., 2013), paraphrase detection (Dolan and Brockett, 2005), semantic textual similarity measurement (Cer et al., 2017), natural language inference (Dagan et al., 2005;Bar-Haim et al., 2006) and solving Winograd schema (Levesque et al., 2012). We consider gender bias as a running example throughout this paper and evaluate the proposed method with respect to its ability to overcome gender bias in contextualised word embeddings, and defer extensions to other types of biases to future work.

Related Work
Prior work on debiasing word embeddings can be broadly categorised into two groups depending on whether they consider static or contextualised word embeddings. Although we focus on contextualised embeddings in this paper, we first briefly describe prior work on debiasing static embeddings for completeness of the discussion.
Bias in Static Word Embeddings: Bolukbasi et al. (2016) proposed a post-processing approach that projects gender-neutral words into a subspace, which is orthogonal to the gender direction defined by a list of gender-definitional words. However, their method ignores gender-definitional words during the subsequent debiasing process, and focus only on words that are not predicted as gender-definitional by a classifier. Therefore, if the classifier erroneously predicts a stereotypical word as gender-definitional, it would not get debiased. Zhao et al. (2018b) modified the original GloVe (Pennington et al., 2014) objective to learn gender-neutral word embeddings (GN-GloVe) from a given corpus. Unlike the above-mentioned methods, Kaneko and Bollegala (2019) proposed GP-GloVe, a post-processing method to preserve gender-related information with autoencoder (Kaneko and Bollegala, 2020), while removing discriminatory biases from stereotypical cases.
Adversarial learning (Xie et al., 2017;Elazar and Goldberg, 2018;Li et al., 2018) for debiasing first encode the inputs and then two classifiers are jointly trained -one predicting the target task (for which we must ensure high prediction accuracy) and the other for protected attributes (that must not be easily predictable). Elazar and Goldberg (2018) showed that although it is possible to obtain chance-level development-set accuracy for the protected attributes during training, a post-hoc classifier trained on the encoded inputs can still manage to reach substantially high accuracies for the protected attributes. They conclude that adversarial learning alone does not guarantee invariant representations for the protected attributes. Ravfogel et al. (2020) found that iteratively projecting word embeddings to the null space of the gender direction to further improve the debiasing performance.

Benchmarks for biases in Static Embeddings:
Word Embedding Association Test (WEAT; Caliskan et al., 2017) quantifies various biases (e.g. gender, race and age) using semantic similarities between word embeddings. Word Association Test (WAT) measures gender bias over a large set of words  by calculating the gender information vector for each word in a word association graph created in the Small World of Words project (SWOWEN; Deyne et al., 2019) by propagating masculine and feminine words via a random walk (Zhou et al., 2003). SemBias dataset (Zhao et al., 2018b) contains three types of word-pairs: (a) Definition, a gender-definition word pair (e.g. hero -heroine), (b) Stereotype, a gender-stereotype word pair (e.g., manager -secretary) and (c) None, two other word-pairs with similar meanings unrelated to gender (e.g., jazz -blues, pencil -pen). It uses the cosine similarity between the gender directional vector, ( # » he − # » she), and the offset vector (a − b) for each word pair, (a, b), in each set to measure gender bias. WinoBias (Zhao et al., 2018a) uses the ability to predict gender pronouns with equal probabilities for gender neutral nouns such as occupations as a test for the gender bias in embeddings.

Bias in Contextualised Word Embeddings:
May et al. (2019) extended WEAT using templates to create a sentence-level benchmark for evaluating bias called SEAT. In addition to the attributes proposed in WEAT, they proposed two additional bias types: angry black woman and double binds (when a woman is doing a role that is typically done by a man that woman is seen as arrogant). They show that compared to static embeddings, contextualised embeddings such as BERT, GPT and ELMo are less biased. However, similar to WEAT, SEAT also only has positive predictive ability and cannot detect the absence of a bias. Bommasani et al. (2020) evaluated the bias in contextualised embeddings by first distilling static embeddings from contextualised embeddings and then using WEAT tests for different types of biases such as gender (male, female), racial (White, Hispanic, Asian) and religion (Christianity, Islam). They found that aggregating the contextualised embedding of a particular word in different contexts via averaging to be the best method for creating a static embedding from a contextualised embedding. Zhao et al. (2019) showed that contextualised ELMo embeddings also learn gender biases present in the training corpus. Moreover, these biases propagate to a downstream coreference resolution task. They showed that data augmentation by swapping gender helps more than neutralisation by a projection. They obtain the embedding of two input sentences with reversed gender from ELMo, and obtain the debiased embedding by averaging them. It can only be applied to feature-based embeddings, so it cannot be applied to fine-tuning based embeddings like BERT. We directly debias the contextual embeddings. Additionally, data augmentation requires re-training of the embeddings, which is often costly compared to fine-tuning. Kurita et al. (2019) created masked templates such as " is a nurse" and used BERT to predict the masked gender pronouns. They used the log-odds between male and female pronoun predictions as an evaluation measure and showed that BERT to be biased according to it. Karve et al. (2019) learnt conceptor matrices using class definitions in the WEAT and used the negated conceptors to debias ELMo and BERT. Although their method was effective for ELMo, the results on BERT were mixed. This method can only be applied to context-independent vectors, and it requires the creation of static embeddings from BERT and ELMo as a pre-processing step for debi- asing the context-dependent vectors. Therefore, we do not compare against this method in the present study, where we evaluate on context-dependent vectors. Dev et al. (2020) used natural language inference (NLI) as a bias evaluation task, where the goal is to ascertain if one sentence (i.e. premise) entails or contradictions another (i.e. hypothesis), or if neither conclusions hold (i.e. neutral). The premisehypothesis pairs are constructed to elicit various types of discriminative biases. They showed that orthogonal projection to gender direction (Dev and Phillips, 2019) can be used to debias contextualised embeddings as well. However, their method can be applied only to the noncontextualised layers (ELMo's Layer 1 and BERT's subtoken layer). In contrast, our proposed method can be applied to all layers in a contextualised embedding and outperforms their method on the same NLI task. And our debiasing approach does not require taskdependent data.

Debiasing Contextualised Embeddings
We propose a method for debiasing pre-trained contextualised word embeddings in a fine-tuning setting that simultaneously (a) preserves the semantic information in the pre-trained contextualised word embedding model, and (b) removes discriminative gender-related biases via an orthogonal projection in the intermediate (hidden) layers by operating at token-or sentence-levels. Fine-tuning allows debiasing to be carried out without requiring large amounts of tarining data or computational resources. Our debiasing method is independent of model architectures or their pre-training methods, and can be adapted to a wide range of contextualised embeddings as shown in § 4.3.
Let us define two types of words: attribute words (V a ) and target words (V t ). For example, in the case of gender bias, attribute words consist of multiple word sets such as feminine (e.g. she, woman, her) and masculine (e.g. he, man, him) words, whereas target words can be occupations (e.g. doctor, nurse, professor), which we expect to be gender neutral. We then extract sentences that contain an attribute or a target word. Sentences contain more than one attribute (or target) words are excluded to avoid ambiguities. Let us denote the set of sentences extracted for an attribute or a target word w by Ω(w). Moreover, let A = w∈Va Ω(w) and T = w∈Vt Ω(w) be the sets of sentences containing respectively all of the attribute and target words. We require that the debiased contextualised word embeddings preserve semantic information w.r.t. the sentences in A, and remove any discriminative biases w.r.t. the sentences in T .
Let us consider a contextualised word embedding model E, with pre-trained model parameters θ e . For an input sentence x, let us denote the embedding of token w in the i-th layer of E by E i (w, x; θ e ). Moreover, let the total number of layers in E to be N . In our experiments, we consider different types of encoder models such as E. To formalise the requirement that the debiased word embedding E i (t, x; θ e ) of a target word t ∈ V t must not contain any information related to a protected attribute a, we consider the inner-product between the noncontextualised embedding v i (a) of a and E i (t, x; θ e ) as a loss L i given by (1).
Here, v i (a) is computed by averaging the contextualised embedding of a in the i-th layer of E over all sentences in Ω(a) following Bommasani et al. (2020) and is given by (2).
Here, |Ω(a)| denotes the total number of sentences in Ω(a). If a word is split into multiple sub-tokens, we compute the contextualised embedding of the word by averaging the contextualised embeddings of its constituent sub-tokens. Minimising the loss L i defined by (1) with respect to θ e forces the hidden states of E to be orthogonal to the protected attributes such as gender.
Although removing discriminative biases in E is our main objective, we must ensure that simultaneously we preserve as much useful information that is encoded in the pre-trained model for the downstream tasks. We model this as a regulariser where we measure the squared 2 distance between the contextualised word embedding of a word w in the i-th layer in the original model, parametrised by θ pre , and the debiased model as in (3).
The overall training objective is then given by (4) as the linearly weighted sum of the two losses defined by (1) and (3).
Here, coefficients α, β ∈ [0, 1] satisfy α + β = 1. As shown in Figure 1, a contextualised word embedding model typically contains multiple layers. It is not obvious which hidden states of E are best for calculating L i for the purpose of debiasing. Therefore, we compute L i for different layers in a particular contextualised word embedding model in our experiments. Specifically, we consider three settings: debiasing only the first layer, last layer or all layers. Moreover, L i can be computed only for the target words in a sentence x as in (1), or can be summed up for all words in w ∈ x (i.e. t∈Vt x∈Ω(t) We refer to the former as token-level debiasing and latter sentence-level debiasing. Collectively this gives us six different settings for the proposed debiasing method, which we evaluate experimentally in § 4.3.

Datasets
We used SEAT (May et al., 2019) 6, 7 and 8 to evaluate gender bias. We use NLI as a downstream evaluation task and use the Multi-Genre Natural Language Inference data (MNLI; Williams et al., 2018) for training and development following Dev et al. (2020). In NLI, the task is to classify a given hypothesis and premise sentence-pair as entailing, contradicting, or neutral. We programmatically generated the evaluation set following Dev et al. (2020) by filling occupation words and gender words in template sentences. The templates take the form "The subject verb a/an object." and the created sentence-pairs are assumed to be neutral. We used the word lists created by Zhao et al. (2018b) for the attribute list of feminine and masculine words. As for the stereotype word list for target words, we use the list created by Kaneko and Bollegala (2019). Using News-commentary-v15 corpus 3 was extract 11023, 42489 and 34148 sentences respectively for Feminine, Masculine and Stereotype words. We excluded sentences with more than 128 tokens in training data. We randomly sampled 1,000 sentences from each type of extracted sentences as development data.

Hyperparameters
We used BERT (bert-base-uncased; Devlin et al., 2019), RoBERTa (roberta-base; Liu et al., 2019), ALBERT (albert-base-v2; Lan et al., 2020), Distil-BERT (distilbert-base-uncased;Sanh et al., 2019) and ELECTRA (electra-small-discriminator; Clark et al., 2020) in our experiments. 4 Distil-BERT has 6 layers and the others 12. We used the development data in SEAT-6 for hyperparameter tuning. The hyperparameters of the models, except the learning rate and batch size, are set to their default values as in run glue.py. Using greedy search, the learning rate was set to 5e-5 and the   Table 1 shows the results on SEAT and GLEU where original denotes the pre-trained contextualised models prior to debiasing. We see that original models other than ELECTRA contain significant levels of gender biases. Overall, the all-token method that conducts token-level debiasing across all layers performs the best. Prior work has shown that biases are learned at each layer (Bommasani et al., 2020) and it is important to debias all layers. Moreover, we see that debiasing at token-level is more efficient compared to at the sentence-level. This is because in token-level debiasing, the loss is computed only on the target word and provides a more direct debiasing update for the target word than in the sentence-level debiasing, which sums the losses over all tokens in a sentence.

Debiasing vs. Preserving Information
To test the importance of carefully selecting the target words considering the types of biases that we want to remove from the embeddings, we implement a random baseline where we randomly select target and attribute words from V a ∪ V t and perform all-token debiasing. We see that random debiases BERT to some extent but is not effective on other models. This result shows that the  proposed debiasing method is not merely a regularisation technique that imposes constraints on any arbitrary set of words, but it is essential to carefully select the target words used for debiasing. The results on GLEU show that BERT, Distil-BERT and ELECTRA compared to the original embeddings, the debiased embeddings report comparable performances in most settings. This confirms that the proposed debiasing method preserves sufficient semantic information contained in the original embeddings that can be used to learn accurate prediction models for the downstream NLP tasks. 5 However, the performance of RoBERTa and ALBERT decrease significantly compared to their original versions after debiasing. We suspect that these models are more sensitive to fine-tuning and hence lose their pre-trained information during the debiasing process. We defer the development of techniques to address this issue to future research.

Measuring Bias with Inference
Following Dev et al. (2020), we use the multigenre co-reference-based natural language inference (MNLI) dataset for evaluating gender bias. This dataset contains sentence triples where a premise must be neutral in entailment w.r.t. two hypotheses. If the predictions made by a classifier that uses word embeddings as features deviate from neutrality, it is considered as biased. Given a set containing M test instances, let the entailment predictor's probabilities for the m-th instance for entail, neutral and contradiction labels be respectively e m , n m and c m . Then, they proposed the following measures to quantify the bias:  Dev et al. (2020). For an ideal (bias-free) embedding, all three measure would be 1.  Table 3: Averaged scores over all layers in an embedding debiased at token-level, measured on SEAT tests.
In Table 2, we compare our proposed method against the noncontextualised debiasing method proposed by Dev et al. (2020) where they debias Layer 1 of BERT-large model using an orthogonal projection to the gender direction during training and evaluation. In addition to the above-mentioned measures, we also report the entailment accuracy on the matched (in-domain) and mismatched (crossdomain) denoted respectively by MNLI-m and MNLI-mm in Table 2 to evaluate the semantic information preserved in the embeddings after debiasing.
We see that the proposed method outperforms noncontextualised debiasing (Dev et al., 2020) in NN and T:0.7, and its performance of the MNLI task is comparable to the original embeddings. This result further confirms that the proposed method can not only debias well but can also preserve the pre-trained information. Moreover, it is consistent with the results reported in Table 1 and shows that debiasing all layers is more effective than only the first layer as done by Dev et al. (2020).

The Importance of Debiasing All Layers
In Table 1, we investigated the bias for the final layer, but it is known that the contextualised embeddings are learned at each layer (Bommasani et al., 2020). Therefore, to investigate whether by debiasing in each layer we are able to remove the biases of the entire contextualised embeddings, we evaluate the debiased embeddings at each layer on SEAT 6, 7, 8 datasets and report the averaged metrics for all-token, first-token and last-token methods in Table 3. We see that, on average, fitsttoken and last-token methods have more bias than all-token. Therefore, we conclude that It is not enough to debias only the first and last layers even in DistilBERT, which has a small number of layers. These results show that biases in the entire contextualised embedding cannot be reliably removed by debiasing only some selected layers, but rather the importance of debiasing all layers consistently.

Visualizing Debiasing Results
To further illustrate the effect of debiasing using the proposed all-token method, we visualise the similarity scores of a stereotypical word with feminine and masculine dimensions as follows. First, for each target word t, its hidden state, E i (t, x) in the i-th layer of the model E in a sentence x is computed. Next, we average those hidden states across all sentences in the dataset that contain t to obtainÊ i (t) = 1

|T |
x∈T E i (t, x). Likewise, we computeÊ i (f ) andÊ i (m) respectively for each feminine (f ) and masculine (m) word. Next, we compute, s f i , the cosine similarity between eacĥ E i (f ) and the feminine vector v i (f ), and the cosine similarity, s m i , between eachÊ i (f ) and the masculine vector v i (f ). s f i and s m i , respectively, are averaged over all layers in a contextualised embedding model to obtain s f Avg and s m Avg , which represent how much gender information each gender word contains on average.
We then compute the cosine similarity, s t,f i , between each stereotype word's averaged embedding, E i (t) and the feminine vector v i (f ). Similarly, we compute the cosine similarity s t,m i between each stereotype word's averaged embeddingÊ i (t) and the masculine vector v i (m). We then average s t,f and s t,m over the layers in E respectively, to compute s t,f Avg and s t,m Avg , which represent how much gender information each stereotype word contains on average. Finally, we visualise the normalised female and male gender scores given respectively by s t,f Avg /s f Avg and s t,m Avg /s m Avg . For example, a zero s t,f Avg /s f Avg value indicates that t does not contain female gender related information, whereas a value of one indicates that it contains all information about the female gender. Figure 2 shows each stereotype word with its normalised female ad male gender scores respectively in x and y axises. For a word, a yellow circle denotes its original embeddings, and the blue triangle denotes the result of debiasing using the all-token method.
We see that with the original embeddings, stereotypical words of are distributed close to one, indicating that they are highly gender-specific. On the other hand, we see that the debiased BERT, DistilBERT and ELECTRA have similar word distributions compared to the original embeddings respectively, with an overall movement towards zero. On the other hand, for RoBERTa, debiased embeddings are mainly distributed from zero to around one compared to the original embeddings. Moreover, for ALBERT, the debiased embeddings are close to zero, but unlike the original distribution, the debiased embeddings are mainly clustered around zero. This shows that RoBERTa and AL-BERT do not retain structure of the original distribution after debiasing. While ALBERT overdebiases pre-trained embeddings of stereotypical words, RoBERTa under-debiases them. This trend was already confirmed on the downstream evaluation tasks conducted in Table 1.

Conclusion
We proposed a debiasing method for pre-trained contextualised word embeddings, operating at token-or sentence-levels. Our experimental results showed that the proposed method effectively debiases discriminative gender-related biases, while preserving useful semantic information in the pretrained embeddings. The results showed that the downstream task was more effective in debias than the previous studies.