On Evaluating and Mitigating Gender Biases in Multilingual Settings

While understanding and removing gender biases in language models has been a long-standing problem in Natural Language Processing, prior research work has primarily been limited to English. In this work, we investigate some of the challenges with evaluating and mitigating biases in multilingual settings which stem from a lack of existing benchmarks and resources for bias evaluation beyond English especially for non-western context. In this paper, we first create a benchmark for evaluating gender biases in pre-trained masked language models by extending DisCo to different Indian languages using human annotations. We extend various debiasing methods to work beyond English and evaluate their effectiveness for SOTA massively multilingual models on our proposed metric. Overall, our work highlights the challenges that arise while studying social biases in multilingual settings and provides resources as well as mitigation techniques to take a step toward scaling to more languages.


Introduction
Large Language Models (LLMs) (Devlin et al., 2019;Brown et al., 2020;Raffel et al., 2020) have obtained impressive performance on a wide range of NLP tasks showing great potential in several downstream applications for real world impact.However, these models have shown to be prone to picking up unwanted correlations and stereotypes from the pre-training data (Sheng et al., 2019;Kurita et al., 2019;Hutchinson et al., 2020) which, can perpetuate harmful biases for people belonging to marginalized groups.While there has been a great deal of interest in understanding and mitigating such biases in LLMs (Nadeem et al., 2021;Schick et al., 2021;Meade et al., 2022), the focus of such studies has primarily been on English.
While Massively Multilingual Language Models (Devlin et al., 2019;Conneau et al., 2020; Xue * Equal contribution et al., 2021), have shown impressive performances across a wide range of languages, especially with their surprising effectiveness at zero-shot crosslingual transfer, there still exists a lack of focused research to evaluate and mitigate the biases that exist in these models.This can lead to a lack of inclusive and responsible technologies for groups whose native language is not English and can also lead to the dissemination of stereotypes and the widening of existing cultural gaps.
Past work on evaluating and mitigating biases in multilingual models has mostly been concerned with gender bias in cross-lingual word embeddings (Zhao et al., 2020;Bansal et al., 2021) which fails to account for contextual information (Kurita et al., 2019;Delobelle et al., 2022), making them unreliable for LLMs.Other methods for estimating biases in contextualized representations involve Multilingual Bias Evaluation (Kaneko et al., 2022, MBE), which utilizes parallel translation corpora in different languages that might lack non-western cultural contexts (Talat et al., 2022).For debiasing LLMs, Lauscher et al. (2021) proposed an adapter (Houlsby et al., 2019) based approach.However, the biases are measured in the word representations and only English data was used for debiasing, missing out on cultural context for other languages.
To address these concerns, we make the following key contributions in our work.First, we extend the DisCo metric (Webster et al., 2020) by creating human-corrected templates for 6 Indian languages.DisCo takes sentence-level context while measuring bias and our templates are largely culturally agnostic making them more generally applicable.Second, we extend existing debiasing strategies like Counterfactual Data Augmentation (Zhao et al., 2018) and Self-Debiasing (Schick et al., 2021) to mitigate gender biases across languages in Masked Language Models (MLMs).
Finally, we also evaluate the transferability of debiasing MLMs from one source language to other target languages and observe limited transfer from English to languages lacking western context.However, we do observe that typologically and culturally similar languages aid each other in reducing gender bias.While there have been multiple studies on measuring biases in multilingual models, previous work has not explored mitigating gender biases from these models on multiple languages and studying the transferability of debiasing across different languages.This is especially true while using nonembedding based approaches for evaluation and debiasing.To the best of our knowledge, ours is the first work to debias multilingual LLMs for different languages and measure the cross-lingual transfer for gender bias mitigation.To encourage future research in this area, we will release our code and datasets publically 1 .

Measuring Bias in Multilingual Models
In this section, we describe the benchmarks to evaluate biases in MLMs across different languages.Since most existing benchmarks for bias evaluation in contextualized representations are designed for English, we discuss our multilingual variant of DisCo and the recently proposed MBE metric.

Multilingual DisCo
Discovery of Correlations (DisCo) is a templatebased metric that measures unfair or biased associations of predictions of an MLM to a particular gender.It follows a slot-filling procedure where for each template, predictions are made for a masked token, which are evaluated to assess whether there is a statistically significant difference in the top predictions across male and female genders.For calculating the bias score using DisCo, a χ2 test is performed to reject the null hypothesis (with a p-value of 0.05) that the model has the same prediction rate with both male and female context.We use the modified version of the metric from (Delobelle et al., 2022) that measures the fraction of slot-fills containing predictions with gendered associations (fully biased model gets a score of 1, and fully unbiased gets a score of 0).
We extend the Names variant of DisCo, as personal names can act as representatives for various socio-demographic attributes to capture cultural context (Sambasivan et al., 2021).Especially for India, surnames are a strong cultural identifier.Majority Indian surnames are typically an identifier 1 https://aka.ms/multilingual-bias of belonging to a particular caste, religion and culture.We use surnames from specific cultures which speak the languages for which we prepare the name pairs for.We further use these surnames to filter out personal first names for both male and female from an open-source Indian names list containing a large number of popular Indian names (details in Appendix A.1) and word-translated the names from English to the corresponding languages, to be used for slot-filling.Further, unlike nouns and pronouns which might be gender-neutral in some languages, names are indicative of gender to a large extent across cultures.
Dataset Construction: We start with the 14 templates provided in Webster et al. (2020) and translate them using Bing translation API 2 to 6 Indian languages of varying resources.We use the Class taxonomy from (Joshi et al., 2020) to characterize language resources, where Class 5 represent high resource and Class-0 for lowest resource languages.Our set of Indian Languages contain Class 4 language Hindi (hi); Class 3 language Bengali (bn); Class 2 languages Marathi (mr) and Punjabi (pa); and Class 1 language Gujarati (gu).A challenge while transferring templates from English to these languages is that, unlike English, a common template might not be applicable to both genders.For eg. the template "'{PERSON} likes to {BLANK}"', will have different translations in Hindi, depending upon the gender of the slot fill for {PERSON}, as Hindi has gendered verbs.Hence, during translation we first filled the {PERSON} slot with a male and a female name to obtain two templates corresponding to each gender (see Figure 1).All the translated templates in our dataset were then thoroughly reviewed and corrected by human annotators who are native speakers of the languages (details in Appendix A.1).

Multilingual Bias Evaluation (MBE)
We also evaluate MLMs with the MBE score proposed in (Kaneko et al., 2022) containing datasets for bias evaluation in 8 high resource languages: German (de), Japanese (ja), Arabic (ar), Spanish (es), and Mandarin (zh) belonging to Class 5; Portuguese (pt) and Russian (ru) in Class 4; and Indonesian (id) in Class 3.For evaluation, it first considers parallel corpora from English to different languages and extracts the set of sentences containing male and female words.Next, the likelihood for each sentence is evaluated with the MLM, and the bias score is measured as the percentage of total pairs for which a male sentence gets a higher likelihood than a female sentence.Hence a value close to 50 for an MLM indicates no bias towards both groups while greater or smaller values indicate a bias towards females and males respectively.For better interpretability of metrics, we report |50 − MBE| in our results.

Mitigating Bias in Multilingual Models
We next discuss how we extend bias mitigation techniques to work beyond English along with different fine-tuning and prompting strategies that we deploy in our experiments.

Counterfactual Data Augmentation (CDA)
CDA (Zhao et al., 2018) is an effective method for reducing biases picked up by the language models during pre-training.It operates by augmenting an unlabeled text corpus with counterfactuals generated for each sentence based on a specific dimension like gender.As an example, the counterfactual for a sentence s = "The doctor went to his home" will be ŝ = "The doctor went to her home".The model is then fine-tuned on the augmented data, which helps balance out any spurious correlations that would have existed in the pre-training dataset.
To generate counterfactuals in English, we do word replacements on Wikipedia data using 193 gendered term pairs (eg.{he, she}, {actor, actress}, etc.) following Lauscher et al. (2021).However, generating counterfactuals for languages other than English can be challenging as acquiring term pairs need recruiting annotators which can be expensive for low-resource languages.Further, word replacement can prove unreliable for languages that mark gender case to objects (like Hindi), producing ungrammatical sentences (Zmigrod et al., 2019).

Generating Multilingual Counterfactuals:
We use a translation-based approach to obtain counterfactually augmented examples in different languages.We first select the sentences in the Wikipedia English corpus containing India-related keywords which were extracted using ConceptNet (Speer et al., 2017) which include keywords related to Indian food, location, languages, religions, etc.Using these keywords we select a set of 20K sentences to avoid under-representation of Indian culture specific context.Also, generating counterfactuals for the whole corpus and fine-tuning MLMs for each of the languages will require substantial energy consumption (Strubell et al., 2019), so we decided to use the set of filtered 20k sentences for debiasing the MLMs.Further, we augment the 193 term pairs list to contain pairs of Indian personal names as well.We align the male and female names through a greedy search for selecting pairs with minimum edit distance.Finally, using the augmented term pairs list and the filtered data with Indian context, we generate counterfactuals using word replacements and translate the obtained data to the 6 Indian languages.
Once we have obtained CDA data in different languages, we can utilize it to debias the model.We define CDA-S as a fine-tuning setup where the MLM is debiased using CDA data for languages belonging to the set S ⊂ L, where L = {en, hi, pa, bn, ta, gu, mr}.In particular, we explore the following classes of fine-tuning setups: 1. CDA-{en}: Fine-tune the model with English CDA data only (zero-shot debiasing).2. CDA-{l}: Fine-tune the model with language l specific CDA data (monolingual-debiasing). 3. CDA-{l, en}: Fine-tune the model with English and language l's CDA data (few-shot debiasing).4. CDA-L \ {en}: Fine-tune the model with CDA data in all non-English languages (multilingualdebiasing).

Self-Debiasing
Self-Debiasing (Schick et al., 2021) is a post-hoc method to reduce corpus-based biases in language models.It is based on the observation that pretrained language models can recognize biases in text data fairly well and prepends the input text with prompts encouraging the model to exhibit undesired behavior.Using this, it recognizes the undesirable predictions of the model as the ones with an increase in likelihood when the prompt is pro-vided and suppresses them in the final predictions.We translate the English prompt "The following text discriminates against people because of their gender" in different languages and use them for bias mitigation (SD-l).We also experiment with using English prompt for other languages (SD-en).

Results
We evaluate the Out Of Box (OOB) biases as well the effect of applying aforementioned debiasing techniques in multilingual MLMs like XLMR-base (Conneau et al., 2020), IndicBERT (Kakwani et al., 2020), and mBERT (cased) (Devlin et al., 2019) using our multilingual DisCo metric.Additionally, we also evaluate language-specific monolingual models (refer Table 3 in appendix) and XLMR on the MBE score.
Comparison Between Different Fine-tuning Setups for CDA: We first compare the results of bias mitigation across all 4 classes of finetuning setups for CDA to understand the effect each had on the final bias reduction.As can be seen in Table 1 even though zero-shot transfer from English (CDA-{en}) results in some reduction in biases when compared to the models without any debiasing (OOB), most of the other fine-tuning setups that use language-specific counterfactuals incur better drops in the DisCo score.Specifically, few-shot debiasing (CDA-{l, en}) and multilingual-debiasing (CDA-L \ {en}) perform consistently the best for both models with CDA-L \ {en} performing slightly better for XLMR and substantially so for Indic-BERT.This shows that even though languagespecific counterfactuals were translated, using them for the debiasing of models helped in considerable bias reduction.We also observe that the monolingual debiasing (CDA-{l}) leads to a drop similar to CDA-{en}, and we conjecture that it might be attributed to the low amount of data we have in languages other than English for debiasing.Further, the dominant performance of CDA-L \ {en} highlights that languages from a similar culture can collectively help improve biases in such models.We also observe similar results for mBERT which are provided in Table 4 in the appendix.
Comparison Between CDA and Self-Debiasing: Counter to CDA, Self-Debiasing shows different bias mitigation trends for Indian languages.Table 1 shows that for both multilingual MLMs, the overall Figure 2: MBE scores for monolingual and multilingual models and the impact of debiasing across languages bias ends up increasing when Self-Debiasing is applied, and that too by a considerable amount for IndicBERT.This seems to be in contrast to the past work (Meade et al., 2022) that shows Self-Debiasing to be the strongest debiasing technique.However, we will see next the cases where it can indeed be effective in reducing biases.

Evaluation on MBE Metric:
We first investigate the effect of Self-Debiasing on monolingual models when evaluated for the MBE metric.As can be observed in Figure 2a, for most languages (except Russian and Spanish), both variants of Self-Debiasing manage to reduce the biases substantially.However, when we compare the results on a multilingual model i.e.XLMR in Figure 2b, we again observe the same phenomenon as for multilingual DisCo, where the biases tend to increase upon applying Self-Debiasing.Figure 2a shows that SDen and SD-l have similar debiasing performance for monolingual models.It is intriguing that monolingual models are able to debias so well based on English prompts.This similarity in results with non-English and English prompts could possibly be explained by contamination in the pretraining monolingual data (Blevins and Zettlemoyer, 2022).
We also compare the effect of CDA-{en}on reducing the biases and we observed it does obtain more success in most languages (except Spanish and Japanese).Even though MBE and Multilingual DisCo have different experimental setups, obtaining consistent results while using the two different metrics like English-only debiasing being insufficient to reduce biases in other languages.Selfdebiasing being ineffective for mitigating biases in multilingual models strenghtens the applicability of our results.Our results indicate that Self-Debiasing might be limited for multilingual models and we leave the investigation of this phenomenon to future work.

Conclusion
In this work, we investigated gender biases in multilingual settings by proposing a bias evaluation dataset in 6 Indian languages.We further extended debiasing approaches like CDA and Self-Debiasing to work for languages beyond English and evaluated their effectiveness in removing biases across languages in MLMs.One of our key findings is that debiasing with English data might only provide a limited bias reduction in other languages and even collecting a limited amount of counterfactual data through translation can lead to substantial improvements when jointly trained with such data from similar languages.Finally, we showed that despite being effective on monolingual models, Self-Debiasing is limited in reducing biases in mul-tilingual models with often resulting in an increase in overall bias.We hope that our work will act as a useful resource for the community to build more inclusive technologies for all cultures.

Limitations
The present study is limited to exploring biases in MLMs for the gender dimension only.For future work, important dimensionalities can be explored, especially for non-western contexts like Caste, Ethnicity, etc (Ahn and Oh, 2021;Bhatt et al., 2022).
We also used Machine Translation on English counterfactuals to obtain CDA data in each language in our dataset.Translations are prone to errors and issues like Translaionese (Gellerstam, 1986), especially for the lower resource languages, and therefore can lead to the unreliability of the quality of generated counterfactuals were generated.In the future, we would like to explore learning generative (Wu et al., 2021) or editing models (Malmi et al., 2022) for automatically generating gender counterfactuals given text data in different languages.This can help us scale our counterfactual generation process to a much higher number of samples while also avoiding any losses in quality that may arise due to machine translation.Our multilingual DisCo metric is currently limited to 6 Indian languages and we hope our work will inspire further extension to cover different language families for improving the focus on multilingual biases evaluation.

A Appendix
A.1 Dataset Construction Details Scraping Langauge-Specific Personal Names: We curated a list of personal names corresponding to the cultures for each language by scraping the popular surnames associated with each culture from Wikipedia3 .We then obtain the open source list of Indian male4 and female5 names, and we segment the names to different languages by referring to our culture-specific surnames list.The names obtained this way our in Latin script, so we transliterate them to the corresponding languages using the Bing Translator API.
Annotator Details: For verifying the templates obtained using machine translation we asked human annotators to correct them.Our annotators were colleagues working at our research lab and all of them were of South Asian (Indian) descent, native to different parts of India, and each having one of the six Indian languages that we consider as their L1.They all identify as males and are in their mid-20s.The annotators were provided original English templates along with the translated ones in their native language and were asked to verify that they were grammatically correct and conveyed the exact same meaning as the original base template.Further, they were asked to make corrections to ensure

Language Number of Name Pairs
Hindi 164 Punjabi 50 Bengali 33 Gujarati 51 Tamil 19 Marathi 49 Table 2: Total number of gendered name pairs for each language used in Multilingual DisCo that a template pair was as close to each other as possible except for modifications in the gendered terms, like verbs in the case of Hindi (Figure 2).Dataset Statistics: Our dataset consists of 14 templates in each language and for each language the number of name pairs are given in Table 2.

A.2 Experimental Setup
We performed all our experiments on a single A100 GPU.For the fine-tuning setup CDA-{en}, we trained for 50K steps using a batch size of 32, a learning rate of 2e-5, and a weight decay of 0.01.We follow the same hyperparameters for other fine-tuning setups as well, but instead of finetuning for 50K steps, we train for 1 epoch following (Lauscher et al., 2020) as the amount of data is limited in other languages.For Self-Debiasing, we used the default hyperparameters i.e. the decay constant λ = 50 and ϵ = 0.01.For all of our experiments, we used the pre-trained models provided with HuggingFace's transformers library (Wolf et al., 2020).The details of all the pre-trained models that we use in the paper are provided in Table 3

Figure 1 :
Figure 1: Example template translation for "{PERSON} likes to {BLANK}" in Hindi for creation of our multilingual dataset.

Table 1 :
Multilingual DisCo metric results (score of 1 being fully biased and 0 being fully unbiased) of debiasing using CDA and Self-Debiasing using various fine-tuning settings on different languages.Refer to Table4for the full version of the results.