Mitigating Language-Dependent Ethnic Bias in BERT

In this paper, we study ethnic bias and how it varies across languages by analyzing and mitigating ethnic bias in monolingual BERT for English, German, Spanish, Korean, Turkish, and Chinese. To observe and quantify ethnic bias, we develop a novel metric called Categorical Bias score. Then we propose two methods for mitigation; first using a multilingual model, and second using contextual word alignment of two monolingual models. We compare our proposed methods with monolingual BERT and show that these methods effectively alleviate the ethnic bias. Which of the two methods works better depends on the amount of NLP resources available for that language. We additionally experiment with Arabic and Greek to verify that our proposed methods work for a wider variety of languages.


Introduction
Ethnic (or national) bias, an over-generalized association of an ethnic group to particular, often negative attributes (Brigham, 1971;Ghavami and Peplau, 2013), is one of the most prevalent social stereotypes. Compared to gender and racial bias, ethnic bias tends to depend more on the cultural context (Cuddy et al., 2009;Fiske, 2017), as anyone could step outside of their ethnic background (e.g., by moving to a different country) and suddenly belong to a minority group. In studying various aspects of large-scale language models (LMs), there are many studies on gender and racial bias (Bolukbasi et al., 2016;Caliskan et al., 2017;Garg et al., 2018;May et al., 2019;Manzini et al., 2019; (Kurita et al., 2019). The different predictions by BERT imply that the ethnic bias in these three BERT models reflect the historical and social context of the countries in which they are used. For example, the current political climate in Germany and the US is hostile toward Iraq, and Korea was occupied and ruled by Japan in recent history, and those negative contexts are reflected as ethnic biases in German (DE-1), English (EN-1), and Korean (KO-1). There are also instances of ethnic bias shared across languages, and we can see an example in EN-2, DE-2, and KO-2 where Somalia and Cuba appear within top three in all three languages. In addition to these three, we study three more languages: Spanish, Turkish, and Chinese.
To quantify and mitigate ethnic bias, we propose a scoring metric called Categorical Bias (CB) score and two mitigation methods: 1) using a multilingual model and 2) aligning two monolingual models. We suggest two separate solutions because of the relatively poor performance of the multilingual model on low-resource languages (Wu and Dredze, 2020). The first solution using the multilingual BERT model works well for Chinese, English, German, and Spanish, languages that are resourceabundant. An alternative solution leverages alignment with the English embedding space, and this solution reduces the bias score for Korean and Turkish, relatively low-resource languages.
Extensive experiments with six languages (English, German, Spanish, Korean, Turkish, and Chinese) demonstrate that our proposed solutions work well for mitigation. We conduct an ablation study to find out what part of the treatment contributes most significantly to bias mitigation. Moreover, we demonstrate that the bias mitigation methods do not result in a performance drop for downstream tasks. Finally, we validate mitigation technique with two additional languages (Arabic and Greek).
Our contributions can be summarized as follows: • We suggest CB score, a multi-class bias measure with log probability to quantify the degree of ethnic bias in language models.
• We reveal the language-dependent nature of ethnic bias.
• We present two simple and effective bias mitigation methods: one with the multilingual model, and the other with contextual word alignment and fine-tuning 1 .

Ethnic Bias
Defining ethnic bias and differentiating it from national bias is very difficult, and in the language models that we look at, it is only possible to lump together ethnic bias and national bias. Furthermore, it is difficult to work with fine-grained ethnicity (e.g., "Navajo nation") which is not well represented in large-scale text corpora used to train LMs in various languages, so we limit the scope of our research to coarse-grained ethnic groups. We note that this ambiguous and limited definition of ethnic bias is not ideal, but it is common practice in social science literature (Brigham, 1971;Bar-Tal, 1997;Madon et al., 2001;Kite and Whitley Jr, 2012). We look deeply into ethnic bias because it is prevalent in datasets that consist of everyday language (Kite and Whitley Jr, 2012), and we conjecture that the bias in the datasets results in similar 1 Our code and data is available on https://github. com/jaimeenahn/ethnic_bias. bias in the models trained with those datasets. Table 1 shows how the training data and the model's predictions are biased in a toxicity classification dataset. 2 The training set contains higher proportions of sentences labeled as toxic with the words "Afghanistan", "Iraq" or "Iran," almost twice the proportion of those containing "France," "Ireland," or "Italy". For the test set, we run a basic BERT classifier based on publicly available code with high accuracy 3 , and the result shows that the model predicts non-toxic comments as toxic when containing Middle Eastern country names. We can clearly see that the false positive rates (FPR: percentage of sentences predicted as toxic when the ground truth is not) are much higher for the sentences with "Afghanistan," "Iraq" or "Iran." These results illustrate that significant ethnic bias exists in both the datasets and the commonly used language models.

Measuring Ethnic Bias
We define ethnic bias in BERT as the degree of variance of the probability of a country name given an attribute in a sentence without any relevant clues. For example, given the sentence template "People from [mask] are [attribute]," the probability of various ethnicity words to replace [mask] should follow the prior probabilities of those words and not vary significantly depending on the attribute.

Normalized probability
Given the conceptual description above, we formally define normalized probability used in our ethnic bias metric.   Figure 2: The bias metrics for two target groups (a) and three or more groups (b). For both metrics, the bias metric is based on the normalized probabilities of the target terms replacing the mask token. The difference is that when there are two target groups, the score is the difference of the normalized probabilities, and when there are more than two target groups, the score is the variance of the normalized probabilities.
The metric is based on the change-of-probability of the target words given the presence or absence of an attribute word as normalized probability P = ptgt p prior . Let us illustrate with an example of measuring gender bias with the sentence "[MASK] is a nurse," in which we can draw the probability of target words (p tgt (he) and p tgt (she)) in the place of the mask token. The attribute word is also masked to produce "[MASK] is a [MASK]," and p prior (he) and p prior (she) are drawn. Even if p tgt (he) and p tgt (she) are similar, and if p prior (he) is high, then she is more strongly associated with the attribute nurse. The difference in this normalized probability can be used to measure bias as effect size, the Cohen's d between (X, Y ) using cosine similarity based on log of P . Again, this normalized probability does not measure the probability of a word occurring, but rather measures the association between the target and the attribute indirectly.

Categorical Bias Score
We generalize the metric above for multi-class targets and propose the Categorical Bias (CB) score, defined as the variance of log normalized probabilities (see Figure 2b). We define CB as with the set of templates T = {t 1 , t 2 , ..., t m }, the set of ethnicity words N = {n 1 , n 2 , ...n n }, and the set of attribute words A = {a 1 , a 2 , ..., a o }. Note that CB score with |T | = 2 is equivalent to the bias metric in (Kurita et al., 2019). We add another step to the CB score by adapting the whole word masking strategy (Cui et al., 2019) for cases when a word can be divided into several tokens. To illustrate, we add as many mask tokens as the number of WordPiece tokens and aggregate each token's probability by multiplying. The probability of each word is the product of the probabilities of W subword tokens.
CB score is based on the assumption that no ethnicity word has a remarkably different normalized probability compared to others. Hence, if the model predicts uniform normalized probabilities to all target groups, then the CB score would be 0. On the contrary, a model with a high ethnic bias would assign significantly higher normalized probability of a particular ethnicity word, and the CB score would also be very high.

Mitigation
As ethnic bias varies across languages, we try to find a general mitigation technique that can be used in various languages. We propose two solutions: multilingual BERT (M-BERT) and contextual word alignment.

Method 1: Multilingual BERT
We suggest M-BERT as the first mitigation method for ethnic bias. The intuition is that the minority ethnic groups subject to bias vary across languages, and the multiple languages used to train M-BERT in one embedding space may have the effect of counterbalancing the ethnic bias in each monolingual BERT. One concern is that M-BERT is known for performance degradation for languages that are relatively low-resource, such as Korean and Turkish, of which Wikipedia is in size about 10% of German and 3% of English Wikipedia (Wu and Dredze, 2020).

Method 2: Contextual Word Alignment
We propose a second approach for languages that are relatively low-resource, contextual word alignment of two monolingual BERTs Conneau et al., 2020). Based on the findings of Lauscher and Glavaš (2019), the amount and targets of bias vary depending on the corresponding monolingual word embedding space. So we expect that alignment to a language with less bias (i.e., low CB score) would help to alleviate the bias.
Following previous methods Conneau et al., 2020), we compute the alignment matrix of the anchor words. First, we compute the anchor points using fast_align (Dyer et al., 2013) and a parallel corpus. Then with the contextual representation of each token in the two languages, compute the mapping in the Procrustes approach (Smith et al., 2017). Lastly, we compute the orthogonal transformation matrix of X, the contextual representation from the source language, and Y from language with a low CB score, as follows: A major difference with Conneau et al. (2020) is that the aligned model still needs a fine-tuning stage. Original contextual word alignment uses a task-specific layer of the target language. But, in this work, we merely move the source embedding to the embedding space of the target language. That is, we still use the MLM head of the source language on the top of the embeddings in the target space. As a consequence, we must finetune the MLM layer using an additional corpus in the source language to fit into the target embedding space. To preserve the alignment, we freeze BERT and the alignment matrix W during fine-tuning.

Experiments
We employ a template-based approach which assesses the association between pre-defined ethnicities and social positions (May et al., 2019;Kurita et al., 2019). We generate ten semantically equivalent sentence templates, five singular and five plural, and these sentence templates are designed not to contain any clues for inferring the ethnicity. We make a set of thirty ethnicities and seventy social positions such as occupations (e.g., computer programmer, professor) (He et al., 2019) and legal status (e.g., immigrant, refugee). The templates, ethnicities, and attributes are machine-translated into five languages and revised by professional translators: Korean (KO), German (DE), Chinese (ZH), Spanish (ES) and Turkish (TR). If a language has no structural difference in the singular and the plural forms, like Chinese, the translated templates may be the same, and in those cases, we exclude one of the redundant templates. We list the templates, ethnicities, and attributes in Appendix A.
We use various BERT models and datasets in the experiments. 4 The baseline models are six monolingual base-uncased BERT models uploaded on the transformer library. We verify our mitigation methods with six languages: English, German, Spanish, Korean, Turkish, and Chinese. We use XNLI (Conneau et al., 2018) and KorNLI (Ham et al., 2020)translated version of XNLI in Korean -as anchor points for the alignment matrix W . The corpora used in fine-tuning varies depending on the languages, and they are listed in Appendix B. Following the masking strategy in previous work (Devlin et al., 2019), we set the maximum sequence length as 128 and the batch size as 16, the learning rate as 1e-4, and use the Adam optimizer (Kingma and Ba, 2015). We freeze BERT and the alignment matrix W and fine-tune for two epochs when the loss does not drastically drop.
As a baseline, we experiment using Counterfactual Data Augmentation (CDA) (Lu et al., 2020;Zhao et al., 2018)   the training data by augmentation. We use onesided CDA and replace the ethnicity terms in the training data (e.g., replace "Mexico" with "China", "France", "Egypt", etc.) so that we balance the number of ethnicity terms in the training data for finetuning. Other than CDA, we cannot compare with bias mitigation approaches that work in the embedding space because our measurement is based on probability.
For downstream tasks, we check the performance of the proposed method with the task of named-entity recognition (NER) of each languages. We generally follow the settings and hyperparameters for fair comparison. More information about corpora, models and fine-tuning is available in Appendix B.

Results & Discussion
In this section, we describe the results and discussion. First, we show the presence of ethnic bias and its variation across monolingual BERT models. Next, we quantify and inspect the effectiveness of the two mitigation methods using the CB score. We verify the efficacy of alignment with a downstream task and an ablation study and show the effect of mitigation. Finally, we see the benefits of mitigation techniques in two additional languages.

Language Dependency
Result Figure 3 shows that the normalized probability distributions of ethnicity words associated with the attribute word "enemy" differ depending on the languages. In English, America shows up with the highest probability, followed by Iraq, Syria, and Russia. The result for German is similar to English in the order of America, Vietnam, Iraq, and China. The common result in English, German and Spanish is that Middle East nations always rank high, especially Iraq which is always one of the top-ranked candidates.
The distributions for languages that are relatively distant from English are significantly different. For example, in Korean, the highest probability word is Japan, followed by Israel, Vietnam, and China. Likewise, in Turkish and Chinese, they point to each other. Overall, the results show that ethnic bias in monolingual BERT varies across languages, in general agreement with the findings in social science that ethnic bias is culture-specific (Fiske, 2017   varies across languages for monolingual BERT and multilingual BERT. Given the templates and predefined attributes, we measure the Jensen-Shannon Divergence (JSD) of the normalized probability distributions of ethnic words in the LMs of six languages. The results in Table 3 reveal that in both monolingual BERT and M-BERT, there are significant differences in ethnic bias in the LMs, noting that 0 ≤ JSD ≤ ln(2). A simple example is the comparison between the two pairs English-German and English-Korean. For both monolingual and multilingual models, the JSD of English-German is much lower than the JSD of English-Korean.

Discussion
We have shown that ethnic bias varies across the six languages we studied. It may have been due to the difference of cultural context in the language corpus as language and culture are entangled (Hovy and Yang, 2021   Crystal, 2018). Moreover, the source of datasets for training English LMs is not restricted to those countries. Thus, the results produced by each monolingual model may be affected by many cultures, and it is very difficult to observe the cultural-specific bias. Nevertheless, we still showed empirical evidence of languagedependent nature of ethnic bias.

Mitigation Result
The results of two mitigation techniques and ablation study are summarized in Table 2.
Method 1: Multilingual BERT We measure CB score on original monolingual models and multilingual models without fine-tuning. Table 4 shows that original M-BERT helps to greatly reduce ethnic bias for English, German, and Spanish. For Korean, Turkish, and Chinese, we see an increase in the CB scores. This result confirms the findings in Wu and Dredze (2020) about the limitation of the multilingual model on languages with insufficient corpora. Although Chinese is one of the resourcerich languages, ethnic bias is not mitigated with the M-BERT. But in the end, Table 2 shows the results that M-BERT with fine-tuning generally performs   and Turkish for which the M-BERT mitigation is not very effective. Next, we verify the effects of alignment and finetuning with an ablation study. Table 2 presents results that both fine tuning and alignment with English contribute to bias mitigation. Models with proper alignment are mostly better than the models with randomly initialized alignment and no alignment. These results together verify that contextual word alignment can be used as an effective solution to mitigate bias for all monolingual models.
We also try alignment in the opposite direction, aligning English to each of the other languages. Table 5 shows the results that aligning with a higher bias language increases the CB scores and training M-BERT with additional corpus is the best option among other model variants.
To test whether alignment degrades the quality of the BERT models, we conduct downstream tasks for each language. Table 6 shows that even with alignment, the downstream task performance is comparable to the original BERT under the same conditions. In all five languages, the performance is lower than the best performance, but when the BERT model is frozen, the difference in performance between the aligned and unaligned models is insignificant.
In the absence of previous work in ethnic bias, we use a manually crafted list of targets and attributes, naturally leaving out some ethnicities and attributes. We seek to verify the generalizability of our method with respect to the list of targets and attributes by adding five targets and five attributes 5 . Table 7 shows that the overall CB score changes slightly with the additional targets or attributes, but we observe the same pattern that the CB score decreases with the alignment method. In future research, we will experiment with larger and more systematically constructed lists of targets and attributes.
Case study: After Alignment We show the results of mitigation by comparing the distribution with the examples in Figure 3. Figure 4 shows the changed distribution of all five languages after applying the contextual word alignment approach to English. Overall, except for Chinese, the association of top-ranked ethnicity is significantly reduced and becomes more uniformly distributed than before. The distribution of normalized probability and  mitigated result of another example in Figure 3 is available on Appendix E.

Discussion
We measure ethnic bias based on how sensitively the probability of ethnicity changes depending on the presence and absence of attribute words, and introduce methods to mitigate the gap in these variations. In this process, we found that English had the lowest CB score. It can be explained in two ways: (1) English is used in many different cultures, and (2) English has been established as a common language, so there is sufficient data from various cultures. After mitigation, the resulting distribution becomes more uniform so that the overall CB score is decreased. However, just like the limitation of previous research on gender bias (Liang et al., 2020b;Cheng et al., 2021), there may be cases in which the ethnicity with the highest probability may change, for example from Japan to China in Figure 4.

Additional Languages
We also experiment with Arabic (AR) and Greek (EL) to validate mitigation techniques in more languages. Unlike the previous languages we mainly deal with, we translate the templates and the list of targets and attributes to Arabic and Greek only with Google Translate without human revision. Table 8 shows that, even in Arabic and Greek, BERT with contextual word alignment outperforms in terms of CB score. In both languages, the multilingual model scores a much higher CB score than the  Another way of mitigation is data augmentation (Zhao et al., 2018;Park et al., 2018;Dinan et al., 2020), for example by using gender swapping on the coreference resolution task Zhao et al. (2018). A third method is re-training with some constraints which can mitigate bias (Zhao et al., 2017;Zhang et al., 2018;Jia et al., 2020;Liu et al., 2020), but these come with the difficulty of re-training. Transfer learning is another option. Liang et al. (2020c) makes use of fine-tuned multilingual LM on English to address its efficacy on Chinese as well. Zhao et al. (2020) reveals the presence of gender bias and proposes a method to mitigate in multilingual word embeddings using alignment. We propose two bias mitigation methods that are shown to be effective for multiple languages. First is fine-tuning a multilingual LM, and this method works well for high-resource languages. Second is aligning a monolingual LM with another monolingual LM that has a lower level of bias, and this works well for relatively low-resource languages. It is important to develop these mitigation approaches that can be applied to a wide variety of languages.

Conclusion & Future Work
In this paper, we study language-dependent ethnic biases in BERT. To first quantify ethnic bias, we introduced the category bias (CB) score. We show the language-dependent nature of ethnic bias, and then we proposed two mitigation strategies: multilingual model and contextual word alignment with English, which has the lowest CB score. For resource-rich languages, the multilingual model alone can mitigate the bias, or fine-tuning the multilingual model can effectively decrease the bias. For all languages, the alignment approach reduces bias and is a better solution for low-resource languages.
Most of the research on bias is limited to English, and our work contributes to studying bias in multiple languages including relatively low-resource languages. Our study shows the variation of ethnic bias across languages with the same set of templates and attributes translated into multiple languages. One limitation of our study is that we did not include all languages and all detailed ethnicities. As our study focuses on the language-dependent characteristic of ethnic bias and depends on publicly available monolingual language models, we are unable to employ fine-grained scope of ethnicity which may be under-represented. Hence, we leave as future work to use templates, attributes, and ethnic groups that are more suitable for each language such that we can conduct in-depth studies on bias in many languages, especially for low-resource languages.

Acknowledgement
This work has been financially supported by the Engineering Research Center Program through the National Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF2018R1A5A1059921)

Ethical Considerations
In this paper, we empirically show BERT contains significant ethnic bias and our proposed methods mitigate some amount of bias. Our proposed methods might help to alleviate the ethnic bias in the language model in a real-world application. However, there are four ethical issues that we want to state explicitly. First, the monolingual model does not represent all the people and the ethnic groups speaking that language. Even if we revealed the ethnic stereotypical behavior of each monolingual model in six languages, it does not mean that languages and people using are also biased. Similarly, depending on the language, the number of ethnic groups in which each language is spoken varies significantly. Moreover, since the data used in training language models are mainly based on the texts from the Internet, language models are more likely to represent and reflect only the skewed population of the language users (Bender et al., 2021).
The next problem may be raised from our range of ethnicity. As a broad sense of ethnic group which is nation-level is used in this paper, it may be too broad to contain distinct people's cultural backgrounds. It might a problem of under-representing minorities in nations where many cultures coexist or are forcibly incorporated into the nation. Nevertheless, the reason we use this broader range of ethnicity in this paper is that it was inevitable to set up a range of ethnic groups that could occur in all languages and to show its characteristics to warn about ethnic stereotypical behavior of the pretrained model. Thus, the future direction of this research should be a deep analysis on ethnic bias with a narrow range of ethnicity in a specific language Third, there is a probability of side-effect that the proposed method can bring. Our goal is to minimize the overall CB score. In trying to achieve this goal, the ethnicity with the highest probability may change, for example from Japan to Germany in Korean (KO) in Figure 4. This side-effect occurs in other "debiasing" techniques as well. For example, previous researches related to gender bias (Liang et al., 2020a;Cheng et al., 2021) results in alleviation of the overall SEAT score, but it sometimes results in code inversion in effect size, specifically when positive effect size became negative. This means that the male-dominant association changing to the female-dominant association. This is not ideal but an unavoidable effect of reducing the overall bias score.
Lastly, our measurement and mitigation cannot detect and remove all bias. We tried to include diverse languages and measure the bias for several ethnic groups, but due to the time and resource constraints, we were only able to experiment with a handful of languages, ethnicities, and attributes. Language model deployment in the realworld must be carefully done, as ours and other works in studying various social biases are far from done.

References
Daniel Bar-Tal. 1997 The sentence templates are generated based on previous work (Kurita et al., 2019). Sentence templates are constructed so that cultural groups could be inferred. See below for the templates we used.
Among crowdsourcing and templates, the main methods for LM bias research, the advantage of crowdsourcing is that "it may reflect better ecological validity" (Blodgett et al., 2021), but as this reference points out, it is difficult to find experts for the six languages. We chose the more practical template-based method and obtained meaningful results. The logical next step in future work is to conduct an in-depth analysis of each language.

A.2 Targets and Attributes
When it comes to words that can occur in target and attribute, we conduct experiments with thirty target terms and seventy attribute terms. All terms are nouns and translated into each language, and our study is based on masculine words in genderrich languages.
Here are the lists of targets and attributes in English: as 1e-4 with warmup step 1000. The models are trained for two epochs when the loss is not significantly dropped anymore, and gradients are clipped with 1. Adam optimizer (Kingma and Ba, 2015) is employed with epsilon value of 1e-8. We mostly follow the masking strategies provided by Devlin et al. (2019) when fine-tuning the masked language model head (MLM Head). Lastly, we do not manually fix the seed because we want to experiment and show its effectiveness in every seed circumstances.

Hyperparameters for Downstream Tasks
We generally follow the suggested hyperparameters provided on the dataset homepage for a fair comparison.
German We conduct Name-Entity-Recognition (NER) on the GermEval 2014 10 based on the example of transformer library. The batch size is set to 32, and the unfrozen model is trained over three epochs. On the other hand, frozen models are trained over 100 epochs because of freezing.
Spanish We also conduct NER on CONLL-2002(Tjong Kim Sang, 2002. Similarly, the batch size is set to 32, and the unfrozen model is trained over 3 epochs, on the other hand, but the frozen models are trained over 300 epochs. Turkish The downstream dataset used in Turkish is splited version of WikiANN (Rahimi et al., 2019) 11 . The detailed hyperparameters are the same as in the previous.
Korean We use splited version of Korean NER dataset 12 from Naver NLP Challenge 2018 that uses the Korean comments on the movie review in a Korean portal called Naver. Notably, unlike other NER tasks, this dataset has 29 labels which containing several distinct entities including date, time, number, and so on. We strongly expect that this brought the degradation on performance when the BERT is frozen compared to other languages.
Chinese MSRA dataset 13 is a simplified Chinese version of the Microsoft NER dataset. The detailed hyperparameters are the same as the previous.

Environment and Runtime
The experiments are conducted on GeForce RTX 2080 Ti 10GB with 10.2 CUDA version. Depending on the models and language, the single experiment takes from an hour to 25 hours. Fine-tuning with a multilingual model usually takes longer than a monolingual model because of the size of the vocabulary. We report a mean score of 5 runs.

C Another model type: XLM
We study ethnicity bias in BERT, arguably the most widely used LM. This is consistent with recent studies of bias in LMs (Liang et al., 2020a;Cheng et al., 2021).
Other than BERT, one model we tried is XLM (Lample and Conneau, 2019) for which the CB scores are (en) 8.95, (de) 12.72, (es) 9.97, (ko) 30.25, (tr) 42.11, and (zh) 12.40. In all languages except Chinese, the CB score is higher (i.e., LM is more biased) than our proposed mitigation methods in Table 2, 5. Note that XLM that covers all six languages is RoBERTa-based , so for a fair comparison, we only report the results of BERT variants.

D Efficacy in terms of distance
As we showed the efficacy in the experiment section, we evaluate our model in terms of the distance, Jensen-Shannon Divergence (JSD). The left half of Table 9 shows how the alignment makes the distribution closer to the target distribution, which is English. Compared to the before, the alignment actually reduces the JSD score in all five languages.
Reversely, the right half of the Table 9 shows how the alignment makes the distribution further from the source distribution, which is English. In this case, as well, the distance to the original English distribution also increases, which means the alignment to other languages forces the English monolingual model away from the original English monolingual model.
To sum up, the contextual alignment does not just reduce the bias score but achieves it by moving the embedding space of the distribution of each language to the target embedding spaces.

E Another Case Study
In this section, we provide more results of a case study which is in Figure 1.
When it comes to the word pirate ( Figure 5), in four of six languages, Somalia ranks in the first place, especially in Korean and Spanish, it is over 40%. Even if other countries are ranked in first  Table 9: Jensen-Shannon Divergence between monolingual models and English monolingual model. For fair comparison, in the case of "No Alignment" in EN − → X is the distribution after fine-tuned with additional corpus just like the other aligned variants. place in Turkish and Chinese, this example shows that the bias does not vary much depending on the language.
After mitigation, most of the peaky distribution becomes more uniform except on Chinese. It is outstanding, especially in Turkish. The case in Chinese shows the side effect that the highest normalized probability is moved to another ethnicity.  Figure 6: Distribution changed after aligning to English testing association with the word pirate.