Language-Agnostic Bias Detection in Language Models with Bias Probing

Pretrained language models (PLMs) are key components in NLP, but they contain strong social biases. Quantifying these biases is challenging because current methods focusing on fill-the-mask objectives are sensitive to slight changes in input. To address this, we propose a bias probing technique called LABDet, for evaluating social bias in PLMs with a robust and language-agnostic method. For nationality as a case study, we show that LABDet “sur-faces” nationality bias by training a classifier on top of a frozen PLM on non-nationality sentiment detection. We find consistent patterns of nationality bias across monolingual PLMs in six languages that align with historical and political context. We also show for English BERT that bias surfaced by LABDet correlates well with bias in the pretraining data; thus, our work is one of the few studies that directly links pretraining data to PLM behavior. Finally, we verify LABDet’s reliability and applicability to different templates and languages through an extensive set of robustness checks. We publicly share our code and dataset in https://github.com/akoksal/LABDet .


Introduction
Pretrained language models (PLMs) have gained widespread popularity due to their ability to achieve high performance on a wide range of tasks (Devlin et al., 2019).Smaller PLMs, in particular, have become increasingly popular for their ease of deployment and finetuning for various applications, such as text classification (Wang et al., 2018), extractive text summarization (Liu and Lapata, 2019), and even non-autoregressive text generation (Su et al., 2021).Despite their success, it is established that these models exhibit strong biases, such as those related to gender, occupation, and nationality (Kurita et al., 2019;Tan and Celis, 2019).However, quantifying intrinsic biases of PLMs remains a challenging task (Delobelle et al., 2022).
Figure 1: Our bias probing method surfaces nationality bias by computing relative sentiment change: subtracting absolute positive sentiment of an example with a nationality from a neutral example without nationality (i.e., with the [MASK] token).Therefore, the Turkish PLM exhibits a positive interpretation for Turks and a negative interpretation for Greeks in the same context.Conversely, the Dutch PLM demonstrates the opposite trend, with a positive sentiment towards Greeks and a negative sentiment towards Turks.
Recent work on social bias detection in PLMs has mainly focused on English.Most of those approaches have limited capabilities in terms of stability and data quality (Antoniak and Mimno, 2021;Blodgett et al., 2021).Therefore, we propose a robust 'Language-Agnostic Bias Detection' method called LABDet for nationality as a case study and analyze intrinsic bias in monolingual PLMs in Arabic, Dutch, English, French, German, and Turkish with bias probing.LABDet addresses the limitations of prior work by training a sentiment classifier on top of PLMs, using templates containing positive/negative adjectives, without any nationality information.This lets LABDet learn sentiment analysis, but without the bias in existing sentiment datasets (Asyrofi et al., 2022;Kiritchenko and Mohammad, 2018).
The second key idea of bias probing is to surface bias by using templates and corpus examples with a nationality slot for which we compare substitutions, e.g., "Turkish" vs "Greek" in Turkish and Dutch PLMs as illustrated in Figure 1.When analyzing the template "This [Nationality] person is neutral." in Turkish and Dutch, we found that the Turkish PLM, BERTurk (Schweter, 2020), gives a relative sentiment score of +0.16 for the Turkish nationality and -0.17 for the Greek nationality, while the Dutch PLM, BERTje (de Vries et al., 2019), gives a relative sentiment score of -0.40 for the Turkish nationality and +0.40 for the Greek nationality.The relative sentiment score surfaces the effect of nationality on the sentiment by subtracting absolute sentiment scores from the sentiment score of a neutral example without nationality, e.g., "This [MASK] person is neutral.".This difference in relative sentiment scores between the two models aligns with historical and political context: Turkish-Greek conflicts, the Turkish minority in the Netherlands etc.These patterns are examined across various templates and corpus examples to identify consistent preferences exhibited by PLMs.We then show the links between biases extracted with our method and bias present in the pretraining data of examined PLMs.We provide a comprehensive analysis of bias in the pretraining data of BERT.We examine the context positivity rate of sentences containing nationality information and also investigate the nationalities of authors in the Wikipedia part of the pretraining data.Furthermore, we present connections between biases present in the real world and biases extracted via LABDet, particularly in relation to minority groups, geopolitics, and historical relations.Finally, the consistency of these patterns across different templates enhances the robustness and validity of our findings, which have been rigorously confirmed through an extensive testing process in six languages.
Our paper makes the following contributions: (i) Pretraining Data Bias: We quantify the nationality bias in BERT's pretraining data and show that LABDet detects BERT's bias with a significant correlation, thus strongly suggesting a causal relationship between pretraining data and PLM bias.(ii) Linkage to Real-world Biases: We apply LABDet to six languages and demonstrate the relationship between real-world bias about minorities, geopolitics, and historical relations and intrinsic bias of monolingual PLMs identified by LABDet, finding support in the relevant political science literature.
(iii) Robustness: We propose LABDet, a novel bias probing method that detects intrinsic bias in PLMs across languages.Through robustness checks, we confirm LABDet's reliability and applicability across different variables such as languages, PLMs, and templates, thus improving over existing work.

Related Work
Measuring Social Bias: One approach identifies associations between stereotypical attributes and target groups (May et al., 2019) by analyzing their embeddings.This approach utilizes single tokens and "semantically bleached" (i.e., "sentiment-less") templates, which limits its applicability (Delobelle et al., 2022).Another approach (CrowS-Pairs (Nangia et al., 2020), StereoSet (Nadeem et al., 2021)) compares the mask probability in datasets of stereotypes.This is prone to instability and data quality issues (Antoniak and Mimno, 2021;Blodgett et al., 2021) and difficult to adapt across different languages.Additionally, PLMs are sensitive to templates, which can result in large changes in the masked token prediction (Jiang et al., 2020;Delobelle et al., 2022).Bias Detection in Non-English Languages: Many studies on bias detection for non-English PLMs, primarily focus on developing language-specific data and methods.For instance, Névéol et al. (2022) adapts the CrowS-Pairs dataset for the French language, while another recent approach (Kurpicz-Briki, 2020) extends the bias detection method in word embeddings, WEAT (Caliskan et al., 2017), to include the French and German languages.Chávez Mulsa and Spanakis (2020) also expand WEAT to Dutch and analyze bias at the word embedding level.However, these languagespecific methods face similar challenges regarding robustness and reliability.Pretraining Data Analysis: Recent work examines the relationship between PLM predictions and their pretraining data, particularly in the context of fact generation (Akyurek et al., 2022) and prompting for sentiment analysis and textual entailment (Han and Tsvetkov, 2022).Other work focuses on determining the amount of pretraining data necessary for PLMs to perform specific tasks, such as syntax (Pérez-Mayos et al., 2021) or natural language understanding (NLU) (Zhang et al., 2021).To the best of our knowledge, our work is the first to establish a connection between pretraining data and PLM behavior for intrinsic social bias.

Dataset
Our bias probing method, LABDet, includes two steps for detecting and quantifying social bias in pretrained language models (PLMs).The first concerns sentiment training: we train a classifier on top of a frozen PLM with generic sentiment data without nationality information.This step aims to map contextual embeddings to positive/negative sentiments without changing any underlying information about nationality.In the second step, we create minimal pairs for bias quantification via sentiment surfacing.We provide minimal pairs with different nationalities (e.g., "Turk" vs "Greek") and see how nationalities surfaces different sentiments.Our dataset covers six languages: Arabic, Dutch, English, French, German, and Turkish.

Sentiment Training Dataset
We carefully design a novel sentiment dataset to map contextual embeddings of PLMs to sentiments.Our goal is to not include any bias about nationalities -which a pair like ("Turkish people are nice.",positive) would do -to keep the sentiment towards nationalities in PLMs unchanged.We do not take advantage of existing sentiment analysis datasets as they contain bias in different forms (Asyrofi et al., 2022).For example, the YELP dataset (Zhang et al., 2015) contains negative reviews towards the cuisine which may be interpreted towards nationalities by PLMs as illustrated in this YELP example: "Worst mexican ever!!!!!! Don't go there!!!".
Therefore, we propose a template-based approach with careful design.We select six languages with diverse linguistic features based on the linguistic capabilities of the authors and conduct experiments in those languages: Arabic, Dutch, English, French, German, and Turkish.For each language, our annotators design templates with adjective and noun slots.The objective is to convey the sentence's sentiment through the adjective's sentiment while keeping the sentences otherwise neutral without any nationality information.The adjective slots can be filled with positive and negative adjectives selected from a pool of ≈25 adjectives, determining the final sentiment.Additionally, we created ≈20 nouns for each language.Finally, with ≈10 templates, we generated over 3,500 training examples for each language.We illustrate one template per language, two nouns, and positive/negative adjectives in Table 1 (top).
Template-based approaches are prone to syntax and semantics issues.For example, we see that there are gender agreement issues or meaningless pairs (e.g., insufficient day).While this is one limitation of our sentiment dataset, we believe that training the model on these ungrammatical or less meaningful sentences would not impact the overall goal of this part, sentiment surfacing from contextual embeddings.We design experiments to verify our method by comparing the correlation between bias present in the pretraining data of PLMs and bias extracted via LABDet.

Minimal Pairs for Sentiment Surfacing
In the second step, we create a second dataset of minimal pairs to analyze the effect of nationality on the sentiment results to quantify bias.However, as the role of templates would play a big role here, we curate templates from different sources and verify the effectiveness of our method, LABDet.Template Pairs: We carefully design templates in different languages to create minimal pairs.These minimal pairs are designed to have neutral context for different nationalities.Our annotators create templates with [Nationality] and [Adjective] tags and this time they propose a neutral set of adjectives.Therefore, we aim to investigate the effect of nationality change for positive/negative sentiment surfacing.As illustrated in in Table 1 (bottom, "Sentiment Surfacing"), we create sentences such as "This Syrian person is neutral.",with ≈15 neutral adjectives for each language.
As an alternative template approach, we modify the templates proposed by Kiritchenko and Mohammad (2018), Equity Evaluation Corpus (EEC), which include both negative and positive adjectives contrary to our neutral examples.Since we track changes in the positive sentiment score in LABDet, even the same positive context with different nationalities could have varying degrees of positive sentiment scores, which would indicate bias toward nationalities.Instead of using nouns in the source, we utilize [Nationality] tags as shown in Table 4 in the Appendix.Since the source corpus is proposed only for the English language, we use EEC for the verification of our method in English.Corpus Pairs: Additionally, we present templates generated from corpus sentences.For six languages, we create minimal pairs from mC4 (Raffel Nationality/Ethnicity: To demonstrate bias against nationalities, we select a diverse set of na-tionalities using a few criteria as guideline: large minority groups in countries where the language is widely spoken and nationalities with which those countries have alliances or geopolitical conflicts.Therefore, we target around 15 different nationalities and ethnicities for each language for the bias detection and quantification part of our work.See Figure 2 for the selected nationalities and ethnicities for each language.

Bias Probing
We propose a robust and language-agnostic bias probing method to quantify intrinsic bias in PLMs.
To extend and improve prior work that mainly focuses on the English language or large language models with prompting, we propose bias probing with sentiment surfacing.First, we train a classifier such as SVM or MLP on top of the frozen PLMs to find a mapping between contextual embeddings and sentiments.For this, we utilize our sentiment training dataset created via templates in order to prevent possible leakage of nationality information to the classifier.This helps to extract positive and negative sentiment information present in the pretrained language models.
In the second step, we propose the sentiment surfacing method by computing the relative sentiment change.Absolute sentiment values vary across models, languages, and contexts such as templates' sentiment.In the relative approach, the placeholders in two sentences with the same context are filled with a nationality term and a neutral word, [MASK].As illustrated in Figure 1, we compare the relative sentiment change of the "This [Nationality] person is neutral."template in Turkish with the Turkish nationality and the [MASK] token.This change shows that the "Turkish" nationality surfaces positive sentiment with +0.16 score while the "Greek" nationality surfaces negative sentiment with -0.17 score.Then, we compare these changes between across different nationalities and templates and evaluate if there is a consistent negative bias towards specific nationalities.
To surface sentiment change, we utilize the three different sources of minimal pairs presented in §3.2: one coming from template pairs we curated and two coming from examples in mC4 (Raffel et al., 2022) and Wikipedia corpora for six languages.Additionally, we also modify and use the previously proposed EEC templates (Kiritchenko and Mohammad, 2018) for English to show robustness of our approach to different template sources.
For sentiment training, we use SVM and MLP classifiers.Next, we quantify bias using both a template-based approach (ours and EEC -only for English-) and a corpus-based approach (mC4 and Wikipedia) via sentiment surfacing.
We propose three distinct analyses.First, we compare the bias extracted via LABDet with the bias of English BERT base pretraining data.This evaluation helps to assess the effectiveness of our method and explore the connection between pretraining data to PLM behavior.In the second analysis, we show the relative sentiment change for each nationality across six languages.We conduct a qualitative analysis of these results and examine their link to real-world bias within the histori- cal and political context.For the first and second analyses, we employ the SVM classifier and our template-based approach.However, to demonstrate the robustness of our findings, we compare our results from different approaches (template vs. corpus), sources (mC4 vs. Wikipedia), and classifiers (SVM vs. MLP) in the third analysis.We use Pearson's r to measure the strength of the correlation between positive sentiment scores of nationalities obtained from different sources.

Pretraining Data Bias
We demonstrate the effectiveness of LABDet for detecting and quantifying bias by evaluating its performance on bias present in the pretraining data, a novel contribution compared to prior work.This approach allows us to obtain evidence for a causal relationship between pretraining data and model bias.Specifically, we analyze the context positivity of different nationalities in the English BERT pretraining data (i.e., English Wikipedia3 and BooksCorpus (Zhu et al., 2015)) by extracting all sentences containing a nationality/ethnicity from a set.We then measure the context positivity by calculating the average positive sentiment score of sentences for each nationality.We use RoBERTa base (Liu et al., 2019) finetuned with SST2 (Socher et al., 2013), for sentiment analysis.We eliminate nationality bias in the sentiment analysis model by replacing each nationality with a mask token in the pretraining data.For a more confident analysis, we also increase the number of nationalities from 15 to 30 for this part.
We present pretraining data bias and relative sentiment scores with our method for all nationalities in Table 3.We observe meaningful patterns in the context positivity scores of English BERT's pretraining data.The connection with the English language, historical developments, and content production can explain why some countries receive higher context positivity score while others remain on the lower end.
For instance, American has the highest context positivity, which could be attributed to the fact that English is widely spoken in the country and the majority of the content in the pretraining data is produced by Americans (Callahan and Herring, 2011).Similarly, nationalities with large numbers of English speakers like Indian and Nigerian (and by extension, Asian and African) as well as Irish also have high context positivity scores, which could be explained by the fact that these countries also produce content in English.For example, most active editors of English Wikipedia4 are from United States of America (21K), followed by United Kingdom (6K) and India (4K).Indeed, among the 6 countries with highest context positivity in Table 3, all except one are among the top content producer countries (Philippines 1K; Italy 950; Brazil 880). 5n the negative end of context positive score, we observe that groups that have minority status in English speaking countries or those that are associated with conflict and tension have lower context positivity scores.Nationalities such as Syrian, Afghan, Israeli, and Iranian are associated with conflict and violence in the past decades.Similarly, Vietnamese has one of the lowest context positivity scores most likely reflecting the bulk of content related with the Vietnam War.That Japanese and German have lower context positivity scores may seem puzzling at first; yet, this is likely due to the historical context of World War 2 and their portrayal in the pretraining data.
To verify the effectiveness of LABDet, we compute the correlation between the context positivity scores of the pretraining data and the relative sentiment scores from our method using Pearson's r.We observe a significant correlation with an r score of 0.59 (< 0.01 p-value).This indicates that LABDet is able to detect bias in PLMs with a high correlation to the bias present in the pretraining data.We also observe significant linear correlation using different approaches such as templates and corpus examples or SVM and MLP classifiers.This shows LABDet's robustness.

Linkage to Real-world Biases
We compare the bias of six monolingual PLMs identified by LABDet to real-world bias, consulting the political science literature.We report the relative sentiment score changes with reference to neutral examples where [Nationality] tags are replaced by a mask token.Using the Wilcoxon signed-rank test, we determine which nationalities have consistently lower, higher, or similar predictions compared to the neutral examples.Our results are presented in Figure 2 with the relative sentiment change, bias direction (red: negative, black: no bias, green: positive), and confidence intervals.
Our findings indicate that all monolingual PLMs exhibit bias in favor of (i.e., green) and against (i.e., red) certain nationalities.In each model, we are able to get diverging relative sentiment scores just due to the change in the nationality mentioned across our templates.Some nationalities consistently rank low on different PLMs.For instance, Syrian, Israeli, and Afghan rank on the lower end of most PLMs in our analysis.Similarly, Ukrainian ranks fairly low in the European language PLMs.On the other hand, some nationalities, such as American and Indian rank consistently high in relative sentiment scores.Apart from consistent rankings across PLMs, there are three context-specific  patterns we can observe from the relative sentiment scores: First and foremost, immigrant/minority populations are important predictors of relative sentiment scores.This manifests in two opposing trends.On the one hand, immigrant populations such as Syrians in Turkey (Getmansky et al., 2018;Alakoc et al., 2021) and the European countries (Poushter, 2016;Secen, 2022), Ukrainians (Düvell, 2007) in European countries (note that the PLMs were trained prior to the recent refugee wave), and Indonesians/Moroccans in the Netherlands are known to be stigmatized (Solodoch, 2021).In accordance with that, these nationalities have negative and some of the lowest relative sentiment scores in the corresponding Turkish, Dutch/German/French, and Dutch PLMs, respectively.Similarly, minorities such as Arab, Syrian, Chinese, and Mexican rank lower in the English PLM.
On the other hand, while there is some evidence of bias against minorities, it seems that large minority populations who reside in a country and/or produce language content that might have made it into the pretraining data may be helping to mitigate some of the impacts of that bias.For example, Moroccan and Senegalese are associated with high relative sentiment scores in the French PLM.The same is true for Indian and Nigerian in the English PLM.This is despite the evidence of significant discrimination against these minorities in their respective countries (Thijssen et al., 2021;Silberman et al., 2007).The fact that English has official language status in India and Nigeria, and French in Senegal might also be a contributing factor.These two trends regarding minorities are likely driven by the history and size of minorities in these countries.While there is bias against newer and relatively smaller minorities, the older and larger minorities who likely produce content receive higher relative sentiment scores.Reflecting these trends, the German PLM is a case in point.The nationalities that rank among the lowest in the German PLM are either recent immigrants (Syrian) or smaller minority groups (Ukrainian before the recent refugee wave, Moroccan, and Nigerian).On the opposite end, Turk ranks among the highest in the German PLM.As Turkish immigrants constitute the largest and one of the oldest immigrant populations in Germany (Destatis, 2023), it is likely that the content they produce leads to a positive bias toward Turks.
Second, negative bias seems to stem not just from attitudes toward minorities but also from geopolitics and conflict.This might be in the form of geopolitical tension between a country where the language of the model is predominantly spoken and a country that we consider as a nationality.This is consistent with the evidence in the literature that geopolitical tensions stoke discrim-inatory attitudes (Saavedra, 2021).For example, tensions between the US and countries like Iran and China are likely driving lower scores for Iranian and Chinese in the English PLM (Lee, 2022;Sadeghi, 2016).Similarly, regional tensions in the Middle East are reflected in Arabic and Turkish PLMs where Israelis ranked among the lowest in terms of relative sentiment scores (Kosebalaban, 2010;Robbins, 2022).
Geopolitical tensions and a conflict environment can also affect attitudes through negative news stories, even when there is no direct conflict between countries.The fact that nationalities of conflictridden countries such as Syrian, Israeli, and Afghan have consistently negative sentiment scores shows how political conflict may affect attitudes toward groups in different parts of the world.
Finally, historical affinity and rivalry seem to play a significant role.Historical allies tend to have higher relative sentiment scores, as seen in the case of Americans that are ranked high in Dutch, French, and Arabic PLMs.Yet, historical rivals tend to be ranked rather negatively.The fact that German has negative relative sentiment scores in the three other European PLMs (Dutch, English, and French) is likely related to Germany's role in the world wars (Reeve, 2017).Similarly, Ukrainian consistently ranking lower in the European PLMs might be a by-product of the Cold War context where Ukraine was part of the USSR in rivalry with the Western Bloc.
Examining the relative sentiment scores in the Turkish PLM is also helpful to explore how historical relations shape both negative and positive biases.The historical negative attitudes between Turkey and Armenia (Phillips, David L., 2013) are reflected in the chart as Armenian is ranked lowest in the Turkish PLM.This sentiment likely arises from the long-standing tension and conflict between the two nations going back to World War I. On the other hand, Korean has the most positive sentiment score in the Turkish PLM, a result that may seem puzzling at first, considering the geographical and political distance between the two countries.However, digging into the historical connection between the two countries helps us explain this score, as Turkey provided military support during the Korean War (Lippe, 2000), which evolved into an affinity between the two nations that has even become a subject of popular culture through literature and movies (Betul, 2017).
As the examination of these three patterns (i.e., minorities, geopolitics, and historical relations) demonstrates, the relative sentiment scores in Figure 2 highlight the importance of considering historical and contemporary real-world context in analyzing the biases present in PLMs.Understanding the real-world biases provides valuable insights into the underlying factors that contribute to the biases in PLMs.Furthermore, this illustrates how cultural and historical ties between nations can have a lasting impact on language usage, which is evident in the pretraining data, subsequently reflected in PLMs.

Robustness Evaluation
Robustness is a crucial aspect of bias detection in PLMs, and many existing methods have limitations in this regard (Delobelle et al., 2022).We compare robustness of LABDet to different setups by assessing the similarity in predictions via Pearson's r (< 0.01 p-value) across languages.Classifier: We compare SVM and MLP classifiers on six language models and four template sources.For each experiment, we observe a significant correlation with an average r of 0.94.Template Source: To demonstrate that our results are not specific to the design of our templates with neutral adjectives, we compare them to modified EEC (Kiritchenko and Mohammad, 2018) templates with positive and negative adjectives (see Table 4).As EEC templates are in English, we only compare English PLMs (but by extending to four BERT and two RoBERTa variants) and two different classifiers.We observe a significant linear correlation for each setup with an average 0.83 r.Template vs. Corpus Examples: We compare our template approach to mC4 examples.For PLMs in six languages and two classifiers, we observe a significant correlation with an average 0.89 r, except for Arabic where there is a significant difference between corpus examples and templates.Corpus Source: We investigate the importance of the corpus source by comparing Wikipedia and mC4 examples in six monolingual PLMs and two classifiers.We observe significant correlations for each combination, with an average 0.98 r.

Conclusion
Our bias probing method, LABDet, allows for the analysis of intrinsic bias in monolingual PLMs and is easily adaptable to various languages.Through rigorous testing and qualitative analysis, we have demonstrated the effectiveness of LABDet, such as finding a strong correlation between bias in the pretraining data of English BERT and our results.We also identified consistent patterns of bias towards minority groups or nationalities associated with conflict and tension across different languages.Additionally, we found that large minority groups who produce content in the target language tend to have more positive sentiment, such as Turks in German PLMs and Senegalese/Moroccans in French PLMs.Finally, we show that our findings are statistically consistent across template and corpus examples, different sources, and languages.
story-like visual explanations by watching movies and reading books.In The IEEE International Conference on Computer Vision (ICCV).

A Appendix
12746

Figure 2 :
Figure 2: Relative sentiment score computed by LABDet for each nationality in six monolingual PLMs.Error bars indicate 95% confidence interval.Colors in each PLM indicate negative (red), neutral (black), and positive (green) groups.Negative and positive groups are statistically different from the neutral examples, i.e., from examples with the [MASK] token (Wilcoxon signed-rank test, < 1e − 4 p-value).

Table 1 :
Bias probing templates in LABDet.Slots (for Adj, Noun, Nationality) are indicated by [].Sentiment Training: LABDet is trained on non-nationality Adj-Noun pairs.Sentiment Surfacing: LABDet uses neutral adjectives to surface positive/negative sentiment about a nationality for bias detection.The full list of templates is available at https://github.com/akoksal/LABDet.
1 Table2: An example of minimal pair templates extracted from mC4 and Wikipedia corpora.We first find sentences that contain specific nationalities, then replace them with the [Nationality] placeholder.This enables a more diverse and larger set of minimal pairs.etal., 2022) and Wikipedia corpora.We first segment sentences in the corpora by spaCy(Honnibal  et al., 2020).Then, we extract 10,000 sentences that contain a selected nationality as a word in the target corpora, separately (e.g., Arab in Arabic, Turk in Turkish, etc.).Then, we replace those nationalities with the [Nationality] placeholder range of topics that we then use to diagnose positive/negative sentiments about nationalities; manually designed templates would be narrower.