Probing Toxic Content in Large Pre-Trained Language Models

Large pre-trained language models (PTLMs) have been shown to carry biases towards different social groups which leads to the reproduction of stereotypical and toxic content by major NLP systems. We propose a method based on logistic regression classifiers to probe English, French, and Arabic PTLMs and quantify the potentially harmful content that they convey with respect to a set of templates. The templates are prompted by a name of a social group followed by a cause-effect relation. We use PTLMs to predict masked tokens at the end of a sentence in order to examine how likely they enable toxicity towards specific communities. We shed the light on how such negative content can be triggered within unrelated and benign contexts based on evidence from a large-scale study, then we explain how to take advantage of our methodology to assess and mitigate the toxicity transmitted by PTLMs.


Introduction
The recent gain in size of pre-trained language models (PTLMs) has had a large impact on state-of-theart NLP models. Although their efficiency and usefulness in different NLP tasks is incontestable, their shortcomings such as their learning and reproduction of harmful biases cannot be overlooked and ought to be addressed. Present work on evaluating the sensitivity of language models towards stereotypical content involves the construction of assessment benchmarks (Nadeem et al., 2020;Tay et al., 2020;Gehman et al., 2020) in addition to the study of the potential risks associated with the use and deployment of PTLMs (Bender et al., 2021). Previous work on probing PTLMs focuses on their syntactic and semantic limitations (Hewitt and Manning, 2019;Marvin and Linzen, 2018), lack of domainspecific knowledge (Jin et al., 2019), and absence of commonsense (Petroni et al., 2019;. However, except for a recent evaluation process of hurtful sentence completion (Nozza et al., 2021), we notice a lack of large-scale probing experiments for quantifying toxic content in PTLMs or systemic methodologies to measure the extent to which they generate harmful content about different social groups.
In this paper, we present an extensive study which examines the generation of harmful content by PTLMs. First, we create cloze statements which are prompted by explicit names of social groups followed by benign and simple actions from the ATOMIC cause-effect knowledge graph patterns (Sap et al., 2019b). Then, we use a PTLM to predict possible reasons for these actions. We look into how BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), and GPT-2 (Radford et al., 2019) associate unrelated and detrimental causes to basic everyday actions and examine how frequently the predicted words relate to specific social groups. Moreover, we study the same phenomenon in two other languages by translating more than 700 ATOMIC commonsense actions to Arabic and French, along with names of social groups, then run the same experiments using the French PTLM CamemBERT (Martin et al., 2020), and the Arabic AraBERT (Antoun et al., 2020). We find that, overall, the predicted content can also be irrelevant and offensive especially when the subject of the sentence is part of a marginalized community in the predominant culture of the language.
In order to gauge the generated toxicity by different language models, we train simple toxicity classifiers based on logistic regression using available hate speech and offensive language datasets. We reduce the classification bias using a two-step approach to first, filter out examples with identity words which typically lead classifiers to predict a toxic label, then perform a second classification step on the remaining examples. We further con- Our main contributions can be summarized in the following.
• We perform a large-scale extensible study on toxic content in PTLMs without relying on datasets which are specific to such a task.
• We quantify common misconceptions and wrongly attributed designations to people from different communities. This assessment can be taken into account when using a PTLM for toxic language classification, and when adopting a mitigation strategy in NLP experiments.
• We develop a large dataset based on structured patterns that can later be used for the evaluation of toxic language classification and harmful content within PTLMs. We make our data resources publicly available to the community. 1 The rest of the paper is organized as follows. We first introduce our methodology in Section 2. Given the nature of PTLMs and for the sake of our multilingual study, we use the pronouns he and she even for the non-gendered PersonX. ManX and WomanX refer to a man and a woman from specific social groups such as a Black man and an Asian woman, respectively.
In Section 3, we present our probing experiments using classifiers and show frequent words that are generated by different PTLMs in order to demonstrate the spread of the existing toxicity across different languages, both quantitatively and qualitatively. Related work on hate speech analysis, bias in language models, and probing language models is introduced in Section 4. Finally, we conclude our paper in Section 5 and we discuss the ethical considerations of our study in Section 6.

Methodology
We adopt a rule-based methodology based on Masked Language Modeling (MLM) in order to probe the toxicity of the content generated by different PTLMs.
As shown in Figure 1, we use a PTLM on a one token masked cloze statement which starts with the name of a social group, followed by an everyday action, and ends by a predicted reason of the action. Our goal is to provide a set of tests and a process to assess toxicity in PTLMs with regard to various social groups.

Probing Patterns
We use the ATOMIC atlas of everyday commonsense reasoning based on if-then relations (Sap et al., 2019b) to create cloze statements to fill in. Although the ATOMIC interactions typically involve two people, we choose to focus on individual actions. Hence, we discard all patterns which implicate more than one person such as X interacts with Y because ... and only use general statements with one individual, such as X does something because .... We prompt the statements by the name of a social group and use gendered pronouns to evoke Table 2: Examples of social groups we use in our experiments. Race refers to different racial groups; Rel. to different (non)religious affiliations; Gen. to different genders and sexual orientations; Politics to various political views; Intersect. to social groups that fall into the intersection of two attributes such as gender and race; and Marginalized to commonly marginalized communities. the effect of the action. For the sake of normalizing English, French, and Arabic patterns 2 , we do not consider the pronoun they.
As shown in Table 1, we adapt X to be either a person, a man, or a woman. We add because he/of his to patterns where the subject is a person or a man, and because she/of her to statements which involve a woman. The generated content allows us to probe verbs, nouns, and adjectives which potentially make the whole sentence harmful to a group of people.

Lists of Social Groups
The original PersonX and PersonY contained in the original ATOMIC patterns are insufficient to probe a PTLM with respect to present social entities and constructs. Slightly modified patterns such as ManX or WomanX give us an idea about the disparities between men and women only. Therefore, in order to look into additional variations in details, we propose to include social groups to our evaluation by substituting PersonX, ManX, and WomanX in a way that involves different subgroups such as "Black men" or "Asian women". The subgroups share a general social attribute or a value system. Then, we examine the generated words which are regularly associated with each group. Table 2 contains examples of these subgroups.

The Generated Data
We use a total of 1,000 ATOMIC heads for each language, 6,000 patterns for English and French, and 4,000 patterns for Arabic. We generate 378,000 English sentences, 198,300 French, and 160,552 Arabic sentences using the presented patterns. We notice in the examples shown in Table 3 that, when using a PTLM to reason about the possible intentions related to basic actions, stereotypical, confusing, and harmful content can easily be generated.
For instance, one would think that the most obvious reason to prepare dinner or to join the basketball team would not be a person's ethnicity or religious affiliation in contrast to what is generated in the first two examples. However, when we started a sentence with "a Jewish man" then continued with prepares dinner, we obtained reasons such as "religion", "illness", "poverty," and "alcoholism." Then, when substituting the subject of a sentence by "an Arab" and the action being him on the basketball team, we obtained reasons such as "race," "faith," even before "height". The case of a refugee woman going hiking is even worse, since most of the generated content is related to death and diseases, and the PTLM produces syntactically incoherent sentences where nouns such as tuberculosis, and asthma appear after the pronoun she.
Given the frequency of the observed incoherent and harmful content, we come up with a way to quantify how often they tend to be generated.

Probing Classifiers
We propose to use simple toxic language classifiers despite their bias towards slurs and identity words (Sap et al., 2019a;Park et al., 2018;Ousidhoum et al., 2020). Due to the trade-off between explainability and performance we train simple logistic regression (LR) models rather than deep learning ones.
We trained an LR classifier on four relatively different English datasets (Davidson et al., 2017;Founta et al., 2018;Ousidhoum et al., 2019;Zampieri et al., 2019), four others in Arabic (Ousidhoum et al., 2020;Albadi et al., 2018;Mulki et al., 2019;Zampieri et al., 2020), and the only one we know about in French (Ousidhoum et al., 2019). Table 4 shows the performance of the LR classifiers on the test splits of these datasets respectively. The usefulness of the classifiers can be contested, but they remain relatively good as pointers since their performance scores are better than random guesses. We use the three classifiers in order to assess different PTLMs, compare the extent to which toxicity   can be generated despite the benign commonsense actions and simple patterns we make use of.

Bias in Toxic Language Classifiers
Toxic language classifiers show an inherent bias towards certain terms such as the names of some social groups which are part of our patterns (Sap et al., 2019a;Park et al., 2018;Hutchinson et al., 2020). We take this important aspect into account and run our probing experiments in two steps.
In the first step, we run the LR classifier on cloze statements which contain patterns based on different social groups and actions without using the generated content. Then, we remove all the patterns which have been classified as toxic. In the second step, we run our classifier over the full generated sentences with only patterns which were not labeled toxic. In this case, we consider the toxicity of a sentence given the newly PTLM-introduced con- CamemBERT 23.38% 20.30% 17.69% AraBERT 3.34% 6.59% 5.82% tent. Finally, we compare counts of potentially incoherent associations produced by various PTLMs in English, French and Arabic.

Experiments
We use the HuggingFace (Wolf et al., 2020) to implement our pipeline which, given a PTLM, outputs a list of candidate words and their probabilities. The PTLMs we use are BERT, RoBERTa, GPT-2, CamemBERT, and AraBERT.

Main Results
We present the main results based on the proportions of toxic statements generated by different PTLMs in Table 5. In the first step, 9.55%, 83.55%, and 18.25% of the English, French, and Arabic sentences to be probed were filtered out by the toxic language classifiers.  As we only have one relatively small dataset on which we train our French LR classifier, the classifier shows more bias and is more sensitive to the existence of keywords indicating social groups. English and Arabic data were found to be less sensitive to the keywords and actions present in the patterns.
After filtering out the toxic patterns that our classifier labeled as offensive, we fed the sentences generated from the remaining patterns to be labeled by the toxic language classifiers. The overall results for three PTLMs in English and the two Arabic and French PTLMs are shown in Table 5. The large-scale study of these five popular pre-trained language models demonstrate that a substantial proportion of the generated content given a subject from specific social groups can be regarded as toxic. Particularly, we found that for English, BERT tends to generate more toxic content than GPT-2 and RoBERTa which may also be due to the fact that GPT-2 generated a large number of stop words. Although the French PTLM Camem-BERT seems to produce more toxic content than the Arabic and English PTLMs, it may only be due to the fact that we are assessing less samples in French after the first filtering step. Hence, we need additional evidence to be more assertive.
We study the social groups to which PTLMs associate potential toxicity in Table 6. The outcome is consistent with the overall results in Table 5. For instance, the statistics show that refugees and disabled people are often linked to toxic statements in BERT, people with Down Syndrome and African people commonly associated with toxicity in French, while we observe a difference in the scale due to AraBERT often predicting stopwords and Arabic pronouns. Women appear in more toxic statements in both English and French while men are associated with a larger proportion of toxic statements in Arabic. Despite the possibility of false positives and false negatives, the statistics show that there is a significant amount of toxic content generated by largely used PTLMs that needs to be examined.

Human Evaluation
We randomly sample 100 generated sentences and ask five domain experts to annotate them as insulting, stereotypical, confusing, or normal.

labels
We ask the annotators to label the generated content as stereotypical, insulting, confusing, or normal.
Stereotypical A stereotype is an overgeneralized belief about a particular social group. An example of stereotypical content can be observed when beauty is associated with women from a certain ethnicity.
Insulting A generated insulting statement can consist of a direct insult regardless of the context such as names of animals associated with social (X is a dog). Other indirect insulting statements depend on the context of the statement, such as saying that someone received a job offer because of their ethnicity, religion, or gender and not due to their abilities.
Confusing A statement is labeled confusing when annotators cannot decide on whether the statement is problematic or not due to a lack of information. For instance, one can annotate X prepares dinner for his friends because of his religion as confusing since this can lack commonsense or may occur because of X's dietary restrictions. However, the annotator could not decide due to the lack of context. Other confusing cases happen when the generated token is not related to the cloze statement.
Normal When the generated content sounds normal.

Discussion
As shown in Table 7, many instances were considered toxic by the annotators. Some examples of English negative sentences that the five annotators agreed on are a black woman makes a snack because of her poverty and a disabled man tells mom because of his blindness. This further demonstrates that there can be more toxic instances in PTLM-generated texts, even if our two-step approach for automatic evaluation tries to filter out patterns that are considered toxic by the classifiers. Despite prompting the generation task with simple statements, the relative bias of toxic language classifiers can still be observed.
In addition, harvesting the generated data by breaking a given sentence into a subject, action, and reason which corresponds to the unmasked token to guide the classification process, allowed us to counter a considerable portion of false positives. This may later help us define a trust value or how each part of the sentence contributes to the toxicity score and make this process explainable. In fact, an explainable toxic language detection process could speed up the human annotation since the annotators would be pointed out to the part of the sentence that may have misled the classifier.

Frequent Content in English
We show examples of potentially harmful yet relatively informative descriptive nouns and adjectives which appear as Top-1 predictions in Table 8. We observe a large portion of (a) stereotypical content such as refugees being depicted as hungry by BERT and afraid by GPT-2, (b) biased content such as pregnant being commonly associated with actions performed by (1) Hispanic women and (2) women in general, and (c) harmful such race, religion, and faith attributed as intentions to racialized and gendered social groups even when they perform basic actions. This confirms that PTLM-generated content can be strongly associated with words biased towards social groups which can also help with an explanability component for toxic language analysis in PTLMs.
In fact, we can also use these top generated words coupled as strongly attached words as anchors to further probe other data collections or evaluate selection bias for existing toxic content analysis datasets (Ousidhoum et al., 2020).

Frequent Content in French and Arabic
Similarly to Table 8, Table 9 shows biased content generated by Arabic and French PTLMs. We observe similar biased content about women with the  Table 8: Examples of relatively informative descriptive nouns and adjectives which appear as Top-1 predictions. We show the two main social groups that are associated with them. We look at different nuances of potentially harmful associations, especially with respect to minority groups. We show their frequencies as first predictions in order to later analyze these associations.
common word pregnant in both French and Arabic, in addition to other stereotypical associations such as gay and Asian men being frequently depicted as drunk in Arabic, and Chinese and Russian men as rich in French. This confirms our previous findings in multilingual settings.

A Case Study On offensive Content Generated by PTLMs
When generating Arabic data, in addition to stereotypical, biased, and generally harmful content, we have observed a significant number of names of animals often seen in sentences where the subject is a member of a commonly marginalized social group in the Arabic-speaking world such as foreign  migrants 3 . Table 10 shows names of animals with, usually, a bad connotation in the Arabic language.
Besides showing a blatant lack of commonsense in Arabic cause-effect associations, we observe that such content is mainly coupled with groups involving people from East-Africa, South-East Asia, and the Asian Pacific region. Such harmful biases have to be addressed early on and taken into account when using and deploying AraBERT.

Word
Tr S1 dog  Japanese 2,085 Indian 2,025 Chinese 1,949 Russian 1,924  Asian  1,890  pig  Hindu  947  Muslim  393  Buddhist 313  Jewish  298  Hindu women  183  donkey  Indian  472  Pakistani 472  Brown  436  Arab  375  African  316  snake  Indian  1,116 Chinese 831  Hindu  818  Asian  713  Pakistani  682  crocodile African  525  Indian  267  Black  210  Chinese  209  Asian  123   Table 10: Frequency (Freq) of Social groups (S) associated with names of animals in the predictions. The words are sometimes brought up as a reason (e.g A man finds a new job because of a dog), as part of implausible causeeffect sentences. Yet, sometimes they are used as direct insults (e.g because he is a dog). The last statement is insulting in Arabic.

Related Work
The large and incontestable success of BERT (Devlin et al., 2019) revolutionized the design and performance of NLP applications. However, we are still investigating the reasons behind this success with the experimental setup side Prasanna et al., 2020). Classification models are typically fine-tuned using PTLMs to boost their performance including hate speech and offensive language classifiers (Aluru et al., 2020;Ranasinghe and Zampieri, 2020). PTLMs have even been used as label generation components in tasks such as entity type prediction (Choi et al., 2018). This work aims to assess toxic content in large PTLMs in order to help with the examination of elements which ought to be taken into account when adapting the formerly stated strategies during the fine-tuning process.
Similarly to how long existing stereotypes are deep-rooted in word embeddings (Papakyriakopoulos et al., 2020;Garg et al., 2018), PTLMs have also been shown to recreate stereotypical content due to the nature of their training data (Sheng et al., 2019)  Different probing experiments have been proposed to study the drawbacks of PTLMs in areas such as the biomedical domain (Jin et al., 2019), syntax (Hewitt and Manning, 2019;Marvin and Linzen, 2018), semantic and syntactic sentence structures (Tenney et al., 2019), prenomial anaphora (Sorodoc et al., 2020), common-sense (Petroni et al., 2019), gender bias (Kurita et al., 2019), and typicality in judgement (Misra et al., 2021). Except for Hutchinson et al. (2020) who examine what words BERT generate in some fill-in-the-blank experiments with regard to people with disabilities, and more recently Nozza et al. (2019) who assess hurtful auto-completion by multilingual PTLMs, we are not aware of other strategies designed to estimate toxic content in PTLMs with regard to several social groups. In this work, we are interested in assessing how PTLMs encode bias towards different communities.
Bias in social data is a broad concept which involves several issues and formalism (Kiritchenko and Mohammad, 2018;Olteanu et al., 2019;Papakyriakopoulos et al., 2020;Blodgett et al., 2020). For instance, Shah et al. (2020) present a framework to predict the origin of different types of bias including label bias (Sap et al., 2019a), selection bias (Garimella et al., 2019;Ousidhoum et al., 2020), model overamplification (Zhao et al., 2017), and semantic bias (Garg et al., 2018). Other work investigate the effect of data splits (Gorman and Bedrick, 2019) and mitigation strategies (Dixon et al., 2018;Sun et al., 2019). Bias in toxic language classification has been addressed through mitigation methods which focus on false positives caused by identity words and lack of context (Park et al., 2018;Davidson et al., 2019;Sap et al., 2019a). We take this issue into account in our experiments by looking at different parts of the generated statements.
Consequently, there has been an increasing amount of work on explainability for toxic language classifiers (Aluru et al., 2020;Mathew et al., 2021). For instance, Aluru et al. (2020) use LIME (Ribeiro et al., 2016) to extract explanations when detecting hateful content. Akin to (Ribeiro et al., 2016), a more recent work on explainability by Ribeiro et al. (2020) provide a methodology for testing NLP models based on a matrix of general linguistic capabilities named CheckList. Similarly, we present a set of steps in order to probe for toxicity in large PTLMs.

Conclusion
In this paper, we present a methodology to probe toxic content in pre-trained language models using commonsense patterns. Our large scale study presents evidence that PTLMs tend to generate harmful biases towards minorities due to their spread within the pre-trained models. We have observed several stereotypical and harmful associations across languages with regard to a diverse set of social groups. We believe that the patterns we generated along with the predicted content can be adopted to build toxic language lexicons that have been noticed within PTLMs, and use the observed associations to mitigate implicit biases in order to build more robust systems. Furthermore, our methodology and predictions can help us define toxicity anchors that can be utilized to improve toxic language classification. The generated words can also be used to study socio-linguistic variations across languages by comparing stereotypical content with respect to professions, genders, religious groups, marginalized communities, and various demographics. In the future, we plan to revise our data by adding actions, more fluent and complex patterns, and longer generated statements which involve human interactions between people within the same social group, and people who belong to different ones.

Ethical Considerations
Our research addresses the limitations of large pretrained language models which, despite their undeniable usefulness, are commonly used without further investigation on their impact on different communities around the world. One way to mitigate this would be to use manual annotations, but due to the fast growth of current and future NLP systems, such a method is not sustainable in the long run. Therefore, as shown in our paper, classifiers can be used to point us to potentially problematic statements.
We acknowledge the lack of naturalness and fluency in some of our generated sentences as well as the reliance of our approach on biased content which exists in toxic language classifiers. Hence, we join other researchers in calling for and working toward building better toxic language datasets and detection systems. Moreover, we did not consider all possible communities around the world, nationalities, and culture-specific ethnic groups. Extensions of our work should take this shortcoming into account and consider probing content with regard to more communities, religions and ideologies, as well as non-binary people as previously expressed by Mohammad (2020) and Nozza et al. (2021).
Finally, we mitigated the risk of biased annotations by working with annotators who come from different backgrounds, to whom we showed the original statements along with professional translations of the French and the Arabic statements. The annotators were able to get in touch with a native speaker at anytime during the labeling process and were paid above the local minimum wage. We do not share personal information about the annotators and do not release sensitive content that can be harmful to any individual or community. All our experiments can be replicated.
ORIGINAL LANGUAGE EXAMPLE ENGLISH An immigrant woman plays ice hockey because she is pregnant.

FRENCH
A Hispanic person explains the situation because he is Jewish.

ARABIC
A Buddhist seeks an ambulance because he is Muslim.

FRENCH
An Asian woman takes the picture because she is ugly.

ARABIC
A European decides to rob a bank because he is French.

ARABIC
An African man starts to speak because he is a n*g*o.