Stepmothers are mean and academics are pretentious: What do pretrained language models learn about you?

In this paper, we investigate what types of stereotypical information are captured by pretrained language models. We present the first dataset comprising stereotypical attributes of a range of social groups and propose a method to elicit stereotypes encoded by pretrained language models in an unsupervised fashion. Moreover, we link the emergent stereotypes to their manifestation as basic emotions as a means to study their emotional effects in a more generalized manner. To demonstrate how our methods can be used to analyze emotion and stereotype shifts due to linguistic experience, we use fine-tuning on news sources as a case study. Our experiments expose how attitudes towards different social groups vary across models and how quickly emotions and stereotypes can shift at the fine-tuning stage.


Introduction
Pretraining strategies for large-scale language models (LMs) require unsupervised training on large amounts of human generated text data.While highly successful, these methods come at the cost of interpretability as it has become increasingly unclear what relationships they capture.Yet, as their presence in society increases, so does the importance of recognising the role they play in perpetuating social biases.In this regard, Bolukbasi et al. (2016) first discovered that contextualized word representations reflect gender biases captured in the training data.What followed was a suite of studies that aimed to quantify and mitigate the effect of harmful social biases in word (Caliskan et al., 2017) and sentence encoders (May et al., 2019).Despite these studies, it has remained difficult to define what constitutes "bias", with most work focusing on "gender bias" (Manela et al., 2021;Sun et al., 2019) or "racial bias" (Davidson et al., 2019;Sap et al., 2019).More broadly, biases in the models can comprise a wide range of harmful behaviors that may affect different social groups for various reasons (Blodgett et al., 2020).
In this work, we take a different focus and study stereotypes that emerge within pretrained LMs instead.While bias is a personal preference that can be harmful when the tendency interferes with the ability to be impartial, stereotypes can be defined as a preconceived idea that (incorrectly) attributes general characteristics to all members of a group.While the two concepts are closely related i.e., stereotypes can evoke new biases or reinforce existing ones, stereotypical thinking appears to be a crucial part of human cognition that often emerges implicitly (Hinton, 2017).Hinton (2017) argued that implicit stereotypical associations are established through Bayesian principles, where the experience of their prevalence in the world of the perceiver causes the association.Thus, as stereotypical associations are not solely reflections of cognitive bias but also stem from real data, we suspect that our models, like human individuals, pick up on these associations.This is particularly true given that their knowledge is largely considered to be a reflection of the data they are trained on.Yet, while we consider stereotypical thinking to be a natural sideeffect of learning, it is still important to be aware of the stereotypes that models encode.Psychology studies show that beliefs about social groups are transmitted and shaped through language (Maass, 1999;Beukeboom and Burgers, 2019).Thus, specific lexical choices in downstream applications not only reflect the model's attitude towards groups but may also influence the audience's reaction to it, thereby inadvertently propagating the stereotypes they capture (Park et al., 2020).
Studies focused on measuring stereotypes in pretrained models have thus far taken supervised approaches, relying on human knowledge of common stereotypes about (a smaller set of) social groups (Nadeem et al., 2020;Nangia et al., 2020).This, however, bears a few disadvantages: (1) due to the implicit nature of stereotypes, human defined examples can only expose a subset of popular stereotypes, but will omit those that human annotators are unaware of (e.g.models might encode stereotypes that are not as prevalent in the real world); (2) stereotypes vary considerably across cultures (Dong et al., 2019), meaning that the stereotypes tested for will heavily depend on the annotator's cultural frame of reference; (3) stereotypes constantly evolve, making supervised methods difficult to maintain in practice.Therefore, similar to Field and Tsvetkov (2020), we advocate the need for implicit approaches to expose and quantify bias and stereotypes in pretrained models.
We present the first dataset of stereotypical attributes of a wide range of social groups, comprising ∼ 2K attributes in total.Furthermore, we propose a stereotype elicitation method that enables the retrieval of salient attributes of social groups encoded by state-of-the-art LMs in an unsupervised manner.We use this method to test the extent to which models encode the human stereotypes captured in our dataset.Moreover, we are the first to demonstrate how training data at the finetuning stage can directly affect stereotypical associations within the models.In addition, we propose a complementary method to study stereotypes in a more generalized way through the use of emotion profiles, and systematically compare the emerging emotion profiles for different social groups across models.We find that all models vary considerably in the information they encode, with some models being overall more negatively biased while others are mostly positive instead.Yet, in contrast to previous work, this study is not meant to advocate the need for debiasing.Instead, it is meant to expose varying implicit stereotypes that different models incorporate and to bring awareness to how quickly attitudes towards groups change based on contextual differences in the training data used both at the pretraining and fine-tuning stage.

Related work
Previous work on stereotypes While studies that explicitly focus on stereotypes have remained limited in NLP, several works on bias touch upon this topic (Blodgett et al., 2020).This includes, for instance, studying specific phenomena such as the infamous 'Angry Black Woman' stereotype and the 'double bind' (Heilman et al., 2004) theory (Kir-itchenko and Mohammad, 2018;May et al., 2019;Tan and Celis, 2019), or relating model predictions to gender stereotype lexicons (Field and Tsvetkov, 2020).To the best of our knowledge, Nadeem et al. (2020); Nangia et al. (2020) andManela et al. (2021) are the first to explicitly study stereotypes in pretrained sentence encoders.While Manela et al. (2021) focus on gender stereotypes using the Wino-Bias dataset (Zhao et al., 2018), the other works propose new crowdsourced datasets (i.e.StereoSet and Crowspair) with stereotypes that cover a wide range of social groups.All datasets, however, have a similar set-up: they contain pairs of sentences of which one is more stereotypical than the other.Working in the language modeling framework, they evaluated whether the model "prefers" the stereotypical sentence over the anti-stereotypical one.In contrast, we propose a different experimental setup and introduce a new dataset that leverages search engines' autocomplete suggestions for the acquisition of explicit stereotypical attributes.Instead of indirectly uncovering stereotypes through comparison, our elicitation method directly retrieves salient attributes encoded in the models.Our technique is inspired by Kurita et al. (2019), but while they measure the LM probability for completing sentences with the pronouns she and he specifically, we study the top k salient attributes without posing any restrictions on what these could be.Moreover, we are the first to include both monolingual and multilingual models in our analysis.
Stereotype-driven emotions Stereotypes are constantly changing and identifying negative ones in particular, is an inherently normative process.While some stereotypes clearly imply disrespect (e.g., women are incompetent), others emerge from excessive competence instead (e.g., Asians are good at math).Moreover, stereotypical content is heavily influenced by the social pressures of society at the time.Cuddy et al. (2009) argue that no stereotype remains stable and predictable from theoretical principles.Hence, many social psychologists have abandoned the study of stereotype content to focus on systematic principles that generalize across different specific instances of stereotypes instead, presumably making them more stable over time and place (Cuddy et al., 2009;Mackie et al., 2000;Weiner, 1993).Similarly, we explore a more robust approach to uncovering stereotypes in pretrained LMs by studying how stereotypes are more generally manifested as varying emotion profiles in the models.Previous works show that groups evoke different emotional profiles (Cottrell and Neuberg, 2005;Tapias et al., 2007;Mackie et al., 2000), and a variety of theories link particular intergroup relations to distinct stereotype-driven emotions such as disgust and anger (Harris andFiske, 2006, 2009).

Stereotypes from search engines
Retrieving human stereotypes in an implicit manner can be useful as people are likely to give more politically correct answers when asked for stereotypes explicitly.Questions we ask to search engines are often done in the comfort of our own homes, making them likely to reflect true stereotypes that are out there in the real world (Stephens-Davidowitz, 2018 where [TGT] are social groups for which we search stereotypes and [ATTR] is the salient attribute with which the search engine completes the sequence.We tested other (longer and more elaborate) templates but we found that they did not produce many autocomplete suggestions.In fact, we believe that the above queries are so successful precisely because of their simplicity, given that people are likely to keep search queries concise.
Search engines Due to Google's hate speech filtering system the autocompletion feature is disabled for frequently targeted groups e.g.black people, Jewish people and members of the LGBTQ+ community.Thus, we retrieve autocomplete suggestions from 3 search engines: Google, Yahoo and DuckDuckGo.In many cases, identical completions were given by multiple search engines.We sort these duplicate samples under the category 'multiple engines'.We find that most negative (offensive) stereotypes are retrieved from Yahoo.
Pre-processing We clean up the dataset manually, using the following procedure: 1. Remove noisy completions that do not result in a grammatically correct sentence e.g.non adjectives.2. Remove specific trend-sensitive references: e.g. to video games 'why are asians so good at league of legends'.3. Remove neutral statements not indicative of stereotypes e.g.'why are [TGT] so called'.4. We filter out completions consisting of mul-tiple words. 1 Yet, when possible, the input is altered such that only the key term has to be predicted by the model e.g., 'Why are russians so x', where x = good at playing chess → 'Why are russians so good at x', x = chess.
In Table 1 we provide some examples from the dataset.See Appendix B for more details on the data acquisition and search engines.The full code and dataset are publicly available. 2

Correlating human stereotypes with salient attributes in pretrained models
To test for human stereotypes, we propose a stereotype elicitation method that is inspired by cloze testing, a technique that stems from psycholinguistics.Using our method we retrieve salient attributes from the model in an unsupervised manner and compute recall scores over the stereotypes captured in our search engine dataset.
Pretrained models We study different types of pretrained LMs of which 3 are monolingual and 2 multilingual: BERT (Devlin et al., 2019) uncased trained on the BooksCorpus dataset (Zhu et al., 2015) and English Wikipedia; RoBERTa (Liu et al., 2019), the optimized version of BERT that is in addition trained on data from CommonCrawl News (Nagel, 2016), OpenWebTextCorpus (Gokaslan and Cohen, 2019) and STORIES (Trinh and Le, 2018); BART, a denoising autoencoder (Lewis et al., 2020) that while using a different architecture and pretraining strategy from RoBERTa, uses the same training data.Moreover, we use mBERT, that apart from being trained on Wikipedia in multiple languages, is identical to BERT.We use the uncased version that supports 102 languages.Similarly, XLM-R is the multilingual variant of RoBERTa (Conneau et al., 2020) that is trained on cleaned CommonCrawl data (Wenzek et al., 2020) and 1 Although incompatible with our set-up, we do not remove them from the dataset as they can be valuable in future studies.
2 https://github.com/RochelleChoenni/stereotypes_in_lms Stereotype elicitation method For each sample in our dataset we feed the model the template sentence and replace [ATTR] with the [MASK] token.We then retrieve the top k = 200 model predictions for the MASK token, and test how many of the stereotypes found by the search engines are also encoded in the LMs.We adapt the method from Kurita et al. (2019) to rank the top k returned model outputs based on their typicality for the respective social group.We quantify typicality by computing the log probability of the model probability for the predicted completion corrected for by the prior probability of the completion e.g.: P post (y = strict|Why are parents so y ?) (1) i.e., measuring association between the words by computing the chance of completing the template with 'strict' given 'parents' corrected by the prior chance of 'strict' given any other group.Note that Eq. 3 has been well-established as a measure for stereotypicality in research from both social psychology (McCauley et al., 1980) and economics (Bordalo et al., 2016).After re-ranking by typicality, we evaluate how many of the stereotypes are correctly retrieved by the model through recall@k for each of the 8 target categories.
Results Figure 1 shows the recall@k scores per model separated by category, showcasing the ability to directly retrieve stereotypical attributes of social groups using our elicitation method.While models capture the human stereotypes to similar extents, results vary when comparing across categories with most models obtaining the highest recall for country stereotypes.Multilingual models obtain relatively low scores when recalling stereotypical attributes pertaining to age, gender and political groups.Yet, XLMR-L is scoring relatively high on stereotypical profession and race attributes.Lifestyle Figure 1: Recall@k scores for recalling the human-defined stereotypes captured in our dataset using our stereotype elicitation method on various pretrained LMs.
The suboptimal performance of multilingual models could be explained in different ways.For instance, as multilingual models are known to suffer from negative interference (Wang et al., 2020), their quality on individual languages is lower compared to monolingual models, due to limited model capacity.This could result in a loss of stereotypical information.Alternatively, multilingual models are trained on more culturally diverse data, thus conflicting information could counteract within the model with stereotypes from different languages dampening each other's effect.Cultural differences might also be more pronounced when it comes to e.g.age and gender, whilst profession and race stereotypes might be established more universally.

Quantifying emotion towards different social groups
To study stereotypes through emotion, we draw inspiration from psychology studies showing that stereotypes evoke distinct emotions based on different types of perceived threats (Cottrell and Neuberg, 2005) or perceived social status and competitiveness of the targeted group (Fiske, 1998).For instance, Cottrell and Neuberg (2005) show that both feminists and African Americans elicit anger, but while the former group is perceived as a threat to social values, the latter is perceived as a threat to property instead.Thus, the stereotypes that underlie the emotion are likely different.Whilst strong emotions are not evidence of stereotypes per se, they do suggest the powerful effects of subtle biases captured in the model.Thus, the study into emotion profiles provides us with a good starting point to identify which stereotypes associated with the social groups evoke those emotions.To this end, we (1) build emotion profiles for social groups in the models and ( 2) retrieve stereotypes about the groups that most strongly elicit emotions.

Model predictions
To measure the emotions encoded by the model, we feed the model the 5 stereotype eliciting templates for each social group and retrieve the top 200 predictions for the [MASK] token (1000 in total).When taking the 1000 salient attributes retrieved from the 5 templates, we see that there are many overlapping predictions, hence we are left with only approx.between 300-350 unique attributes per social group.This indicates that the returned model predictions are robust with regard to the different templates.
Emotion scoring For each group, we score the predicted set of stereotypical attributes W T GT using the NRC emotion lexicon (Mohammad and Turney, 2013) that contains ∼ 14K English words that are manually annotated with Ekman's eight basic emotions (fear, joy, anticipation, trust, surprise, sadness, anger, and disgust) (Ekman, 1999) and two sentiments (negative and positive).These emotions are considered basic as they are thought to be shaped by natural selection to address survivalrelated problems, which is often denoted as a driving factor for stereotyping (Cottrell and Neuberg, 2005).We use the annotations that consist of a binary value (i.e.0 or 1) for each of the emotion categories; words can have multiple underlying emotions (e.g.selfish is annotated with 'negative', 'anger' and 'disgust') or none at all (e.g.vocal scores 0 on all categories).We find that the coverage for the salient attributes in the NRC lexicon is ≈ 70-75 % per group.
We score groups by counting the frequencies with which the predicted attributes W T GT are associated with the emotions and sentiments.For each group, we remove attributes from W T GT that are not covered in the lexicon.Thus, we do not extract emotion scores for the exact same number of attributes per group (number of unique attributes and coverage in the lexicon vary).Thus, we normalize scores per group by the number of words for which we are able to retrieve emotion scores (≈ 210-250 per group).The score of an emotion-group pair is computed as follows: We then define emotion vectors v ∈ R 10 for each group T GT : vT GT = [s f ear , s joy , s sadness , s trust , s surprise , s anticipation , s disgust , s anger , s negative , , s positive ], which we use as a representation for the emotion profiles within the model.
Analysis Figure 2, provides examples of the emotion profiles encoded for a diverse set of social groups to demonstrate how these profiles allow us to expose stereotypes.For instance, we see that in RoBERTa-B religious people and liberals are primarily associated with attributes that underlie anger.Towards homosexuals, the same amount of anger is accompanied by disgust and fear as well.As a result, we can detect distinct salient attributes that contribute to these emotions e.g.: Christians are intense, misguided and perverse, liberals are phony, mad and rabid, whilst homosexuals are dirty, bad, filthy, appalling, gross and indecent.The finding that homosexuals elicit relatively much disgust can be confirmed by studies on humans as well (Cottrell and Neuberg, 2005).Similarly, we find that Greece and Puerto Rico elicit relatively much fear and sadness in RoBERTa-B.Whereas Puerto Rico is turbulent, battered, armed, precarious and haunted, for Greece we find attributes such as failing, crumbling, inefficient, stagnant and paralyzed.
Emotion profiles elicited in BART-B differ considerably, showcasing how vastly sentiments vary across models.In particular, we see that overall the evoked emotion responses are weaker.Moreover, we detect relative differences such as liberals being more negatively associated than homosexuals, encoding attributes such as cowardly, greedy and hypocritical.We also find that BART-B encodes more positive associations e.g., committed, reliable, noble and responsible contributing to trust for husbands.Interestingly, all multilingual models encode vastly more positive attributes for all social groups (see Apppendix D).We expect that this might be an artefact of the training data, but leave further investigation of this for future work.
Comparison across models We systematically compare the emotion profiles elicited by the social groups across different models by adapting the Representational Similarity Analysis (RSA) from Kriegeskorte et al. (2008).We opted for this method as it takes the relative relations between groups within the same model into account.This is particularly important as we have seen that some models are overall more negatively or positively biased.Yet, when it comes to bias and stereotypicality, we are less interested in absolute differences across models, but rather in how emotions differ towards groups in relation to the other groups.First, the representational similarity within each model is defined using a similarity measure to construct a representational similarity matrix (RSM).We define a similarity vector ŵT GT for a social group such that every element ŵij of the vector is determined by the cosine similarity between vi , where i = TGT, and the vector vj for the j-th group in the list.The RSM is then defined as the symmetric matrix consisting of all similarity vectors.The resulting matrices are then compared across models by computing the Spearman correlation (ρ) between the similarity vectors corresponding to the emotion profiles for a group in a model a and b.To express the similarity between the two models we take the mean correlation over all social groups in our list.

Results
Computing RSA over all categories combined, shows us that RoBERTa-B and BART-B ob-  tain the highest correlation (ρ = 0.44).While using different architectures and pretraining strategies, the models rely on the same training data.Yet, we included base and large versions of models in our study and find that these models show little to no correlation (see Appendix E, Fig. 10).This is surprising, as they are pretrained on the same data and tasks as their base versions (but contain more model parameters e.g. through additional layers).This shows how complex the process is in which associations are established and provides strong evidence that other modelling decisions, apart from training data, contribute to what models learn about groups.Thus, carefully controlling training content can not fully eliminate the need to analyze models w.r.t. the stereotypes that they might propagate.

Stereotype shifts during fine-tuning
Many debiasing studies intervene at the data level e.g., by augmenting imbalanced datasets (Manela et al., 2021;Webster et al., 2018;Dixon et al., 2018;Zhao et al., 2018) or reducing annotator bias (Sap et al., 2019).These methods are, however, dependent on the dataset, domain, or task, making new mitigation needed when transferring to a new setup (Jin et al., 2020).This raises the question of how emotion profiles and stereotypes are established through language use, and how they might shift due to new linguistic experience at the fine-tuning stage.We take U.S. news sources from across the political spectrum as a case study, as media outlets are known to be biased (Baron, 2006).By revealing stereotypes learned as an effect of fine-tuning on a specific source, we can trace the newly learned stereotypes back to the respective source.We rely on the political bias categorisation of news sources from the AllSides3 media bias rating website.These ratings are retrieved using multiple methods, including editorial reviews, blind bias surveys, and third party research.Based on these ratings we select the following sources: New Yorker (far left), The Guardian (left), Reuters (center), FOX News (right) and Breitbart (far right).From each news source we take 4354 articles from the All-The-News4 dataset that contains articles from 27 American Publications collected between 2013 and early 2020.We fine-tune the 5 base models5 on these news sources using the MLM objective for only 1 training epoch with a learning rate of 5e-5 and a batch size of 8 using the HuggingFace library (Wolf et al., 2020).We then quantify the emotion shift after fine-tuning using RSA.
Results We find that fine-tuning on news sources can directly alter the encoded stereotypes.For instance, for k = 25, fine-tuning BERT-B on Reuters informs the model that Croatia is good at sports and Russia is good at hacking, at the same time, associations such as Pakistan is bad at football, Romania is good at gymnastics and South Africa at  rugby are lost.Moreover, from fine-tuning on both Breitbart and FOX news the association emerges that black women are violent, while this is not the case when fine-tuning on the other sources.
In fact, Guardian and Breitbart are the only news sources that result in the encoding of the salient attribute racist for White Americans.We find that such shifts are already visible after training on as little as 25% of the original data (∼ 1K articles).When comparing to human stereotypes, we find that fine-tuning on Reuters decreases the overall recall scores (see Figure 4).Although New Yorker exhibits a similar trend, fine-tuning on the other sources have little effect on the number of stereotypes recalled from the dataset.As Reuters has a center bias rating i.e., it does not predictably favor either end of the political spectrum, we speculate that large amounts of more nuanced data helps transmit fewer stereotypes.
Figure 5 shows the decrease in correlation between the emotion profiles from pretrained BERT-B and BERT-B fine-tuned on different proportions of the data.Interestingly, fine-tuning on less articles does not automatically result in smaller changes to the models.In fact, in many cases, the amount of relative change in emotion profiles is heavily dependent on the social category as indicated by the error bars.This is not unexpected as news sources might propagate stronger opinions about specific categories.Moreover, we find that emotions towards different social categories cannot always be distinguished by the political bias of the source.Figure 3, shows how news sources compare to each other w.r.t.different social categories, exposing that e.g.Guardian and FOX news show lower correlation on gender than on age.
Computing correlation between all pretrained and fine-tuned models, we find that emotion profiles are prone to change irrespective of model or news source (see Appendix E).In Figure 6, we showcase the effect of fine-tuning from the model that exhibits the lowest change in correlation, i.e.RoBERTa-B, to highlight how quickly emotions shift.We find that while Reuters results in weaker emotional responses, Guardian elicits stronger negative emotions than FOX news e.g.towards conservatives and academics.Yet, while both sources result in anger towards similar groups, for FOX news anger is more often accompanied with fear while for Guardian this seems to more strongly stems from disgust (e.g.see Christians and Iraq).
Lastly, Figure 7 shows specific stereotype shifts found on the top 15 predictions per template.We illustrate the salient attributes that are removed, added and remained constant after fine-tuning.For instance, the role of news media in shaping public opinion about police has received much attention in the wake of the growing polarization over highprofile incidents (Intravia et al., 2018;Graziano, 2019).We find clear evidence of this polarization as fine-tuning on New Yorker results in attributes such as cold, unreliable, deadly and inept, yet, finetuning on FOX news yields positive associations such as polite, loyal, cautious and exceptional.In addition, we find evidence for other stark contrasts such as the model picking up on sexist (e.g.women are not interesting and equal but late, insecure and entitled) and racist stereotypes (e.g.black people are not misunderstood and powerful, but bitter, rude and stubborn) after fine-tuning on FOX news.

Conclusion
We present the first dataset containing stereotypical attributes of a range of social groups.Importantly, our data acquisition technique enables the inexpensive retrieval of similar datasets in the future, enabling comparative analysis on stereotype shifts over time.Additionally, our proposed methods could inspire future work on analyzing the effect of training data content, and simultaneously contribute to the field of social psychology by providing a testbed for studies on how stereotypes emerge from linguistic experience.To this end, we have shown that our methods can be used to identify stereotypes evoked during fine-tuning by taking news sources as a case study.Moreover, we have exposed how quickly stereotypes and emotions shift based on training data content, and linked stereotypes to their manifestations as emotions to quantify and compare attitudes towards groups within LMs.We plan to extent our approach to more languages in future work to collect different, more culturally dependent, stereotypes as well.

Ethical consideration
The examples given in the paper can be considered offensive but are in no way a reflection of the authors' own values and beliefs and should not be taken as such.Moreover, it is important to note that for the fine-tuning experiments only a few interesting examples were studied and showcased.Hence, more thorough research should be conducted before drawing any hard conclusions about the news papers and the stereotypes they propagate.In ad-dition, our data acquisition process is completely automated and did not require the help from human subjects.While the stereotypes we retrieve stem from real humans, the data we collect is publicly available and completely anonymous as the specific stereotypical attributes and/or search queries can not be traced back to individual users.

Figure 2 :
Figure 2: Examples of emotion profiles for a diverse set of social groups from RoBERTa-B and BART-B.

Figure 3 :
Figure 3: Correlations in emotion profiles for gender and age groups across news sources (BERT-B).

Figure 6 :
Figure 6: A few interesting examples of emotional profiles for a diverse set of social group after fine-tuning RoBERTa-B for only 1 training epoch on articles from Guardian, Reuters and FOX news respectively.

Figure 7 :
Figure 7: Stereotypical attribute shifts when finetuning RoBERTa-B on New Yorker (left) and FOX news (right).Removed attributes are red and those added green.Attributes that persisted are grey.

Figure 9 :
Figure9: Examples of emotion profiles for the multilingual models.It showcases that these models are much more positive about all social groups in comparison to the monolingual models.Whereas we observed that monolingual models primarily encode negative associations for most groups, associations encoded within the multilingual models are more balanced between positive and negative sentiments.

Table 2 :
Ranking:'why are old people so bad with'.