Intersectional Stereotypes in Large Language Models: Dataset and Analysis

,


Introduction
The current body of research concerning the propagation of stereotypes by large language models (LLMs) predominantly focuses on single-group stereotypes, such as racial bias against African Americans or gender bias against women (Mattern et al., 2022;Nadeem et al., 2021;Nangia et al., 2020;Zhao et al., 2018;Rudinger et al., 2018).Nevertheless, it is crucial to acknowledge that numerous stereotypes are directed toward intersectional groups (e.g., bias against African American women), which do not fit into broad single-group classifications.
Existing studies on intersectional stereotypes (Cheng et al., 2023;Cao et al., 2022) often adopt a reductionist approach, primarily focusing on intersectional groups comprising just two demographic attributes.Such research also tends to limit the analysis to word-level, neglecting the possibility of more covert, context-dependent stereotypes.Furthermore, the exploration of stereotypes is often constrained to a few aspects, like appearances or illegal behavior.
To address these limitations, we have curated an intersectional stereotype dataset with the aid of the ChatGPT model1 .For constructing the intersectional groups, we remove all the constraints and enable any combination of 14 demographic features across six categories, namely, race (white, black, and Asian), age (young and old), religion (nonreligious, Christian, and Muslim), gender (men and women), political leanings (conservative and progressive), and disability status (with disability and without).This approach allows us to assess a wide range of stereotypes targeted at diverse group combinations, as generated by ChatGPT.
Our results show that ChatGPT effectively discerns our objectives and generates common stereotypes for up to four intersecting demographic groups.The quality of the stereotypes generated was also substantiated by human validation.However, as the demographic traits exceed four, the groups become exceedingly specific, leading Chat-GPT to make overly broad generalizations.By incorporating rigorous post-generation validation using both ChatGPT and human validation, we successfully mitigated this overgeneralization, thereby enhancing the quality of the data points.This shows the strength of ChatGPT (and potentially other LLMs) for helping with stereotype-related research.Section 2 discusses the complete dataset construction process.
Leveraging this newly created dataset, we probed the presence of stereotypes within two contemporary LLMs, GPT-3 (Brown et al., 2020) and ChatGPT.Following a methodology similar to Cheng et al. (2023), we interrogated the LLMs and analyzed their responses.However, we expanded the scope of inquiry by designing questions that spanned 16 different categories of stereotypes.Our findings revealed that all the models studied produced stereotypical responses to certain intersectional groups.This observation underscores that stereotypes persist in even the most modern LLMs, despite the moderation measures enforced during their training stage (Ferrara, 2023).We argue that future de-biasing efforts should prioritize mitigating intersectional and implicit stereotypes.Section 3 discusses stereotype examination in more details.

Dataset Construction
Understanding intersectional stereotypes can pose a significant challenge, particularly for non-experts, due to their complexity and overlap with more general group-based stereotypes.To address this, we have curated the dataset leveraging ChatGPT and have ensured its integrity through validation by both the model and human validators.The objective of our dataset is to facilitate the expansion of intersectional stereotype research to include a wider array of demographic groups, going beyond the scope of past investigations, with LLMs.

Intersectional Group Construction
Existing literature on intersectional stereotypes predominantly concentrates on gender, race, and disability biases, generally focusing on dyadic combinations (Tan and Celis, 2019;Jiang and Fellbaum, 2020;Hassan et al., 2021).However, this does not encompass the entirety of the intersectional landscape.In this paper, we significantly broaden our scope by considering six demographic categories: race (white, black, and Asian), age (young and old), religion (non-religious, Christian, and Muslim), gender (men and women), political leaning (conservative and progressive), and disability status (with and without disabilities).We examine all possible combinations of these characteristics.

Prompt Design
The design of our prompts, which are used to retrieve stereotypes from ChatGPT, encompasses three key components: the problem statement, regulation, and disclaimer.The problem statement element specifically communicates our objective, which is to retrieve prevalent stereotypes, and de-tails the intersectional group for which we seek these stereotypes.The regulation component instructs ChatGPT to refrain from overly generalizing its responses.It also asks the model to rationalize its responses to help minimize hallucinations, a common issue in language generation (Ji et al., 2023).Additionally, we direct the model to return widely acknowledged stereotypes associated with the target group rather than inventing new ones.Lastly, the disclaimer aspect underscores that the data collection is conducted strictly to research stereotypes.This is a crucial clarification to ensure that our requests are not misconstrued and subsequently moderated.An example of such a prompt is presented in Figure 1.

Stereotype Retrieval
As depicted in Figure 1, we embed the intersectional groups into the prompts and generate stereotypes from ChatGPT.The responses received are manually segmented into triples consisting of the target group, stereotype, and explanation.For instance, given the prompt shown in Figure 1, one of the generated stereotypes from ChatGPT could be ("Black+Women", "Angry Black Woman", "This stereotype characterizes black women as being aggressive, confrontational, and quick to anger.").It is important to note that ChatGPT sometimes struggles to produce ample stereotypes for a particular intersectional group, especially when the group combines more than four demographic traits.In these instances, it tends to generate more generalized stereotypes.We manually curate these responses by excluding them from the specific intersectional group's dataset and incorporating them into the dataset of other, broader intersectional groups identified by the model.

Data Filtering
Our initial data generation process resulted in some stereotypes that applied to multiple, nested intersectional groups.This outcome did not align with our expectations.To enhance the quality of our data, we employed both automatic and manual data filtering processes to remove inappropriate data points.For the automatic data filtering, we used a specific prompt, as shown in Figure 2, to task ChatGPT with identifying stereotypes in its own generated responses that could also apply to broader demographic groups.For instance, in the example presented in Figure 2, all stereotypes generated by ChatGPT were eliminated because they were frequently applicable to more generalized demographic groups.We monitored the entire process with care to ensure that ChatGPT removed the correct instances with solid reasons in its explanations.Subsequently, we manually reviewed all data points, eliminating any stereotypes that contradicted our understanding of the stereotypes associated with each intersectional group.After these data filtering steps, our final dataset included an average of 4.53 stereotypes for each of the 106 intersectional groups, with no stereotypes identified for 1,183 other intersectional groups.Table 1 provides a comprehensive list of the intersectional groups we examined for which ChatGPT was able to generate stereotypes.

Human Validation
As an integral part of our quality control process, we subjected all retrieved stereotypes to human validation.This process ensured that (1) the stereotypes are commonly observed in real life, (2) the stereotypes accurately correspond to the target intersectional groups, and (3) the stereotypes are not applicable to broader demographic groups.
For the commonality validation, validators were asked to affirm whether the provided stereotype is frequently associated with the target group (yes or no).98.33% of the stereotypes in our dataset were agreed upon by at least two out of three validators as being commonly observed either in everyday life or on social media platforms.The inter-annotator agreement (IAA) for this validation was measured This could cover stereotypes related to a person's social behaviors such as partying, as well as attitudes toward their career or education.
Table 2: The list of all 16 categories of stereotypes examined in this paper.Explanations of these categories are also provided.
demonstrate that our dataset is of high quality.It comprises stereotypes that are accurately attributed to a broad range of intersectional groups.

Stereotype Examination
Cheng et al. ( 2023) studied stereotypes in LLMs by instructing these models to create personas based on specified intersectional groups, subsequently identifying words that contribute significantly to differentiating each intersectional group from "unmarked" groups.However, the model's responses to their prompts (such as, "Imagine you are [group], describe yourself") often appeared unnatural, according to their provided examples.Additionally, scrutinizing stereotypes at the word level doesn't seem promising since many "representative words" in their findings lack clarity unless they co-occur with other less representative words.For instance, "almond-shaped", when associated with Asian women, doesn't convey any meaningful information unless we know that it refers to their eye shape.Furthermore, the broad freedom their questions afford to the models results in words representing each intersectional group being mostly related to appearance.
In view of the strengths and limitations of this previous approach, we apply stricter regulations in our design of questions for stereotype examination.Specifically, we categorize the stereotypes into 16 types (including but not limited to appearance-related and behavioral stereotypes) and individually craft questions under each category.We consciously simplify the questions to facilitate easier categorization and examination of the models' responses.For each question, we manually formulate a set of expected answers, enabling us to classify the responses of LLMs into a finite number of categories and simplify the analysis of answer distributions.Importantly, we do not make any assumptions about the answers, considering an LLM to display stereotypical behavior if its answers to a specific question consistently fall within one specific category across multiple trials.Table 2 shows the categories of stereotypes and Appendix B provides an example question with its expected answers for each category.

Target Role Simulation
Our stereotype examination requires repeated queries to the LLMs using the same intersectional group and stereotype.The LLMs' generations could be homogeneous if we repeat exactly the same prompt.To encourage more diverse responses from the LLMs, we generate the life experiences of people in each intersectional group that we study and ask LLMs to behave as if they were the simulated roles when answering the questions.This approach is gradually widely used in recent computational social science research.(Argyle et al., 2022) We used the ChatGPT model to generate life stories for these roles, and we manually investigated all the generations to ensure faithfulness to the provided demographic features and diversity in terms of life experiences.An example prompt and the output of ChatGPT given that prompt are shown in Figure 3.We simulate 10 roles for each intersec- tional group which is associated with stereotypes in our dataset, shown in Table 1.

Examination of Stereotypical Behavior
We examine stereotypical behavior in two recent LLMs: GPT-3 and ChatGPT (GPT-3.5).This is done using a set of custom-designed questions and simulated roles.Our analysis procedure involves five steps, through which we determine the degree of stereotyping in each LLM concerning a particular stereotype related to an intersectional group: 1.We identify questions that pertain to the stereotype of interest among all the questions in the same category as the stereotype.2. For each question identified in the previous step, we pose the question to the LLM along with the ten roles we have simulated for the intersectional group in question.3. We quantify the stereotype exhibited by the LLM by examining the maximum frequency with which the ten responses generated by the LLM match each expected answer.We normalize these results using the mean to allow comparison across questions with varying numbers of expected answers.We use the expected value of frequency (i.e., 1/n for questions with n expected answers) as the mean for normalizing the results.This normalized maximum frequency is referred to as the Stereotype Degree (SDeg) for a specific combination of LLM, intersectional group, and stereotype category.SDeg is always equal to or greater than 0 but less than 1. 4. The maximum SDeg of each LLM toward each intersectional group is used to represent its degree of stereotyping.5. To further evaluate the overall level of stereotyping in each LLM, we aggregate the SDeg of the model toward all intersectional groups.
Appendix C presents the SDeg of each LLM with respect to each intersectional group.Our results indicate that different LLMs exhibit varying degrees of stereotypes toward different intersectional groups.For instance, GPT-3 demonstrates higher degrees of stereotyping toward "young black people", "older black people", and "white women", whereas ChatGPT is more stereotypical toward "black people without disabilities", "conservative Muslim men", and "white people with disabilities".Despite the application of various de-biasing and moderation strategies in these recent LLMs, they continue to exhibit complex intersectional stereotypes.These stereotypes differ across LLMs and necessitate specific measures for their mitigation.Our dataset provides an effective means of identifying and addressing such complex intersectional stereotypes, thereby reducing their negative impact.Moreover, our dataset can be readily expanded to study stereotypes toward other groups, using the methodology outlined in this paper.

Conclusion & Future Work
In this paper, we introduce an intersectional stereotype dataset and evaluate the prevalence of stereotypes in three contemporary Language Learning Models (LLMs) across 106 intersectional groups.The dataset is automatically created and filtered using ChatGPT, and it undergoes manual validation to ensure it encompasses the common stereotypes specifically targeting these demographic groups.Furthermore, we classify the stereotypes in this dataset into 16 categories and formulate categoryspecific questions to assess the stereotypical behaviors of LLMs.The findings from our stereotype examination underscore the necessity for additional investigation and mitigation of stereotypes in LLMs, particularly the more complex intersectional stereotypes, especially when these models are made publicly available.Our dataset serves as a valuable resource that can be employed and expanded upon to attain a broader understanding of intersectional stereotypes and to work toward the reduction of harmful stereotypes in LLMs.

Limitations
In this paper, we have constructed an intersectional stereotype dataset using prompts given to Chat-GPT.However, as pointed out by Santurkar et al. (2023), Language Learning Models (LLMs) like ChatGPT may answer questions from their unique "viewpoints", often reflecting certain social values.This characteristic could potentially introduce unintended biases to the data, especially if our dataset creation approach is employed for constructing stereotype datasets with predefined source groups.Although we did not address this issue in the main paper, which focused solely on general stereotypes associated with each target group, we did employ rigorous human validation processes to ensure the high quality of the dataset.To mitigate potential issues stemming from the "viewpoints" of LLMs, future work extending from our research should take into account the social values expressed in the LLM responses and cautiously regulate the output through effective prompting, particularly when the sources of stereotypes are crucial to their studies.

Ethics Statement
Despite the fact that this paper investigates stereotypes that could be offensive or disturbing to certain groups, the objective behind constructing such a stereotype dataset is to gain a better understanding and subsequently mitigate harmful stereotypes in human communications.All our data is sourced from ChatGPT, a publicly accessible LLM, and the construction phase of the dataset does not involve human subjects, thereby preventing human annotators from exposure to potentially harmful or upsetting content.While we do involve human validators to guarantee the quality of our dataset, they were forewarned about the nature of the content, and their task was to assess the validity of the datapoints, not to propagate offensive statements.This study was also reviewed by the IRB of our institution on (#STUDY00032622).We compensated all the validators at an hourly rate of $14.00, significantly higher than the minimum wage in our state, for their involvement in these manual validations.

Figure 1 :
Figure 1: An example prompt used to retrieve stereotypes from ChatGPT.

Figure 2 :
Figure 2: An example prompt used for data filtering and the corresponding response from ChatGPT.

Figure 3 :
Figure 3: An example prompt and the response used to generate diverse life stories of people within each intersectional group in our stereotype examinations.