CoMPosT: Characterizing and Evaluating Caricature in LLM Simulations

Recent work has aimed to capture nuances of human behavior by using LLMs to simulate responses from particular demographics in settings like social science experiments and public opinion surveys. However, there are currently no established ways to discuss or evaluate the quality of such LLM simulations. Moreover, there is growing concern that these LLM simulations are flattened caricatures of the personas that they aim to simulate, failing to capture the multidimensionality of people and perpetuating stereotypes. To bridge these gaps, we present CoMPosT, a framework to characterize LLM simulations using four dimensions: Context, Model, Persona, and Topic. We use this framework to measure open-ended LLM simulations' susceptibility to caricature, defined via two criteria: individuation and exaggeration. We evaluate the level of caricature in scenarios from existing work on LLM simulations. We find that for GPT-4, simulations of certain demographics (political and marginalized groups) and topics (general, uncontroversial) are highly susceptible to caricature.


Introduction
Large language models (LLMs) have shown promise in capturing social nuances and human behavior.For instance, researchers have reproduced results from social science experiments and public opinion surveys using LLMs (Argyle et al., 2023;Aher et al., 2023;Santurkar et al., 2023, inter alia).More broadly, interest in LLM simulation is rapidly growing, and the possibility of using LLMs to simulate human behaviors has far-reaching applications in fields like education (Markel et al., 2023), product design (Park et al., 2022(Park et al., , 2023a)), psychology (Binz and Schulz, 2023), healthcare (Weizenbaum, 1966;Bassett, 2019), skill training (Hollan et al., 1984;Jones et al., 1999), and law (Hamilton, 2023).These simulations are a sort of digital compost-any new insight into human be-

The CoMPosT Framework
Context Where and when does the simulated scenario occur?Model What LLM is used?Persona Whose opinion/action is simulated?Topic What is the simulation about?
Table 1: Dimensions of the CoMPosT framework.We use these dimensions to characterize LLM simulations and measure their susceptibility to caricature.
havior that they provide draws upon the organic material (human data) used to train LLMs.Such applications currently have little to no mechanisms for comprehensive evaluation or careful deployment.Evaluation of such simulations has been limited to either (1) replicating existing results or (2) assessing believability.Both paradigms have drawbacks: (1) Replication limits us to only reproducing already-known behavior, and does not support the validation or evaluation of any simulation behaviors beyond those highly correlated with existing results from human studies.Also, existing results are typically quantified as categorical distributions across multiple-choice answers, so there is no way to directly evaluate open-ended generations, and such results may have been "memorized" from the LLMs' training data (Lewis et al., 2021;Elangovan et al., 2021).Thus, replication does not facilitate new insight into human behavior.Furthermore, while (2) Believability is useful in certain settings, such as entertainment (Bates et al., 1994), it is susceptible to the biases and fallacies of human judgment: psychology literature shows that people are more likely to believe stereotypes about groups with which they have less personal experience (Plous, 2003;Bar-Tal et al., 2013), and beliefs are easily influenced (Blair et al., 2001;Jussim et al., 2016).Recent work has also demonstrated that human judgment is insufficient for assessing AI (Schneider et al., 2020;Peng et al., 2022;Vodrahalli et al., 2022;Veselovsky et al., 2023).A6.
Toward clearer documentation of this emerging line of work, we first present a descriptive framework that taxonomizes LLM simulations using four dimensions: Context, Model, Persona, and Topic (CoMPosT) (Table 1).Our framework facilitates comparison across existing work on LLM simulations (Figure 1).
Next, we introduce a new evaluation metric that addresses growing concerns of modal responses and essentializing narratives in LLM outputs (Santurkar et al., 2023;Cheng et al., 2023b;Shumailov et al., 2023).Our metric focuses on a simulation's susceptibility to caricature: an exaggerated narrative of the persona (the demographic that we aim to simulate) rather than a meaningful response to the topic (Figure 2).Caricatured simulations are concerning because they a) fail to capture the real nuances of human behavior, thus limiting the usefulness of simulations and b) perpetuate misleading descriptions, stereotypes, and essentializing narratives about demographic groups.We define caricature using two criteria: individuation and exaggeration (Section 3).To measure individuation, we assess whether outputs from the given simulation can be differentiated from the default response for the topic.To measure exaggeration, we use a "contextualized semantic axis" whose two poles are the defining characteristics of the persona and topic dimensions respectively.
We evaluate the level of caricature in scenarios from existing work on LLM simulations.We find that for GPT-4, simulations of certain demographics (political and marginalized race/ethnicity groups) and topics (general, uncontroversial) are more susceptible to caricature.Our main contributions are: (1) CoMPosT, a framework for characterizing the dimensions of LLM simulations of human behavior (Section 2), (2) a novel method that relies on the persona and topic dimensions of CoMPosT to measure simulations' susceptibility to  For the simulation of a nonbinary person's response, the generations are focused on identity-related issues, while the simulation of a person's response is topical.The former constructs a homogenous narrative that defines nonbinary people only by LGBTQ+ activism.We provide more qualitative examples in Appendix A.
caricatures (Section 4), and (3) experiments in different contexts (Section 5) toward an analysis of the dimensions that are most susceptible to caricature (Section 6).We conclude with actionable recommendations and considerations for those interested in LLM simulation (Section 7).1

CoMPosT: Taxonomizing Simulations
We introduce CoMPosT, a descriptive framework with four dimensions to characterize LLM simulations: Context, Model, Persona, and Topic.Inspired by existing descriptive frameworks for AI fairness (Tubella et al., 2023), our framework provides a shared language to understand and articulate similarities and differences across LLM simulations.Context, persona, and topic are specified in the prompt, while model is determined externally.We map existing work on LLM simulations using these dimensions (Figure 1 and Table A4).

Context
The output from a simulation is necessarily affected by the context of where and when the imagined situation takes place.For instance, a formal interview response varies drastically from a user's comment on Twitter or Reddit.The context includes relevant structural factors and embeds information about the norms of the situation.Each context has its own unique set of norms, which may be explicit, as in the case of online communities with written-down rules, or implicit (Chandrasekharan et al., 2018;Ziems et al., 2023).Context also includes the phrasing of the prompt itself, which affects the output-LLMs are notoriously sensitive to prompt engineering (Zamfirescu-Pereira et al., 2023).The desired granularity of outcome also arises from the phrasing of the prompt and thus is embedded in the context: a simulation scenario may be worded to ask for a choice between binaryor multiple-choice options (such as in many social science experiments and public opinion polls) or for an open-ended output.Previous work on evaluating simulations has largely focused on using LLMs to reproduce scenarios in which humans are asked to choose among a fixed number of options without specifying the context, as it is more challenging to evaluate the quality of open-ended responses.We bridge this gap by offering a metric for the latter.

Model
The LLM used to produce the simulation affects the quality and other characteristics of a simulation.Differences may arise from variations in models' training data and processes, including instruction-tuning, fine-tuning, and/or value alignment efforts (Solaiman and Dennison, 2021;Ouyang et al., 2022;Bakker et al., 2022).

Persona
The persona refers to the entity whose opinions/actions the simulation aims to study and replicate.This persona may include attributes that are relatively static (e.g., race/ethnicity), slowly change over time (e.g., age), or temporary and ofthe-moment (e.g., emotional status) (Yang, 2019).It may also refer to a specific individual.

Topic
The topic of the simulation may be a particular subject of discussion, question, or other event to which a response is desired.Topics vary in specificity, from very general (such as a single word that captures a broad conversation category) to very specific (such as a specific situational question in a psychology experiment).
These four dimensions capture a wide range of possible simulation scenarios, many of which have not yet been well-explored.Across existing work, we find that researchers typically use a state-of-theart model and choose a particular context while varying the more salient dimensions of persona and topic.We denote the simulation scenario as S p,t,c , as it is associated with a prompt containing persona p, topic t, and context c (Table A4).Our evaluation methodology and results uses these dimensions of CoMPosT to understand how different simulations may result in caricatures.Specifically, we explore how the relationship between the dimensions of persona and topic help characterize the extent of caricature in simulations.

Definition of Caricature
Building upon Lynch (1927)'s discussion of how caricatures are misrepresentations that have some sense of truth to the subject by reflecting "salient peculiarities," Perkins (1975) define caricature as "a symbol that exaggerates measurements relative to individuating norms."In so doing, Perkins identifies two key characteristics of caricature: exaggeration and individuation.A caricature is a depiction that not only exaggerates particular features of the subject but also exaggerates in a manner that meaningfully differentiates the subject from others.The exaggeration is done in such a way that it individuates by remaining faithful to the properties that distinguish the subject from others (thus, a complete distortion is not a caricature).This inspires our definition of caricature in the LLM simulation context: given that a subject has some defining characteristics, a caricature exaggerates these characteristics in a way that amplifies the ability to identify (i.e., individuate) the subject from the caricature.In CoMPosT terms, a simulation's level of caricature is the degree to which it exaggerates the individuating characteristics that are emblematic of the persona beyond a meaningful, topical response to the scenario.
Note that in some cases, it may be acceptable for the persona to influence the simulation, i.e., individuation alone does not entail caricature.For example, opinions on some topics differ greatly based on demographic.A caricature occurs when the simulation both individuates and exaggerates the defining characteristics of the imagined generic responses of that persona.Previous work has documented how such imagined personas reflect stereotypes, both inside and outside the LLM context (Marsden and Haag, 2016;Cheng et al., 2023b).Thus, caricatures not only fail to capture the diversity of human behavior but also may rely on stereotypes.

Implications of Caricature
Stereotypes We rely on psychology literature that broadly defines stereotypes as generalizations about the characteristics of a social category, such as associating a social category with a particular role or trait (Heilman, 2001;Fiske et al., 2002;Cao et al., 2022;Kambhatla et al., 2022).The normative value of stereotypes is context-dependent; for instance, stereotypes can help foster a sense of authenticity (Marsden and Haag, 2016), while even seemingly-positive stereotypes can have harmful implications (Fiske et al., 2002;Czopp et al., 2015).
Stereotypes and caricature, while closely related, are distinct in that a caricature may be a specific depiction of a stereotype: scholars have documented caricatures of stereotypes in various domains and how they facilitate misogyny, racism, and other forms of oppression (Slavney, 1984;Brown, 2010;Gottschalk and Greenberg, 2011;Takayama, 2017;Bow, 2019).Such caricatures have historically been used in literature and media to justify slavery, imperialism, and war (Demm, 1993;Kriz, 2008).But even when caricatures do not contain stereotypes, they have concerning implications of homogenous narratives.
Misleading Homogeneity Caricatures foster homogenous narratives that do not reflect the full diversity of the personas they aim to simulate, which limits the utility of the simulation.This concern builds upon previous work: Grudin (2006) discuss how personas can result in systematic errors in understanding human behavior, Cheng et al. (2023b) characterize the harms of LLMs reflecting essentializing narratives about demographic groups, and Santurkar et al. (2023) show that certain instructiontuned LLMs tend to generate modal responses.Others have explored the linguistic nuances within complex social categories and the ramifications of ignoring heterogeneity within social groups (Bamman et al., 2014;Hanna et al., 2020;Cheng et al., 2023a).
These harms of caricature are also articulated by feminist scholars who have discussed how "women in the Two-Thirds World...are constructed as one-dimensional, oppressed caricatures without an understanding of their real experiences, agency, and struggles" (Mohanty, 1988;Aneja, 1993;Kumar et al., 2019).This literature reveals that even when the caricatures are not overtly negative, such one-dimensional depictions are still damaging and harmful.Overlooking diversity within demographic groups has been connected to real-world harms including misprediction and medical misdiagnosis (McCracken et al., 2007;Borrell et al., 2021;Read et al., 2021;Wang et al., 2022).

Caricature Detection Method
The two key aspects of caricature are individuation and exaggeration.To measure the amount of caricature in a given simulation S p,t,c , our method has three steps, each of which rely on the persona and topic dimensions of CoMPosT (Figure 3): (1) defining defaults, (2) measuring individuation, and (3) measuring exaggeration.Note that this framework is sequential, as ( 3) is only necessary if the simulations can be individuated.Otherwise, we can halt after step (2) since individuation is a necessary criterion for caricature.

Defining Defaults
A simulation S p,t,c is a caricature if it has more of the defining characteristics associated with the persona p and less of the defining characteristics associated with the topic t.We first identify these defining characteristics using the default-persona simulation S _,t,c (simulation that does not mention any specific persona) and the default-topic simulation S p,_,c (simulation that does not mention any specific topic).Note that this default does not reflect a universal default but rather a modeland context-specific default: previous work has shown that LLMs implicitly default to a particular set of perspectives (Western, white, masculine, etc.) (Santy et al., 2023).We use these defaults as a comparison point for caricature to isolate the defining characteristics of particular dimensions.
For the default-persona simulation, we use a prompt where in lieu of a specific persona, we use an unmarked default term like "person" or "user."Thus, the outputs reflect the topic and context rather than any particular persona.(Again, such words are not true defaults and are inextricably tied to societal norms: in English, the word "person" is often conflated with "man," a phenomenon also present in web data and language models (Bailey et al., 2022;Wolfe and Caliskan, 2022).) For the default-topic simulation, we use a prompt where no topic is specified.Thus, the outputs from these prompts reflect the particular persona rather than a response to the topic.It is well-documented that this type of prompt results in outputs that reflect stereotypes, both when asked to humans and to LLMs (Kambhatla et al., 2022;Cheng et al., 2023b).Note that even if we expect the output to change based on the persona, the response should still be distinct and not defined by the same distinguishing characteristics as the default-topic simulation for a given persona.

Measuring Individuation
We operationalize the desiderata of individuation using differentiability from default: we examine whether the given simulation S p,t,c is differentiable from the default-persona simulation S _,t,c .If not, then S p,t,c cannot be a caricature.We use a binary classifier (specifically, a random forest classifier implemented using scikit-learn) to differentiate between outputs from the target simulation of interest S p,t,c and those from the defaulttopic simulation S _,t,c based on the outputs' contextualized embeddings.We compute contextualized embeddings using the pre-trained Sentence-BERT model all-mpnet-base-v2 (Reimers and Gurevych, 2019).To create the training and test datasets, we use a stratified 80/20 split on S p,t,c and S _,t,c to preserve the balance between the classes.We report the accuracy2 of the classifier on the test dataset as our measure of individuation.Note that this measure is agnostic to the particular choice of differentiator and contextualized embedding model, and we show results with other choices in Appendix B.
This score alone is necessary but insufficient for identifying caricature, as a caricature must also exaggerate, which we measure next.

Measuring Exaggeration
We define caricature as text having more of (and exaggerating) the defining characteristics associated with persona and less of those associated with topic.Unlike individuation, exaggeration requires a more nuanced measure than differentiation from the default-topic simulation: if an output mentions the topic frequently, it can easily be differentiated, but it may still be a caricature.(Note that it is acceptable for a simulation to have many topicrelated words.)Instead, we measure the extent to which the defining characteristics of the persona are exaggerated in the target simulation via persona-topic semantic axes.
Specifically, we construct contextualized semantic axes, a method introduced by Lucy et al. (2022), to capture whether S p,t,c is more similar to the defining characteristics of the persona p or the topic t.Our semantic axes have two poles, P p and P t , reflecting the persona p and the topic t.To construct the set of seed words, we use the Fightin' Words method (Monroe et al., 2008) to identify the words that statistically distinguish S p,_,c from S _,t,c .We first compute the weighted log-odds ratios of the words between S _,t,c vs. S p,_,c .To control for variance in words' frequencies, we use the following prior distribution: other texts where the personas/topics are either p/t respectively or _ (i.e., the default).Then, we take the words that are statistically significant (have z-score > 1.96) as the sets of seed words W p and W t for the corresponding poles P p and P t (Table A3).
We represent each word w ∈ W as the mean of the contextualized embeddings of the sentences containing that word w across S _,t,c and S p,_,c .We define the semantic axis where p i /t j is a word in W p /W t respectively, i.e., we represent P p /P t as the mean of the embeddings of W p /W t respectively.This subtraction-based axis allows for scaling relative to how closelyrelated the topic and persona are.
To evaluate exaggeration, we compute the average cosine similarity of a given simulation's contextualized embedding to this axis: where S i p,t,c refers to individual outputs, i.e. i = 1, 2, ..., n for n outputs from the same simulation S p,t,c .
The final value we report as the measure of exaggeration is this value normalized to lie between 0 and 1 (by scaling it relative to the cosine similarity of the default-persona and default-topic simulations with the axis): .
(3)  We perform internal and external validation of these persona-topic semantic axes (Appendix C).

Experiments
We use our method to evaluate simulations in various contexts that have been used in previous work to demonstrate the capabilities of LLM simulation (Park et al., 2022;Santurkar et al., 2023;Jiang et al., 2022).Our experiments are focused on the two most widely-used contexts of (1) an online forum setting and (2) a question-answering interview setting.We also evaluate the Twitter context as an additional robustness study (Appendix G).These choices are based on our survey of the literature on LLM simulations (Figure 1, Table A6): Among 15 papers in this area, we found that six use the context of a virtual community or society and four rely on an open-ended interview or survey context.The remaining five are in various question-answering contexts, so conclusions about the (2) can also provide insight into these types of simulations.We use the state-of-the-art GPT-4 model for all experiments (OpenAI, 2023); like others (Dubois et al., 2023), we find that open-source LLMs and older models are worse at simulation tasks, yielding outputs that are unrealistic and significantly lower in quality.For each simulation setting S p,t,c , we generate 100 outputs and average all results across them.See Appendix E for a power analysis of this sample size.The full details for each setting, including topic lists, persona lists, and defaulttopic/persona prompts, are in Appendix D.

Online Forum
Park et al. ( 2022) demonstrate the believability of LLM simulations of users in online forums.Following their prompting format, we use the prompt: "A [persona] posted the following comment on [topic] to an online forum:"3 We explore such simulations using 15 different personas (5 race/ethnicities, 3 genders, 3 political ideologies, 3 ages, and the neutral "person") and 30 pairs of topics.
For topics, we aim to cover a wide range of common topics that vary in (1) specificity (e.g., overcoming fear of driving is much more specific than cars) and (2) level of controversy (e.g., abortion is much more controversial than health).To cover both dimensions, we use topics from Wiki-How, which is a knowledge base with a wide range of topics (Koupaee and Wang, 2018), and from ProCon.org, which is a website that lists popular controversial topics and has been used in the NLP context to study stance and argumentation (Misra et al., 2016;Hosseinia et al., 2020).We use the first 15 categories from WikiHow's "popular categories" webpage and randomly sample an associated specific "how-to" for each category.For ProCon.org, we randomly sample 15 topics from ProCon.org's "debate topics" webpage.Each topic on the page is listed in both more general and more specific wording, e.g., abortion and should abortion be legal?Thus, for each sampled subject, we use both the general and specific versions as topics.

Interview
Various previous works simulate opinions of different demographics using an interview-style prompt (Argyle et al., 2023;Santurkar et al., 2023;Hämäläinen et al., 2023, inter alia).We reproduce the public opinion survey simulation context from Santurkar et al. (2023), using the prompt: "Below you will be asked to provide a short description of your identity and then answer some questions.Description: I am [persona].

Question: [topic]
Answer:" For topics, we randomly sample 30 questions from the Pew Research's American Trends Panel survey questions that Santurkar et al. (2023) identify as "most contentious" in their OpinionQA dataset.We convert the multiple-choice questions into open-ended ones by removing the multiplechoice answer options.For personas, we use the same 15 personas as described in Section 5.1.

Results and Discussion
We apply our caricature detection method to evaluate the simulations produced in these different contexts.We further operationalize the CoM-PosT framework by aggregating the individuation and exaggeration scores across the dimensions of topic and persona.This enables us to analyze the topics and personas that lead to the most caricatures across different contexts.For instance, to examine what personas lead to the most caricatures in a particular context, we compute the mean score for each persona across all simulations (varying in topic) for that persona and then compare these scores.We also report results from additional experiments that explore the influence of the context dimension in Appendix F.

Simulations of all personas can be individuated from the default-persona
The mean individuation score and 95% confidence interval for every persona is > 0.5, i.e., each persona can be meaningfully individuated from the default-persona at a rate better than random chance.Mean individuation scores across the online forum and interview contexts are in Figure 4. We see that the woman and man personas have lower mean scores in the online forum context, while in the interview context, the mean score is > 0.95 for every persona-thus, this score is not informative for comparing caricature across personas.The difference in score between contexts arises from differences in the sample distributions: compared to the online forum context, the interview context simulations have lower variability, so they are easier to individuate.

Exaggeration scores reveal the personas and topics most susceptible to caricature
Next, we examine the exaggeration scores, i.e., the similarities across the simulations to their corresponding persona-topic axes (Figure 5).Since almost all the simulations are able to be individuated, we use the exaggeration score as a proxy for caricature, i.e., a given simulation is highly susceptible to caricature if it has a high exaggeration score.
6.2.1 Caricature ↑: Topic specificity ↓ In the online forum context, among the topics, we find that the more general topics resulted in higher exaggeration scores, and thus higher rates of caricature, while the more specific topics had much lower rates of caricature (Figure 5).In particular, the general, uncontroversial topics have highest exaggeration scores.The top five topics with the highest mean rates of caricature are Health, Philosophy and Religion, Education and Communications, Relationships, and Finance and Business.
To explore this pattern further, we experiment on a fine-grained range of topic specificity: For the topic with highest exaggeration score (Health), we generate simulations for a range of related topics with 5 levels of specificity (Appendix D.1.1).We find that this pattern holds: the level of caricature decreases as the specificity of the topic increases (Figure A2).We find no correlation between topic length and caricature otherwise.
In the interview context, the exaggeration scores are broadly comparable to the scores for the more specific topics in the online forum context (Figure A5).After controlling for context (Appendix F), we hypothesize that this is because they are similar in specificity.We also observe this inverse relationship between topic specificity and susceptibility to caricature in the Twitter context (Appendix G).The standard error for each point is < 0.02 and thus not visible on the plot.Based on a classifier between the default-persona and the target persona, many of the personas are easy to individuate (have high accuracy scores).The only personas that are slightly challenging to differentiate are the gender groups woman and man in the online forum context (blue circles).All personas have mean score > 0.95 in the interview context (green stars).We measure exaggeration as normalized cosine similarity to the persona-topic axis.The more general topics (purple, larger marker) have higher rates of exaggeration, and thus caricature, than the specific topics (orange, smaller marker).The uncontroversial (WikiHow, squares) topics have higher rates of caricature than the controversial (ProCon.org,stars) topics.Personas related to political leanings, race/ethnicity, and nonbinary gender broadly have the highest rates of caricature.

Caricature
the perspectives of liberal, white, and younger populations, while the perspectives of non-binary people are poorly represented by these systems (Santy et al., 2023;Santurkar et al., 2023).It is surprising that although Asian and woman reflect marginalized groups, they have relatively low rates of caricature; this may further reflect implicit defaults in LLM outputs.Certain personas and topics have more nuanced relationships, e.g., conservative and liberal personas in the online forum context have the widest gap between the scores of the uncontroversial and controversial topics.The Twitter context also reveals that, beyond analyzing the persona and topic dimensions separately, some personatopic combinations are particularly susceptible to caricature (Figure A4).

Stereotypes
We note that our framework is not an exhaustive test for bias or failure modes, but rather a measure for one way in which simulations may fail.Thus, simulations that seem caricature-free may still contain stereotypes, as our method captures how much a simulation exaggerates the persona in a particular setting, which is not an all-encompassing catalog of stereotypes.We see this in the simulated "woman" responses in the online forum context: The defaulttopic generations contain specific stereotypes, e.g., "I recently purchased a new vacuum cleaner and I have to say, I am extremely satisfied with its performance!It has made my cleaning routine so much easier and efficient."This response reflects gender bias in that it focuses on cleaning and other domestic tasks, while simulations of other personas do not.Although the simulated "women" responses contain various other gender stereotypes/biases beyond association with domestic tasks, they have low caricature scores.

Recommendations
We conclude with several recommendations for those interested in LLM simulations.
Mitigating Caricature Researchers should use our method to test their simulation in their particular context and critically examine whether a simulation helps illuminate the desired phenomenon.
While the relationship between topic, persona, and context in causing caricature is nuanced, we generally encourage researchers and practitioners to use more specific topics to mitigate caricature.Any attempt to simulate a group-especially a politicized or marginalized group-ought to be done with particular care and attention.
Documenting Positionality Research on LLM simulations face the well-documented challenges of human-focused, value-laden interdisciplinary work (Marsden and Haag, 2016).For instance, researchers themselves may be subject to the outgroup homogeneity effect, i.e., the tendency to rely on stereotypes and generalizations for groups to which they do not belong (Plous, 2003).Following work on model, dataset, and system documentation (Bender and Friedman, 2018;Mitchell et al., 2019;Gebru et al., 2021;Adkins et al., 2022), we call for increased transparency and documentation for simulations, including the dimensions of CoM-PosT and less-visible aspects such as the creators' positions, motivations, and process.Drawing upon HCI work on reflexivity and positionality (Keyes et al., 2020;Liang et al., 2021;Bowman et al., 2023), we encourage researchers to report how their identity and perspectives may influence their work.
Understanding Difference Although some applications of LLM simulations focus on aggregates rather than individuals, it is critical to understand the landscape of individuals from which these groupings arise, and it is often necessary to use more subtle forms of aggregation.Otherwise, minority opinions and subgroup disparities may be overlooked (Herring and Paolillo, 2006;Hanna et al., 2020;Wang et al., 2022).Takayama (2017) suggests countering caricature "by providing fully contextualized, balanced, and nuanced description," and in HCI, Marsden and Pröbster (2019) explore how to explicitly capture users' multidimensional identities.Drawing inspiration from these works, one future direction is injecting variation and using mul-tifaceted personas into simulations.Our goal in avoiding caricature is not to erase difference, but rather the opposite: capturing relevant differences that reflect meaningful insights rather than shallow, misleading generalizations.

Positionality
The perspectives introduced in this paper have undeniably been shaped and influenced by our positionality.Myra Cheng identifies as a Chinese-American woman.The authors are a PhD student, postdoctoral scholar, and professor respectively in the Stanford University Computer Science department, which is predominantly male and white/Asian.

Ethical Considerations
From impersonation to pornography, LLM simulations can have deeply problematic applications.We are strongly against such applications, and we also do not condone research and development that may enable such applications by bad actors without guardrails in place.Our CoMPosT framework offers a shared language to meaningfully critique such work.For instance, one might imagine coming to a consensus to avoid simulating certain topics, personas, and contexts entirely.Introducing a method to measure caricature offers a way to make known this concerning limitation.Lack of caricature based on our measure does not mean that a simulation is necessarily acceptable or high quality (see Section 10).

Implicit Defaults
The least caricatured personas are also those that others have found to be implicit defaults in LLMs.Implicit defaults in LLMs may shift depending on the prompt, context, etc., as well as including aspects of identity and social factors that may be invisible or underrepresented in existing empirical data and surveys4 .Given the increasing proliferation of generated content and a limited quantity of human-written text (Xue et al., 2023;Shumailov et al., 2023), caricatures become only more relevant with the prospect of future LLMs that are trained on generated data: what will their defaults be, and how might they further amplify caricature?

Limitations
While we fill a critical gap since there is no existing work on systematically detecting stereotypes/caricatures in simulations or evaluating simulations in this manner at all, our measure is limited in scope: it is not a comprehensive evaluation of the quality of a simulation.We quantify susceptibility to caricature, which is a particular failure case of a simulation.Our method may yield false positives (simulations that seem acceptable and caricaturefree based on our method but have other problems).
Avoiding caricature is a necessary but insufficient criterion for simulation quality; our metric should be used in tandem with other evaluations.As a pilot study for a recently-emerging direction of work, we hope to lay the groundwork for a more comprehensive evaluation of simulations in the future, perhaps in tandem with human evaluation.
As we provide a first step toward characterizing and evaluating LLM simulations, an area which currently lacks a shared language for discussion and comparison, we focus only on simulations of personas that reflect subpopulations such as social groups and on one-round response formats.However, our framework easily extends to other more complex or open-ended settings.For instance, for a multi-round simulation, one could apply our framework by using the full text of the simulation across the rounds.Depending on the length and structure of the simulation, one could also split the simulation into multiple parts and characterize each part's propensity to caricature.
Also, note that in the semantic axes, P p does not necessarily reflect a universal notion of the model's representation or description of that particular demographic group.It merely characterizes the words that distinguish a simulation of that persona, given the particular context c, such as an opinion from that demographic's perspective.This enables us to measure if and when a simulation is dominated by language that is a caricature of the persona, but our work is not a comprehensive evaluation of stereotypes or representations of demographic groups.

A Examples of Caricatures in Simulation
Which our paper focuses on how to discover problematic patterns of caricature at the larger scale used in LLM simulations, these patterns are clearly evident on an individual example level.For example, for the "nonbinary person" persona in the online forum setting, which has some of the highest exaggeration scores, some randomly-sampled default-topic responses are: "As a nonbinary individual, I often feel like I don't fit into the traditional boxes society has created for gender.This can be difficult when navigating spaces that are dominated by cisnormativity.However, I have found solace and empowerment in connecting with other nonbinary people and advocating for inclusivity and understanding.It is important to remember that gender is a social construct, and it is up to us to define our own identities and push for a more accepting world.""As a nonbinary individual, I often find myself in situations where people are not aware of or don't understand my gender identity.While it can be frustrating at times, I choose to take it as an opportunity to educate others and promote a more inclusive society.It's important to recognize and respect everyone's experiences and identities, even if they differ from our own." These descriptions focus solely on general statements regarding gender identity and inclusion.We find the same phenomenon in simulations of a nonbinary person's perspective on other topics.For the topic "how to read guitar music," some randomlysampled simulated non-binary person's responses are: "Hey everyone!I just wanted to share my experience as a nonbinary individual learning how to read guitar music.It's been a really empowering journey for me, as it allows me to express myself in a way that transcends the limits of gender.Moreover, I've found that the guitar community is incredibly accepting and supportive, which has been a huge help in overcoming some of the challenges that I've faced as a nonbinary person.Keep rocking out, friends!" "As a nonbinary guitarist, I wanted to share some advice and encouragement for others who may be navigating the world of guitar music.Remember that music is a universal language and it does not adhere to any specific gender norms or societal expectations.Allow yourself the freedom to explore different styles, techniques, and genres of guitar music that resonate with you.Embrace your unique perspective and use it to create music that reflects your authentic self.And most importantly, don't forget to have fun and enjoy the journey of learning and growing as a musician." Rather than actually discussing "how to read guitar music," these outputs disproportionately focus on generic statements about gender, authenticity, and inclusion (bolded).These responses exaggerate this particular imagined aspect of a nonbinary person's perspective.Similarly, for the topic "cars and vehicles," some randomly-sampled simulated non-binary person's responses are: "I just wanted to share my experience as a nonbinary individual when it comes to cars and other vehicles.For me, it's important that my mode of transportation reflects my identity and personal aesthetics.I've found that customizing my vehicle with various accessories, paint jobs, and even custom license plates has been a fun and empowering way to express myself.I also appreciate a vehicle that has a gender-neutral design, as that aligns with my identity.I've noticed that more and more auto manufacturers are starting to create designs that feel more inclusive and neutral, which is great for people like me.Ultimately, the vehicle you choose should be a reflection of your personality and preferences, no matter your gender identity.Happy driving, everyone!""Hey everyone!As a nonbinary individual, I wanted to share my perspective on car preferences Topic Output for "Man" persona Default "I recently purchased a new smartphone and I'm really impressed with its features and performance.The camera quality is amazing, and the battery life lasts all day without any issues.The interface is easy to use, and the overall design is sleek and modern.I would highly recommend this phone to anyone in the market for a new device.""I recently purchased a new laptop and I'm absolutely loving it so far.The performance is top-notch, the design is sleek, and the battery life has been really impressive.The only issue I've come across is a minor software bug, but that was easily fixed with an update.Overall, I highly recommend this model to anyone in the market for a reliable and high-performing laptop."How to read guitar music "I recently started learning how to play the guitar and I found this guide on How to Read Guitar Music extremely helpful.The step-by-step explanation, along with images and examples, made it easy to understand and follow.I'm now more confident in reading sheet music and applying it to my guitar practice.Highly recommend this guide for any beginner guitarist looking to learn how to read music.Kudos to the author for making it so simple and easy to follow!" "I recently started learning how to read guitar music and found this article very helpful.It provided clear explanations and easy-to-follow steps that have greatly improved my understanding of how to read sheet music for the guitar.The tips and examples given were a great reference for me as a beginner, and I feel much more confident in my ability now.Thanks for sharing this informative guide!" Cars and vehicles "I recently purchased a 2020 Ford Mustang GT and I couldn't be happier with my decision.The drive is smooth, the car is powerful, and the interior is incredibly comfortable and stylish.I have always loved the look of the Ford Mustang and this new model definitely lives up to my expectations.The sound system is fantastic as well.I highly recommend this car to anyone considering a sporty, powerful, and fun vehicle.""I recently bought a 2015 Honda Civic and I'm absolutely loving it!The fuel efficiency is amazing and the ride is smooth.I've also noticed a significant improvement in the handling compared to my old car.The interior is quite spacious and comfortable too.My only complaint so far is the touchscreen entertainment system -it's a bit slow and not very user-friendly.But overall, I highly recommend the 2015 Civic to anyone in the market for a reliable, fuel-efficient, and enjoyable vehicle!"In these outputs, the bolded phrases also present unrealistic caricatures that exaggerate specific aspects of the persona rather than reflecting the full range of possibilities of what a non-binary person might say about these topics.In stark contrast, for the "man" persona, the default-topic responses are not identity-related, and responses for particular topics are much more topical and do not exaggerate any aspect of the "man" persona (Table A1).

B Robustness of Individuation Measure
The notinon of measuring individuation is broadly agnostic to the choice of differentiator and model used to embed the texts.In Figure A1, we show individuation results from (a) using the unsupervised V-measure to measure differentiation and (b) using the all-distilroberta-v1 model to compute embeddings instead of all-mpnet-base-v2 (all-distilroberta-v1 is the next highest-performing model for general-use purposes).For (a), we first use K-Means to cluster the embeddings into two clusters, and then report the v-score.We find that the scores are overall slightly lower than using a binary classifier, which makes sense since the unsupervised method is less powerful in this context, but most of the personas can still be differentiated at a rate higher than random chance.We use the supervised binary classifier in the main results since the purpose of this metric is to reflect a reader's capability to differentiate between the two categories, which is more accurately reflected by the more powerful classifier.The broad patterns remain the same as the findings reported in the paper: all the personas can be individuated in the Interview context, and in the Online Forum context, marginalized personas are the most easily individuated, while the personas of man and woman are challenging to individuate at all.

C Internal and External Validation of Semantic Axes
For internal validation, following  that does not include w.Then, we measure the cosine similarity of w's contextualized embedding to P ′ 1 and P 2 .We found that the former is larger for all of the persona-topic semantic axes that we constructed.
For external validation, we manually inspect the sets of seed words and find that for each axis, it is easy to differentiate which set is associated with the corresponding persona versus topic.We find that for race/ethnicity and gender personas, the sets of top words reflect similar stereotypes as reported by Cheng et al. (2023b).See Table A2  The general prompt for simulation with persona p and topic t is: 'A(n) p posted the following comment on t to an online forum."For persona p, the default-topic prompt is: "A(n) p posted the following comment to an online forum."For topic t, the default-persona prompt is: "A person posted the following comment on t to an online forum."

D.1.1 Fine-Grained Specificity Experiment
The topics for the fine-grained specificity experiment are: • Specificity Level 1: health The Level 1 topic is the topic with highest rate of caricature.The Level 5 topics are from the top search results for "health" in the subreddit community AskReddit.The authors constructed the intermediate specificity levels by interpolating specificity between these.The resulting exaggeration scores are in Figure A2.We see that specificity and exaggeration are negatively correlated.

D.2 Interview Context
The full list of Pew opinion survey questions that we use as topics in the interview setting are in topics/pewtopics.txt in the supplementary material.The default-topic and default-persona prompts are in Table A5.

E Power Analysis
To justify 100 examples per simulation setting, we provide the following power analysis.Note that for individuation, a simulation that cannot be individuated would have score 0.5 (random chance).Across personas and topics, the lowest mean score was 0.65, and the highest standard deviation was 0.2.A power analysis using a t-test for two independent samples reveals that the necessary sample size is 28 given the effect size (0.65 − 0.5)/0.2= 0.75, alpha = 0.5, and desired power = 0.8.Similarly, for exaggeration, a simulation with no exaggeration of a persona would have score 0. We found that simulations with less-specific topics and personas of many political ideology, race, and marginalized groups result in high exaggeration scores.Among these simulations, the lowest mean score was 0.23, and the highest standard deviation was 0.37.Again using a power analysis, the necessary sample size is 41 using effect size (0.23 − 0)/0.37 = 0.62, alpha = 0.5, and desired power = 0.8.Thus, our choice of obtaining 100 samples per simulation is more than sufficient to achieve the desired power for both the individuation and exaggeration metrics.

F Influence of the Context Dimension
In this section, we explore the effect of the context dimension.To verify that the trends we observe are not due to the difference in context alone, we also experiment with switching the contexts and topics, i.e., 1) simulations with the online forum topics in the interview context 2) simulations with the interview topics in the online forum setting.Figure A3 reveals that the trends in caricature rates persist for the same topics in different contexts rather than being based on context alone.Interestingly, the exaggeration scores are overall slightly higher in these switched contexts than in the original contexts.This may be impacted by memorization dynamics (Elangovan et al., 2021;Tirumala et al., 2022;Carlini et al., 2023), a relationship to explore in future work (e.g., are memorized topics less susceptible to caricature?).

G Twitter Context
We additionally conduct and analyze experiments in the Twitter context introduced by Jiang et al. (2022), and we find similar trends that corroborate the results of the main paper.

Figure 1 :
Figure 1: Mapping Existing Work Using CoMPosT.Existing work on LLM simulations can be compared using our framework.MC and O denote multiple-choice and open-response respectively.More examples are in TableA6.
Simulation Topic: Computers and Electronics Generated nonbinary person responses: "I'm interested in getting some recommendations for any cool devices that might particularly appeal to nonbinary individuals or help increase our visibility and representation." "As a nonbinary individual, I want to create an inclusive and comfortable gaming/streaming space for myself, as well as others in the LGBTQ+ community.""I recently upgraded my desktop PC with a new graphics card and SSD, and I'm really impressed with the performance boost I got from these upgrades.""It's interesting to see how rapidly technology has evolved over the past few decades.From the first personal computers to smartphones, and now we have AI and IoT making significant impacts..." Generated person responses:

Figure 2 :
Figure 2: Examples of GPT-4 generated responses for simulations with the topic Computers and Electronics.For the simulation of a nonbinary person's response, the generations are focused on identity-related issues, while the simulation of a person's response is topical.The former constructs a homogenous narrative that defines nonbinary people only by LGBTQ+ activism.We provide more qualitative examples in Appendix A.

Figure 3 :
Figure3: Our method to measure caricature in LLM simulations.We rely on comparing the defining characteristics of the persona and topic dimensions to measure individuation and exaggeration.

Figure 4 :
Figure4: Mean individuation scores (differentiability from default).The standard error for each point is < 0.02 and thus not visible on the plot.Based on a classifier between the default-persona and the target persona, many of the personas are easy to individuate (have high accuracy scores).The only personas that are slightly challenging to differentiate are the gender groups woman and man in the online forum context (blue circles).All personas have mean score > 0.95 in the interview context (green stars).

Figure 5 :
Figure5: Mean exaggeration scores ± standard error in the online forum context.We measure exaggeration as normalized cosine similarity to the persona-topic axis.The more general topics (purple, larger marker) have higher rates of exaggeration, and thus caricature, than the specific topics (orange, smaller marker).The uncontroversial (WikiHow, squares) topics have higher rates of caricature than the controversial (ProCon.org,stars) topics.Personas related to political leanings, race/ethnicity, and nonbinary gender broadly have the highest rates of caricature.

Figure A1 :
Figure A1: Top: Using the unsupervised V-measure to measure differentiation results in similar patterns as our main result in Figure 4. Bottom: Using an alternative pre-trained model to encode the outputs also results in similar patterns.

Table A1 :
Examples of Simulated "Man" Responses Exaggeration scores in the online forum context for topics varying in specificity.Across different personas (x-axis), exaggeration score (y-axis) is negatively correlated with the specificity of the topic (marker size).

Table A6 :
Jiang et al. (2022)demonstrate how LLM simulations of Republican and Democrat Twitter users result in opinions about public figures and groups that correspond to the outcomes of the Ameri-Additional Examples Mapping Existing Work Using the CoMPosT Framework.Extended from Figure1.MC and O mean multiple-choice and open-response respectively.Note that all works use versions of GPT for the model dimension.