COBRA Frames: Contextual Reasoning about Effects and Harms of Offensive Statements

Warning: This paper contains content that may be offensive or upsetting. Understanding the harms and offensiveness of statements requires reasoning about the social and situational context in which statements are made. For example, the utterance"your English is very good"may implicitly signal an insult when uttered by a white man to a non-white colleague, but uttered by an ESL teacher to their student would be interpreted as a genuine compliment. Such contextual factors have been largely ignored by previous approaches to toxic language detection. We introduce COBRA frames, the first context-aware formalism for explaining the intents, reactions, and harms of offensive or biased statements grounded in their social and situational context. We create COBRACORPUS, a dataset of 33k potentially offensive statements paired with machine-generated contexts and free-text explanations of offensiveness, implied biases, speaker intents, and listener reactions. To study the contextual dynamics of offensiveness, we train models to generate COBRA explanations, with and without access to the context. We find that explanations by context-agnostic models are significantly worse than by context-aware ones, especially in situations where the context inverts the statement's offensiveness (29% accuracy drop). Our work highlights the importance and feasibility of contextualized NLP by modeling social factors.

Understanding the harms and offensiveness of statements requires reasoning about the social and situational context in which statements are made. For example, the utterance "your English is very good" may implicitly signal an insult when uttered by a white man to a nonwhite colleague, but uttered by an ESL teacher to their student would be interpreted as a genuine compliment. Such contextual factors have been largely ignored by previous approaches to toxic language detection.
We introduce COBRA frames, the first context-aware formalism for explaining the intents, reactions, and harms of offensive or biased statements grounded in their social and situational context. We create COBRACORPUS, a dataset of 33k potentially offensive statements paired with machine-generated contexts and free-text explanations of offensiveness, implied biases, speaker intents, and listener reactions.
To study the contextual dynamics of offensiveness, we train models to generate COBRA explanations, with and without access to the context. We find that explanations by context-agnostic models are significantly worse than by contextaware ones, especially in situations where the context inverts the statement's offensiveness (29% accuracy drop). Our work highlights the importance and feasibility of contextualized NLP by modeling social factors.

Introduction
Humans judge the offensiveness and harms of a statement by reasoning about its pragmatic implications with respect to the social and interactional context (Cowan and Hodge, 1996;Cowan and Mettrick, 2002;Nieto and Boyer, 2006;Khurana et al., 2022). For example, when someone says "I'm impressed that your English is so good!", while they Figure 1: Pragmatic reasoning about the offensiveness and harms of statements requires taking interactional context into account. We introduce COBRA , a formalism to distill seven types of pragmatic implications of possibly offensive statements grounded in the situational and social context. As illustrated here, COBRA enables counterfactual reasoning about contexts that invert the statements' offensiveness.
likely intended "to give a compliment", the implications and effects could drastically vary depending on the context. A white person saying this to a non-white person is considered a microaggression (Kohli et al., 2018), because it implies that "nonwhite people are not native English speakers" (Figure 1). 1 Unfortunately, most NLP work has simplified toxic language understanding into a classification problem (e.g., Founta et al., 2018;Jiang et al., 2021), ignoring context Suggested corrections: i insult (to) Asian women; ii implies that Chinese immigrants move to the US only because of multi-culture; iii US has many radical feminism supporters Table 1: Examples of statements with GPT-3.5-generated contexts and explanations along different dimensions (see §2), as well as human verification ratings and suggestions. The rating indicates how many annotators (out of three) think the explanation is likely; if deemed unlikely, annotators could provide suggested corrections. and the different pragmatic implications, which has resulted in non-explainable methods that can backfire by discriminating against minority populations (Sap et al., 2019b;Davidson et al., 2019). We introduce COBRA Frames, 2 a formalism to capture and explain the nuanced contextdependent pragmatic implications of offensive language, inspired by frame semantics (Fillmore, 1976) and the recently introduced Social Bias Frames (Sap et al., 2020). As shown in Figure 1, a COBRA frame considers a statement, along with its free-text descriptions of context (social roles, situational context; Figure 1; left). Given the context and statement, COBRA distills free-text explanations of the implications of offensiveness along seven different dimensions (Figure 1) inspired by theories from social science and pragmatics of language (e.g., speaker intent, targeted group, reactions; Grice, 1975;Nieto and Boyer, 2006;Dynel, 2015;Goodman and Frank, 2016).
Our formalism and its free-text representations have several advantages over previous approaches to detecting offensiveness language. First, our free-text descriptions allow for rich representations of the relevant aspects of context (e.g., situational roles, social power dynamics, etc.), in con-2 COntextual Bias fRAmes trast to modeling specific contextual features alone (e.g., user network features, race or dialect, conversational history; Ribeiro et al., 2017;Sap et al., 2019b;Zhou et al., 2021;Vidgen et al., 2021a;Zhou et al., 2022). Second, dimensions with freetext representations can capture rich types of social knowledge (social commonsense, social norms; Sap et al., 2019a;Forbes et al., 2020), beyond what purely symbolic formalisms alone can (Choi, 2022). Finally, as content moderators have called for more explanation-focused AI solutions (Gillespie et al., 2020;Bunde, 2021), our free-text explanations offer an alternative to categorical flagging of toxicity (e.g., Waseem et al., 2017;Founta et al., 2018, etc.) or highlighting spans in input statements (Lai et al., 2022) that is more useful for nuanced offensiveness (Wiegreffe et al., 2021) and more interpretable to humans (Miller, 2019).
To study the influence of contexts on the understanding of offensive statements, we create COBRACORPUS, containing 32k COBRA contextstatement-explanation frames, generated with a large language model (GPT-3.5;Ouyang et al., 2022) with the help of human annotators (Table  1). Following recent successes in high-quality machine dataset creation (West et al., 2022;Kim et al., 2022a;Liu et al., 2022), we opt for machine generations for both the likely contexts for statements …

Statements from Toxigen
Three plausible contexts generated by  Full COBRA frames generated by GPT-3.5

Less toxic context
Counterfactual contexts generated by GPT-3.5 (as no corpora of context-statement pairs exist) and explanations, as relying solely on humans for explanations is costly and time-consuming. To explore the limits of context-aware reasoning, we also generate a challenge set of counterfactual contexts (COBRACORPUS-CF) that invert the offensiveness of statements (Fig. 1).
To examine how context can be leveraged for explaining offensiveness, we train CHARM, a Context-aware Harm Reasoning Model, using COBRACORPUS. Through context-aware and context-agnostic model ablations, we show performance improvements with the use of context when generating COBRA explanations, as measured by automatic and human evaluations. Surprisingly, on the challenging counterfactual contexts (COBRACORPUS-CF), CHARM surpasses the performance of GPT-3.5-which provided CHARM's training data-at identifying offensiveness. Our formalism and models show the promise and importance of modeling contextual factors of statements for pragmatic understanding, especially for socially relevant tasks such as explaining the offensiveness of statements.

COBRA Frames
We draw inspiration from "interactional frames" as described by Fillmore (1976), as well as more recent work on "social bias frames" (Sap et al., 2020) to understand how context affects the interpretation of the offensiveness and harms of statements. We design COBRA frames (S, C, E), an approach that takes into account a Statement in Context ( §2.1) and models the harms, implications, etc ( §2.2) with free-text Explanations.

Contextual Dimensions
There are many aspects of context that influence how someone interprets a statement linguistically and semantically (Bender and Friedman, 2018;Hovy and Yang, 2021). Drawing inspiration from sociolinguistics on registers (Gregory, 1967) and the rational speech act model (Monroe and Potts, 2015), Context includes the situation, speaker identity, and listener identity for statements. The situation is a short (2-8 words) free-text description of the situation in which the statement could likely be uttered (e.g., "Debate about defunding police", "online conversation in a forum about feminism"). The speaker identity and listener identity capture likely social roles of the statement's speaker and the listener (e.g., "a teacher", "a doctor") or their demographic identities (e.g., "queer man", "black woman"), in free-text descriptions.

Explanations Dimensions
We consider seven explanation dimensions based on theories of pragmatics and implicature (Grice, 1975;Perez Gomez, 2020) and social psychology of bias and inequality (Nieto and Boyer, 2006;Nadal et al., 2014), expanding the reasoning dimensions substantially over prior work which only capture the targeted group and biased implication (Sap et al., 2020;ElSherief et al., 2021). 3 We represent all explanations as free text, which is crucial to capture the nuances of offensiveness, increase the trust in models' predictions, and assist content moderators (Sap et al., 2020;Gabriel et al., 2022;Miller, 2019).
Intent (Int.) captures the underlying communicative intent behind a statement (e.g., "to give a compliment"). Prior work has shown that intent can influence pragmatic implications related to biases and harms (Kasper, 1990;Dynel, 2015) and aid in hate speech detection (Holgate et al., 2018).
Target Group (TG) describes the social or demographic group referenced or targeted by the post (e.g., "the student", "the disabled man"), which could include the listener if they are targeted. This dimension has been the focus of several prior works (Zampieri et al., 2019;Sap et al., 2020;Vidgen et al., 2021b), as it is crucial towards understanding the offensiveness and harms of the statement.  Power (Pow.) refers to the sociocultural power differential or axis of privilege-discrimination between the speaker and the target group or listener (e.g., "gender differential", "racial power differential"). As described by Nieto and Boyer (2006), individuals have different levels of power and privilege depending on which identity axis is considered, which can strongly influence the pragmatic interpretations of statements (e.g., gay men tend to hold more privilege along the gender privilege spectrum, but less along the sexuality one).
Impact (Imp.) explain the biased, prejudiced, or stereotypical meaning implied by the statement, similar to Sap et al. (2020). This implication is very closely related to the received meaning from the listener's or targeted group's perspective and may differ from the speaker's intended meaning (e.g., for microaggressions; Sue, 2010).
Emotional and Cognitive Reactions (Emo. & Cog.) capture the possible negative effects and harms that the statement and its implied meaning could have on the targeted group. There is an increasing push to develop content moderation from the perspective of the harms that content engenders (Keller and Leerssen, 2020; Vaccaro et al., 2020). As such, we draw from Nadal et al. (2014) and consider the perceived emotional and cognitive reactions of the target group or listener. The emotional reactions capture the short-term emotional effects or reactions to a statement (e.g., "anger and annoyance", "worthlessness") On the other hand, the cognitive reactions focus on the lessons someone could draw, the subsequent actions someone could take, or on the long-term harms that repeated exposure to such statements could have. Examples include "not wanting to come into work anymore," "avoiding a particular teacher," etc.  Offensiveness (Off.) captures, in 1-3 words, the type or degree of offensiveness of the statement (e.g., "sexism", "offensive generalization"). We avoid imposing a categorization or cutoff between offensive and harmless statements and instead leave this dimension as free-text, to preserve nuanced interpretations of statements and capture the full spectrum of offensiveness types (Jurgens et al., 2019).

Collecting COBRACORPUS
To study the contextual dynamics of the offensiveness of statements at scale, we create COBRACOR-PUS using a three-stage data generation pipeline with human verification, shown in Figure 2. Given that no available corpus contains statements with their contexts and explanations, 4 we prompt a large language model (GPT-3.5; Ouyang et al., 2022) to generate contexts and explanations, following (Hartvigsen et al., 2022;West et al., 2022;Kim et al., 2022b,a). Specifically, we first generate multiple plausible contexts for statements, then generate the explanations for each context separately, using GPT-3.5 with in-context examples. Please refer to Appendix C for examples of our prompts.
To ensure data quality, we design a set of crowdsourcing tasks to verify the generated contexts and explanations and collect suggestions. For all tasks, we pre-select crowd workers based on a qualification task that judged their understanding of each dimension. Please refer to Appendix A for the details of all crowd-sourcing experiments.

Collecting Statements
We draw our statements from Toxigen (Hartvigsen et al., 2022), a dataset of GPT3-generated statements that are subtly or implicitly toxic, offensive, prejudiced, or biased against various demographic groups. Specifically, since we focus on the dynamics of offensiveness, we analyze a sample of 13,000 Toxigen statements tagged as "offensive".

Generating Likely Contexts
Following work demonstrating that LLMs can generate realistic social situations related to majority and minority groups (Park et al., 2022), we use GPT-3.5 to construct plausible or likely contexts (i.e., situation, speaker identity, listener identity) in which a statement could be made. Specifically, we manually curate fifty statement-context pairs, out of which we sample five for each statement as in-context examples. Conditioned on the in-context examples, we then sample three contexts from GPT-3.5 for each statement. The examples of prompts for plausible context generation are presented in Appendix C.
Verifying Contexts We randomly sample 500 statement-context pairs and ask three annotators to rate the plausibility of the contexts (see Appendix A.2 for the exact questions). 5 Of the 500 pairs, only 1% were marked as completely implausible or gibberish. 92% of the scenarios were marked as plausible by at least two workers, and some were marked as unlikely but technically plausible (e.g., A mayor in the public saying "Black people are not humans.") We retain these contexts since such rare situations could still happen, making them helpful for our analyses and modeling experiments.

Generating COBRA Explanations
Similar to context generation, we make use of GPT-3.5's ability to produce rich explanations of social commonsense (West et al., 2022) to generate explanations along our seven dimensions. For each context-statement pair, we generate one full COBRA frame, using three randomly sampled incontext examples from our pool of six manually curated prompts. As shown in Table 2, this process yields a COBRACORPUS containing 32k full (context-statement-explanation) COBRA frames.

Verifying Explanations
To ensure data quality, we randomly sampled 567 (statement, context, explanations) triples and asked three annotators to rate how likely the explanations fit the statements in context. Inspired by prior work (Aguinis et al., Friends Strangers Workplace Family Other more off. 5.28% 43.09% 27.54% 2.85% 21.24% less off. 60.06% 16.6% 5.79% 11.38% 6.17% Table 3: Percentage of contexts occurring under each category/scenario in COBRACORPUS-CF. Row 1 indicates statements that are more offensive due to their contexts vs Row 2 indicates those which are lesser offensive in comparison 2021; Clark et al., 2021;Liu et al., 2022), we also asked annotators to provide corrections or suggestions for those they consider unlikely. 97% of explanations were marked as likely by at least two annotators (majority vote) and 84% were marked as likely by all three annotators (unanimous). 6 As illustrated in Table 1, humans tend to have better explanations of the implications of statements, whereas machines sometimes re-use words from the statement. This might explain the gap between the majority vote and unanimously approved examples, as the annotators might have different standards for what constitutes a good explanation.
Analyzing COBRACORPUS We present some basic statistics of the COBRACORPUS in Table 2. The average length shows illustrates the level of nuance in some of the explanations (e.g., 22 words for cognitive reaction). Additionally, we analyze the distribution of target groups, finding that minority or marginalized groups like LGBTQIA+, people with disabilities, and women are among the most frequently targeted groups (see Figure 3a). Analyzing the distribution of the free-text offensiveness types, we find that microaggressions are the most frequent type of offensiveness (see Figure 3b).

COBRACORPUS-CF: Generating Counterfactual Contexts
To examine the limits of context-aware explanations of offensiveness, we generate COBRACORPUS-CF, a challenge set of counterfactual context pairs that invert the offensiveness of statements, inspired by adversarial and counterfactual test sets in NLP (Gardner et al., 2020;Li et al., 2020;Chang et al., 2021). Illustrated in Figure 1, our motivating question asks, how does the toxicity of a statement change with a different context?
Creating COBRACORPUS-CF One of the difficulties of collecting counterfactual data is finding

CobraCorpus-Cf Source
CobraCorpus Source This is conversation between a man and a woman in an initial interaction in workspace: Excuse me miss, are any of your bosses here?
CobraCorpus Target <intent> The speaker is trying to find someone in a position of authority </intent> <target-Group> Listener, women </targetGroup> <implica-tion> It implies that women are not capable of being in a position of authority, and that a man is needed to speak to someone in charge </impli-cation> <powerDynamics> gender power differential between speaker and listener </powerDynam-ics> <emotionalReaction> listener and women could feel angry, feel condescended to, angry that they didn't think a woman was boss </emo-tionalReaction> <cognitiveReaction> might lead women bosses to change their behavior to appear more boss-like, might become more confrontation-  statements that are contextually ambiguous and can be interpreted in different ways. Statements such as microaggressions, compliments, criticism, and offers for advice are well-suited for this, as their interpretation can be highly contextual (Sue, 2010;Nadal et al., 2014). We scraped 1000 statements from a crowdsourced corpus of microaggressions, 7 including many contextually ambiguous statements. Following a similar strategy as in §3.2, we manually craft 50 (statement, offensive context, harmless context) triples to use as in-context examples for generating counterfactual contexts. Then, for each microaggression in the corpus, we generated both a harmless and offensive context with GPT-3.5, prompted with five randomly sampled triples as in-context examples. This process yields 982 triples, as GPT-3.5 failed to generate a harmless context for 18 statements.
Human Verification We then verify that the counterfactual contexts invert the offensiveness of the statements. Presented with both contexts, the annotators (1) rate the offensiveness of the statement under each context (Individual) and, (2) choose the context that makes the statement more offensive (Forced Choice). We annotate all of the 982 triples in this manner. When we evaluate models' performance on COBRACORPUS-CF ( §5.2), we use the Individual ratings. In our experiments, we use the 344 (statement, context) pairs where 7 https://www.microaggressions.com/ all three annotators agreed on the offensiveness, to ensure the contrastiveness of the contexts. 8 Analyzing Counterfactual Contexts To compare with our likely contexts, we examine the types of situations that changed perceptions of toxicity using our human-verified offensive and harmless counterfactual contexts. We use the aforementioned Forced Choice ratings here. We detect and classify the category of the situation in the counterfactual context pairs as conversations occurring between friends, among strangers in public, at a workplace, and between members of a family, using keyword matching.
We observe that contexts involving conversations occurring among strangers in public and at the workplace are perceived as more offensive than those which occur between friends (see Table 3). This aligns with previous literature showing that offensive, familiar, or impolite language might be considered more acceptable if used in environments where people are more familiar. (Jay and Janschewitz, 2008;Dynel, 2015;Kasper, 1990). Ethnographic research shows how crude language, including the use of offensive stereotypes and slurs, is often encouraged in informal settings like sports (Fine, 1979) or social clubs (Eliasoph and Lichterman, 2003). But such speech is generally considered less acceptable in a broader public sphere including in public and at the workplace.  Table 4: Performance of different model sizes measured with automatic evaluation metrics, broken down by explanation dimension. The best result is bolded. We also calculate the BERTScore (Zhang et al., 2020) for each model size, which shows similar trends (see Appendix B.2). Takeaway: unsurprisingly, the best-performing model is often CHARM (XXL), but XL follows closely behind.

Experiments
We investigate the role that context plays when training models to explain offensive language on both COBRACORPUS and COBRACORPUS-CF. Although GPT-3.5's COBRA explanations are highly rated by human annotators ( §3.3), generating them is a costly process both from a monetary 9 and energy consumption perspective (Strubell et al., 2019;Taddeo et al., 2021;Dodge et al., 2022). Therefore, we also investigate whether such high-quality generations can come from more efficient neural models. We train CHARM ( §5.1), with which we first empirically evaluate the general performance of our models in generating COBRA explanations. We then investigate the need for context in generating COBRA explanations. Finally, we evaluate both GPT-3.5's and our model on the challenging COBRACORPUS-CF context-statement pairs.

COBRA Model: CHARM
We introduce CHARM, a FLAN-T5 model (Chung et al., 2022) finetuned on COBRACORPUS for predicting COBRA frames. Given a context-statement pair (C, S), CHARM is trained to generate a set of explanations E along all seven COBRA dimensions. Note that while there is a range of valid model choices when it comes to modeling COBRA, we choose FLAN-T5 based on its strong reasoning abilities in many language generation tasks.
As illustrated in Fig. 4, both the source and the target are linearized sequences of CO-BRA frame elements. The source sequence concatenates the situation, speaker, listener, and statement into a sequence in the following format: "This is a conversation between [speaker] and [listener] in [situation]: [statement]", and the target 9 Each COBRA explanation costs approximately $0.01 when using  sequence is a concatenation of tagged explanation dimensions, e.g., "<intent> [intent] </intent>", "<targetGroup> [targetGroup] </targetGroup>.". We train the model with the standard cross-entropy loss.
We randomly split COBRACORPUS into training (31k), and evaluation sets (1k) ensuring that no statement is present in multiple splits, with COBRACORPUS-CF serving as an additional evaluation set (we use the small-scale, highly curated 172 statement-context pairs in §4).
We train different variants of CHARM, namely, they are CHARM Small (80M), Base (250M), Large (780M), XL (3B), XXL (11B), to explore how the model's explanation generation abilities differ across sizes. We use the same hyperparameters across different modeling variants. Unless otherwise mentioned, CHARM refers to XL, which we use as our default based on the combination of competitive performance and efficiency. During inference, we use beam search decoding with beam_size=4. Additional experimental details are provided in Appendix B.1.

Evaluation
We evaluate our models in the following ways. For automatic evaluation of explanation generation, we use BLEU-2 and Rouge-L to capture the word overlap between the generations and references (Hashimoto et al., 2019). For human evaluation, we use the same acceptability task as in §3.3, using the unanimous setting (i.e., rated likely by all three annotators). For the counterfactual automatic evaluation, we convert the offensiveness dimension into a binary label based on the existence of certain phrases (e.g., "not offensive", "none", "harmless").
How good are different CHARM models? As shown in Table 4   How important context is for CHARM? We examine how context influences CHARM's ability to generate explanations. In context-agnostic model setups, the source sequence is formatted as "This is a statement: [statement]", omitting the speaker, listener, and situation. As shown in Table 6, incorporating context at training and inference time improves CHARM's performance across the automatic and human evaluation. This is consistent with our hypothesis that context is important for understanding the toxicity of statements.
How well do models adapt to counterfactual contexts? We then investigate how well our model, as well as GPT-3.5 , 10 identifies the offensiveness 10 text-davinci-003 Jan 13th 2022  Table 7: Accuracy, derived from binarizing the "offensiveness" explanation, for different models on COBRACORPUS-CF (WoC means Without Context). All Toxic means predicting every statement as toxic. Takeaway: CHARM adapts to counterfactual contexts better than GPT-3.5 (text-davinci-003 Jan 13th 2022).
of statements when the context drastically alters the implications. We then compare different models' ability to classify whether the statement is offensive or not given the counterfactual context in COBRACORPUS-CF. Surprisingly, although our model is only trained on the GPT-3.5-generated COBRACORPUS, it outperforms GPT-3.5 (in a few-shot setting as described in §3.3) on COBRACORPUS-CF (Table  7). Table 5 shows some example predictions on the counterfactual context pairs. GPT-3.5 tends to "over-interpret" the statement, possibly due to the information in the prompts. For example, for the last statement in Table 5, GPT-3.5 infers the implication as "It implies that people of color are not typically articulate", while such statement-context pair contains no information about people of color. In general, counterfactual contexts are still challenging even for our best-performing models.

Conclusion & Discussion
We introduce COBRA frames, a formalism to distill the context-dependent implications, effects, and harms of toxic language. COBRA captures seven explanation dimensions, inspired by frame semantics (Fillmore, 1976), social bias frames (Sap et al., 2020), and psychology and sociolinguistics literature on social biases and prejudice (Nieto and Boyer, 2006;Nadal et al., 2014). As a step towards addressing the importance of context in content moderation, we create COBRACORPUS, a novel dataset of toxic comments populated with contextual factors as well as explanations. We also build COBRACORPUS-CF, a small-scale, curated dataset of toxic comments paired with counterfactual contexts that significantly alter the toxicity and implication of statements.
We contribute CHARM, a new model trained with COBRACORPUS for producing explanations of toxic statements given the statement and its social context. We show that modeling without contextual factors is insufficient for explaining toxicity. CHARM also outperforms GPT-3.5 in COBRACORPUS-CF, even though it is trained on data generated by GPT-3.5.
We view COBRA as a vital step towards addressing the importance of context in content moderation and many other social NLP tasks. Potential future applications of COBRA include automatic categorization of different types of offensiveness, such as hate speech, harassment, and microaggressions, as well as the development of more robust and fair content moderation systems. Furthermore, our approach has the potential to assist content moderators by providing free-text explanations. These explanations can help moderators better understand the rationale behind models' predictions, allowing them to make more informed decisions when reviewing flagged content (Zhang et al., 2023). This is particularly important given the growing calls for transparency and accountability in content moderation processes (Bunde, 2023).
Besides content moderation, COBRA also has the potential to test linguistic and psychological theories about offensive statements. While we made some preliminary attempts in this direction in §3 and §4, more work is needed to fully realize this potential. For example, future studies could investigate the differences in in-group and out-group interpretations of offensive statements, as well as the role of power dynamics, cultural background, and individual sensitivities in shaping perceptions of offensiveness.

Limitations & Ethical and Societal Considerations
We consider the following limitations and societal considerations of our work.
Machine-generated Data Our analysis is based on GPT-3 generated data. Though not perfectly aligned with real-world scenarios, as demonstrated in Park et al. (2022), such analysis can provide insights into the nature of social interactions. However, this could induce specific biases, such as skewing towards interpretations of words aligned with GPT-3.5's training domains and potentially overlooking more specialized domains or minority speech (Bender et al., 2021;Bommasani et al., 2021). The pervasive issue of bias in offensive language detection and in LLMs more generally requires exercising extra caution. We deliberately generate multiple contexts for every statement as an indirect means of managing the biases. Nevertheless, it is a compelling direction for future research to investigate the nature of biases latent in distilled contexts for harmful speech and further investigate their potential impact. For example, it would be valuable to collect human-annotated data on CO-BRA to compare with the machine-generated data. However, we must also recognize that humans are not immune to biases (Sap et al., 2019b(Sap et al., , 2022, and therefore, such investigations should be carefully designed. Limited Contextual Variables Although CO-BRACORPUS has rich contexts, capturing the full context of statements is challenging. Future work should explore incorporating more quantitative features (e.g., the number of followers of the speaker) to supplement contextual variables such as social role and power dynamics. In this work, we focus on the immediate context of a toxic statement. However, we recognize that the context of a toxic statement can be much longer. We have observed significant effects even in relatively brief contexts, indicating the potential for improved performance when more extended contexts are present. We believe that future research could explore the influence of richer contexts by including other modalities (e.g., images, videos, etc.).

Limited Identity Descriptions
Our work focused on distilling the most salient identity charac-teristics that could affect the implications of toxicity of statements. This often resulted in generic identity labels such as "a white person" or "A Black woman" being generated without social roles. This risks essentialism, i.e., the assumption that all members of a demographic group have inherent qualities and experiences, which can be harmful and perpetuate stereotypical thinking (Chen and Ratliff, 2018;Mandalaywala et al., 2018;Kurzwelly et al., 2020). Future work should explore incorporating more specific identity descriptions that circumvent the risk of essentializing groups.

English Only
We only look at a US-centric perspective in our investigation. Obviously, online hate and abuse is manifested in many languages (Arango Monnar et al., 2022), so we hope future work will adapt our frames to different languages and different cultures.

Subjectivity in Offensiveness
Not everyone agrees that things are offensive, or has the same interpretation of offensiveness (depending on their own background and beliefs; Sap et al., 2022). Our in-context prompts and qualification likely make both our machine-generated explanations and human annotations prescriptive (Röttger et al., 2021), in contrast to a more descriptive approach where we would examine different interpretations. We leave that up for future work.
Dual Use We aim to combat the negative effects and harms of discriminatory language on already marginalized people (Sap et al., 2019b;Davidson et al., 2019). It is possible however that our frames, dataset, and models could be used to perpetuate harm against those very people. We do not endorse the use of our data for those purposes.
Risk of Suppressing Speech Our frames, dataset, and models are built with content moderation in mind, as online spaces are increasingly riddled with hate and abuse and content moderators are struggling to sift through all of the content. We hope future work will examine frameworks for using our frames to help content moderators. We do not endorse the use of our system to suppress speech without human oversight and encourage practitioners to take non-censorship-oriented approaches to content moderation (e.g., counterspeech (Tekiroglu et al., 2022)).

Harms of Exposing Workers to Toxic Content
The verification process of COBRACORPUS and COBRACORPUS-CF is performed by human annotators. Exposure to such offensive content can be harmful to the annotators (Liu et al., 2016). We mitigated these by designing minimum annotation workload, paying workers above minimum wage ($7-12), and providing them with crisis management resources. Our annotation work is also supervised by an Institutional Review Board (IRB).  A Crowd-sourcing on MTurk In this paper, human annotation is widely used in §3.2, §3.3, §4, §4, §5.2, and §5.2. We restrict our worker candidates' location to U.S. and Canada and ask the workers to optionally provide coarse-grained demographic information. Among 300 candidates, 109 workers pass the qualification tests. Note that we not only give the workers scores based on their accuracy in our tests, but also manually verify their provided suggestions for explanations. Annotators are compensated $12.8 per hour on average. The data collection procedure was approved by our institution's IRB.

A.1 Annotator demographics
Due to the subjective nature of toxic language (Sap et al., 2022), we aim to collect a diverse set of annotators. In our final pool of 109 annotators, the average age is 36 (ranging from 18 to 65). For political orientation, we have 64/21/24 annotators identified as liberal/conservative/neutral, respectively. For gender identity, we have 61/46/2 annotators identify as man/woman/non-binary, respectively. There are also 40 annotators that self-identified as being part of a minority group.

A.2 Annotation interface and instructions
As recommended by (Aguinis et al., 2021), With the HuggingFace's Transformers library 11 , different variants of FLAN-T5, small, base, large, XL and XXL, are finetuned on the COBRA training set for two epochs with AdamW optimizer with a learning rate of 1e −4 and batch size of 16. We use beam search as the decoding algorithm and all reported results are based on a single run. We also train a XL model using the same architecture and hyperparameters but without the context information. The sizes of CHARM range from 80M to 11B, the largest of which takes 10 hours to train in FP32 on 5 A6000 GPUs with NVLink, and can do inference in FP16 on a single A6000 GPU. We used HuggingFace evaluate package to evaluate the BLEU-2 and ROUGE-L scores.

B.2 Evaluation details
See Table 8 for the BERTScore metrics across different model sizes.

C GPT-3 prompts used in this paper
The example prompts for generating likely contexts are in Figure 8. The example prompts for generating adversarial contexts are in Figure 9. The example prompts for generating the likely explanations are in Figure 10.  Section 5 C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used? Appendix B The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values? Section 5 and Appendix B C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run? Section 5 and Appendix B C4. If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? Section 5 and Appendix B D Did you use human annotators (e.g., crowdworkers) or research with human participants?
Section 3, 4, 5 and Appendix A D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Appendix A D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)? Appendix A D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating? For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used? Appendix A D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?
Appendix A D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data? Appendix A