Latent Hatred: A Benchmark for Understanding Implicit Hate Speech

Hate speech has grown significantly on social media, causing serious consequences for victims of all demographics. Despite much attention being paid to characterize and detect discriminatory speech, most work has focused on explicit or overt hate speech, failing to address a more pervasive form based on coded or indirect language. To fill this gap, this work introduces a theoretically-justified taxonomy of implicit hate speech and a benchmark corpus with fine-grained labels for each message and its implication. We present systematic analyses of our dataset using contemporary baselines to detect and explain implicit hate speech, and we discuss key features that challenge existing models. This dataset will continue to serve as a useful benchmark for understanding this multifaceted issue.


Introduction
Hate speech is pervasive in social media. Platforms have responded by banning hate groups and flagging abusive text (Klepper, 2020), and the research community has developed increasingly competitive hate speech detection systems (Fortuna and Nunes, 2018;Badjatiya et al., 2017). While prior efforts have focused extensively on overt abuse or explicit hate speech (Schmidt and Wiegand, 2017), recent works have started to highlight the diverse range of implicitly hateful messages that have previously gone unnoticed by moderators and researchers alike (Jurgens et al., 2019;Waseem et al., 2017;Qian et al., 2019). Figure 1 provides an example from each hate speech type (explicit vs. implicit).
Implicit hate speech is defined by coded or indirect language that disparages a person or group on the basis of protected characteristics like race, Equal contribution. Figure 1: Sample posts from our dataset outlining the differences between explicit and implicit hate speech. Explicit hate is direct and leverages specific keywords while implicit hate is more abstract. Explicit text has been modified to include a star (*). gender, and cultural identity (Nockleby, 2000). Extremist groups have used this coded language to mobilize acts of aggression (Gubler and Kalmoe, 2015) and domestic terrorism (Piazza, 2020) while also maintaining plausible deniability for their actions (Dénigot and Burnett, 2020). Because this speech lacks clear lexical signals, hate groups can evade keyword-based detection systems (Waseem et al., 2017;Wiegand et al., 2019), and even the most advanced architectures may suffer if they have not been trained on implicitly abusive messages (Caselli et al., 2020).
The primary challenge for statistical and neural classifiers is the linguistic nuance and diversity of the implicit hate class, which includes indirect sarcasm and humor (Waseem and Hovy, 2016;Fortuna and Nunes, 2018), euphemisms (Magu and Luo, 2018), circumlocution (Gao and Huang, 2017), and other symbolic or metaphorical language (Qian et al., 2019). The type of implicit hate speech also varies, from dehumanizing comparisons (Leader Maynard and Benesch, 2016) and stereotypes (Warner and Hirschberg, 2012), to threats, intimidation, and incitement to violence (Sanguinetti et al., 2018;Fortuna and Nunes, 2018). Importantly, the field lacks a theoreticallygrounded framework and a large-scale dataset to help inform a more empirical understanding of implicit hate in all of its diverse manifestations.
To fill this gap, we establish new resources to sustain research and facilitate both fine-grained classification and generative intervention strategies. Specifically, we develop a 6-class taxonomy of implicit hate speech that is grounded in the social science literature. We use this taxonomy to annotate a new Twitter dataset with broad coverage of the most prevalent hate groups in the United States. This dataset makes three original contributions: (1) it is a large and representative sample of implicit hate speech with (2) fine-grained implicit hate labels and (3) natural language descriptions of the implied aspects for each hateful message. Finally, we train competitive baseline classifiers to detect implicit hate speech and generate its implied statements. While state-of-the-art neural models are effective at a high level hate speech classification, they are not effective at spelling out more finegrained categories with detailed explanations the implied message. The results suggest our dataset can serve as a useful benchmark for understanding implicit hate speech.

Related Work
Numerous hate speech datasets exist, and we summarize them in Table 1. The majority are skewed towards explicitly abusive text since they were originally seeded with hate lexicons (Basile et al., 2019;Founta et al., 2018;Davidson et al., 2017;Waseem and Hovy, 2016), racial identifiers (Warner and Hirschberg, 2012), or explicitly hateful phrases such as "I hate <target>" (Silva et al., 2016). Because of a heavy reliance on overt lexical signals, explicit hate speech datasets have known racial biases . Among public datasets, all but one have near or above a 20% concentration of profanity 1 in the hate class (Table 1).
A few neutrally-seeded datasets also exist (Burnap and Williams, 2014;de Gibert et al., 2018;Warner and Hirschberg, 2012). Although some may contain implicit hate speech, there are no implicit hate labels and thus the distribution is un-known. Furthermore, these datasets tend to focus more on controversial events (e.g. the Lee Rigby murder; Burnap and Williams) or specific hate targets (e.g. immigrants; Basile et al.), which may introduce topic bias and artificially inflate model performance on implicit examples (Wiegand et al., 2019). Consider Sap et al. (2020) for example: 31% of posts take the form of the question leading up to a mean joke. There is still need for a representative and syntactically diverse implicit hate benchmark.
Our contribution is similar to the Gab Hate Corpus of Kennedy et al. (2018), which provides both explicit and implicit hate and target labels for a random sample of 27K Gab messages. We extend this work with a theoretically-grounded taxonomy and fine-grained labels for implicit hate speech beyond the umbrella categories, Assault on Human Dignity (HD) and Call for Violence (CV). Following the work of Sap et al. (2020), we provide free-text annotations to capture messages' pragmatic implications. However, we are the first to take this framework, which was originally applied stereotype bias, and extend it to implicit hate speech more broadly. Implicitly stereotypical language is just a subset of the implicit hate we cover, since we also include other forms of sarcasm, intimidation or incitement to violence, hidden threats, white grievance, and subtle forms of misinformation. Our work also complements recent efforts to capture and understand microaggressions (Breitfeller et al., 2019), a similarly elusive class that draws on subtle and unconscious linguistic reflections of social bias, prejudice and inequality (Sue, 2010). Similar to Breitfeller et al. (2019), we provide a representative and domain-general typology and dataset, but ours are more representative of active hate groups in the United States, and our definitions extend to intentionally veiled acts of intimidation, threats, and abuse.

Taxonomy of Implicit Hate Speech
Implicit hate speech is a subclass of hate speech defined by the use of coded or indirect language such as sarcasm, metaphor and circumlocution to disparage a protected group or individual, or to convey prejudicial and harmful views about them (Gao et al., 2017;Waseem et al., 2017). The NLP community has not yet confronted, in a consistent and unified manner, the multiplicity of subtle challenges that implicit hate presents for online communities. To this end, we introduce a new typology  for characterizing and detecting different forms of implicit hate, based on social science and relevant NLP literature. Our categories are not necessarily mutually exclusive, but they represent principle axes of implicit hate, and while they may not be collectively exhaustive, we find they cover 98.6% of implicit hate in a representative sample of the most prevalent hate ideologies in the U.S. White Grievance includes frustration over a minority group's perceived privilege and casting majority groups as the real victims of racism (Berbrier, 2000;Bloch et al., 2020). This language is linked to extremist behavior and support for violence (Miller-Idriss, 2020). An example is Black lives matter and white lives don't? Sounds racist.
Incitement to Violence includes flaunting ingroup unity and power or elevating known hate groups and ideologies (Somerville, 2011). Phrases like 'white brotherhood operate in the former manner, while statements like Hitler was Germany -Germans shall rise again! operate in the latter, elevating nationalism and Nazism. Article 20 of the UN International Covenant on Civil and Political Rights (Assembly, 1966) states that speech which incites violence shall be prohibited by law.
Inferiority Language implies one group or individual is inferior to another (Nielsen, 2002), and it can include dehumanization (denial of a person's humanity), and toxification (language that com-pares the target with disease, insects, animals), both of which are early warning signs of genocide (Leader Maynard and Benesch, 2016;Neilsen, 2015). Inferiority language is also related to assaults on human dignity (Kennedy et al., 2018), dominance (Saha et al., 2018), and declarations of superiority of the in-group (Fortuna and Nunes, 2018). For example, It's not a coincidence the best places to live are majority white.
Irony refers to the use of sarcasm (Waseem and Hovy, 2016;Justo et al., 2014), humor (Fortuna andNunes, 2018), and satire (Sanguinetti et al., 2018) to attack or demean a protected class or individual. For example, in the context of one hate group, the tweet Horrors... Disney will be forced into hiring Americans works to discredit Disney for allegedly hiring only non-citizens or, really, nonwhites. Irony is not exempt from our hate speech typology, since it is commonly used by modern online hate groups to mask their hatred and extremism (Dreisbach, 2021).
Stereotypes and Misinformation associate a protected class with negative attributes such as crime or terrorism (Warner and Hirschberg, 2012;Sanguinetti et al., 2018) as in the rhetorical question, Can someone tell the black people in Chicago to stop killing one another before it becomes Detroit? This class also includes misinformation that feeds stereotypes and vice versa, like holocaust denial and other forms of historical negationism (Belavusau, 2017;Cohen-Almagor, 2009).
Threatening and Intimidation convey a speaker commitment to a target's pain, injury, damage, loss, or violation of rights. While explicitly violent threats are well-recognized in the hate speech literature (Sanguinetti et al., 2018), here we highlight threats related to implicit violation of rights and freedoms, removal of opportunities, and more subtle forms of intimidation, such as All immigration of non-whites should be ended.

Data Collection and Annotation
We collect and annotate a benchmark dataset for implicit hate language using our taxonomy. Our main source of data uses content published by online hate groups and their followers on Twitter for two reasons. First, as modern hate groups have become more active online, they provide an increasingly vivid picture of the more subtle and coded forms of hate that we are interested in. Second, the problem of hateful misinformation is compounded on social media platforms like Twitter where around 3 out of 4 users get their news (Shearer and Gottfried, 2017). This motivates a representative sample of online communication exchanged on Twitter between members of the most prominent U.S hate groups.

Data Collection and Filtering
We matched all SPLC hate groups with their corresponding Twitter accounts using the account names and bios. Then, for each ideological cluster above, we selected the three hate group accounts with the most followers, since these were likely to be the most visible and engaged. We collected all tweets, retweets, and replies from the timelines of our selected hate groups between January 1, 2015 and December 31, 2017, for a total of 4,748,226 tweets, giving us with an broad sample of hate group activity before many accounts were banned.
Hateful content is semantically diverse, with dif-ferent hate groups motivated by different ideologies. Seeking a representative sample, we identified group-specific salient content from each ideology by performing part of speech (POS) tagging on each tweet. Then we computed the log odds ratio with informative Dirichlet prior (Monroe et al., 2008) for each noun, hashtag, and adjective to identify the top 25 words per ideology. After filtering for tweets that contained one of the salient keywords, we ran the 3-way HateSonar classifier of Davidson et al. (2017) to remove content that was likely to be explicitly hateful. Specifically, we removed all tweets that were classified as offensive, and then ran a final sweep over the neutral and hate categories, removing tweets that contained any explicit keyword found in NoSwear (Jones, 2020) or Hatebase (Hatebase, 2020).

Crowdsourcing and Expert Annotation
To acquire implicit hate speech labels with two different resolutions, we ran two stages of annotation. First, we collected high-level labels, explicit hate, implicit hate, or not hate. Then, we took a second pass through the implicit hate tweets with expert annotation over the fine-grained implicit hate taxonomy from Section 3.

Stage 1: High Level Categorization
Amazon Mechanical Turk (MTurk) annotators completed our high-level labeling task. We provided them with a definition of hate speech (Twitter, 2021) and examples of explicit, implicit, and nonhateful content (See Appendix A), and required them to pass a short five-question qualification check for understanding with a score of at least 90% in accordance with crowdsourcing standards (Sheehan, 2018). We paid annotators a fair wage above the federal minimum. Three workers labeled each tweet, and they reached majority agreement for 95.3% of tweets, with perfect agreement on 45.6% of the data. The Intraclass Correlation for oneway random effects between k = 118 raters was ICC(1, k) = 0.616, which indicates moderate inter-rater agreement. Using the majority vote, we obtained consensus labels for 19,112 labeled tweets in total: 933 explicit hate, 4,909 implicit hate, and 13,291 not hateful tweets.

Stage 2: Fine-Grained Implicit Hate
To promote a more nuanced understanding of our 4,909 implicit hate tweets, we labeled them using our fine-grained category definitions in Section 3, adding other and not hate to take care of any other situations. Since these fine-grained categories were too subtle for MTurk workers, 2 we hired three research assistants to be our expert annotators. We trained them over multiple sessions by walking them through seven small pilot batches and resolving disagreements after each test until they reached moderate agreement. On the next round of 150 tweets, their independent annotations reached a Fleiss' Kappa of 0.61. Each annotator then continued labeling an independent partition of the data. Halfway through this process, we ran another attention check with 150 tweets and found that agreement remained consistent with a Fleiss' Kappa of 0.55. Finally, after filtering out tweets marked as not hate, there were 4,153 labeled implicit hate tweets remaining. The per-category statistics are summarized in the # Tweets Pre Expn. column of Table 2.

Corpus Expansion
Extreme class imbalance may challenge implicit hate classifiers. To address this disparity, we expand the minority classes, both with bootstrapping and out-of-domain samples. For bootstrapping, we trained a 6-way BERT classifier on the 4,153 implicit hate labels in the manner of Section 5.1 and ran it on 364,300 unlabeled tweets from our corpus. Then we randomly sampled 1,800 tweets for each of the three minority classes according to the classifications inferiority, irony, and threatening. Finally, we augmented this expansion with out-of-domain (OOD) samples from Kennedy et al. (2018) and Sap et al. (2020). By drawing both from OOD and bootstrapped indomain samples, we sought to balance two key limitations: (1) bootstrapped samples may be inherently easier, while (2) OOD samples contain artifacts that allow models to benefit from spurious correlations. Our expert annotators labeled this data, and by adding the minority labels from this process, we improved the class balance for a total of 6,346 implicit tweets shown in the # Tweets Post Expn. column of Table 2.

Hate Targets and Implied Statement
For each of the 6,346 implicit hate tweets, two separate annotators provided us with the message's target demographic group and its implied statement in free-text format. Implied statements were 2 We saw less than 30% agreement when we ran this task over three batches of around 200 tweets each on MTurk.

Implicit Hate Speech Classification
We experiment with two classification tasks: (1) distinguishing implicit hate speech from non-hate, and (2) categorizing implicit hate speech using one of the 6 classes in our fine-grained taxonomy.

Experimental Setup
Using a 60-20-20 split for each task, we trained, validated, and tested SVM and BERT baselines. We tried standard unigrams, TF-IDF, and Glove embedding (Pennington et al., 2014)   we concatenated the 768-dimensional BERT final layer with the 200-dimensional Wikidata (or 300dimensional ConceptNet) embeddings, and fed this representation into an MLP with two hidden layers of dimension 100 and ReLU activation between them, using categorical Cross Entropy loss.

Implicit Hate Classification Results
In binary implicit hate speech classification on the left side of Table 3, baseline SVM models offer competitive performance with F 1 scores up to 64.4, while the fine-tuned neural models gain up to 6 additional points. The BERT-base model achieves significantly better macro precision than the linear SVMs (72.1 vs. at most 61.4), demonstrating a compositional understanding beyond simple keyword-matching. When we look at our best BERT + Aug model, the implicit category most confused with non-hate was Incitement (36.3% of testing examples were classified as not hate), followed by White Grievance (29.6%), Stereotypical (23.3%), Inferiority (12.3%), Irony (9.3%), and Threatening (5.5%). In our 6-way classification task on the right of Table 3, we find that the BERTbase models again outperform the linear models. Augmentation does not significantly improve performance in either task since our data is already well-balanced and representative. Interestingly, integrating Wikidata and ConceptNet did not lead to any performance boost either. This suggests detecting implicit hate speech might require more compositional reasoning over the involved entities and we urge future work to investigate this. For addiwas 14.
tional comparisons, we consider a zero-shot setting where we test Google's Perspective API 5 and the HateSonar classifier of Davidson et al. (2017). Our fine-tuned baselines significantly outperform both zero-shot baselines, which were trained on explicit hate.

Challenges in Detecting Implicit Hate
To further understand the challenges of implicit hate detection and promising directions for future work, we investigated 100 randomly sampled false negative errors from our best model in the binary task (BERT+Aug) and found a set of linguistic classes it struggles with. 6 (1) Coded hate symbols (Qian et al., 2019) such as #WPWW (white pride world wide), #NationalSocialism (Nazism), and (((they))) (an anti-Semitic symbol) are contained in 15% of instances, and our models fail to grasp their semantics. While individual sentences appear harmless, implicit hate can occur in (2)   standing of (4) commonsense (11%) surrounding social norms (e.g. a dependant is inferior to a supplier) (Forbes et al., 2020). Other challenge cases contain highly (5) metaphorical language (7%), like the animal metaphor in a world without white people : a visual look at a mongrel future. (6) Colloquial or idiomatic speech (17%) appears in subtle phrases like infrastructure is the white man's game, and (7) Irony (15%) detection (Waseem and Hovy, 2016) may require pragmatic reasoning and understanding, such as in the phrase hey kids, wanna replace white people. When we sample false positives, we find our models are prone to (8) identity term bias (Dixon et al., 2018). Given the high density of identity terms like Jew and Black in hateful contexts, our models overclassified tweets with these terms as hateful, and particularly stereotypical speech. In a similar manner, our model also incorrectly associated white grievance with all diversity-related discourse, incitement with controversial topics like war and race, and inferiority language with valueladen terms like valid and wealth.
To sum up, our dataset contains rich linguistic phenomena and an array of subtleties that challenge current state-of-the-art baselines, which can serve as a useful benchmark and offer multiple new directions for future work.

Explaining Implicit Hate Speech
This section presents our generation results for natural language explanations of both (1) who is being targeted and (2) what the implied message is for each implicitly hateful tweet. Generating such explanations can help content moderators better understand the severity and nature of automatically-flagged messages. Additionally, we echo efforts from social media companies (e.g., Instagram (Bryant, 2019)) where the application alerts the user when the post is flagged "offensive," and asks them if they really want to post it. This strategy has proven successful in deterring hurtful comments. Our work could inspire a similar strategy for implicit hate speech. By showing the user the implied meaning of their post before it is posted, we would enable them to recognize the severity of their words and possibly reconsider their decision to post.

Task Formulation
Our goal is to develop a natural language system that, given a post, generates a hateful post's intended target and hidden implied meanings. Therefore, we formulate the problem as a conditional generation task (i.e., conditioned on the post content). During training, the generation model takes a sequence of tokens as input: with start token [STR], tweet tokens t 1 : t n , target group t [Gi] , and implied statement t [Si] , and minimizes the cross-entropy loss − l log P (t l |t <l ).
During inference, our goal is to mimic real-world scenarios when only the post is available. Therefore, the input to the model only contains post tokens t 1 : t n and we experiment with multiple decoding strategies: greedy search (gdy), beam search, and top-p (nucleus) sampling to generate the explanations t [G i ] and t [S i ] .

Experiment Setup
Our ground-truth comes from the free-text target demographic and implied statement annotations that we collected for all 6,346 implicit hate tweets  Table 5: Example posts from our dataset along with their implicit category labels, the GPT-2 generated target and implied statements (first row of each block), and the ground truth target and implied statements (final row of each block, in italics). Generated implied statements are semantically similar to the ground truth statements.
in Section 4.2.4, with 75% for training, 12.5% for validation, and 12.5% for testing. Since we collect multiple annotations for each post (2 per tweet), we ensure that each post and its corresponding annotations belongs only to one split.
We pick BLEU since it is standard for evaluating machine translation models and ROUGE which is used in summarization contexts; both have been adopted extensively in prior literature. These automatic metrics indicate the quality of the generated target group and implied statement compared to our annotated ground-truth in terms of n-grams and the longest common sequence overlaps. Since there are two ground truth annotations per tweet, we measure both the averaged metrics across both references, and the maximum metrics (BLEU * and ROUGE-L * ).
We tuned hyperparameters and selected the best models based on their performance on the development set, and we reported evaluation results on the test. 7 For decoding, we generate one frame for greedy decoding and three hypotheses for beam search and top-p (nucleus) sampling with p = 0.92 and choose the highest scoring frame. 7 We fine-tune for e ∈ {1, 2, 3, 5} epochs with a batch size of 2 and learning rate of 5 × 10 −5 with linear warm up

Generation Results
In Table 4 we find that, GPT-2 outperforms GPT in both target group and implied statement generation. This difference is likely because GPT-2 was trained on English web text while GPT was trained on fiction books and web text is more similar to our domain. The BLEU and ROUGE-L scores are higher for the target group (e.g., 83.9 BLEU) than for the implied statement (e.g., 75.3 BLEU), consistently across both averaged and maximum scores. This is likely because the implied statement is longer, more nuanced, and less likely to be contained in the text itself. Additionally, beam search achieves the highest performance for both GPT and GPT-2, followed by top-p. This is not surprising since both decoding strategies consider multiple hypotheses. Since BLEU and ROUGE-L measure word overlap and not semantics, it is possible that the results in Table 4 are overly pessimistic. The GPT-2 generated implied statements in Table 4 actually describe the complement (a,d), generalization (b), extrapolation (c), or paraphrase (e,f) of the ground truth, and are thus aligned, despite differences in word choice. Overall, our generation results are promising. Transformer-based models may play a key role in explaining the severity and nature of online implicit hate.

Conclusion
In this work, we introduce a theoretical taxonomy of implicit hate speech and a large-scale benchmark corpus with fine-grained labels for each message and its implication. As an initial effort, our work enables the NLP communities to better understand and model implicit hate speech at scale. We also provide several state-of-the-art baselines for detect-ing and explaining implicit hate speech. Experimental results show these neural models can effectively categorize hate speech and spell out more fine-grained implicit hate speech and explaining these hateful messages.
Additionally, we identified eight challenges in implicit hate speech detection: coded hate symbols, discourse relations, entity framing, commonsense, metaphorical language, colloquial speech, irony, and identity term bias. To mitigate these challenges, future work could explore deciphering models for coded language (Kambhatla et al., 2018;Qian et al., 2019), lifelong learning of hateful language (Qian et al., 2021), contextualized sarcasm detection, and bias mitigation for named entities in hate speech detection systems (Xia et al., 2020) and their connection with our dataset.
We demonstrate that our corpus can serve as a useful research benchmark for understanding implicit hate speech online. Our work also has implications towards the emerging directions of counter-

Ethical Considerations
This study has been approved by the Institutional Review Board (IRB) at the researchers' institution. For the annotation process, we included a warning in the instructions that the content might be offensive or upsetting. Annotators were also encouraged to stop the labeling process if they were overwhelmed. We also acknowledge the risk associated with releasing an implicit hate dataset. However, we believe that the benefit of shedding light on the implicit hate phenomenon outweighs any risks associated with the dataset release.

A Data Collection Details
In our first annotation stage (Section 4.2.1), we provide a broad definition of hate speech grounded in Twitter's hateful conduct policy (Twitter, 2021), and detailed definitions for what constitutes explicit hate, implicit hate, and non-hateful content with examples from each class. We explain that explicit hate speech contains explicit keywords directed towards a protected entity. We define implicit hate speech as outlined in the paper and ground this definition in a quote from Lee Atwater on how discourse can appeal to racists without sounding racist: "You start out in 1954 by saying, "N*gger, n*gger, n*gger." By 1968 you can't say "n*gger"-that hurts you, backfires. So you say stuff like, uh, forced busing, states' rights, and all that stuff, and you're getting so abstract". To ensure quality, we chose only AMT Master workers who (1) have approval rate >98% and more than 5000 HITs approved, (2) scored ≥ 90% on our five-question qualification test where they must (a) identify the differences between explicit and implicit hate speech and (b) identify the hate target even if the target is not explicitly mentioned. Figures 2 and 4 depict snippets of the first stage annotation task and the instructions provided to guide the annotators, respectively. For the second-stage annotation (Section 4.2.2), we observed the following per-category kappa scores at the beginning/middle: (threatening, 1.00/0.66), (stereotypical, 0.67/0.55), (grievance, 0.61/0.63), (incitement, 0.63/0.53), (not hate, 0.55/0.54), (inferiority, 0.47/0.41), and (irony, 0.40/0.31). Even in the worst case, there was fair to moderate agreement. We will add these metrics to the Appendix. The total annotation cost for Stage 1 and 2 was $15k. Limited by our budget, we chose to employ expert annotators to label independent portions of the data once we observed fair to substantial agreement among them. Figure 3 depicts a snippet of the hate target and implied statement data collection for each implicit hate speech post.     Table 6: Fine-grained implicit hate classification performance, averaged across five random seeds. Macro scores are further broken down into category-level scores for each of the six main implicit categories, and we omit scores for other. Again, the BERT-based models beat the linear SVMs on F 1 performance across all categories. Generally, augmentation improves recall, especially for two of the minority classes, inferiority and threatening, as expected. Knowledge graph integration (Wikidata, Conceptnet) does not appear to improve the performance.