HateCheck: Functional Tests for Hate Speech Detection Models

Detecting online hate is a difficult task that even state-of-the-art models struggle with. Typically, hate speech detection models are evaluated by measuring their performance on held-out test data using metrics such as accuracy and F1 score. However, this approach makes it difficult to identify specific model weak points. It also risks overestimating generalisable model performance due to increasingly well-evidenced systematic gaps and biases in hate speech datasets. To enable more targeted diagnostic insights, we introduce HateCheck, a suite of functional tests for hate speech detection models. We specify 29 model functionalities motivated by a review of previous research and a series of interviews with civil society stakeholders. We craft test cases for each functionality and validate their quality through a structured annotation process. To illustrate HateCheck’s utility, we test near-state-of-the-art transformer models as well as two popular commercial models, revealing critical model weaknesses.


Introduction
Hate speech detection models play an important role in online content moderation and enable scientific analyses of online hate more generally. This has motivated much research in NLP and the social sciences. However, even state-of-the-art models exhibit substantial weaknesses (see Schmidt and Wiegand, 2017;Fortuna and Nunes, 2018;Vidgen et al., 2019;Mishra et al., 2020, for reviews).
So far, hate speech detection models have primarily been evaluated by measuring held-out performance on a small set of widely-used hate speech datasets (particularly Waseem and Hovy, 2016;Founta et al., 2018), but recent work has highlighted the limitations of this evaluation paradigm. Aggregate performance metrics offer limited insight into specific model weak-nesses (Wu et al., 2019). Further, if there are systematic gaps and biases in training data, models may perform deceptively well on corresponding held-out test sets by learning simple decision rules rather than encoding a more generalisable understanding of the task (e.g. Niven and Kao, 2019;Geva et al., 2019;Shah et al., 2020). The latter issue is particularly relevant to hate speech detection since current hate speech datasets vary in data source, sampling strategy and annotation process (Vidgen and Derczynski, 2020;Poletto et al., 2020), and are known to exhibit annotator biases (Waseem, 2016;Waseem et al., 2018;Sap et al., 2019) as well as topic and author biases (Wiegand et al., 2019;Nejadgholi and Kiritchenko, 2020). Correspondingly, models trained on such datasets have been shown to be overly sensitive to lexical features such as group identifiers (Park et al., 2018;Dixon et al., 2018;Kennedy et al., 2020), and to generalise poorly to other datasets (Nejadgholi and Kiritchenko, 2020;Samory et al., 2020). Therefore, held-out performance on current hate speech datasets is an incomplete and potentially misleading measure of model quality.
To enable more targeted diagnostic insights, we introduce HATECHECK, a suite of functional tests for hate speech detection models. Functional testing, also known as black-box testing, is a testing framework from software engineering that assesses different functionalities of a given model by validating its output on sets of targeted test cases (Beizer, 1995). Ribeiro et al. (2020) show how such a framework can be used for structured model evaluation across diverse NLP tasks. HATECHECK covers 29 model functionalities, the selection of which we motivate through a series of interviews with civil society stakeholders and a review of hate speech research. Each functionality is tested by a separate functional test. We create 18 functional tests corresponding to distinct expressions of hate. The other 11 functional tests are non-hateful contrasts to the hateful cases. For example, we test non-hateful reclaimed uses of slurs as a contrast to their hateful use. Such tests are particularly challenging to models relying on overly simplistic decision rules and thus enable more accurate evaluation of true model functionalities (Gardner et al., 2020). For each functional test, we hand-craft sets of targeted test cases with clear gold standard labels, which we validate through a structured annotation process. 1 HATECHECK is broadly applicable across English-language hate speech detection models. We demonstrate its utility as a diagnostic tool by evaluating two BERT models (Devlin et al., 2019), which have achieved near state-of-the-art performance on hate speech datasets (Tran et al., 2020), as well as two commercial models -Google Jigsaw's Perspective and Two Hat's SiftNinja. 2 When tested with HATECHECK, all models appear overly sensitive to specific keywords such as slurs. They consistently misclassify negated hate, counter speech and other non-hateful contrasts to hateful phrases. Further, the BERT models are biased in their performance across target groups, misclassifying more content directed at some groups (e.g. women) than at others. For practical applications such as content moderation and further research use, these are critical model weaknesses. We hope that by revealing such weaknesses, HATECHECK can play a key role in the development of better hate speech detection models.

Definition of Hate Speech
We draw on previous definitions of hate speech (Warner and Hirschberg, 2012; as well as recent typologies of abusive content  to define hate speech as abuse that is targeted at a protected group or at its members for being a part of that group. We define protected groups based on age, disability, gender identity, familial status, pregnancy, race, national or ethnic origins, religion, sex or sexual orientation, which broadly reflects international legal consensus (particularly the UK's 2010 Equality Act, the US 1964 Civil Rights Act and the EU's Charter of Fundamental Rights). Based on these definitions, we approach hate speech detection as the binary classification of content as either hateful or 1 All HATECHECK test cases and annotations are available on https://github.com/paul-rottger/hatecheck-data. 2 www.perspectiveapi.com and www.siftninja.com non-hateful. Other work has further differentiated between different types of hate and non-hate (e.g. Founta et al., 2018;Salminen et al., 2018;, but such taxonomies can be collapsed into a binary distinction and are thus compatible with HATECHECK.
Content Warning This article contains examples of hateful and abusive language. All examples are taken from HATECHECK to illustrate its composition. Examples are quoted verbatim, except for hateful slurs and profanity, for which the first vowel is replaced with an asterisk.

Defining Model Functionalities
In software engineering, a program has a certain functionality if it meets a specified input/output behaviour (ISO/IEC/IEEE 24765:2017, E). Accordingly, we operationalise a functionality of a hate speech detection model as its ability to provide a specified classification (hateful or non-hateful) for test cases in a corresponding functional test.
For instance, a model might correctly classify hate expressed using profanity (e.g "F*ck all black people") but misclassify non-hateful uses of profanity (e.g. "F*cking hell, what a day"), which is why we test them as separate functionalities. Since both functionalities relate to profanity usage, we group them into a common functionality class.

Selecting Functionalities for Testing
To generate an initial list of 59 functionalities, we reviewed previous hate speech detection research and interviewed civil society stakeholders.

Review of Previous Research
We identified different types of hate in taxonomies of abusive content (e.g. . We also identified likely model weaknesses based on error analyses (e.g.  as well as review articles and commentaries (e.g. Schmidt and Wiegand, 2017;Fortuna and Nunes, 2018;Vidgen et al., 2019). For example, hate speech detection models have been shown to struggle with correctly classifying negated phrases such as "I don't hate trans people" Dinan et al., 2019). We therefore included functionalities for negation in hateful and non-hateful content.
Interviews We interviewed 21 employees from 16 British, German and American NGOs whose work directly relates to online hate. Most of the NGOs are involved in monitoring and reporting online hate, often with "trusted flagger" status on platforms such as Twitter and Facebook. Several NGOs provide legal advocacy and victim support or otherwise represent communities that are often targeted by online hate, such as Muslims or LGBT+ people. The vast majority of interviewees do not have a technical background, but extensive practical experience engaging with online hate and content moderation systems. They have a variety of ethnic and cultural backgrounds, and most of them have been targeted by online hate themselves. The interviews were semi-structured. In a typical interview, we would first ask open-ended questions about online hate (e.g. "What do you think are the biggest challenges in tackling online hate?") and then about hate speech detection models, particularly their perceived weaknesses (e.g. "What sort of content have you seen moderation systems get wrong?") and potential improvements, unbounded by technical feasibility (e.g. "If you could design an ideal hate detection system, what would it be able to do?"). Using a grounded theory approach (Corbin and Strauss, 1990), we identified emergent themes in the interview responses and translated them into model functionalities. For example, several interviewees raised concerns around the misclassification of counter speech, i.e. direct responses to hateful content (e.g. I4: "people will be quoting someone, calling that person out [...] but that will get picked up by the system"). 3 We therefore included functionalities for counter speech that quotes or references hate.
Selection Criteria From the initial list of 59 functionalities, we select those in HATECHECK based on two practical considerations.
First, we restrict HATECHECK's scope to individual English language text documents. This is due to practical constraints, and because most hate speech detection models are developed for such data (Poletto et al., 2020;Vidgen and Derczynski, 2020). Thus, HATECHECK does not test functionalities that relate to other modalities (e.g. images) or languages, or that require context (e.g. conversational or social) beyond individual documents.
Second, we only test functionalities for which we can construct test cases with clear gold standard labels. Therefore, we do not test functionalities that lack broad consensus in our interviews and the literature regarding what is and is not hateful. The use of humour, for instance, has been highlighted as an important challenge for hate speech research Qian et al., 2018;. However, whether humorous statements are hateful is heavily contingent on normative claims (e.g. I5: "it's a value judgment thing"), which is why we do not test them in HATECHECK.

Functional Tests in HATECHECK
HATECHECK comprises 29 functional tests grouped into 11 classes. Each test evaluates one functionality and is associated with one gold standard label (hateful or non-hateful). Each functional test has a set of corresponding test cases.
18 functional tests for hateful content in HATE-CHECK cover distinct expressions of hate. They are distinct in the sense that we minimise overlap between them, for instance by testing slurs ("f*g") and profanity ("f*ck") in separate functional tests rather than jointly ("f*cking f*g"), so that each test isolates one particular type of expression.
The other 11 functional tests for non-hateful content cover contrastive non-hate, i.e. content which shares linguistic features with hateful expressions. The challenges posed by such content are a key theme in our interviews and the literature. We construct every non-hateful test case as a direct contrast to a hateful test case, making only minimal changes. For instance, "I love immigrants" is a test case in F19: positive statements using a protected group identifier. It directly contrasts the test case "I hate immigrants" in F1: strong negative emotions explicitly expressed about a protected group.
In the following, we give a brief overview of the different functional tests in HATECHECK. Table 1 provides corresponding example test cases. Each individual test is grounded in direct references to previous work and/or our interviews. These references are detailed in Appendix B.

Contrastive Non-Hate
HATECHECK tests non-hateful contrasts for slurs, particularly slur homonyms and reclaimed slurs (F8/9), as well as for profanity (F11). It tests nonhateful contrasts that use negation, i.e. negated hate (F15). It also tests non-hateful contrasts around protected group identifiers (F18/19). It tests contrasts in which hate speech is quoted or referenced to non-hateful effect, specifically counter speech, i.e. direct responses to hate speech which seek to act against it (F20/21). Lastly, it tests non-hateful contrasts which target out-of-scope entities such as objects (F22-24) rather than a protected group.

Generating Test Cases
For each functionality in HATECHECK, we handcraft sets of test cases -short English-language text documents that clearly correspond to just one gold standard label. Within each functionality, we aim to use diverse vocabulary and syntax to reduce similarity between test cases, which Zhou et al. (2020) suggest as a likely cause of performance instability for diagnostic datasets.
To generate test cases at scale, we use templates (Dixon et al., 2018;Garg et al., 2019;Ribeiro et al., 2020), in which we replace tokens for protected group identifiers (e.g. "I hate [IDENTITY].") and slurs (e.g. "You are just a [SLUR] to me."). This also ensures that HATECHECK has an equal number of cases targeted at different protected groups.
HATECHECK covers seven protected groups: women (gender), trans people (gender identity), gay people (sexual orientation), black people (race), disabled people (disability), Muslims (religion) and immigrants (national origin). For details on which slurs are covered by HATECHECK and how they were selected, see Appendix C.
In total, we generate 3,901 cases, 3,495 of which come from 460 templates. The other 406 cases do not use template tokens (e.g. "Sh*t, I forgot my keys") and are thus crafted individually. The average length of cases is 8.87 words (std. dev. = 3.33) or 48.26 characters (std. dev. = 16.88). 2,659 of the 3,901 cases (68.2%) are hateful and 1,242 (31.8%) are non-hateful.
Secondary Labels In addition to the primary label (hateful or non-hateful) we provide up to two secondary labels for all cases. For cases targeted at or referencing a particular protected group, we provide a label for the group that is targeted. For hateful cases, we also label whether they are targeted at a group in general or at individuals, which is a common distinction in taxonomies of abuse (e.g. .

Validating Test Cases
To validate gold standard primary labels of test cases in HATECHECK, we recruited and trained ten annotators. 4 In addition to the binary annotation task, we also gave annotators the option to flag cases as unrealistic (e.g. nonsensical) to further confirm data quality. Each annotator was randomly assigned approximately 2,000 test cases, so that each of the 3,901 cases was annotated by exactly five annotators. We use Fleiss' Kappa to measure inter-annotator agreement (Hallgren, 2012) and obtain a score of 0.93, which indicates "almost perfect" agreement (Landis and Koch, 1977).
For 3,879 (99.4%) of the 3,901 cases, at least four out of five annotators agreed with our gold standard label. For 22 cases, agreement was less than four out of five. To ensure that the label of each HATECHECK case is unambiguous, we exclude these 22 cases. We also exclude all cases generated from the same templates as these 22 cases to avoid biases in target coverage, as otherwise hate against some protected groups would be less well represented than hate against others. In total, we exclude 173 cases, reducing the size of the dataset to 3,728 test cases. 5 Only 23 cases were flagged as unrealistic by one annotator, and none were flagged by more than one annotator. Thus, we do not exclude any test cases for being unrealistic.

Model Setup
As a suite of black-box tests, HATECHECK is broadly applicable across English-language hate speech detection models. Users can compare different architectures trained on different datasets and even commercial models for which public information on architecture and training data is limited.

Functionality
Example Test Case Gold Label n Accuracy (%) For both datasets, we collapse labels other than hateful into a single non-hateful label to match HATECHECK's binary format. This is aligned with the original multi-label setup of the two datasets. , for instance, explicitly characterise offensive content in their dataset as non-hateful. Respectively, hateful cases make up 5.8% and 5.0% of the datasets. Details on both datasets and pre-processing steps can be found in Appendix D.

B-D B-F P SN
In the following, we denote BERT fine-tuned on binary   Commercial Models We test Google Jigsaw's Perspective (P) and Two Hat's SiftNinja (SN). 7 Both are popular models for content moderation developed by major tech companies that can be accessed by registered users via an API.
For a given input text, P provides percentage scores across attributes such as "toxicity" and "profanity". We use "identity attack", which aims at identifying "negative or hateful comments targeting someone because of their identity" and thus aligns closely with our definition of hate speech ( §1). We convert the percentage score to a binary label using a cutoff of 50%. We tested P in December 2020.
For SN, we use its 'hate speech' attribute ("attacks [on] a person or group on the basis of personal attributes or identities"), which distinguishes between 'mild', 'bad', 'severe' and 'no' hate. We mark all but 'no' hate as 'hateful' to obtain binary labels. We tested SN in January 2021.

Results
We assess model performance on HATECHECK using accuracy, i.e. the proportion of correctly classified test cases. When reporting accuracy in tables, we bolden the best performance across models and highlight performance below a random choice baseline, i.e. 50% for our binary task, in cursive red.
Performance Across Labels All models show clear performance deficits when tested on hateful and non-hateful cases in HATECHECK (Table  2). B-D, B-F and P are relatively more accurate on hateful cases but misclassify most non-hateful cases. In total, P performs best. SN performs worst and is strongly biased towards classifying all cases as non-hateful, making it highly accurate on nonhateful cases but misclassify most hateful cases.  Performance Across Functional Tests Evaluating models on each functional test (Table 1) reveals specific model weaknesses. B-D and B-F, respectively, are less than 50% accurate on 8 and 4 out of the 11 functional tests for non-hate in HATECHECK. In particular, the models misclassify most cases of reclaimed slurs (F9, 39.5% and 33.3% correct), negated hate (F15, 12.8% and 12.0% correct) and counter speech (F20/21, 26.6%/29.1% and 32.9%/29.8% correct). B-D is slightly more accurate than B-F on most functional tests for hate while B-F is more accurate on most tests for non-hate. Both models generally do better on hateful than non-hateful cases, although they struggle, for instance, with spelling variations, particularly added spaces between characters (F28, 43.9% and 37.6% correct) and leet speak spellings (F29, 48.0% and 43.9% correct). P performs better than B-D and B-F on most functional tests. It is over 95% accurate on 11 out of 18 functional tests for hate and substantially more accurate than B-D and B-F on spelling variations (F25-29). However, it performs even worse than B-D and B-F on non-hateful functional tests for reclaimed slurs (F9, 28.4% correct), negated hate (F15, 3.8% correct) and counter speech (F20/21, 15.6%/18.4% correct).

Performance on Individual Functional Tests
Individual functional tests can be investigated further to show more granular model weaknesses. To illustrate, Table 3 reports model accuracy on test cases for non-hateful reclaimed slurs (F9) grouped by the reclaimed slur that is used.  Performance varies across models and is strikingly poor on individual slurs. B-D misclassifies all instances of "f*g", "f*ggot" and "q*eer". B-F and P perform better for "q*eer", but fail on "n*gga". SN fails on all cases but reclaimed uses of "b*tch".
Performance Across Target Groups HATE-CHECK can test whether models exhibit 'unintended biases' (Dixon et al., 2018) by comparing their performance on cases which target different groups. To illustrate, Table 4 shows model accuracy on all test cases created from [IDENTITY] templates, which only differ in the group identifier.
B-D misclassifies test cases targeting women twice as often as those targeted at other groups. B-F also performs relatively worse for women and fails on most test cases targeting disabled people. By contrast, P is consistently around 80% and SN around 25% accurate across target groups.

Discussion
HATECHECK reveals functional weaknesses in all four models that we test. First, all models are overly sensitive to specific keywords in at least some contexts. B-D, B-F and P perform well for both hateful and non-hateful cases of profanity (F10/11), which shows that they can distinguish between different uses of certain profanity terms. However, all models perform very poorly on reclaimed slurs (F9) compared to hateful slurs (F7). Thus, it appears that the models to some extent encode overly simplistic keyword-based decision rules (e.g. that slurs are hateful) rather than capturing the relevant linguistic phenomena (e.g. that slurs can have non-hateful reclaimed uses).
Second, B-D, B-F and P struggle with nonhateful contrasts to hateful phrases. In particular, they misclassify most cases of negated hate (F15) and counter speech (F20/21). Thus, they appear to not sufficiently register linguistic signals that reframe hateful phrases into clearly non-hateful ones (e.g. "No Muslim deserves to die").
Third, B-D and B-F are biased in their target coverage, classifying hate directed against some protected groups (e.g. women) less accurately than equivalent cases directed at others (Table 4).
For practical applications such as content moderation, these are critical weaknesses. Models that misclassify reclaimed slurs penalise the very communities that are commonly targeted by hate speech. Models that misclassify counter speech undermine positive efforts to fight hate speech. Models that are biased in their target coverage are likely to create and entrench biases in the protections afforded to different groups.
As a suite of black-box tests, HATECHECK only offers indirect insights into the source of these weaknesses. Poor performance on functional tests can be a consequence of systematic gaps and biases in model training data. It can also indicate a more fundamental inability of the model's architecture to capture relevant linguistic phenomena. B-D and B-F share the same architecture but differ in performance on functional tests and in target coverage. This reflects the importance of training data composition, which previous hate speech research has emphasised (Wiegand et al., 2019;Nejadgholi and Kiritchenko, 2020). Future work could investigate the provenance of model weaknesses in more detail, for instance by using test cases from HATECHECK to "inoculate" training data (Liu et al., 2019).
If poor model performance does stem from biased training data, models could be improved through targeted data augmentation (Gardner et al., 2020). HATECHECK users could, for instance, sample or construct additional training cases to resemble test cases from functional tests that their model was inaccurate on, bearing in mind that this additional data might introduce other unforeseen biases. The models we tested would likely benefit from training on additional cases of negated hate, reclaimed slurs and counter speech.

Negative Predictive Power
Good performance on a functional test in HATE-CHECK only reveals the absence of a particular weakness, rather than necessarily characterising a generalisable model strength. This negative predictive power (Gardner et al., 2020) is common, to some extent, to all finite test sets. Thus, claims about model quality should not be overextended based on positive HATECHECK results. In model development, HATECHECK offers targeted diagnostic insights as a complement to rather than a substitute for evaluation on held-out test sets of real-world hate speech.

Out-Of-Scope Functionalities
Each test case in HATECHECK is a separate English-language text document. Thus, HATE-CHECK does not test functionalities related to context outside individual documents, modalities other than text or languages other than English. Future research could expand HATECHECK to include functional tests covering such aspects.
Functional tests in HATECHECK cover distinct expressions of hate and non-hate. Future work could test more complex compound statements, such as cases combining slurs and profanity.
Further, HATECHECK is static and thus does not test functionalities related to language change. This could be addressed by "live" datasets, such as dynamic adversarial benchmarks (Nie et al., 2020;Vidgen et al., 2020b;Kiela et al., 2021).

Limited Coverage
Future research could expand HATECHECK to cover additional protected groups. We also suggest the addition of intersectional characteristics, which interviewees highlighted as a neglected dimension of online hate (e.g. I17: "As a black woman, I receive abuse that is racialised and gendered").
Similarly, future research could include hateful slurs beyond those covered by HATECHECK.
Lastly, future research could craft test cases using more platform-or community-specific language than HATECHECK's more general test cases. It could also test hate that is more specific to particular target groups, such as misogynistic tropes. They select test cases from other datasets sampled from social media, which introduces substantial disagreement between annotators on labels in their data. Dixon et al. (2018) use templates to generate synthetic sets of toxic and non-toxic cases, which resembles our method for test case creation. They focus primarily on evaluating biases around the use of group identifiers and do not validate the labels in their dataset. Compared to both approaches, HATECHECK covers a much larger range of model functionalities, and all test cases, which we generated specifically to fit a given functionality, have clear gold standard labels, which are validated by near-perfect agreement between annotators.

Related Work
In its use of contrastive cases for model eval-  (2020) propose augmenting NLP datasets with contrastive cases for training more generalisable models and enabling more meaningful evaluation. We built on their approaches to generate non-hateful contrast cases in our test suite, which is the first application of this kind for hate speech detection. In terms of its structure, HATECHECK is most directly influenced by the CHECKLIST framework proposed by Ribeiro et al. (2020). However, while they focus on demonstrating its general applicability across NLP tasks, we put more emphasis on motivating the selection of functional tests as well as constructing and validating targeted test cases specifically for the task of hate speech detection.

Conclusion
In this article, we introduced HATECHECK, a suite of functional tests for hate speech detection models. We motivated the selection of functional tests through interviews with civil society stakeholders and a review of previous hate speech research, which grounds our approach in both practical and academic applications of hate speech detection models. We designed the functional tests to offer contrasts between hateful and non-hateful content that are challenging to detection models, which enables more accurate evaluation of their true functionalities. For each functional test, we crafted sets of targeted test cases with clear gold standard labels, which we validated through a structured annotation process.
We demonstrated the utility of HATECHECK as a diagnostic tool by testing near-state-of-the-art transformer models as well as two commercial models for hate speech detection. HATECHECK showed critical weaknesses for all models. Specifically, models appeared overly sensitive to particular keywords and phrases, as evidenced by poor performance on tests for reclaimed slurs, counter speech and negated hate. The transformer models also exhibited strong biases in target coverage.
Online hate is a deeply harmful phenomenon, and detection models are integral to tackling it. Typically, models have been evaluated on held-out test data, which has made it difficult to assess their generalisability and identify specific weaknesses.
We hope that HATECHECK's targeted diagnostic insights help address this issue by contributing to our understanding of models' limitations, thus aiding the development of better models in the future.

Acknowledgments
We thank all interviewees for their participation. We also thank reviewers for their constructive feedback. Paul Röttger was funded by the German Academic Scholarship Foundation.

Impact Statement
This supplementary section addresses relevant ethical considerations that were not explicitly discussed in the main body of our article.
Interview Participant Rights All interviewees gave explicit consent for their participation after being informed in detail about the research use of their responses. In all research output, quotes from interview responses were anonymised. We also did not reveal specific participant demographics or affiliations. Our interview approach was approved by the Alan Turing Institute's Ethics Review Board.
Intellectual Property Rights The test cases in HATECHECK were crafted by the authors. As synthetic data, they pose no risk of violating intellectual property rights.

Annotator Compensation
We employed a team of ten annotators to validate the quality of the HATECHECK dataset. Annotators were compensated at a rate of £16 per hour. The rate was set 50% above the local living wage (£10.85), although all work was completed remotely. All training time and meetings were paid.
Intended Use HATECHECK's intended use is as an evaluative tool for hate speech detection models, providing structured and targeted diagnostic insights into model functionalities. We demonstrated this use of HATECHECK in §3. We also briefly discussed alternative uses of HATECHECK, e.g. as a starting point for data augmentation. These uses aim at aiding the development of better hate speech detection models.
Potential Misuse Researchers might overextend claims about the functionalities of their models based on their test performance, which we would consider a misuse of HATECHECK. We directly addressed this concern by highlighting HATE-CHECK's negative predictive power, i.e. the fact that it primarily reveals model weaknesses rather than necessarily characterising generalisable model strengths, as one of its limitations. For the same reason, we emphasised the limits to HATECHECK's coverage, e.g. in terms of slurs and identity terms.

A Data Statement
Following Bender and Friedman (2018), we provide a data statement, which documents the generation and provenance of test cases in HATECHECK.
A. CURATION RATIONALE In order to construct HATECHECK, a first suite of functional tests for hate speech detection models, we generated 3,901 short English-language text documents by hand and by using simple templates for group identifiers and slurs ( §2.4). Each document corresponds to one functional test and a binary gold standard label (hateful or non-hateful). In order to validate the gold standard labels, we trained a team of ten annotators, assigning five of them to each document, and asked them to provide independent labels ( §2.5). To further improve data quality, we also gave annotators the option to flag cases they felt were unrealistic (e.g. nonsensical), but this flag was not used for any one HATECHECK case by more than one annotator.
B. LANGUAGE VARIETY HATECHECK only covers English-language text documents. We opted for English language since this maximises HATE-CHECK's relevance to previous and current work in hate speech detection, which is mostly concerned with English-language data. Our language choice also reflects the expertise of authors and annotators. We discuss the lack of language variety as a limitation of HATECHECK in §4.2 and suggest expansion to other languages as a priority for future research.
C. SPEAKER DEMOGRAPHICS Since all test cases in HATECHECK were hand-crafted, the speakers are the same as the authors. Test cases in the test suite were primarily generated by the lead author, who is a researcher at a UK university. The lead author is not a native English speaker but has lived in English-speaking countries for more than five years and has extensively engaged with English-language hate speech in previous research. All test cases were also reviewed by two co-authors, both of whom have worked with English-language hate speech data for more than five years and one of whom is a native English speaker from the UK.
D. ANNOTATOR DEMOGRAPHICS We recruited a team of ten annotators to work for two weeks. 30% were male and 70% were female. 60% were 18-29 and 40% were 30-39. 20% were educated to high school level, 10% to undergraduate, 60% to taught masters and 10% to research degree (i.e. PhD). 70% were native English speakers and 30% were non-native but fluent. Annotators had a range of nationalities: 60% were British and 10% each were Polish, Spanish, Argentinian and Irish. Most annotators identified as ethnically White (70%), followed by Middle Eastern (20%) and a mixed ethnic background (10%). Annotators all used social media regularly, and 60% used it more than once per day. All annotators had seen other people targeted by online abuse before, and 80% had been targeted personally. All annotators had previously completed annotation work on at least one other hate speech dataset. In the first week, we introduced the binary annotation task to them in an onboarding session and tested their understanding on a set of 100 cases, which we then provided individual feedback on. In the second week, we asked each annotator to annotate around 2,000 test cases so that each case in our test suite was annotated by varied sets of exactly five annotators. Throughout the process, we communicated with annotators in real-time over a messaging platform. We also followed guidance for protecting and monitoring annotator well-being provided by Vidgen et al. (2019).
E. SPEECH SITUATION All test cases were created between the 23rd of November and the 13th of December 2020.
F. TEXT CHARACTERISTICS The composition of the dataset, including primary label and secondary labels, is described in detail in §2.3 and §2.4 of the article.

B References for Functional Tests C Hateful Slurs in HATECHECK
For each of the seven protected groups covered by HATECHECK, we searched hatebase.org, a crowdsourced hate speech lexicon, for slurs which target that group. From these slurs, we selected the three that were most often logged by users of the site (e.g. "wh*re", "b*tch" and "sl*t" for women), except for when the third-most sighted slur was logged substantially less often than the second, in which case we selected the top two (e.g. "tr*nny" and "sh*male" for trans people). For immigration status, which is not a target category on hatebase.org, we chose "r*pefugee", a slur for refugees used by the European far right, and "w*tback", a slur for Mexican immigrants to the US, which was logged similarly often as other slurs in HATECHECK.

Protected Group Slurs
Women b*tch, sl*t, wh*re Trans people tr*nny, sh*male Gay people f*ggot, f*g, q*eer Black people n*gger, c*on Disabled people r*tard, cr*pple, m*ng Muslims m*zzie, J*hadi, camel f*cker Immigrants w*tbacks, r*pefugees For reclaimed slurs (F9), we focus on slurs reclaimed by black communities (particularly "n*gga"), gay communities ("f*g", "f*ggot", "q*eer") and by women ("b*tch"), reflecting the concerns highlighted by our interview participants (e.g. I4: "n*gga would often get [wrongly] picked up by [moderation] systems"). Ahead of the structured annotation process ( §2.5) and only for test cases with reclaimed slurs, we asked selfidentifying members of the relevant groups in our personal networks whether they would consider the test cases to contain valid and realistic reclaimed slur uses, which held true for all test cases. Sampling  searched Twitter for tweets containing keywords from a list they compiled from hatebase.org, which yielded a sample of tweets from 33,458 users. They then randomly sampled 25,000 tweets from all tweets of these users.
Annotation The authors hired crowd workers from CrowdFlower to annotate each tweet as hateful, offensive or neither. 92.0% of tweets were annotated by three crowd workers, the remainder by at least four and up to nine. For inter-annotator agreement, the authors report a "CrowdFlower score" of 92%.
Definition of Hate Speech "Language that is used to expresses hatred towards a targeted group or is intended to be derogatory, to humiliate, or to insult the members of the group".

D.2 Founta et al. (2018) Data
Sampling Founta et al. (2018) initially collected a random set of 32 million tweets from Twitter. They then used a boosted random sampling procedure based on negative sentiment and occurrence of offensive words as selected from hatebase.org to augment a random subset of this initial sample with tweets they expected to be more likely to be hateful or abusive.
Annotation The authors hired crowd workers from CrowdFlower to annotate each tweet as hateful, abusive, spam or normal. All tweets were annotated by five crowd workers. For inter-annotator agreement, the authors report that 55.9% of tweets had four out of five annotators agreeing on a label.
Definition of Hate Speech "Language used to express hatred towards a targeted individual or