Trigger Warnings: Bootstrapping a Violence Detector for FanFiction

We present the ﬁrst dataset and evaluation re-sults on a newly deﬁned computational task of trigger warning assignment. Labeled corpus data has been compiled from narrative works hosted on Archive of Our Own (AO3), a well-known fanﬁction site. In this paper, we focus on the most frequently assigned trigger type— violence—and deﬁne a document-level binary classiﬁcation task of whether or not to assign a violence trigger warning to a fanﬁction, exploiting warning labels provided by AO3 authors. SVM and BERT models trained in four evaluation setups on the corpora we compiled yield F 1 results ranging from 0 . 585 to 0 . 798 , proving the violence trigger warning assignment to be a doable, however, non-trivial task.


Introduction
"[The witch] crept up and thrust her head into the oven.Then Grethel gave her a push that drove her far into it, and shut the iron door, and fastened the bolt.Oh! then she began to howl quite horribly, but Grethel ran away, and the godless witch was miserably burnt to death." Hansel and Gretel, a fairy tale † Violence and cruelty are commonplace in literature.Folk tales, especially fairy tales, but also children's and youth literature are full of dark, horror images, such as burning a human being alive in an oven, as in the fairy tale by the Brothers Grimm quoted above.And even if most people will not be deeply shaken by such content, some readers may mentally relive their past traumas evoked by the imagery.To proactively alert readers that a text they are about to read contains potentially disturbing material, socalled "trigger warnings" have been proposed.
Trigger warnings (also referred to as content warnings/notifications/alerts) emerged in online communities (e.g., on Tumblr and LiveJournal) † Translation by Margaret Hunt in the early 2000s (Knox, 2017).They are usually presented as short phrases/keywords preceding a text and warn of potentially disturbing content.While there are no universally accepted trigger warnings (anything can be a trigger), many universities meanwhile published guidelines (see, e.g., lists published by the Universities of Reading and Michigan (UR list; UM list)).They include largely overlapping lists of triggers referring to health (eating disorders, mental illness) or sexuality (sexual assault, pornography), verbal violence (hate speech, racial slurs), and physical violence (animal cruelty, blood, suicide), among others.
Surprisingly, assigning trigger warnings is considered a manual task, and, to our knowledge, there is no work in Computer Science in general, and in Natural Language Processing in particular, that addresses content warnings.We lay the foundation to close this gap by introducing the new NLP task of trigger warning assignment, formulated as follows: Given a text and a trigger label, assign a warning to the text if it contains a corresponding trigger.When multiple trigger labels are predefined, this task can be extended from a binary classification problem to a multi-class or multi-label problem and solved by, for example, a set of binary classifiers, one for each trigger.However, the preceding first step is to investigate the feasibility of automatic trigger warning assignment and for this purpose we create the first trigger warning corpus from narratives with and without triggers, using the trigger warnings supplied by the works' authors.
Our contributions are the following: we (1) introduce the new task of automatic trigger warning assignment, (2) introduce the first corpus compiled from a public archive of fan fiction marked with a trigger warning for violence (Section 3), and (3) evaluate models for assigning trigger warnings and analyze their effectiveness (Section 4).1

Related Work
Constructs related to "trigger warnings" have been investigated using computational approaches under different terms and have spanned a broad range of phenomena.Recent research employs terms such as "objectionable content", "objectionable material", "harmful content", "harmful text" (Banko et al., 2020;Solorio et al., 2021;Kirk et al., 2022) as broad terms covering diverse types of content that can potentially evoke negative emotions in the recipient of the material (be it verbal or visual), i.e. cause emotional harm at different degrees of severity.The type of content that is often subsumed under those terms includes violence, sexual content, misguided messages, misinformation, verbal aggression, malice, callousness, or social aggression, among others.And while there is also a clear link to sentiment analysis, phenomena subsumed under "objectionable/harmful content" lie only on one end of the sentiment scale (that of negative sentiment), however, have a finer granularity (cf.range of specific types of content, mentioned above, that may evoke harm).Now, the notion of "triggering" is equally underspecified (open-ended), but even broader.While most of the objectionable types are indeed unobjectionably harmful-in that they can be linked to intention to harm-there may exist concept associations that are triggering to some individuals which, objectively speaking, have little to no link to intention to harm; consider, for instance, that a mention of a thunderstorm may be triggering to a victim of a severe lightning injury.Thus, triggering covers also concepts which would normally be understood to lie at the positive end of a sentiment scale, which can, however, evoke negative associations in some individuals due to their specific traumatic past experience related to the concept.A "trigger warning" just gives a nominal label to the signal that is considered triggering.While we are not aware of prior work on automatic trigger warning assignment nor specifically violence warning assignment, below we outline prior work in NLP and computer science that covers most closely related topics.
Identifying Causes of Emotions While affect and emotion recognition in non-fiction text-sentiment analysis more generally-has been long studied in NLP (Alswaidan and Menai, 2020), research into interactions between emotions and their triggering cause events was introduced only about a decade ago (Lee et al., 2010).Cause events here refer to (verb) arguments or events in the text that are highly correlated with a certain emotion, positive or negative.The goal of the emotion cause extraction task is to identify the emotion's stimulus and the computational methods range from rule-based lexico-syntactic approaches through traditional classifiers to recently also deep learning; see Khunteta and Singh (2021) for an overview of the emotion cause extraction area.By contrast the trigger warning assignment task is rather about identifying potentially triggering content which may evoke strongly negative emotions in readers.
Identifying Verbal Violence Interest in broadly understood verbal violence-although not explicitly referred to as such-has a long history in the NLP community.Waseem et al. (2017) and Kogilavani et al. (2021) propose taxonomies of abusive and offensive language, respectively; Kogilavani et al. also survey techniques for offensive language detection.Fortuna and Nunes (2018) and Schmidt and Wiegand (2019) provide an overview on hate speech and Mishra et al. (2019) more generally on abuse detection methods with "abuse" defined as "any expression that is meant to denigrate or offend a particular person or group".While not considered from the point of view of triggering, this definition fits the category 'Hateful language" listed in the institutional guidelines.While most work on verbal violence has been carried out in the context of social media (methods ranging from feature engineering to neural networks) it would be useful to extend those systems to cover a broader range of verbal violence, e.g., literary dialogue, in the context of the trigger warning assignment task.

Identifying Health-related Triggering Content
Closest to our research, however, focused on a different trigger type is the work of De Choudhury (2015) investigating behavioral characteristics of the anorexia affected population on Tumblr.Analysis of several thousand posts has shown that the platform contains vast amounts of triggering content which may prompt and/or reinforce anorexiaoriented lifestyle choices.Two sub-groups of the anorexia community were identified-pro-anorexia and pro-recovery-with distinguishing affective, social, cognitive, and linguistic properties.Predictive models based on language features extracted from the posts were able to detect anorexia content at 80% accuracy.Like De Choudhury, we focus on  000 16,810 238 4,706 11  non-violent 10,000 2,859 224 3,155 6  Tags  violent  10,000 7,161  60 1,255 9  non-violent 10,000 2,127  84 1,235 6 Table 1: Descriptive statistics of corpus and sample datasets.Shown are number of works and median numbers of words, kudos, hits, and freeform tags (FF).The median is reported due to the long-tailed nature of the measures; the mean is ca.2-4 times higher.
a single trigger type, but in fiction texts and with warnings assigned by the authors.

The Violence Trigger Warnings Corpus
As data source, we used Archive of Our Own (AO3), 2 a public online anthology of fan fiction, i.e., amateur writings inspired by existing works of fiction: e.g., novels, cartoons, manga.At the time of corpus creation, AO3 hosted about 8 million works.Aside from basic meta-data, such as title, author, language, statistics (number of words, chapters, etc.), reader reactions, ratings, fandoms (original source(s)/inspiration), and relationships (characters involved in romantic/platonic relationship(s)), crucially for this research, works are labeled with Archive Warnings and Additional Tags.
Archive Warnings AO3 defines a set of six content warnings.Authors must actively assign at least one to each of their works.Dataset Sampling Because AO3 works do not include any annotations below document levelthat is, we do not know the extent of violent content nor where in the text it can be found-our goal was to build a corpus with high-confidence examples of texts with and without violence.We apply three sampling strategies with varying reliability criteria: random sampling to represent the corpus, famebased sampling to exclude low-effort works, and tag-based sampling to exclude works that are not thoroughly tagged so that Archive Warnings might be less reliable.Table 1 gives an overview of the corpus and the three sampled datasets.
All sampling strategies randomly select 10,000 violent works (tagged with Graphic Depictions of Violence) and 10,000 non-violent works (tagged with No Archive Warnings Apply but not with Graphic Depictions of Violence).Before selecting the examples, we discarded all works with less than 100 words and works written in a non-English language.The random sample then draws the examples uniformly at random.The fame-based sample first discards all works with <1,000 hits and <100 kudos and then draws uniformly at random.The tag-based sample discards all works with <10 Additional Tags (including characters and relationships) and then draws uniformly at random.
Table 1 shows the meta-data of the entire corpus and the three samples, extended by Table 4 in Appendix A. The random and tag-based samples are highly similar to the overall corpus; the fame-based sample diverts by having longer (esp.violent) documents with more freeform tags.

Assigning Violence Trigger Warnings
We evaluate the four labeled datasets in a text classification setting by building classification models to assign trigger warnings at the document level.
Models We use three long-document classification baselines for our experiments: SVM, BERT, and Longformer.First, we use support vector machines (SVM) (Joachims, 1998) since they are often used for text classification, are easily interpretable, and are not limited by the input sequence length.Second, we use a BERT transformer (Devlin et al., 2019) as the go-to classification baseline; we used the pretrained bert-base-uncased checkpoint with 12 layers and 110M parameters, fine-tuned on our classification task.Third, we use a sparse-attention Longformer (Beltagy et al., 2020) as the state-of-the-art in many long document classification tasks (Park et al., 2022).We used the allenai/longformer-base-4096 pretrained checkpoint, fine-tuned on our classification task.
Text Preprocessing For the SVM, we remove HTML tags, URLs, emojis, numbers, punctuation, and special characters and apply the Porter Stemmer (Porter, 1980).For BERT and Longformer, we only remove HTML tags, URLs, numbers, and special characters, while punctuation is retained.
For both neural models, the inputs are truncated at (and padded to) the maximum sequence length.

Results
For each sample and model, we train a model on the training set and evaluate on the test set, the results of which are reported in Table 2.It can be seen that the SVM reaches overall best scores except for recall.Across the three sample datasets, the models achieve best F 1 on the famebased sample, followed by the random and the tagbased sample.Recall is higher than precision for most neural models and vice versa for the SVM.
Figure 1 shows the effectiveness of the models on subsets of documents of varying lengths over input length.If the documents are shorter than the model's maximum input length, the SVM almost always performs worse (in terms of F 1 ) than the neural models and vice versa.

Discussion and Limitations
The final result (the SVM beats both neural models) is unexpected and can be (partially) explained by the influence of document length and topic.

Document Length
Although the SVM has no contextual semantic information, it covers the tokens of the whole document through the bag-ofwords representation, while BERT and Longformer are limited to a fixed input sequence (512/4,096 tokens respectively), which is only a fraction of the documents (cf.than the SVM on documents shorter than their input limit; on longer documents, the violence might not have been part of the truncated input. Topic Another possible explanation for the SVM's effectiveness is that the classes are separable by topic words (characters, fandom concepts) due to co-occurrence with (non-)violent documents; hence the classifier could not learn the more complex concept of violence.Our analysis (cf. Figure 2) shows that some fandoms are more violent than others (between 5-30% of works) and that about 5% of tagged characters and 2% of freeform tags are strongly associated with violent documents (strongly non-violent ones are rare).
Conversely, the top SVM features (cf.Table 3) contain hardly topic words but mostly words clearly associated with violence.We hypothesize that topic impacts our violence classifier, but the evidence is not conclusive, warranting deeper analysis.

Class Distribution
We see that the classification seems to be effective with F 1 scores ranging from 0.837 to 0.939.While these results are promis-ing, the task is far from solved.Due to the skewed class distribution in the fan fiction corpus (ca.13% of works are violent; likely more extreme for other genres), a high precision is crucial for a model to be transferable to real-world applications.

Limitations
We believe to have cast a challenging task which cannot be trivially solved using transformer models due to their length limitation; the proposed corpus contributes to both experimental analysis and detection of violence in long documents.We want to outline some known limitations, lest people prematurely consider the problem "solved" when observing our results: First, we only consider Graphic Depictions of Violence, whereas AO3 includes other warnings, e.g., Major Character Death.The large set of freeform tags suggests potential for more trigger warnings, but this would require annotations external to AO3.Second, although trigger warnings are usually used for documents, it would be interesting to pin-point the potentially triggering content exactly within a document, i.e., using finegrained annotations of a defined "violence" construct at sentence or paragraph level.Third, the trigger warnings in our corpus were assigned by fan fiction authors and not via principled annotation.While the authors' assessment of their content and warning assignment certainly can be considered ground-truth, the AO3 definition of violence -"[t]he content contains gory, graphic, explicitly described violence"-leaves room for interpretation.Lastly, it is unclear if our negative class indeed never includes violence-related triggers (cf.our Curation Rationale in Appendix B.1).With a working trigger detection approach, relabeling the data by experts will become feasible.

Impact Statement
Note that any automation of trigger warning assignment can be abused to the opposite than the intended effect of trigger warnings, that is, to identify documents with specific triggering content with the goal to target vulnerable individuals.We refrain from directly publishing the corpus since we do not have explicit permission from the AO3 authors to republish their work.However, since AO3 is publicly accessible, we will release a file with the IDs of works included in our experimental setup, so the splits can be reproduced.Table 4: Differences in the Meta-data frequency between violent and non-violent documents.Shown are the ∆ i as described in Appendix A as well as the absolute distance for the example tags split by ratings, characters (as indicator of fandom and plot points), and freeform tags as content descriptors.

A Figures and Tables
Meta-data (Tag) Differences Between Classes Table 4 shows the effect of topic on classification effectiveness.We list the relative count difference between all works D i with an Additional Tag i (rating, freeform, characters) between violent v and non-violent nv documents defined as: indicates that all occurrences of the tag were assigned to violent documents and ∆ i = −1 indicates the opposite.

B Data Statement
Following Bender and Friedman (2018), we provide a data statement to document the construction of the violence trigger warnings corpus.

B.1 Curation Rationale
Our goal was to extract a trigger warning corpus from an existing resource with imperfect labels.In the original data, we are dealing with false negative, false positive, and even contradictory labels, where a work is labeled as both "Graphic Depictions of Violence" and "No Archive Warnings Apply."However, the corpus should be clearly separable in terms of positive and negative examples.
To address this situation, we relied on the existing labels, but filtered the positive and negative classes using a co-occurrence analysis between each tag and "Graphic Depiction of Violence."

B.2 Language Variety
While Archive of our Own (AO3) includes fan fiction in many languages, we discarded all non-English documents.For language detection we used Resiliparse. 4This language constraint is only for the purpose of this study, the remaining documents are of course relevant for future research.

B.3 Speaker Demographic
AO3 hosts fan fiction works from a variety of authors whose demographics are unknown.The only information available to date is a census taken in 2013, where a survey was conducted (Archive of Our Own, 2013) to which 10,005 users (not authors but overlap is possible) replied.In summary, the average user age at that time was 25 years.Most users identified themselves as Female (80%), with Genderqueer being second (6%), and Male third (4%); other options were Transgender, Agender, Androgynous, Trans, Neutrois, and Other (2% or less each).
Regarding ethnicity, the majority of users identified as White (78%), followed by Asian (7%), Hispanic (5%), Mixed/Multiple (5%), Black (2%), Native American (1%), Pacific Islander (1%), and Other (1%).Only 6% of users stated that they used AO3 for languages other than English.The AO3 Census evaluation states that this survey is not representative and has its limitations but also that "[these limitations] do not make the survey useless".There has been been another census since then.

B.4 Annotator Demographic
We used pre-existing labels from AO3 for this corpus.Trigger warnings are assigned by the authors of the respective works.We do not have any additional information about these groups.

B.5 Speech Situation
All of the texts are written works that are or were available online.Each work has a publication date which might reflect the upload date instead of the date of writing, since some works were also posted on other sites before, but backdating is possible.

B.6 Text Characteristics
Almost all texts in our corpus belong to the fan fiction genre.Many fan fiction works revolve (non-exhaustively) around fictional characters from books, cartoons, anime, manga, music, and movies, or non-fictional characters such as celebrities.Aside from that, AO3 includes meta posts (such as the previously mentioned AO3 Census or placeholders which link to other works).
They have been filtered by our tag-based filtering.

C Classification Setup and Ablation
All document vectors are subsequently normalized using the L 2 norm.The cost parameter C is set to 0.5, which is weighted for each class inversely proportional to its occurrence in the training set.For BERT, we use a maximum sequence length of 512.We fine-tune for 10 epochs with a learning rate of 2e −5 and batches of size 32.For Longformer, we use a maximum sequence length of 4,096.We fine-tune for 20 epochs with a learning rate of 2e −5 and batches of size 4. Hyperparameters were optimized for all models via an exhaustive search, evaluating possible combinations using cross-validation on the training set.

Figure 1 :Figure 2 :
Figure 1: Classification effectiveness in terms of F 1 on the sample datasets over intervals of number of tokens.
Creator Chose Not To Use Archive Warnings to avoid spoilers, and (6) No Archive Warnings Apply, if the work has no triggering content.

Table 2 :
Classification effectiveness on the test set for all sample datasets; reported are F 1 score, precision (P), recall (R), and accuracy (Acc.);bold = best result.

Table 1 )
. Our analysis of the relation between text length and effectiveness (cf.Figure1) reveals that neural models perform better

Table 3 :
Most discriminative SVM features for both classes and all three sample datasets.The upper row group also lists the first topic (fandom-specific) feature, it's score, and position in the list (rank).It should be noted that there are almost no topic features in the top 1000 features which we inspected manually.