Andrew Piper

2026

NarraBench: A Comprehensive Framework for Narrative Benchmarking
Sil Hamilton | Matthew Wilkens | Andrew Piper
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

We present NarraBench, a theory-informed taxonomy of narrative-understanding tasks, as well as an associated survey of 78 existing benchmarks in the area. We find significant need for new evaluations covering aspects of narrative understanding that are either overlooked in current work or are poorly aligned with existing metrics. Specifically, we estimate that only 27% of narrative tasks are well captured by existing benchmarks, and we note that some areas – including narrative events, style, perspective, and revelation – are nearly absent from current evaluations. We also note the need for increased development of benchmarks capable of assessing constitutively subjective and perspectival aspects of narrative, that is, aspects for which there is generally no single correct answer. Our taxonomy, survey, and methodology are of value to NLP researchers seeking to test LLM narrative understanding.

2025

pdf bib abs

CR4-NarrEmote: An Open Vocabulary Dataset of Narrative Emotions Derived Using Citizen Science
Andrew Piper | Robert Budac
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We introduce “Citizen Readers for Narrative Emotions” (CR4-NarrEmote), a large-scale, open-vocabulary dataset of narrative emotions derived through a citizen science initiative. Over a four-month period, 3,738 volunteers contributed more than 200,000 emotion annotations across 43,000 passages from long-form fiction and non-fiction, spanning 150 years, twelve genres, and multiple Anglophone cultural contexts. To facilitate model training and comparability, we provide mappings to both dimensional (Valence-Arousal-Dominance) and categorical (NRC Emotion) frameworks. We evaluate annotation reliability using lexical, categorical, and semantic agreement measures, and find substantial alignment between citizen science annotations and expert-generated labels. As the first open-vocabulary resource focused on narrative emotions at scale, CR4-NarrEmote provides an important foundation for affective computing and narrative understanding.

pdf bib abs

People worldwide use language in subtle and complex ways to express emotions. Although emotion recognition–an umbrella term for several NLP tasks–impacts various applications within NLP and beyond, most work in this area has focused on high-resource languages. This has led to significant disparities in research efforts and proposed solutions, particularly for under-resourced languages, which often lack high-quality annotated datasets.In this paper, we present BRIGHTER–a collection of multi-labeled, emotion-annotated datasets in 28 different languages and across several domains. BRIGHTER primarily covers low-resource languages from Africa, Asia, Eastern Europe, and Latin America, with instances labeled by fluent speakers. We highlight the challenges related to the data collection and annotation processes, and then report experimental results for monolingual and crosslingual multi-label emotion identification, as well as emotion intensity recognition. We analyse the variability in performance across languages and text domains, both with and without the use of LLMs, and show that the BRIGHTER datasets represent a meaningful step towards addressing the gap in text-based emotion recognition.

pdf bib abs

A Theoretical Framework for Evaluating Narrative Surprise in Large Language Models
Annaliese Bissell | Ella Paulin | Andrew Piper
Proceedings of the The 7th Workshop on Narrative Understanding

Narrative surprise is a core element of storytelling for engaging audiences, and yet it remains underexplored in the context of large language models (LLMs) and narrative generation. While surprise arises from events that deviate from expectations while maintaining retrospective coherence, current computational approaches lack comprehensive frameworks to evaluate this phenomenon. This paper presents a novel framework for assessing narrative surprise, drawing on psychological theories of narrative comprehension and surprise intensity. We operationalize six criteria—initiatoriness, immutability violation, predictability, post-dictability, importance, and valence—to measure narrative surprise in story endings. Our study evaluates 120 story endings, generated by both human authors and LLMs, across 30 mystery narratives. Through a ranked-choice voting methodology, we identify significant correlations between reader preferences and four of the six criteria. Results underscore the continuing advantage of human-authored endings in achieving compelling narrative surprise, while also revealing significant progress in LLM-generated narratives.

pdf bib abs

Evaluating Taxonomy Free Character Role Labeling (TF-CRL) in News Stories using Large Language Models
David G Hobson | Derek Ruths | Andrew Piper
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We introduce Taxonomy-Free Character Role Labeling (TF-CRL); a novel task that assigns open-ended narrative role labels to characters in news stories based on their functional role in the narrative. Unlike fixed taxonomies, TF-CRL enables more nuanced and comparative analysis by generating compositional labels (e.g., Resilient Leader, Scapegoated Visionary). We evaluate several large language models (LLMs) on this task using human preference rankings and ratings across four criteria: faithfulness, relevance, informativeness, and generalizability. LLMs almost uniformly outperform human annotators across all dimensions. We further show how TF-CRL supports rich narrative analysis by revealing novel latent taxonomies and enabling cross-domain narrative comparisons. Our approach offers new tools for studying media portrayals, character framing, and the socio-political impacts of narrative roles at-scale.

pdf bib abs

NarraDetect: An annotated dataset for the task of narrative detection
Andrew Piper | Sunyam Bagga
Proceedings of the The 7th Workshop on Narrative Understanding

Narrative detection is an important task across diverse research domains where storytelling serves as a key mechanism for explaining human beliefs and behavior. However, the task faces three significant challenges: (1) inter-narrative heterogeneity, or the variation in narrative communication across social contexts; (2) intra-narrative heterogeneity, or the dynamic variation of narrative features within a single text over time; and (3) the lack of theoretical consensus regarding the concept of narrative. This paper introduces the NarraDetect dataset, a comprehensive resource comprising over 13,000 passages from 18 distinct narrative and non-narrative genres. Through a manually annotated subset of ~400 passages, we also introduce a novel theoretical framework for annotating for a scalar concept of “narrativity.” Our findings indicate that while supervised models outperform large language models (LLMs) on this dataset, LLMs exhibit stronger generalization and alignment with the scalar concept of narrativity.

pdf bib abs

Evaluating Large Language Models for Narrative Topic Labeling
Andrew Piper | Sophie Wu
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

This paper evaluates the effectiveness of large language models (LLMs) for labeling topics in narrative texts, comparing performance across fiction and news genres. Building on prior studies in factual documents, we extend the evaluation to narrative contexts where story content is central. Using a ranked voting system with 200 crowdworkers, we assess participants’ preferences of topic labels by comparing multiple LLM outputs with human annotations. Our findings indicate minimal inter-model variation, with LLMs performing on par with human readers in news and outperforming humans in fiction. We conclude with a case study using a set of 25,000 narrative passages from novels illustrating the analytical value of LLM topic labels compared to traditional methods. The results highlight the significant promise of LLMs for topic labeling of narrative texts.

pdf bib abs

Probing Narrative Morals: A New Character-Focused MFT Framework for Use with Large Language Models
Luca Mitran | Sophie Wu | Andrew Piper
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Moral Foundations Theory (MFT) provides a framework for categorizing different forms of moral reasoning, but its application to computational narrative analysis remains limited. We propose a novel character-centric method to quantify moral foundations in storytelling, using large language models (LLMs) and a novel Moral Foundations Character Action Questionnaire (MFCAQ) to evaluate the moral foundations supported by the behaviour of characters in stories. We validate our approach against human annotations and then apply it to a study of 2,697 folktales from 55 countries. Our findings reveal: (1) broad distribution of moral foundations across cultures, (2) significant cross-cultural consistency with some key regional differences, and (3) a more balanced distribution of positive and negative moral content than suggested by prior work. This work connects MFT and computational narrative analysis, demonstrating LLMs’ potential for scalable moral reasoning in narratives.

2024

pdf bib abs

Using Large Language Models for Understanding Narrative Discourse
Andrew Piper | Sunyam Bagga
Proceedings of the 6th Workshop on Narrative Understanding

In this study, we explore the application of large language models (LLMs) to analyze narrative discourse within the framework established by the field of narratology. We develop a set of elementary narrative features derived from prior theoretical work that focus on core dimensions of narrative, including time, setting, and perspective. Through experiments with GPT-4 and fine-tuned open-source models like Llama3, we demonstrate the models’ ability to annotate narrative passages with reasonable levels of agreement with human annotators. Leveraging a dataset of human-annotated passages spanning 18 distinct narrative and non-narrative genres, our work provides empirical support for the deictic theory of narrative communication. This theory posits that a fundamental function of storytelling is the focalization of attention on distant human experiences to facilitate social coordination. We conclude with a discussion of the possibilities for LLM-driven narrative discourse understanding.

pdf bib abs

The Empirical Variability of Narrative Perceptions of Social Media Texts
Joel Mire | Maria Antoniak | Elliott Ash | Andrew Piper | Maarten Sap
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Most NLP work on narrative detection has focused on prescriptive definitions of stories crafted by researchers, leaving open the questions: how do crowd workers perceive texts to be a story, and why? We investigate this by building StoryPerceptions, a dataset of 2,496 perceptions of storytelling in 502 social media texts from 255 crowd workers, including categorical labels along with free-text storytelling rationales, authorial intent, and more. We construct a fine-grained bottom-up taxonomy of crowd workers’ varied and nuanced perceptions of storytelling by open-coding their free-text rationales. Through comparative analyses at the label and code level, we illuminate patterns of disagreement among crowd workers and across other annotation contexts, including prescriptive labeling from researchers and LLM-based predictions. Notably, plot complexity, references to generalized or abstract actions, and holistic aesthetic judgments (such as a sense of cohesion) are especially important in disagreements. Our empirical findings broaden understanding of the types, relative importance, and contentiousness of features relevant to narrative detection, highlighting opportunities for future work on reader-contextualized models of narrative reception.

pdf bib abs

Where Do People Tell Stories Online? Story Detection Across Online Communities
Maria Antoniak | Joel Mire | Maarten Sap | Elliott Ash | Andrew Piper
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Story detection in online communities is a challenging task as stories are scattered across communities and interwoven with non-storytelling spans within a single text. We address this challenge by building and releasing the StorySeeker toolkit, including a richly annotated dataset of 502 Reddit posts and comments, a detailed codebook adapted to the social media context, and models to predict storytelling at the document and span levels. Our dataset is sampled from hundreds of popular English-language Reddit communities ranging across 33 topic categories, and it contains fine-grained expert annotations, including binary story labels, story spans, and event spans. We evaluate a range of detection methods using our data, and we identify the distinctive textual features of online storytelling, focusing on storytelling spans, which we introduce as a new task. We illuminate distributional characteristics of storytelling on a large community-centric social media platform, and we also conduct a case study on r/ChangeMyView, where storytelling is used as one of many persuasive strategies, illustrating that our data and models can be used for both inter- and intra-community research. Finally, we discuss implications of our tools and analyses for narratology and the study of online communities.

pdf bib abs

Story Morals: Surfacing value-driven narrative schemas using large language models
David G Hobson | Haiqi Zhou | Derek Ruths | Andrew Piper
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Stories are not only designed to entertain but encode lessons reflecting their authors’ beliefs about the world. In this paper, we propose a new task of narrative schema labelling based on the concept of “story morals” to identify the values and lessons conveyed in stories. Using large language models (LLMs) such as GPT-4, we develop methods to automatically extract and validate story morals across a diverse set of narrative genres, including folktales, novels, movies and TV, personal stories from social media and the news. Our approach involves a multi-step prompting sequence to derive morals and validate them through both automated metrics and human assessments. The findings suggest that LLMs can effectively approximate human story moral interpretations and offer a new avenue for computational narrative understanding. By clustering the extracted morals on a sample dataset of folktales from around the world, we highlight the commonalities and distinctiveness of narrative values, providing preliminary insights into the distribution of values across cultures. This work opens up new possibilities for studying narrative schemas and their role in shaping human beliefs and behaviors.

pdf bib abs

The Social Lives of Literary Characters: Combining citizen science and language models to understand narrative social networks
Andrew Piper | Michael Xu | Derek Ruths
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

Characters and their interactions are central to the fabric of narratives, playing a crucial role in developing readers’ social cognition. In this paper, we introduce a novel annotation framework that distinguishes between five types of character interactions, including bilateral and unilateral classifications. Leveraging the crowd-sourcing framework of citizen science, we collect a large dataset of manual annotations (N=13,395). Using this data, we explore how genre and audience factors influence social network structures in a sample of contemporary books. Our findings demonstrate that fictional narratives tend to favor more embodied interactions and exhibit denser and less modular social networks. Our work not only enhances the understanding of narrative social networks but also showcases the potential of integrating citizen science with NLP methodologies for large-scale narrative analysis.

pdf bib abs

Large Scale Narrative Messaging around Climate Change: A Cross-Cultural Comparison
Haiqi Zhou | David G Hobson | Derek Ruths | Andrew Piper
Proceedings of the 1st Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2024)

In this study, we explore the use of Large Language Models (LLMs) such as GPT-4 to extract and analyze the latent narrative messaging in climate change-related news articles from North American and Chinese media. By defining “narrative messaging” as the intrinsic moral or lesson of a story, we apply our model to a dataset of approximately 15,000 news articles in English and Mandarin, categorized by climate-related topics and ideological groupings. Our findings reveal distinct differences in the narrative values emphasized by different cultural and ideological contexts, with North American sources often focusing on individualistic and crisis-driven themes, while Chinese sources emphasize developmental and cooperative narratives. This work demonstrates the potential of LLMs in understanding and influencing climate communication, offering new insights into the collective belief systems that shape public discourse on climate change across different cultures.

2023

pdf bib abs

Computational Narrative Understanding: A Big Picture Analysis
Andrew Piper
Proceedings of the Big Picture Workshop

This paper provides an overview of outstanding major research goals for the field of computational narrative understanding. Storytelling is an essential human practice, one that provides a sense of personal meaning, shared sense of community, and individual enjoyment. A number of research domains have increasingly focused on storytelling as a key mechanism for explaining human behavior. Now is an opportune moment to provide a vision of the contributions that computational narrative understanding can make towards this collective endeavor and the challenges facing the field. In addition to providing an overview of the elements of narrative, this paper outlines three major lines of inquiry: understanding the multi-modality of narrative; the temporal patterning of narrative (narrative “shape”); and socio-cultural narrative schemas, i.e. collective narratives. The paper concludes with a call for more inter-disciplinary working groups and deeper investment in building cross-cultural and multi-modal narrative datasets.

2022

pdf bib abs

The predictability of literary translation
Andrew Piper | Matt Erlin
Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities

Research has shown that the practice of translation exhibits predictable linguistic cues that make translated texts detectable from original-language texts (a phenomenon known as “translationese”). In this paper, we test the extent to which literary translations are subject to the same effects and whether they also exhibit meaningful differences at the level of content. Research into the function of translations within national literary markets using smaller case studies has suggested that translations play a cultural role that is distinct from that of original-language literature, i.e. their differences reside not only at the level of translationese but at the level of content. Using a dataset consisting of original-language fiction in English and translations into English from 120 languages (N=21,302), we find that one of the principal functions of literary translation is to convey predictable geographic identities to local readers that nevertheless extend well beyond the foreignness of persons and places.

pdf bib abs

The COVID That Wasn’t: Counterfactual Journalism Using GPT
Sil Hamilton | Andrew Piper
Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

In this paper, we explore the use of large language models to assess human interpretations of real world events. To do so, we use a language model trained prior to 2020 to artificially generate news articles concerning COVID-19 given the headlines of actual articles written during the pandemic. We then compare stylistic qualities of our artificially generated corpus with a news corpus, in this case 5,082 articles produced by CBC News between January 23 and May 5, 2020. We find our artificially generated articles exhibits a considerably more negative attitude towards COVID and a significantly lower reliance on geopolitical framing. Our methods and results hold importance for researchers seeking to simulate large scale cultural processes via recent breakthroughs in text generation.

2021

pdf bib abs

“Are you kidding me?”: Detecting Unpalatable Questions on Reddit
Sunyam Bagga | Andrew Piper | Derek Ruths
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Abusive language in online discourse negatively affects a large number of social media users. Many computational methods have been proposed to address this issue of online abuse. The existing work, however, tends to focus on detecting the more explicit forms of abuse leaving the subtler forms of abuse largely untouched. Our work addresses this gap by making three core contributions. First, inspired by the theory of impoliteness, we propose a novel task of detecting a subtler form of abuse, namely unpalatable questions. Second, we publish a context-aware dataset for the task using data from a diverse set of Reddit communities. Third, we implement a wide array of learning models and also investigate the benefits of incorporating conversational context into computational models. Our results show that modeling subtle abuse is feasible but difficult due to the language involved being highly nuanced and context-sensitive. We hope that future research in the field will address such subtle forms of abuse since their harm currently passes unnoticed through existing detection systems.

pdf bib abs

Narrative Theory for Computational Narrative Understanding
Andrew Piper | Richard Jean So | David Bamman
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Over the past decade, the field of natural language processing has developed a wide array of computational methods for reasoning about narrative, including summarization, commonsense inference, and event detection. While this work has brought an important empirical lens for examining narrative, it is by and large divorced from the large body of theoretical work on narrative within the humanities, social and cognitive sciences. In this position paper, we introduce the dominant theoretical frameworks to the NLP community, situate current research in NLP within distinct narratological traditions, and argue that linking computational work in NLP to theory opens up a range of new empirical questions that would both help advance our understanding of narrative and open up new practical applications.

2020

pdf bib abs

Measuring the Effects of Bias in Training Data for Literary Classification
Sunyam Bagga | Andrew Piper
Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Downstream effects of biased training data have become a major concern of the NLP community. How this may impact the automated curation and annotation of cultural heritage material is currently not well known. In this work, we create an experimental framework to measure the effects of different types of stylistic and social bias within training data for the purposes of literary classification, as one important subclass of cultural material. Because historical collections are often sparsely annotated, much like our knowledge of history is incomplete, researchers often cannot know the underlying distributions of different document types and their various sub-classes. This means that bias is likely to be an intrinsic feature of training data when it comes to cultural heritage material. Our aim in this study is to investigate which classification methods may help mitigate the effects of different types of bias within curated samples of training data. We find that machine learning techniques such as BERT or SVM are robust against reproducing the different kinds of bias within our test data, except in the most extreme cases. We hope that this work will spur further research into the potential effects of bias within training data for other cultural heritage material beyond the study of literature.

2019

pdf bib abs

The Scientization of Literary Study
Stefania Degaetano-Ortlieb | Andrew Piper
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Scholarly practices within the humanities have historically been perceived as distinct from the natural sciences. We look at literary studies, a discipline strongly anchored in the humanities, and hypothesize that over the past half-century literary studies has instead undergone a process of “scientization”, adopting linguistic behavior similar to the sciences. We test this using methods based on information theory, comparing a corpus of literary studies articles (around 63,400) with a corpus of standard English and scientific English respectively. We show evidence for “scientization” effects in literary studies, though at a more muted level than scientific English, suggesting that literary studies occupies a middle ground with respect to standard English in the larger space of academic disciplines. More generally, our methodology can be applied to investigate the social positioning and development of language use across different domains (e.g. scientific disciplines, language varieties, registers).

2016

pdf bib abs

Annotating Characters in Literary Corpora: A Scheme, the CHARLES Tool, and an Annotated Novel
Hardik Vala | Stefan Dimitrov | David Jurgens | Andrew Piper | Derek Ruths
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Characters form the focus of various studies of literary works, including social network analysis, archetype induction, and plot comparison. The recent rise in the computational modelling of literary works has produced a proportional rise in the demand for character-annotated literary corpora. However, automatically identifying characters is an open problem and there is low availability of literary texts with manually labelled characters. To address the latter problem, this work presents three contributions: (1) a comprehensive scheme for manually resolving mentions to characters in texts. (2) A novel collaborative annotation tool, CHARLES (CHAracter Resolution Label-Entry System) for character annotation and similiar cross-document tagging tasks. (3) The character annotations resulting from a pilot study on the novel Pride and Prejudice, demonstrating the scheme and tool facilitate the efficient production of high-quality annotations. We expect this work to motivate the further production of annotated literary corpora to help meet the demand of the community.

pdf bib

The More Antecedents, the Merrier: Resolving Multi-Antecedent Anaphors
Hardik Vala | Andrew Piper | Derek Ruths
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)