NarraSum: A Large-Scale Dataset for Abstractive Narrative Summarization

Narrative summarization aims to produce a distilled version of a narrative to describe its most salient events and characters. Summarizing a narrative is challenging as it requires an understanding of event causality and character behaviors. To encourage research in this direction, we propose NarraSum, a large-scale narrative summarization dataset. It contains 122K narrative documents, which are collected from plot descriptions of movies and TV episodes with diverse genres, and their corresponding abstractive summaries. Experiments show that there is a large performance gap between humans and the state-of-the-art summarization models on NarraSum. We hope that this dataset will promote future research in summarization, as well as broader studies of natural language understanding and generation. The dataset is available at https://github.com/zhaochaocs/narrasum.


Introduction
A narrative is a story (e.g., a novel or a movie) composed of events and characters (Prince, 1973).Narrative summarization aims to produce a distilled version of a narrative, either extractively or abstractively, to contain its most salient events and major characters (Lehnert, 1981).This ability is especially crucial for the understanding of narratives, and in general, the understanding of human behaviors and beliefs (Piper et al., 2021).Practically, a summary of a narrative can enable a reader to quickly discern the key points, which is useful in real-world scenarios such as content recommendations and advertisements.
While text summarization has been explored for over decades, most existing studies focus on summarizing news (Consortium and Company, 2008;Nallapati et al., 2016;Narayan et al., 2018a) or structured documents (e.g., scientific papers (Gidiotis and Tsoumakas, 2019;Cohan et al., 2018)).These documents have specific writing styles.For Document: (https://bigbangtheory.fandom.com/wiki/The_Big_Bran_Hypothesis)Setting their dinner of Thai food, Sheldon gives the group a lecture of the use of the fork in Thai history.A little later, Penny talks with Leonard in the hallway about her work at The Cheesecake Factory.She then asks Leonard to sign for a piece of furniture while she is out.[…] It turns out the furniture is bigger than they had expected.The delivery man does not help them, so Leonard and Sheldon are forced to carry it up the stairs the themselves since the elevator doesn't work.Sheldon's only idea involves using a Green Lantern power ring.Finally, they eventually succeed in getting it up the stairs to her apartment.While there, Sheldon sees that Penny's apartment is a complete mess and insists on tidying up.[…] Leonard get up the next morning and Sheldon tells him that he slept well.Leonard remarks that a well known folk cure for insomnia is to break into your neighbor's apartment and clean.Sheldon asks if that was sarcasm.Penny awakens to find out that her apartment in a well ordered state and screams about those geeky bastards.Penny charges into Sheldon and Leonard's apartment in a fit of rage about them coming into her place while she was sleeping.She demands her key back.[…] Later, Penny runs into Raj in the hallway and talks to him about being upset over what happened (although he doesn't reply as he has selective mutism).Penny decides to forgive them while Raj was thinking; "Boy, her hair smells nice" and "Maybe my mother was right.Maybe I should marry an Indian girl.We would have the same cultural background and she could sing the same lullabies my mother sang to me".Penny then hugs Raj, much to his surprise.[…] Summary: (https://en.wikipedia.org/wiki/The_Big_Bang_Theory_(season_1)#ep2)When Sheldon and Leonard drop off a box of flat pack furniture that came for Penny, Sheldon is deeply disturbed at how messy and disorganized her apartment is.Later that night, while Penny sleeps, the obsessive-compulsive Sheldon, unable to sleep, sneaks into her apartment to organize and clean it.Leonard finds out and reluctantly helps him.The next morning, Penny is furious to discover they had been in her apartment.Sheldon tries to apologize to Penny but fails by remarking that Leonard is a "gentle and thorough lover".Later, Penny encounters Raj in the hallway.Though he cannot talk to Penny, she calms down whilst telling him about the issue, reasoning the guys were just trying to help her, and hugs Raj.Then Leonard apologizes, prompting Penny to forgive and hug him.The input is a narrative text (denoted by "Document", pictures are not included), and the output is a summary containing its salient events and characters.instance, news is organized such that the first few sentences convey the most important information (Hicks et al., 2016).Scientific papers usually follow a standard structure with a few sections contributing the most to the summary (Gidiotis and Tsoumakas, 2020).It has been demonstrated that many summarization models, including recent ones, heavily rely on these structural clues (Kedzie et al., 2018;Zhong et al., 2019;Zhao et al., 2022a).However, a typical narrative does not contain such structural cues.This suggests that a narrative summarization model has to understand the entire narrative to identify the salient events and characters.While some recent summarization tasks also require understanding an entire document, they focus on conversational domains such as dialogues (Gliwa et al., 2019), emails (Zhang et al., 2021a), andmeetings (Zhong et al., 2021).Narratives are different from those genres in nature and are understudied.
Understanding an entire narrative faces unique challenges.A narrative organizes the story into a sequence of events (i.e., plot) in a chronological and causal order (Forster, 1985).Events unfold due to the actions of characters and other event participants, or external forces in stories (Mani, 2012).To identify the salient events, a model needs to understand both plot and characters.From the plot's perspective, the model needs to understand the causal and temporal relationships between events, as well as how the plot develops from the beginning to the end (Freytag, 1908).From the character's perspective, the model needs to understand the characters' profiles (e.g., personalities, roles, and interpersonal relationships), and how their desires and actions drive the story forward.
Figure 1 illustrates the importance of understanding the entire narrative for summarization.In this example, the main event is "Sheldon cleans Penny's apartment and gets Leonard in trouble", which is included in the summary.The side event "Penny speaks to Raj and forgives Leonard" is also included since it is the consequence and ending of the main event.Whereas, "Sheldon gives a lecture of fork" is not included as it does not impact the development of the plot.Besides the main events, the summary also explains Sheldon's motivation to clean the apartment.
A large-scale high-quality dataset is essential to promote research on this topic.Unfortunately, different from other domains, such as news and scientific papers, where the document and summary can be found from the same data source, narrative documents and their corresponding summaries are usually spread in separate sources.Previous studies collect document-summary pairs of narrative by either creating summaries manually (Ouyang et al., 2017) or matching titles between documents and summaries followed by a manual inspection (Ladhak et al., 2020;Kryściński et al., 2021), making it challenging to enlarge the resulting datasets.
In this work we propose an automatic data construction framework to build a narrative summarization dataset with both large scale and high quality.Specifically, we first collect narratives from plot descriptions of movies or TV episodes through online resources.We choose the plot description because it describes the overall narrative of the movie or TV episode, including the story arcs and major characters.This source is also widely used in narrative-related studies (Linebarger and Piotrowski, 2009;Bamman et al., 2013;Papalampidi et al., 2019;Xiong et al., 2019).After data collection, we build an align-and-verify pipeline to automatically align plot descriptions of the same movie or TV episodes from different sources.Finally, we construct document-summary pairs by treating the long plot description as the document to be summarized and the shorter one (of the same movie or TV episode) as the corresponding summary.After filtering out low-quality document-summary pairs, we build NARRASUM, a large-scale dataset that contains around 122K narrative documentsummary pairs in English.Our data construction framework is generic and thus can potentially be applied to other languages as well.
To gauge the feasibility of NARRASUM for the narrative summarization task, we explore different characteristics of this dataset.We observe that compared with other summarization datasets, the narratives in NARRASUM are of diverse genres, and the summaries are more abstractive and of varying lengths.Furthermore, rather than focusing on a particular part of the document (as in other summarization datasets), the summaries in NARRASUM are designed to cover the entire narratives.It brings new challenges to current summarization methods.
We investigate the performance of several strong baselines and state-of-the-art summarization models on NARRASUM.Results show that there is a large gap between human and machine performance in various dimensions, demonstrating that narrative summarization is a challenging task.
The contributions of this paper are four-fold: • We propose an automatic data construction framework to build a large-scale, high-quality narrative summarization dataset.• We release the largest narrative summarization dataset to date named NARRASUM, with detailed data analysis; • We investigate the performance of recent summarization models on NARRASUM; • We perform a thorough analysis of the models to point out the challenges and several promising directions.

Data Construction
We propose an automatic data construction framework to create a narrative summarization dataset.
To this end, we first collect plot descriptions of movies and TV episodes from multiple resources as narratives (Section 2.1).We then align plot descriptions in these resources that refer to the same movie or TV episode (Section 2.2).Finally, we filter the aligned data to construct high-quality documentsummary pairs.(Section 2.3).We describe the details of each step as follows.

Data Collection
We collect plot descriptions of movies and TV episodes from various movie websites and online encyclopedias such as Wikipedia,1 Fandom,2 IMDB,3 TVDB,4 and TMDB.5Note that while we use movie/TV plot descriptions as a source of narrative text, our goal is not to summarize movies and TV episodes themselves but rather to study the task of narrative summarization in a broader sense.Tasks of movie/TV summarization have been addressed by other datasets such as Scriptbase (Gorinski and Lapata, 2015), Screenplay (Papalampidi et al., 2020), and SummScreen (Chen et al., 2022).Those works focus more on summarizing screenplays, which describe the movements, actions, expressions, and dialogue of the characters in a specific structure and format.Compared with general narrative summarization, screenplay summarization presents a different set of challenges such as scene understanding and dialog parsing.
Plot descriptions, on the other hand, describe the movie stories from a third-person point of view and present a different set of challenges as we described in Section 1.
To collect plot descriptions, we parse web pages of movies or TV episodes based on HTML tags and use heuristics to match keywords (e.g., Synopsis, Summary, and Plot) that are related to the plot.We then extract the texts under these sections as plot descriptions of the corresponding movies or TV episodes.Besides the plot descriptions, we also collect the meta information of movies or TV episodes such as their title, air date, director(s), and writer(s), which is used for data alignment.

Data Alignment
After data collection, we align the web pages that are from different websites but refer to the same movie or TV episode.It is a challenging task due to the ambiguity in natural language.For example, a single movie may have different surface forms of the title (e.g., Avengers 4 and Avengers: Endgame), while those with the same title may refer to different movies (e.g., Bad Company may refer to fourteen movies.)Similar ambiguity issues arise when aligning air dates or names of crew members.Also, meta-information might be missing or incorrect due to the editing or parsing mistakes of web pages.To address these challenges, we propose an align-and-verify pipeline.It first aligns movie or TV episodes via fuzzy meta-information matching, which encourages high recall.Then, we use a verifier with high precision to re-check the aligned pairs and filter out the pairs with low confidence.We describe the details of this pipeline as follows.
During the alignment stage, we apply several heuristics for fuzzy meta-information matching.
To align movies, we first normalize movie titles by removing non-alphanumeric characters, stopwords, and subtitles.We then collect the movie pairs where the Levenshtein distance between the normalized titles is less than a threshold. 6Besides the title match, we also require the two movies to have the same air date or a partial overlap on directors or writers when such information is available.The ambiguity in titles of TV episodes is more severe than that of movies.To align TV episodes, we apply similar heuristics and further require the two episodes to belong to the same TV show.
During the verification stage, we improve the precision of alignment by comparing the aligned plot descriptions.Specifically, we train a classifier to take as input the concatenation of two plot descriptions to predict if they should be aligned.To train such a classifier, we first build a dataset with balanced positive aligned pairs and negative pairs.The positive pairs are a subset of heuristically aligned pairs where there is an link in one web page (e.g., "External links" in Wikipedia) pointing to the web page of the same movie or TV episode in the other website.Such links are edited by humans and are commonly used in entity linking (Shen et al., 2014).Negative pairs are randomly sampled from different movies of the same movie series or different episodes of the same TV show.Negative pairs sampled by this strategy usually share a similar set of characters and background setting, preventing the model from relying on surface-level cues to solve the task.
Based on the data sampling method, we collected a large-scale balanced dataset with 60K positive pairs and 60K negative pairs.We then split the dataset into train/validation/test subsets with the ratio of 80%/10%/10%.We train a RoBERTabase (Liu et al., 2019) classifier on this dataset and it achieves an accuracy of 97.13% on the test set, indicating that this model can serve as a reliable verifier to improve the precision of data alignment.We employ this classifier to further verify the heuristically aligned plot descriptions and filter out those where the predicted log-odds is smaller than 1.Finally, we obtain 2.6 million aligned plot description pairs.

Document-Summary Pairing
After obtaining the aligned plot description pairs, we regard the longer plot description as the document and the shorter one as the corresponding summary.However, not all pairs are of good quality for summarization.We identify three major issues compromising the quality and remove the relatively low-quality pairs from the final dataset.
First, the summary may contain hallucinated content that might not be included in the document.Similar to (Ladhak et al., 2020), we observe that hallucination is less common in plot description pairs with a noticeable difference in length.We therefore require the length of the summary to be shorter than half of the document to be summarized.We also calculate the semantic matching score between a summary and a document, and then remove the pairs with low scores.We adopt two scores.The first is the Rouge-1 Precision between the summary and the document.The second is the entailment probability between the summary and the document obtained from DocNLI (Yin et al., 2021), a document-level NLI model.We add up the two scores, rank the instances accordingly, and remove the 3% document-summary pairs with the lowest score.
Second, sometimes the content in the shorter plot description is directly copied from the longer plot description.To create an abstractive summarization dataset, we use ROUGE-2 Precision (Lin, 2004) between the document and the summary to reflect whether the content of the summary is copied from the document, and remove the pairs where the ROUGE-2 Precision is larger than 0.5.Third, a plot description may only describe part of the entire narrative such as a trailer but does not necessarily summarize the narrative.To filter out these cases, we set the minimum length of documents and summaries to make sure that they contain enough information. 7We also extract oracle extractive summaries from the original document using the method proposed by Liu and Lapata (2019).We remove the instances where less than 30% content of the oracle extractive summaries are from either the first half or the second half of the document.
After applying these filtering strategies, we obtain the final version of NARRASUM.It contains 122K aligned document-summary pairs, which is a high-quality subset (3.8%) of the original aligned pairs.We split the dataset into training (90%), validation (5%), and testing (5%) sets at the title level in order to avoid data leakage and undesirable overlap between training and validation or test sets.

Data Analysis
This section provides basic statistics of NARRA-SUM.We then analyze the dataset in terms of the distribution of salient information and abstractiveness of summaries.Finally, we conduct a human assessment to evaluate the quality of NARRASUM.

Data Statistics
We compare NARRASUM with six datasets from different domains such as news, scientific papers, and narratives.These include CNN DailyMail (CN-NDM) (See et al., 2017), XSum (Narayan et al., 2018b), ArXiv (Cohan et al., 2018), PubMed (Cohan et al., 2018), NovelChapter (Ladhak et al., 2020), and BookSum (Kryściński et al., 2021) NARRASUM contains 122K instances from 22.8K unique movies and 28.5K unique TV episodes, which is ten times larger than the previous largest narrative summarization dataset.We provide the distribution of production years and genres of these movies or TV series in Figure 2, which illustrates that NARRASUM spans a wide time period and contains a broad range of genres.The average length of documents and summaries are 785.97 and 147.06 tokens, and the average compression ratio is 5.34.Most of the documents in NARRASUM are longer than 512 tokens, which is the maximum input length of many pre-trained language models.However, the average length of documents in NARRASUM is still shorter than that of a typical novel chapter (∼5K).This requires the models to process long, but not prohibitively long, inputs while exposing them to the challenges of narrative summarization.

Summary Characteristics
Different from news articles, salient information in a narrative spreads across the entire text.To verify whether NARRASUM's summaries have this property, we first check the distribution of the salient information in the documents.Similar to Kim et al. (2019), we use bi-grams of summary text to represent the salient content of the narrative and then obtain their normalized positions in the documents.Figure 3(a) shows the probability density distribution of the positions of the salient information.We compare the distribution of NARRASUM with CN-NDM, XSum, and PubMed.Figure 3(a) indicates that while the salient information of CNNDM and PubMed are concentrated at certain parts of the document, the salient information of NARRASUM is more uniformly distributed over the entire document.It supports our claim that the summarization of NARRASUM requires an understanding of the entire document.There is no lead bias in XSum because the first sentence of the document is removed and is regarded as the summary.It further demonstrates that the first sentence of a news document is enough to summarize the entire document.The section-wise bias in scientific papers is discussed by Gidiotis and Tsoumakas (2020).
Next, we measure the abstractiveness of summaries in NARRASUM.To this end, we calculate the Coverage and Density of each summary as suggested by Grusky et al. (2018).Lower Coverage and Density scores indicate that the summary is more abstractive.The distribution is shown in Figure 3(b).The comparison shows that the summaries of NARRASUM are more abstractive than CNNDM and PubMed while being similar to XSum, the most abstractive dataset for news summarization.
We also report the percentage of novel n-grams that are included in the summary but not in the document.A higher percentage of novel n-grams implies a more abstractive summary.As shown in Table 2, the percentage of novel n-grams in NAR-RASUM is higher than CNNDM and PubMed, and is similar to XSum.This is in line with our observation from the Coverage-Density plot (Figure 3(b)).The difference is that XSum is a news summarization dataset with short summaries (one sentence).NARRASUM is a narrative summarization dataset, where the summaries are of varying length.

Quality Assessment
We further conduct a human evaluation to better assess the quality of the NARRASUM.We randomly select 100 instances from the test set.For each instance, we ask three workers on Amazon Mechanical Turk to evaluate the summary in terms of faithfulness and informativeness.For faithfulness, we show annotators each summary sentence and ask them to evaluate how much of the information in this summary sentence is presented in the document.This is a precision-oriented measure and is commonly used for summary evaluation (Lu et al., 2020).For informativeness, we ask annotators to first identify the most salient events and major characters from the document and then evaluate how much of that is covered by the summary.This is a recall-oriented measure.Both human evaluations are collected on a Likert scale of 1-5 (1 means "none", and 5 means "almost all").
To control the annotation quality, we require human judges to be in the United States, and have more than 1,000 HITs approved with an approval rate higher than 98%.We randomly check the annotation results and block the human judges who continually provide low-quality annotations.Human judges were paid a wage rate of $12 per hour, which is higher than the local minimum wage rate.
Figure 4 shows the distributions of human evaluation results.It shows that 80% of content in the summary is faithful to the document.For informativeness, 83% and 89% of summaries cover most of the salient events and characters, respectively.It demonstrates that NARRASUM is of high quality in both faithfulness and informativeness, and can foster further research on narrative summarization.

Baseline Models
We investigate the performance of several baselines and state-of-the-art neural summarization models on NARRASUM.We include both extractive and abstractive models.For extractive models, we use the following methods: RANDOM selects n sentences from the document randomly.LEAD selects the top-n sentences from the document to compose the summary.This is a strong baseline for news summarization.TEXTRANK (Mihalcea and Tarau, 2004) is a graphbased extractive summarization model based on PageRank (Brin and Page, 1998) in a graph representation of sentences.LEXRANK (Erkan and Radev, 2004) is another graph-based extractive summarization model based on eigenvector centrality .HSG (Wang et al., 2020) is a heterogeneous graphbased neural extractive summarization model that uses word co-occurrence to enhance sentence contextual representation.PRESUMM (Liu and Lapata, 2019) relies on a pretrained language model to enhance the sentence representation during text encoding and extractive summarization.We choose BERT (Devlin et al., 2019), ROBERTA (Liu et al., 2019), and LONG-FORMER (Beltagy et al., 2020) as the pre-trained models.BERT and RoBERTa limit the input length to be shorter than 512 tokens, while Longformer can accept up to 4,096 tokens.
For abstractive models, we use the following pre-trained sequence-to-sequence models: BART (Lewis et al., 2020), T5 (Raffel et al., 2020), PEGASUS (Zhang et al., 2020), and LED (Beltagy et al., 2020).The input length of the first three models is limited to 512 (base version) or 1,024 (large version).LED uses Longformer as the encoder and therefore can accept up to 4,096 tokens as input.

Settings
We conduct experiments with models described in Section 4 to evaluate their performances on NAR-RASUM.For extractive models, we follow the hyper-parameters of the original implementations.For abstractive models, we implement them using the Transformer library (Wolf et al., 2020).We fine-tune each model on the training set of NAR-RASUM with AdamW optimizer (Loshchilov and Hutter, 2019) and batch size of 64.We conduct a simple hyper-parameter search for the learning rate from {3e −4 , 1e −4 , 3e −5 } based on the validation loss.We also adopt early stopping based on the val-  idation loss to avoid overfitting.During inference, we use beam search with beam-size 5. Our model was trained on a single Quadro RTX 5000 GPU in up to 34 hours, depending on the model size.
Evaluation.We evaluate the generated summaries using ROUGE F 1 score. 8We further include Sum-maC (Laban et al., 2022), an automatic measure for summary faithfulness.It achieves state-of-the-art on the benchmark of summary inconsistency detection, and is feasible to be applied to long input and output.

Automatic Results
Table 3 shows the results on NARRASUM using extractive and abstractive summarization approaches.
Extractive Models.The supervised extractive methods outperform the unsupervised extractive methods (the first four models) on all measures by a large margin, indicating that NARRASUM can provide a strong supervision signal for identifying the salient information and creating the sum-mary accordingly.PreSumm-BERT or PreSumm-Roberta models underperform HSG because these models have a maximum input length of 512 tokens whereas HSG can accept inputs with arbitrary length.Longformer achieves the best performance on extractive summarization by combining the advantage of pre-training and long document processing.However, there is still a large gap between Longformer's performance and the oracle upper-bound, indicating the challenges in narrative summarization.
Abstractive Models.Among these models, no particular model consistently outperforms others on all subsets.Larger models consistently outperform smaller models, which is inline with previous research.T5 outperforms BART on most Rouge scores, as they adopt summarization-specific pretraining objectives.LED outperforms other models on Rouge due to its ability to encode longer documents.This is consistent with the result of extractive summarization.However, LED performs worst on SummaC-based faithfulness evaluation.This indicates that though the model can process longer documents, understanding and faithfully summarizing lengthy texts is still challenging.Compression Degree.To better understand the models' capability under different compression degrees, we split the test set into three similar-sized subsets based on the compression ratio of the summary.We then re-evaluate models on each subset separately.We provide details of data split and model performance in Appendix A.1.Results show that it is more challenging to create a short summary than a long one.Other observations on the entire test set still hold across subsets with different levels of compression.

Human Evaluation
We further conduct a human evaluation on Amazon Mechanical Turk to better understand the models' behaviors and the challenges of this task.We randomly sample 100 instances from the test set and then evaluate the outputs of the best two systems (T5-Large and LED-Large) based on the following four dimensions.
• Fluency: whether or not the summary is grammatically correct and free of repetition; • Faithfulness: whether or not the summary is faithful to the original document; • Coherence: whether or not the plot of the narrative summary is logically coherent; • Informativeness: whether or not the summary reflects the salient events and characters in the original document; For each instance, we show annotators the original document and the generated summaries.We ask annotators to rate summaries using a 5-point Likert scale and report the average score over all instances.As shown in Table 4, while the pre-trained abstractive models are good at Fluency, they still struggle with other dimensions such as Faithfulness, Coherence, and Informativeness.It further indicates that narrative summarization is a challenging task for current models.In general, the summaries created by T5 are more fluent and faithful, while those created by LED are more coherent and informative.In appendix A.2, we provide examples of generated summaries by various systems.

Analysis
We perform a series of analyses about the summary position and character consistency.For a fair comparison among models, we only choose test instances where the length of the document is shorter than the maximum input length of these models (1,024 tokens).

Analysis of Summary Position
A good narrative summary should preserve the original narrative structure that contains a start, middle, and ending of the narrative.To investigate this, we adopt the method in Kim et al. (2019) to analyze the normalized position of summary bi-grams in the document, where 0 and 1 represent the start and ending of the document, respectively.
Figure 5 shows that while the relative position of n-grams in gold summary is more close to uniformly distributed (Figure 3(a)), the generated summaries are still biased towards the beginning of the original document.It indicates that current models have difficulty understanding the entire documents and preserving the narrative structures.

Character-Wise Analysis
Characters are essential for narratives.Since characters are not considered in Rouge scores, here we propose to measure character consistency by examining whether the major characters in the document are also mentioned in the summary.We assume that major characters appear more frequently in the narrative text.By comparing the distance between the frequency distributions of characters from the document and the summary, we can understand how well the summary includes the major characters of the document.
To this end, we first identify characters from the narrative.We run a coreference resolution model to extract clusters of entity mentions, and we only keep person entities to obtain clusters of characters. 9We regard each cluster size as the frequency of the corresponding character and then normalize it as a probability.We measure the character inconsistency as the cross-entropy (CE) between the two frequency distributions of characters.A higher CE implies a higher character inconsistency.
In Figure 6, we group the test instances of NAR-RASUM based on the number of distinct characters, and show the cross-entropy of the gold summary and the generated summaries.Compared with the gold summaries, the generated summaries are less consistent with the document at the character level.In general, the difference of cross-entropy between gold summary and generated summaries increases as the number of characters increases, indicating that it is harder for the summarizer to keep the character-level consistency when the document describes more characters.

Application to Other Tasks
Besides presenting NARRASUM as a benchmark for narrative summarization, we further explore the broader benefits of this dataset to narrative-related tasks.We first investigate whether pre-training on NARRASUM can improve performance on other narrative summarization tasks.To this end, we first pre-train a BART-Large model on NARRASUM and then finetune it on Novel Chapter and BookSum-Paragraph.We compare with the finetuned models without pre-training on NARRASUM.As shown in Table 6, pre-training on NARRASUM can improve model performance on both datasets, indicating that NARRASUM is beneficial to other narrative summarization tasks.
We then investigate if NARRASUM can help the model learn general knowledge of narrative understanding and summarization.For this, we first pre-train a BART-Large model on NARRASUM and then apply it to several downstream tasks in a zero-shot manner.We choose five tasks that are designed for narrative understanding, i.e., MCTest (Richardson et al., 2013), MovieQA (Tapaswi et al., 2016) , LiSCU (Brahman et al., 2021), CBT (Hill et al., 2016), and QuAIL (Rogers et al., 2020), and one task for narrative summarization, i.e., Reddit TIFU (Kim et al., 2019).For each task, we provide the corresponding task description, method, and evaluation measure in Appendix A.3.
We use models trained on the summarization task to solve these tasks in a zero-shot manner.In other words, we do not use any training data from these tasks.For discriminative tasks, we first convert the (question, answer) pair into a statement using a T5 model (Chen et al., 2021), and then evaluate the probability of generating each statement conditioned on the document (Zhao et al., 2022b).We choose the candidate with the highest generation probability as the predicted answer.Models are evaluated using Accuracy.For the summarization task, we directly apply the trained model to create the summary.Models are evaluated using the Rouge-1 F measure.
We compare the model pre-trained on NAR-RASUM with those pre-trained on other narrative summarization datasets such as Novel Chapter and BookSum.As shown in Table 5, the model pre-trained on NARRASUM achieves better performance on all narrative-related downstream tasks compared with those pre-trained on other datasets.It indicates that NARRASUM contains high-quality knowledge about narrative understanding and summarization, which can be beneficial to general narrative-related tasks as well.

Conclusion
We present NARRASUM, a large-scale narrative summarization dataset that contains plot descriptions of movies and TV episodes and the corresponding summaries.Narratives in NARRASUM are of diverse genres, and the summaries are highly abstractive and of varying lengths.Summarizing the narratives in NARRASUM requires narrativelevel understanding, which poses new challenges to current summarization methods.Experiments show that current models struggle with creating high-quality narrative summaries.We hope that NARRASUM will promote future research in text summarization, as well as broader NLP studies such as machine reading comprehension, narrative understanding, and creative writing.

Limitations
One limitation of NARRASUM, similar to other automatically constructed datasets, is that we cannot guarantee the entire faithfulness of the summary to the document.To alleviate this issue, we first collect a large-scale dataset and then apply strict rules to select a high-quality subset.The human evaluation and the comparison with other datasets demonstrate that it is worth the trade-off.Another limitation is that NARRASUM does not cover all narrative types such as books, scripts, and personal stories.For those purposes, we suggest readers explore other summarization datasets (Gorinski and Lapata, 2015;Ouyang et al., 2017;Kim et al., 2019;Ladhak et al., 2020;Papalampidi et al., 2020;Kryściński et al., 2021;Chen et al., 2022).

Broader Impact
Besides the contribution to the research field of text summarization, this dataset may spark interest in a broader NLP community.For example, in machine reading comprehension, our paired plot descriptions with low lexical overlap can improve the model's capacity of complex reasoning and understanding (Saha et al., 2018).In narrative understanding, a summary of the narrative can help identify the salient event (Zhang et al., 2021b) as well as the causal, temporal, and hierarchical relationships of events (Hidey and McKeown, 2016;Yao et al., 2020).In creative writing and storytelling, this dataset can support the research of expanding a short story outline to a more detailed story (Ammanabrolu et al., 2020).
We collect and use the publicly available resources for research purposes only, which belong to fair use.This dataset should not be deployed in the real world as anything other than a research prototype, especially commercially.
There is the possibility of (potentially harmful) social biases that can exist in the movies or TV episodes and therefore be introduced in the dataset.While such biases have a limited impact on summarization systems (e.g., introducing harmful biases to the summary when there are no such biases in the document), we suggest the users evaluate the biases and their impacts on their downstream tasks such as creative writing and storytelling, and to make modifications to either the dataset or their models accordingly to avoid such biases.
comprehension.The dataset contains 500 fictional stories, with four multiple choice questions per story.
CBT (Hill et al., 2016) is also an dataset designed for open-domain reading comprehension.The dataset builds question-answer pairs from 108 children's books with clear narrative structure.
MovieQA (Tapaswi et al., 2016) aims to evaluate models' ability of automatic story comprehension.The dataset consists of 14,944 multiplechoice questions sourced from 408 movies.Each question has five options.We use the movie summaries as input to answer these questions.
LiSCU (Brahman et al., 2021) is a charactercentric narrative understanding task to test the model performance from the perspective of characters.This dataset contains 1,708 literature summaries and 9,499 character descriptions.Given the literature summary, the model needs to identify the character's name from an anonymized character description and a list of character candidates.
QuAIL (Rogers et al., 2020) is a machine reading comprehension benchmark with varying types of reasoning.Solving this challenge requires an understanding of not only the text-based information from the document but also the world knowl-edge and commonsense knowledge.Documents in QuAIL are collected from fiction, user stories, and so on.Each question has four options.
Reddit TIFU (Kim et al., 2019) is an abstractive summarization dataset.It consists of 120K crowd-generated posts from the online discussion forum Reddit, as well as their corresponding summaries.Different from other narrative summarization datasets we discussed in the paper, narratives in Reddit TIFU are mostly written in informal and conversational text, and the story is about the poster doing something wrong or messing everything up.These features make Reddit TIFU a good out-ofdomain test data to evaluate the models' generalization power for narrative summarization.Document: It is the girls' second year at PCA. Dana has been accepted to a European Exchange Program, so Zoey and Nicole think will have their dorm room to themselves.However, Coco informs them they will be getting a new roommate.Zoey and Nicole go to the housing office hoping to choose a roommate, but their request is denied (After Nicole accidentally destroy's the secretary's perfume collection).When they return to their dorm, they find their new roommate, Lola Martinez, a goth-punk girl who disturbs them by drinking raw eggs and claiming to talk to the dead.As Lola's behavior grows more erratic, Zoey and Nicole are out of the dorm, but they later discover that Lola is not a goth-punk, but an aspiring actress who managed to fool them with her acting skills.Meanwhile, Michael and Chase have to deal with Logan, whose father has given him a huge entertainment center for their dorm room-which turns out to be a huge headache.
Gold Summary: Zoey and Nicole return to PCA for the second year.But they soon find out Dana will not be returning, and a very strange roommate, Lola, has come to take her place.The year gets to a bad start for Chase and Michael, because of Logan's very loud gifts from his dad.
Bart-large: [Missing Zoey returns to PAC and DANA will not return] Zoey and Nicole get a new roommate, Lola Martinez, a goth-punk girl who disturbs them by drinking raw eggs and claiming to talk to the dead.Meanwhile, Michael and Chase have to deal with Logan, whose father has given him a huge entertainment center for their dorm room-which turns out to be a huge headache.
T5-Large: Zoey and Nicole get a new roommate, Lola Martinez, who turns out to be a goth-punk.Meanwhile, Michael and Chase have to deal with Logan, whose father has given him a huge entertainment center for their dorm room.[Fail to follow the causal and temporal relationship] Meanwhile, Dana is accepted to a European exchange program, so the girls think they'll have their own room, but Coco tells them that they'll be getting a roommate.
Pegasus-Large: Zoey and Nicole's first year at PCA is coming to an end, and they're looking forward to living in their own dorm room.However, Coco informs them that Dana has been accepted to a European exchange program, so Zoey and Nicole think [Not logically coherent] they will have their own room to themselves.Zoey and Nicole go to the housing office hoping to choose a roommate, but their request is denied (After Nicole accidentally destroy's the secretary's perfume collection).
LED-Large: Zoey and Nicole are thrilled when Dana gets accepted to a European exchange program at PCA, [Not fluent] but when they find out they have to share a dorm room with Lola Martinez, a goth-punk girl who drinks raw eggs and talks to the dead.Meanwhile, Michael and Chase have to deal with Logan, whose father has given him a huge entertainment center for their dorm room, which turns out to be a huge headache.
Table 9: Sample summaries generated by baseline models.We show the original document, the gold summary, and the output summaries from four large models.We highlight the typical errors of each output summary.

Figure 1 :
Figure 1: Example of the narrative summarization task.The input is a narrative text (denoted by "Document", pictures are not included), and the output is a summary containing its salient events and characters.

Figure 3 :
Figure 3: The upper figures show the relative positions of bi-grams of the gold summary in the document.The summary content of NARRASUM is more uniformly distributed over the entire document.The lower figures show the Coverage-Density plots.Compared with CNNDM and PubMed, the summary abstractiveness of NARRASUM is more close to XSum.

Figure 4 :
Figure 4: Human assessment results of the quality of NARRASUM.

Figure 5 :
Figure 5: The relative positions of bi-grams of the predicted summaries in the document.

Figure 6 :
Figure 6: Character inconsistency between documents and summaries w.r.t. the number of characters in the document.

Table 2 :
. The comparison of statistics is shown in Table 1.Comparison of novel n-grams between NAR-RASUM and other summarization datasets.

Table 3 :
(Koehn and Monz, 2006)evaluated on test set of NARRASUM over ROUGE 1 (R-1), ROUGE 2 (R-2), ROUGE L (R-L), and SummaC (SC).SC is only used to evaluate abstractive summaries as extractive summaries are faithful by design.We highlight the best scores separately for extractive and abstrative systems.*indicates a statistically significant difference compared with the second best score (bootstrap resampling, p < 0.05(Koehn and Monz, 2006)).

Table 4 :
Human evaluation of the generated summaries.

Table 5 :
Zero-shot performance (Accuracy or Rouge-1) of the model trained on NarraSum and those on other summarization datasets.