Salience-Aware Event Chain Modeling for Narrative Understanding

Storytelling, whether via fables, news reports, documentaries, or memoirs, can be thought of as the communication of interesting and related events that, taken together, form a concrete process. It is desirable to extract the event chains that represent such processes. However, this extraction remains a challenging problem. We posit that this is due to the nature of the texts from which chains are discovered. Natural language text interleaves a narrative of concrete, salient events with background information, contextualization, opinion, and other elements that are important for a variety of necessary discourse and pragmatics acts but are not part of the principal chain of events being communicated. We introduce methods for extracting this principal chain from natural language text, by filtering away non-salient events and supportive sentences. We demonstrate the effectiveness of our methods at isolating critical event chains by comparing their effect on downstream tasks. We show that by pre-training large language models on our extracted chains, we obtain improvements in two tasks that benefit from a clear understanding of event chains: narrative prediction and event-based temporal question answering. The demonstrated improvements and ablative studies confirm that our extraction method isolates critical event chains.


Introduction
Human languages always communicate about evolving events. Hence, it is important for NLP systems to understand how events are procedurally described in text. In this context, identifying patterns of event chains is important but challenging, as it requires knowledge of inter-event relations such as temporal or causal relations. Modeling high-quality event chain patterns from text would be a first step toward the more general goal of Figure 1: An illustration of event chains and salienceaware and discourse-aware filtering on an example text. The words in italic are the events and the arrows between them show the temporal relation given by TEAR (pointing from previous to next). Salient events are shown in bold; the blue sentence is in the Main Event category whereas the red sentence is in the Evaluation category. event schema induction, which involves generating high-level representations of event relations and structures. The extracted chain patterns can in turn represent useful information for core natural language tasks such as question answering (Reddy et al., 2019), semantic role labeling (Cheng and Erk, 2018), story generation (Yao et al., 2019), and reading comprehension (Ostermann et al., 2019).
Generally speaking, "events" correspond to what we perceive as happening around us. According to the theory of embodied cognition (Wilson, 2002), our understanding of the world is shaped by various aspects of our entire bodies, involving also our language comprehension. However, it is currently difficult to gather enough data from other modalities to model real world "events," and written text, especially in the news domain, seems to be our best option.
Previous attempts have been made to generate event chains by modeling narratives in news, stories and documentaries (Chambers and Jurafsky, 2008;Weber et al., 2018b;Li et al., 2020). The problem with such data is that the articles are usually a mixture of the true narrative flow with other content, which serves to explain the context or provide side information. Most prior approaches do not take into account the centrality or salience of events or the discourse structures that describe those events. Accordingly, overlooking such important discourse properties when choosing the events of a sequence introduces noise in the modeling of event chains, and causes trivial or irrelevant content to be captured and inferred in narrative understanding tasks.
In this work, we explore the use of salience identification (Liu et al., 2018;Jindal et al., 2020) and discourse profiling (Choubey et al., 2020) to help isolate the main event chains from other distracting events, and show the effect on two recent datasets related to narrative understanding and temporal understanding. More specifically, we obtain event chains from documents and perform different methods of filtering, then build event language models, which we use to predict story ending events from the ROCStories dataset (Mostafazadeh et al., 2016) and answer temporal event questions from the TORQUE dataset (Ning et al., 2020). By comparing the use of event language models built on differently filtered event sources, we show that filtering out distracting narrative information can indeed benefit the modeling of event relations, thus leading to more reliable extraction of useful event chain patterns.
The main contributions in this work are threefold. (1) To support narrative understanding tasks with high-quality event chains, we develop a new event chain extraction method that is discourse-aware, and particularly, salience-aware.
(2) We show the effectiveness of salience-based and discourse-aware filtering of event chains in improving narrative prediction, which leads to improvement on the Story Cloze Test. (3) We further demonstrate that the discourse patterns captured by the filtered event chains enhance language models on temporal understanding of events, leading to state-of-the-art performance on answering temporal ordering questions.

Event Chains and Narratives
Information extraction techniques have evolved to extract event mentions as well as their ordering information Han et al., 2019;, hence enabling the automated induction of raw event chains from text. Tools such as TEAR (Han et al., 2019) have been developed, which we can use to extract both the events and temporal ordering from text documents, like those shown in Figure 1. These sequences of events, connected in a temporal order, form linear event chains, which can be viewed as a form of linear representation of the progress of described events in a narrative.
However, events that co-occur or are described together in an article are often not equally important. Some events are salient -they are relevant to the main topic of a text context, or play essential roles with regard to the central goal of an event chain (Liu et al., 2018;. For example, from the paragraph in Figure 1, the events detained, alerted and questioned are salient components that constitute the progress of a described story. Other events may describe how the salient events came to be known or involve other events that happened alongside salient events but are not critical to understanding the core actions of the story. In Figure 1, said and establish are not salient. The goal of event salience detection systems (Liu et al., 2018;Jindal et al., 2020) is to identify events that are essential components of the narrative, which can help us filter out trivial or distracting event mentions from event chains.
From another related perspective, modeling discourse structures of the article is also helpful for understanding which parts of the text directly report on main events and which parts serve other supportive roles. Discourse profiling (Choubey et al., 2020) seeks to analyze the functional role of each sentence, which is different from the event-level prediction of salience. van Dijk (1988)'s theory of news discourse outlines an ontology of sentence types that are seen in most news articles. According to the theory, the first sentence in Figure 1 is a Main Event sentence, while the second is an Evaluation sentence. We may classify the members of van Dijk's ontology into types that are core to understanding the event sequence of a story (e.g. Main Event) and types that are not (e.g. Evaluation), and further refine extracted event sequences.
Though salience and discourse structures represent different perspectives of narrative analysis, they can be concurrently modeled in an event chain extraction system, leading to more effective filtering of mined event chain representations.

Method
To obtain the interesting event chains from a document, we first use the TEAR tool by Han et al. (2019) to generate a temporal relation graph. Then we apply different levels of filtering on the ex- Figure 2: System diagram of our approach along with an example. Solid lines indicate inference and dotted lines indicate training. Colors separate different data streams. Events in temporal order are extracted from the news article from Figure 1. Salience detection, trained on (article, abstract) pairs, filters out unimportant individual events from the example as well as labeled news discourse training data. Our salience-aware discourse parsing model removes sentences that do not contribute to the direct story line. The important event chain is used to fine-tune a masked language model, which is used to predict story completion. A similar pipeline is used for the TORQUE task.
tracted events, which correspond to the nodes in the event graph structure. Finally, we extract linear chains of events from the filtered graph structure by following along the directed edges. In our evaluation, we will use these linear chains instead of raw text to pre-train language models such as RoBERTa (Liu et al., 2019). Our overall pipeline is shown in Figure 2. Details of each step in the pipeline are described as follows.

Event Chain Extraction
We adopt the joint structured event-relation extraction model from Han et al. (2019). It uses pretrained BERT (Devlin et al., 2019) embedding vectors of the input text, which are further fed into an RNN-based scoring function for both event and relation extraction. During the end-to-end training, a MAP inference framework sets up a global objective function using local scoring functions to get the optimal assignment of event and relation labels. To ensure that we obtain globally coherent assignments, several logical constraints are specified including event-relation consistency and symmetry/transitivity consistency, so that the output event graph is logically consistent. This end-toend model extracts both events and event relations simultaneously.
We restrict the event relation labels to BEFORE and NONE. We can regard each BEFORE relation as a directed edge in the temporal relation graph that points from the previous event to the next. Finally, to extract linear chains from this directed graph, we repeatedly choose the starting node in a topological order, and start a walk along the directed edge to obtain a maximum chain.

Event Salience Filtering
For the event salience detection task, we follow the Kernel-based Centrality Estimation model by Liu et al. (2018). This model combines the various event salience features such as frequency, sentence location and average cosine similarity with other events or entities, with Gaussian kernels that model event-event and entity-event interactions. For the final salience score we apply an additional sigmoid function and use binary cross-entropy loss for training. In this way, we formulate the task as eventlevel binary classification.
We train the event salience model on the New York Times (NYT) Annotated Corpus (Sandhaus, 2008), a collection of news articles with expertwritten abstracts. During training, we use the frame-based event mention annotation by Liu et al. (2018), and for inference on new articles, we extract the event mentions using TEAR. Each event mention is labeled as salient if its lemma appears in the associated abstract. The trained salience model can then assign a salience score between 0 and 1 to each event extracted from a document, and we use a threshold of 0.5 to perform the final filtering of the events.

Discourse Salience Filtering
For the discourse parsing model, we follow the hierarchical architecture by Choubey et al. (2020) to assign each sentence with one of the eight fine- Figure 3: Architecture of salience-aware discourse parsing model. The addition of E t encoding for each sentence, which is an average of the salience event encodings, provides additional event-level salience information not seen in the original discourse parsing model. grained content labels 2 modified from van Dijk's theory of news discourse (1988). The sentences that are labeled as either Main Event, Consequence, Previous Event or Current Context are considered relevant to the main sequence, and kept in the sentence-level filtering process.
One straightforward way to incorporate information from event salience in the filtering procedure is to first apply discourse filtering, and then keep only salient events from the discourse-filtered sentences, but this may propagate errors in the process and filter away too many events. We instead use event salience information to improve the performance on the discourse parsing task. As shown in Figure 3, we modify the hierarchical bi-LSTM model from the work by Choubey et al. (2020). The input is a document consisting of sentences s 1 , s 2 , . . . , s n . Each sentence s t , which is a sequence of tokens (w t1 , w t2 , . . . , w tk ), is first transformed to a sequence of ELMo (Peters et al., 2018) word representations, and then to hidden vectors (h t1 , h t2 , . . . , h tk ) in the word-level bi-LSTM layer. Then, we concatenate the original intermediate sentence encoding S t obtained from an softattention-weighted sum of the hidden vectors, with E t which is an average over the hidden vectors of salient events within the sentence. These concatenated vectors are fed into the sentence-level bi-LSTM layer to generate sentence encodings H 1 , . . . , H t , . . . , H n , on top of which the final prediction layer and softmax are stacked, following the original model architecture by Choubey et al. (2020). Specifically, we compute the document encoding D as a soft-attention-weighted sum over the sentence encodings, and the final sentence representation of sentence t is the concatenation of H t together with the element-wise product and difference between H t and D. The final sentence representation is used in a two-layer feed-forward neural network to make final prediction of the sentence's news discourse label.
We train the described model on the News-Discourse corpus with annotated content labels (Choubey et al., 2020), following their training setup. After we use our trained discourse parsing model on an input document, we filter the document down to only the salience-aware discoursefiltered sentences, which are sentences classified as one of Main Event, Consequence, Previous Event and Current Context, and keep only events from these sentences. We also try keeping only salient events from the filtered sentences, but it leads to worse performance as shown in our experiments.

Event Language Models
Once we obtain the linear event chains after performing the extraction and filtering from a text dataset, we treat the sequence of event mentions in each chain as a sequential context that would be used for training or fine-tuning a language model. For example, we can fine-tune a pre-trained Transformer language model based on the event chains (Section 4.1), or capture the sequences by training an RNN language model from scratch (Section 4.2).
Once we obtain such a language model, we seek to use it to help with narrative prediction by predicting the continuation of an event chain. We can also leverage the fine-tuned model to support question answering regarding temporal orders of events.

Experiments
It is difficult to directly evaluate the quality of event chains in an intrinsic way. Some works on event schemas ask human annotators to rate qualities of generated chains (Balasubramanian et al., 2013;Weber et al., 2018a,b), which can be subjective. We instead turn to extrinsic evaluation tasks that depend on implicitly understanding typical sequences of meaningful events in order to be completed usefully.
TORQUE (Ning et al., 2020), short for Temporal ORdering QUEstions, is a machine reading comprehension benchmark that requests a model to answer questions regarding temporally specified events (e.g., "What happened before the snow started?") in a reference article. We hypothesize that models trained on unfiltered event chains are less likely to focus on the relevant events requested by the question, but this seeks to be improved by our filtered event chains.
ROCStories (Mostafazadeh et al., 2016) is a narrative prediction dataset consisting of five-sentence short stories, where each sentence of a story contains a core event. A test set included with ROC-Stories contains two candidate endings to each partially complete story, where one of them is plausible. While prior work has successfully leveraged event chains to infer the endings (Chaturvedi et al., 2017), we believe a model trained on relevant chains of events should be able to better distinguish the relevant ending from the irrelevant one than a model trained on event chains that contain irrelevant events. Other works such as Sun et al. (2019) have directly tried to maximize performance on this corpus; we don't seek to directly compete with that work here, but rather use ROCStories as a means of establishing that our isolation of important events at the event language model level does positively influence event ending prediction, indicating that the sequences we find do indeed contain more relevant events.
In this section we verify the value of our eventfiltering model on these tasks. We also evaluate our proposed discourse parsing model to show the usefulness of combining salience information at multiple levels.

Answering Temporal Ordering Questions
Dataset The TORQUE dataset has 3.2k news snippets and 21.2k user-provided questions. We follow the original data split given by Ning et al. (2020), using the training set for fine-tuning the language model for reading comprehension, and the dev set for evaluation. Each question asks about specific temporal relationships of the events in a news passage, and requests a sequence of event mentions from the passage that answer the question.

Evaluation Details
We follow Ning et al. (2020) to fine-tune a RoBERTa-large (Liu et al., 2019) model on the training set of TORQUE, which has a perceptron layer that classifies whether each token in the passage is in the answer or not. The input to the model is the question followed by the passage, separated by the separator token. We follow the same approach, but further fine-tune the model on inputs containing the extracted (and possibly filtered) event chain, rather than the entire passage. For training, we use the same batch size and learning rate as the baseline approach by Ning et al. (2020) in each experiment. We evaluate on the two standard metrics used in question answering, namely macro F1 and exact-match (EM), on the development set, comparing between models finetuned on filtered event chains and unfiltered chains, and, in Table 1, report the average performance over the 3 training runs started from different random seeds. As a comparison, we also fine-tune a GPT-2-based model (Radford et al., 2019), which leads to better results as shown in the rightmost column in Table 1.

Results and Analysis
As we see in Table 1, we improve the performance on predictions of events over the baseline by fine-tuning on event chains constructed in various ways. Also, using filtered event chains gives better results than using unfiltered chains, regardless of the method of filtering. The model achieves the highest score when we use our salience-aware discourse filtering method on event chains, which shows the effectiveness of our approach of combining salience and discourse information. The last two rows of Table 1 show that keeping only salient events from the discoursefiltered sentences leads to worse results than keeping all events from the discourse-filtered sentences. We see the same trend in the last column, which suggests that the improvements from salience and discourse hold on GPT-2 as well as on RoBERTa fine-tuning. Comparing between the numbers from the two columns of "textual order" 3 and "TEAR order" in the table, we see the order of event chains does not seem to affect the performance much, probably because the number of events per document is not very large.
To further illustrate the effect of the various filtering methods on temporal understanding, we perform a breakdown of the questions in the dev set into different types, using prefix matching as shown in Figure 4. We define "standard" questions as those directly asking about Before, After or Co-occuring relations, without the additional

Training Setting
Textual order TEAR order TEAR (GPT-2) F1 EM F1 EM F1 EM Baseline (Ning et al., 2020) 75.  challenges from the fuzziness of time scopes or non-factual modality, as mentioned by Ning et al. (2020). The default questions ask about which events had already happened, were ongoing, or were still in the future, which we regard as not in the "standard" category. After examining the F1 scores on these "standard" categories in Table 2, we see that the model achieves a larger improvement with our filtering methods compared to on all questions, even though these are categories that already have a higher baseline performance.

Narrative Prediction
Dataset We evaluate the effectiveness of our extracted event chains on narrative prediction using the ROCStories dataset (Mostafazadeh et al., 2016). The training set contains 98,161 five-sentence stories, and the development and test sets each consist of 1,871 instances of four-sentence stories together with a correct and an incorrect ending. The goal is to predict the correct ending sentence given the two candidate choices. We use the development set but not the training set for supervised evaluation, following previous approaches (Chaturvedi et al., 2017;Srinivasan et al., 2018), as detailed below.
Evaluation Details Since each of the five sentences in a ROCStories article contains a core event, we convert the task to the prediction of the next event given the current sequence of events. We use the NYT corpus following Chaturvedi et al. (2017), and run our event extraction and filtering methods to obtain event chains from the news articles. We then break the obtained event chains into sequences of five events' length, and train a simple bi-LSTM masked language model on these sequences. This seeks to compare with the best performing method using the same type of features (Chaturvedi et al., 2017). For evaluation on the Story Cloze Test, we choose the ending that contains the event with higher probability of occurring next given the four previous ones, according to the model. As a comparison, we also use the event chains to fine-tune a RoBERTa-large model. In this way we are performing an unsupervised evaluation in the sense that we are not learning directly from labeled sentences in ROCStories, and we are comparing the performance of models trained/fine-tuned on event chains obtained from different filtering methods. We also present results using the standard supervised setting (Chaturvedi et al., 2017;Srinivasan Figure 4: Distribution of questions in the TORQUE dev set, showing only the most frequent prefixes related to Before, After or Co-occuring relations. For example, a total of 229 questions start with "What happened before", which query the Before relation. 191+44 questions query the After relation, and 39+19 questions query the Cooccuring relation. We define these three categories of questions as "standard" type. This prefix matching is only a rough categorization of the question type. et al., 2018), i.e., on top of the language model, we train a binary classifier using the correct and incorrect endings from the development set as positive and negative examples respectively. Table 3, we can see the improvement of prediction accuracy when we feed filtered event chains to our model compared with using unfiltered chains. Our salience-aware discourse-filtered event chains, compared among all event chain-based training settings, provide the best results in both the unsupervised and supervised settings, regardless of the order of extracting the raw event chains. Under the unsupervised setting in which the model has not leveraged any supervision from the development set, and with event chains extracted in textual order, we see that event salience filtering and discourse filtering provide a 2.8% and 3.5% improvement on their own, respectively, and the largest improvement of 4.9% occurs when we use our proposed salience-aware discourse filtering. We see a similar trend under other evaluation settings, that combining event-level and discourselevel salience filtering leads to better performance, though the improvements are not as significant. This confirms that even without additional supervision from labeled data, we can utilize salienceaware and discourse-aware filtering to improve the relevance of event chains for better narrative prediction results. From the last row of Table 3, we see that our RoBERTa-based masked language model trained on the best produced event chains is able to further improve the performance in the supervised evaluation.

Salience-Aware Discourse Parsing
As a case study of our proposed discourse parsing model from Section 3.3, we also compare our performance on the discourse type classification task with the baseline model from Choubey et al. (2020), as shown in Table 4. We see that our model surpasses the baseline model in both macro and micro F1 scores. Looking at the F1 score of each specific discourse type, we achieve the greatest improvement in the classification of type M1 (Main Event) and C2 (Current Context), with an increase of 5.5% and 6.3% respectively. This suggests that introducing salience awareness in the discourse parsing model indeed leads to better prediction accuracy, especially in categories that we expect to be most relevant to the topic of a document.

Related Work
We discuss three relevant research topics. Event Chains Much research effort has been made to extract and process event chains. Chambers and Jurafsky (2008) pioneered the modern interest and modeled the co-occurrence of events in narrative chains based on their PMI. Radinsky et al. (2012) and Radinsky and Horvitz (2013) extended such unsupervised event chain modeling to crossdocument scenarios and used the technology for news prediction and timeline construction. Berant et al. (2014) extracted relations among events and entities in biological processes to help solve a biological reading comprehension task using a structure matching method. More recently,  attempted to infer the type of action and object associated with an event chain, which required recognition of the goal or intention extracted from the chain.  used a probabilistic graphical model to capture common patterns from analogous event chains, and induced new chains from those patterns. These works do not explore salience or discourse structures for event chains.    (van Dijk, 1988). For example, Yarlott et al. (2018) annotated a dataset of discourse structures, which viewed paragraphs as units of annotations, and developed models to predict discourse labels. Choubey et al. (2020) instead created a sentencelevel discourse structure corpus spanning different domains and sources, which was more helpful for a fine-grained identification of sentences relevant to the main topic. Insights in those studies constitute the two aspects of salience awareness that are characterized in our method, i.e., event-level and discourse-level salience.
Event-Centric Language Models Another line of research focuses on language modeling for capturing sequences of events. Many works in this line ex-tract raw event chains and directly train a neural language model for narrative prediction (Chaturvedi et al., 2017; or script generation (Rudinger et al., 2015;Pichotta and Mooney, 2016;Weber et al., 2018b). Besides directly training, Zheng et al. (2020) proposed a unified finetuning architecture, where a masked language model was explicitly fine-tuned on the representations of event chains to model event elements. Li et al. (2020) proposed to learn to induce event schemas using an auto-regressive language model trained on salient paths on an event-event relation graph, which was also an attempt to reduce noise in constructing event schema. While that work uses a different kind of data structure, our works are in agreement on the importance of incorporating salience when fine-tuning an event-centric language model. Specifically, in comparison to prior works that used the language model for narrative prediction tasks (Chaturvedi et al., 2017;, we show that salience-awareness is an important factor to tackle such tasks.

Conclusion
We propose an event chain extraction pipeline, which leverages both salience identification and discourse profiling to filter out distracting events. We demonstrate the effectiveness of the approach by using the produced event chains to train/fine-tune language models, which leads to improved performance on temporal understanding of events and narrative prediction. Our case study on salienceaware discourse parsing shows the advantage of combining event-level and sentence-level salience information. We plan to use these event chain patterns on other narrative understanding and generation tasks, such as constrained story generation (Peng et al., 2018), event script generation Lyu et al., 2021), and implicit event prediction (Lin et al., 2021;Zhou et al., 2021).

Ethical Considerations
This work does not present any direct societal consequence. The proposed method aims at providing high-quality extraction of event chains from documents with awareness of salience and discourse structures, but is only evaluated on English data and uses western notions of both salience and news discourse; event sequence extraction using data from other languages and cultures may not benefit from the methods shown here. The extracted event chain representations benefit narrative understanding and temporal understanding of events. Yet, real-world open source articles may include societal biases. Extracting event chains from articles with such biases may potentially propagate the bias into acquired knowledge representation. While not specifically addressed in this work, the ability to incorporate salience and discourse-awareness could be one way to mitigate bias.