ESTER: A Machine Reading Comprehension Dataset for Reasoning about Event Semantic Relations

Understanding how events are semantically related to each other is the essence of reading comprehension. Recent event-centric reading comprehension datasets focus mostly on event arguments or temporal relations. While these tasks partially evaluate machines’ ability of narrative understanding, human-like reading comprehension requires the capability to process event-based information beyond arguments and temporal reasoning. For example, to understand causality between events, we need to infer motivation or purpose; to establish event hierarchy, we need to understand the composition of events. To facilitate these tasks, we introduce **ESTER**, a comprehensive machine reading comprehension (MRC) dataset for Event Semantic Relation Reasoning. The dataset leverages natural language queries to reason about the five most common event semantic relations, provides more than 6K questions, and captures 10.1K event relation pairs. Experimental results show that the current SOTA systems achieve 22.1%, 63.3% and 83.5% for token-based exact-match (**EM**), **F1** and event-based **HIT@1** scores, which are all significantly below human performances (36.0%, 79.6%, 100% respectively), highlighting our dataset as a challenging benchmark.


Introduction
Narratives such as stories and new articles are often written based on events. Understanding how events are logically connected is essential for reading comprehension (Caselli and Vossen, 2017;Mostafazadeh et al., 2016b). For example, Figure 1 illustrates different pairwise relations for the most important events in the given passage: "the deal" can be considered as the same event of "Paramount purchased DreamWorks." It is also a complex event that contains "assumed debt," "gives access" and "takes over projects" as its sub-events. The event "sought after" is conditional on "created features." Figure 1: A graph illustration of event semantic relations in narratives. We use trigger words to represent events in this graph.
By capturing these semantic relations for crucial events in the text, people can often grasp the gist of a story. Therefore, for machines to achieve human-level narrative understanding, we need to test and ensure models' capability to reason over these event relations.
In this work, we study five types of event semantic relations: CAUSAL, SUB-EVENT, CO-REFERENCE, CONDITIONAL and COUNTERFAC-TUAL. Though previous works study these relations such as SUB-EVENT (Glavaš et al., 2014;Yao et al., 2020), CAUSAL and CONDITIONAL O'Gorman et al., 2016), most of them adopted the pairwise relation extraction (RE) formality by constructing samples as (event trigger, event trigger, relation) triplets. There are two shortcomings of such RE formalism: 1) an event trigger word is used to represent the entire event, which could cause ambiguity; 2) relations are rigidly defined as class labels based on expert knowledge.
In this work, we instead propose to use natural language queries to reason event semantic relations. The flexibility enabled by natural language queries allows us to define events using both triggers and arguments such as subject, object, time, and location. For example, we can easily distinguish "Paramount purchases DreamWorks" from "NBC Universal purchases DreamWorks" based on different subjects for the same trigger word "purchases." Similarly, other events arguments can be naturally incorporated in questions to help disambiguate events. This improvement is particularly helpful when multiple identical or similar triggers with different arguments exist in the text.
Moreover, natural language queries ease the annotation efforts in the RE formalism by supplementing expert-defined relations with textual prompts. In Figure 2, we show several examples. For instances, if we try to reason over CAUSAL relation, we can ask "what leads to Event A?" Or, for SUB-EVENT relation, we can ask "what are included in Event B?" Text cues such as "lead to" and "are included" can help model better understand which relation is being queried. Our question-answering task also poses unique challenges for reasoning event semantic relations. First, the correct answers can be completely different by slight changes of queries. In Figure 2, if we modify the third question to be "What would happen if Europe supported Albania?" then "oust President Sali" becomes an invalid answer. This allows us to test whether models possess robust reasoning skills or simply conduct pattern matching. Second, answers must be in the form of complete and meaningful text spans. For the example COUN-TERFACTUAL QA in Figure 2, a random text span "President Sali Berisha" is not a meaningful answer, and a shortened answer "oust" is not complete. To get correct answers, models need to simultaneously detect event triggers and their associated event arguments. This task is more challenging than multi-class classification in the traditional RE tasks.
A few noticeable event-centric MRC datasets have been proposed recently. Du and Cardie (2020) and  use natural language question format to detect event triggers and event arguments such as subject, object, time, and locations in a given passage. However, knowing event triggers and arguments couldn't possibly solve all event semantic relation tasks. For example, in Figure 1, after detecting the event triggers and arguments for "DreamWorks created successful features" and "sought after by NBC Universal," a model needs to further infer that the former event facilitates the latter in order to understand the semantic relation between these two events.
TORQUE (Ning et al., 2020b) and MCTACO (Zhou et al., 2019) are two recent MRC datasets that study event temporal relations and event temporal commonsense such as during, time and frequency. However, knowing the temporal aspect of events could not solve many important event semantic relations. For example, in Figure 1, to understand that "assumed debt," "gives access" and "takes over projects" are sub-events of "the deal," a model not only needs to know that all these four events have overlapped time intervals but also share the associated subjects and objects (event arguments) in order for "the deal" to contain the other three.
To address the shortcomings of the previous event-centric work in semantic reasoning, we propose ESTER, the first comprehensive MRC / QA dataset for event semantic relations. We adopt natural language queries as questions and require complete meaning spans in the passage as answers. Our experimental results reveal SOTA models' deficiencies in our target task, which demonstrates that ESTER is a challenging dataset that can facilitate more robust MRC for event semantic relations.

Definitions
As Figure 2 shows, annotators need to first identify significant events in the form of trigger words before providing questions and answers. In this section, we describe definitions of our significant events and five event semantic relations.

Significant Events
In a given passage, we define significant events as events that are crucial for narrative understanding, and we can either explicitly show or implicitly infer their event arguments (subject, object, time and location) from the context. For example, in Figure 2, "getting" is a significant event as its subject is "Europe," object is "Albania," time is approximately when the document was writing, and its location is "Europe." As the core event in the first sentence, "getting" is certainly crucial for understanding the narrative. Similarly, the core event in the last sentence "dispatch" is a significant event with "Europe," "multinational force," soon after document writing (inferred), and "Balkan" as its subject, object, time and location respectively.

Event Semantic Relations
Next, we will define five types of event semantic relations that are consistent with previous studies. For example, CAUSAL and CONDITIONAL have been studied in Wolff (2007) COUNTERFACTUAL is the only exception that has not been widely studied to our best knowledge. The examples we use below are all presented in Figure 2.
Causal: A pair of events (e i , e j ) exhibits a CAUSAL relation if e i happens then e j will definitely happen according to the given passage. For example, the passage explicitly says that the "meeting" happens "in return" for "Europe planned for getting stricken Albanian back." Therefore, the CAUSAL relation in the example can be established because if "Europe planned for getting stricken Albanian back" happens, the "meeting" will definitely happen in this context.

Conditional:
A pair of events (e i , e j ) exhibits CONDITIONAL relation if e i facilitates, but may not necessarily leads to e j according to the given passage. For example, the expectation of "the dispatch of a multinational force" is to "pull Albania back from the brink"; in other words, the former event can help but does not guarantee the occurrence of the latter one. Therefore, the relation between this pair of events is CONDITIONAL.
Counterfactual: e j may happen if e i does not happen; in other words, if the negation of e i facilitates e j , then (e i , e j ) has a COUNTERFACTUAL relation. In our example, if "Europe didn't support Albania," which is a negation of the what is happening in the passage, then there is a higher chance that "oust President Sali" by the "armed rebels" would happen.

Sub-event:
A complex event e k contains a set of sub-events e k,1 , e k,2 , ...e k,n . In SUB-EVENT relations, we can infer from the passage that the event arguments of e k,n are either identical or contained in the associated event arguments of e k . For example, in the complex event "efforts to pull Albania back," the object is the entire country Albania whereas in the sub-event "the dispatch of a multinational force" the object can be considered as Albania's "security." Similarly, in another subevent "aid is brought into the chaotic Balkan state," the location can be inferred to be a part of the entire country where the complex event "efforts" are made.
Coreference: e i co-refers to e j when two events are mutually replaceable. This requires 1) their event triggers are semantically the same and 2) their event arguments are identical. In our example, the event triggers in the question "pull (back from the brink)" and in the answer "getting (back on to its feet)" are semantically the same. They also share the same subject -Europe, and object -Albania. Their time and location can be inferred from the passage to be the same. Therefore, these two events form a CO-REFERENCE relation.

Data Collection
In this section, we show how our data collection interface works, describe the details of the qual-ity control process including qualification exams, worker validation, and training.

Interface
The design of our data collection interface consists of two components: event selection and QA annotations.
Event Selections. As Figure 9 in the appendix shows, users are presented with a passage and they need to identify significant events per our definition in Section 2 by highlighting event trigger words. Our focus is question-answering, and thus we do not require workers to identify all event triggers. Also, there are already several dense event datasets such as Chambers et al. (2014) and Ning et al. (2020b), so we do not repeat the work. Rather, the event selection process serves as a warm-up step for the QA annotations below by 1) helping workers locate where the significant events are and 2) ensuring all questions and answers include a significant event in the passage.
QA Annotations. As Figure 10 shows, users must ask natural language questions that contain a highlighted significant event trigger. In order to make questions natural, we allow workers to use different textual forms of an event in the questions, such as "teach" v.s. "taught" and "meeting" v.s. "meet." The only question type that is not required to contain a highlighted event is SUB-EVENT, where we allow participants to ask a question using an imaginary, but concrete and complex event trigger word. This exception enables easier composition of SUB-EVENT QAs as we found in our pilot studies that complex events do not always exist in a given passage despite the abundance of sub-events.
After writing a question, users need to pick the QA type and finally select the answer spans from the passage. If there are multiple answers, we instruct users to select all of them. All answers must include an exact highlighted event, and we prohibit answers with more than 12 words to ensure conciseness. In a typical assignment, users need to ask at least five questions in total using two provided passages.

Quality Control
Qualification. The initial worker qualification was conducted via an examination in the format of multiple-choice questions hosted by CROWDAQ platform (Ning et al., 2020a). We created a set of questions where passage and a pair of QA are provided, and participants need to judge the type of this QA. Possible choices include five categories of previously defined event semantic relations and an invalid option 1 . This examination is a simplified task that tests workers' skills to 1) distinguish valid QAs from invalid ones based on our definitions; 2) judge the differences across different event semantic relations.
We set the basic qualifications on Amazon Mechanical Turk to be 1) at least 1K HITs approved and 2) at least 97% approval rate. A single qualification exam consists of 10 multiple-choice questions, and participants are given 3 attempts to pass with a >= 0.6 score. We found this qualification examination effectively reduces the rate of spammers to nearly 0.
Worker Validation and Training. Since the real task is much more challenging than the qualification exams, we adopted a meticulous five-stage worker validation and training process to ensure data quality. As Figure 3 shows, for workers who passed the qualification exams, we repeat the validation and training steps illustrated in Figure 3 until workers reach the final large tasks.
In each validation and training step, two of our co-authors independently judge workers' annotations to determine 1) whether a provided QA is valid per our definitions and 2) whether the answers provided are complete. For workers who showed a serious misunderstanding of our tasks, we disqualify them; otherwise, for each error made, we write a training message to workers and invite them to the next task. We also added missing answers as a part of the validation process and reserved the validated annotations as our evaluation data.
There are 1, 2, 3, 10, and 25 HITs in Task 1-4 and Large Task respectively. For Task 1-3, we validate all HITs, and for Task 4, we randomly select 20%, i.e. 2 HITs per worker to validate. We further request at least 1 co-author to validate all questions with passages overlapped with the validated data. These two parts make up our final evaluation data (dev + test) in following the section.

Data Statistics
We further split our evaluation data into dev and test sets based on passages. The remaining data are used as the training set. A summary of data statistics is shown in Table 1

Type Distribution
As we can observe in Table 1 and Figure 7-8 in the appendix, CAUSAL and CONDITIONAL are the two most dominant types in ESTER. In Figure 4, we further show the type disagreements using data validated by at least two co-authors. The rows indicate workers' original annotators and the columns are the majority votes between the annotators and co-authors. As we can observe, the matrix is dominant by diagonal entries. Some noticeable disagreements are 1) between CAUSAL and CON-DITIONAL where people have different opinions on the degree of causality between events; 2) between COUNTERFACTUAL and CONDITIONAL as some COUNTERFACTUAL questions, with double negations 2 , are merely CONDITIONAL; 3) between CO-REFERENCE and SUB-EVENT where annotated co-referred events do not have identical event arguments according to co-authors' judgements. These results align with our intuition that some event semantic relations are inherently hard to distinguish. The IAA score is 85.71% when calculated using pair-wise micro F1 scores, and is 0.794 per Fleiss's κ 3 . The IAA scores are calculated using the same data reported in Figure 4. The high IAA scores demonstrate strong alignments between annotators and co-authors in judging event semantic relations.

Worker Distribution
We had 70 workers in total who passed our qualification exam and completed at least 1 assignment in our project. Due to our rigorous validating process, only 27 were able to make it into Task 4 and the Large Task which consist of a large number of assignments. This is illustrated by Figure 5 where the curve for the train set deviates further from the equality baseline compared with the evaluation data. It also shows that our evaluation set is relatively well distributed, which reflects our validation process: for workers who failed our validation tasks and were disqualified, they still provide some good quality QAs, which we use as the evaluation samples upon coauthors' validation.

Other Statistics
Unique N-grams in Questions. Table 2 shows a unique and frequent (ranked top 5) unigram and bigram in each type of questions. These n-grams can be considered as semantic cues in the questions to reason about particular semantic relations. frequent n-grams can be found in Table 7 in the appendix.   Table 3 shows the average number of answers for each semantic type. We observe that SUB-EVENT contains the most answers. This result also aligns with our intuition that a complex event in the passage often contains multiple sub-events. The evaluation sets contain about 0.5 answers more than in the training set. As we mentioned in Section 3, co-authors added the missing answers as part of the integrated validation process.  Number of Tokens. Table 4 shows an average number of tokens in questions and answers. The COUNTERFACTUAL questions contain the most number of tokens as additional words are often needed to specify the negation reasoning. The average numbers of tokens are all around 6.5 across 5 types of answers. This is exactly the medium of our answer length limits where we set the minimum and maximum numbers of words to be 1 and 12 respectively. The average number of tokens in the passages is 128.1 with the longest passage containing 196 tokens.

Number of Answers.
Question Answer CAUSAL 10.2 6.6 CONDITIONAL 12.1 6.4 SUB-EVENT 9.6 6.6 COUNTERFACTUAL 13.9 6.3 CO-REFERENCE 8.9 6.6 Table 4: Average number of tokens in a question and in a single answer by semantic types.

Experiments
We experimented with two tasks enabled by ES-TER: answer generation and conditional question generation. We will describe our experimental design as well as the evaluation metrics below.

Answer Generation
Give a question q i and a passage P i = {x 1 , x 2 , ...x j , ...x n } where x i represents a token in the passage, the answer generation task requires the model to generate natural language answers A i = {a i,1 ...a i,k }. For the gold answers A i = {a i,1 ...a i,k }, each answer span a i,k ∈ P i . We follow the input format of UnifiedQA (Khashabi et al., 2020) by concatenating q i and P i with a "\n" token. For training labels, we concatenate multiple answers with a ";" token.
Evaluation Metrics. We evaluate the quality of generated answers with three metrics.
• F E 1 is calculated similarly as the token based F 1 score above except that we need to replace U i , U i with E i , E i which indicates the event triggers in A i , A i respectively.
• HIT@1 equals to 1 if the top answer, i.e. a i,1 contains a correct event trigger; otherwise it is 0.   Model Baselines. We fine-tuned three pretrained language models on ESTER. Both BART (Lewis et al., 2020) and T5 (Raffel et al., 2020) are pretrained on large sequence-to-sequence datasets, and thus are suitable architectures for our answer generation tasks. UnifiedQA mentioned above is based on BART and T5, but pretrained with a variety of QA tasks including multiple-choice, span detection and answer generation. UnifiedQA also demonstrates powerful zero-shot learning capabilities on unknown QA tasks, which we tested on ESTER too.
Human Baselines. To show the human performance on the task, we randomly select 10 questions for each semantic type from the test set. Two co-authors provide answers for these questions that they never saw before, and we compare the mutually agreed answers with the validated annotations. F E 1 , F T 1 and HIT@1 scores are calculated as the human baselines.

Conditional Question Generation
The secondary task we propose is to automatically generate questions based on relation types and event trigger words. Give a QA type t i ∈ T , an event trigger word e i and a passage P i , the conditional question generation task requires a model to generate a reasonable question q i . Since e i ∈ q i is not provided in the original annotations, we leverage simple but effective heuristic rules to parse out the core event in q i . The event triggers in the vast majority of questions can be found either via exact or lemmatized words from the annotated significant events in the passage. For the remaining questions, we simply detect verbs or nouns that can converted to verbs, and used them as pseudo event triggers. We concatenate t i , e i , P i with "\n" tokens as inputs and train models using q i as outputs.
We think this task can also be a good baseline for two reasons: 1) it simulates the first step of our QA annotations where workers need to use events to produce meaningful semantic relation questions. Therefore, examining model's ability to write reasonable questions is also a good test for machines' semantic reasoning skills. 2) An automatic question generation system can facilitate self-labeling as we can leverage generated questions to annotate answers in new passages.
Evaluation Metrics. We evaluate the quality of generated questions with three metrics, perplexity and BLEU scores (equal weights up to 4-grams).

Results and Analysis
In this section, we present and analyze results for the experiments described in Section 5.

Answer Generation
As Table 5 shows, for the three baseline models we experimented, UnifiedQA achieves best performances across the board, with 60.5%, 57.8% and 76.3% for F E 1 , F T 1 and HIT@1 scores on the test set respectively. All three numbers are more than 23% lower than the human performances.
We further show the breakdown performance for each semantic type. SUB-EVENT and CO-REFERENCE relations have the lowest scores, which can be attributed to two reasons: 1) these two categories have small percentages of data, 11.8%, and 10.9% respectively; 2) understanding these two relations requires complicated reasoning skills from models to capture not only the hierarchical relations for event triggers but also for their associated arguments.
Interestingly, though COUNTERFACTUAL relation also has a small number of questions and requires more complex reasoning, it appears our models can learn this relation relatively well. This could be contributed by the similarity between COUN-TERFACTUAL and CONDITIONAL relations, and the negation in COUNTERFACTUAL questions are well detected through textual cues in the model training.
Zero-shot and Few-shot Learning. UnifiedQA has demonstrated powerful zero-shot and few-shot learning capabilities in a variety of QA tasks. We observe similar patterns where zero-shot learning from UnifiedQA drastically improves performances compared to its T5 counterpart in Table 5. For few-shot learning, we show in Figure 6 that fine-tuning with only 500 examples, the model can achieve quite comparable results with full-training, and the model performances level off after using 2-3K of training data, suggesting that the benefits of getting more data diminishes drastically.

Conditional Question Generation
In Table 6, we observe that CO-REFERENCE is the easiest the questions to match by BLEU scores; whereas COUNTERFACTUAL is the hardest one to get right by the model. These results align with our statistics in Table 4, where CO-REFERENCE appears to have the simplest structure by tokens and COUNTERFACTUAL seems to have the most complicated structure. Surprisingly, SUB-EVENT achieves the best perplexity, but the lowest BLEU score. This indicates that SUB-EVENT may be the most diverse type with multiple good ways to ask equally reasonable questions.

Discussion
Learning with Partial Signals. In Table 3, we show that the validated data contain about 0.5 more answers per question. Proximity and saliency are the two reasons we observe that contribute most to this discrepancy. Our input data include long passages with an average of 128 tokens. Even a well-trained worker can overlook relations for event pairs that are physically distant from each other. Moreover, long-distance relations are often less salient. For non-salient event relations, expert or external knowledge may be needed to disambiguate. We found workers tend to be conservative by avoiding using these non-salient relations as answers.
However, when presented with enough examples comprising of both complete and partial answers, human beings can learn, generalize and thus find complete answers for unseen questions. Learning from partial annotations has been widely explored in several tasks such as named-entity recognition (Huang et al., 2019;Shang et al., 2018), word segmentation, and part-of-speech tagging (Tsuboi et al., 2008;Yang and Vozila, 2014). Therefore, an encouraging future research direction of creatively using our data would be either leveraging or advancing partial signal learning methods in order to close the gap between SOTA models and human performances as shown in Table 5.

Related Work
Event Semantic Relations There are several studies for event semantic relations, and most of them leverage relation extraction formalism for annotations. Causality is one of the widely studied event semantic relations. ;  follow the CAUSE, ENABLE and PREVENT schema proposed by Wolff (2007) where the former two align with our CAUSAL and CONDITIONAL relations respectively. Do et al. (2011) adopted a minimally supervised method and measure event causality based on pointwise mutual information (PMI) of event predicates and arguments, which resulted in denser causal annotations than previous works.
HIEVE (Glavaš et al., 2014) defines pairwise SUB-EVENT relation as spatiotemporal containment, which is less rigorous than our definitions as we require containment for all event arguments, i.e. subject, object, time and location. Our definition of CO-REFERENCE is nearly identical as HIEVE where two co-referred events denote the same real-world events. Yao et al. (2020) utilized a weakly-supervised method to extract large scale SUB-EVENT pairs, but the extracting rules can result in noisy relations.
RED (O'Gorman et al., 2016) proposed to annotate event temporal and semantic relations (CAUSAL, SUB-EVENT) jointly. However, due to the complexity of the annotation schema, the data available for semantic relations are relatively sparse. Mostafazadeh et al. (2016b) and Caselli and Vossen (2017) annotate both event temporal and semantic relations in ROCStories (Mostafazadeh et al., 2016a) and Event StoryLine Corpus (Caselli and Vossen, 2017) respectively. ESTER differs from these works by disentangling temporal and semantic relations and designing an MRC dataset to capture comprehensive event semantic relations.
Event-centric MRC. Leveraging natural language queries for event-centric machine reading comprehension have been proposed recently (Zhou et al., 2019;Ning et al., 2020b;Du and Cardie, 2020;. However, these datasets focus on either event arguments or event temporal commonsense, whereas ESTER studies event semantic relations.

Conclusions
We propose ESTER, the first comprehensive MRC datasets for event semantic reasoning. We adopt meticulous data quality control to ensure annotation accuracy. ESTER dataset enables question answering tasks over event semantic relations, which can be more challenging than the traditional relation extraction work. The difficulty of the proposed data is also manifested by the significant gap between machine and human performances. We thus believe that ESTER would be a challenging benchmark for event-centric research in the future.