Probing into the Root: A Dataset for Reason Extraction of Structural Events from Financial Documents

This paper proposes a new task regarding event reason extraction from document-level texts. Unlike the previous causality detection task, we do not assign target events in the text, but only provide structural event descriptions, and such settings accord more with practice scenarios. Moreover, we annotate a large dataset FinReason for evaluation, which provides Reasons annotation for Financial events in company announcements. This task is challenging because the cases of multiple-events, multiple-reasons, and implicit-reasons are included. In total, FinReason contains 8,794 documents, 12,861 financial events and 11,006 reason spans. We also provide the performance of existing canonical methods in event extraction and machine reading comprehension on this task. The results show a 7 percentage point F1 score gap between the best model and human performance, and existing methods are far from resolving this problem.


Introduction
Why does the event happen? People are always eager to find the reasons for an event. Automatically extracting the causal explanations of the given events from texts is useful and important for common users and downstream applications. For example, in the financial domain, returning the reasons of a concerned financial event in an Information Retrieval system can free analysts from reading the enormous company announcements and help investors make financial decisions.
Previous work on event causality (Do et al., 2011;Riaz and Girju, 2013;Mirza and Tonelli, 2014;Caselli and Vossen, 2017) mainly focus on * Most of the work was done when the first author was a research engineer in the Institute of Automation, CAS. the identification of causal relations between two given events that are usually presented as event trigger words. However, in reality, users may only know a particular event happened but without knowing its mention or trigger in the documents, and they just wonder the reasons for it. Therefore, we propose a new task aiming at extracting the causal explanations of the given structurally presented events from document-level texts. Specifically, a Structural Event defined here is a structural description that contains all necessary roles for an event type. Such a description can completely represent a specific happened event in reality. For example, in Figure 1, the PLEDGE event has four predefined roles NAME, ORG, NUM, BEG to represent an occurred PLEDGE event. Then, our task is to extract the reasons of the structural events as textual spans from the document.
To investigate the solution for this challenging task, we construct a large-scale Chinese dataset FinReason 1 . Specifically, we automatically collect the formal financial documents with their corresponding structural events the same as Yang et al. (2018). Then, crowd workers are employed to annotate the reasons in the documents for each structural event. In order to guarantee annotation quality and high inter-annotator agreement (IAA), we set several annotation principles and define 3 types of possible causal explanations (MOTIVATION, CAUSE, ENABLE) as Reasons for the events in company announcements to guide annotators. Finally, there are 8,794 documents, 12,861 collected financial events, and 11,006 reason spans in total. The Cohen's kappa of annotations is 83.87%.
Moreover, to understand this task's difficulties, we regard this task as an Event Extraction (EE) or a Machine Reading Comprehension (MRC) task. We also try some canonical models, such as BiL-SRM+CRF (Ma and Hovy, 2016), and BERT (Devlin et al., 2019) on this task and set benchmarks. Empirical results show that this task is challenging, and there is still an overall gap of 7pp (percentage points) in the F1 score between the best model and human performance.

Related Work
Much NLP research has focused on identifying causality relations from text, including knowledge bases (WordNet (Miller, 1998), FrameNet (Baker et al., 1998) and ConceptNet (Speer et al., 2017)), semantic related evaluations (SemEval-2007 task 04 (Girju et al., 2007), COPA (Roemmele et al., 2011), RED (Ikuta et al., 2014)), and event-related systems (Beamer and Girju, 2009;Do et al., 2011;Riaz and Girju, 2013;Hu and Walker, 2017;Caselli and Vossen, 2017). These work tried to identify real-world causality in lexicons or texts from different aspects. However, they have found it is difficult to agree on if a causal relationship exists in reality due to the ambiguity of causality definition. Our dataset mitigates this problem by only identifying contextual causality and do not check with reality.
In addition, plenty of work also only identify context-level causal relationships, such as general causality detection tasks PDTB (Prasad et al., 2007) and BECauSE 2.0 (Dunietz et al., 2017), and emotion causality detection task ECA (Lee et al., 2010). Some work (Radinsky et al., 2012;Mirza and Tonelli, 2014;Zhao et al., 2017) also tries to identify the causal relations between events at the contextual-level. However, our task is different because we focus on extracting the reasons for well-defined structural events, which is more close to practice scenarios.

Task Description and Data Collection
Task Description Our task is to extract the corresponding causal explanations for given structural events in a document. The inputs are a document with corresponding structural events described in it. The outputs are the causal text spans for the given events. For a given event in the document, there may be zero, single, or multiple causal explanations that need to be identified. To construct this dataset, we first collect a corpus of structural events with their corresponding documents following Yang et al. (2018). The collected documents are constrained to company financial announcements, which are relatively formal documents. Such a setting could improve annotation IAA because of the logical consistency and clarity. In specific, we crawl the public company financial announcements as documents from sohu.com 3 and the structural events from eastmoney.com 4 . Since the documents are not in line with their corresponding structural events, we leverage key event items (see more details in Appendix B) matching to align them. Same as Yang et al. (2018), we assume that if the key event items of a structural event appear in a document, the document mentions the target structural event. This alignment method has a high precision of 94.5% as evaluated by Yang et al. (2018).

Data Collection
In total, as in Table 5, we align 8,794 documents with corresponding 12,861 structural events of 3 types in financial domain, namely Pledge of Shares ( Pledge), Overweight and Underweight of Shares (O/U), Lawsuit and Arbitration ( Lawsuit). To the best of our knowledge, this is the largest dataset in the event reason extraction task.

Event Reason Annotation
Annotation Principles: To construct a corresponding dataset with high IAA, we follow the two principles in the annotation. First, we annotate the event reasons according to the contextual expressions. We do not check with reality, even if it is obviously a false statement (e.g., the stock market falls because of intense sunspot activity). Second, we specifically define 3 types of possible causal explanations as reasons for the financial events in announcements following previous work (Trabasso et al., 1989;Van den Broek, 1990;Dunietz et al., 2015): MOTIVATION, CAUSE, ENABLE (see details in Appendix A). This provides a clear guideline to annotators to decide what to annotate and what not to. Because it is also ambitious to differentiate those reason types from texts (Dunietz et al., 2015), we do not require the annotators to distinguish them but just require them to confirm that the reasons annotated at least belong to one of the 3 types. Quality Control: Besides the aforementioned 2 principles, we adopt several more rules to control data quality as follows.
(1) Each member should find as many reasons for a target event as possible.
(2) Each reason annotated should be as short as possible but with complete expressivity. (3) When explicit causal relation terms such as 因为 (because), 为了 (in order to) are mentioned, they should be included in the annotated reasons. (4) For each reason annotated, the annotator should confirm it by doing a why test (Grivaz, 2010), which means the reason should answer the question why the event happened.
Then, we employ crowdsourcing to annotate the reasons for each event. Specifically, 9 workers are divided into 3 teams to annotate each event type separately. Each team is trained to acquire the domain knowledge of the target event type so they can figure out the possible reasons for the events. Within each team, 2 members are responsible for annotating the reasons independently, and the 3rd member will be activated to make a judgment when 2 annotators have inconsistent annotations. Be-cause the alignment in the first step may not be perfectly accurate, annotators are also responsible for removing those wrongly aligned cases in the annotation to maintain data quality. Finally, as shown in Table 5, there are totally 8,794 documents, 12,861 collected financial events and 11,006 annotated event reason. And, approximately 73.58% of the documents are annotated with event reason. The Cohen's kappa of IAA is 83.87%.  From the annotation results, we could briefly conclude that extracting the reasons for given structural events in a document is not an easy task. First, a document may mention multiple events like the 2 events in the example of Figure 1. As in Table 5, approximately 20.67% documents mention more than one events. Without event mention assignment, discriminating the corresponding reasons for different events within the same document is difficult. Second, about 13.25% of documents mention multiple reasons for an event. Finding all reasons out is also not easy. Thirdly, 71.74% of the documents mention the reason for the events in an implicit way. There are only 28.26% reasons mentioned with explicit modifiers, like 因 为(because),由于(since),原因(cause),为(in order to),目的(aims to), etc. Such implicitly mentioned reasons are harder to be identified because they do not have any syntactic clue and require deep reasoning.

Task Challenges
We regard the average performance of the two annotators with respect to the final golden standard in the test set as human performance 5 . We can see in Table 2, the human performance on the test set is in line with intuition. Compared with simple cases (Single-Event, Single-Reason, Explicit-Reason), identifying reasons in multipleevent, multiple-reason, and implicit-reason cases are more challenging.

Evaluation Criterion
To evaluate the solution on this task, we follow a similar paradigm of SQuAD 2.0 (Rajpurkar et al., 2018) but also with several differences. In general, we get precision/recall/f1 scores of every event in the test set and calculate the macro-average of all events as the overall performance. However, there are multiple reason cases in FinReason, and we try to evaluate the ability of multiple reasons identification. As a result, we do not fully follow SQuAD-style evaluation by selecting the best prediction but considering all predictions to avoid the systems cheating by predicting all possible causal expressions in the text. For each case, we compute the scores as follows. 1) When there is no reason annotated for an event, the prediction should be Null string so as to get precision/recall/f1 scores of all 1; otherwise, all scores will be assigned 0.
2) When there is only one reason annotated for an event, we calculate the precision/recall/f1 scores based on the overlapping strings of prediction and ground truth. 3) When there are multiple reasons for a target event, we first calculate each reason's scores with corresponding predictions as in situation 2 and then calculate the macro-average of all reasons as the final scores for the target event.

Baselines
FinReaon is a new task but similar to several existing tasks, such as event extraction (EE) or machine reading comprehension (MRC). So we apply existing canonical methods for those similar tasks on FinReason as benchmarks for future research. The selected baselines are as followed (see more details in Appendix C): Regular Expressions (RegExp): In this setting, we regard the FinReason task as a causal sentence detection problem and employ some ad-hoc regular expressions to solve it. Specifically, we use five modifiers (因为(because),由于(since),原 因(cause),为(in order to),目的(aims to)) as causal clues to detect the sentence as the reasons for an event.
BiLSTM-CRF (BiLSTM): We can take the reasons as one part of the event description and regard the task as an EE task. Similar to Yang et al. (2018), we employ a BiLSTM-CRF (Ma and Hovy, 2016) to predict the start and end positions of each reason. Specifically, We simply get the event participants in the documents via string matching between the documents and the given structural events. Such information is used as features in a BIO tagging format. BERT-QA: We can take this task as an MRC problem if the structural event is regarded as a query and the target reason as the answer. In particular, we use templates to turn each structural event into a why-question and employ BERT-QA (Devlin et al., 2019)

Results
We split the dataset into train/dev/test sets with a ratio of 8:1:1 for experiments. From the results in Table 2, we can see that there is still an average of 7pp (82% vs. 89%) F1 score gap between the best model (BiLSTM) and human performance. Besides, we can see that human performance is relatively low for Lawsuit. This is because the reason for a lawsuit usually lies in a whole story between the plaintiff and the defendant, and it is hard to agree on the boundaries of the span. Furthermore, the BiLSTM model generally performs better than BERT-QA. The reason may be that the BiLSTM model knows the positions of event mentions by using the event BIO features, but BERT-QA only uses the structural event as the query. So it may be easier for BiLSTM to locate the correct reasons. Besides, we also evaluate the 3 challenges on the whole test set. As from Table 2, the average F1 gap between the best model and human for the 3 challenges are 12pp, 16pp, 28pp, respectively, which is much larger than the overall average gap of 7pp. This demonstrates that the challenges are also the bottlenecks of the models to reach comparable human performance, especially the implicit cases.

Conclusion
In this work, we propose a dataset FinReason for a new event causality extraction task. Our experiments show that this task is still challenging for current models. Future work may consider breaking the challenging cases (multiple-events, multiplereasons, and implicit-reasons) to achieve a more satisfying performance.