Knowledge-Enriched Event Causality Identification via Latent Structure Induction Networks

Identifying causal relations of events is an important task in natural language processing area. However, the task is very challenging, because event causality is usually expressed in diverse forms that often lack explicit causal clues. Existing methods cannot handle well the problem, especially in the condition of lacking training data. Nonetheless, humans can make a correct judgement based on their background knowledge, including descriptive knowledge and relational knowledge. Inspired by it, we propose a novel Latent Structure Induction Network (LSIN) to incorporate the external structural knowledge into this task. Specifically, to make use of the descriptive knowledge, we devise a Descriptive Graph Induction module to obtain and encode the graph-structured descriptive knowledge. To leverage the relational knowledge, we propose a Relational Graph Induction module which is able to automatically learn a reasoning structure for event causality reasoning. Experimental results on two widely used datasets indicate that our approach significantly outperforms previous state-of-the-art methods.


Introduction
Event causality identification (ECI) aims to identify causal relation of events in texts. For example, in the sentence "The earthquake generated a tsunami.", an ECI model should be able to identify a causal relationship that holds between the two mentioned events, i.e., earthquake cause − −− → tsunami. ECI is an important task in natural language processing (NLP) area and can support many NLP applications, such as machine reading comprehension (Berant et al., 2014), process extraction (Thalappillil Scaria et al., 2013) and future event prediction (Radinsky et al., 2012;Hashimoto et al., 2014).
Identifying event causal relation is inherently challenging, because event causality is usually expressed in diverse forms that often lack explicit clues indicating its existence. For example in Figure 1, the sentence has no explicit clue indicating the causal relation between "global warming" and "tsunami". In this scenario, models can resort to a large amount of labeled data to learn diverse causal expressions. However, existing ECI datasets are very small. For example, the largest dataset EventStoryLine (Caselli and Vossen, 2017) only contains 258 documents, which is not sufficient to train neural network models . Consequently, models cannot thoroughly understand the text and possibly make a wrong prediction. Nonetheless, humans could make a correct judgement, because humans have the background knowledge about the two events. To be more specific, humans not only know what the two events are, but also know the connection between them. Fortunately, existing knowledge bases (KBs) usually contain the Descriptive Knowledge of events and Relational Knowledge between events, which can be regarded as the background knowledge to enhance ECI models. In this paper, we focus on how to incorporate these two kinds of external knowledge into the task.
Descriptive Knowledge: The external knowl-edge base contains the descriptive or explanatory information about events, which can be called the descriptive knowledge of events. It usually consists of one-hop neighbors of events. This kind of knowledge is able to help the model better understand what the mentioned event is. For example in Figure 1, the descriptive knowledge associated with "global warming" includes (global warming, IsA, temperature change), (global warming, CreatedBy, greenhouse gas) and so on. If the model can make use of such knowledge, it is obvious that the model can better understand the meaning of the event itself than using only the given text. Therefore, incorporating the descriptive knowledge is very helpful for this task. However, when leveraging this kind of knowledge, we find two critical challenges: (1) As shown in Figure 1, the descriptive knowledge forms a sub-graph. How to effectively encode the graph-structured knowledge is a very challenging problem; (2) The knowledge base is incomplete (Wang et al., 2020), which will inevitably cause the descriptive knowledge of some events cannot be obtained from the KB. Thus, the model should have the ability to obtain and encode such knowledge, even if it does not exist in the KB. Relational Knowledge: The external knowledge base contains connections between events, which can be referred as the relational knowledge between events. It is usually defined by the multi-hop path between two events. This kind of knowledge can provide useful information for event causality reasoning, especially when the text lacks causal clues. For example in Figure 1, the relational knowledge between the two events is "global warming" Apparently, compared with only using text information, utilizing the relational knowledge can provide ample evidence for the model to judge the causality between "global warming" and "tsunami". However, two challenges exist when using the relational knowledge: (1) The multi-hop path may miss some potentially useful relations. For example in Figure  1, the fact (sea-level rising, Causes, tsunami) is described in the wikipedia page of "sea-level rising" 1 , while it is not annotated in the KB; (2) Not all the knowledge on the path is related to causality, such as (sea-level rising, AtLocation, ocean). Therefore, directly reasoning along the multi-hop path struc-1 https://en.wikipedia.org/wiki/Sea_ level_rise ture may not be optimal. The model should be able to learn a more reasonable structure for capturing potentially useful information and reducing the impact of irrelevant knowledge.
In this paper, we propose a novel method termed as Latent Structure Induction Network (LSIN) to overcome aforementioned challenges. Specifically, we devise a Descriptive Graph Induction module to make use of the descriptive knowledge. The module first adopts a hybrid method of retrieval and generation to obtain the descriptive knowledge, and then utilizes the information aggregation technique to encode the graph-structured knowledge. Meanwhile, we propose a Relational Graph Induction module to leverage the relational knowledge. The module first treats the reasoning structure as a latent variable and learns it in an end-to-end fashion. Then, the module performs event causality reasoning based on the induced structure. Experimental results on two widely used datasets demonstrate that our model substantially outperforms previous state-of-the-art methods.
Our contributions are summarized as follows: • We propose a novel Latent Structure Induction Network (LSIN) to leverage the external structural knowledge. To our knowledge, we are the first to use both the descriptive knowledge and relational knowledge for this task.
• To exploit the descriptive knowledge, we devise a descriptive graph induction module. To utilize the relational knowledge, we propose a relational graph induction module.
• Experimental results on two widely used datasets indicate that our proposed approach significantly outperforms previous state-ofthe-art methods.

Related Work
Event causality identification (ECI) is a very important task in natural language processing area, which has attracted extensive attention in the past few years. Early studies for the task are feature-based methods which utilize lexical and syntactic features (Riaz and Girju, 2013;Gao et al., 2019), explicit causal patterns (Beamer and Girju, 2009;Do et al., 2011;, and statistical causal associations (Riaz and Girju, 2014;Hashimoto et al., 2014;Hashimoto, 2019)   network-based methods have been proposed for the task and achieved the state-of-the-art performance (Kruengkrai et al., 2017;Kadowaki et al., 2019;Zuo et al., 2020).  propose a mention masking generalization method and also consider the external structural knowledge. The very recent work (Zuo et al., 2020) propose a data augmentation method to alleviate the data lacking problem for the task. Regarding datasets construction, Mirza (2014) annotates the Causal-TimeBank dataset about event causal relations in the TempEval-3 corpus. Caselli and Vossen (2017) construct a dataset called EventStoryLine for event causality identification. Despite many efforts for this task, most existing methods typically train the models on manually labeled data solely, rarely considering the external structural knowledge. As a result, these methods cannot handle well the cases where there is no explicit causal clue.
Although  leverage the descriptive knowledge to enrich event representations, they directly retrieve the descriptive knowledge from the KB. Therefore, their method cannot handle the cases where there is no knowledge about the event in the KB. In addition, they ignore the relational knowledge between events. By contrast, our method can not only generate the descriptive knowledge when it cannot be retrieved from the KB, but also leverage the relational knowledge. To our knowledge, we are the first to simultaneously make use of the descriptive knowledge and relational knowledge for this task.

Methodology
Following previous works (Ning et al., 2018;, we formulate ECI as a binary clas-sification problem. For every pair of events in a sentence, we predict whether a causal relation holds. Figure 2 schematically visualizes our approach, which consists of three major components: (1) Context Encoding ( §3.1), which encodes the input sentence and outputs contextualized representations; (2) Descriptive Graph Induction ( §3.2), which first obtains the corresponding descriptive knowledge for each event, and then encodes the graph-structured knowledge; (3) Relational Graph Induction ( §3.3), which automatically induces a reasoning structure and performs causality reasoning on the induced structure. We will illustrate each component in detail.

Context Encoding
Given a sentence with a pair of events (denoted as e 1 and e 2 ), the context encoding module aims to extract context features, which takes the sentence as input and outputs the context representations. Our context encoder is based on the Transformer architecture (Vaswani et al., 2017). We adopt the BERT (Devlin et al., 2019) to encode the input sentence, 2 which has achieved the state-of-the-art performance for ECI task Zuo et al., 2020). After using BERT encoder to compute the contextual representations of the entire sentence, we concatenate representations of [CLS], e 1 and e 2 as the context representation regarding to the event pair (e 1 , e 2 ), namely where ⊕ indicates the concatenation operation.
, e 1 and e 2 , respectively. d is the output hidden size of BERT model.

Knowledge Obtaining
Given e 1 and e 2 , we adopt a hybrid method of retrieval and generation to obtain their descriptive knowledge, respectively. The descriptive knowledge forms a sub-graph which is called Descriptive Graph (denoted as G d ). For this paper, we prefer CONCEPTNET (Speer et al., 2017) as the external KB, which contains abundant semantic knowledge of concepts. We take e 1 as an example to illustrate the knowledge obtaining procedure: (1) If the descriptive knowledge can be retrieved from the KB, we adopt the retrieval method. Our method first grounds e 1 to a concept via matching the event mention with the tokens of concepts in CONCEPTNET. We enhance the matching approach with some rules, such as soft matching with lemmatization and filtering of stop words. The grounded concept is called zero-hop concept. Then, our method grows zero-hop concept with one-hop concepts. The zero-hop concept, one-hop concepts and all relations between them form the descriptive graph for e 1 (denoted as G d 1 ).
(2) If the descriptive knowledge cannot be retrieved from the KB, we adopt the generation method. Our method employs the pre-trained model, COMET (Bosselut et al., 2019), which is originally proposed for the knowledge base completion. Specifically, COMET is obtained by finetuning GPT (Radford et al., 2018) on CONCEPT-NET. The input of COMET is the head event and candidate relation, and the output is the tail event. The relation types are the same as the ones used in Bosselut et al. (2019). By leveraging COMET, we can generate the descriptive graph G d 1 for e 1 .
In the same way, we can also construct the descriptive graph G d 2 for e 2 .

Knowledge Encoding
Graph neural networks have been widely used to encode graph-structured data Yang et al., 2019), as they are able to effectively collect relevant evidence based on an information aggregation scheme. In addition, many works show that relational graph convolutional networks (R-GCNs) (Schlichtkrull et al., 2018) usually overparameterize the model and cannot effectively uti-lize multi-hop relational information (Zhang et al., 2018;. We thus apply GCNs (Kipf and Welling, 2017) to encode the related descriptive knowledge of e 1 and e 2 .
Formally, given a descriptive graph G d (i.e., G d 1 or G d 2 ) with n d nodes (i.e., concepts), which can be represented with an n d × n d adjacency matrix A d . If there is a connection between node i and node j, the A d ij is set to 1. For the node i at the l-th layer, the convolution computation can be defined as follows: where W u are the weight matrix and bias vector for the l-th layer, respectively. ρ is an activation function (e.g., ReLU). u (0) i ∈ R d is the initial representation of the i-th node obtained by the pretrained model (i.e., BERT). To consider context information when encoding descriptive knowledge, we use the h e 1 and h e 2 obtained in Section 3.1 as the initial representations of events.
After the knowledge encoding, the representations of e 1 and e 2 in descriptive graphs are denoted as u e 1 and u e 2 , respectively. We concatenate them as the descriptive knowledge representation: (3)

Multi-Hop Path Obtaining
Given e 1 and e 2 , our model first retrieves the multihop path between the two events from CONCEPT-NET. We refer to the multi-hop path as Relational Path. Since shorter connections between two concepts could mean stronger relevance (Lin et al., 2019), our model exploits the shortest path between the two events as the relational path. We represent the CONCEPTNET as a graph, and then use Net-workX toolkit 3 to get the shortest path between the two events. When there are multiple shortest paths, we randomly select one path for avoiding information redundancy.

Structure Induction
To capture potentially useful information and reduce the impact of irrelevant knowledge on the relational path, our model treats the reasoning structure as a latent variable and induces it with the input of the relational path, which can be shown in Figure 2. We call the induced reasoning structure as Relational Graph (denoted as G r ). The structure induction module is built based on the structured attention (Kim et al., 2017). We use a variant of Kirchhoff's Matrix-Tree Theorem (Koo et al., 2007;Nan et al., 2020) to learn the graph structure.
Formally, the nodes of relational graph are the concepts on the relational path. The initialized representation of each node is obtained via the pretrained model (i.e., BERT). The representation of the i-th node is denoted as m i ∈ R d . We first calculate the pair-wise unnormalized attention score s ij between the i-th node and the j-th node: (4) where W p and W c are weights matrixes. W b are the weights for the bilinear transformation. Next, we compute the root score s r i which represents the unnormalized probability of the i-th node to be selected as the root node of the structure: where W r ∈ R 1×d is the weight for linear transformation. Suppose the graph G r has n r nodes, we first assign non-negative weights P ∈ R nr×nr to the edges of the induced relational graph: where P ij is the weight of the edge between the i-th and the j-th node. Then, following Koo et al. (2007), we define the Laplacian matrix L ∈ R nr×nr of G r , and its variantL ∈ R nr×nr , respectively: We use A r ij to denote the marginal probability of the edge between the i-th node and the j-th node, which can be computed as follows: where δ is the Kronecker delta (Koo et al., 2007) and · −1 denotes matrix inversion. A r can be regarded as a weighted adjacency matrix of the graph G r . Finally, A r is fed into the iterative refinement for event causality reasoning.

Iterative Refinement
After obtaining the relational graph structure, we perform event causality reasoning on the induced structure. To better capture potential reasoning clues, we adopt the densely connected graph convolutional networks (DCGCNs) (Guo et al., 2019), which allows training a deeper reasoning model. The convolution computation of each layer is: where g (l) j is the concatenation of the initial node representation and the node representations produced in layers 1, . . . , l − 1, namely g The induced structure at once is relatively shallow (Liu et al., 2019;Nan et al., 2020) and may not be optimal for causality reasoning. Therefore, we iteratively refine the induced structure to learn a more informative structure. We stack N blocks (each block is structure induction and DCGCNs reasoning) of this module to induce the structure N times. Intuitively, as the structure gets more refined, the structure is more reasonable.
After the iterative refinement, the representations of e 1 and e 2 are denoted as v e 1 and v e 2 , respectively. We concatenate them as the relational knowledge representation:

Model Prediction and Training
We concatenate the context representation, descriptive knowledge representation and relational knowledge representation as the final representation: To make the final prediction, we perform a binary classification by taking F e 1 ,e 2 as input: p e 1 ,e 2 = softmax(W s F e 1 ,e 2 + b s ).
For training, we adopt cross entropy as the loss function: where Θ denotes the model parameters. s denotes a sentence in the training set D. E s is the set of events in sentence s. y e i ,e j is a one-hot vector representing the gold label between e i and e j .

Datasets and Evaluation Metrics
We evaluate our proposed method on two widely used datasets, including EventStoryLine (Caselli and Vossen, 2017) and Causal-TimeBank . For EventStoryLine, the dataset contains 258 documents, 5,334 events in total, and 1,770 of 7,805 event pairs are causally related. For Causal-TimeBank, the dataset contains 184 documents, 6,813 events, and 318 of 7,608 event pairs are causally related. We conduct the 5-fold and 10fold cross-validation on the EventStoryLine dataset and Causal-TimeBank dataset respectively, same as previous methods to ensure fairness. Following previous works (Choubey and Huang, 2017;Gao et al., 2019), we adopt Precision (P), Recall (R) and F1-score (F1) as evaluation metrics.

Parameter Settings
In our implementations, our method uses the Hug-gingFace's Transformers library 4 to implement the uncased BERT base model, which has 12-layers, 768-hidden, and 12-heads. The learning rate is initialized as 2e-5 with a linear decay. We use the Adam algorithm (Kingma and Ba, 2015) to optimize model parameters. The batch size is set to 20. The number of induction blocks (i.e., N ) is set to 2. The dropout of GCN is set to 0.3. Due to the sparseness of positive examples, we adopt a negative sampling strategy for training. The negative sampling rate is 0.6 and 0.7 for the EventStoryLine and Causal-TimeBank, respectively. We utilize CONCEPTNET 5.0 as the external knowledge base.

Baselines
We compare the proposed approach LSIN with previous state-of-the-art methods: Feature-based methods: (1) , which proposes a data driven method with causal signals for the task; (2) Mirza (2014), which employs a verb rule based model with data filtering and causal signals enhancement; (3) Choubey and Huang (2017)  the task; (4) Gao et al. (2019), which utilizes a logistic regression classifier with the integer linear programming to model causal structure for the task. Neural network-based methods: (1) Cheng and Miyao (2017), which proposes a dependency path based bidirectional long short-term memory network (BiLSTM) that models the context between two event mentions for causal relation identification; (2) KMMG , which proposes a mention masking generalization method and also utilizes the external knowledge; (3) KnowDis (Zuo et al., 2020), which proposes a knowledge enhanced distant data augmentation method to alleviate data lacking problem.

Overall Results
Since some baselines are evaluated either on the EventStoryLine dataset or the Causal-TimeBank dataset, the baselines used for the two datasets are different. Table 1 and Table 2 show the results on the EventStoryLine and Causal-TimeBank, respectively. From the tables, we can observe that: (1) Our method outperforms all the baselines by a large margin on the two datasets. For example, compared with the state-of-the-art model KnowDis (Zuo et al., 2020) Table 3: Experimental results by using different kinds of knowledge on the EventStoryLine dataset. "DK" and "RK" refer to "descriptive knowledge" and "relational knowledge", respectively. achieves 2.8% and 3.9% improvements of F1-score on the EventStoryLine and Causal-TimeBank, respectively. It indicates that our proposed method is very effective for this task.
(2) Compared with the state-of-the-art model KMMG , our method achieves 6.0% improvements in terms of Precision score on the EventStoryLine. The reason may be that our method utilizes the relational knowledge between events for causality reasoning, which can improve the confidence of event causality prediction.
(3) Our method improves upon the BERT model by 8.0% and 12.4% in terms of F1-score on the two datasets, respectively. This suggests that only using the annotated training data is not enough to tackle the task. Moreover, it also indicates that our method is able to effectively leverage the external structural knowledge for ECI task.
(4) The BERT model achieves comparable performance with complex feature-based methods such as Gao et al. (2019) on the EventStoryLine dataset, which indicates that the BERT is able to extract useful text features for the task.

Effectiveness of External Structural Knowledge
We validate the effectiveness of external structural knowledge for this task. Based on the BERT model, we leverage the descriptive knowledge via descriptive graph induction module, and the relational knowledge via relational graph induction module.
The results are shown in Table 3. We have two important observations: (1) Based on the BERT model, incorporating these two kinds of knowledge can both improve performance. Moreover, simultaneously using these two kinds of knowledge can further improve the performance. It indicates that the external structural knowledge is very effective for this task.
(2) The performance improvement of using the Methods P(%) R(%) F1(%)    Table 4: Comparison between the different methods for using the descriptive knowledge on the EventStoryLine dataset. "DGI" refer to "descriptive graph induction". relational knowledge is more obvious than that of using the descriptive knowledge, achieving 4.0% improvements in terms of F1-score. We guess that the relational knowledge can provide more clues for event causality reasoning.

Effectiveness of Descriptive Graph Induction
To verify the effectiveness of descriptive graph induction module, we compare our method with the state-of-the-art model .  first retrieve the descriptive knowledge, and then transfer the knowledge into a sequence. Finally, they adopt the BERT to encode the knowledge. The results are listed in Table 4. In the table, "DGI-Retrieval", "DGI-Generation" and "DGI-Hybrid" denote obtaining the descriptive knowledge via retrieval, generation and hybrid method, respectively. Overall, we can observe that: (1) The DGI-Hybrid model significantly outperforms , achieving 4.5% improvements of F1-score. Moreover, even if we use the same retrieval method as , our model still achieves better result. It indicates the descriptive graph induction module can better take advantage of the descriptive knowledge.
(2) Compared with , the DGI-Hybrid model achieves great improvements in terms of Recall score (i.e., improving 12.6%). The reason is that our method can automatically generate the descriptive knowledge, when the knowledge cannot be retrieved from the KB.

Effectiveness of Relational Graph Induction
To validate the effectiveness of the relational graph induction module, we compare our method with other three baselines. The three baselines are illustrated as follows: (1) LSTM-based Reasoning, which regards the relational path as a sequence and employs LSTM   to encode it; (2) Fixed Graph-based Reasoning, which regards the relational path as a graph. Its nodes are concepts on the path and edges only exist between adjacent concepts; (3) Attention-based Reasoning, which uses the self-attention to encode the relational path for modeling the dependencies between arbitrary two concepts. The results are shown in Table 5. From the results, we can observe that: (1) Our method LSIN outperforms the three methods by a large margin. For example, compared with LSTM-based reasoning method, our method achieves 4.4% improvements of F1-score. This empirically confirms using induced relational graph structure is more effective than directly using the relational path for causality reasoning.
(2) Compared with Fixed Graph-based reasoning method, our method achieves 3.6% improvements of F1-score. It indicates that our method is able to effectively capture the potentially useful information and reduce the impact of irrelevant knowledge on the relational path.
b) The fights erupted in Flatbush, and 46 were arrested at Wednesday . . . Table 6: Results of case study where bold denotes the two event pair. and denote a correct and incorrect prediction, respectively.

Impact of the Number of Refinements
We investigate the effect of the refinement on the overall performance. We plot the overall F1-score varying with the number of refinements in Figure  3. From the figure, we can observe that: (1) Our method LSIN yields the best performance in the second refinement. Compared with the first induction, the second refinement achieves 1.1% improvements of F1-score on the EventSto-ryLine dataset. This indicates that the proposed LSIN is able to induce more reasonable reasoning structures by iterative refinement.
(2) When the number of refinements is too large, the performance on the two datasets stops increasing or even decreases due to over-fitting.

Case Study
We conduct case study to further verify the effectiveness of our method. Table 6 shows several cases showing the outputs of BERT and our method LSIN. From the results, we can observe that the BERT model cannot handle the cases where there is no causal clue. By contrast, our method can make correct predictions by leveraging the external structural knowledge. For the second example in Table  6, although the text has no clue indicating the existence of causality between "fights" and "arrested", there is the relational knowledge between the two events in the KB, namely "fight"

Conclusion
In this paper, we propose a novel latent structure induction network (LSIN) to leverage the external structural knowledge for ECI task. To make use of the descriptive knowledge, we devise a descrip-tive graph induction module to obtain and encode the graph-structured descriptive knowledge. To utilize the relational knowledge, we propose a relational graph induction module to induce a more reasonable reasoning structure for causality reasoning. Experimental results on two widely used datasets indicate that our approach substantially outperforms previous state-of-the-art methods.