Enhancing Event Causality Identification with Event Causal Label and Event Pair Interaction Graph

Most existing event causality identification (ECI) methods rarely consider the event causal label information and the interaction information between event pairs. In this paper, we pro-pose a framework to enrich the representation of event pairs by introducing the event causal label information and the event pair interaction information. In particular, 1) we design an event-causal-label-aware module to model the event causal label information, in which we design the event causal label prediction task as an auxiliary task of ECI, aiming to predict which events are involved in the causal relationship (we call them causality-related events) by mining the dependencies between events. 2) We further design an event pair interaction graph module to model the interaction information between event pairs, in which we construct the interaction graph with event pairs as nodes and leverage graph attention mechanism to model the degree of dependency between event pairs. The experimental results show that our approach outperforms previous state-of-the-art methods on two benchmark datasets EventStoryLine and Causal-TimeBank.


Introduction
Event causality identification (ECI) aims to identify the causal relationship between pairs of events in the text.As shown in Figure 1(a), giving text and events as inputs, the ECI model needs to identify three causal relationships < e7, cause, e5 >, < e5, cause, e2 >, and < e7, cause, e2 >.As a semantic relationship, causality is important for semantic understanding and discourse analysis.Moreover, causal knowledge identified from a text can be useful for many natural language processing tasks (Fei et al., 2020a,b;Wang et al., 2020;Zhou et al., 2021;Dalal et al., 2021).
Various approaches have been proposed for ECI, from the early feature-based methods (Mirza, 2014; *Corresponding author. Meanwhile, ISPs who use ( e 1 ) bandwidth on the SEACOM cable have been scrambling(e2) to implement(e3) contingency(e4) plans(e5) to keep(e6) their customers connected(e7).Figure 1: (a) An example of leveraging the event causal label information and the interaction information between event pairs for ECI.(b) The event pairs composed of seven events in (a), where the dark area denotes the event pairs composed of causality-related events.Caselli and Vossen, 2017;Gao et al., 2019) to the current neural-network-based methods (Liu et al., 2020;Zuo et al., 2020Zuo et al., , 2021a,b;,b;Cao et al., 2021;Phu and Nguyen, 2021).The existing methods have achieved impressive performance.However, as far as we know, they usually improve the performance of ECI models by introducing commonsense knowledge or generating additional training data through data augmentation, with less focus on the inherent characteristics of the data.Specifically, 1) Minimal use of event causal label information.In the ECI task, a sentence usually contains multiple events.However, not all events are involved in causality.
For instance, the sentence shown in Figure 1(a) has seven events, among which e2, e5, and e7 are causality-related events, whereas e1, e3, e4, and e6 are causality-unrelated events.In general, there is a strong semantic correlation between events that have a causal relationship.If the model can first identify e2, e5, and e7 as causality-related events according to the dependencies between events and contextual semantics, it can limit the identification scope of causality to some extent, as shown in Figure 1(b), thereby reducing the interference of unrelated events.Furthermore, by performing statistical analysis on two benchmark datasets, EventStory-Line and Causal-TimeBank, we find that 55.13% and 38.63% of the sentences with events in the two datassets contained four or more events, where the average number of the different candidate event pairs is 13.12 and 10.64, and the average number of the event pairs with causal relationship is 1.99 and 0.30, respectively.If we can first predict which events are causality-related events according to the dependencies between them, the average number of candidate event pairs composed of causality-related events is 3.68 and 0.37, respectively, thereby narrowing the range of causality from 13.12 and 10.64 to 3.68 and 0.37, respectively.Therefore, it is necessary to introduce the event causal label information to assist the ECI tasks.2) Little focus on the interaction between event pairs.The existing method (Phu and Nguyen, 2021) uses events as nodes and leverages toolkits to obtain information (such as dependency parsing tree and coreference) to construct event graph.However, they are modeled with events as nodes, cannot directly learn the interaction between event pairs.As shown in Figure 1(a), e2 and e7 are far from each other in the text.Thus, directly identifying the causal relationship between them is difficult for the model.If < e7, cause, e5 > and < e5, cause, e2 > are known, it is easier to infer the causal relationship between e2 and e7 according to the transitivity of the causal relationship (Hall, 2000).Therefore, it is essential to introduce the interaction information between event pairs to learn their dependencies.To address the above limitations, we propose a framework called ECLEP, which introduces Event Causal Label information and Event Pair interaction information to enrich the representation of event pairs.In particular, 1) we design an eventcausal-label-aware module to model the event causal label information, in which we introduce the event causal label prediction task as an auxiliary task of ECI to mine the dependencies between events.2) We further design an event pair interaction graph module to model the interaction information between event pairs.In particular, we construct the interaction graph with event pairs as nodes and adopt graph attention mechanism to model the degree of dependency between event pairs.The experimental results on two benchmark datasets show that our overall framework outperforms previous state-of-the-art methods.

Methodology
In ECI, the goal is to predict whether a causal relationship exists in each event pair (e i , e j )(i ̸ = j) by giving text S = (w 1 , w 2 , ..., w n ) and events set E = (e 1 , e 2 , ..., e m ) contained in S as inputs, where w i denotes the i-th word in S, and e i denotes the i-th event in E. The overall framework of the proposed method is shown in Figure 2, which mainly includes four modules, we illustrate each component in detail in the following.

Encoding Layer
Given an input sentence S, we adopt a pre-trained language model BERT (Devlin et al., 2019) as the sentence encoder to extract hidden contextual representation H S = (h w 1 , h w 2 , ..., h wn ), where h w i denotes the hidden representation of the ith token in S.Then, the events representation H E = (h e 1 , h e 2 , ..., h em ) can be obtained by summing the token representation contained in the event.

Event-Causal-Label-Aware Module
A sentence usually contains multiple events, however, not all events are causality-related events.We design an event-causal-label-aware module to mine the dependencies between events and help the model to pay attention to the extraction of the causal relationship from the event pairs composed of causality-related events.In general, there is a strong semantic correlation between events with causal relationships, so we first adopt the Transformer mechanism (Vaswani et al., 2017) to capture the dependencies between events, and get the updated event representation via Eq.( 1).
(1) Then, we construct a binary classification to predict the probability p L e i of each event e i in E as a causality-related event via Eq.(2).
where W L e i and b L e i are learnable parameters.To incorporate the event causal label into the event pair, we introduce a learable label vector set L = {l ij }(l ij ∈ {cr, cu}), where cr denotes the event pairs composed of causality-related events, cu denotes the event pairs that have at least one causality-unrelated event.In particular, the embedding vector L is randomly initialized via sampling from a uniform distribution and is learned together with the model training process.For the event pair (e i , e j ), if e i and e j are both causality-related events (i.e.p L e i ≥ 0.5 and p L e j ≥ 0.5), the label of (e i , e j ) is l ij = cr, else l ij = cu.

Event Pair Interaction Graph Module
To capture the interaction information between event pairs, we construct the interaction graph with event pairs as nodes and adopt the graph attention mechanism to adaptively fuse the information of neighbor nodes.
• Event Pair Interaction Graph Construction.Given the event set E, the events are combined in pairs to form the set of candidate causal event pairs EP = {ep ij }(0 < i, j ≤ |E|, i ̸ = j) as nodes.In EP , the where h e i and h e j ∈ H E and r ij denotes the relative position embedding between event e i and event e j in the text.
For the edges of the interaction graph, consider that event pairs in the same row or column are strongly associated with the current event pair (Ding et al., 2020).As shown in Figure 1(b), the relationship of event pair < e7, e2 > can be transmitted through the same row event pair < e7, e5 > and the same column event pair < e5, e2 >.Thus, we connect the edges between the event pairs in the same row or column and add self-loop edges to fuse the information of their nodes.
• Event Pair Interaction Graph Update.Considering that different neighbor nodes have different importance for each event pair, we leverage Graph Attention Networks (GAT) (Veličković et al., 2018) to model the degree of dependency between event pairs.GAT propagates information among nodes by stacking multiple graph attention layers.At the t-th graph attention layer, the representation of each node can be updated via Eq.( 3).
(3) where N (ij) denotes the directly neighboring nodes of ep ij ; the attention weight α t ij,uv is learned via Eq.( 4), which reflects the strength of aggregation level between the nodes ep t−1 ij and ep t−1 uv ; W t , W t ij , W t uv , w t , and b t are learnable parameters.
Therefore, the updated nodes representation EP I = {ep I ij } can be obtained by stacking T layers to model the inter-node relationships.

Prediction and Training
The predicted probability p ij of the event pair (e i , e j ) as a causal event pair can be obtained by performing a binary classification with ep I ij as input via Eq.( 5).
where W I ij and b I ij are learnable parameters.For training, we utilize the cross-entropy function to supervise the causal event pair prediction via Eq.( 6).
where D denotes the training set; s denotes the sentence in D; E s denotes the events set in s; y ij denotes the ground truth label of event pair (e i , e j ).
We also add an auxiliary supervision for the event causal label prediction task via Eq.( 7).
where p L e i denotes the predicted probability of event e i as a causality-related event; y e i denotes the ground truth label of event e i , which can be automatically obtained according to the labels of event pairs without any manual labeling.
The final loss function is a weighted sum of the aforementioned terms L = L ep + λ e L e , where λ e ∈ (0,1).

Parameter Settings
We use HuggingFace's Transformers 1 library to implement the uncased BERT base model.The Adam algorithm (Kingma and Ba, 2015) is used as an optimizer, the learning rate is initialized to 2e-5, the batch size is set to 5, the GAT layers is set to 2, the dropout of GAT is set to 0.3, the dimensions of the event causal label embedding and relative position embedding are set to 80 and 40, respectively, and the weight λ e is set to 0.2.Moreover, since the positive samples in the dataset are sparse, we adopt a negative sampling rate of 0.5 for training.

Overall Results
The experimental results are shown in Table 1, we can observe that our proposed method ECLEP outperforms all the baselines on the two datasets.In particular, on the EventStoryLine, compared with the current best method RichGCN, our method ECLEP achieves 1.9% improvement in the F1score; on the Causal-TimeBank, compared with the current best method CauSeRL, our method ECLEP achieves 3.1% improvement in the F1-score.This finding indicates that our proposed method is effective for the ECI task.In addition, we observe that baseline methods of ECI often require external knowledge resources or toolkits to improve the performance.Our approach achieves the best performance by mining the inherent characteristics of the data.

Ablation Study
This section analyzes the contribution of each part in our model through ablation experiments, as shown in Table 2.In particular, we examine the following ablated models: 1) -ECL and -EPI denote the removal of the event causal label information and the event pair interaction information, respectively.We note that removing any of them degrades the performance for ECI, indicating that the two information we introduced are effective.2) -L e denote the removal of auxiliary supervision.We note that removing it degrades overall performance, because it can help the model more sufficiently learn the representation of the label vector.3) -pos denote the removal of relative position embedding.The experimental results show that relative position information is helpful for ECI, because the probability of causal relationship between events that are closer is higher than those that are farther apart.

Visualization Analysis
We visualize the distribution of each module to explore the effectiveness of our model further.The following can be observed from Figure 3: 1) The event causal label information and the event pair interaction information focus on different aspects of features to identify the causal relationships and they share complementary effects.This finding also provides an explanation on the good performance of our full model ECLEP.
2) The event causal label can limit the identification scope of causality to some extent, and help the model to pay attention to the extraction of the causal relationship from the event pairs composed of causality-related events.

Conclusion
In this paper, we propose a framework to enrich the representation of event pairs by introducing event causal label information and event pair interaction information.The experimental results on two widely used datasets indicate that our approach is effective for the ECI task.In the future, we aim to mine other potential causal features for this task and apply our model to other types of relation extraction tasks, such as temporal relation extraction.

Limitations
In this paper, we only focus on whether or not there is a causal relationship between the given events, does not discriminate the specific cause/effect event.In addition, we only conduct research on sentence-level ECI, whereas document-level ECI often present more challenges.These are the focus of our future research.

Figure 2 :
Figure 2: The framework of our proposed method.
Figure 3: (a) Distribution of the label L in the eventcausal-label-aware module, where the shaded areas denote the event pairs composed of causality-related events.(b) Distribution of the full model ECLEP.(c) Distribution that only using the event causal label information.(d) Distribution that only using the event pair interaction information.In (b), (c), and (d), the shaded area denotes the probability of each event pair is predicted as a causal event pair.In this example, the ground truth labels are (e1, e2), (e1, e4) and (e1, e5).

Table 1 :
Experimental Results on the EventStoryLine and Causal-TimeBank Datasets.

Table 2 :
Ablation Results on the EventStoryLine dataset.