Improving Event Causality Identification via Self-Supervised Representation Learning on External Causal Statement

Current models for event causality identification (ECI) mainly adopt a supervised framework, which heavily rely on labeled data for training. Unfortunately, the scale of current annotated datasets is relatively limited, which cannot provide sufficient support for models to capture useful indicators from causal statements, especially for handing those new, unseen cases. To alleviate this problem, we propose a novel approach, shortly named CauSeRL, which leverages external causal statements for event causality identification. First of all, we design a self-supervised framework to learn context-specific causal patterns from external causal statements. Then, we adopt a contrastive transfer strategy to incorporate the learned context-specific causal patterns into the target ECI model. Experimental results show that our method significantly outperforms previous methods on EventStoryLine and Causal-TimeBank (+2.0 and +3.4 points on F1 value respectively).


Introduction
Event causality identification (ECI) aims to identify causal relations between events in texts, which can provide crucial clues for deep textual understanding (Girju, 2003;Oh et al., 2013Oh et al., , 2017. For example in Figure 1, an ECI system should identify two causal relations in S 1 with mentioned events: noticed E1 cause −→ alerted E3 and alerted E3 cause −→ ran E2 . To date, most existing methods regard this task as a classification problem and usually train ECI models on annotated data (Hashimoto et al., 2014;Riaz and Girju, 2014b;Mirza and Tonelli, 2016;Hu and Walker, 2017b;Gao et al., 2019). However, the scale of current annotated datasets are relatively limited, where the so far largest dataset EventSto-ryLine (Caselli and Vossen, 2017) Figure 1: S 1 is a labeled data that contains unseen causal events and their statement when training; S 2 is an external causal statement; The bottom illustrates the context-specific causal pattern in S 2 could help identify the causality of unseen events in S 1 . event pairs. As a result, on the limited annotated examples, existing ECI models could not easily capture useful indicators from causal statements, especially for handing those new, unseen cases.
To address this problem,  employed external event-related knowledge bases (KBs) to enhance the causality inference, where those KBs store inherent causal relations between some given events. For those unseen events and unlabeled causalities in KBs,  proposed a mention-mask based reasoner to enhance the causal statement representation. However, such mention-mask based reasoner is still trained on the human-annotated examples solely. It will still suffer from data limitations and have no capacity to handling unseen contexts. Moreover, Zuo et al. (2020) improved the performance of ECI with the distantly supervised labeled training data. However, their models are still limited to the unsatisfied qualities of the automatically generated data.
To address the insufficient annotated example problem, we employ a large number of external causal statements (Sap et al., 2018;Mostafazadeh et al., 2020) that can support adequate evidence of context-specific causal patterns  for understanding event causalities. For example in Figure 1, the context-specific causal pattern support by an external causal statement S 2 is helpful for identifying the causality of event noticed E1 and event alerted E3 in S 1 , which is unseen when only training with labeled data. However, different from annotated examples for the ECI task, there are no event annotations in the external causal statements. As a result, it is difficult for the models to learn context-specific causal patterns from them to identify event causalities. To resolve this issue, inspired by Grill et al. (2020), we design a selfsupervised representation learning framework to learn enhanced causal representations from external causal statements. Specifically, we iteratively sample two external causal statements, then take each of them as a target to learn the commonalities among them. Intuitively, we believe that the learned commonalities between different causal statements through self-supervision reflect such context-specific causal patterns which are helpful for identifying event causalities in the unseen cases.
Moreover, to incorporate the learned contextspecific causal patterns from external causal statements into the target ECI model, we employ a contrastive transfer strategy. In specific, we regard the self-supervised representation learning module as a teacher model that masters abundant external causal statements, and the target ECI model as a student model. Methodologically, we make the representation of the causal events encoded by the student model should be close to the causal representation grasped by the teacher model, and keep the representation of the non-causal events away from it. In this way, the mutual information between the teacher and student models could be maximized (Tian et al., 2020). Then the learned context-specific causal patterns could be naturally transferred into the ECI model and the generalization could be improved.
In experiments, we evaluate our model on two benchmarks. The experimental results show that our model achieves SOTA performance. Then, concrete proofs show that the effectiveness of our selfsupervised contrast-based framework for contextspecific causal patterns learning and transfer.
In summary, the contributions are as follows: • We propose a novel approach, shortly named CauSeRL, which could leverage external causal statements to identify the causalities between events.
• First of all, we design a self-supervised framework to learn context-specific causal patterns from external causal statements. Then, we adopt a contrastive transfer strategy to incorporate the learned context-specific causal patterns into target ECI model for identification.
• Experimental results on two benchmarks show that our model achieves the best performance.

Related Work
Event Causality Identification Up to now, identifying the causality implied in the text has attracted more and more attention (Hu and Walker, 2017a;Riaz and Girju, 2014b;Hashimoto et al., 2014;Girju, 2014a, 2010;Do et al., 2011;Hidey and McKeown, 2016;Beamer and Girju, 2009;Hu et al., 2017;Hu and Walker, 2017b). Recently, some benchmarks on the event causality have been released. , Mirza and Tonelli (2016) extracted causal relation of events with a rule-based multi-sieve approach incorporating with event temporal relation.  annotated the Causal-TimeBank of event causal relations. Caselli and Vossen (2017) annotated the EventStoryLine Corpus for event causality identification in 320 short stories based on the temporal and causal relations annotated dataset (Mostafazadeh et al., 2016). Dunietz et al. (2017) presented BECauSE 2.0, a new version of the BE-CauSE (Dunietz et al., 2015) of causal relation and other seven relations. Based on the above benchmarks, Gao et al. (2019) modeled document-level structures to identify the causalities of events.  identified event causalities with the mention masking generalization and external KBs. Zuo et al. (2020) improved the performance of ECI with the distantly automatically labeled training data. However, these methods only rely on a small scale of labeled data. In this paper, we introduce external causal statements to help identify event causalities.

Self-Supervised Representation Learning
Self-supervised representation learning cares about producing good features generally helpful for many tasks (Weng, 2019). Wu et al. (2018)  look-up.  proposed the SimCLR which learns representations for visual inputs by maximizing agreement between differently augmented views of the same sample via a contrastive loss. Grill et al. (2020) claimed a novel representation learning framework relies on two neural networks, BYOL, without using negative samples. CURL (Srinivas et al., 2020) applies the above ideas in reinforcement learning. Inspired by them, we design a self-supervised framework to learn context-specific causal patterns from external causal statements and adopt a contrastive transfer strategy to incorporate them into target ECI model.

Methodology
As shown in Figure 2, the whole pipeline process of CauSeRL is divided into two major stages.
• Self-supervised causal representation learning (SelfRL, Sec. 3.1). In this stage, we design a self-supervised representation learning module to learn enhanced causal representations by iteratively sampling two external causal statements, taking each of them as a target to learn their commonalities which reflect context-specific causal patterns.
• Contrastive representation transfer (ConRT, Sec. 3.2). In this stage, we employ a contrastive transfer module to transfer the learned context-specific causal patterns into the ECI target model, the event causality identifier, via incorporating the enhanced causal representations from SelfRL.

Self-Supervised Causal Representation
Learning (SelfRL) SelfRL aims to train a module that masters contextspecific causal patterns from external causal statements by learning their enhanced causal representation with a self-supervised framework.

Self-Supervised Representation Learning Module
We design a self-supervised module to capture the context-specific causal patterns from external causal statements via learning their enhanced causal representation. However, there are no ECIspecific event annotations in the external causal statements, which makes them unable to be directly used as training data to train the ECI model. To handle this problem, inspired by Grill et al. (2020), we iteratively sample two external causal statements, take each of them as a target to learn their commonalities, that is, the causal representations, which reflect context-specific causal patterns.
In specific, as shown in Figure 2, we configure two networks for SelfRL, an online network, and a target network. The target network provides regression targets to train the online network which makes it learn the commonalities among two input causal statements, that is, the causal representations reflecting different context-specific causal patterns. Structurally, the online network is defined as a set of weights θ which is comprised of three submodules: an encoder Enc θ , a projector P roj θ and a predictor P red θ . And the target network has the same architecture as the online network, but no predictor and uses a different set of weights δ.
In specific, we iteratively sample two external causal statements, initially encode them by BERT (Devlin et al., 2019), and input them into two net-works respectively. After encoding and projection, the online network and target network respectively output a projection z θ and z δ . Then the online network outputs a prediction y θ , and takes the following mean square error between 2 -normalized y θ andz δ as the training objective to learn the commonalities of two causal statements, that are regarded as the context-specific causal patterns.
To reduce the bias, we symmetrize the L θ,δ by swapping the input causal statements of the online and target networks to compute L θ,δ .
Learning of SelfRL For the learning of SelfRL, at each step, as shown in Algorithm 1, we minimize the L tea θ,δ to stochastic gradient update the online network respect to the parameters θ only. For the target network, the parameters δ are an exponential moving average of the parameters θ of the online network (Lillicrap et al., 2016): where, η tea is the learning rate of the online network, and τ ∈ [0, 1] is the decay rate that determines the degree of the movement of θ to δ. As shown in Figure 2, when learning, BERT is only used to provide an initial representation for the input statements, and its parameters are not updated. According to the theoretical analysis by Grill et al. (2020), the addition of a predictor on the online network and the usage of a slow-moving average of the online parameters as the target network encourage SelfRL to encode a more informative causal representation of commonalities within the online projection and avoids collapsed solutions 1 .

Contrastive Representation Transfer (ConRT)
ConRT aims to incorporate the context-specific causal patterns learned in SelfRL from external Algorithm 1 Two stages training of CauSeRL.
Require: External causal statements C for teacher model and event pairs with statements P for student model. Training: 1: Stage: CAUSAL REPRESENTATION LEARNING 2: for each batch C bat ∈ C do Learning of SelfRL 3: for any two causal statements ∈ C bat do 4: One for online another for target; 5: Get y θ from P red θ in online network; 6: Get z δ from P roj δ in target network; 7: Swap two statements into two networks; 8: Get symmetrical y θ and z δ ; 9: Compute L θ,δ and L θ,δ ; 10: end for 11: Compute batch L tea θ,δ in equation (3); 12: Stochastic gradient update θ in equation (4); 13: Slow-moving update δ in equation (5) for each batch P bat ∈ P do Learning of identifier 19: for any event pair with statement ∈ P bat do 20: Get revent and revent state from BertEnc λ ; 21: Predict the causality of two events in one pair; 22: end for 23: Compute batch L stu λ in equation (6); 24: Sample C bat ∈ C; 25: Get r external of c ∈ C bat from learned Enc θ ; 26: Get r + event state , r − event state from revent state; 27: Get mapped r p + e s , r p − e s and rext; 28: Compute L λ = L stu λ + L con λ in equation (8); 29: Stochastic gradient update λ in equation (9); 30: end for 31: end Stage: causal statements into the identifier. As aforementioned, the goal of SelfRL is learning the commonalities among different external causal statements, which does not make the representation learning module have the ability to distinguish the causal and non-causal statements directly. Therefore, we employ a contrastive transfer module to teach the learned context-specific causal patterns to the event causality identifier for training.
Event Causality Identifier Event causality identification is formulated as a sentence level binary classification problem. Specifically, we design a classifier based on BERT (Devlin et al., 2019) to build our identifier. The input is an event pair and its statement. As shown in Figure 2, we take representation of events r event and their contextual statement r event state encoded by BertEnc λ as the input of top MLP predictor. Finally, the output is a binary vector to indicate the causal relation of the input two events expressed by their statement. The parameters of the identifier are defined as λ and the optimization function is the following

DISTANT
Fisk was shot to death by his mistress's new lover and Fisk's ex-business partner.
Fisk was shot to death by his mistress's new lover and Fisk's ex-business partner. classification cross-entropy function: Contrastive Transfer Module As aforementioned, inspired by Tian et al. (2020), we employ a contrastive transfer strategy to transfer the "knowledge" mastered by the teacher (self-supervised representation learning module), that is the contextspecific causal patterns, to the student (event causality identifier), which helps the latter to identify the event causalities. The key idea of contrastive transfer is intuitional: maximize the mutual information between the teacher and the student (Tian et al., 2020). Methodologically, we make the representation of the statements of causal events encoded by the student model should be close to the causal representation grasped by the teacher model. By contrast, we keep the representation of the statements of non-causal events away from it. As shown in Figure 2, at each training step of identifier, we sample a batch of external causal statements into the learned Enc θ of the online network to obtain their causal representation r ext for teaching. At the same time, we also sample a batch of event pairs with their statements into the BertEnc λ of identifier to obtain the statement representation r event state of each event pair. Among one batch, r event state consists of the r + event state of causal event pairs and the r − event state of non-causal event pairs. After mapping r external , r + event state and r − event state into a same space, we obtain r ext , r p + e s and r p − e s respectively. After that, we make r p + e s be close to r ext in the contrastive loss function: where, P + and P are the causal event pairs and all event pairs in one batch respectively, T is a temperature that adjusts the concentration level, and D is the 2 -distance function to measure the distance of two representation.
Learning of Event Causality Identifier For the training of event causality identifier, we add contrastive loss to the basic classification loss, which could guide the identifier to learn context-specific causal patterns implied in the enhanced causal representation from SelfRL. As shown in Algorithm 1, we minimize the L λ and stochastic gradient update the λ as following: where, η stu is the learning rate of the identifier. For evaluation, we predict the causality of input event pair without the contrastive transfer module. Additionally, the T in L con λ indirectly plays a role in adjusting the influence weight of L stu λ and L con λ . In specific, for teaching, we take the learned Enc θ of the online network as the encoder, freeze its parameters, to provide the enhanced causal representation of the external causal statements for contrastive representation transfer.

Experimental Setup
Dataset and Evaluation Metrics for ECI Our experiments are conducted on two main benchmarks, including: EventStoryLine v0.9 (ESC) (Caselli and Vossen, 2017) described above; and (2) Causal-TimeBank (CTB)  which contains 184 documents, 6813 events, and 318 causal event pairs. Same as previous methods, we use the last two topics of ESC as the development set for two datasets. For evaluation, we adopt Precision (P), Recall (R), and F1-score (F1) as evaluation metrics. We conduct 5-fold and 10fold cross-validation on ESC and CTB respectively, same as previous methods. All the results are the average of three independent experiments.
Data Preparation for Self-Supervised Causal Representation Learning We take four types of external causal statements from three resources. Table 1 illustrates the original form and the converted input form of SelfRL (Sec. 3.1) of the causal statements from three different resources.
• GLUCOSE (Mostafazadeh et al., 2020): a large-scale dataset of implicit commonsense knowledge, encoded as causal explanatory mini-theories inspired by cognitive psychology. Each GLUCOSE explanation is stated both as a specific statement (grounded in a given context, GLU-SPE in Table 1) and a corresponding general rule (applicable to other contexts, GLU-GEN in Table 1).
• ATOMIC (Sap et al., 2018): an atlas of machine commonsense, as a step toward addressing the rich spectrum of inferential knowledge that is crucial for commonsense reasoning.
• DISTANT (Zuo et al., 2020): the automatically labeled training data for ECI via distant supervision that expresses the causal semantics between events.

Parameters Settings
In implementations, all the BERT modules are implemented on BERT-Base architecture 2 , which has 12-layers, 768-hiddens, and 12-heads. We employ the one-layer BiLSTM (Hochreiter and Schmidhuber, 1997) as Enc θ and Enc δ . For parameters, we set the learning rate of SelfRL (η tea ) and identifier (η stu ) as 1e-5 and 2e-5 respectively. The size of the space in the contrastive transfer module and the hidden layer of BiLSTM are both set as 50. And we respectively set the decay rate τ of moving average in SelfRL and the temperature of the contrastive loss L con λ are 0.996 and 0.1 tuned on the development set. Moreover, we also tune the batch size of SelfRL and identifier as 48 and 16 respectively on the development set. And we apply the early stop and AdamW gradient strategy to optimize all models. We also adopt a negative sampling rate of 0.6 for the training of identifier, owing to the sparseness of positive examples in the ECI datasets.
To make a fair comparison, we employ CauSeRL to retrain MasG and KnowDis to illustrate the effectiveness of our proposed approach for ECI on other methods. In specific, 1) MasG+CauSeRL: we retrain MasG with L con λ based on the CLU-SPE. To be consistent with other BERT-based compared models, we re-construct MasG based on BERT-Base rather than the original BERT-Large of MasG;  EventStoryLine. * denotes a significant test at the level of 0.05; ∇ means the points lower than CauSeRL or higher than BERT in the upper and lower parts respectively; Enc θ−init + ConRT denotes a varietal CauSeRL that removes SelfRL, directly employs an initial Enc θ of the online network to encode external causal statements into ConRT and trains it meanwhile; BertEnc λ−init + ConRT denotes a varietal CauSeRL that removes SelfRL, directly employs a same initial BertEnc λ of identifier to encode external causal statements into ConRT and trains it meanwhile; BERT+SelfRL f inetune denotes a varietal CauSeRL that removes ConRT (Sec. 3.2), and takes the learned Enc θ of the online network as the initial encoder of identifier on the BERT baseline model.
2) KnowDis+CauSeRL: we regard the automatically distantly labeled causal sentences generated by KnowDis as causal statements to learn in Sel-fRL, and transfer to KnowDis. CauSeRL External-Statement : To further illustrate the ability of CauSeRL to learn the contextspecific causal patterns for the ECI task, we make CauSeRL learn from four types of external causal statements shown in Table 1 for identifying the causalities between events. External-Statement denotes what kind of external causal statements. Table 2 shows the results of ECI on EventStoryLine and Causal-TimeBank. From the results: 1) Our CauSeRL outperforms all baseline methods and achieves the best performance on F1 value, 52.1% on ESC and 53.2% on CTB respectively. Specifically, CauSeRL outperforms the no-bert (ILP/VerR-C) and bert (MasG/KnowDis) baseline methods by a margin of 7.4%/10.0% and 2.0%/3.4% on two benchmarks respectively. It illustrates the context-specific causal patterns from external causal statements are effective for ECI.

Our Method vs. State-of-the-art Methods
2) Comparing MasG+CauSeRL with MasG, we note that even with BERT-Base, the performance of MasG+CauSeRL is significantly higher than that of MasG based on BERT-Large. This shows that the context-specific causal patterns learned by CauSeRL from external causal statements can ef-  fectively alleviate the limitation of mask generalization only relying on limited labeled causal context.
3) Comparing KnowDis+CauSeRL with Know-Dis, we find that CauSeRL could more efficiently make use of the automatically labeled causal statements, which learns their context-specific causal patterns to further enhance the ability of models to identify the causalities between events. 4) Comparing different external causal statements. a) GLU-SPE brings the most significant improvement because the specific causal statements from GLU-SPE have complete text structures that are more similar to ECI labeled data and make models easier to learn. There, all the ablation experiments are conducted on GLU-SPE. b) The effects of GLU-GEN and ATOMIC are similar because these two types of statements are abstract causal structures. Although they are similar to the contextspecific causal patterns, it is relatively difficult to understand directly. c) The improvement brought by DISTANT is relatively small because of the effects of the noise from distantly labeled data. 5) Comparing CauSeRL with MasG+CauSeRL, we notice that after removing the ConceptNet knowledge enhancement employed by MasG, the external causal statements could be better learned and transferred. This is because MasG directly flattens the event concept knowledge into the statement sequence, which disrupts the statement structure and affects the understanding of the statement.
6) It is worth noting that the improvement on Figure 3: Results of event causality identification on EventStoryLine that directly using external causal statements as the training data of ECI task.
the CTB is higher than that of the ESC, because the amount of labeled data of the former is relatively small, and more need for the help of external causal statements. Moreover, compared with the traditional methods based on features or rules, all BERT-based methods demonstrate high recall value, which is benefited from more training data, knowledge and causal statements.

Effect of Self-Supervised Causal Representation Learning
We analyze the effect of the self-supervised causal representation learning (SelfRL, Sec. 3.1). As shown in Table 3, from the results, 1) after removing SelfRL, the performance of ECI significantly decreases. This illustrates that the context-specific causal patterns learned by SelfRL are important for the ECI model to understand the causality. 2) Comparing BERT+SelfRL f inetune with BERT, the Enc θ that has learned from external causal statements could improve the performance of ECI to a certain extent. This illustrates that SelfRL could effectively capture the context-specific causal patterns in the statements for identification. 3) Comparing Enc θ−init + ConRT and BertEnc λ−init + ConRT, after representation learning, the fine-tuned Enc θ could further improve the performance of ECI. This indirectly shows that the context-specific causal patterns learned in the SelfRL is generalized.

Effect of Contrastive Representation Transfer
We analyze the effect of the contrastive representation transfer (ConRT, Sec. 3.2). As shown in Table  4, from the results, 1) after removing ConRT, the performance of ECI also significantly decreases. This illustrates that the learned causal representations from external statements are not suitable for direct application to ECI, and needs to be ef-  fectively transferred that the ConRT focuses on.
2) Comparing BERT + ConRT Enc θ with BERT, even if causal representation learning is not carried out in advance, adopting contrast strategy to directly transfer the context-specific causal patterns could also help the inference of event causality to a certain extent. 3) Comparing Enc θ−f reeze + Sel-fRL with Enc θ−f inetune + SelfRL, we find that the causal representations encoded by pre-trained BERT and BiLSTM have similar effects. Aforementioned, to avoid collapse solutions (Sec. 3.1), we choose the BiLSTM as an encoder in SelfRL that could be initialized completely independently.

Effect of the Utilization of External Causal Statement
As shown in Figure 3, we regard external causal statements as positive training data for ECI and directly use them to train the BERT baseline model. In specific, we treat two words that play a predicate role in the syntactic structure of each statement as events. From the results, CauSeRL could more effectively make use of causal statements to help understand the causalities of events. In contrast, directly serving as training data is not effective.

Case Study
As shown in Figure 4, with limited labeled data, the model could not understand the causal relation between event noticed and event alerted. Fortunately, with the support of the context-specific causal pattern from GLU-SPE in Table 1, the prediction is modified correctly. Moreover, the original model that only trained with limited labeled data is ambiguous about the causal relation between event alerted and event ran. Influenced by the similar causal statements with the example in Table 1 from ATOMIC, the prediction confidence is improved.
We propose a novel approach, CauSeRL, which could leverage external causal statements to identify the causalities of events. First of all, we design a self-supervised framework to learn contextspecific causal patterns from external causal statements. Then, we adopt a contrastive transfer strategy to incorporate the learned context-specific causal patterns into the target ECI model for identification. Experimental results on two benchmarks show that our model achieves the best performance.