Event Causality Extraction via Implicit Cause-Effect Interactions

,


Introduction
Event Causality Extraction (ECE) is an emerging yet challenging natural language processing (NLP) task, which requires extracting the whole event structure of cause and effect events, including event type and event arguments.As shown in Figure 1, given the input text, an ECE system is expected Figure 1: An instance of event causality extraction.The cause event is marked in yellow and the effect event is marked in green.The cause-effect interactions include intra-and inter-event interactions.Solid lines indicate intra-event interactions between the event type and the corresponding arguments, while the dashed line indicates inter-event interaction between the cause-effect event.
Existing studies mainly regard this task as a classification-based problem.Du and Cardie (2020) and Wang et al. (2019) extracted the events and classified the causal relation in a pipelined manner.Wang et al. (2020) leveraged grid tagging to simultaneously extract the events and causality pairs.Cui et al. (2022) also designed a dual grid tagging scheme, which aims at modeling the correlations between different event arguments.Compared to the classification-based methods, generationbased models can take full advantage of Pre-trained Language Models (PLMs) by designing flexible prompt templates (Ma et al., 2022;Liu et al., 2022).As a result, there is a trend to cast the extraction task as a conditional generation problem.
Although generation-based approaches have achieved remarkable success, the limited interactions between the cause-effect events impede the model's ability to reason effectively between events.Intuitively, the privileged information (Xu et al., 2020), which stands for the ground truth information of event types or event arguments, could provide valuable knowledge for inferring causal clues.Take Figure 1 as an example, if the model has already known the cause event is Price Increase, and the Product and Region are oil and worldwide in this event, the inter-event interactions could benefit the extraction of effect event.Similarly, if the model has known Price Increase causes Demand Increase during training, the intra-event interactions may help the model to capture event arguments more accurately.By incorporating different kinds of privileged information, the model could make full use of the implicit interactions and be guided to extract causal clues.However, due to the unavailability of such privileged information in practice, incorporating it naively will result in inconsistencies between training and test phases and may affect the model performance.Several methods (Liu et al., 2017;Wei et al., 2021) in other NLP fields have provided solutions to overcome this problem, but they are not applicable when ECE is modeled as a generative problem.Moreover, generation-based methods are typically trained via maximum likelihood estimation (MLE) (Salakhutdinov, 2015), which maximizes the likelihood of the next word conditioned on its previous ground truth words.Then, it leverages cross-entropy loss to measure the difference at each position of the target sequence.Nevertheless, since MLE only emphasizes strict word-level alignment, it struggles to consider the semantic information from the perspective of event type or event arguments.For instance, as shown in Figure 1, when training the word new from the effect event argument new energy, MLE ignores new energy is a whole unit as the Industry, which results in a partial loss of semantics.The event type and event arguments from the cause-effect event pairs could also be regarded as different wholes, and by interacting with each other, the model could implicitly incorporate such event-level semantic information.
In this paper, we propose an Implicit Cause-Effect interaction (ICE) framework for ECE to address the above issues.Specifically, we formulate the ECE task as a template-based conditional generation problem, which takes the context and prompt template as the input, and decodes event causality and event structure from the generated sequence.To capture the implicit intra-and inter-event interactions, we feed different privileged information to the input template and train two well-informed teacher models.Then a student model is driven by imitating the behaviors of teachers to narrow the input difference of training and test phases through knowledge distillation (Hinton et al., 2015).Furthermore, to facilitate knowledge transfer and strengthen the interactions between cause and effect events, we design a Cause-Effect Optimal Transport (CEOT) mechanism by treating the event type and event arguments as model units, which could implicitly incorporate the event-level semantic information.In summary, the contributions of this paper are as follows: 1) This work proposes an ICE framework, which models event causality extraction in a generative paradigm and incorporates privileged knowledge for reasoning.
2) The proposed method implicitly captures the intra-and inter-event interactions through knowledge distillation, and employs a CEOT strategy to strengthen the semantic interactions of cause and effect events.
3) Experimental results show that our model achieves state-of-the-art performance, improving the F1-score by 8.39% on the ECE evaluation benchmark.

Related Work
Event Causality Extraction.ECE is derived from the previous event causality identification (ECI), which aims to recognize the causal relations between the given events in text (Zuo et al., 2021;Phu and Nguyen, 2021).Early methods mainly focus on syntactic and lexical features (Gao et al., 2019), causality patterns (Hidey and McKeown, 2016), and statistical causal clues (Hu and Walker, 2017).Recent works seek to employ external knowledge (Liu et al., 2020;Cao et al., 2021) or prompt-based models (Shen et al., 2022;Liu et al., 2023) for this task.But these methods only identify the causality of events expressed by a word or phrase, without considering the event type and event arguments.Cui et al. (2022) first proposed the ECE task and exploited the argument correlations to extract event causality and event structure.Some variant methods from relation extraction have also been applied to this task (Wang et al., 2020;Wei et al., 2020).
However, these works fail to model the implicit cause-effect interactions, making it difficult to extract causal clues.Knowledge Distillation.This technique is first proposed by Hinton et al. (2015), which aims to transfer knowledge from a well-trained teacher to a student model.Jiao et al. (2020) and Sanh et al. (2019) used knowledge distillation for compressing large-scale pre-trained language models.Wu et al. (2021) and Li et al. (2022) adopted multipleteacher knowledge distillation to improve the effectiveness of distillation.Wei et al. (2023a) proposed to incorporate related arguments knowledge through knowledge distillation for event argument extraction.Nevertheless, they are designed for classification-based methods and struggle to migrate to the ECE task under the generative pattern.Optimal Transport.OT has a wide range of applications in NLP domains (Chen et al., 2019;Xu et al., 2021;Wei et al., 2023b).Li et al. (2020) proposed using optimal transport to tackle the exposure bias issue in training generative models by maximum likelihood estimation.Zhou et al. (2022) modeled events in the sequence as units and adopted optimal transport to explicitly extract the event semantics for generating temporally-ordered event sequences.In this paper, we employ optimal transport to improve the semantic interactions of event type and event argument for cause and effect events.

Methodology
Our ICE framework formulates ECE as a templatebased generation problem and implicitly incorporates privileged information for reasoning.Under this paradigm, we train two well-informed teacher models by incorporating different privileged information into model inputs.Then we adopt a knowledge distillation mechanism to drive a student model to capture implicit cause-effect interactions, which could alleviate the difference of unavailable privileged information in the test stage.During the training phase, a CEOT strategy is adopted to improve the semantic interactions of cause-effect events and promote the training of the student.The overview of ICE is shown in Figure 2.

Task Formulation
The goal of ECE is to extract event causality and event structure from the text.Formally, given a context, ECE aims to extract a set of cause-effect event pairs , where E ca i and E ef i indicate the cause and effect event of the i-th pair, respectively.The event structure E = (t, A) contains the event type t and the argument set A. Each argument in A corresponds to a role.

Generative Template-based ECE Model
Template Creation.At the input stage, we first construct a specific task-related template for ECE.Following Ma et al. (2022) and Liu et al. (2021), we design a soft template that contains learnable pseudo tokens and slots for all components we require extracting.Figure 2(c) shows the ECE template in our model, where <Cause>, <Effect>, <type>, </type>, etc. are specific learnable pseudo tokens.Then we concatenate the context and template, and feed them into a Transformer-based model to generate the output sequence, where the slots in the template will be filled with concrete event type or event arguments of the cause event and effect event.Target Output Sequence.For the cause-effect event pair in context, we construct the target output sequence Y for conditional generation by filling the ground truth cause event and effect event into the template.Note that when there is more than one argument corresponding to a role, they will be concatenated by a special token <and>.If some roles of the event have no arguments, the corresponding positions in the target output sequence will be filled with <None>.
Training.In the training process, we adopt the Transformer-based pre-trained language model BART (Lewis et al., 2020) as our basic model architecture, which consists an encoder and a decoder.
H e = Encoder(X) where X denotes the concatenation of context and template.The training target is to maximize the likelihood of the next token conditioned on the previous ground truth tokens in the sequence: where L is the length of the target output sequence.
Inference.After obtaining the generated sequence, we decode the event type and event arguments of cause-effect events from corresponding slots with a rule-based matching algorithm.Then we check whether the event type is in the predefined event type set and whether each event argument is a span of the context.

Teacher-Student Distillation Learning
Based on the generative paradigm, we adopt a teacher-student distillation learning framework to capture the implicit intra-and inter-event interactions, which could implicitly incorporate the privileged information for event causality reasoning and narrow the inconsistencies between training and test phases.Specifically, we train two well-informed teachers, including an event argument extractor and an effect event extractor.The event argument extractor aims to extract event arguments with the event type of cause event and effect event given, and the effect event extractor seeks to recognize event type and event arguments of effect event with knowing the information of cause event.With templatebased generation formulation, we could introduce the privileged information into the model in a flexible way.As shown in Figure 2, for the event argument extractor, we construct a knowledgeenriched template by filling the ground truth event types of the cause and effect event in the raw template.Likewise, for the effect event extractor, we fill in the ground truth event type and arguments of the cause event to form a new knowledgeenriched template.We give an example for the construction of knowledge-enriched templates in Appendix B. Next, we concatenate the context and the knowledge-enriched templates as inputs to train teachers.
In the knowledge distillation stage, we use H S d i to denote the hidden states of the i-th decoder layer of the student model, and H T d i to denote the teacher's.We adopt the mean squared error (MSE) to encourage the student model to match the hidden states of corresponding layers of the decoder in the teacher: where N is the number of decoder layers.We also employ KL-Divergence to encourage the student to match the probability distribution of the teacher over the next possible word at each position: where p S l and p T l are probability distributions of student and teacher over the next possible token at position l.Please note that the teachers and student exploit the same model architecture and training objectives, but do not share model parameters.And the parameters of teachers are fixed during the training of the student.The overall loss for training student model with a single teacher is: where α and β are weight coefficients.

Cause-Effect Optimal Transport
To improve knowledge transfer, we seek to model semantic information from the perspective of event type or event arguments and introduce a CEOT strategy to promote interactions of cause and effect events, which is achieved by event-level alignment of teacher and student representations.
Optimal transport defines a distance metric between two probability measures on a domain.Given two discrete probability measures µ = n i=1 u i δ x i and ν = m j=1 v j δ y j , where δ x i is the Dirac function centered on x, the weights Under this setting, the OT distance is formalized as the following problem (Luise et al., 2018): v}, 1 n denotes an n-dimensional all-one vector, C is cost matrix defined as C ij = c(x i , y j ), M is the transportation plan, and ⟨M , C⟩ = Tr(M ⊤ C) denotes the Frobenius inner product.Xie et al. (2019) proposed an approximate algorithm IPOT to solve Eq. 6, which is illustrated in Appendix A. After solving M , we use OT distance as loss to update model parameters.
Specifically, the representation of the template for student is denoted as H S e , which is obtained from the last hidden state of the BART encoder of student corresponding to the template.We first partition H S e into n groups, where each group corresponds to the representation of the slot together with specific tokens before and behind it (e.g., <type> TypeOfCause </type> belong to a group).Then we average the representations of tokens in each group to obtain the sequence K S e = {h S e i } n i=1 , where h e i denotes the representation of the i-th group, n is the number of groups.Similarly, we can get the sequence K T e = {h T e i } m i=1 by group and average the template representations of teacher.We use cosine distance as the cost function and adopt the IPOT algorithm to compute the OT loss:

Split
Meanwhile, the last hidden state of the BART decoder for teacher and student is denoted as H T d and H S d .We first use a linear layer followed by an argmax function to decode the output sequence.Then the representations are divided into several groups based on the event type or event argument together with specific tokens before and behind it.Note that when some specific tokens are missing in the output, we remove the corresponding groups.Likewise, we can obtain two sequences K S d and K T d via an average operation and compute the OT loss: The overall loss of the model is defined as: where λ and η are weight coefficients.

Training Objectives
Since the two teachers exhibit different difficulties in extracting the same sample, motivated by Zhang et al. (2022), we use adaptive weights to control the importance of different teachers for a specific sample: where L k gen T denotes the prediction loss of the kth teacher.The total loss under the multi-teacher knowledge distillation framework is calculated as: derived from the corpus released by Tianchi (2021).
The dataset is annotated with 39 event types and 3 event roles, and the statistic information is listed in Table 1.Evaluation Metric.We use precision (P), recall (R), and F1-score (F1) as evaluation metrics.A predicted cause-effect event pair is considered correct when the event type and event arguments of the cause event and effect event are correctly extracted.To prove a fair comparison with previous methods (Cui et al., 2022), we also report results on the following two tasks: Event Argument Extraction (EAE), which measures the model's ability to extract event arguments; Cause-Effect Type extraction (CET), which aims to recognize whether the predicted event types of cause and effect event are correct.Implementation Details.All experiments are conducted on NVIDIA Tesla V100 GPU with Pytorch framework.We use the pre-trained BARTbase from Hugging-Face's Transformers library (Wolf et al., 2020) as the encoder-decoder language model.Our model is optimized by the AdamW weight decay strategy with a learning rate of 3e-5.The coefficient λ is set to 0.1 and η is set to 0.1.We set α and β to 1e-3 and 0.5.The model is trained for 60 epochs with a batch size of 16.

Baseline Methods
Classification-based Method.
(1) Novel-tagging is introduced to ECE by combining causality, event types, event roles, and argument span into a unified label space (Zheng et al., 2017).( 2) CasECE is a pipelined method inspired by Wei et al. (2020), which first extracts the cause event and then recognizes the effect event conditioned on the former prediction.
(3) Pair-linking is a grid tagging method based on Wang et al. (2020), which uses eventtype-level pair linking as conditional information for token-pair linking to extract event arguments.
(4) DualCor (Cui et al., 2022) 4) CE-Pipeline is a pipelined generation model, which first extracts the cause event and then predicts the effect event based on the cause.( 5) TA-Pipeline is also a pipelined generation model that first identifies the event type of the cause and effect event, conditioned on which to detect the event arguments.

Overall Performance
The experimental results are reported in (3) Compared with Student, our method performs better on the three tasks.We credit the reason to that the student model has difficulty in extracting implicit causal clues, while ICE can equip the model with more implicit event causality reasoning knowledge via a multi-teacher knowledge distillation framework, thus boosting the model performance.( 4) With the same model architecture, the performance of ICE exceeds MKD.The improvements indicate that the privileged information could help to train well-informed teachers, which guide the student to capture the intra-and inter-event interactions.
(5) CE-Pipeline and TA-Pipeline perform poorly among the generation-based methods.
The results illustrate that the pipelined methods suffer from error accumulation problems, while ICE is an end-to-end method with privileged information as the supervision, which could avoid introducing noise or irrelevant information.

Further Discussion
Ablation Study.To evaluate the contribution of each component, we conduct ablation studies by removing event argument extractor (w/o EAE), effect event extractor (w/o EEE), Cause-Effect Optimal Transport (w/o CEOT), and adaptive weights (w/o WA).The experimental results are shown in Table 3.We observe that: (1) Removing the event argument extractor or effect event extractor will result in performance decay, demonstrating that both extractors are beneficial for event causality extraction.This is because they could capture the implicit intra-and inter-event interactions to perform event causality reasoning, and the knowledge is transferred to the student via a teacher-student

Example1
The growth of {global}RegionOfCause {gasoline}ProductOfCause demand is sluggish, and the combination of excess supply and unsatisfactory demand has led to a continuous decline in {gasoline}ProductOfEffect cracking spreads this year.

Example2
The amount of {silicon}ProductOfCause used in June increased compared to May, and the increase came from polysilicon, while the supply of {monosilicon}ProductOfCause decreased.Therefore, it is expected that the price of {monosilicon}ProductOfEffect will rise after July.BART-ECE: {TypeOfCause: Supply Decrease, TypeOfEffect: Price Increase} ICE: {TypeOfCause: Supply Decrease, TypeOfEffect: Price Increase}  5. We find that the soft template is superior to the manual template and concatenation template, which illustrates the effectiveness of our template.Although the manual template can elicit pre-training knowledge in a cloze formulation, it is labor-intensive and hard to achieve optimal.However, the soft template can avoid this laborious process and take the best advantage of the PLMs.
Case Study.We present case studies to further illustrate the performance of the proposed method.
As shown in Table 6, for Example1, BART-GEN gives wrong predictions about the event type of the cause event and effect event.The reason may be that BART-ECE has difficulty capturing implicit cause-effect interactions, so it fails to recognize the causality between Demand Drop and Price Drop.For Example2, both methods produce the correct cause-effect event type, while BART-GEN fails to predict the event argument monosilicon of the cause event.We credit the reason to that our model could leverage the event argument extractor and effect event extractor trained with privileged information to guide the training process of the student, thus obtaining better performance in extracting event causality and event structure.

Conclusions
In this paper, we propose an Implicit Cause-Effect interaction (ICE) framework to improve the reasoning ability of the model, which tackles ECE in a generative manner.The proposed method incorporates privileged information for reasoning to capture implicit intra-and inter-event interactions, and utilizes a teacher-student learning framework to bridge the gap between training and test stages.Besides, we introduce a Cause-Effect Optimal Transport (CEOT) strategy to improve the event-level semantic interactions of cause and effect events.
Experimental results indicate that ICE outperforms all the baselines on the ECE-CCKS dataset, demonstrating the effectiveness of this work.

Limitations
The multi-teacher knowledge distillation mechanism utilized in the ICE framework may increase the computational time during the training process.A The details of the IPOT algorithm The details of the IPOT algorithm is shown in Algorithm 1.
Algorithm 1 IPOT algorithm end for 8: M (t+1) = diag(δ) Q diag(σ) 9: end for 10: return ⟨M , C⟩ B An example of the construction of knowledge-enriched templates We show an example of the construction of knowledge-enriched templates for EAE and EEE in Table 7.The knowledge-enriched template for EAE is constructed by filling the ground truth event types of the cause and effect event in the raw template; The knowledge-enriched template for EEE is constructed by filling the ground truth event type and arguments of the cause event in the raw template.

C The creation of different types of templates
We construct three types of templates for comparison: (1) Concatenation Template, where all slots for event type and event arguments are concatenated; (2) Manual Template, where event types and event arguments are integrated into templates in natural language form; (3) Soft Template, which is used in our method.And detailed information of the templates is shown in Table 8.

D Compare with ChatGPT
In this section, we conduct experiments to evaluate the performance of ChatGPT on the ECE task.A well-designed prompt template is as follows: Suppose you are now an event causality extraction model.Given a sentence, please give the cause event and result event respectively, where the event contains the event type and the arguments corresponding to each role.The list of event types is: ['Typhoon', 'Demand Increase', 'Price Decrease', 'Cold Wave', 'Price Increase', 'Other Natural Disasters', 'Supply Decrease', 'Supply Increase', 'Sales Decrease', 'Demand Drop', 'Import Decrease', 'Flood', 'Other Trade Frictions', 'Negative Impact', 'Swine Fever', 'Sales Increase', 'Limited Production', 'Operating Costs Increased ', 'Other Livestock Epidemics', 'Positive Impact', 'Drought', 'Operating Cost Decrease', 'Export Decrease', 'Frost', 'Other or Unclear', 'Import Increase', 'Bird Flu', 'Earthquake ', 'Anti-dumping Against China', 'Exports Increase', 'Add Tariffs to China', 'Decrease in Product Profits', 'Increase in Product Profits', 'Foot-and-mouth Disease of Pigs', 'Anti-dumping Against Other Countries', 'Unsalable', 'Cattle Foot and Mouth Disease', 'Flash Flood', 'Hail'].The list of event argument roles is: ['Region', 'Industry', 'Product'].Given a sentence:"The worldwide rise of oil prices stimulates the demand for new energy such as Ammonia fuel.",please extract the event type and arguments corresponding to cause and effect event.If no argument corresponds to a role, the argument content returns "None".If multiple arguments corresponds to a role, the arguments are connected with "and".
As shown in Table 9, we can observe that ICE outperforms ChatGPT by a large margin on EAE, CET, and ECE tasks.The results indicate that ChatGPT has difficulty solving such complex event causality extraction tasks without any fine-tuning or training to update parameters.

Figure 3 :
Figure 3: Experimental results (F1-score) under different ratios of training data on three tasks.
eot  Figure 2: (a) The overview of the ICE learning framework.(b)Theprocess of Cause-Effect Optimal Transport (CEOT).(c)The creation of the template.The cause event is marked in orange and the effect event is marked in green.Underlined tokens indicate slots for event types and event arguments of cause-effect events.And tokens with <angle brackets> denote specific pseudo tokens.

Table 2 :
Overall performance compared to the state-of-the-art methods on the test set.P, R, and F1 denote precision (%), recall (%), and F1-score (%).The best results are denoted in bold.

Table 4 :
Experimental results under different number of cause-effect event pairs in a sample on ECE task.

Table 5 :
Experimental results (F1-score) of using different types of templates.CA: Concatenation Template, MA: Manual Template, SF: Soft Template.

Table 6 :
Case study on the test set.The correctly predicted event type or event arguments are marked in teal, while wrong predictions by BART-ECE are marked in red.
However, only the student model is leveraged during the test process, and the test time is identical to regular generation-based models.The problem of relatively long training time can be mitigated by strategies such as GPU parallelization.Considering the significant improvement brought by the ICE framework, we believe the cost is acceptable.