LearnDA: Learnable Knowledge-Guided Data Augmentation for Event Causality Identification

Modern models for event causality identification (ECI) are mainly based on supervised learning, which are prone to the data lacking problem. Unfortunately, the existing NLP-related augmentation methods cannot directly produce available data required for this task. To solve the data lacking problem, we introduce a new approach to augment training data for event causality identification, by iteratively generating new examples and classifying event causality in a dual learning framework. On the one hand, our approach is knowledge guided, which can leverage existing knowledge bases to generate well-formed new sentences. On the other hand, our approach employs a dual mechanism, which is a learnable augmentation framework, and can interactively adjust the generation process to generate task-related sentences. Experimental results on two benchmarks EventStoryLine and Causal-TimeBank show that 1) our method can augment suitable task-related training data for ECI; 2) our method outperforms previous methods on EventStoryLine and Causal-TimeBank (+2.5 and +2.1 points on F1 value respectively).


Introduction
Event causality identification (ECI) aims to identify causal relations between events in texts, which can provide crucial clues for NLP tasks, such as logical reasoning and question answering (Girju, 2003;Oh et al., 2013Oh et al., , 2017. This task is usually modeled as a classification problem, i.e. determining whether there is a causal relation between two events in a sentence. For example in Figure 1, an ECI system should identify two causal relations in two sentences: (1) attack cause −→ killed in S1; (2) statement cause −→ protests in S2. Most existing methods for ECI heavily rely on annotated training data (Mirza and Tonelli, 2016; Kimani Gray, a young man who likes football, was killed in a police attack shortly after a tight match.
In the week following the fatal violence, several protests have erupted because of the official statement. S1:

S2:
Kimani Gray, a young man who likes football, was killed in a police attack shortly after a tight match.

S3:
Figure 1: S1 and S2 are causal sentences that contain causal events. S3 is produced by EDA based on S1. The dotted line indicates the causal relation. Riaz and Girju, 2014b;Hashimoto et al., 2014;Hu and Walker, 2017;Gao et al., 2019). However, existing datasets are relatively small, which impede the training of the high-performance event causality reasoning model. According to our statistics, the largest widely used dataset EventStoryLine Corpus (Caselli and Vossen, 2017) only contains 258 documents, 4316 sentences, and 1770 causal event pairs. Therefore, data lacking is an essential problem that urgently needs to be addressed for ECI.
Up to now, data augmentation is one of the most effective methods to solve the data lacking problem. However, most of the NLP-related augmentation methods are a task-independent framework that produces new data at one time (Zhang et al., 2015;Guo et al., 2019;Xie et al., 2019b). In these frameworks, data augmentation and target task are modeled independently. This often leads to a lack of task-related characteristics in the generated data, such as taskrelated linguistic expression and knowledge. For example, easy data augmentation (EDA) (Wei and Zou, 2019) is the most representative method that relies on lexical substitution, deletion, swapping, and insertion to produce new data. However, solely relying on such word operations often generates new data that dissatisfies task-related qualities. As shown in Figure 1, S3 is produced by EDA, it lacks a linguistic expression that expresses the causal semantics between kill and attack. Therefore, how to interactively model data augmentation and target task to generate new data with task-related characteristics is a challenging problem on ECI.
Specific to ECI, we argue that an ideal taskrelated generated causal sentence needs to possess two characteristics as follows. (1) The two events in the causal sentence need to have a causal relation. We call such property as Causality. For example, there is usually a causal relation between an attack event and a kill event, while nearly no causal relation between an attack event and a born event.
(2) The linguistic expressions of the causal sentence need to be well-formed to express the causal semantic of events. We call such property as Well-formedness, which consists of a) canonical sentence grammar, b) event-related entities with semantic roles (e.g. the attack was carried out by a police in S1), and c) cohesive words that express complete causal semantics (e.g. in a and other words except for events and entities in S1).
To this end, we propose a learnable data augmentation framework for ECI, dubbed as Learnable Knowledge-Guided Data Augmentation (LearnDA). This framework regards sentence-torelation mapping (the target task, ECI) and relationto-sentence mapping (the augmentation task, sentence generation) as dual tasks and models the mutual relation between them via dual learning. Specifically, LearnDA can use the duality to generate task-related new sentences learning from identification and makes it more accurate to understand the causal semantic learning from generation. On the one hand, LearnDA is knowledge guided. It introduces diverse causal event pairs from KBs to initialize the dual generation which could ensure the causality of generated causal sentences. For example, the knowledge of judgment cause −→ demonstration from KBs can be used to construct a novel causal sentence, which is also helpful to understand the causal semantic of statement cause −→ protests. On the other hand, LearnDA is learnable. It employs a constrained generative architecture to generate well-formed linguistic expressions via iteratively learning in the dual interaction, which expresses the causal semantic between given events. Methodologically, it gradually fills the remaining missing cohesive words of the complete sentences under the constraint of given events and related entities.
In experiments, we evaluate our model on two benchmarks. We first concern the standard evaluation and show that our model achieves the state-of-the-art performance on ECI. Then we estimate the main components of LearnDA. Finally, our learnable augmentation framework demonstrates definite advantages over other augmentation methods in generating task-related data for ECI.
In summary, the contributions as follows: • We propose a new learnable data augmentation framework to solve the data lacking problem of ECI. Our framework can leverage the duality between identification and generation via dual learning which can learn to generate task-related sentences for ECI.
• Our framework is knowledge guided and learnable. Specifically, we introduce causal event pairs from KBs to initialize the dual generation, which could ensure the causality of generated causal sentences. We also employ a constrained generative architecture to gradually generate well-formed causal linguistic expressions of generated causal sentences via iteratively learning in the dual interaction.
• Experimental results on two benchmarks show that our model achieves the best performance on ECI. Moreover, it also shows definite advantages over previous data augmentation methods.

Related Work
To date, many researches attempt to identify the causality with linguistic patterns or statistical features. For example, some methods rely on syntactic and lexical features Girju, 2013, 2014b). Some focus on explicit causal textual patterns (Hashimoto et al., 2014;Girju, 2014a, 2010;Do et al., 2011;Hidey and McKeown, 2016). And some others pay attention on statistical causal association and cues (Beamer and Girju, 2009;Hu et al., 2017;Hu and Walker, 2017).
Recently, more attention is paid to the causality between events. Mirza and Tonelli (2014) annotated Causal-TimeBank of event-causal relations based on the TempEval-3 corpus. Mirza et al. (2014), Mirza and Tonelli (2016) extracted eventcausal relation with a rule-based multi-sieve approach and improved the performance incorporating with event temporal relation. Mostafazadeh et al. (2016) annotated both temporal and causal relations in 320 short stories. Caselli and Vossen (2017)  Unlike computer vision, the augmentation of text data in NLP is pretty rare (Chaudhary, 2020). Zuo et al. (2020) solved the data lacking problem of ECI with the distantly supervised labeled training data. However, including the distant supervision, most of the existing data augmentation methods for NLP tasks are task-independent frameworks (Related work of data augmentation and dual learning are detailed in Appendix B). Inspired by some generative methods which try to generate additional training data while preserving the class label (Anaby-Tavor et al., Yang et al., 2019;Papanikolaou and Pierleoni, 2020), we introduce a new learnable framework for augmenting task-related training data for ECI via dual learning enhanced with external knowledge.

Methodology
As shown in Figure 2, LearnDA jointly models a knowledge guided sentence generator (input: event pair and its causal/non-causal relation, output: causal/non-causal sentence) and an event causality identifier (input: event pair and its sentence, output: causal/non-causal relation) with dual learning. LearnDA iteratively optimizes identifier and generator to generate task-related training data, and then utilize new data to further train the identifier. Therefore, we first present the main idea of dual learning, which is the architecture of learnable dual augmentation, including the states, actions, policies, and  Figure 3: The architecture of learnable dual augmentation. Causal and NCausal represent the causal and non-causal sentence generator respectively. Red parts are the process of <event pair, relation> → sentence → relation (primal cycle), while blue parts are the process of <event pair, sentence> → relation → sentence (dual cycle). Solid and dashed lines denote the main process and reward feedback direction respectively.
rewards. Then, we briefly introduce the knowledge guided sentence generator, especially the processes of knowledge guiding and constrained sentence generation. Finally, we describe the event causality identifier and training processes of LearnDA.

Architecture of Learnable Dual Augmentation
The architecture of learnable dual augmentation is shown in Figure 3. Specifically, I denotes the event causality identifier, and G denotes the sentence generator which consists of two independent generators. They produce causal and non-causal sentences on the relation c of input event pair ep. Generally, G generates a sentence s which expresses the causal or non-causal relation c of the input event pair ep. Then it receives the reward R that consists of a semantic alignment reward R s from itself and a causality reward R c from I (primal cycle). Similarly, I identifies the causal or non-causal relation c of the input event pair ep with its sentence s. Then it receives the reward R consists of a causality reward R c from itself and a semantic alignment reward R s from G (dual cycle).
I and G are optimized interactively with dual reinforcement learning. Specifically, for G, an action is the generation from relation to sentence, a state is denoted by the representation of input event pair and its relation, a policy is defined by the parameters of generator. For I, an action is the identification from sentence to relation, a state is denoted by the representation of input event pair and its sentence, a policy is defined by the parameters of identifier. Inspired by Shen and Feng (2020), we utilize a probability distribution over actions given states to represent the policys, i.e., the probability distribution of the generation of G and identification of I. As aforementioned, we introduce two rewards, causality (R c ) and semantic alignment (R s ) rewards, which encourage G to generate taskrelated sentences with the feedback from identifier, while further optimize I with the feedback from generator. Definitions are as following: Causality Reward (R c ) If the relation of input event pair can be clearly expressed by the generated sentence, it will be easier to be understood by identifier. Therefore, we use the causal relation classification accuracy as the causality reward to evaluate the causality of generated sentences, while tune and optimize the identifier itself: where θ I is the parameter of I, p(c |s; θ I ) denotes the probability of relation classification, s denotes the input sentence and c is the classified relation.
Semantic Alignment Reward (R s ) We hope that the semantic of the generated sentence can be consistent with the relation of the input event pair. Additionally, if the relation of the input event pair can be more accurately classified, the semantic of the new generated sentence can be considered more consistent with it. Therefore, we measure the semantic alignment by means of the probability of constructing a sentence with similar semantic to the input relation, and the reward is: where θ G is the parameter of G, c is the input relation, t is one of the generated tokens T s of the generated sentence s , and p(t|c; θ G ) is the generated probability of t. Specifically, there are two independent G with different θ G . In detail, θ c G is employed to generated causal sentence when the input c is causal relation, and non-causal sentence is generated via θ nc G when c is non-causal relation.

Knowledge Guided Sentence Generator
As shown in Figure 4, knowledge guided sentence generator (KSG) first introduces diverse causal and non-causal event pairs from KBs for causality. Then, given an event pair and its causal or non-causal relation, it employs a constrained gen-  Figure 4: Flow diagram of the knowledge guided sentence generator (KSG). We take causal sentence generation via lexical knowledge expanding as an example.
erative architecture to generate new well-formed causal/non-causal sentences that contain them.
Knowledge Guiding KSG introduces event pairs that are probabilistic causal or non-causal from multiple knowledge bases in two ways.
(2) Connective knowledge introducing: introducing event pairs from external event-annotated documents (KBP corpus) assisted with FrameNet (Baker et al., 1998) and Penn Discourse Treebank (PDTB2) (Group et al., 2008). As shown in Table 1, we illustrate how to extract event pairs from multiple knowledge bases. Then, inspired by Bordes et al. (2013), we filter the extracted event pairs by converting them into triples <e i , causal/noncausal, e j > and calculating the causal-distance by maximizing L in a causal representation space: where T and T are the causal and non-causal triples set respectively, and e is the representation of event. After that, the higher probability of causal relation, the shorter distance between two events, and we sort event pairs in ascending order by their distances. Finally, we keep the top and bottom α% sorted event pairs to obtain the causal and noncausal event pairs sets for generation.
Constrained Sentence Generator Given an event pair, constrained sentence generator produces a well-formed sentence that expresses its causal or non-causal relation in three stages: (1) assigning event-related entities ensures the logic of the semantic roles of events, (2) completing sentences ensures the completeness of causal or non-causal Knowledge How to extract event pair Why causal or non-causal Lexical knowledge expanding WordNet 1) Extracting the synonyms and hypernyms from WordNet of each event in ep. 2) Assembling the items from the two groups of two events to generate causal/non-causal event pairs. Items in each group are the synonyms and hypernyms of the annotated causal/noncausal event pairs. VerbNet 1) Extracting the words from VerbNet under the same class as each event in ep. 2) Assembling the items from the two groups of two events to generate causal/non-causal event pairs. Items in each group are in the same class of the annotated causal/non-causal event pairs.  semantic expression, (3) filtering sentences ensures the quality and diversity of generated sentences.
Assigning Event-related Entities. Event related entities play different semantic roles of events in sentences, which is an important part of eventsemantic expression. Hence, as shown in Figure 4, given an event pair, we firstly assign logical entities for input events to guarantee the logic of semantic roles in the new sentences, such as gang is a logical entity as the body of the event onrush. Logically, entities of the same type play the same semantic roles in similar events. Moreover, as shown in Table 1, there is a corresponding original sentence for each extracted event pair. Therefore, in new sentence, we assign the most similar entity in the same type from candidate set 2 for each entity in the original sentence. For example, we assign gang for onrush in new sentence which is similar with the police related to attack in the original sentence. Specifically, we put the candidate entities in the same position in the original sentence to obtain their BERT embeddings. Then we select entities via the cosine similarity between their embeddings: where ent is the entity and E(w) is the BERT embedding of ent.
Completing Sentences. A well-formed sentence requires a complete linguistic expression to express the causal or non-causal semantics. Therefore, we complete sentences by filling the cohesive words between given events and assigned entities with masked BERT (Devlin et al., 2019). All words except events and entities are regarded as cohesive words. Specifically, we insert a certain number of the special token [MASK] between events and 2 We collect entities from annotated data and KBP corpus. entities, and then predict the [MASK] 3 tokens as new words. As shown in Figure 4, we fill cohesive tokens via two independent generators to express causal and non-causal semantic according to the relation of given events. For example, in a guiding a causal semantic filled by the causal generator.
Filtering Sentences. Inspired by Yang et al. (2019), we design a filter to select new sentences that are balanced between high quality and high diversity with two key factors: 1) Perplexity (PPL): we take the average probability of the filled cohesive words in the new sentence s as its perplexity: where T is the set of filled cohesive words. 2) Distance (DIS): we calculate the cosine similarity between generated sentence s and annotated data D m as its distance: where D m is m random selected annotated sentences and E is the BERT sentence representation of the [CLS] token. A new sentence should have both appropriate high PPL which indicates the quality of generation, and appropriate high DIS which indicates the difference from the original sentences. Therefore, we select the top β% of the newly generated sentences according to Score for the further training of identifier as following: where the µ is an hyper-parameter.

Training of LearnDA for ECI
We briefly describe the training processes of LearnDA for ECI, including the pre-training of generator and identifier, the dual reinforcement training, and the further training of identifier. Generator generates the sentence s i of epi; 4: Identifier re-predicts the causality c * i of epi; 5: Computing the reward as: 6: R s primal = λRs(epi, ci) + (1 − λ)Rc(epi, s i ).

19:
Computing the stochastic gradient of θI: 20: Event Causality Identifier First of all, we formulate event causality identification as a sentencelevel binary classification problem. Specifically, we design a classifier based on BERT (Devlin et al., 2019) to build our identifier. The input of the identifier is the event pair ep and its sentence s. Next, we take the stitching of manually designed features (same lexical, causal potential, and syntactic features as Gao et al. (2019)) and two event representations as the input of top MLP classifier. Finally, the output is a binary vector to predict the causal/noncausal relation of the input event pair ep.
Pre-training We pre-train the identifier and generator on labeled data before dual reinforcement training. On the one hand, we train identifier via the cross-entropy objective function of the relation classification. On the other hand, for generators, we keep the events and entities in the input sentences, replace the remaining tokens with a special token [MASK], and then train it via the cross-entropy objective function to re-predict the masked tokens. Specifically, causal generator and non-causal generator are pre-trained on causal and non-causal labeled sentences respectively.
Dual Reinforcement Training As shown in Algorithm 1, we interactively optimize the generator and identifier by dual reinforcement learning. Specifically, we maximize the following objective functions: LG(ep, c) = p(s |c; θG) = 1 |Ts| t∈Ts p(t|c; θG) p(s |c; θNG) = 1 |Ts| t∈Ts p(t|c; θNG), LI (ep, s) = p(c |s; θI ), where θ G and θ N G is the parameters of causal and non-causal sentence generators respectively, T s is the masked tokens. Finally, after dual data augmentation, we utilize generated sentences to further train the dual-trained identifier via the crossentropy objective function of relation classification.

Experimental Setup
Dataset and Evaluation Metrics Our experiments are conducted on two main benchmark datasets, including: EventStoryLine v0.9 (ESC) (Caselli and Vossen, 2017) described above; and (2) Causal-TimeBank (Causal-TB) (Mirza and Tonelli, 2014) which contains 184 documents, 6813 events, and 318 causal event pairs. Same as previous methods, we use the last two topics of ESC as the development set for two datasets. For evaluation, we adopt Precision (P), Recall (R), and F1-score (F1) as evaluation metrics. We conduct 5-fold and 10-fold cross-validation on ESC and Causal-TB respectively, same as previous methods to ensure comparability. All the results are the average of three independent experiments.
Parameters Settings In implementations, both the identifier and generators are implemented on BERT-Base architecture 4 , which has 12-layers, 768-hiddens, and 12-heads. We set the learning rate of generator pre-training, identifier pretraining/further training, and dual reinforcement training as 1e-5, 1e-5, and 1e-7 respectively. We set the ratio of the augmented data used for training to the labeled data, α, β, µ, λ and γ as 1:2, 30%, 50%, 0.2, 0.5 and 0.5 respectively tuned on the development set. And we apply early stop and SGD gradient strategy to optimize all models. We also adopt a negative sampling rate of 0  2019), document-level models adopt document structures for ECI. For Causal-TB, we prefer 1) RB, a rule-based system; 2) DD, a data driven machine learning based system; 3) VR-C, a verb rule based model with data filtering and gold causal signals enhancement. These models are designed by Mirza and Tonelli (2014); Mirza (2014) for ECI. Owing to our methods are constructed on BERT, we build BERT-based methods: 1) BERT, a BERTbased baseline, our basic proposed event causality identifier. 2) MM (Liu et al., 2020), the BERTbased SOTA method with mention masking generalization. 3) MM+Aug, the further re-trained MM with our dual augmented data. 4) KnowDis (Zuo et al., 2020) improved the performance of ECI with the distantly labeled training data. We compare with it to illustrate the quality of our generated ECI-related training data. 5) MM+ConceptAug, to make a fair comparison, we introduce causalrelated events from ConceptNet that employed by MM, and generate new sentences via KonwDis and LearnDA to further re-train MM (see Appendix C for details). Finally, we use LearnDA F ull indicates our full model, which is the dual-trained identifier further trained via dual augmented data. Table 2 shows the results of ECI on EventStoryLine and Causal-TimeBank. From the results: 1) Our LearnDA F ull outperforms all baselines and achieves the best performance (52.6%/51.9% on F1 value), outperforming the no-bert (ILP/VR-C) and bert (MM/KnowDis) state-of-the-art methods by a margin of 7.9%/8.7% and 2.5%/2.1% respectively, which justifies its effectiveness. Moreover, BERT-based methods demonstrate high recall value, which is benefited from more training data and their event-related guided knowledge.

Our Method vs. State-of-the-art Methods
2) Comparing KnowDis with LearnDA F ull , we note that training data generated by LearnDA is more helpful to ECI than distant supervision with external knowledge (+2.9%/+2.1%). This shows that LearnDA can generate more ECI-related data.
3) Comparing MM+ConceptN et with MM, with the same knowledge base, our dual augmented data can further improve the performance  (+0.8%/+2.8%), which illustrates that LearnDA can make more effective use of external knowledge by generating task-related training data. 4) Comparing MM+Aug with MM, we note that training with our dual augmented data can improve the performance by 1.4%/3.9%, even though MM is designed on BERT-Large (LearnDA is constructed on BERT-Base) and also introduces external knowledge. This indicates that the augmented data generated by our LearnDA can effectively alleviate the problem of data lacking on the ECI.

Effect of Learnable Dual Augmentation
We analyze the effect of the learnable dual augmentation for event causality identification. 1) For identifier. Comparing LearnDA Dual with BERT in Table 3, we note that the performance of the proposed identifier is improved (+2.6%) after the dual training only with the same labeled data. This indicates that the identifier can learn more informative expressions of causal semantic from generation with dual learning. 2) For generator. Comparing BERT DualAug with BERT Aug in Table 3, we note that the dual augmented data is high quality and more helpful to ECI (+2.6%). This indicates generator can generate more ECI task-related data learned from identifier with dual learning. Figure 5 illustrates the learnability of our LearnDA. Specifically, as the number of training rounds of dual learning increases, the generated data gradually learns task-related information, fur-   In each round, we generate new training data by the generator at the current round. The performance is achieved by further training the identifier at the current round with the aforementioned newly generated data.
ther improving the performance accordingly.   Table 5: Manual (4-score rating (0, 1, 2, 3)) and automatic (BLEU score) evaluation of the generated sentences via different methods from causality, well-formedness and diversity. Causality and wellformedness are assessed manually, while diversity is assessed manually and automatically.

Effect of Knowledge Guiding
causal-related knowledge is better.

Our Augmentation vs. Other NLP Augmentations
In this section, we conduct a comparison between our augmentation framework and other NLPrelated augmentation methods to further illustrate the effectiveness of LearnDA.

Effectiveness of Our Augmentation
We train our identifier with augmented data produced by different NLP-related augmentation methods. As shown in Table 4, the augmented data generated by our LearnDA is more efficient for ECI, which is consistent with the previous analysis. The LearnDA can generate well-formed task-related new sentences that contain more event causal knowledge. Specifically, 1) text surface transformation brings a slight change to the labeled data, thus it has relatively little impact on ECI; 2) Back translation introduces limited new causal expressions by translation, thus it slightly increases the recall value on ECI; 3) EDA can introduce new expressions via substitution, but the augmented data is not canonical and cannot accurately express the causality, therefore, its impact on ECI is also limited.

Quantitative Evaluation of Task-relevance
We select five Ph.D. students majoring in NLP to manual score the 100 randomly selected augmented sentences given their corresponding original sentences as reference (Cohen's kappa = 0.85). Furthermore, we calculate the BLEU (Papineni et al., 2002) value to further evaluate the  Figure 6: The modification of dual learning.
diversity. As aforementioned, the task-relevance of new sentences on ECI is manifested in causality and well-formedness, while the diversity indicates the degree of generalization. As shown in Table  5, we note the sentences generated by LearnDA are equipped with the above three properties that are close to the labeled sentences. Specifically, the sentences produced by EDA has a certain degree of causality and diversity due to the lexical substitution assisted by external knowledge. However, they cannot well express the causality due to the grammatical irregularities. Correspondingly, new sentences generated via back translation are very similar to the original sentences, while the diversity is poor.

Case Study
We conduct a case study to further investigate the effectiveness of our LearnDA. Figure 6 illustrates the modification process of dual learning. For example as a), given two causal events, the generator is expected to generate a causal sentence. However, the generator without dual learning produces a noncausal sentence. Fortunately, with dual learning, the identifier judges the generated sentence as a non-causal one and guides the generator to produce a causal sentence with the feedback. Similarly, as shown in b), given a causal sentence, the identifier is expected to output a causal relation, but no dual-trained one cannot do. Correspondingly, the generator constructs feedback of low confidence to guide the identifier to output a causal relation.

Conclusion
This paper proposes a new learnable knowledgeguided data augmentation framework (LearnDA) to solve the data lacking problem on ECI. Our framework can leverage the duality between generation and identification via dual learning to gener-ate task-related sentences for ECI. Moreover, our framework is knowledge guided and learnable. Our method achieves state-of-the-art performance on EventStoryLine and Causal-TimeBank datasets.  As shown in Table 6, our dual augmented data is significantly more quantitative than the labeled data. Specifically, the causal event pairs are increased by 3.1 times, the causal sentences are increased by 5.9 times and the average number of causal sentences corresponding to each causal event pair is also increased.  We change the quantity of dual augmented data for training to explore the influence of augmentation ratio on ECI. As shown in Table 7, when the ratio is 1:2, the effective knowledge brought by dual augmented data is maximized. And as the ratio increasing, the dual augmented data will bring noises, which obstructs the model to identify event causality and may change the data distribution from original data (Xie et al., 2019a). This suggests that too much augmented data is not better and that there is a trade-off between introducing knowledge and reducing noise.   Table 9: Performance of identifier (BERT) trained with new generated sentences filtered in different β. * denotes a significant test at the level of 0.05. Table 9 tries to show the effectiveness of generated sentences with different filtering ratios. With the ratio of retained generated sentences increasing, the contribution of filtered generated sentences for ECI decreases gradually. This proves the effectiveness of filtering, which can balance the overall quality of the sentences against diversity.

B.1 Dual Learning
For many Natural Language Processing (NLP) tasks, there exist many primal and dual tasks, such as open information narration (OIN) and open information extraction (OIE) (Sun et al., 2018), natural language understanding (NLU) and natural language generation (NLG) , semantic parsing and natural language generation (Ye et al., 2019;Cao et al., 2019Cao et al., , 2020, link prediction and entailment graph induction (Cao et al., 2019), query-to-response and response-to-query generation (Shen and Feng, 2020) and so on. The duality between the primal task and the dual task is considered as a constraint that both problems must share the same joint probability mutually. Recently, inspired by Xia et al. (2017) who implemented the duality in a neural-based dual learning system, the above primal-dual tasks are implemented in two different ways: 1) providing additional labeled samples via bootstrapping, and 2) adding rewards at the training stage for each agent. We observe that the event causality identification and the sentence generation are dual to each other. Therefore, we apply a dual learning framework in the second way to optimize identification and generation interactively for generating ECI-related data.

B.2 Data Augmentation for NLP
The scarcity of annotated data is a thorny problem in machine learning. Unlike computer vision, the augmentation of text data in NLP is pretty rare. Existing text data augmentation methods for NLP tasks are almost task-independent frameworks and can be roughly summarized into the following categories (Chaudhary, 2020): (1) Lexical substitution tries to substitute words without changing the meaning (Zhang et al., 2015;Wei and Zou, 2019;Wang and Yang, 2015;Xie et al., 2019b); (2) Back translation tries to paraphrase a text while retraining the meaning (Xie et al., 2019b); (3) Text surface transformation tries to match transformations using regex (Coulombe, 2018); (4) Random noise injection tries to inject noise in the text to make the model more robust (Wei and Zou, 2019); (5) Generative method tries to generate additional training data while preserving the class label (Anaby-Tavor et al., Yang et al., 2019); (6) Distantly supervision and self-supervision try to introduce new training data from unlabeled text (Chen et al., 2017;Ruiter et al., 2019). As aforementioned, these frameworks cannot directly produce new suitable task-related examples for ECI. However, (1), (3), and (4) cannot guarantee the causality and wellformedness of new examples for ECI. Additionally, (2) and (5) are not easy to directly use external knowledge bases to generalize the event-related causal commonsense. Furthermore, (6) needs to design proprietary processing methods to generate ECI task-related training data. Zuo et al. (2020) solved the data lacking problem of ECI with the distantly supervised labeled training data. However, including the distant supervision, most of the existing text data augmentation methods for NLP tasks are task-independent frameworks. Therefore, we introduce a new learnable framework for augmenting task-related training data for ECI via dual learning enhanced with external knowledge.

C Generation with ConceptNet
To make a fair comparison, we introduce causalrelated events from ConceptNet based on causal-related concepts, and obtain the causal sentence via the method in KonwDis (Zuo et al., 2020) to further re-train MM . Specifically, firstly, we obtain triples based on cause-related semantic relations from ConceptNet, such as Causes, HasSubevent, HasFirstSubevent, HasLastSubevent, MotivatedByGoal, and CausesDesire relations. Secondly, we assemble any two events from obtained causal triples to generate causal event pairs set and filter them via the filter of KonwDis. Next, we employ filtered causal event pairs to collect preliminary noisy labeled sentences from external documents via the DistantAnnotator of KonwDis. Then, we use the CommonFilter of KnowDis assisted with causal commonsense knowledge to pick out labeled sentences that express causal semantics between events. Finally, the refined causal sentences are input into LearnDA to generated ECIrelated dual augmented training data and further train the MM to obtain MM+ConceptAug.

D Main Experimental Environments and
Other Parameters Settings

D.1 Experimental Environments
We deploy all models on a server with 250GB of memory and 4 TITAN Xp GPUs. Specifically, the configuration environment of the server is ubuntu 16.04, and our framework mainly depends on python 3.6.0 and PyTorch 1.0.

D.2 Other Parameters Settings
All the final hyper-parameters for evaluation are averaged after 3 independent tunings on the development set. Moreover, the whole dual learning framework which includes event causality identifier and knowledge guided sentence generator takes approximately 5 minutes per epoch when training. According to the early stop strategy, the training rounds for different folds are different, and it takes about 20-30 rounds.