Effect Generation Based on Causal Reasoning

Causal reasoning aims to predict the future scenarios that may be caused by the observed actions. However, existing causal reasoning methods deal with causalities on the word level. In this paper, we propose a novel event-level causal reasoning method and demonstrate its use in the task of effect generation. In particular, we structuralize the observed cause-effect event pairs into an event causality network, which describes causality dependencies. Given an input cause sentence, a causal subgraph is retrieved from the event causality network and is encoded with the graph attention mechanism, in order to support better reasoning of the potential effects. The most proba-ble effect event is then selected from the causal subgraph and is used as guidance to generate an effect sentence. Experiments show that our method generates more reasonable effect sentences than various well-designed competitors.


Introduction
Causal reasoning is the process of observing an action and reasoning future scenarios that may be potentially caused by it (Radinsky et al., 2012). Earlier causal reasoning methods (Roemmele et al., 2011;Luo et al., 2016) collect causally related word pairs (e.g., earthquake→tsunami) to build the statistical models of causality, and then predict effects words for given cause words. Recently, (Xie and Mu, 2019) uses causal embedding to predict possible effect words of the input causes. (Li et al., 2020) proposed the lexically-constrained beam-search to generate possible effects given provided word guidance. However, all these methods tend to reason causalities at word-level.
Causalities between word pairs are not always self-contained (i.e., intelligible) when they are extracted without the context (Hashimoto et al., 2014)). For example, "quarrel→break" is not selfcontained since this is not intelligible without the context: "They always quarrel→They break up". A word-level causal reasoning method may only predict the unintelligible effect of "break" conditioned "quarrel". Considering this deficiency, a better way is to enhance causal reasoning with causal events (Radinsky et al., 2012;Zhao et al., 2017;Martin et al., 2018;Ammanabrolu et al., 2020). However, an observed causal event is very likely to appear only once, which brings about huge sparsity to causalities and great difficulty to the eventlevel causal reasoning. To solve this problem, we structuralize observed causal events into an event causality network, where similar events are clustered together. Given an input cause sentence, a causal subgraph is retrieved and is encoded with the graph attention mechanism, in order to support better effect reasoning. As such, we are able to predict the most reasonable effect event based on the event causality network. The predicted effect event contains the skeleton information, with the detailed context neglected in the event extraction process. So we further rewrite the predicted effect event to an effect sentence in order to fill in the missing information.
The contributions of this paper are twofold: i) we devise a effect generation method which is based on causal event reasoning (EGCER) to generate effect sentences for given input cause sentences, ii) experiments demonstrate that our model achieves better performances compared among various welldesigned baselines.

Event Causality Network Construction
In this paper, we use causal events to bridge the causalities between input sequences and generated sequences. Hence, we must first collect sufficient cause-effect sentence pairs so that from each sentence pair a cause-effect event pair can be eventified. Then we construct an event causality network based on the extracted causal event pairs. The construction process includes two steps: 1) Event Eventification, 2) Events Structuralization.
Event Eventification: Following (Do et al., 2011;Asghar, 2016;Luo et al., 2016;Hassanzadeh et al., 2019), we make use of a few highprecision causal connectives to extract cause-effect sentence pairs, for example 'because', 'as a result', etc. Then we extract causal event pairs from causal sentence pairs based on dependency analysis 1 . We adopt the commonly used 4-tuple event representation (s, v, o, m) (Pichotta and Mooney, 2016) where v denotes the verb, s denotes the head noun of the subject, o denotes the head noun of the direct object or the adjective, and m denotes the head noun of the prepositional or indirect object.
Events Structuralization: We structuralize the extracted causal event pairs into an event causality network, in which semantically similar events are clustered together. We use event abstractions to judge whether two events are semantically similar. The abstraction of an event is obtained by generalizing its components to their categories in linguistic resources. Specifically, the verb in each event is generalized to its class in VerbNet (Schuler, 2005). The other components are generalized by the WordNet (Miller, 1995) synset two levels up in the inherited hypernym hierarchy. In addition, we explicitly use the semantic-similarity based inferring rule. For example, assume we have observed that A has the same abstraction with B, and a causal relation holds from A to C, then it is most likely to conclude that there may be a causal relation from B to C. Such a manipulation significantly reduces the sparsity of causalities in the event causality network, and hence supports better reasoning about the effect events. The weight of an edge in our event causality network is derived by the following rules: 1) If the edge between the event pair (e i , e j ) is extracted from the dataset, the weight w ij of this edge is w ij = 1; 2) If the edge of (e i , e j ) is inferred based on the semantic-similarity between (e i , e k ) and the causal relation between (e k , e j ), we have w ij = sim(e i , e k ), where sim(e i , e k ), calculated by the path-similarity measure in WordNet, is the semantic-similarity score between e i and e k . into an effect sentence. Formally, given a cause sentence X = {x 1 x 2 · · · x m }, and a causal subgraph CG = {e 1 , e 2 , · · · , e N CG }, which consists of a set of events {e j = (s j , v j , o j , m j )} (j = 1, · · · , N CG ) as nodes, our model first predicts an effect event e Y from CG according to X, then rewrites e Y to an effect sentence Y = {y 1 y 2 · · · y n }. The overview of the proposed EGCER is illustrated in Figure 1, which consists of two modules: 1) Effect Event Predictor, and 2) Effect Event Rewriter.
Effect Event Predictor: Given the cause sentence X, a bidirectional GRU model  is used to reads the sequence X from both directions and computes hidden states We then eventify the cause event from X, and match the event abstraction in the event causality network. Once the abstraction is matched, a L-hop causal-related subgraph CG is preserved. The neighborhood information in CG represents the causality tendencies, which are useful for reasoning the most reasonable effect event. We use a simple graph neural network (GNN) (Kipf and Welling, 2016;Veličković et al., 2017) to capture the neighborhood information. Specifically, the lth layer's vectors of e i and its neighbors are pooled to obtain the vector of e i on the (l + 1)-th layer with a activation function σ (ReLU by default): where W l is a parameter, · denotes the inner product of the two vectors, w ij is the weight of the edge (e i , e j ), e l i is the vector of e i at l-th layer, is the concated word embedding of all components of e i .
The final hidden vector e L i (i = 1, · · · , N CG ) of events are used to select the guided effect event e Y by e Y = max i cs i , where cs i = e L i T · h X is the causal score between each candidate event e i and X, h X = 1 m m k=1 h x k is the mean-pooling representation of X.
Effect Event Rewriter: The predicted e Y contains the skeleton information, we want retain all tokens of e Y when generating the effect sentence to avoiding the causal information carried by e Y degrading to word-level. Inspired by (Mou et al., 2016;Martin et al., 2018), we rewrite e Y = (s, v, o, m) into the effect sentence which conforms to the format of [_s] [_v][_o] [_m], where blanks indicate the place words should be added to in order to make a sentence richer in content. We use a decoder with attention mechanism  to generate words in each blank until generating the "<eos>" token.

Datasets
English Wikipedia(Enwiki): We extract causeeffect sentence pairs from the English Wikipedia corpus 2 , resulting in about 80K pairs. We split all pairs into training/validation/test with the ratio of 8:1:1, and tune parameters on the validation data. The training data is used to construct the event causality network. We retrieve 2-hop causal subgraphs according to input cause sentences because it is the most commonly used setting. The percentage of the test samples whose gold effect events exist in the retrieved causal subgraphs is 70.8%.
COPA Benchmark: The Choice of Plausible Alternatives (COPA) (Roemmele et al., 2011) dataset consists of 1,000 multiple-choice questions (500 for validation and 500 for testing) requiring causal reasoning in order to answer correctly. Each question is composed of a premise and two alternatives, and the task is to select a more plausible alternative as a cause (or an effect) of the premise. We use the most plausible alternative and its premise to collect cause-effect sentence pairs. The COPA causes are used to retrieve causal subgraphs from our event causality network, leading to 186 COPA pairs with their corresponding causal subgraphs. The percentage of the samples whose gold effect events exist in causal subgraphs is 11.2%. Because there is no released training data for the COPA task, we train all models on Enwiki and evaluate them on COPA.
Metrics: For automatic evaluation, we use metrics including BLEU-4 (Papineni et al., 2002), Distinct-n (Li et al., 2015) to evaluate the generated effect sentences. Abstraction-Matching (AbsMat) evaluates the percentage of the generated effect sequences that have the same abstraction as the corresponding gold effect sequences.
For the manual evaluation, we examine whether the generated sequence is a plausible effect of the input, which is denoted as plausibility (Plau). Details can be seen in Appendix B.
Result: The result is shown in Table 1, where EGCER achieves the best results. BART performs better than GPT2 due to the adopted encoderdecoder architecture. Based on the event skeletons provided by the effect event predictor, CopyNet and EGCER are aware of the topic which should be generated, and hence perform better than BART and GPT2. CopyNet performs worse than EGCER because CopyNet cannot cover all tokens of the retrieved event, as a result, the causal information in the generated sequence is incomplete. We also find that CopyNet tends to copy an event token repeatedly. CausalBert performs worse than EGCER because it is based on the word-level causal analysis, which can also be found in Section 4.3. Given the effect event, EGCER sees a more complete scenario, hence generate a more reasonable effect sentence.
The result of the manual evaluation is also shown in Table 1. As for EGCER, we find that it may sometimes generate negation expressions or grammatical errors, as a result, the generated sequence is not a plausible effect even if the retrieved event is plausible. The proportion of the generated sequences in this case is about 21%. We speculate that the errors in data preprocessing and the insufficiently powerful generator are the possible reasons. In the future, we will further improve generators in order to generate more high-quality effect sentences. It can also be found that EGCER performs far worse on COPA than on Enwiki, this is because a great gap exists between these two datasets. However, EGCER is still superior to any other model, which demonstrates event-level causal reasoning contributes to the effect sentence generation.  Appendix C presents a case with generations of different models. CausalBERT generates "missing bus" given "missing" as guidance. However, from the input we can see that this person may be in a car, therefor the generated sequence is not an effect. That is CausalBERT, which is based on the word-level analysis, generates causal inconsistent sequence. In contrast, our method successfully predicts the expected effect event "(he,missed,meeting)", and generates the correct effect sentence.

Visualization
We extract a part of CG according to the input cause, and visualize the causal scores cs using event vectors on the first and second layers of GNN respectively, as shown in Figure 2a and 2b. In Figure 2a, the "(was, late, work)" receives the highest score, followed by "(he, encountered, jam)" and "(was, late, meeting)" in one-hop reasoning. And, the "(leader, scolded, him)" receives the lowest score. Noted that "(he, encountered, jam)" is actually not an effect event. However, in Figure  2b, the "(he, missed, meeting)" receives the highest score, followed by "(was, late, work)", "(was, late, meeting)" and "(leader, scolded, him)" in two-hop reasoning. The "(he, encountered, jam)" and "(rain, is, heavy)" receive lower scores. This makes sense because they are not effect events at all. This shows that the multi-layer GNN can well capture multihop causal relationships and thus are able to select the plausible effect events.  To understand the importance of the key components of our approach, we perform an ablation study by training multiple ablated versions of our model, including the one without weights of edges in the retrieved causal subgraph, the one without the 2nd-layer of GNN, and the one without GNN. The results are provided in Table 2. When the GNN module is gradually ablated, the performance of the model gradually degrades. This demonstrates that all modules of our multi-layer GNN effectively contribute to effect sentence generation.

Conclusion and Future Work
We present an event-level causal reasoning based effect generation method to generate the plausible effect sentences for the input cause sentences. Experiments show that our method performs better than competitors in capturing the causal semantics which should be generated. In the future, we would like to develop more effective approaches to enhance the effect event reasoning, and more powerful generators to generate the effect sentences with higher quality.

Acknowledgements
The work described in this paper was supported by and Research Grants Council of Hong Kong(PolyU/5210919, PolyU/15207920, PolyU/15207821), National Natural Science Foundation of China (61672445, 62076212, 62076072) and PolyU internal grants (ZVQ0). We are grateful to the anonymous reviewers for their valuable comments.

A Experiment Setting
We concat cause-effect sentence pairs and finetune GPT2(117M) in a language model setting. BART is finetuned with the encoder-decoder setting. Both GPT2 and BART are implemented by transformers 3 . CopyNet employs the copy mechanism which either copies tokens from the retrieved event or generates words from the vocabulary. CausalBERT employs the lexically-constrained beam-search to generate possible effects for provided word guidance. ConceptNet (Speer and Havasi, 2012) is used to retrieve causal relevant constraints for Causal-BERT.
Our effect event predictor consists of a 2-layer bidirectional GRU for encoding input sequences and a 2-layer GNN for updating event representations. Our event rewriter is a GRU decoder. The predictor and the rewriter do not share parameters, and their hidden sizes are set to 512. The word embedding size is 300. We use the Adam optimizer with the mini-batch size of 96. The learning rate is 0.001.
We use the gold effect event to supervise our event predictor. The objective is: For our event rewriter, the objective is to maximize the estimated probability of the gold effect sequence: The final loss function is the combination of the above two J = J 1 + J 2 (4) B Details for Manual Evaluation 100 samples are randomly selected from the Wikipedia test set and COPA, respectively, and distribute them to the two graduate students from the NLP field. Each student is asked to give a score from {0, 0.5, 1} for the (input, generation) pair, given the following guidelines. Assign 0 to the pair if the generation can never be considered as a possible effect of the input, assign 0.5 to the pair if the generation is a possible effect of the input but has certain grammatical errors and assign 1 to the pair if the generation is a possible effect of the 3 https://huggingface.co/ input and there is no grammatical error. We average scores over the two annotators. The cohen's kappa scores on Enwiki and COPA are 0.65 and 0.63, respectively.

C Generation Example
Input cause he encountered a heavy traffic jam.
GPT2 the lighthouse was closed over three weeks.
BART he was delayed for over an hour.
CopyNet he missed missed the meeting.
CausalBert causing him to miss bus.
EGCER he missed the important meeting. Given the input cause, CauseBERT generates the unexpected sequences by using "missing" as constraint, which demonstrates that word-level causal analysis is not always self-contained. CopyNet repeatedly generates the "missed" token. EGCER rewrites the predicted effect event "(he, missed, meeting)" into the reasonable effect sentence.