Learning Event Graph Knowledge for Abductive Reasoning

Abductive reasoning aims at inferring the most plausible explanation for observed events, which would play critical roles in various NLP applications, such as reading comprehension and question answering. To facilitate this task, a narrative text based abductive reasoning task \alphaNLI is proposed, together with explorations about building reasoning framework using pretrained language models. However, abundant event commonsense knowledge is not well exploited for this task. To fill this gap, we propose a variational autoencoder based model ege-RoBERTa, which employs a latent variable to capture the necessary commonsense knowledge from event graph for guiding the abductive reasoning task. Experimental results show that through learning the external event graph knowledge, our approach outperforms the baseline methods on the \alphaNLI task.


Introduction
Abductive reasoning aims at seeking for the best explanations for incomplete observations . For example, given observations Forgot to close window when leaving home and The room was in a mess, human beings can generate a reasonable hypothesis for explaining the observations, such as A thief entered the room based on commonsense knowledge in their mind. However, due to the lack of commonsense knowledge and effective reasoning mechanism, this is still a challenging problem for today's cognitive intelligent systems (Charniak and Shimony, 1990;Oh et al., 2013;Kruengkrai et al., 2017).
Most previous works focus on conducting abductive reasoning based on formal logic (Eshghi et al., 1988;Levesque, 1989;Ng et al., 1990;Paul, 1993). However, the rigidity of formal logic limits the application of abductive reasoning in NLP * Corresponding author Figure 1: (a) An example of abductive reasoning. (b) Additional commonsense knowledge (such as event I 1 and I 2 ) is necessary for inferring the correct hypothesis. Such knowledge could be described using an event graph. (c) A latent variable z is employed to learn the commonsense knowledge from event graph. tasks, as it is hard to express the complex semantics of natural language in a formal logic system. To facilitate this,  proposed a natural language based abductive reasoning task αNLI. As shown in Figure 1 (a), given two observed events O 1 and O 2 , the αNLI task requires the prediction model to choose a more reasonable explanation from two candidate hypothesis events H 1 and H 2 . Both observed events and hypothesis events are daily-life events, and are described in natural language. Together with the αNLI task,  also explored conducting such reasoning using pretrained language models such as BERT (Devlin et al., 2019) and RoBERTa .
However, despite pretrained language models could capture rich linguistic knowledge benefit for understanding the semantics of events, additional commonsense knowledge is still necessary for the abductive reasoning. For example, as illustrated in Figure 1 (b), given observations O 1 and O 2 , to choose the more likely explanation H 1 : A thief entered the room and exclude H 2 : A breeze blew in the window, prediction model should have the commonsense knowledge that it is hardly possible for a breeze to mess up the room, whereas a thief may enter the room from the open window (I 1 ), then rummage through the room (I 2 ) and lead to a mess. These intermediary events (I 1 and I 2 ) can serve as necessary commonsense knowledge for understanding the relationship between observed events and hypothesis events.
We notice that the observed events, hypothesis events, intermediary events and their relationships could be described using an event graph, which can be constructed based on an auxiliary dataset. The challenge is how to learn such commonsense knowledge from the constructed event graph.
To address this issue, we propose an Event Graph Enhanced RoBERTa (ege-RoBERTa) model, and a two-stage training procedure. Specifically, as shown in Figure 1 (c), on the basis of the RoBERTa framework, we additionally introduce a latent variable z to model the information about the intermediary events. In the pretraining stage, ege-RoBERTa is trained upon an event-graph-based pseudo instance set to capture the commonsense knowledge using the latent variable z. In the finetuning stage, model adapts the commonsense knowledge captured by z to conduct the abductive reasoning.
Experimental results show that ege-RoBERTa could effectively learn the commonsense knowledge from a well-designed event graph, and improve the model performance on the αNLI task compared to the baseline methods. The code is released at https://github.com/sjcfr/ege-RoBERTa.

Problem Formalization
As shown in Figure 1 (a), αNLI can be defined as a multiple-choice task. Given two observed events O 1 and O 2 happened in a sequential order, one needs to choose a more reasonable hypothesis event from two candidates H 1 and H 2 for explaining the observations. Therefore, we formalize the abductive reasoning task as a conditional distribu- ). Therefore, taking the event order into consideration, we further characterize the abductive reasoning task as p(Y |X).

Event Graph
Formally, an event graph could be denoted as G = {V, R}, where V is the node set, and R is the edge set. Each node Vi ∈ V corresponds to an event, while Rij ∈ R is a directed edge Vi → Vj along with a weight W ij , which denotes the probability that Vj is the subsequent event of Vi.
Given observed events and a certain hypothesis event, from the event graph we could acquire additional commonsense knowledge about: (1) the intermediary events, (2) the relationships between events. As Figure 1 (b) shows, the observed events, hypothesis event and intermediary events compose another event sequence (O1, I1, Hi, I2, O2). For clarity, we define such event sequence as posterior event sequence X , where X = (O1, I1, Hi, I2, O2). The relationship between events within X could be described by an adjacency matrix A ∈ R 5×5 , with each element initialized using the edge weights of the event graph: The matrix A could describe the adjacency relationship between arbitrary two events in X .

Ege-RoBERTa as a Conditional
Variational Autoencoder Based Reasoning Framework In this paper, rather than directly predicts the relatedness score Y based on the event sequence X, we propose to predict Y based on both X and additional commonsense knowledge (i.e. posterior event sequence X and adjacency matrix A). To this end, we introduce a latent variable z to learn such knowledge from an event graph through a two stage training procedure. To effectively capture the event graph knowledge through z and conduct the abductive reasoning task based on z, we frame the ege-RoBERTa model as a conditional variational autoencoder (CVAE) (Sohn et al., 2015). Specifically, with regard to the latent variable z, ege-RoBERTa characterizes the conditional distribution P (Y |X) using three neural networks: a prior network p θ (z|X), a recognition network q φ (z|X , A) and a neural likelihood p θ (Y |X, z), Figure 2: Illustration of the pretraining, finetuning and prediction process of ege-RoBERTa. The grey color in circle denotes the availability of corresponding information. For example, in the pretraining stage conducted on the pseudo instance set, X, Y and additional commonsense knowledge X and A are available. While in the finetuning stage on αNLI, X and A are absent.
where θ and φ denote the parameters of networks. Moreover, instead of directly maximize P (Y |X), following CVAE (Sohn et al., 2015), ege-RoBERTa proposes to maximize the evidence lower bound (ELBO) of P (Y |X) : Note that, in the recognition network, the latent variable z is directly conditioned on X and A, where X = {O1, I1, Hi, I2, O2} is the posterior event sequence, A is an adjacency matrix describing the relationship between events within X . This enables z to capture the event graph knowledge from X and A. Through minimizing the KL term of ELBO, we can teach the prior network p θ (z|X) to learn the event graph knowledge from the recognition network as much as possible. Then in the neural likelihood p θ (Y |X, z) the relatedness score Y could be predicted based on X and z, which captures the event graph knowledge.
However, the event graph knowledge is absent in the αNLI dataset. To learn such knowledge, we design the following two-stage training procedure: Pre-training Stage: Learning Event Graph Knowledge from a Pseudo Instance Set In this stage, ege-RoBERTa is pretrained on a prebuilt event-graph-based pseudo instance set, which contains rich information about the intermediary events and the events relationships. As shown in Figure 2 (a), the latent variable z is directly conditioned on X and A. Therefore, z could be employed to learn the event graph knowledge.
Finetuning Stage: Adapt Event Graph Knowledge to the Abductive Reasoning Task As Figure 2 (b) shows, at the finetuning stage, ege-RoBERTa is trained on the αNLI dataset without the additional information X and A. In this stage model learns to adapt the captured event graph knowledge to the abductive reasoning task. Then as Figure 2 (c) shows, after the two-stage training process, ege-RoBERTa could predict the relatedness score Y based on the latent variable z.

Architecture of ege-RoBERTa
We introduce the specific implementation of ege-RoBERTa. As illustrated in Figure 3, ege-RoBERTa introduces four modules in addition to the RoBERTa framework: (1) an aggregator providing representation for any event within X and X ; (2) an attention-based prior network for modeling p θ (z|X); (3) a graph neural network based recognition network for modeling q φ (z|X , A); (4) a merger to merge the latent variable z into RoBERTa frame for downstream abductive reasoning task.

Event Representation Aggregator
The event representation aggregator provides distributed representation for events in both the event sequence X and the posterior event sequence X . To this end, the aggregator employs attention mechanism to aggregate token representations of the event sequence from hidden states of RoBERTa.
Given an event sequence X composed of tokens [[CLS], (x 1 jth event, the distributed representation is initialized as q j = 1 l j h j l j . Multi-head attention mechanism (MultiAttn) (Vaswani et al., 2017) is employed to softly select information from H (M ) and get the representation of each event: ( 3) For brevity, we denote the vector representation of all events in X using a matrix E X , where E X = {e 1 , e 2 , e 3 } ∈ R 3×d . Note that, through the embedding layer of RoBERTa, position information has been injected into the token representations. Therefore, E X derived from token representations carries event order information. In addition, since E X is obtained from the hidden states of RoBERTa, rich linguistic knowledge within RoBERTa could be utilized to enhance the comprehension of event semantics. By the same way, the representation of events within X could be calculated, which we denote as E X .

Recognition Network
The recognition network models q φ (z|X , A) based on E X and A, where E X is the representations of events within X . Following traditional VAE, q φ (z|X , A) is assumed to be a multivariate Gaussian distribution: where D denotes the identity matrix.
To obtain µ(X , A), we first combine E X and adjacency matrix A using a GNN (Kipf et al., 2016): where σ(·) is the sigmoid function; W (u) ∈ R d×d is a weight matrix and E (U ) are relational information updated event representations. Then a multi-head self-attention operation is performed to promote the fusion of event semantic information and relational information: Finally, to estimate µ(X , A), we aggregate information within E (U ) using a readout function g(·): µ = g(E (U ) ).
Following  and Zhong et al. (2019), we set g(·) to be a mean-pooling operation.
Hence, by estimating µ based on the relational information updated event representation E (U ) , event graph knowledge about X and A is involved into the latent variable z.

Prior Network
The prior network models p θ (z|X) based on E X , where E X is the representation matrix of events in X. The same as the recognition network, p θ (z|X) also follows multivariate normal distribution, while the parameters are different: where D denotes the identity matrix.
To obtain µ(X), different from the recognition network, the prior network starts from updating E X using a multi-head self-attention: Then an additional multi-head self-attention operation is performed to get deeper representations: (10) Finally, µ(X) is estimated through aggregating information from E (U ) : where g(·) is a mean-pooling operation.

Merger
The merger module merges the latent variable z as well as updated (deep) representation of events into the N th transformer layer of RoBERTa frame for predicting the relatedness score. To this end, we employ multi-head attention mechanism to softly select relevant information from z and E (U ) , and then update the hidden state of the N th transformer layer of RoBERTa.
Specifically, in the pretraining stage: where H (N ) is the hidden states of the N th transformer layer of RoBERTa, and H (N ) * is the event graph information updated hidden states. While in the finetuning and prediction stage: Note that, given X, p θ (µ|X) achieves its maximum when z = µ. Hence, making predictions based on µ could be regarded as finding the best explanation based on the most likely commonsense situation. Through integrating latent variable z, H (N ) * contains the event graph knowledge. By taking H (N ) * as the input of the subsequent (N + 1)th transformer layers of RoBERTa for predicting the relatedness score, the abductive reasoning task is conducted based on the additional event graph knowledge.

Optimizing
The αNLI task requires model to choose a more likely hypothesis event from two candidates. However, in the pre-training stage, the negative examples are absent in the pseudo instances. To address this issue, following the method of , in the pre-training stage ege-RoBERTa is trained to predict the masked tokens in the event sequence X rather than the relatedness score. In addition, in order to balance the masked token prediction loss with the KL term, we introduce an additional hyperparameter λ. Hence, the objective function in the pretraining stage is defined as follows: where logLMLM (X, z; θ) is the masked token prediction loss. Intuitively, through minimizing the KL term, we aim to transmit the event graph knowledge from the recognition network to the prior network.
In the finetuning stage, ege-RoBERTa is trained to adapt the learned event graph knowledge to the abductive reasoning task. Without the recogniton network, we formulate the objective function as:

Training Details
We implement two different sizes of ege-RoBERTa model (i.e. ege-RoBERTa-base and ege-RoBERTalarge) based on RoBERTa-base framework and RoBERTa-large framework, respectively. For the ege-RoBERTa-base model, in the aggregator, the prior network, the recognition network and the merger, the dimension of the attention mechanism d is set as 768, and all multi-head attention layers contain 12 heads. While for the ege-RoBERTa-large model, d is equal to 1024 and all multi-head attention layers contain 16 heads. In the ege-RoBERTabase model, token representations are aggregated from the 7th transformer layer of RoBERTa, and the latent variable is merged to the 10th transformer layer of RoBERTa. While for the ege-RoBERTalarge model, the aggregator and merger layer are set as the 14th and 20th layer, respectively. The balance coefficient λ equals 0.01. More details are provided in the Appendix.

αNLI Dataset
The αNLI dataset    quadruples in training, development and test set, respectively. The observation events are collected from a short story corpus ROCstory , while all of hypothesis events are independently generated through crowdsourcing.

Construction of Event Graph
The event graph serves as an external knowledge base to provide information about the relationship between observation events and intermediary events. To this end, we build the event graph based on an auxiliary dataset, which are composed of two short story corpora independent to αNLI, i.e., VIST (Huang et al., 2016), and TimeTravel (Qin et al., 2019). Both VIST and TimeTravel are composed of five-sentences short stories. Totally there are 121,326 stories in the auxiliary dataset.
To construct the event graph, we define each sentence in the auxiliary dataset as a node in the event graph. To get the edge weight W ij between two nodes V i and V j (i.e., the probability that V j is the subsequent event of V i ), we finetune a RoBERTa-large model through a next sentence prediction task. Specifically, we define adjacent sentence pairs in the story text (for example, [1st, 2nd] sentence, [4th, 5th] sentence of a story) as positive instances, define nonadjacent sentence pairs or sentences pairs in reverse order (such as [1st, 3rd] sentence, [5th, 4th] sentence of a story) as negative instances. After that we sample 300,000 positive and 300,000 negative instances from the auxiliary dataset. Then given an event pair (V i , V j ), the finetuned RoBERTa-large model would be able to predict the probability that V j is the subsequent event of V i .
Event Graph Based Pseudo Instance Set for Pretraining ege-RoBERTa To effectively utilize the event graph knowledge, we induce a set of pseudo instances for pretraining the ege-RoBERTa model. Specifically, given a five-sentence-story within the auxiliary dataset, as Table 1 shows, we define the 1st and 5th sentence of the story as two observed events, the 3rd sentence as the hypothesis event, the 2nd and 4th sentence as intermediary events, respectively. In this way, the posterior event sequence X and the event sequence X of a pseudo instance could be obtained. In addition, given X , we initialize the elements of the adjacency matrix A using the edge weights of the event graph, and scale A so that its row sums equal to 1. After the above operations, each pseudo instance is composed of an event sequence X, a posterior event sequence X which contains intermediary event information, and an adjacency matrix A which describes relationships between events within X .

Baselines
We compare ege-RoBERTa with: • SVM uses features about length, overlap and sentiment to predict the more likely hypothesis event.
• ege-RoBERTa λ=0 refers to setting the balance coefficient to 0 in the pretraining stage. Note that all pretrained-language-model-based baselines (i.e., GPT, BERT and RoBERTa) are finetuned on the αNLI dataset as the method of  to adapt to the abductive reasoning task.
In addition, we also list two concurrent works: (i) L 2 R (Zhu et al., 2020) learns to rank the candidate hypotheses with a novel scoring function. (ii) RoBERTa-GPT-MHKA (Paul et al., 2020) enhances pretrained language model with social and causal commonsense knowledge for αNLI task.

Quantitative Analysis
We list the prediction accuracy (%) in Table 2, and observe that: (1) Compared with SVM and Infersent, pretrained language model based methods: GPT, BERT, RoBERTa and ege-RoBERTa show significant better performances in abductive reasoning task. This is because through the pre-training
(2) Comparison between ege-RoBERTa-large u with ege-RoBERTa-large shows that the pretraining process can increase the accuracy of abductive reasoning. In addition, comparison between ege-RoBERTa-large λ=0 with ege-RoBERTalarge indicates that in the pre-training process, ege-RoBERTa could capture the event graph knowledge through the latent variable to enhance the abductive reasoning. Furthermore, the relative close performance between ege-RoBERTa-large u and ege-RoBERTa-large λ=0 suggest that the main improvements of the performance is brought by the event graph knowledge.
(3) Compared to RoBERTa, ege-RoBERTa achieves higher prediction accuracy for both the base and large sized model. This result confirms our motivation that learning event graph knowledge could be helpful for the abductive reasoning task.
(4) According to , human performance on the test set of αNLI is 91.4%. While the RoBERTa-large model has achieved an accuracy of 83.9%. Therefore, further improvements over RoBERTa-large could be challenging. Through learning the event graph knowledge, our proposed method ege-RoBERTa further improves the relative accuracy.
(5) Our approach has comparable performance with the SOTA concurrent work, which combines RoBERTa with GPT, and incorporates social and causal commonsense into model. The combination of both methods would further increase the model performance.

Ablation Study
All studies are conducted on the development set of the αNLI using the ege-RoBERTa-base model.

Influence of the Balance Coefficient
In the pretraining stage, the balance coefficient λ controls the trade off between event graph knowledge learning and abductive reasoning. To investigate the specific influence of the balance coefficient, we compare the performance of ege-RoBERTa model pretrained with different λ. As shown in Figure 4, the prediction accuracy continues to increase as λ increases from 0 to 0.01. This is because adequate event graph knowledge can offer guidance for the abductive reasoning task. While when λ exceeds 0.05, the accuracy start to decrease, as the over-emphasis of event graph knowledge learning would in turn undermine the model performance.
Influence of the External Commonsense Knowledge We study the specific effect of the event relational information and the intermediary event information by controlling the generation of pseudo instances. In specific, we eliminate the influence of the adjacency matrix A by replacing A with a randomly initialized matrixÃ. Similarly, the influence of the intermediary events I 1 and I 2 is eliminated through substituting them by two randomly sampled eventsĨ 1 andĨ 2 . As Table 3 shows, both the replacement of A and {I 1 , I 2 } lead to obvious decrease of model performance. This demonstrates that ege-RoBERTa can use both two kinds of event graph knowledge for enhancing the abductive reasoning task.

Sensitivity Analysis
To find out if the improvement of Ege-RoBERTa is brought by a certain dataset, and the specific   relationship between the model performance with the number of pseudo instances, we conduct following experiments: (1) excluding a certain dataset when inducing pseudo instances; (2) pretraining the ege-RoBERTa-base model with different number of pseudo instances. The corresponding results on the dev set of αNLI is shown in Table 4. We can find that, the elimination of both dataset leads to decrease of model performances. This suggests that the ege-RoBERTa model could capture relevant event graph knowledge from both dataset. While the prediction accuracy continues to increase along with the number of pseudo instances used for pretraining the ege-RoBERTa model. This is because the accumulation of commonsense knowledge is helpful for the abductive reasoning task. In addition, it also indicates that the model performance could be further improved if the auxiliary dataset is even more enlarged. Table 5 provides an example of model prediction results. Given two observed events O 1 "hates Fall" and O 2 "didn't have to experience Fall in Guam", the hypothesis event H 1 "moved to Guam" is more likely to explain the two motivations of observed events. However, H 1 implicitly relies on a precondition that in Guam, Fall could be eluded. Correspondingly, in the auxiliary dataset, there is information supporting the hypothesis event H 1 that there is no Fall in Guam. In this case, ege-RoBERTa chooses the hypothesis event H 1 , whereas RoBERTa chooses the wrong hypothesis event H 2 . This indicates that ege-RoBERTa could learn the event graph knowledge in the pretraining process for improving the reasoning performance.