Generating Hypothetical Events for Abductive Inference

Abductive reasoning starts from some observations and aims at finding the most plausible explanation for these observations. To perform abduction, humans often make use of temporal and causal inferences, and knowledge about how some hypothetical situation can result in different outcomes. This work offers the first study of how such knowledge impacts the Abductive NLI task – which consists in choosing the more likely explanation for given observations. We train a specialized language model LMI that is tasked to generate what could happen next from a hypothetical scenario that evolves from a given event. We then propose a multi-task model MTL to solve the Abductive NLI task, which predicts a plausible explanation by a) considering different possible events emerging from candidate hypotheses – events generated by LMI – and b) selecting the one that is most similar to the observed outcome. We show that our MTL model improves over prior vanilla pre-trained LMs fine-tuned on Abductive NLI. Our manual evaluation and analysis suggest that learning about possible next events from different hypothetical scenarios supports abductive inference.


Introduction
Abductive reasoning (AR) is inference to the best explanation. It typically starts from an incomplete set of observations about everyday situations and comes up with what can be considered the most likely possible explanation given these observations (Pople, 1973;Douven, 2017). One of the key characteristics that make abductive reasoning more challenging and distinct from other types of reasoning is its non-monotonic character (Strasser and Antonelli, 2019) i.e., even the most likely explanations are not necessarily correct. For example, in Figure  1, the most likely explanation for Observation 1: "wet grass outside my house" is that "it has been raining". However, when a new piece of information (observation or evidence) becomes available, the explanation must possibly be retracted, showing the defeasible character of abduction. With the new observation ("the sprinkler was switched on") the most plausible explanation changes to "Sprinkler caused the grass to be wet". Humans, in such situations, could induce or validate such abductive inferences by performing hypothetical reasoning (such as "What would happen if the sprinkler was switched on?") to arrive at a plausible explanation for "wet grass outside my house".
In this work, we focus on the αNLI task (Bhagavatula et al., 2020), where given two observations (O 1 at time t 1 , O 2 at time t 2 , with t 1 < t 2 ) as an incomplete context, the task is to predict which of two given hypothesized events (H 1 or H 2 ) is more plausible to have happened between O 1 and O 2 . Figure 2 illustrates this with an example: given observations O 1 :"Priya decided to try a new restaurant." and O 2 : "Priya thought her food was delicious.", the task is to predict whether H 1 or H 2 is the more plausible explanation given observations O 1 and O 2 . Both H 1 and H 2 are different plausible hypothetical situations that can evolve from the same observation (premise) O 1 .
In this paper, we hypothesize that learning how different hypothetical scenarios (H 1 and H 2 ) can result in different outcomes (e.g., O H j 2 , Fig. 2) can help in performing abductive inference. In order to decide which H i , is more plausible given observa- tions, we assume each H i to be true and generate a possible next event O H i 2 for each of them independently (e.g.: What will happen if Priya's ordered food was microwaved and precooked?). We then compare the generated sentences (O H 1 2 , O H 2 2 in Fig.  2) to what has been observed (O 2 ) and choose as most plausible hypothesis the one whose implication is closest to observation O 2 .
We design a language model (LM I ) which, given observations and a hypothesis, generates a possible event that could happen next, given one hypothesis. In order to train this language model, we use the TIMETRAVEL (TT) corpus (Qin et al., 2019) (a subpart of the ROCStories corpus 1 ). We utilize the LM I model to generate a possible next event for each hypothesis, given the observations. We then propose a multi-task learning model MT L that jointly chooses from the generated possible next events (O H 1 2 or O H 2 2 ) the one most similar to the observation O 2 and predicts the most plausible hypothesis (H 1 or H 2 ).
Our contributions are: i) To our best knowledge, we are the first to demonstrate that a model that learns to perform hypothetical reasoning can support and improve abductive tasks such as αNLI. We show that ii) for αNLI our multi-task model outperforms a strong BERT baseline (Bhagavatula et al., 2020).
Our code is made publicly available. 2

Learning about Counterfactual Scenarios
The main idea is to learn to generate assumptions, in a given situation, about "What could have hap-1 We ensure that αNLI testing instances are held out. 2 https://github.com/Heidelberg-NLP/ HYPEVENTS pened (next) if we had done X?" or "What could happen (next) if we do X?" (Bhatt and Flanagan, 2010). Figure 3(a) depicts the αNLI task framework. We hypothesize that getting to know what will happen (next) if any of two hypotheses occurs, will help us verifying which of them is more plausible (see Fig. 3(c)). Therefore, we encourage the model to learn how different hypothetical events (including counterfactual events) evolving from the same premise (s 1 ) can lead to different or similar outcomes (see Fig. 3(b)). Accordingly, we teach a pre-trained GPT-2 (Radford et al., 2019) language model how to generate a sequence of possible subsequent events given different hypothetical situations in a narrative setting. Training such a model on narrative texts encourages it to learn causal and temporal relations between events. We train a conditional language model, LM I , which generates a possible event that could happen next, given some counterfactual scenarios for a given story. We train this model on the TIMETRAVEL (TT) dataset (Qin et al., 2019), by fine-tuning GPT-2 to learn about possible next events emerging from a situation in a story, given some alternative, counterfactual event. The TT dataset consists of fivesentence instances S=(s 1 ,s 2 ,..,s 5 ) 3 from the ROC-Stories corpus 1 plus additional crowd-sourced sen-  tences s 2:5 , where s 2 is counterfactual 4 to s 2 from the original story 5 . There are two reasons for using the TT dataset for our purposes: a) the domains on which GPT-2 was pretrained are broad 6 and different from the domain of ROCStories, b) the model can see how alternative situations can occur starting from the same premise s 1 , resulting in similar or different outcomes. Note that, although intermediate situations may be counterfactual to each other, the future outcome can still be similar to the original ending due to causal invariance 7 .
Concretely, the language model LM I reads the premise (s 1 ) and the alternative event(s) (s 2 or s 2 ), the masked token (serving as a placeholder for the missing possible next event(s) (s 3:i or s 3:i ), then the rest of the story (s i+1:5 or s i+1:5 ) and again the premise (s 1 ). We train the model to maximize the log-likelihood of the missing ground-truth sentence(s) (s 3:i ). 3 Generating Hypothetical Events to support the αNLI task In this paper, we aim to investigate whether models perform better on the αNLI task when explicitly learning about events that could follow other events in a hypothetical scenario. We do so by introducing two methods LM I + BERTScore and LM I + 4 a counterfactual s states something that is contrary to s 5 During our experiments we treat them as two separate instances: S1=(s1:5) and S2 = (s1,s 2:5 ). 6 GPT-2 was trained on the WebText Corpus. 7 the future events that are invariant under the counterfactual conditions (Qin et al., 2019) MT L for unsupervised and supervised settings, respectively.
We first apply the trained model LM I on the αNLI task, where the given observations O 1 and O 2 , and alternative hypotheses H j are fed as shown in (2) We generate a possible next event for each hypothetical event H j , i.e., O H 1 2 and O H 2 2 (or: what will happen if some hypothesis H j occurs given the observations), where j ∈ [1, 2]. Table 1 illustrates an example where different O H j 2 are generated using LM I . One of the challenges when generating subsequent events given a hypothetical situation is that there can be infinite numbers of possible next events. Therefore, to constrain this range, we chose to give future events (O 2 ) as input, such that the model can generate subsequent events in a constrained context.

Unsupervised Setting
In this setting, we do not train any supervised model to explicitly predict which hypothesis is more plausible given the observations. Instead, we apply the fine-tuned LM I model to the αNLI data, generate possible next events O H j 2 given O 1 and H j , as described above, and measure the similarity between such possible next events (O H j 2 ) and the observation (O 2 ) in an unsupervised way, using BERTScore (BS) (Zhang et al., 2020) 9 . We evaluate our hypothesis that the generated possible next event O H j 2 given the more plausible hypothesis H j should be more similar to observation O 2 . Table 1 illustrates an example where H 2 is the more plausible hypothesis. We impose the constraint that for a correctly predicted instance BS are the more plausible vs. implausible hypothesis, respectively.

Supervised Setting
In this setting, displayed in Figure 4, we explore the benefits of training a multi-task MT L model that predicts i) the most plausible hypothesis and ii) which possible next event (O H j 2 ) is more similar to the observation (O 2 ). Multi-task learning aims to improve the performance of a model for a task by utilizing the knowledge acquired by learning related tasks (Ruder, 2019). We hypothesize that a) the possible next event O H j 2 of the more plausible hypothesis H j should be most similar to observation O 2 , and that b) learning which possible next event is more similar supports the model in the αNLI task (inductive transfer).
The architecture of LM I + MT L model is shown in Figure 4. The model marked (a) in Figure 4 depicts the LM I model as described in §3. The outputs of the LM I model, which we get from Eq. (2) for both hypotheses are incorporated as an input to the MT L model. Concretely, we feed the MT L classifier a sequence of tokens as stated in part (b) of Figure 4, and aim to compute their contextualized representations using pre-trained BERT. The input format is described in Table 3. Similar to (Devlin et al., 2019), two additional tokens are added [CLS] at the start of each sequence input and [SEP] at the end of each sentence. In the shared layers (see Fig 4(b)), the model first transform the input sequence to a sequence of embedding vectors. Then it applies an attention mechanism that learns contextual relations between words (or sub-words) in the input sequence.
For each instance we get four [CLS] embed- ; j ∈ [1, 2]) which are then passed through two linear layers, one for the αNLI (main task) and another for predicting the   and O 2 . We compute the joint loss function L = L αN LI + w * L similarity ; where w is a trainable parameter, L αN LI and L similarity are the loss function for the αN LI task and auxiliary task, respectively.

Experimental Setup
Data. We conduct experiments on the ART (Bhagavatula et al., 2020) dataset. Data statistics are given in Table 2. For evaluation, we measure accuracy for αNLI.
Hyperparameters. To train the LM I model we use learning rate of 5e − 05. We decay the learning rate linearly until the end of training; batch size: 12. In the supervised setting for the αNLI task, we use the following set of hyperparameters for our MT L model with integrated LM I model (LM I + MT L): batch size: {8, 16}; epochs: {3, 5}; learning rate: {2e-5, 5e-6}. For evaluation, we measure accuracy. We use Adam Optimizer, and dropout rate = 0.1. We experimented on GPU size of 11GB and 24GB. Training is performed using cross-entropy loss. The loss function is L αN LI + w * L similarity , where w is a trainable parameter. During our experiment we initialize w = 1. The input format is depicted in Table 3. We report performance by averaging results along with the variance obtained for 5 different seeds.
Baselines. We compare to the following baseline models that we apply to the αNLI task, training them on the training portion of the ART dataset (cf. Table 2).
• ESIM + ELMo is based on the ESIM model previously used for NLI (Chen et al., 2017). We use (a) ELMo to encode the observations and hypothesis, followed by (b) an attention • BERT (Devlin et al., 2019) is a LM trained with a masked-language modeling (MLM) and next sentence prediction objective.
As baselines for using the MT L model, we replace LM I with alternative generative LMs: • GPT-2 + MT L. In this setup, we directly use the pretrained GPT-2 model and task it to generate a next sentence conditioned on each hypothesis (O H i 2 ) without finetuning it on the TIMETRAVEL data. We then use the supervised MT L model to predict the most plausible hypothesis and which of the generated observations is more similar to O 2 .
• COMET + MT L. In this setting, we make use of inferential if-then knowledge from ATOMIC (Sap et al., 2019a) as background knowledge. Specifically, we use COMET to generate objects with Effect 10 relations for each hypothesis as a textual phrase.

Results
In Table 4, we compare our models LM I + BERTScore and LM I + MT L against the models proposed in Bhagavatula et al. (2020): a majority baseline, supervised models (Infersent and 10 as a result PersonX feels; as a result PersonX wants; PersonX then ESIM+ELMo), as well as BERT Large . Bhagavatula et al. (2020) re-train the ESIM+ELMo and Infersent models on the ART dataset and fine-tuned the BERT model on the αNLI task and report the results.
We find that our unsupervised model with BERTScore (LM I + BERTScore) outperforms (by +9.28 pp. and +1.28 pp.) strong ESIM+ELMo and Infersent baseline models. Table 5 shows some examples of our generation model LM I along with the obtained BERTScores.
Unlike the unsupervised LM I + BERTScore, our supervised LM I + MT L model also improves over the BERT Large baseline, by +3.3 pp. We can attribute the improvement to the model having been jointly trained to assess the similarity and dissimilarity of possible next events O H j 2 and observations (O 2 ) along with the αNLI task. One of the advantages of training our proposed multitask learning (MT L) model, instead of directly feeding the possible next events O H j 2 as knowledge inputs is that it adds an explainable component to the model. One can view the generated next events O H j 2 as natural language rationales and our multitask model explicitly chooses one of them. Hence, the multi-task framework makes the model more expressive. Finally, we compare, for the MT L model, our embedded generation model LM I to pre-trained GPT-2 and COMET. Table 4 shows that LM I + MT L yields better performance compared to both COMET + MT L (+3.1 pp.) and GPT-2 + MT L (+3.4 pp.) -the intuitive reason being that the next events generated by LM I are more helpful than events generated using pretrained GPT-2 and objects generated by COMET. Table 5 illustrates some examples where our MT L model not only chooses the correct hypothesis, but also a likely possible next event that is similar to the observation O 2 . Interestingly, during training of MT L we initialize w = 1, and after training the model we found the w value had been adjusted to a range between 0.85 and 0.75, which intuitively shows both the effectiveness of our LM I -generated possible next events, and their similarity to the given observations O 2 .  (i) examples (a), (b) and (d) depicting the scenario where possible next events and observation pairs correctly achieve higher BERTscores 11 , (ii) example (c) depicting the scenario where an incorrect possible next event and observation pair achieves higher BERTscores than the correct one. Intuitive reasons for these scenarios are, for example, for (a): there is a higher word overlap and semantic similarity between a correct next event and observation O 2 , for example (b): there is higher semantic similarity; whereas for example (c): although there is a higher semantic dissimilarity, the word overlap between the wrong possible next event ("She started to feel sick.") and the observation ("She felt much better afterwards.") is much higher.

Manual Evaluation
Since the automatic scores only account for wordlevel similarity between observations and generated possible next events, we conduct a manual evaluation study, to assess the quality of sentences generated by our LM I model.
Annotation Study on LM I generations. The annotation was performed by three annotators with computational linguistic background. We provide each of the three annotators with observations, hypotheses and sentences, as produced by our LM I 11 BERTscore matches words in candidate and reference sentences by cosine similarity. model, for 50 randomly chosen instances from the αNLI task. They obtain i) generated sentences for a next possible event for the correct and incorrect hypothesis, as well as ii) the sentence stating observation O 2 .
We ask each annotator to rate the sentences according to four quality aspects as stated below.
Grammaticality: the sentence is i) grammatical, ii) not entirely grammatical but understandable, or iii) completely not understandable; Redundancy: the sentence contains redundant or repeated information; Contradiction: the sentence contains any pieces of information that are contradicting the given observation O 2 or not; Relevance: the possible next event is i) relevant, ii) partially relevant, or iii) not relevant.
For each aspect, they are asked to judge the sentence generated for the correct hypothesis 12 . Only for Contradiction, they are asked to judge both sentences, for correct and the incorrect hypotheses.
Results and Discussion. Figures 5, 7, and 6 present the results of manual evaluations of the generation quality, according to the different criteria described above.

12.0%
Grammatical Understandable Gibberish Figure 5: Human evaluation of the grammaticality of generated sentences: ratio of i) grammatical, ii) not entirely grammatical but understandable, iii) completely not understandable sentences. For measuring inter-annotator agreement, we computed Krippendorff's α (Hayes and Krippendorff, 2007) for Grammaticality and Relevance, as it is suited for ordinal values, and Cohen's Kappa κ for Redundancy and Contradiction. We found α values are 0.587 and 0.462 for Grammaticality and Relevance, respectively (moderate agreement) and κ values 0.61 and 0.74 for Redundancy and Contradiction (substantial agreement). We aggregated the annotations from the three annotators using majority vote. Figure 5 shows that the majority of sentences (96%) are grammatical or understandable. Figure 7 shows that most sentences for correct labels are non-redundant (84%) and noncontradictory (88%), whereas for incorrect labels 39 instances are found to be contradictory with the observation O 2 (78%). The manual evaluation supports our hypothesis that the generated sentences for correct labels should be more similar (less contradictory) compared to the sentences generated for incorrect labels. Figure 6 shows the ratio of sentences considered by humans as relevant, partially relevant, and irrelevant. The results show that 46% of cases are relevant (based on majority agreement) and 24% of cases are partially relevant. This yields that the generated sentences are (partially) relevant in most cases and thus should support abduction for both unsupervised (LM I + BERTScore) and supervised (LM I + MT L) models.
Impact of Reasoning types. Finally, to better assess the performance of our model, we determine what types of reasoning underly the abductive reasoning tasks in our data, and examine to what extent our models capture or not these reasoning types. We consider again the 50 instances that were annotated by our previous annotators and manually classify them into different reasoning types. We broadly divided the data into 6 categories: (i) Motivation, (ii) Spatial-Temporal, (iii) Emotional, (iv) Negation, (v) Reaction, (vi) Situational fact. The most frequent type was Emotional (10), most infrequent was Spatial (7). We ask a new annotator to annotate the reasoning types for these 50 instances.
Considering the relevance and contradiction categories from the previous annotations we determine that for Negation (8), Emotional (10), and Reaction (8) all generated events for correct labels are partially or fully relevant and non-contradictory. An intuitive reason can be that we train our LM I model to learn how different counterfactual hypothetical events emerging from a single premise can lead to the same or different outcomes through a series of events. Some counterfactual events (s 2 ) are negations of the original event (s 2 ) in the TIME-TRAVEL dataset. This may support the reasoning class Negation. For the other categories: Motivation, Spatial-temporal, and Situational fact, we detect errors regarding (missing) Relevance in 21%, 14% and 28% of cases, respectively. Table 6 illustrates an example from the class Situational Fact, where our generated next event is irrelevant and redundant.

Related Work
Commonsense Reasoning. There is growing interest in this research field, which led to the creation of several new resources on commonsense reasoning, in form of both datasets, such as So-cialIQA (Sap et al., 2019b), CommonsenseQA (Talmor et al., 2019), CosmosQA (Huang et al., 2019) and knowledge bases, e.g. ConceptNet (Speer et al., 2017), ATOMIC (Sap et al., 2019a), or Event2Mind (Rashkin et al., 2018. Recently, many works proposed to utilize external static knowledge graphs (KGs) to address the bottleneck of obtaining relevant commonsense knowledge. Lin et al. (2019) proposed to utilize knowledge graph embeddings to rank and select relevant knowledge triples or paths. Paul and Frank (2019) proposed to extract subgraphs from KGs using graph-based ranking methods and further  adopted the graph-based ranking method and proposed to dynamically extend the KG to combat sparsity. In concurrent work, Paul and Frank (2021) introduced a method to dynamically generate contextually relevant knowledge that guides a model while performing the narrative story completion task. Both hypothetical reasoning and abductive reasoning are understudied problems in NLP. Recently, Tandon et al. (2019) proposed a first large-scale dataset of "What if..." questions over procedural text. They introduced the dataset to study the effect of perturbations in procedural text. Related to our work, Qin et al. (2019) investigated the capabilities of state-of-the-art LMs to rewrite stories with counterfactual reasoning. In our work we utilize this dataset to model how to generate possible next events emerging from different hypothetical and counterfactual events. Mostafazadeh et al. (2016) designed the narrative cloze task, a task to choose the correct ending of a story. 13 Conversely, Bhagavatula et al. (2020) proposed a task that requires reasoning about plausible explanations for narrative omissions. Our research touches on the issue of hypothetical reasoning about alternative situations. We found that making language models learn how different hypothetical events can evolve from a premise and result in similar or different future events forming from a premise and how these events can result in similar or different future events helps models to perform better in abduction.

Explainability.
Despite the success of large pretrained language models, recent studies have raised some critical points such as: high accuracy scores do not necessarily reflect understanding (Min et al., 2019), large pretrained models may exploit superficial clues and annotation artifacts (Gururangan et al., 2018;Kavumba et al., 2019). Therefore, the ability of models to generate explanations has become desirable, as this enhances interpretability. Recently, there has been substantial effort to build datasets with natural language explanations (Camburu et al., 2018;Park et al., 2018;Thayaparan et al., 2020). There have also been numerous research works proposing models that are interpretable or explainable (Rajani et al., 2019;Atanasova et al., 2020;Latcinnik and Berant, 2020;Wiegreffe and Marasović, 2021). Our work sheds light in this direction, as our MT L model not only predicts the plausible hypothesis H j but also generates possible next events O H j 2 and chooses the one that is closer to the given context, thereby making our model more expressive.
Abductive Reasoning. There has been longstanding work on theories of abductive reasoning (Peirce, 1903(Peirce, , 1965aKuipers, 1992Kuipers, , 2013. Researchers have applied various frameworks, some focused on pure logical frameworks (Pople, 1973;Kakas et al., 1992), some on probabilistic frameworks (Pearl, 1988), and others on Markov Logics (Singla and Mooney, 2011). Recently, moving away from logic-based abductive reasoning, Bhagavatula et al. (2020) proposed to study languagebased abductive reasoning. They introduced two tasks: Abductive Natural Language Inference (αNLI) and Generation (αNLG). They establish baseline performance based on state-of-the-art language models and make use of inferential structured knowledge from ATOMIC (Sap et al., 2019a) as background knowledge. Zhu et al. (2020) proposed to use a learning-to-rank framework to address the abductive reasoning task. Ji et al. (2020) proposed a model GRF that enables pre-trained models (GPT-2) with dynamic multi-hop reasoning on multi-relational paths extracted from the external ConceptNet commonsense knowledge graph for the αNLG task.  have proposed a multi-head knowledge attention method to incorporate commonsense knowledge to tackle the αNLI task. Unlike our previous work in , which focused on leveraging structured knowledge, in this work, we focus on learning about what will happen next from different counterfactual situations in a story context through language model fine-tuning. Specifically, we study the impact of such forward inference on the αNLI task in a multi-task learning framework and show how it can improve performance over a strong BERT model.

Conclusion
We have introduced a novel method for addressing the abductive reasoning task by explicitly learning what events could follow other events in a hypothetical scenario, and learning to generate such events, conditioned on a premise or hypothesis. We show how a language model -fine-tuned for this capability on a suitable narrative dataset -can be leveraged to support abductive reasoning in the αNLI tasks, in two settings: an unsupervised setting in combination with BertScore, to select the proper hypothesis, and a supervised setting in a MT L setting.
The relatively strong performance of our proposed models demonstrates that learning to choose from generated hypothetical next events the one that is most similar to the observation, supports the prediction of the most plausible hypothesis. Our experiments show that our unsupervised LM I +BERTScore model outperforms some of the strong supervised baseline systems on αNLI. Our research thus offers new perspectives for training generative models in different ways for various complex reasoning tasks.