Think about it! Improving defeasible reasoning by first modeling the question scenario

Defeasible reasoning is the mode of reasoning where conclusions can be overturned by taking into account new evidence. Existing cognitive science literature on defeasible reasoning suggests that a person forms a mental model of the problem scenario before answering questions. Our research goal asks whether neural models can similarly benefit from envisioning the question scenario before answering a defeasible query. Our approach is, given a question, to have a model first create a graph of relevant influences, and then leverage that graph as an additional input when answering the question. Our system, CURIOUS, achieves a new state-of-the-art on three different defeasible reasoning datasets. This result is significant as it illustrates that performance can be improved by guiding a system to"think about"a question and explicitly model the scenario, rather than answering reflexively. Code, data, and pre-trained models are located at https://github.com/madaan/thinkaboutit.


Introduction
Defeasible inference is a mode of reasoning where additional information can modify conclusions (Koons, 2017). Here we consider the specific formulation and challenge in Rudinger et al. (2020): Given that some premise P plausibly implies a hypothesis H, does new information that the situation is S weaken or strengthen the conclusion H? For example, consider the premise "The drinking glass fell" with a possible implication "The glass broke". New information that "The glass fell on a pillow" here weakens the implication.
We borrow ideas from the cognitive science literature that supports defeasible reasoning for humans with an inference graph (Pollock, 2009(Pollock, , 1987. Inference graphs formulation in (Madaan et al., 2021), which we use in this paper, draws connections between the P, H, and S through mediating events. This can be seen as a mental model of the question scenario before answering the question (Johnson-Laird, 1983). This paper asks the natural question: can modeling the question scenario with inference graphs help machines in defeasible reasoning?
Our approach is as follows. First, given a question, generate an inference graph describing important influences between question elements. Then, use that graph as an additional input when answering the defeasible reasoning query. Our proposed system, CURIOUS, comprises a graph generation module and a graph encoding module to use the generated graph for the query (Figure 2).
To generate inference graphs, we build upon past work that uses a sequence to sequence approach (Madaan et al., 2021). However, our analysis re- vealed that the graphs can often be erroneous, and CURIOUS also includes an error correction module to generate higher quality inference graphs. This was important because we found that better graphs are more helpful in the downstream QA task.
The generated inference graph is then used for the QA task on three existing defeasible inference datasets from diverse domains, viz., δ-SNLI (natural language inference) (Bowman et al., 2015), δ-SOCIAL (reasoning about social norms) (Forbes et al., 2020), and δ-ATOMIC (commonsense reasoning) (Sap et al., 2019). We show that the way the graph is encoded for input is important. If we simply augment the question with the generated graphs, there are some gains on all datasets. However, the accuracy improves substantially across all datasets with a more judicious encoding of the graph-augmented question that accounts for interactions between the graph nodes. To achieve this, we use the mixture of experts approach (Jacobs et al., 1991) to include a mixture of experts layers during encoding, enabling the ability to attend to specific nodes while capturing their interactions selectively.
In summary, our contribution is in drawing on the idea of an inference graph from cognitive science to show benefits in a defeasible inference QA task. Using an error correction module in the graph generation process, and a judicious encoding of the graph augmented question, CURIOUS achieves a new state-of-the-art over three defeasible datasets. This result is significant also because our work illustrates that guiding a system to "think about" a question before answering can improve performance.

Task
We use the defeasible inference task and datasets defined in (Rudinger et al., 2020), namely given an input x = (P,H,S), predict the output y ∈ {strengthens, weakens}, where P, H, and S are sentences describing a premise, hypothesis, and scenario respectively, and y denotes whether S strengthens/weakens the plausible conclusion that H follows from P, as described in Section 1.

Approach
Inspired by past results (Madaan et al., 2021) that humans found inference graphs useful for defeasible inference, we investigate whether neural models can benefit from envisioning the question scenario using an inference graph before answering a defeasible inference query.
Inference graphs As inference graphs are central to our work, we give a brief description of their structure next. Inference graphs were introduced in philosophy by Pollock (2009) to aid defeasible reasoning for humans, and in NLP by Tandon et al. (2019) for a counterfactual reasoning task. We interpret the inference graphs as having four kinds of nodes (Pollock, 2009;Madaan et al., 2021): i. Contextualizers (C-, C+): these nodes set the context around a situation and connect to the P.
ii. Situations (S, S-): these nodes are new situations that emerge which might overturn an inference.
iii. Hypothesis (H-, H+): Hypothesis nodes describe the outcome/conclusion of the situation.
iv. Mediators (M-, M+): Mediators are nodes that help bridge the knowledge gap between a situation and a hypothesis node by explaining their connection explicitly. These node can either act as a weakener or strengthener.
Each node in an influence graph is labeled with an event (a sentence or a phrase). The signsand + capture the nature of the influence event node. Concrete examples are present in Figures 1, 4, and in Appendix §D.

Overview of CURIOUS
Our system, CURIOUS, comprises three components, (i) a graph generator GEN init , (ii) a graph corrector GEN corr , (iii) a graph encoder (Figure 1). GEN init generates an inference graph from the input x. We borrow the sequence to sequence approach of GEN init from Madaan et al. (2021) without any architectural changes. However, we found that the resulting graphs can often be erroneous (which hurts task performance), so CURIOUS includes an error correction module GEN corr to generate higher Figure 3: An overview of our method to perform graph-augmented defeasible reasoning using a hierarchical mixture of experts. First, MOE-V selectively pools the node representations to generate a representation h G of the inference graph. Then, MOE-GX pools the query representation h x and the graph representation generated by MOE-V to pass to the upstream classifier.
quality inference graphs that are then judiciously encoded using the graph encoder. This encoded representation is then passed through a classifier to generate an end task label. The overall architecture is shown in Figure 2. As the initial graph generator, we use the method described in Madaan et al. (2021) (GEN init ) to generate inference graphs for defeasible reasoning. 2 Their approach involves first training a graph-generation module, and then performing zero-shot inference on a defeasible query to obtain an inference graph. They obtain training 2 We use their publicly available code and data data for the graph-generation module from WIQA dataset (Tandon et al., 2019). WIQA is a dataset of 2107 (T i , G i ) tuples, where T i is the passage text that describes a process (e.g., waves hitting a beach), and G i is the corresponding influence graph.

Graph corrector
We found that 70% of the randomly sampled 100 graphs produced by GEN init (undesirably) had repeated nodes (an example of repeated nodes is in Figure 4). Repeated nodes introduce noise because they violate the semantic structure of a graph, e.g., in Figure 4, nodes C+ and C-are repeated, although they are expected to have opposite semantics. Higher graph quality yields better end task performance when using inference graphs (as we will show in §4.3.1) To repair such problems, we train a graph corrector, GEN corr , that takes as input G , and as output it gives a graph G * , with repetitions fixed. To train the model, we require (G , G * ) examples, which we generate using a data augmentation technique described in the Appendix §A. Because the nodes in the graph are from an open vocabulary, we then train a T5 sequence-to-sequence model (Raffel et al., 2020) with input = G and output = G * . In summary, given a defeasible query PHS, we generate a potentially incorrect initial graph G using GEN init . We then feed G to GEN corr to obtain an improved graph G.

Graph Encoder
For each defeasible query (P, H, S), we add the inference graph G from CURIOUS (the corrected graph from §3.3), to provide additional context for the query, as we now describe.
We concatenate the components (P, H, S) of the defeasible query into a single sequence of tokens x = (P H S), where denotes concatenation. Thus, each sample of our graphaugmented binary-classification task takes the form ((x, G), y), where y ∈ {strengthener, weakener}. Following (Madaan et al., 2021), we do not use edge labels and treat all the graphs as undirected graphs.
Overview: We first use a language model L to obtain a dense representation h x for the defeasible query x, and a dense representation h v for each node v ∈ G. The node representations h v are then pooled using a hierarchical mixture of experts (MoE) to obtain a graph representation h G . The query representation h x and the graph representation h G are combined to solve the defeasible task. We now provide details on obtaining h x , h v , h G .

Encoding the query and nodes
Let L be a pre-trained language model (in our case RoBERTa ). We use h S = L(S) ∈ R d to denote the dense representation of sequence of tokens S returned by the language model L. Specifically, we use the pooled representation of the beginning-of-sequence token <s> as the sequence representation.
We encode the defeasible query x and the nodes of the graph using L. Query representation is computed as h x = L(x), and we similarly obtain a matrix of node representations h V by encoding each node v in G with L as follows: where h v i ∈ R d refers to the dense representation for the i th node of G derived from L (i.e., h v i = L(v i )), and h V ∈ R |V|×d to refer to the matrix of node representations.

Graph representations using MoE
Recently, mixture-of-experts (Jacobs et al., 1991;Shazeer et al., 2017;Fedus et al., 2021) has emerged as a promising method of combining multiple feature types. Mixture-of-experts (MoE) is especially useful when the input consists of multiple facets, where each facet has a specific semantic meaning. Previously, Gu et al. (2018);Chen et al. (2019) have used the ability of MoE to pool disparate features on low-resource and cross-lingual language tasks. Since each node in the inference graphs used by us plays a specific role in defeasible reasoning (contextualizer, situation node, or mediator), we take inspiration from these works to design a hierarchical MoE model (Jordan and Xu, 1995) to pool node representations h V into a graph representation h G .
An MoE consists of n expert networks E 1 , E 2 , . . . , E n and a gating network M. Given an input x ∈ R d , each expert network E i : R d → R k learns a transform over the input. The gating network M : R d → ∆ d gives the weights p = {p 1 , p 2 , . . . , p n } to combine the expert outputs for input x. Finally, the output y is returned as a convex combination of the expert outputs: The output y can either be the logits for an end task (Shazeer et al., 2017;Fedus et al., 2021) or pooled features that are passed to a downstream learner (Chen et al., 2019;Gu et al., 2018). The gating network M and the expert networks E 1 , E 2 , . . . , E n are trained end-to-end. During learning, the gradients to M train it to generate a distribution over the experts that favors the best expert for a given input. Appendix §B presents a further discussion on our MoE formulation and an analysis of the gradients.
Hierarchical MoE for defeasible reasoning Different parts of the inference graphs might help answer a query to a different degree. Further, for certain queries, graphs might not be helpful (and could even be distracting), and the model could rely primarily on the input query alone. This motivates a two-level architecture that can: (i) select a subset of the nodes in the graph and ii) selectively reason across the query and the graph to varying degrees.
Given these requirements, a hierarchical MoE (Jordan and Jacobs, 1994) model presents itself as a natural choice to model this task. The first MoE (MOE-V) creates a graph representation by taking a convex combination of the node representations. The second MoE (MOE-GX) then takes a convex-combination of the graph representation returned by MOE-V and query representation and passes it to an MLP for the downstream task.
• MOE-V consists of five node-experts and gating network to selectively pool node representations h v to graph representation h G : • MOE-GX contains two experts (graph expert E G and question expert E Q ) and a gating network to combine the graph representation h G returned by MOE-GX and the query representation h x : h y is then passed to a 1-layer MLP to perform classification. The gates and the experts in our MoE model are single-layer MLPs, with equal input and output dimensions for the experts.

Experiments
In this section, we empirically investigate if CURI-OUS can improve defeasible inference by first modeling the question scenario using inference graphs. We also study the reasons for any improvements. Datasets Our end task performance is measured on the three existing datasets for defeasible inference provided by Rudinger et al. (2020): 3 δ-ATOMIC, δ-SNLI, δ-SOCIAL (Table 1). These datasets exhibit substantial diversity because of their domains: δ-SNLI (natural language inference), δ-SOCIAL (reasoning about social norms), and δ-ATOMIC (commonsense reasoning). Thus, it would require a general model to perform well across these diverse datasets.

Baselines and setup
The previous state-of-theart (SOTA) is the RoBERTa ) model presented in Rudinger et al. (2020, and we report the published numbers for this baseline. For an additional baseline, we directly use the initial inference graph G generated by GEN init , and provide it to the model simply as a string (i.e., sequence of tokens; a simple, often-used approach). This baseline is called E2E-STR. We use the same hyperparameters as Rudinger et al. (2020), and add a detailed description of the hyperparameters in Appendix §C. For all the QA experiments, we report the accuracy on the test set using the checkpoint with the highest accuracy on the development set. We use the McNemar's test (McNemar, 1947;Dror et al., 2018) and use p < 0.05 as a threshold for statistical significance. All the p-values are reported in Appendix §G.

Results
Table 2 compares QA accuracy on these datasets without and with modeling the question scenario.
The results suggest that we get consistent gains across all datasets, with δ-SNLI gaining about 4 points. CURIOUS achieves a new state-of-the-art across three datasets, as well as now producing justifications for its answers with inference graphs.

Understanding CURIOUS gains
In this section, we study the contribution of the main components of the CURIOUS pipeline.

Impact of graph corrector
We ablate the graph corrector module GEN corr in CURIOUS by directly supplying the output from GEN init to the graph encoder. Table 3 shows that this ablation consistently hurts across all the datasets. GEN corr provides 2 points gain across datasets. This indicates that better graphs lead to better task performance, assuming that GEN corr actually reduces the noise. Next, we investigate if GEN corr can produce more informative graphs.    Table 4 shows GEN corr does reduce repetitions by approximately 40% (2.11 to 1.25) per graph across all datasets, and also reduces the fraction of graphs with at least one repetition by 25.7% across.

Impact of graph encoder
We experiment with two alternative approaches to graph encoding to compare our MoE approach by using the graphs generated by GEN corr : 1. Graph convolutional networks: We follow the approach of Lv et al. (2020) who use GCN (Kipf and Welling, 2017) to learn rich node representations from graphs. Broadly, node representations are initialized by L and then refined using a GCN. Finally, multi-headed attention (Vaswani et al., 2017) between question representation h x and the node representations is used to yield h G . We add a detailed description of this method in Appendix §H. 2. String based representation: Another popular approach (Sakaguchi et al., 2021) is to concatenate the string representation of the nodes, and then using L to obtain the graph representation h G = L(v 1 v 2 ..) where denotes string concatenation. Table 5 shows that MoE graph encoder improves end task performance significantly compared to the baseline. 4 In the following analysis, we study the reasons for these gains in-depth.
We hypothesize that GCN is less resistant to noise than MoE in our setting, thus causing a lower performance. The graphs augmented with each query are not human-curated and are instead generated by a language model in a zero-shot inference setting. Thus, the GCN style message passing might amplify the noise in graph representations. On the other hand, MOE-V first selects the most useful nodes to answer the query to form the graph representation h G . Further, MOE-GX can also decide to completely discard the graph representations, as it does in many cases where the true answer for the defeasible query is weakens.
To further establish the possibility of message passing hampering the downstream task performance, we experiment with a GCN-MoE hybrid, wherein we first refine the node representations using a 2-layer GCN as used by (Lv et al., 2020), and then pool the node representations using an MoE. We found the results to be about the same as ones we obtained with GCN (3rd-row Table 5), indicating that bad node representations are indeed the root cause for the bad performance of GCN. This is also supported by Shi et al. (2019) who found that noise propagation directly deteriorates network embedding and GCN is sensitive to noise.
Interestingly, graphs help the end-task even when encoded using a relatively simple STR based encoding scheme, further establishing their utility.

Detailed MoE analysis
We now analyze the two MoEs used in CURIOUS: (i) the MoE over the nodes (MOE-V), and (ii) the MoE over G and input x (MOE-GX). MOE-GX performs better for y = strengthens: Figure 5 shows that the graph makes a stronger contribution than the input, when the label is strengthens. In instances where the label is weakens, the gate of MOE-GX gives a higher weight to the question. This trend was present across all the datasets. We conjecture that this happens because language models are tuned to generate events that happen rather than events that do not. In the case of a weakener, the nodes must be of the type event1 leads to less of event2, whereas language models are naturally trained for event1 leads to event2. Understanding this in-depth requires further investigation in the future.
MOE-V relies more on specific nodes: We study the distribution over the types of nodes and their contribution to MOE-V. Recall from Figure 3 that C-and C+ nodes are contextualizers that provide more background context to the question, and Snode is typically an inverse situation (i.e., inverse S), while M-and M+ are the mediator nodes leading to the hypothesis. Figure 6 shows that the situation node S-was the most important, followed by the contextualizer and the mediator. Notably, our analysis shows that mediators are less important for machines than they were for humans in the experiments conducted by Madaan et al. (2021). This is probably because humans and machines use different pieces of information. As our error analysis shows in §5, the mediators can be redundant given the query x. Humans might have used the redundancy to reinforce their beliefs, whereas machines leverage the unique signals present in S-and the contextualizers.

MOE-V, MOE-GX have a peaky distribution:
A peaky distribution over the gate values implies that the network is judiciously selecting the right expert for a given input. We compute the average entropy of MOE-V and MOE-GX and found the entropy values to be 0.52 (max 1.61) for MOE-V, and 0.08 (max 0.69) for MOE-GX. The distribution of the gate values of MOE-V is relatively flat, indicating that specialization of the node experts might have some room for improvement (additional discussion in Appendix §B). Analogous to scene graphs-based explanations in visual QA (Ghosh et al., 2019), peaky distributions over nodes can be considered as an explanation through supporting nodes.
MOE-V learns the node semantics: The network learned the semantic grouping of the nodes (contextualizers, situation, mediators), which became evident when plotting the correlation between the gate weights. As Figure 7 shows, there is a strong negative correlation between the situation nodes and the context nodes, indicating that only one of them is activated at a time.   Table 6 shows that CURIOUS is able to correct several previously wrong examples. When CURI-OUS corrected previously failing cases, the MOE-V relied more on mediators, as the average mediator probabilities go up from 0.09 to 0.13 averaged over the datasets. CURIOUS still fails, and more concerning are the cases when previously successful examples now fail. To study this, we annotate 50 random dev samples over the three datasets (26/24 examples for weakens/strengthens label). For each sample, a human-annotated if the graph had errors. We observe the following error categories: 5 • All nodes off-topic (4%): The graph nodes were not on topic. This (rarely) happens when CURIOUS cannot distinguish the sense of a word in the input question. For instance, S = there is a water fountain in the center -CU-RIOUS generated based on an incorrect word sense of natural water spring.

Concrete examples in Appendix §F
• Repeated nodes (20%): These may be exact or near-exact matches. Node pairs with similar effects tend to be repeated in some samples. E.g., the S-node is often repeated with contextualizer C-perhaps because these nodes indirectly affect graph nodes in a similar way. • Mediators are uninformative (34%): The mediating nodes are not correct or informative. One source of these errors is when the H and S are nearly connected by a single hop, e.g., H = personX pees, and S = personX drank a lot of water previously. • Good graphs are ineffective (42%): These graphs contained the information required to answer the question, but the gating MOE mostly ignored this graph. This could be attributed in part to the observation in the histogram in Figure 5, that samples with weakens label disproportionately ignore the graph.
In accordance with the findings of Rudinger et al. (2020), the maximum percentage of errors was in δ-ATOMIC, in part due to low question quality.

Explainability
In this section, we analyze the explainability of CURIOUS model. Jacovi and Goldberg (2020) note that an explanation should aim towards two complementary goals: i) Plausibility: provide an interpretation of system outputs that is convincing for humans, and ii) Faithfulness: capture the actual reasoning process of a model. We discuss how our approach takes a step towards addressing these goals.
Plausibility In a prior work, Madaan et al. (2021) show that human annotators selectively picked and chose parts of the graph that explained a model decision and enabled them in improving on the task of defeasible reasoning. We show in §4.3.3, the MoE gate values gives insights into the part of the graph (contextualizer, mediator, situation node) that the model leveraged to answer a query. Our model thus produces a reasoning chain that is similar to the explanation that humans understand, providing a step towards building inherently plausible models, while also achieving better performance.
Measuring faithfulness w.r.t. graphs Since faithfulness is a widely debated term, we restrict its definition to measure faithfulness w.r.t to the reasoning graph. This can be measured by the correlation between the model performance and graph correctness. A high correlation implies that the model uses both the graph and query to generate an answer and thus is faithful to the stated reasoning mechanism (i.e., graphs used to answer a question). Our analysis reveals this to be the case: in cases where the model answers incorrectly, 42% of the graphs were entirely correct ( §5). In contrast, when the model answers correctly, 82% of the graphs are correct. In summary, we hope that CURIOUS serves as a step towards building reasoning models that are both plausible and faithful.

Related work
Mental Models Cognitive science has long promoted mental models -coherent, constructed representations of the world -as central to understanding, communication, and problem-solving (Johnson-Laird, 1983;Gentner and Stevens, 1983;Hilton, 1996). Our work draws on these ideas, using inference graphs to represent the machine's "mental model" of the problem at hand. Building the inference graph can be viewed as first asking clarification questions about the context before answering. This is similar to self-talk (Shwartz et al., 2020) but directed towards eliciting chains of influence.
Injecting Commonsense Knowledge Many prior systems use commonsense knowledge to aid question-answering, e.g., using sentences retrieved from a corpus (Yang et al., 2019;Guu et al., 2020), or with knowledge generated from a separate source (Shwartz et al., 2020;Bosselut et al., 2019); and injected either as extra sentences fed directly to the model (Clark et al., 2020), via the loss function (Tandon et al., 2018), or via attention (Ma et al., 2019). Unlike prior work, we use conditional language generation techniques to create graphs that are relevant to answering a question.
Encoding Graph Representations Several existing methods use graphs as an additional input for commonsense reasoning (Sun et al., 2018;Lin et al., 2019;Lv et al., 2020;Feng et al., 2020;Bosselut et al., 2021;Ma et al., 2021;Kapanipathi et al., 2020). These methods first retrieve a graph relevant to a question using information retrieval techniques and then encode the graph using graph representation techniques like GCN (Kipf and Welling, 2017) and graph attention (Velickovic et al., 2018). Different from these works, we use a graph generated from the query for answering the commonsense question. The graphs con-sumed by these works contain entities grounded in knowledge graphs like ConceptNet (Speer et al., 2017), whereas we perform reasoning over event inference graphs where each node describes an event. Our best model uses a mixture-of-experts (MoE) (Jacobs et al., 1991) model to pool multifaceted input. Prior work has shown the effectiveness of using MoE for graph classification (Zhou and Luo, 2019;Hu et al., 2021), cross-lingual language learning (Chen et al., 2019;Gu et al., 2018), and model ensemble learning (Fedus et al., 2021;Shazeer et al., 2017). To the best of our knowledge, we are the first to use MoE for learning and pooling graph representations for QA task.

Summary and Conclusion
Cognitive science suggests that people form "mental models" of a situation to answer questions about it. Drawing on those ideas, we have presented a simple instantiation in which the situational model is an inference graph. Different from GCN-based models popular in graph learning, we use mixtureof-experts to pool graph representations. Our experiments show that MoE-based pooling can be a strong (both in terms of performance and explainability) alternative to GCN for graph-based learning for reasoning tasks. Our method establishes a new state-of-the-art on three defeasible reasoning datasets. Overall, our method shows that performance can be improved by guiding a system to "think about" a question and explicitly model the scenario, rather than answering reflexively. We refer to this data generator as GEN * init , and the graphs produced by it as G * . However, we do not have access to y during test time, and thus GEN * init cannot be used directly to produce G * for defeasible queries. We circumvent this by using GEN * init to train a graph-to-graph generation model, that takes as input G and generates G * as output (G → G * ). We call this system GEN corr . We give an overview of the process in Figure 8. In Figure 9, we give examples of an intial graph produced by GEN init , the corresponding graph produced by GEN * init , and the graph produced by GEN corr .

B MoE gradient analysis
We restate Equation 2 for quick reference: where we have changed the notation slightly to use o as the MoE output instead of y. We also refer to E i (x) as E i . Further, o j = n i=1 p i E ij . We present the analysis for a generic multi-class classification setting with k classes, with training done using a cross-entropy loss L (Figure 10) Let y c be the normalized probability of the correct class c calculated using softmax:ŷ Let L be the cross-entropy loss: The derivatives w.r.t. the m th expert gate probability p m is given by: Evaluating ∂L ∂Emc the derivatives w.r.t. the logits E mc (logit for the correct class by m th expert) is given by: Equations 5 and 6 have natural interpretations: the gradient on both the mixture probability p m and the logits E mc will be 0 (note that for Equation 5, y c = 1 =⇒ y j = 0 for j = c) when the network makes perfect predictions (ŷ c = 1). As noted by Jacobs et al. (1991) (Section 1), this might cause the network to specialize slower, as the gradient will be small for experts that are helping in making the correct prediction. They suggest a different loss function that promotes faster specialization by redefining the error function in terms of a mixture distribution, with the mixture weights supplied by the p i terms. Analyzing the effect of loss function for applications where the MoE is used to pool representations remains an interesting future work.

C Hyperparameters
Training details All of our experiments were done on a single Nvidia GeForceRTX 2080 Ti. We base our implementation on PyTorch (Paszke et al., 2017) and also use PyTorch Lightning (Falcon, 2019) and Huggingface (Wolf et al., 2019). The gates and the experts in our MoE model were a single layer MLP. For the experts, we set the input size set to be the same as output size. Table 7 shows the parameters shared by all the methods, and 8 shows the hyperparameters applicable to GCN encoder. Figure 11 shows the skeleton of an influence graph.

E Runtime Analysis
Finally, we discuss the cost-performance tradeoffs for various encoding mechanisms (Table 9). As Table 9 shows, both GCN and MoE take about 7% more number of parameters than the STR encoding scheme and have about 2x the runtime. Further, as we use one expert per node, the number of parameters scales linearly with the number of nodes. While this is not prohibitive in our setting (each graph has a small number of nodes), our analysis shows that the behavior of the nodes that have similar semantics is correlated, indicating that the experts for those nodes can share parameters. Alternatively, MoE with more than two layers (Jordan and Xu, 1995) can also help in scaling the number of parameters only logarithmically with the number of nodes.
Method STR GCN MoE #Params 124M 131M 133M Runtime 0.17 0.47 0.40 Table 9: Number of parameters in the different encoding methods. Runtime reports the number of seconds to process one training example.

F Error Analysis Examples
We show three examples with different types of errors. These examples are taken from Dev set, and these are for the cases where CURIOUS introduced a wrong answer, while baseline answered this correctly without the graph.
• Figure 12 shows a failure case when a good graph is unused. Example from δ-ATOMIC dev set. • Figure 13 shows a failure case when an off topic graph is produced due to confusion in the sense of water fountain. Example from δ-SNLI dev set. • Figure 14 shows a failure case when the mediator is wrong. Example from δ-SOCIAL dev set.

G Significance Tests
We perform two statistical tests for verifying our results: i) The micro-sign test (s-test) (Yang and Liu, 1999)      The generated graph is off topic (wrong sense of water fountain is used). Example from δ-SNLI dev set.