Explainable Inference Over Grounding-Abstract Chains for Science Questions

We propose an explainable inference approach for science questions by reasoning on grounding and abstract inference chains. This paper frames question answering as a natural language abductive reasoning problem, constructing plausible explanations for each candidate answer and then selecting the candidate with the best explanation as the ﬁnal answer. Our method, ExplanationLP , elicits explanations by constructing a weighted graph of relevant facts for each candidate answer and employs a linear programming formalism designed to select the optimal subgraph of explanatory facts. The graphs’ weighting function is composed of a set of parameters targeting relevance, cohesion and diversity, which we ﬁne-tune for answer selection via Bayesian Optimisation. We carry out our experiments on the WorldTree and ARC-Challenge datasets to empirically demonstrate the following contributions: (1) ExplanationLP obtains strong performance when compared to transformer-based and multi-hop approaches despite having a sig-niﬁcantly lower number of parameters; (2) We show that our model is able to generate plausible explanations for answer prediction; (3) Our model demonstrates better robustness towards semantic drift when compared to transformer-based and multi-hop approaches.


Introduction
Answering science questions remain a fundamental challenge in Natural Language Processing and AI as it requires complex forms of inference, including causal, model-based and example-based reasoning Clark et al., 2018;Jansen et al., 2016;Clark et al., 2013). Current state-of-theart (SOTA) approaches for answering questions in the science domain are dominated by transformerbased models (Devlin et al., 2019;Sun et al., 2019). Despite remarkable performance on answer prediction, these approaches are black-box by nature, lacking the capability of providing explanations for their predictions Miller, 2019;Biran and Cotton, 2017;Jansen et al., 2016).
Explainable Science Question Answering (XSQA) is often framed as a natural language abductive reasoning problem (Khashabi et al., 2018;Jansen et al., 2017). Abductive reasoning represents a distinct inference process, known as inference to the best explanation (Peirce, 1960;Lipton, 2017), which starts from a set of complete or incomplete observations to find the hypothesis, from a set of plausible alternatives, that best explains the observations. Several approaches (Khashabi et al., 2018;Jansen et al., 2017;Khot et al., 2017a; employ this form of reasoning for multiple-choice science questions to build a set of plausible explanations for each candidate answer and select the one with the best explanation as the final answer. XSQA solvers typically treat explanation generation as a multi-hop graph traversal problem. Here, the solver attempts to compose multiple facts that connect the question to a candidate answer. These multi-hop approaches have shown diminishing returns with an increasing number of hops . Fried et al. (2015) conclude that this phenomenon is due to semantic drift -i.e., as the number of aggregated facts increases, so does the probability of drifting out of context. Khashabi et al. (2019) propose a theoretical framework, empirically supported by ; Fried et al. (2015), attesting that ongoing efforts with very long multi-hop reasoning chains are unlikely to succeed, emphasising the need for a richer representation with fewer hops and higher importance to abstraction and grounding mechanisms.
Consider the example in Figure 1A where the central concept the question examines is the understanding of friction. Here, an inference solver's [✓]: Explanatory Facts [✕]: Non-Explanatory Facts What is an example of force producing heat?
Candidate Answer (C 1 ): Two sticks getting warm when rubbed together Grounding Facts: [✓] a stick is an object: F G1 [✓] friction is a force: F G2 [✕] a pull is a force: F G3 [✓] to rub together means to move against: F G4 [✕] rubbing against something is kind of movement: F G5  Figure 1: Overview of our approach: (A) Depicts a question, answer and formulated hypothesis along with the set of facts retrieved from a fact retrieval approach (B) Illustrates the optimisation process behind extracting explanatory facts for the provided hypothesis and facts. (C) Details the end-to-end architecture diagram.
challenge is to identify the core scientific facts (Abstract Facts) that best explain the answer. To achieve this goal, a QA solver should be able first to go from force to friction, stick to object and rubbing together to move against. These are the Grounding Facts that link generic or abstract concepts in a core scientific statement to specific terms occurring in question and candidate answer . The grounding process is followed by the identification of the abstract facts about friction. A complete explanation for this question would require the composition of five facts to derive the correct answer successfully. However, it is possible to reduce the global reasoning in two hops, modelling it with grounding and abstract facts. In line with these observations, this work presents a novel approach that explicitly models abstract and grounding mechanisms. The contributions of the paper are: 1. We present a novel approach that performs natural language abductive reasoning via grounding-abstract chains combining Linear Programming with Bayesian optimisation for science question answering (Section 2).
2. We obtain comparable performance when compared to transformers, multi-hop approaches and previous Linear Programming models despite having a significantly lower number of parameters (Section 3.1).
3. We demonstrate that our model can generate plausible explanations for answer prediction (Section 3.2) and validate the importance of grounding-abstract chains via ablation analysis (Section 3.3).

ExplanationLP: Abductive Reasoning with Linear Programming
ExplanationLP answers and explains multiplechoice science questions via abductive natural language reasoning. Specifically, the task of answering multiple-choice science questions is reformulated as the problem of finding the candidate answer that is supported by the best explanation. For each Question Q and candidate answer c i ∈ C, ExplanationLP converts to a hypothesis h i and attempts to construct a plausible explanation. Figure 1C illustrates the end-to-end framework. From an initial set of facts selected using a retrieval model, ExplanationLP constructs a fact graph where each node is a fact, and the nodes and edges have a score according to three properties: relevance, cohesion and diversity. Subsequently, an optimal subgraph is extracted using Linear Programming, whose role is to select the best sub-set of facts while preserving structural constraints imposed via grounding-abstract chains. The subgraphs' global scores computed by summing up the nodes and edges scores are adopted to select the final answer. Since the subgraph scores depend on the sum of nodes and edge scores, each property is multiplied by a learnable weight which is optimised via Bayesian Optimisation to obtain the best possible combination with the highest accuracy for answer selection. To the best of our knowledge, we are the first to combine a parameter optimisation method with Linear Programming for inference. The rest of this section describes the model in detail.

Relevant facts retrival
Given a question (Q) and candidate answers C = {c 1 , c 2 , c 3 , ..., c n } we convert them to hypotheses {h 1 , h 2 , h 3 , ..., h n } using the approach proposed by Demszky et al. (2018). For each hypothesis h i we adopt fact retrieval approaches (e.g: BM25, Unification-retrieval (Valentino et al., 2021)) to select the top m relevant abstract facts .., f h i m } from a knowledge base containing abstract facts (Abstract Facts KB) and top l relevant grounding facts .., f h i l } from a knowledge base containing grounding facts (Grounding Facts KB) that at least connects one abstract fact with the hypothesis, such that

Fact graph construction
For each hypothesis h i we build a weighted undirected graph is a learnable parameter which is optimised via Bayesian optimisation. The model scores the nodes and edges based on the following three properties (See Figure 1B): (1) Relevance: We promote the inclusion of highly relevant facts in the explanations by encouraging the selection of sentences with higher lexical relevance and semantic similarity with the hypothesis. We use the following scores to measure the relevance and the semantic similarity of the facts: Lexical Relevance score (L): Obtained from the upstream facts retrieval model (e.g: BM25 score/ Unification score (Valentino et al., 2021)). Semantic Similarity score (S): Cosine similarity obtained from neural sentence representation models. For our experiments, we adopt Sentence-BERT (Reimers et al., 2019) since it shows state-of-the-art performance in semantic textual similarity tasks.
(2) Cohesion: Explanations should be cohesive, implying that grounding-abstract chains should remain within the same context. To achieve cohesion, we encourage a high degree of overlaps between different hops (e.g. hypothesis-grounding, grounding-abstract, hypothesis-abstract) to prevent the inference chains from drifting away from the original context. The overlap across two hops is quantified using the following scoring function: Cohesion score (C): We denote the set of unique terms of a given fact f h i i as t(f h i i ) after being lemmatized and stripped of stopwords. The overlap score of two facts f h i j and f h i j is given by: Therefore, the higher the number of term overlaps, the higher the cohesion score.
(3) Diversity: While maximizing relevance and cohesion between different hops, we encourage diversity between facts of the same type (e.g. abstractabstract, grounding-grounding) to address different parts of the hypothesis and promote completeness in the explanations. We measure diversity via the following function: Diversity score (D): We denote the overlaps between hypothesis h i and the fact The diversity score of two facts f h i j and f h i j is given by: The goal is to maximise diversity and avoid redundant facts in the explanations. Therefore, if two facts overlap with different parts of the hypothesis, they will have a higher diversity score compared to two facts that overlap with the same part.

Subgraph extraction with Linear Programming (LP) optimisation
The construction of the explanation graph has to be optimised for the downstream answer selection task. Specifically, from the whole set of facts retrieved by the upstream retrieval models, we need to select the optimal subgraph that maximises the performance of answer prediction. To achieve this goal, we adopt a Linear Programming approach. The selection of the explanation graph is framed as a rooted maximum-weight connected subgraph problem with a maximum number of K vertices (R-MWCS K ). This formalism is derived from the generalized maximum-weight connected subgraph problem (Loboda et al., 2016). R-MWCS K has two parts: objective function to be maximized and constraints to build a connected subgraph of explanatory facts. The formal definition of the objective function is as follows: Definition 1. Given a connected undirected graph G = (V, E) with edge-weight function ω e : E → IR, node-weight function ω v : V → IR , root vertex r ∈ V and expected number of vertices K, the rooted maximum-weight connected subgraph problem with K number of vertices (R-MWCS K ) problem is finding the connected subgraphĜ = (V ,Ê) such that r ∈V , |V |≤ K and where θ vw , θ ew ∈ θ 3 , θ 3 ∈ [0, 1] and θ 3 is a learnable parameter optimized via Bayesian optimisation. The LP solver will seek to extract the optimal subgraph with the highest possible sum of node and edge weights. Since the solver seeks to obtain the highest possible score, it will avoid negative edges and will prioritise high-value positive edges resulting in higher diversity, cohesion and relevance. We adopt the following binary variables to represent the presence of nodes and edges in the subgraph: 1. Binary variable y v takes the value of 1 iff v ∈ V h i belongs to the subgraph.
2. Binary variable z e takes the value of 1 iff e ∈ E h i belongs to the subgraph.
In order to emulate the grounding-abstract inference chains and obtain a valid subgraph, we impose the set constraints described in Table 1 for the LP solver.

Bayesian Optimisation for Answer Selection
Given Question Q and choices C = {c 1 , c 2 , c 3 , ..., c n } we extract the optimal explanation graphsĜ Q = {Ĝ c 1 ,Ĝ c 2 ,Ĝ c 3 , ...,Ĝ cn } for each choice. We consider the hypothesis with the highest relevance, cohesion and diversity to be the correct the answer. Based on this premise we define the correct answer as In order to automatically optimize the Linear Programming model (i.e, θ 1 , θ 2 , θ 3 ) we use Bayesian optimisation. The algorithm is defined as below (Here GP is Gaussian Process and LP is the Linear Programming module).

Empirical Evaluation
Background Knowledge: We construct the required knowledge bases using the following sources.
(1) Abstract KB: Our Abstract knowledge base is constructed from the WorldTree Tablestore corpus (Xie et al., 2020;. The Tablestore corpus contains a set of common sense and scientific facts adopted to create explanations for multiple-choice science questions. The corpus is built for answering elementary science questions encouraging possible knowledge reuse to elicit explanatory patterns. We extract the core scientific facts to build the Abstract KB. Core scientific facts Chaining constraint: Equation 1 states that the subgraph should always contain the hypothesis node. Inequality 2 states that if a vertex is to be part of the subgraph, then at least one of its neighbors with a lexical overlap should also be part of the subgraph. Equation 1 and Inequality 2 restrict the LP method to construct explanations that originate from the hypothesis and perform multi-hop aggregation based on the existence of lexical overlap. Inequalities 3, 4 and 5 state that if two vertices are in the subgraph then the edges connecting the vertices should be also in the subgraph. These inequality constraints will force the LP method to avoid grounding nodes with high overlap regardless of their relevance.
Abstract fact limit constraint: Equation 6 limits the total number of abstract facts to K. Instead of limiting of total selected number of nodes to K, by limiting the abstract facts we dictate the need for grounding facts based on the number of terms present in the hypothesis and in the abstract facts.
Grounding neighbor constraint: Inequality 7 states that if a grounding fact is selected, then at least two of its neighbors should be either both abstract facts or a hypothesis and an abstract fact. This constraint ensures that grounding facts play the linking role connecting hypothesis-abstract facts. are independent from the specific questions and represent general scientific and commonsense knowledge, such as Actions (friction occurs when two object's surfaces move against each other) or Affordances (friction causes the temperature of an object to increase).
(2) Grounding KB: The grounding knowledge base consists of definitional knowledge (e.g., synonymy and taxonomy) that can take into account lexical variability of questions and help it link it to abstract facts. To achieve this goal, we select the is-a and synonymy facts from ConceptNet (Speer et al., 2017) as our grounding facts. ConceptNet has high coverage and precision, enabling us to answer a wide variety of questions.
Question Sets: We use the following question sets to evaluate ExplanationLP's performance and compare it against other explainable approaches: (1) WorldTree Corpus: The 2,290 questions in the WorldTree corpus are split into three different subsets: train-set (987), dev-set (226) and test-set (1,077). We use the dev-set to assess the explainability performance and robustness analysis since the explanations for test-set are not publicly available.
(2) ARC-Challenge Corpus: ARC-Challenge is a multiple-choice question dataset which consists of question from science exams from grade 3 to grade 9 (Clark et al., 2018). We only consider the Challenge set of questions. These questions have proven to be challenging to answer for other LP-based question answering and neural approaches. ExplanationLP rely only on the train-set (1,119) and test on the test-set (1,172). ExplanationLP does not require dev-set, since the possibility of over-fitting is non-existent with only ten parameters.

Relevant Facts Retrieval (FR):
We experiment with two different fact retrieval scores. The first model -i.e. BM25 Retrieval, adopts a BM25 vector representation for hypothesis and explanation facts. We apply this retrieval for both Grounding and Abstract retrieval. We use the IDF score from BM25 as our downstream model's relevance score. The second approach -i.e. Unification Retrieval (UR), represents the BM25 implementation of the Unification-based Reconstruction framework described in Valentino et al. (2021). The unification score for a given fact depends on how often the same fact appears in explanations for similar questions.
Baselines: The following baselines are replicated on the WorldTree corpus to compare against Expla-nationLP: (1) Bert-Based models: We compare the Ex-planationLP model's performance against a set of BERT baselines. The first baseline -i.e. BERT Base /BERT Large , is represented by a standard BERT language model (Devlin et al., 2019) fine-tuned for multiple-choice question answering. Specifically, the model is trained for binary classification on each question-candidate answer pair to maximize the correct choice (i.e., predict 1) and minimize the wrong choices (i.e., predict 0). During inference, we select the choice with the highest prediction score as the correct answer. BERT baselines are further enhanced with explanatory facts retrieved by the retrieval models. BERT + BM25 and BERT + UR, is fine-tuned for binary classification by complementing the question-answer pair with grounding and abstract facts selected by BM25 and Unification retrieval, respectively. Similarly, the second model BERT + UR complements the question-answer pair with grounding and abstract facts selected using BM25 and Unification retrieval, respectively.
(2) PathNet (Kundu et al., 2019): PathNet is a neural approach that constructs a single linear path composed of two facts connected via entity pairs for reasoning. PathNet also can explain its reasoning via explicit reasoning paths. They have exhibited strong performance for multiple-choice science questions by composing two facts. Similar to Bert-based models, we employ PathNET with the top k facts retrieved utilizing Unification (PathNet + UR) and BM25 (PathNet + BM25) retrieval. We concatenate the facts retrieved for each candidate answer and provide as supporting facts.
Further details regarding the hyperparameters and code used for each model, along with information concerning the knowledge base construction and dataset information, can be found in the Supplementary Materials.

Answer Selection
WorldTree Corpus: We retrieve the top l relevant grounding facts from Grounding KB and the top m relevant abstract facts from Abstract KB such that l + m = k and l = m. To ensure fairness across the approaches, the same amount of facts are presented to each model. We experimented with k = {10, 20, 30, 40, 50} and report the accuracy across Easy and Challenge split of the   best performing setting in Table 2. We draw the following conclusions: (1) Despite having a smaller number of parameters to train (BERT Base : 110M parameters, BERT Large : 340M parameters, ExplanationLP: 9 parameters), the best performing ExplanationLP (#10) overall outperforms all the BERT Base and BERT Large models on both Challenge and Easy split. We outperform the best performing BERT model with facts (BERT Large (#6)) by 7.74% in Easy and 6.43% in Challenge. We also outperform best performing BERT without facts (BERT Large (#2)) by 11.66% in Easy and 20.76% in Challenge.
(2) BERT is inherently a black-box model, not being entirely possible to explain its prediction. By contrast, ExplanationLP is fully explainable and produces a complete explanatory graph.
(3) Similar to ExplanationLP, PathNet is also explainable and demonstrates robustness to noise.
CASE I: All the selected facts are in the gold explanation (Frequency: 33%) Question: A company wants to make a game that uses a magnet that sticks to a board. Which material should it use for the board? Answer: steel Explanations: (1) steel is a metal (Grounding), (2) if a magnet is attracted to a metal then that magnet will stick to that metal (Abstract), (3) a magnet attracts magnetic metals through magnetism (Abstract), CASE II: At least one selected facts are in the gold explanation (Frequency: 58%) Question: A large piece of ice is placed on the sidewalk on a warm day. What will happen to the ice? Answer: It will melt to form liquid water. Explanations: (1) drop is liquid small amount (Grounding), (2) forming something is change (Grounding), (3) ice wedging is mechanical weathering (Grounding), (4) melting means changing from a solid into a liquid by adding heat energy (Abstract), (5) weathering means breaking down surface materials from larger whole into smaller pieces by weather (Abstract),

CASE III: No retrieved facts is in the gold explanation (Frequency: 9%)
Question:Wind is a natural resource that benefits the southeastern shore of the Chesapeake Bay. How could these winds best benefit humans? Answer: The winds could be converted to electrical energy Explanations: (1) renewable resource is natural resource (Grounding), (2) wind is a renewable resource (Abstract), (3) electrical devices convert electricity into other forms of energy (Abstract) ExplanationLP also outperforms PathNet's best performance setting (#8) by 18.59% in Easy and 16.60% in Challenge.
(4) ExplanationLP consistently exhibits better scores on both BM25 and UR than BERT and Path-Net, demonstrating independence of the upstream retrieval model for performance.

ARC-Challenge :
We also evaluated our model on the ARC-Challenge corpus (Clark et al., 2018) to evaluate ExplanationLP on a more extensive general question set and compare against contemporary approaches that provide explanations for an inference that has only been trained on ARC corpus. Table 3 reports the results on the test-set. We compare ExplanationLP against published approaches that are fully/partly explainable. Here explainability indicates if the model produces an explanation/evidence for the predicted answer. A subset of the approaches produces evidence for the answer but remains intrinsically black-box. These models have been marked as Partial.
As depicted in the Table 3, we outperform the best performing fully explainable (#4 TableILP) model by 13.28%. We also outperform specific neural approaches with larger parameter sets (#5 -#9) that provide explanations for their inference and BERT (#1). Despite having a smaller number of training parameters, we also exhibit competitive performance with a state-of-the-art Bert-based approach (#10) that do not use external resources to train the QA system.   Table 5 shows the Precision, Recall and F1 M acro score for explanation retrieval for PathNet and Ex-planationLP. These scores are computed using gold abstract explanations from WorldTree corpus. We outperform PathNet across all spectrum by a significant margin. Table 4 reports three representative cases that show how explanation generation relates to correct answer prediction. The first example (Case I) represents the situation in which all the selected sentences are annotated as gold explanations in the WorldTree corpus (dev-set). The second example (Case II) shows the case in which at least one sentence in the explanation is labelled as gold. Finally, the third example (Case III) represents the case in which the explanation generated by the method does not contain any gold fact. We observe Case I and Case II occur over 91% of the questions, demonstrating that the correct answers are mostly derived from plausible explanations.

Ablation Study
In order to understand the contribution lent by different components, we choose the best setting  (WorldTree: ExplanationLP + UR (k=30) and ARC: ExplanationLP + BM25 (k=40)) and drop different components to perform an ablation analysis. We retain the ensemble after removing each component. The results are summarized in Table 6.
(1) The grounding-abstract chains (#2) play a significant role, particularly in the reasoning mechanism on a challenging question set like ARC-Challenge.
(2) As observed in #3, #4 removing node weights and edge weights lead to a dramatic drop in performance. This drop indicates that both are fundamental for the final prediction, highlighting the role of graph structure in explainable inference.
(3) The importance of cohesion varies across different types of facts. We observe that Hypothesis-Abstract cohesion (#5) is significantly more important than the others. We attribute this to the fact that without Hypothesis-Abstract cohesion, multi-hop inference can quickly go out of context. (4) From the ablation analysis, we can see how lexical relevance and semantic similarity (#10, 11) complements each other towards the final prediction. For WorldTree corpus, the relevance score has a higher parameter score translating into a higher impact and vice-versa for ARC.
(5) Diversity plays a smaller role when compared to cohesion and relevance. The impact of diversity in ARC is higher than that of WorldTree.
Semantic Drift To validate the performance across an increasing number of hops, we plot the accuracy against explanation length as illustrated in Figure 2. As demonstrated in explanation regeneration (Valentino et al., 2021;Jansen and Ustalov, 2019), the complexity of a science question is directly correlated with the explanation length -i.e. the number of facts required in the gold explanation. Unlike BERT, PathNet and ExplanationLP use external background knowledge, addressing the multihop process in two main reasoning steps. However, in contrast to ExplanationLP, PathNet combines only two explanatory facts to answer a given question. This assumption has a negative impact on answering complex questions requiring long explanations. This is evident in the graph, where we observe a sharp decrease in accuracy with increasing explanation length. Comparatively, ExplanationLP achieves more stable performance, showing a lower degradation with an increasing number of explanation sentences. These results crucially demonstrate the positive impact of grounding-abstract mechanisms on semantic drift. We also exhibit consistently better performance when compared with BERT as well.

Related Work
Our approach broadly falls into Linear Programming based approaches for science question answering. LP-based approaches perform inference over either semi-structured tables  or structural representations extracted from the text (Khashabi et al., 2018;Khot et al., 2017a). These approaches treat all facts homogeneously and attempt to connect the question with the correct answer through long hops. While they have exhibited good performance with no supervision, the performance tends to be lower when answer-ing complex questions requiring long explanatory chains. In contrast, our approach performs inference over unstructured text by imposing structural constraints via grounding-abstract chains, lowering the hops, and also combine parametric optimisation to extract the best performing model. The other class of approaches that provide explanations are graph-based approaches. Graphbased approaches have been successfully applied for open-domain question answering (Fang et al., 2020;Qiu et al., 2019;Thayaparan et al., 2019) where the question only requires only two hops. PathNet (Kundu et al., 2019) operates within the same design principles and has been applied on OpenbookQA science dataset. As indicated in the empirical evaluation, it struggles with longchain explanations since it relies only on two facts. Graph-based approaches have also been employed for mathematical reasoning (Ferreira and Freitas, 2020a,b) and textual entailment (Silva et al., 2019(Silva et al., , 2018.
The third category of partially explainable approaches employs black-box neural models in combination with a retrieval approach. The SOTA model for Science Question (Khashabi et al., 2020) answering is pretrained across multiple datasets and is not explainable. The current partially explainable SOTA approach that does not rely on external resource (Yadav et al., 2019b) employs a large parameter BERT model for question answering resulting. In contrast, with a low number of parameters, we have introduced a model that demonstrates competitive performance and leaves a smaller carbon footprint in terms of energy consumption (Henderson et al., 2020). Other methods construct explanation chains by leveraging explanatory patterns emerging in a corpus of scientific explanations (Valentino et al., , 2021.

Conclusion
This paper presented a robust, explainable and efficient science question answering model that performs abductive natural language inference. We also presented an in-depth systematic evaluation demonstrating the impact on the various set of design principles via an in-depth ablation analysis. Despite having a significantly lower number of parameters, we demonstrated competitive performance compared with contemporary explainable approaches while also showcasing its robustness, explainability and interpretability.