Unification-based Reconstruction of Multi-hop Explanations for Science Questions

This paper presents a novel framework for reconstructing multi-hop explanations in science Question Answering (QA). While existing approaches for multi-hop reasoning build explanations considering each question in isolation, we propose a method to leverage explanatory patterns emerging in a corpus of scientific explanations. Specifically, the framework ranks a set of atomic facts by integrating lexical relevance with the notion of unification power, estimated analysing explanations for similar questions in the corpus. An extensive evaluation is performed on the Worldtree corpus, integrating k-NN clustering and Information Retrieval (IR) techniques. We present the following conclusions: (1) The proposed method achieves results competitive with Transformers, yet being orders of magnitude faster, a feature that makes it scalable to large explanatory corpora (2) The unification-based mechanism has a key role in reducing semantic drift, contributing to the reconstruction of many hops explanations (6 or more facts) and the ranking of complex inference facts (+12.0 Mean Average Precision) (3) Crucially, the constructed explanations can support downstream QA models, improving the accuracy of BERT by up to 10% overall.


Introduction
Answering multiple-choice science questions has become an established benchmark for testing natural language understanding and complex reasoning in Question Answering (QA) Mihaylov et al., 2018). In parallel with other NLP research areas, a crucial requirement emerging in recent years is explainability (Thayaparan et al., 2020;Miller, 2019;Biran and Cotton, 2017;Ribeiro et al., 2016). To boost automatic methods of inference, it is necessary not * equal contribution only to measure the performance on answer prediction, but also the ability of a QA system to provide explanations for the underlying reasoning process.
The need for explainability and a quantitative methodology for its evaluation have conducted to the creation of shared tasks on explanation reconstruction (Jansen and Ustalov, 2019) using corpora of explanations such as Worldtree (Jansen et al., 2018(Jansen et al., , 2016. Given a science question, explanation reconstruction consists in regenerating the gold explanation that supports the correct answer through the combination of a series of atomic facts. While most of the existing benchmarks for multi-hop QA require the composition of only 2 supporting sentences or paragraphs (e.g. QASC , HotpotQA (Yang et al., 2018)), the explanation reconstruction task requires the aggregation of an average of 6 facts (and as many as ≈20), making it particularly hard for multi-hop reasoning models. Moreover, the structure of the explanations affects the complexity of the reconstruction task. Explanations for science questions are typically composed of two main parts: a grounding part, containing knowledge about concrete concepts in the question, and a core scientific part, including general scientific statements and laws.
Consider the following question and answer pair from Worldtree (Jansen et al., 2018): • q: what is an example of a force producing heat? a: two sticks getting warm when rubbed together.
An explanation that justifies a is composed using the following sentences from the corpus: (f 1 ) a stick is a kind of object; (f 2 ) to rub together means to move against; (f 3 ) friction is a kind of force; (f 4 ) friction occurs when two objects' surfaces move against each other; (f 5 ) friction causes the temperature of an object to increase. The explanation contains a set of concrete sentences that are conceptually connected with q and a (f 1 ,f 2 and f 3 ), along with a set of abstract facts that require multi-hop inference (f 4 and f 5 ). Previous work has shown that constructing long explanations is challenging due to semantic drift -i.e. the tendency of composing out-ofcontext inference chains as the number of hops increases (Khashabi et al., 2019;Fried et al., 2015). While existing approaches build explanations considering each question in isolation (Khashabi et al., 2018;Khot et al., 2017), we hypothesise that semantic drift can be tackled by leveraging explanatory patterns emerging in clusters of similar questions.
In Science, a given statement is considered explanatory to the extent it performs unification (Friedman, 1974;Kitcher, 1981Kitcher, , 1989, that is showing how a set of initially disconnected phenomena are the expression of the same regularity. An example of unification is Newton's law of universal gravitation, which unifies the motion of planets and falling bodies on Earth showing that all bodies with mass obey the same law. Since the explanatory power of a given statement depends on the number of unified phenomena, highly explanatory facts tend to create unification patterns -i.e. similar phenomena require similar explanations. Coming back to our example, we hypothesise that the relevance of abstract statements requiring multihop inference, such as f 4 ("friction occurs when two objects' surfaces move against each other"), can be estimated by taking into account the unification power.
Following these observations, we present a framework that ranks atomic facts through the combination of two scoring functions: • A Relevance Score (RS) that represents the lexical relevance of a given fact.
• A Unification Score (US) that models the explanatory power of a fact according to its frequency in explanations for similar questions.
An extensive evaluation is performed on the Worldtree corpus (Jansen et al., 2018;Jansen and Ustalov, 2019), adopting a combination of k-NN clustering and Information Retrieval (IR) techniques. We present the following conclusions: 1. Despite its simplicity, the proposed method achieves results competitive with Transformers (Das et al., 2019;Chia et al., 2019), yet being orders of magnitude faster, a feature that makes it scalable to large explanatory corpora.
2. We empirically demonstrate the key role of the unification-based mechanism in the reconstruction of many hops explanations (6 or more facts) and explanations requiring complex inference (+12.0 Mean Average Precision).
3. Crucially, the constructed explanations can support downstream question answering models, improving the accuracy of BERT (Devlin et al., 2019) by up to 10% overall.
To the best of our knowledge, we are the first to propose a method that leverages unification patterns for the reconstruction of multi-hop explanations, and empirically demonstrate their impact on semantic drift and downstream question answering.

Related Work
Explanations for Science Questions. Reconstructing explanations for science questions can be reduced to a multi-hop inference problem, where multiple pieces of evidence have to be aggregated to arrive at the final answer (Thayaparan et al., 2020;Khashabi et al., 2018;Khot et al., 2017;Jansen et al., 2017). Aggregation methods based on lexical overlaps and explicit constraints suffer from semantic drift (Khashabi et al., 2019;Fried et al., 2015) -i.e. the tendency of composing spurious inference chains leading to wrong conclusions.
One way to contain semantic drift is to leverage common explanatory patterns in explanationcentred corpora (Jansen et al., 2018). Transformers (Das et al., 2019;Chia et al., 2019) represent the state-of-the-art for explanation reconstruction in this setting (Jansen and Ustalov, 2019). However, these models require high computational resources that prevent their applicability to large corpora. On the other hand, approaches based on IR techniques are readily scalable. The approach described in this paper preserves the scalability of IR methods, obtaining, at the same time, performances competitive with Transformers. Thanks to this feature, the framework can be flexibly applied in combination with downstream question answering models.
Our findings are in line with previous work in different QA settings (Rajani et al., 2019;Yadav et al., 2019), which highlights the positive impact of explanations and supporting facts on the final answer prediction task.
In parallel with Science QA, the development of models for explanation generation is being explored in different NLP tasks, ranging from open domain question answering (Yang et al., 2018;Thayaparan et al., 2019), to textual entailment (Camburu et al., 2018) and natural language premise selection (Ferreira and Freitas, 2020b,a).
Scientific Explanation and AI. The field of Artificial Intelligence has been historically inspired by models of explanation in Philosophy of Science (Thagard and Litt, 2008). The deductivenomological model proposed by Hempel (Hempel, 1965) constitutes the philosophical foundation for explainable models based on logical deduction, such as Expert Systems (Lacave and Diez, 2004;Wick and Thompson, 1992) and Explanationbased Learning (Mitchell et al., 1986). Similarly, the inherent relation between explanation and causality (Woodward, 2005;Salmon, 1984) has inspired computational models of causal inference (Pearl, 2009). The view of explanation as unification (Friedman, 1974;Kitcher, 1981Kitcher, , 1989 is closely related to Case-based reasoning (Kolodner, 2014;Sørmo et al., 2005;De Mantaras et al., 2005). In this context, analogical reasoning plays a key role in the process of reusing abstract patterns for explaining new phenomena (Thagard, 1992). Similarly to our approach, Case-based reasoning applies this insight to construct solutions for novel problems by retrieving, reusing and adapting explanations for known cases solved in the past.

Explanation Reconstruction as a Ranking Problem
A multiple-choice science question Q = {q, C} is a tuple composed by a question q and a set of candidate answers C = {c 1 , c 2 , . . . , c n }. Given an hypothesis h j defined as the concatenation of q with a candidate answer c j ∈ C, the task of explanation reconstruction consists in selecting a set of atomic facts from a knowledge base E j = {f 1 , f 2 , . . . , f n } that support and justify h j . In this paper, we adopt a methodology that relies on the existence of a corpus of explanations. A corpus of explanations is composed of two distinct knowledge sources: • A primary knowledge base, Facts KB (F kb ), defined as a collection of sentences F kb = {f 1 , f 2 , . . . , f n } encoding the general world knowledge necessary to answer and explain science questions. A fundamental and desirable characteristic of F kb is reusability -i.e. each of its facts f i can be potentially reused to compose explanations for multiple questions In this setting, the explanation reconstruction task for an unseen hypothesis h j can be modelled as a ranking problem (Jansen and Ustalov, 2019). Specifically, given an hypothesis h j the algorithm to solve the task is divided into three macro steps:

Selecting the top k elements belonging to
Rank(h j ) and interpreting them as an explanation for h j ; E j = topK(Rank(h j )).

Modelling Explanatory Relevance
We present an approach for modelling e(h j , f i ) that is guided by the following research hypotheses: • RH1: Scientific explanations are composed of a set of concrete facts connected to the question, and a set of abstract statements expressing general scientific laws and regularities.
• RH2: Concrete facts tend to share key concepts with the question and can therefore be effectively ranked by IR techniques based on lexical relevance.
• RH3: General scientific statements tend to be abstract and therefore difficult to rank by means of shared concepts. However, due to explanatory unification, core scientific facts tend to be frequently reused across similar questions. We hypothesise that the explanatory power of a fact f i for a given hypothesis h j is proportional to the number of times f i explains similar hypotheses.  To formalise these research hypotheses, we model the explanatory scoring function e(h j , f i ) as a combination of two components: e(hj, fi) = λ1rs(hj, fi) + (1 − λ1)us(hj, fi) (1) Here, rs(h j , f i ) represents a lexical Relevance Score (RS) assigned to f i ∈ F kb with respect to h j , while us(h j , f i ) represents the Unification Score (US) of f i computed over E kb as follows: whether the fact f i belongs to the explanation E z for the hypothesis h z .
In the formulation of Equation 2 we aim to capture two main aspects related to our research hypotheses: 1. The more a fact f i is reused for explanations in E kb , the higher its explanatory power and therefore its Unification Score; 2. The Unification Score of a fact f i is proportional to the similarity between the hypotheses in E kb that are explained by f i and the unseen hypothesis (h j ) we want to explain. Figure 1 shows a schematic representation of the Unification-based framework.

Empirical Evaluation
We carried out an empirical evaluation on the Worldtree corpus (Jansen et al., 2018), a subset of the ARC dataset   q and c j the system has to retrieve the scientific facts describing how friction occurs and produces heat across objects. The corpus classifies these facts (f 3 , f 4 ) as central. Grounding explanations like "stick is a kind of object" (f 1 ) link question and answer to the central explanations. Lexical glues such as "to rub; to rub together means to mover against" (f 2 ) are used to fill lexical gaps between sentences. Additionally, the corpus divides the facts belonging to F kb into three inference categories: retrieval type, inference supporting type, and complex inference type. Taxonomic knowledge and properties such as "stick is a kind of object" (f 1 ) and "friction is a kind of force" (f 5 ) are classified as retrieval type. Facts describing actions, affordances, and requirements such as "friction occurs when two object's surfaces move against each other" (f 3 ) are grouped under the inference supporting types. Knowledge about causality, description of processes and if-then conditions such as "friction causes the temperature of an object to increase" (f 4 ) is classified as complex inference.
Specifically, The RS model uses TF-IDF/BM25 to compute the relevance function for each fact in F kb (i.e. rs(h j , f i ) function in Equation 1) while the US model adopts TF-IDF/BM25 to assign similarity scores to the hypotheses in E kb (i.e. sim(h j , h z ) function in Equation 2). For reproducibility, the code is available at the following url: https://github.com/ai-systems/ unification_reconstruction_explanations.
Additional details can be found in the supplementary material.

Explanation Reconstruction
In line with the shared task (Jansen and Ustalov, 2019), the performances of the models are evaluated via Mean Average Precision (MAP) of the explanation ranking produced for a given question q j and its correct answer a j . Table 1 illustrates the score achieved by our best implementation compared to state-of-the-art approaches in the literature. Previous approaches are grouped into four categories: Transformers, Information Retrieval with re-ranking, One-step Information Retrieval, and Feature-based models.
Transformers. This class of approaches employs the gold explanations in the corpus to train a BERT language model (Devlin et al., 2019). The best-performing system (Das et al., 2019) adopts a multi-step retrieval strategy. In the first step, it returns the top K sentences ranked by a TF-IDF model. In the second step, BERT is used to rerank the paths composed of all the facts that are within 1-hop from the first retrieved set. Similarly, other approaches adopt BERT to re-rank each fact  individually (Banerjee, 2019;Chia et al., 2019).
Although the best model achieves state-of-theart results in explanation reconstruction, these approaches are computationally expensive, being limited by the application of a pre-filtering step to contain the space of candidate facts. Consequently, these systems do not scale with the size of the corpus. We estimated that the best performing model (Das et al., 2019) takes ≈ 10 hours to run on the whole test set (1240 questions) using 1 Tesla 16GB V100 GPU. Comparatively, our model constructs explanations for all the questions in the test set in ≈ 30 seconds, without requiring the use of GPUs (< 1 second per question). This feature makes the Unification-based Reconstruction suitable for large corpora and downstream question answering models (as shown in Section 4.4). Moreover, our approach does not require any explicit training session on the explanation regeneration task, with significantly reduced number of parameters to tune. Along with scalability, the proposed approach achieves nearly state-of-the-art results (50.8/54.5 MAP). Although we observe lower performance when compared to the best-performing approach (-5.5/-4.0 MAP), the joint RS + US model outperforms two BERT-based models (Chia et al., 2019; Banerjee, 2019) on both test and dev set by 3.1/3.6 and 9.5/12.2 MAP respectively.
Information Retrieval with re-ranking. Chia et al. (2019) describe a multi-step, iterative reranking model based on BM25. The first step consists in retrieving the explanation sentence that is most similar to the question adopting BM25 vectors. During the second step, the BM25 vector of the question is updated by aggregating it with the retrieved explanation sentence vector through a max operation. The first and second steps are repeated for K times. Although this approach uses scalable IR techniques, it relies on a multi-step retrieval strategy. Besides, the RS + US model outperforms this approach on both test and dev set by 5.0/4.8 MAP respectively.
One-step Information Retrieval. We compare the RS + US model with two IR baselines. The baselines adopt TF-IDF and BM25 to compute the Relevance Score only -i.e. the us(q, c j , f i ) term in Equation 1 is set to 0 for each fact f i ∈ F kb . In line with previous IR literature (Robertson et al., 2009), BM25 leads to better performance than TF-IDF. While these approaches share similar characteristics, the combined RS + US model outperforms both RS BM25 and RS TF-IDF on test and dev-set by 7.8/8.4 and 11.4/11.7 MAP. Moreover, the joint RS + US model improves the performance of the US model alone by 27.9/32.6 MAP. These results outline the complementary aspects of Relevance and Unification Score. We provide a detailed anal-  ysis by performing an ablation study on the dev-set (Section 4.2).
Feature-based models. D'Souza et al. (2019) propose an approach based on a learning-to-rank paradigm. The model extracts a set of features based on overlaps and coherence metrics between questions and explanation sentences. These features are then given in input to a SVM ranker module. While this approach scales to the whole corpus without requiring any pre-filtering step, it is significantly outperformed by the RS + US model on both test and dev set by 16.7/17.4 MAP respectively.

Explanation Analysis
We present an ablation study with the aim of understanding the contribution of each sub-component to the general performance of the joint RS + US model (see Table 1). To this end, a detailed evaluation on the development set of the Worldtree corpus is carried out, analysing the performance in reconstructing explanations of different types and complexity. We compare the joint model (RS + US) with each individual sub-component (RS and US alone). In addition, a set of qualitative examples are analysed to provide additional insights on the complementary aspects captured by Relevance and Unification Score.
Explanatory categories. Given a question q j and its correct answer a j , we classify a fact f i belonging to the gold explanation E j according to its explanatory role (central, grounding, lexical glue) and inference type (retrieval, inference-supporting and complex inference). In addition, three new categories are derived from the number of overlaps between f i and the concatenation of q j with a j (h j ) computed by considering nouns, verbs, adjectives and adverbs (1+ overlaps, 1 overlap, 0 overlaps). Table 2 reports the MAP score for each of the described categories. Overall, the best results are obtained by the BM25 implementation of the joint model (RS BM25 + US BM25) with a MAP score of 54.5. Specifically, RS BM25 + US BM25 achieves a significant improvement over both RS BM25 (+8.5 MAP) and US BM25 (+32.6 MAP) baselines. Regarding the explanatory roles (Table  2a), the joint TF-IDF implementation shows the best performance in the reconstruction of grounding explanations (32.7 MAP). On the other hand, a significant improvement over the RS baseline is obtained by RS BM25 + US BM25 on both lexical glues and central explanation sentences (+6.0 and +5.6 MAP over RS BM25).
Regarding the lexical overlaps categories (Table  2b), we observe a steady improvement for all the combined RS + US models over the respective RS baselines. Notably, the US models achieve the best performance on the 0 overlaps category, which includes the most challenging facts for the RS models. The improved ability to rank abstract Which animals would most likely be helped by flood in a coastal area? alligators as water increases in an environment, the population of aquatic animals will increase (1) Where would animals and plants be most affected by a flood? -low areas (2) Which change would most likely increase the number of salamanders? -flood #198 #57 (↑141) What is an example of a force producing heat?
two sticks getting warm when rubbed together friction causes the temperature of an object to increase (1) Rubbing sandpaper on a piece of wood produces what two types of energy? -sound and heat (2) Which force produces energy as heat? -friction #1472 #21 (↑1451)  (Table 2c). Crucially, the largest improvement is observed for complex inference sentences where RS BM25 + US BM25 outperforms RS BM25 by 12.0 MAP, confirming the decisive contribution of the Unification Score to the ranking of complex scientific facts.
Semantic drift. Science questions in the Worldtree corpus require an average of six facts in their explanations (Jansen et al., 2016). Long explanations typically include sentences that share few terms with question and answer, increasing the probability of semantic drift. Therefore, to test the impact of the Unification Score on the robustness of the model, we measure the performance in the reconstruction of many-hops explanations. Figure 2a shows the change in MAP score for the RS + US, RS and US models (BM25) with increasing explanation length. The fast drop in performance for the Relevance Score reflects the complexity of the task. This drop occurs because the RS model is not able to rank abstract explanatory facts. Conversely, the US model exhibits increasing performance, with a trend that is inverse. Short explanations, indeed, tend to include question-specific facts with low explanatory power. On the other hand, the longer the explanation, the higher the number of core scientific facts. Therefore, the decrease in MAP observed for the RS model is compensated by the Unification Score, since core scientific facts tend to form unification patterns across similar questions. This results demonstrate that the Unification Score has a crucial role in alleviating the semantic drift for the joint model (RS + US), resulting in a larger improvement on many-hops explanations (6+ facts).
Similarly, Figure 2b illustrates the Precision@K. As shown in the graph, the drop in precision for the US model exhibits the slowest degradation. Similarly to what observed for many-hops explanations, the US score contributes to the robustness of the RS + US model, making it able to reconstruct more precise explanations. As discussed in section 4.4, this feature has a positive impact on question answering.
k-NN clustering. We investigate the impact of the k-NN clustering on the explanation reconstruction task. Figure 3 shows the MAP score obtained by the joint RS + US model (BM25) with different numbers k of nearest hypotheses considered for the Unification Score. The graph highlights the improvement in MAP achieved with increasing values of k. Specifically, we observe that the best MAP is obtained with k = 100. These results confirm that the explanatory power can be effectively estimated using clusters of similar hypotheses, and that the unification-based mechanism has a crucial role in improving the performance of the relevance model.

Qualitative analysis.
To provide additional insights on the complementary aspects of Unification and Relevance Score, we present a set of qualitative examples from the dev-set. Table 3 illustrates the ranking assigned by RS and RS + US models to scientific sentences of increasing complexity. The words in bold indicate lexical overlaps between question, answer and explanation sentence. In the first example, the sentence "gravity; gravitational force causes objects that have mass; substances to be pulled down; to fall on a planet" shares key terms with question and candidate answer and is therefore relatively easy to rank for the RS model (#36). Nevertheless, the RS + US model is able to improve the ranking by 34 positions (#2), as the gravitational law represents a scientific pattern with high explanatory unification, frequently reused across similar questions. The impact of the Unification Score is more evident when considering abstract explanatory facts. Coming back to our original example (i.e. "What is an example of a force producing heat?"), the fact "friction causes the temperature of an object to increase" has no significant overlaps with question and answer. Thus, the RS model ranks the gold explanation sentence in a low position (#1472). However, the Unification Score (US) is able to capture the explanatory power of the fact from similar hypotheses in E kb , pushing the RS + US ranking up to position #21 (+1451).

Question Answering
To understand whether the constructed explanations can support question answering, we compare the performance of BERT for multiple-choice QA (Devlin et al., 2019) without explanations with the performance of BERT provided with the top K explanation sentences retrieved by RS and RS + US models (BM25). BERT without explanations operates on question and candidate answer only. On the other hand, BERT with explanation receives the following input: the question (q), a candidate answer (c i ) and the explanation for c i (E i ). In this setting, the model is fine-tuned for binary classification (bert b ) to predict a set of probability scores P = {p 1 , p 2 , ..., p n } for each candidate answer in C = {c1, c 2 , ..., c n }: The binary classifier operates on the final hidden state corresponding to the [CLS] token. To answer the question q, the model selects the candidate answer c a such that a = argmax i p i . Table 4 reports the accuracy with and without explanations on the Worldtree test-set for easy and challenge questions . Notably, a significant improvement in accuracy can be observed when BERT is provided with explanations retrieved by the reconstruction modules (+9.84% accuracy with RS BM25 + US BM25 model). The improvement is consistent on the easy  split (+6.92%) and particularly significant for challenge questions (+15.69%). Overall, we observe a correlation between more precise explanations and accuracy in answer prediction, with BERT + RS being outperformed by BERT + RS + US for each value of K. The decrease in accuracy occurring with increasing values of K is coherent with the drop in precision for the models observed in Figure 2b. Moreover, steadier results adopting the RS + US model suggest a positive contribution from abstract explanatory facts. Additional investigation of this aspect will be a focus for future work.

Conclusion
This paper proposed a novel framework for multihop explanation reconstruction based on explanatory unification. An extensive evaluation on the Worldtree corpus led to the following conclusions: (1) The approach is competitive with state-of-theart Transformers, yet being significantly faster and inherently scalable; (2) The unification-based mechanism supports the construction of complex and many hops explanations; (3) The constructed explanations improves the accuracy of BERT for question answering by up to 10% overall. As a future work, we plan to extend the framework adopting neural embeddings for sentence representation.
The code to reproduce the experiments described in the paper is available at the following URL: https://github.com/ai-systems/ unification_reconstruction_explanations