Reasoning over Hierarchical Question Decomposition Tree for Explainable Question Answering

Explainable question answering (XQA) aims to answer a given question and provide an explanation why the answer is selected. Existing XQA methods focus on reasoning on a single knowledge source, e.g., structured knowledge bases, unstructured corpora, etc. However, integrating information from heterogeneous knowledge sources is essential to answer complex questions. In this paper, we propose to leverage question decomposing for heterogeneous knowledge integration, by breaking down a complex question into simpler ones, and selecting the appropriate knowledge source for each sub-question. To facilitate reasoning, we propose a novel two-stage XQA framework, Reasoning over Hierarchical Question Decomposition Tree (RoHT). First, we build the Hierarchical Question Decomposition Tree (HQDT) to understand the semantics of a complex question; then, we conduct probabilistic reasoning over HQDT from root to leaves recursively, to aggregate heterogeneous knowledge at different tree levels and search for a best solution considering the decomposing and answering probabilities. The experiments on complex QA datasets KQA Pro and Musique show that our framework outperforms SOTA methods significantly, demonstrating the effectiveness of leveraging question decomposing for knowledge integration and our RoHT framework.


Introduction
Explainable question answering (XQA) is the task of (i) answering a question and (ii) providing an explanation that enables the user to understand why the answer is selected (Neches et al., 1985;Schuff et al., 2020). It provides a qualified way to test the reasoning ability and interpretability of intelligent systems, and plays an important role in artificial intelligence (Lu et al., 2022). * Indicates equal contribution. † Corresponding author.
: Which is higher, the highest mountain in North America or the highest mountain in Africa?
: How high is the highest mountain in North America?
: How high is the highest mountain in Africa?
: Which mountain is the highest in Africa?
: How high is #6? : Which mountain is the highest in North America?
Although achieving significant results, both directions have key limitations. For neuro-symbolic methods, the formal representation can only be executed on KBs. However, even the largest KBs are incomplete, thus limits the recall of model. For decompose-based methods, they employ free-text corpora as the knowledge source, and the diversity of natural language makes XQA difficult. In fact, integrating knowledge from heterogeneous sources is of great importance to QA (Wolfson et al., 2020), especially for answering complex questions. Several attempts have been made for knowledge integration (e.g., KBs, text corpora) (Sun et al., 2018(Sun et al., , 2019Shi et al., 2021). Although promising, these graph-based methods suffer from lacking explainability or are constrained to limited reasoning capability.
Intuitively, leveraging question decomposing to integrate heterogeneous knowledge sources is a promising direction, since we can flexibly select the appropriate knowledge source for each subquestion. The challenges lie in: 1) How to determine the granularity of question decomposing, since certain complex questions can be directly answered with a knowledge source, and further decomposition increases the possibility of error. For example, in Figure 1, q 1 can be answered with the Wikipedia corpus without further decomposition.
2) How to find the optimal solution among various possible ones, since question decomposing and answering are both uncertain. For example, q 0 can also be decomposed as "Which mountains are in North America or Afirica", "What's the height of #1", "[SelectAmong] [largest] #2".
To this end, we propose a novel two-stage XQA framework Reasoning over Hierarchical Question Decomspotion Tree, dubbed RoHT. First, we propose to understand the complex question by building its hierarchical question decomposition tree (HQDT). In this tree, the root node is the original complex question, and each non-root node is a subquestion of its parent. The leaf nodes are atomic questions that cannot be further decomposed. Compared with existing representations that directly decompose a question into the atomic ones, e.g., QDMR (Wolfson et al., 2020), our tree structure provides the flexibility to determine solving a question whether by directly answering or further decomposing. Second, we propose probabilistic reasoning over HQDT, to fuse the knowledge from KB and text at different levels of the tree, and take into consideration the probability score of both tree generation and answering. The reasoning process is recursive, from the root to leaves, and constitues three steps: 1) a scheduler determines the appropriate knowledge sources for a particular question (from KB, text, or solving its children sequentially); 2) the corresponding executors output the answers with probabilities; 3) an aggregator aggregates the candidate answers from all the knowledge sources and outputs the best ones.
In evaluation, we instantiate our RoHT framework on two complex QA datasets: KQA Pro (Cao et al., 2022a), where we remove half of the triples in its KB and supplement it with Wikipedia corpus, and Musique , where we take Wikidata (Vrandecic and Krötzsch, 2014) as additional KB besides the given text paragraphs. Experimental results show that, RoHT improves the performance significantly under the KB+Text setting, by 29.7% and 45.8% EM score on KQA Pro and Musique compared with existing SOTA model. In addition, compared with the decompose-based methods, RoHT improves the SOTA by 11.3% F1 score on Musique.
Our contributions include: 1) proposing to leverage question decomposing to integrate heterogeneous knowledge sources for the first time; 2) designing a novel two-stage XQA famework RoHT by first building HQDT and then reasoning over HQDT; 3) demonstrating the effectiveness of our RoHT framework through extensive experiments and careful ablation studies on two benchmark datasets.
2 Related Work 2.1 QA over Text and KB Over time, the QA task has evolved into two main streams: 1) QA over unstructured data (e.g., freetext corpora like Wikipedia); 2) QA over structured data (e.g., large structured KBs like DBpedia (Lehmann et al., 2015), Wikidata (Vrandecic and Krötzsch, 2014)). As structured and unstructured data are intuitively complementary information sources (Oguz et al., 2022), several attempts have been made to combines the best of both worlds.
An early approach IBM Watson (Ferrucci, 2012) combines multiple expert systems and re-ranks them to produce the answer. (Xu et al., 2016) maps relational phrases to KB and text simultaneously, and use an integer linear program model to provide a globally optimal solution. Universal schema based method (Das et al., 2017) reasons over both KBs and text by aligning them in a common embedded space. GraftNet (Sun et al., 2018) and its successor PullNet (Sun et al., 2019) incorporate free text into graph nodes to make texts amenable to KBQA methods. TransferNet (Shi et al., 2021) proposes the relation graph to model the label-form relation from KBs and text-form relation from corpora uniformly.
Although achieving promising results, these methods lack interpretability or are constrained to limited question type, i.e., TransferNet shows interpretability with transparent step transfering, however, it can only answer multi-hop questions, and cannot deal with questions that require attribute comparison or value verification. In contrast, our proposed framework shows great interpretability with HQDT and cover more question types.

Question Decomposing
For datasets, KQA Pro (Cao et al., 2022a) proposes to decompose a complex question into a multi-step program KoPL, which can be executed on KBs. BREAK (Wolfson et al., 2020) proposes to decompose questions into QDMR, which constitutes the ordered list of steps, expressed through natural language. Musique  is a QA dataset constructed by composing single-hop questions obtained from existing datasets, and thus naturally provides question decompositions.
For models, several attempts have been made for learning to decompose with weak-supervision, such as span prediction based method (Min et al., 2019), unsupervised sequence transduction method ONUS (Perez et al., 2020), AMR-based method QDAMR (Deng et al., 2022). Another line of work is to employ large language models with in-context learning, such as Least-to-most Prompting , decomposed prompting , successive prompting (Dua et al., 2022).
Compared with existing works, we are the first to design a hierarchical question decomposition tree for integrating information from multiple knowledge sources.

Definition of HQDT
Formally, given a complex question, its HQDT is a tree T . Each node q i ∈ T represents a question. For root node, it represents the given complex question, and for non-root nodes, it represents a sub-question of its parent node. The leaf nodes are simple ("atomic") questions that cannot be decomposed. Note that HQDT is a 3-ary ordered tree. As shown in Figure 1, we enumerate the nodes of T with BFS ordering, and q 0 is the root question.
A question q i = w 1 , · · · , w j , · · · , w |q i | can be categorized into one of the three types according to the token vocabulary: 1) natural language question (e.g., q 4 : "Which mountain is the highest in North America?"), here, w j ∈ V, and V is the word vocabulary; 2) bridge question (e.g., q 5 : "How high is #4?"), here, w j ∈ V ∪ R, and R is the reference token vocabulary. In this ques-tion, "#4" refers to the answer of q 4 , which is the sibling question of q 5 ; 3) symbolic operation question (e.g., q 3 : "[SelectBetween][greater] #1 #2"), here, w j ∈ V ∪ R ∪ O, and O is the vocabulary of pre-defined symbolic operations, which are designed for supporting various reasoning capacity (e.g., attribute comparison and set operation) and are shown in appendix A in details. Note that all the bridge questions and symbolic operation questions are atomic questions and can only appear in leaf nodes.
For every non-leaf question q i , we define two ordered lists: • q i .atoms = a i 1 , · · · , a i n i , which is a list of atomic questions deduced from the n i leaf nodes of the sub-tree rooted by q i , by rearranging the reference tokens. For example, for q 0 in Figure 1, its leaf nodes is q 4 , q 5 , q 6 , q 7 , q 3 , and correspondingly, q 0 .atoms is q 4 ,q 5 , q 6 ,q 7 ,q 3 , withq 5 as "How high is #1?",q 7 as "How high is #3", and q 3 as "[SelectBetween][greater] #2 #4". The detailed deduction algorithm is in appendix B due to space limit. We also call q i .atoms the atomic representation of q i . Specially, among q i .children, q st i , . . . , q ed i −1 are all natural language questions, and q ed i is either a bridge question or a symbolic operation question. Answering q i is semantically equivalent to answering sub-questions in q i .children or in q i .atoms sequentially. The last question in q i .children or q i .atoms returns the answer of q i .

Methodology
Our framework RoHT is composed of two stages: 1) Building HQDT. We understand the hierarchical compositional structure of a complex question q 0 by generating its HQDT T with probability, where each question q i ∈ T has a score p i g that represents the certainty of its generation.
2) Probabilistic Reasoning over HQDT. We conduct recursive probabilistic reasoning over the HQDT from root to leaves to solve q 0 . For each question q i , we will utilize KBs, text and its child questions together to get a list R i , which contains answers of q i with probabilistic scores. Finally the answer with the highest score in R 0 will be picked out as the final answer of q 0 .
The details are introduced as follows.

Building HQDT
To build the HQDT for a complex question, we first generate its atomic representation, which corresponds the leaf nodes of HQDT, then generate every non-leaf nodes based on this atomic representation. We compute certainty score of each node based on the likelihood of each step of generation.
Building Leaf Nodes Given a complex question q 0 , we first use a BART )-based question decomposer M θ to generate its atomic representation and output the likelihood of generation: Here, L 0 = a 0 1 ⟨sep⟩ a 0 2 ⟨sep⟩ . . . ⟨sep⟩ a 0 n 0 is the serialization of q 0 .atoms, where ⟨sep⟩ is a separating token. l d = Pr(L 0 |q 0 ; θ) is the likelihood of generation. Since q 0 is the root of T , each atomic question in q 0 .atoms corresponds to a leaf node in T (with the deterministic algorithm in Appendix C), and the certainty score of each leaf node in T is l d .
Building Non-leaf Nodes Based on q 0 .atoms, we can generate all the non-leaf questions in HQDT. The root question is just q 0 and thus has certainty score p 0 g = 1. For every other non-leaf question q i , its atomic representation q i .atoms = ⟨a i 1 , . . . , a i n i ⟩ can be translated from a specific subset of q 0 .atoms by rearranging the reference tokens. The subset can be determined by considering the reference relations of a bridge or symbolic operation question a 0 j ∈ q 0 .atoms, which corresponds to the leaf node q ed i , with other questions in q 0 .atoms. We show the details in Appendix C. For example, q 2 .atoms in Figure 1 is ("Which mountain is the highest in Africa?", "How high is #1?"), and it can be obtained from (a 0 3 , a 0 4 ) in q 0 .atoms. Then we can use a BART-based question generator M ϕ to generate q i from q i .atoms: where L i = a i 1 ⟨sep⟩ a i 2 ⟨sep⟩ . . . ⟨sep⟩ a i n i is the serialized q i .atoms, and l i g = Pr(q i |L i ; ϕ) is the likelihood of q i given L i . The certainty score of q i is computed as: Learning of Question Decomposer and Generator The question decomposer M θ can be trained with paired (q 0 , q 0 .atoms) data, where the atomic representation can be from either given annotation or unsupervised construction. The question generator M ϕ can also be trained with the same data by exchanging the input and output. The details are shown in Section 5.2.

Probabilistic
Reasoning over HQDT j is an answer of q i , and score p i j represents the certainty of ans i j . As shown in Figure 3, the implementation of f contains tree steps: 1) a scheduler determines the suitable knowledge sources for a particular question, i.e., whether the question can be answered from KB, text, or by solving its child questions sequentially; 2) according to the suitable sources output by the scheduler, executors aim to get the answers with probabilities via executing on KB (KB executor) or retrieving from text (text executor), or answering the child questions (call f recursively); 3) an aggregator aggregates candidate answers from all the knowledge sources and outputs the top-k answers according to their probabilities. In the following, we will introduce their details when answering q i .
Scheduler We formalize the scheduler as: Where suit kb , suit text and suit child are 0/1 variables, respectively representing whether the answers of q i are suitable to get from the KB G, the corpus C, or by solving q i .children sequentially.
Specifically, to check whether G is suitable, the scheduler employs a semantic parser (Cao et al., 2022a) M sp to parse q i into a program K with probability p parse : Then it classifies the type of q i according to the function skeleton of K. For example, the function skeleton of K in Figure 2 is "Find-Relate-FilterConcept-SelectAmong". If the precision of G on the questions that have the same function skeleton with K is larger than a predefined threshold γ 1 , the scheduler will set suit kb to be 1.   To check whether the corpus C is suitable, the scheduler tries to find a set of evidence paragraphs for q i . If C is too large, the scheduler will first use BM25 (Robertson and Zaragoza, 2009) to recall dozens of most relevant paragraphs. For each paragraphs, we train a RoBERTa (Liu et al., 2019)based selector M sl to classify whether it is an evidence paragraph for q i . Suppose the set of selected evidence paragraphs, C e is not empty, the scheduler will set suit text as 1.
To make best use of knowledge from all levels, the scheduler simply set suit child to be 1 if q i is a non-leaf question otherwise 0.
Executors For the KB executor, it takes the program K in Equation 6 on KB G to get the answers, and takes the parsing score p parse in Equation 6 to calculate the probability score for each answer: For the text executor, it takes the selected paragraph set C e as described above, and employs a Transformer-based reading comprehension model M rc to extract answers from C e : where p i ex,j is the extraction probability of ans i text,j given by M rc . For solving q i by answering its children, f will recursively call itself to solve q st i , . . . , q ed i in or-der: . . .
Here, f ref is a variant of f to solve bridge and symbolic questions, which refer to the answers of their sibling questions. Suppose q ed i refers to the answers of its siblings q r 1 , . . . , q r h i in order. If q ed i is a bridge question, f ref will 1) convert q ed i into several possible natural language question q 1 nl , . . . , q K nl by replacing the reference tokens with every combination ((x k 1 , v k 1 ), . . . , (x k h i , v k h i )) ∈ R r 1 × · · · × R r h i , 2) call f to solve each q k nl and 3) fuse the answers from each R k nl and select the top-k answers with the highest scores: Note that the score of answer ans k nl,j is computed by averaging p k nl,j and v k 1 , . . . , v k h i , instead of multiplying them, to avoid exponential shrink during recursion. If q ed i is a symbolic operation question with operation op and arguments, f ref will execute simple program to apply the operation op over R r 1 , . . . , R r h i to get R ed i . The score of each answer ans ed i j is computed as the average of p ed i g and the scores of answers in R r 1 , . . . , R r h i used by the program to get ans ed i j . 14560 Aggregator The aggregator fuses R i kb , R i text and R i child by selecting the top-k answers with the highest scores from them. If several answers have the same surface form, only the one with the highest score will be preserved.  (Vrandecic and Krötzsch, 2014), and consists of 16k entities, 363 predicates, 794 concepts and 890k triple facts. For each question, KQA Pro also provides the corresponding KoPL program. To simulate the realistic case where KB is incomplete, following (Sun et al., 2019;Shi et al., 2021), we randomly discard 50% triples in the KB and take Wikipedia as supplementary text corpus.
Musique ) is a multi-hop QA dataset over text, including 25k 2-4 hop questions. We evaluate our framework under Musique-Ans setting where all the questions are answerable. Its questions are carefully constructed from several single-hop QA datasets via manually composition and paraphrase, and are hard to cheat via reasoning shortcut. For each complex question, Musique gives 20 paragraphs (including annotated evidence paragraphs and distractor paragraphs) as the corpus. Specially, for each question in the training set, Musique also provides a golden atomic representation, together with the answer and the evidence paragraph of each atomic question. In addition to the given paragraphs, we choose Wikidata as the KB to acquire additional knowledge.

Implementations
KQA Pro For the experiments of KQA Pro, a key challenge is that there are no annotations for atomic representation, which are required for training the question decomposer and generator in RoHT. Because the KoPL program of a complex question follows context free grammar, every atomic question will correspond to a specific span of the program. Therefore we first split the KoPL program into subprograms according to the grammar, then use each sub-program to generate the atomic question by applying BART model fintuned with the (KoPL, question) pairs from the original dataset. For the answers for each atomic question, we execute the corresponding sub-programs on the KB to get corresponding answers. Using these constructed atomic representations, we train two BART-base models as the question decomposer and generator, respectively.
For the scheduler, we directly use the semantic parser trained by (Cao et al., 2022a) on KQAPro, and set the precision threshold γ to be 0.7. We train a RoBERTa-large as the evidence selector via weak supervised method: for each question in the training set and constructed atomic representations, we first use BM25 to recall 10 related paragraphs from wikipedia, then take the paragraphs that contain the answer as positive samples and take other recalled paragraphs as negative samples. For the text executor, we also train a BART-large reading comprehension model on these positive samples.
Musique Since Musique provides golden atomic representation for every complex question in the training set, we directly use them to train BARTbase models as question decomposer and generator. For the scheduler, we adapt semantic parser trained by (Cao et al., 2022a) on Wikidata. The KB precision threshold γ is set to be 0.4, which is determined by the top-10 types of questions with the highest precision. We train the RoBERTa selector model on complex and atomic questions in the training set together, taking annotated evidence paragraphs as positive samples and distractor paragraphs as negative samples. For the text executor, we pre-train a Longformer-large (Beltagy et al., 2020) reading comprehension model on SQUAD (Rajpurkar et al., 2016), then finetune it on complex questions and atomic questions of Musique. SA ) is a two-stage model that first uses a RoBERTa-large selector to rank and select the K most relevant paragraphs with the question and then uses a Longformer-large answerer to predict answer based on selected paragraphs. EX(SA)  is the state-of-the-art model on Musique. It first explicitly decomposes the complex question into atomic representation and then calling SA model repeatedly to answer each atomic question in order.
TransferNet (Shi et al., 2021) iteratively transfer entity scores via activated path on the relation graph that consists of both text-form relations and KB-form relations. It is existing state-of-the-art model that utilizes both KBs and text as knowledge soruces, and nearly solves MetaQA. We reimplement it on both KQA Pro and Musique, and the details are shown in Appendix D.
RoHT: RoHT KB , RoHT text and RoHT mix denote the RoHT models that only use KB, only use text and use both KB and text, respectively.

Results on KQA Pro
The experimental results for KQA Pro are shown in Table 1. When using only the incomplete KB, RoHT KB model respectively improves EM by 21.22, 4.17 and 0.90 compared to KVMemNN, RGCN and BART KoPL, showing the benefit of integrating the answers of sub-questions of different levels. After adding Wikipedia as supplementary text corpus, RoHT mix yields substantial improvement compared with RoHT KB (7.51 on EM), demonstrating the effectiveness of utilizing knowledge from KB and text together. RoHT mix also outperforms TransferNet, which is end-to-endly trained with a mixed relation graph, by a large margin (29.65 on EM). This is because unlike graphbased methods, RoHT explicitly shows the compositional structure of a complex question in natural language form via HQDT generation, and thus can retrieve answers from the KB and text with more advanced and flexible sub-modules (e.g., semantic parser and reading comprehension model). Moreover, our designed atomic operations in the HQDT also enable RoHT to solve a wide variety of complex questions: we can see that RoHT mix achieves the best results on 6 types of questions among 7 types, showing comprehensive reasoning capacity.   we can also see some benefits of supplementing the text information with KB information, though the improvement is smaller than supplementing the KB with text on KQA Pro because KBs have lower coverage than text and the semantic parser is not specially finetuned for questions of Musique. We submit the predictions of RoHT mix on the test set and achieve 63.6 F1 score, which significantly outperforms the best public result 52.3.

Effect of Scheduler
To show the effect of the scheduler module, we remove it from the RoHT mix model, i.e, default that the KB and recalled/given text paragraphs are suitable for all questions in the HQDT, and evaluate the performance again on the dev set of KQA Pro and Musique. The results are shown in Table 3. We can see that after discarding the scheduler, the EM performance on KQA Pro and Musique drops by 5.8 and 7.4, respectively. Therefore, it is important to use the scheduler to select suitable knowledge sources for each question.

Effect of Hierarchical Decomposition
Many existing methods generate non-hierarchical decomposition of complex questions, similar to the atomic representation, to assist reasoning (Min et al., 2019;Wolfson et al., 2020;Deng et al., 2022).
To demonstrate the superiority of hierarchical decomposition, we compare our RoHT mix model with : Why did Roncalli leave the city where the painter of Venus with a Mirror died?
: Where did the creator of The Venus with a Mirror die?
: Where did #3 die? : The Venus with a Mirror was made by whom?
Question: Why did Roncalli leave the city where the painter of Venus with a Mirror died?   RoAT mix model, which uses the same scheduler, executors, and aggregator as RoHT mix , but solves the complex question by directly answering the atomic questions in its atomic representation in order. As shown in Table 3, RoHT mix outperforms RoAT mix by a large margin on both KQA Pro and Musique. This is because the hierarchical structure of HQDT enables RoHT model to fuse the knowledge from KBs and text at different question levels, and to discard wrong answers via comparing the problisitic scores of answers.
To further understand the reason, we show a case from Musique in Figure 3. We can see that both RoHT mix and RoAT mix fail to answer the question "Where did (Titian) die?" (q 4 in the left, a 0 2 in the right). However, RoHT mix directly extracts the correct answer of q 1 from text and finally gets the correct answer of q 0 with the highest score, while RoHT mix fails to solve a 0 3 because it must rely on the wrong answer from a 0 2 .

Conclusion
In this paper, we propose RoHT, an understandingreasoning XQA framework that uses both a KB and a text corpus to derive answers of complex questions. RoHT first builds the HQDT for a complex question to understand its hierarchical compositional structure, then conducts recursive probabilistic reasoning over the HQDT to solve the question, integrating answers from the KB, text, and sub-questions. Experiments show that RoHT significantly outperforms previous methods. We also demonstrate the superiority of HQDT compared with non-hierarchical decomposition.

Limitation
Currently, RoHT framework is restricted to incorporating KBs and text. However, since RoHT retrieves answers from each knowledge source in a separate way, it could in principle utilize knowledge from more heterogeneous sources such as tables, and we will study this in future work. In addition, a device with large storage space and memory is needed for the storage and usage of Wikipeida and Wikidata.

Ethics Statement
The data used in this paper are drawn from publicly published datasets, encyclopedias and knowledge bases. Most of them do not involve sensitive data.

Acknowledgement
This work is supported by grants from the Institute for Guo Qiang, Tsinghua University (2019GQB0003) and Cloud BU, Huawei Technologies.

ACL 2023 Responsible NLP Checklist
A For every submission: A1. Did you describe the limitations of your work?

Section 7
A2. Did you discuss any potential risks of your work?

Section 8
A3. Do the abstract and introduction summarize the paper's main claims?
Section 1 A4. Have you used AI writing assistants when working on this paper?
Left blank.
B Did you use or create scientific artifacts?
Section 4 B1. Did you cite the creators of artifacts you used?

Section 4
B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
We follow the same license B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)? Not applicable. Left blank.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?
The data used in this work are drawn from publicly published encyclopedias, knowledge bases and datasets.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used? No response.
The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.