HPE: Answering Complex Questions over Text by Hybrid Question Parsing and Execution

,


Introduction
End-to-end neural models that transductively learn to map questions to their answers have been the dominating paradigm for textual question answering (Raffel et al., 2020;Yasunaga et al., 2021)  ing to their flexibility and solid performance.However, they often suffer from a lack of interpretability and generalizability.Symbolic reasoning, on the other hand, relies on producing intermediate explicit representations such as logical forms or programs, which can then be executed against a structured knowledge base (e.g., relational database, knowledge graph, etc.) to answer questions (Gu et al., 2022;Baig et al., 2022).These methods naturally offer better interpretability and precision thanks to the intermediate symbolic representations and their deterministic executions.However, they might be limited in expressing a broad range of questions in the wild depending on the semantic coverage of the underlying symbolic language and grammar employed.
Neural Module Networks (Andreas et al., 2016;Gupta et al., 2019;Khot et al., 2020) have been proposed to combine neural and symbolic modality together.However, they require a symbolic language and a corresponding model that only covers limited scenarios in the specific task or domain.To apply this approach on new tasks or domains, new languages and neural modules have to be introduced.Therefore, designing a generalizable framework that use a high-coverage symbolic expression and a flexible neural network that can be versatilely used in various scenarios becomes our goal.
Recent chain-of-thought prompting works on large language model (Wei et al., 2022;Zhou et al., 2022;Drozdov et al., 2022) have the insight of solving complex questions by iteratively decomposing the unsolved question into simpler sub-questions that can be solved.Inspired by them, we adapt the decomposition into our framework to make the model generalizable to complex questions.
In this work, we propose a Hybrid question Parser and Execution framework, named HPE, for textual question answering, which combines neural and symbolic reasoning.To this end, H-expression is introduced as a simple explicit representation of original complex questions, which in the fashion of only contain primitives and operations as in (Liu et al., 2022).As shown in Figure 1, we define a primitive in the H-expression as a single-hop question and the operation is used to express the connections between primitives.H-parser, a semantic parsing process, is used to convert questions into H-expressions.To execute H-expression, we design a hybrid execution (H-executor), which not only utilizes the neural network to answer each single-hop question but also contains deterministic rules to combine the answers from single-hop questions into the final answer.H-executor can be detachable and replaced by any reader model (Izacard and Grave, 2020;Beltagy et al., 2020) and the training of this network can be done globally with the massive single-hop question answering data.
Our contributions can be summarized as follows: • Architecture: We propose to combine the advantages of both symbolic and neural reasoning paradigms by parsing questions into hybrid intermediate expressions that can be iteratively executed against the text to produce the final answer.Extensive experiments on MuSiQue (Trivedi et al., 2022b) and 2WikiMultiHopQA (Min et al., 2019) show that our proposed approach achieves state-ofthe-art performance.
• Generalizability: End-to-end neural approaches are data hungry and may significantly suffer from poor generalization to unseen data, especially in limited resource scenarios.Our design, on the other hand, naturally splits the reasoning process into Hparser and H-executor, through which it intends to disentangle learning to parse complex questions structurally from learning to resolve simple questions therein.Our few-shot experiments on MuSiQue and 2WikiMultiHopQA and zero-shot experiments on HotpotQA (Yang et al., 2018) and NQ (Kwiatkowski et al., 2019) suggest that even with less training data, our approach can generalize better to unseen domains.
• Interpretability: The execution process of our model is the same as its reasoning process.Transparency of our approach facilitates spotting and  fixing erroneous cases.

Approach
We formulate textual question answering as the task of answering a question q given the textual evidence provided by passage set P .We assume access to a dataset of tuples {(q i , a i , P where a i is a text string that defines the correct answer to question q i .Previous works take this tuple as input and directly generate the predicted answer.
In this work, we cast this task as question parsing with hybrid execution.Given a question q i , a question parser is tasked to generate the corresponding H-expression l i .The generated H-expression l i and the supporting passage set P i are given to the execution model to generate the predicted answer.
Our framework consists of two modules, namely H-parser and H-executor.In Section 2.1, we first define H-expression as a hybrid expression consisting of primitives and operations, which serves as an intermediate representation connecting Hparser and H-executor.H-parser, a Seq2Seq model, is introduced to take the question as the input and output an H-expression as an explicit representation of the question.In Section 2.2, H-executor first compiles the H-expression into a tree structure by symbolic rules.Then H-executor delegates a singlehop reader for simple question answering and executes the symbolic operations based on answers in a bottom-up manner to get the final prediction.

H-parser
To improve the compositional generalization, we follow (Liu et al., 2022) to define H-expression as the composition of primitives and operations.
Grammars and rules of H-expression In this framework, we define a primitive as a singlehop question, which is the atomic element con- sisting of the complex question.We use the operation to represent the relation between primitives.H-expression contains seven types of operations, which are JOIN, AND, COMPARE_=, COMPARE_>, COMPARE_<, MINUS and AD-DITION.Each operation is a binary function that takes two primitives q2 and q1 as input, written as OP[q2, q1], where OP ∈ {JOIN, AND, COM-PARE_=, COMPARE_>, COMPARE_<, MINUS and ADDITION} and q1, q2 are format-free singlehop question.In the execution step, q1 will be executed first, then q2.Those operations can be combined into more complex H-expression.For example, JOIN (q3, JOIN (q2, q1 ) ) or JOIN (q3, AND (q2, q1)).And for a single-hop question, its H-expression will be itself.
In Table 1, we listed operation definitions, which will be executed in the H-executor.More specifically, the JOIN operation is used for linear-chain type reasoning.q1 is the complete question that can be answered directly, but q2 is an incomplete question with a placeholder (a1) inside.In the execution step, This operation will be executed in s serial way, q2 will be executed first and the answer of q2 will be used to replace the placeholder in the q1.AND operation is used for intersection reasoning, which will return the intersection of the answer of q2 and q1.COMPARE_= is used to determine if the answers of q2 and q1 are equal.The return value should be "Yes" or "No".COMPARE_< and COM-PARE_> operations will select the question entity corresponding to the smaller or the bigger answer of q2 and q1.MINUS and ADDITION operations are for subtractions and additions that involve the answers of q2 and q1.

H-expression Generation
The semantic parsing process on KB and DB usually needs to interact with the background context to match natural questions to logical forms with the specified schema, which is a necessary condition to execute in knowledge base (Ye et al., 2021) or table (Lin et al., 2020).However, in textual question answering, the question parsing process is context-independent because we want the meaning of the original question and the H-expression to be equivalent without any additional information from the context.
Our parser is a seq-to-seq model that takes a natural question q as input and generates the H-expression l as the output.We delegate a T5 model (Raffel et al., 2020) as the basis of our question parser, as it demonstrates strong performance on various text generation tasks.We train the model by teacher forcing -the target H-expression is generated token by token, and the model is optimized using cross-entropy loss.At inference time, we use beam search to decode the top-k target Hexpression in an auto-regressive manner.

H-executor
Unlike the execution in DB and KB, which is fully program-based, our execution has both a neural and a symbolic component.The advantage of symbolic representation is that the resulting execution process is deterministic and robust.
H-expression to Tree Structure H-expression is in the nested structure and the linear sequence cannot represent it well.H-executor first interprets the linear H-expression into a binary tree structure, where primitives are leaf nodes and operations are non-leaf nodes of the tree.All primitives will be executed by the neural network and the non-leaf nodes will be executed by the deterministic symbolic rules to generate new primitives or answers.
The interpreter of H-executor will traverse from the rightmost primitive, followed by its parent node and the left branch recursively, which is similar to in-order traversal with opposite leaf order.As shown in Figure 2, the question who is winner of 1894-95 FA Cup is the first primitive to be executed and the single-hop reader will be called to answer this question and give the answer Aston Villa.Then non-leaf operation AND will be visited, which will store the A1 as Aston Villa.The left primitive what is member of sports team of Duane Courtney will be visited.The single-hop reader will be called to answer this question and return Birminghan City, which will be stored as A2.Then operation JOIN will be visited, which will replace the placehold A1 and A2 with the stored answer and produce a new primitive When was the last time Birminghan City beat Aston Villa.This primitive will be answered by the single-hop reader and predict the final answer 1 December 2010.
Training of Reader Network We use FiD (Izacard and Grave, 2020) as our reader network, which is a generative encoder-decoder model.Each supporting passage is concatenated with the question, and processed independently from other passages by the encoder.The decoder takes attention over the concatenation of all resulting representations from the encoder.To distinguish different components, we add the special tokens question:, title: and context: before the question, title and text of each passage.Note that the reader network is detachable and may be replaced by any generative or extractive reader model at your choice.
We have an assumption that the single-hop question is much easier to answer and it's feasible to have a global single-hop reader, which can be adapted to any unseen dataset in a Wikipedia domain.To get a single-hop global reader, we leverage the large-scale QA pairs from Probably-Asked Questions/PAQ (Lewis et al., 2021).To reduce the training computational cost, we first pre-train the model with few passages to get a reasonable checkpoint and then finetune using all supporting passages.In detail, we first train the Seq2Seq model T5-large in the reading comprehension setting (with one positive passage) using PAQ data.Then we use the trained T5-large from PAQ to initialize the FiD model and further train the model using training set of TriviaQA (Joshi et al., 2017), SQuAD (Rajpurkar et al., 2016), BoolQ (Clark et al., 2019) in a multiple passage setting (with one positive passage and nineteen negative passages).A global reader can be zero-shot used to unseen questions and also boost the performance of fine-tune setting as the pre-training weights.

Experiments
We conduct experiments on two multi-hop multipassage textual QA datasets, MuSiQue and 2Wikimulti-hopQA, which contain complex questions and corresponding decomposed simple questions for the supervised setting.We also test models' generalization on the few shot setting using 5-20% of training data.In real scenarios, neither decomposed questions nor the complexity of questions is known.Therefore, we also investigate our models under the zero-shot setting on both complex (HotpotQA) and simple (NQ) QA datasets.In the end, we carry out a case study to show the interpretability of our framework.

Datasets
We describe each dataset and then explain how to convert original data into the training format for both question parsing and execution.MuSiQue (Trivedi et al., 2022b) contains multihop reasoning questions with the mixed number of hops and question entities which can be asked from 20 supporting passages.It contains 19,938/2,417/2,459 for train, dev and test sets, with 2hop1 (questions with 2 hops and 1 entity), 3hop1, 4hop1, 3hop2, 4hop2 and 4hop3 reasoning types.2Wikimulti-hopQA (2WikiQA) (Ho et al., 2020) requires models to read and perform multihop reasoning over 10 multiple passages.Three types of reasoning are included, namely comparison, bridge, and bridge-comparison.It contains 167,454/12,576/12,576 for train, dev and test sets.Reconstruction MuSiQue contains complex questions, decomposed single question with answers and the reasoning type for each complex question.We use JOIN operation to combine linearly-chain type questions together and use AND operation to combine intersection type questions.In 2WikiQA, we use evidences (in form of triplet <subject, relation, object>) and reasoning type to create the H-expression.In detail, we first convert the subject and relation into natural questions using templates and the object is the answer of this natural question.Then, we use the operation to combine those singlehop questions into an H-expression based on their reasoning type.From   GPT-3 (Brown et al., 2020).They iteratively generate an answerable question, use retrieval to get supporting passages, and answer the question based on the retrieved passages.SA (Trivedi et al., 2022b) is the stateof-the-art model on the MuSiQue dataset, which first uses a RoBERTa based (Liu et al., 2019) ranking model to rank supporting passages and then uses an End2End reader model to answer complex questions using the top-k ranked passages.EX(SA) (Trivedi et al., 2022b) decomposes a complex question into single-hop questions and builds a directed acyclic graph (DAG) for each singlehop reader (SA) to memorize the answer flow.NA-Reviewer (Fu et al., 2022) proposes a reviewer model that can fix the error prediction from incorrect evidence.We include FiD (Izacard and Grave, 2020) as the baseline End2End reader model.In the original FiD, it takes the question as well as the supporting passages as input, and generates the answer as a sequence of tokens.Moreover, we propose two variants of FiD to compare the influence using Hexpression: one uses H-expressions as the input, instead of original questions, to generate answers (referred to as FiD LF−>Ans ), and the other uses questions as input to generate both H-expressions and answers (referred to as FiD CQ−>LF+Ans ).

Implementation Details
We describe fine-tuning details for question parsing and single-hop reader models in Appendix B.

Pre-training (PT)
To pretrain the single-hop reader, we use a subset of PAQ (Lewis et al., 2021) consisting of 20M pairs, which is generated based on named entities and the greedy decoded top-1 sequence with the beam size of 4. We train a T5-large model for 400k steps, with one gold passage, maximum length of 256 and batch size of 64.Then we initialize FiD with the PAQ pre-trained model and further train it for 40k steps, with batch size of 8 and 20 supporting passages, on the combined training sets of TriviaQA (Joshi et al., 2017), SQuAD (Rajpurkar et al., 2016) and BoolQ (Clark et al., 2019).Our code is based on Huggingface Transformers (Wolf et al., 2019).All the experiments are conducted on a cloud instance with eight NVIDIA A100 GPUs (40GB).

Fine-tuning Results
We present our main results on MuSiQue and 2Wik-iQA in Table 3.We observe that Self-ask and IRCoT, which are based on large language models and search engines, underperform most supervised models.This indicates that multi-hop multiparagraph question answering is a difficult task, and there still has an evident gap between supervised small models and large models with few-shot or zero-shot.Moreover, our framework outperforms the previous SOTA methods on both datasets.We notice that the baseline EX(SA) underperforms SA by a large margin, but our HPE outperforms FiD by 5.3% on MuSiQue EM.This shows the difficulty to build a good H-expression and executor.Moreover, EX(SA) gets a bad performance on 2Wik-iQA, which shows that using DAG to represent the logical relationship between sub-questions is not adaptable to any reasoning type.Compared with the End2End baseline (FiD) that our model is built on, our framework with an explicit representation performs much better.As for FiD LF−>Ans and FiD CQ−>LF+Ans , using H-expression as the input or output, expecting this facilitates the model to capture the decomposition and reasoning path in an implicit way, does not help the model.This suggests that only the proposed execution method can help the model capture the logical reasoning represented in the H-expression.

Few-shot Results
To illustrate the generalization ability of our framework, we show the analysis of our method under the few-shot setting in Table 4.We run three experiments, random sampling 5, 10, and 20 percentage of the training data.We use the End2End FiD model as the baseline, which inputs complex questions and generates the answers as token sequences. In

Zero-shot Results
We expect the H-parser to work well on questions of varying levels of complexity.To verify this, we HotpotQA EM F1 Standard (Yao et al., 2022)  test the models on two benchmarks HotpotQA and NQ without any tuning.The former does not contain any decomposed questions, and the latter contains common questions in the real world.

Dataset
HotpotQA we use the distractor setting (Yang et al., 2018) that a model needs to answer each question given 10 passages.
To produce correct answer for a question, the dataset requires the model to reason across two passages.Note that two main reasoning types bridge and comparison in HotpotQA are included in MuSiQue and 2WikiQA.NQ (Kwiatkowski et al., 2019) contains opendomain questions collected from Google search queries.Usually, NQ is treated as a simple question dataset and previous works usually use End2End multi-passage reader like FiD.However, we argue that certain questions in NQ involve multi-hop reasoning and the model performance can be improved by decomposing them into single-hop questions.

Global Question Parser
To seamlessly generate H-expressions on unseen questions, we need a global question parser.This question parser can understand the complexity of the question, which means it can decompose a complex question into several simple questions and keep the simple question as is.To get a global question parser, we train a pretrained generative model T5 (Raffel et al., 2020) to convert questions to H-expressions using MuSiQue and 2Wikimulti-hopQA datasets.As the two datasets are not the same size, we categorize the complex question based on their reasoning type and sample the same amount of data for each category.To endow the model with the ability of understanding question complexity, we also use the simple questions in those datasets (the H-expression of a simple question is itself).Moreover, we decouple the composition of complex H-expressions into a few of simple H-expressions to ensure the coverage of all levels of complexity.

Zero-Shot Results on HotpotQA
We show the HotpotQA results in Table 5.We use FiD pre-trained on PAQ and TriviaQA, SQuAD and BoolQ as our zero-shot reader.Our framework outperforms both Standard and CoT, using promptbased large language models.This shows that with the hybrid question parsing and execution framework, a small language model is generalizable on unseen questions.Compared with FiD (PT), we get a comparable performance.But checking the union of HPE and FiD, which takes the correct predictions from both methods, we find 15% absolute gain can be obtained.This shows that HPE correctly answers around 15% of questions that FiD predicts incorrectly, with the help of the question decomposition and symbolic operation.On the other hand, we conjecture that the reason that HPE wrongly predicts some questions is that the global question parser fails to generate H-expression correctly.Hence, it is worth exploring how to design a generalizable global question parser in future work.

Results on NQ
We use the global question parser to decompose NQ question in a zero-shot manner.If a question is recognized as single-hop reasoning and cannot be further decomposed, the parser will keep the question unchanged.We use the DPR model (Karpukhin et al., 2020) to retrieve the Top-20 documents from Wikipedia as the supporting documents.Among the 8k dev set examples, 32 questions have been decomposed into single-hop questions with the logical operations and the rest are left as is.For ex-ample, a question "when did the last survivor of the titanic die" is converted into the H-expression "JOIN [when did A1 die, who was the last person to survive the titanic]".The result in Table 6 shows that HPE can handle questions of different complexity levels and will not degenerate on simple questions.

Ablation Study
Impact of H-parser We show the performance of different H-parsers.Table 7 shows using T5large rather than T5-base, we can get around 2 to 4 percent performance improvement on both datasets.Compared to the result using gold Hexpression, there is more room for improvement on the MuSiQue dataset.This might also be the case as the questions in MuSiQue are generally more complex than 2WikiQA.
Impact of H-executor Our hybrid executor is combined with symbolic operations and replaceable reader network.We analyze the influence of different reader networks to the final performance and experiment with different versions of FiD.Support-FiD generates both answers and the supporting document titles.SelectFiD is a two-step method that first uses a RoBERTa-based (Liu et al., 2019) ranking model to predict the Top-5 relevant documents and feeds them into FiD to generate the answer.From results in Table 8, we can see that a better single-hop reader produces better performance on MuSiQue.The improvement on single-hop reader translates to a significant performance boost on complex questions.

Case Study
In this section, we analyze the error cases.Moreover, we show the performance under each reasoning type on MuSiQue and 2WikiQA in Appendix C. In the end, we show a case of how our model reasons on a complex question in Appendix D.
Error Analysis There are two types of errors of our model prediction.One is the error from the semantic parsing of the H-expression.The other is the error from the single-hop question answer.The percentage of the first type of error is 67% and the second type is 33% on the MuSiQue dataset.
When the number of hops gets larger, our model could suffer from exposure bias (Bengio et al., 2015).Due to the chain reasoning, the next step question depends on the previous answers.This problem becomes acute if the model predicts a bad output at a certain step which in turn affects the final answer.However, one additional advantage of our work is that once we know where the error come from.We can fix the issue and get the correct final answer.To fix a wrong prediction, we can check whether the generated H-expression is correct.If it has some problem, like generating a bridge type H-expression of the comparison type complex question, we can fix it with the correct one.Otherwise, if we find out one single-step answer wrongly predicts, we can correct this single-hop answer.Moreover, this exposure bias can be solved by beam search (Wiseman and Rush, 2016) meaning that rather than generating one answer at each step, we can generate multiple answers and the final answer is the highest-scoring one.
4 Related Work 4.1 Neural Symbolic Systems (Gupta et al., 2019) introduces a neural module network (NMN), which solves QA using different modules by performing text span extraction and arithmetic operations.Khot et al. (2020) introduces a NMN variant that decomposes a question into a sequence of simpler ones answerable by different task-specific models.Systematic question decomposition is also explored in (Talmor and Berant, 2018;Min et al., 2019;Wolfson et al., 2020).Although our framework shares some similarities with this line of works, there is an essential difference in that we keep both symbolic and neural representations coincide, whereas they delegate the neural model to replace the non-differentiable symbolic representation in order to end-to-end train the model.

Explainable QA
A series of recent works focus on generating the explanation, which can be viewed as reasoning chain.(Yavuz et al., 2022;Latcinnik and Berant, 2020;Jhamtani and Clark, 2020) formulate the multi-hop question answering as single sequence generation, which contains an answer along with its reasoning path.Even though the generated reasoning path may provide some explanation on how the question being solved, there is no guarantee that the answer is indeed generated by the predicted reasoning path.
Recently, large Language Models (LLMs) have show its capability to answer complex questions by producing step-by-step reasoning chain (chainsof-thought, or CoT) when prompted with a few examples (Wang et al., 2022;Zhou et al., 2022;Wei et al., 2022) or even without any examples (Kojima et al., 2022;Kadavath et al., 2022).

Conclusion
We propose HPE for answering complex questions over text, which combines the strengths of neural network approaches and symbolic approaches.We parse the question into H-expressions followed by the hybrid execution to get the final answer.Our extensive empirical results demonstrate that HPE has a strong performance on various datasets under supervised, few-shot, and zero-shot settings.Moreover, our model has a strong interpretability exposing its underlying reasoning process, which facilitates understanding and possibly fixing its errors.By replacing our text reader with KB or

Limitations
We acknowledge that our work could have the following limitations: • Even if the defined H-expression can be used on various reasoning types and different text question answering datasets, it is not mature to be used to any type of reasoning.When the new reasoning type comes, we need to retrain the question parser.To solve the new reasoning type question, we plan to take advantage of in-context learning in a large language model to generate H-expression as future work.It's worth mentioning that our executor can be easily adapted to new reasoning types by adding new symbolic rules and the reader network doesn't need to be retrained.
• As mentioned in the error analysis section, the bottom-up question answering process could suffer from exposure bias since the next step question answering may depend on the previous predicted answers.To deal with this limitation, we anticipate that generating multiple answers using beam search in each step may greatly solve this issue.Since predicted candidates by current reader models have a strong lexical overlap, general beam search needs to be revised to provide a sufficient coverage of semantic meanings.We leave it for future work.

A H-expression Examples of Musique and 2WikiQA
In table 9 and 10, we show the H-expression examples and parsing tree under each reasoning type of Musique and 2WikiQA.

B Supervised Training Details
To train the question parser, we initiate H-parser using T5-large model.We trained it with batch size of 32 with a learning rate of 3e-5 for 20 epochs on both MuSiQue and 2WikiQA.We selected the model weight based on evaluating the H-expression exact match.We base our reader network FiD on T5-large.We use 20 passages with maximum length of 256 tokens for input blocks on MuSiQue dataset and use 10 passages with 356 tokens as text length on the 2WikiQA dataset.We trained the reader model with a batch size of 8 with a learning rate of 5e-4 for 40k steps.

C Performance of each Different Reasoning Type
We represent the Answer F1 performance under different reasoning types on both MuSiQue and 2WikiQA in Figure 4. Our hybrid question parsing and execution model performs significantly better than directly getting the answer model in both QA showing that the advantage of delegating semantic parsing to solve complex textual questions.In MuSiQue, for the relevant simple reasoning types (2hop, 3hop1), our model outperform FiD by a great margin.For complex reasoning types (3hop2, 4hop1, 4hop2 and 4hop3), our model gets lower performance compared with the simple reasoning types because the exposure bias issue becomes worse with the step of reasoning increase.But it still has a equivalent or better perform comparing End-to-End FiD.In 2WikiQA, our model performs best on all four reasoning type.Especially on the most complex type bridge comparison, our framework greatly outperform, which shows using deterministic symbolic representation is more robust to produce a correct answer.

D A case study of how HPE reasoning
In Figure 3, we show an example that FiD predicts a wrong answer but our model correctly predicts.Given a complex question, our framework first parses the complex question into H-expression.Then hybrid executor will convert the binary tree from the H-expression, where operation and natural sub-question as its nodes.H-executor parses the binary tree from the rightmost left node to the left and upper layer with considering the operation.
At each question node, the reader neural network will take sub-questions and multiple paragraphs as input to generate the sub-answers.We store the sub-answer in memory for later substitution of the placeholder.For example, Q3 will be rewritten in A1 with the answer of Q1 (the Republicans) and A2 with the answer of Q2 (Senate) as the new question Q3' "when did Senate take control of the Republicans".The final answer is obtained by answering Q3'.

Figure 2 :
Figure 2: An overview of the HPE pipeline of two stages: H-parser and H-executor.(1) H-parser first maps the input question to the H-expression.(2) Followed by H-executor uses the Reader network to return answer feedback for each question and the deterministic symbolic interpreter executes the expression to derive the final answer.

Complex Question:Figure 3
Figure 3: 3hop2-type MuSiQue question example and how our framework finds the final answer.

Figure 4 :
Figure 4: Answer F1 score on each reasoning type on MuSiQue and 2WikiQA.

Table 1 :
Operations defined in our H-expressions; q2 and q1 are single-hop natural questions.
Table 2, we shows a few ex-Bridge Where was the place of film The Iron Man director JOIN [Where is A1's place of death, Who is director of The Iron Man] Intersection Which team is Bernard Lowe was a member of beat the winner of the 1894-95 FA Cup AND [What is member of sports team of Bernard Lowe, Who won the 1894-95 FA Cup] Comparison Did Lenny Mchallister and Ken Xie have the same nationality COMPARE_= [What is country of citizenship of Lenny Mchallister, What is the country of citizenship of Ken Xie]

Table 2 :
Examples of question and corresponding H-expression under three basic reasoning types.For more complex reasoning types, their questions and H-expressions are shown in the Appendix A.

Table 3 :
Answer Exact match (EM) and F1 scores on dev/test split of MuSiQue and 2WikiQA.PT represents pre-training on reader network.The methods in Large LM and SOTA are reported from the previous work.The methods in End2End is implemented by us following the training details in the paper.

Table 4 :
Few-shot setting Exact match (EM) and F1 scores on test/dev split of the MuSiQue and 2WikiQA.

Table 5 :
Zero-shot performance on HotpotQA.Standard and CoT are prompted method using large language model like GPT3(Brown et al., 2020).

Table 8 :
EM and F1 scores of Answer and Support Passage on MuSiQue using different reader models.SQ represents simple question and CQ represents complex question.
Table based neural network, our framework can be extended to solve KB and Table QA.