Analyzing the Effectiveness of the Underlying Reasoning Tasks in Multi-hop Question Answering

To explain the predicted answers and evaluate the reasoning abilities of models, several studies have utilized underlying reasoning (UR) tasks in multi-hop question answering (QA) datasets. However, it remains an open question as to how effective UR tasks are for the QA task when training models on both tasks in an end-to-end manner. In this study, we address this question by analyzing the effectiveness of UR tasks (including both sentence-level and entity-level tasks) in three aspects: (1) QA performance, (2) reasoning shortcuts, and (3) robustness. While the previous models have not been explicitly trained on an entity-level reasoning prediction task, we build a multi-task model that performs three tasks together: sentence-level supporting facts prediction, entity-level reasoning prediction, and answer prediction. Experimental results on 2WikiMultiHopQA and HotpotQA-small datasets reveal that (1) UR tasks can improve QA performance. Using four debiased datasets that are newly created, we demonstrate that (2) UR tasks are helpful in preventing reasoning shortcuts in the multi-hop QA task. However, we find that (3) UR tasks do not contribute to improving the robustness of the model on adversarial questions, such as sub-questions and inverted questions. We encourage future studies to investigate the effectiveness of entity-level reasoning in the form of natural language questions (e.g., sub-question forms).


Introduction
The task of multi-hop question answering (QA) requires a model to read and aggregate information from multiple paragraphs to answer a given question (Figure 1a).Several multi-hop QA datasets have been proposed, such as QAngaroo (Welbl et al., 2018), HotpotQA (Yang et al., 2018), and MuSiQue (Trivedi et al., 2022).In HotpotQA, the authors provide sentence-level supporting facts (SFs) to test the reasoning ability and explainability of the models.However, owing to the design of the sentence-level SFs task (binary classification) and the redundant information in the sentences, Inoue et al. (2020) and Ho et al. (2020) show that the sentence-level SFs are insufficient to explain and evaluate multi-hop models in detail.To address this issue, R 4 C (Inoue et al., 2020) and 2WikiMul-tiHopQA (2Wiki;Ho et al., 2020) datasets provide an entity-level reasoning prediction task to explain and evaluate the process of answering questions.Entity-level reasoning information is defined as a set of triples that describes the reasoning path from question to answer (Figure 1b).
Several previous studies (Chen et al., 2019;Fu et al., 2021a) utilize sentence-level SFs and/or entity-level reasoning information to build explainable models by using question decomposition (Min et al., 2019b;Perez et al., 2020) or predicting sentence-level SFs.The advantages of these pipeline models are that they can exploit the underlying reasoning (UR) process in QA and their predicted answers are more interpretable.However, the question remains as to how effective training on UR tasks is for the QA task in an end-to-end manner.Although a few end-to-end models have also been introduced (Qiu et al., 2019;Fang et al., 2020), these models are not explicitly trained on entity-level and answer prediction tasks.
In addition to the triple form, the sub-question form is another way to utilize entity-level reasoning information.Specifically, Tang et al. (2021) utilize question decomposition as an additional subquestion evaluation for bridge questions (there are two types of questions: bridge and comparison) in HotpotQA.They only use sub-questions for evaluation and do not fine-tune the models on them.In addition, Ho et al. (2022) use sub-questions for both evaluation and training.However, they only  focus on comparison questions for date information.In contrast, we focus on the triple form of the entity-level information and conduct experiments using two datasets, 2Wiki and HotpotQA-small (obtained by combining HotpotQA and R 4 C), which include both types of questions.
In this study, we analyze the effectiveness of UR tasks (including both sentence-level and entitylevel) in three aspects: (1) QA performance, (2) reasoning shortcuts, and (3) robustness.First, QA performance is the final objective of the QA task.
We aim to answer the following question: (RQ1) Can the UR tasks improve QA performance?For the second aspect, previous studies (Chen and Durrett, 2019;Jiang and Bansal, 2019a;Min et al., 2019a;Trivedi et al., 2020) demonstrate that many questions in the multi-hop QA task contain biases and reasoning shortcuts (Geirhos et al., 2020), where the models can answer the questions by using heuristics.Therefore, we aim to ask the following: (RQ2) Can the UR tasks prevent reasoning shortcuts?For the final aspect, to ensure safe development of NLP models, robustness is one of the important issues and has gained tremendous amount of research (Wang et al., 2022).In this study, we aim to test the robustness of the model by asking modified versions of questions, such as sub-questions and inverted questions.Our question is (RQ3) Do the UR tasks make the models more robust?
There are no existing end-to-end models that can perform three tasks simultaneously (sentence-level SFs prediction, entity-level prediction, and answer prediction); therefore, we first build a multi-hop BigBird-base model (Zaheer et al., 2020) to perform these three tasks simultaneously.We then evaluate our model using two multi-hop datasets: 2Wiki and HotpotQA-small.To investigate the ef-fectiveness of the UR tasks, for each dataset, we conduct three additional experiments in which the model is trained on: (1) answer prediction task, (2) answer prediction and sentence-level prediction tasks, and (3) answer prediction and entitylevel prediction tasks.We also create four debiased sets (Figure 1c) for 2Wiki and HotpotQA-small for RQ2.We create and reuse adversarial questions for 2Wiki and HotpotQA-small for RQ3.
The experimental results indicate that the UR tasks can improve QA performance from 77.9 to 79.4 F1 for 2Wiki and from 66.4 to 69.4 F1 for HotpotQA-small (RQ1).The results of the models on the four debiased sets reveal that the UR tasks can be used to reduce reasoning shortcuts (RQ2).Specifically, when the model is trained on both answer prediction and UR tasks, the performance drop of the model on the debiased sets is lower than that when the model is trained only on answer prediction (e.g., 8.9% vs. 13.4% EM).The results also suggest that the UR tasks do not make the model more robust on adversarial questions, such as sub-questions and inverted questions (RQ3).Our analysis shows that correct reconstruction of the entity-level reasoning task contributes to finding the correct answer in only 37.5% of cases.This implies that using entity-level reasoning information in the form of triples does not answer adversarial questions, in this case, the sub-questions.We encourage future work to discover the effectiveness of the entity-level reasoning task in the form of subquestions that have the same form as multi-hop QA questions.

Background
Reasoning Tasks in Multi-hop QA In this study, we consider UR tasks in multi-hop QA including two levels: sentence-level and entity-level.The sentence-level SFs prediction task was first introduced by Yang et al. (2018).This task requires a model to predict a set of sentences that is necessary to answer a question (Figure 1).
To evaluate the UR process of the models, derivation and evidence information were introduced in R 4 C and 2Wiki, respectively.Both derivation and evidence are sets of triples that represent the reasoning path from question to answer.The difference is the form; derivation in R 4 C uses a semi-structured natural language form, whereas evidence in 2Wiki uses a structured form.We conduct experiments with both R 4 C (HotpotQA-small) and 2Wiki.For consistency, we use the term entity-level reasoning prediction task to denote the derivation task in R 4 C and the evidence task in 2Wiki.

Reasoning Shortcuts and Biases
In this study, we consider both reasoning shortcuts and biases to be similar.These are spurious correlations in the dataset that allow a model to answer the question correctly without performing the expected reasoning skills, such as comparison and multi-hop reasoning.Following previous studies (Jiang and Bansal, 2019a;Ko et al., 2020), we use the terms word overlap shortcut and position bias.
To check whether the UR tasks can prevent reasoning shortcuts, we first identify the types of shortcuts that exist in HotpotQA-small and 2Wiki.We use heuristics to identify the word overlap shortcut (Appendix A).We find that the word overlap shortcut is common in HotpotQA-small, but not in 2Wiki.The small sample size of HotpotQA-small (Section 4) increases the uncertainty of the obtained results.Therefore, within the scope of this study, we mainly experiment with position bias.
We observe that many examples in 2Wiki contain answers in the first sentence.Therefore, we divide every sentence-level SF in each gold para-  graph into two levels: the first sentence (position_0) and the remaining sentences (position_other).Subsequently, we obtain the percentage of each level by dividing the total number of each level (e.g., po-sition_0) by the total number of SFs. Figure 2 illustrates the information on the position of sentencelevel SFs in dev.sets of three datasets.We find that all three datasets have a bias toward the first sentence.We also find that 2Wiki has more position biases than HotpotQA and HotpotQA-small.

Our Multi-task Model
To investigate the usefulness of UR tasks for the QA task, we jointly train the corresponding tasks: sentence-level SFs prediction, entity-level prediction, and answer prediction.Figure 3 illustrates our model.To handle long texts, we use the Big-Bird model (Zaheer et al., 2020), which is available in Hugging Face's transformers repository.2Our model comprises three main steps: (1) paragraph selection, (2) context encoding, and (3) multi-task prediction.We use the named entity recognition (NER) models of Spacy3 and Flair (Akbik et al., 2019) to extract all entities in the context and use them for the entity-level prediction task.
Paragraph Selection Following previous models (Qiu et al., 2019;Fang et al., 2020;Tu et al., 2020), instead of using all the provided paragraphs, we first filter out answer-unrelated paragraphs.We follow the paragraph selection process described in Fang et al. (2020).First, we retrieve first-hop paragraphs by using title matching or entity matching.We then retrieve second-hop paragraphs using the hyperlink information available in Wikipedia.
When we retrieve paragraphs, we reuse a paragraph ranker model4 from the hierarchical graph network (HGN) model (Fang et al., 2020) to rank input paragraphs using the probability of whether they contain sentence-level SFs.
Context Encoding To obtain vector representations for sentences and entities, we first combine all the selected paragraphs into one long paragraph and then concatenate it with the question to form a context C. Specifically, where m and n are the lengths of the question q and the combined paragraph p (all selected paragraphs), respectively.The context C is then tokenized into l sub-words before feeding into BigBird to obtain the contextual representation C ′ of the sub-words: where h is the hidden size of the BigBird model.Next, we obtain the representation s i ∈ R 2h of the i-th sentence and the representation e j ∈ R 4h+dt of the j-th entity, as follows: word representations of the i-th sentence and j-th entity, respectively.We enrich the entity embedding e j by concatenating it with a d t -dimensional type embedding t j and a sentence embedding s k , where k is the index of the sentence containing the j-th entity.
We also leverage the entity information to improve the contextual representation of sub-words C ′ as it is mainly used for the answer prediction task, which will be described in the next section.Thus, the enhanced sub-word representation C ′′ i of the i-th sub-word is calculated as follows: where e k is the embedding of the k-th entity containing the i-th sub-word.Otherwise, e k is a null vector with the same dimension.
Multi-task Prediction After context encoding, we train our model on three main tasks together: (1) sentence-level prediction, (2) entity-level prediction, (3) and answer prediction.We split the answer prediction task into two sub-tasks, similar to previous studies (Yang et al., 2018;Fang et al., 2020), including answer type prediction and answer span prediction.We train our model by minimizing the joint loss for all tasks, as follows: where λ sent , λ ent , and λ ans are the hyperparameters for three tasks: sentence-level prediction, entity-level prediction, and answer prediction (details are given in Appendix B.1).
For the sentence-level prediction task, we use a binary classifier to predict whether a sentence is a supporting fact.For the answer type prediction task, we use a 4-way classifier to predict the probabilities of yes, no, span, and no answer.Two linear classifiers are used for the answer span prediction task to independently predict the start and end tokens of the answer span.
Different from existing end-to-end models (Qiu et al., 2019;Fang et al., 2020), our model is explicitly trained on the entity-level prediction task.We formalize the entity-level reasoning prediction task as a relation extraction task (Zhang and Wang, 2015).The input is a pair of entities, and the output is the relationship between two entities.From all named entities obtained by using the NER models, we generate a set of entity pairs; for example, given N entities, we obtain N × (N − 1) pairs.For each pair, we predict a relationship in a set of predefined relationships obtained from the training set.We then use cross-entropy as the learning objective.

Datasets and Evaluation Metrics
We mainly experiment with 2Wiki and HotpotQAsmall.We also train and evaluate our model on the full version of HotpotQA.We reuse and create debiased and adversarial sets for the evaluation.Table 1 presents the statistics for 2Wiki, HotpotQA-small, and additional evaluation sets.The details of Hot-potQA and 2Wiki are presented in Appendix B.2.It should be noted that all datasets are in English.
4.1 HotpotQA-small R 4 C (Inoue et al., 2020)  in HotpotQA.We obtain HotpotQA-small by combining HotpotQA (Yang et al., 2018) with R 4 C. HotpotQA-small comprises three tasks as in 2Wiki: (1) sentence-level SFs prediction, (2) entity-level prediction, and (3) answer prediction.First, we re-split the ratio between the training and dev.sets; the new sizes are 3,671 and 917 for the training and dev.sets, respectively (the original sizes are 2,379 and 2,209, respectively).In R 4 C, there are three gold annotations for the entity-level prediction task; in 2Wiki, there is only one gold annotation.For consistency in the evaluation and analysis, we randomly choose one annotation from the three annotations for every sample in R 4 C.The entity-level reasoning in R 4 C is created by crowdsourcing.We observe that there are many similar relations in the triples in R 4 C, and these relations can be grouped into one.For example, is in, is located in, is in the, and is located in the indicate location relation.We also group the relations by removing the context information in the relations; for example, is a 2015 book by and is the second book by are considered similar to the relation is a book by.After grouping, the number of relations in R 4 C is 2,526 (it is 4,791 before).

Debiased Dataset
The objective of our debiased dataset is to introduce a small perturbation in each paragraph to mitigate a specific type of bias, in our case, the position bias shown in Figure 2.For both 2Wiki and HotpotQAsmall, we use the same method to generate four debiased sets: ADDUNRELATED, ADDRELATED, ADD2, and ADD2SWAP.The differences between these four sets are whether the sentence is related or unrelated to the paragraph and whether we add one or two sentences into the paragraph.The details of each set are as follows.
ADDUNRELATED: One sentence unrelated to the paragraph is added.In our experiment, we use a list of sentences in the sentence-level revision dataset (Tan and Lee, 2014).We randomly choose one sentence that has a number of tokens greater than eleven but less than twenty-one.
ADDRELATED: One sentence that does not have an impact on the meaning or flow of the paragraph is added.In our experiment, we write multiple templates for each entity type (e.g., for a film entity, "#Name is a nice film", where #Name is the title of the paragraph), then randomly choose one template, and add it to the paragraph.To detect the type of the paragraph, we use the question type information in 2Wiki and HotpotQA-small, the results of the NER model, and the important keywords in the question (e.g., who, magazine, album, and film).
ADD2: ADDRELATED and ADDUNRELATED are combined in order.
ADD2SWAP: The order of ADDRELATED and ADDUNRELATED in ADD2 is swapped.

Adversarial Dataset
The objective of our adversarial dataset is to check the robustness of the model by asking modified versions of questions.For HotpotQA-small, we reuse two versions of adversarial examples in Geva et al. (2022).The first one is automatically generated by using the 'Break, Perturb, Build' (BPB) framework in Geva et al. (2022).The BPB framework performs three main steps: (1) breaking a question into multiple reasoning steps, (2) perturbing the reasoning steps by using a list of defined rules, and (3) building new QA samples from the perturbations in step #2.The second version is a subset of the first version and is validated by crowd workers.We only use the examples in these two versions that the original examples appear in HotpotQA-small.For 2Wiki, no adversarial dataset is available.Based on the idea of the BPB framework in Geva et al. (2022), we apply two main rules from BPB for 2Wiki: (1) replace the comparison operation for comparison questions, and (2) use the prune step for bridge questions.For the first rule, we replace the operation in the comparison questions (e.g., "Who was born first, A or B?" is converted to "Who was born later, A or B?").For the second rule, we use a sub-question in the QA process as the main question (e.g., for Figure 1, we ask, "Who is the father of Joan of Valois?").

Evaluation Metrics
Each task in HotpotQA and 2Wiki is evaluated by using two metrics: exact match (EM) and F1 score.
Following the evaluation script in HotpotQA and 2Wiki, we use joint EM and joint F1 to evaluate the entire capacity of the model.For HotpotQA, they are the products of the scores of two tasks: sentence-level prediction and answer prediction.For 2Wiki and HotpotQA-small, they are the products of the scores of three tasks: sentence-level prediction, entity-level prediction, and answer prediction.

Results
Currently, there are no existing end-to-end models that explicitly train all three tasks together; therefore, in this study, we use our proposed model for analysis.We also compare our model with other previous models on the HotpotQA and 2Wiki datasets.In general, the experimental results indicate that our model is comparable to previous models and can be used for further analyses.We focus more on the analysis; therefore, the detailed results of the comparison are presented in Appendix B.3.

Effectiveness of the UR Tasks
To investigate the effectiveness of the UR tasks, we train the model in four settings: (1) answer prediction only, (2) answer prediction and sentence-level SFs prediction, (3) answer prediction and entitylevel prediction, and (4) all three tasks together.
QA Performance (RQ1) Our first research question is whether the UR tasks can improve QA performance.To answer this question, we compare the results of different task settings described above.
The results are presented in Table 2.For 2Wiki, using sentence-level and entity-level separately (settings #2 and #3), the QA performance does not change significantly.The improvement is significant when we combine both the sentence-level and entity-level (setting #4).Specifically, the scores when the model is trained on the answer prediction task only (setting #1) and on both the answer prediction task and UR tasks (setting #4) are 77.9 and 79.4 F1, respectively.In contrast to 2Wiki, using sentence-level and entity-level separately, there is a larger QA performance improvement in HotpotQAsmall.Specifically, the F1 scores of settings #2 and #3 are 69.0 and 69.1, respectively, whereas, the F1 score of the first setting is 66.4.Similar to 2Wiki, there is a large gap between the two settings, #1 and #4 (66.4 F1 and 69.4 F1, respectively).In summary, these results indicate that both sentence-level and entity-level prediction tasks contribute to improving QA performance.These results align with the findings in Yang et al. (2018), which shows that incorporating the sentence-level SFs prediction task can improve QA performance.We also find that when combining both sentencelevel and entity-level prediction tasks, the scores of the answer prediction task are the highest.
Reasoning Shortcuts (RQ2) To investigate whether explicitly optimizing the model on the UR tasks can prevent reasoning shortcuts, we evaluate the four settings of the model on the four debiased sets of 2Wiki and HotpotQA-small.The generation of the debiased sets includes stochastic steps.To minimize the impact of randomness on our re- ported results, we generate the debiased sets five times and report the average evaluation scores.The average performance drops are presented in Table 3 (detailed scores are given in Appendix B.4).
Overall, for 2Wiki, when the model is trained on only one task (#1), the drop is the largest (except for ADDRELATED, which is the second largest).When the model is trained only on the answer prediction task, the drops are always higher than those when the model is trained on three tasks.Specifically, the gaps between the two settings, #1 (only answer task) and #4 (all three tasks), are 4.5%, 0.4%, 3.2%, 4.5% (EM score) for ADDUNRELATED, AD-DRELATED, ADD2, and ADD2SWAP, respectively.These scores indicate that the two tasks, sentencelevel and entity-level, positively affect the answer prediction task when the model is trained on three tasks simultaneously.
For HotpotQA-small, we observe that the effectiveness of the UR tasks is inconsistent.For example, for ADDUNRELATED, when training the model on the three tasks (setting #4), the reduction is larger than that when training on answer task only (setting #1) (5.1 vs. 3.0 EM).However, for ADDRELATED, the reduction on setting #4 is smaller than that on setting #1 (1.3 vs. 4.0 EM).One possible reason is that the performance of the entity-level task is not good (6.4EM), which affects the answer prediction task when the model is trained on the three tasks together.Another possible reason is that the position bias in HotpotQAsmall is not sufficiently large.We present a detailed analysis in Section 5.2 to explain this case.we evaluate the four settings of the model on the adversarial sets.For 2Wiki, the results are presented in Table 4.The scores for all four settings decrease significantly on the adversarial set.The reduction is the smallest when the model is trained on the answer task only.The UR tasks do not make the model more robust on this adversarial set.For HotpotQA-small, we observe the same behavior, that is, when the model is trained on the answer task only, the reduction is the smallest.All results are presented in Table 5.These results indicate that both sentence-level and entity-level prediction tasks do not contribute to improving the robustness of the models on adversarial questions, such as subquestions and inverted questions.We analyze the results in Section 5.2.

Analyses
Details of RQ2 To investigate the results concerning RQ2 in more depth, we first analyze the position biases of different types of questions in 2Wiki and HotpotQA-small.We find that the comparison questions have more position biases than the bridge questions in both 2Wiki and HotpotQA-  small (Appendix B.5).To evaluate the effectiveness of the position bias for each type of question, we evaluate the four settings of the model on the four debiased sets for each type of question in both datasets.All the results are presented in Appendix B.5.For 2Wiki, we find that most of the answers are in the first sentences in the comparison questions.This large bias is the main reason for the significant reduction in the scores in the comparison questions.2Wiki has 46.0% of comparison questions.The reduction in comparison questions contributes to the reduction in the entire dataset.In other words, the results of 2Wiki are affected by those of the comparison questions.HotpotQA-small has only 22.0% of comparison questions, and the position bias in the comparison questions was not sufficiently large.Therefore, the position bias does not have a significant impact on the main QA task.In other words, the UR tasks do not have a significant effect.

Details of RQ3
The adversarial questions used in RQ3 are the sub-questions in the QA process for bridge questions and the inverted questions for comparison questions.We observe that the triple in the entity-level task is helpful in answering the sub-questions.For example, the triple is: (Charles of Valois, father, Philip III of France) and the subquestion is "Who is the father of Charles of Valois?".To understand more on the behaviors of the model, we analyze the results from 2Wiki in two settings: (3) Ans + Ent and (4) Ans + Sent + Ent.Table 6 presents the detailed results for these two settings.We find that correct reconstruction of the entity-level reasoning task contributes to finding the correct answer only in 32.8% of cases in setting #3 and only in 37.5% of cases in setting #4.Entity-level reasoning in the form of triples has no significant effect on the main QA process.Several examples are presented in Appendix B.5.
We conjecture that there are three possible reasons why the UR tasks cannot contribute to the adversarial dataset.The first one is the difference in the form and design of the tasks.Specifically, the entity-level reasoning task is formulated as a relation extraction task; the input is a pair of entities, and the output is a relation label.Meanwhile, the adversarial dataset is formulated as a QA task; the input is a natural language question, and the output is an answer.The second reason is the incompetence of the entity-level reasoning information.As discussed in Ho et al. (2022), the entity-level reasoning in the comparison questions does not describe the full path from question to answer, and other reasoning operations are required to obtain the answer.The final reason is the manner in which we utilize the entity-level reasoning information.Our model does not consider the order of the triples in the reasoning chain.For example, we do not consider the order of the two steps in Figure 1b.We hope that our research will inspire future studies to investigate the effectiveness of the UR tasks in the form of a natural language question, which has the same form as a multi-hop QA question.

Related Work
Multi-hop Datasets and Analyses To test the reasoning abilities of the models, many multi-hop QA datasets (Welbl et al., 2018;Talmor and Berant, 2018;Yang et al., 2018) have been proposed.Recently, Trivedi et al. (2022) introduced MuSiQue, a multi-hop dataset constructed from a composition of single-hop questions.The reason why do we not conduct experiments on MuSiQue is explained in the limitations section.
In addition to Tang et al. (2021) and Ho et al. (2022), the most similar to our research mentioned in the Introduction, there are some other existing studies (Chen and Durrett, 2019;Jiang and Bansal, 2019a;Min et al., 2019a;Trivedi et al., 2020) on the analysis and investigation of the multi-hop datasets and models.However, most of them do not utilize the internal reasoning information when answering questions.
Other QA Reasoning Datasets In addition to multi-hop reasoning datasets, several other existing datasets also aim to evaluate the reasoning abilities of the models.Some of them are: DROP (Dua et al., 2019) for numerical reasoning; CLUTRR (Sinha et al., 2019), ReClor (Yu et al., 2020), and LogiQA (Liu et al., 2020) for logical reasoning; Quoref (Dasigi et al., 2019) for coreference reasoning; CommonsenseQA (Talmor et al., 2019), MCScript2.0(Ostermann et al., 2019), and CosmosQA (Huang et al., 2019) for commonsense reasoning.Many of these datasets consist of only a single paragraph in the input or lack explanation information that describes the reasoning process from question to answer.However, our focus is on multi-hop reasoning datasets that contain multiple paragraphs in the input and provide explanatory information for the QA process.

Conclusion
We analyze the effectiveness of the underlying reasoning tasks using two multi-hop datasets: 2Wiki and HotpotQA-small.The results reveal that the underlying reasoning tasks can improve QA performance.Using four debiased sets, we demonstrate that the underlying reasoning tasks can reduce the reasoning shortcuts of the QA task.The results also reveal that the underlying reasoning tasks do not make the models more robust on adversarial examples, such as sub-questions and inverted questions.We encourage future studies to investigate the effectiveness of the entity-level reasoning task in the form of sub-questions.

Limitations
Our study has two main limitations.The first one is the small size of HotpotQA-small.Currently, there are no other multi-hop datasets that contain a large number of examples with the entity-level reasoning prediction task.MuSiQue is the most potential option.The entity-level reasoning information in MuSiQue includes two types of formats: triple format and natural language question format.We do not experiment with MuSiQue because the number of examples that have entity-level reasoning information in the form of a triple is small: 2,253 out of 19,938 in the training set and 212 out of 2,417 in the dev.set.
The second limitation is that our model does not consider the order of the triples in the entity-level reasoning prediction task.As shown in Figure 1b, the two triples are ordered.However, our model formulizes the entity-level prediction task as a relation extraction task.We predict a relation given the two entities detected by the NER models.Therefore, the order of the triples is not considered.We conjecture that this may be one of the reasons why the entity-level reasoning prediction task (e.g., a triple (Film A, director, D)) does not support the model when answering sub-questions (e.g., Who is the director of Film A?) using the same information.
Based on this finding, we automatically calculate the word overlap shortcut for 2Wiki and HotpotQAsmall.We observe that the word overlap shortcut is common in bridge questions; therefore, we only calculate the word overlap shortcut for bridge questions in 2Wiki and HotpotQA-small.To check whether a sample contains the word overlap shortcut, we do the following steps: • Obtain a set of surrounding words S by getting the five words immediately to the left and right of the answer span, then remove stopwords in S.
• Obtain a set of overlapping words (O) between S and a question.
• We consider a sample containing the word overlap shortcut if there are at least two words in O and |O| |S| ≥ 0.65.These numbers (threshold) are chosen based on the evaluation of 40 examples that are manually annotated by the authors.
We find that there are 56 out of 5,791 and 151 out of 715 examples (5,791 and 715 are the numbers of bridge questions in 2Wiki and HotpotQA-small) in the dev.sets of 2Wiki and HotpotQA-small containing the word overlap shortcut.
It is noted that there is another type of shortcut, namely, entity-type matching shortcut.Based on the experimental results and human performance, Min et al. (2019a) reveal that examples in HotpotQA contain the entity type matching shortcut, where the models can answer the questions by using the first five tokens in the questions; meanwhile, humans can answer the questions by using the entity type of the paragraphs.Currently, there is no dataset that can prevent the entity-type shortcut; therefore, we do not use this type of shortcut in our experiments.

B Experimental Details B.1 Implementation Details
We use Pytorch (Paszke et al., 2017) and Hugging Face when building our model.For the context encoding step, we use a pre-trained BigBird model as the encoder; the hidden dimension is 768.For the entity-level reasoning prediction task, we obtain 33 relations for 2Wiki and 2,526 relations for R 4 C, from all triples in the training set, including a nonrelation type.We use entity type embedding d t of 50.We fine-tuned our model with a total batch size of 32 on a single GPU (NVIDIA A100 80GB) using mixed precision and a gradient accumulation step of 8. Following the hyperparameters in the BERT model (Devlin et al., 2019), for optimization, we use the Adam Opitmizer (Kingma and Ba, 2015) with a learning rate of 3e-5, weight decay of 0.01, learning rate warmup over the first 10% of the total number of training steps, and linear decay of the learning rate.We also use a dropout probability of 0.1 on all layers.
For multi-task prediction, we use λ sent as 4, λ ent as 15, and λ ans as 1 for 2Wiki and HotpotQA-small; we use λ sent as 7 and λ ans as 1 for HotpotQA.We do not run all experiments with different values of λ sent , λ ent , and λ ans ; instead we run several experiments, base on the results, we then adjust the parameters.We find that when running with λ sent as 4 for 2Wiki and 7 for HotpotQA, λ ent as 15, and λ ans as 1, it produces the best results.We fix the random seed for the reproducibility of the results.We observe that the final epoch often produces the best scores, and its scores are stable on adversarial datasets; therefore, we choose the final epoch for all settings in our experiment.

B.2 Datasets
HotpotQA HotpotQA was created by crowdsourcing.Due to the design of the dataset, there are only two tasks in HotpotQA: sentence-level SFs prediction and answer prediction.R 4 C was created based on HotpotQA and contained 4,588 questions.The dataset requires systems to provide an answer and derivation in a semi-structured natural language form.There are two types of questions in HotpotQA: bridge and comparison.2Wiki 2Wiki was constructed by utilizing a Knowledge Base and Wikipedia, and the questions were created by using templates.There are three different tasks in the dataset: (1) sentence-level SFs prediction, (2) evidence generation (for consistency, we use the term entity-level prediction), and (3) answer prediction.The context consists of ten paragraphs, including two or four gold paragraphs and eight or six distractor paragraphs.The gold paragraph contains the information required to find the answer.Meanwhile, the purpose of the distractor paragraph is to distract the models.There are four different types of questions in the dataset: comparison, inference, compositional, and bridge-comparison.Inference and compositional Step 1: ("Albert Einstein", "date of birth", "14 March 1879") & Step 2: ("Abraham Lincoln", "date of birth", "February 12, 1809") Adversarial question: Who was born later, Albert Einstein or Abraham Lincoln? questions are two sub-types of the bridge question.For the convenience of analysis, we consider comparison and bridge-comparison questions as comparison questions. Figure 4 presents an example of a comparison question from the 2Wiki dataset.
2Wiki was designed to focus on the entire reasoning process from question to answer.The entire capacity of the model is evaluated by using two metrics: joint EM and joint F1.To obtain the joint F1 score, they first calculate the joint precision and joint recall as follows: P joint = P ans P ent P sent and R joint = R ans R ent R sent .(P ans , R ans ), (P ent , R ent ), (P sent , R sent ) represent the precision and recall for three tasks: answer prediction, entitylevel reasoning prediction, and sentence-level SFs prediction.The joint EM is 1 when all three tasks achieve an exact match and otherwise 0.

B.3 Results Comparison
We compare our results with three previous models: BiDAF, CRERC, and NA-Reviewer.BiDAF is a baseline model in Ho et al. (2020).CRERC (Fu et al., 2021a) is a pipeline model that includes three modules: relation extractor, reader, and comparator.NA-Reviewer (Fu et al., 2021b) is an improved version of CRERC, as it addresses the error accumulation issue.It is noted that both CRERC and NA-Reviewer models are evaluated on only 2Wiki.
Table 7 presents the results of our model and previous models in the dev.set of HotpotQA and in the test set of 2Wiki.It also shows the performance of our model in the dev.set of HotpotQA-small and human performance in Ho et al. (2020).
Results on HotpotQA Our score is comparable to the BERT-base version of two strong models, SAE (Tu et al., 2020) and HGN (Fang et al., 2020) in the dev.set of the distractor setting in HotpotQA.Specifically, our joint F1 is 67.8, while for SAE-BERT, it is 66.5, and for HGN-BERT, it is 66.9.However, our score is smaller than the RoBERTabase of SAE and HGN.They are 72.8 and 74.4 F1 for SAE-RoBERTa and HGN-RoBERTa, respectively.It is noted that we use the BigBird-ITC version in our model.Although the BigBird-ETC version performs better than the BigBird-ITC version, it is not available in Hugging Face.We do not use SAE and HGN for our analyses because these models are not designed to train on the entity-level reasoning prediction task.
Results on HotpotQA-small The scores on HotpotQA-small are lower than those on HotpotQA in the answer prediction task.This result may be explained by the fact that the training size of HotpotQA-small is smaller than HotpotQA (3,671 vs. 90,564).Due to the small size, we only use the gold paragraphs for experiments.That is why the scores on HotpotQA-small are higher than those on HotpotQA in the sentence-level task.For the entitylevel task, the EM score is quite low (6.4 EM).A possible reason for this is that there are many relations in HotpotQA-small (2,526 relations); meanwhile, there are only 33 relations in 2Wiki.We observe that the F1 score (31.1 F1) is much better than the EM score.Therefore, we keep using HotpotQA-small for analyses.
Results on 2Wiki Our model significantly outperforms BiDAF in all tasks.Our results are comparable to CRERC.The EM score of our model in the entity-level task is lower than that of CRERC.A possible explanation for this might be that the relation extractor module in CRERC is fine-tuned on 2Wiki; therefore, it can extract entities better than the NER models from Spacy and Flair that are used in our model.However, the F1 score of our model in the entity-level task is higher than that of CRERC.This indicates that our model can correctly obtain a few triples in a set of gold triples for many samples.All our scores (except the F1 score of the entity-level task) are lower than those on NA-Reviewer.Our target is to analyze the UR tasks in an end-to-end model.Although the pipeline models (CRERC and NA-Reviewer) are easy to interpret, we cannot determine how the UR tasks affect answer prediction in an end-to-end model.Therefore, we use the design of our model to perform the analyses in this study.(Fang et al., 2020); therefore, we show the reported F1 scores in HGN.

B.4 Effectiveness of the UR Tasks
Reasoning Shortcuts (RQ2) Table 8 presents the performance drop (smaller is better) for five times running of the four settings of the model on the four debiased sets of 2Wiki and HotpotQAsmall.As depicted in the table, for 2Wiki, the gap between two settings #1 (answer prediction task only) and #4 (all three tasks) is consistent in all five times running.Meanwhile, for HotpotQA-small, the gap between two settings #1 (answer prediction task only) and #4 (all three tasks) is inconsistent in all five times running.This observation supports our explanation in Section 5.2 that the position bias in HotpotQA-small does not have a large impact on the main QA task.

B.5 Analyses
Details of RQ2 Figure 5 illustrates the information on the position of sentence-level SFs of comparison and bridge questions in the dev.sets of the two datasets: 2Wiki and HotpotQA-small.As shown in the Figure, the comparison questions have more position biases than the bridge questions in both 2Wiki and HotpotQA-small.Furthermore, we observe that the position bias in the comparison questions in HotpotQA-small is smaller than that in 2Wiki.Table 9 presents the performance drop for two types of questions, comparison and bridge questions, in 2Wiki and HotpotQA-small.Table 9: Performance drop (smaller is better) for two types of questions (comparison and bridge questions) of the four settings of the model on the four debiased sets of 2Wiki and HotpotQA-small.The best and worst scores are boldfaced and underlined, respectively.For both 2Wiki and HotpotQA-small, we choose the results from the first time running to perform the analysis.

Figure 1 :
Figure 1: Example of (a) a standard multi-hop question, (b) two underlying reasoning tasks in the QA process and three aspects in our analysis, '+' and '-' indicate that the UR tasks have a positive and negative impacts, respectively, and (c) debiased and adversarial examples that are used in our study.

Figure 2 :
Figure 2: Information on the position of sentence-level SFs in the dev.sets of the three datasets.

Figure 3 :
Figure 3: Our model has three main steps: paragraph selection, context encoding, and multi-task prediction.
sub-word representations of the i-th sentence and j-th entity, respectively.

Question:
Who was born first, Albert Einstein or Abraham Lincoln?

Figure 4 :
Figure 4: Example of a comparison question from the 2Wiki dataset.

Figure 5 :
Figure 5: Information on the position of sentence-level SFs of comparison and bridge questions in the dev.sets of the two datasets: 2Wiki and HotpotQA-small.

Table 1 :
is created by adding entity-level reasoning information to the samples Statistics for 2Wiki and HotpotQA-small.There are four debiased sets in 2Wiki and HotpotQAsmall.There are one adversarial set in 2Wiki and two adversarial sets in HotpotQA-small.

Table 2 :
Ablation study results (%) of our model in the dev.sets of 2Wiki and HotpotQA-small.Ans, Sent, and Ent represent the answer prediction task, sentence-level SFs prediction task, and entity-level prediction task, respectively.'Task Setting' represents the tasks that the model is trained on.'-' indicates the tasks the model is not trained on.

Table 3 :
Average performance drop from five times running (smaller is better) of the four settings on the four debiased sets of 2Wiki and HotpotQA-small.The best and worst scores are boldfaced and underlined, respectively.

Table 4 :
To test whether the UR tasks can help to improve the robustness of the model, Results of our model in the dev-adversarial set of 2Wiki and the performance drop.

Table 5 :
Results of our model in the dev.and two dev-adversarial sets of HotpotQA-small.'Adver' denotes adversarial and 'Adver-val' denotes the adversarial set that was validated by crowd workers.

Table 6 :
Number of correct predicted answers, number of correct predicted entity-level reasoning, and number of examples that have both correct predicted answers and correct predicted entity-level reasoning.

Table 7 :
Results (%) of our model and previous models in the dev.set of HotpotQA and in the test set of 2Wiki.We also show the performance of our model in the dev.set of HotpotQA-small.Answer, Sentence-level, and Entity-level represent the answer prediction task, sentence-level prediction task, and entity-level prediction task, respectively.For HGN-BERT, the scores that we obtained(from left to right: 58.93 73.18 54.64 85.34 35.11 64.24) are lower than the reported scores in HGN

Table 8 :
Table 10 presents examples of the outputs predicted by our model, which is trained on three tasks simultaneously..1617.73 16.40 17.74 16.41 17.34 15.99 17.59 16.25 Ans + Ent 13.02 12.19 13.09 12.19 13.38 12.30 13.13 12.30 12.97 12.07 Ans + Sent + Ent 14.28 13.56 14.31 13.70 14.18 13.41 14.89 13.99 15.00 14.17 Performance drop (smaller is better) for five times running of the four settings of the model on the four debiased sets of 2Wiki and HotpotQA-small.The best and worst scores are boldfaced and underlined, respectively.The debiased datasets are newly created for each time running.