Weakly Supervised Pre-Training for Multi-Hop Retriever

In multi-hop QA, answering complex questions entails iterative document retrieval for finding the missing entity of the question. The main steps of this process are sub-question detection, document retrieval for the sub-question, and generation of a new query for the final document retrieval. However, building a dataset that contains complex questions with sub-questions and their corresponding documents requires costly human annotation. To address the issue, we propose a new method for weakly supervised multi-hop retriever pre-training without human efforts. Our method includes 1) a pre-training task for generating vector representations of complex questions, 2) a scalable data generation method that produces the nested structure of question and sub-question as weak supervision for pre-training, and 3) a pre-training model structure based on dense encoders. We conduct experiments to compare the performance of our pre-trained retriever with several state-of-the-art models on end-to-end multi-hop QA as well as document retrieval. The experimental results show that our pre-trained retriever is effective and also robust on limited data and computational resources.


Introduction
Multi-hop QA is the task of answering complex questions that requires reasoning across multiple documents (Nogueira and Cho, 2017;Nie et al., 2019;Sun et al., 2019;Fang et al., 2020;Zhao et al., 2020). The core components of multi-hop reasoning are identifying the missing entity in the question and generating a new query with the missing entity. Figure 1 shows an example of the reasoning process in multi-hop QA. In the example, the missing entity, which we call bridge entity, of the question is "Jupiter". To answer the question, the correct document for the sub-question "the largest planet in the Solar System," should be retrieved. Supervised training of the multi-hop QA models for these intermediate reasoning steps requires a dataset of complex questions, sub-questions, and their corresponding documents. However, building such dataset requires costly human annotation and cannot be done at scale (Min et al., 2019;Wolfson et al., 2020).
When there is limited annotated supervision signal, weakly supervised pre-training can be a solution (Devlin et al., 2019;Liu et al., 2019) which has shown effectiveness in open-domain QA Guu et al., 2020). Unlike open-domain QA, it is not trivial to apply a pre-training method to multi-hop QA due to the complexity in generating weak supervision data. In open-domain QA, weak supervision is generated by selecting a document from a corpus and extracting a sentence from the document. The sentence becomes a pseudo question, and the document becomes a pseudo supporting document to be predicted by retrievers. This two-step process for weak supervision cannot be directly applied in multi-hop QA since each multihop question refers to multiple documents.
In this paper, we propose a novel weakly supervised pre-training method for multi-hop retriever, LOUVRE (Learning frOm mUlti-hop Variation of document RElations). Our method contains three core elements: 1) a pre-training task, 2) a scalable method to generate pre-training data with weak supervision, and 3) a model to pre-train a retriever for multi-hop QA. Specifically, we define a task for pre-training, "Next Document Prediction" (NDP), which is to retrieve documents for sub-questions. We then propose "Bridge Entity Re-Phrasing" to generate the pre-training data. "Bridge Entity Re-Phrasing" generates complex questions that contain sub-questions of the bridge entities and their corresponding documents. To generate a complex question using a bridge entity without human effort, we Q: "What is the name of the biggest moon of the largest planet in the Solar System? Jupiter is the fifth planet from the Sun and the largest in the Solar System.
Ganymede is the largest and most massive of the Jupiter's moons.
Bridge Entity Document A: Jupiter Document B: Ganymede (moon) Figure 1: An example of chain reasoning in multi-hop QA. Answering the question requires finding "the largest planet in the Solar System" which is the bridge entity "Jupiter." With the retrieved document A, the question has enough information to retrieve the correct answer in document B. use two documents connected by Wikipedia hyperlinks. The hyperlinked entity becomes the bridge entity, and the introductory phrase of the entity becomes the sub-question in the complex question. This approach enables our weak supervision data generation to be scalable, as shown in Figure 2. We use a dense retriever consisting of a question encoder and a document encoder for the pre-trained model structure (Karpukhin et al., 2020;. The two encoders calculate vector representations of questions and documents. Document retrieval is performed by comparing the vectors with MIPS (maximum inner product search). Pre-training multi-hop retriever with our weak supervision method brings three benefits: significant performance improvement, robustness on few-shot settings, and computational efficiency. We evaluate our weakly supervised pretrained retriever with two types of experiments on HOTPOTQA dataset: supporting documents prediction and end-to-end multi-hop QA. In both experiments, LOUVRE outperforms previous multi-hop retrievers. Also, we fine-tune LOUVRE on 1% of training data and show that the performance of LOUVRE is comparable to the baselines. We evaluate the performance of LOUVRE according to the computational efficiency. The results show that LOUVRE requires less inference time than baselines.
Contributions of this paper are as follows: 1) we propose a novel scalable weakly supervised pre-training method for multi-hop retrievers, 2) we provide the implementation of LOUVRE and the pre-trained checkpoint publicly available 1 , 3) we show the effect of our pre-training method in multihop QA with various experimental results.

Related Work
Distant Supervision in Open-Domain QA: Many open-domain QA datasets only provide 1 https://github.com/yeonsw/LOUVRE question-answer pairs; some also provide weakly annotated supporting documents, but they are predicted by simple heuristics (Joshi et al., 2017;Berant et al., 2013). Document retrieval has suffered from insufficient strong supervision issues. Hence, document retrieval has suffered from lack of strong supervision. To resolve this issue, Karpukhin et al. (2020) use a document retrieved by TF-IDF as the supporting document of the given question. Weak supervision is also an effective method in the distant supervision setting of open-domain QA.  use ICT (inverse cloze task) to generate pseudo question-document pairs and pretrain their retriever. They select documents from Wikipedia and extract sentences from the documents. The selected sentence-document pairs become pseudo-question-document pairs. Guu et al. (2020) propose a pre-training method for a language model that uses knowledge retriever (document retriever). They train the knowledge retriever only with the language modeling loss without using any supervision signal of supporting documents. Although pre-training methods show effectiveness in open-domain QA, they are limited to single-hop retrievers.
Multi-Hop QA: To overcome the lack of supervision signal in multi-hop QA, weak supervision methods have been proposed. Qi et al. (2019) propose a sub-question generation method. They use heuristically generated pseudo-questions as supervision for the question generation model. Perez et al. (2020) generate weak supervision for question decomposition by mapping a complex question to multiple single-hop questions in existing QA datasets. They use complex questions in HOTPOTQA (Yang et al., 2018) and single-hop questions in SQuAD 2.0 (Rajpurkar et al., 2018). Another method to train multi-hop QA models without human annotated datasets is by taking two simple questions and generating a complex question (Pan et al., 2020). They generate complex ques-Quantum mechanics is a fundamental theory in physics that provides a description of the physical properties of nature at the scale of subatomic particles Subatomic particles are particles that are smaller than the atom Hyperlink What is a fundamental theory in physics that provides a description of the physical properties of nature at the scale of are particles that are smaller than the atom ? Question: What is a fundamental theory in physics that provides a description of the physical properties of nature at the scale of are particles that are smaller than the atom ?

Document B
Supporting documents in order: Document B -> Document A Answer: Quantum mechanics Figure 2: Proposed pre-training data generation process. Two documents connected by Wikipedia hyperlink are selected. In "Bridge Entity Re-Phrasing" process, document B which describes the entity "subatomic particles" is used to re-phrase the entity in document A. After replacing the answer entity, "Quantum mechanics", the complex question and its corresponding supporting documents are generated. tions with GPT-2 (Radford et al., 2019) fine-tuned on SQuAD1.1. Our work improves upon previous research by providing a more general method that leverages a large open corpus with retriever pretraining.

Method
We propose an effective and scalable pre-training method that provides weak supervision of the complex questions with sub-questions and their corresponding supporting documents.

Next Document Prediction
We propose the "Next Document Prediction" (NDP) task for pre-training. NDP refers to the process of recurrent document retrieval used in (Qi et al., 2019;Asai et al., 2020;. We apply the common definitions in the existing studies to our "Next Document Prediction" task. We define NDP as the task that predicts documents in the reasoning sequence [d 1 , ..., d n ] recurrently as follows: where q is a question, d k is a predicted document at step k, and D k−1 is a set of documents retrieved in the previous steps, {d 1 , ..., d k−1 }.

Bridge Entity Re-Phrasing
Our pre-training requires a dataset of questions, sub-questions, and their corresponding reasoning chains (i.e., a sequence of documents). We propose "Bridge Entity Re-Phrasing" for generating this pretraining dataset. "Bridge Entity Re-Phrasing" takes two steps: entity selection and re-phrasing. Figure 2 provides an overview of our data generation process. We provide the detailed description of "Bridge Entity Re-Phrasing" in the following paragraphs.
The "Bridge Entity Re-Phrasing" process requires informative entities and the description of the entity. We assume that an entity with a Wikipedia hyperlink is an informative entity. Also, hyperlink entities often have Wikipedia articles describing the entities. The hyperlink entity becomes the bridge entity. In Figure 2, document A and document B are connected with the bridge entity, "subatomic particles". We re-phrase the selected entity with the first line of the document. In Figure 2, "subatomic particles" is re-phrased with the first line in document B. When the bridge entity appears in the question, multi-hop retrievers easily find the bridge document using only the word. To prevent this issue, we remove the bridge entity from document B. The generated document becomes the document to be used for question-answer pair generation. Generating a question-answer pair from a single document has been studied by pre-training research in open-domain QA Guu et al., 2020). We extend their work to generate questions, reasoning chains, and answers. We randomly select an entity from the merged document and replace the entity with the word "what". In Figure 2, "Quantum mechanics" is the entity word and replaced with "what." The new sentence becomes a pseudo-question, and the replaced entity becomes the answer. Since document B contains the bridge entity, the pseudo-question reasoning chain becomes [document B, document A].

Model Architecture
Model structure for our pre-training is subject to two requirements: general model structure and recurrent retrieval. We use multi-hop dense retriever  which meets the two requirements and is based on the DPR (dense passage retriever) (Karpukhin et al., 2020). DPR consists of a question encoder E Q and a document encoder E D , both of which are based on RoBERTa-base. Documents are retrieved by MIPS (maximum inner product search) with similarity between the question vectors and the document vectors as follows: (2) MDR retrieves documents recurrently by taking the previously retrieved documents as input. MDR concatenates the question q and the retrieved documents {d 1 , ..., d k−1 } and calculates a question vector for k-th step as follows: where d is a document in the corpus. We train the dense encoder to assign the highest probability for the ground truth document among the documents in the huge corpus. The loss function for our pre-training is as follows: where q k is a concatenation of q and D k−1 , and neg(d k ) is a set of documents excluding d k . Since computing the softmax over the whole corpus is computationally expensive, we use in-batch negatives for neg(d k ) (Karpukhin et al., 2020).

Pre-Training Details
We generate our pre-training data from 5,233,329 Wikipedia articles provided by Yang et al. (2018). We select all sentences that contain at least one hyperlinked entity to generate pseudo questions and randomly select "answer" entities from the sentences 2 . Our data generation process builds 13.9 million question-document-answer triples. We pre-train our dense retriever with a batch size of 256 for 200K+ steps. We use Adam with a warmup ratio of 0.1 and set the learning rate to 2 × 10 −5 . We use a machine with eight V100 (32G) GPUs. 2 We use spaCy for entity recognition TF-IDF Wiki RR Eff LOUVRE -eff -Wiki -reranking -reranking-Wiki Table 1: The five variations of LOUVRE. Each column represents sparse retrieval (TF-IDF), Wikipedia hyperlinks (Wiki), reranking (RR), and efficient fine-tuning (Eff). "Eff" represents whether the model uses efficient hyper-parameter setting, a batch size of 32 and a number of epochs of 15.

Fine-Tuning Details
We use TF-IDF negatives in addition to in-batch negatives for fine-tuning as in Karpukhin et al. (2020); . We set the number of TF-IDF negatives to 2. We use the Adam optimizer with a warm-up rate of 0.1 and set the learning rate to 2 × 10 −5 . We set the batch size to 32, the number of epochs to 15. To achieve better performance, we additionally fine-tune our model with another hyper-parameter setting: a batch size of 150 and a longer training time of 50 epochs.

Tasks
Supporting Document Prediction: In this task, retrievers and rerankers predict supporting documents for each question in HOTPOTQA dataset (Yang et al., 2018). The models predict possible combinations of supporting documents. Formally, when a question, q i , has been taken as input of the model, the models yield a ranked list of K-sets of documents, } is a pair of candidate supporting documents. In HOTPOTQA, the number of supporting documents is fixed to 2. We use the 5 million Wikipedia articles as the knowledge source. End-to-End Multi-Hop QA: We evaluate the supporting facts prediction performance and the answer prediction performance of LOUVRE on HOTPOTQA full wiki setting (Yang et al., 2018).

Multi-Hop Retrieval Strategy
We propose five variations of LOUVRE based on existing multi-hop retrieval strategies. Multi-hop document retrievers leverage three strategies for performance improvement and computational efficiency: sparse retrieval methods such as TF-IDF (Nie et al., 2019), Wikipedia hyperlinks (Asai et al., 2020), and reranking . Sparse retrieval methods select a small number of candidate documents relevant to the given question and are used to narrow down the search space of dense retrievers. We use TF-IDF and keyword matching as Nie et al. (2019) to retrieve 200 candidate documents. Existing multi-hop retrievers select reasoning paths (document chains) from documents connected with Wikipedia hyperlinks. We iteratively select the next-hop documents from the documents connected with the previously retrieved documents. Rerankers take the candidate reasoning paths (pairs of documents) from the retriever and predict the most probable reasoning path. We use the reranker proposed by . Table 1 shows the detailed information of these five variations.

Metric
We use five evaluation metrics: EM, F1, R@K, PathR@K, and AR@K. EM and F1 measure answer prediction and supporting fact prediction performance of multi-hop QA models (Yang et al., 2018). In addition to R@K, which measures the performance of supporting document prediction, we use another metric PathR@K to evaluate how well the retriever predicts the entire set of supporting documents. Since the readers predict answers by reading each path, PathR@K is a more appropriate estimate of answer prediction. The definitions of R@K and PathR@K are: where G = {g i , g j } is the set of ground truth supporting documents, D is the set of retrieved documents, and P i = {d a , d b } is a reasoning path ranked at i.
In our experimental setting, D is set to all documents in K/2 i=1 P i . AR@K measures the percentage of predictions that at least one passage in the top K predicted paths contains the answer text.

Results & Discussion
LOUVRE overcomes limited supervision in multihop QA with our weakly supervision data. Training with additional data brings progress in three ways: overall retrieval performance improvement, 2) robustness on a few-shot setting, and 3) overall improvement in the end-to-end multi-hop QA. We verify these improvements on the two tasks described in section 4: supporting document prediction and end-to-end multi-hop QA.

Supporting Document Prediction
In this experiment, we demonstrate the efficacy of LOUVRE with document retrieval experiments. First, we show the performance gain that comes from using our pre-trained model. Then, we show that the result becomes more significant in few-shot settings. Effect of Our Pre-Training: We compare LOU-VRE with MDR which is a multi-hop retriever fine-tuned on RoBERTa-base. We use the same fine-tuning method as MDR but initialize the parameters with LOUVRE. Table 2 shows the results. LOUVRE achieves 1.1% absolute performance improvement than when using RoBERTa (65.9); PathR@1 of LOUVRE is 67.0. Also, LOUVRE outperforms MDR in other evaluation metrics. In reranking experiments, we use the same reranking model as MDR-rerank. The only difference between LOUVRE-rerank and MDR-rerank is the parameter initialization method in the fine-tuning step same as the retriever experiment. These results show that our pre-training method is effective even after reranking; PathR@1 of LOUVRE-rerank is 83.2, and PathR@1 of MDR-rerank is 81.2. Weak Supervision and Training Time: LOU-VRE's pre-training method uses additional training with the multi-hop weak supervision dataset and results in the performance improvement shown above. To verify that the performance gap between RoBERTa and LOUVRE is not from the additional training time that LOUVRE uses in pre-training, we train RoBERTa with much a longer training time, 50 epochs, and compare with LOUVRE.
In Figure 3, we show the performance of RoBERTa fine-tuned for {2, 5, 15, 50} epochs and   the performance of LOUVRE fine-tuned for 15 epochs. The performance of RoBERTa stabilizes at approximately 80% in terms of PathR@20 after 15 epochs. This result shows that the main factor of the performance improvement from our pre-training method is not merely from longer training time but the unique information for multi-hop retrievers provided by our weak supervision. Retrieval Performance of Variations of LOUVRE-eff: Table 2 shows the effect of using LOUVRE. Similar results are observed in LOUVRE-eff. Table 3 shows the same experiments as Table 2 but in efficient fine-tuning setting, a small batch size and a short train time. We compare LOUVRE-eff with the retrieval performance of R PathR @10 @20 @8 @20  LOUVRE-eff without our pre-training method, which is fine-tuned on RoBERTa. Applying our pre-training method increases the retrieval performance by 4.7% point (R@10); R@10 of LOUVRE-eff is 80.4 and R@10 of LOUVRE-eff without pre-training is 75.7. Taking the results in Table 2 (the performance gain from our method with a big batch size/train epochs is 1.1) and the results in Table 3, we see the performance gain increases as there is more limitation on computation time. Robustness on Few-shot Settings: Pre-training alleviates the model's drastic performance drop when the number of training data is insufficient. We demonstrate the robustness of LOUVRE on fewshot settings with different sizes of train data. We fine-tune LOUVRE and MDR on a small portion of the HOTPOTQA train set within 0.1% to 100%. Figure 4 shows that the performance gap between LOUVRE and MDR increases as the size of train data decreases. When we use 0.1% of HOTPOTQA train data, almost 30% of LOUVRE's predictions contain correct supporting documents; the performance of MDR with the same amount of train data is close to 0. We conduct the same experiment with LOUVRE-Wiki and verify that using Wikipedia hyperlinks improves the robustness on a few-shot setting by 10.5% point in terms of PathR@5 when there is only 0.1% train data.
We conduct the same experiment with a small batch size of 32 and 10 epochs. Figure 4 illustrates the retrieval performance of LOUVRE-eff and LOUVRE-eff without our pre-training (RoBERTa) depending on the proportion of the data used for fine-tuning. It is worth noting that LOUVRE-eff fine-tuned with 10% data outperforms RoBERTa with 100% and shows little performance degradation compared to fine-tuning with 100%. We report the detailed results of LOUVRE-eff (1%) in Table  2. Table 2 shows that LOUVRE-eff (1%) achieves comparable performance to MDR trained on full data with a larger batch size and a longer train  time; R@10 of LOUVRE-eff (1%) is 2.0% point lower than R@10 of MDR, which is 97% of the performance of MDR.
Furthermore, we evaluate the zero-shot performance of LOUVRE-eff and compare to DPR not fine-tuned on HOTPOTQA. To adapt DPR to the multi-hop retrieval task, we replace encoders in LOUVRE-eff with DPR encoders. We report this result in Table 3. In Table 3, LOUVRE-zeroshot achieves higher performance (R@10: 44.8 and R@20: 55.8) than DPR-zeroshot (R@10: 39.0 and R@20: 51.6). Computational Efficiency: We compare the inference time of baselines and LOUVRE, with the number of BERT-base executions needed for each question. We exclude the inference time for document indexing which can be done a pri-   ori. The number of BERT executions for each baseline is derived from each paper and its implementation. We measure the inference time of LOUVRE, PathRetriever, and MDR in various hyper-parameter settings by adjusting the number of beam size and the number of documents retrieved by the sparse retriever, TF-IDF. The number of BERT executions of MDR, LOUVRE-eff, and LOUVRE-eff with a beam size of b is calculated as follows: #BERT = 1(question encoding)+ b(question-passage encoding). The inference time of MDR-rerank, LOUVRE-rerank, and LOUVRErerank-wiki involves another factor, the input size of the reranker. The inference time of these reranking models with a beam size of b and a input size of r becomes #BERT = 1 + b + r. For PathRetriever, we vary the number of documents retrieved by the sparse retriever, TF-IDF. Figure 5 illustrates that LOUVRE is more effective and efficient than the baselines because it yields better retrieval performance with a much smaller number of BERT executions.

End-to-End Multi-Hop QA
In this section, we demonstrate that the end-to-end multi-hop QA pipeline using LOUVRE retains the three outcomes of LOUVRE: overall performance improvement, robustness on a few-shot setting, and the fast inference speed. We use multi-hop QA pipelines of MDR and PathRetriever. All the components of the multi-hop QA pipelines except the retriever are fixed. We plug in each baseline retriever and LOUVRE to the pipeline and evaluate the end-to-end performance of each model. End-to-End Performance: Table 4 shows the endto-end multi-hop QA performance of baselines and LOUVRE. In this experiment, we replace the retriever of MDR's pipeline with LOUVRE-rerank. We set the beam size to 30 and the input size of the reranker to 900. LOUVRE outperforms baselines with a Joint F1 of 67.08. Table 5 shows the performance of LOUVRE-Wiki using the same inference time as MDR. In this experiment, we use a beam size of 100 and the reranker's input size of 350. MDR uses a beam size of 200 and the reranker's input size of 250. LOUVRE-Wiki outperforms MDR by 0.32% point in terms of Joint F1. We conduct the same experiment with LOUVREeff and PathRetriever. Table 5 shows the results. LOUVRE-eff in Table 5 represents the end-to-end pipeline of PathRetriever-eff using LOUVRE-eff as the initial candidate document retriever not TF-IDF. We set the number of initial candidate documents of LOUVRE-eff and PathRetriever-eff to 50. LOUVRE-eff outperforms PathRetriever-eff by 3.38% point without any loss of computational efficiency. We provide the detailed experimental results of LOUVRE-eff in Appendix B.
End-to-End Performance of Variations of LOUVRE-eff: Table 6 shows the end-to-end performance of each model on different inference time, size of train data, and pre-training. We evaluate two types of LOUVRE-eff. LOUVREeff (RR) represents the same pipeline used in and LOUVRE-eff (100 + ) shows that applying LOUVRE-eff to PathRetriever increases Joint F1 by 1 with 5 times faster inference speed. We conduct the same experiment by reducing the number of documents retrieved by LOUVRE-eff and achieve 0.6% point higher Joint F1 than PathRetriever with 10 times faster inference speed.
We conduct ablation studies with two factors of retrievers: 1) size of train data and 2) pre-training. We fix the reader with BERT-wwm fully fine-tuned on HOTPOTQA train set. Table 6 shows the results. When we train LOUVRE-eff with only 1% of the train set, the end-to-end performance drops by 0.8% point. However, using RoBERTa with the whole train set decreases the performance by 3.6% point. This result indicates that our pre-training methods bring more significant improvement to the end-toend multi-hop QA pipeline when the size of the train data is small.
Decreasing the search space of multi-hop retrievers increases the retriever's computational efficiency but results in a significant performance drop. We demonstrate the robustness of LOUVREeff when the computation time is limited. We decrease the beam size of LOUVRE-eff to 2 and the number of output paths to 2; the total number of BERT executions of this model is #BERT = 1 + 2(beam size) + 2(input size). Table 6 shows that LOUVRE-eff achieves 89% of PathRetriever's performance with almost 100 times faster infer-ence speed of PathRetriever. We adjust the inference speed of PathRetriever and compare it with LOUVRE-eff (w/o RR). LOUVRE-eff (w/o RR) outperforms PathRetriever (50 + ) by 1.2% point with less computation time. LOUVRE-eff (w/o RR) trained only with 1% train data even outperforms PathRetriever (50 + ).

Conclusion
Answering complex questions includes reasoning across multiple documents. Recent studies have found that reasoning requires learning sub-question detection and relevant document retrieval to predict n correct answer with supporting facts. However, building such datasets requires costly human annotation and has limited scalability. To address this issue, we proposed a weakly supervised pre-training method for multi-hop retriever, LOUVRE. Our pretraining method contains three elements: "Next Document Prediction" task, "Bridge Entity Re-Phrasing", and a model. We demonstrated the efficacy of LOUVRE and its robustness on few-shot settings with extensive experiments on supporting document retrieval task and end-to-end multi-hop QA task. We also showed that our method performs very well at a much lower inference cost.   calculate score of each paths/documents by jointly encode each document with the given question. As a result, reranking takes a huge portion of computation time of the end-to-end multi-hop QA pipeline. SemanticRetrievalMRS (Nie et al., 2019) propose the document reranking model that takes output of sparse retrievers such as TF-IDF. Since the model outputs documents not a list of supporting documents, we use the same document rearranging method as TF-IDF above. PathRetriever (Asai et al., 2020) and HopRetriever (Li et al., 2020) are reasoning path prediction models. These models use TF-IDF and BERT to retrieve and rerank the candidate documents. They use Wikipedia hyperlinks for candidate documents selection as described in section 4.4 and beam search with size 8 to rank each predicted supporting documents. MDR ) provides a reranking model as well as their retriever. We report the performance of MDR-reranking from .
B Appendix: End-to-End Performance of LOUVRE-eff Table 7 shows the additional results of Table 6. We evaluate LOUVRE-eff on other evaluation metrics used in the HOTPOTQA benchmark and verify the efficacy of LOUVRE-eff. We report the detailed results of Table 5 in Table 8. The results show the end-to-end performance of LOUVRE-eff and PathRetriever on the HOTPOTQA test set.