Hybrid Hierarchical Retrieval for Open-Domain Question Answering

Retrieval accuracy is crucial to the performance of open-domain question answering (ODQA) systems. Recent work has demonstrated that dense hierarchical retrieval (DHR), which re-trieves document candidates first and then relevant passages from the refined document set, can significantly outperform the single stage dense passage retriever (DPR). While effective, this approach requires document structure information to learn document representation and is hard to adopt to other domains without this information. Additionally, the dense retrievers tend to generalize poorly on out-of-domain data comparing with sparse retrievers such as BM25. In this paper, we propose Hybrid Hierarchical Retrieval (HHR) to address the existing limitations. Instead of relying solely on dense retrievers, we can apply sparse retriever, dense retriever, and a combination of them in both stages of document and passage retrieval. We perform extensive experiments on ODQA benchmarks and observe that our framework not only brings in-domain gains, but also generalizes better to zero-shot TriviaQA and Web Questions datasets with an average of 4.69% improvement on recall@100 over DHR. We also offer practical insights to trade off between retrieval accuracy, latency, and storage cost. The code is available on github ∗ .


Introduction
Open-domain question answering (ODQA) (Voorhees, 1999) aims to answer questions based on a large corpus without pre-specified context, and enjoys a broad scope of real-world applications such as chatbots, virtual assistants, search engines, etc. Recent ODQA systems often follow a two stage retrieve-then-read architecture (Zhu et al., 2021;Chen et al., 2017;Lee et al., 2019). Given a question, a retriever module first selects a + The work was done during an internship at AWS AI Lab. † These authors contributed equally to this work. * https://github.com/ghuhan17/HybridHierarchicalRetrieval.

Question:
What is the lightest metal under standard conditions? Answer: Lithium DPR Passage, Electrochemical fatigue crack sensor: Stainless steel has a density of 8000 kg/m 3 and aluminum alloy has a density of 2700 kg/m 3 . EFS detects growing cracks in steel, aluminum, titanium alloys … DHR Passage, Iron: Because of the softness of iron, it is easier to work with than its heavier congeners. The form of iron that is stable under standard conditions … HHR Passage, Lithium: Lithium is a chemical element with symbol Li. It is a soft alkali metal. Under standard conditions, it is the lightest metal and the lightest solid … Figure 1: An example showing the top-1 retrieved passage of DPR, DHR and HHR for a TriviaQA query. The title of the source document is underlined. DPR finds a passage with different metals from an irrelevant document. DHR retrieves a better passage yet focus a single iron metal. HHR finds the groundtruth passage.
candidate set of relevant contexts from a diversified large corpus such as Wikipedia; afterwards, a reader module consumes the retrieved evidence to predict an answer. Here, retrieval performance is crucial to the accuracy of the QA system as it determines whether the correct context to answer the question can be presented to the reader. While most work in information retrieval focus on document retrieval (Nguyen et al., 2016;Thakur et al., 2021), existing work in ODQA often splits documents into short passages and directly retrieve passages for the reader (Karpukhin et al., 2020;Izacard and Grave, 2020) to accommodate reader models that handle shorter sequences most effectively. A drawback of such single-stage passage retrieval approaches is that they tend to be susceptible to distracting passages that contain seemingly relevant local context but not the correct answer, since they cannot incorporate information from other parts of the document (see Figure 1). Further, the large number of passage candidates also contributes negatively to system throughput. To mitigate these issues, Liu et al. (2021) recently proposed a two-stage hierarchical retrieval framework where the retriever first retrieves relevant documents, then discern relevant passages within re-Figure 2: An overview of HHR. It consists of a document retrieval followed by a passage retrieval. Each stage uses one of three types of retrievers -sparse, dense, and combined, leading to a total of 9 configurations. trieved documents. This helps prune passages that look relevant but are from irrelevant documents to improve answer accuracy, meanwhile greatly reducing the candidate set for passage retrieval and improve the inference speed of ODQA systems.
Despite its success, Liu et al.'s (2021) approach (dense hierarchical retrieval, DHR) relies on dense neural retrievers (Lee et al., 2019;Karpukhin et al., 2020) for both document retrieval and passage retrieval, which suffers from two key weaknesses. First, neural encoders used in retrieval are often limited in context length for effectiveness and efficiency, which is too short to encompass most documents. As a result, DHR needs to make use of the structure of Wikipedia documents, and represents the documents succinctly with title, abstract and table of contents, which is not always available for non-Wikipedia text. Second, dense retrievers have been shown to suffer from poor generalization on out-of-domain data (Thakur et al., 2021), whereas sparse retrievers like BM25 (Robertson et al., 2009) excel with lexical matches (Sciavolino et al., 2021).
In this work, we propose a hybrid hierarchical retrieval (HHR) framework to alleviate these issues. Specifically, we investigate the tradeoffs and complementary strengths of sparse retrievers and dense retrievers at both the document retrieval and passage retrieval stages for ODQA (see Figure 2). We find, among other things, that sparse retrievers can complement dense retrievers at both retrieval stages with a simple approach to aggregate results from both. Besides in-domain evaluation on the dataset that neural models are trained on, we also perform zero-shot evaluation on unseen datasets to compare the generalization of these retriever architectures. We find that sparse retrievers can help HHR generalize better to unseen data and potentially replace dense retrievers in document retrieval. In addition, we also study the accuracy, storage cost, and latency tradeoff for these architectures under the HHR framework, and offer practical insights to real-world ODQA systems that often need to factor these into consideration.
Our main contributions are as follows. First, we propose a hybrid hierarchical retrieval framework on ODQA, and extensively study tradeoffs and complementary strengths of sparse and dense retrievers in both document and passage retrieval. Second, we perform both in-domain and out-ofdomain evaluation to provide insight into the generalization performance of different model choices. Finally, we present the accuracy-storage-latency landscape for HHR architectures and offer practical insights to real-world applications.

Background & Related Work
Open-domain question answering (ODQA). ODQA is a task that takes a question, such as "Who got the first Nobel prize in physics?", and aims to find the answer from a large corpus. ODQA systems often rely on efficient and accurate retrievers to find relevant context to answer questions (Chen et al., 2017), where retrieval performance is usually critical to QA accuracy (Karpukhin et al., 2020).
Passage retrieval. Since most reader models in ODQA systems struggle to effectively handle long contexts, ODQA retrieval is often performed at the passage level (usually around 100 words long). Earlier work (Chen et al., 2017;Yang et al., 2019) relied on bag-of-words-based sparse retrievers such as BM25 (Robertson et al., 2009). More recent work showed that neural retrievers can generate effective dense representations for retrieval when trained on ODQA (Lee et al., 2019;Karpukhin et al., 2020;Liu et al., 2021). Sciavolino et al. (2021) showed, however, that these dense retrievers tend to generalize worse to unseen entities during training since they lack the capacity for lexical matching, which is a strong suit for sparse retrievers and important for out-of-domain generalization.
Hierarchical retrieval. Passage retrievers are limited by the context available in each passage, and can retrieve spurious passages to hurt answer performance. A remedy is to incorporate documentlevel relevancy during passage retrieval. Qi et al.
(2021) explored combining document and passage relevancy scores in a BM25 retriever for ODQA. Liu et al. (2021) applied this idea to dense retrievers with a hierarchical retrieval framework (DHR), where a document retriever first retrieves documents of high relevancy, followed by a passage retriever to rerank passages within those documents, and our work extends this approach.

Methodology
Our hybrid hierarchical retrieval (HHR) framework extends DHR, a hierarchical retriever built on dense retrievers that first retrieves top-k d documents and then top-k p passages from those documents. We follow DHR to build the dense retrievers in HHR, and expand both document and passage retrievers to work with sparse retrievers to address limitations of the DHR approach ( Figure 2). Specifically, in DHR, to make documents amenable to neural encoders with limited context length, the authors proposed to leverage the document structure of Wikipedia articles to construct a document summary that contains the document abstract and table of contents. While effective, this also potentially limits the applicability of this approach to corpora where this information is not available. In contrast, a sparse retriever can easily handle documents of arbitrary lengths efficiently without the need of structure information. Besides, dense retrievers tend to generalize poorly to out of domain data. We extend each of the document retrieval and passage retrieval stages with the option to use sparse retrievers to help alleviate this issue, and to help us understand the tradeoff between the two.
Besides switching between sparse and dense retrievers in HHR, we also introduce a simple heuristic to combine results from both retrievers at the same stage by simply interleaving their top-k/2 results for top-k retrieval to better understand the complementary strengths of sparse and dense retrievers. This yields a total of 9 possible configurations for HHR for our extensive studies. Finally, for both sparse and dense passage retrievers in HHR, we implement on-the-fly passage reranking for all passages in the top retrieved documents with pre-computed passage representations. This helps reduce the latency of the passage retrievers in our implementation and provide more realistic insights into the accuracy-storage-latency tradeoff of different HHR settings in real-world systems. Training and evaluation We followed the same configuration as DHR to train the document and passage encoders up to 40 epochs on 8 V100 Tensor Core GPUs. In order to measure the retriever framework in both in-domain and zero-shot settings, the dense encoders are trained only on NQ dataset and tested on all three datasets.

Main Results
We present the in-domain and zero-shot evaluation results for all datasets in Table 1. Following DHR, we retrieve 100, 500, 500 documents in the first stage for NQ, WebQ, and TriviaQA, respectively. We compare HHR against DHR and single stage sparse and dense passage retrievers. In addition, we also study the effect of retrieving varying numbers of documents in HHR (see Figure 3). We find that: First, dense retrievers are crucial for indomain performance, yet sparse retrievers bring gains with complementary strengths. We see that replacing either or both components in the DHR baseline (Dense+Dense) with sparse retriever leads to noticeable drops in recall@100 on NQ, with 1.7% for replacing the document retriever, 2.3% for the passage retriever, and 7.2% for both. However, adding sparse retriever to document, passage or both retrievers brings 2.0%, 0.9% and 2.4% gain.
Second, sparse retrievers significantly improve HHR's generalization on zero-shot datasets. We corroborate previous work's (Sciavolino et al., 2021) finding that dense retrievers struggle to generalize in zero-shot settings. Likewise, DHR underperforms the optimal setting by 3.79% on WebQ, and leads to the worst performance on TriviaQA per recall@100. Adding sparse retrievers to both stages, Combined+Combined brings an average of 4.69% recall@100 improvement on WebQ and TriviaQA.
Third, sparse document retriever can completely replace dense document retriever. While this leads to performance drop in-domain, we see that dense passage retrievers can help make up for the performance gap, and that the gap diminishes   as more documents are retrieved in the first stage ( Figure 3). Further, in the zero-shot setting, replacing dense document retrievers with sparse ones can actually lead to gains (of 1.26% and 3.15% recall@100 on WebQ and TriviaQA, respectively). This helps HHR generalize better to out-of-domain data not only lexically, but also removes the requirement of document structure information needed by the dense document retriever.

Accuracy-Storage-Latency Landscape
We report the trade off between retrieval accuracy, latency and storage cost for different HHR configurations based on inference on a single CPU core. Figure 4, presents the accuracy-storage-latency plot of different retriever configurations for the NQ, WebQ and TriviaQA datasets. We find that in the in-domain case of NQ, all settings involving dense document retrievers are Pareto-efficient on accuracy and latency, while the passage retriever presents a tradeoff among accuracy, latency, and storage. In contrast, in zero-shot settings the sparse document retrievers can be Pareto-efficient when complemented with dense passage retrievers. Table  2 presents the storage cost and retrieval latency for the four essential components in HHR framework, namely, sparse and dense retrievers in document and passage retriever stages. Sparse retrievers are storage efficient compared to the dense retrievers at the same level as they use inverted index to represent the text whereas dense retrievers use the dense embeddings. We also observe that the PyLucene's sparse document retriever is slower than the FAISS dense document retriever whereas our implementation of the on-the-fly sparse passage retriever is faster than its dense counterpart. Dense passage retrieval takes the most storage cost to store the embedding dictionary among others.

Conclusion
In this paper, we study a Hybrid Hierarchical Retrieval (HHR) framework for ODQA that integrates sparse and dense retrievers through a simple aggregation strategy in document and passage retrievers. We demonstrate that sparse retrievers complement dense retrievers in-domain and greatly improve the poor generalization from dense encoders to zeroshot data. Our framework addresses the limitation of DHR by achieving better zero-shot performance without relying on document structure information.
We also study the accuracy-storage-latency land-  scape. We believe these findings are critical to the real-world adoption of HHR to ODQA systems.

Limitations
Our HHR framework uses a simple combination strategy to take top-k/2 documents (or passages) from dense and sparse retrievers for top-k retrieval in the combined setting. The overlap between dense and sparse results can lead to less than k retrieved results to be consumed by next stage. Future work might improve the aggregation strategy by: 1) developing a more advanced combination strategy that is not solely based on rank but also retrieval scores, and 2) accounting document retriever scores in the passage retrieval stage to rerank the passages according to both the global document relevancy and local passage relevancy. Futhermore, we only evaluate against Wikipedia ODQA datasets due to its presence of document structure information to be used in the dense document retriever. However, future work can extend the evaluation to other ODQA corpus. A Implementation details of sparse and dense retrievers in HHR Sparse Retrieval For the document level retriever, we used PyLucene * to build an inverted document index offline and perform BM25 based retrieval. Comparing with dense document retriever, it can run efficiently without document structure information. For the passage level retriever, the TF and IDF of all passage tokens are computed and pre-stored in a dictionary offline. During inference time, the BM25 score is computed on-the-fly. Due to the refined passage candidates, the passage retriever does not require construction of a huge passage index and can run faster than single-stage passage retrievers.

References
Dense Retrieval We follow the same dense retrieval set up as DHR with a small change. While DHR applies an iterative training strategy to perform a second step training from hard-negatives grounded from first step retriever, we train both the document and passage level retrievers without any iteration. While the document embeddings are indexed using FAISS (Johnson et al., 2017) to perform document retriever, the passage retrieval is performed on-the-fly with the cached passage embedding dictionary.

B Similarities and Differences between different HHR configurations
In this section, we analyze pairwise HHR configurations to understand their similarities and differences. Table 3 shows the percentage of queries in the NQ, WebQ and TriviaQA datasets respectively for which the correct evidences are retrieved exclusively by the row HHR configuration but not by the column HHR configuration. Higher percentage for a given pair of HHR configuration indicates that the row configuration is more exclusive and is able to answer a different set of queries when compared to the column configuration. We can observe the following high-level takeaways: 1) Observing the Sparse+Sparse row in all the 3 tables indicate that Sparse retrievers are more effective and exclusively retrieve correct evidence for more queries in zero-shot set-tings as the percentages are more for WebQ and TriviaQA compared to NQ dataset. 2) Observing the Combined+Combined column shows that there are only few percentage of queries that other HHR configurations are able to answer over Combined+Combined. This is expected as Com-bined+Combined leverages both sparse and dense retrievers in both the stages. Out of other HHR configurations, Sparse+Dense seems to have the highest percentage over Combined+Combined. Ensembling the results of these two configuration might result in an improved performance over Com-bined+Combined and can be an interesting future work. Table 4 shows for a given pair of HHR configurations, the percentage of of queries, out of the total correctly retrieved queries by the row configuration, for which the column configuration also retrieves the correct evidence. Higher percentage indicates that the column HHR configuration is also able to correctly retrieve evidences for the queries as the row configuration. In case of NQ dataset, we observe that configurations that have dense retrievers in both the stages are able to correctly retrieve evidences for most of the queries of the other configurations. However, this percentage decreases in the case of WebQ and TriviaQA datasets.  Table 4: This table shows the percentage of queries, out of the total correctly retrieved queries datasets by the row configuration, for which the column configuration also retrieves correct evidence, for a given pair of HHR configurations. For all the numbers shown, we retrieve 100, 500, 500 documents for NQ, WebQ and TriviaQA respectively in the first stage and 100 passages in the second stage for all the datasets.