Reader-Guided Passage Reranking for Open-Domain Question Answering

Current open-domain question answering (QA) systems often follow a Retriever-Reader (R2) architecture, where the retriever ﬁrst re-trieves relevant passages and the reader then reads the retrieved passages to form an answer. In this paper, we propose a simple and effective passage reranking method, R eader-gu IDE d R eranker (R IDER ), which does not involve any training and reranks the retrieved passages solely based on the top predictions of the reader before reranking. We show that R IDER , despite its simplicity, achieves 10 to 20 absolute gains in top-1 retrieval accuracy and 1 to 4 Exact Match (EM) score gains without reﬁning the retriever or reader. In particular, R IDER achieves 48.3 EM on the Natural Questions dataset and 66.4 on the TriviaQA dataset when only 1,024 tokens (7.8 passages on average) are used as the reader input.


Introduction
Current open-domain question answering (QA) systems often follow a Retriever-Reader (R2) architecture, where the retriever first retrieves relevant passages and the reader then reads the retrieved passages to form an answer.Since the retriever retrieves passages from a large candidate pool (e.g., millions of Wikipedia passages), it often fails to rank the most relevant passages at the very top.One line of work (Mao et al., 2020;Karpukhin et al., 2020) aims to improve the quality of passage retrieval and shows that significantly better retrieval accuracy as well as QA performance can be achieved when the retriever is improved.
An alternative solution is to rerank the initial retrieval results via a reranker, which is widely used in information retrieval (Nogueira and Cho, 2019;Qiao et al., 2019) and explored in early open-domain QA systems (Wang et al., 2018a;Lee et al., 2018).However, current state-of-theart open-domain QA systems (Karpukhin et al., 2020;Izacard and Grave, 2020b;Lewis et al., 2020) do not distinguish the order of the retrieved passages and instead equally consider a large number of retrieved passages (e.g., 100), which could be computationally prohibitive as the model size of the readers becomes larger (Izacard and Grave, 2020b).
We argue that a Retriever-Reranker-Reader (R3) architecture is beneficial in terms of both model effectiveness and efficiency: passage reranking improves the retrieval accuracy of the retriever at top positions and allows the reader to achieve comparable performance with fewer passages as the input.However, one bottleneck of R3 is that the reranker, previously based on BiLSTM (Wang et al., 2018a;Lee et al., 2018) and nowadays typically BERTbased cross-encoder (Nogueira and Cho, 2019;Qiao et al., 2019), is costly to train and slows down the whole pipeline as well.
Can we achieve better performance without the bother of training an expensive reranker or refining the retriever (reader)?In this paper, we propose a simple and effective passage reranking method, named Reader-guIDEd Reranker (RIDER), which does not require any training and reranks the retrieved passages solely based on their lexical overlap with the top predicted answers of the reader before reranking.Intuitively, the top predictions of the reader are closely related to the ground-truth answer and even if the predicted answers are partially correct (or incorrect), they may still provide useful signals suggesting which passages may contain the correct answer (Mao et al., 2020).
We conduct experiments on the Natural Questions (NQ) (Kwiatkowski et al., 2019) and Triv-iaQA (Trivia) (Joshi et al., 2017)  We demonstrate that the passages reranked by RIDER achieve significantly better retrieval accuracy and consequently lead to better QA performance without refining the retriever or reader.(3) Notably, R3 with RIDER as the reranker achieves comparable or better performance than state-of-theart methods on two benchmark datasets when only 1,024 tokens are used as the reader input.

Task Formulation
We assume that an open-domain QA system with an R2 architecture is provided.We denote the initially retrieved passages of the retriever as R. We denote the top-N predictions of the reader on the top-k passages of R (denoted as R The goal of RIDER is to rerank R to R using A [:N ] such that the retrieval accuracy is improved and better end-to-end QA results are achieved when R [:k] is used as the reader input instead of R [:k] .

Passage Reranking
Given an initially retrieved passage list R and top-N predictions of the reader A [:N ] , RIDER forms a reranked passage list R as follows.RIDER scans R from the beginning and appends to R every passage p ∈ R that contains any reader prediction a ∈ A [:N ] .Then, the remaining passages are appended to R according to their original order.Despite its simplicity, we observe that RIDER leads to consistent gains in terms of both retrieval accuracy and QA performance without refining the retriever (reader) or even any training itself.

Passage Reading
We consider a scenario where the number of passages that can be used as the reader input is lim-ited, which is common in reality due to efficiency consideration (e.g., real-time response) or model capacity (e.g., input length limit).We use a generative reader initialized by BART-large (Lewis et al., 2019), which concatenates the question and top-10 retrieved passages (without reranking during training), trims them to 1,024 tokens (7.8 passages are left on average) as the input, and learns to generate the answer in a Seq2Seq manner (Mao et al., 2020;Min et al., 2020).We further add a shuffle strategy during the training of the reader by randomly shuffling the top retrieved passages before concatenation.In this way, the reader appears to be more robust to the reranked passages during inference and achieve better performance.

Experimental Setup
Datasets.We conduct experiments on the opendomain version of two widely used QA benchmarks -Natural Questions (NQ) (Kwiatkowski et al., 2019) and TriviaQA (Trivia) (Joshi et al., 2017), whose statistics are listed in Table 1.Evaluation Metrics.Following prior studies (Mao et al., 2020;Karpukhin et al., 2020), we use top-k retrieval accuracy to evaluate the performance of the retriever and Exact Match (EM) to measure the performance of the reader.Top-k retrieval accuracy is the proportion of questions for which the top-k retrieved passages contain (at least) one answer span.One can use it as an upper bound of how many questions are answerable by an extractive reader.Exact Match (EM) is the proportion of the predicted answer spans being exactly the same as one of the ground-truth answers, after string normalization such as article and punctuation removal.Source of R. We take the top retrieved passages of GAR (Mao et al., 2020) on Trivia and its combination with DPR (Karpukhin et al., 2020)  out reranking and used for final passage reading in R3, which represents an apple-to-apple comparison to the R2 architecture without any additional information but higher-quality input.We also experiment with an extractive reader E that has access to more passages, where the goal is to study whether we can rerank passages via other signals and further improve G such that it outperforms both G and E when they are in R2.We use the extractive reader in Mao et al. (2020) with BERTbase (Devlin et al., 2019) representation and span voting.For the generative reader, we either take its top-1 prediction with greedy decoding or sample 10 answers with appropriate decoding parameters (e.g., temperature). 1For the extractive reader, the top predictions are the text spans with the highest scores and we set N = 1, 5, 10.

Quality of Reranking Signals
We first analyze the EM of the top-N reader predictions A [:N ] used for reranking, i.e., we consider a question correctly answered as long as one of the top-N predictions matches the answer.The standard EM is a special case with N = 1.As listed in Table 3, the reader EM can be improved by up to 24 on NQ and 15.8 on Trivia if we consider the top-10 predictions instead of only the first prediction, suggesting that there is significant potential if we leverage multiple answer candidates for reranking.
1 There are duplicate samples and on average N = 6.

Passage Retrieval
We list the top-k retrieval accuracy before and after passage reranking in Table 2. RIDER significantly improves the retrieval accuracy at top positions (especially top-1) without refining the retriever.In particular, we observe that when taking more reader predictions (i.e., larger N), the top-k retriever accuracy tends to improve more at a larger k and less at a smaller k.For example, an improvement of about 3 points is achieved for top-5 and top-10 accuracy when increasing N from 1 to 5 on NQ for reader E, but the top-1 retrieval accuracy also drops significantly (although still better than without reranking), which shows that there is a trade-off between the answer coverage and noise.Note that the top-100 retrieval accuracy is unchanged after reranking since we rerank the top-100 passages.

Passage Reading
Comparison w. the state-of-the-art.We show the QA performance comparison between RIDER and state-of-the-art methods in Table 4.We observe that RIDER improves GAR (or GAR+DPR) on both datasets by a large margin, despite that they use the same generative reader and no further model training is conducted.Such results indicate that RIDER provides higher-quality input for the reader and better performance can be achieved with the same input length.Moreover, the results of RIDER are better than most of the methods that take much more passages as input, except for FID-large (Izacard and Grave, 2020b) that takes 100 passages as input and has more model parameters.
Ablation Study.A detailed analysis of RIDER with different reranking signals is shown in Table 5.We can see that by reranking based on the prediction of the generative reader G (with input R [:k] ), RIDER achieves around 1 to 2 gains in EM, which shows that RIDER can improve end-to-end QA performance without any additional information.By iterative reranking (R ) using the reader predictions after first reranking, the performance of RIDER is further improved.RIDER achieves even better performance when using the predictions of the extractive reader E (with input R) for reranking, which is consistent with the better results of E on the retrieval accuracy after reranking.Also, it is interesting to see that RIDER outperforms the extractive reader that has access to more passages.

Related Work
Reranking for Open-domain QA.Reranking has been widely used in information retrieval to refine the initial retrieval results.Early effort on passage reranking for open-domain QA uses su-pervised (Lee et al., 2018) or reinforcement learning (Wang et al., 2018a) based on BiLSTM.More recently, BERT-based rerankers that treat the query and passage as a sentence pair (i.e., cross-encoders) achieve superior performance (Nogueira and Cho, 2019;Qiao et al., 2019).However, the training of cross-encoders is rather costly.Moreover, the representations of cross-encoders cannot be precomputed and matched via Maximum Inner Product Search (MIPS) as in bi-encoders (Karpukhin et al., 2020) but measured online between the query and each passage, which results in slower inference as well.Another line of work (Das et al., 2018;Qi et al., 2020) reranks the passages by updating the query and often involves a complicated learning process such as R2 interactions.Alternatively, some prior studies (Wang et al., 2018b;Iyer et al., 2020) directly rerank the top predicted answers instead of the passages using either simple heuristics or additional training.In contrast, RIDER utilizes downstream signals (i.e., the predictions of a reader) to rerank the passages based on the surface forms of the texts without any training.
Recent studies (Izacard and Grave, 2020a;Yang and Seo, 2020) show that distillation from the preference of the reader can improve the retriever performance, where the reader preference is measured by the attention scores of the reader over different passages and the retriever is refined by learning to approximate the scores.RIDER, to some extent, can also be seen as one way to distill the reader, but RIDER is much simpler in that no further training is involved for either the retriever or reader, and explicit reader predictions instead of latent attention scores are leveraged to improve the retriever results directly.

Conclusion
In this work, we propose RIDER, a simple and effective passage reranking method for open-domain QA that does not involve any training or computationally expensive inference.RIDER can be easily integrated into existing R2 systems for further improvements.Without fine-tuning the retriever or reader, RIDER improves the retrieval accuracy and the QA results on two benchmark datasets significantly.In particular, RIDER achieves comparable or better performance than state-of-the-art methods with less reader input and allows for more efficient open-domain QA systems.

Table 1 :
Dataset statistics that show the number of samples, the average question (answer) length, and the average number of answers for each question.

Table 2 :
Top-k retrieval accuracy on the test sets before and after reranking.G and E denote generative and extractive readers, respectively, whose top predictions are used for reranking.
on NQ as the initial retrieval results R for reranking.Source of A[:N ].To obtain the top-N predicted answers for RIDER, we first use the generative reader G in Sec.2.3 that is trained on the passages with-

Table 3 :
EM of top-N predictions of the reader.Results are mostly on reader E. Only top-1 and top-10 EM are shown (in the brackets) for reader G, as its 10 predictions are sampled without particular order.

Table 4 :
End-to-end QA comparison of state-of-theart methods in EM.

Table 5 :
Performance comparison of RIDER in EM when different reranking signals are used.The numbers in the brackets represent the performance of the reader used for reranking and relative gains.