Joint Passage Ranking for Diverse Multi-Answer Retrieval

We study multi-answer retrieval, an under-explored problem that requires retrieving passages to cover multiple distinct answers for a given question. This task requires joint modeling of retrieved passages, as models should not repeatedly retrieve passages containing the same answer at the cost of missing a different valid answer. Prior work focusing on single-answer retrieval is limited as it cannot reason about the set of passages jointly. In this paper, we introduce JPR, a joint passage retrieval model focusing on reranking. To model the joint probability of the retrieved passages, JPR makes use of an autoregressive reranker that selects a sequence of passages, equipped with novel training and decoding algorithms. Compared to prior approaches, JPR achieves significantly better answer coverage on three multi-answer datasets. When combined with downstream question answering, the improved retrieval enables larger answer generation models since they need to consider fewer passages, establishing a new state-of-the-art.


Introduction
Passage retrieval is the problem of retrieving a set of passages relevant to a natural language question from a large text corpus. Most prior work focuses on single-answer retrieval, which scores passages independently from each other according to their relevance to the given question, assuming there is a single answer (Voorhees et al., 1999;Chen et al., 2017;Lee et al., 2019). However, questions posed by humans are often open-ended and ambiguous, leading to multiple valid answers . For example, for the question in Figure 1, "What was Eli Whitney's job?", an ideal retrieval system should provide passages covering all professions of Eli Whitney. This introduces the problem of multianswer retrieval-retrieval of multiple passages * Work done while interning at Google.

Retrieval System
Question Eli Whitney was an American inventor, widely known for inventing the cotton gin … Whitney worked as a farm laborer and school teacher to save money.
What was Eli Whitney's job?
Prediction Target Figure 1: The problem of multi-answer retrieval. A retrieval system must retrieve a set of k passages (k = 5 in the figure) which has maximal coverage of diverse answers to the input question: inventor, farm laborer and school teacher in this example. This requires modeling the joint probability of the passages in the output set: P (p 1 ...p k |q). Our proposed model JPR achieves this by employing an autoregressive model. with maximal coverage of all distinct answerswhich is a challenging yet understudied problem.
Multi-answer retrieval poses two challenges that are not well represented in single-answer retrieval. First, the task requires scoring passages jointly to optimize for retrieving multiple relevant-yetcomplementary passages. Second, the model needs to balance between two different goals: retrieving passages dissimilar to each other to increase the recall, and keeping passages relevant to the question.
In this work, we introduce Joint Passage Retrieval (JPR), a new model that addresses these challenges. To jointly score passages, JPR employs an encoder-decoder reranker and autoregressively generates passage references by modeling the probability of each passage as a function of previously retrieved passages. Since there is no ground truth ordering of passages, we employ a new training method that dynamically forms supervision to drive the model to prefer passages with answers not already covered by previously selected passages. Furthermore, we introduce a new tree-decoding algorithm to allow flexibility in the degree of diversity.
In a set of experiments on three multi-answer datasets-WEBQSP (Yih et al., 2016), AM-BIGQA (Min et al., 2019) and TREC (Baudiš and Šedivỳ, 2015), JPR achieves significantly improved recall over both a dense retrieval baseline (Guu et al., 2020;Karpukhin et al., 2020) and a state-of-the-art reranker that independently scores each passage (Nogueira et al., 2020). Improvements are particularly significant on questions with more than one answer, outperforming dense retrieval by up to 12% absolute and an independent reranker by up to 6% absolute.
We also evaluate the impact of JPR in downstream question answering, where an answer generation model takes the retrieved passages as input and generates short answers. Improved reranking leads to improved answer accuracy because we can supply fewer, higher-quality passages to a larger answer generation model that fits on the same hardware. This practice leads to a new stateof-the-art on three multi-answer QA datasets and NQ (Kwiatkowski et al., 2019). To summarize, our contributions are as follows: 1. We study multi-answer retrieval, an underexplored problem that requires the top k passages to maximally cover the set of distinct answers to a natural language question.
2. We propose JPR, a joint passage retrieval model that integrates dependencies among selected passages, along with new training and decoding algorithms.
3. On three multi-answer QA datasets, JPR significantly outperforms a range of baselines with independent scoring of passages, both in retrieval recall and answer accuracy.

Review: Single-Answer Retrieval
In a typical single-answer retrieval problem, a model is given a natural language question q and retrieves k passages {p 1 ...p k } from a large text corpus C (Voorhees et al., 1999;Ramos et al., 2003;Robertson and Zaragoza, 2009;Chen et al., 2017;Lee et al., 2019;Karpukhin et al., 2020;Luan et al., 2020). The goal is to retrieve at least one passage that contains the answer to q. During training, question-answer pairs (q, a) are given to the model.
Evaluation Intrinsic evaluation directly evaluates the retrieved passages. The most commonly used metric is RECALL @ k which considers re-  Table 1: A comparison of single-answer and multianswer retrieval tasks. Previous work has used independent ranking models P (p i |q) for multi-answer retrieval because the inference-time inputs and outputs are the same. We propose JPR as an instance of P (p 1 ...p k |q).
trieval successful if the answer a is included in {p 1 ...p k }. Extrinsic evaluation uses the retrieved passages as input to an answer generation model such as the model in Izacard and Grave (2021) and evaluates final question answering performance.
Reranking Much prior work (Liu, 2011;Asadi and Lin, 2013;Nogueira et al., 2020) found an effective strategy in using a two-step approach of (1) retrieving a set of candidate passages B from the corpus C (k < |B| |C|) and (2) using another model to rerank the passages, obtaining a final top k. A reranker could be more expressive than the first-stage model (e.g. by using cross-attention), as it needs to process much fewer candidates. Most prior work in reranking, including the current stateof-the-art (Nogueira et al., 2020), scores each passage independently, modeling P (p|q).

Multi-Answer Retrieval
We now formally define the task of multi-answer retrieval. A model is given a natural language question q and needs to find k passages {p 1 ...p k } from C that contain all distinct answers to q. Unlike in single-answer retrieval, question-and-answer-set pairs (q, {a 1 ...a n }) are given during training.
Evaluation Similar to single-answer retrieval, the intrinsic evaluation directly evaluates a set of k passages. As the problem is underexplored, metrics for it are less studied. We propose to use MRE-CALL @ k, a new metric which considers retrieval to be successful if all answers or at least k answers in the answer set {a 1 ...a n } are recovered by {p 1 ...p k }. Intuitively, MRECALL is an extension of RECALL that considers the completeness of the retrieval; the model must retrieve all n answers Training with dynamic oracle (Section 3.2) Inference with TREEDECODE (Section 3.3)

T5 Decoder
Prefix Positive passages not in the prefix Autoregressive generation of passage references (Section 3.1) Figure 2: An overview of JPR. We focus on reranking and propose: autoregressive generation of passage references (Section 3.1), training with dynamic oracle (Section 3.2), and inferece with TREEDECODE (Section 3.3).
when n ≤ k, or at least k answers when n > k. 1 The extrinsic evaluation inputs the retrieved passages into an answer generation module that is designed for multiple answers, and measures multianswer accuracy using an appropriate metric such as the one in .
Comparing to single-answer retrieval We compare single-answer retrieval and multi-answer retrieval in Table 1. Prior work makes no distinctions between these two problems, since they share the same interface during inference. However, while independently scoring each passage (P (p i |q)) may be sufficient for single-answer retrieval, multi-answer retrieval inherently requires joint passage scoring P (p 1 ...p k |q). For example, models should not repeatedly retrieve the same answer at the cost of missing other valid answers, which can only be done by a joint model.
Choice of k for downstream QA Previous stateof-the-art models typically input a large number (k ≥ 100) of passages to the answer generation model. For instance, Izacard and Grave (2021) claim the importance of using a larger value of k to improve QA accuracy. In this paper, we argue that with reranking, using a smaller value of k (5 or 10) and instead employing a larger answer generation model is advantageous given a fixed hardware budget. 2 We show in Section 5 that, as retrieval performance improves, memory is better spent on larger answer generators rather than on more passages, ultimately leading to higher QA accuracy. answer retrieval. JPR uses an approach consisting of first-stage retrieval followed by reranking: the first-stage retrieval obtains candidate passages B from the corpus C, and a reranker processes B to output {p 1 ...p k } ⊂ B. We refer to Section 4.2 for the first-stage retrieval, and focus on the reranking component of the model, which allows (1) efficiently modeling the joint probability P (p 1 ...p k |q), and (2) processing candidate passages with a more expressive model.
The overview of JPR is illustrated in Figure 2. The reranker of JPR leverages the encoder-decoder architecture for an autoregressive generation of passage references (Section 3.1). Unlike typical use cases of the encoder-decoder, (1) the ordering of passages to retrieve is not given as supervision, and (2) it is important to balance between exploring passages about new answers and finding passages that may cover previously selected answers. To this end, we introduce a new training method (Section 3.2) and a tree-based decoding algorithm (Section 3.3).

Autoregressive generation of passage references
JPR makes use of the encoder-decoder architecture, where the encoder processes candidate passages and the decoder autoregressively generates a sequence of k passage references (indexes). Intuitively, dependencies between passages can be modeled by the autoregressive architecture.
We extend the architecture from Izacard and Grave (2021); we reuse the encoder but modify the decoder. Each candidate passage p i is concatenated with the question q and the number i (namely index). It is fed into the encoder to be transformed to p i ∈ R L×h , where L is the length of the input text and h is a hidden size. Next, p 1 ...p |B| are concatenated to formp ∈ R L|B|×h , and then fed into the decoder. The decoder is trained to autoregressively output a sequence of indexes i 1 ...i k ,

O={a}
O={a, b} O={a, b, e} O={a, b, e, d} S={empty, a} S={empty, a, a::b} S={empty, a, a::b, a::e} S={empty, a, a::b, a::e, a::b::d} Step 1 Step 2 Step 3 Step 4 d t=3 i f z Figure 3: An illustration of TREEDECODE, where passages that are chosen and passages that are being considered are indicated in orange and blue, respectively. See Section 3.3 and Algorithm 1 for details.
representing a sequence of passages p 1 ...p k . As the generation at step t (1 ≤ t ≤ k) is dependent on the generation at step 1 . . . t − 1, it can naturally capture dependencies between selected passages. As each index occupies one token, the length of the decoded sequence is k.

Training with Dynamic Oracle
A standard way of training the encoder-decoder is teacher forcing which assumes a single correct sequence. However, in our task, a set of answers can be retrieved through many possible sequences of passages, and it is unknown which sequence is the best. To this end, we dynamically form the supervision data which pushes the model to assign high probability to a dynamic oracle-any positive passage covering a correct answer that is not in the prefix, i.e., previously selected passages. We first precompute a set of positive passages O and a prefixp 1 ...p k . A set of positive passages O includes up to k passages with maximal coverage of the distinct answers. 3 A prefixp 1 ...p k is a simulated prediction of the model, consisting of O and k − |Õ| sampled negatives. At each step t (1 ≤ t ≤ k) , given a set of positive passagesÕ and a prefixp 1 ...p t (denoted as P t ), JPR is trained to assign high probabilities to the dynamic oraclẽ O − P t . The objective is defined as follows:

Inference with TREEDECODE
A typical autoregressive decoder makes the top 1 prediction at each step to decode a sequence of k (SEQDECODE in Algorithm 1), 4 which, based 3 |Õ| < k if fewer than k passages are sufficient to cover all distinct answers; |Õ| = k otherwise. 4 We explored beam search decoding but it gives results that are the same as or marginally different from SEQDECODE.
Algorithm 1 Decoding algorithms for JPR.
return Set(O) 7: procedure TREEDECODE(k, P (p|p1...pn), l) 8: O ← ∅ // a set of selected passages 9: S ← [Empty] // a tree 10: while |O| < k do 11: P (p|s) ← P (p|s)I[s :: p / ∈ S] 12: (ŝ,p) ← argmax p∈B,s∈S l(|s| + 1)logP (p|s) 13: O ← O ∪ {p}, S ← S.append(ŝ ::p) 14: return O on our training scheme, asks the decoder to find a new answer at every step. However, when k is larger than the number of correct answers, it would be counter-productive to ask for k passages that each covers a distinct answer. Instead, we want the flexibility of decoding fewer timesteps and take multiple predictions from each timestep. In this context, we introduce a new decoding algorithm TREEDECODE, which decodes a tree from an autoregressive model. TREEDECODE iteratively chooses between the depth-wise and the width-wise expansion of the tree-going forward to the next step and taking the next best passage within the same step, respectively-until it reaches k passages (Figure 3). Intuitively, if the model believes that there are many distinct answers covered by different passages, it will choose to take the next step, being closer to SEQDECODE. On the other hand, if the model believes that there are very few distinct answers, it will choose to take more predictions from the same step, resulting in behavior closer to independent scoring.
The formal algorithm is as follows. We represent a tree S as a list of ordered lists [s 1 ...s T ] where s 1 is an empty list and s i is one element appended to any of s 1 ...s i−1 . The corresponding set O is  ∪ s∈S Set(s). We define a score of a tree S as We form S and O through an iterative process by (1) starting from O = ∅ and S = [null], and (2) updating O and S by finding the best addition of an element that maximizes the gain in f (S), until |O| = k, as described in Algorithm 1.

Experimental Setup
We compare JPR with multiple baselines in a range of multi-answer QA datasets. We first present an intrinsic evaluation of passage retrieval by reporting MRECALL based on answer coverage in the retrieved passages (Section 5.1). We then present an extrinsic evaluation through experiments in downstream question answering (Section 5.2).

Datasets
We train and evaluate on three datasets that provide a set of distinct answers for each question. Statistics of each dataset are provided in Table 2. WEBQSP (Yih et al., 2016) consists of questions from Google Suggest API, originally from Berant et al. (2013). The answer is a set of distinct entities in Freebase; we recast this problem as textual question answering based on Wikipedia. AMBIGQA  consists of questions mined from Google search queries, originally from NQ (Kwiatkowski et al., 2019). Each question is paired with an annotated set of distinct answers that are equally valid based on Wikipedia. TREC (Baudiš and Šedivỳ, 2015) contains questions curated from TREC QA tracks, along with regular expressions as answers. Prior work uses this data as a task of finding a single answer (where retrieving any of the correct answers is sufficient), but we recast the problem as a task of finding all answers, and approximate a set of distinct answers. Details are described in Appendix B.1.

First-stage Retrieval
JPR can obtain candidate passages B from any firststage retrieval model. In this paper, we use DPR + , our own improved version of DPR (Karpukhin et al., 2020) combined with REALM (Guu et al., 2020). DPR and REALM are dual encoders with a supervised objective and an unsupervised, language modeling objective, respectively. We initialize the dual encoder with REALM and train on supervised datasets using the objective from DPR. More details are provided in Appendix A.

Baselines
We compare JPR with three baselines, all of which are published models or enhanced versions of them. All baselines independently score each passage. DPR + only uses DPR + without a reranker.
DPR + +Nogueira et al. (2020) uses DPR + followed by Nogueira et al. (2020), the state-of-theart document ranker. It processes each passage p i in B independently and is trained to output yes if p i contains any valid answer to q, otherwise no. At inference, the probability for each p i is computed by taking a softmax over the logit of yes and no. The top k passages are chosen based on the probabilities assigned to yes. INDEPPR is our own baseline that is a strict nonautoregressive version of JPR in which prediction of a passage is independent from other passages in the retrieved set. It obtains candidate passages B through DPR + and the encoder of the reranker processes q and B, as JPR does. Different from JPR, the decoder is trained to output a single token i (1 ≤ i ≤ |B|) rather than a sequence. The objective is the sum of −logP (p|q, B) of the passages including any valid answer to q. At inference, INDEPPR outputs the top k passages based the logit values of the passage indices. We compare mainly to IN-DEPPR because it is the strict non-autoregressive version of JPR, and is empirically better than or comparable to Nogueira et al. (2020) (Section 5.1).

Implementation Details
We use the English Wikipedia from 12/20/2018 as the retrieval corpus C, where each article is split into passages with up to 288 wordpieces, All rerankers are based on T5 (Raffel et al., 2020), a pretrained encoder-decoder model; T5-base is used unless otherwise specified. We use |B| = 100, k = {5, 10}. Models are first trained on NQ (Kwiatkowski et al., 2019) and then finetuned  Table 3: Results on passage retrieval in MRECALL. The two numbers in each cell indicate performance on all questions and on questions with more than one answer, respectively. Test-set metrics on AMBIGQA are not available as its test set is hidden, but we report the test results on question answering in Section 5.2.
Note: it is possible to have higher MRECALL @ 5 than MRECALL @ 10 based on our definition of MRECALL (Section 2.2).   on multi-answer datasets, which we find helpful since all multi-answer datasets are relatively small. During dynamic oracle training, k − |Õ| negatives are sampled from B −Õ based on s(p i ) + γg i , where s(p i ) is a prior logit value from INDEPPR, g i ∼ Gumbel(0, 1) and γ is a hyperparameter. In TREEDECODE, to control the trade-off between the depth and the width of the tree, we use a length penalty function l(y) = 5+y 5+1 β , where β is a hyperparameter, following Wu et al. (2016). More details are in Appendix B.2. Table 3 reports MRECALL on all questions and on questions with more than one answer. Independent vs. joint ranking JPR consistently outperforms both DPR + +Nogueira et al. (2020) and INDEPPR on all datasets and all values of k. Gains are especially significant on questions with more than one answer, outperforming two reranking baselines by up to 11% absolute and up to 6% absolute, respectively. WEBQSP sees the largest gains out of the three datasets, likely because the average number of answers is the largest.

Ablations & Analysis
Training methods Table 4 compares dynamic oracle training with alternatives. 'Dynamic oracle w/o negatives' is the same as dynamic oracle training except the prefix only has positive passages. 'Teacher forcing' is a standard method in training an autoregressive model: given a target sequence o 1 ...o k , the model is trained to maximize Π 1≤t≤k P (o t |o 1 ...o t−1 ). We form a target sequence using a set of positive passagesÕ, where the order is determined by following the ranking from INDEPPR. Table 4 shows that our dynamic oracle training, which uses both positives and negatives, significantly outperforms the other methods. Table 5 compares JPR with SEQDECODE and with TREEDECODE. We find that TREEDECODE consistently improves the performance on both WEBQSP and AM-BIGQA, with both k = 5 and 10. Gains are especially significant on AMBIGQA, since the choice of whether to increase diversity is more challenging on AMBIGQA where questions are more specific and have fewer distinct answers, which TREEDE-  The average depth of the tree is larger on WEBQSP, likely because its average number of distinct answers is larger and thus requires more diversity.

Impact of TREEDECODE
An example prediction Table 6 shows predictions from INDEPPR and JPR given an example question from AMBIGQA, "Who plays Mark on the TV show Roseanne?" One answer Glenn Quinn is easy to retrieve because there are many passages in Wikipedia providing evidence, while the other answer Ames McNamara is harder to find. While INDEPPR repeatedly retrieves passages that mention Glenn Quinn and fails to cover Ames McNamara, JPR successfully retrieves both answers. More analysis can be found in Appendix C.

QA Experiments
This section discusses experiments on downstream question answering: given a question and a set of passages from retrieval, the model outputs all valid answers to the question. We aim to answer two research questions: (1) whether the improvements in passage retrieval are transferred to improvements in downstream question answering, and (2) whether using a smaller number of passages through reranking is better than using the largest possible number of passages given fixed hardware memory.
We use an answer generation model based on Izacard and Grave (2021) which we train to generate a sequence of answers, separated by a [SEP] token, given a set of retrieved passages. Our main model uses JPR to obtain passages fed into the answer generation model. The baselines obtain passages from either DPR + only or INDEPPR, described in Section 4.3.
We compare different models that fit on the same hardware by varying the sizes of T5 (base, large, 3B) and use the maximum number of passages (k). 5 This results in three settings: {k = 140, base}, {k = 40, large} and {k = 10, 3B}. Table 7 reports the performance on three multianswer datasets in F1, following .

Main Result
Impact of reranking With {k = 10, 3B}, JPR outperforms both baselines, indicating that the improvements in retrieval are successfully transferred to improvements in QA performance. We however find that our sequence-to-sequence answer generation model tends to undergenerate answers, presumably due to high variance in the length of the output sequence. This indicates the model is not fully benefiting from retrieval of many answers, and we expect more improvements when combined with an answer generation model that is capable of generating many answers.
More passages vs. bigger model With fixed memory during training, using fewer passages equipped with a larger answer generation model outperforms using more passages. This is only true when reranking is used; otherwise, using more passages is often better or comparable. This demonstrates that, as retrieval improves, memory is better spent on larger answer generators rather than more passages, leading to the best performance.
Finally, JPR establishes a new state-of-the-art, outperforming the previous state-of-the-art on AM-BIGQA (Gao et al., 2021) with extensive reranking and the answer generation model trained using x3 more resources than ours. 6

Single-answer QA result
While our main contributions are in multi-answer retrieval, we experiment on NQ to demonstrate  Table 7: Question Answering results on multi-answer datasets. The two values in each cell indicate F1 on all questions and F1 on questions with multiple answers only, respectively. Mem compares the required hardware memory during training. Note that Gao et al. (2021) reranks 1000 passages instead of 100, and trains an answer generation model using x3 more memory than ours. Better retrieval enables using larger answer generation models on fewer retrieved passages.  Table 8: Question Answering results on NQ. We report Exact Match (EM) accuracy. The first five rows are from our own experiments, which all use the same hardware resources for training. The last row is the previous state-of-the-art which requires x5 more resources than ours to train the model. that the value of good reranking extends to the single-answer scenario. Table 8 indicates two observations consistent to the findings from multianswer retrieval: (1) when compared within the same setting (same T5 and k), INDEPPR always outperforms DPR + only, and (2) with reranking, {k = 10, 3B} outperforms {k = 40, large}. Finally, our best model outperforms the previous state-of-the-art (Izacard and Grave, 2021) which uses x5 more training resources. Altogether, this result (1) justifies our choice of focusing on reranking, and (2) shows that INDEPPR is very competitive and thus our JPR results in multi-answer retrieval are very strong.

Related Work
We refer to Section 2 for related work focusing on single-answer retrieval.
Diverse retrieval Studies on diverse retrieval in the context of information retrieval (IR) requires finding documents covering many different subtopics to a query topic (Zhai et al., 2003;Clarke et al., 2008). Questions are typically underspecified, and many documents (e.g. up to 56 in Zhai et al. (2003)) are considered relevant. In their problem space, effective models post-hoc increase the distances between output passages during inference (Zhai et al., 2003;Abdool et al., 2020). Our problem is closely related to diverse retrieval in IR, with two important differences. First, since questions represent more specific information needs, controlling the trade-off between relevance and diversity is harder, and simply increasing the distances between retrieved passages does not help. 7 Second, multi-answer retrieval uses a clear notion of "answers"; "sub-topics" in diverse IR are more subjective and hard to enumerate fully.
Multi-hop passage retrieval Recent work studies multi-hop passage retrieval, where a passage containing the answer is the destination of a chain of multiple hops (Asai et al., 2020;Xiong et al., 2021;Khattab et al., 2021). This is a difficult problem as passages in a chain are dissimilar to each other, but existing datasets often suffer from annotation artifacts (Chen and Durrett, 2019;Min et al., 2019), resulting in strong lexical cues for each hop. We study an orthogonal problem of finding multiple answers, where the challenge is in controlling the trade-off between relevance and diversity.

Conclusion
We introduce JPR, an autoregressive passage reranker designed to address the multi-answer retrieval problem. On three multi-answer datasets, JPR significantly outperforms a range of baselines in both retrieval recall and downstream QA accuracy, establishing a new state-of-the-art. Future work could extend the scope of the problem to other tasks that exhibit specific information need while requiring diversity.

A Details of DPR +
We use a pretrained dual encoder model from REALM (Guu et al., 2020) and further finetune it on the QA datasets using the objective from DPR (Karpukhin et al., 2020): where f q and f p are trainable encoders for the questions and passages, respectively, p + is a positive passage (i.e., a passage containing the answer), and B − is a set of negative passages (i.e., passages without the answer). As shown in Karpukhin et al. (2020), a choice of B − is significant for the performance. We explore two methods: Distant negatives follows DPR (Karpukhin et al., 2020) in using distantly obtained negative passages as B − . We obtain two distant negative passages per question: one hard negative, a top prediction from REALM without finetuning, and one random negative, drawn from a uniform distribution, both not containing the answer.
Full negatives considers all passages in Wikipedia expect p + as B − , and instead freezes the passage encoder f p and only finetunes the question encoder f q . This is appealing because (a) the number and the quality of the negatives, which both are the significant factors for training, are the strict maximum, and (b) f p from REALM is already good, producing high quality passage representations without finetuning. Implementation of this method is feasible by exploiting extensive model parallelism.
We use distant negatives for multi-answer datasets and full negatives for NQ as this combination gave the best result. B Experiment Details B.1 Data processing for TREC TREC from Baudiš and Šedivỳ (2015) contains regular expressions as the answers. We approximate a set of semantically distinct answers as follows. We first run regular expressions over Wikipedia to detect valid answer text. If there is no valid answer found from Wikipedia, or there are more than 100 valid answers 8 , we discard the question. We then only keep the answers with up to five tokens, following the notion of short answers from Lee et al. (2019). Finally, we group the answers  that are the same after normalization and white space removal. We find that this gives a reasonable approximation of a set of semantically distinct answers. Note that the data we use is the subset of the original data because we discarded a few questions. Statistics are reported in Section 4.1.
Here is an example: a regular expression from the original data is Long Island|New\s?York|Roosevelt Field. All matching answers over Wikipedia include roosevelt field, new york, new\xa0york, new\nyork, newyork, long island.

B.2 Details of reranker training
All implementations are based on Tensorflow (Abadi et al., 2015) and Mesh Tensorflow (Shazeer et al., 2018). All experiments are done in Google Cloud TPU. We use batch size that is the maximum that fits one instance of TPU v3-32 (for WEBQSP and AMBIGQA) or TPU v3-8 (TREC). We use the same batch size for INDEPPR; for Nogueira et al. (2020), we use the batch size of 1024. We use the encoder length of 360 and the decoder length of k (JPR) or 1 (all others). We use k = {5, 10} for all experiments. We train JPR with γ = {0, 0.5, 1.0, 1.5} and choose the one with the best accuracy on the development data. We use a flat learning rate of 1 × 10 −3 with warm-up for the first 500 steps. Full hyperparameters are reported in Table 9.
For training INDEPPR and JPR, instead of using all of |B| passages, we use |B|/4 passages by sampling k positive passages and |B|/4 − k negative passages. We find that this trick allows larger batch size when using the same hardware, ultimately leading to substantial performance gains. We also find   Algorithm 2 An algorithm to obtainÕ from the answer set and B. Aleft ← Aleft− answers in b 8: if |Õ| == k then 9: break 10: return O.toSet() that assigning indexes of the passages based on a prior, e.g., ranking from dense retrieval, leads to significant bias, e.g., in 50% of the cases, the top-1 passage from dense retrieval contains a correct answer. We therefore randomly assign the indexes, and find this gives significantly better performance. Algorithm 2 describes how a set of positive pas-sagesÕ used in Section 3.2 is computed during preprocessing.

B.3 Details of answer generation training
We train the models using a batch size of 32. We use a decoder length of 20 and 40 for NQ and multi-answer datasets, respectively. We decode answers only when they appear in the retrieved passages, as we want the generated answers to be grounded by Wikipedia passages. Answers in the output sequence follow the order they appear in the passages, except on WEBQSP, where shuffling the order of the answers improves the accuracy. All other training details are the same as details of reranker training.

C Additional Results
We additionally report retrieval performance in α-NDCG @ k, one of the metrics for diverse retrieval in IR (Clarke et al., 2008;Sakai and Zeng, 2019). It is a variant of NDCG (Järvelin and Kekäläinen, 2002), but penalizes retrieval of the same answer. We refer to Clarke et al. (2008) for a complete definition. We use α = 0.9.
Results are reported in Table 10. JPR consistently outperforms INDEPPR across all datasets, although the gains are less significant than the gains in MRECALL. We note that we report α-NDCG following IR literatures, but we think of MRECALL as a priority, because α-NDCG does not use an explicit notion of completeness of retrieval of all answers. It is also a less strict measure than recall because it gives partial credits to retrieving a subset of the answers.
Gains with respect to the number of answers Figure 4 shows gains over INDEPPR on three datasets with respect to the number of answers. Overall, gains are larger when the number of answers is larger, especially for WEBQSP and TREC. For AMBIGQA, the largest gains are when the number of answers is 2, which is responsible for over half of multi-answer questions.