Answering Ambiguous Questions via Iterative Prompting

In open-domain question answering, due to the ambiguity of questions, multiple plausible answers may exist.To provide feasible answers to an ambiguous question,one approach is to directly predict all valid answers, but this can struggle with balancing relevance and diversity.An alternative is to gather candidate answers and aggregate them, but this method can be computationally costly and may neglect dependencies among answers.In this paper, we present AmbigPrompt to address the imperfections of existing approaches to answering ambiguous questions.Specifically, we integrate an answering model with a prompting model in an iterative manner.The prompting model adaptively tracks the reading process and progressively triggers the answering model to compose distinct and relevant answers. Additionally, we develop a task-specific post-pretraining approach for both the answering model and the prompting model, which greatly improves the performance of our framework. Empirical studies on two commonly-used open benchmarks show that AmbigPrompt achieves state-of-the-art or competitive results while using less memory and having a lower inference latency than competing approaches. Additionally, AmbigPrompt also performs well in low-resource settings.


Introduction
Recent years have witnessed substantial advances in open-domain question answering (QA) systems (Karpukhin et al., 2021;Lewis et al., 2020;Izacard and Grave, 2021b), which aim to find the answer for the given question from a large knowledge corpus (Chen et al., 2017). While a dominating scenario is the single-answer QA setting, i.e., only one exact answer is required for a given question (Karpukhin et al., 2021), this work focuses * Corresponding author.
Which movie was both directed and screen written by Kamal Haasan?
Vishwaroopam (titled vishwaroop in hindi;) is a 2013 Indian espionage action thriller film written, directed and produced by kamal haasan, who also enacts the lead role.
Vishwaroopam II, or vishwaroop II, is a 2018 indian espionage action thriller film written and directed by kamal haasan, it is the sequel to "vishwaroopam" (2013) and features himself alongside rahul bose, shekhar kapur, pooja kumar and andrea jeremiah, reprising their roles Question Passages Answers Vishwaroopam Vishwaroopam II Sabaash Naidu Virumaandi on the more realistic scenario of Multi-answer QA, where multiple plausible answers are associated with a user-issued question (Min et al., 2020), given that questions posed by humans are often openended and ambiguous. 1 A natural approach for answering ambiguous open-domain questions would be to fine-tune a pretrained answer generation model, e.g., T5 (Raffel et al., 2020), using supervised data of the form (evidential passages, question, all plausible answers) (Min et al., 2020(Min et al., , 2021. However, this approach often leads to sub-optimal solutions since it requires the model to balance the relevance and diversity of the generated multiple answers within a single-round decoding procedure, which is non-trivial. To manage the relevance-diversity trade-off, another approach is to decompose multianswer QA into candidate answer prediction and answer post-processing. This typically requires a high-capacity model with billions of parameters to construct candidate answers and sophisticated answer aggregation pipelines to obtain the final results (Shao and Huang, 2022;Gao et al., 2021b), incurring high computational costs. In addition, this approach suffers from the dilemma of having to predict diverse candidate answers before knowing which answer has been predicted, which is unnatural and intricate. For example, in Figure 1, given the question "Which movie was both directed and screenwritten by Kamal Haasan?," with the existence of the answer Vishwaroopam, the model excludes its eponymous translation version Vishwaroop and deduces that Vishwaroopam II is another potential answer.
When facing an ambiguous question, people are capable of providing multiple valid answers by introspectively composing new content on the basis of what has already been devised, usually in an iterative manner. Inspired by this observation, in this paper, we conceptualize AmbigPrompt as an approach to mimic this mechanism by iteratively guiding the answering model with a lightweight prompting model. As shown in Figure 2, this prompting model steers the answering model to progressively generate valid answers whose content the prompting model will then condition on for the next-round prompt construction. Essentially, our proposed framework comprises two key components: (i) an encoder-decoder answering model and (ii) an interleaving answer-conditional prompting model. By conditioning on preceding generated contents, the proposed framework introspectively perceives which answer has been predicted before updating the hidden activation for the generation of subsequent answers. Furthermore, we devise a task-adaptive post-pretraining strategy, in which pseudo multi-QA training instances are constructed to facilitate the training of the proposed framework.
We carry out extensive experiments on the Am-bigQA (Min et al., 2020) and WebQSP (tau Yih et al., 2016) datasets. The results demonstrate that AmbigPrompt attains superior performance despite having a significantly smaller parameter scale, 14 times less than state-of-the-art models. Furthermore, as a lightweight approach, AmbigPrompt improves the answer relevance and diversity with a tiny fraction of the memory footprint and inference latency of competing approaches. Notably, Ambig-Prompt achieves the best performance in the lowresource setting. The effectiveness of the proposed method is also verified by ablation experiments and  Figure 2: Given the retrieved passages, AmbigPrompt alternates between (2) generating prompts based on previous answers, (3) generating a new answer using a question-answering model, and (4) appending the new answer to the answers set. Note that steps (2) and (3) operate in an interleaving way.
analytical experiments. In summary, this paper makes the following contributions: (i) We propose AmbigPrompt, which tackles ambiguous question answering by iterative prompting. (ii) We propose an interleaving answerconditional prompting model to generate meaningful continuous prompts. (iii) Experiments on multi-QA datasets verify the effectiveness of the proposed approach.

Problem formalization
Formally, given an open-domain question q, a multi-answer question answering (QA) model is required to make use of (multiple pieces of) evidence from a large-scale text corpus Ω (e.g., Wikipedia) to find multiple plausible answers A = {a 1 , a 2 , . . . , a n }, where a i denotes one answer and we suppose there are n answers. The QA model aims to infer p(A|q, Ω). In open-domain QA, the QA model typically follows a two-step pipeline, comprising passage retrieval and answer generation. In the passage retrieval step, a retrieval model p(C|q, Ω) retrieves m evidence passages C = {c 1 , c 2 , . . . , c m } according to the question q from Ω. In the answer generation step, an answering model p(A|q, C) reads the evidential passages and finds the answers to the question.

Answering model
We use Fusion-in-Decoder (FiD) as a basic singleanswer answering model (Izacard and Grave, 2021b). In particular, FiD has an encoder-decoder architecture. FiD first concatenates each retrieved passage with the question with a [SEP] token: where we use X to denote the concatenated sequence. Then, for each x i , the FiD encoder Enc encodes it to x i : where Cat denotes a concatenation function. Finally, the decoder Dec attends to the representations of all passages and generates an answer a: p(a|q, C) = Dec(X) (3)

Prompt-tuning
Prompt-tuning adapts pre-trained transformer models to downstream tasks by optimizing continuous prompting vectors (Li and Liang, 2021;Liu et al., 2022). Suppose x is the input sequence of the model, we denote Q(x) j , K(x) j , V (x) j as the query, key, and value representations of x in the j-th attention layer in the transformer encoder. Prompt-tuning prepends learnable prompting vectors E j to K(x) j and V (x) j to modify the attention distribution as well as the output x j of the j-th layer as follows: where x j denotes the output of layer j, Attn(·) represents the attention operation in the transformer, and Cat(·) is the concatenation function.

AmbigPrompt
Conventionally, the question answering model generates the desired answer given the input context in a single pass (Izacard and Grave, 2021b). While it suffices to tackle the single-answer QA scenario, managing ambiguous questions with multiple answers can be more nuanced -the answering model is required to balance the relevance and diversity of the generated answers in a single pass, and precisely modeling dependencies among the answers can be non-trivial. In this paper, we propose Am-bigPrompt, a question-answering model that answers ambiguous questions via iterative prompting, inferring more accurate answers progressively. Figure 2 gives an overview of the proposed method.
Overall, AmbigPrompt decomposes the generation of answers A into multiple steps instead of one single pass, i.e., p(A|q, C) = n t=1 p(a t |ϕ(a <t ), q, C), where a <t denotes the set of answers that have been generated at time t, and ϕ(·) denotes a prompting model that generates prompt vectors for answer generation at the t-th step. The prompting model shares parameters with the answering model, allowing for seamless integration. AmbigPrompt iteratively composes a new answer a t , conceiving the prompt of previous answers, i.e., ϕ(a <t ), and appends a t to the answers set, till all feasible answers are found. The proposed framework is optimized in a twostage manner: task-adaptive post-pretraining and prompt-based tuning. In the former stage, the model is trained on a large synthesized multianswer QA dataset, while in the latter stage, the model is tuned on the annotated multi-answer QA dataset. We first detail the prompting model ( §3.1) and the iterative question answering procedure ( §3.2), and then introduce the optimization scheme ( §3.3).

Retrospective prompting mechanism for answer generation
To capture intricate dependencies among answers, we devise an interleaving answer-conditional prompting model ϕ(a <t ), which generates the prompt vector E = ϕ(a <t ) conditioned on antecedent generated answers a <t , as depicted in Figure 3. Specifically, the prompting model ϕ is a transformer encoder that shares the same parameters with the encoder of the answering model. ϕ processes the a <t in three steps: (1) Templating answers. First, a <t is transformed into a text sequence e = T (a <t ) using a template T . Here we use semicolons to splice answers.
(2) Generating prompts. Then, given the answer sequence e and context X (i.e., the concatenated question and passages in Eq. 1), the prompting model ϕ computes the hidden activations E j of each layer j via cross-attending the contextual representation X j−1 :  where Q(e) j , K(e) j , and V (e) j denote the query, key, and value representations of e in the j-th attention layer in the prompting model; denotes the concatenated context representations of the (j−1)-th layer in the answering model. We write E for the last layer output of the prompting model.
(3) Prompting answering model. Finally, the generated prompt E j is prepended to the attention layer of the encoder Enc of the answering model as in Eq. 4. Meanwhile, the decoder Dec of answering model attends to Cat(E, X) and generates the target answer a t : p(a t |ϕ(a <t ), q, C) = Dec(Cat(E, X)). (7) Capturing long-range dependencies among derived answers via a retrospective prompting mechanism enables the answering model to compose new contents grounding on what has already been devised, and thus the model is able to strike a good relevance-diversity balance for answering ambiguous questions.

Answering ambiguous questions via iterative prompting
Given the input context, i.e., the question and retrieved evidential passages, AmbigPrompt iteratively performs attention operations over the input context and the generated answers, addressing the answer generation and prompt construction interactively. The key is to pass the attention activa-tions between the prompting model and answering model so that they can inspect each other's internal states and make harmonious predictions. Specifically, we start from an empty answer set and progressively append newly generated answers to it. As depicted in Figure 2, in each iteration, we first use the previously generated answer sequence to obtain the introspective prompts, and then interwoven the resultant prompting vectors into the answering model to predict the next answer. Our algorithm terminates if the model reaches the [EOI] token.

Optimization
To enhance the pre-training model towards multianswer QA, one straightforward approach is to leverage a question-answering dataset such as NQ (Kwiatkowski et al., 2019) for domain-adaptive pre-training (Min et al., 2021). However, the effectiveness of such a trivial approach is limited to the inherent defect of the one-pass prediction process; that is, the lack of the modeling capability of the interactions between answer generation and answer perception, which is critical to achieving superior performance in multi-QA scenarios. To explicitly align the pre-training objective to task-specific preferences, we further propose to conduct taskadaptive post-pretraining on pseudo multi-answer QA dataset, and then finetune the proposed model using the task data.
Task-adaptive post-pretraining. We first pretrain the model on NQ, in which only one answer A = {a 1 } is labeled for each question q. To ex-plicitly characterize the pretraining stage as the efforts for finding which part of preceding answers to interact with regarding the input context, we construct the pseudo multi-answer datasetÂ for post-pretraining the proposed framework to mimic the iterative question answering process. Specifically, we first train an auxiliary reader g(a|q, c i ), which learns to find an answer from the passage c i given a question q. Then, we use this auxiliary reader to generate a pseudo answer for each retrieved passage in C: whereÂ denotes the pseudo-answer set of q.
Then, we aggregate the generated answers to construct the previously known answers a <t in Eq. 5. In particular, we randomly sample t answers fromÂ and filter out those that are equivalent to the ground-truth answer a 1 ; we denote the sampled set asâ <t . With the pseudo answers, we define the post-pretraining objective as: where the number of answers inâ <t , i.e., t, is sampled from a Bernoulli distribution.
Prompt-based fine-tuning. We fine-tune the pretrained model on downstream multi-answer QA datasets. Specifically, in multi-answer QA, n answers A = {a 1 , a 2 , . . . , a n } corresponding to a question q are provided. The model is tuned by the following objective: where t ∈ [1, n] is sampled from a Bernoulli distribution. Since A is unordered, we shuffle A when constructing the a <t and a t to improve the robustness. Besides, we explicitly optimize the model to generate [EOI] to stop the iteration. Specifically, we define a parameter α ∼ U(0, 1) and a threshold λ, which controls the propensity of generating NQ-Open (Kwiatkowski et al., 2019), and asks annotators to search for, navigate and read multiple Wikipedia pages to find as many answers as possible. WebQSP: WebQSP consists of questions from Google Suggest API, originally from Berant et al. (2013). The answer is a set of distinct entities in Freebase; we use the modified versions by Min et al. (2021), which recasts WebQSP as textual question answering based on Wikipedia.
The statistical details of these two datasets and NQ are shown in Table 1.

Evaluation metrics
Following previous studies (Min et al., 2020), we adopt F1 as the evaluation metric, which measures the precision and recall between the ground-truth answers and the predicted answers. The test set is further divided into two subsets: full and multi. The full subset evaluates the model on all the questions in the test set, while the multi subset evaluates the model on the questions with multiple answers (i.e., n > 1). To assess the computational efficiency of various approaches, we also report the number of parameters, average latency, and peak memory usage during model inference. All the models are tested on the same device. We estimate the latency and memory usage of those baselines without public code using randomly initialized models since these metrics are independent of their parameters given a fixed number of encoded tokens and decoding length.

Baselines
The following models are adopted as baselines: DPR (Karpukhin et al., 2021): A dual-encoder is trained using contrastive loss for passage retrieval, and a BERT-based reader is used for answer extraction. SpanSeqGen (Min et al., 2020): DPR reranks the passages, and a BART-based generator is used for answer generation. FiD (Izacard and Grave, 2021b): The retrieved passages are encoded by a T5 encoder independently, and the representations are then concatenated and fed into the T5 Decoder to generate answers. Refuel (Gao et al.,  2021b): A question disambiguation module is proposed to generate disambiguated questions. The disambiguated questions are then used to find more answers. JPR (Min et al., 2021): JPR is a passage reranker that reranks the passages using an autoregressive model. With the additional reranking stage, JPR selects ten diverse passages from 100 retrieved passages and uses a T5-3B FiD answering model to compose answers in one pass. RECTIFY (Shao and Huang, 2022): RECTIFY proposes the recall-then-verify framework, which separates the reasoning process of each answer. An answering model operates on each passage to recall surplus answers. Then, a sophisticated verifier based on T5-3B FiD verifies each answer with an aggregation module. We divide the baseline models into two categories depending on the number of parameters of the models: (i) high-capacity baselines that use large models with billions of parameters, while requiring more computational resources and memory; (ii) comparable low-capacity baselines that use low-capacity models with a similar number of parameters and computational effort as Ambig-Prompt, which can be reasonably compared with AmbigPrompt.

Implementation details
We choose T5-Base (Raffel et al., 2020) as the backbone of the answering model. Regarding the passage retrieval model, we fine-tune the pre-trained model from Gao and Callan (2021) on the NQ dataset (See Appendix C for details). The retrieval corpus is the English Wikipedia on 12/20/2018, and the documents are split into chunks with 100 words following Karpukhin et al. (2021). We set m=100, λ=0.5, the batch size to 32, and the model is trained using the AdamW optimizer (Loshchilov and Hutter, 2017) with a constant learning rate of 5e−5. We train the model up to 5k steps on on 4 V100-16G GPUs and choose the hyperparameters and checkpoints on the validation set. 2 5 Experimental Results Table 2 reports the evaluation results on AmbigQA and WebQSP. Based on the results, we have three main observations. First, AmbigPrompt achieves comparable performance to the state-of-the-art. Specifically, Ambig-Prompt obtains 48.7 F1 on the full test set and 38.8 F1 on the multi test set, which exceeds all baselines except RECTIFY. The improvements are particularly significant on the multi test set; AmbigPrompt improves 1.2% over JPR and 1.5% over Refuel. Besides, compared with FiD, which concatenates all the answers in A with [SEP] and generates them in one pass, the proposed method, which benefits from the iterative design and answer-conditional prompting mechanism, achieves 3% and 5% improvements on full and multi of AmbigQA. Similar results can also be observed on WebQSP.

Main results
Second, AmbigPrompt uses fewer resources compared to previous high-capacity models. Am-bigPrompt uses a lightweight model with 220M  parameters. Still, AmbigPrompt achieves superior performance compared to the high-capacity models, e.g., JPR, that use 3B parameters. The state-ofthe-art model RECTIFY uses 6B parameters (3B for the answering model and 3B for the verifier), which is 27× as much as ours, significantly increasing the training and inference overhead. Similar results are witnessed in terms of latency. In particular, RECTIFY is 29× slower than our model due to the heavy design of the answering model and verifier. Refuel's iterative passage retrieval and clarifying question generation procedure results in a 32.6× latency compared with our approach. Finally, the comparison of peak memory usage also confirms our approach's lightweight nature. The lightweight design allows our approach to be adapted to academically accessible devices and reduces the carbon footprint for model training and deployment.
Third, we find that AmbigPrompt achieves a better resource-performance balance. In Figure 4 (a), we display the existing methods under the speed-performance coordinate system. Note that we place RECTIFY with different sizes (i.e., latency) on the diagram according to Shao and Huang (2022). AmbigPrompt improves the optimal latency-performance curve (the dashed lines), especially on the multi-answer test set, demonstrating the effectiveness of our approach in answering ambiguous questions.

Low-resource setting
Figure 4 (b) shows the results under different training data sizes to investigate the effectiveness of the proposed method in the low-resource setting. The proposed method achieves favorable results for different data sizes. Remarkably, Ambig-Prompt achieves promising performance with little data, surpassing the fully supervised high-capacity model JPR on a multi-answer test set. This result suggests that the proposed prompting mechanism can better elicit the capabilities of the pre-trained model and effectively adapt the model trained on single-answer QA data to multi-answer scenarios.

Ablation study
To understand the contribution of each component of AmbigPrompt, we conduct an ablation study. The results are listed in Table 3. The compared variants and the findings are: W/o task-adaptive pre-training. The models are trained only on multi-QA data with L P T . A notable performance decline can be seen. This observation suggests that task-adaptive pre-training is an important contributor to the model's performance since the size of multi-answer QA data is small. W/o prompting model. We remove the prompting model in this variant and instantiate the learnable prompt vector to each step t separately, like Liu et al. (2021a). The performance drops by about 3% and 4% on the two datasets, respectively. The results verify the effectiveness of the proposed answer-conditional prompting mechanism. W/o interleaving prompting. We remove the interaction mechanism between the prompting model and answering model, i.e., the FiD encoder encodes the e and X independently without crossattention. The results drop by about 2% and 2% on two datasets, respectively, which reveals that enabling the answering model to generate new answers conditioned on the introspective prompts effectively improves the model's performance.

Analytical experiments
Conceptually, our proposed framework Ambig-Prompt equips the FiD model with the ability to progressively compose the answers using retrospective prompts, i.e., iterative prompt learning.
To further analyze the capability of such an iterative prompt learning approach in managing the relevance-diversity trade-off, we present the F1, precision, recall, and average answer numbers of AmbigPrompt and FiD model variants in Figure 5.
In particular, FiD-multi denotes a variant of FiD in which we reduce the generation probability of the end-of-sequence token </s> to ensure that the number of generated answers is approximately the same as AmbigPrompt. We see that FiD-multi obtains comparable recall but gets significantly lower precision. In contrast, AmbigPrompt generates more answers than FiD without sacrificing precision, indicating that the designed iterative prompting mechanism induces the model with a superior ability to manage the trade-off between relevancy and diversity for ambiguous question answering.
6 Related work

Ambiguous question answering
In open-domain QA, given a question about any topic, the model finds the answer from a large knowledge corpus (Chen et al., 2017). Typically, a retrieval model and an answering model are employed. The two modules can be trained separately (Karpukhin et al., 2021;Izacard and Grave, 2021b;Qu et al., 2021) or jointly (Lee et al., 2022;Lewis et al., 2020;Izacard and Grave, 2021a). Ambiguity is inherent to open-domain QA; especially when exploring new topics, it can be difficult to ask questions that have a single, unambiguous answer (Min et al., 2020;Rubin et al., 2022). Min et al. (2020) identify the challenge of multi-answer QA and collect the dataset AmbigQA. Based on that, Min et al. (2021) propose an autoregressive passage reranking model JPR, which reranks the top-retrieved passages and improves their diversity. Gao et al. (2021b) propose a round-trip prediction approach, where clarification questions are generated and fed back into the model to find more answers. Shao and Huang (2022) propose a recalland-verify framework, where surplus answers are generated first, and a verifier model then determines each candidate answer. Compared with existing methods, we propose a lightweight yet effective approach to answering ambiguous questions by iterative prompting.

Prompt-based learning
Prompt-based learning has received much attention recently (Liu et al., 2021a). Existing studies on prompt-based learning mainly focus on discrete and continuous prompts. The former designs text-based prompts (Jiang et al., 2020;Gao et al., 2021a;Schick and Schütze, 2021), while the latter prepend a learnable prompt vector to word embeddings (Lester et al., 2021;Liu et al., 2021b) or attention layers (Li and Liang, 2021;Liu et al., 2022). Prompt-based learning has demonstrated advantages in low-parameter tuning (He et al., 2022) and few-shot/zero-shot performance (Brown et al., 2020;Wei et al., 2022a). We propose an iterative prompting method for multi-answer QA based on answer-conditional continuous prompts.

Iterative generation
Iterative generation (a.k.a. progressive generation) aims to decompose a challenging generation task into multiple steps and progressively produce the target sequence. Iterative generation has been applied to the tasks of machine translation (Lee et al., 2018), controllable text generation (Casas et al., 2020;Zhang et al., 2020), storytelling (Hua and Wang, 2020;Tan et al., 2021), data-to-text (Kasner andDusek, 2020), etc. Recently, Wang et al. (2022) introduced an iterative prompting framework to progressively elicit knowledge from language models for commonsense reasoning and multi-hop question answering tasks (Qi et al., 2019;Xiong et al., 2021). Compared to existing work, we propose an answer-conditional prompting model and an effective task-specific pre-training scheme for multianswer QA.
In this paper, we have proposed AmbigPrompt for multi-answer QA. AmbigPrompt is a simple yet effective model that answers ambiguous questions by iterative prompting. We have proposed an answerconditional prompting model for prompt generation, and a task-adaptive post-pretraining scheme for model training. Extensive experiments suggest that AmbigPrompt achieves comparable performance as high-capacity models and achieves the best results in a low-resource setting.

Limitations
The limitations of this paper include the absence of experiments on large language models. Previous studies have shown that using high-capacity pre-trained language models can significantly improve the accuracy of answers but also entails an increase in computational overhead. Due to (academic) limitations of computational resources, this paper employs a low-capacity T5 model for experiments. Our experiments have suggested that the proposed iterative prompting method that works with the low-capacity model can achieve comparable results with baseline methods equipping with large models.
In future work, we would like to scale up the proposed model to improve the model's performance. Recent research on large language models (LLMs) has shown that they can learn from few examples and reason well. We believe that it is worth exploring ways to enhance the prompting of LLMs to improve their completeness when responding to ambiguous questions and reduce model hallucination in generation (OpenAI, 2023;Zhao et al., 2023;Sun et al., 2023b,a). Another direction worth exploring in the future is the application in lowresource scenarios, such as low-resource languages. Low-resources in our study are characterized by limited multi-answer-QA annotations, which aims to examine how data size impacts model performance. Other low-resource languages may behave differently with less training data and large models (Xue et al., 2020;Sun et al., 2021). Besides, we would like to explore more effective prompting methods, such as chain-of-thought prompting (Wei et al., 2022b).

Ethics Statement
The paper has proposed a question-answering model, which is intended to answer factoid open-domain questions. The model-predicted answers still have a considerable amount of misinformation. Besides, the proposed models rely on pre-trained question-answering models, which are trained on large-scale web data that is known to contain biased or discriminatory content. Table 4 lists the exact match (EM) score of the baselines and AmbigPrompt on single-answer QA benchmark, NQ-Open test. We see that the highcapacity models (e.g., JPR), which benefit from large language models like T5-3B, achieve better EM score. However, in the multi-answer QA task, the models need to focus not only on the precision of answers, but also on the diversity of answers (i.e., recall rate). In AmbigQA, we can see that the proposed model outperforms JPR, indicating its superior ability to recall multiple feasible answers.

B Zero-shot evaluation on AmbigQA
We also test the proposed model and baselines on AmbigQA in zero-shot setting following Min et al. (2020). In zero-shot evaluation, the models are trained using partial supervision only (i.e., singleanswer NQ-Open (Kwiatkowski et al., 2019)), and are evaluated on multi-answer data AmbigQA. This setting provides a practical application where only single-answer datasets are available. Note that the zero-shot evaluation on AmbigQA allows the model to tune some hyper-parameters (e.g., threshold of generation probability (Min et al., 2020)) using development data, which may make the setting not zero-shot in the strictest sense. The compared models are (1) DPR and SpanSe-qGen, in which the models trained on NQ-Open are adopted to predict multiple answers via a thresholding strategy (Min et al., 2020).
(2) FiD with various decoding methods, in which FiD trained on NQ-Open produces multiple answers through (a) Nucleus sampling with {p=0.8, t=0.8}; (b) Top-k sampling with {k=40, t=0.8}; and (c) Diverse beam search with {b=3, t=0.8, diversity_penalty=0.5}. We also evaluate FiD with greedy decoding that generates one answer for each question as the default setting of FiD. (3) AmbigPrompt, in which the FiD answering model prompted by our proposed answer-conditional prompting model is trained on NQ-Open with our task-adaptive post-pretraining method and produces multiple answers through iterative prompting.
The results are listed in Table 5. FiD series outperform DPR and SpanSeqGen as they utilize more passages that potentially cover more feasible answers. FiD with nucleus sampling obtains the best results among different decoding methods. Ambig-Prompt achieves the best zero-shot performance on AmbigQA and also outperforms high-capacity supervised baselines JPR on the multi-answer subset.

C Retrieval results
We train the dense retrieval model on NQ-Open using in-batch negatives with batch size 64. The retrieval model is initialized from CoCondenser (Gao and Callan, 2021). Our retrieval corpus is the English Wikipedia from 12/20/2018. Table 6 lists the retrieval results on NQ-Open and AmbigQA.
In NQ-Open, we use Recall@k (R@k for short) as the metric, which considers retrieval to be successful if at least one answer is included in the top-k ranked passages. In AmbigQA, we use MRe-call@k (MR@k for short) as the metric, which considers retrieval to be successful if all answers or at least k answers in the answer set A are covered by the top-k ranked passages. From the results, we see that our retrieval model achieves comparable results against baseline retrieval models, but underperforms reranking models such as KPR and MonoT5.

D Case study
We present some examples in Table 7 and Table 8.   Table 7: An example on AmbigQA dev shows that the proposed method AmbigPrompt finds all valid answers.
Question Who was the bond girl in you only live twice?

Passages
Severine | She had also categorized Aki and Kissy Suzuki, both from "You Only Live Twice" (1967), as falling into this trope. She supported this assessment by pointing to the characterś lack of agency and impact on "Skyfall"ś main narrative, and summed up Sévérine as "one of the most disempowered, pitiful, and tragic women in the Bond film franchise".