ASQA: Factoid Questions Meet Long-Form Answers

Recent progress on open domain factoid question answering (QA) does not easily transfer to the task of long-form QA, where the goal is to answer questions that require in-depth explanations. The hurdles include a lack of high-quality data and the absence of a well-defined notion of an answer’s quality. In this work, we address these problems by releasing a novel dataset and a task that we call ASQA (Answer Summaries for Questions which are Ambiguous); and proposing a reliable metric for measuring performance on ASQA. Our task focuses on ambiguous factoid questions which have different correct answers depending on the interpretation. Answers to ambiguous questions should combine factual information from multiple sources into a coherent long-form summary that resolves the ambiguity. In contrast to existing long-form QA tasks (such as ELI5), ASQA admits a clear notion of correctness: a user faced with a good summary should be able to answer different interpretations of the original ambiguous question. Our analysis demonstrates an agreement between this metric and human judgments, and reveals a considerable gap between human performance and strong baselines.


Introduction
In the last few years, the factoid question answering (QA) task-extracting short answers to factoid questions-has witnessed significant progress (Lee et al., 2019;Guu et al., 2020;Karpukhin et al., 2020;Lewis et al., 2020;Izacard and Grave, 2021).The progress was achieved in large part thanks to (i) the availability of high-quality datasets (Voorhees and Tice, 2000;Joshi et al., 2017;Yang et al., 2018;Abujabal et al., 2019;Kwiatkowski et al., 2019), and (ii) a well-defined notion of correctness.A key challenge for ongoing research now lies in longform question answering where the goal is to generate detailed explanations in response to questions that require elaborate and in-depth answers.
There is much less data available for the task of long-form QA.One of the primary data sources is the ELI5 dataset (Fan et al., 2019) that pairs open-ended questions with paragraph-long answers written by users of the "Explain Like I'm Five" Reddit forum.However, questions in ELI5 are very general (e.g., "How can different animals perceive different colors?") and can be answered in myriad different ways, making it hard to define objective criteria for a good answer.As a result, Krishna et al. (2021) identify several hurdles in using this data towards meaningful modeling progress, including a lack of reliable evaluation metrics.
In this work, we address the lack of data sources and unreliability of evaluations by constructing a long-form QA dataset for factoid questions.Our paper is motivated by the work of Min et al. (2020) who observe that more than half of the factoid questions that occur naturally are ambiguous.For example, a seemingly simple question: "Who was the ruler of France in 1830?" is ambiguous because there were two rulers of France in 1830.Min et al. (2020) collected the AMBIGQA dataset that connects ambiguous factoid questions with disambiguations: pairs of disambiguated questions and unique short answers to these questions (see example on the right side of Figure 1).
We note, however, that ambiguous questions often arise when a user lacks background knowledge about why there might be multiple answers to their question, and how those answers relate to each other.Thus, the list of disambiguations may not be satisfactory for the user.For example, the fact that in 1830 the ruler of France changed due to the revolution is highly salient but is not captured in Figure 1: The input questions in ASQA are sourced from AMBIGQA.Long-form answers must be sufficient to answer disambiguated questions from AMBIGQA (short answers are marked in blue and green), and should introduce additional knowledge from Wikipedia (highlighted in red) to resolve ambiguity and clarify the relationship between different short answers.The DR score we propose combines ROUGE and Disambiguation-accuracy (Disambig-Acc) metrics, overcoming the issues with long-form QA evaluation outlined by Krishna et al. (2021).
In this paper, we argue the importance of generating long-form answers to ambiguous factoid questions.In that, we present ASQA (Answer Summaries for Questions which are Ambiguous)a novel dataset that pairs each ambiguous question from AMBIGQA with a crowdsourced long-form answer. 1 The answers we collect aim to (i) explain the source of ambiguity in the question, and (ii) connect all the valid short answers into a coherent passage.An example ASQA instance is shown in Figure 1.
The main feature of ASQA is a combination of (i) a well-defined notion of correctness pertinent to factoid QA and (ii) the complexity of long-form QA.First, observe that a good answer to an ambiguous question should be sufficient for the user to answer different interpretations of the question.This observation induces a notion of correctness that is conceptually similar to the conventional accuracy in factoid QA.Second, to answer an ambiguous question, a system needs to retrieve a diverse set of documents that talk about different interpretations of the question and synthesize this information into a coherent summary.Thus, the key challenges of long-form QA-precise retrieval and high-quality summarization-are present in ASQA.
Contributions Overall, our work makes several contributions: • First, we carefully develop a crowdsourcing 1 Data, evaluation scripts, and other supplementary materials are provided on the project's GitHub repository: https://github.com/google-research/language/tree/master/language/asqa pipeline and collect ASQA-a dataset of highquality long-form answers to 6,316 ambiguous factoid questions.
• Second, we design principled evaluation procedures for ASQA: (i) we propose a novel automated evaluation metric (DR) that combines the correctness aspect of factoid QA and the fluency aspect of long-form QA; (ii) we develop and release a convenient interface for human evaluations; (iii) we conduct a small-scale human study that shows a high agreement between our automated metric DR and human judgments.
• Third, we establish strong baselines for our task by combining joint passage retrieval (Min et al., 2021) and T5-large (Raffel et al., 2019).Our extensive evaluations demonstrate that there is a large gap between the baselines and human performance.Additionally, we highlight areas of improvement for future research on ASQA.

Related Work
In this section, we describe relevant works that propose new tasks, datasets, and methods for QA and summarization problems.
Extractive QA Much of the existing work on question answering, including reading comprehension (Rajpurkar et al., 2016(Rajpurkar et al., , 2018;;Trischler et al., 2017;Yang et al., 2018), open-domain QA (Kwiatkowski et al., 2019;Joshi et al., 2017) and dialog-based QA (Choi et al., 2018), assumes that questions have unique answers.Min et al. (2020) relax this assumption and propose a task that aims at identifying all possible short answers to the ambiguous subset of the open-domain version of the NQ dataset, denoted NQ-OPEN (Kwiatkowski et al., 2019;Lee et al., 2019).The AMBIGQA dataset constructed by Min et al. (2020) serves as a building block of the present work and we provide more details on this dataset in Section 3. Another related effort is the CONDITIONALQA task (Sun et al., 2021) that requires systems to identify conditions under which the extracted answers are valid.Unlike the ASQA task, the answers in CONDI-TIONALQA come from a document provided in advance and do not need to be summarized into a single response.
Generative QA Extractive models achieve good results when the answer to the question is readily available on the web.However, in many settings, including ambiguous factoid questions, a system needs to combine information from many (unknown) sources to present the answer to the user in a convenient way.Hence, in this work, we focus on the generative QA setting where a model needs to generate a textual answer rather than extract it.Datasets for generative QA include NARRA-TIVEQA (Kočiský et al., 2018) and COQA (Reddy et al., 2019), but the average answer length in these datasets is small: 4.7 and 2.7 tokens, respectively.The MS MARCO Natural Language Generation (MS-NLG) dataset by Nguyen et al. (2016) combines both extractive and generative tasks and contains slightly longer human-generated answers (usually, a sentence-long) that can be read by a smart assistant.Fan et al. (2019) proposed a more challenging task of answering open-ended (e.g., "why?") questions.They scraped the "Explain Like I'm Five" Reddit forum and released a dataset of ∼272K questions, where each question is supplied with several paragraph-long answers generated by the Reddit users.We overview the differences between ASQA, ELI5 and MS-NLG in Section 3.3.
Recently, large language models such as GPT-3 (Brown et al., 2020) have been successfully applied to the task of long-form QA using the ELI5 dataset (Nakano et al., 2021).For this, a two-step human-in-the-loop approach was involved: first, demonstrations of annotators navigating the web to write answers were collected; second, a reward model (Stiennon et al., 2020) was trained by manual pairwise comparisons of answers.In ASQA, relevant passages for the answer are already provided by the annotators and we show that the pro-posed DR score correlates well with the human judgment of answer quality.Using this automated metric in place of the reward model in the approach of Nakano et al. ( 2021) is a potential direction for future work.
Summarization Given a set of documents relevant to the question (either ground truth or obtained using retrieval) the problem of generating a long-form answer reduces to query-based multi-document summarization.A small-scale dataset for this task was introduced as part of the DUC tasks (Dang, 2005).Recent work on building large-scale datasets has instead focused either on query-based summarization from a single document (Nema et al., 2017;Zhong et al., 2021) or on multi-document summarization without queries (Liu et al., 2018;Fabbri et al., 2019).In addition to the QA task, the ASQA dataset is suitable for the evaluation of systems' accuracy in the summarization setting, where the ground-truth passages containing the relevant information are assumed to be given.

QA-Based Evaluation
Prior work has looked at using question answering techniques to evaluate factual consistency in summarization (Wang et al., 2020;Durmus et al., 2020) and dialogue (Honovich et al., 2021).These works automatically generate questions from the system-produced text and search for answers in some reference text (e.g., the input being summarized) to evaluate the quality of the output.Instead, to evaluate generated long-form answers to ambiguous questions, in ASQA we use questions created by AMBIGQA annotators.

ASQA Task and Data
In this section, we introduce the ASQA task and the underlying data-collection process.The ASQA task is illustrated in Figure 1.The goal of the task is to write a comprehensive paragraph-long answer â to a given ambiguous question q.
Source Data We build ASQA on top of the subset of ambiguous questions identified in the AM-BIGQA dataset.Out of a total of 14,042 AM-BIGQA questions, 7,207 are identified as ambiguous by at least one AMBIGQA annotator.Each of these ambiguous questions q is paired with a list of n disambiguations {(x i , y i )} n i=1 , where x i denotes a disambiguated question and y i denotes  a unique short answer to x i .The number of disambiguations ranges from 2 to 46 per ambiguous question.To ensure that it is feasible to put all this information into a coherent story, we remove 417 questions with more than six disambiguations from consideration, thereby focusing on 6,790 AM-BIGQA instances that we use as a starting point for building our task.

ASQA Annotation Objectives
At a high level, the goal of the annotation process is to obtain high-quality long answers to ambiguous questions.We begin with a formulation of criteria for what counts as a good long answer to an ambiguous question: • Completeness The long answer should contain all valid short answers y 1 , . . ., y n to the disambiguated questions x 1 , . . ., x n in an appropriate context.
• Comprehensiveness The long answer should provide enough details for the user to (i) understand the source of ambiguity in the original question and (ii) understand the relationship between different short answers.
• Fluency The long answer should be coherent and fluent.
• Attributability The long answer should be grounded in an underlying source of information (in our case, Wikipedia).

ASQA Annotation Process
To ensure that annotations satisfy the aforementioned objectives, we develop a custom annotation interface (Figure 2) and recruit native English speakers to perform our task.We then collect long-form answers for each target instance of AM-BIGQA using a commercial crowdsourcing platform where it is possible to interact with the annotators on an ongoing basis.Let us now discuss the key components of our annotation pipeline.

Input to Annotators
The left side of Figure 2 illustrates the input to our annotation procedure.Annotators are given relevant aspects of the target AM-BIGQA instance: the ambiguous question q, list of disambiguations {(x i , y i )} n i=1 , and the Wikipedia pages W visited by AMBIGQA annotators.Additionally, to help annotators understand the context behind the disambiguations without reading full Wikipedia articles, for each disambiguation i we provide a (possibly empty) Wikipedia passage C i with information relevant to the disambiguation.Details on the procedure used to find these context passages {C i } n i=1 are given in Appendix A.

Output of Annotation
The key output of annotation is a long-form answer a to a given ambiguous question q.Additional elements of the output are introduced to facilitate the requirement of attributability.In that, we require annotators to provide the source Wikipedia passage e for each piece of additional information they bring to their answer.Our interface has designated fields for additional knowledge (see Figure 2) and annotators can add as many of these fields as they need to include any number m of evidence passages {e j } m j=1 .Instructions, Training and Quality Control We carefully design instructions, a training procedure, and quality control tools to minimize the amount of noise in annotations.Details on these aspects of the annotation pipeline are provided in Appendix A.

ASQA Dataset
By following the procedure outlined above, we annotated train, dev, and test splits of the AMBIGQA dataset.Each question in the train split was annotated by a single annotator while the dev and test splits have two annotations per question.
For 474 questions, our annotators raised concerns regarding the validity of the AMBIGQA disambiguations.Not all of these concerns necessarily indicate errors in the AMBIGQA dataset as some of them could be due to misinterpretation on the annotators' side.Nevertheless, to maintain data fidelity, we exclude the corresponding instances from the resulting dataset.Table 1 displays the final breakdown of the ASQA dataset.
Table 2 compares ASQA to other open-domain QA datasets: ELI5, MS-NLG, AMBIGQA, and NQ-OPEN.We observe that ASQA requires long answers with an average length of 64.8 (vs.103.0 for ELI5 and 14.6 for MS-NLG), and is the only dataset that admits evaluations in terms of both ROUGE, which is typically used for long-form QA, and accuracy, which is typically used for factoid QA.This makes ASQA an appealing dataset as it enables researchers to work on long-form QA while retaining the benefits of reliable objective evaluation typical in factoid QA.
Additional Comparison to ELI5 ELI5 is the closest existing long-form QA dataset.We now provide additional comparison of ASQA and ELI5.Support Documents First, both ASQA and ELI5 supplement annotations with relevant information retrieved from Wikipedia (ASQA) or the whole Internet (ELI5).For ELI5, support documents are retrieved automatically and independently of the annotation process.The resulting documents contain, on average, 858 words.Manual analysis conducted by Fan et al. (2019) reveals that support documents are sufficient to answer 65% of the questions and have information relevant to 92% of the questions.
In ASQA, support documents are constructed as a part of the annotation process.For each annotation, the support document contains disambiguations from AMBIGQA, context paragraphs, and additional knowledge provided by the corresponding annotator (see Section 3.2 for details).On average, support documents contain 225 words, being much shorter than those for ELI5.By design of our annotation procedure, support documents should be sufficient to write long-form answers to ambiguous questions.Indeed, we observe that 92% of the annotations' tokens are present in the corresponding support documents.2If we exclude AMBIGQA disambiguations from the support documents, their average length reduces to 172 words, but 78% of tokens from the answers remain captured therein.These observations demonstrate that ASQA satisfies the requirement of attributability (Section 3.1).
Inter-Annotator Agreement Second, we compare the inter-annotator agreement in ELI5 and ASQA that we measure as the mean ROUGE-L F1 score between each pair of annotations for the same question.Our analysis reveals that ASQA has a much higher level of inter-annotator agreement: 49.6 vs. 16.9 for ELI5.Thus, ASQA admits a more welldefined notion of ground truth than ELI5.
Note that answers in ELI5 are written by Reddit users.Thus, they are inherently subjective and are not supposed to follow any predefined criteria.The diversity and subjectiveness could make human evaluation of the ELI5 answers challenging.In contrast, ASQA annotators follow common annotation guidelines and undergo a thorough training procedure, thereby aiming at generating answers that satisfy a set of well-defined criteria for human evaluation (Section 3.1).
Overall, compared to other datasets, ASQA has some novel features that may be useful for future QA research.Its benefits, however, come at the cost of a much smaller sample size than that of MS-NLG and ELI5.Thus, we believe MS-NLG and ELI5 may be useful counterparts for ASQA as they can be used for pre-training (that said, we leave this exploration to future work).

ASQA Metrics
In this section, we introduce metrics that we propose to evaluate performance on the ASQA task.

Automated Evaluation
We evaluate performance on the ASQA task along the following two aspects.
ROUGE Following the conventional approach for measuring the quality of generated text, we report the ROUGE-L score (Lin, 2004) in a multireference setup.3Given that each example in the development and test sets is annotated by two annotators, we compare predictions against both answers and take the maximum of these two scores to be the score of the prediction.
Disambiguation Metrics A good long-form answer to an ambiguous question should contain short answers to all disambiguated questions as well as the context necessary to understand the source of ambiguity and the relationship between the short answers.However, ROUGE-L is not well suited for evaluating these aspects as it may fail to distinguish between two fluent and stylistically similar answers which provide considerably different information.Therefore, we complement ROUGE-L with two metrics that are specifically designed to capture the completeness and comprehensiveness aspects of our task: • STR-EM (String Exact Match) The fraction of disambiguations for which the corresponding short answer is present in the long answer (exact match).The fraction is computed within each question and then averaged across all questions.
• Disambig-F1 We follow the reading comprehension literature (Rajpurkar et al., 2016(Rajpurkar et al., , 2018) ) and use Roberta (Liu et al., 2019) trained on SQUADV2 to evaluate the fraction of disambiguated questions that can be answered from the predicted long answers. 4For each disambiguation (x i ) in the k-th example, we apply the SQUADV2 model on the generated long-form answer â(k) to predict short answer ŷ(k) i after normalizing answer strings in the manner done for SQUADV2 evaluations.Then the Disambig-F1 score is given by: where N indicates the total number of instances being evaluated, and n (k) indicates the number of disambiguations for the k-th instance.
Overall: DR Score Both ROUGE-L and disambiguation metrics are crucial for our task.Hence, we propose an overall DR (Disambiguation-Rouge) score that combines the two metrics as follows: We choose the geometric mean for aggregation to penalize methods that maximize one metric at a cost of the other.Note that STR-EM and Disambig-F1 aim at measuring the same aspect so we include only one of these metrics in the DR score.

Human Evaluation
We also design an interface for human evaluations for the ASQA task with the following metrics.
• Disambiguation Accuracy For each long-form answer, we ask human annotators to verify whether each disambiguated question from the AMBIGQA dataset can be correctly answered using the provided information.We then report the average number of disambiguations that are captured in the long-form answers (ACC).
• Pairwise Comparisons We propose a pairwise evaluation scheme where annotators need to compare two long-form answers to the same question.
We ask annotators to choose the better answer in terms of each of the three criteria: Comprehensiveness (COMP), Fluency (FLUE), and Human Overall impression (HO).In each pairwise comparison, an answer is given one point for victory and half for a tie.We then normalize model scores into percentages by dividing the total number of points a model receives by the number of pairwise comparisons.

Experimental Setup
We now describe the baseline models and human answers used in our experiments.

Models
We include the following models for comparison.
Naïve The naïve model (denoted as QUESTION) repeats the ambiguous question eight times.

Retrieval-Only
The retrieval-only models retrieve a Wikipedia passage as the answer: • DPR@1.DPR (Karpukhin et al., 2020) is a BERT-based dual encoder trained on NQ.
• JPR@1.JPR (Min et al., 2021) trains a reranker on top of DPR for questions with multiple answers in AMBIGQA.The JPR model is the state of the art retriever for AMBIGQA.
Generative We also evaluate T5-large based generative models (Raffel et al., 2019) in two regimes: • T5 Closed Book (T5-C).We train T5 to answer ambiguous questions without providing any additional passages from Wikipedia.The model only relies on its pretrained knowledge to answer the question (Roberts et al., 2020).
• T5 Open Book (T5-O).The T5 model is additionally provided with context paragraphs retrieved by JPR.We vary the number of top-K retrieved paragraphs used as input to T5, denoting the corresponding model as T5-O-K.
Oracle To investigate the headroom in retrieval systems, we experiment with an ORACLE system: T5-large provided with the gold supporting documents.The input to ORACLE includes all the disambiguations {(x i , y i )} n i=1 and contexts {C i } n i=1 shown to the annotators (left half of Figure 2), as well as the additional knowledge pieces {e j } m j=1 identified by one of the two annotators (the one with the longest answer).This system can be thought of as a generative model that has access to a perfect retriever.In evaluations, we compute ROUGE-L by comparing the answer predicted by ORACLE against the answer of the annotator whose additional knowledge pieces were not in the input of ORACLE (instead of the usual comparison against two references).
Appendix B provides more details on the modeling aspects of our evaluations.

Human Performance
We also evaluate two sets of human answers: • Human performance with context (HP-W/-C).
We use reference ASQA answers in our comparisons.Recall that the ASQA annotators were provided with context: disambiguations from AMBIGQA {(x i , y i )} n i=1 and context paragraphs we retrieved {C i } n i=1 .We consider performance in this setup as an upper bound on the human performance.In evaluations of ROUGE-L, we compute the score of HP-W/-C by comparing the answers from two annotators against each other (instead of the usual comparison against two references).
• Human performance without context (HP-W/O-C).To establish a conservative lower bound on human performance, we additionally annotate 200 questions from the ASQA dev set (one annotation per question) in the "no context" regime.
Annotators in this regime are only given ambiguous questions as input (no disambiguations or context paragraphs) and need to search for disambiguations and the required additional information on their own.

Results
We evaluate all models introduced above in the automated evaluations.Additionally, we conduct a small-scale human study involving a subset of models to provide some verification of the automated evaluation results.Specifically, our human study  Following Krishna et al. (2021), we also experimented with a random retrieval baseline where, during inference, the model was provided randomly selected retrieved passages from the training set.This baseline gets a DR of only 7.8, further confirming that, different from ELI5, retrieval is very important for ASQA.
Importance of Summarization Retrieval is very important for ASQA, but just using the top retrieved passage from a strong system (JPR@1) is not sufficient.Even though the STR-EM and Disambig-F1 metrics of JPR@1 are considerably higher than these of T5-O-1 (by 11.4 and 4.6 points, respectively), the human overall impression score HO and the DR score are similar across these models.This discrepancy is observed because the disambiguation metrics do not evaluate the conciseness of the answers, and the advantage of JPR@1 on these metrics is gained at the cost of the increased answer length (196.8 words).In contrast, T5 models tend to generate shorter answers whose length is much closer to the average length of human references (65 words).Hence, in addition to including the correct information, answers in ASQA must be concise which highlights the importance of summarization.
Correlation with Human Judgments Table 5 reports Pearson correlations between different automated metrics and the human judgments, enabling us to study the validity of the automated metrics.First, we observe that Disambig-F1 is better correlated with human evaluations than ROUGE-L.That said, we note that ROUGE-L is an important metric as it enforces concise answers.
Second, observe that Disambig-F1 scores (Table 3) underestimate the human evaluations of ACC (Table 4).This discrepancy is likely due to: (i) a distribution shift between ASQA and SQUADV2; and (ii) the presence of distracting answers from the other disambiguated questions in the long answers, which are known to degrade QA models' accuracy (Jia and Liang, 2017).However, almost perfect correlation between Disambig-F1 and ACC (99.3) implies that this discrepancy does not impact the ordering of the different systems, thereby enabling us to meaningfully evaluate the relative differences in performance.Additionally, the presence of strong distractors ensures that the Disambig-F1 metric cannot be easily gamed by mentioning all the short answers without appropriate context.Finally, we note that the DR score has the highest correlation with the overall human judgment HO among all automated metrics.While the difference with Disambig-F1 is not statistically significant, this observation hints at the importance of combining ROUGE-L and Disambig-F1 in the overall metric to take a holistic view on the model performance.
Remaining Headroom Both the upper bound (61.8 DR and 88.9 HO) and the lower bound (40.6 DR and 74.4 HO) on human performance significantly exceed the best model performance (T5-O-5 with 32.1 DR and 36.7 HO).Hence, there is a lot of headroom for the community to explore in ASQA.We report some additional insights that may be helpful for future work in Section 7.

Analysis
We now conduct additional analysis that provides insights on the ASQA task.

Headroom in Summarization
As shown in Figure 3, the Disambig-F1 score of retrieval-based methods increases considerably as the number of retrieved passages increases.However, there is a big gap between T5 and JPR, even though T5 takes the output passages from JPR as an input.This indicates that T5 tends to either lose information while summarizing the passages or produce outputs that are inconsistent with its input.Moreover, the Disambig-F1 of JPR@5 already exceeds the lower bound on human performance.Thus, progress in summarization alone may be sufficient to raise the overall level of performance on ASQA to this lower bound.proportional to the answer lengths.The T5-O-K score increases with K but there is also an increasing gap between T5-O-K and JPR@K.Passages from the latter are used as input for the former.
To provide further insight into the summarization aspect of our task, we conduct a manual analysis of the answers generated by the open-book T5-O-5 model.Our analysis identifies several characteristic mistakes (hallucination, questions misunderstanding, and repetitions) that need to be addressed to improve performance on the ASQA task.More details on this evaluation are provided in Appendix C. 3 compares models by Disambig-F1 and the higher score means that the passage generated by a model provides answers to more disambiguated questions.We observe that the best-performing retrieval system, JPR@5, lags behind the output of the ORACLE model by 14.4 and the human upper bound by 32.6.Hence, improving the retrieval step for ASQA is also important.

Conclusion
In contrast to existing datasets for long-form QA, ASQA admits a clear notion of correctness that we use to define an overall metric of performance (DR).Our empirical evaluations demonstrate that DR correlates well with the human judgment; and there is a large gap between human performance and the strong baselines.Thus, we believe that ASQA is an appealing task for the QA community.Our analysis suggests that strong performance on ASQA is contingent upon both high-quality retrieval and summarization.These aspects constitute important directions for future work on ASQA.

Limitations
We now make two remarks that we urge the reader to consider when interpreting the results of this work.
Inter-Annotator Agreement In Section 3.3, we observed that inter-annotator agreement in ASQA is higher than in ELI5.We note, however, that the high inter-annotator agreement in ASQA is contingent upon the high inter-annotator agreement in the AMBIGQA dataset.Indeed, AMBIGQA disambiguations serve as a shared source of information between the two ASQA annotators working on the same instance, potentially inflating the level of agreement.
That said, Min et al. (2020) observe that human annotators have a decent level of agreement in constructing the disambiguations in AMBIGQA, thereby supporting the observation that ASQA is more objective than ELI5.
Evaluation Metrics Second, we caveat that our accuracy metrics (STR-EM and Disambig-F1) only measure the recall of the required information in the long answers.In cases where the long answer hallucinates incorrect disambiguations or facts, the accuracy metrics may still be high as long as the correct disambiguations are included.We note, however, that this unnecessary extra information may still be penalized by the ROUGE-L metric.Moreover, in the presence of distractors, we also expect the accuracy of the Roberta model used for reading comprehension to degrade, thereby effectively penalizing a low precision.
On a separate note, the Disambig-F1 metric requires a high-accuracy QA system.Hence, for domains that are significantly different from Wikipedia, fine-tuning the Roberta SQUADV2 model on the task might be important to ensure the effectiveness of the Disambig-F1 metric.
and receive an answer within a day.To support this mechanism, we allowed annotators to "park" an annotation task they were unsure about and return to it after they have their concerns resolved.
Annotators' Well-Being For this study, we recruited annotators who were fully dedicated to our task (8 hours a day for 5 days a week).To reduce the pressure on annotators and allow them to work at a comfortable pace, we gave annotators one hour to answer each question and recommended answering ten or more questions per day.On average, it took annotators 15 minutes to answer each question with the time consumption slightly decreasing as annotators get familiar with the task.The compensation rate for the task was set to be $17.8/hour which is higher than the minimum hourly wage in the US.

B Additional Details on Modeling
In this section, we provide additional details on the modeling aspect of our evaluations.
Input Format Figures 4 and 5 provide schematic representations of inputs to the T5-O-K and OR-ACLE models, respectively.Bold black text represents tags that separate conceptually different parts of the input, text in blue is replaced with the instance-specific content in the actual training and evaluation data.
The input to T5-O-K is simpler and consists of two parts separated by the context tag: an ambiguous question and K retrieved passages.Each retrieved passage consists of the info field that contains the retrieved passage and the wikipage field that displays the title of the source Wikipedia page.Retrieved passages are separated with the pipe symbol "|".
The input to the ORACLE model is more complex and has five parts: • An ambiguous question q • Additional knowledge pieces provided by the annotator {e j } m j=1 (context2) Similarly to the T5-O model, context paragraphs and additional knowledge pieces have info and wikipage fields, and the pipe symbol "|" is used to separate elements in the list.

C Qualitative Analysis
To provide further insight into the importance of the generation aspect of our task, we conduct a manual analysis of the answers generated by the T5 open-book model.Our main observation is that even if the knowledge necessary to answer an ambiguous question is successfully retrieved, T5 often struggles to provide a high-quality answer.Table 6 demonstrates several characteristic mistakes that we identify.
Hallucination The first two rows of Table 6 demonstrate examples of hallucination in the T5generated answers.In the first example, T5 hallucinates facts about the 2016 elections (there were no elections in 2016) and about the winner of the 2017 elections (Rick Baker did not win the elections).In the second example, T5 starts with a wrong disambiguation (dragons do not marry people) and then mixes up facts about two characters from different books (Daenerys Targaryen and Elizabeth/Liz Pennykettle) into one.
Question Misunderstanding Another issue we observe in the T5-generated answers is that sometimes the answers provide a coherent story that is relevant to the question but does not answer it.This problem is illustrated in the third row of Table 6 where the question "When was «under God» added to the Pledge of Allegiance?" is answered with a dragons are often married to multiple people in a song of ice and fire storyline.in a song of ice and fire, the mother of dragons is known as elizabeth/liz pennykettle, a woman probably in her thirties who makes dragons out of clay and sells them at pottery fairs.she is the mother of lucy pennykettle, wife of the blinded exmonk arthur, and landlady of david rain.elizabeth/liz pennykettle -a woman probably in her thirties who makes dragons out of clay and sells them at pottery fairs.she is the mother of lucy pennykettle, wife of the blinded ex-monk arthur, and landlady of david rain.
A Song of Ice and Fire: . . .Daenerys is married off by her elder brother Viserys Targaryen to . . .Daenerys Targaryen: Viserys marries Daenerys to the powerful Dothraki warlord Khal Drogo in exchange for his military support . . .The Last Dragon Chronicles: Elizabeth/Liz Pennykettle -A woman probably in her thirties who makes dragons out of clay and sells them at pottery fairs.
In the first sentence of the answer, T5 hallucinates that dragons are married to humans in the A Song of Ice and Fire storyline.
In the second sentence, T5 suddenly jumps from a discussion of A Song of Ice and Fire to The Last Dragon Chronicles, thereby misleading the reader.
The blue text highlights repeated content.
Q: When was under god added to the pledge of alligence?Disambig-Q1: When was under god officially added to the pledge of alligence?June 14, 1954 Disambig-Q2: When did Bowman popularly introduce the pledge of alligence with under god added to a society meeting in Chicago?February 12, 1948 Disambig-Q3: When was the bill signed into law that under god was added to the pledge of Allegiance?Flag Day | June 14, 1954 Disambig-Q4: When was under god added to the pledge of Allegiance for the first time?February 12, 1948 Disambig-Q5: When did the board of directors of the Knights of Columbus adopt a resolution to add under god to the pledge of Allegiance?April 30, 1951 the pledge of allegiance of the united states is an expression of allegiance to the flag of the united states and the republic of the united states of america. it was originally composed by captain george thatcher balch, a union army officer during the civil war and later a teacher of patriotism in new york city schools.the form of the pledge used today was largely devised by francis bellamy in 1892, and formally adopted by congress as the pledge in 1942.the most recent alteration of its wording came on flag day, june 14, 1954.[Missing mention of "under god" words] Pledge of Allegiance Louis Albert Bowman, an attorney from Illinois, was the first to suggest the addition of "under God" to the pledge . . .At a meeting on February 12, 1948, he   history of the Pledge of Allegiance but does not mention the target phrase («under God»).
Repetitions Finally, we observe a somewhat technical issue of repetitions in the generated answers, as shown in the second row of Table 6.
born Charles Philippe, Count of Artois) was King of France from 16 September 1824 until 2 August 1830.France faced urban riots which led to the July Revolution of 1830, which resulted in his abdication and the election of Louis-Philippe I as King of the French.Louis-Philippe was sworn in as King Louis-Philippe I on 9 August 1830.

Figure 2 :
Figure 2: Schematic representation of the annotation interface.
involves four model outputs (JPR@1, T5-C, T5-O-1, T5-O-5) and two sets of human-generated answers (HP-W/O-C, HP-W/-C) that are juxtaposed on a subset of 45 randomly chosen questions from the development set of ASQA.For each of the questions, six target answers are split into three pairs and pairwise comparisons are conducted by authors of this paper in a blind manner.Importance of Retrieval Models that take the output of a retrieval system (T5-O-1/3/5) perform much stronger than the closed-book model (T5-C) on both automated metrics and the human evaluation.T5-O-1 outperforms T5-C by 20.0 points on human evaluation (HO) and by 12.8 points on DR.T5-O-5 outperforms T5-C by 15.6 points on HO and by 17.0 points on DR.

Figure 3 :
Figure3: Disambig-F1 of different methods with a varying number of retrieved passages.Marker sizes are proportional to the answer lengths.The T5-O-K score increases with K but there is also an increasing gap between T5-O-K and JPR@K.Passages from the latter are used as input for the former.

Figure 5 :
Figure 5: Input to the ORACLE model.

Table 1 :
Summary statistics of the ASQA dataset.

Table 2 :
Comparison of ASQA with existing open domain QA datasets.ASQA is the only QA dataset that allows for both ROUGE and accuracy evaluations.† Standard accuracy for non-ambiguous questions.

Table 3 :
Evaluation of baselines on the dev set of the ASQA task.T5 models with passages retrieved by JPR are the best models, but there is a large gap between human performance and model performance on all metrics.* As explained in Section 5, for ORACLE and HP-W/-C we only use one of the references to compute ROUGE-L.

Table 4 :
Results of human evaluations executed on a set of 45 questions from the development set of ASQA.The scores are in percentage and larger values are better.All metrics are specified in Section 4.2.

Table 5 :
Correlation between human and automated metrics.DR has the highest correlation with the overall human score HO among all automated metrics.
led the society in reciting . . .Pledge of Allegiance In 1951, the Knights of Columbus, the world's largest Catholic fraternal service organization, also began including the words "under God" in the Pledge of Allegiance.Pledge of Allegiance Congress passed the necessary legislation and Eisenhower signed the bill into law on Flag Day, June 14, 1954.Eisenhower said: The phrase "under God" was incorporated into the Pledge of Allegiance on June 14, 1954, by a Joint Resolution of Congress amending § 4 of the Flag Code enacted in 1942.

Table 6 :
Error analysis for T5-O-5.The colored text highlights problematic parts of the T5 output.