How Many Answers Should I Give? An Empirical Study of Multi-Answer Reading Comprehension

The multi-answer phenomenon, where a question may have multiple answers scattered in the document, can be well handled by humans but is challenging enough for machine reading comprehension (MRC) systems. Despite recent progress in multi-answer MRC, there lacks a systematic analysis of how this phenomenon arises and how to better address it. In this work, we design a taxonomy to categorize commonly-seen multi-answer MRC instances, with which we inspect three multi-answer datasets and analyze where the multi-answer challenge comes from. We further analyze how well different paradigms of current multi-answer MRC models deal with different types of multi-answer instances. We find that some paradigms capture well the key information in the questions while others better model the relationship between questions and contexts. We thus explore strategies to make the best of the strengths of different paradigms. Experiments show that generation models can be a promising platform to incorporate different paradigms. Our annotations and code are released for further research.


Introduction
In the typical setting of machine reading comprehension, such as SQuAD (Rajpurkar et al., 2016), the system is expected to extract a single answer from the passage for a given question.However, in many scenarios, questions may have multiple answers scattered in the passages, and all the answers should be found to completely answer the questions, such as the examples illustrated in Figure 1.Recently, a series of MRC benchmarks featuring multi-answer instances have been constructed, including DROP (Dua et al., 2019), Quoref (Dasigi swers (Li et al., 2022;Ju et al., 2022).Yet, they did not holistically consider the interaction between questions and contexts.We observe that in some cases the number of answers is indicated in the question itself (two players in Example A of Figure 1) while in others we have no idea until we read the documents carefully (Example B of Figure 1).
To better understand this challenge, we develop a taxonomy for the multi-answer phenomenon, based on how the number of answers is determined: the question itself suffices, or both the question and the passage should be taken into consideration.We annotate 6,857 instances from DROP, Quoref, and MultiSpanQA based on our taxonomy and find that the procedure of dataset construction has a large influence on the expressions in the questions.Most questions in crowdsourced datasets contain certain clues indicating the number of answers.By contrast, real-world information-seeking questions are less likely to specify the number of answers, which is usually dependent on the passages.
We further use our annotations to examine the performance of current MRC solutions regarding the multi-answer challenge (Hu et al., 2019;Segal et al., 2020;Li et al., 2022), which can be categorized into 4 paradigms, i.e., TAGGING, NUMPRED, ITERATIVE and GENERATION.We analyze their strengths and weaknesses and find that some efforts, e.g., NUMPRED, are good at capturing the key information in the questions, while others, e.g., ITER-ATIVE, can better model the relation between questions and contexts.This motivates us to investigate better ways to benefit from different paradigms.
Given the complementary nature of these paradigms, we wonder whether a combination of paradigms improves performance on multi-answer MRC.We explore two strategies, early fusion and late ensemble, to benefit from different paradigms.With a generation model as the backbone, we attempt to integrate the paradigms NUMPRED and INTERATIVE, in a lightweight Chain-of-Thought style (Wei et al., 2022).Experiments show that the integration remarkably improves the performance of generation models, demonstrating that GENERA-TION is a promising platform for paradigm fusion.
Our contributions are summarized as follows: (1) We design a taxonomy for multi-answer MRC instances according to how the number of answers can be determined.It considers both questions and contexts simultaneously, enlightening where the multi-answer challenge comes from.(2) We anno-tate 6,857 instances from 3 datasets with our taxonomy, which enables us to examine 4 paradigms for multi-answer MRC in terms of their strengths and weaknesses.(3) We explore various strategies to benefit from different paradigms.Experiments show that generation models are promising to be backbones for paradigm fusion.

Task Formulation
In multi-answer MRC, given a question Q and a passage P , a model should extract several spans, A = {a 1 , a 2 , ..., a n }(n ≥ 1), from P to answer Q.Each span, a i ∈ A, corresponds to a partial answer to Q, and the answer set A as a whole answers Q completely.These spans can be contiguous or discontiguous in the passage.
We distinguish between two terms, multi-answer and multi-span, which are often confused in previous works.Multi-answer indicates that a question should be answered with the complete set of entities or utterances.Multi-span is a definition from the perspective of answer annotations.In certain cases, the answer annotation of a question can be either single-span or multi-span, as explained in the next paragraph.Ideally, we expect that the answers to a multi-answer question should be annotated as multi-span in the passage, where each answer is grounded to a single span, although some of them can be contiguous in the passage.
Q0: What's Canada's official language?P: [...] English and French, are the official languages of the Government of Canada.[...] For example, in Q0, there are two answers, English and French, to the given question.According to the annotation guidelines of SQuAD, one might annotate this instance with a single continuous span English and French.Yet, this form of annotation is not preferred in the multi-answer MRC setting.It blurs the boundary of different answers and fails to denote explicitly the number of expected answers.Thus, it is suboptimal for a comprehensive model evaluation.Instead, we suggest denoting each answer with distinct spans, say, annotating this instance with two spans, English and French.With this criterion, we can encourage models to disentangle different answers.With fine-grained answer annotations, we can also assess how well a model answers a question sufficiently and precisely.
This annotation criterion generally conforms to the annotation guidelines of existing multi-answer datasets, e.g., DROP, Quoref and MultiSpanQA.A few instances violating the criterion are considered as bad annotations, as discussed in Section 4.2.See more remarks on the task formulation in Appendix A.

Taxonomy of Multi-Answer MRC
To better understand the challenge of multi-answer, we first design a taxonomy to categorize various multi-answer MRC instances.It assesses how the number of answers relates to the question or passage provided.Different from the previous works that classify questions according to the distances or relations between multiple answers (Li et al., 2022;Ju et al., 2022), our taxonomy, taking both questions and passages into consideration, focuses on how the number of answers is determined.This enables us to analyze multi-answer questions and single-answer questions in a unified way.We illustrate our taxonomy in Figure 2 and elaborate on each category as follows.
Question-Dependent If one can infer the exact number of answers from the question without referring to the passage, this instance belongs to the question-dependent category.According to whether there are clue words that directly indicate the number of answers, this type is further divided into two sub-categories: (a) In a with-clue-words question, one can find a few words that indicate the number of answers.In Q1, the word two in the question indicates that two answers are expected.
Q1: What are the two official languages of Puerto Rico?P: [...] English is an official language of the Government of Puerto Rico.[...] As another official language, Spanish is widely used in Puerto Rico.[...] We group the clue words into five types: cardinal, ordinal, comparative/superlative, alternative, and other lexical semantics, as illustrated in Table 1.
(b) In a without-clue-words question, although we can not locate obvious clue words, we can infer the number of answers with sentence semantics or commonsense knowledge.In Q2, we can determine that there is only one conversion result for the question based on sentence semantics instead of any single words.
Q2: 1 light year equal to how many km?P: [...] The light-year is a unit of length used to express astronomical distances.It is about 9.5 trillion kilometres or 5.9 trillion miles.[...] In Q3, we can infer that the following question has only one answer, based on the commonsense that there is only one winner of a given Super Bowl.Passage-Dependent In a passage-dependent instance, the question itself is not adequate to infer the number of answers.One needs to rely on the provided passage to decide how many answers are needed to answer the question.In Q4, we have no idea of the number of answers solely based on the question.If we refer to the passage, we will find ten answers to the question.

Datasets
We annotate the validation sets of three widelyused multi-answer MRC datasets, i.e., DROP (Dua et al., 2019), Quoref (Dasigi et al., 2019), and Mul-tiSpanQA (Li et al., 2022).The number of annotated questions is listed in Table 2 and more statistics are in Appendix B.
DROP is a crowdsourced MRC dataset for evaluating the discrete reasoning ability.The annotators are encouraged to devise questions that require discrete reasoning such as arithmetic.DROP has four answer types: numbers, dates, single spans, and sets of spans.Since the previous two types of answers are not always exact spans in the passages, we only consider the instances whose answers are single spans or sets of spans.
Quoref focuses on the coreferential phenomena.The questions are designed to require resolving coreference among entities.10% of its instances require multiple answer spans.
MultiSpanQA is a dataset specialized for multispan reading comprehension.The questions are extracted from NaturalQuestions (Kwiatkowski et al., 2019), which are real queries from the Google search engine.

Annotation
Annotation Process Our annotation process is two-staged: we first automatically identify some question-dependent instances and then recruit annotators to classify the remaining ones.
In the first stage, we automatically identify the questions containing certain common clue words such as numerals (full list in Appendix B) to reduce the workload of whole-process annotation.Afterward, the annotators manually check whether each instance is question-dependent.Out of the 4,594 recalled instances, 3,727 are identified as question-dependent.
In the second stage, we recruit annotators to annotate the remaining 3,130 instances.For each instance, given both the question and the answers, the annotators should first check whether the form of answers is correct and mark incorrect cases as bad-annotation2 .We show examples of common bad-annotation cases in Table 10.After filtering out the bad-annotation ones, the annotators are presented with the question only and should decide whether they could determine the number of answers solely based on the question.If so, this instance is annotated as question-dependent; otherwise passage-dependent.
For a question-dependent instance, the annotators are further asked to extract the clue words, if any, from the question, which determines whether the instance is with-clue-words or without-clue-words.
Quality Control Six annotators participated in the annotation after qualification.Each instance is annotated by two annotators.In case of any conflict, a third annotator resolves it.An instance is classified as bad-annotation if any annotator labels it as bad-annotation.Cohen's Kappa between two initial annotators is 0.70, indicating substantial agreement.See more details in Appendix B.

Analyses of Annotation Results
With our annotated data, we study how the multianswer instances differ across different datasets under our designed taxonomy.We find that the distributions of instance types are closely related to how the datasets are constructed.
Instance Types The distributions of instance types in different datasets are shown in Table 3. Question-dependent prevails in DROP and Quoref, making up over 70% of the two datasets.In contrast, most instances in MultiSpanQA are passage-dependent.This difference stems from how the questions are collected.DROP and Quoref use crowdsourcing to collect questions with specific challenges.Given a passage, the annotators know the answers in advance and produce questions that can only be answered through certain reasoning skills.These artificial questions are more likely to contain clues to the number of answers, such as the question with ordinal in Table 1.By contrast, the questions in MultiSpanQA are collected from search engine queries.Users generally have no idea of the answers to the queries.The number of answers, as a result, is more often de-   We provide more analyses on of how the instance types are distributed with respect to the specific number of answers in Appendix C.

Existing Multi-Answer MRC Models
Based on our categorization of the multi-answer instances, we continue to investigate how existing multi-answer MRC models perform differently on various types of multi-answer instances.We summarize current solutions into four paradigms according to how they obtain multiple answers, as illustrated in Figure 3. ITERATIVE Searching for evidence iteratively is widely adopted in many QA tasks (Xu et al., 2019;Zhao et al., 2021;Zhang et al., 2021), but it is not explored in multi-answer MRC.We adapt this idea to extract multiple answers iteratively.In each iteration, we append the previously extracted answers to the question, with the word except in between, and then feed the updated question to a single-answer MRC model.The iterative process terminates when the model predicts no more answers.
GENERATION Generation has been adopted as a uniform paradigm for many QA tasks (Khashabi et al., 2020(Khashabi et al., , 2022)), but it is less explored on multianswer MRC.For GENERATION, we concatenate all answers, with semicolons as separators, to form an output sequence, and finetune the model to generate it conditioned on the question and passage.

Experimental Setup
Implementation Details We use RoBERTabase (Liu et al., 2019) for the three extractive  paradigms and BART-base (Lewis et al., 2020) for GENERATION.We train models on the training sets of each dataset and evaluate them on the corresponding validation sets with our instance type annotations.See more details in Appendix D.1.
Metrics We adopt the official metrics of Multi-SpanQA (Li et al., 2022), including the precision (P), recall (R), and F1 in terms of exact match (EM) and partial match (PM).See Appendix D.2 for details.

Results and Analyses
We report the overall performance in Table 5, and the performance on different instance types in Table 6.We observe that each of these paradigms has its own strengths and weaknesses.
TAGGING outperforms other paradigms on DROP and Quoref, whose dominating instance type is question-dependent.Although TAG-GING has no explicit answer number prediction step, it can still exploit this information implicitly because it takes the question into account during  the sequential processing of every token.Besides, TAGGING, as a common practice for entity recognition, is good at capturing the boundaries of entities.Thus, it is not surprising that it performs the best on DROP and Quoref, most of whose answers are short entities.
ITERATIVE achieves the best overall performance on MultiSpanQA, whose prevailing instance type is passage-depenent.This paradigm does not directly exploit the information of the number of answers given in the question.Rather, it encourages adequate interactions between questions and passages, performing single-answer extraction at each step.As a result, ITERATIVE does well for the questions whose number of answers heavily depends on the given context.
As for NUMPRED, although we expect high performance on question-dependent instances, it lags behind TAGGING by approximately 2% in PM F1 on DROP and Quoref.This might result  from the gap between training and inference.The model treats the answer number prediction and answer span extraction as two separate tasks during training, with limited interaction.Yet during inference, the predicted number of answers is used as a hard restriction on multi-span selection.Different from the decent performance on DROP and Quoref, NUMPRED performs worst among the four paradigms on MultiSpanQA, because it is difficult for models to accurately predict the number of answers for a long input text that requires thorough understanding.
Among all paradigms, GENERATION generally performs the worst.Under the same parameter scale, extractive models seem to be the better choice for tasks whose outputs are exact entity spans from the input, while generation models do well in slightly longer answers.This also explains the smaller gap between GENERATION and extractive paradigms on MultiSpanQA compared to that on DROP and Quoref: MultiSpanQA has many descriptive long answers instead of short entities only.

Fusion of Different Paradigms
From the above analysis, we can see that extractive methods can better locate exact short spans in the passage, and NUMPRED can provide potential guidance on the number of answers.Meanwhile, the generation models can better handle longer answers and are more adaptable to different forms of inputs and outputs.Now an interesting question is how to combine different paradigms to get the best of both worlds.
We explore two strategies for combining different paradigms: early fusion and late ensemble.The former mixes multiple paradigms in terms of model architectures while the latter ensembles the predictions of different models.We discuss our exploration of late ensemble in Appendix E.1 since model ensemble is a well-explored technique.Here we primarily elaborate on early fusion.We carry out a series of pilot studies to demonstrate the potential of paradigm fusion.
Previous works attempt to fuse two extractive paradigms, TAGGING and NUMPRED (Segal et al., 2020;Li et al., 2022).However, they only lead to marginal improvements, probably because TAG-GING can already implicitly determine answer numbers well and the help of NUMPRED is thus limited.
Although the performance of base-size generation models on multi-answer MRC is inferior to that of extractive ones, generation models of larger sizes show great potential with more parameters and larger pre-training corpora (Khashabi et al., 2020(Khashabi et al., , 2022)).More importantly, GENERATION can easily adapt to various forms of inputs and outputs.We carry out pilot studies using a generation model as the backbone and benefiting from the ideas of other paradigms.We propose several lightweight methods to combine GENERATION with NUMPRED and ITERATIVE, as illustrated in Figure 4.
GENERATION + NUMPRED Inspired by recent works on Chain-of-Thought (Wei et al., 2022), we guide the model with prompts indicating the number of answers.We introduce a NUMPRED prompt sentence (NPS) in the form of There are {2, 3, ...} answers/There is only one answer.We experiment with two variants, multitask and pipeline.In the multitask variant, the model outputs an NPS before enumerating all the answers.In the pipeline variant, we predict the number of answers with a separate classifier and then append the NPS to the question as extra guidance.

GENERATION + ITERATIVE
We substitute the original extractor of ITERATIVE with a generator.The iterative process terminates when the model outputs the string No answer.Besides the normal setting, we experiment with another variant that additionally outputs an NPS in the form of The number of remaining answers is {1, 2, 3, ...}.

Results
Our main experiments are conducted with BART-base and BART-large due to our limited computational budget.For the pipeline variant of GENERATION + NUMPRED, we use RoBERTabase as an answer number classifier.The overall experiment results are reported in Table 7 and the results on different question types are reported in Appendix E.2.
When GENERATION is multitasking with NUMPRED, it outperforms the vanilla one consistently.The NPS in the output provides a soft but useful hint for the succeeding answer generation, improving the accuracy of answer number prediction by 1.7% on average for BART-base.The pipeline variant is often inferior to the multitasking one due to error propagation.Especially, its performance drops a lot on MultiSpanQA, whose instances are passage-dependent.The accuracy of the answer number classifier on MultiSpanQA lags behind that on the other two datasets by more than 12%.Thus the NPS in the input, with an unreliably predicted answer number, is more likely to mislead the subsequent answer span generation.
The combination of GENERATION and ITERA-TIVE does not always lead to improvement.This might be because the answer generation process of GENERATION is already in an iterative style: in the output sequence, each answer is generated conditioned on the previously-generated ones.The incorporation of ITERATIVE thus does not lead to further improvement.When we further introduce an NPS with the number of remaining answers, the performance generally outperforms the normal setting.This proves that GENERATION, as a backbone, is easy to integrate with various hints.
Pilot Study on GPT-3.5To investigate whether these fusion strategies work on larger models, we conduct a pilot study on GPT-3.5.We use the 653 multi-answer instances in the validation set of Mul-tiSpanQA for experiments.The prompts are listed in Appendix E.2.The experiment results are shown in Table 8.When given only one example for in-context learning, GPT-3.5 can already achieve 79.27% PM F1 on the multi-answer instances, with only a small gap between BART trained on full data.Its EM  F1 score is low because GPT-3.5 cannot handle the boundaries of answer spans well.This is not unsurprising since one example is not sufficient for GPT-3.5 to learn the annotation preference of span boundaries in MultiSpanQA.If we ask GPT-3.5 to predict the number of answers before giving all the answers, we observe an improvement of 10.1% EM F1 and 3.1% PM F1.This proves the effectiveness of fusing NUMPED with larger generation models As evidenced by the above trials, it is promising to fusion different paradigms.We hope that our exploration will inspire future works adopting larger generation models for multi-answer MRC.

Related Works
Compared to the vast amount of single-answer MRC datasets, the resources for multi-answer MRC are limited.Aside from the datasets in Section 4.1, MASH-QA (Zhu et al., 2020) focuses on the healthcare domain, with 27% of the questions having multiple long answers, ranging from phrases to sentences.CMQA (Ju et al., 2022) is another multi-answer dataset in Chinese, featuring answers with conditions or different granularities.For our analysis, we select two commonly-used datasets, DROP and Quoref, as well as a newlyreleased dataset, MultiSpanQA.
Current models addressing multi-answer MRC generally fall into two paradigms: TAGGING (Segal et al., 2020) and NUMPRED (Hu et al., 2019), as explained in Section 5. ITERATIVE (Xu et al., 2019;Zhao et al., 2021;Zhang et al., 2021;Gao et al., 2021) and GENERATION (Khashabi et al., 2020(Khashabi et al., , 2022) ) have been adopted for many types of QA tasks including knowledge base QA, multiplechoice QA, and open-domain QA.Nevertheless, their performance on multi-answer MRC is less explored.In our paper, we also study how to adapt these paradigms for multi-answer MRC.Apart from the exploration of model architectures for multi-answer MRC, Lee et al. (2023) attempt to generate multi-answer questions as data augmentation.
Previous works have made preliminary attempts in fusing two extractive paradigms.Segal et al. (2020) adopt a single-span extraction model for single-answer questions and TAGGING for multianswer questions; Li et al. (2022) add a NUMPRED head to the TAGGING framework.The predicted number of answers is used to adjust the tagging results.Both strategies lead to marginal improvement over the baselines.We instead resort to GENERA-TION for paradigm fusion, considering its potential with larger sizes and its flexibility in inputs and outputs.

Conclusion
In this paper, we conduct a systematic analysis for multi-answer MRC.We design a new taxonomy for multi-answer instances based on how the number of answers is determined.We annotate three datasets with the taxonomy and find that multi-answer is not merely a linguistic phenomenon; rather, many factors contribute to it, especially the process of data collection.With the annotation, we further investigate the performance of four paradigms for multi-answer MRC and find their strengths and weaknesses.This motivates us to explore various strategies of paradigm fusion to boost performance.We believe that our taxonomy can help determine what types of questions are desirable in the annotation process and aid in designing more practical annotation guidelines.We hope that our annota-tions can be used for more fine-grained diagnoses of MRC systems and encourage more robust MRC models.

Limitations
First, our taxonomy of multi-answer MRC instances only considers whether we know the exact number of answers from the questions.In some cases, one might have an imprecise estimate of answer numbers from the question.For example, for the question Who are Barcelona's active players?, one might estimate that there are dozens of active players for this football club.Yet, these estimations are sometimes subjective and difficult to quantify.Therefore, this instance is classified as passage-dependent according to our current taxonomy.We will consider refining our taxonomy to deal with these cases in the future.
Second, we did not conduct many experiments with pre-trained models larger than the large-size ones due to limited computational budgets.Generation models of larger sizes show great potential with more parameters and larger pre-training corpora.We encourage more efforts to deal with multi-answer MRC with much larger models, such as GPT-3.5.

A Additional Remarks on Task Formulation
As discussed in Section 2, multi-answer and multispan are two orthogonal concepts.We have already shown an example (Q0 in Section 2) where a multianswer question can be annotated as single-span by certain annotation guidelines.Here is another example to demonstrate the difference between multianswer and multi-span.
Q: Which offer of Triangle-Transit is most used by students?P: [...] Triangle-Transit offers scheduled, fixed-route regional and commuter bus service.The first is most used by students.
This is an example where a single-answer question can be annotated as multi-span.A single answer, scheduled bus service, will be annotated as multiple-span, i.e., scheduled and bus service in the passage.
Considering the differences between multianswer and multi-span, we suggest carefully distinguishing between these two terms in the future.

B Annotation Details
Dataset Statistics We report more statistics of the annotated datasets in Table 9. MultiSpanQA has the largest average number of answers since it is a dataset designed especially for multi-answer questions.The answers in MultiSpanQA are generally longer than those in DROP and Quoref because many of the answers in MultiSpanQA are long descriptive phrases or clauses instead of short entities.For all three datasets, the distances between answers are large.This indicates that the answers to a large proportion of the questions are discontiguous in the passages, demonstrating the difficulty of multi-answer MRC.

C Additional Analyses on Annotation Results
We report more statistics of the annotation results in Table 11 and Table 12, and conduct additional analyses from the perspective of the number of answers.
For multi-answer instances, passage-dependent questions account for the largest proportion, followed by with-clue-word.As for the single-answer instances in DROP and Quoref, they tend to be question-dependent, while in MultiSpanQA most of them are passage-dependent.In terms of the clue words in the with-clue-word questions, cardinal numbers are more common in multi-answer questions while other types of clue words are more likely to appear in single-answer questions.

D.1 Implementation Details
We use base-size models for our main experiments for sake of energy savings.Since T5-base has twice as many parameters as RoBERTa-base and BART-base, we did not use it to ensure fair comparisons.We carefully tune each model on the  training set and report its best performance on the validation set.We use an NVIDIA A40 GPU for experiments.A training step takes approximately 0.5s for RoBERTa-base and 0.2s for BART-base.We describe the implementation details of different models here.
TAGGING We use the implementation by Segal et al. (2020) 3 .We use the IO tagging variant, which achieves the best overall performance according to the original paper.We adopt the best-performing

D.2 Evaluation Metrics
Here, we describe the evaluation metrics used in our experiments, which are the official ones used by MultiSpanQA (Li et al., 2022).The metrics consist of two part: exact match and partial match.
Exact Match An exact match occurs when a prediction fully matches one of the ground-truth answers.We use micro-averaged precision, recall, and F1 score for evaluation.
Partial Match For each pair of prediction p i and ground truth answer t j , the partial retrieved score s ret ij and partial relevant score s rel ij are calculated as the length of the longest common substring (LCS) between p i and t j , divided by the length of p i and t j respectively, as: len(LCS(p i , t j )) len(p i ) s rel ij = len(LCS(p i , t j )) len(t j ) Suppose there are n predictions and m ground truth answers for a question.We compute the partial retrieved score between a prediction and all answers and keep the highest one as the retrieved score of that prediction.Similarly, for each ground truth answer, the relevant score is the highest one between it and all predictions.The precision, recall, and F1 are finally defined as follows:

E.1 Late Ensemble
By late ensemble, we aggregate the outputs from models of different paradigms to boost performance.We experiment with a simple voting strategy.If a span is predicted as an answer by more than one model, we add it to the final prediction set.If a span is part of another span, we consider them equivalent and take the longer one.In rare cases where the four models predict totally different answers, we add them all to the final prediction set.
Our voting strategy leads to improvements of 1.0%, 1.2%, and 1.3% in PM F1 on DROP, Quoref, and MultiSpanQA, respectively, over the bestperforming models in Table 5.Yet, this strategy might discard many correct answers.In the future, we can explore more sophisticated strategies.For example, similar to the idea of Mixture of Experts (Jacobs et al., 1991), the system can evaluate the probability that the instance belongs to a certain category and then adjust the weight of the model based on its capabilities in this category.

E.2 Early Fusion
In Table 13, we report the performance of different strategies for early fusion on different types of instances.In Table 14, we list the prompts used for our pilot study on GPT-3.5.

F Licenses of Scientific Artifacts
The license for Quoref and DROP is CC BY 4.0.The license for HuggingFace Transformers is Apache License 2.0.Other datasets and models provide no licenses.

Figure 2 :
Figure 2: Illustration of our taxonomy for multi-answer MRC instances.

TAGGING
Segal et al. (2020) cast the multianswer MRC task as a sequence tagging problem, similar to named entity recognition (NER), so that the model can extract multiple non-contiguous spans from the context.NUMPRED (Number Prediction)Hu et al. (2019) first predict the number of answers k as an auxiliary task and then select the top k nonoverlapped ones from the output candidate spans.

Figure 3 :
Figure 3: An illustration of four paradigms for multi-answer MRC.

Figure 4 :
Figure 4: An illustration of different strategies for early fusion of paradigms.

Table 1 :
Examples of various types of clue words.

Table 2 :
The number of instances for human annotation in the validation set of each dataset.

Table 3 :
Distribution of instance types in three datasets.

Table 4 :
Distribution of clue word types in three datasets.A question may contain multiple types of clue words.
Clue Words Since a large portion (57.8%) of the annotated instances belong to the with-clue-word type, we further investigate the distribution of clue words in different datasets, shown in Table4.On the one hand, the questions contain a large variety of clue words, demonstrating the complexity of multi-answer MRC.On the other hand, the prevailing type of clue words is different in each dataset, reflecting the preference in dataset construction.

Table 5 :
Performance of four paradigms on three datasets.

Table 6 :
The performance (PM F1) of four paradigms on different types of instances.p-dep.

Table 7 :
The performance (EM F1 and PM F1) of different strategies for early fusion of paradigms.

Table 8 :
The performance of BART and GPT-3.5 on the multi-answer instances of MultiSpanQA.

Table 10 :
Examples and explanations of bad-annotation cases.

Table 12 :
(Wolf et al., 2020)e word types in three datasets according to the number of answers.NUMPRED Because the implementation by the original paper(Hu et al., 2019)4does not support RoBERTa, we re-implement the model with Huggingface Transformers(Wolf et al., 2020)5.We use the representation of the first token in the input sequence for answer number classification.The maximum number of answers of the classifier is 8.The batch size is 12.The number of training epochs is 10.The learning rate is 3e-5.Te maximum sequence length is 512.ITERATIVE Our implementation is based on the scripts of MRC implemented by Huggingface.Dur-ing training, the order of answers for each iteration is determined by their order of position in the passage.The batch size is 8.The number of training epochs is 8.The learning rate is 3e-5.Te maximum sequence length is 384.Dring inference, the beam size is set to 3 and the length penalty is set to 0.7.The maximum length of answers is 10.GENERATION Our implementation is based on the scripts of sequence generation implemented by Huggingface.The batch size is 12.The learning rate is 3e-5. Te number of training epochs is 10. Te maximum input length is 384.The maximum output length is 60.
hyperparameters provided by the original paper.