A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers

Readers of academic research papers often read with the goal of answering specific questions. Question Answering systems that can answer those questions can make consumption of the content much more efficient. However, building such tools requires data that reflect the difficulty of the task arising from complex reasoning about claims made in multiple parts of a paper. In contrast, existing information-seeking question answering datasets usually contain questions about generic factoid-type information. We therefore present Qasper, a dataset of 5049 questions over 1585 Natural Language Processing papers. Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text. The questions are then answered by a separate set of NLP practitioners who also provide supporting evidence to answers. We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers, motivating further research in document-grounded, information-seeking QA, which our dataset is designed to facilitate.


Introduction
Machines built to assist humans who engage with texts to seek information ought to be designed with an awareness of the information need.Abstractly, the human's need should define the lens through which the system views the text in order to find desired information.Existing information-seeking machine reading datasets (e.g., Kwiatkowski et al., 2019;Clark et al., 2020) have led to significant progress in reading at scale (e.g., Asai et al., 2020;Guu et al., 2020;Liu et al., 2020).However, most of those benchmarks focus on an "open domain" setting where the questions are not anchored in any particular user context.The result is an emphasis 3 Dataset Construction Each dataset consists of a collection of records with one QA problem per record.For each record, we include some question text, a context document relevant to the question, a set of candidate solutions, and the correct solution.

Context Retrieval
The context document for each record consists of a list of ranked and scored pseudodocuments relevant to the question.
Q. Which retrieval system was used for the baselines?

Results
Several baselines rely on the retrieved context to extract the answer to a question.For these, we refer to the fraction of instances for which the correct answer is present in the context as Search Accuracy.The performance of the baseline among these instances is referred to as the Reading Accuracy.
A: The dataset comes with a ranked set of relevant documents.Hence the baselines do not use a retrieval system.A question about the paper is written after reading only the title and the abstract.To arrive at the answer, one finds relevant evidence, which can be spread across multiple paragraphs.In this example, to answer the question about "baselines", the reader must realize from evidence from Sections 3 and 4 that "context documents" come pre-ranked in the dataset and the paper's "baselines" select from these "context documents."on generic factoid questions, rather than the full range of information needs people have.

Question and Answer Title and Abstract
We present QASPER, 1 an information-seeking question answering (QA) dataset over academic research papers.Each question is written as a followup to the title and abstract of a particular paper, and the answer, if present, is identified in the rest of the paper, along with evidence required to arrive at it.This setup results in questions requiring more complex document-level reasoning than prior datasets, because (i) abstracts provide rich prompts for questions that can be asked as follow-up and (ii) academic research papers naturally trigger ques-tions by their target readers that require supporting or refuting claims.This evidence may be spread across the paper, including tables and figures, often resulting in complex entailment problems.The example in Figure 1 illustrates one such case where we need to retrieve information from paragraphs in three different sections to answer the question.
QASPER contains 5,049 questions over 1,585 natural language processing (NLP) papers, asked by regular readers of NLP papers, and answered by a separate set of NLP practitioners.Each paper has an average of 3.2 questions, up to a maximum of 12 questions for a single paper.In addition to providing answers when the questions are answerable, the annotators were asked to select text, tables, or figures as evidence required for answering the questions.55.5% of the questions require evidence from multiple paragraphs in the paper and 13% require tables or figures.To the best of our knowledge, QASPER is the first QA dataset in the academic research domain focusing on entire papers, and not just abstracts.
To quantify the difficulty of the tasks in QASPER, we apply state-of-the-art document-level Transformer (Vaswani et al., 2017) models to the tasks of selecting evidence and generating answers, and show that the best model performance lags behind humans by 27 F 1 points at answering questions from entire papers, and 32 F 1 points at selecting the paragraphs that provide evidence to answer the questions, indicating that these are both unsolved problems.Additionally, we experiment with oracles that answer questions from gold evidence and find that better pretraining and domain-adaptation might be helpful.

Building the QASPER Dataset
We now describe our process for constructing the dataset.We began with a set of open-access NLP papers, recruited NLP practitioners who are regular readers of research papers, and designed two different data collection interfaces: one for collecting follow-up questions given titles and abstracts, and another for obtaining evidence and answers to those questions.

Papers
We filtered S2ORC (Lo et al., 2020),2 a collection of machine-readable full text for open-access pa-pers, to (i) those from arXiv with an associated LaTeX source file,3 and (ii) are in the computational linguistics domain. 4We limited our domain to computational linguistics to ensure high quality as we have access to realistic users through our research network; broader domain collection is left to future work and should be enabled by the proof-of-concept of our protocols given in this paper.We used the S2ORC parser (which normalizes multi-file LaTeX sources and resolves comments and macros) to convert LaTeX markup to full text while preserving section and paragraph breaks and math equations.We supplemented the paper text with extracted images of figures and tables associated with their captions; these were crawled from Semantic Scholar. 5 The result of this process was a collection of 18K full text papers for annotation.

Decoupled Data Collection
To ensure that our questions are realistic, we decoupled the question-writing and question-answering phases.For both tasks we recruited graduate students studying NLP and freelancers practicing NLP through professional networks and Upwork6 .All the workers were regular readers of NLP papers, and were paid US$25 per hour on average ($20-$40 based on experience).We paid them on a per-hour basis and not a per-question basis to prioritize data quality over quantity.A total of 25 workers wrote questions while 51 answered them.
Questions To ensure that annotators were actually interested in the paper they are reading, we provided them with a lightweight search interface to search papers from the aforementioned collection to focus on their papers of interest.The interface supports entering manual queries and examples of the queries annotators used include general (e.g., "computer vision") or specific (e.g., "question answering", "information extraction") areas of study, specific tasks (e.g., "language identification"), entities (e.g., "bert", "transformers") or concepts (e.g., "commonsense", "interpretability"), or domain specifications (e.g., "medical", "wikipedia").Annotators also had the option to not enter any search queries; in this case, they were shown random papers.Annotators were displayed only the title and abstracts of relevant papers and asked to write any number of questions they had about the paper.Annotators were instructed to only write questions that are not answerable from the title and abstract but expected to be answered somewhere in the paper.Annotators also provided basic information about their expertise in NLP and how familiar they already were with the paper for which they asked questions.Most workers (about 70%) had some experience in NLP, with 20% having more than five years of experience.A vast majority (94%) of the abstracts were seen by the questionwriters for the first time.
Answers Annotators were randomly assigned papers with all the corresponding questions written for that paper.They were shown the paper title, abstract, question, full text, and all associated figures and tables to answer the questions.After reading these, annotators were were asked to: • Make a binary decision as to whether the question is answerable given the paper.
• If the question is answerable, select the minimal set of evidence snippets that contains the answer to the question.This could be (possibly discontiguous) paragraphs from the text and/or figures or tables.Annotators were asked to prioritize text over figures and tables, unless the information required was present only in figures or tables.When multiple paragraphs could serve as evidence, annotators were asked to first prioritize evidence that adequately answered the question, and then paragraphs that occurred earlier in the text.
• If the question is answerable, also provide a concise answer to the question.Annotators were also asked to also indicate whether their concise answer was (i) extracted from the evidence, (ii) "yes" or "no", or (iii) abstractively written.
Annotators were allowed to skip any questions they did not feel comfortable answering.Since the answering task is significantly more complex than the question-writing task, we designed interactive tutorials and qualification exams for the workers for this task using CrowdAQ (Ning et al., 2020).Workers who scored well were invited to work on the task.If the test performance indicated that the workers did not have sufficient NLP knowledge, or were not used to reading papers we did not let them work on the task.In cases where the workers misunderstood the task, but had sufficient background knowledge, we provided additional training before letting them work on the task.

QASPER Analysis
Table 1 provides representative examples from QASPER categorized by question, answer, and evidence types, which we describe here in greater detail.

Question types
We first analyze whether our annotation setup results in questions that are anchored in the context of the papers.To answer this question, we manually7 categorized a set of 200 questions as being applicable to most papers in the domain (general) vs. being applicable only to the paper that the question is written about (specific).
Table 1 shows that most of the questions (67%) are specific to the papers they are written about.This result indicates the advantage of viewing the QASPER task as a question answering problem, instead of an information extraction problem since a fixed schema would not be able to handle the long tail of paper-specific information needs.
Answer types As shown in Table 1, most of the answers in the dataset are extractive.The average length of the extractive answers is 14.4 words (including all spans), and that of abstractive spans is 15.6 words.
Evidence types Evidence can include one or more paragraphs from the paper, a figure, or a table, or a combination of these.Table 1 shows the distribution of these types.Among the answerable questions with text-only evidence, 55.5% of the answers have multi-paragraph evidence (Figure 1 is one example).Unanswerable questions do not have any evidence.Among the answerable ones, (3.0%) have no evidence when the answer is No, and the evidence is the lack of a mention of something specific.The last question in Table 4 is one example of such a case.

Distribution of evidence paragraphs
We perform an analysis to identify the main sections of a paper that contain textual evidence.We assign each evidence paragraph to its containing top-level8  section, and perform some section name normalization.We find that among the frequently used section names such as "Experiments" and "Introduction," there was not a single section name that contained a majority of evidence spans, indicating that the distribution of evidence over section in the paper was more or less uniform.
Inter-annotator agreement 44% of the questions in QASPER have multiple annotated answers.On average, each question is answered by 1.6 annotators (up to a maximum of 6 annotators for the same question).Using these multiple annotations, we compute some measures of agreement between annotators.First, we found that there is a high level of agreement (90%) regarding answerability of questions.Second, we find that annotators agreed on the type of the evidence (text vs. figure) in 84.0% of the cases.Papers often provide the same information both in tables and text, and agreement over the evidence types could be a consequence of our clear annotation guidelines regarding selecting evidence.
Correctness To estimate the correctness of the answer annotations in QASPER, we manually analyzed 100 randomly sampled questions with multiple answer annotations (averaging 2.73 answers per question).We found that 207 (75.8%) of the answers were correct.98% of the questions had at least one correct answer, and 77% had most of the answers correct.

Modeling QASPER
This section explains the task, evaluation metrics, and a model addressing QASPER tasks.

Task Setup
We formally define the QASPER tasks as follows: Given a paper, and a question about it, the primary task is to determine if the question is answerable, and output a predicted answer, that is one or more spans in the full-text of the paper, yes, no or other free-form text.A system built for this will be eval-uated based on the correctness of the predicted answer measured against the reference answers.
Since QASPER also provides labeled evidence for all questions, the system may also use auxiliary supervision provided by the evidence.
One such auxiliary task is to predict the evidence required for the question.The inputs are the same as that of the primary task, but the outputs are expected to be one or more paragraphs in the fulltext, figures, or tables, and they will be evaluated against labeled evidence spans.
Evaluation metrics As an automatic proxy for the measure of correctness of all types of answers, we use the span-level F 1 measure proposed by Rajpurkar et al. (2016).We convert answers that are multiple selected spans into single commaseparated strings.For questions with multiple reference answers, we compute the max span-F 1 of the predictions over all the references.We evaluate the performance of a system over the auxiliary task by computing a F 1 score over the set of paragraphs, figures, and tables chosen by the system against the reference evidence, considering a max when there are multiple references.We refer to these metrics as Answer-F 1 and Evidence-F 1 , respectively.

Data splits
We split the dataset into train, validation, and test sets, so that each paper appears in only one of them.Our analysis of correctness of annotations presented in Section 3 indicates a high likelihood (98%) of evaluating against a correct reference when evaluation is aggregated over multiple references.Hence we ensure that most of the questions in validation and test sets have multiple references (98% in test, and 74% in validation).This resulted in 2,593, 1,005, and 1,451 questions in the three sets, respectively.

Estimating human performance
To estimate an upper bound on model performance given our data splits and metrics, we assess the performance of the workers when evaluated against each other using the same metrics on a sample of the test set.Since model performance is evaluated by aggregating over multiple references, we consider a subset of the test set containing questions with at least three references (40% of the test set), evaluate each reference against the remaining, and compute an average over all such combinations.This procedure estimates the human performance to be 60.9 Answer-F 1 , and 71.6 Evidence-F 1 .Note that given the disagreements among the workers estimated in Section 3, this is a lower bound on human performance for two reasons: first, because only two annotations are used to compute the metric, while systems are evaluated against all three; and second, because the annotators are NLP practitioners, not expert researchers, and it is likely that an expert would score higher.Hence we report these numbers, along with a breakdown over answer types in Table 2 and Table 3 as human performance lower bounds.

QASPER Model
We base our model on pretrained Transformer (Vaswani et al., 2017) models which currently produce state-of-the-art results on a majority of QA tasks.9Recall that QASPER introduces two main modeling challenges -different answer types and long input documents.
First, QASPER includes a variety of answer types, including extractive, abstractive, yes/no, and unanswerable questions, which means a typical spanselection BERT-based QA model (Devlin et al., 2019) is not sufficient to support all these answer types.We address this by converting all answer types into a single task: generating answer text (Raffel et al., 2020;Khashabi et al., 2020).10This is a sequence-to-sequence formulation that requires an encoder-decoder Transformer model where the encoder reads the question and the document and the decoder generates the answer text.
Second, research papers are much longer than the typical 512 or 1024 token limit of most BERTlike models, so we need a Transformer model that can process long inputs.We use the Longformer-Encoder-Decoder (LED; Beltagy et al., 2020), an encoder-decoder Transformer model that can efficiently process input sequences thousands of tokens long.With LED's support for input sequence length of 16K tokens, we can encode 99% of the paper full texts in the QASPER dataset without truncation.
Longformer-Encoder-Decoder (LED) LED (Beltagy et al., 2020) is a variant of the original Transformer encoder-decoder model that replaces the Transformer's full self-attention in the encoder with the efficient local+global attention pattern of Longformer.This allows each token to attend to only its local window and a pre-specified set of global locations of interest, thereby scaling self-attention computation linearly with the input size (as opposed to quadratically with full context self-attention).LED has a similar architecture to BART (Lewis et al., 2020) in terms of number of layers and hidden state sizes, with the distinction that it has a larger position embeddings matrix, allowing it to process inputs of up to 16K tokens long (up from 1K tokens in the original BART model).In practice, LED's parameters are initialized from a pretrained BART model, and LED copies BART's position embeddings 16 times to fill the entire 16K position embeddings matrix.For all experiments we use the LED-base sized model, which uses BART-base weights.
Input and Output Encoding For the input, we follow the Longformer QA models (Beltagy et al., 2020) and encode the question and context in one concatenated string with "global attention" over all the question tokens.For the output, all answer types are encoded as single strings.The string is the text of the abstractive answer, a comma separated concatenation of the extractive spans, "Yes", "No", or "Unanswerable".
Evidence extraction To support extracting evidence paragraphs, we prepend each paragraph with a </s> token and add a classification head over these tokens on LED's encoder side.We also add Longformer's global attention over these tokens to facilitate direct information flow across the paragraphs.We then train LED using both loss functions (teacher-forced text generation and paragraph classification) in a multi-task training setup.For the answer generation, we use a cross-entropy loss function over the vocabulary.For the evidence paragraph extraction, we use a cross-entropy loss function with binary 0 or 1 gold labels for evidence/nonevidence paragraph.To account for class imbalance, we use loss scaling with weights proportional to the ratio of positive to negative gold paragraphs in the batch, which we found to be crucial for the model to train.One benefit of multi-task training of evidence extraction along with answer selection is that tasks can benefit each other (see Section 5.2).

Experiments
We evaluate model performance on question answering and evidence selection tasks, and compare them to estimated lower bounds on human performance.These human performance estimates are calculated by comparing the answers of questions for which we have multiple human annotations.For each question, we choose one annotation as if it were a prediction, and evaluate it against the rest of the annotations, and consider as human performance the average over all annotations chosen as predictions.We restrict our experiments to the subset of questions in QASPER that can be answered from text in the paper, ignoring those that require figures or tables as evidence (13% of the dataset; see Section 3) to avoid having to deal with multimodal inputs.We leave multimodal question answering to future work.

Training Details
We train all models using the Adam optimizer (Kingma and Ba, 2014) and a triangular learning rate scheduler (Howard and Ruder, 2018) with 10% warmup.To determine number of epochs, peak learning rate, and batch size, we performed manual hyperparameter search on a subset of the training data.We searched over {1, 3, 5} epochs with learning rates {1e −5 , 3e −5 , 5e −5 , 9e −5 }, and found that smaller batch sizes generally work better than larger ones.Our final configuration was 10 epochs, peak learning rate of 5e −5 , and batch size of 2, which we used for all reported experimental settings.When handling full text, we use gradient checkpointing (Chen et al., 2016) to reduce memory consumption.We run our experiments on a single RTX 8000 GPU, and each experiment takes 30-60 minutes per epoch.

Results
Question answering Table 2 shows the overall performance of the LED-base model11 on question answering, as well as the performance breakdown on the different answer types.The table also compares LED-base variants when the input is heuristically limited to smaller parts of the paper (i.e., no context, abstract, introduction).We generally observe that, by using more context, the performance improves.Specifically, as we observe in row 5 encoding the entire context results in significant overall performance improvement (∆ = +9.5)over the best heuristic ("introduction").This signifies the importance of encoding the entire paper.Comparing rows 4 and 5, we observe that using the evidence prediction as a multi-task scaffolding objective helps, improving the results by ∆ = +0.8points.
Evidence selection Table 3 illustrates the evidence selection performance of the LED-large and LED-base models compared with simpler baselines.We observe that LED variants outperform the simple TF-IDF baseline but there still remains a large gap to human performance.
Varying amounts of training Figure 2 shows the learning curve that measures the validation Answer-F 1 and Evidence-F 1 of the LED-base variants based on training data size.The learning curve suggests that performance has not reached a plateau, and future data collection could be useful.
Answer prediction from gold evidence To better isolate the question answering (as opposed to evidence selection) task performance, we perform oracle experiments where models are given the gold evidence.For these experiments, we are able to use larger (T5-large; Raffel et al., 2020) or better task-adapted pretrained models (UnifiedQA-large; Khashabi et al., 2020), which perform significantly better in the oracle setting.We did not use them in the non-oracle setting, however, as Longformer versions of these models are not available, and LED's ability to handle the full document without the need for a pipelined retrieval system was more important.These experiments show that (1) the human lower bound is in fact a lower bound, as large models exceed it for span answers in this setting; (2) the majority of the large headroom in the non-oracle setting can be closed with better evidence selection; and (3) research into making large pretrained models able to better scale to long documents would be beneficial.Error analysis To gain insight into the model's errors, we sample 67 test questions with predicted Answer-F 1 scores below 0.10 from the LED model trained with evidence prediction scaffolding.We remove four cases in which the predicted answers are actually correct.Examining gold answers of the remaining 63, we find 31 are extractive, 24 are abstractive, 3 are "yes", 3 are "no," and 2 are unanswerable.We observe that LED often predicts shorter spans than the gold answers (9.5 words shorter than gold counterparts, on average).Focusing only on the 55 questions with either extractive or abstractive gold answers, we manually categorize error types in Table 5.
Most datasets for QA on academic research papers also fall within the information-verifying paradigm as they automatically construct QA examples using extracted entities and relations and structured knowledge resources, like DrugBank.Some examples include emrQA (Pampari et al., 2018), BioRead (Pappas et al., 2018), BioMRC (Pappas et al., 2020), MedHop (Welbl et al., 2018).While these datasets enabled significant progress in machine comprehension, they include biases in questions that may not reflect real-world settings (Kwiatkowski et al., 2019).
Information-Seeking QA in General Domain Recognizing this challenge, others have followed an information-seeking paradigm where the writer of questions is genuinely interested in finding the answer to the question, or at least does not have access to the answer.Examples of such datasets include WikiQA (Yang et al., 2015), NewsQA (Trischler et al., 2017), MsMarco (Campos et al., 2016), QuAC (Choi et al., 2018), Natural Questions (Kwiatkowski et al., 2019), TyDiQA (Clark et al., 2020), and IIRC (Ferguson et al., 2020).Un-like QASPER, Natural Questions and TyDiQA12 questions are not grounded in any contexts, and the associated documents are linked to the questions after they are written.In contrast, QASPER's questions are real follow-up questions about a paper that a reader of appropriate domain expertise would have after reading the title and the abstract.The priming lets the readers ask detailed questions that are specific to the papers in context, those that require a deeper understanding of the contexts, like those shown in Figure 1 and Table 1.QuAC used similar data collection method but with focus on entities, which QASPER does not impose.
Domain-Specific Information-seeking QA Some work has been done on information-seeking QA on academic research papers.PubmedQA (Jin et al., 2019) derives Yes/No/Maybe questions from PubMed paper titles answered from the conclusion sections of the corresponding abstracts.BioAsq benchmarks (Balikas et al., 2013;Nentidis et al., 2018;Krallinger et al., 2020) focus on open-domain QA over PubMed abstracts.Like QASPER, BioAsq answers can take different forms (e.g., yes/no, extracted span(s)).QASPER differs from BioAsq in that questions are grounded in a single paper of interest.Furthermore, QASPER uses the paper full text, not just the abstract.To the best of our knowledge, QASPER is the first information-seeking QA dataset in a computer science domain, while most prior work using academic research papers has been in biomedicine.Furthermore, with over 5K annotated questions, QASPER is also larger than other comparable human-annotated QA datasets -PubmedQA and BioAsq contain 1K and 3.2K questions, respectively.Finally, QASPER poses a challenging full document-level task while other related datasets are abstract-level.Beyond the domain of academic research, realistic QA datasets have also been built in the privacy policy domain (Ravichander et al., 2019;Ahmad et al., 2020).These tasks are similar to our evidence selection task.

Conclusion
We presented QASPER, an information-seeking QA dataset over NLP research papers.With natural questions asked as follow-up to titles and abstracts, the task presented by QASPER requires evidence

Figure 1 :
Figure 1: An example instance taken from QASPER.A question about the paper is written after reading only the title and the abstract.To arrive at the answer, one finds relevant evidence, which can be spread across multiple paragraphs.In this example, to answer the question about "baselines", the reader must realize from evidence from Sections 3 and 4 that "context documents" come pre-ranked in the dataset and the paper's "baselines" select from these "context documents."

Figure 2 :
Figure 2: Learning curves showing Answer-F 1 and Evidence-F 1 on the dev.set while varying training data size.

Table 2 :
LED-base and lower-bound human performance on answering questions in QASPER, measured in Answer-F ! .The top three rows are heuristic baselines that try to predict answers without encoding entire papers.w/ scaff.refers to the inclusion of the evidence selection scaffold during training.

Table 3 :
Model and lower-bound human performance on selecting evidence for questions in QASPER

Table 4 :
Model performance on the QASPER test set on answering questions given gold evidence.We do not show performance on Yes/No and Unanswerable types because they can be trivially predicted to a large extent from the absence of gold evidence.