English Machine Reading Comprehension Datasets: A Survey

This paper surveys 60 English Machine Reading Comprehension datasets, with a view to providing a convenient resource for other researchers interested in this problem. We categorize the datasets according to their question and answer form and compare them across various dimensions including size, vocabulary, data source, method of creation, human performance level, and first question word. Our analysis reveals that Wikipedia is by far the most common data source and that there is a relative lack of why, when, and where questions across datasets.


Introduction
Reading comprehension is often tested by measuring a person or system's ability to answer questions on a given text. Machine reading comprehension (MRC) datasets have proliferated in recent years, particularly for the English language -see Fig. 1. The aim of this paper is to make sense of this landscape by providing as extensive as possible a survey of English MRC datasets. The survey has been carried out with the following audience in mind: (1) those who are new to the field and would like to get a concise yet informative overview of English MRC datasets; (2) those who are planning to create a new MRC dataset; (3) MRC system developers, interested in designing the appropriate architecture for a particular dataset, choosing appropriate datasets for a particular architecture, or finding compatible datasets for use in transfer or joint learning.
Our survey takes a mostly structured form, with the following information presented for each dataset: size, data source, creation method, human performance level, whether the dataset has been "solved", availability of a leaderboard, the most frequent first question token, and whether the dataset is publicly available. We also categorise each dataset by its question/answer type.
Our study contributes to the field as follows: 1. it describes and teases apart the ways in which MRC datasets can vary according to their question and answer types; 2. it provides analysis in table and figure format to facilitate easy comparison between datasets; 3. by providing a systematic comparison, and by reporting the "solvedness" status of a dataset, it brings the attention of the community to less popular and relatively understudied datasets; 4. it contains per-dataset statistics such as number of instances, average question/passage/answer length, vocabulary size and text domain which can be used to estimate the computational requirements for training an MRC system.

Question, Answer, and Passage Types
All MRC datasets in this survey have three components: passage, question, and answer. 1 We begin with a categorisation based on the types of answers and the way the question is formulated. We divide questions into three main categories: Statement, Query, and Question. Answers are divided into the following categories: Cloze, Multiple Choice, Boolean, Extractive, Generative. The relationships between question and answer types are illustrated in Fig. 2. In what follows, we briefly describe each question and answer category, followed by a discussion of passage types, and dialog-based datasets.

Answer Type
Cloze The question is formulated as a sentence with a missing word or phrase which should be inserted into the sentence or should complete the sentence. The answer candidates may be included as in (1) from ReciteQA (Yagcioglu et al., 2018), and may not, as in (2)  We distinguish cloze multiple choice datasets from other multiple choice datasets. The difference is the form of question: in the cloze datasets, the answer is a missing part of the question context and, combined together, they form a grammatically correct sentence, whereas for other multiple choice datasets, the question has no missing words.
Boolean A Yes/No answer is expected, e.g. (4) from the BoolQ dataset (Clark et al., 2019). Some datasets which we include here have a third "Cannot be answered" or "Maybe" option, e.g. (5) from PubMedQuestions (Jin et al., 2019).
(4) P: The series is filmed partially in Prince Edward Island as well as locations in ... Q: Is anne with an e filmed on pei? A: Yes (5) P: ... Young adults whose families were abstainers in 2000 drank substantially less across quintiles in 2010 than offspring of nonabstaining families. The difference, however, was not statistically significant between quintiles of the conditional distribution. Actual drinking levels in drinking families were not at all or weakly associated with drinking in offspring. ... Q: Does the familial transmission of drinking patterns persist into young adulthood? A: Maybe Extractive or Span Extractive The answer is a substring of the passage. In other words, the task is to determine the answer character start and end index in the original passage, as shown in (6)  Generative or Free Form Answer The answer must be generated based on information presented in the passage. Although the answer might be in the text, as illustrated in (7) from Narra-tiveQA (Kočiský et al., 2018), no passage index connections are provided.
(7) P: ...Mark decides to broadcast his final message as himself. They finally drive up to the crowd of protesting students, .... The police step in and arrest Mark and Nora.... Q: What are the students doing when Mark and Nora drive up? A: Protesting.

Question Type
Statement The question is a declarative sentence and used in cloze, e.g. (1-2), and quiz questions, e.g. (8) from SearchQA (Dunn et al., 2017).  Query The question is formulated to obtain a property of an object. It is similar to a knowledge graph query, and, in order to be answered, part of the passage might involve additional sources such as a knowledge graph, or the dataset may have been created using a knowledge graph, e.g. (9) from WikiReading (Hewlett et al., 2016).
(9) P: Cecily Bulstrode (1584-4 August 1609), was a courtier and ... She was the daughter ... Q: sex or gender A: female We put datasets with more than one type of question into a separate Mixed category.

Passage Type
Passages can take the form of a one-document or multi-document passage. They can also be categorised based on the type of reasoning required to answer a question: Simple Evidence where the answer to a question is clearly presented in the passage, e.g. (3) and (6), Multihop Reasoning with questions requiring that several facts from different parts of the passage or different documents are combined to obtain the answer, e.g. (10) from the HotpotQA (Yang et al., 2018), and Extended Reasoning where general knowledge or common sense reasoning is required, e.g. (11)

Datasets
All datasets and their properties of interest are listed in Table 1. 3 We present the number of questions per dataset (size), the text sources, the method of creation, whether there are a leaderboard and data publicly available, and whether the dataset is solved, i.e. the performance of a MRC system exceeds the reported human performance (also shown). We will discuss each of these aspects.  PAD -publicly available data; k/M -thousands/millions; CRW -crowdsourcing; AG -automatically generated; KG -knowledge graph; ER -educational resources; UG -user generated; HG -human generated (UG + annotators, crw, experts); L/S -long/short answer; -available/"solved"; -unavailable/not "solved"; 8 -the leaderboard is hosted at https://paperswithcode.com/; u -the dataset is available by request. This information was last verified in September 2021. datasets share not only text sources but the actual samples. SQuAD2.0 extends SQuAD with unanswerable questions. AmazonQA and Ama-zonYesNo overlap in questions and passages with slightly different processing. SubjQA also uses a subset of the same reviews (Movies, Books, Electronics and Grocery). BoolQ shares 3k questions and passages with the NaturalQuestions dataset. The R 3 dataset is fully based on DROP with a focus on reasoning.

Dataset Creation
Rule-based approaches have been used to automatically obtain questions and passages for the MRC task by generating the sentences (e.g. bAbI) or, in the case of cloze type questions, excluding a word from the context. We call those methods automatically generated (AG). Most dataset creation, however, involves a human in the loop. We distinguish three types of people: experts are professionals in a specific domain; crowdworkers (CRW) are casual workers who normally meet certain criteria (for example a particular level of proficiency in the dataset language) but are not experts in the subject area; users who voluntarily create content based on their personal needs and interests.
More than half of the datasets (37 out of 60) were created using crowdworkers. In one scenario, crowdworkers have access to the passage and must formulate questions based on it. For example, MovieQA, ShaRC, SQuAD, and SQuAD2.0 were created in this way. In contrast, another scenario involves finding a passage containing the answer for a given question. That works well for datasets where questions are taken from already existing resources such as trivia and quiz questions (TriviaQA, Quasar-T, and SearchQA), or using web search queries and results from Google and Bing as a source of questions and passages (BoolQ, Nat-uralQuestions, MS MARCO).
In an attempt to avoid word repetition between passages and questions, some datasets used different texts about the same topic as a passage and a source of questions. For example, DuoRC takes descriptions of the same movie from Wikipedia and IMDb. One description is used as a passage while another is used for creating the questions. NewsQA uses only a title and a short news article summary as a source for questions while the whole text becomes the passage. Similarly, in NarrativeQA, only the abstracts of the story were used for question creation. For MCScript and MCScript 2.0, questions and passages were created by different sets of crowdworkers given the same script.
Question/Passage/Answer Length The graphs in Fig. 4 provide more insight into the differences between the datasets in terms of answer, question, and passage length, as well as vocabulary size. The outliers are highlighted. 11 The majority of datasets have a passage length under 1500 tokens with the median being 329 tokens but due to seven outliers, the average number of tokens is 1250 ( Fig. 4 (a)). Some datasets (MS MARCO, SearchQA, AmazonYesNo, AmazonQA, MedQA) have a collection of documents as a passage but others contain just a few sentences. 12 The number of tokens in a question lies mostly between 5 and 20. Two datasets, ChildrenBookTest and WhoDid-What, have on average more than 30 tokens per question while WikiReading, QAngaroo MedHop, and WikiHope have only 2 -3.5 average tokens per question ( Fig. 4 (b)). The majority of datasets contain fewer than 8 tokens per answer with the average being 3.5 tokens per answer. The Natu-ralQuestions is an outlier with average 164 tokens per answer 13 (Fig. 4 (c)).
Vocabulary Size To obtain a vocabulary size we calculate the number of unique lower-cased token lemmas. A vocabulary size distribution is presented in Fig. 4 (d). There is a moderate correlation 14 between the number of questions in a dataset and its vocabulary size (see Fig. 5). 15 WikiReading has the largest number of questions as well as the richest vocabulary. bAbI is a synthetic dataset with 40k questions but only 152 lemmas in its vocabulary.
Language Detection We ran a language detector over all datasets using the pyenchant for American and British English, and langid li- 12 We consider facts (sentences) from ARC and QASC corpora as different passages. 13 We focus on short answers, considering long ones only if the short answer is not available.
14 As the data has a non-normal distribution, we calculated the Spearman correlation: coefficient=0.58, p-value=1.3e-05. 15 The values for the BookTest (Bajgar et al., 2017) and WhoDidWhat (Onishi et al., 2016) are taken from the papers.  braries. 16 In 37 of the 60 datasets, more than 10% of the words are reported to be non-English. 17 We inspected 200 randomly chosen samples from a subset of these. For Wikipedia datasets (Hot-PotQA, QAngoroo WikiHop), around 70-75% of those words are named entities; 10-12% are specific terms borrowed from other languages such as names of plants, animals, etc.; another 8-10% are foreign words, e.g. the word "dialetto" from Hot-PotQA "Bari dialect (dialetto barese) is a dialect of Neapolitan ..."; about 1.5-3% are misspelled words and tokenization errors. In contrast, for the user-generated dataset, AmazonQA, 67% are tokenization and spelling errors. This aspect of a dataset's vocabulary is useful to bear in mind when, for example, fine-tuning a pre-trained language model which has been trained on less noisy text.
First Question Word A number of datasets come with a breakdown of question types based on the first token (Nguyen et al., 2016;Ostermann et al., 2018Ostermann et al., , 2019Kočiský et al., 2018;Clark et al., 2019;Xiong et al., 2019;Jing et al., 2019;Bjerva et al., 2020). We inspect the most frequent first word in a dataset's questions excluding cloze-style questions. Table 1 shows the most frequent first word per dataset and Table 2

Evaluation Metrics
Accuracy is defined as the ratio of correctly answered questions out of all questions. For those datasets where the answer should be found or generated (extractive or generative tasks) accuracy is the same as Exact Match (EM), implying the system answer is exactly the same as the gold answer.
In contrast with selective and boolean tasks, extractive or generative tasks can have ambiguous, incomplete, or redundant answers. In order to assign credit when the system answer does not exactly match the gold answer, Precision and Recall, and their harmonic mean, F1, can be calculated over words or characters. Accuracy is used for all boolean, multiple choice, and cloze datasets. 19 For extractive and generative tasks it is common to report EM (accuracy) and F1. For cloze datasets, the metrics depends on the form of answer. If there are options available, accuracy can be calculated. If words have to be generated, the F1 measure can also be applied.
One can view the MRC task from the perspective of Information Retrieval, providing a ranked list of answers instead of one definitive answer. In this case, a Mean Reciprocal Rank (Craswell, 2009) (MRR) and Mean Average Precision (MAP) can be used, as well as the accuracy of the top hit (Hits@1) (single answer) over all possible answers (all entities). 20 All metrics mentioned above work well for welldefined answers but might not reflect performance for generative datasets as there could be several alternative ways to answer the same question. Some datasets provide more than one gold answer. A number of different automatic metrics used in language generation evaluation are also used: Bilingual Evaluation Understudy Score (BLEU) (Papineni et al., 2002), Recall Oriented Understudy for Gisting Evaluation (ROUGE-L) (Lin, 2004), and Metric for Evaluation of Translation with Explicit 19 Except MultiRC as there are multiple correct answers and all of them should be found, and CliCR and ReCoRD which use exact match and F1. This is because even though the task is cloze, the answer should be generated (in case of CliCR) or extracted (ReCoRD). 20 MRR and MAP are used only by (Yang et al., 2015) in the WikiQA dataset, as well as precision, recall and F1. (Miller et al., 2016) in the WikiMovies datasets used the accuracy of the top hit (Hits@1).
ORdering (METEOR) (Lavie and Agarwal, 2007). MSMarco, NarrativeQA, and TweetQA are generative datasets which use these metrics. Choi et al. (2018) introduced the human equivalence score (HEQ). It measures the percentage of examples where the system F1 matches or exceeds human F1, implying a system's output is as good as that of an average human. There are two variants: HEQ-Q based on questions and HEQ-D based on dialogues.

Human Performance
Human performance figures have been reported for some datasets -see Table 1. This is useful in two ways. Firstly, it gives some indication of the difficulty of the questions in the dataset. Contrast, for example, the low human performance score reported for the Quasar and CliCR datasets with the very high scores for DREAM, DROP, and MC-Script. Secondly, it provides a comparison point for automatic systems, which may serve to direct researchers to under-studied datasets where the gap between state-of-the-art machine performance and human performance is large, e.g. CliCR (33.9 vs. 53.7), RecipeQA (29.07 vs 73.63), ShaRC (81.2 vs 93.9) and HotpotQA (83.61 vs 96.37).
Although useful, the notion of human performance is problematic and has to be interpreted with caution. It is usually an average over individuals, whose reading comprehension abilities will vary depending on age, ability to concentrate, interest in, and knowledge of the subject area. Some datasets (CliCR, QUASAR) take the latter into account by distinguishing between expert and nonexpert human performance, while RACE distinguishes between crowd-worker and author annotations. The authors of MedQA, which is based on medical examinations, use a passing mark (of 60%) as a proxy for human performance. It is important to know this when looking at its "solved" status since state-of-the-art accuracy on this dataset is only 75.3% (Zhang et al., 2018b).
Finally, Dunietz et al. (2020) call into question the importance of comparing human and machine performance on the MRC task and argue that the questions that MRC systems need to be able to answer are not necessarily the questions that people find difficult to answer.

Related Work
Other MRC surveys have been carried out (Liu et al., 2019;Zhang et al., 2019;Wang, 2020;Qiu et al., 2019;Baradaran et al., 2020;Zeng et al., 2020;Rogers et al., 2021). There is room for more than one survey since each survey approaches the field from a different perspective, varying in the criteria used to decide what datasets to include and what dataset properties to describe. Some focus on MRC systems rather than on MRC datasets.
The surveys closest to ours in coverage are Zeng et al. (2020) and Rogers et al. (2021). The survey of Rogers et al. (2021) has a wider scope, including datasets for any language, multimodal datasets and datasets which do not fall under the canonical MRC passage/question/answer template.
An emphasis in our survey has been on relatively fine-grained dataset processing. For example, instead of reporting only train/dev/test sizes for each dataset, we calculate the length of each question, passage and answer. We examine the vocabulary of each dataset, looking at question words and applying language identification tools, in the process discovering question imbalance and noise (HTML tags, misspellings, etc.). We also report data re-use across datasets.

Concluding Remarks and Recommendations
This paper provides an overview of English MRC datasets, released up to and including 2020. We compare the datasets by question and answer type, size, data source, creation method, vocabulary, question type, "solvedness", and human performance level. We observe the tendency of moving from smaller datasets towards large collections of questions, and from synthetically generated data through crowdsourcing towards spontaneously created. We also observe a scarcity of why, when, and where questions. Gathering and processing the data for this survey was a painstaking task, 21 from which we emerge with some very practical recommendations for future MRC dataset creators. In order to 1) compare to existing datasets, 2) highlight possible limitation for applicable methods, and 3) indicate the computational resources required to process the data, some basic statistics such as average passage/question/answer length, vocabulary size and frequency of question words should be reported; the data itself should be stored in consistent, easyto-process fashion, ideally with an API provided; any data overlap with existing datasets should be reported; human performance on the dataset should be measured and what it means clearly explained; and finally, if the dataset is for the English language and its design does not differ radically from those surveyed here, it is crucial to explain why this new dataset is needed.
For any future datasets, we suggest a move away from Wikipedia given the volume of existing datasets that are based on it and its use in pretrained language models such as BERT (Devlin et al., 2019). As shown by Petroni et al. (2019), its use in MRC dataset and pre-training data brings with it the problem that we cannot always determine whether a system's ability to answer a question comes from its comprehension of the relevant passage or from the underlying language model. The medical domain is well represented in the collection of English MRC datasets, indicating a demand for understanding of this type of text. Datasets may be required for other domains, such as retail, law and government.
Some datasets are designed to test the ability of systems to tell if a question cannot be answered, by including a "no answer" label. Building upon this, we suggest that datasets be created for the more complex task of providing qualified answers based on different interpretations of the question.

A Other Datasets
There are a number of datasets we did not include in our analysis as we focus on the Question Answering Machine Reading Comprehension task. In this section we mention some of these and explain why they are excluded. CLOTH (Xie et al., 2018) and Story Cloze Test, (Mostafazadeh et al., 2016(Mostafazadeh et al., , 2017 are cloze-style datasets with a word missing from the context and without a specific query. In contrast, cloze question answering datasets considered in this work have a passage and a separate sentence which can be treated as question with a missing word. As well as this, we did not include a number of MRC datasets where the story should be completed such as ROCStories (Mostafazadeh et al., 2016), CODAH (Chen et al., 2019), SWAG (Zellers et al., 2018), and HellaSWAG (Zellers et al., 2019) because there are no questions.
QBLink (Elgohary et al., 2018) is technically a MRC QA dataset but for every question there is only the name of a wiki page available. The "lead in" information is not enough to answer the question without additional resources. In other words, QBLink is a more general QA dataset, like CommonSenseQA (Talmor et al., 2019). Another general dataset is MKQA (Longpre et al., 2020) which contains 260k question-answer pairs in 26 typologically diverse languages.
Textbook Question Answering (TQA) (Kembhavi et al., 2017) is a multi-modal dataset requiring not only text understanding but also picture processing. MCQA is a Multiple Choice Question Answering dataset in English and Chinese based on examination questions introduced as a Shared Task in IJCNLP 2017 by (Shangmin et al., 2017). The authors do not provide any supportive documents which can be considered as a passage so it is not a reading comprehension task.

A.1 Non-English Datasets
While we focus on English datasets, there are a growing number of MRC datasets in other languages. In this section we will briefly mention some of them. Please see (Rogers et al., 2021)  TyDi (Clark et al., 2020) is a question answering corpus of 11 typologically diverse languages (Arabic, Bengali, Korean, Russian, Telugu, Thai, Finnish, Indonesian, Kiswahili, Japanese, and English).  As mentioned in Section 3.3, we processed all datasets in the same way with spaCy. If the data is tokenized or split by sentences we simply join it back using Python " ".join(tokens/sentences_list). Based on the spaCy implementation 23 we would not expect significant differences between the originally provided tokenization and the results of the spaCy tokenization of the joined tokens. This ensures consistent processing of all datasets.

B Extra Features and Statistics
There are a few dataset peculiarities that are worth mentioning: ShARC: there are several scenarios for the same snippet. We consider instance to be a tree_id, and the concatenation of the snippet, scenario and follow up questions with answers as a passage; HotpotQA: we consider a passage to be a concatenation of all supporting facts, and an instance is a title of supporting fact. In this case, there are multiple instances for the same question.  TyDi: we calculated the statistic for joined data from the English subset for both the Minimal answer span task and the Gold passage task for training and development sets.
WikiQA: we consider a passage to be the concatenation of all sentences, we did calculations based on publicly available data and code from the github page; 24 TriviaQA: to obtain the data we modified the script 25 provided by the authors; MSMARCO: we consider every passage separately, so in this case, there are multiple passages for one question.
Who Did What: we looked into relaxed setting. We do not have a licence to get Gigaword data so we calculated only the average length of the questions and answers. The vocabulary size is provided by the original paper (Onishi et al., 2016).