Do Question Answering Modeling Improvements Hold Across Benchmarks?

Do question answering (QA) modeling improvements (e.g., choice of architecture and training procedure) hold consistently across the diverse landscape of QA benchmarks? To study this question, we introduce the notion of concurrence—two benchmarks have high concurrence on a set of modeling approaches if they rank the modeling approaches similarly. We measure the concurrence between 32 QA benchmarks on a set of 20 diverse modeling approaches and find that human-constructed benchmarks have high concurrence amongst themselves, even if their passage and question distributions are very different. Surprisingly, even downsampled human-constructed benchmarks (i.e., collecting less data) and programmatically-generated benchmarks (e.g., cloze-formatted examples) have high concurrence with human-constructed benchmarks. These results indicate that, despite years of intense community focus on a small number of benchmarks, the modeling improvements studied hold broadly.


Introduction
The NLP community has created a diverse landscape of extractive question answering (QA) benchmarks-their context passages may come from different sources, their questions may focus on different phenomena or be written by different populations, or other aspects of the data collection process may differ. Driven to improve benchmark performance, researchers have proposed a variety of QA modeling approaches. However, not all benchmarks receive equal attention from the community (Koch et al., 2021); many QA modeling approaches are developed on a small handful of benchmarks, especially those with popular leaderboards (e.g., SQuAD;Rajpurkar et al., 2016). As a result, it is conceivable that some modeling improvements may not hold because they are (perhaps inadvertently) benchmark-specific, while others Surprisingly, we find that human-constructed benchmarks (e.g., SQuAD, NaturalQuestions) have high concurrence with other human-constructed benchmarks, downsampled human-constructed benchmarks, and even programmatically-generated cloze benchmarks (e.g., the Children's Book Test; CBT). In addition, we are able to construct synthetic benchmarks that have high concurrence with human-constructed benchmarks despite lacking natural language passages or questions.
(e.g., pre-training on more data) hold more broadly.
In this work, we evaluate whether improvements from modeling approaches hold (e.g., choices in model architecture or training procedure)-if a particular modeling approach improves performance when trained and evaluated on one benchmark, does it also improve performance on others? Although much existing work studies whether systems generalize (i.e., a model with a particular set of parameters; Jia and Liang, 2017;Talmor and Berant, 2019;Miller et al., 2020), research value often comes not from the systems themselves (e.g., model weights), but from the underlying ideas, techniques, and approaches. We study the comparatively under-investigated question of whether such modeling approaches generalize.
To study whether modeling improvements hold across benchmarks, we introduce the notion of concurrence. We say that two benchmarks have high concurrence on a set of modeling approaches if they rank the modeling approaches similarly. To assess whether modeling improvements hold across the space of QA benchmarks, we measure the concurrence between 32 diverse QA benchmarks on a testbed of 20 representative modeling approaches introduced between 2016 and 2020.
Overall, we find that benchmarks that differ substantially still often have high concurrence. Human-constructed benchmarks (e.g., SQuAD and MRQA NaturalQuestions) have high concurrence with each other, despite differences in crowdsourcing setups, passage and question distributions, and even linguistic phenomena of focus ( §3).
How different can a benchmark be, while still maintaining high concurrence with humanconstructed benchmarks? In §4.1, we investigate the role of training dataset size by measuring concurrence with downsampled training datasets (e.g., using 20K SQuAD training examples rather than the full 88K). We find that downsampled training datasets are sufficient for high concurrence with other human-constructed benchmarks. In §4.2, we measure concurrence between human-constructed and programmatically-generated benchmarks (e.g., cloze-formatted or synthetic) to better understand the importance of human-written questions and passages. We find that cloze-formatted benchmarks have high concurrence with human-constructed benchmarks, so human-written questions and passages are not strictly necessary for concurrence. However, programmatically-generated synthetic benchmarks (e.g., the bAbI task suite) have low concurrence. Having found this breaking point of low concurrence, we construct two minimal synthetic benchmarks that achieve high concurrence with human-constructed benchmarks, despite lacking linguistic structure. Intuitively, the benchmarks that concur with human-constructed benchmarks are those that require model capabilities that are also useful for better performance on human-constructed benchmarks (e.g., identifying paraphrase and lexical overlap; §4.3-4.5).
Our results have several implications for the future development of benchmarks and modeling approaches. To summarize: 1. Human-constructed benchmarks have high concurrence with each other on our testbed of 20 modeling approaches. The modeling approaches studied are not particularly benchmark-specific and that their modeling improvements largely hold across different benchmarks, despite intense community focus on a small number of benchmarks. This is especially true of recent modeling improvements driven by better pre-training, which is largely downstream benchmark-agnostic.
2. Many benchmarks require reasoning over predicate-argument structure (e.g., SQuAD, NewsQA, NaturalQuestions), and improvements on these benchmarks also transfer to more specialized benchmarks (e.g., HotpotQA or MRQA DROP) because (1) almost all benchmarks involve reasoning over predicateargument structure and/or (2) better reasoning over predicate-argument structure is correlated with improvements on other phenomena.
3. Human-constructed benchmarks are not strictly necessary for improving performance on other human-constructed benchmarks. Synthetic benchmarks may be useful tools for isolating, understanding, and improving on particular model capabilities.

Downsampling benchmarks to as few as 10K
training examples does not significantly affect concurrence, especially since recent pretrained modeling approaches have greater sample efficiency. We recommend the community build benchmarks that are smaller but more challenging (e.g., harder/more expensive to label per-example).
5. Since human-constructed benchmarks have high concurrence amongst themselves, we encourage researchers to seek diversity and build benchmarks that explore qualitatively different modeling capabilities that push research in new directions.

Measuring Concurrence
Informally, we say that two benchmarks have high concurrence on a set of modeling approaches if the two benchmarks rank the modeling approaches similarly. We compare the performance of a modeling approach when trained and tested on one benchmark with its performance when trained and tested on another benchmark-we use each benchmark's original i.i.d. train-test split, so all evaluation is indomain. Repeating this process for many modeling approaches, we can assess whether performance gains between modeling approaches are generally preserved when moving between benchmarks. Formally, define a benchmark B as a pair of datasets (D train , D test ), where D train ⊆ X × Y and D test ⊆ X × Y for an input space X and an output space Y. A system is a function s : X → Y (i.e., a trained model with a particular set of parameters). In contrast, a modeling approach (i.e., a neural architecture coupled with a training procedure) is a function a that takes in a training dataset D train and outputs a system. Let EVAL denote an evaluation function, where EVAL(a, B) returns the performance (under a given evaluation function, e.g., exact match) of a modeling approach a when trained on the train split of B and tested on the test split of B. Finally, CONCUR(B 1 , B 2 ; A, EVAL) is the concurrence between the benchmarks B 1 and B 2 with respect to a set of modeling approaches A and the evaluation function EVAL. Let a ∼ uniform(A), where uniform(A) denotes the uniform distribution over the set of modeling approaches A. Defining the random variables P 1 = EVAL(a, B 1 ) and P 2 = EVAL(a, B 2 ), we finally define where CORR is some correlation function.
We use the SQuAD exact match (EM) metric as our evaluation function EVAL, and we consider the Pearson correlation coefficient (r) and the Kendall rank correlation coefficient (τ ) as our correlation functions CORR. The former measures whether the relationship between model performance on the two benchmarks is approximately linear, whereas the latter measures whether pairwise rank comparisons between models are preserved between benchmarks. As a rough guideline, we consider τ > 0.8 to be high concurrence, though interpreting concurrence often requires more than comparing overall correlation. Extractive QA modeling approaches. To assess concurrence in this work, we use a representative set of 20 diverse modeling approaches introduced between 2016 to 2020 (A). These model-  et al., 2020), andSpanBERT (Joshi et al., 2020). 1 10 of our 20 modeling approaches are nonpretrained. These approaches generally propose (1) better sequence encoders for passages and questions (e.g., Lee et al., 2016;Yang et al., 2017;Yu et al., 2018) and/or (2) improved attention mechanisms for question-passage interactions (e.g., Seo et al., 2017;Wang et al., 2017;Huang et al., 2018).
In contrast, the other 10 of our 20 modeling approaches are pre-trained; these modeling approaches all use the Transformer architecture (Vaswani et al., 2017), but improve performance by proposing better pre-training procedures and objectives. These pre-trained modeling approaches are generally evaluated on a suite of downstream tasks, in contrast to non-pretrained modeling approaches, which generally evaluate on a single benchmark.
All of these modeling approaches were originally evaluated on SQuAD, though several (e.g., SpanBERT) were also evaluated on other QA benchmarks. We evaluate each modeling approach on each benchmark with the same training hyperparameters used for SQuAD, as well as 5 additional randomly sampled hyperparameter settings.
Extractive QA benchmarks. In this work, we study concurrence between three broad classes of extractive QA benchmarks: (i) human-constructed, (ii) cloze, and (iii) synthetic. Human-constructed benchmarks contain human-written natural language questions and passages; examples include SQuAD, NewsQA (Trischler et al., 2017), andNaturalQuestions (Kwiatkowski et al., 2019). On the other hand, cloze benchmarks (e.g., Children's Book Test or CNN;Hill et al., 2016;Hermann et al., 2015) contain cloze questions, which are "fill-inthe-blank" statements with masked answers. These questions are usually automatically-generated from human-written natural language passages. Finally, synthetic benchmarks contain programmaticallygenerated questions and passages (e.g., the bAbI task suite; Weston et al., 2016).

Results
Human-constructed benchmarks have high concurrence amongst themselves. Despite differences in benchmark crowdsourcing setups, passage and questions distributions, and even linguistic phenomena of interest, modeling improvements generally hold across human-constructed benchmarks (Table 1). Furthermore, concurrence is high over both non-pretrained and pre-trained modeling approaches ( Figure 2). For example, SQuAD, NewsQA, and Natu-ralQuestions differ in their passage-question joint relationship. In SQuAD, crowdworkers are employed to write questions given Wikipedia passages, but this results in questions with high lexical overlap with salient passage sentences. To minimize such overlap in NewsQA, crowdworkers write questions given only bullet-point summaries of the passages, rather than the passages themselves. Finally, questions in NaturalQuestions are written independently of their provided passage. These different crowdsourcing protocols drastically affect the ease and cost of benchmark construction, but SQuAD, NewsQA, and NaturalQuestions have high concurrence despite these differences. Concurrence is high even when benchmarks focus on different phenomena. We also see that MRQA DROP and MRQA HotpotQA have surprisingly high concurrence with other humanconstructed benchmarks (e.g., SQuAD and Natu-ralQuestions), despite their relatively specialized focus on particular linguistic phenomena (numerical and multi-hop reasoning, respectively). 2 This suggests that modeling improvements on benchmarks that target general reasoning over predicateargument structure also improve performance on benchmarks that focus on different phenomena. We hypothesize this occurs because benchmarks are more similar than we'd otherwise expect (e.g., due to reasoning shortcuts; Min et al., 2019), and better reasoning over predicate-argument structure may be generally useful for other phenomena of interest.

Exploring the Limits of Concurrence
Our results in §3 indicate that human-constructed benchmarks have high concurrence with each other,  To take this to an extreme, §4.3 evaluates concurrence between programmaticallygenerated synthetic benchmarks (the bAbI task suite) with human-constructed benchmarks. Our results show that the bAbI tasks have low concurrence with human-constructed benchmarks. Having found this breaking point, we work backwards to build a minimal benchmark with high concurrence, which will enable us to better understand sufficient conditions for concurrence. In §4.4, we construct a benchmark that has no linguistic structure or complex reasoning but still has high concurrence with human-constructed benchmarks over non-pretrained models. Finally, §4.5 shows that a synthetic benchmark that requires richer reasoning between question and passage tokens can achieve high concurrence with human-constructed benchmarks on both pre-trained and non-pretrained modeling approaches.   (Table 3). Concurrence is high on both non-pretrained and pre-trained modeling approaches ( Figure 3). Downsampling to 10K examples slightly reduces concurrence with non-pretrained modeling approaches. Concurrence with pre-trained models only begins to degrades when using 1K training examples, indicating that few-shot settings are likely categorically different and worth studying separately.

Cloze Benchmarks
To better understand the importance of humanwritten questions and passages, we measure concurrence between human-constructed benchmarks and cloze benchmarks. Cloze extractive question answering benchmarks contain cloze questions, which are "fill-in-the-blank" statements with masked answers. Large cloze benchmarks are cheap to construct because examples can be automatically generated by eliding spans from naturally-occurring text. Although the passages in cloze benchmarks are natural language, their fillin-the-blank require more guessing from context, rather than the answer deduction typically found in human-constructed benchmarks.
Setup. We study the Children's Book Test ( Results. Despite using programmaticallygenerated cloze questions, cloze benchmarks (e.g., CBT and LAMBADA) can have high concurrence with human-constructed benchmarks (Table 4).
On the other hand, CNN and ReCoRD have lower concurrence with human-constructed bench-  Table 4: Concurrence between programmaticallygenerated cloze benchmarks and human-constructed benchmarks can be high (e.g., CBT and LAMBADA), but not always (CNN and ReCoRD).
marks, especially on non-pretrained modeling approaches-the performance improvements between pre-trained modeling approaches are still largely preserved ( Figure 4).
Concurrence on CNN is lower due to a pair of outlier modeling approaches-DocumentReader, with and without external linguistic features. We hypothesize that these models do poorly on CNN because some aspects of their preprocessing are SQuAD-specific; this may have also influenced architecture design. ReCoRD's low overall concurrence comes from the poor performance of nonpretrained modeling approaches. This may be due to ReCoRD's construction procedure, since a filtering step removed all examples that were correctly answered by a strong non-pretrained modeling approach (SAN, with SQuAD dev. EM of 76.24; Liu et al., 2018). ReCoRD has low concurrence with SQuAD on modeling approaches that are weaker than SAN, and high concurrence on modeling approaches that outperform SAN.

High Concurrence Is Not Universal: Improvements Do Not Hold On bAbI
Having established that human-written passages are not necessary for high concurrence with humanconstructed benchmarks ( §4.2), we take this to an extreme by evaluating concurrence between humanconstructed benchmarks and synthetic extractive question answering benchmarks, which contain questions and passages that are programmatically generated (and possibly not even natural language). The bAbI task suite contains 20 synthetic questionanswering benchmarks, each of which focuses on a particular skill required by a competent dialogue system (e.g., fact retrieval, subject-object relations, counting). The textual data is generated from a simulated toy environment.
Setup. We consider the 11 tasks that can be losslessly converted to an extractive format (Tasks 1,2,3,4,5,11,12,13,14,15,16 Results and Discussion. The bAbI tasks have low concurrence with human-constructed benchmarks-high concurrence is not universal. Modeling approaches often have either near-perfect or near-random performance ( Figure 5).

What is Sufficient for Concurrence on Non-Pretrained Modeling Approaches?
To better understand the sufficient conditions for concurrence with human-constructed benchmarks, we are interested in constructing a minimal synthetic benchmark with high concurrence. Given that human-written passages and questions are not necessary for high concurrence with human-constructed benchmarks ( §4.2), but the programmatically-generated bAbI synthetic bench-marks have low concurrence ( §4.3), we design a minimal synthetic benchmark with high concurrence with human-constructed benchmarks over non-pretrained modeling approaches. Setup. Questions in extractive QA benchmarks can often be answered by exploiting lexical overlap between question and passage tokens (Weissenborn et al., 2017;Krishna et al., 2020). To better understand the limits of concurrence, we build a minimal synthetic cloze benchmark (FuzzySyntheticQA) that explicitly targets this fuzzy pattern-matching and find that it has high concurrence with SQuAD on non-pretrained modeling approaches. Figure 6 shows a sample passage and question-answering pairs. We use 10,000 questions for training and 10,000 questions for evaluation. See Appendix E for further details about FuzzySyntheticQA's construction.
Passage Generation. We generate the passage by randomly sampling 150 tokens from the uniform distribution over a token vocabulary. The token vocabulary is taken from the WikiText-2 training set (Merity et al., 2017) and has 68,429 types. Answer Generation. The answer token is randomly selected from the generated passage. Cloze Question Generation. To generate the cloze question, we first extract the answer token's local context (up to 10 tokens) and mask out the answer token. Then, we corrupt the cloze question by (1) randomly replacing its tokens with related tokens (100 approximate nearest neighbor tokens in the vocabulary, measured by vector distance in the pre-trained English FastText embeddings), (2) locally permuting its tokens (within 3 positions), and (3) applying word dropout (with rate 0.2). Results and Discussion. FuzzySyntheticQA has high concurrence with human-constructed benchmarks, but only on non-pretrained modeling approaches-concurrence on pre-trained modeling approaches is much lower (Figure 7). Even benchmarks that lack much linguistic structure can have high concurrence with human-constructed benchmarks, as long as they require similar phenomena (in this case, fuzzy lexical matching between the question and passage). Why do improvements in pre-training not hold on FuzzySyntheticQA? One potential reason is that passages in FuzzySyntheticQA lack of linguistic structure. To evaluate this hypothesis, we generate FuzzySyntheticQA questions from English Wikipedia passages, rather than sampling from the   FuzzySyntheticQA has high concurrence with SQuAD on non-pretrained modeling approaches, but pre-training does not increase performance, leading to low overall concurrence. Right: Despite lacking natural language structure, WikidataSyn-theticQA has high concurrence with SQuAD. uniform distribution over tokens, but this still results in low concurrence with human-constructed benchmarks on pre-trained modeling approaches (r = −0.49, τ = −0.19), indicating that the low concurrence comes from more than just a lack of natural language passages (Appendix F).

What is Sufficient for Concurrence on Pre-Trained and Non-Pretrained
Modeling Approaches?
Having found a minimal synthetic benchmark that achieves high concurrence with humanconstructed benchmarks on non-pretrained modeling approaches ( §4.4), we show that a synthetic  benchmark that requires richer reasoning between question and passage tokens is sufficient for high concurrence on both non-pretrained and pre-trained modeling approaches.
Setup. We construct WikidataSyntheticQA, a benchmark derived from Wikidata triples; Figure 8 shows a sample passage and question-answering pairs. Knowledge graphs like Wikidata are rich sources of complex relations between entities, which enables us to increase the complexity of question-passage token relations beyond the simple noising and corruptions of FuzzySyntheticQA. We use 10,000 questions for training and 9,835 question-answer pairs for evaluation. See Appendix G for further details about WikidataSyn-theticQA's construction.
Wikidata Background. Wikidata is a knowledge graph connecting entities via relations. Wikidata entities and relations include a label, the most common name that an entity is known by, and aliases, alternative names for entities. For example, the entity Mae_C._Jemison has the label "Mae C. Jemison", with aliases "Mae Jemison" and "Mae Carol Jemison". We treat labels and aliases as potential surface realizations of entities and relations.
Generation Preliminaries. Generating a passage requires a set of Wikidata triples. To select these triples, we first randomly choose a seed entity from the 10,000 Wikidata entities with the highest PageRank score (Page et al., 1999). We then extract the triples from the seed entity and all entities connected to the seed entity. Finally, we randomly sample 50 triples for use in generation. Passage Generation. Given the set of 50 Wikidata triples, we realize triples into textual surface forms by selecting a random Wikidata label or alias for each triple element. The final passage is formed by concatenating the realizations of all triples and adding a delimiter token between them to mimic sentential structure. Answer Generation.
We generate an answer span by selecting a random triple used in the passage generation process, and then choosing a random element of that triple. The passage realization of this random element is the answer span. Cloze Question Generation. To generate the cloze question, we take the triple used for answer generation and mask out the particular element marked as the answer. We realize the non-answer triple elements into textual forms by selecting a random Wikidata label or alias for each triple element. Then, we optionally and randomly replace the predicate with its inverse (if one exists), reversing the subject and the object to maintain consistency. We also optionally and randomly replace the remaining unmasked entity (i.e., the triple subject or object that was not masked) with one of its hypernyms, challenging models' knowledge of such relations. Results and Discussion. As Figure 7 shows, WikidataSyntheticQA has high concurrence with human-constructed benchmarks, despite its lack of natural language passages or questions. We hypothesize that WikidataSyntheticQA has higher concurrence with human-constructed benchmarks than FuzzySyntheticQA because correctly answering its examples often requires reasoning about hypernymy relations between entities and inverse relations between predicates-it is conceivable that pre-trained modeling approaches are better-equipped to handle and use these lexical relations. In addition, the Wikidata aliases provide sufficient lexical variation such that the benchmark is not trivially solvable through string pattern-matching (removing aliases from the generation procedure results in near-perfect performance from all modeling approaches). In contrast, high performance on FuzzySyntheticQA simply requires matching similar tokens in the passage and question-models can achieve high performance by simply learning the similarity relationships in the FastText vector space.

Related Work
A recent line of work examines whether systems have overfit to particular test sets by taking existing systems and evaluating them on newly-constructed test sets (Recht et al., 2019;Yadav and Bottou, 2019;Miller et al., 2020). Recent work has also studied whether higher-performing systems are more robust by studying the correlation between in-domain and out-of-domain improvements (Taori et al., 2020;Djolonga et al., 2020).
In contrast, this work examines whether improvements from modeling approaches hold across benchmarks. We train and test modeling approaches on a variety of existing and newlyconstructed benchmarks. In this regard, our work is similar to the study of Kornblith et al. (2019), who find that performance improvements on Im-ageNet are well-correlated with performance improvements on other benchmarks.

Conclusion
This work studies whether QA modeling improvements hold across the diverse landscape of QA benchmarks. We develop the notion of concurrence, which quantifies the similarity between benchmarks' rankings of modeling approaches. Experiments with 32 QA benchmarks and 20 diverse modeling approaches indicate that humanconstructed benchmarks largely have high concurrence amongst themselves, even when their passage and question distributions or linguistic phenomena of focus are very different. To better understand how different benchmark attributes affect concurrence, we explore downsampled benchmarks and various programmatically-generated benchmarks, the latter having high concurrence only when they target phenomena that are also useful for better performance on human-constructed benchmarks (e.g., identifying paraphrase and lexical overlap). Our results indicate that the modeling improvements studied hold broadly, despite years of intense community focus on a small number of benchmarks.

Limitations
While we conducted an extensive set of experiments to gain a broad picture of whether modeling improvements hold between benchmarks, it is always possible to investigate more settings. While our study covers a representative set of 20 nonpretrained and pre-trained modeling approaches, it is conceivable that evaluating more modeling approaches (or a different set of modeling approaches) on additional benchmarks (or a different set of benchmarks) would have led to different results.
Furthermore, although we evaluate each modeling approach on each benchmark with the same training hyperparameters used for SQuAD, as well as 5 additional randomly sampled hyperparameter settings (20 × 32 × 6 = 3840 experiments in total), it is possible that the SQuAD hyperparameters for some modeling approaches happen to be more general than other modeling approaches. Ideally, each modeling approach would be individually tuned to maximize performance on every benchmark, but doing so requires prohibitive amounts of compute and researcher effort-we believe that our experiments have enough coverage with respect to hyperparameter optimization.

Appendices A Implementation Details of Modeling Approaches Evaluated
We evaluated a representative subset of 20 extractive question answering modeling approaches, published between 2016 to 2020 (Table 5). Below, we describe implementation details for all the modeling approaches evaluated.  RaSoR We reimplement the RaSoR model of (Lee et al., 2016) with PyTorch in the AllenNLP (Gardner et al., 2018) framework, following the original paper as closely as possible. While the authors released an implementation of their method (github.com/shimisalant/rasor), the codebase is in Theano and inexplicably fails on passages that are significantly longer than those found in SQuAD (e.g., those found in the CNN benchmark). DocumentReader approach uses external features from a part-of-speech tagger and named entity recognition system. To fairly compare to systems that do not use such external resources, we also run the models without these features. We keep the hand-crafted term-frequency and token exact match features defined in the DocumentReader paper. We also make some changes to the DocumentReader preprocessing code. In particular, the original implementation (github.com/facebookresearch/DrQA) of these two modeling approaches (intended for training and evaluation on SQuAD) replaces all tokens without a pre-trained GloVe embedding (trained on 840B tokens from the Common Crawl) with a special unknown token-the reimplementation we use adopts the same practice. This preprocessing assumption works well for SQuAD, since the vast majority of SQuAD tokens also appear in the GloVe vocabulary. However, this preprocessing assumption does not apply to CNN-many of the special @entityN and @placeholder markers, which anonymize entities to prevent models from deriving answers from world knowledge, are not in the GloVe vocabulary. As a result, the original DocumentReader implementation maps them all to a single unknown token, effectively preventing the model from telling valid answer choices apart and yielding a model that performs no better than the majority baseline. Keeping these special tokens in the model's vocabulary enables differentiating between different entities in a passage, which naturally improves performance (and are the reported numbers)-however, the modeling approaches' improvements on SQuAD still do not transfer to CNN. Drawing inspiration from DocumentReader, the FusionNet approach also uses external features from a part-of-speech tagger and named entity recognition system. As a result, we also run the models without these features to fairly compare to systems that do not use such external resources. We keep the hand-crafted term-frequency and token exact match features originally used in the FusionNet paper.

B Preprocessing Existing Benchmarks B.1 Existing Human-Constructed Benchmarks
We use the MRQA NewsQA, MRQA DROP, and MRQA HotpotQA benchmarks exactly as released by the MRQA 2019 shared task (Fisch et al., 2019). The passages in MRQA NaturalQuestions contain HTML entities (e.g., <P> and </P>). The tokenizers used in non-pretrained models frequently split these entities into separate tokens. For example, <P> may become <, P, and >. This is problematic because the entities are quite common in passages, and expanding them during tokenization drastically increases the passage lengths, which some non-pretrained modeling approaches cannot handle due to GPU memory limits. HTML entities are tokenized like this because they contain non-alphanumeric characters. As a result, we normalize HTML entities by replacing the non-alphanumeric characters. For example, <P> becomes BPB, and </P> becomes EEPE. These tokens are correctly kept intact. It's possible that modeling approaches that use subword information will perform worse with these normalized HTML entities, but we empirically observe that this normalization does not have a measurable impact on model performance.
QAMR questions were originally collected at the sentence level, but we concatenate these sentences to reconstruct the original passages they were sourced from. We then pair these reconstructed passages with the original QAMR questions. It's possible for questions to become unanswerable at the passage-level. One case of his happens when two sentences have the same question-we filter out such questions that are asked for multiple sentences in a reconstructed passage. Questions can also become unanswerable if relations between entities change between sentences. For example, given the passage "Bill lived in California in 1920. Bill lived in Washington in 1921.", the question "Where did Bill live" is answerable within the context of a particular sentence, but not in the context of the entire passage. Manual examination of generated QAMR passages and questions suggests that this case is rather uncommon, but it may still introduce a small amount of noise into the benchmark.

B.2 Existing Cloze Benchmarks
To convert the CBT and CNN benchmarks to extractive format, we take the passages and question as-is. The answer span is designated as the first occurrence of the answer token in the passage. To convert LAMBADA into extractive format, we follow the setup of Cheng and Erk (2020). The ReCoRD benchmark is used as-is, since it includes span-level annotations of answer tokens in passages.

B.3 Existing Synthetic Benchmarks
We consider tasks 1,2,3,4,5,11,12,13,14,15,16. The other tasks cannot be converted to extractive format (e.g., they require "yes"/"no" answers that do not appear in passages). To convert the tasks in the bAbI benchmark to extractive format, we take the passages and question as-is. While the bAbI benchmark does not provide character-level span annotations for answers, questions come with "supporting facts"sentences in the passage that contain the answer. Thus, choose the first occurrence of the answer token in the supporting fact sentence as our answer span.
Some of the bAbI tasks, while usable in an extractive format in theory, cannot be trivially converted to the extractive format via the procedure above because the released benchmark's annotations do not appear in the passage. For instance, consider Figure 9, which shows an example drawn from the training set of Task 15. The answer provided in the benchmark is "cat", although this token never appears in the passage-instead, "cats" does. In cases where the originally-labeled answer cannot be found in the supporting fact, but its pluralization is present, we use the pluralized answer as our answer span.    Table 6 shows examples from the existing human-constructed benchmarks we study.

Question Answer
MRQA NewsQA (CNET) -When Facebook Chief Executive Mark Zuckerberg recently announced a "Like" button that publishers could place on their Web pages, he predicted it would make the Web smarter and "more social". What Zuckerberg didn't point out is that widespread use of the Like button allows Facebook to track people as they switch from CNN.com to Yelp.com to ESPN.com, all of which are sites that have said they will implement the feature...
What does the like button allow?
Facebook to track people

MRQA NaturalQuestions
BPB A shooting schedule is a project plan of each day 's shooting for a film production . It is normally created and managed by the assistant director , who reports to the production manager managing the production schedule . Both schedules represent a timeline stating where and when production resources are used . EEPE

QAMR
An additional problem to face the empire came as a result of the involvement of Emperor Maurice -LRB-r. 582 -602 -RRB-in Persian politics when he intervened in a succession dispute . This led to a period of peace , but when Maurice was overthrown , the Persians invaded and during the reign of Emperor Heraclius -LRB-r. 610 -641 -RRB-controlled large chunks of the empire , including Egypt , Syria , and Anatolia until Heraclius ' successful counterattack . In 628 the empire secured a peace treaty and recovered all of its lost territories .
Whose politics did the empire get involved with? Persian Table 6: Example passages, questions, and answers from the existing human-constructed benchmarks we study. Table 7 shows examples from the existing cloze benchmarks we study.

Question Answer
Children's Book Test (Common Nouns) ... Lady Latifa argued and urged her wishes , but in vain ; the prince was not to be moved . Then she called to the cupbearers for new wine , for she thought that when his head was hot with it he might consent to stay . The pure , clear wine was brought ; she filled a cup and gave to him . He said : ' O most enchanting sweetheart ! it is the rule for the host to drink first and then the guest . ' So to make him lose his head , she drained the XXXXX ; then filled it again and gave him .
cup Children's Book Test (Named Entities) ... At last , however , the Sunball became aware how sad Letiko was . ... Then he sent them away , and called two hares to him , and said : ' Will you take Letiko home to her mother ? ' ' Yes , why not ? ' ' What will you eat and drink if you should become hungry and thirsty by the way ? ' ' We will eat grass and drink from streamlets . ' ' Then take her , and bring her home . ' Then the hares set out , taking XXXXX with them , and because it was a long way to her home they became hungry by the way .
Letiko LAMBADA sorry 's not going to win me my game tomorrow . my racket is . i ca n't believe i let you take it out of here in the first place ! " " but , dad , i 'm sure you made mistakes when you were a hippie teenager ! " " and i paid for them ! like you 're going to pay for my racket CNN ( @entity0 ) you 'll see some familiar faces in the @en-tity1 . @entity2 beat @entity3 66 -52 on sunday , giving @entity4 ' coach @entity5 his 12th trip to the semifinals of the @entity6 men 's basketball tournament . @entity7 and @entity8 each scored 16 to help @entity2 win the @entity9 . @entity3 , led by 16 points from @entity10 , was hoping to earn its first trip to the @entity1 . here 's how the @entity1 , to be played in @entity11 , has shaped up : next saturday , @entity2 will face @entity12 in the first semifinal . in the next game , top seed @entity13 will battle @entity14 . ... the @entity1 matchups : @placeholder vs. @entity12 and @entity13 vs. @entity14 @entity2

ReCoRD
Secretary of State Hillary Clinton on Monday tried to douse a political firestorm over the deadly assault on a U.S. diplomatic mission in Libya, saying she's responsible for the security of American diplomatic outposts. "I take responsibility," Clinton told CNN in an interview while on a visit to Peru. "I'm in charge of the State Department's 60,000-plus people all over the world, 275 posts. The president and the vice president wouldn't be knowledgeable about specific decisions that are made by security professionals. They're the ones who weigh all of the threats and the risks and the needs and make a considered decision." @highlight "What I want to avoid is some kind of political gotcha or blame game," Clinton says @highlight "I take this very personally," she says @highlight Diplomats need security but "can't hang out behind walls," she adds Clinton also described a desperate scene in the @placeholder during the hours of the attack, as staff tried to find out what had happened.
State Department Table 7: Example passages, questions, and answers from the existing cloze benchmarks we study. Lily is a swan. Lily is white. Bernhard is green. Greg is a swan.

C.3 Examples From Existing Synthetic Benchmarks
What color is Greg? white               To efficiently replace tokens with related tokens, we consider each token's 100 approximate nearest neighbors as replacement candidates. In particular, we use Annoy (Bernhardsson and the Annoy development team, 2020) to perform the approximate nearest neighboor look-ups. Similarities are derived from the Euclidean distance of normalized vectors between two tokens. Figure 12 shows that changing the passage generation method in FuzzySyntheticQA has a minimal effect on concurrence. We experiment with generating passages from a 3-gram language model, a probabilistic context-free grammar, a large neural language model (GPT-2 1.5B; Radford et al., 2019), and by taking real Wikipedia paragraphs.

D.3 Full Results on Existing Synthetic Benchmarks
The 3-gram language model is trained with maximum likelihood estimation on WikiText-103 (Merity et al., 2017). The PCFG is trained with maximum likelihood estimation on the Penn Treebank (Marcus et al., 1993). Lastly, we take GPT-2 1.5B generations from the officially-released output samples (github.com/openai/gpt-2-output-dataset; generated with top-k truncated sampling with k = 40). Table 19 and Table 20 show the performance of each modeling approach on each of our constructed synthetic fuzzy pattern-matching benchmarks. non-pretrained models pre-trained models Figure 12: Even with progressively more natural passages, FuzzySyntheticQA continues to have low overall concurrence with SQuAD-this low concurrence is not trivially caused by the lack of natural passages, and simply making our passages more closely resemble natural language will not yield high concurrence.    Figure 13 summarizes the data generation procedure for WikidataSyntheticQA. Inverses of Properties. Some of our generated questions use the inverse relationships between two properties. To obtain the inverse relationship for a given property, we first retrieve its list of property constraints by using Wikidata property P2302 (property constraint). If Q21510855 (inverse constraint) is present, we then retrieve the corresponding property of this inverse relationship. If the inverse constraint is not present, we check the corresponding property of P7087 (inverse label item), which outputs the item with a label of the inverse relationship of the property. Entity Hyponyms. Some of our generated questions replace entities with their hyponyms. To obtain the hyponyms for a given entity, we retrieve any object entities of the P31 (instance of) and P279 (subclass of) properties.

Randomly select a triple element
Realize entity as string    Table 22 and Table 23 show the performance of each modeling approach on subsamples of the SQuAD benchmark.