Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets

Ideally Open-Domain Question Answering models should exhibit a number of competencies, ranging from simply memorizing questions seen at training time, to answering novel question formulations with answers seen during training, to generalizing to completely novel questions with novel answers. However, single aggregated test set scores do not show the full picture of what capabilities models truly have. In this work, we perform a detailed study of the test sets of three popular open-domain benchmark datasets with respect to these competencies. We find that 60-70% of test-time answers are also present somewhere in the training sets. We also find that 30% of test-set questions have a near-duplicate paraphrase in their corresponding training sets. Using these findings, we evaluate a variety of popular open-domain models to obtain greater insight into what extent they can actually generalize, and what drives their overall performance. We find that all models perform dramatically worse on questions that cannot be memorized from training sets, with a mean absolute performance difference of 63% between repeated and non-repeated data. Finally we show that simple nearest-neighbor models out-perform a BART closed-book QA model, further highlighting the role that training set memorization plays in these benchmarks


Introduction
Open-domain Question Answering (ODQA) is a task examining the ability of models to produce answers to natural language factoid questions drawn from an open set of domains. ODQA has received significant attention for its potential practical applications, and more recently as a popular method to analyse how well NLP systems can capture and recall factual knowledge. This interest in ODQA as a challenging "knowledge-intensive" task has led to a flurry of recent works that have driven test-set performance on standard ODQA datasets to new heights (Lee et al., 2019;Guu et al., 2020;Karpukhin et al., 2020;Lewis et al., 2020;Izacard and Grave, 2020, inter alia). However, a deeper understanding of what kinds of questions our models can answer well has been less forthcoming. Whilst there have been several works examining other kinds of QA datasets (Manjunatha et al., 2018;Kaushik and Lipton, 2018;Sugawara et al., 2018Sugawara et al., , 2020, we know comparatively little about how the questions and answers are distributed in these ODQA benchmarks, making it hard to understand and contextualize the results we are observing. In this work, we address these issues via an analysis of the test sets of three popular ODQA datasets, namely WebQuestions (Berant et al., 2013), TriviaQA (Joshi et al., 2017) and Open Natural Questions (Kwiatkowski et al., 2019;Lee et al., 2019). We identify three classes of question that a trained ODQA system should be able to answer, in increasing order of difficulty: 1) the most basic behaviour is to be able to reliably recall the answer to a question that the model has seen at training time. 2) a model should be able to answer novel questions at test time and choose an answer from the set of answers it has seen during training.
3) a strong system should be able to answer novel questions which have answers which are not contained in the training data. It is not clear to what extent our current ODQA datasets measure each of these three behaviours. To address this, we stratify the test sets of these datasets. Firstly, we split the test data by whether answers in the test set also appear somewhere in the training sets. We find that 58-71% of test answers also occur somewhere in the training data, demonstrating that the majority of the test data does not probe for answer generalization.
Secondly, we annotate 1000 question, answer pairs from each test set for repeated questions in their respective training sets. We find that a surprisingly high 28-34% have paraphrased questions in the training data, the vast majority of which are near-duplicates differing by one or two words. This result implies that 30% of the test set of these datasets only probe for how well models can simply memorize question answer pairs seen at training.
Equipped with these insights, we compute the performance of several recently proposed ODQA models on our test subsets.
We test both Open-book approaches, which leverage retrieval from a large corpus of documents and Closedbook approaches, which focus on training large parametric models with no external knowledge source . We find that test data with train-overlapping data contribute the bulk of the overall performance of all the models studied.
These issues seem to be more acute for closedbook models. Strikingly, we find that a closedbook BART-based model (Lewis et al., 2019) is incapable of producing answers not observed at training time, and achieves very low scores on non-overlapping questions, suggesting this model is only capable of memorizing question, answer pairs from training time. With this in mind, we build simple nearest-neighbor models which outperform this BART model, despite having virtually no capacity to generalize beyond training data.
To summarize, we make the following contributions: 1) We provide insights into how answer entities are distributed between dataset splits for ODQA datasets 2) We provide annotated subsets of ODQA test sets indicating whether test-time questions are duplicates of training time questions. 1 3) We evaluate a variety of models on our dataset splits, and derive insights into what kinds of question answering behaviour different models achieve.

Datasets
In our analysis, we consider three widely used Open-domain QA datasets, WebQuestions (Berant et al., 2013), TriviaQA (Joshi et al., 2017), and Open Natural Questions, a subset of Natural Questions (Kwiatkowski et al., 2019) introduced by Lee et al. (2019). All three datasets consist of factual natural language questions and short multi-token answers, but differ slightly in the style of questions and format of answers.
WebQuestions WebQuestions is a dataset of 3,778 train and 2,032 test question, answer pairs. Questions were obtained by mining a search engine, and answers are Freebase entities (Bollacker et al., 2008) annotated by crowdworkers. The ODQA task consists of predicting the name of the freebase entity. We use the standard train/test splits from Berant et al. (2013). We use the development split used in Karpukhin et al. (2020), which was randomly split from the train set.
TriviaQA TriviaQA is a dataset of 78,785 train, 8,837 development and 11,313 test question, answer pairs obtained by scraping trivia websites. Answers consist of wikipedia entities, and any alias for the answer entity is considered a correct answer. We use the open-domain train/test splits, which corresponding to the unfiltered-train and unfiltered-dev reading comprehension splits (Lee et al., 2019;Min et al., 2019Karpukhin et al., 2020). For all three datasets, the canonical train, development and test splits were obtained by randomly splitting the question, answer pairs, and there are no exact duplicate questions in any dataset. We exclude development data from our overlap analyses, and focus purely on train-test overlap to explicitly assess the effects of training memorization.

Open-Natural Questions
We explore two ways of examining the test sets based on overlaps between training and test data. Consider a question, answer pair (q, a) from the test set D test where the answer consists of at least one answer reference a = {s 1 ..s n }. We can consider answer overlap where there exists at least one (q ′ , a ′ ) ∈ D train which shares at least one answer reference with (q, a). We can also consider question overlap, where there exists some (q ′′ , a ′′ ) ∈ D train where q ′′ is a duplicate of q, such that q and q ′′ are paraphrases and have the same answer.
Answer Overlap Following Rajpurkar et al. (2016), we apply answer normalization 2 on answer references before searching for overlapping answer references for all (q, a) pairs in the test set -see Table 1. We find that 58% of test (q, a) pairs in WebQuestions have answer overlaps, with 63.6% and 71.7% for Natural Questions and Trivi-aQA respectively. We would naturally expect Triv-iaQA to have higher answer overlap as it has more answer references per question on average (13.7 references on average compared to 1.2 for Natural Questions and 2.4 for WebQuestions). Examples of answer overlaps are shown in Table 2.
Question-Overlap Unlike answer overlap, question overlap cannot be easily computed automatically, as searching for duplicates via rules or paraphrase classifiers may lead to both false positives and negatives. Thus, we turn to manual annotation to investigate question overlap. To obtain a representative sample for each dataset, we annotate a random subset of 1,000 (q, a) pairs for each test set. Annotators are shown a list of up to 50 training questions which have a similar answer reference. 3 This answer similarity function is designed for high recall to obtain a tight lower bound on question overlap. If there were no questions with similar answers in the training set, the question was automatically annotated as not over-2 Answer normalization consists of lower-casing, stripping punctuation, removing articles and normalizing whitespace 3 Training questions are selected for annotation if one of the following is true: they share an answer reference with a test question, a test answer reference is a sub-sequence of a training answer reference, or the other way around (a training reference answer is a sub-sequence of a test answer reference). If there are more than 50 such questions, the top 50 are chosen by the highest degree of word overlap to the test question.
lapping. Three expert annotators looked through these similar questions and indicated if any were paraphrases of the test question and had the same answer.
The results from the annotation can be seen in Table 1 shows these results and examples of overlapping questions in Table 3. A sample of 100 2-way annotated examples indicated 93% agreement, corresponding to a Cohen's Kappa of 0.85 (Cohen, 1960). What we observe is a high degree of question overlap, with between 27.5 and 33.6% of the 1,000 annotated test questions had a duplicate in the training set. It is also common to see several duplicates per test question, with an average of 2.8 duplicate questions per overlapping test question in Natural Questions.

Implications for Modelling
Given our findings from above, we turn our attention to how well ODQA models perform with respect to train-test set overlap. Earlier, we identified three classes of answering behaviors: 1) questions that can be memorized at training time, 2) novel questions that can be answered with answers memorized at training time, 3) novel questions with novel answers. We refer to these behaviours as Question memorization, Answer classification and QA generalization respectively.
Question memorization To perform well on the question overlap subset, a model would only need to be able to memorize (q, a) pairs at training time, then recognize which training question matches a test-time question. The reasoning required ranges from trivial duplicate detection for very similar questions such as "who played pink in pink floyd the wall" and "who played pink in the movie the wall", to more challenging inference problems for more subtle duplicates such as "On which island in the North Sea did both St Aidan and St Cuthbert live?" and "irish born missionary saint aidan founded a monastery in 653 on which english island which is also the name of a 1970s uk folk-rock band?". A manual annotation of 100 question-overlap pairs indicated that 81% were simple duplicates differing by one or two words, 14% required some paraphrasing recognition capability, and 5% required more sophisticated natural language understanding. To measure performance on question memorization, we build a test subset comprised of (q, a) pairs which have question overlap to the training set.  Jason Marsden who plays max voice in a goofy movie who does max voice in a goofy movie January 23 2018 when will the 2018 oscar nominations be announced when are the oscar nominations for 2018 announced Alan Shearer who has scored more goals in the premier league most goals scored by a premier league player retina where are the cones in the eye located where are cone cells located in the eye francisco pizarro who led the conquest of the incas in south america conquistador who defeated the incan empire in peru Answer Classification In order to tackle the answer-overlap question, a multi-class classifier over training set answers would be sufficient, as answers never appear at test time that don't appear at training time. We build a test subset of (q, a) pairs which have answer overlap, but do not have question overlap. Question-overlap pairs are excluded to isolate performance on answer classification, since question-overlap questions are significantly easier to answer, and would inflate scores.

QA Generalization
In this regime, models cannot rely on memorizing their training data. To measure performance on this most challenging split, we build a test subset of (q, a) pairs which do not have answer overlap with the training set. We further note that we expect higher frequency answers, such as countries, integers and public figures would naturally be expected to appear less often in this test subset. As such, models that perform well on the head of the answer distribution may struggle to perform well in this setting, despite being able to perform some generalization at test time.
In the following, we briefly describe the models included in our analysis. For published models, we obtain test set predictions directly from the authors.

Open-Book Models
Open-book Models are ODQA models which first retrieve relevant documents from Wikipedia and then either extract or generate answers conditioned on those documents. We consider the Dense Passage Retrieval (DPR) model (Karpukhin et al., 2020), a pipeline model which retrieves documents based on dense embeddings, before feeding them into a conventional reader-reranker which extracts spans of text as answers. We also include Retrieval-Augmented Generation (Lewis et al., 2020), a recent model that jointly learns to retrieve and generate answers in seq2seq framework, based on dense retrieval and BART (Lewis et al., 2019). Finally we include the state-of-theart Fusion-in-Decoder (FID) (Izacard and Grave, 2020), a pipeline model based on T5-large  which retrieves 100 documents and fuses them so that the decoder can attend to all documents at once. We not include FID results on WebQuestions as the authors did not use it in their original work.

Closed-Book Models
Closed-book models store the knowledge required to answer their questions entirely within the parameters of the model itself, rather than in an external corpus. Typically these models consist of seq2seq transformer models which are directly fine-tuned on (q, a) pairs. In our analysis, we train a BART-large closed-book QA model, which is trained with questions as input and generates (q, a) pairs as output. Checkpoints are selected by Exact Match score on a development set. We also include a much more powerful T5-11B model from . We use the T5-11B model which has been pretrained with a special "Salient Span Masking" objective (Guu et al., 2020), designed to improve downstream ODQA  Table 4: Exact Match scores for several recent models on our dataset splits. The "Total" column is the overall performance on the dataset. "Question Overlap" refers to the test subset with train-test question overlap, and probes for simple question memorization. "Answer Overlap Only" refers to the test subset without train-test question overlap, but with train-test answer overlap, which probes for answer classification. "No overlap" refers to the test subset with no train-test answer overlap and probes for QA generalization performance. The T5-11B model was trained on both train and development portions of the data, and thus has seen ∼10% more training data than other models. As we did not include development data in our overlap analysis, a small amount of unaccounted-for overlap is possible for this model. We do not include TriviaQA results for the T5 model since this model was trained using a different TriviaQA data splitting scheme.

Nearest-Neighbor Models
Given that there are high levels of train-test overlaps in these datasets, we also experiment with some simple nearest-neighbor models. Here, we simply retrieve a (q, a) pair from the training set based on question similarity to the test question, and return its answer. We experiment with two models, one using TF-IDF and the other using the dot product similarity of question embeddings from the DPR retriever. These models cannot generalize to non-overlapping answers, and have limited capacity to answer non-overlapping questions. However, these models are attractive from the perspective of model size and efficiency. There has recently been a push towards more space and memory-efficient QA systems. 4 Lightweight retrievers coupled with a database of carefully selected (q, a) pairs would represent a very spaceefficient solution compared to open-book models which must retrieve from large textual corpora, or closed-book models with large parameter counts. Table 4 shows our results. In this section we unpack some findings.

Results
Question Memorization Earlier, we found that ∼30% of test set questions overlap with the training set. The "Question overlap" columns in Table 4 shows performance on Question Memorization. Comparing this column with the total performance column shows that all models perform significantly higher on memorizable questions. This finding is not surprising, but it is worth highlighting that a significant proportion of overall performance is driven by question memorization. This effect is most pronounced for closed book models. The T5-11B performs especially well for question memorization on both Natural Questions and We-bQuestions. This suggests that its very large capacity, coupled with more powerful question understanding may allow it to store, recognise and recall training questions more effectively than other models.

Answer Classification
The "Answer overlap only" column in Table 4 shows performance on answer classification. Answer classification has a large drop in performance compared to question memorization, dropping by an average of 45% Exact Match score. Open-book models handle this setting better than closed book models. The BART model in particular struggles here, only managing 10.2% accuracy on this set.

QA Generalization
The "No overlap" column in Table 4 shows performance on QA generalization. All models suffer significant performance degradation on QA generalization, highlighting the shortcomings of the overall performance metric. For example, we may expect the FID stateof-the model to answer half of Natural Questionsstyle questions correctly, but once we have accounted for repeated questions and answers, it can only answer about one third of questions correctly. This difference is even more pronounced for other models, with an average absolute drop of 25% with respect to overall performance.
Nearest-Neighbor Models The bottom two rows of Table 4  Our dense nearest neighbour model consists of a single BERT-base checkpoint and outperforms a BART-large model, and could be compressed using quantization and distillation techniques (Sanh et al., 2020;Jiao et al., 2019). The TF-IDF model is even smaller and could be implemented extremely efficiently with negligible memory footprint.

Related Work
The widespread adoption of deep learning in the last few years has been accompanied with an increase in dataset sizes and construction methodologies. Examining what kinds of behaviours are learnt by models has received attention in natural langauge understanding tasks, such as the GLUE benchmark (Wang et al., 2018), which includes a diagnostic test set probing for different reasoning types. Various works have also performed critical and careful analysis of question answering systems and datasets. Chen et al. (2016) closely examine the difficulty of the CNN-DM dataset (Hermann et al., 2015), Sugawara et al. (2020) perform an analysis of machine comprehension dataset difficulty, Kaushik and Lipton (2018) analyse the difficulty of various machine reading datasets, and Manjunatha et al. (2018) show that visual question answering models memorize common question-answer relationships present in training data. Fvry et al. (2020) perform an analysis of various closedbook models' TriviaQA predictions, based on entity mentions. Kwiatkowski et al. (2019) note that the machine reading Natural Questions dataset has substantial train-test overlap of wikipedia titles, and provide some baselines for "long-answer" QA. Closest to our work, Verga et al. (2020) observe similar answer overlap in knowledge-base QA, and explore results on non-overlapping subsets.

Conclusion
In this work, we performed a novel analysis of popular open-domain question answering datasets. We found that 60% of test set answers overlap with the training set and, more surprisingly, 30% of test set questions have at least one duplicate in the train set. Following these observations, we contextualize the performance of seven ODQA models, stratifying by different amounts of training set overlap, gaining an insight into to what extent these models generalize or simply memorize their training data. It is clear that performance on these datasets cannot be properly understood by overall QA accuracy and suggest that in future, a greater emphasis should be placed on more behaviour-driven evaluation, rather than pursuing single-number overall accuracy figures.