Comparing Test Sets with Item Response Theory

Recent years have seen numerous NLP datasets introduced to evaluate the performance of fine-tuned models on natural language understanding tasks. Recent results from large pretrained models, though, show that many of these datasets are largely saturated and unlikely to be able to detect further progress. What kind of datasets are still effective at discriminating among strong models, and what kind of datasets should we expect to be able to detect future improvements? To measure this uniformly across datasets, we draw on Item Response Theory and evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models, while SNLI, MNLI, and CommitmentBank seem to be saturated for current strong models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.


Introduction
Many datasets have been created to evaluate various aspects of natural language understanding (NLU) in English. These datasets are useful to measure progress; however, it is evident from various leaderboards (Wang et al., 2018(Wang et al., , 2019bRajpurkar et al., 2016;Zellers et al., 2018) that many of them are no longer challenging or discriminative enough to differentiate strong models such as those based on Transformers (Vaswani et al., 2017). 1 Even if these benchmarks are sound tests of important * Equal contribution. † Work done while at New York University. 1 For example, the recent DeBERTa model (He et al., 2020) achieves parity with human annotators on the SuperGLUE benchmark score: https://super.gluebenchmark. com/leaderboard.
(and potentially unsolved) tasks, their usefulness is limited if they cannot measure further progress. In this paper, we ask: Which datasets are best in distinguishing current and possible future strong models?
We aim to compare datasets using a single metric that accounts for their effectiveness in separating current stronger and weaker models. To that end, we use Item Response Theory (IRT; Baker and Kim, 1993), a statistical framework from psychometrics that is widely used for the evaluation of test items in educational assessment. IRT assumes that the probability that a model will correctly handle an example in a test set depends on the model's latent ability parameter and three example-specific parameters, typically measuring example difficulty (how strong does a model have to be to get it right), discrimination (how effective the example is for differentiating between similar models), and guessing (how likely a weak model is to get the example right for spurious reasons).
This paper presents a large-scale IRT analysis of existing English NLU datasets. Unlike previous work which focuses on example-level analysis within individual datasets (Lalor et al., 2016(Lalor et al., , 2018), here we analyze example characteristics from a larger perspective by comparing individual examples across datasets. We evaluate test sets from 29 datasets in different formats-classification, multiple-choice QA, and span-selection QA. As responses, we use model predictions from 18 Transformer-based models, including some limited-capacity models chosen to expose better the dataset's ability to discriminate weaker from stronger predictors. We then fit a single IRT model on these responses using a variational inference method. 2 Figure 1: Distribution of test examples according to our proposed locally estimated headroom (LEH) scores ( § 4.1.1), which measure the local slope of the Item Characteristic Curve (ICC) for an example at the ability level corresponding to the best model, and thus reflect the effectiveness of that single example at distinguishing between near-state-of-the-art models. Datasets are grouped by task format: classification (green), sentence-level multiple-choice (blue), paragraph-level multiple-choice (red), and span selection (grey). Within each format, the datasets are sorted by their release date. More details on the datasets are given in Table 1. We find: • Quoref, HellaSwag, and MC-TACO contain the highest number of examples that can differentiate between near-state-of-the-art models, making them very likely to be effective at tracking near-future progress on the skills that they actually test (Figure 1).
• Span-based QA is an effective task format for discriminating between strong and weak models.
• CosmosQA, MC-TACO, Winogrande, and ARC-Challenge consist mostly of hard examples, while for most datasets, the example difficulty levels are more widely distributed.

Item Response Theory
Baker and Kim (1993) introduce Item Response Theory (IRT), a statistical framework to measure the probability of a responder (human or AI system) predicting a correct answer for a given item (test example). The probability of a responder i answering an item j correctly is estimated as a function of the responder's latent ability θ i and the item characteristics, referred to as the item characteristic curve (ICC). We use the 3-parameter (3PL) IRT model, where item behavior is governed by discrimination, difficulty, and guessing parameters. The discrimination Figure 2: An example of item characteristic curves (ICCs) with different values for discrimination (α), difficulty (β), and guessing (γ) parameters. p(θ) is the probability of a correct answer for a given θ. θ measures a model's ability level (higher is better). α governs the steepness of the function, β determines the θ value at which the curve is the steepest, while γ defines the baseline likelihood that an arbitrarily weak model can guess correctly. parameter (α) defines how effective an item is for distinguishing predictors along the ability axis. The difficulty parameter (β) defines a minimum level of ability at which we expect to see high responder performance. The guessing parameter (γ) defines the probability of correctly answering an item by random guessing. Figure 2 shows example ICCs with different parameter values.

IRT with Variational Inference
We use variational inference to infer IRT parameters from model response patterns using Pyro (Ranganath et al., 2014;Bingham et al., 2019). Lalor et al. (2019) found this method effective when fitting IRT models to responses on SNLI. Let n be the number of items and let m be the number of responders. The response patterns is Y ∈ R n×m , where the i-th row corresponds to responder i and the j-th column corresponds to item j. We define y ij ∈ [0, 1] as the response of model i to item j, where y ij = 1 indicates a correct response and y ij = 0 indicates an incorrect response. We approximate the joint probability of the parameters π(θ, α, β, γ | Y) with a variational posterior: where π ρ (·) denotes the density for parameter ρ.
For each parameter, we choose the following distributions: We fit the posterior parameters by minimizing the evidence lower bound (ELBO). When calculating the ELBO, we weight the log-likelihoods of each item's parameter by the inverse of the item's dataset size to control for test set size. Following Lalor et al. (2019), we use a prior of N (0, 1) for θ, β, and sigmoid −1 (γ). While Lalor et al. (2019) uses N (0, 10 3 ) for item parameter priors, we encountered degenerate runs and instead use N (0, 1). For log α, we use N (0, σ 2 α ) where we set σ α by searching [0.25, 0.5] by increments of 0.05 and use the value yielding the highest ELBO after excluding degenerate runs. We use a sigmoid transformation for γ to constrain the guessing probability to (0, 1).

Datasets
Our goal is to perform a fine-grained evaluation of English NLU datasets that appear to discriminate among widely used Transformer-based models. To that end, we choose datasets based on the following criteria: • They are plausibly unsolved, in that the bestreported model performance does not exceed estimated human performance (if available) by more than three metric points.
• They are relatively easy to use with current large pretrained models, and in particular, their inputs fit within a typical pretrained Transformer's 512-token limits. (This rules out tasks with full-document contexts or retrieval components.) • They are evaluated at example-level, i.e., we focus our analysis on QA and other classification datasets, where each example corresponds to one item in the IRT. (This rules out structured prediction and sequence tagging tasks.) • They have simple and reliable automatic metrics at the example level. (This rules out generation-based tasks.)

Models
We aim to understand how examples from different datasets contribute to the evaluations of models with near-state-of-the-art abilities, so we include several pretrained Transformer-based models to approximate this. However, using only highperforming models could result in a poor IRT model fit ( (Zhang et al., 2021b). 4 For each of the 18 Transformer-based models, we evaluate five different checkpoints-at 1%, 10%, 25%, and 50% of the maximum steps of 4 The MiniBERTas are RoBERTa models pretrained on 1M, 10M, 100M, or 1B words of raw text, and varying slightly in model size. There are three pretrained models for each pretraining data quantity, which are pretrained using different near-optimal hyperparameter values. We use all three variants in producing responses for IRT. the maximum epochs (Section 3.3), as well as the best checkpoint on the validation set, which need not be one of the other four. This yields a total of 90 model predictions for each test example.
We only perform hyperparameter tuning with the RoBERTa Large model and apply the best configuration to train all the other Transformer models. We use NVIDIA V100 Tensor Core GPUs for our experiments. On average, it takes approximately four hours to train RoBERTa on small datasets (< 3k training examples), one day for medium-  (Table 3) sized datasets (< 10k), and four days for large datasets (> 10k). Figure 3 shows the performance of RoBERTa Large , ALBERT-XXL-v2, and one of the low performing MiniBERTas (RoBERTa-Med-Small-1M-2) on all validation sets. Unsurprisingly, ALBERT-XXL-v2 and RoBERTa Large are the best-performing models, while the small MiniBERTa model achieves much lower performance. Full results using all 18 models can be found in the Appendix (Table 3).

Item Characteristics
Metric As our primary metric, we introduce Locally Estimated Headroom (LEH) score, which measures the ability of each test example to contribute to the evaluation of near-future progress. We calculate it as the derivative of the example's ICC ( Figure 2) with respect to the highest latent ability score, which corresponds to ALBERT-XXL-v2. A high LEH score indicates that the best-performing model is still far from the example's saturation points-the flat sections of ICC inferred by our model. There is enough space along the curve that the IRT model expects the example to be able to differentiate future state-of-the-art models. Typically, different near-state-of-the-art models both succeed and fail on this kind of example, while weaker models mostly fail. A high LEH score implies that there is still enough room for potentially stronger models to perform better on this dataset.
To validate the use of LEH scores for detecting near-future improvements, we compare two IRT models. The first is fitted using responses from all models, while the second is fitted based on responses from BERT and other weaker models (excluding RoBERTa Large , RoBERTa Base , XLM-R, and ALBERT-XXL-v2). After that, we compute the correlation between the two sets of LEH scores, focusing on the 75 th percentile for each dataset. The Pearson correlation is 95.5% with a median absolute difference of 0.007 and a standard deviation of 0.011. Out of the 29 datasets, only SQuAD2.0, CommensenseQA, MuTual, Quoref, and HellaSwag have more than 0.02 absolute difference in LEH scores. This strong correlation suggests that our ICCs fits are not overly sensitive to the exact characteristics of current state of the art models. Figure 1 shows the distribution of test examples for each dataset based on their LEH scores. For our analysis, we focus on the 75 th percentile examples in each dataset as a rough proxy for how likely a dataset is to have a significant number of examples that are difficult or discriminative for near-future models.

Analysis by LEH Scores
We observe that Quoref, HellaSwag, and MC-TACO have examples with the highest LEH scores, suggesting sufficient headroom for future state-ofthe-art models with a higher ability to achieve better performance on these datasets. SNLI, CommitmentBank, and MNLI have relatively low LEH scores, indicating that performance on these datasets is largely saturated. Additionally, we also measure how the 75 th percentile LEH scores correlate with human-RoBERTa gap. Using 22 datasets that have human performance numbers (Table 1), we find that the Pearson correlation between the two is weakly positive (0.21).
Analysis by Item Parameters Next, we analyze the distribution of test examples according to their discrimination and difficulty parameters ( Figure 4). We observe that datasets with span selection for- mat (QAMR, NewsQA, SQuAD, MRQA-NQ, and Quoref) have the highest discrimination scores than other datasets, highlighting span selection as an effective task format for discriminating among strong and weak models. However, this might be because this task format typically features a much larger space of possible model outputs than the other formats we consider. It does not necessarily mean that span selection is the most suitable to test models' ability to understand language. As the span-based format restricts answers to be text spans in the given passage, there are concerns that it rarely requires reasoning ability which often involves answers not mentioned in the passage, and thus not reflecting comprehension ability of humans (Lai et al., 2017;Sugawara et al., 2018).
For the difficulty parameter, we do not observe a narrow task format that is superior to the others. However, we notice that the highest difficulty scores are obtained by QA datasets such as SQuAD2.0, NewsQA, QuAIL, ARC-Challenge, and MC-TACO. ANLI, which is created with adversarial model-in-the-loop crowdsourcing, also has of many hard examples. Impressionistically, training set size and creation date do not seem to correlate with either example's difficulty or discrimination parameters. Figure 5 shows the distribution of examples jointly according to their difficulty and log discrimination parameters. We notice a half-moon shape pattern in most datasets, which indicates that most of the discriminative examples are either very easy or very difficult. Referring to the ICC curve ( Figure 2), this indicates that there is high agreement among strong models or weak models, which corresponds to one of the saturation points in the ICC curve (upper or lower). The only dataset that does not have this pattern is Winogrande, which is difficult for all models.
ARC-Challenge, QuAIL, HellaSwag, Common-senseQA, and MC-TACO show clusters with high density on the top right regions, indicating a large number of examples with high discrimination and difficulty scores. Other datasets have more scattered distributions. SNLI, MNLI, and MCScript show higher density on the bottom right regions, while NewsQA, SQuAD2.0, and MRQA-NQ show higher density on both the top and bottom right regions. Further analysis of the guessing parameters can be found in Appendix A.

Examples with Unanimous Responses
When fitting ICC on examples that have only correct responses or only incorrect responses, the discrimination parameter is unconstrained. We find that these examples make up 4% of our data. 13 of the 29 datasets contain at least one such example. Roughly 16% of NewsQA examples are incorrectly answered by all models, while the remaining 12 datasets have less than 10% of all correct or incorrect examples. To study the effect of examples with all correct or incorrect responses, we fit an

IRT model on responses excluding such examples
and compare against parameters from the full set of responses. We find that the Pearson correlation for the discrimination at the 75 th percentile is 97.2%, with a median absolute difference of 0.016 and standard deviation of 0.015. MC-TACO, Com-mitmentBank, and WSC differ by more than 0.04. Further, we find that the Pearson correlation for the LEH score at the 75 th percentile is 98.9%, with a median absolute difference of 0.006 and standard deviation of 0.005. RTE, WiC, WinoGrande, QAMR, NewsQA, MRQA-NQ, MC-TACO, and BoolQ differ by 0.01. Given these high correlations, we do not exclude these examples when reporting our main results.

Analysis by Task Group
Next, we analyze each task-type group in more detail, focusing on the example's scores around the 75 th percentile.
Classification We observe that all datasets have moderate discrimination scores. Most ANLI examples have relatively high difficulty scores, while SNLI, MNLI, and CommitmentBank have the lowest difficulty scores.
Sentence-Level Multiple Choice All of the datasets in this group have relatively low discrimination scores compared to span selection datasets. Figure 5 shows that MC-TACO, Winogrande, and CommonsenseQA all have a higher density of difficult examples, while for other datasets the distri-bution is more spread. Span Selection We observe that span selection datasets are the most discriminative. However, in terms of difficulty, only SQuAD2.0 and NewsQA are among the top five.

Analysis on Model Ability
For a sanity check, we further analyze how each model scores according to our fitted IRT parame-Name Example Difficulty (β)

MNLI
Premise: And, you know, with this, you know, it wasn't many opportunities for kids to be special, because kids weren't, you know, you were pushed out of adult conversation, and just really pushed to the side.

3.27
Hypothesis: Children were pushed out of adult conversation, and really just pushed to the side in general.
Label: entailment MNLI Premise: Look, it's your skin, but you're going to be in trouble if you don't get busy.
-1.87 Hypothesis: The boss will fire you if he sees you slacking off.
Label: neutral

MC-TACO
Because then they feel like they are forced to stay in that situation."On average, how often do they feel stuck in the situation? -1.67 (1) 54 months (2) 6 centuries (3) once every 6 years (4) every few seconds (5) once every 2 seconds (6) once every 18 years ters. We observe a positive correlation between ability and average model accuracy (Appendix B). Generally, within a model, the best validation checkpoint obtains the highest average model accuracy and/or ability score. Across models, ALBERT-XXL-v2 performs typically best.

Qualitative Analysis
To better understand what kinds of examples are difficult or discriminating, we analyze the 20 examples with the lowest and highest scores for the discrimination and the difficulty parameters from five datasets: SQuAD2.0, MC-TACO, QuAIL, MNLI, and BoolQ. The first three are datasets with high discrimination and/or difficulty scores. MNLI and BoolQ have moderate discrimination and difficulty scores and low label entropy (three-class classification for MNLI and binary choice for BoolQ

Related Work
Prior work on using IRT to evaluate NLP systems mostly relies on human responses. Hopkins and May (2013) use IRT to estimate the relative ability of a set of machine translation systems using responses from pairwise comparison of system outputs by human judges. Otani et al. (2016)  Their experiments demonstrate that IRT can produce a more reliable ranking of models than the traditional metrics. They also show that IRT is not only useful for better understanding of individual examples in the dataset and task, but also effective in identifying annotation errors.
For other dataset evaluations, in addition to providing a benchmark, the SuperGLUE paper also compares a set of candidate datasets using a fixed pool of machine learning models and human annotators (Nangia and Bowman, 2019). Wang et al. (2019a) investigate pretraining tasks and paradigms for effective transfer learning methods. Pruksachatkun et al. (2020a) study when and why intermediate-task training is useful for a given target task. Vu et al. (2020) introduce task embeddings to predict the most beneficial source task for a given target task. Schlegel et al. (2020) propose an evaluation framework for machine reading comprehension (MRC) datasets and reveal some concerns regarding factual correctness and the presence of linguistic cues in existing MRC gold datasets.

Conclusion
Given the large number of NLU datasets introduced in recent years, what kinds of datasets are effective to measure near-future progress? Our analysis on 29 test sets using IRT gives us reason to believe that, among the datasets we evaluate, Quoref, HellaSwag, and MC-TACO are best able to discriminate among current (and likely future) strong models. Meanwhile, SNLI, MNLI, and Commit-mentBank seem to be saturated and ineffective for measuring future progress.
Our analysis of examples' difficulty and discrimination parameters shows that datasets with many hard examples do not always contain examples that can discriminate between strong and weak models. We find that QA datasets are more difficult than other datasets. We also find span selection as the most effective task format for discriminating between strong and weak models.
According to our LEH score, datasets that seem to be solved are unlikely to see improvements with future pretrained models. Therefore, the skills they intend to test are either largely solved, to the extent that they are solvable, or not well isolated (e.g., due to data artifacts). Focusing on the skills for which these solved test sets are originally designed to evaluate would most likely require a new dataset that better isolates the reasoning ability of interest.
On the other hand, datasets that perform well according to our LEH metric show the best signs of being amenable to future hill-climbing. This does not entail that we should focus future research on these benchmarks, since we do not evaluate whether they test the skills they mean to test, or whether these skills are important for scientific or practical progress on natural language understanding. Finally, we argue that this evaluation should be done periodically, as datasets and models improve over time.
For future work, one can study multidimensional variables for both model ability and item parameters, which could reveal a factorization of datasets by skills. Other potential directions include expanding our analysis to a broader range of tasks and analyzing the relationship between the estimated IRT parameters and the human-model gap.

Acknowledgments
We thank John Lalor, João Sedoc, Nikita Nangia, Sebastian Schuster, Iacer Calixto, and the anonymous reviewers for feedback. This work has benefited from financial support to SB by Eric and Wendy Schmidt (made by recommendation of the Schmidt Futures program), Samsung Research (under the project Improving Deep Learning using Latent Structure), Apple, and Intuit, and from inkind support by the NYU High-Performance Computing Center and by NVIDIA Corporation (with the donation of a Titan V GPU). This material is based upon work supported by the National Science Foundation under Grant No. 1922658. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Ethical Considerations
We present an objective approach for comparing the difficulty of test sets examples across datasets and demonstrate it on a large set of established datasets. We expect this to contribute to the development of more challenging benchmarks for NLP datasets and potentially to develop more challenging mod-els. One concern worth noting is that most of the evaluation datasets we study are crowdsourced or drawn from naturally occurring data. Thus, they likely demonstrate harmful stereotypes to some degree or even score models more highly for demonstrating them. In general, models that perform well on these datasets should not be deployed directly without additional measures to measure and eliminate any harms that stereotypes like these could cause in the target application settings. In addition to the analysis of discrimination versus difficulty parameters, we also look at the distribution of the guessing (γ) parameters. From Figure 6, we observe that all QA datasets with span selection format generally have low guessing parameters, meaning that they are difficult to predict correctly by random guessing. This makes sense as span selection has higher label entropy than classification or multiple-choice task. We find that  We use different colors for different models (e.g., dark blue for ALBERT-XXL-v2), and different shapes to mark different checkpoints.

B Additional Analysis on Model Ability
Since we only perform tuning on RoBERTa Large , some of these models might have worse performance than if they were individually tuned.

C Task Descriptions
In this section, we provide a short description for each dataset.
RTE The series of Recognizing Textual Entailment datasets (Dagan et al., 2005;Haim et al., 2006;Giampiccolo et al., 2007;Bentivogli et al., 2009) correspond to a two-class textual entailment classification task. Given a premise sentence and a hypothesis sentence, the task is to decide whether the premise entails the hypothesis.
SNLI The Stanford Natural Language Inference corpus (Bowman et al., 2015) is a textual entailment dataset, formulated as a three-class classification task. Given a premise sentence and a hypothesis sentence, the task is to determine if the premise entails the hypothesis, contradicts it, or neither. The SNLI dataset is created using premises taken from image captions.
MNLI The Multi-Genre Natural Language Inference corpus (Williams et al., 2018) is also a textual entailment dataset, similar to that of SNLI. The MNLI dataset is built to cover a broad range of genres, including written and spoken text. Half of its test set is created from text that is out of domain relative to the training set.
CommitmentBank CommitmentBank (De Marneffe et al., 2019) is a dataset formulated as a three-class textual entailment classification task. Given a piece of text and an embedded clause, models must decide whether the embedded clause is entailed by the text.
ARCT The Argument Reasoning Comprehension Task (Habernal et al., 2018) is a multiplechoice question answering dataset. Given an argument, a claim, and a premise, the task is to select the correct implicit warrant (which explains why the premise implies the claim) from two choices. ARC (Clark et al., 2018) is a multiple-choice QA dataset composed of real multiple-choice science questions in grade schools. ARC-Easy is composed of the easier questions that do not satisfy the criteria used to built ARC-Challenge (described below).

ARC-Easy
ARC-Challenge ARC-Challenge (Clark et al., 2018) is the subset of ARC that contains questions that are incorrectly answered by both a retrievalbased algorithm and a word co-occurrence algorithm.
MCScript The MCScript (Ostermann et al., 2018) is a QA dataset with multiple-choice format. The dataset tests models' commonsense knowledge, in particular, script knowledge which corresponds to the sequence of actions people do in a particular situation.
Cosmos QA Cosmos QA (Huang et al., 2019) is a multiple-choice reading comprehension dataset, and it is intended to require extensive abstractive commonsense reasoning. Unlike Common-senseQA, Cosmos QA requires comprehension over an auxiliary article, instead of simply responding to a free-standing question.
HellaSwag HellaSwag (Zellers et al., 2019) is a commonsense reasoning multiple-choice dataset. It is built using adversarial filtering with BERT. Given a story, the task is to select the most plausible continuation.
BoolQ BoolQ (Clark et al., 2019) is a boolean (yes/no) reading comprehension QA dataset built  MuTual MuTual (Cui et al., 2020) is a multiplechoice QA dataset for multi-turn dialogue reasoning. The dataset is created from Chinese students' English listening comprehension exams, and it is intended to require a variety of commonsense reasoning skills.
MuTual-Plus MuTual-Plus (Cui et al., 2020) is a variant of MuTual, in which one of the choices in each set of answers is replaced by a safe response (i.e., "could you repeat that"). If all other choices are incorrect, then the model is supposed to select the safe response. This variant of MuTual is built so that we can evaluate if the model can select the safe response when all other options are incorrect.
QuAIL QuAIL (Rogers et al., 2020) is a reading comprehension dataset formulated as a multiple choice task. One feature of QuAIL is that it combines "commonsense, text-based, and unanswerable questions." It is also designed such that it has a balanced distribution of genres and reasoning types.
COPA Choice of Plausible Alternatives (Roemmele et al., 2011) is a dataset for sentence-level multiple-choice task. Given a premise and a question that asks for the cause or effect of the premise, the task is to choose the most plausible hypothesis from two options.
WSC The Winograd Schema Challenge (Levesque et al., 2012) is a sentence-level multiplechoice commonsense reasoning dataset. Given a piece of text, a pronoun, and a list of possible noun phrases, the model must choose the correct referent to the pronoun. The dataset is designed such that world knowledge is required to make the correct choices. We use the SuperGLUE (Wang et al., 2019b) version of the dataset.
CommonsenseQA CommonsenseQA (Talmor et al., 2019) is a multiple-choice QA dataset which is designed to test a range of commonsense knowledge.
SocialIQA SocialIQA (Sap et al., 2019) is a dataset that is specifically designed to test a models' capabilities related to emotional and social intelligence in everyday situations.
MC-TACO MC-TACO (Zhou et al., 2019) is a multiple-choice QA dataset that is designed to test temporal commonsense reasoning, in particular: duration, temporal ordering, typical time, frequency, and stationarity. Each question consists of a varying number of choices, and for each answer choice, a model needs to predict whether the answer is correct or incorrect.
WiC The Word-in-Context (Pilehvar and Camacho-Collados, 2019) dataset which is designed to test the word sense disambiguation skill of a model. Given two pieces of text (a phrase or a sentence) with a polysemous word in both, a model needs to predict whether the two words are used in the same sense.
PIQA The Physical Interaction Question Answering dataset (Bisk et al., 2020) is a multiplechoice QA dataset that is designed to test the physical commonsense reasoning skill. Given a physical task expressed in text, a model needs to select the most sensible solution.
WinoGrande The WinoGrande dataset (Sakaguchi et al., 2020) is built through a crowdsourcing procedure that incorporates adversarial filtering. Given a sentence with a blank (where the blank corresponds to a noun phrase), the task is to select the correct filler. The dataset is designed to test the commonsense reasoning skill.
Abductive NLI The Abductive Natural Language Inference dataset (Bhagavatula et al., 2020) is a multiple-choice dataset. Given a premise, the task is to select the most likely explanation from the given hypotheses. NewsQA NewsQA (Trischler et al., 2017) is a QA dataset formulated as span selection task. The dataset is built by crowdworkers using passages taken from CNN news articles.
SQuAD2.0 SQuAD2.0 (Rajpurkar et al., 2018) is a QA dataset that combines the span-selection reading-comprehension questions in SQuAD 1.1 (Rajpurkar et al., 2016) with over 50,000 unanswerable questions. The unanswerable questions were written by crowdworkers to look like the answerable ones. A model must either select an answer span or decline to answer.
Quoref Quoref (Dasigi et al., 2019) is a QA dataset that is designed to test coreferential reasoning ability. The dataset is formulated as a span selection QA task.