Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards?

Leaderboards are widely used in NLP and push the field forward. While leaderboards are a straightforward ranking of NLP models, this simplicity can mask nuances in evaluation items (examples) and subjects (NLP models). Rather than replace leaderboards, we advocate a re-imagining so that they better highlight if and where progress is made. Building on educational testing, we create a Bayesian leaderboard model where latent subject skill and latent item difficulty predict correct responses. Using this model, we analyze the ranking reliability of leaderboards. Afterwards, we show the model can guide what to annotate, identify annotation errors, detect overfitting, and identify informative examples. We conclude with recommendations for future benchmark tasks.


Leaderboards are Shiny
Leaderboard evaluations-for better or worse-are the de facto standard for measuring progress in question answering (Rajpurkar et al., 2016) and in many NLP tasks (Wang et al., 2019a). An unfortunate side effect of leaderboard popularity is SOTA-chasing, often at the expense of carefully inspecting data and models (Linzen, 2020). For example, the same "super-human" models that top question answering leaderboards (Najberg, 2018) often fail spectacularly (Feng et al., 2018;Wallace et al., 2019a) by learning non-generalizable statistical patterns (McCoy et al., 2019;Niven and Kao, 2019). Finally, focusing solely on metrics conflates progress on a specific task with progress on realworld NLP problems behind the task (Bender and Koller, 2020). Plainly, focusing on headline SOTA numbers "provide(s) limited value for scientific progress absent insight into what drives them" and where they fail (Lipton and Steinhardt, 2019). * Work completed at University of Maryland. Negative discriminability suggests an annotation error; for example, the question with most negative discriminability asks "Why did demand for rentals decrease?" when the answer is "demand for higher quality housing increased." In this work we take leaderboards "as they are," and imagine how they might better support research. Leaderboards establish differences between models on a fixed task. Hence, leaderboards should enable and encourage the comparison of models and inspection of examples. And leaderboards should also signal when they have outlived their usefulness (Boyd-Graber and Börschinger, 2020).

How to Direct Leaderboards' Light
To help focus attention on examples and models of interest, we propose Difficulty and Ability Discriminating (DAD) leaderboards that explicitly model both task and submissions jointly, rather than either in isolation. 1 DAD's underlying model is based on Subjects Items Responses Figure 2: A DAD leaderboard uses IRT to jointly infer item difficulty β i , discriminability γ i , feasibility λ i , and subject skill θ j . These predict the likelihood p ij (r ij = 1) of a correct response r ij .
Item Response Theory (Lord et al., 1968;Baker, 2001, IRT, reviewed in §2), a widely used (van Rijn et al., 2016) alternative in educational testing to simple summary statistics (Edgeworth, 1888). DAD can explicitly identify the difficulty and discriminability of items (Figure 1), 2 which in turn can lead to a more nuanced ranking of models, identifying poor items, and better understanding of a dataset and task. Throughout the paper, we use the question answering (QA) benchmark SQuAD 2.0 (Rajpurkar et al., 2018). For example, DAD can identify questions that are challenging to models and questions that are wrong (incorrectly annotated). In addition to better understanding datasets, it is also helpful for efficiently selecting evaluation items to annotate. We conclude with recommendations for future leaderboards ( §7) and discuss where IRT in NLP can go next ( §8).

A Generative Story for Leaderboards
Leaderboards are a product of the metrics, evaluation data, and subjects (machine or human) who answer items ( Figure 2). For concreteness, let's assume that we have a question-answering task and two subjects: Ken, who is good at trivia, and Burt, who is not. In the simplest IRT models, each subject j has a random variable θ j corresponding to their skill: Ken's is big, Burt's is small.
But you cannot know that until you start asking them questions of varying difficulty β i . Harder questions have a higher difficulty ("what is the airspeed of an unladen swallow") than easy ones ("who is buried in Grant's tomb"). The bigger the margin between a subject's skill θ j and an item's difficulty β i , θ j − β i , the more likely that subject j responds correctly p i,j (r i,j = 1). This is the simplest IRT model, which we call IRT-base. Generally, given n test items X = (X 1 , . . . , X n ) and m subjects S = (S 1 , . . . , S m ), where each subject answers every item, we want to estimate subject skills and item difficulties. To discover the random variables that best explain the data, we turn to probabilistic inference (Pearl, 1988).
Two additional random variables further improve DAD: discriminability γ i and feasibility λ i . We first consider discriminability and the margin between a question's difficulty β i and a subject's skill θ j . A discriminative question is challenging but can still be answered correctly by a strong subject. If Ken's ability is higher than most items' difficulty (θ j − β i is large), item discriminability multiplies this gap by γ i in a model called IRT-disc. Questions with low γ i are low quality: they have annotation error or do not make sense.
Another way of capturing poor quality questions is the feasibility λ i . For example, if the question "who was the first president" has the answer Rajendra Prasad, the question has an unstated implicit assumption that subjects must guess what country or company the question is about. In the model IRT-feas, if a large fraction of subjects all get an item wrong, everyone's probability of getting the item right is capped at λ i . In NLP terms, 1 − λ i corresponds to the prevalence of annotation errors that lead to unsolvable items.
Having introduced all of the constituent elements of the model, we can now present the full generative model: (1) For IRT-base, γ i and λ i are fixed to 1.0, while for IRT-disc, only λ i is fixed. 3 Means µ θ , µ β , µ γ are drawn from N (0, 10 6 ) and τ θ , τ β , τ γ from a Γ(1, 1) prior, as in Lalor et al. (2019) and recommended by Natesan et al. (2016). 4 4488 Because it is difficult to completely codify skill and difficulty into a single number, we can rewrite the exponent in Equation 1 as a sum over dimen- where each dimension captures the interaction between an item's difficulty and a subject's skill. For example, perhaps Burt could better exploit artifacts in one dimension (their skill for θ j,k=5 is high but everywhere else is low) while Ken might not know much about a particular topic like potent potables (θ j,k=2 is low but everywhere else is high). We call this model  Multidimensional IRT models (Reckase, 2009) could-in addition to better modeling difficulty-also cluster items for interpretation; we briefly experiment with this (Appendix F), but leave more to future work ( §8).

Examples are Not Equally Useful
IRT's fundamental assumption is that not all items and subjects are equal. This explains why leaderboards can fail while having "normal looking" accuracies. As a thought experiment, consider a dataset that is one third easy (β i ∈ [0, 1]), one third medium difficulty (β i ∈ [2, 3]), and one third hard (β i ∈ [6, 7]). Suppose that Ken has skill θ k = 4 while Burt has skill θ b = 2. A standard leaderboard would say that Ken has higher accuracy than Burt. But suppose there's a new subject that wants to challenge Ken; they are not going to reliably dethrone Ken until their skill θ c is greater than six. This is a more mathematical formulation of the "easy" and "hard" dataset splits in question answering (Sugawara et al., 2018;Rondeau and Hazen, 2018;Sen and Saffari, 2020). In IRT-feas, this recapitulates the observation of Boyd-Graber and Börschinger (2020) that annotation error can hinder effective leaderboards. DAD helps systematize these observations and diagnose dataset issues.

Inference
To estimate the latent parameters of our model, we use mean-field variational inference (Jordan et al., 1999). In variational inference, we propose a distribution over the latent variables, q ϕ (·), that approximates the true but intractable posterior p(·). We then minimize the KL-divergence between these distributions, equivalent to maximizing the evidence lower-bound (ELBO) with respect to the variational parameters.
In our case, q ϕ (·) is a mean-field distribution, which means it factorizes over each of the latent variables (the product is over the n × m subjectitem pairs) Specifically, for our key latent variables z ∈ {θ, β, γ}, the associated variational distributions are of the form q(z) = N (u z , t −1 z ). Recall that in the generative distribution, each latent z is drawn from a N (µ z , τ −1 z ) whose parameters are also latent variables; for these variables, we use the variational distributions q(µ z ) = N (u µz , t −1 µz ) and q(τ z ) = Γ(a τz , b τz ). We optimize the ELBO with respect to the variational parameters for all z using ADAM (Kingma and Ba, 2015). With DAD's leaderboard IRT model introduced, we next discuss how leaderboard subjects are statistically compared and alternative methods-such as using IRT parameters-to evaluate whether two models are truly different.

Ranking and Comparing Subjects
Fundamentally, the objective of comparative evaluations like leaderboards is to decide whether model A is better than model B. A thread of NLP has rightfully advocated for adding rigor to these decisions using statistics (Traub, 1997, Classical Testing Theory) where the objective is to infer a true score T from the observed test score X = T +E given a measurement error E, uniform across subjects. However, in educational testing-a field measuring skill and knowledge in humans-IRT is a primary measurement instrument (Hambleton, 1991, p. 2). A major motivation for IRT is that subjects of different skill have different errors. IRT explicitly accounts for the bandwidth-fidelity dilemma (McBride, 1976): items can either accurately measure a narrow ability range (fidelity) or inaccurately measure large ability ranges (bandwidth). 6 This section and the next contrast methods for identifying the best model and advocate for IRT. Implicit in nearly all leaderboard evaluations is ranking models by a statistic such as the average accuracy. As we show in §4, naïve rankings are noisier than IRT rankings.

IRT for Leaderboards
Leaderboards should: (1) reliably and efficiently rank better models ahead of worse models (Tague-Sutcliffe, 1992;Voorhees, 2003) and (2) guide inspection of items and subjects ( §5). The first ameliorates the unavoidable randomness of finite evaluations while the second enables error analysis (Wu et al., 2019) and model probing (Belinkov and Glass, 2019;Zhang et al., 2019). First we verify that IRT models accurately predict the responses of subjects ( §4.2). Next, a ranking stability analysis shows that IRT has modestly better reliability than classical rankings ( §4.2.3). Lastly, using IRT to actively sample items for annotation yields rankings with better correlation to complete test data ( §4.4).

Why a Linear Model Baseline
At first blush, the differences between IRT and logistic regression are minimal, but we include the comparison to address natural questions from the NLP community: (1) do the idiosyncrasies of the IRT formulation hurt accuracy? (2) should we add features to better understand phenomena in the questions?
(3) why not use deep models?
The next section argues that both IRT and logistic regression are accurate even without laboriously engineered task-specific features. Adding obvious features such as item words (e.g., questions) only minimally improves the accuracy. We explicitly omit less interpretable deep models since our goal is to make leaderboards more interpretable.

Response Prediction is Accurate
Just as educational testing researchers validate IRT models by seeing if they predict subject responses correctly (American Educational Research Association, 2014), we validate how well DAD predicts whether SQuAD models get questions right.
We compare against a logistic regression linear model (LM) implemented with Vowpal Wabbit (Agarwal et al., 2014). Since integrating handcrafted features is easy, we incorporate features derived from subject IDs; item IDs; functions of the SQuAD question, answer, and title; and IRT parameters (details in Appendix B). As in IRT, logistic regression predicts whether a subject correctly responds to an item. Later, we discuss ways to integrate more features into IRT ( §8).

SQuAD Leaderboard Data
Experiments are on the SQuAD 2.0 leaderboard. Development data are publicly available, and orga-nizers provide test set responses. There are 161 development subjects, 115 test subjects, and 11,873 items (1.9 million total pairs). Experiments that do not need test responses use all development subjects; those that do use the smaller test subset.

Evaluation Scheme
Following prior work (Wu et al., 2020), we evaluate IRT and linear models by holding out 10% of responses and computing classification metrics. 7 In SQuAD, predicting whether a response is correct is an imbalanced classification problem (80.4% of responses in the development set are correct). Thus, we use ROC AUC, macro F1, and accuracy.

IRT Response Prediction is Accurate
IRT models that incorporate more priors into the generative story should be better, but are they? We compare four IRT models: IRT-base, IRT-disc, IRTfeas, and IRT-vec ( §2). The more sophisticated models are better and all improve over the LM (Figure 3) and correlate well with each other (Appendix C). To be clear, while higher accuracy than LM is good, our goal is to validate that IRT models are accurate; later, we inspect model errors and identify annotation errors ( §5).

What Model Features are Predictive?
Integrating additional features into Bayesian models is not trivial, so we instead use the flexibility of linear models to identify useful features. Our leaveone-in ablation compares features ( Figure 3): the top ablations both use IRT features, further validating IRT parameters. The subject and item identifier features are also strongly predictive, but item is the stronger of the two. Text-based features are weaker, but this suggests future work to better integrate them into IRT models ( §8).

Ranking with IRT
Leaderboards should produce reliable subject rankings: can DAD rank systems even with a tiny test set? Thus, we compare the correlation both of traditional average accuracy ( §3) and IRT rankings on the whole test set compared to the rankings of the same metric on a smaller test set. Our first experiment ( §4.3.1) examines the stability of existing items and subjects while the second ( §4.4) investigates stability of "new" evaluation data using sampling strategies.   Figure 3: We compare each IRT and linear model (LM) by how well they predict subject responses. We focus on ROC AUC since predicting responses is an imbalanced classification problem (most subjects are correct). Under that metric, all IRT models improve over the best LM, and the strongest LM ablation only uses IRT features. That textual features are predictive in the LM suggests they could improve future models.

IRT Rankings Have Better Reliability
Rankings should be reliable within the same dataset (e.g., on dev set) and generalize to similar datasets (e.g., with a test dataset). To test the first, we measure the ranking stability of mutually exclusive samples of the development data (Buckley and Voorhees, 2000). To test the second, we measure the correlation between development set sample rankings to test set rankings (Voorhees, 1998).
Specifically, for a range of sample sizes 8 we (1) sample two partitions of the data, (2) compute the classical ranking 9 and the IRT ranking from a refit IRT-feas model, then (3) compute Kendall's correlation (Kendall, 1938) between the samples for each ranking (details in Appendix D). In both cases IRT rankings have higher correlation than classical rankings (Figure 4, left). Since the benefit is strongest at low sample sizes, IRT can improve the reliability of small-scale evaluations.
The second experiment examines ranking generalization: IRT yields more reliable measures of subject skill, implying a greater consistency in subject rankings across evaluation settings. Figure 4 compares the development set sample rankings computed above to rankings obtained using subjects' test set responses (with the same IRT model).
Across all sample sizes, subjects' IRT ability estimated on the development set correlates well test set ability. Crucially, this is better than the corresponding classical metrics like accuracy (Appendix D quantifies the statistical significance of the difference), supporting our original motivation for using IRT. 10 8 The sample size must be less than half the size of the development data so that we can obtain two samples. 9 For SQuAD, ordering by mean exact match score. 10 Since the maximum trial size was limited, we train one final model with the full data, see Table 3 in the Appendix D.

IRT Improves Cold Start Reliability
IRT can also guide the construction of tests. Just as IRT practitioners prepare tests for humans, we too construct tests for machines. In educational testing, collecting responses from humans is expensive; likewise, although questions are cheap in searchbased QA tasks (Nguyen et al., 2016;Kwiatkowski et al., 2019), annotating answers is expensive. Likewise, "grading" machine dialog responses is expensive and IRT helps (Sedoc and Ungar, 2020). To emulate this setting, we use computerized adaptive testing (Weiss and Kingsbury, 1984) to iteratively select SQuAD items to "annotate." As in human test preparation, we use existing annotations to infer item parameters and iteratively infer the ability of new subjects. This experiment splits m subjects into a training group (80%) and a testing group (20%). The training group represents subjects for which we have full item predictions and annotations; the testing group represents a new group of subjects that we need to rank. To efficiently rank, we should iteratively choose items to annotate that yield the most information about the ranking if all the data were annotated.
This experiment compares how well several item selection strategies work. For each selection method, we (1) choose a sample size, (2), sample from the development set, (3) compute the ranking of subjects, and (4) compute Kendall's rank correlation ( Figure 5). 11 Which item selection strategies should we compare? As a baseline, we use naïve random sampling. Like prior work, we compare selecting items with the highest difficulty and the highest discriminability (Lalor et al., 2019) as well as the sum of the  Figure 4: Compared to the final ranking over a large test set, how well does a small test set correlate? The left shows correlation between mutually exclusive development set samples and the right between development samples and the full test set. In both experiments (panes), ranking systems by IRT ability is more stable-across all sample sizes-than mean accuracy and thus more reliable (Kendall's rank correlation is higher). Bands show 95% confidence intervals of rank correlations across ten trials per sample size.

Correlation to Test Rank
High Information

High Discrimination
High Disc + Diff High Difficulty

Sampling Method
Figure 5: Suppose we need to cold start and collect annotations for a new subject: what order would most rapidly increase correlation to the full test data? As we expect, the correlations eventually converge, but with little data, IRT has better correlation than other methods. We suspect that the IRT information underperforms early on when the subject ability estimate is unstable.
two. 12 We propose that items should be selected according to their Fisher information content (Weiss, 1982) as derived by Lord et al. (1968, p. 70). Intuitively, if we do not yet know the true skill θ j , we should pick items whose expected response we are most uncertain about. Our uncertainty (entropy) is maximized when the likelihood of a correct re- 12 We train an IRT-disc model to simplify sampling (e.g., avoiding a tradeoff between feasibility and discriminability). sponse p ij is the same as the likelihood of an incorrect response 1 − p ij , which corresponds to the maximal value of I i (θ j ); it is also sensible this value increases as discriminability γ i increases.
To infer the maximally informative items, we estimate the ability θ j of each subject using the currently selected items, use the ability to compute the information of each yet-to-be-annotated item for each subject, and then aggregate the informativeness by item i summed over subjects j. This approach is similar to uncertainty sampling and reduces to it for the IRT-base model (Lewis and Gale, 1994). We initially seed with the twenty-five most discriminative items (details in Appendix D). Like computerized adaptive testing (Moreno et al., 1984), Figure 5 shows that at lower sample sizes three of the IRT sampling methods are better than random sampling-difficulty does worse. The other IRT methods have comparable correlation. Thus, by using IRT, DAD can both improve rankings and guide annotation.

Qualitative Insights on Leaderboards
DAD also helps qualitative analysis of items and subjects. First, IRT identifies overfitting and generalizes partitioning datasets by difficulty. Then we show that-like in educational testing-IRT identifies good and bad items.  Figure 6: We partition evaluation data by IRT difficulty and discriminability with accuracy in each quartile. Most improvements in high-accuracy systems come from getting high-difficulty questions right. Items with low discriminability (and thus prone to annotation errors) are difficult for all subjects except the overfit ARGS-BERT model. We include top-performing SQuAD subjects, several notable subjects (systems), and a pair from the bottom of the leaderboard.

Guiding Analysis with IRT
Several works curate easy and hard QA subsets based on how many models answer correctly (Rondeau and Hazen, 2018) or heuristics (Sugawara et al., 2018). IRT can create similar subsets using IRT-feas, the best 1D model. Difficulty finds where subjects improve while discriminability and feasibility can surface items that may be invalid. For example, one low feasibility question ( Figure 9) asks "what are two examples of types of Turing machines?" which has two problems: (1) the answer omits five types and (2) span-based evaluation precludes selecting non-contiguous types. After excluding items with negative discriminability-they are likely erroneouswe sort items into bins. We break both difficulty and discriminability into four bins-taking the 25 th , 50 th , and 75 th percentiles-creating eight total bins. Then we select representative SQuAD subjects with their exact match scores ( Figure 6). Let's examine a feasible item with positive difficulty and discriminability like "what reform was attempted following the Nice treaty?" 13 In this case, the annotator's span is too long-resulting in almost no correct answers and a low fuzzy match (token F1). In contrast, one highly discriminative question succeeds because there are multiple plausible guesses to "who did the Normans team up with in Anatolia?" 14 While both the Armenian state and Turkish forces are superficially plausible answers, only Turkish forces is correct; nonetheless, some models are fooled. Using IRT to guide subject analysis is helpful; next, we test how efficient it is in identifying annotation error.

Identifying Annotation Error
To test if IRT can identify annotation error, we inspect sixty SQuAD development set items. We select ten items from each of these groups: the most negative discriminability, discriminability nearest to zero, the highest discriminability, the least difficult, most difficult, and IRT model errors. For each, we annotate whether the item was correct, was "correct" yet flawed in some way, or simply wrong (Figure 7). 15 Inter-annotator agreement between three authors on this three-way annotation with Krippendorff's α (Krippendorff, 2004;Artstein and Poesio, 2008) is 0.344. Despite only modest agreement, just as in the development of education tests, negative discriminability is predictive of bad items. When discriminability is negative, then the probability of getting the answer right is higher when ability is lower, which is undesirable: Ken consistently loses to Burt on those items. This could identify bad items in evaluation sets for removal.

Related Work
DAD draws together two primary threads: we use IRT to understand datasets, which has been applied to other NLP tasks, and apply it to improving leaderboards. Finally, we explore how the insights of IRT can improve not just the analysis of test sets but to improve the construction of test sets.   Figure 7: We annotate SQuAD items by discriminability, difficulty, and IRT prediction errors. For example, one question with negative discriminability was classified as "Wrong" with the explanation that the annotated answer indicates it is not answerable, but the question actually is answerable. Items with negative discriminability or where IRT's prediction is wrong have a much higher rate of annotation error ("Flawed" or "Wrong"). Using similar methodology, errors in datasets could be more rapidly identified.  (Kiela et al., 2021). Ideally, new data should challenge models through adversarial collection (Wallace et al., 2019b;Nie et al., 2020) and related methods (Gardner et al., 2020). However, if making an easy leaderboard more difficult is possible, the leaderboard has outlived its helpfulness and should be retired (Voorhees, 2000).
Part of our work centers on alternate task efficacy rankings, but this naïvely assumes that task efficacy is the sole use case of leaderboards. Indeed, focusing solely these factors can mislead the public (Paullada et al., 2020) and may not reflect human language capabilities (Schlangen, 2020). Leaderboards are also well positioned to provide incentive structures for participants to prioritize fairness (Bender and Friedman, 2018) and efficiency (Strubell et al., 2019;Schwartz et al., 2020;Min et al., 2021) or incorporate testing of specific capabilities (Ribeiro et al., 2020;Dunietz et al., 2020). To enable these more nuanced analyses, leaderboards should accept runnable models rather than static predictions (Ma et al., 2021).
Active Learning Beyond IRT, the analysis of training dynamics and active learning (Settles, 2009) is helpful for actively sampling specific items or identifying low-quality items (Brodley and Friedl, 1999). For example, Swayamdipta et al. (2020) and Pleiss et al. (2020) propose alternative training dynamics-based methods for identifying difficult items as well annotation errors. Even closer to goals, Rahman et al. (2020) use active learning to build a test collection. Explicitly measuring how effectively examples separate the best subject from the rest allows test set curators to "focus on the bubble" (Boyd-Graber and Börschinger, 2020), prioritizing examples most likely to reveal interesting distinctions between submitted systems.
Alternate Formulations IRT is an example of convergent evolution of models that predict subject action given an item. Ideal point models (Poole and Rosenthal, 2017) consider how a legislator (subject) will vote on a bill (item) and use a similar mathematical formulation. The venerable ELO model (Glickman and Jones, 1999) and modern extensions (Herbrich et al., 2007) predict whether a player (subject) will defeat an opponent (item) with, again, a similar mathematical model. Certain IRT models can also be formulated as nonlinear mixed models (Rijmen et al., 2003), where the item parameters are fixed effects and the latent subject parameters are random effects. This allows for comparisons between IRT models and other mixed effects models under a consistent framework. IRTbase and IRT-disc can be formulated as nonlinear mixed models, and IRT-feas can be formulated as a discrete mixture model over items. As we discuss further in the next section, DAD's application of IRT can further be improved by adopting interpretable extensions of these models.

Conclusion
This paper advocates incorporating decades of research in crafting education tests to improve how we evaluate the capabilities of NLP models. We propose and validate an alternate IRT ranking method for leaderboard evaluations, show it can guide annotation, detect annotation error, and naturally partition evaluation data. Just as educators moved from classical testing to IRT, the NLP community should consider future evaluations with IRT.

Limitations
Although there is much to gain through IRT evaluation, there are limitations which make it hard to implement. First, it requires access to item-level responses for all examples for all subjects which are often only available to organizers. Second, Urbano (2016) notes that sampling mutually exclusive subsets has drawbacks-samples are not entirely independent. Lastly, our work is a proof of concept using SQuAD 2.0 as a test bed and our results may not generalize.

Future Work
We see a few directions for future work. First, this paper is intended to validate IRT and its usefulness as an active part of the leaderboard lifecycle; the natural next step is to implement it in a leaderboard. Second, our IRT models do not incorporate the item content (e.g., example text) to predict responses, but in principle could; Bayesian models with metadata (Card et al., 2018)     On the same page, we provide a web interface for inspecting the parameters of the IRT models. Figure 12 shows the feasibility distribution corresponding to Figure 1.

B Logistic Regression Features
The linear model ( §4.2) includes features based on item IDs, subject IDs, textual features of the question, context, and answer, and topic model features. Table 1 lists the feature names from Figure 3 with descriptions of each. When IRT features or the statistics features are used, they include interaction terms with themselves.

C IRT Model Type Correlation
Although each IRT model differs in expressiveness, they should-in general-produce similar results. This is confirmed by computing the Kendall's rank correlation between the subject abilities and item difficulties (Table 2).

D Ranking Stability Experiments
Here we provide further details for the ranking stability experiments ( §4.2.3). First, we filter from the 161 subjects that have development set scores to the 115 that also have test set scores. 16 In our simulation, we run 10 trials for every sample size; sample size begins at 100 and with steps of 100. In addition to these, we also run trials for sample sizes 25, 50, and 75. Since each sample can be no larger than half the dataset, we stop at half the dataset. Table 3 uses a IRT-disc model since we noticed that in comparison IRT-feas overfit the data, yielding worse results. The correlations with the full data are all strong, but not the same. We conclude that-at least on SQuAD-IRT rankings are modestly more reliable than classical rankings.

D.2 Statistical Significance of Difference in Kendall Tau Coefficients
While Figure 4 shows a consistent difference in correlation between ranking methods, it is unclear whether this difference is statistically significant. We estimate the statistical significance of the difference through bootstrap sampling (Efron, 1994). Since the null case is no difference in correlation coefficients, we seek a symmetric sampling distribution centered at zero that represents a realistic density function. Each ranking stability experiment 17 trial results in two lists of number pairs. The lists correspond to subject scores on two datasets; 18 each number pair is the subject's accuracy and IRT score. To create the bootstrap distribution, we (1) sample with replacement pairs from one list, (2) compute the correlation between Discriminability: -9.63 Difficulty: -0.479 Feasibility: 0.614 Mean Exact Match: 0.472 Wikipedia Page: Economic inequality Question ID: 572a1c943f37b319004786e3 Question: Why did the demand for rentals decrease? Official Answer: demand for higher quality housing Context: A number of researchers (David Rodda, Jacob Vigdor, and Janna Matlack), argue that a shortage of affordable housing -at least in the US -is caused in part by income inequality. David Rodda noted that from 1984 and 1991, the number of quality rental units decreased as the demand for higher quality housing increased (Rhoda 1994:148). Through gentrification of older neighbourhoods, for example, in East New York, rental prices increased rapidly as landlords found new residents willing to pay higher market rate for housing and left lower income families without rental units. The ad valorem property tax policy combined with rising prices made it difficult or impossible for low income residents to keep pace. Figure 8: The example from SQuAD with the lowest discriminability. Surprisingly, it had a negative discriminability, implying that the less skilled a subject is, the more likely their response is to be correct. Official Answer: probabilistic Turing machines, non-deterministic Turing machines Context: Many types of Turing machines are used to define complexity classes, such as deterministic Turing machines, probabilistic Turing machines, non-deterministic Turing machines, quantum Turing machines, symmetric Turing machines and alternating Turing machines. They are all equally powerful in principle, but when resources (such as time or space) are bounded, some of these may be more powerful than others. Figure 9: This question is regarded as infeasible by the IRT model. Upon further inspection, the answer omits five acceptable answers, but more importantly does not permit all combinations of Turing machines. the resampled ranking and unused ranking when using accuracy versus IRT score, and (3) compute and store the IRT correlation score minus the accuracy correlation score. We repeat this process 1000 times for each of the 10 trials in the original experiment and aggregate all the differences to build the bootstrap distribution. For each sample size we compute the empirical P-Value on each trial which we show in box and whisker plots (Figure 13).

E The IRT Statistical Test
The IRT test differs in two substantial ways from other tests: (1) it does not assume that items are equally informative and (2) it does assume that the informativeness of items is a function of the subject's skill θ j . In the literature, this is closely connected to reliability (Tague-Sutcliffe, 1992) and each item provides information about the location of θ j ; as we accumulate more evidence for the location of θ j the error of estimation decreases. It is a well known result in IRT that standard error of estimate (SEE) σ(θ|θ) varies with respect to the agent location parameter θ (De Ayala, 2013, p. 30) and is connected to the Fisher information of each item. For a 2PL model, information is maximized when p i = (1 − p i ). Since Fisher information is additive, the information of the evaluation set is maximal when items have a 50% chance of being responded to correctly. As derived by De Ayala (2013, p. 102), the standard error of estimation is computed by accumulating the information gained from each item. Given two subjects X and Y , one can use the probability distribution of score differences to compute the probability that the difference in skill is greater than two standard errors which corresponds to an α ≤ .05 significance level.

F Multidimensional IRT Clustering
While we achieve strong held-out accuracy with 10 dimensional IRT (IRT-vec), we had limited success in interpreting parameters. We use TSNE 19 plots overlayed with features like item accuracy, the question's Wikipedia page, if the question was answerable, length of questions, and topic model weights. Of these, item accuracy and answerability showed the most obvious patterns (Figure 14). Official Answer: an attempt to reform the constitutional law of the European Union and make it more transparent Context: Following the Nice Treaty, there was an attempt to reform the constitutional law of the European Union and make it more transparent; this would have also produced a single constitutional document. However, as a result of the referendum in France and the referendum in the Netherlands, the 2004 Treaty establishing a Constitution for Europe never came into force. Instead, the Lisbon Treaty was enacted. Its substance was very similar to the proposed constitutional treaty, but it was formally an amending treaty, and -though it significantly altered the existing treaties -it did not completely replace them. Figure 10: This example shows that the answer span is likely too large, causing models to fail in both SQuAD's exact match and F1 metrics.  Figure 12: The feasibility parameter λ of our IRT model represents the probability that an example is unsolvable. For example, annotation error could lead to an example always being scored incorrectly-regardless of how good the model is. In SQuAD 2.0, λ < .434 in the 5% percentile, λ < .698 for the 7.5%, and λ < .931 in the 10% percentile.
We repeated this approach with the multi-task question answering shared task MRQA (Fisch et al., 2019). However, instead of using 10 dimensions we use 6 to match the number of development set tasks in MRQA. Although questions in NarrativeQA standout (Figure 15), there is not a discernible pattern amongst the other tasks. We leave more sophisticated methods for making multidimensional IRT models interpretable to future work.

G Reproducibility Checklist
Here we provide reproducibility details to complement our source code (https://irt.pedro.ai).

G.1 Software and Parameters
All IRT models are implemented in Py-Torch (Paszke et al., 2019) and Pyro (Bingham et al., 2018). Linear models are trained with Vowpal Wabbit (Agarwal et al., 2014). The topic model that generates features for the linear model uses Mallet (McCallum, 2002).
The number of IRT model parameters is proportional to the number of subjects m and the number of items n. The IRT-base has one parameter per subject and one per item. The IRT-disc has one parameter per subject and two per item. The IRTfeas has one parameter per subject and three per item. The IRT-vec has ten parameters per subject and thirty per item.

G.2 Hyperparameters
We did not invest significant effort in hyperparameter tuning the IRT models and instead used the defaults in the py-irt software 20 provided by Lalor et al. (2019). The IRT-base, IRT-disc, and IRT-feas models were trained for 1000 epochs with no early stopping conditions and a learning rate of 0.1 with ADAM (Kingma and Ba, 2015). The IRT-vec model was trained for 2500 epochs and used 10 dimensions. 1,000 2,000 3,000 4,000 5,000 Sample Size 0 1,000 2,000 3,000 4,000 5,000

Sample Size
Dev Sample to Dev Sample Dev Sample to Test Figure 13: P-values of the rank correlation difference for each sample size and trial in Figure 4. The inherent noise in dev set sampling makes inferring significance difficult (left); test set driven results (right) are more significant.  Figure 14: In SQuAD, TSNE shows a relationship between mean exact match (item accuracy) and answerability with respect to multidimensional difficulty and discriminability.
In the linear model, we used a Hyperoptbased (Bergstra et al., 2013) tool provided by Vowpal Wabbit 21 for hyper parameter search. For each LM, the tool spent 20 iterations optimizing the learning rate, L2 regularization, and number of bits against the logistic loss function. The learning rate was searched from 0.001 to 10 with loguniform sampling, L2 regularization from 1e − 8 to 1, and bits from 20 to 23 as categorical variables.
The topic model that generated features for the linear model used mallet, and we followed the recommendations of the software to set hyper param-21 github.com/VowpalWabbit/vowpal_wabbit  Figure 15: In MRQA, TSNE shows a relationship between whether the task is NarrativeQA with respect to multidimensional difficulty and discriminability.

MRQA Task
eters. 22 Specifically, we used an optimization interval of 10, removed stop words, trained for 1000 iterations, and used a document-topic threshold of 0.05. Each document was comprised of the Wikipedia page title and the question text.

G.3 Computational Resources
The majority of experiments were conducted on a single workstation with an Intel i7-7700K CPU, 47GB of RAM, and an Nvidia 1080Ti. The average runtime for the IRT-feas model on CPU is 113 seconds with a standard deviation of 2.31 over 5 trials. The average runtime of the IRT-vec model on GPU is 110 seconds with a standard deviation of 0.5 over 5 trials.
Since each ranking stability experiment required ( §4.3.1) re-training an IRT-feas model on each subset, we parallelized this experiment on a CPU cluster where each trial received two CPU cores and 16GB of RAM. In total, this included 520 trials which corresponds to twice that many trained IRT models since one model is trained on each subset of the data.