Cross-replication Reliability - An Empirical Approach to Interpreting Inter-rater Reliability

When collecting annotations and labeled data from humans, a standard practice is to use inter-rater reliability (IRR) as a measure of data goodness (Hallgren, 2012). Metrics such as Krippendorff’s alpha or Cohen’s kappa are typically required to be above a threshold of 0.6 (Landis and Koch, 1977). These absolute thresholds are unreasonable for crowdsourced data from annotators with high cultural and training variances, especially on subjective topics. We present a new alternative to interpreting IRR that is more empirical and contextualized. It is based upon benchmarking IRR against baseline measures in a replication, one of which is a novel cross-replication reliability (xRR) measure based on Cohen’s (1960) kappa. We call this approach the xRR framework. We opensource a replication dataset of 4 million human judgements of facial expressions and analyze it with the proposed framework. We argue this framework can be used to measure the quality of crowdsourced datasets.


Introduction
Much content analysis and linguistics research is based on data generated by human beings (henceforth, annotators or raters) asked to make some kind of judgment. These judgments involve systematic interpretation of textual, visual, or audible matter (e.g. newspaper articles, television programs, advertisements, public speeches, and other multimodal data). When relying on human observers, researchers must worry about the quality of the data -specifically, their reliability (Krippendorff, 2004). Are the annotations collected reproducible, or are they the result of human idiosyncrasies?
Respectable scholarly journals typically require reporting quantitative evidence for the inter-rater reliability (IRR) of the data (Hallgren, 2012). Cohen's kappa (Cohen, 1960) or Krippendorff's alpha (Hayes and Krippendorff, 2007) is expected to be  Koch, 1977) above a certain threshold to be worthy of publication, typically 0.6 (Landis and Koch, 1977). Similar IRR requirements for human annotations data have been followed across many fields. In this paper we refer to this absolute interpretation of IRR as the Landis-Koch approach (Fig. 1).
This approach has been foundational in guiding the development of widely used and shared datasets and resources. Meanwhile, the landscape of human annotations collection has witnessed a tectonic shift in recent years. Driven by the datahungry success of machine learning (LeCun et al., 2015;Schaekermann et al., 2020), there has been an explosive growth in the use of crowdsourcing for building datasets and benchmarks (Snow et al., 2008;Kochhar et al., 2010). We identify three paradigm shifts in the scope of and methodologies for data collection that make the Landis-Koch approach not as useful in today's settings.
A rise in annotator diversity In the precrowdsourcing era lab settings, data were typically annotated by two graduate students following detailed guidelines and working with balanced corpora. Over the past two decades, however, the bulk of data are annotated by crowd workers with high cultural and training variances.
A rise in task diversity There has been an increasing amount of subjective tasks with genuine ambiguity: judging toxicity of online discussions (Aroyo et al., 2019), in which the IRR values range between 0.2 and 0.4; judging emotions expressed by faces (Cowen and Keltner, 2017), in which more than 80% of the IRR values are below 0.6; and A/B testing of user satisfaction or preference evaluations (Kohavi and Longbotham, 2017), where IRR values are typically between 0.3 and 0.5.
A rise in imbalanced datasets Datasets are no longer balanced intentionally. Many high-stakes human judgements concern rare events with substantial tail risks: event security, disease diagnostics, financial fraud, etc. In all of these cases, a single rare event can be the source of considerable cost. High class imbalance has led to many complaints of IRR interpretability (Byrt et al., 1993;. Each of these changes individually has a profound impact on data reliability. Together, they have caused a shift from data-from-the-lab to datafrom-the-wild, for which the Landis-Koch approach to interpreting IRR is admittedly too rigid and too stringent. Meanwhile, we have seen a drop in the reliance on reliability. Machine learning, crowdsourcing, and data research papers and tracks have abandoned the use and reporting of IRR for human labeled data, despite calls for it (Paritosh, 2012). The most cited recent datasets and benchmarks used by the community such as SQuAD (Rajpurkar et al., 2016), ImageNet (Deng et al., 2009), Freebase (Bollacker et al., 2008), have never reported IRR values. This would have been unthinkable twenty years ago. More importantly, this is happening against the backdrop of a reproducibility crisis in artificial intelligence (Hutson, 2018).
With the decline of the usage of IRR, we have seen a rise of ad hoc, misguided quality metrics that took its place, including 1) agreement-%, 2) accuracy relative to consensus, 3) accuracy relative to "ground truth." This is dangerous, as IRR is still our best bet for ensuring data reliability. How can we ensure its continued importance in this new era of data collection?
This paper is an attempt to address this problem by proposing an empirical alternative to interpreting IRR. Instead of relying on an absolute scale, we benchmark an experiment's IRR against two baseline measures, to be found in a replication. Replication here is defined as re-annotating the same set of items with a slight change in the experimental setup, e.g., annotator population, annotation guidelines, etc. By fixing the underlying corpus, we can ensure the baseline measures are sensitive to the experiment on hand. The first baseline measure is the annotator reliability in the replication. The second measure is the annotator reliability between the replications. In Section 3, we present a novel way of measuring this. We call it cross-kappa (κ x ). It is an extension of Cohen's (1960) kappa and is designed to measure annotator agreement between two replications in a chance-corrected manner.
We present in Appendix A the International Replication (IRep) dataset, 1 a large-scale crowdsourced dataset of four million judgements of human facial expressions in videos. The dataset consists of three replications in Mexico City, Budapest, and Kuala Lumpur. 2 Our analysis in Section 4 shows this empirical approach enables meaningful interpretation of IRR. In Section 5, we argue xRR is a sensible way of measuring the goodness of crowdsourced datasets, where high reliability is unattainable. While we only illustrate comparing annotator populations in this paper, the methodology behind the xRR framework is general and can apply to similarly replicated datasets, e.g., via change of annotation guidelines.

Related Work
To position our research, we present a brief summary of the literature in two areas: metrics for measuring annotator agreement and their shortcomings (Section 2.1), comparing replications of an experiment (Section 2.2).

Annotator Agreement
Artstein and Poesio (2008) present a comprehensive survey of the literature on IRR metrics used in linguistics. Popping (1988) compare an astounding 43 measures for nominal data (mostly applicable to reliability of data generated by only two observers). Since then, Cohen's (1960) kappa and its variants (Carletta et al., 1997;Cohen, 1968) have become the de facto standard for measuring agreement in computational linguistics.
One of the strongest criticisms of kappa is its lack of interpretability when facing class imbalance. This problem is known as the kappa paradox Byrt et al., 1993;Warrens, 2010), or the 'base rates' problem (Uebersax, 1987). Bruckner and Yoder (2006) show class imbalance imposes practical limits on kappa and suggest one to interpret kappa in relation to the class imbalance of the underyling data. Others have proposed measures that are more robust against class imbalance (Gwet, 2008;Spitznagel and Helzer, 1985;Stewart and Rey, 1988). Pontius Jr and Millones (2011) even suggest abandoning the use of kappa altogether.

Agreement Between Replications
Replications are often being compared, but it is done at the level of per-item mean scores. Cowen and Keltner (2017) measure the correlation between the mean scores of two geographical rater pools. They use Spearman's (1904) correction for attenuation (discussed later in this paper) with splithalf reliability. Snow et al. (2008) measure the Pearson correlations between the score of a single expert and the mean score of a group of nonexperts, and vice versa. In this comparison the authors do not correct for correlation attenuation, hence the reported correlations may be strongly biased. Bias aside, correlation is not suitable for tasks with non-interval data or task with missing data. In this paper, we propose a general methodology for measuring rater agreement between replications with the same kind of generality, flexibility, and ease of use as IRR.

Cross-replication Reliability (xRR)
Data reliability can be assessed when a set of items are annotated multiple times. When this is done by a single rater, intra-rater reliability assesses a person's agreement with oneself. When this is done by two or more raters, inter-rater reliability (IRR) assesses the agreement between raters in an experiment. We propose to extend IRR to measure a similar notion of rater-rater agreement, but where the raters are taken from two different experiments. We call it cross-replication reliability (xRR). These replications can be a result of re-labeling the same items with a different rater pool, annotation template, or on a different platform, etc.
We begin with a general definition of Cohen's (1960) kappa. We extend it to cross-kappa (κ x ) to measure cross-replication reliability. We then use this foundation to define normalized κ x to measure similarity between two replications.

Kappa and Its Generalizations
The class of IRR measures is quite diverse, covering many different experimental scenarios, e.g., different numbers of raters, rating scales, agreement definitions, assumptions about rater interchangeability, etc. Out of all such coefficients, Cohen's (1960) kappa has a distinct property that makes it most suitable for the task on hand. Unlike Scott's pi (Scott, 1955), Fleiss's kappa (Fleiss, 1971), Krippendorf's alpha (Krippendorff, 2004), and many others, Cohen's (1960) kappa allows for two different marginal distributions. This stems from Cohen's belief that two raters do not necessarily share the same marginal distribution, hence they should not be treated interchangeably. When we compare replications, e.g., two rater populations, we are deliberately changing some underlying conditions of the experiment, hence it is safer to assume the marginal distributions will not be the same. Within either replication, however, we rely on the rater interchangeability assumption. We think this more accurately reflects the current practice in crowdsourcing, where each rater contributes a limited number of responses in an experiment, and hence raters are operationally interchangeable. Cohen's (1960) kappa was invented to compare two raters classifying n items into a fixed number of categories. Since its publication, it has been generalized to accommodate multiple raters (Light, 1971;Berry and Mielke Jr, 1988), and to cover different types of annotation scales: ordinal (Cohen, 1968), interval (Berry and Mielke Jr, 1988;Janson and Olsson, 2001), multivariate (Berry and Mielke Jr, 1988), and any arbitrary distance function (Artstein and Poesio, 2008). In this paper we focus on Janson and Olsson's (2001) generalization, which the authors denote with the lowercase Greek letter iota (ι). It extends kappa to accommodate interval data with multiple raters, and is expressed in terms of pairwise disagreement: (1) d o in this formula represents the observed portion of disagreement and is defined as: where n is the number of items, b the number of annotators, i the item index, r and s the annotator indexes; r<s is the sum over all r and s such that 1 <= r < s <= b. D() is a pairwise disagreement defined as: for interval data, and for categorical data. Note we are dropping Janson and Olsson's multivariate reference in D() and focusing on the univariate case. d e in the denominator represents the expected portion of disagreement and is defined as: Janson and Olsson's expression in Eq. 1 is based on Berry and Mielke Jr (1988). While the latter use absolute distance for interval data, the former use squared distance instead. We follow Janson and Olsson's approach because squared distance leads to desirable properties and familiar interpretation of coefficients (Fleiss and Cohen, 1973;Krippendorff, 1970). Squared distance is also used in alpha (Krippendorff, 2004). Berry and Mielke Jr (1988) show if b = 2 and the scale is categorical, ι in Eq. 1 reduces to Cohen's (1960) kappa. For other rating scales such as ratio, rank, readers should refer to Krippendorff (2004) for additional distance functions. The equations for d o and d e are unaffected by the choice of D().

Definition of κ x
Here we present κ x as a novel reliability coefficient for measuring the rater agreement between two replications. In Janson and Olsson's generalized kappa above, the disagreement is measured within pairs of annotations taken from the same experiment. In order to extend it to measure crossreplication agreement, we construct annotation pairs such that the two annotations are taken from different replications. We do not consider annotation pairs from the same replication. We define cross-kappa, κ x (X, Y ), as a reliability coefficient between replications X and Y : where (7) and where x and y denote annotations from replications X and Y respectively, n is the number of items, R and S the numbers of annotations per item in replications X and Y respectively. In this definition, the observed disagreement is obtained by averaging disagreement observed in nRS pairs of annotations, where each pair contains two annotations on the same item taken from two different replications. Expected disagreement is obtained by averaging over all possible n 2 RS cross-replication annotation pairs. When each replication has only 1 annotation per item, and the data is categorical, it is easy to show κ x reduces to Cohen's (1960) kappa. κ x is a kappa-like measure, and will have properties similar to kappa's. κ x is bounded between 0 and 1 in theory, though in practice it may be slightly negative for small sample sizes. κ x = 0 means there is no discernible agreement between raters from two replications, beyond what would be expected by chance. κ x = 1 means all raters between two replications are in perfect agreement with each other, which also implies perfect agreement within either replication.

κ x with Missing Data
As presented, the two replications can have different numbers of annotations per item. However, within either replication, the number of annotations per item is assumed to be fixed. We recognize this may not always be the case. In practice, items within an experiment can receive varying numbers of annotations (i.e., missing data). We now show how to calculate κ x with missing data.
When computing IRR with missing data, weights can be used to account for varying numbers of annotations within each item. Janson and Olsson (2004) propose a weighting scheme for iota in Eq. 1. Instead, we follow the tradition of Krippendorff (2004) in weighting each annotation equally in computing d o and d e . That amounts to the following scheme. In d o , we first normalize within each item, then we take a weighted average over all items, with weights proportional to the combined numbers of annotations per item. In d e , no weighting is required.
Since R and S can now vary from item to item, we index them using R( * ) and S( * ) to denote that they are functions of the underlying items. We rewrite d o and d e as: and where R is the total number of annotations in replications X, R(i) the number annotations on item i in replication X, r = 1, 2, . . . , R(i) (on item i in replication X); and similarly for S, S(j), and s with respect to replication Y .
R(i) r=1 and S(j) s=1 in Eq. 9 and 10 are inner summations, where i and j are indexes from the outer summations. Without missing data, R(i) = R for all i, and S(j) = S for all j, then R = nR, S = nS, reducing Eq. 9 and 10 to Eq. 7 and 8.

Normalization of κ x
xRR is modeled closely after IRR in order to serve as its baseline. As IRR measures the agreement between raters, so does xRR. In other words, κ x is really a measure of rater agreement, not a measure of experimental similarity per se. This distinction is important. If we want to measure how well we replicate an experiment, we need to measure its disagreement with the replication in relationship to their own internal disagreements. The departure between inter-experiment and intra-experiment disagreements is important in measuring experimental similarity.
This calls for a normalization that considers κ x in relation to IRR. First, we take inspirations from Spearman's correction for attenuation (Spearman, 1904): where r xy is the observed Pearson product-moment correlation between x and y (variables observed with measurement errors), ρ xy is an estimate of their true, unobserved correlation (in the absence of measurement errors), and reliability x and reliability y are the reliabilities of x and y respectively. Eq. 12 is Spearman's attempt to correct for the negative bias in r xy caused by the observation errors in x and y. 3 Eq. 12 is relevant here because of the close connection between Cohen's (1960) kappa and the Pearson correlation, r xy . In the dichotomous case, if the two marginal distributions are the same, Cohen's (1960) kappa is equivalent to the Pearson correlation (Cohen, 1960(Cohen, , 1968). In the multicategory case, Cohen (1968) generalizes this equivalence to weighted kappa, under the conditions of equal marginals and a specific quadratic weighting scheme.
Based on this strong connection, we propose replacing r xy in Eq. 12 with κ x and define normalized κ x as: Defined this way, one would expect normalized κ x to behave like ρ xy . That is indeed the case. When we apply both measures to the IRep dataset, we obtain a Pearson correlation of 0.99 between them (see Section 4.5). This leads to two insights. First, we can interpret normalized κ x like a disattenuated correlation, ρ xy (see (Muchinsky, 1996) for a rigorous interpretation). Second, normalized κ x approximates the true correlation between two experiments' item-level mean scores. Despite their affinity, ρ xy is not a substitute for normalized κ x for measuring experimental similarity. Normalized κ x is more general as it can accommodate non-interval scales and missing data.

Connection between xRR and IRR
By connecting normalized κ x to ρ xy , we can also learn a lot about κ x itself. To the extent that normalized κ x approximates ρ xy , we can rewrite Eq. 13 as: This formulation shows κ x behaves like a product of ρ xy and the geometric mean of the two IRRs.
This has important consequences, as we can deduce the following. 1) Holding constant the mean scores, and hence ρ xy , the lower the IRRs, the lower the κ x . Intra-experiment disagreement inflates interexperiment disagreement. 2) In theory ρ xy ≤ 1.0, 4 hence κ x is capped by the greater of the two IRRs. I.e., Intra-experiment agreement presents a ceiling to inter-experiment agreement. 3) If x and y are identically distributed, e.g., in a perfect replication, ρ xy = 1 and κ x (X, Y ) = IRR X = IRR Y . Thus, when a low reliability experiment is replicated perfectly, κ x will be as low, whereas normalized κ x will be 1. This explains why normalized κ x is more suitable for measuring experimental similarity.
In this section, we propose κ x as a measure of rater agreement between two replications, and normalized κ x is as an experimental similarity metric. In the next section, we apply them in conjunction with IRR to illustrate how we can gain deeper insights into experiment reliabilities by triangulating these measures.

Applying xRR to the IRep Dataset
As a standalone measure, IRR captures the reliability of an experiment by encapsulating many of its facets: class imbalance, item difficulty, guideline clarity, rater qualification, task ambiguity, etc. As such, it is difficult to compare the IRR of different experiments, or to interpret their individual values, because IRR is tangled with all the aforementioned design parameters. For example, we cannot attribute a low IRR to rater qualification without first isolating other design parameters. This is the problem we try to solve with xRR by contextualizing IRR with meaningful baselines via a replication. We will demonstrate this by applying this technique to the IRep Dataset (Appendix A). We focus on a subset of 5 emotions for illustration purposes, with the rest of the reliability values provided in Appendix B. In our analysis, IRR is measured with Cohen's (1960) kappa and xRR with κ x . We will refer to them interchangeably.

IRR Variability Across Emotions
First we illustrate in Fig. 2 that different emotions within the same city can have very different IRR. For instance, the labels awe and love in Mexico City have an IRR of 0.1208 and 0.597 respectively (Table 1). Awe and love are completely different emotions with different levels of class imbalance and ambiguity, and without controlling for these differences, the gap in their reliabilities is not unexpected. That is exactly the problem about comparing IRRs -such comparisons are not meaningful. We need something directly comparable to awe in order to interpret its low IRR. If we do not compare emotions, and just consider awe using the Landis-Koch scale, that would not be helpful either. We would not be able to tell if its low IRR is a result of poor guidelines, general ambiguity in emotion detection, or ambiguity specific to awe. It's more meaningful to compare replications of awe itself. The x-axis denotes buckets of IRR values. The y-axis denotes the number of emotion labels in each of those buckets. There is a lot of variation between emotion labels within each city.

IRR Variability Across Replications
While the aforementioned variation in IRR between emotions is expected, IRR of the same emotion can vary greatly between replications as well. Fig. 3 shows two contrasting examples. On the one hand, the IRR of love is consistent across replications. On the other hand, the IRR of contemplation varies a lot. We know the IRR variation in contemplation is strictly attributed to rater pool differences because the samples, platforms and annotation templates are the same across experiments. Such variation in IRR will be missed entirely by sampling based approaches for error-bars (e.g. standard error, bootstrap), which assume a fixed rater population.

Cross-replication Rater Agreement
As shown, replication can facilitate comparisons of IRR by producing meaningful baselines. However, IRR is an internal property of a dataset, it does not allow us to compare two datasets directly. To that end, we can apply κ x to quantify the rater agreement between two datasets, as IRR quantifies the rater agreement within a dataset. Interestingly, not only is κ x useful for comparing two datasets, but it also serves as another baseline for interpreting their IRRs.
IRR is a step toward ensuring reproducibility, so naturally we wonder how much of the observed IRR is tied to the specific experiment and how much of it generalizes? This is of particular concern when raters are sampled in a clustered manner, e.g., crowd workers from the same geographical region, grad students sharing the same office. We rarely make sure raters are diverse and representative of the larger population. High IRR can be the result of a homogeneous rater group, limiting the generality of the results. In the context of the IRep dataset, that two cities having similar IRRs does not imply their raters agree with each other at a comparable level, or at all. We will demonstrate this with two contrasting examples.  Mexico City and Budapest both have a moderate IRR for sadness, 0.5147 and 0.5175 respectively, and their κ x is nearly the same at 0.4709 (Fig. 4). This gives us confidence that the high IRR of sadness generalizes beyond the specific rater pools. In contrast, on contentment Mexico City and Kuala Lumpur have comparable levels of IRR, 0.4494 and 0.6363 respectively, but their κ x is an abysmal -0.0344 5 (Fig. 5). In other words, the rater agreement on contentment is limited to within-pool observations only. This serves as an important reminder that IRR is a property of a specific experimental setup and may or may not generalize beyond that. κ x allows us to ensure the internal agreement has external validity.

Replication
Similarity κ x is a step towards comparing two replications, but it is not a good standalone measure of replication similarity. To do that, we must also account for both replications' internal agreements, e.g., via normalized κ x in Eq. 13. Fig. 6 shows an example. Mexico City and Budapest have a low κ x of 0.0817 on awe. On the surface, this low agreement may seem attributable to differences between the rater pools. However, there is a similarly low IRR in either city: 0.1208 in Mexico City, and 0.117 in Budapest. After accounting for IRR, normalized κ x is much higher at 0.6872 (Table 2), indicating a decent replication similarity between the two cities.

Connection to ρ xy
We apply Spearman's correction for attenuation in Eq. 12 to all 31 emotion labels in 3 replication pairs. The resulting ρ xy is plotted against the corresponding normalized κ x in Fig. 7. Both measures are strongly correlated with a Pearson correlation of 0.99. This justifies interpreting normalized κ x as a disattenuated correlation like ρ xy .

Measuring the Quality of a Crowdsourced Dataset
The IRep dataset is replicated and is conducive to xRR analysis. However, in practice most datasets are not replicated. Is xRR still useful? We present a specific use case of xRR in this section and argue that it is worth replicating a crowdsourced dataset in order to evaluate its quality.

Data Target
Given a set of items, it is possible that annotations of the highest attainable quality still fail to meet the Landis-Koch requirements. Task subjectivity and class imbalance together impose a practical limit on kappa (Bruckner and Yoder, 2006). In these situations, the experimenter can forgo a data collection effort for reliability reasons. Alternatively, the experimenter may believe that data of sufficiently high quality can still have scientific merits, regardless of reliability. If so, what guidance can we use to ensure the highest quality data, especially when collecting data via crowdsourcing? This paper is heavily motivated by this question. xRR allows us to interpret IRR not on an absolute scale, but against a replication, a reference of sorts. By judging a crowdsourced dataset against a reference, we can decide if its meets a certain quality bar, albeit a relative one. In the IRep dataset, all replications are of equal importance. However, in practice, we can often define a trusted source as our target. This trusted source can consist of linguists, medical experts, calibrated crowd workers, or the experimenters themselves. They should have enough expertise knowledge and an adequate understanding of the task. The critical criterion in choosing a target is its ability to remove common quality concerns such as rater qualification and guideline effectiveness.

Internal Agreements
By replicating a random subset of a crowdsourced dataset with trusted annotators, 6 one can compare the two IRRs and make sure they are at a similar level. If the crowd IRR is much higher, that may be an indication of collusion, or a set of overly simplistic guidelines that have deviated from the experiment fidelity (Sameki et al., 2015). If the crowd IRR is much lower, it may just be a reflection of annotator diversity, or it can mean under-defined guidelines, unequal annotator qualifications, etc. Further investigation is needed to ensure the discrepancy is reasonable and appropriate.

Mutual Agreement
Suppose the two IRRs are similar, that is not to say that both datasets are similar. Both groups of annotators can have high internal agreement amongst themselves, but the two groups can agree on different sets of items. If our goal is to collect crowdsourced data that closely mirror the target, then we have to measure their mutual agreement, in addition to comparing their internal agreements. Recall from Section 3.5 that if an experiment is replicated perfectly, κ x should be identical to the two IRRs. Or more concisely, normalized κ x should be equal to 1. Thus a high normalized κ x can assure us that the crowdsourced annotators are functioning as an extension of the trusted annotators, based on which we form our expectations.

Relation to Gold Data
At a glance, this approach seems similar to the common practice of measuring the accuracy of crowdsourced data against the ground truth (Resnik et al., 2006;Hripcsak and Wilcox, 2002). However, they are actually fundamentally different approaches. κ x is rooted in the reliability literature that does not rely on the existence of a correct 7061 answer. The authors argue this is an unrealistic assumption for many crowdsourcing tasks, where the input involves some subjective judgement. Accuracy itself is also a flawed metric for annotations data due to its inability to handle data uncertainty. For instance, when the reliability of the gold data is less than perfect, accuracy can never reach 1.0. Furthermore, accuracy is not chance-corrected, so it tends to inflate with class imbalance.

Extending an Existing Dataset
The aforementioned technique can also measure the quality of a dataset extension. The main challenge in extending an existing dataset is to ensure the new data is consistent with the old. The state-of-the-art method in computer vision is frequency matching -ensuring the same proportion of yes/no votes in each image class. Recht et al. (2019) extended ImageNet 7 using this technique, concluding there is a 11% -14% drop in accuracy across a broad range of models. While frequency matching controls the distribution of some statistics, the impact of the new platform is uncontrolled for. Engstrom et al. (2020) pointed out a bias in this sampling technique. Overall, it is difficult to assess how well we are extending a dataset. To that end, xRR can be of help. A high normalized κ x and a comparable IRR in the new data can give us confidence in the uniformity and continuity in the data collection.

Discussion
There has been a tectonic shift in the scope of and methodologies for annotations data collection due to the rise of crowdsourcing and machine learning. In many of these tasks, a high reliability is often difficult to attain, even under favorable circumstances. The rigid Landis-Koch scale has resulted in a decrease in the usage and reporting of IRR in most widely used datasets and benchmarks. Instead of abandoning IRR, we should adapt it to new ways of measuring data quality. The xRR framework presents a first-principled way of doing so. It is a more empirical approach that utilizes a replication as a reference point. It is based on two metricsκ x for measuring cross-replication rater agreement -and normalized κ x for measuring replication similarity.
We opensource a large-scale replication dataset of facial expression judgements analyzed with the proposed framework. We show this framework can 7 http://www.image-net.org/ be used to guide our crowdsourcing data collection efforts. This is the beginning of a long line of inquiry. We outline future work and limitations below: Confidence intervals for κ x Confidence intervals for κ x and normalized κ x are required for hypothesis testing. Though one can use the blockbootstrap for an empirical estimate, large sample behavior of these metrics needs to be studied.
Sensitivity of κ x with high class-imbalance The xRR framework sidesteps the effect of classimbalance by comparing replications on the same item set. Further analysis needs to confirm the sensitivity of κ x metrics in high class-imbalance.
Optimization of κ x computation Our method requires constructing many pairs of observations: n 2 RS. This may get prohibitively expensive, when the number of items is large. Using algebraic simplification and dynamic programming, this can be made much more efficient.
Alternative normalizations of κ x We provided one particular normalization technique, but it may not suit all applications. For example, when comparing crowd annotations to expert annotations, one can consider, κ x /IRR expert .
Alternative xRR coefficients Our proposed xRR coefficient, κ x , is based on Cohen's (1960) kappa for its assumption about rater noninterchangeability. It may be useful to consider Krippendorff's alpha and other agreement statistics as alternatives for other assumptions and statistical characteristics.
We hope this paper and dataset will spark research on these questions and increase reporting of reliability measures for human annotated data. B IRR, xRR, and normalized xRR values for the IRep dataset In Table 5 we report the IRR, κ x , and normalized κ x obtained from the entire IRep dataset.
7065 Table 3: Schema of the CSV file: Each line in the IRep csv file corresponds to one item ID annotated by Rater 1 and Rater 2 with some of the emotion labels (Label 1 . . . Label 30) annotated on the corresponding. There is one additional column for "unsure" indicating when it was not possible to determine which expression was expressed. Table 4: Distribution of items and ratings across different rater pools, where every item is annotated by a maximum of 2 raters from each pool.
Here we show what fraction of the unique items were annotated by one or two raters in each rating pool. Table 5: The fist column shows the 30 emotion labels + "unsure" in the IRep dataset. The next 3 columns are their IRR measured by Cohen's (1960) kappa in Mexico City (MC), Kuala Lumpur (KL), and Budapest (Bud). The next 3 columns are the κx in the 3 pairs of cities, and the last 3 columns are the corresponding normalized κx.