Competency Problems: On Finding and Removing Artifacts in Language Data

Much recent work in NLP has documented dataset artifacts, bias, and spurious correlations between input features and output labels. However, how to tell which features have “spurious” instead of legitimate correlations is typically left unspecified. In this work we argue that for complex language understanding tasks, all simple feature correlations are spurious, and we formalize this notion into a class of problems which we call competency problems. For example, the word “amazing” on its own should not give information about a sentiment label independent of the context in which it appears, which could include negation, metaphor, sarcasm, etc. We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account, showing that realistic datasets will increasingly deviate from competency problems as dataset size increases. This analysis gives us a simple statistical test for dataset artifacts, which we use to show more subtle biases than were described in prior work, including demonstrating that models are inappropriately affected by these less extreme biases. Our theoretical treatment of this problem also allows us to analyze proposed solutions, such as making local edits to dataset instances, and to give recommendations for future data collection and model design efforts that target competency problems.


Introduction
Attempts by the natural language processing community to get machines to understand language or read text are often stymied in part by issues in our datasets (Chen et al., 2016;Sugawara et al., 2018). Many recent papers have shown that popular datasets are prone to shortcuts, dataset artifacts, bias, and spurious correlations (Jia and Liang, 2017;Costa-jussà et al., 2019). While these empirical demonstrations of * Equal contribution  Figure 1: A statistical test for deviation from a competency problem, where no individual feature (here words) should give information about the class label, plotting the number of occurrences of each word against the conditional probability of the label given the presence of the word. The label associated with each point is marked by color and superscript. All features above the blue line have detectable correlation with class labels, using a very conservative Bonferronicorrected statistical test. deficiencies in the data are useful, they often leave unanswered fundamental questions of what exactly makes a correlation "spurious", instead of a feature that is legitimately predictive of some target label.
In this work we attempt to address this question theoretically. We begin with the assumption that in a language understanding problem, no single feature on its own should contain information about the class label. That is, all simple correlations between input features and output labels are spurious: p(y|x i ), for any feature x i , should be uniform over the class label. We call the class of problems that meet this assumption competency problems ( §2). 1 This assumption places a very strong restriction 1 Our use of the term "competency problems" is inspired by, but not identical to, the term "competence" in linguistics. We are referring to the notion that humans can understand essentially any well-formed utterance in their native language. on the problems being studied, but we argue that it is a reasonable description of complex language understanding problems. Consider, for example, the problem of sentiment analysis on movie reviews. A single feature might be the presence of the word "amazing", which could be legitimately correlated with positive sentiment in some randomly-sampled collection of actual movie reviews. However, that correlation tells us more about word frequency in movie reviews than it tells us about a machine's ability to understand the complexities of natural language. A competent speaker of a natural language would know that "amazing" can appear in many contexts that do not have positive sentiment and would not base their prediction on the presence of this feature alone. That is, the information about the sentiment of a review, and indeed the meaning of natural language, is contained in complex feature interactions, not in isolated features. To evaluate a machine's understanding of language, we must remove all simple feature correlations that would allow the machine to predict the correct label without considering how those features interact.
Collecting data that accurately reflects the assumptions of a competency problem is very challenging, especially when humans are involved in creating it. Humans suffer from many different kinds of bias and priming effects, which we collectively model in this work with rejection sampling during data collection. We theoretically analyze data collection under this biased sampling process, showing that any amount of bias will result in increasing probability of statistically-significant spurious feature correlations as dataset size increases ( §3).
This theoretical treatment of bias in data collection gives us a new, simple measure of data artifacts ( §3.2), which we use to explore artifacts in several existing datasets ( §4). Figure 1 revisits prior analyses on the SNLI dataset (Bowman et al., 2015) with our statistical test. An analysis based on pointwise mutual information (e.g., Gururangan et al., 2018) would correspond to a horizontal line in that figure, missing many features that have less extreme but still significant correlations with class labels. These less extreme correlations still lead models to overweight simple features. The problem of bias in data collection is pervasive and not easily addressed with current learning techniques.
Our framework also allows us to examine the theoretical impact of proposed techniques to mitigate bias, including performing local edits after data collection ( §5) and filtering collected data ( §6). We derive properties of any local edit procedure that must hold for the procedure to effectively remove data artifacts. These proofs give dataset builders tools to monitor the data collection process to be sure that resultant datasets are as artifact-free as possible. Our analysis of local edits additionally suggests a strong relationship to sensitivity in boolean functions (O'Donnell, 2014), and we identify gaps in the theory of sensitivity that need to be filled to properly account for bias in sampled datasets.
We believe our theoretical analysis of these problems provides a good starting point for future analyses of methods to improve NLP data collection, as well as insights for inductive biases that could be introduced to better model competency problems.

Competency Problems
We define a competency problem to be one where the marginal distribution over labels given any single feature is uniform. For our analysis, we restrict ourselves to boolean functions: we assume an input vector x and an output value y, where x ∈ {0, 1} d and y ∈ {0, 1}. 2 In this setting, competency means p(y|x i ) = 0.5 for all i. In other words, the information mapping x to y is found in complex feature interactions, not in individual features.
Our core claim is that language understanding requires composing together many pieces of meaning, each of which on its own is largely uninformative about the meaning of the whole. We do not believe this claim is controversial or new, but its implications for posing language understanding as a machine learning problem are underappreciated and somewhat counterintuitive. If a model picks up on individual feature correlations in a dataset, it has learned something extra-linguistic, such as information about human biases, not about how words come together to form meaning, which is the heart of natural language understanding. To push machines towards linguistic competence, we must control for all sources of extra-linguistic information, ensuring that no simple features contain information about class labels.
For some language understanding problems, such as natural language inference, this intuition is already widely held. We find it surprising and problematic when the presence of the word "cat", "sleeping" or even "not" in either the premise or the hypothesis gives a strong signal about an entailment decision (Gururangan et al., 2018;Poliak et al., 2018). Competency problems are broader than this, however. Consider the case of sentiment analysis. It is true that a movie review containing the word "amazing" is more likely than not to express positive sentiment about the movie. This is because of distributional effects in how humans choose to use phrases in movie reviews. These distributional effects cause the lexical semantics of "amazing" to carry over into the whole context, essentially conflating lexical and contextual cues. If our goal is to build a system that can accurately classify the sentiment of movie reviews, exploiting this conflation is useful. But if our goal is instead to build a machine that understands how sentiment is expressed in language, this feature is a red herring that must be controlled for to truly test linguistic competence.

Biased Sampling
To get machines to perform well on competency problems, we need data that accurately reflects the competency assumption, both to evaluate systems and (presumably) to train them. However, humans suffer from blind spots, social bias, priming, and other psychological effects that make collecting data for competency problems challenging. Examples of these effects include instructions in a crowdsourcing task that prime workers to use particular language, 3 or distributional effects in source material, such as the "amazing" examples above, or racial bias in face recognition (Buolamwini and Gebru, 2018) and abusive language detection datasets (Davidson et al., 2019;Sap et al., 2019).
In order to formally analyze the impact of human bias on collecting data for competency problems, we need a plausible model of this bias. We represent bias as rejection sampling from the tar-get competency distribution based on single feature values. Specifically, we assume the following dataset collection procedure. First, a person samples an instance from an unbiased distribution p u (x, y) where the competency assumption holds. The person examines this instance, and if feature x i = 1 appears with label y = 0, the person rejects the instance and samples a new one, with probability r i . If y = 0 corresponds to negative sentiment and x i indicates the presence of the word "amazing", a high value for r i would lead to "amazing" appearing more often with positive sentiment, as is observed in typical sentiment analysis datasets.
We do not that claim rejection sampling is a plausible psychological model of dataset construction. However, we do think it is a reasonable first-order approximation of the outcome of human bias on data creation, for a broad class of biases that have empirically been found in existing datasets, and it is relatively easy to analyze.

Emergence of Artifacts Under Rejection Sampling
Let p u (y|x i ) be the conditional probability of y = 1 given x i = 1 under the unbiased distribution, p b (y|x i ) be the same probability under the biased distribution, andp(y|x i ) denote the empirical probability within a biased dataset of n samples. Additionally, let f i be the marginal probability p u (x i ).
Recall that p u (y|x i ) is 0.5 by assumption.
We will say that dimension i has an artifact if the empirical probabilityp(y|x i ) statistically differs from 0.5. In this section, we will show that an artifact emerges if there is a bias at dimension i in the sampling procedure, which is inevitable for some features in practice. We will formalize this bias in terms of a rejection sampling probability r i .
For a single sample x, y, we first derive the joint and marginal probabilities p b (y, x i ) and p b (x i ), from which we can obtain p b (y|x i ). These formulas use a recurrence relation obtained from the rejection sampling procedure.
With no bias (r i = 0), this probability is 0.5, as expected, and it rises to 1 as r i increases to 1.
We definep(y|x i ) as the empirical expectation of p b (y|x i ) over n samples containing x i , with different samples indexed by superscript j.p(y|x i ) = 1 n n j=1 y j . Note thatp is a conditional binomial random variable. By the central limit theorem,p is approximately ∼ N (µp, σ 2 p ) for large n, where This variance is inversely proportional to the number of samples n. Thus,p(y|x i ) can be well approximated by its expected value for a large number of samples. As the rejection probability r i increases, the center of this distribution tends from 0.5 to 1. This formalizes the idea that bias in the sampling procedure will cause the empirical probabilityp(y|x i ) to deviate from 0.5, even if the "true" probability is 0.5 by assumption. Increasing the sample size n concentrates the distribution inversely proportional to √ n, but the expected value is unchanged. Thus, artifacts created by rejection sampling will not be combated by simply sampling more data from the same biased procedure-the empirical probability will still be biased by r i even if n increases arbitrarily. These persistent artifacts can be exploited at i.i.d. test time to achieve high performance, but will necessarily fail if the learner is evaluated under the competency setting.

Hypothesis Test
Here we set up a hypothesis test to evaluate if there is enough evidence to reject the hypothesis that r i is 0, i.e., that the data is unbiased. In this case, we can use a one-sided binomial proportion hypothesis test, as our rejection sampling can only lead to binomial proportions for p b (y | x i ) that are greater than 1 2 . Our null hypothesis is that the binomial proportion p b (y | x i ) = 0.5 = p 0 , or equivalently, that r i = 0. Our alternative hypothesis is that p b (y | x i ) ≥ 0.5. Letp be the observed probability. We can compute a z-statistic 4 using the standard formula: Thus, if our observed proportionp is far from p 0 = 0.5, we will have enough evidence to reject the null hypothesis that r i = 0. This depends on n as well, and to explore this interaction, we solve forp for a given n and confidence level z * :p = z * 2 √ n + 1 2 .

Empirical Analysis
With a hypothesis test in hand, we can examine existing datasets for evidence of statisticallysignificant feature bias, and then explore the extent to which this bias impacts models supervised with this data. Prior work has used pointwise mutual information (PMI) to find features that have high correlation with labels (e.g., Gururangan et al., 2018). This measure is useful for understanding why certain features might get used as deterministic decision rules by models (Ribeiro et al., 2018;Wallace et al., 2019). However, studies involving PMI have also intuitively understood that PMI by itself does not tell the whole story, as a strict ranking by PMI would return features that only appear once in the dataset. To account for this problem, they used arbitrary cutoffs and included information about feature occurrence in addition to their PMI ranking. A benefit of our approach to defining and detecting artifacts is that we have a single statistical test that takes into account both the number of times a feature appears and how correlated it is with a single label. We use this test to find features with the strongest statistical evidence for artifacts ( §4.1) and then show empirically that models use these features inappropriately when making predictions ( §4.2). This analysis goes beyond deterministic prediction rules, showing that the impact of sampling bias on model behavior is subtle and pervasive.

Data Analysis
We analyze two datasets with the hypothesis test from §3.2: SNLI (Bowman et al., 2015) and the Universal Dependencies English Web Treebank (Silveira et al., 2014).
SNLI Each feature x i represents the presence of a word in a given example, counting each appearance in an instance as a separate occurrence 5 for the purposes of computing n andp in Equation 1. We compute a z-statistic for every token that appears in the SNLI data, where p 0 = 1 3 , as SNLI has three labels. We then plot the z-statistic for each token against the number of times the token appears in the data. We also plot a curve for the value of the z-statistic at which the null hypothesis (that r i = 0) should be rejected, using a significance level of α = 0.01 and a conservative Bonferroni correction (Bonferroni, 1936) for all 28,000 vocabulary items. This analysis is shown in Figure 1. We label in Figure 1 several words that were also found to be artifacts by Gururangan et al. (2018) and Wallace et al. (2019), among others.
We find a very large number of deviations from the competency assumption, many more than would be suggested by a PMI-based analysis. PMI equals logp (y|x i ) p(y) ; becausep(y) does not vary across features, and the data is balanced over labels, a PMI analysis ranks features byp(y|x i ), looking only at the y-axis in Figure 1. 6 But the threshold for which a deviation inp(y|x i ) becomes a statistical artifact depends on the number of times the feature is seen, so our statistical test gives a simpler and more complete picture of data artifacts. Strong statistical deviations with less extreme PMI values still impact model behavior ( §4.2 and §6).
UD English Web Treebank Next we turn to dependency parsing. In particular, we focus on the classic problem of prepositional phrase (PP) attachment (Collins and Brooks, 1995), which involves determining whether a PP attaches to a verb (e.g., We ate spaghetti with forks) or a noun (e.g., We ate spaghetti with meatballs). We heuristically extract (verb, noun, prepositional phrase) constructions with ambiguous attachment from the UD English Web Treebank (EWT) training data. 7 We treat (verb, preposition) tuples as features and attachment types (noun or verb) as labels, and we compute a z-statistic for each tuple. Figure 2 shows the z-statistic for each tuple that appears 10 or more times in the data. We labeled tuples that also appear in the locally edited samples from the UD English contrast set created by . Many of these tuples fall either above or close to the significance curve, suggesting that the low contrast consistency reported by Gard- Artifact statistics in UD English PP Attachment z=1.96 VERB NOUN Figure 2: Artifact statistics of (verb, preposition) tuples in PP attachment in the UD English Web Treebank (EWT). Plotted are tuples that appear in the original EWT training data, and labeled are tuples that also appear in the UD English locally edited contrast set. ner et al. (2020) could potentially be explained by models' reliance on these artifacts. 8

Model Analysis
The previous section reveals a large number of individual word dataset artifacts in the SNLI dataset. Here, we ask whether typical NLP models learn to bias their predictions based on these artifacts for both the SNLI and RTE (Dagan et al., 2005) 9 datasets. That is, we will show that these single words noticeably influence a model's confidence in particular predictions, even when the PMI value is not extreme enough to create a universal trigger (Wallace et al., 2019). Importantly, this analysis focuses on words with high z-statistics, which are often words that show up very frequently with slight deviations from p u (y|x i ). This includes words such as "for" and "to" (the two words with highest z-statistic for the neutral class), and "there" and "near" (the highest and fifth-highest z-statistic for the entailment class).
To measure the model bias learned from these words, we employ RoBERTa-base (Liu et al., 2019) fine-tuned on RTE, and ALBERT-base (Lan et al., 2020) fine-tuned on SNLI. 10 Given a single type such as x i = "nobody" and a target class such as y = "contradiction", we estimate the modelp(y|x i ) as follows. We first create two synthetic input examples, one with the premise containing only the single token with an empty hypothesis, and one 8 Some tuples in the plot with high p(y|xi) are not artifacts, as the attachment decision is nearly deterministic. For instance, the top right blue dot corresponds to (have, of ); of can only attach to have in archaic or idiosyncratic constructions. 9 We use the RTE data from SuperGlue . 10 Both models are from Morris et al. (2020), and are implemented in the Transformers library (Wolf et al., 2020 with an empty premise and hypothesis containing the single token. As each input contains only a single token without additional context, this tests whether the model will bias its output based on the token. We run a forward pass with each input and average the target class probabilities as an estimate ofp(y|x i ). All of the words in each dataset appearing at least 20 times are partitioned among the classes based on their largest class conditional z * , and for each class we form two cohorts of 50 words each with the highest and lowest z * . Let X y,z * < denote the set of x i with the lowest z * for class y and similarly let X y,z * > denote the set with the largest z * . Finally, we compute the average The results are shown in Table 1. As can be seen, these models exhibit non-trivial bias based on the single token inputs, with ∆p y exceeding 10% for some classes. The bias is much more extreme for SNLI versus RTE, likely due to the fact that RTE has two orders of magnitude less data than SNLI.
A caveat about this experiment is in order: due to the fact that automatically replacing high z * words with low z * words will likely make most inputs nonsensical, we chose to use very unnatural singleword inputs to the model instead. We believe this is a reasonable estimate of the model's marginal prior on these tokens, measured in a way that introduces the fewest possible confounding variables into the experiment, but it's possible that it does not completely reflect how a model treats these tokens in context. Section 6 discusses some additional empirical evidence for models' reliance on these artifacts.

Mitigating Artifacts with Local Edits
Many works have tried to remove data artifacts by making minimal changes to existing data (Shekhar et al., 2017;Sennrich, 2017;Zhao et al., 2018, inter alia). In this section we show that this kind of data augmentation can be effective with an appropri-ately sensitive edit model, where sensitivity refers to how often a change to inputs results in the label changing. However, because humans are involved in making these changes, achieving appropriate sensitivity is challenging, and bias in this process can lead to the introduction of new artifacts. This suggests that care must be taken when performing edit-based data augmentation, as large edited training datasets are not likely to be artifact-free (cf. Huang et al., 2020).
Imagine a new dataset D e consisting of samples x , y generated by making local edits according to the following repeated procedure: 1. Randomly sample an instance x from a dataset D b of n instances created under p b . 2. Make some changes to x to arrive at x . 3. Manually label y and add x , y to D e . We examine the expected probability p e (y |x i ) under this edit process. To derive this probability, we will need to know the probability that a change to x changes y. We define the edit sensitivity s to be this probability, i.e., s = p b (y = ¬y). The other quantity of interest for an edit model is e i , the probability that dimension i gets flipped when going from x to x . We now show theoretically that s and e i control whether samples generated by local editing debias first-order artifacts.
Proposition 1 (Proof in §B). If r i = 0, then, An analogous statement holds when = is replaced by ≤ on both sides.
Thus, local editing will dilute or cancel out artifacts from rejection sampling if the edit sensitivity is high enough, but if it is too high, it can introduce artifacts in the opposite direction. In practice, if it is possible to pick a dimension to flip more or less uniformly, then e i ≈ 0 in a high-dimensional space, so engineering s ≈ 0.5 should produce artifact-free additional samples with p e (y |x i ) ≈ 0.5.
Furthermore, edit sensitivity and edit dimension are empirically measurable when constructing a dataset. This empirical measurement can give theoretical guarantees for the degree to which the local editing will alleviate artifacts (i.e., this gives us a principled way to decide between edit models).

Local Edits in Practice
We empirically investigate the effectiveness of local edits for reducing single feature artifacts using locally edited samples generated from two datasets: (1) the Boolean Questions dataset (BoolQ;Clark et al., 2019a), which consists of pairs of paragraphs (p) and questions (q), where each q has a binary answer that can be found by reasoning over p; and (2) IMDb (Maas et al., 2011), a sentiment classification dataset in the domain of movie reviews. We define each feature x i as the occurrence of a particular word within q for BoolQ, and within the text of the review for IMDb.  generated additional data for BoolQ and IMDb by making local edits to the question or review text and recording the updated binary label. Figure 3 visualizes the effect of these changes on single-feature artifacts by comparing the artifact statistics for the original texts to the statistics for the edited texts generated by . For BoolQ, many tokens in the original data exhibit artifacts in the positive (> 0.5) direction, while, within the edited data, almost all tokens fall within the confidence region. In contrast, there is no apparent distributional difference between artifact statistics for the original vs. edited texts on IMDb. We find that for BoolQ the values (1 + e i )/s tightly concentrate around a mean of 1.94, which, by Proposition 1, explains why most of thep(y|x i ) values for the edited samples are not significantly different from 0.5. For IMDb, (1 + e i )/s was close to 1. This case study illustrates the importance of leveraging our theory to engineer better edit models.

Local Edits and Boolean Sensitivity
In the above discussion we used the term sensitivity in an informal way to describe the probability that a local edit changes the label. This term also has a related formal definition in the study of boolean functions, where it is an implicit complexity measure (Wegener, 1987). Sensitivity in this sense has been shown to correlate with generalization in neural networks (Franco, 2001), and has been extended for use with practical NLP datasets (Hahn et al., 2021). In this section we discuss the intersection of our theory with sensitivity analysis, highlighting limitations in sensitivity analysis for sampled datasets that could be addressed in future work.  Artifact statistics for IMDb before and after local edits z = ± 2 i.i.d. edits Figure 3: The artifact statistics of the original BoolQ (above) and IMDb (below) samples are plotted in red, compared to the artifact statistics over the edited instances, plotted in green. of x with different labels: The local sensitivity s(f, x) is the size of this set: s(f, x) = |S(f, x)|. Finally, the global sensitivity is defined as s(f ) = max x s(f, x).
Importance of sensitivity In our case, the effect of local editing on a dataset can be understood in terms of sensitivity. Imagine a boolean function f : {0, 1} d → {0, 1} from which we draw n samples x, y . If these samples are drawn uniformly over {0, 1} d , then the probability of observing any Hamming neighbors goes to 0 rapidly with d. 11 Thus, it is possible to pick a low sensitivity function that can perfectly fit the data. In this sense, the true sensitivity of f is likely underspecified by the dataset.
Imagine we give this data to a learner with inductive bias resembling some variant of Occam's razor. If the learner's notion of complexity is correlated with sensitivity (which many complexity measures are), then the learner will favor low sensitivity decision boundaries. Thus, the fact that sensitivity is underspecified in the training data is a problem if the gold-standard function has high sensitivity, as the inductive bias of the learning algorithm may favor low-sensitivity alternatives.
Contrast this with a dataset where some local neighborhoods in the input space have been filled in with local edits. The set of observed neighbors around a point x provide a lower bound on s(f, x), which is a lower bound on s(f ). In this sense, s(f ) is no longer underspecified by the dataset.
In this discussion we have used underspecified in an informal way; there is no precise measure of the sensitivity of a sampled dataset (as opposed to a fully-specified function), particularly when generalizing from finite boolean functions to natural language inputs. Attempts to generalize sensitivity to natural language have done so by leveraging large language models to generate neighbors from which sensitivity can be estimated (Hahn et al., 2021). Resampling data in this way can give reasonable estimates of the sensitivity of the underlying task, but it is fundamentally incompatible with measuring dataset artifacts of the kind we discuss in this paper, as the generative model can fill in parts of the data distribution that are missing due to sampling bias, giving a higher estimate of sensitivity than is warranted by the sampled dataset.

Other Mitigation Techniques
In this section we briefly discuss the implications of our theoretical analysis for other artifact mitigation techniques that have been proposed in the literature. Our analysis in this section is not rigorous and is meant only to give high-level intuition or potential starting points for future work.
More annotators One suggested mitigation technique for dataset artifacts is to increase the number of annotators (Geva et al., 2019). Especially when people generate the text that is used in a dataset, there can be substantial person-specific correlations between features and labels. Having more annotators washes out those correlations in aggregate, making the data less biased overall.
We briefly analyze this procedure using our rejection sampling framework. For simplicity, we have so far only considered a single possible rejection probability, where an instance is rejected with probability r i if x i = 1 and y = 0. If we introduce additional rejection probabilities for the other three possible combinations of values for x i and y, there will be the possibility that some rejections balance out other rejections. We can model multiple anno-tators by splitting a dataset into k different slices that have their own bias vectors r. If the r vectors are uncorrelated, it seems likely that as k increases, the probability thatp(y|x i ) deviates from p u (y|x i ) tends towards zero. Even in our simplistic model, if we assume a sparse r, averaging more and more of them will make the deviation tend toward zero, if the non-zero dimensions are uncorrelated.
However, if the r vectors are correlated, increasing the number of annotators will not produce data reflecting the competency assumption. When might the r vectors be correlated? This could happen due to societal biases, word usage frequencies, or priming effects from data collection instructions given to all annotators. Surely across any pool of annotators there will be some dimensions along which r values are correlated, and other dimensions along with they are not. Increasing the number of annotators thus helps mitigate the problem, but does not solve it completely.
Data filtering A recent trend is to remove data from a training set that is biased in some way in order to get a model that generalizes better (Le Bras et al., 2020;Oren et al., 2020). While this method can be effective for very biased datasets, it is somewhat unsatisfying to remove entire instances because of bias in a single feature. In the extreme case where r i ≈ 1, such as with "nobody" in SNLI (Fig. 1), this process could effectively remove x i from the observed feature space.
To understand the effect of these automated methods on dataset artifacts, we repeat the analysis from §4.1 on data that was classified as "ambiguous" according to Dataset Cartography . This data was shown to provide better generalization when used as training data compared to the original training set. The ambiguous instances did not have a balanced label distribution, so we downsampled the data to balance it, then downsampled the whole training data to get the same number of instances as the balanced ambiguous set.
The resulting artifact plots are shown in Figure 4. As can be seen, the "ambiguous" instances have many fewer deviations from the competency assumption, across the entire range of our hypothesis test. It is not just high PMI values that are getting corrected by finding ambiguous instances; all statistical deviations are impacted. This effect is striking, and it further corroborates our arguments about the  ; above) versus a random (same-size) sample from the SNLI training set (below). The filtering done by ambiguous instance detection targets statistical artifacts across the whole range of the statistical test, not just high PMI values.
importance of the competency assumption. 12 7 Other Related Work Theoretical analysis of bias Several recent works explore sources and theoretical treatments of bias or spurious correlations in NLP (Shah et al., 2020a;Kaushik et al., 2020) or ML more broadly (Shah et al., 2020b). Our work differs by introducing a competency assumption and exploring its implications. The difference between our biased and unbiased distributions is an instance of covariate shift (Quionero-Candela et al., 2009).
Competent models An interesting question is whether we can inject a "competency inductive bias" into models, i.e., discourage relying on individual features. The closest works we are aware of are methods that ensemble weak models together with strong models during training (Clark et al., 2020;Dagaev et al., 2021), or ensembles of models with unaligned gradients (Teney et al., 2021).
Other works use ensembles with models targeted at known sources of data artifacts, but these are less close to a competency assumption (Clark et al., 2019b;Karimi Mahabadi et al., 2020).

Conclusion
The more NLP models advance, the better they are at learning statistical patterns in datasets. This is problematic for language understanding research if some statistical patterns allow a model to bypass linguistic competence. We have formalized this intuition with a class of problems called competency problems, arguing that, for any language understanding task, all correlations between simple features and labels are spurious. Collecting data meeting this assumption is challenging, but we have provided theoretical analysis that can inform future data collection efforts for such tasks.
We conclude with some final thoughts on general best practices for data collection, informed by the analysis in this paper. If annotators are generating text for some data collection task, find ways to decrease priming effects. This could involve using images as prompts instead of text (Novikova et al., 2017;Weller et al., 2020), or randomly sampling words to include in the generated text. If existing text is being collected and annotated, make local edits to the text while monitoring the sensitivity of those edits according to the guidelines in §5, perhaps using different processes between train and test, to minimize correlations between train features and test labels.
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2358-2367, Berlin, Germany. Association for Computational Linguistics.