PECO: Examining Single Sentence Label Leakage in Natural Language Inference Datasets through Progressive Evaluation of Cluster Outliers

Building natural language inference (NLI) benchmarks that are both challenging for modern techniques, and free from shortcut biases is difficult. Chief among these biases is “single sentence label leakage,” where annotator-introduced spurious correlations yield datasets where the logical relation between (premise, hypothesis) pairs can be accurately predicted from only a single sentence, something that should in principle be impossible. We demonstrate that despite efforts to reduce this leakage, it persists in modern datasets that have been introduced since its 2018 discovery. To enable future amelioration efforts, introduce a novel model-driven technique, the progressive evaluation of cluster outliers (PECO) which enables both the objective measurement of leakage, and the automated detection of subpopulations in the data which maximally exhibit it.


Introduction
Natural language inference (NLI) is a fundamentally pairwise task, wherein a logical relation between two statements is predicted. Progress on NLI benchmarks is an important proxy for advancements in natural language reasoning by machines. Models are trained to process the two statements simultaneously, in a paired sentence condition (PSC). Unfortunately many modern NLI datasets exhibit single sentence label leakage. When this leakage is present, models are able to accurately predict the pairwise relation encoded by the labels in a single sentence condition (SSC)-where the model is only shown one of the statements (Poliak et al., 2018). This is a serious problem, rendering NLI datasets' capture of reasoning questionable, and limiting the robustness of models trained on them.
NLI is formalized as predicting a relation r ∈{neutral, entail, contradict} from a pair of sentences (premise s 1 and hypothesis s 2 ). An ideal NLI benchmark without single sentence label leakage will have distribution of r that is conditionally

SNLI CAugNLI
Balanced Regions (Small Markers) Figure 1: A T-SNE projection of the SNLI and CAugNLI test sets in PECO's model-driven single sentence condition (SSC) embedding space, showing entailment-, contradiction-, and neutral-labeled samples. This model was trained on the paired-sentence condition (PSC) where the relation between sentences in a sample are observable. In a leakage-free dataset, a model should be unable to separate subpopulations that disproportionately exhibit one label class in the SSC. Local regions exhibiting an imbalanced label distribution in this subpopulation are considered "biased regions" and plotted with large markers. SNLI exhibits large, continuous regions in the hypothesis only space disproportionately exhibiting the same label, compared to CAugNLI. Accordingly, SNLI exhibits higher single sentence label leakage than CAugNLI (Table 3). dependent on the pair of sentences, but independent from either individual sentence (Wang et al., 2021c). In practice this is difficult to achieve, particularly when constructing usefully large datasets. Most large-scale NLI datasets are produced by sourcing seed sentences from an existing text population to serve as initial premises or hypotheses. Each seed sentence is then assigned one or more relations r, which annotators use to write new elicited sentences satisfying each selected relation relative to the seed. Many datasets exclusively build either the hypothesis or premise population from seed sentences, leaving the other exclusively elicited.  Table 1: Information on the NLI datasets we compare in this study. MNLI has "matched" (m) and "unmatched" (u) test sets that we evaluate separately. The ANLI dataset is decomposed into partitions "A1," "A2," and "A3". A Vertical line denotes identical value in cell as above (sub-elements of same dataset).
Systematic, shared biases in the words, sentence structures, or ideas that crowdworkers consider when given a logical relation, coupled with the exclusively elicited nature of one of the sentence populations, then drive relation leakage (Gururangan et al., 2018). For example, a slight preference for words like "not" or "doesn't" when given contradict as opposed to entail would lead to a bias in the n-gram distribution between the classes in the SSC. Simple heuristics inspired by these findings can produce challenging test sets that hobble models trained on these biased datasets , but they require manual guesswork, don't generalize, and may miss higher-level, more nuanced semantic shortcuts and biases.
These "leakage features" encoded in the elicited sentences are visible to NLI models (Zhang et al., 2019), enabling them to "cheat" by attending to them as shortcuts rather than logical correspondences between the two sentences, calling into question the appropriateness of NLI datasets as benchmarks for language understanding (Bowman and Dahl, 2021). In this work we rigorously analyze this problem of single sentence relation leakage in both popular and recent NLI datasets using novel techniques to enable targeted interventions and create higher quality future resources.
Further NLI datasets have been proposed to tackle these problems using machine-in-the-loop adversarial sentence elicitation, (Nie et al., 2020), counterfactual augmentation (Kaushik et al., 2020), and learning dynamic-based debiasing (Wu et al., 2022). These datasets are purported to provide more challenging generalization scenarios for NLI models to better test logical reasoning capabilities. One big question remains-have these techniques actually eliminated relation leakage biases?
In this work, we demonstrate the following: New NLI datasets still exhibit single sentence relation leakage. We compare SSC performance for 10 NLI datasets (including those previously assessed by Poliak et al. (2018)) using a simple transformer baseline, finding that single sentence relation leakage remains a severe problem. NLI models still use the leakage features to cheat. We analyze the datasets using output decision agreement and input token importance statistics between models trained in the SSC and PSC to demonstrate this. Automated leakage feature detection is feasible. We introduce a novel model-based metric and dataset analysis tool, the Progressive Evaluation of Cluster Outliers (PECO) (Figure 1), for examining the degree of single sentence relation leakage and eliminating it in future datasets. 1

Quantifying NLI Dataset Bias
An ideal NLI benchmark is neither "saturated" nor biased. Saturated benchmarks are datasets for which current approaches already achieve high accuracy. They are "solved" and have limited utility in tracking future progress (Bowman and Dahl, 2021). We refer to as biased any NLI benchmarks that exhibit significant single-sentence relation leakage through high achievable SSC accuracy in at least one single sentence condition.  We analyze 10 datasets containing a total of 14 test or validation sets in terms of biasedness and saturatedness, across 17 SSC conditions (premiseonly (s 1 ) or hypothesis-only (s 2 )). Table 1 provides an overview of this information along with statistics such as train/dev/test set size, and which sentence population is potentially unbalanced.
Each dataset D is composed of (s 1 ,s 2 ,r) tuples. We use the standard notation (X i ,Y i ) ← D to describe the samples, as depending on whether the training condition is standard PSC or SSC, each X i can be (s 1i ,s 2i ), s 1i , or s 2i . Models trained in condition c are referred to as f c .

Saturation and Bias Scoring
We assess accuracy on the test set (or val set when no labeled test set is available) for each dataset in paired-and single-sentence conditions. Saturation (Accuracy): We report state-of-theart (SOTA) model performance results in the PSC, For our model-level and sample-level comparative analysis between the PSC and SSC we train our own transformer-based models using a simple procedure (Sec. 2.4) to assess replication accuracy, SSC Accuracy: For each elicited population SSC we train a model f SSC according to the procedure described in Sec. 2.4.2, to assess A SSC : Table 2 shows current SOTA models and results for the 10 datasets, as well as our PSC model performance and the relevant training hyperparameters (more detail in Sec. 2.4). Figure 2 shows SSC accuracy against replication (PSC) accuracy for each dataset. Datasets that exhibit higher SSC accuracy have worse single sentence relation leakage, and are thereby questionable in their ability to capture reasoning abilities. Ideally, an optimal benchmark for NLI will both have low maximum SSC accuracy and low maximum PSC accuracy (room for future model growth).
These absolute measures of dataset bias and saturation are useful targets for future optimal benchmarks, it is important to understand how these measures interact with each other.

Relative Dataset Bias Scoring
We assess two relative dataset bias scores. SSC Improvement over Chance: We subtract the SSC test accuracy from the accuracy achieved by a "guess majority label" strategy following (Poliak et al., 2018): This metric gives an insight into single sentence relation leakage that compensates for datasets (such as SICK) with an uneven base label distribution. SSC-PSC Accuracy Recovered: Accuracy achieved by SSC model over PSC:   This metric captures how similarly the single sentence and normal condition models perform. Table 3 shows the extent of the single-sentence relation leakage problem across the 17 SSC tests on the 14 splits for the 10 NLI datasets. These results clearly show that each dataset exhibits significant single-sentence relation leakage for at least one condition. The comparison columns replication test recovery (%R R ), and SSC improvement over the chance majority guessing strategy (∆ maj ) are all computed using the single sentence condition accuracy and the standard NLI two-sentence condition SOTA and replication accuracy values in Table 2.

Model Training
Our training technique is simple and applied consistently to all datasets. We fine-tune three different language-specific pretrained transformer checkpoints 2 from HuggingFace (Wolf et al., 2019) using Pytorch Lightning. All models were trained on NVIDIA A-100 GPUs. All models are the HuggingFace xForSequenceClassification with num _ classes=3 and no other modifications. All models are trained using the Adam optimizer with cross entropy loss.
We find that this procedure produces broadly near-SOTA performance models, with a maximum relative accuracy difference of 8%, and a 92% Pearson's correlation coefficient (PCC) between SOTA and replication accuracy across the datasets (Figure 3). Our replications are a reasonable proxy to SOTA for comparative dataset analysis.
Single Dataset Fine-tuning We obtain separate fine-tuning checkpoints from the pretrained models for each dataset to enable clean analysis of one dataset at a time. We do not accumulate fine-tuning passes across multiple datasets.

Single Sentence Condition Training
To train each dataset's corresponding SSC model(s), we use the same setup as the PSC model but follow Poliak et al. (2018)'s formulation of finetuning the chosen classification model on only the SSC sentence, premise only or hypothesis only.
FEVER exhibits bias in the premise distribution, and CAugNLI, SNLI debiased , and MNLI debiased exhibit imbalance in both (Table 1). For the datasets that have imbalanced distributions in both conditions, we separately train bias models for both hypothesis-only and premise-only.

Analyzing NLI Dataset Bias
In this section we introduce quantification techniques for more accurately characterizing the extent of these bias problems in the aforementioned NLI datasets, analyze how they interact with the observable bias itself, and develop tools for producing future NLI benchmarks that more closely resemble the ideal benchmark.

Sample-level Model Behavior
We are particularly interested in understanding the degree to which models trained in the PSC and SSCs "reason" similarly. For this section we use the notations f (X X X test ), Y Y Y test to denote the (1 × N) column vectors of model output decisions and labels for a test set of N samples, and a simple agreement function Ag(Y Y Y 1 ,...,Y Y Y n ) as the ratio of elements that are identical across all Y Y Y i to the vector size N. In other words, SSC-PSC Agreement (NBA): The number of samples for which the SSC and PSC models agree over the total number of samples in the set: SSC-PSC Recovery (NBR): The number of samples for which the SSC and PSC models agree, and both classify correctly over the total number of samples they agree on: Token Relevance Agreement (TRA): Do SSC and PSC models reason alike? For a sentence X with length n, we compute the gradient of the classification output posterior with respect to each token embedding emb(w j ). We take the 2-norm of the each gradient vector and normalize it over the entire sequence to produce a normalized local explanation vector m( f (X)) (Sundararajan et al., 2017): (9) To compare "reasoning" similarity between the two models, we compute the samplewise input token relevance agreement can be computed using cosine similarity: As the SSC and PSC inputs have different lengths, we pad the SSC importance vector for m( f SSC (X i )) with zeros either prepended or postpended (depending on if the SSC is hypothesisor premise-only) to make the two local explanation map vectors of equal length. The dataset-level token relevance agreement is the average of samplewise TRA.

Cluster-based Bias Evaluation
We are interested in investigating how the biased distributions of the elicited sentences in NLI datasets are captured in the learned representation spaces of models trained on them. In particular, we are interested in answering this question: is elicited sentence label leakage captured semantically in regions of latent space?
To answer this we produce dimensionalityreduced elicited sentence embeddings for the test set, using the PSC replication models, then fit a high-k k-means clustering to this collection of embeddings. This will allow us to analyze how the local distribution of labels varies over the elicited sentence embedding space. By comparing the KLdivergence of the label distribution within each cluster and the global label distribution, we can compute the Progressive Evaluation of Cluster Outliers (PECO) score ( Figure 4).

Elicited Sentence Embeddings:
To embed the elicited sentences as they're learned by a model in the PSC, we feed the elicited sentences s e through the PSC replication fine-tuned NLI model encoder. We extract the latent codes produced at the output very last fully connected layer of the model before  Figure 4: An overview of the approach to computing the PECO score from a collection of elicited population sentences s elicited and their corresponding Labels r. When a fixed threshold is chosen, the Hypothesis embeddings can be dimensionality-reduced using T-SNE to produce plots like Figure 1.
the linear classifier to collect latent codes for every s e in the test set. We then embed these codes into their 30 principal components to produce the embeddings (Figure 4 (a)).
Clustering: We fit a high-k (in this case, k = 50) k-means clustering over the distribution of elicited sentence embeddings to provide a set of local bins for analysis. For each cluster, we count the relation labels its samples contain, to produce a set of 50 cluster-label distributions (Figure 4 (b)).
Computing Cluster Divergences: For each cluster label distribution p i = P(Y |cluster = i), we assess the L2 divergence between it and the global label distribution p G to produce divergence scores s i : This step is depicted in Figure 4 (c).
Progressive Evaluation: Finally, we compute the PECO score for this collection of cluster divergences as the area under the curve produced by counting the number clusters with divergence s i over some threshold t for the range of s i .
Generality of PECO: These same techniques could be applied to a wide variety of potential leakage features on the input to analyze a wide variety of correlation types. For example, input sentence words could be shuffled to test for word order invariance, or word classes could be specifically masked to test for spurious vocabulary correlations.

PECO Parameter Choices:
We discuss the impact of PECO parameters (e.g., choice of k-means, number of principal components to reduce to, use of L2 or KL-divergence) in Appendix C.

Results
As discussed above, ideal NLI benchmarks are neither saturated nor biased. Unfortunately, as Table 3 demonstrates, none of the NLI datasets tested thus far satisfy this condition. This is more clearly illustrated in Figure 2. Two questions remain: to what extent do current models cheat and how can we make less biased, less saturated datasets? Table 4 contains experimental results intended to answer : Token releveance agreement (TRA) vs SSC accuracy. When the PSC model and SSC models "reason" more similarly for a dataset, the relation leakage bias exhibited in that dataset tends to be higher. these two questions. The "agreement metrics" as introduced in Sec. 3.1, SSC-PSC Agreement, SSC-PSC Recovery, and Token Relevance Agreement are provided in Table 4.

Result-Metric Correlations
We find that SSC-PSC model output agreement (NBA) and recovery rate %R R are correlated with a PCC of 0.69 ( Figure 5). Datasets where the SSC and PSC models predict more similarly have closer the SSC and PSC performance results are for said datasets. While this result is surprising, ANLI R3 is an interesting outlier (section 5). We find that TRA and SSC accuracy are also positively correlated with a PCC of 0.57 (Figure 6). This result demonstrates that for a single dataset, similar reasoning patterns for the single sentence condition and standard sentence pair condition is strongly correlated to single-sentence relation leakage. In other words, standard condition NLI models trained on biased (high leakage) datasets tend to cheat. Thus, models indeed rely on annotation artifacts in NLI datasets to achieve high accuracy, and demonstrates that this continues to be a problem in newer NLI datasets, in spite of mitigation attempts. How can we use this knowledge to build better benchmarks? Figure 7 depicts the relationship between PECO score and bias recovery (%R R ). We find the two are positively correlated with a PCC of 0.64. This result is fairly intuitive: the more uneven the distribution of labels is in the single-sentence latent spaces (and thus, the higher the area under the PECO curve), the more SSC performance approaches the standard PSC condition performance for a given NLI dataset. This suggests PECO-reducing inter-  Figure 7: Test-set PECO score vs Bias accuracy recovery (%R R ). This result suggests that interventions that produce lower PECO score datasets tend to yield datasets that exhibit less relation leakage in the SSC.
ventions may be able to target debiasing efforts.

Discussion
Some examples are only correctly classified in the single-sentence condition. A common assumption to discussions of cheating features in machine learning is that they play a role in inflated classification accuracy when present. However, ANLI R3 provides an interesting counterexample. For this dataset, the hypothesis-only model achieves a SSC accuracy of 48.1%, and the PSC model achieves 49.8% (a %R R of 90%), and SOTA achieves 53%. Despite this score similarity, the samples which the two conditions are able to actually classify correctly vary surprisingly. With an NBR of 58.5%, only ≈ 27% of test samples are correctly classified by both the single and two sentence condition models. This means that around 21% of samples in A3 test are only correctly classified by the single-sentence model.
Perhaps unsurprisingly, ANLI exhibits the lowest TRA out of all datasets tested, indicating that it is somewhat of an outlier in having the SSC and PSC condition models reason differently on it. XNLI demonstrates the cross-lingual and semantic nature of single-sentence leakage. While previous work has focused on finding words, phrases, patterns, and heuristics in the surface form of the data, our study of XNLI provides an interesting opportunity to investigate the potential for the influence of underlying semantics as a leakage feature. XNLI consists exclusively of 14 language test and val sets, manually translated from MNLI examples. Our XNLI PSC and SSC models are thus trained on MNLI alone, using the multilingual xlm-roberta checkpoint.
This produces a natural experiment wherein surface form biases present in the training data are completely eradicated in the test set (as only the 14 non-English languages Table 1 are present), while the underlying meanings encoded by those words remain. In Table 3 we indeed find that XNLI and MNLI exhibit very similar result comparisons. The models on both datasets have a ∆ max ≈ 20% and %R R ≈ 65. These leakage feature results, being robust to manual translation into 14 different languages, seem to indicate that there is a strong fundamental semantic component to the human biases driving the elicited sentence relation leakage.
Relation leakage remains Unsolved. Elicited sentence relation leakage is a problem for all evaluated NLI datasets, including the new ones intended to fix it. Recent datasets such as XNLI, FEVER, and OCNLI, exhibit high absolute SSC performance over majority (∆ maj > 20).
Although ANLI and CAugNLI are improvements over the others in terms of ∆ maj , with CAugNLI shining particularly in this regard, none eliminate the relation leakage problem entirely, as even CAugNLI still has ∆ maj = 8.0, an 8% performance over chance in the single sentence condition. SNLI debiased and MNLI debiased , despite their intended purpose, still contain significant amounts of SS label leakage (29.8% and 20.9% over chance). This might be because while their production (Wu et al., 2022) does eliminate bias originally present in SNLI and MNLI, it fails to prevent the introduction of new bias in the data generation pipeline.
Cluster approaches are promising for future debiasing efforts. Figure 1 shows how the PECOderived cluster-bias T-SNE plots can be used directly to visualize, analyze, and "debug" biased datasets. In the plot, SNLI clearly has considerably more high-biased clusters taking up a considerable portion of the latent space as compared to CAugNLI, for the bias threshold of 0.2.
An intervention could be performed on identified bias regions in the distribution by having human annotators create new premise sentences from the given hypotheses, thereby forcing the PECObased bias metrics to reduce. This idea is further backed up by the PCC of 35.8 that we find between PECO and %R R , suggesting that producing datasets of lower PECO score will naturally lead to lower recovered performance in the SSC, and thus less elicited sentence relation leakage.

Related Work
Understanding Bias in NLI Huang et al. (2020) demonstrated that counterfactual augmentation alone cannot debias NLI. Multi-task learning can improve model robustness to fitting spurious features (Tu et al., 2020), but because the underlying benchmarks are biased, progress on the desired reasoning capability is questionable (Poliak et al., 2018). Geva et al. (2019) strengthen the finding that annotator bias is a key driver of this poor generalization performance, showing that NLI models can struggle to even generalize across disjoint sets of annotators on the same task.
Simple word-and n-gram level approaches have proven surprisingly capable in a-priori characterizations of dataset difficulty (McKenna et al., 2020) and producing difficult test sets (Saxon et al., 2021) in diverse language domains such as SLU. Gardner et al. (2021) show how such purely frequentist approaches can identify word-level spurious correlations with respect to label class which drive in-part the shortcut features for classes of "competency problems" such as NLI. Mitigating Bias in NLI Belinkov et al. (2019) demonstrate an approach to train NLI models robustly against some of these biases, using Gururangan et al. (2018)'s hard test set.  utilize simple heuristics like lexical overlap to produce the synthetic debiased HANS NLI dataset to test generalization. This dataset has been used to evaluate techniques including predicate-argument-(Moosavi et al., 2020) and syntactic transformationbased (Min et al., 2020) augmentations. Zhou and Bansal (2020) leverage a bag-of-words approach to debias datasets along lexical features. However, these approaches have yet to improve generalization in comprehensive replication studies has thus far (Bhargava et al., 2021). Meanwhile Varshney et al. (2022) propose a fully unsupervised data collection pipeline for NLI, in order to sidestep the problem of human biases entirely.
Approaches like (Kaushik et al., 2020) and (Wu et al., 2022) are very promising for producing data that reduces bias on a samplewise, but not populationwise level. Using our semantic, model-driven local bias finding strategies, future interventions can lead to the large scale production of debiased NLI datasets and a new generation of higher quality benchmarks for language understanding. Liu et al. (2022) perform such a targeted augmentation approach using the dataset cartography sample characterization scheme from Swayamdipta et al. (2020) to produce WANLI, an NLI dataset that allows for improved performance on the aforementioned challenging test sets.

PECO vs Dataset Cartography
To determine if PECO-driven dataset augmentation is redundant given the recent release of WANLI, we seek to determine if usable samplewise information for targeting interventions (e.g., presence in a "bias" cluster) is captured during PECO analysis and is redundant to the relevant samplewise characterization produced in dataset cartography.
To do this, we collect the samplewise confidence feature (Swayamdipta et al., 2020) during training of the PSC model for each validation set sample in SNLI. We then assign each validation sample to its corresponding PECO cluster (out of the 50) and produce two histograms of the confidence feature, one for "biased" clusters (PECO L2 > 0.25) and one for the other clusters. Figure 8 shows the results of this experiment. We find that out of the 10k validation set examples, roughly 2 are assigned to an "unbiased" cluster for every 1 assigned to "biased," roughly evenly across all confidence bins.
This result suggests that PECO clusterwise "biasedness" is orthogonal to the samplewise ease of learnability captured by the dataset cartography confidence feature. In other words, we find that some samples are easy to learn (high confidence) because they are simple, while other samples are easy to learn because the model is using cheating features. PECO-like analyses will be instrumental in guiding future efforts to eliminate shortcut features in natural language datasets.

Conclusion
In the half decade since (Poliak et al., 2018) single sentence relation leakage bias has proven to remain a difficult issue. Efforts to debias NLI have led to datasets that merely exhibit different kinds of bias than those shown before, or less saturated benchmarks that continue to exhibit cheating features. Future work must prioritize reducing observable bias directly using a model-driven approach.

Limitations
Our work is limited primarily by the PECO's reliance on test-set classification. To successfully analyze the train set-only datasets of SNLI debiased and MNLI debiased , we had to generate our own train/test splits over the data by sampling. Luck in split selection may play a role in the level of observable bias in cases like these. Furthermore, this reliance on observing held-out samples to understand bias in general means that interventions to reduce single sentence label leakage must apply costly multi-fold splitting and analysis, consuming more significant compute resources than would otherwise be needed for other model-driven approaches.

Ethical Considerations
In the short term, progress toward better natural language inference does not appear to lead to significant social risks in its broader impacts. While "underclaiming" progress in natural language processing tasks (e.g. exaggerating the scope or severity of failures of specific models on specific tasks) (Bowman, 2021) may be enabled by this work in the future, our focus on directly quantifiable and observable single sentence leakage, use of SOTA-like models (fine-tuned transformers) for analysis, and our side-by-side comparison of our model implementations with SOTA all ensure that our criticisms of current NLI benchmarks are well-founded. All data and tools we utilized were freely distributed for unlimited research use in the academic context.  (Marelli et al., 2014) was produced by instructing annotators to label existing sourced pairs from 8K ImageFlickr data set (Young et al., 2014) (Thorne et al., 2018). The original dataset was collected by eliciting annotators to write fact sentences that are supported, refuted, or unverifiable relative source passages drawn from Wikipedia. This is converted into an NLI task by treating the elicited sentences as premises and the source passages as NLI pairs with relations entail, contradict, or neutral respectively. This dataset is unique in that the premises were elicited from seed hypotheses, meaning it has a balanced hyp. distribution but potentially biased prem. distribution. ANLI The adversarial NLI corpus (Nie et al., 2020) is collected through crowdworkers and the purpose of this dataset creation is to make the stateof-art results fail in this dataset. The sentences are selected from the Wekipedia and manually curated HotpotQA training set (Yang et al., 2018). The language is in English. It contains three partitions of increasing complexity and size, which we refer to hereafter as A1, A2, and A3. Detailed data statistics are in Table 1.

References
OCNLI The Original Chinese NLI corpus was collected following MNLI-procedures but with strategies intended to produce challenging inference pairs . No translation was employed in producing this data; the source premise sentences and elicited hypotheses are original.
CAugNLI Kaushik et al. (2020) produced counterfactually augmented datasets for NLI and sentiment analysis using human annotators, instructing them to make minimal changes to the sentences beyond those necessary to change the label. It extends the work of Maas et al. (2011) andBowman et al. (2015). They find that a BiLSTM classifier achieves negligible performance over chance when trained on hypothesis only. However, since their dataset includes elicited modified sentences in both the premise and hypothesis populations, there are opportunities for bias on both. CAugNLI was produced by having human annotators minimally modify either the premise or hypothesis of 2,500 samples drawn randomly from SNLI so as to produce new samples with similar structure and word distributions but different meanings. These modifications are intended to reduce spurious correlations, in particular by roughly equalizing the distribution of relation labels with respect to word-level and semantic-level patterns in the elicited hypothesis sentences. SNLI debiased and MNLI db are augmentations of the SNLI and MNLI train sets produced by training GPT-2 (Radford et al., 2019) generators on them, and then generating samples which they check for accuracy using a pretrained RoBERTa NLI classifier, and then reject if they exhibit spurious correlations including samplewise hypothesis-only model classifiability (Wu et al., 2022). To do this they first train static hypothesis-only SNLI and MNLI models, and reject all generated samples that can be successfully classified hypothesis-only by them. However, beyond this test under a static hypothesisonly distribution they do not attempt to assess if their generator models introduce new leakage features in the sentence distributions as a result of their accuracy filtering process. To test this we create test splits on the data (as they provide train sets only) which contain no sentence overlap with the train sets through random sampling.

B Training on PSC, Testing on SSC
Here we justify why PECO is computed on singlesentence condition (SSC) examples, using models trained on the paired-sentence condition (PSC).
Our core goal is to characterize only the modelrelevant shortcut features that are present in the SSC data, to enable better model-level understanding and to enable shortcut feature elimination in future datasets. While all SSC accuracy must be driven by SSC-visible shortcuts, it is possible that some SSC-visible cheating features aren't actually used as shortcuts by PSC classifiers. Thus, we have to train on PSC and test on SSC, and PECO is an alternative metric of bias that captures this model-level separability of sentences in the SSC notion better than other approaches.

C PECO Parameter Details
The PECO scoring pipeline contains a number of parameters that require motivation, including SSC and PSC model training hyperparameters, number of principal components to reduce to during PCA |PC|, and number of k-means clusters k to divide the test set into for analysis ( Figure 4). We specify our NN hyperparams that we performed grid search over in subsubsection 2.4.1. However, selecting k and |PC| is not a straightforward simple grid search.
We report PECO scores for all assessed datasets in Table 5, for k ∈ 10,25,50,100 and no PCA projection, |PC| = 50, and |PC| = 100 PCA conditions. We find that for a given k, the different PCA conditions have limited impact on the final scores. We also find that L2-and KLD-based PECO scores are well-correlated. For smaller test sets (e.g., ANLI and its partitions A1-A3, OCNLI) there is increased sensitivity to variations in k relative to the larger datasets such as SNLI. We selected k = 30, |PC| = 50 for our main experimental PECO results as it didn't produce the extreme swings in score for small test sets that we observed for higher k.