Fighting Bias with Bias: Promoting Model Robustness by Amplifying Dataset Biases

NLP models often rely on superficial cues known as dataset biases to achieve impressive performance, and can fail on examples where these biases do not hold. Recent work sought to develop robust, unbiased models by filtering biased examples from training sets. In this work, we argue that such filtering can obscure the true capabilities of models to overcome biases, which might never be removed in full from the dataset. We suggest that in order to drive the development of models robust to subtle biases, dataset biases should be amplified in the training set. We introduce an evaluation framework defined by a bias-amplified training set and an anti-biased test set, both automatically extracted from existing datasets. Experiments across three notions of bias, four datasets and two models show that our framework is substantially more challenging for models than the original data splits, and even more challenging than hand-crafted challenge sets. Our evaluation framework can use any existing dataset, even those considered obsolete, to test model robustness. We hope our work will guide the development of robust models that do not rely on superficial biases and correlations. To this end, we publicly release our code and data.


Introduction
NLP models often exploit repetitive patterns introduced during data collection, known as dataset biases, to achieve strong performance (Poliak et al., 2018;McCoy et al., 2019). 2 This trend has led to attempts of improving the evaluation of NLP models by creating test sets that are different from the training sets, e.g., from a different domain (Williams et al., 2018) or a different distribution (Koh et al., 2021), and challenge sets that focus on counterexamples to known biases in the training set, which we refer to as anti-biased examples (Jia and Liang, 2017;Naik et al., 2018;Utama et al., 2020).
To address these gaps, some works used balancing techniques to create unbiased datasets, by filtering out biased examples (Zellers et al., 2018;Le Bras et al., 2020;Swayamdipta et al., 2020), or injecting anti-biased examples into the training sets (Nie et al., 2020;Liu et al., 2022a).In this work we argue that in order to encourage the development of robust models, we should in fact amplify biases in the training sets, while adopting the challenge set approach and making test sets anti-biased (Fig. 1).
Amplifying dataset biases might seem counterintuitive at first.Our work follows recent work that challenged the assumption that biases can ever be fully removed from a given dataset (Schwartz and Stanovsky, 2022), arguing that models are able to pick up on very subtle phenomena even in partially balanced (or mostly unbiased) datasets (Gardner et al., 2021).As a result, dataset balancing, while potentially improving generalization, might make it harder to develop models that are resilient to such biases; these biases "hide" in the balanced training sets, and the way models handle them is hard to evaluate and make progress on. 3Instead, we argue that academic benchmarks should include training splits that mainly consist of biased examples (see Fig. 2).Such splits will drive the development of robust models that generalize beyond biases, ideally even subtle ones.
We present a simple method to implement our approach (Sec.2).Given a dataset in which both training and test sets are divided into biased and anti-biased subsets, we remove the anti-biased instances from the training set and the biased ones from the test set.The new splits then form a challenging evaluation setting.We assume that biased instances constitute the majority of a dataset (Gururangan et al., 2018;Utama et al., 2020), and thus the resulting training sets are similar in size to the original ones (though the test sets are smaller).
To discern biased and anti-biased instances, we consider three model-based approaches (Sec.3): (a) dataset cartography (Swayamdipta et al., 2020), which uses training dynamics to profile the difficulty of learning individual data instances.In this approach, we identify instances that are hard-tolearn as anti-biased (Sanh et al., 2021;He et al., 2019); (b) partial-input models (Kaushik and Lipton, 2018;Poliak et al., 2018), which are forced to rely on bias, regarding instances on which they fail as anti-biased; and a method we introduce for identifying (c) minority examples (Tu et al., 2020;Sagawa et al., 2020), which groups a dataset's instances using deep clustering (Caron et al., 2018) and regards the minority-label instances within each cluster as anti-biased.
We apply our framework to MultiNLI (Williams et al., 2018) and QQP (Wang et al., 2018), on which trained models exceed human performance.We also experiment with two datasets that are considered more challenging: Adversarial NLI (ANLI ; Nie et al., 2020) and WANLI (Liu et al., 2022b).We use a ROBERTA-BASE (Liu et al., 2019b) model for selecting biased and anti-biased instances according to each method, and evaluate the performance of ROBERTA and DEBERTA (He et al., 2021) LARGE models under our proposed setting (Sec.4).While anti-biased instances are naturally challenging for models, amplifying biases in the training set makes them even more challenging; using the partial-input and minority examples methods, we observe mean absolute performance reductions of 15.8% and 31.8%,respectively.Using instances detected with dataset cartography leads to smaller (though still large) reductions of 10.1%.
We compare bias-amplified splits to handcrafted challenge sets such as HANS (McCoy et al., 2019), and find that our automatically-generated anti-biased test sets are both of similar difficulty to such challenge sets, and capture a more diverse set of biases.Our framework can further be used to augment existing challenge sets, as training on bias-amplified data increases their difficulty.
Next, we investigate how many anti-biased examples are required for generalization, by gradually re-inserting such instances to the training set (Liu et al., 2019a).While models greatly benefit from observing small amounts of anti-biased instances, anti-biased test sets remain challenging, and additional performance gains require much larger quantities (Sec.5).We then show that standard debiasing methods applied to bias-amplified training sets lead to little to no gains in performance (Sec.6).
Our findings may change the way we evaluate the robustness of NLP models, and in particular their level of generalization beyond the biases of their training sets.Our method requires no new annotation or any task-specific expertise.It allows to rejuvenate datasets previously considered as obsolete, and thus reuse the intensive efforts used in their curation.We release our new dataset splits along with code for automatically creating biasamplified splits for other datasets.

Amplifying Dataset Biases to Advance Model Robustness
This section motivates our approach in view of recent developments in NLP, provides a general overview of the framework we use to implement it, and discusses its applications.

Motivation: Data Balancing Hides Biases
This paper focuses on the problem of creating robust models that generalize beyond dataset biases.A common approach to addressing this problem is removing these biases from the training data (Zellers et al., 2018;Le Bras et al., 2020).This approach is intuitive-if a model doesn't observe these biases in the first place, it is less likely to learn them, and will thus generalize better.Despite the appeal of this approach, it suffers from several problems.First, recent work has argued that models are sensitive to very fine-grained biases, which are hard to detect and filter (Gardner et al., 2021).Other works have shown that training on bias-filtered datasets does not necessarily lead to better generalization (Kaushik et al., 2021;Parrish et al., 2021), indicating that while such training sets are less biased, models might still rely on biases to solve them.Finally, recent studies argued that even with our utmost efforts, we may never be able to create datasets that contain no exploitable biases (Linzen, 2020; Schwartz and Stanovsky, 2022).
As a result, this paper argues that mitigating the negative effect of dataset biases is not only a data problem, but needs to also come from better modeling.But how can we create a testbed for developing models that overcome these biases?We argue that training on datasets filtered for such biases will not suffice in developing such models, and in fact make it harder to do so; as subtle biases still "hide" inside filtered training sets, it is much harder to track them, evaluate their impact and importantlydevelop models that learn to ignore them.
Instead, in this paper we propose that when evaluating model robustness, dataset biases should be amplified by training mostly on biased instances, while using anti-biased instances for evaluation (Fig. 2).This simple setting defines a challenging test, where models must counteract dataset biases and learn generalizable solutions in order to succeed, as the anti-biased test set cannot be solved using the biased training set's statistical cues.

Framework for Amplifying Dataset Biases
We describe our approach for amplifying dataset biases during training to evaluate model generalization.Given a dataset split into training and test sets D " D train Y D test , we begin by dividing its instances across both splits into biased and antibiased subsets. 4To evaluate a model's robustness, we first train it on the portion of biased train instances D train biased .We assume most data instances are biased (Gururangan et al., 2018), so this process results in small reductions in training set sizes compared to D train .We then evaluate the model on the anti-biased test instances D test anti-biased , and compare it to the performance of the same model trained on the full training set.Drops in performance between the two indicate that the model struggles to overcome its training set biases.

Discussion
Applications We suggest our framework as a tool for studying and evaluating models.As such, it is orthogonal to data collection procedures.Importantly, we do not suggest to intentionally collect biased data when curating new datasets.Nonetheless, data collected in large quantities tends to contain unintended regularities (Gururangan et al., 2018).We therefore propose to use bias-amplified splits to complement benchmarks with challenging evaluation settings that test model robustness, in addition to the dataset's main training and test sets.
Such splits, when created using the methods we consider in this work, can be created automatically and efficiently for any dataset.These include newly collected datasets, but also existing ones, such as obsolete benchmarks on which model performance is too high to measure further progress, allowing for the rejuvenation and reuse of benchmarks.
Anti-biased vs. challenge sets Our framework provides an evaluation environment to assess model robustness, similar to challenge sets.However, unlike challenge sets, which are often manually curated with protocols designed to create difficult examples, our approach is automatic and uses data collected using the exact same protocol as the model's training data.Still, we find that antibiased test sets are challenging for models and can capture more diverse biases, and moreover-that training on bias-amplified data further enhances the difficulty of existing challenge sets (Sec.4.2).Consequently, our framework can be employed to evaluate robustness in tasks where challenge sets are unavailable, or in conjunction with existing challenge sets for a more comprehensive evaluation.
Can models generalize from biased data?A natural question to ask about our approach is whether we can truly expect models to generalize from a biased training distribution.Although the biased training sets could be solved by capturing only a subset of relevant features, their instances can still provide valuable information for learning additional features that are important for generalization yet under-utilized by models (Shah et al., 2020;Geirhos et al., 2020).Previous work has proposed techniques to encourage models to learn diverse, unbiased representations from extremely biased training distributions, mostly focusing on domains outside of NLP (Kim et al., 2019;Bahng et al., 2020;Pezeshki et al., 2021).This is likely due to the difficulty of defining and controlling biased distributions in textual domains.Our work paves the way for implementing and evaluating such methods specifically for NLP.
Related to this concern is our decision to leave no anti-biased instances in the training set.Indeed, it is likely that for many biases, at least some counter-examples will be found in the training set.We admit that this decision is not a major component of our approach, and it could be easily implemented with a small number of anti-biased instances in the training set instead.To avoid deciding on the numeric definition of small, and to make the setup as challenging as possible, we experiment throughout this paper with no (identified) anti-biased instances in training.In Sec. 5 we study the effect of using limited amounts of such counter-examples, by reinserting some anti-biased instances into training.

Definitions of Biased and Anti-biased Examples
Our approach requires a drop-in method for classifying a dataset's examples into biased and antibiased instances.We consider the following modelbased methods for doing so.We note that none of them requires any prior knowledge or task-specific expertise.All methods can be computed automatically at the reasonable cost of training and evaluating a (possibly smaller) model on the dataset.
Dataset Cartography (Swayamdipta et al., 2020) is a method to automatically characterize a dataset's instances according to their contribution to a model's performance, by tracking a model's training dynamics.Specifically, measuring each instance's confidence-the mean of the predicted model probabilities for the gold label across training epochs-reveals a region of easy-to-learn instances with high confidence which the model consistently predicts correctly throughout training, and a region of low-confidence hard-to-learn instances, on which the model consistently fails during training.We follow previous work which considered instances that models find easy or hard to solve as more likely to be biased or anti-biased, respectively (Sanh et al., 2021;He et al., 2019).
To estimate the confidence of test instances, we make predictions with a partially trained model at the end of each epoch on the test set (as typically done on the validation set), and use the average confidence scores across epochs. 5To choose anti-biased examples, we use the q% most hard-tolearn instances in each of the training and the test sets individually, where q is a hyperparameter.We consider all other examples as biased.
Partial-input baselines is a common method for identifying annotation artifacts in a dataset.The method works by examining the performance of models that are restricted to using only part of the input.Such models, if successful, are bound to rely on unintended or spurious patterns in the dataset.Examples include question-only models for visual question answering (Goyal et al., 2017), endingonly models for story completion (Schwartz et al., 2017) and hypothesis-only models for natural language inference (Poliak et al., 2018).
Held-out instances where such baselines fail are considered anti-biased and less likely to contain artifacts (Gururangan et al., 2018). 6Generating a biased training set for this method is not trivial, as the partial-input model is likely to fit to the training data during training, and thus almost all examples will be labeled biased.We therefore follow the dataset cartography approach with a partial-input baseline, and compute the mean confidence score for each instance across epochs.We select the q% most hard-to-learn instances as anti-biased.
Minority examples Current models are typically sensitive to minority examples that defy common statistical patterns found in the rest of the data, especially when the amount of such examples in the training set is scarce (Tu et al., 2020;Sagawa et al., 2020).Minority examples are often detected by heuristically searching for spurious features correlated with one label in the instances of another label (e.g., high word overlap between two nonparaphrase texts).Motivated by recent work that leverages instance similarity in the representation space of fine-tuned language models for various use cases (Liu et al., 2022b;Pezeshkpour et al., 2022), we propose a model-based clustering approach to automatically detect minority examples.
We follow a three-step approach.First, we cluster the training set using [CLS] token representations extracted from a model trained on the dataset.Second, to detect minority examples in the training set, we inspect the distribution of instances over the task labels L within each cluster c i .We define a cluster c i 's majority label as the label ℓ i P L associated with the most instances in c i .We consider all other labels as c i 's minority labels.Instances belonging to their cluster's minority labels are regarded as minority examples, and accordingly anti-biased, while and all others are considered biased.Finally, to detect minority examples in the test set, we extract [CLS] representations for all test instances, and assign each instance to the cluster of its nearest neighbor in the training set using Euclidean distance.If the test instance px, yq is assigned to cluster c i , we consider px, yq as a majority example iff it belongs to c i 's majority label, i.e., if y "" ℓ i . 7ur preliminary experiments show that standard clustering algorithms tend to create labelhomogeneous clusters, i.e., they are less likely to cluster together instances from different labels.We thus use DEEPCLUSTER (Caron et al., 2018), which we find to create more label-diverse clusters.DEEPCLUSTER alternates between grouping a model's representations with a standard clustering algorithm8 to produce pseudo-labels, and fine-tuning a new pretrained model to predict these pseudo-labels.We perform one iteration of deep clustering and then cluster the representations of the DEEPCLUSTER model to obtain the final clustering.App.C shows details and preliminary results on alternative clustering methods.

Models Struggle with Amplified Biases
We next use our framework to evaluate the extent to which models generalize beyond the biases of their training sets.

Experimental Setup
We create bias-amplified splits for four datasets: two (QQP , Wang et al., 2018;and MultiNLI , Williams et al., 2018) that were shown to contain considerable biases (Zhang et al., 2019;Gururangan et al., 2018); and two additional datasets (ANLI , Nie et al., 2020;and WANLI, Liu et al., 2022b) designed to contain smaller proportions of biased instances.QQP is a duplicate question identification dataset, while the other three are natural language inference (NLI) datasets.
We split all datasets into biased and anti-biased parts according to each of the three methods described in Sec. 3. We use a ROBERTA-BASE (Liu et al., 2019b) model for all three methods: we finetune the model on each dataset to compute training dynamics for dataset cartography, and also to extract and cluster [CLS] representations for identifying minority examples; we separately train the model on partial inputs to obtain training dynamics for partial-input baselines.We use hypothesis-only baselines for NLI datasets.For QQP, we use the first question of each pair.
We then evaluate the performance of ROBERTA and DEBERTA (He et al., 2021) LARGE models under our proposed framework.We train models on the biased training split obtained from each of the three methods, and report their performance on the corresponding anti-biased test sets.9Since the number of biased training instances is induced by the clustering in the minority examples approach, but is a hyperparameter q for the two other approaches, we adjust q to create equally sized training sets for all three methods.This results in 79% of the training set for MultiNLI , 82% for QQP and 10 When selecting minority examples for MultiNLI and QQP , we consider all labels but a cluster's majority label as minority labels.Using this setting for ANLI and WANLI results in specifying more than 40% of the training set as minority examples.This leaves too few biased instances for training and substantially changes the original training distribution.Therefore, for these datasets, we use the label with the least instances within a cluster as its minority-label.

Results
Models struggle with biased training sets Tab. 1 shows our results for ROBERTA-LARGE.We observe that the baseline models struggle with all anti-biased test sets, even when training on the full training set.The anti-biased test splits based on dataset cartography prove to be the most initially difficult, with the splits created using the two other methods overall similar in difficulty.Still, model performance on anti-biased instances drops further when training on biased training splits; taking the mean across datasets, performance drops by 8.4% for dataset cartography-based splits, 16.2% for partial-input, and 32.5% for minority examples.Results for DEBERTA-LARGE (App.B.1) follow the same trends, with mean performance reductions of 11.8% for dataset cartography, 15.4% for partial-input, and 31.1% for minority examples.
We also observe that training on biased splits leads to minor reductions on the full test sets, indicating that while current models trained on our training splits fail to generalize beyond the biases in these sets, they are seemingly able to learn the tasks at hand.
Anti-biased test sets are as challenging as manual challenge sets We further compare model performance on our anti-biased test splits to performance on challenge sets collected manually.Particularly, we compare the splits created with the minority examples method for MultiNLI and QQP , to the HANS (McCoy et al., 2019) and PAWS (Zhang et al., 2019) challenge sets, respectively.
Our results (Tab. 2 for HANS , Tab. 3 for PAWS ) show that, when training on the full dataset, our automatically curated test splits are more difficult than the HANS challenge set, but not as challenging as PAWS (Mean column).Interestingly, train- ing on biased splits (final row) makes the challenge sets dramatically more difficult, but our anti-biased splits are even more challenging in this setup-the model performs 0.9% worse on MultiNLI compared to HANS , and 13.2% worse on QQP compared to PAWS .We further find that anti-biased test splits are more diverse than the challenge sets, as difficult instances affected by biases arise in all labels in the anti-biased splits, while mostly in one label in the challenge sets.Our results suggest that biasamplified splits can augment existing challenge sets by boosting their difficulty or uncovering instances that influence the biases they test.
Discussion Overall, bias-amplified splits prove to be extremely difficult for strong models.Such splits could be used to identify models that successfully generalize beyond substantial biases, and are more likely to overcome subtler ones.Importantly, bias amplification remains challenging even when applied to recent datasets that contain fewer biased instances (e.g., ANLI and WANLI), or when compared to hand-crafted challenge sets.They could therefore be used to complement model evaluation on future, more challenging datasets.Finally, our splits can be created automatically for any existing dataset, even those for which model performance on the standard splits exceeds human performance, such as MultiNLI and QQP .

How Many Anti-biased Examples are
Needed for Generalization?
So far, we have seen that amplifying dataset biases by eliminating all anti-biased instances from the training set uncovers shortcomings in model generalization.We next study the effect of allowing some anti-biased instances in the training set (Liu et al., 2019a).We fine-tune ROBERTA-LARGE on all four datasets using the biased splits created using the minority examples method, while gradually reinserting 10%, 20%, 35%, 50% and 70% of the anti-biased instances back into the training set. 11ur results (Fig. 3) show that reinserting 20% of the anti-biased training instances allows the model to close approximately 50% of the gap from its baseline performance on the anti-biased test set.Surprisingly, performance grows slowly when restoring additional anti-biased instances, and does not match the full training set's levels even when adding 70% of anti-biased instances.This indi- cates that the model is capable of generalizing from small amounts of anti-biased instances, but is inefficient in gaining further improvements.Results for the other models (Fig. 4) show a similar trend.
On the one hand, our results encourage careful data collection in order to fill gaps in dataset coverage (Parrish et al., 2021;Liu et al., 2019a).On the other hand, our findings indicate that data curation is not a sufficient solution, as models struggle on minority examples even when observing all available instances, and collecting more instances results in smaller further gains.Thus, it is also necessary to develop robust models that can better generalize from biased data.Our proposed framework provides a testbed for doing so.(Taori et al., 2020;Awadalla et al., 2022).

Related Work
Biased splits The concept of re-organizing a dataset's training and test splits is often used to create more challenging evaluation benchmarks from existing datasets by inserting bias into the training set.Søgaard et al. (2021) showed that using biased splits better approximates real-world perfor-mance compared to standard, random splits.Koh et al. (2021) and Santurkar et al. (2021) simulated real-world distribution shifts by filtering out different kinds of data from the training and test sets, based on manually crafted heuristics.Agrawal et al. (2018) ignored the dataset's original training and test splits altogether and re-split instances to create biased splits for VQA using dataset-specific heuristics.Unlike such approaches, our method automatically constructs biased splits using dataset-agnostic approaches, and follows the original training and test splits.Concurrently to this work, Godbole and Jia (2023) re-split datasets by placing all examples that are assigned lower likelihood by an LM in the test set, and more likely examples in the training set.In some sense, that work also creates an "easy" training set and a "hard" test set, and can thus be considered a special case of our approach.
Challenge sets Given the exceptional performance of modern NLP tools on standard benchmarks, challenging test sets were created to better assess model capabilities across various tasks (Isabelle et al., 2017;Naik et al., 2018;Marvin and Linzen, 2018).Such approaches often rely on human experts to identify model weaknesses and create challenging test cases using instance perturbations (Jia and Liang, 2017;Glockner et al., 2018;Belinkov and Bisk, 2018;Gardner et al., 2020) or rule-based data creation protocols (McCoy et al., 2019;Jeretic et al., 2020).Some approaches automated certain parts of these procedures, yet still require human design or annotation (Bitton et al., 2021;Li et al., 2020;Rosenman et al., 2020).
Inserting instances from challenge sets to the training set was shown to potentially alleviate their difficulty (Liu et al., 2019a), perhaps similarly to how model performance in our framework improves when reintroducing anti-biased examples to the training set (Sec. 5).Other work extracted challenging test subsets from existing benchmarks for focused model evaluation (Gururangan et al., 2018).Our framework can similarly be used to better evaluate model generalization, but without requiring additional annotations or task-specific expertise, and using data that was collected in the exact same procedure as the model's training data.We further showed (Sec.4) that our framework can be used along with existing challenge sets to increase their difficulty.
Dataset balancing Recent work proposed methods to collect benchmarks with balanced and ideally unbiased training and test splits.Such benchmarks often use a model-in-the-loop during data collection and task crowd workers to write examples on which models fail (Bartolo et al., 2020;Nie et al., 2020;Kiela et al., 2021;Talmor et al., 2021), or used adversarial filtering to remove examples from existing or newly collected datasets that were easily solved by models (Zellers et al., 2018(Zellers et al., , 2019;;Dua et al., 2019;Le Bras et al., 2020;Sakaguchi et al., 2021).Parrish et al. (2021) proposed to use an expert linguist-in-the-loop during crowdsourcing to improve data quality and diversity.Other work used generative methods to enrich existing datasets and compose new machine-generated examples similar to challenging seed examples (Lee et al., 2021;Liu et al., 2022a).Other studies argued that despite our best efforts, we may never be able to create datasets that are truly balanced (Linzen, 2020;Schwartz and Stanovsky, 2022).Our framework can be used to expose biases in such datasets and to automatically augment them with more challenging evaluation splits.

Conclusion
Recent approaches in NLP attempted to eliminate dataset biases from training sets to produce robust models and reliable evaluation settings, yet model generalization remains a challenge, and subtler biases persist.In this work, we argued that to promote robust modeling, models should instead be evaluated on datasets with amplified biases, such that only true generalization will result in high performance.We presented a simple framework to automatically create bias-amplified splits for a given dataset, finding that such splits are difficult for strong models when created for either obsolete or difficult datasets, and could potentially expose differences in generalization capabilities between models.Our results indicate that bias amplification could ease the creation of robustness evaluation tests for new datasets, as well as inform the development of robust methods.

Limitations
In our experiments, we evaluated models by finetuning on bias-amplified splits, but we did not explore the robustness of few-shot methods.Such methods are intuitively less likely to be affected by slight changes in the distribution of examples they observe.However, recent work has shown that they could still be affected by dataset biases (Utama et al., 2021;Li et al., 2022), and we will use our framework to explore this in future work.
We note that our approach is less suitable for datasets with relatively small test sets.In such cases, extracting an anti-biased test split, which consisted of 13-21% of the original test set in the benchmarks we considered, will result in a test set too small to reliably evaluate models.However, the methods we used to extract bias-amplified splits (Sec.3) could be tuned to produce larger test sets (while keeping the amount of anti-biased instances in the training set relatively small), e.g., by selecting a lower number of biased training instances (q, Sec.4.1).
Throughout this paper, we used the term "bias" to describe statistical regularities in datasets that can be exploited by models as unintended shortcut solutions.While we do not explore model robustness to other types of data biases (e.g., different kinds of societal biases) our framework could potentially be used to evaluate how models handle such cases by revising the definitions of biased and anti-biased instances used to create the evaluation splits.We leave such applications of our framework to future work.

A.1 Datasets
We experiment with four large datasets: QQP , MultiNLI , WANLI and ANLI .We also run a hyperparameter search on SST-2 , and evaluate model performance on HANS and PAWS .Sizes of the different datasets are reported in Tab. 6.Our implementation loads all datasets from Huggingface Datasets Hub using the datasets python library (Lhoest et al., 2021).All datasets are for English tasks.
QQP We experiment with the Quora Question Pairs13 (QQP ) dataset using the version released under the GLUE benchmark (Wang et al., 2018).QQP is a dataset for the task of predicting whether pairs of questions have the same intent, i.e., if they are duplicates or not.The dataset is based on actual data from Quora.
Natural Language Inference (NLI) The task of natural language inference involves predicting the relationship between a premise and hypothesis sentence pair.The label determines whether the hypothesis entails, contradicts or is neutral to the premise.

MultiNLI
We experiment with the multi-genre MultiNLI dataset (Williams et al., 2018), which was crowdsourced by tasking annotators to write hypotheses to a given premise for each of the three labels.MultiNLI contains ten distinct premise genres of written and spoken data: (Face-to-face, Telephone, 9/11, Travel, Letters, Oxford University Press, Slate, Verbatim, Government and Fiction, of which five are included in the train and devmatched sets.We don't use the dev-mismatched set in our experiments.We use the version released under the GLUE benchmark (Wang et al., 2018).
Adversarial NLI We experiment with Adversarial NLI (ANLI) (Nie et al., 2020), a large-scale human-and-model-in-the-loop natural language inference dataset collected over multiple rounds, using BERT (Devlin et al., 2019) and ROBERTA (Liu et al., 2019b) as adversary models.Although each of the dataset's rounds can be used as separate evaluation settings (e.g., training on the first round and testing on the second), the data collected over all rounds can also be concatenated and used for training and evaluation; both settings were used in the original paper.In our experiments we take the concatenation approach.
WANLI We experiment with WANLI (Liu et al., 2022b), an NLI dataset collected based on worker and AI collaboration.WANLI was created by identifying examples with challenging reasoning patterns in MultiNLI and using a LLM to compose new examples with similar patterns.The generated examples were then automatically filtered, and finally revised and labeled by human crowdworkers.WANLI is more challenging to models than MultiNLI , and using WANLI instances for training was shown to improve out-of-distribution generalization.

SST-2
We run a hyperparameter search on SST-2 .The Stanford Sentiment Treebank (Socher et al., 2013) is a sentiment analysis corpus with fully labeled parse trees for single sentences extracted from movie reviews.SST-2 refers to a binary classification task on sentences extracted from these parse tress (negative or somewhat negative vs somewhat positive or positive, with neutral sentences discarded).We use the version of SST-2 released under the GLUE benchmark (Wang et al., 2018).
HANS We evaluate models on HANS (Heuristic Analysis for NLI Systems; McCoy et al. 2019), a challenge set used to assess whether NLI models adopt invalid syntactic heuristics that succeed for the majority of NLI training examples (e.g., lexical overlap implies that the label is entailment), instead of learning more generalizable solutions.HANS contains many entailment examples that support these heuristics, and many non-entailment examples where such heuristics fail.When evaluating NLI models that were trained with 3-way labels (as in MultiNLI ), we map contradiction or neutral predictions to the non-entailment label.HANS was created by automatically filling in words in templates devised by human experts.
PAWS We evaluate models on PAWS (Paraphrase Adversaries from Word Scrambling; Zhang et al. 2019), a challenge set for the paraphrase identification task that focuses on non-paraphrase pairs with high lexical overlap.Challenging pairs are generated by controlled word swapping and back translation, followed by fluency and paraphrase judgments by human raters.We evaluate models on the test set of the PAWS Wiki dataset.

A.2 Experimental Settings
We experiment with the BASE and LARGE variants of ROBERTA (Liu et al., 2019b) and DE-BERTA (He et al., 2021).Our implementation and pretrained model checkpoints use the Huggingface Transformers library (Wolf et al., 2020) Bias-amplified split sizes Tab.7 reports the sizes of the bias-amplified biased train and anti-biased test splits created based on each of the three methods (Sec.3) we experimented with.
Hyperparameters For fine-tuning, we did not optimize the hyperparameters and instead used parameters that were included in the hyperparame-ter search on down-stream tasks from the original papers, except for training LARGE models for 5 epochs instead of 10.We also used an earlystopping patience threshold of 3 epochs.We report all fine-tuning hyperparameters in Tab. 8 and Tab. 9.
Average runtimes For ROBERTA-BASE, each train run was performed on a single RTX 2080Ti GPU (10GB).For all other models, each train run was performed on a single Quadro RTX 6000 GPU (24GB).We report average runtimes (training and inference combined) in Tab.10.

B.1 Main Results for DEBERTA
Tab. 5 shows our results for DEBERTA-LARGE for the experiment described in Sec.4.1.

C Clustering Algorithm for Detecting Minority Examples
Minority examples (Tu et al., 2020;Sagawa et al., 2020) are often detected by searching for spurious features correlated with one label in the instances of another label (e.g., high word overlap between two non-paraphrase texts).Motivated by recent work that leverages [CLS] token similarity in fine-tuned models between different instances (Liu et al., 2022b;Pezeshkpour et al., 2022), we proposed a model-based clustering approach to automatically detect minority examples (Sec.3) Our approach is based on simple analyses applied to the clustering of a given dataset's [CLS] model representations.In this work we used the deep clustering algorithm described in Sec. 3, DEEPCLUSTER (Caron et al., 2018), to perform the clustering.In this appendix we provide more details on the algorithm (App.C.1), its implemen-   like models,14 we consider a model fine-tuned on the dataset to be clustered. 15We extract and cluster its [CLS] token representations using a standard clustering algorithm, and then perform one DEEPCLUSTER iteration by fine-tuning a new pretrained model with the pseudo-labels (instead of the dataset's gold labels) for one epoch. 16We then cluster the representations from this second model to obtain the final clustering.

C.2 Implementation Details
As the standard clustering algorithm at the base of DEEPCLUSTER, we use Ward's method (Ward Jr, 1963), a popular hierarchical clustering algorithm which is deterministic and therefore stable across different runs, a quality which we found preferable.We use the fastcluster (Müllner, 2013) python implementation with the default settings.
Applying Ward's clustering to large-scale datasets We did not have resources with enough memory to cluster the entire training sets of MultiNLI and QQP , which contain more than 320k examples.We therefore approximate the clustering assignment by clustering a random sample of 50% of the training set, and then using a simple nearestneighbor classifier to predict the assignments for the other 50%.17t10, 30, 50, 100, 300, 500, 1000, 1500, 3000u 18  and representations from the last four layers of ROBERTA-BASE.
For each set of hyperparameters, we applied the minority examples method to create biased training and anti-biased test splits, and trained two ROBERTA-BASE models-one on the biased train split, and a baseline model on an equally-sized random train subset, finally choosing the hyperparameters that lead to the largest performance drop on anti-biased test instances between the two.The best hyperparameters were the layer before last of ROBERTA-BASE (layer 11) and m " 1500.

C.4.1 Using Standard Clustering to Detect Minority Examples
Our preliminary experiments show that standard clustering algorithms applied to the [CLS] representations of models fine-tuned on the original task tend to create label-homogeneous clusters, i.e., they are less likely to cluster together instances from different labels.In Fig. 5 we show the average proportions of majority and minority instances within clusters for different clusterings of SST-2 (which has two task labels) based on ROBERTA-BASE representations.We compare DEEPCLUSTER and two standard clustering algorithms: K-Means and Ward's method.We find that the clusters of standard methods contain, on average, less than 5% minority label instances, while clusters based on DEEPCLUSTER are more label-diverse and contain 15% minority label instances.When inspecting how many individual clusters contain more than 10% minority label instances, we find that for both standard methods only one cluster (out of 10) meets this threshold, whereas there are 6 such clusters with DEEPCLUSTER.

C.4.2 Difficulty of Minority Examples in Bias-amplification Over Random Seeds
We ran a preliminary experiment on SST-2 to examine whether the difficulty of the bias-amplified splits based on the minority examples method varies with the seed used to collect data representations.We clustered SST-2 using DEEPCLUSTER based on representations of ROBERTA-BASE.We used 3 different seeds to fine-tune the model and run DEEPCLUSTER, and created a bias-amplified split from each resulting clustering.We then examined the performance drops between a ROBERTA-BASE model trained on the biased vs. random split (as in the hyperparameter search; see App.C.3).
The mean absolute performance drop was -16.7, with a standard deviation of 5.9.This indicates that while there is variation between seeds, all clusterings produced challenging settings.We conclude that when seeking to create the most challenging splits, running a hyperparameter search over multiple seeds on the dataset the splits are created for would likely lead to better results.In this work, we did not optimize the clustering hyperparameters for each dataset, and therefore used one seed for all clusterings.

Figure 1 :
Figure 1: To guide the development of models robust to subtle biases, we propose to extract bias-amplified splits for existing benchmarks.Our approach first partitions a given dataset into biased and anti-biased instances.It then constructs a biased training set and an anti-biased test set, which are used to evaluate model generalization.

Figure 2 :
Figure 2: Different approaches to data collection.In standard datasets (1), the training and test sets mostly contain a majority of biased instances.Challenge sets (2) curate anti-biased test sets.Balancing and filtering methods (e.g., adversarial filtering, 3) collect unbiased training and test sets.Our framework (4) contains biased training sets and anti-biased test sets.

Figure 3 :
Figure 3: Accuracy for ROBERTA-LARGE models finetuned on bias-amplified splits created with the minority examples method, while gradually reinserting antibiased instances back into the training set.Reported values are averaged across three random seeds.We interpolate and place stars (‹) at points where the model regains 50% of its original performance.Models generalize from small amounts of anti-biased instances, but require much larger quantities to achieve comparable performance gains.

Figure 4 :
Figure 4: Accuracy for models trained on bias-amplified splits of WANLI created with the minority examples method, while gradually reinserting anti-biased instances back into the training set.
Recently proposed methods were shown to be effective in improving the out-of-distribution generalization of models, either by adjusting the training loss to account for biased instances (model debiasing;He et al. 2019;Clark et al. 2019), or by filtering the training set to increase the proportions of different kinds of instances found to be advantageous for generalization(data filtering;Le Bras et al. 2020;Yaghoobzadeh et al. 2021;Liu et al. 2021).We now examine whether such methods improve the generalization of models trained on bias-amplified training sets to anti-biased test instances.We consider a ROBERTA-LARGE model trained on bias-amplified splits of MultiNLI and QQP based on minority examples.For model debiasing, we apply the self-debiasing framework suggested MultiNLI -anti-biased QQP -anti-biased biased 50Accuracy of ROBERTA-LARGE models trained on MultiNLI and QQP with different training schemes: a biased subset, two debiasing methods applied to the biased subset, and the full training set.We use the biased and anti-biased splits created with the minority examples method.Applying model debiasing or data filtering approaches in the bias-amplified setting results in only slight improvements on the anti-biased test sets.by Utama et al. (2020) 12 with example reweighting (Schuster et al., 2019) to down-weight the loss function for biased instances; for data filtering, we apply dataset cartography to identify ambiguous instances-examples for which the model's confidence in the gold label exhibits high variability across training epochs-and train on the 33% most ambiguous ones, as shown to benefit generalization in Swayamdipta et al. (2020).Importantly, we apply both methods to the bias-amplified training split (rather than the original training set) and do not train on any other instances during the debiasing or filtering procedures.Our results (Tab.4) show that neither debiasing nor filtering result in substantial improvements on anti-biased data.This indicates that such methods are less effective when training sets lack sufficient anti-biased instances, and highlights the need for methods that could improve model generalization when additional data curation is impractical.Our findings are also in line with recent results showing that various robustness interventions struggle with improving upon standard training in real-world distribution shifts (Koh et al., 2021) or dataset shifts

Runtime
Running DEEPCLUSTER requires (1) fine-tuning a model for 1 epoch and then extracting its representations, which takes 15-70 minutes on a GPU, and (2) clustering the representations on a CPU, which takes 40 minutes for WANLI and ANLI , and 3 hours for MultiNLI and QQP .C.3 DEEPCLUSTER Hyperparameters DEEPCLUSTER has three hyperparameters: the number of final clusters k, the number of pseudo-labels m for representation learning, and the Transformer layer from which [CLS] representations are extracted for clustering.We used k " 10 clusters for all datasets, and searched for a good configuration for the other two hyperparameters on SST-2(Socher et al., 2013), which were then used for experiments on all other datasets in the paper.We searched over m P

Figure 5 :
Figure 5: The mean proportions of majority and minority label instances within clusters for different clusterings of SST-2 , based on the [CLS] representations of ROBERTA-BASE fine-tuned on the dataset.[CLS] tokens are taken from the layer before last of the model.
(Socher et al., 2013)against two baselines: the original training split (100% train) and a random sample of the same size as the biased training splits (random).In addition to the anti-biased test set, we also report performance on the original test set to validate that the model's training data (the biased training instances) is sufficient for learning the task.Hyperparameters selection Our approach for identifying minority examples is based on clustering the representations of a fine-tuned model.The clustering algorithm we use, DEEPCLUSTER (Sec.3), has three hyperparameters: the number of final clusters k, the number of pseudo-labels m for representation learning, and the Transformer layer from which [CLS] representations are extracted for clustering.We use k " 10 clusters for all datasets, and search for a good configuration for the other two hyperparameters on SST-2(Socher et al., 2013): for each set of hyperparameters, we apply the minority examples method to create biased training and anti-biased test splits,

Table 6 :
Datasets sizes.Development set in MultiNLI is the matched validation set (we did not use the mismatched validation set).

Table 7 :
Sizes of the train and test bias-amplified splits created with each of the considered methods (Sec.3).Since the number of biased train instances is induced by the clustering in the minority examples approach, but is a hyperparameter q for the two other approaches, we simply adjust q to create equally sized training sets for all three methods.We use the same q used for choosing biased train instances when choosing anti-biased test instances.We note that for the minority examples method, the training set clustering and the predicted test set clustering (based on a simple nearest neighbor classifier fitted on the training set) are two different clusterings, which can result in different proportions of minority examples between the train and test sets.This explains the difference in the amounts of anti-biased test instances between minority examples and the other two methods.

Table 10 :
Average runtimes for fine-tuning, in hours.