IBADR: an Iterative Bias-Aware Dataset Refinement Framework for Debiasing NLU models

As commonly-used methods for debiasing natural language understanding (NLU) models, dataset refinement approaches heavily rely on manual data analysis, and thus maybe unable to cover all the potential biased features. In this paper, we propose IBADR, an Iterative Bias-Aware Dataset Refinement framework, which debiases NLU models without predefining biased features. We maintain an iteratively expanded sample pool. Specifically, at each iteration, we first train a shallow model to quantify the bias degree of samples in the pool. Then, we pair each sample with a bias indicator representing its bias degree, and use these extended samples to train a sample generator. In this way, this generator can effectively learn the correspondence relationship between bias indicators and samples. Furthermore, we employ the generator to produce pseudo samples with fewer biased features by feeding specific bias indicators. Finally, we incorporate the generated pseudo samples into the pool. Experimental results and in-depth analyses on two NLU tasks show that IBADR not only significantly outperforms existing dataset refinement approaches, achieving SOTA, but also is compatible with model-centric methods.


Introduction
Although neural models have made significant progress in many natural language understanding (NLU) tasks (Bowman et al., 2015;Gururangan et al., 2018), recent studies have demonstrated that these models exhibit limited generalization to outof-distribution data and are vulnerable to various types of adversarial attacks (Dasgupta et al., 2018;McCoy et al., 2019).This is primarily due to their tendency to rely excessively on biased featuresspurious surface patterns that are falsely associated with target labels, rather than to capture the underlying semantics.Consequently, how to effectively debias neural networks has become a prominent research topic, attracting increasing attention recently.
To alleviate this issue, researchers have proposed many methods that can be generally divided into two categories: model-centric mitigation approaches (Clark et al., 2019;Stacey et al., 2020;Utama et al., 2020a;Karimi Mahabadi et al., 2020;Du et al., 2021) and dataset refinement approaches (Lee et al., 2021;Wu et al., 2022;Ross et al., 2022).The former mainly focuses on designing model architectures or training objectives to prevent models from exploiting dataset biases, and the latter aims to adjust dataset distributions to reduce correlations between spurious features and labels.In these two types of methods, dataset refinement approaches possess the advantage of not requiring modifications to the model architecture and training objective, while are also compatible with model-centric approaches.Therefore, in this work, we also concentrate on dataset refinement approaches, which employ controllable generation techniques (Zhou et al., 2020;Hu et al., 2022) to refine the data distribution.However, recent studies (Lee et al., 2021;Wu et al., 2022;Ross et al., 2022) heavily rely on manual data analysis for debiasing models.They either define perturbation rules to generate adversarial examples, or generate pseudo samples and filter out samples with identified biased features.Typically, the state-of-the-art (SOTA) method (Wu et al., 2022) first generates a large amount of samples and then applies z-filtering involving a predefined set of biased features to eliminate samples with such features.However, these methods of manually predefining biased features may overlook some potential biased features, thus limiting their generalizability.
In this paper, we propose IBADR, an Iterative Bias-Aware Dataset Refinement framework, which iteratively generates samples to debias NLU models without predefining biased features.Under this framework, we create a sample pool initialized by the original training samples, and gradually expand it through multiple iterations.As shown in Figure 1, in each iteration, we first sort and group samples in the pool according to their bias degree, determined by a shallow model trained on a limited set of training samples.Next, we concatenate samples in each group with a bias indicator that represents its bias degree.These concatenated samples are then utilized to train a sample generator, which effectively learns the correspondence relationship between bias indicators and samples.Afterwards, as implemented in the training phase, we feed a low-degree bias indicator to the sample generator, allowing it to generate pseudo samples with fewer biased features.Finally, we add these pseudo samples back into the sample pool and repeat the above process until the maximum number of iterations is reached.
Apparently, the above iterative process guides the sample generator towards samples with fewer biased features.However, we observe the generated pseudo samples display less diversity when we feed the lowest-degree bias indicator to the sample generator.The underlying reason is that the shallow model consistently assigns a relatively low bias degree to samples with specific patterns, such as the premise directly negating the hypothesis by inserting a word "not".Consequently, the sample generator learns these patterns and tends to produce samples containing similar patterns, thereby limiting their diversity.
To address this issue, we further explore two strategies to diversify generations.First, instead of always using the lowest-degree bias indicator, we randomly select a low-degree bias indicator.In this way, the sample generator is discouraged from continually creating pseudo samples containing similar patterns, while still ensuring fewer biased features in the pseudo samples.Secondly, we dynamically update the shallow model by integrating the newly generated pseudo samples during the iterative generation process.By doing this, we effectively decrease the assignment of the lowestdegree bias indicator to pattern-specific samples, ultimately promoting greater diversity of the generated samples.
To summarize, the main contributions of this paper are three-fold: • We propose a dataset refinement framework designed to iteratively generate pseudo samples without prior analysis of biased features.
• We present two strategies to enhance the diversity of the pseudo samples, which further boost the performance of NLU models.
• To verify the effectiveness and generality of IBADR, we conduct experiments on two NLU tasks.The experimental results show that IBADR achieves SOTA performance.

The IBADR Framework
In this section, we give a detailed description of IBADR.Under this framework, we first use a limited set of training samples to train a shallow model, which serves to measure the bias degree of samples.Then, we iteratively generate pseudo samples with fewer biased features, as illustrated in Figure 1.Finally, these pseudo samples are used to debias the NLU models via retraining.

Training a Shallow Model to Measure the Bias Degree of Samples
As investigated in (Utama et al., 2020b), a shallow model trained on a small portion of training data tends to overfit on biased features, thus is highly confident on the samples that contain biased features.Motivated by this, we randomly select some training samples to train a shallow model, denoted as θ s , for measuring the bias degree of samples.Let (x (i) , y (i) ) denote a training sample for NLU tasks, where y (i) is the golden label of the input x (i) , we directly use the model confidence p(y (i) |x (i) ; θ s ) to quantify the bias degree of (x (i) , y (i) ).Apparently, if p(y (i) |x (i) ; θ s ) − → 1, (x (i) , y (i) ) is more likely to be a biased one.
Back to our framework, our primary objective is to generate samples with a low bias degree, which can be used to reduce spurious correlations via adjusting dataset distributions.

Iterative Pseudo Sample Generation
The overview of the iterative sample generation process is shown in Figure 1.During this process, we introduce a sample generator θ g to iteratively generate pseudo samples, which are added into a sample pool S. Specifically, we initialize the sample pool S with the original training samples, the sample generator θ g with a generative pretrained  language model.Then, we iteratively expand S via the following four stage: Step 1: Setting Bias Indicators.First, we use the above-mentioned shallow model to measure the bias degree of each sample in S, as described in Section 2.1, and sort these samples according to their bias degree and divide them into N bi groups with equal size.Each group is assigned with a bias indicator b n , where 1≤n≤N bi , b 1 represents the lowest-degree bias indicator and b N bi denotes the highest-degree bias indicator.
Step 2: Finetuning Sample Generator.Then, we use the samples in S to finetune the sample generator θ g via the following loss function: where b (i) represents the bias indicator assigned for the training sample (x (i) , y (i) ).Through training with this objective, the generator can effectively learn the correspondence relationship between bias indicators and samples.Furthermore, in the subsequent stages, we can specify both the bias indicator and the label to control the generations of pseudo samples.
Step 3: Generating Pseudo Samples.Next, we designate a bias indicator b representing a low degree of bias, and then feed it with a randomlyselected NLI label ȳ into the generator θ g .This process allows us to form a pseudo sample (x, ȳ) by sampling x from the generator output distribution p g (•| b, ȳ; θ g ).By repeating this sampling process, we can obtain a set of generated pseudo samples with fewer biased features.
Step 4: Expanding Sample Pool.Subsequently, to ensure the quality of generated pseudo samples, we follow Wu et al. (2022) to filter the above generated pseudo samples with model confidence lower than a threshold ϵ, and incorporate the remaining pseudo samples back into S.
After N iter iterations of the above steps, our sample pool contains not only the original training samples, but also abundant pseudo samples with fewer biased features.Finally, we debias the NLU model via the retraining on these samples.

Diversifying Pseudo Samples
Intuitively, the most direct way is to set the above specified bias indicator b to b 1 , which denotes the lowest bias degree.However, we observe that such generated pseudo samples lack diversity and fail to cover diverse biased features.The reason behind this is that the generated pseudo samples designated with b 1 always follow certain patterns, exhibiting less diversity compared to those assigned with other bias indicators.For example, the premise directly negates the hypothesis using the word "not".Consequently, this results in spurious correlations between b 1 and these certain patterns.Hence, the generator tends to generate samples following these patterns and fails to generate samples that compass a broader range of biased features.
To address this issue, we employ the following two strategies: (i) Instead of using the lowest-degree bias indicator b 1 , we use a randomlyselected low-degree bias indicator: b=b r , where 1≤r≤ N bi 2 , and feed it into the generator during the iterative generation process.Upon human inspection, we observe that the generated pseudo samples not only become diverse but also still contain relatively few biased features.(ii) During the generation process, we update the shallow model θ s using a randomly-extracted portion of S at each iteration.This strategy prevents the shallow model from consistently predicting a low bias degree to pseudo samples following previously-appeared patterns, thereby enhancing the diversity of the pseudo samples.

Setup
Tasks and Datasets.We conduct experiments on two NLU tasks: natural language inference and fact verification.
• Natural Language Inference (NLI).This task aims to predict the entailment relationship between the pair of premise and hypothesis.We conduct experiments using the MNLI (Williams et al., 2018) and SNLI (Bowman et al., 2015) datasets.As conducted in previous studies (Stacey et al., 2020;Wu et al., 2022;Lyu et al., 2023), in addition to the development sets, we evaluate IBADR on the corresponding challenge sets for MNLI and SNLI, namely HANS (McCoy et al., 2019) and the Scramble Test (Dasgupta et al., 2018), respectively.These two challenge sets are specifically designed to assess whether the model relies on syntactic and word-overlap biases to make predictions.
• Fact Verification.This task is designed to determine whether a textual claim is supported or refuted by the provided evidence text.We select FEVER (Thorne et al., 2018) as our original dataset and evaluate the model performance on the development set and two challenge sets: FeverSymmetric V1 and V2 (Symm.v1and Symm.v2) (Schuster et al., 2019a), both of which are developed to mitigate biases stemming from claim-only data.
Baselines.We compare IBADR with the following baselines: • CrossAug (Lee et al., 2021).This method tackles negation bias in the fact verification task through a contrastive data augmentation method.
• z-filter (Wu et al., 2022).It first defines a set of task-relevant biased features, and then trains a generator on existing datasets to generate pseudo samples, where pseudo samples with these biased features are filtered.Finally, the remaining samples are used to retrain the model.
• Products-of-Experts (PoE) (He et al., 2019;Karimi Mahabadi et al., 2020).In an ensemble manner, it trains a debiased model with a biasonly one, the predictions of which heavily rely on biased features.By doing so, the debiased model is encouraged to focus on samples with fewer biased features where the bias-only model performs poorly.
• Confidence Regularization (Conf-reg) (Utama et al., 2020a).This method trains a debiased model by increasing the uncertainty of samples with biased features.It first trains a bias-only model to quantify the bias degree of each sample, and then scales the output distribution of a teacher model based on the bias degree, where the re-scaled distribution can be used to enhance the debiased model.
• Example Reweighting (Reweight) (Schuster et al., 2019b).This method aims to reduce the contribution of samples with biased features on the training loss by assigning them with relatively small weights.
Please note that in exception to CrossAug and z-filter, which are dataset refinement approaches, all other approaches are model-centric.
Implementation Details.In our experiments, we use GPT2-large (Radford et al., 2019) to construct the sample generator and the shallow model, respectively.To train the shallow models for different NLU tasks, we randomly select 2K, 2K, and 0.5K samples from the original training sets of MNLI, SNLI, and FEVER, individually.These shallow models are trained for 3 epochs with a learning rate of 5e-5.When training the sample generator, we set the learning rate to 5e-5, the number of pseudo sample generated per iteration to 200K, and the iteration number N iter to 5. Particularly, we train the sample generator for 3 epochs in the first iteration and for only 1 epoch in the subsequent iterations.When updating the shallow model, we randomly select 2K samples from the sample pool to finetune it.
For the NLU models, we train them on the augmented datasets of different tasks for 8 epochs using a learning rate of 1e-5.We employ an early stop strategy during the training process.We conduct all experiments three times, each time with different random seeds, and report the average results.Sample numbers of the augmented datasets are listed in Table 1.

Effect of Bias Indicator Number N bi
The bias indicator number N bi on our framework is an important hyper-parameter, which determines the partition granularity of the sample pool.Thus, we gradually vary N bi from 3 to 9 with an increment of 2 in each step, and compare the model performance on the development sets of MNLI.
As shown in Table 2, when N bi is set to a smaller value, such as 3, there is a significant decrease in model performance.This is because in this case, the bias indicator can only be set to b 1 , which reduces the diversity of generated pseudo samples, as discussed in Section 2.3.Conversely, when N bi is set to larger values, e.g., 7 or 9, the model performance on both development sets also decreases.We hypothesize that this decline occurs because a larger value of N bi results in a fine-grained partition of the sample pool, reducing the number of samples corresponding to each specific bias indicator.Consequently, this weakens the correspondence relationship between bias indicators and samples, and thus harms the performance of the sample generator.According to these results, we set N bi to 5 for all subsequent experiments.

Main Results
Table 3 presents the experimental results.Overall, compared with all baselines, IBADR is able to achieve the most significant improvements on the challenge sets (i.e.HANS, Scramble, Symm.v1, and Symm.v2).Specifically, IBADR achieves improvements of 2. 57, 4.44, 1.11 and 4.16  while other baselines, for example, Reweight and z-filter, decline on the development sets.

Compatibility of IBADR with PoE
To assess the compatibility of IBADR with modelcentric debiasing methods, we report the model performance when simultaneously using IBADR and PoE (He et al., 2019), following the setting of Wu et al. (2022).
As shown in Table 4, on the challenge sets, the combination of IBADR and PoE not only yields better results than using PoE or IBADR individually, but also outperforms the combination of zfilter and PoE.Thus, we believe that IBADR has the potential to further enhance the performance of existing model-centric methods.

Ablation Study
To assess the effects of special designs on IBADR, we also report the performance of several IBADR variants on MNLI: • w/o USM.In this variant, we do not Update the Shallow Model during the process of iterative sample generation.
• b r ⇒b 1 .The sample generator uses the lowest bias indicator b 1 rather than the randomly-selected low-degree bias indicator b r , where 1≤r≤ N bi 2 , to generate pseudo samples.• w/o USM & b r ⇒b 1 .In this variant, the sample generator uses the bias indicator b 1 to generate pseudo samples, and the shallow model remains fixed during the generation process.
• w/o bias indicator.This variant directly uses the samples without the bias indicator to train the sample generator.
• w/o iterative generation.Instead of generating pseudo samples iteratively, we only utilize the sample generator trained in the first iteration to generate pseudo samples.
As shown in Table 5, all variants exhibit performance declines on HANS, indicating the effectiveness of our special designs.Particularly, w/o bias indicator demonstrates the most significant performance drop, which is intuitive since the bias indicator can guide the sample generator to produce pseudo samples with fewer biased features.
Without the bias indicators, the generated pseudo samples will contain undesired biased features, resulting in poorer performance on the challenge set.

Adversarial Tests for Combating Distinct Biases in NLI
As shown in (Liu et al., 2020), current debiasing approaches primarily concentrate on addressing known biases, and thus might fail to mitigate unknown biases in NLI tasks.To assess the robustness of NLI models, Liu et al. (2020)  randomly select two subsets from original training samples of MNLI, with sizes 100K and 200K, respectively.Afterwards, we employ IBADR to augment these subsets and subsequently retrain NLU models using the augmented datasets.Table 8 presents the results of both the development and challenge sets.We can observe that IBADR consistently improves model performance across all test sets.Notably, even when using a limited number of original training samples, e.g.100K, the model trained on the IBADR augmented dataset outperforms the full-size baseline.This suggests that IBADR exhibits remarkable robustness on limited original training samples.

The Effect of Augmented Dataset Size
To explore the influence of augmented dataset size, we retrain the NLU model on the MNLI dataset with different numbers of augmented samples: 10K, 100K, 300K, 600K, and 900K, respectively.As indicated in Table 9, the performance of IBADR consistently improves.Moreover, with just 100K augmented samples, IBADR outperforms z-filter across dev-m, dev-mm, and HANS.It's worth mentioning that z-filter utilizes a larger set of 360K augmented samples.

The Compatibility with Advanced Language Models
To ensure a fair comparison with z-filter, we employ GPT-2 Large as the sample generator in our main study.In exploring IBADR's compatibility with advanced large language models (LLM), we finetune the LLaMA-7b model (Touvron et al., 2023) using LORA (Hu et al., 2021)  dataset are listed in Table 10.We can observe that the performance of IBADR is further improved with LLaMA-7b, which indicates IBADR's generalizability.

Related Work
Our related work primarily focuses on two categories of methods: model-centric and dataset refinement methods.

Model-centric Data Debiasing Methods
Numerous previous studies have adopted modelcentric approaches to address biases in NLU models.et al., 2021), and products-of-experts (POE) (Clark et al., 2019;He et al., 2019;Karimi Mahabadi et al., 2020).Typically, these methods follow a two-stage paradigm.In the first stage, a bias-only model is trained, either automatically (Utama et al., 2020c;Geirhos et al., 2020;Sanh et al., 2021) or by leveraging prior knowledge about the bias (Clark et al., 2019;He et al., 2019;Belinkov et al., 2019).Then, in the second stage, the output of the bias-only model is utilized to adjust the loss function of the debiased model.Recently, Lyu et al. (2023) propose a novel approach using contrastive learning to capture the dynamic influence of biases and effectively reduce biased features, offering an alternative perspective on addressing bias in NLU models.Wang et al. (2023) observe that lower layers in Transformer models tend to capture biased features.They introduce the residual connection to integrate low-layer representations with top-layer ones, thus minimizing biased feature impact on the top layer.

Dataset Refinement
Several studies have explored generative data augmentation methods to enhance the model robustness in various domains.Lee et al. (2021) train a generator to generate new claims and evidence for debiasing fact verification datasets like FEVER.Ross et al. (2022) introduce TAILOR, a semantically-controlled perturbation method for data augmentation based on some manually defined perturbation strategies.Wu et al. (2022) identify a set of biased features by z-statistics, and then adjust the distribution of the generated samples by post-hoc filtering to remove the generated samples with biased features.Unlike these approaches, our framework does not require data analysis to define biased features or manual perturbation rules, and hence achieves better generalizability.

Conclusions
In this work, we propose IBADR, an iterative dataset refinement framework for debiasing NLU models.Under this framework, we train a shallow model to quantify the bias degree of samples, and then iteratively generate pseudo samples with fewer biased features, which can be used to debias the model via retraining.We also incorporated two strategies to enhance the diversity of generated pseudo samples, further improving model performance.On extensive experiments of two tasks, IBADR consistently shows superior performance compared to baseline methods.Besides, IBADR can better handle unknown biased features and has good compatibility with larger language models.
In the future, we will explore the compatibility of IBADR with other large language models, such as GPT4 (OpenAI, 2023).

Limitations
The limitations of this framework are the following aspects: (i) Despite filtering the pseudo samples with low model confidence, IBADR might still produce pseudo samples with incorrect labels, which limits the model performance; (ii) We only conduct experiments on NLU tasks, neglecting the exploration of its applicability to a wider range of tasks.

Ethics Statement
This paper proposes a dataset refinement framework that aims to adjust dataset distributions in order to mitigate data bias.All the datasets used in this paper are publicly available and widely adopted by researchers to test the performance of debiasing frameworks.Additionally, this paper does not involve any data collection or release, thus eliminating any privacy concerns.Oveall, this study will not pose any ethical issues.

Figure 1 :
Figure 1: Overview of the iterative sample generation process, which consists of four key stages: ① Setting bias indicator; ② Finetuning sample generator; ③ Generating pseudo samples; and ④ Expanding sample pool.Through N iter iterations of the above steps, we continuously augmented the sample pool with pseudo samples, which can be effectively employed to debias the NLU models.

Table 1 :
Sample numbers of the constructed augmented datasets for MNLI, SNLI, and FEVER.

Table 2 :
Results on the development sets of MNLI with different numbers of bias indicators.

Table 3 :
Results on the development and challenge sets (HANS, Scramble, Symm.v1,Symm.v2) of MNLI, SNLI and FEVER.* means the results are directly cited from previous studies.Note that IBADR outperforms all baselines on the challenge sets, while maintaining comparable or better performance on the development sets.† indicates the results are significantly better than the best comparison method (p < 0.001).

Table 4 :
points than the previously-reported best results, respectively.Note that IBADR is effective on both the development and challenge sets of MNLI and FEVER, Results on the development and challenge sets of MNLI, SNLI, and FEVER.The combination of IBADR with PoE can significantly enhance the model performance on the challenge sets, surpassing all baselines.

Table 8 :
Results on MNLI when using different sizes of original training samples.

Table 9 :
as an alternative to GPT-2 Large.The results on the MNLI Results on MNLI when using different sizes of augmented datasets.