Kernel-Whitening: Overcome Dataset Bias with Isotropic Sentence Embedding

Dataset bias has attracted increasing attention recently for its detrimental effect on the generalization ability of fine-tuned models. The current mainstream solution is designing an additional shallow model to pre-identify biased instances. However, such two-stage methods scale up the computational complexity of training process and obstruct valid feature information while mitigating bias.To address this issue, we utilize the representation normalization method which aims at disentangling the correlations between features of encoded sentences. We find it also promising in eliminating the bias problem by providing isotropic data distribution. We further propose Kernel-Whitening, a Nystrom kernel approximation method to achieve more thorough debiasing on nonlinear spurious correlations. Our framework is end-to-end with similar time consumption to fine-tuning. Experiments show that Kernel-Whitening significantly improves the performance of BERT on out-of-distribution datasets while maintaining in-distribution accuracy.


Introduction
Despite remarkable performance on NLP tasks, pretrained language models, like BERT, suffer sharp performance degradation in out-of-distribution (OOD) settings (McCoy et al., 2019).The above defect roots in the excessive reliance on spurious correlations, which is widely found in crowdsourcingbuilt datasets (Gururangan et al., 2018).These phenomena are donated as dataset bias problem (He et al., 2019).A line of works attempts to tackle this problem by down-weighting bias training examples to discourage the main model from adopting recognized biases, including example reweighting (Schuster et al., 2019), confidence regularization (Utama et al., 2020a), or model ensembling (Clark et al., 2019).The vertical and horizontal axes represent the valid and invalid features, respectively.Uneven sample distribution induces a bias decision boundary, resulting in errors on out-ofdistribution data.The normalization method maps the data to isotropic latent space, where the new boundary is uncorrelated to redundant features, providing better generalization.
However, aforementioned methods over-depend on researchers' intuition and task-specific insights to characterize spurious correlations, causing unrecognized bias patterns to remain in individual dataset (Sharma et al., 2018).Such assumption that dataset biases are known as a prior has been relaxed by limited capacity models (Utama et al., 2020b) or early training (Tu et al., 2020) in recent works.These approaches still rely on extra shallow models, which are not end-to-end, and weak-weighted bias samples simultaneously obstruct learning from their non-bias parts (Wen et al., 2021).
Instead of designing an extra model as previous attempts did, in this work, we propose a novel endto-end framework, Kernel-Whitening, to significantly improve OOD performance while maintaining a similar computational cost as conventional BERT fine-tuning.The BERT-whitening (Su et al., 2021) and BERT-flow (Li et al., 2020) methods are effective normalization techniques to obtain better semantic representation.BERT-whitening calculates a linear operator 1 with SVD decompo-sition (GOLUB and REINSCH, 1970) to transform the sentence representation to follow a distribution with respect to the standard normal distribution.BERT-flow introduces normalized flow (Rezende and Mohamed, 2015) to perform similar transformations.Particularly, we find that the normalization method is also promising in improving the generalization ability of fine-tuned models by eliminating spurious correlations in training datasets.Despite the significant improvement over the OOD datasets, the linear transformation of BERT-whitening is not capable of dealing with nonlinear dependencies between features.Meanwhile, flow-based methods require a complex inference process, which scales up the training costs.
In an attempt to eliminate nonlinear correlations while maintaining low training expenditure, we introduce kernel methods to naturally reconstruct a set of sentence representations only with linear correlation (Achlioptas et al., 2001).However, traditional kernel methods focus only on data similarities without providing explicit mapping operators, therefore, we use the Nyström approximation (Xu et al., 2015) to obtain low-rank kernel estimations.In general, we transform training data to an isotropic Gaussian distribution without affecting topological relationships between data points.
Kernel-Whitening2 achieves competitive performance on generalization tasks.Experiments on eight datasets demonstrate that our method can improve the accuracy by 7%-11% on OOD datasets.In addition, the analysis of sentence representation proves that our method effectively removes the spurious correlations between dimensional features, which are known to be the direct cause of the dataset bias problem.Overall, our main contributions are as follows: • We propose a novel framework, Kernel-Whitening, which ameliorates the bias problem by transforming sentence representation into isotropic distribution with similar time to fine-tuning.
• We introduce a kernel estimation algorithm, i.e., Nyström approximation, to alleviate normalization methods from the trade between complex arithmetic and disentangle effects.
• We conduct comprehensive experiments on debiasing tasks to verify the effectiveness of normalization methods for overcoming spurious correlations.

How Do Normalization Methods Provide Better Generalization
In this section, we discuss the negative impacts of dataset bias on the model's generalization ability, and subsequently, how normalization methods lead to better performance in OOD settings.

Illustrate Dataset Bias from Feature Perspective
We first interpret dataset bias as triggered by the imbalance distribution of training data in feature space.Figure 2 shows an empirical analysis on MNLI dataset (Williams et al., 2018).Both train and test sets exhibit a strong positive correlation between the word overlap ratio and the class portion of the entailment label, which enable models to achieve high accuracy scores without modelling semantic information.McCoy et al. (2019) proposed HANS, a label-fair test set (i.e., labels are proportionally consistent with different degrees of word overlap) to investigate the generalizability of models fine-tuned with above-mentioned bias, the model provides over 20% accuracy degradation when the specific literal heuristic could not be utilized for prediction.Formally, given input data (X, Y ) and bias dataset D, the training process can be formulated as follows: where X L represents the task-related features, and X P represents other irrelevant features.That is, the model learns X L from the label pair (X, Y ).By modeling the conditional distribution of labels relative to the input, the model extracts valid features for specific tasks.However, previous works argue that construction process of bias dataset introduces spurious correlations between (X L |D) and (X P |D) (Gururangan et al., 2018): Therefore, the actual training objective on dataset D is the posterior between feature distribution (X L , X P |D) and label Y , i.e., P (Y |X L , X P ; D).By Eq.2, irrelevant features increase the confidence for specific labels: Such overconfidence does not perturb the model effect on the test set, which has a similar distribution to the training set.However, out-ofdistribution data follow the correct distribution P (Y |X L ), and are therefore classified as the bias label (e.g., "entailment" in MNLI dataset), even if they have different relations.

Isotropic Representation Leads to Better Generalization
According to the definition described in Eq.3, dataset bias causes deep networks to fit the datasetspecific distribution, which impairs the generalization performance.Normalization methods intervene in the above problem by reconstructing the feature space.When sentences are encoded by pre-train model, the embedding representations are transformed into isotropic distribution, e.g., standard normal distribution.Suppose data x and prior u satisfy: where U represents the latent space of isotropic u, and f is an invertible function.The probabilistic density function of original data on the transformed space can be calculated as follows: By Eq.5, the distribution of training data is transformed into isotropic ones.Subsequently, suppose X L u and X P u are the latent representations of X L and X P , the spurious correlations between valid features and invalid features are eliminated: With isotropic data distribution, a decision boundary independent of the redundant features is obtained, which provides better generalization on the OOD samples.

Minor Weakness for BERT-flow and BERT-whitening
The BERT-flow method learns a flow-based generative model to fit the transform function f θ , and BERT-whitening method computes the inverse linear operator with SVD decomposition of the covariance matrix.Despite the decent effect of representing normalization methods, BERT-flow requires multiple convolutional layers to find the appropriate transformation function, which increases the difficulty and time consumption of the training process.When fine-tuned on a tiny dataset, the flow layer encountered obstacles in providing reasonable transform results.Moreover, the BERTwhitening method focuses only on eliminating linear correlations between features, which is ineffective in alleviating the nonlinear correlation problem.In an attempt to ease the effort of training and provide faster but thorough transformation, we propose a novel normalization framework by kernel approximation, which is detail discussed in the next section.

Distribution Generalization with Kernel Approximation
In this section, we show our end-to-end framework named Kernel-Whitening.We first introduce Nyström kernel estimation algorithm, and subsequently how to apply such approximation method on debiasing tasks.

Nyström Kernel Estimation
We first elaborate kernel trick, which constructs new linearly differentiable properties by mapping the original feature space onto a high-dimensional RKHS (Alvarez et al., 2012).Given a set of training data X = x i ∈ R d , i = 1, . . ., n , the kernel method maps X onto a dot product space H using ϕ : X − → H. Generally, the dimension of H can be so large that the mapping function cannot be obtained explicitly.Nevertheless, the dot product result can be represented by a positive definite kernel k, i.e., a function k satisfies: where λ i and ϕ i denotes the eigenvalues and eigenfunctions of kernel operator k, and N denotes their number.
With finite dataset x i ∈ R d , i = 1, . . ., n , such decomposition can be replaced with empirical estimation as follows: Eq.8 indicates a spectral decomposition of kernel matrix G, which satisfy G k,j = k(x k , x j ).Considering the SVD decomposition of G as where W is an orthogonal matrix, and Σ is a diagonal matrix with positive diagonal elements.When matching Eq.8 with Eq.9, the mapping operator ϕ is therefore denoted as: However, existing datasets often contain thousands to hundreds of thousands samples, which makes it impossible to directly calculate the SVD decomposition.Therefore, we introduce Nyström method (Williams and Seeger, 2001) to provide a low-rank estimation of kernel matrix.
Suppose S to be a sampled subset of X, the kernel matrix G can be represented as: where G s denotes the gram matrix of subset S. The W and Σ can subsequently approximated by: The reconstructed Nyström representation of single example x is as follows: where By estimating high-dimensional representations of the training samples we obtain a linearly divisible distribution, which can be normalized with a linear transformation.

Batch Iterate for Global Approximation
In Section 3.1, we elaborate the standard Nyström approach to processing input data.The difference is, the subset S in the traditional approach, i.e., kernel SVM, usually contains hundreds of elements, while deep networks are trained on smaller batches (e.g., 32 for Kernel-Whitening) with stochastic gradient descent (SGD) optimizer.The insufficient samples compromise the information of reconstructed representations, making the improvement inconspicuous when directly applying Nyström methods to debiasing tasks.
In attempt to introduce global information while processing batch data, we design preservation and reloading structures to extend the dimension of low-rank kernel matrix.For each batch, we calculate Nyström matrix with batch features Z t L and global features Z t f , which represent the principal components of data distribution in step t.Giving input data Z t L contains L instances, the extended representation is gived with Eq.13: where G denotes the kernel matrix generated by T , and G 0:L,: denotes the first L rows of G. Especially, we select Radial Basis Function (RBF) kernel in Kernel-Whitening method.
Noticing that the kernel estimation method only projects the data points into a linearly separable space, we further normalize the distribution with a linear transformation.The reconstructed representation ϕ(Z t L ) is subsequently weighted under the supervised signal of Hilbert-Schmidt independence criterion (HSIC) (Wang et al., 2021), which is an adequate indicator for estimating the mutual independence between features.The optimal weight W * is calculated by: where W = {W i } n i=1 denotes the sample weight vector, and ΣG,W represents the empirical estima-tion for covariance between features: At the end of each iteration, we update the global features with local information Z t L to catch reasonable basis vectors: where α i denotes the attenuated factor controlling the importance of local information.

Training Objective
In Section 3.2, we show how to obtain the reconstructed feature representation ϕ(Z t L ) and weighting parameters W * .Subsequently, we use above results to train on original BERT models.The final train loss of Kernel-Whitening is denoted as: where f (•, •) represents the cross-entropy loss with input ϕ(Z t L ) i and it's corresponding label y i .Our detailed algorithm implementation is shown in Algorithm 1.
Algorithm 1 Framework of Kernel-Whitening for our system.Input: The set of pooler output for current batch,

Experiments
In this section, we provide a comprehensive analysis of Kernel-Whitening and the other two normalization methods (i.e., BERT-flow (Li et al., 2020) 3 and BERT-whitening (Su et al., 2021)) through extensive experiments on three tasks.

Baseline Methods
Our method is compared with previous works as follows: •

Datasets and Metrics
We conduct experiments on three tasks: natural language inference, fact verification, and paraphrase identification.Each task contains in-distribution and out-of-distribution datasets.
For the NLI task, we conduct experiments on the Multi-Genre Natural Language Inference (MNLI) dataset (Nangia et al., 2017) and HANS (McCoy et al., 2019).We train the model on MNLI, and choose MNLI-mismatch and HANS as the ID and OOD test set.
For the fact verification task, we use FEVER4 as the training dataset provided by (Thorne et al., 2018).We train the model on FEVER, and evaluate the model performance on ID test set and FEVER Symmetric (Schuster et al., 2019) Table 2: Details of the state-of-the-art debiasing methods used to compare with Kernel-Whitening.Our model is end-to-end while not requiring prior knowledge of biases or additional shallow models.
For the paraphrase identification task, we perform the evaluation using Quora Question Pairs (QQP) as ID dataset and PAWS (Zhang et al., 2019) as OOD dataset which consists of two types of data including duplicate and non-duplicate.

Evaluation Metrics
Following previous works, we measured the accuracy score on the in-distribution and out-ofdistribution test sets to compare the results of different models.

Implementation Details
Following previous debiasing methods, we apply our debiasing method on the BERT-base (Devlin et al., 2019).The hyperparameters of BERT are consistent with previous research papers.The learning rate is 2e-5 for MNLI dataset and 1e-5 for FEVER and QQP, the batch size is 32 and the optimizer is AdamW with a weight decay of 0.01.Note that previous methods (Sanh et al., 2020;Xiong et al., 2021) have shown high variance in experiment results under different settings, we evaluate the performance of our model by four random seeds and report the averaged result.We use the [CLS] vector as sentence embedding for all three methods.The model is trained in an NVIDIA GeForce RTX 3090 GPU.All models are trained five epochs, and checkpoints with top-2 performance are finally evaluated on the challenge test set.

Experimental Results
The extensive results of all the above-mentioned methods are summarized in table 1.Compared with other baseline methods, Kernel-Whitening significantly improves the model performance on challenge sets, and achieves state-of-the-art results on seven of the eight benchmarks.On the MNLI and FEVER datasets, our framework achieves the best performance with about 10 percentage points higher than the accuracy of BERT-base, which outperforms other debiasing methods.This proves that our framework has the best results and generalizability among these methods.
Moreover, our approach can effectively eliminate the dataset bias while mitigating the damage to generalizable features.The vast majority of debiasing methods improve the performance of out-of-distribution datasets by sacrificing the performance of in-distribution datasets, which means that current debiasing methods attempt to achieve a trade-off between ID performance and OOD performance.However, our approach achieves the best performance on OOD datasets for natural language inference and fact verification tasks with better results on ID datasets.For QQP dataset, our proposed approach also achieves decent generalization in PAWS without excessive performance degradation on ID datasets.
In general, the normalization methods perform well on both in-distribution and out-of-distribution datasets for all tasks.All five models of three methods are end-to-end approaches and do not rely on any prior knowledge of the dataset.That is to say, they achieve better utility and scalability while providing more effective debiasing.For BERTwhitening and Kernel-Whitening, a larger hidden dimension indicates better performance on OOD datasets, and Kernel-whitening performs better when parameters are constant to BERT-whitening, which strongly supports our analysis of normalization methods.BERT-flow shows an acceptable performance on OOD datasets, but is inferior to the whitening-based approach.We argue that flow model requires more samples as reference, and the original hyperparameters are not capable for the additional network layers.

Analysis and Discussion
In this section, we construct supplementary experiments to further analyze the effectiveness of normalization methods, especially our Kernel-Whitening framework.

Effect of Latent Dimension L
The dimensionality of reconstructed features is a key feature.The reduction of vector size brings smaller memory occupation and faster inference downstream layers, while the missing information may impair the ability of the model.To further illustrate the effect of low-rank kernel approxima- tion, we conduct a sensitivity analysis of latent dimension L. Figure 3 shows the variation curve of performance change for two whitening-based methods.For both in-distribution and out-of-distribution tasks.A latent dimension of double the batch size provides promising performance.As the dimensionality rises, Kernel-Whitening maintains a stable debiasing effect, while BERT-whitening fluctuates on FEVER and Symm.v2 dataset.We argue that this phenomenon is due to that highdimensional features are more prone to nonlinear correlations, where Kernel-Whitening is designed to show better results.Moreover, Kernel-Whitening always performs better when the dimensionality is greater than 300, which illustrates the stability and generality of our method.

Independence Study
In Section 2 we analyse how isotropic data distribution leads to better generalization.To check whether normalization methods remove the dependencies between features, we conduct experiments on the covariance between features during training process.As shown in Figure 4, All three normalization methods exhibit a suppression effect on feature correlation, while our method achieves the optimal performance at the end of training.As the iterations increase, the covariance first decreases rapidly and converges to a low point.All method's performance fluctuations around certain steps, we believe such fluctuations are related to biased samples in the data.
Overall, Kernel-Whitening largely remits dependencies between features, and such independence effectively contributes to the generalization ability of deep network models.

Time Consumption
Besides outstanding debiasing performance, we compute the time consumption with baseline methods to further demonstrate the strength of Kernel-Whitening.We train each model equally on an NVIDIA RTX 2080Ti GPU with the same batch size.We compare three normalization methods with the best baseline work, MoCaD (Xiong et al., 2021), which trains a bias model to produce model calibrating.To give a horizontal comparison between different datasets, we set the time consumption of fine-tune 100 as a baseline.As shown in Table 3, the time consumption of Kernel-Whitening is nearly the same as fine-tuning and costs 6 times less extra time than MoCad.Although BERT-whitening only uses a linear transformation to obtain reconstruction representations, our method is still faster.Because our method performs SVD decomposition on a matrix of L * L while BERT-whitening handles the same operations on a matrix of L * N , where L is the hidden dimension and N is the output dimension of BERT (e.g., 768).Existing methods train additional models to identify biased training data (Clark et al., 2019;Utama et al., 2020a;Schuster et al., 2019) or use the above bias model to calibrate the classification results of test data (Utama et al., 2020b;Sanh et al., 2020;Xiong et al., 2021).The so-called bias model refers to classifiers who use only a portion of input data for prediction, e.g., hypothesis-only model in NLI task which only predict from specific linguistic phenomena in hypothesis sentences such as negation.These methods are not end-to-end and face difficulty in fully identifying all bias patterns.

Method
Recently, another line of works notice the connection between dataset bias and feature distribution, and try to tackle the dataset bias problem by identifying features with better generalizability.Dou et al. (2022) use an loss function based on information bottleneck (IB) to focus the model on task-relevant features, and Wu and Gui (2022) similarly achieve such feature filtering by mapping sentence embedding into a specific low-dimension subspace.

Unsupervised Semantic of Sentence Embedding
Previous works suggest that the word representations of pre-train language model are not isotropic (Gao et al., 2018;Ethayarajh, 2019), leading model to poorly capture the underlying semantic of sentences (Li et al., 2020).Such anisotropic causes the difficulty of using sentence embedding directly through simple similarity metrics.Gao et al. (2018) propose word embedding matrix regularization methods to mitigate the degeneration problem.Recently, researchers attempt to transform BERT sentence embedding into an isotropic Gaussian distribution through normalizing flow (Li et al., 2020) or whitening methods (Su et al., 2021).As supervised learning also suffers from uneven data distribution of train sets, we are the first to normalize the data distribution on supervised training to eliminate dataset bias problem.

Conclusion
In this work, we propose a novel framework, Kernel-Whitening, to tackle the spurious correlation from a feature perspective.We analyze how to introduce isotropic sentence embedding for eliminating dataset bias and propose a promising and computationally kernel estimation, to obtain an approximation of disentangled sentence embedding.Experiments on various datasets demonstrate that Kernel-Whitening achieves better performance on both ID and OOD datasets than comparative works.This implies that a shallow model, or prior knowledge of dataset bias, is not must for the improvement of generalization.

Limitations
In this section, we discuss the potential limitations of our work.The analysis of model effects in this paper is focusing on commonly used benchmarks for natural language understanding debiasing works, and they may carry confounding factors that affect the performance of our model.Therefore, it is worth further exploring the performance of our model on more tasks, e.g., the WikiGender-Bias dataset for gender bias on relation extraction task.In addition, this presented work is inspired by unsupervised semantic learning methods, such as BERT-whitening, and it will be better to test the performance of our approach on unsupervised tasks.We leave these two problems to further work.

Figure 1 :
Figure1: Illustration of Kernel-Whitening.The vertical and horizontal axes represent the valid and invalid features, respectively.Uneven sample distribution induces a bias decision boundary, resulting in errors on out-ofdistribution data.The normalization method maps the data to isotropic latent space, where the new boundary is uncorrelated to redundant features, providing better generalization.
Clark et al. (2019) (Reweighting and Learned-Mixin), which predicts confidence for each sample and down-weights problematic data.• Sanh et al. (2020) (Product-of-Experts and PoE cross-entropy ), which trains limited capacity models as experts to debias without explicitly identifying dataset bias.• Utama et al. (2020b) (PoE self-debias and Conf-reg self-debias ), which uses a shallow model to identify biased samples and focus the main model on them.• Utama et al. (2020a) (Conf-reg), which uses confidence regularization to discourage models from exploiting biases.• Xiong et al. (2021) (MoCaD), which produces uncertainty estimations to achieve a threestage ensemble-based debiasing framework.

Figure 3 :
Figure 3: Effect of different dimensionality L with whitening methods on each aforementioned tasks.The x axis is the latent dimension of sentence embeddings.The two images are model performance on out-ofdistribution and in-distribution test sets, respectively.

Table 1 :
(version 1 and 2) as our OOD test set.Models evaluation on MNLI, FEVER, QQP, and their respective challenge test sets.The performance of the three normalized models is shown in cyan, the model name with asterisks represents the experimental results on our machine.The best results on each dataset are bolded.

Table 3 :
Time consumption (percentages) of training one epoch on the whole dataset.Whitening-based methods cost much less time than previous works.