Looking at the Overlooked: An Analysis on the Word-Overlap Bias in Natural Language Inference

It has been shown that NLI models are usually biased with respect to the word-overlap between the premise and the hypothesis, as they take this feature as a primary cue for predicting the entailment label. In this paper, we focus on an overlooked aspect of the overlap bias in the NLI models: the reverse word-overlap bias. Our experimental results demonstrate that current NLI systems are also highly biased towards the non-entailment label on instances with low overlap and that existing debiasing methods, which are reportedly successful on challenge datasets, are generally ineffective in addressing this category of bias.Through a set of analyses, we investigate the reasons for the emergence of the overlap bias and the role of minority examples in mitigating this bias.For the former, we find that the word overlap bias does not stem from pre-training, and in the latter, we observe that in contrast to the accepted assumption, eliminating minority examples does not affect the generalizability of debiasing methods with respect to the overlap bias.


Introduction
Natural Language Inference (NLI) is one of the most commonly used NLP tasks, particularly in the scope of evaluating models for their language understanding capabilities.Since their emergence, pre-trained language models (PLMs) have been highly successful on standard NLI datasets, such as the Multi-Genre Natural Language Inference (Williams et al., 2018, MultiNLI).However, recent analytical studies have revealed that their success is partly due to their reliance on spurious correlations between superficial features of the input texts and gold labels in these datasets (Poliak et al., 2018;Bhargava et al., 2021).As a result, performance usually drops on out-of-distribution datasets where such correlations do not hold.Several proposals Figure 1: NLI model's confidence on a randomly sampled subset of instances from the SNLI dataset across four different degrees of word overlap between premise and hypothesis.BERT is biased towards the entailment label on instances with full overlap (denoted by the huge confidence gap with the non-entailment label).On the contrary, a reverse bias is seen for low and nonoverlapping instances, with a significant confidence lead on the non-entailment label.
have been put forth to enhance the robustness of models to the known and unknown biases and improve performance on the so-called challenging datasets (Stacey et al., 2020;Utama et al., 2020a;Asael et al., 2022).
One of the well-known dataset biases in NLI models is the spurious correlation of the entailment label and high word-overlap between premise and hypothesis.A number of challenging sets are designed to showcase the tendency of PLMs to predict entailment for most such cases.HANS (McCoy et al., 2019) is arguably the most widely used dataset in this group.Constructed based on human-made linguistic patterns, the dataset focuses on high-overlapping samples, the non-entailment subset of which is deemed as challenging for NLI models.Most current debiasing methods have considered the word-overlap bias as one of their main targets and have shown substantial improvements on HANS (Mendelson and Belinkov, 2021;Min et al., 2020).
In this paper, we revisit the word-overlap bias in NLI and the effectiveness of existing debiasing techniques.Despite the popularity of this type of bias, we find that some of its aspects are generally ignored in the research community.If we consider word-overlap as a feature with values ranging from no to full overlap, and NLI task with two labels of entailment and non-entailment, we show that there are other kinds of spurious correlation than the popular high word-overlap and entailment.Specifically, as it is shown in Figure 1, we see a clear bias towards non-entailment for the low and no wordoverlap values (denoted by the high performance on the non-entailment label, which comes at the price of reduced performance on the entailment class).We will refer to this type of bias as reverse word-overlap throughout the paper.
Through a set of experiments, we demonstrate that the overlooked reverse word-overlap bias exists in popular NLI datasets, such as MNLI and SNLI, as well as in the predictions of PLMs.Moreover, our results suggest that while existing debiasing methods can mitigate the overlap bias in NLI models to some extent, they are ineffective in resolving the reverse bias.
Moreover, we analyze how NLI models employ minority instances to enhance their generalization.Focusing on the forgettable debiasing method (Yaghoobzadeh et al., 2021), we realize that eliminating HANS-like examples and the reverse ones do not hurt the generalization noticeably.
In search of the origin of the bias, we employ prompt-based techniques to check whether the bias stems from pre-training.We also verify the robustness of PLMs in a few-shot learning experiment with controlled and balanced training sets.Our results suggest that PLMs do not exhibit any bias towards a specific label.Nevertheless, introducing a few samples triggers the bias toward the entailment label.Furthermore, balancing the training examples with respect to their word-overlap prevents the emergence of bias to some extent.
Our contributions can be summarized as follows: • We expand our understanding of the wordoverlap bias in NLI by revealing an unexplored spurious correlation between low wordoverlap and non-entailment.
• We analyze how debiasing methods work for the whole spectrum of word-overlap bias, finding that they generally fail at addressing bias for the low and non-overlapping cases.
• To explore the origin of word-overlap bias in PLMs, we design several new experiments showing that, even when exposed to a few training examples, PLMs get biased towards predicting entailment.

Natural Language Inference
In NLI, a model is provided with two input sentences, namely premise and hypothesis.The task for the model is to predict whether the hypothesis is true (entailment), false (contradiction), or undetermined (neutral) given the premise.

Bias in NLI Models
Analyzing NLI models have demonstrated that they are sensitive to the shortcuts that appear in the dataset.Several types of bias have been investigated in the literature, including hypothesis-only prediction, spurious correlations between certain words and labels (e.g., negation words and the nonentailment label), sensitivity to the length of hypothesis, and lexical overlap between the premise and hypothesis (Gururangan et al., 2018;Poliak et al., 2018;McCoy et al., 2019;Wu et al., 2022).
Relying on these spurious features hampers the language understanding ability of NLI models, leading to poor performance on out-of-distribution datasets where such superficial correlations do not hold (He et al., 2019;McCoy et al., 2019).
Word-Overlap Bias.Among the detected dataset biases, word-overlap is a quite well-studied shortcut in the NLI task (Zhou and Bansal, 2020;Mendelson and Belinkov, 2021).We define word-overlap (wo) as the ratio of words in the hypothesis (h) that are shared with the premise (p), i.e., |h∩p| |h| .Table 1 shows examples of different degrees of word-overlap.

Debiasing Methods
Creating high-quality datasets without any spurious features between instances and gold labels is an arduous and expensive process (Gardner et al., 2021a), making it inevitable for a dataset not to have biases to some extent.Therefore, to have a robust model, it is essential to take extra steps for debiasing against dataset artifacts.The past few years P: A blond woman in a white dress sits in a flowering tree while holding a white bird.H: The woman beats two eggs to make breakfast for her husband.
Non-Entailment None (0.0) P: A couple sits in the grass.H: People are outside.Entailment P: An older women tending to a garden.H: The lady is cooking dinner.

Non-Entailment
Table 1: NLI examples with different degrees of word-overlap (between premise and hypothesis), where the overlap is the ratio of hypothesis words that are shared with the premise.The highlighted words are the common (in green) or different (in purple) words (the samples are picked to reflect extreme cases across the word-overlap spectrum).have seen several debiasing methods (Karimi Mahabadi et al., 2020;Utama et al., 2020a,b;Belinkov et al., 2019).For our experiments, we opted for three different debiasing approaches.We evaluate the effectiveness of these techniques in mitigating the overlap bias and its reverse.
Long-tuning.Tu et al. (2020) have shown that fine-tuning NLI models for more epochs can enhance the generalizability of LMs over challenging datasets.Following their suggestion, we fine-tuned the models for 20 epochs on the MNLI dataset.
Forgettable Examples.Yaghoobzadeh et al. (2021) find minority examples without prior knowledge of the dataset artifacts.In the proposed method, the minority examples are considered samples that have never been learned or learned once and then forgotten by the model.Then, the already trained NLI model is fine-tuned on this subset for a few more epochs.Following the authors' suggestion, to find the forgettable examples, we utilized a simple Siamese Bag of Words (BoW) model where the sentence representations of the premise and hypothesis are the average over their word embeddings.set as the in-distribution and WANLI and HANS as the out-of-distribution datasets (HANS+ and HANS− are entailment and non-entailment subsets, respectively).

Product of Experts (PoE
and main model predictions: where p w and p m are the outputs of the weak learner and the main model, respectively.The robust model is trained using a cross-entropy loss function based on y.We used TinyBERT (Jiao et al., 2020) as our weak learner.

Experimental Setup
Datasets.In our experiments, we opted for the Multi-Genre Natural Language Inference dataset (Williams et al., 2018, MNLI) (Liu et al., 2022).In the former dataset, each instance is curated in a way that all words of the hypothesis are also observed in the premise, irrespective of the word order.Previous work has shown that biased NLI models tend to perform poorly on HANS, particularly for the nonentailment class (Yaghoobzadeh et al., 2021).The latter challenging set has employed GPT-3 (Brown et al., 2020) to generate high-quality instances followed by filtering done by human crowd-workers.Quality tests on WANLI indicate that the dataset contains fewer artifacts compared to MNLI.

Models.
As for PLMs, we opted for the base version of BERT and RoBERTa (Devlin et al., 2019;Liu et al., 2020) and fine-tuned them for three epochs as our baselines.We trained the models with a learning rate of 2e-5, employing the Adam optimizer for three different random seeds.The batch size was set to 32 with a max length of 128.
All the reported results are based on three random seeds.

Results
Table 2 shows the results for the baseline models (BERT and RoBERTa) and the three debiasing techniques on different datasets.The bias in the baseline model is highlighted by the performance contrast across the entailment (HANS+) and non-entailment (HANS−) subsets.As can be seen, the three debiasing methods are generally effective in softening the biased behavior, reflected by the improved performance on HANS− (and, in turn, HANS), and also WANLI.

Reverse Word-Overlap
Considering the word-overlap bias as a spectrum, the existing studies have mainly focused on a small subset of the spectrum, i.e., the case with full wordoverlap and its spurious correlation with the entailment label.In this section, we evaluate the performance of NLI models on other areas of the spectrum and with respect to both labels (entailment and non-entailment) to broaden our insights on the robustness of these models considering the word-overlap feature.Table 3: The accuracy of the two NLI models across different overlap bins and on both subsets.The lowest numbers in each column are underlined.

Probing Dataset
As for this probing study, we experimented with the SNLI dataset (Bowman et al., 2015), merging the training, development, and test sets to build a unified evaluation set.The set was split into seven bins based on the degree of overlap.The statistics are reported in Figure 2. As an example, the [0.6, 0.8) bin contains samples that have a word overlap (between premise and hypothesis) of greater than (and equal to) 0.6 and less than 0.8.

Results
Unless specified otherwise, the experimental setup in this experiment is the same as the one reported in Section 2.3.Table 3 reports the results across different word overlap bins for both BERT and RoBERTa and for both labels.As expected, high contrast is observed on the full overlap subset: nearperfect NLI performance on the entailment, while poor performance on non-entailment, suggesting a strong bias towards the entailment label.This is the conventional type of NLI bias that has been usually discussed in previous studies.The HANS challenging dataset is constructed based on the same type of bias.However, surprisingly, the results show that this biased behavior only exists for samples with full overlap.In fact, no notable bias is observed even for the high overlap samples in the [0.8, 1) bin.This observation further narrows down the scope of HANS as a challenging dataset and raises questions on the robustness of models developed based on the dataset.
Reverse bias.Interestingly, the results in Table 3 shed light on another inherent spurious correlation that exists between NLI performance and the degree of word-overlap.Particularly towards the non-overlap extreme, the performance drops on en-tailment and increases on non-entailment samples.
In the (0.0, 0.2) bin, we see the largest gap: 55.5 entailment vs 96.7 non-entailment for the BERT model.We refer to the biased behavior of NLI models on the low word-overlapping samples towards the non-entailment label as the Reverse bias.
It is also worth mentioning that based on the proposed results, reverse bias covers a broader range of bins in comparison with the word-overlap bias.

Effectiveness of Debiasing Methods
Figure 3 shows the performance of the three debiasing methods (described in Section 2.2) across the seven bins in our word-overlap analysis.As can be observed, debiasing methods improve over the baseline on the full-overlap ("Full" in Figure 3) and non-entailment subset, with PoE proving the most effective.The improvement is expected since the results on the challenging dataset, HANS, suggest the same.This, however, comes at the price of reduced performance on the entailment subset, specifically in the BERT model.
As we move toward the non-overlap end of the spectrum ("None" in Figure 3), the performance gap between the entailment and non-entailment labels grows, mainly due to the drop in entailment performance.Interestingly, the experimental results reveal that debiasing methods are clearly ineffective in addressing the reverse bias and perform similarly to the baseline models.

Role of Minority Examples
In the context of word-overlap bias, the nonentailment instances that have full overlap (between premise and hypothesis) are usually referred to as minority examples.Tu et al. (2020) show that We carry out a set of experiments on the forgettable approach, where a subset of the training data is chosen for further fine-tuning of models (66k in our NLI experiments for the F BOW method).We extend the forgettable analysis to the low wordoverlap or reverse minority examples.We also verify the role played by minority examples in the performance of debiasing methods.
As the first step, we compare the distribution of instances with respect to their overlap in the original training set of MNLI and its forgettable subset.The results are shown in Figure 4.As can be seen, the forgettable subset tends to have better coverage over the minority subset than the original One can hypothesize that better coverage of minority examples is the reason behind the effectiveness of the forgettable approach.To verify this hypothesis, we eliminate several subsets from F BOW and fine-tune the NLI models with the remaining samples.We considered the following four settings: • Full − NEnt: Full overlap between premise and hypothesis with the non-entailment label.
• None − Ent: No overlap and entailment label.
The results are reported in Table 4. Interestingly, we observe that removing HANS-like examples (Full−NEnt), which were hypothesized to play the main role in improving performance on the challenging datasets, does not affect the performance of F BOW notably.The observation is consistent even for larger subsets of high-overlapping instances ([0.8, 1]−NEnt).Discarding the reverse group (low-overlapping entailment samples) yields a similar pattern.So, it can be inferred that such samples do not play the primary role in the debiasing methods' effectiveness.
This opens up questions on how NLI models extrapolate to patterns unseen during training and how debiasing methods enhance their generalization over out-of-distribution data.This is particularly interesting in light of observations made by (Tu et al., 2020) that standard training does not enable such extrapolation.We leave further investigations in this area to future work.

The Origin of Word-Overlap Bias
We conducted another experiment to see if the vulnerability of NLI models to the word-overlap feature and the reverse bias comes from pre-training or from fine-tuning on the task-specific data.To this end, we followed Utama et al. (2021) in evaluating pre-trained models under zero-and few-shot settings.To rule out the impact of fine-tuning and verify if the pre-trained model exhibits similar biases with respect to word-overlap, we evaluated BERT in a zero-shot setting by reformulating the NLI task as a masked language modeling objective.Following previous studies (Schick and Schütze, 2021;Utama et al., 2021), we transformed the NLI examples using the below template: Premise ?[MASK], Hypothesis.
where the [MASK] token denotes the gold label.We used a simple verbalizer with yes, maybe, and no as mappings to, respectively, the entailment, neutral, and contradiction labels.
The first row of Table 5 shows the results for the zero-shot setting.The similar performance across HANS− and HANS+ shows that the pre-trained BERT model does not exhibit much bias towards a specific label.Therefore, the bias stems from the fine-tuning on the task-specific instances.This is reflected even with as few as 16 samples in the few-shot scenario (where we have fine-tuned the prompt-based model).As the number of training instances increases, the gap between the entailment and non-entailment samples grows.
Balanced data.We also examined the role of class imbalance in the training data on the emergence of word-overlap bias.For this experiment, we defined four categories based on the overlap {Full, [0.5, 1), (0.0, 0.5), and None} and uniformly sampled K instances per label.The bottom block of Table 5 presents the results.It can be inferred that having a balanced training set can reduce the bias to some extent.Finally, the high variance on the HANS subsets suggests that the quality of training examples and word-overlap percentage between the premise and hypothesis can have a significant impact on the bias in NLI systems.

Related Work
Dataset biases in NLP.Different categories of bias have been discovered and discussed in NLP datasets.Earlier work has discovered that negative words are correlated with contradiction label in the SNLI dataset (Naik et al., 2018;Gururangan et al., 2018).Hypothesis-only (Gururangan et al., 2018) and word-overlap between hypothesis and premise (McCoy et al., 2019)  In particular, word overlap has also been investigated in the context of duplicate question detection on the QQP dataset (Zhang et al., 2019).For both NLI and QQP, it has been shown that considerable spurious correlations exist between high word overlap and the entailment/duplicate label.In this word, we focused on the word overlap bias in the NLI dataset and introduced an overlooked aspect of this bias: the correlation between low word overlap and non-entailment class.
Challenging sets.In the past few years, several challenging datasets have been introduced to study the limitations of NLP models and, in particular, pre-trained language models in learning robust features and ignoring dataset biases.Challenging datasets for NLI include HANS (McCoy et al., 2019), ANLI (Williams et al., 2022), MNLIhard (Gururangan et al., 2018) and Stress-tests (Naik et al., 2018).Similar datasets for other tasks include PAWS (Zhang et al., 2019;Yang et al., 2019), for duplicate question detection, and FEVER-Symmetric (Schuster et al., 2019), for stance detection.
Spurious correlation.Gardner et al. (2021b) argue that for complex language understanding tasks, any simple feature correlation should be considered spurious, e.g., "not" and the contradiction label in NLI.Spurious correlations can also be defined from the viewpoint of generalizability  (He et al., 2019;Karimi Mahabadi et al., 2020;Utama et al., 2020a;Sanh et al., 2021).Others augment the training set with examples that violate the spurious correlations.A mix of both these approaches has also been investigated by Wu et al. (2022).An alternative approach is to extend the fine-tuning either on all (Tu et al., 2020) or parts of training data (Yaghoobzadeh et al., 2021).
Analysis of debiasing.Given the increasing interest in debiasing methods, there have been concerns about their widespread use.Schwartz and Stanovsky (2022b) argue that excessive balancing prevents the models from learning anything (in particular, important world and commonsense knowledge), making it neither practical nor desired.They suggest abstaining and interacting with the user when the contextual information is not sufficient and also focus on zero-and few-shot learning approaches instead of full fine-tuning.In this paper, we showed that balancing datasets should only be  taken as a partial solution for eliminating spurious correlations.We also showed that in this context, few-shot learning might not be effective.Mendelson and Belinkov (2021) found that debiasing methods encode more extractable information about the bias in their inner representations.This observation is explained in a concurrent work to ours in terms of the necessity and sufficiency of the biases (Joshi et al., 2022).In this paper and for the word-overlap bias, we showed that our selected debiasing techniques are not robust against if we consider the whole spectrum.

Conclusions
In this work, we uncovered an unexplored aspect of the well-known word-overlap bias in the NLI models.We showed a spurious correlation between the low overlap instances and the non-entailment label, namely the reverse word-overlap bias.We demonstrated that existing debiasing methods are not effective in mitigating the reverse bias.We found that the generalization power of debiasing methods (the forgettable approach in particular) does not stem from minority examples.We also showed that the word-overlap bias does not seem to come from the pre-training step of PLMs.As future work, we plan to focus on designing new debiasing methods for mitigating the reverse bias for NLI and similar tasks.Also, building specific challenging sets, similar to HANS, for the reverse bias helps to expand this line of research.

Acknowledgements
We would like to acknowledge that the idea of reverse bias was initiated in discussion with Alessandro Sordoni (MSR Montreal).Also, we want to thank the anonymous reviewers for their valuable comments, which helped us in improving the paper.Sara Rajaee is funded in part by the Netherlands Organization for Scientific Research (NWO) under project number VI.C.192.080.

Limitations
In our experiments, we have focused on two popular PLMs, BERT and RoBERTa.Using more PLMs, with diversity in the objective and architecture and evaluating their robustness is one of the extendable aspects of our work.Moreover, we evaluated three debiasing methods, but this could have been expanded to more.The other susceptible aspect to improvement is creating a more high-quality dataset for analyzing the overlap bias and its reverse.We have used SNLI as our main probing set, a crowdsourcing-based dataset that contains some noisy examples, especially in minority groups.

Figure 3 :
Figure3: The performance of the baseline and the three debiasing methods across the seven word-overlap bins for both labels and for BERT and RoBERTa.Across the spectrum, the debiasing techniques seem to be effective only on samples with high (particularly full) word-overlap on the non-entailment subset and are either ineffective (or even harmful) towards the other end of the overlapping spectrum and on the entailment subset.
MNLI training set.See the right side of Figure 4(a) and the left side of Figure 4(b).

Figure 4 :
Figure 4: Normalized distribution of instances with respect to their word-overlap in the original training set of MNLI and the subset identified by F BOW .

Table 2 :
The average accuracy of the baseline models and debiasing methods on the MNLI development(matched) (Karimi Mahabadi et al., 2020;Sanh et al., 2021)arn superficial features in the input.The weak learner's output is then used to normalize the main model's predictions on over-confident examples.Following previous studies(Karimi Mahabadi et al., 2020;Sanh et al., 2021), we employed the following combination strategy for taking into account both weak learner

Table 4 :
are other types of biases discussed in the literature of SNLI and MNLI datasets.The performance of F BOW after eliminating four different subsets.Eliminated denotes the number of eliminated examples in each setting.All the subsets tend to be in the same performance ballpark with respect to the generalizability of the model on the out-of-distribution datasets (WANLI and HANS).

Table 5 :
Zero-shot and few-shot results of prompt-based fine-tuning for BERT.While no significant bias is seen in the zero-shot setting, only with a few task-specific examples, BERT predictions are biased towards entailment (HANS+ vs. HANS−).Balancing the training set (bottom block) slightly reduces the extent of bias.