Statistically Profiling Biases in Natural Language Reasoning Datasets and Models

Recent work has indicated that many natural language understanding and reasoning datasets contain statistical cues that may be taken advantaged of by NLP models whose capability may thus be grossly overestimated. To discover the potential weakness in the models, some human-designed stress tests have been proposed but they are expensive to create and do not generalize to arbitrary models. We propose a light-weight and general statistical profiling framework, ICQ (I-See-Cue), which automatically identifies possible biases in any multiple-choice NLU datasets without the need to create any additional test cases, and further evaluates through blackbox testing the extent to which models may exploit these biases.


Introduction
Deep neural models have made remarkable strides in a broad spectrum of natural language understanding (NLU) tasks (Bowman et al., 2015;Wang et al., 2018;Mostafazadeh et al., 2016;Roemmele et al., 2011;Zellers et al., 2018).These tasks often employ a multiple-choice framework, as illustrated in Example 1.However, the inherent sensitivity of these models to minute variations calls for a robust and precise evaluation mechanism (Jurafsky et al., 2020).
Example 1 Natural language inference in the SNLI dataset, with the correct answer bolded.
Premise: A swimmer playing in the surf watches a low flying airplane headed inland.* The corresponding author.
Hypothesis: Someone is swimming in the sea.
In tasks akin to Example 1, humans typically rely on the logical relationship between the premise and hypothesis.Contrarily, some NLP models might bypass this logical reasoning, focusing instead on the biases embedded within the dataset and, more specifically, within the hypotheses (Naik et al., 2018;Schuster et al., 2019).These biases-such as sentiment or shallow n-grams-could provide misleading cues for correct predictions.
We refer to these biases as "artificial spurious cues" when they pervade both the training and test datasets, maintaining a similar distribution over predictions.An example of such a cue is a model's disproportionate dependence on the word "someone" in Example 1.These cues, when absent or altered, can significantly impair a model's performance, underlining the importance of identifying them to enhance model robustness in the future.To tackle the issue of cues, it's crucial to distinguish between cues embedded in the dataset and those learned by the model.Conventional bias detection and mitigation tools, such as the AI Fairness 360 toolkit (Bellamy et al., 2018), primarily target dataset biases, inadequately addressing those learned by models during training.
While existing methods like "hypothesis-only" tests and CheckList can uncover model vulnerabilities, they're not expressly designed to identify model-learned cues."Hypothesis-only" tests can highlight dataset issues where the hypothesis alone can provide a correct answer but fail to realistically depict the model's capabilities as they don't evaluate the model using the full data context used during both training and prediction.
Drawing from the tenets of black-box testing in software engineering, CheckList scrutinizes model weaknesses without detailed knowledge of the model's internal architecture.It achieves this by delivering additional stress test cases premised on predefined linguistic features.However, CheckList's dependence on meticulously designed templates limits its scope, and it also falls short of illuminating the knowledge the model has actually gleaned from the data.
To address these limitations, we introduce ICQ ("I-see-cue"), a resilient statistical analysis framework1 engineered to identify model-learned cues.Diverging from traditional methods, ICQ identifies biases in multiple-choice NLU datasets without necessitating additional test cases.Employing blackbox testing, ICQ assesses how models utilize these biases, delivering a comprehensive understanding of the bias in NLU tasks.
We authenticate ICQ's efficacy by deploying it on various NLU datasets to probe potential cues learned by models during training.ICQ facilitates an in-depth understanding of how models like Chat-GPT2 learn potential cues, and it offers illustrative examples to guide the selection of suitable prompts, providing invaluable guidance for model optimization.
In summary, this paper contributes the following: • We unveil ICQ, a lightweight yet potent method for identifying statistical biases and cues in NLU datasets, proposing simple and efficient tests to quantitatively and visually evaluate whether a model leverages spurious cues in its predictions.
• We execute a comprehensive evaluation of statistical bias issues across ten popular NLU datasets and four models, corroborating previous findings and unveiling new insights.We also offer an online demonstration system to showcase the results and invite users to evaluate their own datasets and models.
• Through a case study, we delve into how Chat-GPT learns potential biases, offering valuable recommendations for its practical applications.

Task Definition
We define an instance x of an NLU task dataset X as where p is the context against which to do the reasoning (p corresponds to "premise" in Example 1); h is the hypothesis given the context p; l ∈ L is the label that depicts the type of relation between p and h.The size of the relation set L varies with tasks.

Linguistic Features
As demonstrated in previous work (Naik et al., 2018;Jurafsky et al., 2020), we consider the following linguistic features: Word: The existence of a specific word in the premise or hypothesis of a dataset instance.
Sentiment: The sentiment value of an instance, calculated as the sum of sentiment polarities of individual words.
Tense: The tense feature (past, present, or future) of an instance, determined by the POS tag of the root verb.
Overlap: The existence of at least one word (excluding stop words) that occurs in both the premise and hypothesis.
NER: The presence of named entities (e.g., PER, ORG, LOC, TIME, or CARDINAL) in an instance, detected using the NLTK NER toolkit.
Typos: The presence of at least one typo in an instance, identified using a pretrained spelling model.
For multiple-choice datasets, all features except Overlap are applied exclusively to hypotheses.

Approach
The ICQ framework, depicted in Figure 2, consists of three phases: data extraction, cue discovery, and model probing.In the data extraction phase, instances containing a specific linguistic feature f are extracted from the dataset.The cue discovery phase identifies potential cues among pre-defined features.Finally, the model probing phase conducts two tests: the "accuracy test" and "distribution test".We will discuss these phases in more detail below.

Data Extraction Phase
After defining the linguistic features, our system's fundamental step is constructing a data extractor for each feature value f .An extractor processes a dataset and retrieves a set of instances associated with the specified feature value.More specifically, for a specific feature, if an instance contains that feature, then that instance will be singled out as a part of the subset bearing that feature.

Cue Discovery Phase
For each feature f , we apply its extractor to both the training and test data of dataset X, denoted as R and S in Figure 2.This results in clustered subsets of training instances (R f ) and test instances (S f ).A feature is considered a possible cue for a dataset only if it is present in both the training and test data.
The bias of the label distribution for an extracted set is computed using the mean squared error (MSE) and Jensen-Shannon Divergence (JSD) (Lin, 1991).The cueness score indicates the extent to which dataset X is biased against a feature f .
Here, y i represents the number of instances with label l i in the extracted dataset F , and y i is the mean number of instances for each label.A larger M SE(F ) implies a more pointed label distribution and greater bias.If the extracted training set (R f ) and the extracted test set (S f ) exhibit similar biases, the JSD between their distributions will be small: where The function Q() denotes the label distribution of the extracted dataset.We define the cueness score as: This cueness score represents the degree to which a dataset X is biased against a feature f .

Model Probing Phase
In the previous section, we established that a dataset X could be influenced by a cue f .However, a model trained on this dataset may not necessarily exploit that cue, as models depend on both data and architecture.In this section, we introduce a framework to probe any model instance trained on the biased dataset, assessing if it utilizes cue f and to what degree.We achieve this through two tests: the accuracy test and the distribution test.

Accuracy Test
The accuracy test examines the model's performance on data subsets with and without specific features.By comparing the model's accuracy on these two subsets, we can understand the model's generalization capabilities under different conditions.If the model's performance shows a noticeable improvement on the subset containing a specific feature, it may suggest that the model has exploited that feature for prediction.
In the accuracy test, we assess the prediction accuracies of the model M on the extracted test set (with feature f ) and on the remaining test set (without feature f ), denoted as acc(S f ) and acc(S nf ), respectively.The accuracy test computes the difference between these two accuracies: A positive or negative value of ∆Acc indicates the direction of the model's performance change when comparing its accuracy on data subsets with and without specific features.A positive value suggests that the model exploits the feature for prediction, while a negative value implies struggles to generalize or detrimental sensitivity to the feature.
The magnitude of the absolute value of ∆Acc reflects the degree to which a model's performance is affected by the presence or absence of specific features in the data subsets.A larger absolute value indicates a stronger reliance on or sensitivity to the feature, whereas a smaller absolute value suggests a more robust model that is less affected by the presence or absence of the feature.

Distribution Test
The distribution test is a visual examination that focuses on how changes in specific feature distributions within datasets affect a model's predictive performance.
First, we create a "stress dataset" S f by "flattening" the label distribution in S f .We achieve this by removing random instances from all labels except for the one with the smallest number of instances, stopping when a balance is reached.In other words, we retain the minimum number of instances present in each label, while randomly discarding the excess instances.This approach effectively eliminates bias in the extracted test set, and challenges the model.
Next, we apply the model to the stress test set and obtain prediction results.We then compare the label distribution of the prediction results on the stress test set with the label distribution of the extracted training data (R f ).The rationale is that if the extracted training data contains a cue, its label distribution will be skewed towards a specific label.If the model exploits this cue, it will prefer to predict that label as much as possible, amplifying the skewness of the distribution, despite the input test set being neutralized.We aim to observe such amplification in the output distribution to identify the model's weaknesses.
In summary, the accuracy test and distribution test are related in terms of assessing a model's sensitivity to specific features, but they emphasize different aspects.Distribution testing focuses on the impact of feature distribution changes on model performance, while accuracy testing evaluates the model's performance on data subsets with and without specific features.By combining these two testing methods, a model's sensitivity to particular features can be more comprehensively assessed.If both tests determine that the model is sensitive to a certain feature, we can have a higher degree of confidence in this conclusion.

Evaluation
We first present the experimental setup, followed by results on cue discovery, model probing, and analysis.The entire framework is implemented in an online demo.

Cues in Datasets
In this section, we showcase the cues identified in each dataset using the cueness metric, as described in Section 3.
We first filter the training and test data for each dataset using all the features defined in this paper.The left half of Table 2 displays the top 5 cues discovered for each of the 10 datasets, along with their cueness scores.ARCT_adv, an adversarial dataset, is intentionally well-balanced.Consequently, we only found one cue, OVERLAP, with a very low cueness score.This is unsurprising since OVER-LAP is the only "second-order" feature in our list of linguistic features that considers tokens in both the premise and hypothesis, and likely evaded data manipulation by the creator.
Mostly, the top 5 cues discovered are word features.However, besides OVERLAP, we also see NEGATION and TYPO appearing in the lists.In fact, SENTIMENT and NER features would have emerged if we expanded the list to the top 10.Interestingly, several features previously reported as biased by other works, such as "not" and NEGA-TION in ARCT, "no" in MNLI and SNLI, and "like" in ROC, are also found.Particularly in MNLI, all five discovered cues are related to negatively toned words, suggesting significant human artifacts in this dataset that can lead to model fragility.
Additionally, we observe that some word cues are indicative of certain syntactic, semantic, or sentiment patterns in the questions.For example, "because" in SNLI implies a cause-effect structure; "like" in ROC indicates positive sentiment; "probably" and "may" in RACE suggest uncertainty, and so on.These features can serve as clues for revising datasets.

Biases in Models
To investigate whether a model is affected by a specific cue or feature in a dataset, we train four models on their original training sets and evaluate them using accuracy and distribution tests.
Accuracy Test: The results are presented in Table 2.As mentioned in Section 3.3.1,a positive or negative ∆ value indicates the direction of a model's performance change when comparing its accuracy on data subsets with and without specific features.The absolute value of ∆ reflects the degree to which the model's performance is influenced by these features, with larger values suggesting stronger reliance or sensitivity and smaller values indicating a more robust model.
The bottom of Table 2 shows that, across all 10 datasets, the sum of the absolute values of ∆ follows the order: RoBERTA < BERT < ESIM < FastText.This is consistent with earlier hypothesisonly tests and the community's common perception of these popular models.However, examining individual datasets and features reveals a more nuanced situation.For instance, FastText tends to pick up individual word cues rather than semantic cues, while more complex models such as BERT and RoBERTA appear more sensitive to structural features like NEGATION and SENTIMENT, which are actually classes of words.This is well explained by FastText's design, which focuses more on modeling words than syntactic or semantic structures.
Interestingly, FastText exhibits a strong negative correlation with TYPO.We speculate that FastText might have been trained with a more orthodox vocabulary, making it less tolerant of typos in the text.We observe that all models on the cue "no" in MNLI achieve positive ∆ in Table 2 is amplified in Figure 3 for FastText and ESIM.
With the "no" cue, they prefer to predict "Contradiction" even more than the ground truth in the training data.In contrast, BERT and RoBERTA moderately follow the training data.While the cue "no" is effective at influencing the models, the cue "above" is not as successful.Figure 3 shows that the distribution of predicted results for ESIM in ARCT is entirely opposite to the training data, explaining the ∆ = −8.43 in Table 2 and demonstrating that models may not exploit a cue even if it is present in the data.Similarly, "speaking" in BERT and RoBERTA can also explain their low ∆ values, which are not shown in Table 2.
The example of the cue "threw" presents an out-lier for BERT, as the distribution test result is inconsistent with the accuracy test: the accuracy deviation is very high for BERT, but its prediction distribution is flat.We have not encountered many such contradictory cases.However, when they do occur, as in this example, we give BERT the benefit of the doubt that it might not have exploited the cue "threw".
Prompt 1 Analyze the logical relationship between the following premise: '{premise}' and the hypothesis: '{hypothesis}'.Determine whether the hypothesis is an entailment, contradiction, or neutral with respect to the premise.
Prompt 2 Please identify whether the premise entails, contradicts, or is neutral with respect to the hypothesis.The answer should be exactly "entailment," "contradiction," or "neutral."premise: '{premise}' hypothesis: '{hypothesis}' Analyze the logical relationship between the following premise: '{premise}' and the hypothesis: '{hypothesis}'.Determine whether the hypothesis is an entailment, contradiction, or neutral with respect to the premise.
Let's think step by step.
Please identify whether the premise entails, contradicts, or is neutral with respect to the hypothesis.The answer should be exactly "entailment," "contradiction," or "neutral."premise: '{premise}' hypothesis: '{hypothesis}' Let's think step by step.x3 series model, is trained through Reinforcement Learning from Human Feedback (RLHF) (Christiano et al., 2017) similarly to InstructGPT (Ouyang et al., 2022).In this section, we investigate if zeroshot ChatGPT is influenced by bias features, using a case study focused on the word "no" in the MNLI dataset.We aim to compare the effectiveness of different prompts and select the best one for mitigating bias based on a single bias feature.

Dataset
We selected test instances from the MNLI dataset to study the influence of the word "no" on ChatGPT's performance.For the accuracy test, we used all 313 instances with "no" and an equal number of instances without "no", randomly chosen from the remaining test set.This ensures a balanced evaluation of ChatGPT's performance.
For the distribution test, we selected 38 instances per label containing "no", resulting in a total of 114 instances.

Prompts
We have four prompts in Figure 4: The first prompt is proposed by ChatGPT itself.We ask, "What is the best prompt for the MNLI task according to you?".ChatGPT returns prompt 1 for us.The sec-ond prompt is inspired by previous work (Qin et al., 2023).The third and fourth prompts are created by adding "Let's think step by step" (Kojima et al.) to prompt 1 and prompt 2 with the "chain of thought" (CoT) thinking, respectively.This modification has been shown to significantly improve the performance of InstructGPT on reasoning tasks (Ouyang et al., 2022).We evaluated the model's accuracy using different prompts on instances with and without the word "no."The results are shown in Table 3. P1 demonstrates a negative ∆ Acc, indicating difficulty in generalizing when "no" is present.P2 exhibits a positive ∆ Acc, suggesting better generalization.

Prompt
Adding "CoT" to both prompts reduces bias risk.P1 + CoT shows the most significant improvement in Acc ("no"), but P2 + CoT has the smallest absolute ∆ Acc, indicating the least sensitivity to "no" and the lowest bias risk among the tested prompts.
Besides, we analyzed the model's prediction distribution Figure 5 for the different prompts on the stress test set containing the word "no" with balanced label distribution.Distribution test results reveal imbalances in P1 and P2 prediction distributions, with P1 and P2 leaning towards predicting contradictions.Adding "CoT" mitigates these imbalances, leading to more balanced distributions.P2 + CoT presents the most balanced distribution among the labels, supporting its lowest bias risk.
In conclusion, our case study, particularly when focusing on the feature "no", indicates that zeroshot ChatGPT can be influenced by bias features.The choice of prompt significantly affects its performance.When assessing the feature "no", P2 + CoT showed the lowest bias risk across both tests.The "CoT" strategy appears effective in reducing bias risk for this specific feature.However, it's important to note that our findings, particularly regarding the feature "no", suggest that ChatGPT's self-recommended prompt (P1) might not always be optimal.This underscores the importance of human intervention and ongoing exploration to optimize performance and minimize bias risks.Future studies and conclusions would benefit from a more nuanced and feature-specific analysis.

Related Work
Our work is related to three research directions: spurious features analysis, bias calculation, and dataset filtering.
Spurious features analysis has been increasingly studied recently.Much work (Sharma et al., 2018;Srinivasan et al., 2018;Zellers et al., 2018) has observed that some NLP models can surprisingly get good results on natural language understanding questions in MCQ form without even looking at the stems of the questions.Such tests are called "hypothesis-only" tests in some works.Further, some research (Sanchez et al., 2018) discovered that these models suffer from insensitivity to certain small but semantically significant alterations in the hypotheses, leading to speculations that the hypothesis-only performance is due to simple statistical correlations between words in the hypothesis and the labels.Spurious features can be classified into lexicalized and unlexicalized (Bowman et al., 2015): lexicalized features mainly contain indicators of n-gram tokens and cross-ngram tokens, while unlexicalized features involve word overlap, sentence length, and BLEU score between the premise and the hypothesis.(Naik et al., 2018) refined the lexicalized classification to Negation, Numerical Reasoning, Spelling Error.(Mc-Coy et al., 2019) refined the word overlap features to Lexical overlap, Subsequence, and Constituent which also considers the syntactical structure overlap.(Sanchez et al., 2018) provided unseen tokens an extra lexicalized feature.
Bias calculation is concerned with methods to quantify the severity of the cues.Some work (Clark et al., 2019;He et al., 2019;Yaghoobzadeh et al., 2019) attempted to encode the cue feature implicitly by hypothesis-only training or by extracting features associated with a certain label from the embeddings.Other methods compute the bias by statistical metrics.For example, (Yu et al., 2020) used the probability of seeing a word conditioned on a specific label to rank the words by their biasness.LMI (Schuster et al., 2019) was also used to evaluate cues and re-weight in some models.However, these works did not give the reason to use these metrics, one way or the other.Separately, (Ribeiro et al., 2020) gave a test data augmentation method, without assessing the degree of bias.
Dataset filtering is one way of achieving higher quality in datasets by reducing artifacts.In fact, datasets such as SWAG and RECLOR evaluated in this paper were produced using variants of this filter approach which iteratively perturb the data instances until a target model can no longer fit the resulting dataset well.Some methods (Yaghoobzadeh et al., 2019), instead of preprocessing the data by removing biases, leave out samples with biases in the middle of training according to the decision made between epoch to epoch.(Bras et al., 2020) investigated model-based reduction of dataset cues and designed an algorithm using iterative training.Any model can be used in this framework.Although such an approach is more general and more efficient than human annotating, it heavily depends on the models.Unfortunately, different models may catch different cues.Thus, such methods may not be complete.

Conclusion
Our lightweight framework, ICQ, identifies biases and cues in multiple-choice NLU datasets, illuminating model behaviors from a statistical perspective.Extensive experimentation on diverse tasks validates ICQ's efficiency in revealing dataset and model biases.Using a case study on ChatGPT, we explore its cues, offering practical guidance.ICQ advances our understanding and optimization of large language models, promoting the creation of robust, unbiased AI systems.

Figure 1 :
Figure 1: Example of a cue.

Figure 2 :
Figure 2: ICQ Workflow. 1 ⃝: data extraction phase; 2 ⃝: cue discovery phase; 3 ⃝: model probing phase.f =a specific feature, R=training data, S=test data, R f =extracted training data, S f =extracted test data, S nf =remaining test data without feature f , S f =flatten test data, M=a specific model.

Figure 3 :
Figure 3: Four test examples for distribution comparison with 4 different models

Table 1 :
are collected through crowdsourcing.SWAG is generated from an LSTM-based language model.Specifications of the datasets are listed in Table1.10 Datasets.Data size refers to the number of questions in each dataset.CLS=Classification.MCQ=Multiple Choice Questions.By our definition, k-way MCQs will be split into k instances in preprocessing.

Table 2 :
, particularly FastText.Consistent with the "Accuracy Test," we find that the prediction label distribution skewness Datasets, their top 5 cues and 4 models biases ∆ on them.