Mitigating Word Bias in Zero-shot Prompt-based Classifiers

Prompt-based classifiers are an attractive approach for zero-shot classification. However, the precise choice of the prompt template and label words can largely influence performance, with semantically equivalent settings often showing notable performance difference. This discrepancy can be partly attributed to word biases, where the classifier may be biased towards classes. To address this problem, it is possible to optimise classification thresholds on a labelled data set, however, this mitigates some of the advantages of prompt-based classifiers. This paper instead approaches this problem by examining the expected marginal probabilities of the classes. Here, probabilities are reweighted to have a uniform prior over classes, in an unsupervised fashion. Further, we draw a theoretical connection between the class priors and the language models' word prior, and offer the ability to set a threshold in a zero-resource fashion. We show that matching class priors correlates strongly with the oracle upper bound performance and demonstrate large consistent performance gains for prompt settings over a range of NLP tasks.


Introduction
Large language models (LLM) have shown impressive general ability for natural language processing (NLP) tasks.LLMs can effectively handle a range of NLP tasks through 'prompting', where a natural language instruction is added to the input, conditioning the model to the task at hand.Prompting can either be an emergent ability learned through scaling up model size (Brown et al., 2020;Wei et al., 2022) or an ability learned through instruction tuning (Wei et al., 2021;Chung et al., 2022;Ouyang et al., 2022).Despite the recent popularity of prompting, there is a known sensitivity of prompt-based LLMs to elements such as prompt 1 code available on github at https://github.com/adianliusie/robust-prompt-classifier What is the sentiment of the following review?Inception was great!What is the sentiment of the following review?disappointing, I thought the whale would be a wildlife documentary...

LM outputs probabilities word prior normalization
"  !=amazing | " = 0.02  #  !=bad| " = 0.0015  "  !=amazing| # = 0.001  #  !=bad| # = 0.010 Figure 1: Instead of using the raw LM output probabilities of the label words, we consider mitigating bias by finding weights that make the classifier unbiased over classes.This is connected to normalising by word priors, which we use as a zero-resource de-biasing approach.
template and label words (Gao et al., 2021;Schick and Schütze, 2021).Previous works have demonstrated that prompt templates can significantly impact task performance (Shin et al., 2020;Zhou et al.) and that factors such as chosen label words can influence system performance for classification tasks (Zhao et al., 2021;Holtzman et al., 2021).This work focuses on the influence of 'word biases' for prompt-based classifiers.i.e. the bias that prompts may have towards certain classes, independent of the input text.To account for this bias, one could use a labelled dataset to find optimal class decision thresholds.This, however, requires labelled task data, which may limit the zero-shot benefits of prompt-based classifiers.We propose a simple unsupervised solution of re-weighting probabilities, where we use unlabelled data to search for weight parameters that ensure a uniform prior over classes.We show that this prior matching leads to greater robustness for diverse prompt settings and that the unsupervised weights which debias the classifier is highly correlated with the oracle weights that maximise accuracy.Further, we provide theoretical analysis that draws a connection between word priors and inherent class bias, which we use to motivate a zero-resource normalisation approach that is competitive with prior matching.Overall, we demonstrate that our unsupervised approach highly reduces sensitivity to the chosen prompt and label words, and that settings which initially fail can often be made effective through a simple probability re-weighting.
Our contributions are 1) We propose a simple unsupervised probability re-weighting method, and empirically demonstrate greater robustness to prompt and label word choice, with large accuracy gains across prompt settings for a range of standard NLP tasks.2) We theoretically connect the weight parameters to word priors and use this to motivate a zero-resource re-weighting approach.3) We show that the weights of prior matching are highly correlated with the optimal oracle weights that maximize accuracy, illustrating that our approach is a near-optimal use of a system's output probabilities.

Mitigating Bias by Re-weighting
Prompt-based classifiers Given an input sequence x ∈ X , large language models (LLMs) model P θ (w|x), the output probability distribution over all possible sequences w ∈ X .For a classification task T , a prompt-based classifier 1) reformats the input text x to prompt p ∈ X by including the task instruction, and 2) selects class words {w k } 1:K which are associated to each output class {y k } 1:K .For example in sentiment classification, one can use prompt 'what is the sentiment of the following review?<x>', (where <x> is the current input x, e.g.'Inception was absolutely brilliant'), and class words w 0 =bad and w 1 =good for the negative and positive classes respectively.For a prompt classifier, Q = {p, {w k } 1:K }, class probabilities can be set to be proportional to the probability of the associated class word, where the final decision ŷ is the class with the highest probability (Zhao et al., 2021;Jiang et al., 2020). ) (1) However, as a large language model, the promptbased classifier may return probabilities that are influenced by distributional statistics of words (Gardner et al., 2021;Liusie et al., 2022).This may lead to inherent class bias, where label words may have high probability not because they better answer the prompt, but because they have a high LM prior.
Optimal Weights To account for this, one can define weight parameters α = {α k } 1:K , where each α k ∈ R + scales the probabilities of the classifier, Given labelled task dataset D = {(x (j) , y (j) )} N j=1 , one can then find the optimal weights α * that maximises the accuracy of the prompt classifier Pθ (y k |x, Q, α) over the dataset, Prior-Matching The previous approach requires labelled data, which may limit the benefit of using prompt-based classifiers.As an alternative, one can find the values ᾱ that ensure that the classifier is unbiased, such that the class prior P (y k |Q, α) matches the true prior P (y k ) Pθ (y A deterministic solution that exactly matches the distributions exists, which can be found with a search with 1 degree of freedom (that can be accounted for by setting α 1 = 1).If there is no expected class bias, one can assume equal probabilities over all classes, P (y k ) = U(y k ) = 1 N .This approach is therefore unsupervised and only requires text inputs D x = {x (j) } M j=1 , which therefore can be applied at inference to any test set.

Null-Input Approximation
The dependence of prior-matching on unlabelled dataset D x is a drawback.In Appendix A, we show that one can make the analytical approximation ᾱk ≈ 1 This enables a zero-resource approximation of weight parameters ᾱ.
3 Experiments word probability via equation 1 (baseline).Normalised probabilities calculated using null-inputs priors via equation 9 (null-input).Optimising α k with a search to have unbiased class prior via equation 7 (prior-match).The oracle upper-bound performance, found by optimising the optimal accuracy threshold via equation 4 (optimal).

Experimental Results
Classification Robustness Table 1 shows the mean and standard deviation of accuracies among all prompt and class word settings for a given task.We observe large consistent gains from both re-weighting approaches, with prior-matching increasing baseline accuracy by between 6.7% to 12.1% for sentiment classification, 13.7% for qqp, and over 25% for natural language inference.Prior-matching also demonstrates performance very similar to the oracle upper-bound, often within 1%, showing that the unsupervised prior-match approach is competitive with the supervised threshold search.Prior-matching also performs better than null-input by a small margin in all tasks, where this small gap confirms that the word-prior normalisation is a very reasonable zero-shot approximation.
Prompt Robustness Figure 2 illustrates a boxplot of rotten tomatoes performance over all classwords for each considered method, over all 6 prompts.As observed in Table 1, naively using raw label word probabilities (dark blue) leads to considerable fluctuations in accuracy; some prompt and label word settings lead to reasonable accuracy (92%+ accuracy), however there is observed brittleness to label word choice, with many settings demonstrating poor performance.Prior matching (green) leads to significant robustness, with nearly all sensible settings above 85% accuracy.We further find that, as shown in Table 1, the unsupervised approach has accuracies very comparable to those Figure 2: boxplots of the accuracy of all label-word pairs for rotten tomatoes, over all the considered prompts when using optimal thresholds.In Figure 3, we consider similar boxplots for SNLI and observe larger gains through reweighting.This was as higher probabilities are often assigned to the entailment and contradiction labels words, leading to under-classification of the neutral class.We observe greater sensitivity to prompt choice and label words for snli than as observed in rotten tomatoes, even with reweighting.Weight Alignment Figure 4 shows a scatter plot of the weights found by the optimal threshold search α * (equation 4), with those found from the unsupervised prior matching method ᾱ (equation 7) and the zero-resource word prior approximation (equation 9).We see a clear linear relationship between optimal and prior-match, illustrating that accounting for the marginal bias is almost equivalent with maximising accuracy, however, achieved in an unsupervised fashion.Null-input is also well correlated with the optimal thresholds, but there is a less direct relationship.Similar linear relationships are observed also for other binary-classification tasks and prompts, as shown in Appendix C.This paper analyzes prompt-based classifiers and demonstrates that inherent class bias is a significant factor that influences the sensitivity of the system to prompt and label words.We propose an unsupervised approach of prior matching, which we demonstrate performs competitively to the supervised alternative of searching for optimal thresholds, while avoiding the need for labelled data.We relate prior matching with word biases, and motivate a zero-resource approach of debiasing model probabilities.We show that our methods lead to practical approaches that reduce the sensitivity to design choices such as prompts and label words.

Limitations
This work considered sentiment classification, natural language inference, and paraphrase detection, and could have been extended over a greater suite of tasks to guarantee its effectiveness.Further, this paper ran experiments on FlanT5 and Llama2, and this work has not yet explored a larger range of prompted language models.FlanT5 has also been instruction-tuned on similar tasks, so the findings may be limited in scenarios where known capabilities have to be elicited from models robustly.

Ethical Considerations
Though this work suggests methods to improve the robustness of prompt-based classifiers to prompts and label words, this does not imply that all design choices will work.In some set ups, the system may be ineffective and have poor generalisation over the task.Deploying machine learning classifiers in realworld classification settings has many associated risks, and careful analysis should be made before deploying such systems.

A Derivation of Zero-Resource Equation
For a prompt classifier Q = {p, {w k } 1:K }, class probabilities are assumed to be proportional to the probability of the associated class word, ) Given the task dataset D = {{x (j) , y (j) }} N j=1 , one can calculate the assumed prior of the prompt classifier over the output classes, )) This can be compared to the actual prior of the task/domain, If D is sufficiently large, then an unbiased classifier should have a class prior similar to that approximated via the labels.However, if they diverge, one may wish to debias the classifier by scaling class probabilities by factors α k , Where Z(x, Q, α) = i α i P θ (w i |x, Q, α) and P θ (w k |x, Q) ≡ P θ (w k |p(x)).The parameters ᾱ that lead to an unbiased classifier can then be determined in a deterministic fashion.
Note that by constraining α 1 = 1, there will exist a deterministic solution that ensures that Pθ (y k |Q, α) = P (y k ).For given weight parameters α, Consider the prompt-classifier priors, Pθ (y k |Q, α).One can approximate this using a Taylor series of the expectation of a ratio, yielding By equating the predicted prior with the true prior, we find an approximation for ᾱk Pθ (y A final insight is that in many cases it is assumed that there should be no inherent class bias, and so P (y k |D) can be assumed to be uniform and be included in the normalisation term.
B Prompts and Label Words       Tables 8 and 9 show the prompt-based classifier performance for the different methods when using different FlanT5-base and LLama-2-chat 7B respectively.For sentiment classification and natural language tasks, we similarly observe that the various re-weighting methods lead to considerable boosts in accuracy.Both null-norm and prior-match again lead to performance near that to the optimal weights, with considerable performance boost over the baseline.However, for paraphrase detection, we only observe moderate performance boosts over the baseline setting with a larger performance discrepancy with the optimal weights.

Figure 3 :
Figure 3: boxplots of the accuracy of all label-word sets for snli, for the first 3 prompts

Figure 4 :
Figure 4: Scatter plot of the optimal weights α * (equation 4) with the prior match weights ᾱ (equation 7) and the approximation via null-input (equation 9), for all settings of prompt 1 on amazon

Figure 7 :
Figure 7: boxplots of the accuracy of all label-word pairs on IMDB, over all the considered prompts

Table 1 :
Average dataset accuracy and standard deviations, over all prompts and label words.baseline and nullinput are zero-resource classification methods, prior matching uses the text inputs but not labels, while optimal is an oracle approach that uses the labels to search for the best thresholds.Results for FlanT5 large Timo Schick and Hinrich Schütze.2021.Exploiting cloze-questions for few-shot text classification and natural language inference.In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255-269.

Table 3 :
label words for sentiment classification B.2 Natural Language Inference prompt is the second text an entailment of the first text?does the second text directly follow from the first text?are the texts related?are the texts consistent?does text 1 imply text 2? can text 2 be logically derived from text 1? does the hypothesis logically follow the premise?

Table 5 :
label words for NLI

Table 7 :
label words for sentiment classification C Threshold Alignment PlotsFigure 5: weights alignment plot for rotten tomatoes Figure 6: weights alignment plot for imdb D Impact of LLM Choice

Table 8 :
Robustness performance when using FlanT5 base as the base LLM (set-up equivalent to Table1).

Table 9 :
Robustness performance when using Llama-2-chat 7B as the base LLM.