Zero-shot Approach to Overcome Perturbation Sensitivity of Prompts

Recent studies have demonstrated that natural-language prompts can help to leverage the knowledge learned by pre-trained language models for the binary sentence-level sentiment classification task. Specifically, these methods utilize few-shot learning settings to fine-tune the sentiment classification model using manual or automatically generated prompts. However, the performance of these methods is sensitive to the perturbations of the utilized prompts. Furthermore, these methods depend on a few labeled instances for automatic prompt generation and prompt ranking. This study aims to find high-quality prompts for the given task in a zero-shot setting. Given a base prompt, our proposed approach automatically generates multiple prompts similar to the base prompt employing positional, reasoning, and paraphrasing techniques and then ranks the prompts using a novel metric. We empirically demonstrate that the top-ranked prompts are high-quality and significantly outperform the base prompt and the prompts generated using few-shot learning for the binary sentence-level sentiment classification task.


Introduction
The recent advance of large language models such as ChatGPT (ChatGPT, 2022), GPT-3 (Brown et al., 2020), and T5 (Raffel et al., 2020) has shown an astounding ability to understand natural languages. These pre-trained models can conduct various Natural Language Processing (NLP) tasks under the zero/few-shot settings using natural language instructions (i.e., prompts) when no or a few training samples exist. The prompts play crucial roles in these scenarios.
The prompts can be generated manually or automatically (Schick and Schütze, 2021;Gu et al., 2022;Wang et al., 2022). The manual prompts are handcrafted based on the * equal contribution user's intuition of the task (Schick and Schütze, 2021;. Humans can easily write prompts, but the manual prompts are likely to be suboptimal since the language models may understand the instruction differently from humans. Prior studies have also shown that the performance of the language models is sensitive to the choice of prompts. For example, Jiang et al., 2020) have shown that the performance is sensitive to the choice of certain words in the prompts and the position of the prompts. Due to the sensitivity and the potential misunderstanding of the instruction, manual prompts tend to suffer from poor performance under zero-shot settings. The language models tend to understand human intentions better when used with a small amount of training data. Therefore, the model can improve significantly under few-shot settings.
To address the problems of manual prompts, some studies (Jiang et al., 2020; further propose to generate prompts automatically following few-shot settings. These models utilize generative language models, such as the T5 model, to write automatic prompts using small training data from the task. Some studies (Shin et al., 2020) also use the small training set to fine-tune the language models or to evaluate the prompts. However, there are several drawbacks to automatically generated prompts in real applications. First, prompts cannot be generated in zero-shot settings, and the generated prompts may not follow the human intuition of the tasks. Second, deploying the generative language models also poses challenges. It can be costly to deploy on local hardware due to the size of the pre-trained generative language models. Using the generative language models via API (ChatGPT, 2022) also faces limitations, such as privacy concerns when uploading confidential customer or organizational data.
In this work 1 , we aim to study how to improve manual prompts for classification tasks under zero-shot settings using moderately sized masked language models. Specifically, we use the binary sentence-level sentiment classification tasks as the testbed. Instead of deploying large generative language models, we study the usability of moderately sized masked language models, such as BERT (Devlin et al., 2019), which can be deployed and tuned in-house easily for real-world applications. The prompt follows the cloze-style format, where the position of the label is masked (e.g., "Battery life was great. The sentence was [MASK]", where a positive polarity is the goal of prediction). The prompts are used to predict probability scores for the polarity labels from the pre-trained masked language model.
To overcome the sensitivity of the language model to a manual prompt, we propose augmentation strategies to automatically generate more candidate prompts similar to the manual prompt (i.e., the base prompt), which is not required to be complex or optimized. Three augmentation techniques are designed: positioning, subordination, and paraphrasing. Different from , where generative language models are used to generate candidate prompts, we use the same masked language models to paraphrase the base prompt. To find high-quality prompts under the zero-shot setting, we propose a novel ranking metric designed based on the intuition that high-quality prompts should be more sensitive to changing certain keywords. If a prompt is not sensitive to the change of certain keywords, it is not high-quality, and vice versa.
We conduct extensive experiments on various benchmark datasets from different domains of binary sentence-level sentiment classification and show the efficacy of the proposed ZS-SC model compared with different prompts, including manually and automatically generated prompts, in the zero-shot setting. The experimental results demonstrate the effectiveness of the proposed method in real applications.
In summary, the main contributions of this paper are as follows: • We propose a prompt augmentation method using moderately sized masked language 1 The code can be found at https://github.com/ Mohna0310/ZSSC models to improve manual prompts for classification tasks under zero-shot settings.
• To rank the automatically generated prompts under the zero-shot setting, we propose a novel ranking metric based on the intuition that high-quality prompts should be sensitive to the change of certain keywords in the given sentence.
• Extensive experiments and ablation studies performed on benchmark datasets for sentence-level sentiment classification tasks validate the effectiveness of the proposed method.

Related Work
Prompt-based learning is a recent paradigm used in the zero/few-shot setting. In the zero-shot setting, the model is given a natural language instruction (prompt) describing the task without any training data (Brown et al., 2020), whereas in the fewshot setting, a few samples of training data are used along with the prompt. In prompt-based learning, the downstream tasks are formalized as masked language modeling problems using natural language prompts. Then, a verbalizer is used to map the masked language model prediction to the labels of the downstream task. This work uses prompt-based learning for the binary sentencelevel sentiment classification task. This section discusses the related work that explored promptbased learning from generic and task-specific perspectives.
Prompt-based Learning: With the introduction of GPT-3 (Brown et al., 2020), recent years have witnessed a series of studies based on promptbased learning. Schick and Schütze (2021) utilized manual-designed hard prompts, composed of discrete words, to fine-tune the pre-trained language model. Finding the best-performing manual prompt is challenging, and to alleviate the problem, Jiang et al. (2020); ; Shin et al. (2020) designed methods for automatic prompt generation. Specifically, Shin et al. (2020) performed the downstream tasks using gradient-guided search utilizing a large number of annotations for an automatic prompt generation.  proposed LM-BFF that auto-generates prompts using the T5 model but relies on few annotations for an automatic prompt generation. However, the auto-generated prompts are hard prompts making them sub-optimal.
To overcome the limitations of hard prompts, Zhong et al. (2021b); Li and Liang (2021); Wang et al. (2021) proposed methods to learn soft prompts under the few-shot settings. Soft (or continuous) prompts are composed of several continuous learnable embeddings, unlike hard prompts. Motivated by the prior studies, Zhao and Schütze (2021) utilized both the hard and soft prompts for training the pre-trained language model. Gu et al. (2022) proposed pre-training hard prompts by adding soft prompts into the pretraining stage to obtain a better initialization.
Another line of study (Khashabi et al., 2022;Wang et al., 2022;Zhong et al., 2021a) designed manual task-specific prompts by fine-tuning pretrained language models on multiple tasks. The fine-tuned language model is then used on unseen tasks under the zero/few-shot setting. Prompt-based Learning for Sentence-level Sentiment Classification: Over the past years, a large body of studies (Shin et al., 2020;Gu et al., 2022;Wang et al., 2022) have demonstrated excellent performance in few-shot settings on sentence-level sentiment classification tasks. Specifically, Shin et al. (2020) used gradientguided search to generate automatic prompts, whereas  used a more generalpurpose search method to generate automatic prompts. Following the limitation of automatic prompts, Gu et al. (2022) suggested hybrid training combining hard and soft prompts in the initial stage, obtaining a better initialization. Wang et al. (2022) proposed a Unified Prompt Tuning framework and designed prompts by fine-tuning a pre-trained language model over a series of nontarget NLP tasks and using the trained model to fit unseen tasks. For instance, when the target task is sentiment classification, the training data is from other domains like NLI and paraphrasing.
These studies consider access to labeled instances and perform the sentence-level sentiment classification task using a large-scale pre-trained generative language model. In our study, we do not use any training data, and the base prompt can be considered as a natural language description for the task. Therefore, this study follows the zero-shot setting. Using a moderately sized masked language model further makes the proposed method more appealing in practice.

Methodology
This section first discusses the problem formulation and the overview in Section 3.1 and Section 3.2. Our proposed method handles the language model's sensitivity to a manual prompt by utilizing prompt augmentation techniques to generate multiple candidate prompts. The detailed description of the prompt augmentation is discussed in Section 3.3. To rank the automatically generated prompts in the zero-shot setting, we propose a novel ranking metric, discussed in Section 3.4. Finally, the top-ranked prompts are used for prediction, discussed in Section 3.5.

Problem Formulation
Given an unlabeled corpus D with N sentences, an input mapping M : Y → V for the labels y ∈ Y = {−1, 1}, in the vocabulary V of L and a base prompt B p , the task is to find quality prompts similar to the base prompt in a zeroshot setting for the binary sentence-level sentiment classification task. Figure 1 shows one example input to the model. In this example, y ∈ Y = {negative, positive}, M(positive) = great, and M(negative) = terrible.

Overview
Given a base prompt B p , the proposed ZS-SC first generates multiple prompts similar to the base prompt using augmentation techniques. Specifically, we introduce positioning, subordination, and paraphrasing techniques in the augmentation process, which are discussed in detail in Section 3.3.
With more automatically generated candidate prompts, ZS-SC ranks the prompts using a novel ranking metric. This metric is designed based on the observation that quality prompts should flip the predicted label if M(y) present in the sentence is replaced with M(y ′ ), where y ̸ = y ′ , whereas the predicted label should stay the same if M(y) is replaced with its synonyms. Section 3.4 discusses the proposed ranking metric in detail.  Finally, the top-ranked prompt is selected, or top−k highly ranked prompts are aggregated to conduct the zero-shot prediction for the unlabeled corpus D (Section 3.5). Figure 2 illustrates the overview of the proposed approach, ZS-SC.

Prompt Augmentation
A single base prompt provided by a user may not provide optimal results for the given task. Prior studies Jiang et al., 2020) have shown that the performance of the prompts is sensitive to the choice of certain words and the position of the prompts, respectively. Furthermore, we observe that using subordinate conjunctions to join the prompt and sentence can improve the method's performance on some datasets since it introduces a dependency between the prompt and sentence, thereby leading the model to relate the predicted label with the context of the sentence. Based on the above observations, we propose to apply three augmentation techniques to generate prompts automatically, namely positioning, subordination, and paraphrasing techniques.
The positioning technique places the prompt either before or after the given sentence. The subordination technique uses subordinate conjunctions like "because" and "so" to join the prompt and the sentence. Specifically, the conjunction "because" is used if the prompt is

was positive [SEP] [CLS]
. The sentence positive [SEP] "was" … "were" "response" "review … "reaction" placed before the sentence, and the conjunction "so" is used if the prompt is placed after the sentence.
The paraphrasing technique generates multiple prompts similar to the base prompt B p by swapping the tokens in the base prompt with similar tokens. These similar tokens should have the same part of speech tags as the tokens they are replacing and should not change the context of the prompt. Therefore, to obtain these similar tokens, we use a pre-trained MLM model L. Pre-trained MLM models are trained to predict the missing tokens that fit the context of the given sentence and thus would be suitable for the purpose. Figure  3 illustrates the paraphrasing technique for the base prompt. The label "positive" is used as a placeholder so that pre-trained MLM model can learn the context of the given sentence. If a specific sentence is joined with the base prompt, the MLM model L can understand the context better, so the replacing tokens will make more sense. Therefore, instead of using prompts alone, we form sample instances by randomly selecting sentences from the unlabeled corpus D. We then mask the replaceable tokens from the base prompt one at a time and use the MLM model L to predict the masked token. For each masked token, the MLM model L gives a score to all the tokens in its vocabulary. We choose the top-K ranked tokens as similar token candidates and remove those that do not have the same POS tag as the masked token.
These three techniques can be applied in different combinations and permutations to generate prompts automatically. The number of candidate paraphrasing tokens K can be increased to generate more prompts. Figure 3 illustrates the process of obtaining paraphrasing tokens to the tokens of the base prompt.

Ranking Metric
Not all the automatically generated prompts in Section 3.3 obtain good performance for the task. Therefore, we aim to rank these prompts and choose quality prompts for the tasks. Previous works Shin et al., 2020) have used validation or manually annotated few-shot training data for evaluating the automatically generated prompts. However, under the zero-shot setting, we do not assume there exists any manually annotated data. Therefore, we have to rank the automatically generated prompts in the absence of manually annotated data which is not considered by the previous works.
Intuitively, if the mapping token of the opposite label replaces the mapping token in a given sentence, the predicted label by a quality prompt should flip. On the other hand, the predicted label should remain the same if the mapping token in the sentence is replaced by its synonyms. For example, suppose we replace the word "great" in sentence "battery life was great" with "terrible". In this case, the predicted label should flip, whereas if we replace "great" with "excellent", the predicted label should remain the same. We use this intuition to measure the sensitivity of the prompt to the change of the mapping tokens in the given sentences. The measured sensitivity implies the quality of the prompt, namely prompts sensitive to the change of the mapping tokens in the given sentence can achieve good performance for the task. Figure 4 illustrates the key idea of the proposed ranking metric.
We model the above intuition as a zero-one scoring function. To do so, we first obtain sentences from the unlabeled corpus D that contain the mapping tokens M(y) ∈ V obtained from the provided input mapping M : Y → V. If the mapping tokens are not present in the corpus D, the synonyms of the mapping tokens can be used.
For a sentence s in ∈ S W , let the label predicted by the model for a given prompt P be l 1 . We then replace the mapping token M(y) in s in with M(y ′ ), where y ̸ = y ′ to obtain a new sentence s ′ in . Let the label predicted for s ′ in be l 2 . The zero-one scoring function for this scenario is defined as: We consider the synonyms of M(y) to further diversify the scoring function. Specifically, we use Wordnet (Miller, 1995) to obtain synonyms for M(y). We replace M(y) by its synonym to obtain a new sentence s ′′ in . Let the label predicted for s ′′ in be l 3 . The scoring function for this scenario is defined as: Similarly, we can also consider the synonyms of M(y ′ ). The predicted label should flip if M(y) is replaced by synonyms of M(y ′ ). Let Z be the set of new sentences obtained through synonym replacement. The overall score for a given prompt (P ) is defined as: A higher score indicates that the prompt is more sensitive to the polarity of mapping tokens. The score is calculated for all the prompts generated in the prompt augmentation step (Section 3.3), and then the prompts are ranked based on their calculated score. The top-ranked prompt is the prompt with the highest score. Figure 4 depicts the functioning of our ranking metric.

Prediction
First, we define how we obtain the prediction probabilities using any given prompt. Given an input mapping M : Y → V that maps the task label space to individual words in the vocabulary V of pre-trained MLM model L, the probability of a label y ∈ Y for a given sentence s in in the unlabeled corpus D using a prompt P is obtained as: where s P = P (s in ) is the sentence s in joined with the prompt P , which contains exactly one masked token at the position of the label, h [M ASK] is the hidden vector of the [MASK] token and w v is the pre-softmax vector corresponding to v ∈ V. The predicted label for the given sentence s in is the label y with the highest probability. Our proposed approach is to use quality prompts for the zero-shot prediction tasks. We can either select the top-ranked prompt or aggregate top-kranked prompts. If the top-1 prompt is selected, Eq. (4) is used to obtain the label probability for each sentence, and the label with the highest probability is the predicted label.
Prompt aggregation may help correct the mistakes of the individual prompts. We consider prediction confidence and use the soft labels computed by Eq. (4) in aggregation. Let p 1 (y), p 2 (y), .., p k (y) be the prediction probability for label y ∈ Y obtained using top-k prompts. The aggregated prediction probability is: and then the label with the highest aggregated prediction probability is chosen for the sentence.

Experiments
In this section, we evaluate the proposed ZS-SC model on several benchmark binary sentencelevel sentiment classification datasets from various domains. More studies can be found in the Appendix A.

Evaluation Metrics
Since no training data is used in zero-shot settings, we evaluate all prompts on the entire dataset. We use Accuracy (Acc.) and macro F1 score (F1) for all the datasets to evaluate the performance of ZS-SC and compare it with baselines under different settings. Note that Accuracy is equivalent to micro F1 score in binary classification tasks.

Baseline Methods
Since none of the prior work has performed the task of binary sentence-level sentiment classification under the zero-shot setting, we compare it with the baselines that have performed the task under the few-shot setting for the datasets discussed in Section 4.1. For a fair comparison, we modified these studies as per the zero-shot setting, using the prompts reported in their paper. The baseline templates are discussed in Table 5 of Appendix A. LM-BFF : This paper explores manual prompts and generates automatic prompts under the few-shot setting. Specifically, they use few-shot examples to automatically generate prompts using the T5 model. The performance of their method is evaluated on a range of classification and regression tasks using RoBERTalarge (Liu et al., 2019) with fine-tuning. We compare ZS-SC with their manual prompt and their top-ranked automatic prompts. Table 2: Results of the sentiment classification task on the three benchmark datasets using BERT base and BERT large. We report accuracy and F1 score for all datasets. The results are evaluated on the entire dataset. We report the majority voting results for the automatic prompt baselines. The best-performing and runner-up model per column are highlighted in bold and underlined, respectively. PPT (Gu et al., 2022): This paper proposes pretraining hard prompts by adding soft prompts to achieve better initialization into the pre-training stage on classification tasks. ZS-SC is compared with their manual prompt.
UPT (Wang et al., 2022): This paper proposes a Unified Prompt Tuning framework and designs prompts by fine-tuning a pre-trained language model (RoBERTa-large) over a series of non-target NLP tasks. After multi-task training, the trained model can be fine-tuned to fit unseen tasks. ZS-SC is compared with their top-ranked prompts.

Settings
The experiments are conducted using pre-trained uncased BERT (BERT base and BERT large) encoders. BERT base has 12 attention heads, 12 hidden layers, and a hidden size of 768 resulting in 110M pre-trained parameters, whereas BERT large has 16 attention heads, 24 hidden layers, and a hidden size of 1024 resulting in 336M pre-trained parameters. We set K, the hyperparameter for the number of candidate words in paraphrasing, to 30. We obtain 6 synonyms for each mapping word from WordNet (Miller, 1995). The size of the set of new sentences through synonym replacement (Z) is 12, 6 of which are obtained by replacing the mapping token M(y) with its synonyms, and the other 6 are obtained by replacing the mapping token by M(y ′ ) and synonyms of M(y ′ ), where y ̸ = y ′ .
For ZS-SC, we considered two different base prompts. The first base prompt is "<sentence>. It was [MASK]", which is the same as the manual prompt used by LM-BFF (denoted by † in Table 2), whereas the second base prompt is "<sentence>.
The sentence was [MASK]" (denoted by ⋆ in Table  2). The base prompts defined are generic and used for all datasets.

Results and Discussion
To better compare the performance of different methods, we categorize them based on the prompt (manual or automatic). Table 2 shows the results of all prompts using BERT base and BERT large pre-trained MLM models, respectively. ZS-SC with the ⋆ base prompt significantly outperforms both manual and automatic baseline methods on both pre-trained MLM models on all three datasets. Overall, the aggregation strategy tends to outperform the selection strategy, but the outperformance is inconsistent across different data. We conduct more studies on the impact of top-k prompts in Section 4.6.
It is interesting to notice that for † base prompt ZS-SC outperforms on SST-2 and MR datasets but not on the CR dataset. Furthermore, the margin of ZS-SC over the base prompt decreases for † compared to ⋆ base prompt. This is because "It was" is harder to augment than "The sentence was" since the former is shorter and contains no concrete word. Even though the † base prompt is not ranked top-1 by ZS-SC on the CR dataset, it is ranked as the 4-th for both pre-trained MLM models, demonstrating that ZS-SC can recognize † base prompt as a highquality prompt.
It is also interesting to note that for baseline methods, either using manual or automatic prompts, there is no significant gain using the BERT large over the BERT base encoder, and the performance of a prompt can change significantly using different pre-trained language models. However, we can observe that the performance of ZS-SC improves with the scale of the model. The key difference between ZS-SC and the automatic prompts generated by baseline models is that we use the same language models to generate prompts and conduct classification tasks, whereas baselines generate prompts manually or using a different model. These results suggest that different language models have different knowledge of the language, so prompts need to be generated specifically for the chosen language model.

Study of Selection VS Aggregation
Comparing top-1 selection to top-k aggregation, from Table 2, we can observe that top-1 selection performs better compared to top-k aggregation on BERT base whereas on BERT large top-k aggregation performs better. Furthermore, we can observe that the top-k aggregation result does not increase with k as suggested by previous works . To further analyze our observation, we plot the change in performance of ZS-SC with respect to the number of aggregated top-k prompts for BERT large encoder on ⋆ base prompt in Figure  5. Figure 5 shows that the top-k aggregation performance increases with k only for SST-2 dataset and does not increase for CR and MR datasets. This implies that top-k aggregation performance increases with k only for some datasets but not all. Furthermore, we can also observe that top-k aggregation performance can be better than top-1 selection performance on all three datasets. We believe that aggregation performance improves when the top-ranked prompts make independent mistakes.

Study of the Proposed Ranking Metric
To study the effectiveness of the proposed ranking metric, we plot the accuracy of the augmented prompts evaluated using ground truth labels with respect to their ranks based on the proposed ranking metric. The results for SST-2 dataset using the BERT base model on ⋆ base prompt are shown in Figure 6. The figure shows that the highlyranked prompts achieve higher accuracy than the low-ranked prompts in general, demonstrating the effectiveness of our proposed ranking metric. Furthermore, we can observe that the accuracy of the prompts decreases as the rank provided by our proposed ranking metric increases.

Ablation Studies
We conduct ablation studies to investigate the contributions of Wordnet synonyms to the overall model performances. Table 3 shows the performance of ZS-SC with and without Wordnet. From the results, we can observe that ZS-SC with Wordnet outperforms ZS-SC without Wordnet for both variants of pretrained MLM models. The results show that diversification of the mapping tokens helps the scoring function to rank the prompts better and subsequently improve the performance.

Conclusion
This work proposes to study how to improve manual prompts for binary sentence-level sentiment classification tasks under zero-shot settings. To overcome the sensitivity of the language model to a manual prompt, we propose prompt augmentation techniques to generate multiple candidate prompts. Further, to rank the generated prompts without labeled data, we propose a novel ranking metric based on the intuition that high-quality prompts should be sensitive to the change of certain keywords in the given sentence. Extensive experiments and ablation studies demonstrate the power of the proposed ZS-SC on three benchmark datasets.

Limitations
The proposed method is tested for a binary labeling scenario where each instance can belong to one of the labels but not both. The scenario of overlapping labeling space is not tested, nor is the scenario for multi-class labeling space. Since we aim to obtain high-quality prompts similar to the base prompt, if the base prompt is very restrictive, then the suggested prompt might be the same as the base prompt. The approach only applies to two moderately sized MLM models, and the extension to other larger models is not tested.

Ethics Statement
We comply with the ACL Code of Ethics.