Adversarial Knowledge Stimulated Contrastive Prompting for Few-shot Language Learners

Prompt-based fine-tuning has boosted the performance of Pre-trained Language Models (PLMs) on few-shot Natural Language Understanding (NLU) tasks by employing task-specific prompts. Yet, PLMs are unfamiliar with prompt-style expressions during pre-training, which limits the few-shot learning performance on downstream tasks. It would be desirable if the models can stimulate prompting knowledge while adaptation to specific NLU tasks. We present the Adversarial Knowledge Stimulated Contrastive Prompting (AKSCP) framework, leading to better few-shot NLU tasks for language models by implicitly stimulate knowledge from pretrained language model. In AKSCP, a novel paradigm Cloze-driven prompt is proposed for joint prompt tuning across word cloze task and prompt-based learning, forcing PLMs to stimulate prompting knowledge. We further design an Adversarial Contrastive learning method to improve the generalization ability of PLM for different downstream tasks. Experiments over a variety of NLU tasks show that AKSCP consistently outperforms state-of-the-arts for prompt-based fine-tuning.

To alleviate the above dilemma for low-resource scenarios, natural language prompts have been applied to enable few-shot or zero-shot learning with PLMs (Brown et al., 2020b;Li and Liang, 2021;Liu et al., 2021b;Lester et al., 2021;Liu et al., 2021a).To make prompts more flexible and taskadaptive, prompt tuning freezes the PLM backbone and only adjusts the representations of prompts (Lester et al., 2021).This type of method is especially suitable for ultra-large PLMs that are difficult to tune.In addition, prompt-based fine-tuning has been proposed, transforming text classification tasks into cloze-style problems (Gao et al., 2020;Schick and Schütze, 2020).To specify, taskspecific discrete templates with masked language tokens are added to input texts.The result tokens of the masked positions predicted by the Masked Language Modeling (MLM) head are used for class label prediction.Therefore, the pre-trained knowledge acquired by PLMs can be better utilized by "re-using" the MLM training objective.However, prompt construction is usually handcrafted or searched by gradient descent, which may lack coverage and bring considerable bias and high variances to the results (Hu et al., 2021a).A recent work (Hu et al., 2021a) attempts to tackle the above challenge using external training knowledge data.Yet, these external knowledge data could be expensive to obtain and not transferable.It would be better if PLMs can stimulate more internal knowledge while they are adapted to downstream tasks without any external knowledge data.
A major limitation of task-based prompts is that they are too coarse-grained and fail to capture the fine-grained information in the input data.Existing methods use the same prompt for all input data within a tuning task.However, the input data also contains context-specific information that can help the PLM retrieve more relevant knowledge, such as the particular entity being discussed.Such knowledge embedded in the input data should be fully exploited to unleash the potential of prompts.The key challenge is that there is a mismatch between prompt and pre-training, because the template used in the prompt may not be present during pre-training (Gao et al., 2022;Su et al., 2022;Zheng et al., 2022).To address this issue, we propose a unified paradigm named Cloze Driven Prompt (CDP).CDP uses a word cloze task that is more compatible with PLM, that is, it only masks and reconstructs the original input (see Table 1) to activate the knowledge learned by the original PLMs.To enable the model to better understand the NLU task, we further propose a novel adversarial contrastive learning objective to encourage the PLM to discriminate between different classes.Specifically, we propose a supervised contrastive framework that clusters inputs from the same class under different augmented "views" and pushes away the ones from different classes.We create different "views" of an example by appending it with various language prompts and contextual demonstrations.Furthermore, we design a promptbased adversarial training method to improve the generalization abilities of PLMs.As our training method does not actually generate adversarial samples, it can be applied to large-scale training sets efficiently.
We conduct experiments over 15 public NLU benchmarks.Evaluation results indicate that our model Adversarial Knowledge Stimulated Contrastive Prompting (AKSCP) not only outperforms the performance of the state-of-the-art models, but also exhibits a good generalization ability over extensive tasks.In addition, we find that with the decrease of training data, the performance of AKSCP with fine-tuned parameters consistently outperforms the standard prompt learning method which freezes LM parameters.We also analysis the sample efficiency and the improvement margin difference to further verify the correctness of our motivation of AKSCP.
In this paper, we make three main contributions: (1) introduce a knowledge stimulated method that leverages knowledge of pre-trained language models (PLMs) to enhance the performance of prompt tuning; (2) proposal of a unified cloze adversarial contrastive prompting learning framework that jointly optimizes the cloze prompts and the PLM parameters in an adversarial and contrastive manner; and (3) conduct extensive experiments on fifteen few-shot natural language understanding (NLU) datasets and demonstrate the effectiveness of our approach.

Related Work
Prompt tuning Many studies (Li and Liang, 2021;Liu et al., 2021b;Lester et al., 2021;Liu et al., 2021a) have focused on how to design prompts since good prompts can narrow the gap between pretrained language models and downstream tasks.
Depending on the prompt types, existing researches can be divided into two main categories: manually designed ones (Li and Liang, 2021;Liu et al., 2021b;Lester et al., 2021;Liu et al., 2021a) and automatically created ones (discrete prompts (Gao et al., 2020;Schick and Schütze, 2020) or continuous prompts (Shin et al., 2020;Hambardzumyan et al., 2021)) , where continuous prompts focus on utilizing learnable continuous embeddings as prompt templates rather than label words.However, these prompts construction still lack coverage and bring considerable bias and high variances to the results.Recently, Hu et al. (2021a) propose to utilize external knowledge data to solve this issue.However, these works can not stimulate knowledge without external data directly.
Contrastive learning Contrastive learning is a selfsupervised learning technique that aims to learn representations that are semantically similar for samples from the same class (positive pair) and semantically dissimilar for samples from different classes (negative pairs).This technique achieves this by maximizing the lower bound of the mutual information between two augmented views of the samples (Bachman et al., 2019;Tian et al., 2020b,a).Various contrastive learning methods have been proposed (Wang et al., 2021;Logeswaran and Lee, 2018;Wang et al., 2020;Gao et al., 2021;Zhang et al., 2021).Among them, SupCon (Khosla et al., 2020) is a distinctive method that performs contrastive learning at the class level by clustering two augmented batches of samples in the feature space.This allows SupCon to generate more negative pairs, which enhances the efficiency of contrastive learning in practice.
Adversarial training Many approaches for im-

Methodology
This section first formulates the knowledge stimulation method for low-resource NLU task.We then introduce the two important components in proposed Adversarial Knowledge Stimulation Contrastive Prompt (AKSCP), including i) Cloze Driven Prompt; ii) Adversarial Contrastive learning.Figure 1 shows the overall AKSCP.training and prompt tuning, and stimulates more enriched knowledge from PLMs.Moreover, to improve the generalization abilities of PLMs, adversarial attack A is applied at the embedding level.In addition, Z b K is an unsupervised knowledge that does not require labeled data and has a good extensibility.However, Z y K is a supervised knowledge that requires labeled data.Finally, Z b K injects its more enhanced knowledge into the Z y K for downstream training.Therefore, the variables enable us to model the objectives of different knowledge levels with respect to keywords in a unified framework.Some advantages of joint learning include (1) the model and contrastive learning are more robust to the noise in Z y K inferred in the training process due to the protection of each task; and (2) in terms of prediction, the model can automatically control the expression of knowledge, so it can easily adapt to different scenarios without too much extra effort.The overall objective of learning as

Low-Resource Learning Framework
4.2 Cloze Driven Prompt

Prompt-based Learning
Fine-tuning is a common method to adapt PLM to specific downstream tasks (Devlin et al., 2019).However, for low-resource data augmentation, we want the stimulated synthetic knowledge K LM to be different from K, and provide new information for NLU model learning.Fine-tuning PLM may not be an optimal solution, as it may overfit to a small number of training examples.Inspired by the zero-shot instructions in GPT3 (Brown et al., 2020a), we adopt prompt learning, which keeps the whole PLM parameters frozen, and adds discrete natural language task instructions (e.g."translate into English") before the task input.Freezing PLM parameters may help with generalization during training.However, finding a suitable discrete task instruction is not easy to optimize in an end-toend manner, and requires additional human effort.
Compared with the previous methods (Brown et al., 2020a;Gao et al., 2020) of generating prompts by manual or neural network methods, we design prompt mapping based on several heuristic rules: G p represents the mapping of NLU tasks.
Let X and G p = G p (x) denote the input sentence and the instance prompt respectively.Then we get the input: We use [C] as a special token to separate the prompt from the input sentence.For example, the input of Figure 1 is "I love every minute of the movie, it was [MASK].",where the prompt G p is "it was [MASK]".If a sentence does not follow this format, we append multiple [MASK] tokens to the end of the sentence.The number of [MASK] tokens in the prompt is a predefined hyper-parameter l mask .We use demonstrations of label words to construct our input as follows: x d = x 0 , t 0 ([MASK]), x i , t i (word i ), where t i and word i are the template and the label word for s i respectively, and s i is sampled from the training set.
During training, we update the parameters using masked language modeling (MLM) loss: where y denotes the label word that corresponds to x input .

Word Cloze Task
Table 1 illustrates that conventional prompt-based learning approaches rely on a single mask token to infer the label of an entire sentence.However, this method faces a challenge because pre-trained language models (PLMs) are not exposed to promptstyle expressions during pre-training, resulting in a gap between the prompt and the PLM's knowledge.
To address this issue, we propose a word cloze task that bridges the gap between pre-training, prompting, and fine-tuning in natural language understanding (NLU).The word cloze task has a significant impact on knowledge stimulation, especially in low-resource setting.Hu et al. (2021a) propose to further train the full PLM parameters using external knowledge bases (KBs) to enhance the knowledge capability.However, this strategy (i.e., full PLM training) incurs high data collection costs and substantial computational overhead.In contrast, we propose to directly train the parameters using the word cloze task without any external training data.Assuming that knowledge stimulation updates the parameters based on partial information (such as keywords) through the MLM model, we propose the Significant Keywords to Sentence cloze task.Given a piece of text, we use the unsupervised keyword extraction algorithm to extract keywords.Given these keywords, the Cloze Sequence is trained to reconstruct the original text blocks.When the Cloze Task is applied to knowledge stimulation, we only need to fine-tune the Cloze Sequence under unsupervised learning.This training process is conducted jointly with the prompt-based learning process.We only use the few-shot training data.Formally, the Significant Keywords to word cloze task creates a corrupted version x c for an input x input : , where X ′ is the corrupted version of X using Significant Keywords masking.
After constructing this corrupted version of the sequence, the MLM model attempts to predict the masked tokens to restore the original tokens.The word cloze task loss is then defined as: In order to reduce the inconsistency caused by masking between training and evaluation, we keep the probability δ mentioned by a phrase unchanged in a direct alignment.In this way, the resulting output text should maintain a higher level of distinguishability and diversity in latent space and stimulate more task/keyword agnostic novel knowledge.We use SupCon (Khosla et al., 2020) to compute the contrastive learning loss.To apply SupCon on multiple views of input text, we need to first obtain two views of text: s 1 = x 0 , G p 0 , x i , G p i and s 2 = x 0 , G p j , x j , G p j We generate candidate demonstrations for each input instance based on different G p .Let s2b−1 , s2b be two augmented views of input batch s b , and r 2b and r 2b−1 are the features of s2b−1 and s2b , y b denote the label for x b , then we can calculate the SupCon loss as follows:

Adversarial Training
After completing the two kinds of Multi-View data augmentation, we obtain synthesized data that is substantially less noisy, denoted as ĤLM = H I ∪ H O , as shown in Algorithm 1 (line 6).We then proceed to train the model f (; θ) for the final NLU tasks.As a special training regimen, we adopt adversarial training, which aims to minimize the maximal loss caused by label-preserving adversarial perturbations (Szegedy et al., 2013;Goodfellow et al., 2014), thereby making the model more robust.Specifically, adversarial training is especially effective in a Natural Language Inference (NLI) framework when used to exploit augmented data, as it encourages the model to be more resilient to the variation among similar words and word orders in different source sentences and to better adapt to the new moderately noisy data.We confirm this hypothesis in our experimental results (see SNLI in Table 3).Adversarial training is based on the idea of finding optimal parameters θ to make the model robust against any perturbation r within a norm ball on a continuous (sub-)word embedding space.Hence, the loss function becomes: Madry et al. (2017) demonstrated that projected gradient descent (PGD) allows us to find a better perturbation r adv (x i , y i ).In particular, for the norm ball constraint ||r|| ≤ ϵ, given a point r 0 , ∏ ||r||≤ϵ aims to find r that is closest to r 0 as follows: In order to explore more optimal points in the latent space, one needs to perform K-step PGD during the training process, which entails K forwardbackward passes through the network.Under a linear approximation and an L 2 norm constraint, each iteration of PGD takes the following form: Here, α is the step size and t is the step index.L S = SupCon(M N , ĤLM ) 9: # Build ĤAdv LM with PGD 12: L Adv = L(M N , ĤAdv LM ) 13: L = L P + γL C + βL S + αL Adv 14: M N ← TRAIN(M, ĤLM ) 15: N ← N + 1 16: end while 17: return M

Joint Learning
To enable the integration of CDP and Adversarial Contrastive learning, we propose a joint training method: where β , α and γ are loss balance weights, and α, γ, β ∈ (0.0, 1.0).We note that γ > 0.0 is required to ensure that the parameters of the word cloze task can be optimized through back propagation.γ < 1.0 is necessary to prevent the cloze task loss from reducing the performance of prompt tuning (Zhang et al., 2019).

Experiments
This section is organized as follows.Section 5.1 introduces the experimental settings.Main experimental results were reported in Section 5.2.In Section 5.3, we perform ablation studies.And Section 5.4 compares AKSCP and standard prompting under different settings and analyzes the sample efficiency.

Experimental Setup
Following the few-shot setting in LM-BFF (Gao et al., 2020), we conduct experiments on 15 tasks.For each benchmark, we perform shot-16 experiments following (Gao et al., 2020).We repeat the experiments 5 times and report the average results according to the previous works The experimental results show the means and standard deviations from 5 different train-test splits.♣ results taken from (Jian et al., 2022a).(Gao et al., 2020;Jian et al., 2022b).The Baseline model is Roberta-BASE model, which only uses a few-shot training data D for training.We use the same hyper-parameter settings to train the same Roberta-BASE model.We compare our methond with LM-BFF (Gao et al., 2020) (a method with demonstrations) and PET (Schick and Schütze, 2020) (a method without demonstrations).We use the state-of-the-art method LM-SupCon (Jian et al., 2022a) as a prompt tuning method for all tasks.
Implementation Details AKSCP is based on the RoBERTa-base (Liu et al., 2019).Our method uses a single prompt/template (primary prompt) for the prediction of each task, and a set of prompts (auxiliary prompts) for generating multi-view inputs for contrastive learning.We use the Adam optimizer with a learning rate 1e-5, warm-up rate of 0.1 and weight decay of 1e-3 in training process.The number of [MASK] tokens in word cloze task is l mask = 2.The batch size is set to 16.We conduct the training on 8 Nvidia Tesla V100 32G GPU cards.The γ in Eq.9 is set to 0.3.Early stopping on validation is adopted as a regularization strategy.We determine all the hyperparameters by grid search.

Main Results
In this subsection, we introduce the specific results and provide possible insights of AKSCP.
Main results on 15 tasks Table 2 summarizes the experiment results in shot-16.In all tasks, our methond can consistently boost the performance of baseline prompt tuning method LM-SupCon.
There has a maximum improvement of 3.3% in TREC.The reason is obvious since the selection of label words among the vocabulary becomes inaccurate when labeled data is limited.Prompt empowered AKSCP successfully avoids this problem and stimulate PLM's inner ability distribution among neurons to support model training on downstream tasks.In terms of variance, we can see that AKSCP enjoys smaller variances than baseline methods in most cases, demonstrating that the better coverage of label words stabilizes the training.Table 2 also shows that our method works outperform well than prompt-based methods without demonstrations.PET, which is a method without demonstrations, works consistently worse than AKSCP.In some tasks, e.g., SST-2, SST-5, QNLI, QQP, RTE MRPC, MR, and CR, the contribution of AKSCP can be even larger than the sole use of the demonstrations for label words.Figure 3(a),(b) and (c) shows the performance in shot-{16~2048} settings.
AKSCP is always superior to other models in all settings.Compared with TREC and SNLI, the improvement of SST-2 is smaller.This may be due to the relatively high performance of SST-2 baseline (90.6%).LM-SupCon performs consistently worse than AKSCP (e.g., more than 2.7% score gap in the SNLI and TREC experiment).This is because LM-SupCon tuned full PLMs, which can easily memorize the limited labeled training data and overfitting.In contrast, the adversarial contrastive learning allows AKSCP to maintain high generalization ability and CDP provide additional stimulated signals to the NLU models.The results from AKSCP are all statistical significant, compared to the Baseline model (paired student's t-test, p < 0.05).

Ablation Study
We conduct ablation study to evaluate the effects of word cloze task, multi-view contrastive learning and adversarial training on the SST, TREC and SNLI datasets under the 16-shot setting.Word Cloze Task To verify the contributions of the proposed prompt module, we replace the clozedriven prompt with standard prompt.Table 3 shows the result: Cloze-driven Prompt (83.5%) outperforms the standard prompt tuning (82.2%) by up to 1.3% on average.The results verify the correctness of our motivation and the effectiveness of the word knowledge stimulation.This is because the key words in the sentence play a hint role, which makes the model ignore the overall semantic representation in the context, thus leading to representation collapse and generalization issue.Masking the phrase mentions forces the model to learn representations from context which prevents overfitting and representation collapse (Gao et al., 2020).

Multi-View Contrastive Learning
We then examine the effect of Multi-View Contrastive Learning in AKSCP.We generate positive and negative data pairs from the input view and output view, respectively.As shown in Table 3, the data pairs from these two single views improve the model performance compared to the baseline.However, their performance is still inferior to that of AKSCP.dard prompting (frozen the parameters of the LM) (Brown et al., 2020b) on SST-2 in shot-{4~2048} settings.Basically, when there are enough training data (e.g., > 16 shot), fine-tuning prompt can further improve the model performance than fixed parameters.With the decrease of shot, the performance of our pompt is consistantly outperform than the fixed methond.However, the results of baseline are opposite.The performances of LM-SupCon drop a lot when shot<16.In particular, acc drops by 29.5% (from 89.2% to 59.7%) in shot-8, even underperform than fixed model (67.5%), and the smaller the training size is, the bigger the gap is between the model with fixed parameters and the model with fine-tuned parameters.This is because the parameters will be over-fitted when the training size is small.AKSCP without word cloze task have the similar result.The reasons are i) word cloze task is Self-Supervised Learning (SSL), which can trained along with supervision signals provided by itself; ii) the keywords of sentence may play a hint role, which makes model ignore the overall semantic representation in the context, thus leading to representation collapse and generalization issue (Li et al., 2022).Masking the phrase forces model to learn representations from context which prevents overfitting and representation collapse with limited data (Li et al., 2022;Gao et al., 2020).Other tasks have the same experiment result.

Sample Efficiency
We discuss how the performance of AKSCP, LM-SupCon, PET and LM-BFF varies when the number of training samples increases.In Figure 3 2, the improvement margins in the classification tasks are generally larger than the ones in the similarity and paraphrase tasks.The reasons are two-folds: i) the similarity and paraphrase tasks are more fine-grained and knowledge-intensive task than the single sentence classification task; ii) the stimulated knowledge for the similarity and paraphrase tasks includes entity type and boundary, which is more difficult for PLMs to mining, in particular for low-resource settings, compared to the sentence classification task (Wang et al., 2022).Joint Learning paramerter We investigate the effect of Joint Learning in AKSCP.It can be observed that, in general, a lower weight loss balance weight leads to better performance in most cases.Specifically, in Eq.9, setting γ to 0.3 is always better than other values on the SST dataset.This is because the weight of the word cloze task should not be too large, so as to avoid interfering with the prompt tuning tasks.

Conclusion
In this paper, we propose the first prompt-based knowledge stimulation model AKSCP for lowresource NLU tasks.We conduct experiments on 15 tasks and demonstrate the effectiveness of our approach.For future work, we plan to expand our model to other NLP tasks such as QA and NLG.

Limitations
In this paper, we only evaluated our method on a limited number of NLU tasks and datasets.It is possible that our method may not generalize well to other tasks or domains that require different types of prompting knowledge or cloze-driven prompts.
A promising direction for future work is to investigate how the prompt design and the learning objective influence the performance and robustness of PLMs on few-shot NLU tasks.

ACL 2023 Responsible NLP Checklist
A For every submission: A1. Did you describe the limitations of your work?
In Limitations Section.
A2. Did you discuss any potential risks of your work?Not applicable.Left blank.
A3. Do the abstract and introduction summarize the paper's main claims?
In sections of Abstract, Introduction, and Conclusion.
A4. Have you used AI writing assistants when working on this paper?
Left blank.
B Did you use or create scientific artifacts?
In sections of Related Work, Problem Formalization, and Experiments.
B1. Did you cite the creators of artifacts you used?
In sections of Related Work, Problem Formalization, and Experiments.
B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
We use the open benchmark and open baseline models provided by authors legally.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?
In the Limitations Section.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Not applicable.Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.?Not applicable.Left blank.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.
In sections of Experiments.
C Did you run computational experiments?
In section of Experiments.
C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?Due to space limitation, we will add detailed parameters and more detailed training details in the camera ready stage.

IFigure 1 :
Figure 1: An illustration of our proposed Adversarial Knowledge Stimulated Contrastive Prompting approach.

Figure 2 :
Figure 2: Abstract Logic of the proposed approach.Solid lines indicate the existence of stimulation links in both the probabilistic graph and the neural graph, while dotted lines indicate different levels of stimulation intensity, red lines mean backward optimization.

Figure 2
Figure 2 shows the graphical model of our approach.The model consists of four variables: dataset D, cloze-driven D construct by keyword extraction method P, label Y, latent knowledge Z b K stimulated by word cloze task, and latent knowledge Z y K stimulated by prompt tuning.The variable Z b K bridges D and Z y K guided by D using self-supervised learning, while Z b K and Z y K stimulated each other simultaneously.Prompt tuning use prompt to stimulate task-related knowledge Z b K , while word cloze task narrows the gap between pre-

Figure 3 :
Figure 3: Results of sample efficiency analysis.Mean and variance are calculated over 5 different train-test splits.(a) Comparison of AKSCP and strong baselines with different shot K on SST-2.(b) Comparison of AKSCP and strong baselines with different shot K on TREC.(c) Comparison of AKSCP and strong baselines with different shot K on SNLI.(d) Comparison of LM parameters fine-tuning and fixing on SST-2.
loved every minute of this film .Prompt tuning 1 I just loved every minute of this film .It was M .CDP top-k I just M every minute of this M , It was M .

Table 1 :
Masked examples.M denotes [MASK] token.Different colors represent different mask strategies' tokens replaced by M. Italic words represent prompt template.All models are large models trained with the efficient pre-training.

Table 2 :
Few-shot experiments of baseline methods and ours (mean ± std).LM-BFF is a prompt-based method with demonstrations of label words, PET is one without demonstrations, LM-SupCon is SOTA approach.

Table 3 :
(Jian et al., 2022a)res over SST, TREC and SNLI of AKSCP for few-shot learning setting.w/o.denotes that we only remove one component from AKSCP.†refers to standard prompt tuning.♣resultstaken from(Jian et al., 2022a).