Contrastive Demonstration Tuning for Pre-trained Language Models

Pretrained language models can be effectively stimulated by textual prompts or demonstrations, especially in low-data scenarios. Recent works have focused on automatically searching discrete or continuous prompts or optimized verbalizers, yet studies for the demonstration are still limited. Concretely, the demonstration examples are crucial for an excellent final performance of prompt-tuning. In this paper, we propose a novel pluggable, extensible, and efficient approach named contrastive demonstration tuning, which is free of demonstration sampling. Furthermore, the proposed approach can be: (i) Plugged into any previous prompt-tuning approaches; (ii) Extended to widespread classification tasks with a large number of categories. Experimental results on 16 datasets illustrate that our method integrated with previous approaches LM-BFF and P-tuning can yield better performance. Code is available in https://github.com/zjunlp/PromptKG/tree/main/research/Demo-Tuning.


Introduction
Pre-trained language models (PLMs) have been applied to widespread natural language understanding and generation tasks, which are proven to obtain significant gains across benchmarks (Devlin et al., 2019;Liu et al., 2019;Lewis et al., 2020a;Dong et al., 2019;Bao et al., 2020;Zhang et al., 2022c;Xie et al., 2022a;. One paradigm of PLMs is the pre-train-fine-tune, which has become the de facto standard for natural language processing (NLP), where task-specific objectives and additional parameters are leveraged in the tuning procedure. Recently, the paradigm of the adaptation of PLMs has been shifting. A new finetuning methodology named prompt-tuning with a natural language prompt and a few demonstrations has made waves in the NLP community by proving astounding few-shot capabilities on myriad language understanding tasks. Further studies try to mitigate the labour-intensive prompt engineering with discrete prompt searching (Shin et al., 2020) or continuous prompt optimization (Liu et al., 2021d;Li and Liang, 2021;Hambardzumyan et al., 2021a;Zhong et al., 2021). However, few studies have focused on the demonstration, which is an indispensable component in prompt-oriented methodologies.
In previous studies, demonstrations are sampled examples in the training set. GPT-3's naive "incontext learning" paradigm picks up to 32 randomly sampled instances as demonstrations and directly concatenates them with the input sequence (Liu et al., 2021a;Min et al., 2022). Since informative demonstrations are crucial for model performance, Gao et al. (2021a) develop a refined strategy via sampling input pairs with similar examples, thereby providing the model with more discriminative comparisons. However, it is still not guaranteed to prioritize the most informative demonstrations as (1) the similarity-based sampling may obtain degraded demonstrations in different classes but have similar distances to the input; (2) the number of usable demonstrations is still bounded by the model's maximum input length. For example, as shown in Figure 1, the purple lines refer to the random sampling while the blue lines indicate similarity-based sampling. Note that similaritybased sampling may obtain examples very similar to the input sequence. However, those sampled examples with different labels may tend to have a similar representation and thus confuse the discriminability of the model. Moreover, for datasets with many classes, it is still non-trivial to concatenate all sampled demonstrations. Those above-mentioned challenges hinder the applicability of demonstration in prompt-tuning.
To address those issues, in this paper, we propose contrastive DEMOnstration Tuning (Demotuning) for pre-trained language models. Specifically, we leverage learnable continuous embeddings (e.g., one or two learnable tokens) as virtual demonstrations to relax the maximum number of categories. We concatenate those virtual demonstrations to the input sequence; thus, our approach can be extended to a wide variety of classification tasks with many categories. To optimize those continuous embeddings, we explore a simple contrastive framework without negative pairs (Grill et al., 2020) since it is difficult to find an appropriate negative pair in semantic space for NLP. In each training batch, we randomly sample a real example and regard the virtual and real examples as positive pairs. With contrastive learning, we can obtain informative, optimized virtual demonstrations with more discriminative comparisons.
We conduct extensive experiments on 16 NLP datasets. Our contrastive demonstration tuning can yield better performance when integrated with previous prompt-based methods (e.g., LM-BFF (Gao et al., 2021a), P-tuning (Liu et al., 2021d)). Moreover, our approach can be applied to datasets with many categories and outperform baselines. Note that our approach is model-agnostic and can be plugged into lots of prompt-based methods without the effort to select suitable demonstrations. The main contributions of this study are as follows: • We propose a pluggable, extensible, and efficient approach to contrastive demonstration tuning for pre-trained language models. To the best of our knowledge, optimizing demonstration is also a new branch of research that has not been explored in language model prompting.
• We propose virtual demonstration and lever-age contrastive learning to obtain informative demonstrations and also relax the maximum number of categories in classification tasks.
• A systematic evaluation of 16 NLP datasets shows that the proposed simple-yet-effective approach contributes towards improvements across all these tasks.
2 Related Work 2.1 Prompt-tuning With the prevalence of GPT-3 (Brown et al., 2020), prompting PLMs for few-shot learning has become a new, popular learning paradigm in natural language processing (Schick and Schütze, 2021;Tam et al., 2021;Liu et al., 2021b) and appealed to researchers. Recently, prompt-tuning has been applied to various NLP tasks, such as named entity recognition (Cui et al., 2021;Chen et al., 2021b;Zhou et al., 2021;Ma et al., 2022), entity typing (Ding et al., 2021), relation extraction , event extraction (Hsu et al., 2021;, sentiment analysis , machine translation , and knowledge graph completion (Xie et al., 2022b). Schütze (2021, 2020) propose the PET, which reformulates the NLP tasks as cloze-style questions and yields satisfactory performance. Tam et al. (2021) further propose a denser supervision object during fine-tuning to improve the PET. Note that handcrafting a best-performing prompt is like finding a needle in a haystack, which facilitates the labor-intensive prompt engineering, Thus, recent studies (Qin and Eisner, 2021;Hambardzumyan et al., 2021b;Ye et al., 2022;Chen et al., 2021c) conducted in this field have been focused on automatically searching the prompts. Shin et al. (2020) propose AUTOPROMPT, which is a gradient-based method to acquire templates and label words for prompt-tuning.  propose EFL, which reformulates the NLP task as an entailment one and turns small LMs into better few-shot learners. Additionally, Gao et al. (2020) propose LM-BFF-better few-shot fine-tuning of language models, which utilizes a generation model to obtain templates and a refined strategy for dynamically and selectively incorporating demonstrations into each context. However, it is sub-optimal for the discrete prompt searching due to the continuous nature of neural networks.
To overcome these limitations, Liu et al. (2021d,c) propose P-tuning to to automatically search prompts in the continuous space. Li and Liang (2021) propose prefix-tuning, which optimizes a sequence of continuous task-specific vectors and keeps language model parameters frozen. Lester et al. (2021a) leverage a mechanism to learn "soft prompts" to condition frozen language models.  propose a differentiable prompt learning method for few-shot NLP with optimized prompt templates as well as labels. Vu et al. (2021) propose SPoT, which learns a prompt on one or more source tasks and then uses it to initialize the prompt for a target task to boost the performance across many tasks. More related works including WARP (Hambardzumyan et al., 2021a) and OP-TIPROMPT (Zhong et al., 2021) also propose to leverage continuous templates, which is more effective than discrete prompt search. To conclude, most of the existing works try to obtain optimized prompts for widespread NLP tasks; however, few studies have focused on the demonstration, which is an indispensable component in prompt-oriented learning.
Our work is orthogonal to previous prompttuning approaches, which are aimed at optimizing prompts. The major differences between virtual demonstration and continuous prompts are that: 1) they have a wholly different training strategy since continuous prompts are optimized via backpropagation with a training set, while our approach utilizes contrastive learning. 2) our approach requires no external architecture (e.g., LSTM in P-tuning), thus, making it efficient and pluggable to any prompttuning approaches. To date,  is the only approach that studies the demonstration and presents a simple demonstration-based learning method for named entity recognition. Apart from , our approach focus on general NLP classification tasks. Moreover, we propose virtual demonstrations with contrastive learning strategies, which can obtain better demonstrations and also relax the maximum number of categories in datasets.

Contrastive Learning
Contrastive learning has been long considered effective in learning meaningful representations. In the early stage, Mikolov et al. (2013) propose to learn word embeddings by regarding words nearby a target word as a positive instance while others as negative. Logeswaran and Lee (2018);  further generalize this approach to learn sentence representations. Recently, Kim et al. (2021) propose a contrastive learning method that makes use of a self-guidance mechanism. Yan et al. (2021) propose ConSERT, a contrastive framework for self-supervised sentence representation transfer. Giorgi et al. (2021) propose DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations. Gao et al. (2021b) leverage dropout as mimimal data augmentation and propose Sim-CSE, a simple contrastive learning framework that greatly advances the state-of-the-art sentence embeddings.
On the other hand, contrastive learning has been also appealed to the computer vision community (Jaiswal et al., 2020;.  propose SimCLR: a simple framework for contrastive learning of visual representations without requiring specialized architectures or a memory bank. Chen and He (2021) observe that simple siamese networks can learn meaningful representations even using none of the negative sample pairs, large batches, and momentum encoders.
Our work is related to Grill et al. (2020), a non-contrastive self-supervised learning approach, which relies on two neural networks, referred to as online and target networks, that interact and learn from each other. However, as opposed to this approach, we utilize the encoder in the same state while Grill et al. (2020) leverage two networks in the different states. Moreover, we focus on demonstration optimization in prompt-tuning for NLP, including learning informative demonstrations and acquiring prompt temples and label tokens.

Preliminaries
In this work, we focus on classification tasks in the few-shot setting, including text classification and natural language understanding, where the input x in is either a sentence x in = x 1 or a pair of sentences x in = (x 1 , x 2 ). Here, we let denote the training set of a downstream task composed of only K training examples per class, where Y is label space of the task. Given a pre-trained language model comprised of two stages: an encoder f (·) and a classifier g(·) 2 , we encode the input x in to a sequence of hidden vectors {h k ∈ R d } and take  the hidden vector h Prompt-based Fine-tuning Prompt-based finetuning (Schick and Schütze, 2021;Gao et al., 2021a) is an efficient work by designing cloze-style template T and verbalizer M : Y → V mapping task labels to individual words from vocabulary V of pre-trained language model to fill the gap between masked LM objective of pre-trained language model and downstream fine-tuning objective.
Template In prompt-based fine-tuning paradigm, template T is mainly comprised of inputs x in and a prompt P = [P i ] m i , where the prompt could be a series of discrete tokens (Schick and Schütze, 2021) or continual pseudo tokens (Liu et al., 2021d). For instance, in the sentiment analysis task (see Figure 2), a template with handcraft prompt may be: where "It was ... ." is prompt and [MASK] is target which cast classification task as a language modeling task.
Verbalizer A verbalizer M defines a mapping of label tokens from label space of a specific task. In Figure 2a, the verbalizer maps "negative/postive" to "terrible/great". In this way, we could re-use the output weight W v ∈ R d×|V| refered MLM head used in pre-training and model the probability of predicting token M (y) ∈ V as . We then combine the original template T with templates above in all classes to form T * (x in ), which will be used as a template during prompt-based tuning and inference (See Figure 2).

Contrastive Demonstration Tuning
In this work, we focus on how to learn a compact and differentiable virtual demonstration to serve as prompt augmentation instead of designing specific sampling strategies for demonstrationbased learning. We propose a learning framework based on a contrastive learning approach that can be compatible with the current prompt-based learning paradigm. This section introduces the concepts of contrastive demonstration tuning (Demo-tuning) and provides details of this approach.
i refer to the virtual demonstration of the c th class where n is a hyper-parameter to set the length of virtual demonstration, which is far less than the length of real demonstration. For instance, given a template of binary classification task (see Figure 2) as: where ⊕ denotes concatenation of input sequences.
[D (1) ] and [D (2) ] respectively denote the virtual demonstrations of two classes. Virtual demonstrations could be so flexible that can be integrated to wide variety of prompt learning approaches (Liu et al., 2021d;Lester et al., 2021b). Next, we study how to obtain the optimal virtual demonstrations, which are initialized as a series of pseudo tokens at the start of fine-tuning. To address this challenging problem, we propose to use contrastive learning, which aims to obtain effective representation by pulling semantically close neighbors together. Intuitively, we believe the optimal virtual demonstrations may be analogous with "prototype" (Snell et al., 2017), the representative for corresponding class, and we will discuss in §6.
Positive Instances A key element of contrastive learning is how to construct reasonable x in , x + in pairs. Here, we design a new template T + (x) based on template T (x) by randomly replacing one of virtual demonstrations [D (c) ] with real demonstration d c as shown in the Figure 2b: where [D (1) ] is replaced with a demonstration d 1 of class "terrible". Using this template, we could convert input x in to corresponding positive example x + in , i.e., T (x in ), T + (x in ) is a positive training instance. In this way, aligning virtual demonstration [D (c) ] with d c , the only difference between x in and x + in , and pulling representations (h in , h + in ) closer in semantic space could effectively alleviate the problem that the existing of terrible or irrelevant demonstration by previous sampling strategies.
Optimization Similar to , we can randomly sample a minibatch of N examples from D train to construct positive pairs and take a cross-entropy objective with in-batch negatives for (x i , x + i ): where τ denotes a temperature parameter and The negative pairs are composed of two different examples with the same demonstration in a minibatch.
In this work, we also explore a simple contrastive framework without negative pairs 4 similar to recent non-contrastive self-supervised learning (Grill et al., 2020). Regarding the difficulty to find a appropriate negative pair in semantic space for NLP, specially in few-shot setting, we only construct positive pairs and define the following mean squared error between h i and h + i with ℓ 2 -normalization, where h i and h + i are obtained through encoder f (·) in the same state different from Grill et al. (2020) which encodes x i and x + i through two networks in the different states (online network and target network).
When supervised examples D train are available, the pre-trained language model could be fine-tuned to minimize the joint objective comprised of crossentropy and contrastive objective of Eq. (4). In this way, during inference, we can concatenate the input x in with trained virtual demonstrations in template T (x), which does not need to sample real demonstrations. Besides, we provide empirical analysis of negative sampling in §5.4.

Settings
Evaluation During training, we follow the evaluation protocol adopted in Gao et al. (2021a) and assume a development set D dev for model selection and hyper-parameter tuning, where the size is same with D train , i.e., |D dev | = |D train |. For every experiment, we measure average performance across 5 different randomly sampled D train and D dev splits using a fixed set of seeds.
Hyperparameter Selection We implement our framework and reproduce P-tuning by ourselves using PyTorch (Paszke et al., 2019) and Hugging-Face (Wolf et al., 2020). The main results of LM-BFF in Table 1 Table 1: Comparison of performance of our approach with several baselines across 14 text classification tasks in few-shot setting. We report mean (and standard deviation) results of 5 random seeds. LM-BFF (w/ Demo) and P-tuning (w/ Demo): prompt-tuning methods (LM-BFF and P-tuning) using demonstration in context with manual template used in Gao et al. (2021a). Demo-tuning (LM-BFF) and Demo-tuning (P-tuning): Our proposed approach respectively based on LM-BFF and P-tuning.
language model and set K = 16. For the length n of virtual demonstration per class, we select it from candidate set {1, 2, 3, 5}.

Main Results
We apply our method to two popular prompt-based tuning techniques, LM-BFF and P-tuning, and compare them to a number of baselines, namely: (1) standard fine-tuning in the few-shot setting; (2) "GPT-3" in-context learning: zero-shot prediction, which concatenates prompt (e.g., randomly sampled demonstrations); (3) LM-BFF using demonstration in context with a manual template. (4) Ptuning using demonstration in context with a manual template, where we do not specifically search the optimal length of continual prompt and fixed the length m to 4 in all tasks.
In Table 1, we report the performance of the baseline approaches and our two variants. First, in-context learning could achieve comparable or even higher performance to the standard fine-tuning method and prompt-tuning methods (LM-BFF and P-tuning); using demonstration in context bring consistent improvement in a majority of tasks, which means that demonstration is worth being exploited.
Second, our approach based on two promptbased tuning techniques could consistently outperform the vanilla methods. In detail, Demo-tuning  based LM-BFF improves the average score by 0.75, compared with LM-BFF with the demonstration in an input context. More importantly, Demo-tuning is flexible and orthogonal to most fine-tuning methods. Here, for evaluating the compatibility, we combine Demo-tuning with P-tuning (Liu et al., 2021d), which could lead to a 1.0 average score improvement in total. In this work, we do not specially design template for P-tuning 5 . Although templates for P-tuning and prompt length are suboptimal, we find that Demo-tuning with P-tuning  leads to consistent gains in a majority of tasks. Third, an advantage of our proposed virtual demonstration is that it could be well applied for multi-class sentence classification tasks. Table 2 gives the results of Demo-tuning compared to standard fine-tuning and prompt-based tuning. Due to the limitation of the model's input length, incontext learning and LM-BFF with demonstration could not be applied in this scenario. We notice that while the performance of LM-BFF is worse than fine-tuning, Demo-tuning based on LM-BFF improves the score by 1.7 in Yahoo and achieves a better score compared to fine-tuning.

Analysis of Virtual Demonstration
The selection of demonstration is crucial for demonstration-based learning (e.g., in-context learning and LM-BFF with demonstration). Next, we compare and discuss our proposed virtual demonstration with current approaches. Table 3 provides the impact of demonstration sampling strategies. During inference, our proposed virtual demonstration obtained by contrastive learning during training could be an alternative to real demonstrations, which could be viewed as an implicit sampling strategy. We compare our method with previous sampling strategies based on LM-BFF.

Demonstration Sampling
While the performance of uniform demonstration sampling from each class is better than the vanilla LM-BFF in TREC and SNLI, we notice that on the MRPC task, this method causes severe accuracy loss, which is up to 3.6. We think that random sampling is prone to generate irrelevant information in demonstrations. To address the above issue, Gao et al. Demo-tuning (w/o neg): Demo-tuning using our simplified optimization method without negative samples. (Reimers and Gurevych, 2019) to select relevant demonstrations to examples. The filter-based sampling strategy could achieve consistent gains in the majority of tasks, which yields the highest improvement with 3.6 on the TREC task. We consider that this KNN-style method, which concatenates examples and demonstrations that are semantically close to examples, could promote language models to decipher meaningful patterns.
Virtual demonstration, an alternative to the real demonstration during inference, i.e., avoiding complex sampling steps, could achieve gains in most tasks. Besides our proposed method, We design a simple strategy to construct virtual demonstrations via averaging the representations of instances with the same label. We notice that constructing virtual demonstration with simple averaging of instances causes poor performance in most tasks. However, our method with contrastive learning is more predominant than previous approaches. The only exception is SNLI, which score only is comparable with random sampling. We hypothesize that this is caused by some confusion issues, which may exist in filter-based strategy regarding semantically closeness among contrastive demonstrations.
Optimization w/ Vs. w/o Negative Samples Figure 3 gives the results of comparison between virtual demonstration optimization with negative sampling and without negative sampling. We conduct experiments with different optimization strategies on 3 tasks. We find that optimizing the objective of Eq.3, i.e., conventional contrastive learning with negative samples, causes dramatically performance degradation, in which the average score is even lower than LM-BFF's. We think there are two possible reasons: (1) In NLP tasks, finding a semantically reasonable negative pair is difficult, especially in the few-shot setting; (2) Negative pairs may become example-demonstrations pairs without specific limitation, which will cause a certain confusion to model. Moreover, our goal is to obtain optimal virtual demonstrations for downstream tasks. Using contrastive optimization without negative sampling may be a more suitable solution.
Demonstration Length Figure 4 shows the ablation study on length n of virtual demonstration per class. We compare Demo-tuning with its variant without contrastive learning in different settings about length n. It is noteworthy that without contrastive learning, a virtual demonstration will degrade into a continual prompt. We find that a relatively shorter length (e.g., 2 or 3) could gain stable improvement of performance in QNLI and MR. Oppositely, a larger length (e.g., 20) may decrease the performance. We consider that as the length of virtual demonstration increases, it will introduce more parameters into the model, making it challenging to learn from a small amount of annotated data. Demo-tuning could achieve consistent improvement in different lengths compared to its variant. Hence, we can conclude that virtual demonstration optimized by simple contrastive framework plays a different role from continuous prompt.

Discussion
We will discuss several favorable properties of contrastive demonstration tuning and present some open problems: Possible Supplement for Parameter-efficient Fine-tuning. Previous studies (Liu et al., 2021d;Li and Liang, 2021) have demonstrated the effectiveness of prompt-tuning (e.g., P-tuning, Prefix-tuning) as an parameter-efficient fine-tuning methodology for huge PLMs. Our approach can serve as a supplement or parameter-efficient finetuning via only tuning demonstration with PLM fixed. We leave this for future work.
Relation to Prototype Learning. In §4, we note that the optimal virtual demonstrations may be analogous with "prototype" (Snell et al., 2017), representative for corresponding class. Our approach may have connections to prototype learning, and further empirical and theoretical analysis should be conducted.
Demonstration as External Knowledge. Recall that those concatenated demonstrations are similar to previous studies such as RAG (Lewis et al., 2020b), REALM (Guu et al., 2020) which retrieve and concatenate relevant texts as external knowledge (Zhang et al., 2022b). We think that it is also interesting to investigate novel knowledge injection approaches via demonstration. We further discuss a few weaknesses of our method in its current form and look into some possible avenues for future work. On the one hand, our work still suffers from biased/long-tailed label distribution. Note that we obtain optimized virtual demonstration via contrastive learning; thus, those virtual demonstrations of classes with many samples may dominate the training stage. This limitation might be ameliorated with weighted sampling strategies. On the other hand, our approach cannot directly handle structure prediction tasks. Integrating demonstration with prefix-tuning-based methods may help to mitigate such limitations.

Conclusion and Future Work
In this work, we propose contrastive demonstration tuning, a simple model-agnostic approach for pre-trained language models, which improves stateof-the-art prompt-tuning performance without the necessity of demonstration selection. In the future, we plan to explore the following directions: 1) studying the connection between virtual demonstration and prototypes and theoretically analyzing the optimal solution of demonstration for prompttuning. 2) applying our work to more NLP tasks and trying to adapt to natural language generation.
Our contrastive demonstration tuning has limitations. Firstly, our model leverages the pre-trained language model; thus, it is necessary to cost GPU resources. Besides, in few-shot settings, the performance gains are still limited with virtual demonstrations learned in only a few training instances. It is worth studying retrieving relevant context from the internet as "demonstrations" to help efficient NLP.