Improving the Sample Efficiency of Prompt Tuning with Domain Adaptation

Prompt tuning, or the conditioning of a frozen pretrained language model (PLM) with soft prompts learned from data, has demonstrated impressive performance on a wide range of NLP tasks. However, prompt tuning requires a large training dataset to be effective and is outperformed by finetuning the entire PLM in data-scarce regimes. Previous work (Gu et al., 2022, Vu et al., 2022) proposed to transfer soft prompts pretrained on the source domain to the target domain. In this paper, we explore domain adaptation for prompt tuning, a problem setting where unlabeled data from the target domain are available during pretraining. We propose bOosting Prompt TunIng with doMain Adaptation (OPTIMA), which regularizes the decision boundary to be smooth around regions where source and target data distributions are similar. Extensive experiments demonstrate that OPTIMA significantly enhances the transferability and sample-efficiency of prompt tuning compared to strong baselines. Moreover, in few-shot settings, OPTIMA exceeds full-model tuning by a large margin.


Introduction
Prompt tuning (Lester et al., 2021;Li and Liang, 2021;Liu et al., 2022;Hambardzumyan et al., 2021) is an effective method for adapting largescale pretrained language models for downstream tasks.While keeping the PLM weights unchanged, prompt tuning trains input vectors, called soft prompts, that are input to the PLM alongside the text embeddings.Compared to other adaptation techniques for PLMs, such as Adapter (Houlsby et al., 2019;Rücklé et al., 2021;He et al., 2022), Compacter (Mahabadi et al., 2021), BitFit (Elad et al., 2022), LoRA (Hu et al., 2021), and Ladder Side-Tuning (Sung et al., 2022), the advantage of prompt tuning is that it does not require addition or change of model parameters.As a result, with prompt tuning, we can easily specialize one neural network (possibly deployed on a large number Figure 1: Smooth vs. zigzag decision boundaries.Left: When the distribution of the target-domain data (orange) are similar to the source domain (blue), the smooth decision boundary (solid line) generalizes better than the zigzag boundary.Right: When the distributions are different, it is not clear if the smooth decision boundary is the better choice.
of servers or as application-specific integrated circuits) to support many different tasks by simply switching out the soft prompt in the input, which greatly simplifies model deployment and maintenance.
However, training effective soft prompts usually requires sufficient labeled training data (Su et al., 2021).Studies have shown that prompt tuning significantly underperforms full-model tuning on many few-shot classification tasks (Gu et al., 2022).Our experiments corroborate this finding.In addition, we find that, in few-shot learning, prompt tuning is equally, if not more, sensitive to random seed choices compared to full-model tuning, despite having far fewer trainable parameters ( §3.4).Gu et al. (2022) address this by transferring prompts learned from a source domain to the target domain with limited training data.
In this paper, we investigate a related but different scenario, unsupervised domain adaptation (UDA) (Wang et al., 2019;Long et al., 2022), where unlabeled data from the target domain are available.Such situations are common when data are abundant but the labeling cost, including annotator recruitment, annotator training, and quality assurance, is high.Utilizing unlabeled examples can be an effective approach toward enhancing the data efficiency of prompt tuning.
We propose bOosting Prompt TunIng with do-Main Adaptation (OPTIMA).Employing regularization from adversarial perturbation, OPTIMA learns a smooth decision boundary that passes through regions of low data density.In addition, recognizing that the feature distributions in the two domains may overlap only partially, we propose to focus the regularization to regions where the target-domain and source-domain data exhibit high similarity.We illustrate the intuition in Figure 1.
The popular Domain-adversarial Neural Network (DANN) technique (Ganin et al., 2016) encourages the network to learn domain-invariant features and optimizes for both a domain-specific task loss and a domain discrimination loss.However, the two losses could compete against each other, leading to optimization difficulties (Guo et al., 2021).Empirically, DANN exhibit low performance for prompt tuning.We hypothesize that the low capacity of prompts worsens the optimization problem.To solve this issue, in OPTIMA, we create input disturbance vectors that optimize for domain similarity, so that the prompt needs to optimize for only the task loss.The separation leads to excellent results.
Experiments shows that OPTIMA learns effective data representations that transfer well to the target domain under zero-shot and few-shot settings.OPTIMA outperforms eight baselines, including state-of-the-art transfer learning techniques such as SPOT (Vu et al., 2022)

Domain Adaptation for Prompt Tuning
In this section, we first introduce prompt tuning for text classification.Then, we introduce how to enhance in-domain generalization performance of soft prompts by augmenting the input with virtual perturbations.Next, we propose how to optimize the perturbations to reduce the domain gap and obtain soft prompts with domain-invariant knowledge.
Finally, we show how to use the soft prompts to boost few-shot learning in the target domain.

Preliminaries: Prompt Tuning
We start by introducing some notations.The input x is a sequence of n token embeddings, x = ⟨x 1 , . . ., x n ⟩.The trainable soft prompt sequence p has m embeddings, p = ⟨p 1 , . . ., p m ⟩.The manually designed hard prompt sequence h has k token embeddings h = ⟨h 1 , . . ., h k ⟩.All embedding vectors have d dimensions.The soft prompt and the hard prompt are both task-specific.The hard prompt text is usually a natural language description of the task, whereas the soft prompts do not correspond to any text and are trained directly using gradient descent.
For classification problems, we adopt the masked language modeling formulation, which aims to predict a predefined verbalizer token y ∈ V at a masked position in the input.For example, for binary classification, the words "yes" and "no" may be used as verbalizers that indicate positive and negative predictions, where we may define the label space as Y = {yes, no}.In encoder-only networks such as BERT (Devlin et al., 2019), the output of the encoder is mapped to the label space Y via a projection head.In encoder-decoder networks like T5 (Raffel et al., 2020), the decoder is responsible for generating the verbalizer token.
We concatenate all sequences and the embedding of the [MASK] token, e([MASK]), to form the final input to the PLM: ⟨p; h; x; e[MASK]⟩.For simplicity, we use the function f (x, p) to denote the PLM prediction at the masked position, which is a multinomial distribution over Y.We adopt the the cross-entropy classification loss ℓ xe with the ground-truth label y ∈ Y.
We optimize the soft prompt by minimizing the expected loss over the labeled training set, D: (2) Under the zigzag (non-smooth) decision boundary, a small perturbation with a well-chosen direction is sufficient to flip the predicted class.The smooth boundary requires a larger perturbation.

The OPTIMA Approach
We build OPTIMA off two intuitions regarding domain adaptation.First, as the target domain provides no direct supervision, it is easy to overfit to the source domain.Therefore, it is important to mitigate overfitting by regularizing the network to maintain a smooth decision boundary.Under an adversarial learning framework, we seek a small perturbation δ that, when added to the input, results in maximum change in the model prediction.After that, we optimize the model parameters to minimize the prediction change under the adversarially perturbed input.The overall result is a network whose output f (x) changes little where a small change is added to the input x.In the sense of Lipschitz continuity, such a decision boundary is smooth.Smooth decision boundaries can be understood as passing through regions of low data density and are shown to improve generalization (Huang et al., 2020;Cicek and Soatto, 2019;Kim et al., 2019).
The second intuition is that we do not have to regularize the entire decision boundary.As the source and target domains may have different data distributions, all that matters is the decision boundary segment close to the target-domain data.Therefore, we target the regularization and the perturbation δ to areas on the data manifold where the source domain and target domain are similar.
Specifically, we have a labeled dataset from the source domain, D s = {(x , drawn i.i.d.from distribution P s and an unlabeled dataset from the target domain, D t = {x (j) t } Nt j=1 , drawn i.i.d.from distribution P t .We define ℓ KL as the KL divergence between the prediction of the original input and that of the perturbed input, ℓ KL (δ, p, x s ) = KL(f (x s , p) ∥ f (x s + δ, p)). (3) ℓ KL measures how much the model prediction changes when the perturbation δ is applied to x s and captures the smoothness of the decision boundary.We illustrate the intuition in Figure 2. Further, we introduce a domain discriminator network parameterized by θ d , which attempts to distinguish data instances from the two domains.This network is trained to reduce the domain discrimination loss L disc , where z is the output of the discriminator network.This loss is a variation of the cross entropy with an additional term where x s is perturbed by δ.In addition, we define an adversarial loss, which, when maximized, causes the domain discriminator to mistake the perturbed source example x s + δ as coming from the target domain.
For a given source-domain input, x s , we find the perturbation δ * within a ϵ-radius ball that maximizes the following objective, Here, ℓ adv (δ, x s ) can be understood as a regularization term for δ.By maximizing ℓ KL , we seek a disturbance to the input that causes the most change in the model prediction.At the same time, the disturbed input x s +δ * from the source domain should resemble data in the target domain, in order to maximize ℓ adv (δ, x s ); ℓ adv constrains δ * to the region where the data from the two domains are similar.
(7) The update to δ can be written as where η δ is the learning rate.We normalize g to make sure the updates have the same magnitude.
During the training session, we alternately optimize the perturbation δ and the soft prompt p.With δ * found by PGA, we optimize the following loss function over p using standard gradient-based optimization.
L R is the empirical expectation computed over the current mini-batch.With the same δ * , we also minimize the domain discrimination loss over the discriminator network parameter θ d .

The OPTIMA Algorithm
We show the complete OPTIMA algorithm as Algorithm 1.With lines 5 and 6, we create an initial perturbation δ (i) 0 for every source data point x (i) s .From line 7 to line 13, we iteratively update the perturbation δ (i) associated with every sourcedomain data point x (i) s using projected gradient ascent on ℓ KL + ℓ adv .After K iterations, we find accordingly, and update p with stochastic gradient descent (SGD) and learning rate η p (line 16).At line 17, we update the domain discriminator parameters θ d using SGD with the current mini-batches.Though we show the vanilla SGD updates in lines 16-17, we can easily switch to other optimizers such as SGD with momentum or Adam (Kingma and Ba, 2015).

Comparison with Virtual Adversarial Training
Virtual Adversarial Training (VAT) (Miyato et al., 2016(Miyato et al., , 2018) is a pioneering work that applies adversarial perturbation to unlabeled examples in semisupervised learning (SSL).The SSL assumption is that we have labeled data (x, y) ∼ P and unlabeled data x i.i.d.
∼ P .Notice that x is drawn from the same distribution P regardless of the existence of the label y.VAT finds disturbance δ ∈ Q ϵ that maximizes the change in the model prediction KL(f (x) ∥ f (x + δ)).After that, the neural network minimizes cross-entropy on labeled data and the KL-divergence under disturbance on all data.Similar ideas have been explored in (Cicek and Soatto, 2019;Kim et al., 2019;Park et al., 2022).
A critical difference between SSL and domain adaptation is that the unlabeled data are drawn from  a different distribution (P t ) than the labeled data (P s ).As the two distributions may overlap in some regions and diverge in others, regularizing over the the entire source dataset may be ineffective.Thus, we propose to focus the smoothness constraint on the regions of the data manifold where the sourcedomain and target-domain data are similar.

Experimental Evaluation
We evaluate the representations learned by OP-TIMA under zero-shot and few-shot settings.

Datasets
We investigate domain adaptation on six text classification datasets in two tasks.In the task of para- phrase detection, we employ MRPC and QQP 1 .
The statistics and the label space Y of each dataset can be found in Table 1.We prepare 8 groups of cross-domain experiments, two for paraphrase detection and 6 for natural language inference (NLI), as shown in Table 2.

Baseline Techniques
We include eight competitive single-domain and cross-domain baselines.Out of the eight, baselines #2-#4 do not use any transfer learning from the source domain.Baselines #5-#9 utilize transfer learning and data from the source domain.
1) Frozen PLM.Large PLMs have demonstrated non-trivial zero-shot performance (Brown et al., 2020).Here, we directly apply T5-large (Raffel et al., 2020) with the manually written hard prompt and take the verbalizer with the highest probability as the prediction.
2) Prompt Tuning (PT).We feed the input data with both soft and hard prompts to a frozen T5large model and finetune the soft prompt embeddings on the few-shot training set from the target domain.
3) Fine Tuning (FT).We feed the input data with the hard prompt to T5-large and finetune the entire network on the few-shot target-domain data.Notice that we use the verbalizer rather than training a separate task-specific prediction head.
For fair comparison, we wrap the input with both soft and hard prompts and finetune both the PLM and the soft prompts on target-domain data.The predictions are mapped via verbalizers.
6) Soft Prompt Transfer (SPOT).Vu et al. (2022) propose to pretrain soft prompts on source-domain datasets and finetune the learned soft prompts on the target-domain datasets.We apply this approach on different source-target pairs in few-shot setting.
7) Prompt Tuning with FreeLB.FreeLB (Zhu et al., 2020) is an adversarial training approach, which generates the adversarial perturbation from the supervised classification loss, After that, we find the optimal p by minimizing ℓ xe (x s , y s , p) + ℓ xe (x s + δ, y s , p).The adversarial training may be understood as another type of smoothness constraint, as the network attempts to maintain the same prediction despite the strongest possible perturbation.
8) Prompt Tuning with VAT.We apply the original VAT (Miyato et al., 2018) to generate the perturbations that maximally alter model predictions on the source domain, and optimize p as in Equation 10.This can be seen as an ablation of OPTIMA, as Equation 12 omits the ℓ adv term from Equation 6. 9) Prompt Tuning with DANN.We implement Domain-adversarial Neural Network (DANN) (Ganin et al., 2016), a popular UDA method for prompt tuning.DANN introduces a domain discrimination loss L DD , where z is the output of a domain discrimination network.The soft prompt p optimizes for the source-domain cross-entropy loss and the negative domain discrimination loss.For fair comparison, we use the same architecture for the domain discriminator as OPTIMA.Note that in DANN, the gradients from domain discrimination loss are backpropagated to the soft prompts, while in OPTIMA such gradients are backpropagated to the perturbations.

Experiment Settings
Pretraining.For all methods that utilize source domain data, we train the soft prompts using the whole source-domain training set and perform model selection using the source-domain validation set.When domain adaptation is applied, we additionally use the entire target-domain training set for training with all labels removed.To mitigate variance, we train each method using 3 different random seeds, yielding three different models.For zero-shot evaluation, we report the mean score and standard deviation of the three models.Few-shot Evaluation.Following Gao et al. (2021), we sample the few-shot training set and validation set from the original target training set.Each set contains 8 data points per class.We evaluate the trained model on the original target validation set.To mitigate high variance of few-shot learning, we repeat the sampling 16 times, and report the average of 48 runs (16 samples × 3 models).More details can be found in Appendix A. Model Settings.For all the experiments, unless specified, we use the LM-adapted version of T5large as the PLM.Results in Lester et al. (2021) (Figure 3) shows that T5 further trained for LM Adaptation works the best for prompt tuning, which is also adopted by Gu et al. (2022) and Vu et al. (2022).For the domain discriminator, we use a linear classification layer with parameters  2022), for all methods other than PPT, we set the soft prompt length to 100, initialized to the first 100 alphabetic token embeddings of T5.We combine soft prompts with hard prompts with details in the Appendix A. Evaluation Metrics.Following (Lester et al., 2021), we use accuracy and F1 score to evaluate the performance on the MRPC and QQP datasets.Following (Gu et al., 2022), we use accuracy for NLI.For zero-shot model selection, we use the source-domain validation set.For few-shot model selection, we use the target-domain validation set.

Few-shot Performance
We adopt few-shot classification to evaluate the representations learned by different models and pretraining methods.We show the few-shot performance in Table 3 and make the following observations.First, OPTIMA significantly outperforms all baseline models across all the few-shot test cases, including the state-of-the-art SPOT baseline.We perform statistical significance tests that compare OPTIMA to all baselines in a pair-wise manner.In all but the SICK experiments, the differences between OPTIMA and all baselines are statistically significant.We attribute the performance to the high-quality representation of OPTIMA, resulting from domain adaptation.
Second, DANN performs much worse than perturbation-based methods.As discussed earlier, we suspect the poor performance of DANN is partially due to the limited capacity of prompts (102K parameters in our case).In OPTIMA, the perturbation optimizes for domain invariance (Eq.6), whereas the prompt optimizes for only task-specific losses (Eq.10), which simplifies optimization for soft prompts.
Third, OPTIMA outperforms the VAT baseline, especially in the NLI tasks, where the performance difference ranges from 1.2% in MNLI→SNLI to 5.8% in SNLI→CB.The VAT baseline is an ablation of OPTIMA and omits the targeted regularization term when finding the perturbation.This comparison demonstrates the effectiveness of the proposed targeted smoothness constraint.
Finally, our experiments are consistent with earlier results of Gu et al. (2022), which show that prompt tuning (PT) suffers from high variance in the results.In the single-domain experiments, finetuning the entire T5-Large (FT) exhibits comparable, if not lower, variances than PT, even though FT updates about 7500× more parameters.This underscores the importance of using pretrained prompts from a source domain.Indeed, all transfer learning methods utilizing a source domain similar to the target (SPOT, FreeLB, VAT, and OPTIMA) yield sizable performance gains than single-domain methods.Notably, FreeLB, VAT and OPTIMA are obviously better than SPOT across the benchmarks, which underscores the importance of alleviating overfitting to source-domain datasets.Sample Efficiency.We perform an additional ex- periment where we increase the number of available samples per class from the target domain, and show the results in Figure 3.We observe that 4-shot OPTIMA achieves comparable performance as fullmodel finetuning on 128-shot dataset.Similarly, 8-shot OPTIMA achieves an accuracy comparable to 64-shot SPOT.These results clearly demonstrate the superior sample efficiency of OPTIMA.

Zero-shot Performance
Zero-shot performance on the target domain is also an effective way to evaluate the learned representations.We show the zero-shot performance in Table 4 and make the following observations.First, OPTIMA still takes the highest spot in performance in all target domains, outperforming the second best baseline by up to 4.1%.In the source domain, OPTIMA is comparable with the baselines.Second, the ablation baseline, VAT, is consistently surpassed by OPTIMA, which again confirms the utility of our proposal.Third, the stateof-the-art method, SPOT, in the majority of cases produces results with higher variance than the three perturbation-based methods.This suggests that ad- versarial perturbation is effective against overfitting.Lastly, except in the MNLI → SICK task, DANN performs rather poorly across the benchmarks, indicating that DANN is not suitable for prompt tuning.

Class Similarity and Transfer Learning
We investigate the relationship between domain similarity and transfer learning performance.Due to space constraints, we present the results on CB as the target domain and leave more content to the Appendix.CB is a difficult target.On SNLI, all models in Table 4 achieve in-domain test accuracy greater than 88%, but zero-shot SNLI-to-CB transfer obtains accuracy of around 47%.This is disappointing given that even Frozen PLM achieves 55.4% on CB.
To investigate the underlying cause, we plot the TF-IDF textual similarities between different domains in Figure 4. We compare SPOT, which performs direct transfer without any smoothness regularization, and OPTIMA in the form of confusion matrices in Figure 5 and F1 scores in Figure 6.
Figure 4(a) shows irregular similarities between classes of SNLI and CB, which explains the difficulty in transfer learning.For example, the SNLI Neutral class is more similar to the CB Yes class than the CB Neural class.The CB Neutral class has low similarity to all SNLI classes.This leads to significant confusion for the few-shot SPOT classi-

Related Work
Few-shot Learning with PLMs.Traditional approach for few-shot learning is fine-tuning, where a PLM and a task-specific head are tuned together for the tasks at hand (Zhang et al., 2021;Chen et al., 2020;Das et al., 2022).However, finetuning causes high memory consumption as the scale of PLMs increases.To better exploit large frozen PLMs, prompt-based methods have demonstrated excellent few-shot performance on a range of datasets by wrapping test examples in cloze question format for GPT-3 to make predictions (Brown et al., 2020).Prompts are also shown to boost finetuning in LM-BFF (Gao et al., 2021), PET (Schick andSchütze, 2021a,b), andPERFECT (Rabeeh et al., 2022).
Transfer Learning for Prompt Tuning.Soft prompt tuning methods (Lester et al., 2021;Li and Liang, 2021;Liu et al., 2022;Hambardzumyan et al., 2021) learn prompts from data and achieve comparable performance with full-model tuning when the PLMs are large enough.SPOT (Vu et al., 2022) proposes to pretrain soft prompts on a set of source-domain datasets and then use the trained soft prompts to boost prompt tuning for target domains.PPT (Gu et al., 2022) introduces unsupervised tasks such as next sentence prediction as the pre-text task for prompt pretraining.After that, the soft prompts are finetuned on the few-shot target-domain data.(Zhou et al., 2019) and unsupervised domain adaptation (UDA) (Wang et al., 2019;Long et al., 2022), depending on whether the target-domain data are labeled or unlabeled.Domain adaptation has been used in various applications such as sentiment analysis (Glorot et al., 2011;Dai et al., 2020;Ghosal et al., 2020), machine translation (Chu et al., 2017), reading comprehension (Wang et al., 2019), and others (Shah et al., 2018;Naik and Rose, 2020).For a complete survey of UDA in NLP, we refer readers to Ramponi and Plank (2020).In this paper, we do not induce domain-invariant soft prompts but encourage the learned adversarial perturbations to fill in the domain gap and focus on smoothing the decision boundary where source-domain and target-domain data are similar.

Conclusions
In this paper, we propose OPTIMA to enhance soft prompt transfer performance by regularizing the training on source domain under perturbations generated with domain adaptation.We extensively evaluate the proposed method.Compared to competitive baselines, soft prompt trained with OP-TIMA generalizes better to the source domain and significantly boosts zero-shot and few-shot learning in the target domain.We observe that pre-training soft prompts on a similar dataset confers more benefits than pre-training on a disimilar dataset.We expect the current work to contribute to the wide deployment of PLMs.

Limitations
We identify a few limitations of the current work.
• The domain adaptation problem formulation requires unlabeled data from the target domain.Although unlabeled data are easy to obtain in most cases, doing so might be difficult for some data-scarce domains.
• The proposed regularization technique addresses the situation where the source and target domains have different data distributions.When the two distributions are exactly the same, the technique degenerates to simply adversarial training.When the two distributions are extremely dissimilar, the transfer is unlikely to yield performance improvements.
A unified framework that automatically detects domain distances and applies the correct method may be desirable.
• The power of perturbations has the most effect in the few-shot / zero-shot settings.When the target domain has abundant labeled data, the gap between soft prompt tuning and our method will likely diminish.

Figure 2 :
Figure 2: Intuition about perturbation and smoothness.Under the zigzag (non-smooth) decision boundary, a small perturbation with a well-chosen direction is sufficient to flip the predicted class.The smooth boundary requires a larger perturbation.
Ns i=1 and an unlabeled target-domain dataset Dt = {x (j) t } N t j=1 , perturbation ball radius ϵ, ascent steps K and step size η δ .Initialize: Soft prompts embeddings p and domain discriminator θ d , learning rates ηp, η d . 1 repeat 2 Sample a pair of batches, each of B data points, from Ds and Dt; 3 for i = 0, ..., B do 4 Forward computation: f (x

p
* = arg min p E (xs,ys)∈Ds ℓ xe (x s , y s , p) − L DD (14) where 1024 is the dimension of the output hidden states from the decoder of T5-large model.Soft and Hard Prompts.FollowingLester et al. (2021);Gu et al. (

Figure 3 :
Figure 3: Average test performance on MRPC across 16 runs.PT and FT are trained on MRPC directly.The rest use the soft prompt pretrained under the QQP→MRPC setting as initialization.OPTIMA exhibits the best performance across different few-shot settings.

Figure 4 :
Figure 4: TF-IDF similarity for SNLI, MNLI, and CB, where we treat all text in one class as a document.

Figure 5 :
Figure 5: Confusion matrices for 8-shot transfer learning to CB.Each result is the average test accuracy across 16 runs.(a) and (c) refer to SNLI→CB while (b) and (d) refer to MNLI→CB setting respectively

Figure 7 :
Figure 7: Document similarity for MRPC and QQP datasets between their classes.

Figure 8
Figure 8: F-score on three classes for NLI datasets.SPOT_0 and OPTIMA_0 are compared for their zeroshot performance.SPOT_8 and OPTIMA_8 are compared for their 8-shot performance.

Figure 9 :
Figure 9: Document similarity using TF-IDF for each pair of NLI datasets.

Table 2 :
The set of domain adaptation experiments.

Table 3 :
Few-shot test performance.Results in bold are the best and results underlined are the best in the singledomain group.Results marked with * are significantly better than all the others under the student t-test (p < 0.05).

Table 4 :
Source-domain and zero-shot target-domain test performance.

Table 7 :
Confusion matrix for zero-shot performance of SPOT on each class of CB. Results are in %.The bold means the most predicted labels for each of the classes of CB

Table 8 :
Confusion matrix for zero-shot performance of OPTIMA on each class of CB.

Table 9 :
Confusion matrix for few-shot performance of SPOT on each class of CB.

Table 10 :
Confusion matrix for few-shot performance of OPTIMA on each class of CB.