Is Continuous Prompt a Combination of Discrete Prompts? Towards a Novel View for Interpreting Continuous Prompts

,


Introduction
Continuous prompts for pre-trained language models (PLMs) have shown remarkable performance on almost every NLP field (Li and Liang, 2021;Lester et al., 2021;Liu et al., 2021b).However, trained continuous prompts tend to improve performance at the sacrifice of interpretability and transferability relative to discrete prompts (Liu et al., 2021a), which causes mistrust in people and makes crossmodel transfer challenging.
Recent advancements spiked interest in understanding how prompts work and found the counterintuitive mechanism behind.(Webson and  Figure 1: Interpreting continuous prompts for sentiment classification.Each continuous prompt (p i ) can be regarded as the combination of discrete prompts (r i ), which reflects the tokens utilized by continuous prompts in prompting the PLM to output expected labels.
2022) conducted numerous experiments on various discrete prompts, finding the improvement in downstream tasks does not originate from the model understanding task instructions in a manner similar to how humans use them.(Kavumba et al., 2022) presented the first investigation of the exploitation of superficial cues by prompt-based models, finding the presence of superficial cues which prompt-based models exploit.Continuous prompts, on the other hand, are more complicated and incomprehensible.Recent attempts for interpreting continuous prompts came from (Khashabi et al., 2022), which introduced the Prompt Waywardness Hypothesis to prove the infeasibility of interpreting a learned continuous prompt with a single discrete prompt.To the best of our knowledge, no general post-hoc interpretable framework is proposed to translate continuous prompts into a comprehensible form.
Towards filling this research gap, we propose the Combination Hypothesis, which argues the feasibility of utilizing combinations of discrete prompts as faithful interpretations for continuous prompts ( §3.2).In other words, we treat the continuous prompt as an embedding lookup table with the one-hot restriction removed.For instance, a welltrained continuous prompt for sentiment classification should contain task-related tokens such as "drama" or auxiliary tokens such as "seem", "look" to stimulate the PLM for desired outputs (Fig. 1).To find the effective interpretation, a joint optimization framework is proposed to ensure both prompt fidelity and downstream fidelity ( §3.3).
Comprehensive experiments are conducted to support our hypothesis and framework.We first directly optimize parameters of the combination of discrete prompts to replace continuous prompts.Results show that the combination of discrete prompts has competitive performance in most scenarios (especially in few-shot learning), which verifies the feasibility of the Combination Hypothesis in practice ( §5).
As a significant property of interpretations, faithfulness is comprehensively verified to check how accurately it reflects the true reasoning process of the model (Jacovi and Goldberg, 2020).We first verify the prompt fidelity and downstream fidelity of the interpretations using discrete prompts and continuous prompts as the content to be interpreted ( §6.1), then we verify the tokens selected from interpretations can better restore the performance of source prompts on downstream tasks ( §6.2).
Despite faithfulness, a high-quality interpretation should also contain plausibility, which refers to how convincing the interpretation to humans (Jacovi and Goldberg, 2020).By conducting a visual comparison with the nearest tokens to continuous prompts ( §7.1),Our interpretations are shown to be more convincing and allow us to identify several "shortcuts" contained in the model's decisionmaking ( §7.2).Furthermore, inspired by the readability and transferability of discrete prompts, we investigate the feasibility of cross-model transfer for continuous prompts using our interpretations.We argue its breakthrough since no previous work to achieve cross-model transfer for continuous prompts without any training signals on target PLMs.Experiments show that even continuous prompts trained on a simple structured PLM with 100-shot settings can be transferred to large PLMs using our method and achieve competitive performance ( §8).

Related Work
Prompt Engineering.Prompt engineering, as a crucial part of prompt learning, is the process of creating a prompt function that performs effectively on the downstream task (Liu et al., 2021a).It can be generally divided into discrete prompts and continuous prompts.
Discrete prompts usually search for templates, i.e., natural language tokens in discrete spaces as prompt functions.There is a line of work focused on manually-designed prompts (Petroni et al., 2019;Brown et al., 2020;Scao and Rush, 2021).These methods rely excessively on prior knowledge, while even experts have difficulty finding optimal templates (Jiang et al., 2020).Therefore, recent explorations devoted much attention to automatically searching for templates in discrete spaces (Jiang et al., 2020;Shin et al., 2020;Gao et al., 2021;Haviv et al., 2021).
Continuous prompts, on the other hand, relax the constraint that templates are natural language tokens (Li and Liang, 2021;Liu et al., 2021b;Lester et al., 2021;Zhong et al., 2021;Qin and Eisner, 2021;Zhang et al., 2022).These works effectively improve performance at the expense of interpretability.Khashabi et al. (2022) demonstrated the disconnection between continuous prompts and discrete prompts, In this paper, we investigate the feasibility of using discrete prompts to interpret continuous prompts from a novel view.
Cross-model Transfer.Benefiting from the readability of discrete prompts, we can easily transfer manually-designed prompts to any PLM (Perez et al., 2021).Nonetheless, since the embedding dimensions and semantic spaces of different PLMs are inconsistent, it is tricky for cross-model transfer of continuous prompts.Su et al. (2022) devoted the first attempt by prompt projectors, which trained on another task to project continuous prompts into the semantic space of target PLMs.As a post-hoc interpretable framework, this paper investigates the feasibility of cross-model transfer without the help of additional task data.
vector which decouples p i into v discrete prompts (Fig. 1).
In this paper, we are interested in generating an interpretation R with both faithfulness and plausibility (Jacovi and Goldberg, 2020).In addition, as a side effect of the interpretation, it is also expected to utilize the results for cross-model transfer of continuous prompts.

The Combination Hypothesis
Continuous prompts are essentially trained on a large corpus of natural language.These incomprehensible prompts occupy the place of discrete prompts that are composed of natural language tokens, but better motivate the PLM to output desired results.Consequently, they are intuitively more likely to be associated with natural language tokens than to be isolated from them.
Considering the infeasibility of one-to-one mapping (Khashabi et al., 2022), we propose the idea that the continuous prompt may be a combination of multiple discrete prompts.It is known that the essence of discrete prompt e(x) is a function of token x, which is parameterized by a one-hot embedding lookup table (Li et al., 2020a).If the onehot restriction is removed, the continuous prompt can be seen as the output of a fully connected layer with all discrete prompts as input.We formalize the idea as the following hypothesis.
Hypothesis 1: (Combination Hypothesis) For any continuous prompt p ∈ R d and a discrete prompt matrix E ∈ R v×d of a large pre-trained model, there exists a vector r ∈ R v such that dist(r ⊤ E, p ⊤ ) ≤ ∆, where dist(•) is the Euclidean distance function, ∆ is the shortest distance to p among all discrete prompts.
In fact, it can almost be proved that the linear equation r ⊤ E = p ⊤ has infinitely many solutions.For general PLMs, it is always satisfied that v ≫ d (e.g., v = 30522, d = 768 in the BERT base model (Devlin et al., 2019)).Thus, for most cases, R(E ⊤ ) = R(E ⊤ , p) < v, where R denotes the rank of the matrix.
Nonetheless, although v ≫ d, it is still not guaranteed that these discrete prompts can necessarily constitute a set of bases in the vector space, which implies the non-existence of an exact solution.Thus, we relax the restriction in our hypothesis, which only proves the existence of a more faith-Figure 2: The case where discrete prompts fail to form a set of bases of the space and the continuous prompt p (red) is not in the linear subspace V they form.We can still find a linear combination of discrete prompts p ⊥ (green) such that its distance to p is not greater than the distance from the nearest discrete prompt e 0 (blue) to p.
ful interpretation than the nearest discrete prompt.We consider the following two cases.
1. E constitutes a set of bases in the vector space.In this case, all vectors in the vector space can be represented by this set of bases.Therefore, there exists a solution r such that dist(r ⊤ E, p ⊤ ) = 0 ≤ ∆. (1) 2. E is not sufficient to constitute a set of bases in the vector space.Let e 0 be the nearest discrete prompt to p, V be the linear subspace composed of E. If p ∈ V, then there exists a linear combination of discrete prompts that satisfies Eq.1.If p ̸ ∈ V (Fig. 2), we make a projection of p onto V, denoted p ⊥ , then Since p ⊥ is in the linear subspace V, it can be represented as a linear combination of discrete prompts.Therefore, in this case, the hypothesis also holds, which implies the existence of a more faithful interpretation than the discrete prompt.
Empirically, simply summing rather than concatenating prompts does not seem to make sense.Suppose we have two input vectors and their concatenation, denoted as Then we apply linear embedding projection e to x concat : where W 1 ∈ R d×d , W 2 ∈ R d×d , W ∈ R d×2d are parameters of the linear projection.This indicates that summing is somehow equivalent to concatenating, which also supports the rationality of decoupling continuous prompts into discrete prompts.

Finding Interpretations
The hypothesis indicates the existence of R, but it does not consider how to find a solution that better represents the continuous prompt.In this section, we first introduce an optimization method to find the interpretations that both satisfy the hypothesis and ensure downstream fidelity, then we reduce the vocabulary size by traversing datasets and thus speed up the optimization.
Our post-hoc interpretable framework is similar to probes, which focus on simple linguistic properties of interest (Conneau et al., 2018).Therefore, following the view of Hewitt and Liang (2019), a simple model with only one linear layer is designed in our paper for interpreting continuous prompts.Since negative results can be confusing or controversial, the softplus activation function (Dugas et al., 2000) is applied in the output layer.
To satisfy the Combination Hypothesis, we minimize the distance between continuous prompts and the combination of discrete prompts: (4) It is not sufficient to find the most reasonable solution with the loss above.As a consequence, we introduce the following loss function to ensure downstream fidelity: where D KL (•) is the Kullback Leibler distance function, M (•) is the output of the PLM.This loss function helps to find a more meaningful combination, i.e., discrete prompts with larger values should have outputs on downstream tasks that are as consistent as possible with the continuous prompt.
We learn the interpretation r by jointly minimizing the loss ℓ 1 (•) for the Combination Hypothesis (Eq.4) and the loss ℓ 2 (•) for downstream fidelity (Eq.5): where γ is a hyperparameter.In this paper, we find γ = 0.09 to achieve a reasonable trade-off between prompt fidelity and downstream fidelity (see §9).Nonetheless, it is time-consuming since the second optimization requires traversing the vocabulary of the PLM.As a post-hoc interpretation, we argue that the decoupling result r should be sparse, i.e., most of the discrete prompts should correspond to 0. On the one hand, a dense interpretation is incomprehensible; on the other hand, as an effective prompt that motivates the PLM to output desired outputs, it should not have much useless token information.
We propose a simple method that traverses the full downstream dataset and selects the v tokens with the highest frequency into our new vocabulary since it is intuitive that critical tokens contained in continuous prompts tend to appear in the dataset to be trained already.Moreover, since the parameters of the PLM are fixed, M (e ⊕ x) is invariant in different epochs.Thus, for a given discrete prompt e and sample x, we only need to compute the output once, which further speeds up the training.
4 Studying P-tuning: Experimental Setup 4.1 Model and Training Details P-tuning (Liu et al., 2021b), as a typical representative of continuous prompts, is used in this paper to study our proposed framework.For PLMs, we use the base version of BERT (Devlin et al., 2019), which is broadly adopted in the NLP field.
We freeze the parameters of BERT and use the prompt template are the only trainable parameters with a two-layer LSTM (Hochreiter and Schmidhuber, 1997) head, respectively.We use a batch size of 8, initial learning rate of 0.00001, AdamW optimizier (Loshchilov and Hutter, 2019), and 15 training epochs for P-tuning; initial learning rate of 0.01, L1 loss coefficient of 0.01 and 4000 steps for training our interpretations with early stopping based on the validation set.Unless otherwise stated, all experiments are conducted in the 100shot scenario.

Studied Datasets
Detailed experiments are conducted on the following 4 classification datasets: SST-2 (Socher et al., 2013), IMDB (Maas et al., 2011), Amazon Review Polarity (McAuley and Leskovec, 2013) and AG-News (Zhang et al., 2015) and B. Among all these datasets, test set accuracy is reported as our evaluation metric.

Hypothesis Verification
The Combination Hypothesis argues the existence of combinations of discrete prompts in fairly small neighborhoods as an alternative to continuous prompts.Therefore, it should also be feasible to train combinations of discrete prompts individually for downstream tasks.The amount of loss is quantified as follows: where loss(•) is the loss function on downstream task.We then minimize the loss function to obtain a replacement for continuous prompts.The optimized performance is provided in Table 1.Our method performs competitively, especially in fewshot scenarios.Furthermore, we find that v = 1500 is sufficient for the model to obtain good performance, while a larger vocabulary size is more likely to introduce noisy tokens, which is not conducive to optimization.Therefore, we set v = 1500 in the following research.Note that since the designed structure itself is difficult to optimize, we set the learning rate to 0.3 when trained on few-shot scenarios and 0.1 when trained on full datasets.Besides, the L1 loss function with a coefficient of 0.01 is added.This method does not aim to fully surpass P-tuning, but to verify the feasibility of the hypothesis that continuous prompt can be replaced by the full connection of discrete prompts without loss of precision and at the same time provide methods for faithfulness verification in §6.As an approximate alternative to continuous prompts, the loss of accuracy is unavoidable.For example, P-tuning is able to accurately find the simple connection between features and labels on full datasets, while it is more tricky for our method.
Table 2: Performance of prompt fidelity and downstream fidelity on discrete prompts (SST-2).For prompt fidelity, the percentage (%) of corresponding tokens in the interpretations is reported (left).For downstream fidelity, comparisons of the accuracy (%) between continuous prompts p and our interpretations r ⊤ E on downstream tasks are reported (right).
6 Faithfulness Verification 6.1 Do the Interpretations Faithfully Reflect the Source Prompts?
In this section, We verify the prompt fidelity and downstream fidelity of the interpretations, i.e., the proximity of the weighted discrete prompts to the source prompts and the similarity in performance on downstream tasks.
To obtain ground-truth labels, we first design three manual discrete templates on SST-2 and interpret them.The performance of prompt fidelity and downstream fidelity are shown in Table 2, where initial capitalization and plural forms are ignored.For most tokens, they account for more than 20% among the 1500 tokens.We consider this to be a fairly high value and the synonyms of the original tokens also achieve a high value.However, several tokens like "exactly", "drama" and "cat" still achieve a low value.For tokens like "exactly" and "drama", the interpretations discover their synonyms and give them an extremely high percentage (>20%), such as "completely" for "exactly" and "film" for "drama".For tokens like "cat", since they do not help with downstream tasks, the model can only attempt to optimize for the first objective (Eq.4), leading to a jumbled interpretation.As for SST-2 IMDB Amazon AGNews Nearest-1 0.0026 0.0030 0.0036 0.0047 Nearest-2 0.0027 0.0030 0.0037 0.0048 Ours 0.0025 0.0027 0.0032 0.0043 downstream fidelity, the performance of the interpretations is similar to the source prompts in all 3 sets of experiments.Furthermore, we verify the fidelity of the interpretations to continuous prompts.Performance of prompt fidelity and downstream fidelity is shown in Table 3 and Table 4, respectively.For comparison, the two nearest tokens in the Euclidean space are selected as interpretations for continuous prompts.Among all tasks, the distance of our results from the source prompts is closer compared to the nearest discrete token, indicating that our method has a higher fidelity in the restoration of source prompts.Moreover, simply taking the two nearest discrete tokens as a replacement for continuous prompts performs quite poorly on downstream tasks, even similar to random predictions in most cases, while our method achieves comparable performance to the source prompts on downstream tasks.In summary, our interpretations consistently maintain higher fidelity than the only existing method (select the nearest discrete prompts) and reflect the decision process of the source prompts well.

How Reductive are the Interpretations on Downstream Tasks?
As described in §3.3, the interpretations are intended to be sparse, which means that the top few tokens of interpretations are supposed to contain the majority of information from source prompts.In this section, We select the top five tokens of the interpretations as vocabulary and train the weighting of these tokens using the optimization method in §5.Comparison with baselines is shown in Table 5.In all scenarios, the tokens selected by our in-terpretations are more reductive than the randomly selected tokens and the tokens selected nearest to the continuous prompts, implying that these tokens do contain more information relevant to the downstream task from source continuous prompts.Moreover, for a more visual demonstration of the ability of the selected five tokens to restore performance, we show the test set accuracy of several baselines under different training scenarios, including Manually, LM-BFF (Gao et al. (2021)) and P-tuning.For Manually, we report the best performance among the five manually designed templates (see Appendix E).For LM-BFF, we only use it to automatically generate templates without changing target tokens and making additional fine-tuning.It can be found that the performance of the five tokens selected by our method can outperform the templates selected by Manually and LM-BFF in all cases, and is even comparable to P-tuning in fewshot scenarios, while random selection and nearest neighbor selection are not.This further shows that our selected tokens are reliable and faithful.

What do the Interpretations Look Like?
Still taking the 100-shot scenario as an example, we show our interpretations on different tasks in Table 6.For each prompt, five tokens with the largest values are selected for display.As a comparison, the five nearest tokens to each prompt in the Euclidean space are also selected for display.
As can be seen, our interpretations better reflect the decision-making of continuous prompts and output meaningful tokens compared to the Nearest baseline.For example, the continuous prompts in SST-2 induce the PLM to determine how great or terrible something in the input is, while prompts in IMDB and Amazon induce the PLM to judge how well someone thinks of something.
To our surprise, there contains a large number of task-independent tokens which also induce the PLM to output desired target tokens.For example, the interpretations on SST-2 contain tokens like "taste", "material" and "quality".These tokens are irrelevant to movie review sentiment classification, but can prompt the PLM to output the target tokens "terrible" or "great".We consider that continuous prompts may sneak in shortcuts (Geirhos et al., 2020) during training, which will be briefly verified in §7.2.
Nonetheless, there still remain several noisy to-  kens that are hard to understand for humans, especially in AGNews.These tokens seem irrelevant to the downstream task and it is difficult to spot potential shortcuts.We believe there are two reasons for the phenomenon.On the one hand, the tokens utilized by prompts are overcrowded in the semantic space, leading to the replacement of the interpreted tokens by irrelevant ones.On the other hand, the high complexity of the downstream task leads to a more difficult optimization of the interpretations.Future work will be conducted along these two directions.

Do Continuous Prompts Contain Shortcuts?
As shown in Table 6, our interpretations reveal the possibility of continuous prompts using shortcuts, which perform well on benchmarks but may fail to transfer on the anomaly test set (Geirhos et al., 2020).Taking the interpretation of SST-2 as an example, It contains unexpected tokens like "something", "taste", etc. to induce the PLM for desired target labels "terrible" or "great".
To test whether the model makes use of these shortcuts, we select several task-irrelevant texts containing shortcut tokens as suffixes to be added to the test set text and reverse the sentiment polarity of the added text to the test set labels on SST-2 (see Table 7.2).For example, "The food tastes delicious." is added if the ground-truth label is 0 (terrible), while "The food tastes unpalatable." is added if the ground-truth label is 1 (great).The significantly degraded performance suggests that the model utilizes a large number of shortcuts.To our surprise, these shortcuts do not disappear as the training data increases but are more fully exploited by the model, resulting in an accuracy of almost 0 after training on the full dataset.Obviously, continuous prompts of SST-2 are just baiting the PLM to output the target token terrible/great, not caring whether it is really a review of the movie or a review of food, cats, or something else.We present this phenomenon in the hope that it will attract more attention and research in the future.

Cross-Model Transfer
Due to the inconsistent embedding dimensions and semantic spaces of different PLMs, cross-model transfer of continuous prompts is tricky.With our proposed interpretable framework that establishes connections between continuous and discrete prompts, it becomes feasible to transfer continuous prompts from source PLMs to target PLMs without extra training signals on target PLMs.Considering a scenario to transfer continuous prompts of the source PLM M a to the target PLM M b , we can first get the decoupling results r using the method presented in §3.3.Then the continuous prompts transferred to M b are r ⊤ E b , where E b is the discrete prompt matrix of M b .
Following this idea, we investigate the feasibility of cross-model transfer from BERT base (Devlin et al., 2019) to BERT large , RoBERTa base and RoBERTa large (Liu et al., 2019) respectively in Table 8.Considering that only discrete templates are capable of cross-model transfer without extra training signals on target PLMs in existing studies, we choose (1) select the nearest tokens to continuous prompts; (2) manually designed templates that perform best on BERT base ; and (3) automatically generated templates using LM-BFF (Gao et al. (2021)) as the baselines.For LM-BFF, we automatically generate templates using T5 base (Raffel et al., 2020) in the 100-shot scenario for cross-model transfer.Detailed results on the baseline ( 2) and (3) can be found in Appendix E.
As can be seen, our method outperforms baselines in most scenarios, especially on tasks like AGNews where it is tricky to construct discrete templates using prior knowledge.This enables zero-shot transfer of continuous prompts across arbitrary models without the restrictions of vector dimensionality and semantic space.For the poor performance on SST-2, we consider that the continuous prompts learned using BERT base inherently contain a large number of shortcuts, which may no longer be applicable after being captured by the interpretations and transferred to larger PLMs.Therefore, the performance of cross-model transfer is affected by the robustness of the source prompts.
If continuous prompts are trained on larger PLMs and datasets, better performance will be obtained using our interpretations and is expected to be applied to areas such as model compression.

Further Analysis
Effect of Gamma.We analyze the effect of hyperparameter γ, i.e., the trade-off between prompt fidelity and downstream fidelity (Eq.6).Intuitively, as γ increases, the prompt fidelity decreases while the downstream fidelity goes up.When γ is 0, our method degenerates to use only prompt fidelity as the optimization objective.Fig. 3 shows the results of the grid search using the interpretations described in §3.As expected, the accuracy on BERT base improves as gamma increases since the interpretations are directly optimized on it.Nonetheless, when γ is larger than 0.09, the performance of the interpretations for cross-model transfer decreases.As a consequence, we choose γ = 0.09 in this paper.

Conclusion
In this paper, we present a novel view that interprets continuous prompts as a combination of discrete prompts.Contrary to the previous perspective which attempts to discover a one-to-one mapping between continuous prompts and discrete prompts, we demonstrate the continuous prompt to be an embedding lookup table with the one-hot restriction   removed.Detailed experiments are conducted to verify that our interpretations faithfully reflect the reasoning of source prompts with both prompt fidelity and downstream fidelity.Furthermore, our interpretations exhibit promising readability and plausibility, which not only provides a tool for understanding model decisions but also offers a chance for discovering potential shortcuts contained in the prompts.Finally, with the bridge between continuous prompts and discrete prompts, we analyze the feasibility of cross-model transfer for continuous prompts with the proposed method.Results show that even trained on a small PLM (BERT base ) and 100-shot scenario, continuous prompts maintain good performance after transferring to various large PLMs.We hope that this work will bring a novel view for interpreting continuous prompts and encourage more research to explore the internal mechanisms of continuous prompts.

Ethical Statement
We propose a novel view to interpret continuous prompts, which have been considered "black boxes", as combinations of human-understandable discrete tokens.Since the method itself is unbiased and faithful, and all experiments are conducted on publicly available datasets, we believe that our work does not create any potential ethical risk.Further, we discover shortcuts latent in continuous prompts, implying that systematic biases or discrimination may also exist in continuous prompts.These biases may originate from training datasets which are exploited by continuous prompts as a shortcut to the acquisition of true labels, or even originate from artificially implanted backdoors.We hope this work will provide the possibility to detect these potential biases in continuous prompts.
Our created artifacts are intended to provide researchers or users with a tool for understanding decision-making and detecting possible unexpected shortcuts of continuous prompts, while at the same time offering the feasibility of cross-model transfer without extra training signals on target PLMs.They are compatible with the original access conditions.All use of existing artifacts is consistent with their intended use in this paper.

B Target Tokens
Manual verbalizers are adopted in this paper.We rank the target tokens by their likelihoods and select the target token with the maximum likelihood as the classification output.The used target tokens for each task are shown in Table 11.

C Usage of Existing Packages
The pre-processing steps and prompt-based methods are all implemented in OpenPrompt (Ding et al., 2022), an open-source framework for deploying prompt learning.Our interpretable method is implemented in PyTorch (Paszke et al., 2019), an open-source framework for deploying deep learning algorithms.For PLMs, we use "bertbase-cased" as the base model, "bert-large-cased", "roberta-base", "roberta-large" for cross-model transfer, and "T5-base" for generating templates in LM-BFF from Huggingface transformers (Wolf et al., 2020).All licenses of these packages allow us for normal research use.Identical hyperparameters are adopted regardless of the dataset.Detailed setups for P-tuning and our interpretable method are already shown in §4.1.

Dataset
Target Tokens SST-2 terrible, great IMDB bad, good Amazon bad, good AGNews politics, sports, business, technology For the LM-BFF baseline, we fix the target tokens and only use T5 base to search for the best discrete template with the training epochs of 10, learning rate of 0.00001, batch size of 2, and beam width of 100.

D Experimental Details
For all the experiments mentioned in this paper, we use 2 NVIDIA GeForce GTX 1080 Ti GPUs with 11G memory each.For training our interpretable framework, an additional linear layer with n × v parameters is introduced besides the source PLM, where n denotes the number of continuous prompts and v denotes the vocabulary size.In this paper, we set n = 3, v = 1500, which means only 4,500 extra parameters are introduced.Compared to large-scale PLMs such as BERT or RoBERTa, these parameters are almost negligible.

E Performance of Discrete Templates
The performance of the manually designed templates (the first five rows of each table) and the templates generated by LM-BFF (the last row of each table) on each task and PLM is shown in Table 12-15.For manually designed templates, the bestperforming templates on BERT base are selected as the baseline templates for cross-model transfer.
and Formulation Given a sequence with n continuous prompts P = {p 1 , p 2 , • • • , p n } trained on the dataset D = {x, y}, we analyze the feasibility to interpret continuous prompts as a combination of discrete prompts

Table 1 :
. Statistics and target tokens for each dataset are attached in Appendix A Comparison of P-tuning and discrete prompts combinations (Discrete-v) on different tasks, where v is the vocabulary size, k-shot columns are trained on few-shot scenarios with k samples for each label, and Full columns are trained on the full datasets.

Table 3 :
Performance of prompt fidelity on continuous prompts (average squared distance reported).

Table 5 :
Performance of downstream tasks by training combinations of five selected tokens, including random selection, nearest to continuous prompts, and our method.

Table 6 :
Intuitive comparison of interpretations on various continuous prompts, where Nearest selects the five nearest tokens to continuous prompts in the Euclidean space and ours selects the five largest values and their corresponding tokens using our proposed method.

Table 7 :
Performance of continuous prompts on the SST-2 test set with shortcut tokens.Three task-irrelevant texts are selected to be added to the test set texts with sentiment polarity opposite to the ground-truth labels.

Table 8 :
tuning  74.52 87.70 73.09 77.02 80.70 88.84 85.25 92.21 84.07 86.66 85.63 82.82  Random  50.58 50.01 50.03 61.67 73.37 76.54 61.20 79.88 77.16 42.95 61.64 53.96Performance of cross-model transfer, including P-tuning on source PLMs, non-transferred baselines (P-tuning and random prompts), transferred baselines (Nearest, Manually Designed and LM-BFF) and our proposed method, where M a , M b , M c , and M d refer to BERT base , BERT large , RoBERTa base , and RoBERTa large , respectively.All experimental setups are similar to Table6, with BERT base adopted as the source PLM in 100-shot scenario.

Table 9 :
General descriptions of datasets.

Table 10 :
Statistics of datasets.

Table 11 :
Target tokens of classification tasks.