Efficiently Enhancing Zero-Shot Performance of Instruction Following Model via Retrieval of Soft Prompt

Enhancing the zero-shot performance of instruction-following models requires heavy computation, either by scaling the total number of training datasets or the model size. In this work, we explore how retrieval of soft prompts obtained through prompt tuning can efficiently assist hard prompts in zero-shot task generalization. Specifically, we train soft prompt embeddings for each prompt through prompt tuning, store the samples of the training instances mapped with the prompt embeddings, and retrieve the corresponding prompt embedding of the training instance closest to the query instance during inference. While only adding 0.007% additional parameters, retrieval of soft prompt enhances the performance of T0 on unseen tasks by outperforming it on 10 out of 11 datasets as well as improving the mean accuracy of T0 on BIG-bench benchmark by 2.39% points. Also, we report an interesting finding that retrieving source embeddings trained on similar answer choice formats is more important than those on similar task types.


Introduction
Training Large Language Models (LLMs) on huge amounts of data has enabled LMs to perform downstream tasks without any fine-tuning with the aid of natural prompts or concatenation of a few demonstration instances (Brown et al., 2020;Rae et al., 2021;Kojima et al., 2022;Chowdhery et al., 2022).Additionally, recent works have shown that adding a instruction tuning stage, an additional training step that helps pretrained LMs understand prompts and demonstrations results in a significant performance boost on zero-shot task generalization even for moderate-sized LMs (Min et al., 2021;Sanh et al., 2021;Wei et al., 2021;Wang et al., 2022b;Ye et al., 2022;Chung et al., 2022).This extra 1 Model checkpoints and code implementation are available at github.com/seonghyeonye/RoSPr.

Output T0
Embedding E tm instruction-tuning stage involves explicit, multitask prompted learning on various tasks, enabling LMs to quickly adapt to unseen tasks at inference.

Evaluation Input Evaluation Input
To maximize the effect of instruction-tuning, two approaches have been widely explored: (1) scaling the number of training datasets, and (2) scaling the model size (Wang et al., 2022b;Chung et al., 2022).However, both approaches require heavy computation, not applicable with an academic budget.Specifically, the first approach requires updating the whole parameters of the model every time a training dataset is added, showing limitations in terms of scalability.On the other hand, the second approach requires heavy memory requirements to load and train a massive LLM.
To enhance the zero-shot performance of instruction-following model efficiently, we introduce Retrieval of Soft Prompt (ROSPR), which is easily scalable and requires minimal computation by only adding 0.007% parameters to the main model during inference.As shown in Figure 1, by training prompt embeddings (soft prompt) for each given hard prompt through prompt tuning, we construct a Source Prompt Library consisting of samples of training instances mapped with their corresponding prompt embeddings.Then, during inference, by using a simple, off-the-shelf dense retriever model, we search for training instances similar to the given query instances and retrieve their corresponding prompt embeddings.Because the backbone LM is frozen, it allows the retrieved embeddings to serve as adapters assisting hard prompts.While ROSPR can be applied to any LM, in this work, we use T0 (Sanh et al., 2021) as our initial backbone LM and perform prompt tuning on the tasks used during the instruction-tuning stage.
While adding only 0.007% additional parameters, ROSPR outperforms T0 on 10 out of 11 evaluation datasets and outperforms efficient finetuning baselines without any target task fine-tuning.ROSPR is also effective for challenging tasks such as tasks from BIG-bench (Srivastava et al., 2022), outperforming T0 by 2.39% mean accuracy.Furthermore, we provide several interesting findings: (1) Variants of ROSPR that include interpolation of multiple prompt embeddings and scoring method that considers the answer choice distribution during retrieval further increases the effect of ROSPR (2) Also, we provide analysis of which factors attribute to the performance of ROSPR and show that, similarly to the role of demonstrations in in-context learning (Min et al., 2022), heuristic features such as answer choice format are more important than the similarity of the source task.
2 Related Work

Task Generalization with
Instruction-Tuning Prompts and demonstrations are essential for task generalization since proper explanations are required for LMs to understand an unseen task (Kojima et al., 2022;Wei et al., 2022;Lampinen et al., 2022).Instruction-tuning, which is explicit multitask prompted training on various downstream tasks, is a simple but effective way to achieve this, resulting in improved zero-shot capabilities.Zhong et al. (2021) first introduced the method of instruction-tuning by converting various tasks into a question-answering format and finetuning the model on the aggregated dataset.Following works (Mishra et al., 2022;Min et al., 2021;Sanh et al., 2021;Wei et al., 2021;Wang et al., 2022b;Xu et al., 2022;Ouyang et al., 2022;Ye et al., 2022;Chung et al., 2022) extended this approach on a larger scale and show that zero-shot task generalization could be enhanced with more diverse prompts, a larger number of training downstream tasks, and a larger LM.

Source Task Retrieval
Retrieving a source task that is relevant to the target task has shown to result in faster and better task adaptation.For parameter-efficient fine-tuning, Vu et al. (2022); Su et al. (2022) retrieve source prompt embedding that is similar to the target prompt embedding and obtain a better initialization point for prompt tuning.Instead of utilizing a single prompt embedding, recent works show a mixture of multiple prompt embeddings to be effective (Asai et al., 2022;Qin and Eisner, 2021).
For instruction-tuning, Lin et al. (2022) retrieve training instances that are similar to the query through a dense retriever and fine-tune the model using the retrieved examples.For in-context learning, Rubin et al. (2021); Liu et al. (2022b); Wang et al. (2023) retrieve training data that could be used for demonstrations.Wang et al. (2022c) show the effect of retrieving prompt embeddings in a continual learning setting.Although our proposed method is related to these works, the novelty of our work lies in applying source task retrieval in the zero-shot setting and retrieving soft prompts instead of training instances.

Method
In this section, we introduce Retrieval of Prompt Tuning (ROSPR) for zero-shot task generalization.A detailed overview is shown in Figure 2. We first train source prompt embeddings of LM for each hard prompt given a source task using prompt tuning (Section 3.1).Then, we save training instance samples along with their prompt embeddings in the Source Prompt Library and use it to retrieve embeddings at inference to perform tasks in a zero-shot manner (Section 3.2).We additionally introduce interpolation of multiple source prompt embeddings (Section 3.3) and variance-based ranking (Section 3.4) to increase robustness and accuracy.

Training Source Prompt Embeddings
Even though ROSPR may be used to augment any type of LM, we use T0 (Sanh et al., 2021) as the backbone LM for this paper.For training of soft prompts, we utilize the source tasks and prompts used for the instruction-tuning phase of T0.While T0 was trained in a multi-task learning manner, we freeze the initial T0 parameters and train only soft prompts (source prompt embeddings) for each hard prompt of the source task.
Prompt Tuning Among various parameterefficient fine-tuning methods, we follow prompt tuning proposed by Lester et al. (2021) because the number of trainable parameters is extremely small (∼204K parameters per prompt), which implies that the memory overhead of parameter retrieval at inference is negligible.
For each source training dataset D i (i = 1, .., T ) where T is the total number of source datasets, we train source embeddings E ij (j = 1, .., M i ) where M i is the number of hard prompts in D i , making soft prompt embeddings for each individual hard prompts.Specifically, given a training instance {x ik , y ik }(k = 1, .., K) from D i where K is the number of sampled training instances per dataset, we first convert it into its hard prompted version {h j (x ik ), h j (y ik )} where h j (•) denotes adding the j-th hard prompt2 .Next, we train the LM with the following objective: where all the parameters of the underlying backbone LM are frozen and only E ij is trainable.In short, given D i , we perform M i number of prompt tunings for each hard prompts, resulting in T i=1 M i total number of source prompt embeddings.For training efficiency, we only train on K = 5000 training instances for a single epoch for each source prompt embedding.

Zero-Shot Embedding Retrieval
After source prompt embedding training, we retrieve the most related source embeddings and select one from the retrieved candidates to be used at inference (right part of Figure 2).
We first construct a Source Prompt Library, consisting of sentence-level representations of training instance inputs as keys and the corresponding source prompt embedding as the values.For each available source prompt embedding, n number of samples are stored in the library.The sentencelevel representations are obtained by getting the mean representation of hidden states of the last layer of the dense retriever.We use a T0-small encoder as a dense retriever, replicated based on Sanh et al. (2021) with smaller model size.
At inference, we first randomly sample Q query instances from the target task, following Lin et al. (2022).After obtaining sentence-level representations for each query through our T0-small encoder, we retrieve top-N examples for each query instance using MIPS (maximum inner product search) operation3 on our Source Prompt Library, retrieving a total of Q * N prompt embeddings.As the default methodology, among the retrieved embedding candidates, we select the most frequently retrieved prompt embedding as our designated soft prompt for the given target task and concatenate the embedding with each of the target task instances before feeding it to our backbone LM.In the next two subsections, we explain different strategies for calculating the target embedding from the Q * N prompt embedding candidates.

Interpolation of Prompt Embeddings
When retrieving only a single prompt embedding for a given task (Section 3.2), it may result in high variance across evaluation prompts when the selected prompt embedding does not fit well with the given task.Recent works on prompt embedding retrieval have shown that the interpolation of prompt embeddings effectively transfers to the target task (Asai et al., 2022;Vu et al., 2022).We also explore calculating the target embedding through interpolation of multiple source embeddings instead of just using a single embedding.Among Q * N prompt candidates searched in Section 3.2, we select top-N ′ candidate embeddings based on the frequency of the search.Then, we calculate the weighted sum of the candidate embeddings, where the interpolation weight for each source embedding is based on the proportion of frequency.While Asai et al. (2022); Vu et al. (2022) require fine-tuning the target embeddings on the target task to calculate the weights for interpolation, our approach does not require any target task fine-tuning, enabling zero-shot task transfer.

Variance-based Ranking
Similar to the scoring and calibration method of Lu et al. (2022); Zhao et al. (2021), we introduce a scoring method applicable to zero-shot classification tasks that allows ranking the Q * N retrieved prompt embedding candidates by considering the answer choice distribution of the given target task as extra cues together with the original frequency cues.To accomplish this, we perform a forward pass with the concatenation of each candidate prompt embeddings together with the given hard prompt (only including the instruction, excluding the input instance) of the target task and give a higher score to the embedding candidate that results in lower variance.Ideally, the combination of soft and hard prompts should result in equal probability among the answer choices because the actual context of the task is not included.
Specifically, when given a target task with k-th hard prompt h k , for each candidate embedding E ij , we calculate the scoring as follows: where y refers to the available output options of the target task.

Experimental Settings
In this section, we explain the experimental settings of training of source prompt embedding and construction of our Source Prompt Library.We also explain our evaluation setting during zero-shot inference and baseline models.We provide detailed experiment configurations in Appendix F.

Source Tasks
For training soft prompts through prompt tuning, we use the subset of source tasks used for the initial T0 instruction-tuning (Sanh et al., 2021)4 .For each source task, we use the prompts for each dataset in T0, resulting in a total of 230 prompts.For Source Prompt Library construction, we sample only n = 100 training instances per source embedding to minimize the inference latency.We show a variation of n and different methods to sample n training instances in Appendix D.

Evaluation Tasks
Following Sanh et al. ( 2021), we evaluate on the validation set of 4 held-out tasks (natural language inference, sentence completion, coreference resolution, word sense disambiguation) resulting in a total of 11 evaluation datasets.We also follow Sanh et al. ( 2021) and evaluate on 14 different datasets from the BIG-bench benchmark (Srivastava et al., 2022)5 .We use rank classification evaluation method by selecting the output option with higher log-likelihood following Brown et al. (2020);Sanh et al. (2021).For all evaluation tasks, we use accuracy as an evaluation metric and report the mean accuracy and standard deviation of all of the evaluation prompts per given dataset (average of ∼10 prompts per evaluation dataset)6 .For BIGbench tasks, we do not report standard deviation because only one prompt is provided per task.

Baseline Models
Zero-shot Baseline For zero-shot baseline models, we show the results of T0 (3B) together with a 4 times larger T0 (11B) instruction-tuned model.We also compare with GPT-3 (175B) model which is 60 times larger than T0 (3B).

Fine-tuning Baseline
We also compare with efficient fine-tuning baseline models that utilize prompt tuning.These models require target task prompt tuning, indicating that zero-shot transfer is infeasible.Similar to our source prompt tuning process, we train each target prompt for a single epoch with a maximum of 5,000 training instances.The first baseline model is naive prompt tuning on the target tasks without any prompt retrieval, referred to as PT (Lester et al., 2021).The second model is ATTEMPT (Asai et al., 2022), which trains the target soft prompts through attentional mixtures of source prompts.Because StoryCloze (Mostafazadeh et al., 2016) does not contain training instances, we exclude the dataset for fine-tuning.More training details of fine-tuning baseline are specified in Appendix C.
7 One exception is WSC, which is a binary classification task (yes/no) predicting whether the reference of the pronoun is correct.We observed that the evaluation data of this dataset has unbalanced labels, containing over 60% of "No" labels.This might be the reason why T0-11B underperforms T0-3B only on this dataset (Sanh et al., 2021).Indeed, predicting only "No" on this dataset outperforms T0-11B (63.46>61.45).
ROSPR also outperforms finetuning baselines even without utilizing any training instances of the target task.We first observe that PT harms the performance of the backbone model, which aligns with the result of Liu et al. (2022a); Gu et al. (2022) that prompt tuning is unstable when the training instances or the training steps are small.By comparing ATTEMPT with ROSPR, ROSPR outperforms ATTEMPT on 7 out of 10 tasks and 1.21% points on the mean accuracy of 10 tasks.This shows that ROSPR is more applicable for efficient adaptation because it requires 3 times fewer additional parameters compared to ATTEMPT and also does not require any further fine-tuning of the target task.
INTER and VAR enhance the performance of ROSPR.We also analyze the effect of introducing variants of ROSPR: interpolation of soft prompts (INTER) and variance-based ranking (VAR) in Table 1.First, applying INTER shows similar accuracy compared to ROSPR.However, as shown in the last column of Table 1, INTER reduces the standard deviation of T0 and ROSPR by 8.84% while improving the mean accuracy of T0, indicating increased robustness to different surface forms of evaluation prompts.This indicates that interpolation of multiple source embeddings outperforms a single source embedding retrieval, aligning with the result of Asai et al. (2022).Applying VAR with T0+ROSPR improves both zero-shot accuracy and robustness of T0+ROSPR, showing that considering the answer choice distribution is beneficial for zero-shot setting, aligned with results from Zhao et al. (2021);Shi et al. (2022).Moreover, applying both VAR+INTER results in the highest overall average accuracy, outperforming T0 by 2.18% points by largely reducing the gap between larger LLMs.The effect of ROSPR generalizes to challenging tasks.ROSPR is also effective for challenging tasks such as tasks from BIG-bench benchmark.As shown in Figure 3, T0+ROSPR improves the mean accuracy performance of T0-3B by 2.39% points while only adding 0.007% additional parameters.T0+ROSPR also outperforms 60 times larger zeroshot and 1-shot GPT-3 and largely reduces the performance gap between 4 times larger T0-11B (1.84% points) or 60 times larger 3-shot GPT-3 (0.53% points).Applying INTER with T0+ROSPR results in additional mean accuracy enhancement, outperforming T0-3B by 2.67% points8 .

Analysis of ROSPR
Zero-shot task adaptation of LMs is often seen as a problem of task location, locating the target task to where the model can solve it using the intrinsic ability obtained at pretraining stage with the aid of prompts and demonstrations (Reynolds and Mc-Donell, 2021).In this section, we analyze which factors contribute to the performance enhancement in the perspective of identifying better task location.We find that although the target task performance depends on the source task types, heuristic features such as the answer choice format are more important.This agrees with previous findings that a metatrained LM focuses on simple features such as the label space, the input distribution, and sequence format, instead of complex semantics (Webson and Pavlick, 2021;Min et al., 2022).
Target task performance depends on source task types.To analyze the effect of different source task types on each target task, we measure the frequency ratio of each source task type that results in the best performance (ORACLE) for the given prompts of the target tasks (visualized in Figure 4).
From this figure, we can observe a few patterns: paraphrase task assists NLI and word sense disambiguation task while multi-choice QA (MQA) task assists sentence completion task.For coreference resolution task, various source task types (paraphrase, summarization, multi-choice QA) assist the target task.Answer choice format is important for task location.We also analyze the effect of using different answer choice formats with the same source task.
Answer choice format decides how the available answer choices are given to the LM through the input.For example, a prompt that requires classifying a movie review into good/bad has a different answer choice format from classifying it into positive/negative.We experiment on 3 datasets (RTE, COPA, WiC) which correspond to different tasks (NLI, sentence completion, word sense disambiguation) respectively.For each dataset, we select a source dataset that is retrieved the most for ORACLE.Among the source prompts of the selected source dataset, we select a prompt that has the same answer choice format as the target task (ALIGNED) and another prompt that has a different answer choice format (MISALIGNED).Figure 5 shows the effect of answer choice format alignment on the target task performance by comparing ALIGNED and MIS-ALIGNED.The result shows that for all 3 datasets, ALIGNED significantly outperforms MISALIGNED.This result is non-trivial considering that the two prompt embeddings are trained on the same source training dataset and the same training configuration, with the only difference in the given answer choice format, implying that how the answer choices are given to solve a specific task is more important than the content of the training data for task location.ROSPR is mostly comparable to ALIGNED embedding, implying that retrieving a source prompt embedding by searching for similar input instances results in retrieving a source embedding with similar answer choice formats.retrieving MISALIGNED and ALIGNED answer choice format across various source tasks.We report the mean accuracy of the evaluation prompts and the performance of T0 is shown in green dotted line.
We additionally analyze the effect of answer choice formats on RTE and WiC datasets by retrieving prompt embeddings trained on various source tasks.Both target datasets have the answer choices of either yes/no.Similar to the previous experiments, we retrieve ALIGNED (yes/no format) and MISALIGNED prompt embeddings across three source tasks: paraphrase, sentiment classification, and multi-choice QA.As shown in Figure 6, for both target datasets, ALIGNED outperforms MIS-ALIGNED across all three source tasks.This shows that aligning to the answer choice format of the target task is crucial regardless of the retrieved source task.
Answer choice format is more important than task similarity.From Figure 6, we can see that all three source tasks benefit from aligning to the target task answer choice format.One may think that embeddings from source tasks requiring similar knowledge to the target task may be important.Counterintuitively, for both RTE and WiC target tasks, when the answer choice format is aligned, the task source embedding of sentiment classification, which is known to be irrelevant to RTE and WiC (Pruksachatkun et al., 2020), outperforms other embeddings sourced from datasets that are more relevant to the target datasets (paraphrase and multi-choice QA) (Appendix G).This implies that for retrieval of source embedding for task location, answer choice format is more important than containing similar knowledge required to solve the target task.
Role of ROSPR is similar to in-context learning.
From the findings explained in previous paragraphs, we can conclude that although the source task types influence the target task performance, retrieving a similar answer choice format is more important for task location.Indeed, source tasks containing similar knowledge can help target tasks only if the answer choice formats are aligned to the target task.These findings support Min et al. (2022); Webson and Pavlick (2021) that a meta-trained LM "takes less effort" to understand the input: models exploit the simple aspects of prompts and demonstrations such as the format and distribution instead of complex semantics.Especially, for in-context learning, Xie et al. (2021); Min et al. (2022) show that the role of demonstrations lies in providing the shared concept and distribution hints of the target task.From this aspect, the role of ROSPR is similar to demonstrations.However, it is more efficient than including demonstrations because it avoids heavy computation at inference from long sequence lengths (Liu et al., 2022a;Choi et al., 2022) since ROSPR prepends a fixed length of prefix tokens regardless of the task.Also, ROSPR is free from the instability of in-context learning coming from different orderings of demonstrations (Lu et al., 2022;Zhao et al., 2021).Lastly, we conjecture that ROSPR also has the benefits of soft prompts (Li and Liang, 2021) such as having more expressiveness.

Ablation Studies
In this section, we analyze the effect of the number of (1) prompts, (2) source datasets, and (3) queries sampled for evaluation.We evaluate variations of our proposed methods on 4 datasets: RTE for NLI, COPA for sentence completion, Winogrande  for coreference, and WiC for word sense disambiguation task.We report the average of the mean accuracy of all evaluation prompts for each dataset by running 3 different runs.
Scaling number of prompts vs. number of datasets.Recent works on meta-trained LMs show that the number of source datasets and prompts is an important factor for zero-shot task generalization (Sanh et al., 2021;Wei et al., 2021;Wang et al., 2022b;Chung et al., 2022).We also show ablations for ROSPR and measure how the zero-shot generalization performance changes when we vary the number of prompts and datasets available during the prompt tuning stage (shown in Figure 7a).First, we vary (1) the total number of source prompts by 60, 120, and 230 by increasing the number of prompts per dataset and (2) the number of datasets by 8, 16, and 30 by increasing the number of datasets per task cluster. 9Note that in (1), the total number of datasets is fixed while in (2), we use all available prompts for each dataset while varying the number of datasets per task cluster.
In contrast to (1), ( 2) does not always lead to a linearly increasing performance boost; the performance saturates as more source datasets are included.By comparing the effect of scaling datasets and scaling prompts for similar Source Prompt Library sizes, we observe that the number of prompts has more impact on the accuracy of the target task (Figure 7a).
This ablation study also supports the analysis of the previous section; diverse answer choice formats of prompts, which are mostly influenced by the total number of source prompts, are more important than source task types which are influenced by the number of source datasets. 10Therefore, if 9 Task cluster is defined as a cluster of the same task types. 10Although the total number of prompts also increases as the number of task clusters is sufficient to some extent, scaling the number of source prompts per dataset is more crucial than scaling the number of source datasets per task cluster.
More sampled queries improve the performance.
We also analyze the effect of the number of query instances sampled at inference for retrieval.As seen in Figure 7b, increasing the number of queries results in higher mean accuracy.This is different from the analysis of Lin et al. (2022) that sampling more queries leads to better performance only to some point.Because we use the frequency of each prompt embedding candidate as the default metric for retrieval, utilizing more query instances would represent the evaluation data more accurately, resulting in a reduced number of wrong retrievals.

Conclusion
In this paper, we introduce ROSPR, a method that efficiently enhances zero-shot generalization capabilities of a meta-trained LM by retrieving promptspecific source prompt embeddings (soft prompts) for a given target task.We accomplish this by first training the soft prompts for hard prompt of the source tasks.After training source prompt embeddings, we construct the Source Prompt Library by storing the mean representation of training instances as keys and the corresponding prompt embeddings as values.At inference, we search for training instances stored in the library similar to sample instances from the target task, retrieve the corresponding prompt embedding, select the most frequently retrieved embedding, and append it to each of the target task instances for prediction.Our results show that ROSPR efficiently enhances the zero-shot performance of the backbone model while introducing minimal additional parameters during inference.We additionally provide analysis of which factors attribute to the performance of ROSPR and find that heuristic cues such as the answer choice format are critical for generalization performance, implying that it may play a role similar to demonstrations in in-context learning.

Limitations
Although we show the effectiveness of ROSPR by applying it on T0-3B (Sanh et al., 2021), we did not evaluate our method on different model scales such as the T0-11B variant and other LM architectures such as decoder-only LMs due to limited computational resources.This leaves future works on applying ROSPR to even larger LMs and diverse LM architectures (Wang et al., 2022a).Moreover, it is hard to apply VAR to target tasks without answer choices such as free-form generation because variance among options cannot be obtained.However, ROSPR and ROSPR+INTER can still be utilized and we leave applying ROSPR on zero-shot task location of free-form generation as future work (Scialom et al., 2022).

A Comparison with Hard Prompt Optimization
We have conducted additional experiments to compare our method with 3 different hard prompt optimization techniques including (1) RoHPr which utilizes the same search technique as RoSPr but retrieves 4 corresponding training instances for incontext learning instead of the corresponding soft prompt, and (2) APE (Zhou et al., 2023) which is an automatic prompt engineering method that utilizes the generation results of an LLM (Ouyang et al., 2022), known to show better performance than human instructions, and (3) ZPS (Liao et al., 2022) which selects an optimal hard prompt from a prompt pool using prompt ensemble and pseudolabels.As shown in the result of Table 2, on average, RoSPr performs the best on average, showing the benefits of using soft prompts over hard prompts for task generalization.While RoSPr shows improvement on 10 out of 11 datasets, other hard prompt optimization methods do not show consistent improvements.

C Fine-tuning Baseline Details
For fine-tuning baseline models (PT and AT-TEMPT), we follow the training configuration of source prompt tuning.We train each target prompt with a single epoch using a maximum of 5,000 training instances.For fine-tuning, we use a batch size of 32 and a learning rate of 1e-3.Also, we randomly select hard prompts (templates) of the training dataset during fine-tuning.For ATTEMPT, we randomly sample one source prompt per task cluster, resulting in interpolation between 8 soft prompts.We provide detailed result of variation of the number of prompts (Figure 8) and the number of datasets (Figure 9).We additionally analyze the effect of (1) different sampling methods for constructing the Source Prompt Library, (2) the number of instances sampled for constructing the Source Prompt Library, (3) the number of top-N retrieval for embedding retrieval, and (4) the number of multiple source embeddings to interpolate.Same as Section 7, we report the mean accuracy of 4 evaluation datasets: RTE, COPA, Winogrande, and WiC with 3 runs with different random seeds for the sampling of evaluation queries.

D.1 Sampling Methods for Source Prompt Library
We experiment three different methods to sample instances for constructing Source Prompt Library and analyze the effect of each method.By default, we choose RANDOM method, where we sample 12301  As shown in Figure 10, RANDOM method outperforms CLUSTERING and DISTRIBUTED methods.Interestingly, CLUSTERING method significantly hurts the performance on all 4 proposed methods, suggesting that storing similar instances per prompt results in retrieval failures more often.Also for DISTRIBUTED method, most of the methods significantly underperform RANDOM, except INTER and ROSPR+VAR+INTER.From these results, we can conclude that random sampling represents the source dataset most effectively.

D.2 Number of Instances Sampled for Constructing Source Prompt Library
We analyze the effect of size of the Source Prompt Library by varying the number of instances n to sample for each hard prompt by 100, 300, 500.Therefore, n×(number of total hard prompts) would be the size of the Source Prompt Library.As shown in Figure 11, increasing the number of sampled instances does not increase the performance; it hurts the performance for most cases.This suggests that only a few number of training instances are enough to represent the distribution of prompted 12302 input (hard prompt + input instances) for each hard prompt and increasing the number sometimes hurt the performance by adding noise to the distribution.This also supports the importance of heuristic cues in Section 6 by showing that adding more training instances per hard prompt does not increase the performance.Instead, adding hard prompts with diverse answer choice format is more important.

D.3 Number of Top N instances for Embedding Retrieval
We vary the number of top-N instances that are retrieved given each query through MIPS search.As shown in Figure 12, varying the number of top-N instances does not have much effect compared to increasing the number of sampled queries (Figure 7b).This implies that if the size of the evaluation set of the target task is large, sampling more queries is effective than searching for more similar instances per query.This is important for variance-based methods because the number of forward passes needed before evaluation is proportional to Q * N .Therefore, we can reduce the time latency by reducing the number of instances retrieved per query without hurting the performance much.

D.4 Number of Source Embeddings for Interpolation
We analyze the effect of number of source embeddings for interpolation by varying top-N ′ interpolation from 1 (no interpolation) to 5 shown in Figure 13.By comparing between single prompt embedding retrieval (N ′ = 1) and the interpolation of multiple embeddings (N ′ > 1), the mean accuracy drops by adding multiple source embeddings for retrieval because interpolation-based methods underperform on tasks such as COPA as shown in Table 1.Mean accuracy would increase if we add other datasets for evaluation that benefit from interpolation such as WSC and CB.By comparing among various N ′ values, we find that for ROSPR+INTER, the accuracy substantially decreases for K ′ = 2, implying that the possibility of a wrong retrieval varies depending on the value of N ′ .In contrast, ROSPR+VAR+INTER is more robust to the value of K ′ , showing that variance-based ranking increases robustness to different numbers of source embeddings for interpolation as well.

E Visualization of Results
We show the visualization of the evaluation result on 11 datasets in Figure 14.Methods based on ROSPR not only show higher accuracy, but lower variance for many datasets.

F Experimental Configurations
As mentioned in the previous sections, we use T0-3B as our backbone meta-trained LM.For prompt tuning, we fix the prefix length as 100 and the embeddings are initialized from 5,000 most common vocabulary following Lester et al. (2021).We train each source embedding for a single epoch with a learning rate of 0.1 and a batch size of 32.We use the Adam optimizer with weight decay of 1e-5.For retrieval, we randomly sample Q = 32 query instances and retrieve top K = 10 examples for each query.We train a T0-small variant (∼ 35M params) as our dense retriever by multitask prompted training on T5+LM model (Lester et al., 2021)  a learning rate of 1e-3, input sequence length 512, output sequence 128, and batch size of 1024.We select model checkpoint by early stopping based on validation accuracy.We use a meta-trained LM instead of a naive pretrained model (e.g.Sentence-BERT) because meta-trained LM is shown to be more effective for retrieval (Lin et al., 2022).For the interpolation experiment, we set K ′ = 3 for top-K ′ prompt embedding candidates.For training source prompt embeddings, we used 8 V100 GPUs.

G Examples of Applying Prompts, Answer Choice Format and Source Task Types
Figure 15 shows an example of applying prompt through Promptsource (Bach et al., 2022) as mentioned in Section 3.1.
We assert that answer choice format is more important than task similarity in Section 6.We further provide details of the input instances of the mentioned tasks: paraphrase, NLI, word sense disambiguation, and sentiment classification in Figure 16.As supported in Pruksachatkun et al. (2020), intuitively, paraphrase task is more similar to the task of word sense disambiguation task or NLI, implying their task similarity, while the task of sentiment classification is very different.However, our counterintuitive result shows the soft prompt to show the best performance in Figure 6 Section 6, bolstering the claim that similar source task types are not a major factor for evaluation performance.

H Full List of Source Training and Evaluation Datasets
All of our training and evaluation datasets are a subset of datasets used in Sanh et al. ( 2021).We use Huggingface version of each dataset (Lhoest et al., 2021).

H.1 Training Datasets
Following Sanh et al. ( 2021), we use 8 task clusters for training of source prompt embedding: sentiment classification, paraphrase, topic classification, summarization, struc-to-text, multiple-choice QA, extractive QA, and closed book QA.We use imdb (Maas et al., 2011), amazon_polarity Question1: What are lanyards used for?
Question2: What is a lanyard?
0 Can an answer to "What are lanyards used for?"Also be used to answer "What is a lanyard?"?

PromptSource
x ik y ik h j (y ik ) h j (x ik ) Figure 15: Example of applying prompt to a given instance through Promptsource (Bach et al., 2022).
We exclude 6 datasets (MRPC, TREC, DREAM, QuaRTz, QASC, QuaRel) that have small training sets because it leads to task imbalance, which is critical for training our small dense retriever (∼ 35M params).We also exclude CNN Daily Mail, App Reviews, and WikiQA dataset due to dataset download issues, absence of any test or validation data, and unbalanced label distribution, respectively.
For BIG-bench tasks, we evaluate on 14 tasks, following Sanh et al. ( 2021

I Full List of Retrieved Prompt Embeddings
We provide a full list of retrieved prompt embeddings of ROSPR and ORACLE for all prompts of 11 evaluation datasets.We report retrieval results of a single random seed (Table 4 ∼ Table 14).

Figure 1 :
Figure 1: During zero-shot inference, ROSPR selects similar training instances with the given input from the Source Prompt Library and retrieves the prompt embeddings corresponding to the selected training instances.

Figure 2 :
Figure 2: An overview of ROSPR.For each hard prompt of the source datasets, soft prompts are trained via prompt tuning.After storing training instances as keys and corresponding prompt embedding as values, ROSPR searches training instances similar to query set Q, retrieves the corresponding prompt embeddings, and selects the most frequently retrieved candidate for inference.Variants of selection strategy are also shown: ROSPR+INTER interpolates between multiple related source embeddings and ROSPR+VAR ranks candidate embeddings considering both frequency and variance.

Figure 3 :
Figure 3: Mean accuracy of 14 datasets of BIG-bench.We evaluate on a single prompt following Sanh et al. (2021).By only adding 0.007% parameters to T0-3B, T0+ROSPR largely reduces the performance gap between 4 times larger T0-11B.The full result is provided in Appendix B.

Figure 4 :
Figure 4: Frequency of source task types (x-axis) that maximizes (i.e.ORACLE) the accuracy of each target task (y-axis).

Figure 5 :
Figure5: Effect of answer choice format alignment across different target datasets (RTE, COPA, WiC).We report the mean accuracy of the evaluation prompts and the performance of T0 is shown in green dotted line.

Figure 7 :
Figure 7: Result of various ablation settings of ROSPR.(a) compares the effect of scaling the number of datasets with scaling the number of prompts and (b) shows the effect of the number of sampled queries at inference.Additional ablation results are shown in Appendix D.

Figure 8 :
Figure 8: Variation of number of prompts by increasing the number of prompts per dataset.

Figure 9 :
Figure 9: Variation of number of datasets by increasing the number of datasets per task cluster.

Figure 10 :
Figure 10: Different instance sampling methods for constructing Source Prompt Library.

Figure 11 :
Figure 11: Variation of number of instances sampled for constructing Source Prompt Library.Default setting is n = 100.

Figure 12 :
Figure 12: Variation of number of Top N instances for embedding retrieval.Default setting is N = 10.

Figure 13 :
Figure 13: Variation of number of source embeddings for interpolation based methods.Default setting is K ′ = 3.
, replicating the original training setting of T0 by training T5+LM for 8 epochs using the same training instances of Sanh et al. (2021) with

Figure 14 :
Figure 14: Visualization of evaluation results of 11 datasets.

Figure 16 :
Figure 16: Examples of instances different source tasks.

Table 1 :
ROSPR refers to our main proposed method, W/ INTER refers to applying interpolation of multiple source embedding candidates, W/ VAR refers to retrieval through variance-based ranking, W/ VAR & INTER refers to applying both interpolation and variance-based ranking where the interpolation weight is based on the variance-based ranking score, and ORACLE refers performance when the most optimal source embedding is retrieved from the candidates, acting as an upper bound performance for retrieval.FT refers to fine-tuned models on the target tasks.For FT models, we exclude StoryCloze due to the absence of training instances.The best and second-best performance is shown in bold and underline respectively.Comparison with hard prompt optimization techniques and visualization of the results is shown in Appendix A and Appendix E, respectively.

Question: Is the word 'bang' used in the same sense in the two sentences above?
They got a great bang out of it.