UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation

Large Language Models (LLMs) are popular for their impressive abilities, but the need for model-specific fine-tuning or task-specific prompt engineering can hinder their generalization. We propose UPRISE (Universal Prompt Retrieval for Improving zero-Shot Evaluation), which tunes a lightweight and versatile retriever that automatically retrieves prompts for a given zero-shot task input. Specifically, we demonstrate universality in a cross-task and cross-model scenario: the retriever is tuned on a diverse set of tasks, but tested on unseen task types; we use a small frozen LLM, GPT-Neo-2.7B, for tuning the retriever, but test the retriever on different LLMs of much larger scales, such as BLOOM-7.1B, OPT-66B and GPT3-175B. Additionally, we show that UPRISE mitigates the hallucination problem in our experiments with ChatGPT, suggesting its potential to improve even the strongest LLMs. Our model and code are available at https://github.com/microsoft/LMOps.

Fine-tuning LLMs adjusts their weights to fit specific prompts.However, this can be constrained by computational limitations or the unavailability of model weights (Hu et al., 2022).Multi-task tuning Figure 1: UPRISE tunes a prompt retriever on multiple tasks with a small LLM, but conducts inference on unseen task types with a different larger LLM.
provides an alternative approach to improve zeroshot task generalization (Wei et al., 2022a;Sanh et al., 2022), which partially justifies the tuning cost.Yet, the constant evolution of LLMs creates the need for tuning new models, making the cumulative fine-tuning cost a big concern.
Prompt engineering constructs prompts to guide frozen LLMs.Prompt design adds an engineered natural language prompt to teach the LLM to learn in context (Brown et al., 2020) or to induce the LLM to reason (Wei et al., 2022b).Prompt tuning adds a soft prompt represented by continuous parameters, and optimizes it through gradient propagation (Liu et al., 2021;Li and Liang, 2021;Lester et al., 2021).While these methods can improve performance for specific tasks, it is uncertain whether the prompts designed for one task can generalize to unseen task types, as prompt designers are blind in strict zero-shot settings (van de Kar et al., 2022).
In this paper, we propose UPRISE (Universal Prompt Retrieval for Improving Zero-Shot Evaluation), which tunes a lightweight and versatile retriever that automatically retrieves prompts from a pre-constructed pool, given a zero-shot task input.As illustrated in Figure 1, the retriever is trained to retrieve prompts for multiple tasks, enabling it to generalize to un-seen task types during inference.In addition, we demonstrate that the cross-task capabilities can generalize well from a small LLM to different LLMs of much larger scales: we use GPT-Neo-2.7B(Black et al., 2021) to guide the tuning of the retriever and evaluate the retriever's performance on BLOOM-7.1B(Scao et al., 2022), OPT-66B (Zhang et al., 2022), and GPT3-175B (Brown et al., 2020).The cross-model and cross-task generalization of UP-RISE makes it a promising and practical solution for real-world applications.
Furthermore, our approach demonstrates the potential for enhancing even the most powerful LLMs, as shown in our experiments with ChatGPT.Despite its impressive abilities, ChatGPT has been found to struggle with serious hallucination problems, leading to responses that are factually inaccurate (Bang et al., 2023).However, UPRISE is able to address this issue on fact-checking tasks by prompting the model to draw correct inferences from its built-in knowledge.
In summary, our contributions include: • We introduce UPRISE, a lightweight and versatile approach to improve zero-shot performance of LLMs in the cross-task and cross-model scenario.• UPRISE is tuned with GPT-Neo-2.7B,but can also benefit different LLMs of much larger scales, such as BLOOM-7.1B,OPT-66B, and GPT3-175B.• Our exploration on ChatGPT demonstrates the potential of UPRISE in improving performances of even the strongest LLMs.

Problem Definition
We aim to improve zero-shot performance of LLMs by training a prompt retriever to retrieve prompts1 for any given task input.Specifically, UPRISE decomposes the prompting process into two steps: retrieve then predict.Given an input x, we first retrieve a set of positive prompts P + from a preconstructed pool P: (1) Then we concatenate P + with x to form an input sequence for a frozen LLM, which generates a predicted output: (2) Figure 2: Typical prompt engineering methods and prompt retrieval.Prompt retrieval prepends a natural language prompt to the task input and uses a frozen LLM to evaluate the prompt's performance.The obtained evaluation is then used to tune the retriever in a reverse manner.
Our objective is to optimize performance of y P + to match the target y by updating the retriever R.
Figure 2 compares prompt retrieval with typical prompt engineering methods: prompt design adds an engineered natural language prompt (Brown et al., 2020;Wei et al., 2022b) and prompt tuning tunes a soft prompt (Liu et al., 2021;Lester et al., 2021).In contrast, prompt retrieval tunes a retriever to retrieve natural language prompts, which is both interpretable and flexible.It uses the language model itself to label each prompt in the pool as positive/negative, and then tunes a retriever from this signal (Rubin et al., 2022).Such fine-tuned prompt retrieval has demonstrated effectiveness in the task-specific scenario (Rubin et al., 2022;Ye et al., 2023): a prompt retriever is tuned on one or multiple specific tasks using the training sets as the prompt pool.The retriever is then evaluated on the corresponding testing sets.
Our work is to achieve universality of the prompt retriever, which means the fine-tuned retriever can be directly used to retrieve prompts for unseen tasks and various inference LLMs, without the need for further tuning.We define the universality from two perspectives: cross-task retrieval and cross-model retrieval.
Cross-task retrieval.Considering the diversity of tasks in real-world applications, we propose cross-task retrieval to retrieve for task types on which the prompt retriever has not been trained.We simulate this setting by evaluating the prompt retriever on unseen task types: various tasks are 12319 grouped into different clusters based on their task types, and we hold out each task cluster for evaluation while training the retriever on all remaining clusters (Wei et al., 2022a).
Cross-model retrieval.Due to the high cost of tuning a prompt retriever with a large-scale LLM, we propose evaluating the capability to generalize from a small LLM to a large LLM.Specifically, we use a relatively small LLM for tuning the retriever, while using a much larger LLM for inference.Furthermore, we suggest exploring the transferability between different LLM sources, as there are LLMs developed by different companies or institutions.

Method
As shown in Figure 3, UPRISE uses a frozen LLM to supervise the fine-tuning of a prompt retriever on diverse tasks, and then uses this trained retriever to retrieve prompts for unseen task types with different LLMs during inference.In this section, we elaborate on our data construction, prompt scoring, retriever tuning and inference pipeline.

Data Construction
Task Data.We use instruction templates from FLAN (Wei et al., 2022a) to convert task datasets into natural language instructions2 .Each task dataset corresponds to approximately seven templates.For each data example (x i , y i ), we randomly select one of the seven templates to convert x i into a task input and y i into a label completion.The option suffices and new line characters "\n" are automatically removed from the task input, to make the text format more similar to that of the pre-training corpus, improving prompting performance (van de Kar et al., 2022).
Prompt pool.For each testing cluster, the prompt pool used for retrieval is made up of training demonstrations of the remaining task clusters (i.e., the clusters for training the retriever).This is inspired by in-context learning (Brown et al., 2020), which presents a few training demonstrations before the task input to improve model performance.Each demonstration is a concatenation of the task input and the label completion.Our motivation is that the testing input may benefit from similar question types, topics, or reasoning chains in the retrieved demonstrations, despite that the testing input and the demonstrations are of different task types.

Prompt Scoring
For each training example (x i , y i ) in the training clusters, we collect a set of positive and negative prompts from the prompt pool P = {p j } N P j=1 , where the positive prompt indicates that the frozen LLM achieves good task scores conditioned on the prompt-input concatenation.We use these positive and negative labels to supervise the contrastive learning of the retriever.
We categorize all tasks into two question types: text completion and multiple choice (Brown et al., 2020), and use different methods to score the prompts for each training example.
Text completion is the question to do free-form completion.We calculate score of the prompt using the following equation: where y x i is the model prediction based on the input concatenation p j ⊕ x i , and ⊕ is a text delimiter "\n".metric (•) is the function used to calculate the task metric score (e.g., F1 or ROUGE).
Multiple choice is the question to choose one correct completion from several options.Suppose there are M options in a multiple choice question , where {o m } M m=1 is the option set and o y i is the gold option.We feed the concatenation p j ⊕ x i to the LLM and calculate per-token likelihood of each option: LH (o m ).The option with the highest likelihood is considered as the model prediction y p j i (Brown et al., 2020).Accuracy of the prediction acc y i , y p j i is a common metric for multiple-choice questions, but it only produces 0 or 1 for each example, making it hard to compare prompt effectiveness.To address this, we multiply the accuracy by the per-token likelihood of the gold option, which is normalized by the sum of the per-token likelihood of all options, to achieve a fine-grained comparison.The final score is formulated as: .
(4) Prompt filtering.Intuitively, to collect the positive and negative prompts for each training example, we need to score every prompt in the prompt

Read the passage…
Read.Compre.

Does love entail…
Close.QA

Please answer…
Read.Compre.In the training stage, a frozen LLM is used to supervise the tuning of a prompt retriever, where both the LLM and the retriever take the prompt-input pairs as input, and we use the task scores given by the LLM to supervise the contrastive learning of the retriever.In the inference stage, for each task input, the tuned prompt retriever retrieve positive prompt(s) to guide the inference model to predict a task output.Overall, we follow a cross-task and cross-model paradigm where the task types and LLMs for training could be different from those for inference.

Read the passage…
pool and identify the prompt that yields the best score as the positive prompt.Conversely, prompts that lead to the worst scores are labeled as negative prompts.However, scoring all the prompts can be computationally expensive (Rubin et al., 2022), even with a relatively small LLM.
To address this, we only score a subset of L randomly sampled demonstrations; each demonstration is constrained to have the same task as the training example (x i , y i ).This is inspired by in-context learning where the testing sample and training demonstrations share the same task, resulting in improved task scores.By scoring a subset of demonstrations, we significantly reduce the computational cost while increasing the likelihood of identifying positive prompts within the sampled subset.
Furthermore, in the case of a difficult question, all L prompt-input concatenation may result in a score of 0. To address this, we repeat the sampling process to score another subset of L prompts with the same task as (x i , y i ), until we find at least one prompt with a score greater than 0.
For all the scored prompts for a training example, we label the prompt with the highest score as positive.For negative samples, we randomly sample B training demonstrations from the prompt pool, each with a different task from that of (x i , y i ).In addition, we label B demonstrations corresponding to the lowest B scores in the sampled prompts as hard negatives, which are of the same task with (x i , y i ) but are less effective.

Retriever Tuning
After labeling prompts for each training example, we split the collected data into two sets: 90% for training and 10% for validation.The prompt retriever is a bi-encoder model (Karpukhin et al., 2020) where the input encoder E X (•) takes the task input x i as input, and the prompt encoder E P (•) takes prompt p j as input.
To train the prompt retriever, InfoNCE (van den Oord et al., 2018) loss is used to maximize the similarity score between the encoded prompt and input for positive prompt-input pairs, and minimize it for (hard) negative prompt-input pairs.For a single training example (x i , y i ), the loss function for its positive and negative prompts is: , where p + i is the positive prompt, p − i,j is one of the (hard) negative prompts, and sim(x i , p) = E X (x i ) ⊤ E P (p) calculates the similarity score be-12321 tween input x i and prompt p using inner products (Rubin et al., 2022).

Inference
After fine-tuning the prompt encoder, we use it to encode the entire prompt pool with E P (•).At inference time, for a testing task input x test , we compute its encoding E X (x test ) and then use maximum inner-product search over the prompt pool to retrieve K most similar prompts, sorted by their inner product in descending order, denoted as P + = (p 1 , ..., p K ).We then concatenate the prompts with the task input, resulting in the concatenation p K ⊕ ... ⊕ p 1 ⊕ x test (Rubin et al., 2022).
To evaluate the inference results, we use the same method described in Section 3.2 to generate predictions, and then use each task's corresponding evaluation metric to compute the scores.LLMs.

Experiment Settings
We use GPT-Neo-2.7B(Black et al., 2021) from EleutherAI to tune the retriever, and evaluate the performance on larger LLMs from various sources during inference, including BLOOM-7.1B (Scao et al., 2022) from BigScience, OPT-66B (Zhang et al., 2022) from Meta, and Davinci and text-davinci-001 from OpenAI, both belonging to the GPT3-175B (Brown et al., 2020) series.Greedy search is used to obtain predictions from all the LLMs.Prompt scoring.We set the size of the randomly sampled subset to L = 50 and the number of (hard) negatives to B = 20.For difficult questions, we repeat the re-sampling process up to seven rounds, as we found that this is sufficient to identify a positive prompt for 90% of the training examples.If no sampled prompt yields a score greater than 0, we filter out the corresponding training example.
Tuning.We initialize the two independent encoders of the retriever with BERT BASE (Devlin et al., 2019).Each retriever is fine-tuned for three epochs, and the best checkpoint is chosen based on retrieval accuracy using the validation set.For detailed tuning hyperparameters, Please refer to Appendix B.
Inference.During inference, we set the number K of concatenated prompts to a relatively small value of 3, to balance between prompting performance and inference efficiency.For each dataset, we report metric scores on the test set when available, falling back to the validation set otherwise.

Main Results
We evaluate our prompt retriever on natural language understanding tasks where generative LLMs are known to need improvement (Liu et al., 2021).Table 1 compares the performance of UPRISE to vanilla zero-shot prompting.

Cross-Task Prompt Retrieval
Based on the results of GPT-Neo-2.7B,we can assess our ability of generalizing across different task types.UPRISE has positive impacts on most of the testing clusters.Specifically, we achieve absolute gains of 8.5% and 14.6% in Reading Comprehension and Paraphrase Detection tasks, respectively.We also find that UPRISE shows consistent performance improvements across all tasks in Closed-book QA and Natural Language Inference clusters.
However, UPRISE has negative impacts on Commonsense Reasoning and Coreference Resolution tasks.We conduct analyses in Appendix D to understand the reasons, revealing that Coreference Resolution hardly benefits from demonstrations and Commonsense Reasoning is harmed by different demonstration formats.

Cross-Model Prompt Retrieval
In addition to evaluating cross-task generalization, we can explore the cross-model ability by examining the results of BLOOM, OPT, Davinci and text-davinci-001.UPRISE continues to im-12322   prove performance on Reading Comprehension, Closed-book QA, and Paraphrase Detection tasks across all LLMs.While the performance on Sentiment Analysis is negative with the small 2.7B GPT-Neo, we observe positive impacts when using larger LLMs.We achieve consistent gains on Natural Language Inference tasks with the models that have not been fine-tuned (BLOOM, OPT, and Davinci), but experience a drop in performance on text-davinci-001, which could potentially be due to the model being specifically fine-tuned on such tasks to improve performance.Generally, we present the average performance of Reading Comprehension, Closed-book QA, Paraphrase Detection, Natural Language Inference, and Sentiment Analysis in Figure 4.The results indicate consistent performance gains across all LLMs.

Hallucination Mitigation of ChatGPT
Despite the strong abilities of ChatGPT, recent reports have shown that it suffers from hallucination: providing factually incorrect responses (Bang et al., 2023).To assess the versatility of UPRISE, we also investigate whether it can mitigate the hallucination problem.We evaluate on three tasks: TruthfulQA (Lin et al., 2022) for detecting human falsehood, FEVER2.0 (Thorne et al., 2018) and Covid-19 (Lee et al., 2021) for fact-checking.
Table 2 shows that UPRISE outperforms vanilla zero-shot prompting on the fact-checking tasks.Figure 5 presents an interesting case where 0-SHOT induces a correct generation of information ("Sonny & Cher... consisting of Sonny Bono and his wife Cher."), but an incorrect answer.In contrast, UPRISE induces a precise answer.We attribute this improvement to the retrieved demonstra- tion, which is of the Natural Language Inference task type that may motivate the model to correctly infer from its parametric memory.This finding suggests that the limited memory4 of ChatGPT may not be the only factor leading to the hallucination challenge.Rather, it highlights the importance of having effective inference mechanisms.Prompt engineering techniques such as UPRISE can help address this issue.Evaluation details and further analysis can be found in Appendix C.

Universal Prompt Retriever
We replace the universal retriever with three alternatives: 1) RANDOM samples prompts from the prompt pool randomly, 2) TOPK-BM25 uses the sparse retriever BM25 (Robertson and Zaragoza, 2009) to retrieve prompts similar to the testing input, and 3) TOPK-BERT follows KATE (Liu et al., 2022) to use SBERT (Reimers and Gurevych, 2019) to retrieve similar prompts.Figure 6 displays the comparative performance using GPT-Neo-2.7B,where UPRISE achieves the best results among all the universal retrievers.This suggests that word-level (TOPK-BM25) or sentence-level (TOPK-BERT) similarity to the testing input is not the only decisive factor for a good prompt.This finding underscores the effectiveness of fine-tuning a retriever with the language model itself as a data labeler.

Universal Prompt Pool
For each testing task cluster, we use training demonstrations of the remaining clusters to construct the prompt pool.To evaluate its effectiveness, we replace it with the raw texts of wikitext-103 (Merity et al., 2016), which belongs to the pre-training corpora of many LLMs.The results in Table 3 show our prompt pool outperforms the raw texts on all the testing clusters.
In Appendix F, we analyze which training task clusters are retrieved when testing on the heldout cluster, showing that tasks of diverse question/answer types, such as Reading Comprehension and Closed-book QA, are most frequently retrieved.Furthermore, in Table 7-11 in Appendix, we conduct a case study to analyze the relevance between the retrieved prompts and task input, observing that the cross-task improvement benefits from similar question types, topics, text formats, or logical relationships.These findings underscore the importance of including diverse task demonstrations in the prompt pool (Asai et al., 2022;Su et al., 2022).

Analysis on Training Data Diversity
we conduct ablations to assess the impact of training data diversity.Impact of reducing diversity.We reduce the range of trained task types to see the impact on the testing performance: For each testing task type, we randomly select 50% of the remaining task types to train a retriever.The results in Figure 7 do indicate a decline in performance as diversity decreases.Nonetheless, the retriever trained on 50% remaining task types continues to demonstrate better performance than 0-SHOT across most task types.Table 4: Comparative results with few-shot prompting.# Demos is the number of demonstrations prepended to the input instruction, FEW-SHOT is vanilla few-shot prompting where the demonstrations are randomly sampled from the training demonstrations of the target task (Brown et al., 2020).

TRAINING TASK
Figure 8: Generablizability of each task type, ✓ means the performance of prompt retrieval is better than 0-SHOT.
Generalizability of each task type.We then reduce the number of trained tasks to only one to test its generalizability.Specifically, for each task type, we train a retriever on this type alone and then evaluate on the remaining task types.For example, if the retriever trained on A outperforms 0-SHOT when testing on B, we regard task type A is generalizable to task type B. The results in Figure 8 demonstrate that tasks with diverse question/answer types, such as Reading Comprehension and Closed-book QA, tend to be more generalizable and can serve as representative choices for training a universal retriever.

Exploration of Few-Shot Learning
We compare UPRISE with vanilla few-shot prompting and apply UPRISE to few-shot prompt retrieval in Table 4: (1) Comparing UPRISE with FEW-SHOT, UPRISE approaches and even outperforms vanilla few-shot prompting on most task types; (2) UPRISE-REMAIN-TARGET, using the retriever trained on remaining tasks to retrieve in the target task pool, outperforms vanilla few-shot prompting.(3) Substantial improvements are then observed with UPRISE-ALL-TARGET, a unified retriever trained on all task types.These findings emphasize UPRISE's effectiveness as a comprehensive method for both zero-shot and few-shot prompt retrieval.

Related Work
Our work is related to prompt engineering methods including prompt design, prompt tuning, and prompt search.Here we discuss prompt search that relates most closely to our work and describe prompt design and prompt tuning in Appendix E.
Prompt search involves searching for prompts from pre-training corpora or downstream tasks to construct the input text (Gao et al., 2021;Liu et al., 2022;Rubin et al., 2022;Ye et al., 2023Ye et al., , 2022)).To retrieve prompts for the test examples, retrievers such as the sparse retriever BM25 (Robertson and Zaragoza, 2009) and the dense retriever based on SBERT (Reimers and Gurevych, 2019) are employed.Furthermore, methods like EPR (Rubin et al., 2022) and CEIL (Ye et al., 2023) use the LLM itself to score the searched prompts, thereby eliminating the need for manual prompt engineering and ensuring prompting performance.

Conclusion
This paper explores training a lightweight and versatile prompt retriever to improve the zero-shot performance of LLMs.We investigate the retriever's ability to generalize from the trained task types to unseen task types, and from a small LLM to different LLMs of much larger scales.We hope our paper will spur further research on developing a universal assistant for the ever-expanding landscape of tasks and large language models.12326

Limitations
While UPRISE has shown consistent performance gains on most testing clusters, it displays limited impacts on tasks that are directly formulated as language modeling, such as Coreference Resolution and Commonsense Reasoning.Future work may explore including other formats of demonstrations such as chain-of-thought (Wei et al., 2022b) to improve the performance.
Besides, the universality of UPRISE has been verified on language only in our experiment, future work may explore the versatility of UPRISE by incorporating prompts such as tool-use APIs (Schick et al., 2023), and multimodal information (Huang et al., 2023;Zhang et al., 2023).

Appendices A Task Clustering
We use the following datasets for each task cluster.

C Hallucination Mitigation of ChatGPT
We evaluate ChatGPT's performance using its released API, with the gpt-3.5-turbo-0301model and a temperature of 0. Human evaluation is conducted to check the accuracy on sampled test examples from each dataset, including 66 from TruthfulQA to follow Bang et al. (2023), 100 from FEVER2.0, and 100 from the scientific subset of Covid-19.As types of these tasks have no overlap with the types we listed in Appendix A, we use the retriever trained with all the listed task types for the crosstask and cross-model evaluation.
On the Covid-19 dataset, which requires a true/false answer to input claims, we observe vanilla zero-shot prompting often leads to ambiguous answers, which we consider the answer incorrect.However, by concatenating retrieved prompts, primarily focused on natural language inference, ChatGPT generates the correct answer.An example is shown in Figure 9.This suggests the model has access to necessary information but struggles with making inferences, possibly due to RLHF (Christiano et al., 2017) training or inherent difficulty.
On July 8 2020, the following claim was made: There is a possibility for some animals to become infected COVID-19 through close contact with infected humans.Q: Was this claim true or false?Scientists at the Genome Institute of Singapore (GIS) have discovered the complete genetic sequence of a coronavirus isolated from a Singapore patient with SARS.Can we infer the following?Singapore scientists reveal that SARS virus has undergone genetic changes.Yes, No, or Maybe?Maybe On July 8 2020, the following claim was made: There is a possibility for some animals to become infected COVID-19 through close contact with infected humans.Q: Was this claim true or false?

0-SHOT UPRISE
As an AI language model, I cannot determine the veracity of a claim without additional context or information.However, it is widely accepted that some animals, such as cats, dogs, and minks, can become infected with COVID-19 through close contact with infected humans.

D Analysis on Performance Decline
We conduct analysis on why UPRISE shows negative performance when testing on Coreference Resolution and Commonsense Reasoning tasks.
Coreference Resolution hardly benefits from demonstrations.For Coreference Resolution task type, we observe that even vanilla few-shot prompting underperforms zero-shot prompting, as highlighted in Table 6 5 .This trend is consistent with GPT-3 (Brown et al., 2020), GLaM (Du et al., 2022), and LaMDA-PT (Thoppilan et al., 2022), as reported by FLAN (Wei et al., 2022a).These models also exhibit limited performance gain from fewshot prompting compared to zero-shot for Coreference Resolution.We deduce that the task's inherent nature might make it less responsive to demonstrations, regardless of their alignment with the task.Commonsense Reasoning is harmed by different demonstration format.By analyzing the retrieved training task types (as shown in Figure 10), we find that Closed-book QA is the most-frequently retrieved type when testing Commonsense Reasoning.However, the two types differ significantly on the input-output format: Closed-book QA follows a question-answering format, but Commonsense Reasoning follows the language modeling format, which may lead to the decrease in performance.

E Extended Related Work
Prompt Design.In-context Learning (Brown et al., 2020) is a method that helps LLMs transfer to new tasks via inference alone by conditioning a concatenation of training demonstrations and testing input, without any gradient updates.
With standard in-context learning, LLMs struggle to tackle complex arithmetic, commonsense, and symbolic reasoning tasks.Chain-of-Thoughts (CoT) (Wei et al., 2022b) proposes providing LLMs with a series of intermediate reasoning steps as demonstrations to induce LLMs to produce another series of intermediate reasoning steps that lead to the final answer.
Prompt Tuning.Traditional natural language prompts require significant human engineering and can lead to suboptimal performance.Prompt tuning proposes to learn a prompt represented by continuous parameters rather than discrete natural language tokens (Liu et al., 2021).Prompt tuning takes the source text embedded by the LM input embeddings and prepends learnable embeddings to obtain a new embedded sequence.A variant of prompt tuning is prefix tuning (Li and Liang, 2021;Lester et al., 2021), where the learnable vectors are added not only to the input but to all transformer layers.

F Analysis on Retrieved Training Clusters
To further interpret the impact of the retrieved prompts on the testing task performance, we analyze which training task clusters are retrieved when testing on the held-out cluster.
As shown in the visualisation plot in Figure 10, clusters including diverse question types like Reading Comprehension correspond to high retrieved ratios (e.g., 80.7% for Close-QA and 36.1% for NLI), while the less diverse Sentiment Analysis cluster does not reach the top ranks.This finding further supports that including tasks of diverse question/answer types in the training data contributes to good generalizability of the retriever.

Figure 3 :
Figure3: Training and inference pipeline.In the training stage, a frozen LLM is used to supervise the tuning of a prompt retriever, where both the LLM and the retriever take the prompt-input pairs as input, and we use the task scores given by the LLM to supervise the contrastive learning of the retriever.In the inference stage, for each task input, the tuned prompt retriever retrieve positive prompt(s) to guide the inference model to predict a task output.Overall, we follow a cross-task and cross-model paradigm where the task types and LLMs for training could be different from those for inference.

Figure 4 :
Figure 4: cross-model results of the cross-task retriever.

Figure 5 :
Figure 5: Case of the chats of vanilla zero-shot prompting and UPRISE on the FEVER2.0dataset, the label completion is "True".

Figure 6 :
Figure 6: Comparison of different universal retrievers, we report the average performance on each testing cluster.

Figure 7 :
Figure 7: Impact of training data diversity on the testing task performance.100%-DIVERSE is UPRISE retriever trained on all the remaining task types, 50%-DIVERSE reduces the range of trained task types to half of UPRISE, and 0-DIVERSE is 0-SHOT.

Figure 9 :
Figure 9: Case of the chats of vanilla zero-shot prompting and UPRISE on Covid-19 dataset, the label completion is "True".12331

Figure 10 :
Figure 10: Percentages of retrieved prompts in each training task cluster when testing on the held-out cluster.

Tune on many tasks
Read.Compre.

Table 1 :
Zero-shot performance across tasks and LLMs.The model Davinci-001 is the fine-tuned version text-davinci-001 of Davinci.The method 0-SHOT is the vanilla zero-shot method with only the input instruction fed into the LLM.

Table 3 :
Prompt Pool Read.Closed.Para.NLI Senti.Comparison of average performance on GPT-Neo-2.7Bwith different prompt pool: RAW TEXT uses raw data of the pre-training corpora, UP- RISE uses training demonstrations of the trained tasks.

Table 5 :
Hyperparameter settings of tuning a prompt retriever

Table 6 :
Average scores of vanilla zero-shot and fewshot prompting of Coreference Resolution tasks.