Prompt-Guided Retrieval Augmentation for Non-Knowledge-Intensive Tasks

Retrieval-augmented methods have received increasing attention to support downstream tasks by leveraging useful information from external resources. Recent studies mainly focus on exploring retrieval to solve knowledge-intensive (KI) tasks. However, the potential of retrieval for most non-knowledge-intensive (NKI) tasks remains under-explored. There are two main challenges to leveraging retrieval-augmented methods for NKI tasks: 1) the demand for diverse relevance score functions and 2) the dilemma between training cost and task performance. To address these challenges, we propose a two-stage framework for NKI tasks, named PGRA. In the first stage, we adopt a task-agnostic retriever to build a shared static index and select candidate evidence efficiently. In the second stage, we design a prompt-guided reranker to rerank the nearest evidence according to task-specific relevance for the reader. Experimental results show that PGRA outperforms other state-of-the-art retrieval-augmented methods. Our analyses further investigate the influence factors to model performance and demonstrate the generality of PGRA. Codes are available at https://github.com/THUNLP-MT/PGRA.


Introduction
Retrieval-augmented methods aim at enhancing dense models with non-parametric indices to better leverage external knowledge (Borgeaud et al., 2022;Izacard et al., 2022;Wang et al., 2022).By decoupling knowledge storage from model parameters, retrieval-augmented methods can achieve comparable or better performance than large-scale pre-trained models with orders of magnitude less parameters on tasks such as language modeling (Khandelwal et al., 2020;Guu et al., 2020;Borgeaud et al., 2022) and question answering (Lee et al., 2019;Karpukhin et al., 2020;Izacard and Grave, 2021).Moreover, as external knowledge is stored in the non-parametric index, knowledge can be updated simply by replacing the index without further training (Izacard et al., 2022).Therefore, retrieval-augmented methods have attracted increasing interest in recent years and achieved promising results in various natural language processing tasks (Zhang et al., 2018;Khandelwal et al., 2020;Guu et al., 2020;Karpukhin et al., 2020).
Despite their success, retrieval-augmented methods for the majority of non-knowledge-intensive (NKI) tasks remain under-explored.Following Lewis et al. (2020), we define tasks that "humans could not reasonably be expected to perform without access to an external knowledge source" as knowledge-intensive (KI) tasks and the others as NKI tasks.Previous studies (Karpukhin et al., 2020;Izacard and Grave, 2021;Izacard et al., 2022) have extensively explored the potential of retrievalaugmented methods for various KI tasks.As for NKI tasks, most efforts are devoted to language modeling (Khandelwal et al., 2020;Guu et al., 2020), text generation (Lewis et al., 2020), and machine translation (Zhang et al., 2018;Khandelwal et al., 2021), although there is a wide range of NKI tasks, such as sentiment analysis (Ding et al., 2008;Socher et al., 2013), text classification (Hovy et al., 2001;Li and Roth, 2002) and linguistic acceptability (Warstadt et al., 2018).Therefore, we ask this question: Can retrieval-augmented methods assist on a wider range of NKI tasks?
However, leveraging retrieval-augmented methods for more types of NKI tasks faces two major challenges.On the one hand, there is a demand for diverse relevance score functions.To retrieve the most desirable evidence from the index, proper relevance score functions are needed.Although the relevance score functions suitable for predicting the next token distribution are well-studied in works on language modeling, text generation, and ma- Figure 1: The framework of our proposed Prompt-Guided Retrieval Augmentation (PGRA) method.We first retrieve candidates through a task-agnostic retriever (Section 2.1), then use a task-specific prompt and pre-trained language model (PLM) to rerank the candidates (Section 2.2).We send the top results to the reader to make predictions.(Section 2.3).chine translation, NKI tasks require more diverse relevance score functions.For example, the text classification task may favor evidence with similar sentence-level semantics (Reimers and Gurevych, 2019;Gao et al., 2021b) while the linguistic acceptability task may prefer linguistically similar evidence (Warstadt et al., 2018).Therefore, it is non-trivial to satisfy all these diverse requirements in a single framework.On the other hand, there is a dilemma between training cost and task performance.The external knowledge index and retriever are crucial for the performance of a retrievalaugmented method (Lee et al., 2019;Karpukhin et al., 2020;Izacard and Grave, 2021).Previous works show that joint training index with the dense model results in better performance (Guu et al., 2020;Xiong et al., 2020).However, due to the large size of external knowledge, updating the index periodically during training is computationally expensive.On the contrary, keeping the index static is computationally cheap but makes it hard to meet the diverse requirements of NKI tasks.Therefore, it is difficult to balance the trade-off between training cost and task performance.
To address these challenges, we propose a twostage framework, entitled PGRA, to better retrieve task-specific resources for NKI tasks.The overall framework is shown in Figure 1.In the first stage, we use a task-agnostic retriever to recall candidate evidence, which builds a shared static index for all tasks.In the second stage, we adopt prompt-guided pretrained language models (PLMs; Brown et al., 2020;Zhang et al., 2022) as a reranker to rerank the candidates according to the task-specific relevance score functions.Finally, we feed the reranked top evidence to the reader to generate answers.By leveraging textual prompts, our framework can satisfy the demand for diverse relevance score functions.As both the retriever and the reranker are training-free, the expensive computational cost of periodical index update in training is avoided.At the same time, experimental results justify the effectiveness of our framework on various datasets.Therefore, we successfully break the dilemma between training cost and task performance.
Our main contributions are three-fold: • We propose a prompt-guided retrieval augmentation method for a wider range of nonknowledge-intensive tasks, which are hardly explored in previous works.• By combining the retrieval-and-rerank procedure with textual prompts, our framework maintains reasonably low training cost while satisfying diverse task-specific relevance score function requirements.

• Extensive experimental results and analysis
show that our framework is effective for diverse non-knowledge-intensive tasks.

Methods
In this section, we introduce our proposed Prompt-Guided Retrieval Augmentation (PGRA) method, as shown in Figure 1.Our proposed method mainly has three components: (i) a task-agnostic retriever using a shared retriever to build static indexes to select top-k candidate evidence from large-scale external resources; (ii) a prompt-guided reranker adopting PLMs to measure task-specific relevance for reranking candidate evidence; (iii) a reader taking the final top-d (d < k) ranked evidence as augmentations to generate answers.

Task-Agnostic Retriever
Given that the external resource is extremely largescale, from millions (Khandelwal et al., 2020(Khandelwal et al., , 2021) ) to billions (Wang et al., 2022;Izacard and Grave, 2021;Chen et al., 2017), we use a shared retriever to build the static index once.The key and value of the index are the task-agnostic text representation and the text itself, respectively.The index will be shared across tasks, and thus we save a significant amount of training cost (See Section 4.5 for discussion).Formally, given the input as query q, and the external resource containing a bunch of text R = {t 1 , t 2 , • • • , t |R| }, we firstly encode representations for both query and text, which can be denoted as Enc(q) and Enc(t i ), respectively.The representations of text then serve as keys of the index.Then, we use a dense inner product to compute the similarity Sim(q, t i ) based on the index: .
(1) With the similarity scores, we get the top-k nearest evidence according to retrieval distribution which is the softmax over these scores.Then we follow the faiss (Johnson et al., 2019) implementation to efficiently complete the approximate retrieval via Maximum Inner Product Search (MIPS).These top-k pieces of evidence are regarded as candidates for the second stage for further reranking.

Prompt-Guided Reranker
As discussed above, the task-agnostic retriever in the first stage selects the nearest candidates by evaluating the similarity of the static indexes between input and external text.However, such shared retrievers neglect the fact that different NKI tasks prefer their own task-specific relevance score functions, which is crucial to retrieve useful evidence.
In order to meet the demand for diverse relevance score functions, we further design a taskspecific reranker in the second stage.To avoid expensive calculations for training a task-specific retriever per NKI task, we exploit the in-context learning ability of prompt-guided PLMs.
At first, we adopt in-context learning under the few-shot setups to encode task-specific representations of the input query q and the top-k pieces of  1.Then, we feed an auto-regressive PLM (e.g., OPT; Zhang et al., 2022) with both constructed prompts and our input to obtain the task-specific representations of the next predicted tokens: where p 1 , p 2 , • • • , p m are the m prompts of the examplars, l 1 , l 2 , • • • , l m are the labels, p q and p e i (i = 1, • • • , k) are the prompts of the input query and evidence e i , respectively.The prefix text is then concatenated to the prompts of the query or the evidence as the textual input.Lastly, the inputs are fed to the model to generate the last hidden states of the first new token h * q ∈ R d and h * e i ∈ R d .It is worth noting that text in the external knowledge resource may lack explicit labels for NKI tasks.Through in-context learning with prompt guidance, the representations of the inputs and external evidence encoded by the PLM implicitly contain different critical features to solve various tasks.Similar to the first stage, we compute the similarity between the representations of input q and its candidate evidence e i , which reflects their task-specific relevance: Finally, we rerank the candidate evidence according to the aforementioned task-specific relevance score and select the top-d results for the reader in the next section.

Reader
To encode useful information from the reranked evidence and infer the final answer for the query text q, we use the FiD (Fusion-in-Decoder; Izacard and Grave, 2021) model as our reader, which has a Seq2seq pre-trained Transformer (Vaswani et al., 2017) such as T5 (Raffel et al., 2020).Specifically, each piece of evidence obtained from the reranker is concatenated with the query, which is independently fed into the encoder.The decoder takes the embeddings of these concatenations produced by the encoder and computes cross attention over them to give the final answer prediction.Following the prompt-based learning (Schick and Schütze, 2021;Liu et al., 2021), we transfer the NKI tasks to the form of language modeling, where the answers are deduced according to the label prediction in a context.The overall reader is trainable and the parameters are updated given the training samples of the required NKI tasks.

Experimental Setups
Tasks and Metrics.Following the setups in LM-BFF (Gao et al., 2021a), We conduct the experiments mainly on four types of NKI tasks: (1) Sentiment analysis.We use a various of datasets from different domains, including SST-2 (Socher et al., 2013) and SST-5 (Socher et al., 2013) for the general domain with two and five labels, CR (Ding et al., 2008) for comment reviews, MR (Pang and Lee, 2004) for movie reviews, MPQA (Wiebe et al., 2005) for news opinions; (2) Linguistic acceptability.We adopt CoLA (Warstadt et al., 2018), which aims to discriminate whether a sentence is grammatically correct; (3) Question classification.We use TREC (Hovy et al., 2001;Li and Roth, 2002), in which a question needs to be classified into six categories; (4) Subjectivity analysis.We use Subj (Pang and Lee, 2004), which has to judge whether the sentence is subjective or objective.As for metrics, we report Matthew's correlation for CoLA while reporting accuracy in all other tasks.More details about datasets and metrics can be found in Appendix G.
External Resources and Models.As for the external resources, we use Wiki1M following Gao et al. (2021b).Furthermore, in the first stage, we use BERT-base-uncased (Devlin et al., 2019) as our shared task-agnostic retriever.We also compare with other retrievers of the first stage in Section 4.6.
In the second stage, we use OPT-13b (Zhang et al., 2022) as our auto-regressive PLMs to obtain the task-specific representations.We further explore the effects on the size of our PLMs in Section 4.3.Finally, we adopt T5-base and T5-large (Raffel et al., 2020) as our readers to generate answers.
Implementation Details.We use the checkpoints of T5-base, T5-large, and OPT-13b from HuggingFace1 .Our manually designed prompts are obtained from PromptSource (Bach et al., 2022).We finetune the T5 model on each task with the AdamW (Loshchilov and Hutter, 2019) optimizer.We search hyper-parameters of learning rate of {1e-5, 2e-5, 5e-5, 8e-5, 1e-4} and batch sizes of {4, 8}.We set the number of top-k in the first stage to 150, while the number of top-d in the second stage is 16 with T5-base and 8 with T5large due to computational resource limitation.We further compare the effect of k and d in Section 4.2.We use 8 shots for prompts during reranking in the second stage.Our experiments are conducted with one NVIDIA V100 GPU.
Baselines.We compare our proposed method PGRA with the following baselines: (1) In-context learning (ICL; Brown et al., 2020), which directly uses OPT-13b, the same as our PLM in the second stage, to generate answers under the few-shot setups (8 shots in our settings); (2) T5 (Raffel et al., 2020), which use T5-base and T5-large in supervised learning; (3) k-Nearest Neighbour (k-NN; Cunningham and Delany, 2020), in which the model makes a majority vote based on distances between embeddings; (4) LM-BFF (Gao et al., 2021a), which is a few-shot inference method tuned with dedicated prompts; (5) RAG (Lewis et al., 2020), which treats context samples as hidden variables and jointly trains the retriever and generator; (6) FiD (Izacard and Grave, 2021), which concatenates query and context samples in the encoder and generates answers with cross attention.
To ensure a fair comparison, we uniformly adopt the same reader (i.e., T5-base and T5-large) for retrieval-augmented methods.As for k-NN and LM-BFF, we also use T5-base and T5-large for building representations and training.In the baseline of in-context learning, we use the same templates as ours in the second stage.

Results
We compare our proposed PGRA with the aforementioned baseline methods, where the results are shown in Table 2.We include results on both T5base and T5-large models for generality reasons.We run our experiments three times and report details of each run in Appendix A. We report average results here and first-run results in the analysis section below.Firstly, the PGRA can significantly outperform the simple k-Nearest Neighbour and few-shot methods, including in-context learning with OPT-13b and LM-BFF.As for the k-Nearest Neighbour, it is simply based on the distances of embeddings encoded by T5.As for the few-shot methods, incontext learning uses prompts to elicit PLMs to generate answers without updating parameters.It is worth noting that we use in-context learning with OPT-13b as our prompt-guided reranker in the second stage.The performance of in-context learning is ordinary, so it is surprising that it can assist on PGRA.We will further discuss the reason behind this in Section 4.1.Meanwhile, LM-BFF is further fine-tuned on the prompts to give answers.Thus, its performance is obviously higher than k-Nearest Neighbour and in-context learning with OPT-13b but remains a large gap to PGRA.
Secondly, compared to supervised learning (i.e., T5-base and T5-large) and retrieval-augmented baselines, PGRA still outperforms them across most tasks.Specifically, the line of retrieval methods with a T5-base reader outperforms supervised learning with the T5-base model, while retrievalaugmented methods with a T5-large reader are worse or comparable to supervised learning with the T5-large model.Furthermore, our method PGRA can obviously surpass these baselines, in both T5-base and T5-large setups.In conclusion, extensive experimental results have shown that our PGRA is effective on diverse NKI tasks.

Effects of Label Consistency
In this section, we probe the influence of retrieved evidence on the model performance of our PGRA from the aspect of label consistency.Note that our external text is without any task-specific labels.Therefore, we use a T5-base model fine-tuned on the specific task, which is the closest to our PGRA reader but without retrieval, to generate pseudolabel for all text in the external resource.In detail, if the pseudo-label of evidence is the same as the ground-truth label of the input, we say the evidence is consistent with the input.We can then directly detect the relation between the number of consistent evidence and model performance at the instance level.Specifically, out of 16 pieces of total retrieved evidence, the number of consistent evidence with the same (pseudo) labels as the input varies from 0 and 16.
Taking the SST-2 task as an example, we count the total number of instances with different numbers of consistent evidence.We then compute the average accuracy of PGRA for the instances with the same number of consistent evidence.The results are shown in Figure 2a.Firstly, since we rerank the evidence based on the relevance score of pseudo-labels, the number of instances also rises as the number of consistent evidence increases.The phenomenon indicates that we can always find sufficient task-specific evidence retrieved from the first stage, except for a small part of inputs which is possibly caused by the limitation of the k's size in the first stage.Secondly, the average accuracy is also rising as the number of consistent evidence increases, which reflects that the model performance is related to the (pseudo) label consistency.However, when the number of consistent evidence is small (i.e., 3 and 4), the accuracy can also be high.This is because the number of instances is too small, so the result is insignificant.Furthermore, it is interesting to find that when the number of consistent evidence is high enough (i.e., larger than 13), the accuracy approaches 100%, which shows that there exists high potential in increasing label consistency to improve model performance.As can be seen from the figure, it still holds that the more label-consistent evidence, the higher accuracy the model can achieve.The difference between PGRA and FiD is that PGRA can retrieve more label-consistent evidence than FiD.

Effects of k and d
In this section, we further investigate the effects of k and d on the performance, where k and d are the numbers of final retrieved evidence in the first and second stages, respectively.In detail, we run PGRA with different k or different d, while other setups keep the same as main experiments.As seen from Figure 3, larger k values can consistently improve the average performance, while larger d values maintain a relatively stable trend.As for k, larger k values mean providing more candidate evidence for the second stage reranker to find more appropriate instances with (pseudo) label consistency.As for d, larger d values indicate more consistent evidence if the proportion of consistent evidence keeps the same.At the same time, their top consistent evidence is the same, and the candidate evidence is fixed with the same k, so their performance is close.In our expectation, the PGRA can better solve diverse NKI tasks with larger k if enough computing resources are allowed.

Effects of OPT Model Sizes
In this section, we first investigate the effects on the performance of different sizes of the OPT models used in the prompt-guided reranker.Specifically, we vary the size of OPT models and conduct experiments in five downstream tasks.The model performances are shown in the orange line of Figure 4.The overall trend is obviously that the larger OPT models can achieve better performance.We believe that the larger OPT models have better abilities to apply task-specific features to encode representations, and further obtain more effective task-specific relevance scores to retrieve evidence.
To validate this assumption, we further investigate the relations between (pseudo) label consistency and model performance of different OPT model sizes.We define the pseudo-label consistency score (i.e., consistency score) as the proportion of retrieved instances with the same pseudolabel as the input.For example, given the input with a positive ground-truth label, when our PGRA recalls 5 consistent and 3 inconsistent pieces of evidence, the consistency score is 5/8 = 62.5%.As shown in the Figure 4, overall, larger models with higher consistency scores result in better performance, which is within expectation.

Effects of Evidence Granularity
In this work, we propose to use a task-specific relevance score to retrieve from sentence-level external resources, rather than popular passage-level used in previous studies (Chen et al., 2017;Izacard and Grave, 2021;Guu et al., 2020).To demonstrate that our granularity of external evidence is appropriate, we compare the model performance between sentence-level and passage-level evidence.As for passage-level evidence, we use WikiDPR (Chen et al., 2017) as external resources.We randomly sample 1M passages from WikiDPR to keep the same data size as our sentence-level external resource in the main experiment.The results are shown in Figure 5. Across all NKI tasks, our sentence-level setup performances significantly surpass passage-level setup.This phenomenon indicates that sentence-level evidence can better satisfies the task-specific demands for NKI tasks.For example, it is easier to show a clear sentiment ori-entation in a sentence than in a paragraph.

Training Cost
To solve the dilemma between training cost and task performance, we propose PGRA where both the retriever and reranker are training-free.To demonstrate this statement, in this section, we approximately compare the training cost of our method PGRA with training a task-specific retriever per task, the latter of which needs periodical refreshing indexes (i.e., refreshed-index models).
Considering a significant amount of training time concentrates on building and refreshing indexes, we mainly statistic this part.Due to the limitation of computation resources, we conduct our main experiment on 1M data from Wikipedia.In our PGRA, we only need to build the index once without extra training, and the time cost c is about 0.5 hours.However, although the time cost c of building index is almost the same, they need to periodically refresh the index n times to learn a taskspecific retriever.Lastly, for all h tasks, their total training cost is c×n×h, which is much larger than our time cost c.It is worth noting that the external resource is usually much larger than ours (Chen et al., 2017;Wang et al., 2022;Izacard et al., 2022), so the gap between refreshed-index models and our PGRA will further grow to explode.

Generalization on Retrievers
In this section, we study the generalization of PGRA with different first-stage retrievers.We use popular retrievers like BM25 (Robertson and Zaragoza, 2009), BERT (Devlin et al., 2019), Sim-CSE (Gao et al., 2021b) to compare FiD and our PGRA.As shown in Figure 6, our method PGRA consistently outperforms FiD, no matter which retriever to use.This phenomenon indicates that PGRA can adapt to different types of retrievers in the first stage to solve various NKI tasks.Furthermore, the retriever with BM25 performs worse than both BERT and SimCSE counterparts, which is consistent with previous studies (Zhao et al., 2022).

Case Study
In Table 3, we present a case in SST-2 with different retrieved evidence from baselines (i.e., FiD and RAG) and our PGRA.As shown in the table, our PGRA can exactly predict the correct answer, while both FiD and RAG are wrong.To further analyze the retrieved evidence from different methods, we find that sentences retrieved by FiD and RAG may have overlapped tokens or similar semantics.For example, the retrieved evidence from FiD is highly related to filming and stories, consistent with the "title", "characters", and "camera" in the input.
But their retrieved evidence hardly has the same sentiment orientation to assist the downstream task.Some of them may have even opposite sentiments, such as the second sentence retrieved by FiD.However, our retrieved evidence from PGRA clearly has a negative sentiment orientation, though some may not have explicitly relatedness with the query, such as the second retrieved sentence.In general, evidence retrieved by our PGRA method based on task-specific relevance can effectively improve performance on NKI tasks.

Related Work
Retrieval-augmented methods are widely used for knowledge-intensive tasks such as question answering (Chen et al., 2017;Karpukhin et al., 2020; Izac- The title not only describes its main characters but the lazy people behind the camera as well.Label Negative

Method: FiD Prediction Positive
Evidence (1) The story overlaps science fiction, theology, and philosophy.
(2) However, the film's greatness is not limited to a few isolated scenes.

Method: RAG Prediction Positive
Evidence (1) The 1978 King Cup was the 20th season of the knockout competition since its establishment in 1956.
Method: PGRA Prediction Negative Evidence (1) Once it had been shown that the film could not be realized, "The Works" was officially abandoned.
(2) The play can also be seen as a discussion of romanticism and reality, in a quite disillusional way.
Table 3: Case study of FiD, RAG, and our PGRA with the top-2 retrieved evidence in SST-2.
ard and Grave, 2021; Izacard et al., 2022), where explicit knowledge is required to achieve reasonable performance, even for human (Lewis et al., 2020).Such systems usually follow a retrieverreader architecture, where an existing retriever like BM25 (Robertson and Zaragoza, 2009) or a trained dual-encoder (Lee et al., 2019;Luan et al., 2021) is used, followed by a reader model to fuse the retrieved results.We focus on non-knowledgeintensive tasks and propose a prompt-guided retrieval method for mining fine-grained textual information across multiple tasks, without training a specific retriever for each of them.Recently, Wang et al. (2022) also applied retrieval-augmented methods to more general tasks by keeping a shared BM25 retriever unchanged for each task while modifying the reader for information filtering.In contrast, we propose a two-stage retrieval method to find task-specific information at a low cost for different downstream tasks.Prompt-based methods gained much advance in recent years (Schick and Schütze, 2021;Liu et al., 2021;Gao et al., 2021a), where downstream tasks can be solved via transforming the problem to the form of language modelling.Combined with PLMs such as GPT-3 (Brown et al., 2020) and OPT (Zhang et al., 2022), such methods show strong performance under zero-shot or few-shot settings.Recently, there are also some works that leverage prompts for retrieval.For example, Asai et al. (2022) collected large-scale instruction-annotated datasets for training instruction-guided retrievers for tasks, van de Kar et al. (2022) use prompts for searching regex-based patterns from unlabeled corpora.Our method is inspired by these works and different in that we leverage the pretrained models for retrieving according to taskspecific relevance and propose an efficient retrievalaugmented method for NKI tasks.

Conclusion
In this paper, considering the demand for diverse relevance score functions to solve wider NKI tasks, we propose a two-stage method PGRA.In the first stage, we use a task-agnostic retriever for building shared static indexes to select candidate evidence.In the second stage, we design a prompt-guided reranker to rerank candidates with task-specific relevance for the reader.Extensive experimental results show that our proposed method PGRA can overall outperform previous state-of-the-art retrieval-augmented methods.Furthermore, we explore the influence of label consistency between input and retrieved evidence from the prompt-guided reranker and demonstrate the generality of our PGRA on both evidence granularities and types of retrievers.In the future, we will consider ways to improve the pseudo-label consistency to enhance model performances according to our analyses.

Limitations
In this work, we present PGRA to retrieve taskspecific context evidence to support NKI tasks.However, our work has some limitations.Firstly, we have not experimented with our PGRA on sentence-pair tasks, such as MRPC (Dolan and Brockett, 2005), in which the model needs to infer the relationship between two sentences.Retrieving two sentences from an external datastore is non-trivial as there are hardly sentence pairs in the Wikipedia datastore.A larger corpus with more diverse data sources may help in this case.Secondly, We restrict our PGRA to classification tasks but not generation tasks.Similar to sentence-pair tasks, retrieving sentences that may help the model generate text is more complex.For example, data related to both the source and the target may help in machine translation (Khandelwal et al., 2021).We will research this question in the future.Last but not least, we have not extensively tested the performance of our method on KI tasks, except for some preliminary analysis in Appendix F. This restricts the generality of our methods.Solving KI tasks depends on knowledge in the passage-level external datastore while matching such information needs possibly more specialized prompts for our method.Thus, it is for our future work.

Ethics Statement
Currently, large language models with retrieval augmentation require a large amount of computation in indexing a large-scale datastore, retrieving from that large datastore and refreshing index during training.Despite improving model performance, the retrieval augmentation methods need too much computation power.This not only limits the usability of such models but also harms the fairness in this community.Our work tries to balance the performance of retrieval augmentation methods and the training cost, in that our method does not need to retrain a new retriever and rebuild an index when facing a new task.This may help the community in developing new low-cost methods.
During selecting the external datastore and tasks, we follow previous studies and choose the wellknown Wikipedia dataset and common tasks.Biases from the data may be reflected in the results.In addition, when using the model on a larger scale, more consideration needs to be paid to deal with biases in retrieved text.

A Multiple Runs of the Main Experiment
We run PGRA three times under the settings of the main experiment in Table 2 and report results of these runs in Table 8.

B Experiments with More Retrieved
Evidence For FiD baselines We run additional experiments for FiD (T5-base) baseline with more retrieved evidence.The results are shown in Table 4.It can be seen that with more retrieved evidence, although the average scores become higher, FiD still underperforms PGRA.

C Impact of k, d and OPT Model Sizes
We explore the impact of k, d and second-stage OPT model sizes.The full analysis is shown in Section 4.2 and Section 4.3.Table 5, Table 6 and Table 9 show detailed performance of our method in each task.For each ablation, we keep other hyper-parameters the same as used in Table 2.

D Label Consistency
We include the details of label consistency scores of our PGRA with different second-stage OPT models on each task in Table 10.

E Generality on Retrievers
We include the detailed performance of FiD and our PGRA on all tasks with different first-stage encoders, namely BM25, BERT and SimCSE.The results are shown in Table 11.

F Generalization on KI tasks
We perform experiments on the FEVER (Thorne et al., 2018) benchmark.FEVER is a fact verification task, requiring a model to classify whether a claim is factually correct.Due to resource limitations, we sample 5k claim-label pairs from the training set and 1k pairs from the validation set.
We run FiD and PGRA with both T5-base backbone and keep other hyperparameters the same as in Table 2.Note that we did this experiment with a sentence-level datastore (Wiki1m).FiD and PGRA achieve 73.8% and 77.7% accuracy respectively.The results confirm again the performance increase with PGRA.However, one might notice that the performance of FiD with a traditional passage-level datastore can achieve better performance.We acknowledge this as a limitation of our method because a passage-level datastore requires much different relevance metrics as stated in the Limitation section.This is also a possible future direction.

G Datasets and Metrics
We use the Wiki1M from SimCSE (Gao et al., 2021b) as our external datastore.This dataset is a subset of Wikipedia and used in (Gao et al., 2021b).We report information on tasks in Table 12.We use the same configuration as (Gao et al., 2021a), including dataset splits.

I Prompts
We include all prompts used in all 8 tasks in Table 13.

Figure 2 :
Figure2: The pseudo label consistency of samples in SST-2 with PGRA and FiD (T5-base models for both).We plot the accuracy scores of instances with different numbers of label-consistent evidence, along with the number of such instances.

Figure 3 :
Figure 3: Accuracy against k (left) and d (right).Details of performance on different tasks can be found in Appendix C.

Figure 4 :
Figure 4: Average performance and average consistency score on 5 tasks (SST-2, SST-5, CoLA, MR and MPQA) against different OPT model sizes.Detailed information can be found in Appendix C.

Figure 5 :
Figure 5: Performance of PGRA with passage-level and sentence-level external datastores.

Figure 6 :
Figure 6: Comparison between FiD and our PGRA with BM25, BERT and SimCSE retrievers.More details of specific performance in all tasks can be found in Appendix E.

Table 2 :
The results of baselines and our PGRA.For models with T5-base backbone, we use d = 16.For models with T5-large backbone, we use d = 8 in the second stage due to GPU memory limitation.The best results are bolded, and the second-best ones are underlined.

Table 4 :
Detailed analysis of the impact of top-d in the second stage.

Table 5 :
Detailed analysis of the impact of top-k in the first stage.

Table 6 :
Detailed analysis of the impact of top-d in the second stage.

Table 7 :
Information of the tasks.

Table 8 :
Multiple run results of PGRA.

Table 9 :
Detailed analysis of the impact of OPT sizes with k = 150, d = 16.

Table 10 :
Pseudo-label consistency with different OPT models."Average" is the average label consistency score on the five tasks.Performance is the average of the 5 tasks.We keep k = 150, d = 16 in this experiment.

Table 11 :
Table of generalization performance of FiD and PGRA with different first-stage encoders.

Table 12 :
Information of the tasks and datasets.