Self-Knowledge Guided Retrieval Augmentation for Large Language Models

Large language models (LLMs) have shown superior performance without task-specific fine-tuning. Despite the success, the knowledge stored in the parameters of LLMs could still be incomplete and difficult to update due to the computational costs. As complementary, retrieval-based methods can offer non-parametric world knowledge and improve the performance on tasks such as question answering. However, we find that the retrieved knowledge does not always help and even has a negative impact on original responses occasionally. To better make use of both internal knowledge and external world knowledge, we investigate eliciting the model's ability to recognize what they know and do not know (which is also called self-knowledge) and propose Self-Knowledge guided Retrieval augmentation (SKR), a simple yet effective method which can let LLMs refer to the questions they have previously encountered and adaptively call for external resources when dealing with new questions. We evaluate SKR on multiple datasets and demonstrate that it outperforms chain-of-thought based and fully retrieval-based methods by using either InstructGPT or ChatGPT.


Introduction
Large language models (LLMs, Brown et al., 2020;Chowdhery et al., 2022;Ouyang et al., 2022) have achieved remarkable performance without much task-specific fine-tuning.However, the full-parametric knowledge stored in LLMs could still be incomplete and difficult to update due to the computational costs.Alternatively, retrievalaugmented methods (Guu et al., 2020;Lewis et al., 2020b;Borgeaud et al., 2022;Izacard et al., 2022;Shi et al., 2023) can utilize external resources such as Wikipedia and offer complementary nonparametric knowledge to enrich the contextualized  information, thus helping the model generate more reliable answers.
Retrieval augmentation has shown to be very effective for models such as BERT (Devlin et al., 2019), BART (Lewis et al., 2020a), and T5 (Raffel et al., 2020) in various tasks (Karpukhin et al., 2020;Khandelwal et al., 2020Khandelwal et al., , 2021;;Izacard and Grave, 2021;Wang et al., 2022;Guo et al., 2023).As LLMs become more and more "knowledgable", recent studies show that the benefit brought from retrieval augmentation is reducing (Mallen et al., 2022;Yoran et al., 2023).Moreover, we find that the retrieved passages could even negatively affect what LLMs originally know.As illustrated in Figure 1, the model can directly give reasonable answers "German Shepherds are often used as seeing-eye dogs", however, it is distracted and gives incorrect ones by adding retrieved passages.
The above findings show that one should be more careful when applying the retrieval-based method since it is difficult to know in advance whether the retrieved results are better than what LLMs already captured.To this end, a key issue is to figure out what LLMs do well (e.g., they can answer correctly without assistance) and what they cannot do well (e.g., they answer incorrectly and external information can lead to improved results).
Unfortunately, LLMs themselves have a limited ability to recognize what they know and do not know, which is also called "self-knowledge" (Yin et al., 2023).However, such an ability is crucial for generating truthful responses (Kadavath et al., 2022) and could be helpful for LLMs themselves to "decide when and when not to use tools" such as a retriever (Mialon et al., 2023).
In this paper, we investigate eliciting the selfknowledge of LLMs and propose a simple yet effective Self-Knowledge guided Retrieval augmentation (SKR) method to flexibly call the retriever for making better use of both internal and external knowledge.In particular, different from existing studies that evaluate the ability through specifically designed metrics or datasets, we collect the selfknowledge of training questions by comparing the performance with or without retrieval augmentation.Then, we propose several strategies to detect the self-knowledge corresponding to a question by referring to the existing collected training questions, including using the LLMs themselves through prompting or explicitly training a small model.Finally, we leverage such elicited selfknowledge to better solve the question through adaptive retrieval augmentation.

Related Work
Retrieval-Augmented LLMs Recent studies show that retrieval-augmented methods can enhance the reasoning ability of LLMs (Trivedi et al., 2022;He et al., 2022;Yu et al., 2023;Shao et al., 2023;Jiang et al., 2023) and make the responses more credible and traceable (Xu et al., 2023b;Qian et al., 2023).For example, Trivedi et al. (2022) uses the chain-ofthought (Wei et al., 2022) reasoning steps as queries and uses the results to guide further reasoning and retrieval.He et al. (2022) uses an external natural language inference model to select the most supported reasoning path via retrieved evidence.Yu et al. (2023) propose using the retrieval feedback to refine the output of LLMs to be more reliable and accurate.Xu et al. (2023b) propose search-inchain and make LLMs interact with retrievers to improve accuracy and credibility.These methods aim at integrating sufficient external knowledge for a better reasoning process, while we propose to better utilize both the internal and external knowledge through eliciting the self-knowledge of LLMs.
Another line of work tries to teach LLMs to use external tools including retriever, calculator, other foundation models, etc. (Schick et al., 2023;Shen et al., 2023;Qin et al., 2023).These works focus more on leveraging the language understanding capabilities of LLMs to deploy suitable tools in different scenarios, while our work investigates the self-knowledge of LLMs and tries to integrate them with retrievers in a more flexible manner.Self-Knowledge in LLMs "Self-knowledge" in LLMs is originally mentioned in Kadavath et al. (2022), which is used to measure the LLMs' confidence in their own knowledge and reasoning.Such ability is further defined as "the ability to understand limitations on the unknowns" and evaluated by Yin et al. (2023), where they find a considerable gap exists between self-knowledge in models and humans.To explore the LLMs capabilities more extensively, unanswerable and more challenging datasets are also proposed (Rajpurkar et al., 2018;Srivastava et al., 2022;Suzgun et al., 2022).Our work is also related to detecting what LLMs know and do not know, while we do not design new evaluation metrics or challenging datasets to test the ability.By explicitly introducing the external resources, we detect the knowledge boundary of LLMs through the performance changes.Moreover, instead of evaluating each question independently, we propose several ways to elicit selfknowledge by referring to existing cases.

Method
Our method is depicted under the questionanswering settings, which has been a popular way to interact with and assess LLMs.The overall pipeline is shown in Figure 2, which includes collecting, eliciting, and using self-knowledge of LLMs.We introduce each of them as follows.

Collecting Self-Knowledge of LLMs from Training Samples
Given a dataset D with training question-answer pairs {q j , a j } |D| j=1 , we can use the LLM M to generate the answers for each question q i via few-shot in-context learning (Brown et al., 2020): The overall pipeline of our SKR method.We first collect self-knowledge from training questions according to the performance with or without external information ( § 3.1).Then we use the LLMs themselves or explicit small trainable models to elicit self-knowledge of a question q t by referring to the collected self-knowledge from training questions ( § 3.2).Finally, we use the self-knowledge to the new question and adaptively call a retriever ( § 3.3).
where • denotes concatenation and {q j • a j } d j=1 are d demonstrations.
The above generated answers â(M, q i ) reflects the internal knowledge to question q i in M. Meanwhile, we can possibly find passages from external resources that may be related to q i , such passages can be used as additional information for the model input.Formally, for each question, we first use a pre-trained retriever R to find the possibly related information from corpus C : where p i = {p i1 , p i2 , ..., p ik } are the top-k retrieved passages for q i .In practice, we set R as dense passage retriever (Karpukhin et al., 2020) and C as passage chunks from Wikipedia.Then, we use M again to generate the answer with retrieval augmentation: (3) Given the answers â(M, q i ), âR (M, q i ), and the ground-truth answer a i , we categorize each question into positive subset D + and negative subset D − based on the differences between results: ) where E is an evaluation metric such as accuracy and exact match score, we discard the question q i if both the â(M, q i ) and âR (M, q i ) are incorrect.
Finally, the training set can be split into subset D + = {q + 1 , ..., q + m } which includes questions that M can directly give correct answers without external information (LLM knowns) and the subset where the external information can lead to more accurate results (LLM unknowns).

Eliciting Self-Knowledge of LLMs
Four different strategies are proposed to detect the self-knowledge of target questions, including direct prompting, in-context learning, training a classifier, and nearest neighbor search.We use the LLM itself in the former two methods and explicit smaller modes in the latter two methods.Direct Prompting Given a question q t , a straightforward way to detect whether LLMs are capable of solving it is to ask them directly: Direct Prompting (prompt) {qt} Q: Do you need additional information to answer this question?A: (possible response) No, I don't need additional information to answer this question./ Yes, I need additional information to answer this question.
Here we use the prompt "Do you need additional information to answer this question?"as a template and detect self-knowledge according to the possible response.We thought LLM is capable (or not capable) of solving the question well when they "don't need (or need) additional information".Direct prompting may intuitively work, but it tests each question independently and does not make use of the collected training questions in Section 3.1.To remedy this issue, we further leverage the collected self-knowledge from training questions in the next three strategies.In-Context Learning LLMs have shown a strong capability to learn from demonstrations and infer through few-shot in-context learning (Brown et al., 2020).We select few training questions from both D + and D − as demonstrations to elicit the selfknowledge to the question q t : In-Context Learning Here we use the answer templates "No, I don't need..." or "Yes, I need..." in demonstrations based on whether the corresponding question comes from positive set D + or negative set D − , respectively.
The proposed direct prompting and in-context learning methods can elicit self-knowledge of LLMs to some extent.However, they have several limitations.First, both methods require designing prompts and calling the LLMs for each new question, which makes it impractical.Second, incontext learning could also be unstable due to contextual bias and sensitivity (Zhao et al., 2021;Lu et al., 2022) and it is more difficult to address such an issue for close-source LLMs.Third, they cannot make use of all questions due to the constraints of maximum tokens.To make our method more practical and avoid the above issues, we further leverage smaller models to help elicit self-knowledge.Training a Classifier Given D + and D − , we can take them as a two-way classification problem (e.g., setting q i in D + with a positive label and q i in D − with a negative label) and use all the samples to train a classifier such as BERT-base (Devlin et al., 2019) explicitly: where is the sentence-level representation from BERTbase, W and b are parameters of the classification head.The parameters can be optimized by minimizing the cross-entropy loss between the predicted label distribution ŷi and the ground-truth label of q i .Once the training is complete, we can infer the label of question q t similar to Eq. 5. Nearest Neighbor Search Instead of training an explicit smaller model, we can infer the label of questions through k-nearest-neighbor (kNN) search by using a pre-trained fixed encoder, as shown in Figure 3. kNN (Fix and Hodges, 1989) is a widely used algorithm and benefit for a range of NLP tasks (Khandelwal et al., 2020(Khandelwal et al., , 2021;;Shi et al., 2022;Xu et al., 2023a).Our motivation is similar in that if two questions are close in the semantically embedded space, then the LLMs would show similar self-knowledge for both of them.Formally, we encode each question into embeddings and compute the semantic similarity through cosine distance sim(q t , q i ) = e(qt)•e(q i ) ||e(qt)||•||e(q i )|| , where q i ∈ {q + 1 , ..., q + m , q − 1 , ..., q − n }, e(•) is the representations of a sentence encoder such as SimCSE (Gao et al., 2021).Then we search the top-k nearest neighbors with the highest similarity.If the top-k nearest neighbors include ℓ positive ones and k − ℓ negative ones, we label the question q t as positive if n (m and n are the numbers of questions in D + and D − , respectively).

Using Self-Knowledge for Adaptive Retrieval Augmentation
The self-knowledge given by the responses from LLMs (via direct prompting or in-context learning) or the predicted labels (via the trained classifier or k-nearest-neighbor search) reflects the necessity for external knowledge towards the question q t .Therefore, we can adaptively call the retriever instead of using them for every new question: Adaptive Retrieval Augmentation

Baselines
In addition to the Zero-Shot and Few-Shot settings with direct output, we also compare with the chain-of-thought (CoT) reasoning based methods including Zero-Shot-CoT (Kojima et al., 2022) with simple prompt "Let's think step by step", Manual-CoT (Wei et al., 2022) with manually written demonstrations, Auto-CoT (Similarity) with automated demonstrations according to semantic similarity (Liu et al., 2022;Rubin et al., 2022) and Auto-CoT (Diversity) according to semantic diversity (Zhang et al., 2023).For retrieval-based methods, we compare with our implemented Manual-CoT-IR with additional retrieved passages before generating the answers, IRCoT (Trivedi et al., 2022) with retrieved passages using CoT reasoning steps as the queries, CoT-RR (He et al., 2022) with an external model to verify multiple reasoning steps by retrieved evidence and deduce the answer through self-consistency (Wang et al., 2023).

Implementation Details
By applying different strategies in Section 3.2 to elicit self-knowledge, we denote our SKR method as SKR prompt , SKR icl , SKR cls , and SKR knn , respectively.For SKR knn , we choose k as 3∼10 according to different sizes of datasets.For LLMs, we use InstructGPT (text-davinci-003) and ChatGPT (gpt-3.5-turbo-0301)through Ope-nAI API 1 .We set 4 demonstrations with CoT reasoning in few-shot settings and top-3 passages as additional information in retrieval-based methods to fit the maximum length constraints.

Main Results
The main results are shown in Table 1.Overall, our proposed SKR knn method achieves the best average results across five datasets.Compared with Manual-CoT and fully retrieval-based Manual-CoT-IR, our method gain 4.08%/2.91%improvement by using InstructGPT and 4.02%/4.20%improvement by using ChatGPT, respectively.By comparing different strategies to elicit selfknowledge, we find that 1) SKR prompt shows relatively poor results, which show that direct prompting may not be a good way to detect the self-knowledge of LLMs.The results are also in line with Yin et al. (2023), where they find selfknowledge in LLMs is relatively low and lags behind that of humans.2) SKR icl and SKR cls work but do not show consistent improvement.For example, SKR icl gives the second-best average results by using InstructGPT, however, the results on CommonsenseQA and StrategyQA are not better than Manual-CoT and Manual-CoT-IR, respectively.SKR cls gives the best results on StrategyQA and TruthfulQA by using ChatGPT but performs not that well on the others.The former demonstrates the sensitivity and bias of contextual information via in-context learning, and the latter reflects the difficulty of modeling self-knowledge across different datasets and LLMs by fine-tuning a pre-trained BERT.
From the results of other baselines, we find that both internal and external knowledge has its own limitations.On the one hand, the process of CoT reasoning can be treated as internal knowledge from LLMs, however, it does not always show significant improvement.For example, when evaluated on CommonsenseQA, Manual-CoT does not outperform the Few-Shot counterpart where ex-1 platform.openai.complicit reasoning steps are not required.The results from Wei et al. (2022) and Zhang et al. (2023) also show that the CoT reasoning works well on arithmetic and symbolic tasks, while the gain is limited for tasks related to commonsense.
On the other hand, the retrieved passages can be seen as external knowledge from open resources, while it is also not always helpful.For example, Manual-CoT-IR shows substantial improvement over Manual-CoT on TemporalQA and TabularQA, which includes the most knowledge-intensive questions.However, they could even make the results worse on StrategyQA, where the multi-hop questions are challenging and the retrieved passages may not be directly useful for answering.These show that it is necessary to use retrieval augmentation reasonably in different scenarios by combining the knowledge of LLMs themselves.

Effects of Different Templates for Eliciting Self-Knowledge of LLMs
To directly prompt LLMs themselves to elicit selfknowledge, we designed different templates, collected the responses, and evaluated the performance on questions that LLMs thought they could solve directly.The results are shown in Table 2.
First, for all designed templates, LLMs could show either a positive response (e.g., directly giving the predicted answers) or a negative response (e.g., showing the need for external information) to a specific question.Second, interestingly, we find that the model achieves around 70%∼73% accuracy for questions that they thought could be answered directly, indicating that there exist around 30% questions for which the model does not know its incapability (i.e., "unknown unknowns").Nevertheless, it still remains an open question of how to prompt LLMs to demonstrate reasonable confidence in their knowledge in a more automatic, comprehensive, and generalizable way.

Effects of Elicited Self-Knowledge across Different Datasets
We investigate the benefit brought by the elicited self-knowledge across different datasets.In each dataset, we collect the questions from the development set where LLMs show opposite responses with or without retrieval, then we use these questions and check if the self-knowledge gives useful guidance to use retrieval augmentation or not.Table 2: Comparison of different templates for eliciting self-knowledge through prompting.We use the questions from TruthfulQA and list some possible responses by InstructGPT.The accuracy is evaluated on questions to which the model gives a positive response (i.e., on questions where the model shows confidence to answer directly).
(w/o self-knowledge) Figure 4: The fine-grained effect of elicited selfknowledge to each dataset by using different strategies.
The results are shown in Figure 4.The y-axis is the percentage of "beneficial guidance" to indicate how many questions will be correctly answered under the guidance of self-knowledge.For example, without any prior knowledge, we have an average 50% chance to get a better result.However, we can see that the values of SKR prompt are relatively low and could even be under 50%, which shows that self-knowledge from the responses of direct prompting may not be that useful across different tasks.Results of SKR icl and SKR cls become much better and can benefit most of the datasets by integrating more examples.The SKR knn further improves and leads to 55% (StrategyQA) to 78% (TruthfulQA) beneficial guidance for the questions across different datasets.

Effects of Training Data Sizes
We investigated the effects of training data sizes on TabularQA and CommonsenseQA, both of which have relatively abundant training questions.In particular, we randomly select 10%, 25%, 50% train- ing data for SKR cls and SKR knn methods and evaluate the final accuracy.
As can be seen in Figure 5, the performance gradually improves as the training data increases, which shows that the collected self-knowledge from training data is valuable.Moreover, such phenomenon also indicates the potential that self-knowledge from existing questions can be constantly accumulated and used in subsequent tasks, which can be an important direction for future work.

Effects of Different Knowledge Resources
In addition to Wikipedia, we also compare the influence of different corpora C, including the passages from MS MARCO (Bajaj et al., 2016) and Sci-Fact (Wadden et al., 2020).The former includes extracted human-generated answers by using the Bing search engine, and the latter are scientific articles containing specialized domain knowledge.
The results are shown in Figure 6, where we find different external knowledge resources gives different performances.In particular, Wikipedia leads   to better results by a large margin on TabularQA, where there exists the most questions for querying the information of certain entities or events (e.g., John Adams was born in the Spring of 1735. True or False?).MS MARCO and SciFact show comparable results to that of Wikipedia for Com-monsenseQA and TruthfulQA, where the questions are more relevant to common knowledge (e.g., What happens to you if you eat watermelon seeds?Choices:...).Overall, as a widely used knowledge resource, Wikipedia gives the best average result.

Conclusion
In this paper, we propose a Self-Knowledge guided Retrieval augmentation (SKR) method, which investigates eliciting the ability of LLMs to recognize what they know or do not know (i.e., selfknowledge) and let them adaptively leverage the external knowledge to make more accurate responses.Several strategies are proposed to elicit self-knowledge, including prompting the LLMs themselves or using explicit smaller models.Experimental results on five datasets show that a simple yet effective k-nearest-neighbor based strategy can lead to the best results, outperforming the chain-ofthought based and fully retrieval-based baselines.

Limitations
There are several directions to improve this work.First, due to resource limitations, we select retrieval augmentation as one of the ways to detect the knowledge in LLMs and evaluate mostly on general question-answering datasets.We can explore self-knowledge at different levels (e.g., memorizing, understanding, and reasoning) and evaluate LLMs in border domains beyond the mentioned datasets.Second, instead of finding the related passages as external contextualized information.the retrieval augmentation method for LLMs can still be improved.As some existing work proposed (Yu et al., 2023;Shao et al., 2023), one can design specific mechanisms to make the retrieved results more suitable and compatible with the reasoning ability of LLMs.

Ethics Statement
As for the datasets, we use Wikipedia as an external knowledge resource and five question-answering datasets for evaluation.All of them are publicly available and widely used by researchers.As for the LLMs, we use InstructGPT and ChatGPT through OpenAI API.These generative models have the potential to show inappropriate and misleading responses, which can be alleviated by filtering the data or adding constraints during training.
In this work, we only focus on the generated responses to the questions from the given datasets and try to combine LLMs with external world knowledge via retrieval augmentation, which actually has been shown as a potential way to reduce issues such as hallucination (Shuster et al., 2021;Roller et al., 2021).

Figure 1 :
Figure 1: Comparison between two responses given by InstructGPT.The retrieved passages are relevant but not particularly helpful for solving the question, which influences the model's judgment and leads to incorrect answers.
Figure2: The overall pipeline of our SKR method.We first collect self-knowledge from training questions according to the performance with or without external information ( § 3.1).Then we use the LLMs themselves or explicit small trainable models to elicit self-knowledge of a question q t by referring to the collected self-knowledge from training questions ( § 3.2).Finally, we use the self-knowledge to the new question and adaptively call a retriever ( § 3.3).
Do you need additional information to answer this question?A: No, I don't need additional information to answer this question.{q − 1 } Q: Do you need additional information to answer this question?A: Yes, I need additional information to answer this question....... {qt} Q: Do you need additional information to answer this question?A:(possible response) No, I don't need additional information to answer this question./ Yes, I need additional information to answer this question.

Figure 3 :
Figure 3: Illustration of k-nearest-neighbor search to elicit the self-knowledge to the question q t .

Figure 5 :
Figure 5: The performance on TabularQA and Common-senseQA by using different amounts of training data.

Figure 6 :
Figure6: The performance on each dataset by using different corpus for retrieval augmentation.
Question Q: In a Roman Osteria is a 19th-century work of art.True or False?Cleopatra and Caesar is a 19th century French work of art.True or False?Q2: Boulevard des Capucines is a 19th century European work of art.True or False?Q3: Bulb Fields is a 19th century European work of art.True or False?An osteria is an Italian type of eating establishment.It is not a 19th-century artwork.So the answer is false.(✗) A: In a Roman Osteria is a painting by the Danish painter Carl Bloch.It was painted in 1866.So the answer is true.(✓)

Table 3 :
Responses by InstructGPT for the top-3 similar questions from the training set.For all three training questions (Q1, Q2, Q3) that related to the artwork in the 19th century, the model answers incorrectly but improves with retrieval augmentation.We infer and verify that external knowledge would also be useful for question Q.

Table 3
Cleopatra and Caesar are a 17th-century Italian painting by Francesco Barberini), however, it shows improved and correct responses by adding retrieved information.Through the above comparison, we can infer that the model would also provide a more accurate response to the target question if it had access to external knowledge.The results in the last row further validate our hypothesis.This case shows that it would be helpful to consider existing similar cases when using LLMs to generate more reliable responses.
illustrates an example showing the different responses with or without retrieval augmentation to similar questions and how self-knowledge is deduced by using nearest-neighbor search.Given the question "In a Roman Osteria is a 19th-century work of art.True or False?", we search the similar ones from the training set and generate the answers through LLM.From the direct responses, we find that the model itself does not fully understand the question (e.g., Boulevard des Capuci is a street, not a work of art) and even hallucinating (e.g.,