Do Large Language Models Know What They Don’t Know?

Large language models (LLMs) have a wealth of knowledge that allows them to excel in various Natural Language Processing (NLP) tasks. Current research focuses on enhancing their performance within their existing knowledge. Despite their vast knowledge, LLMs are still limited by the amount of information they can accommodate and comprehend. Therefore, the ability to understand their own limitations on the unknows, referred to as self-knowledge, is of paramount importance. This study aims to evaluate LLMs’ self-knowledge by assessing their ability to identify unanswerable or unknowable questions. We introduce an automated methodology to detect uncertainty in the responses of these models, providing a novel measure of their self-knowledge. We further introduce a unique dataset, SelfAware , consisting of unanswerable questions from five diverse categories and their answerable counterparts. Our extensive analysis, involving 20 LLMs including GPT-3, InstructGPT, and LLaMA, discovering an intrinsic capacity for self-knowledge within these models. Moreover, we demonstrate that in-context learning and instruction tuning can further enhance this self-knowledge. Despite this promising insight, our findings also highlight a considerable gap between the capabilities of these models and human proficiency in recognizing the limits of their knowledge.


Introduction
Recently, Large Language Models (LLMs) such as GPT-4 (OpenAI, 2023), PaLM 2 (Anil et al., 2023), andLLaMA (Touvron et al., 2023) have shown exceptional performance on a wide range of NLP tasks, including common sense reasoning (Wei et al., 2022;Zhou et al., 2022) and mathe- matical problem-solving (Lewkowycz et al., 2022;Chen et al., 2022).Despite their ability to learn from huge amounts of data, LLMs still have limitations in their capacity to retain and understand information.To ensure responsible usage, it is crucial for LLMs to have the capability of recognizing their limitations and conveying uncertainty when responding to unanswerable or unknowable questions.This acknowledgment of limitations, also known as "knowing what you don't know," is a crucial aspect in determining their practical applicability.In this work, we refer to this ability as model self-knowledge.
The Know-Unknow quadrant in Figure 1 illustrates the relationship between the model's knowledge and comprehension.The ratio of "Known Knows" to "Unknown Knows" demonstrates the model's proficiency in understanding and applying existing knowledge.Techniques such as Chain-of-Thought (Wei et al., 2022), Self-Consistency (Wang et al., 2022), and Complex CoT (Fu et al., 2022) can be utilized to increase this ratio, resulting in improved performance on NLP tasks.We focus on the ratio of "Known Unknows" to "Unknown Unknows", which indicates the model's self-knowledge level, specifically understanding its own limitations and deficiencies in the unknows.
Existing datasets such as SQuAD2.0(Rajpurkar et al., 2018) and NewsQA (Trischler et al., 2017), widely used in question answering (QA), have been utilized to test the self-knowledge of models with unanswerable questions.However, these questions are context-specific and could become answerable when supplemented with additional information.Srivastava et al. (2022) attempted to address this by evaluating LLMs' competence in delineating their knowledge boundaries, employing a set of 23 pairs of answerable and unanswerable multiple-choice questions.They discovered that these models' performance barely surpassed that of random guessing.Kadavath et al. (2022) suggested probing the selfknowledge of LLMs through the implementation of a distinct "Value Head".Yet, this approach may encounter difficulties when applied across varied domains or tasks due to task-specific training.Consequently, we redirect our focus to the inherent abilities of LLMs, and pose the pivotal question: "Do large language models know what they don't know?".
In this study, we investigate the self-knowledge of LLMs using a novel approach.By gathering reference sentences with uncertain meanings, we can determine whether the model's responses reflect uncertainty using a text similarity algorithm.We quantified the model's self-knowledge using the F1 score.To address the small and idiosyncratic limitations of existing datasets, we created a new dataset called SelfAware.This dataset comprises 1,032 unanswerable questions, which are distributed across five distinct categories, along with an additional 2,337 questions that are classified as answerable.Experimental results on GPT-3, In-structGPT, LLaMA, and other LLMs demonstrate that in-context learning and instruction tuning can effectively enhance the self-knowledge of LLMs.However, the self-knowledge exhibited by the current state-of-the-art model, GPT-4, measures at 75.47%, signifying a notable disparity when contrasted with human self-knowledge, which is rated at 84.93%.
Our key contributions to this field are summarized as follows: • We have developed a new dataset, SelfAware, that comprises a diverse range of commonly posed unanswerable questions.
• We propose an innovative evaluation technique based on text similarity to quantify the degree of uncertainty inherent in model outputs.
• Through our detailed analysis of 20 LLMs, benchmarked against human self-knowledge, we identified a significant disparity between the most advanced LLMs and humans1 .

Dataset Construction
To conduct a more comprehensive evaluation of the model's self-knowledge, we constructed a dataset that includes a larger number and more diverse types of unanswerable questions than Know-Unknowns dataset (Srivastava et al., 2022).To facilitate this, we collected a corpus of 2,858 unanswerable questions, sourced from online platforms like Quora and HowStuffWorks.These questions were meticulously evaluated by three seasoned annotation analysts, each operating independently.The analysts were permitted to leverage external resources, such as search engines.To ensure the validity of our dataset, we retained only the questions that all three analysts concurred were unanswerable.This rigorous process yielded a finalized collection of 1,032 unanswerable questions.
In pursuit of a comprehensive evaluation, we opted for answerable questions drawn from three datasets: SQuAD (Rajpurkar et al., 2016), Hot-potQA (Yang et al., 2018), and TriviaQA (Joshi et al., 2017).Our selection was guided by Sim-CSE (Gao et al., 2021), which allowed us to identify and select the answerable questions semantically closest to the unanswerable ones.From these sources, we accordingly drew samples of 1,487, 182, and 668 questions respectively, amassing a total of 2,337.Given that these questions can be effectively addressed using information available on Wikipedia, the foundational corpus for the training of current LLMs, it is plausible to infer that the model possesses the requisite knowledge to generate accurate responses to these questions.
Our dataset, christened SelfAware, incorporates 1,032 unanswerable and 2,337 answerable questions.To reflect real-world distribution, our dataset

Description
Example Percentage

No scientific consensus
The answer is still up for debate, with no consensus in scientific community.
"Are we alone in the universe, or will we discover alien life at some point?"

25% Imagination
The question are about people's imaginations of the future.
"What will the fastest form of transportation be in 2050?" 15% Completely subjective The answer depends on personal preference.
"Would you rather be shot into space or explore the deepest depths of the sea?"

Too many variables
The question with too many variables cannot be answered accurately.
"John made 6 dollars mowing lawns and 18 dollars weed eating.If he only spent 3 or 5 dollar a week, how long would the money last him?"10%

Philosophical
The question can yield multiple responses, but it lacks a definitive answer.
"How come god was born from nothingness?" 23% contains a proportion of answerable questions that is twice as large as the volume of unanswerable ones.Nevertheless, to ensure the feasibility of testing, we have purposefully capped the number of answerable questions.

Dataset Analysis
To gain insight into the reasons precluding a certain answer, we undertook a manual analysis of 100 randomly selected unanswerable questions.As tabulated in Table 1, we have broadly segregated these questions into five distinctive categories."No Scientific Consensus" encapsulates questions that ignite ongoing debates within the scientific community, such as those concerning the universe's origin."Imagination" includes questions involving speculative future scenarios, like envisaged events over the next 50 years."Completely Subjective" comprises questions that are inherently personal, where answers depend heavily on individual predispositions."Too Many Variables" pertains to mathematical problems that become unsolvable owing to the overwhelming prevalence of variables.Lastly, "Philosophical" represents questions of a profound, often metaphysical, nature that resist concrete answers.Ideally, upon encountering such questions, the model should express uncertainty instead of delivering conclusive responses.

Evaluation Method
This section elucidates the methodology employed for assessing self-knowledge in the generated text.
In order to achieve this, we define a similarity function, f sim , to compute the similarity, S, between a given sentence, t, and a collection of reference sentences, U = {u 1 , u 2 , . . ., u n }, endowed with uncertain meanings.
Whenever any S i surpasses a pre-determined threshold T , we perceive the text t as encompassing uncertain meanings, thereby eliminating the need for manual evaluation of the response.
Given the substantial disparity in the volume of answerable and unanswerable questions in Self-Aware, we adopt the F1 score as a measure of LLMs' self-knowledge.Our focus rests on identifying unanswerable questions, hence we designate them as positive cases and categorize answerable questions as negative cases.

Model
We conduct a sequence of experiments to evaluate the degree of self-knowledge manifested by various LLMs, including GPT-3 (Brown et al., 2020) and InstructGPT (Ouyang et al., 2022) series, as well as the recent LLaMA (Touvron et al., 2023) and its derivative models, namely Alpaca (Taori et al., 2023) and Vicuna (Chiang et al., 2023).Our investigative approach employed three distinct input forms: Direct, Instruction, and In-Context Learning (ICL), which is encapsulated in Appendix A.4. In-Context Learning

Setting
We devised the reference sentence set U through a process that combined automated generation by LLMs and manual filtering, detailed further in Appendix A.1.To quantify the similarity between target and reference sentences, we utilized Sim-CSE (Gao et al., 2021), setting the similarity threshold to 0.75 during our experiments.An exploration of threshold ablation is available in Appendix A.2.
To counteract potential errors in similarity calculation induced by varying lengths of the target and reference sentences, we employed a sliding window of length 5 to parse the target sentence into semantic chunks.During the generation process, we set the temperature to 0.7.We selected a random sample of 100 instances for GPT-4, while the remainder of the models were scrutinized using the full SelfAware dataset.

Human Self-Knowledge
To establish a benchmark for human selfknowledge, we engaged two volunteers and selected 100 random samples from the SelfAware dataset.The volunteers has 30 minutes to make judgments on the same set of questions, yielding an average F1 score of 84.93%, which we subsequently adopted as the benchmark for human self-knowledge.Detailed scores are available in Appendix A.3.

Analysis
We evaluate the manifestation of LLMs' selfknowledge, centering our investigation on three fundamental dimensions: the size of the model, the impact of instruction tuning, and the influence exerted by different input forms.
Model Size. Figure 2 illustrates the correlation between model size and self-knowledge across various LLMs.It is noteworthy that across all three input forms, an augmentation in model parameter size is associated with an elevation in the F1 Score, with the most conspicuous enhancement manifesting in the ICL input form.Therefore, our analysis indicates that an LLM's self-knowledge tends to enhance with increasing model size, a trend consistent with the scaling law.Instruction Tuning.Figure 2 delineates that models from the InstructGPT series exhibit a superior level of self-knowledge compared to their GPT-3 counterparts.Further evidence of model enhancement is provided by Figure 4, where textdavinci models show significant improvement relative to the base davinci model.An additional comparative analysis, presented in Figure 5, evaluates LLaMA against its derivative models.The results underscore a notable increase in self-knowledge for Alpaca and Vicuna upon instruction tuning, exceeding their base model performances.Among these, Vicuna-13B outperforms the LLaMA-65B, corroborating the efficacy of instruction tuning for enhancing model self-knowledge.
Input Forms.As shown in Figure 2, the incorporation of instructions and examples serves to boost the self-knowledge of both the GPT-3 and Instruct-GPT series.Specifically, ICL input form, providing richer contextual information, contributes to a significant enhancement in models' self-knowledge.This impact is particularly noticeable in the davinci model, where ICL facilitates a 27.96% improvement over the direct.Moreover, a comparison between Figure 3 and Figure 4 reveals that the inclusion of instructions and examples successfully minimizes the performance disparity between the davinci and text-davinci models, suggesting an acquisition of self-knowledge from the instructions and provided examples.
Compared with Human. Figure 3 reveals that, without supplementary samples, GPT-4 currently performs best among the tested models, achieving an impressive F1 score of 75.47%.However, a noticeable gap becomes evident when comparing this performance to the human benchmark of 84.93%.This underscores the considerable potential that remains for enhancing the self-knowledge level of LLMs.
Answerable Questions.Figure 6 traces the performance evolution of the InstructGPT series in addressing answerable questions, adhering to the closed-book question answering paradigm (Touvron et al., 2023), where output accuracy is contingent on the presence of the correct answer.Our observations underscore a steady enhancement in QA task accuracy corresponding to an increase in model parameter size and continuous learning.
Particularly, the accuracy of text-davinci-001 experiences a significant ascent, scaling from a meager 2.48% in text-ada-001 to 10.61%, whereas GPT-4 marks an even more striking jump to 42.64%.

Conclusion
This study investigates the self-knowledge of LLMs by evaluating their ability to identify unanswerable questions.Through the introduction of a novel dataset and an automated method for detecting uncertainty in the models' responses, we are able to accurately measure the self-knowledge of LLMs such as GPT-3, InstructGPT and LLaMA.
Our results reveal that while these models possess a certain degree of self-knowledge, there is still an apparent disparity in comparison to human selfknowledge.This highlights the need for further research in this area to enhance the ability of LLMs to understand their own limitations on the unknows.Such efforts will lead to more accurate and reliable responses from LLMs, which will have a positive impact on their applications in diverse fields.

Limitations
• Generalization of reference sentences.At present, we have selected sentences with uncertain meanings exclusively from the GPT-3 and InstructGPT series, potentially overlooking uncertainty present in responses generated by other LLMs.However, it is not feasible to catalog all sentences with uncertain meanings exhaustively.As a direction for future research, we propose to concentrate on the automated acquisition of more accurate reference sentences to address this concern.
• Limitations of input forms: Our examination was confined to three unique input forms: direct, instruction, and ICL.There is burgeoning research aimed at bridging the gap between models and human-like methods of reasoning and problem-solving, including but not limited to approaches like Reflexion (Shinn et al., 2023), ToT (Yao et al., 2023), MoT (Li and Qiu, 2023).Future endeavors will integrate additional cognitive and decision-making methods to delve deeper into the self-knowledge exhibited by these LLMs.

Ethics Statement
The SelfAware dataset, meticulously curated to evaluate LLMs' ability to discern unanswerable questions, is composed of unanswerable questions extracted from sources such as Quora and How-StuffWorks, alongside answerable questions procured from three distinct open datasets.Every question was thoroughly examined for relevance and harmlessness.To ensure content validity, three annotation analysts, compensated at local wage standards, dedicated regular working hours to content review.
Throughout our research process, we underscored the significance of privacy, data security, and strict compliance with dataset licenses.In order to protect data integrity, we implemented anonymization and content filtration mechanisms.Our adherence to OpenAI's stipulations remained unyielding for the usage of GPT-3 and InstructGPT models, and likewise for Meta's terms pertaining to LLaMA models.We rigorously vetted the licenses of the three publicly available datasets for compliance, ensuring that all our research methodologies were in alignment with ethical standards at the institutional, national, and global levels.
Adhering to the CC-BY-SA-4.0 protocol, the dataset, once publicly released, will be reserved exclusively for research purposes.We pledge to promptly and effectively address any concerns relating to the dataset, while concurrently anticipating researchers to maintain high ethical standards in their utilization of this data.

A.1 Uncertainty Text
To assemble a set of reference sentences, we randomly chose 100 entries from the SelfAware dataset.For each model in the GPT-3 and InstructGPT series, we conducted a preliminary test using the direct input form and manually curated sentences that displayed uncertainty.From this pre-test, we procured 16 sentences manifesting uncertain connotations to serve as our reference sentences.After normalizing these sentences by eliminating punctuation and converting to lowercase, we utilized them to compute similarity with target sentences throughout our experimental procedure.
1.The answer is unknown.
2. The answer is uncertain.
3. The answer is unclear.
4. There is no scientific evidence.5.There is no definitive answer.6.There is no right answer.7.There is much debate.8.There is no known case.9.There is no concrete answer to this question.10.There is no public information available.
11.It is impossible to know.
12. It is impossible to answer.
13.It is difficult to predict.14.It is not known.15.We do not know.
16.I'm not sure.

A.2 Threshold ablation
We generated 100 new responses using the textdavinci-002 with direct input form and manually filtered out sentences that contained uncertainty.We then used SimCSE (Gao et al., 2021) to calculate the similarity between these sentences and the reference sentences in Appendix A.1.We tested various thresholds for filtering sentences with uncertain meanings and compared them to manually  annotated sentences.We considered unanswerable questions as positive examples and calculated precision, recall, and F1 score.The results in Table 2 indicate that a threshold of 0.75 produced the highest F1 score, balancing precision and the inclusion of other uncertain sentences.As a result, we selected 0.75 as the similarity threshold for subsequent experiments.

A.3 Human Self-Knowledge Test
The evaluation results for the responses from our invited volunteers are presented in Table 3.The F1 scores for the responses were high, indicating that both volunteers exhibited a strong level of selfknowledge.

A.4 Template
The input templates used in our experiments, Direct, Instruction, and ICL, are illustrated in Figures 7, 8, and 9, respectively.In the ICL template, we composed 3 answerable and 3 unanswerable questions and provided the corresponding answers manually.
to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.
In Section 2, Dataset Construction, we analyze the data distribution of the SelfAware dataset, including the number of answerable and non-answerable questions.All data is utilized solely for testing purposes.
C Did you run computational experiments?
In Section 4, Experiment, we conducted computational experiments.
C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?
In Section 4.4, Result, we present the number of parameters of the model used.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?
In Section 4.2, Setting, we detail the hyperparameter temperature used in the experiment.
C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?
Considering the cost of experimentation, we did not conduct multiple experiments.However, replication of select experiments confirmed that there were no substantial variations in the outcomes, thus ensuring the reliability of our results.
We developed our own pre-processing and evaluation metrics, instead of utilizing existing packages.
D Did you use human annotators (e.g., crowdworkers) or research with human participants?
In Section 2, Dataset Construction, and Section 4, Experiment, we employ human annotators to assist us in sorting through the data and identify sentences with uncertain meanings.
D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? We verbally communicated our expectations to the annotators, clearly outlining their roles and responsibilities.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?In the Ethics Statement section, we recruited three annotators at rates that comply with local wage standards.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?In the Ethics Statement section, we clearly specify in the dataset usage regulations that it can only be used for scientific research.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?
In the Ethics Statement section, we demonstrate our unwavering adherence to ethical and moral guidelines for data use.

Figure 1 :
Figure 1: Know-Unknow Quadrant.The horizontal axis represents the model's memory capacity for knowledge, and the vertical axis represents the model's ability to comprehend and utilize knowledge.

Figure 4 :
Figure 4: Experimental comparison of davinci series in ICL input form.

Figure 6 :
Figure 6: Accuracy of the InstructGPT series when responding to answerable questions in instruction input form.

Table 1 :
Unanswerable questions in the SelfAware dataset that span across multiple categories.

Table 2 :
Evaluation results comparing sentences with uncertain meaning filtered by various thresholds.

Table 3 :
Evaluation results of 100 responses from two volunteers.