The Internal State of an LLM Knows When It’s Lying

,


Introduction
Large Language Models (LLMs) have recently demonstrated remarkable success in a broad range of tasks Brown et al. [2020], Bommarito II and Katz [2022], Driess et al. [2023], Bubeck et al. [2023].However, when composing a response, LLMs tend to hallucinate facts and provide inaccurate information Ji et al. [2023].Furthermore, they seem to provide this incorrect information using confident and compelling language.The combination of a broad body of knowledge, along with the provision of confident but incorrect information, may cause significant harm, as people may accept the LLM as a knowledgeable source, and fall for its confident and compelling language, even when providing false information.
We believe that in order to perform well, an LLM must have some internal notion as to whether a sentence is true or false, as this information is required for generating (or predicting) following tokens.For example, consider an LLM generating the following false information "The sun orbits the Earth."After stating this incorrect fact, the LLM is more likely to attempt to correct itself by saying that this is a misconception from the past.But after stating a true fact, for example "The Earth orbits the sun," it is more likely to focus on other planets that orbit the sun.Therefore, we hypothesize that determining whether a statement is true or false can be extracted from the LLM's internal state.Interestingly, retrospectively "understanding" that a statement that an LLM has just generated is false does not entail that the LLM will not generate it in the first place.We identify three reasons for such behavior.The first reason is attributed to the fact that an LLM generates a token at a time, and it "commits" to each token generated.Therefore, even if maximizing the likelihood of each token given the previous tokens, the overall likelihood of the complete statement may be low.For Pluto is the second largest smallest dwarf planet in our solar system.
celestial body in the solar system that has ever been classified as a planet.
only most Figure 1: A tree diagram that demonstrates how generating words one at a time and committing to them may result in generating inaccurate information.
example, consider a statement about Pluto.The statement begins with the common words "Pluto is the", then, since Pluto used to be the smallest planet the word "smallest" may be a very plausible choice.Once the sentence is "Pluto is the smallest", completing it correctly is very challenging, and when prompt to complete the sentence, GPT-4 (March 23rd version) completes it incorrectly: "Pluto is the smallest dwarf planet in our solar system."In fact, Pluto is the second largest dwarf planet in our solar system (after Eris).One plausible completion of the sentence correctly is "Pluto is the smallest celestial body in the solar system that has ever been classified as a planet."(see Figure 1).Consider the following additional example: "Tiztoutine is a town in Africa located in the republic of Niger."Indeed, Tiztoutine is a town in Africa, and many countries' names in Africa begin with "the republic of".However, Tiztoutine is located in Morocco, which is not a republic, so once the LLM commits to "the republic of", it cannot complete the sentence using "Morocco", but completes it with "Niger".In addition, committing to a word at a time may lead the LLM to be required to complete a sentence that it simply does not know how to complete.For example, when describing a city, it may predict that the next words should describe the city's population.Therefore, it may include in a sentence "Tiztoutine has a population of", but the population size is not present in the dataset, so it complete the sentence with a pure guess.
The second reason for an LLM to provide false information may be that, at times, there may be many ways to complete a sentence correctly, but fewer ways to complete it incorrectly.Therefore, it might be that a single incorrect completion may have a higher likelihood than any of the correct completions (when considered separately).Finally, since it is common for an LLM to not use the maximal probability for the next word, but to sample according to the distribution over the words, it may sample words that result in false information.
In this paper we present our Statement Accuracy Prediction, based on Language model Activations (SAPLMA).SAPLMA is a simple yet powerful method for detecting whether a statement generated by an LLM is truthful or not.Namely, we build a classifier that receives as input the activation values of the hidden layers of an LLM.The classifier determines for each statement generated by the LLM if it is true or false.Importantly, the classifier is trained on out-of-distribution data, which allows us to focus specifically whether the LLM has an internal representation of a statement being true or false, regardless of the statement's topic.
In order to train SAPLMA we compose a dataset of true and false statements from 6 different topics.
Each statement is fed to the LLM, and its hidden layer's values are recorded.The classifier is then trained to predict whether a statement is true or false only based on the hidden layer's values.Importantly, our classifier is not tested on the topics it is trained, but on a held-out topic.We believe that this is an important measure, as it requires SAPLMA to extract the LLM's internal belief, rather than learning how information must be aligned to be classified as true.We show that SAPLMA, which leverages the LLM's internal states, results in a better performance than prompting the LLM explicitly whether a statement is true or false.Specifically, SAPLMA reaches accuracy levels of between 60% to 80% on specific topics, while few-shot prompting achieves only slightly above random performance, with no more than a 56% accuracy level.
As we later discuss, the probability an LLM assigns to a given statement depends much on the frequency of the tokens in the statement as well as its length.Therefore, these probabilities must be normalized to become useful for detecting the veracity of a statement.Interestingly, the standard way for an LLM to generate a sentence is by generating tokens that maximize the likelihood of the next token, given all previous tokens.Therefore, we also consider statements generated by an LLM, which should be assigned a high probability by it; nevertheless, approximately 50% of them were annotated by human judges as false.However, we show that SAPLMA also performs well when tested on the statements generated by the LLM.
SAPLMA employs a simple and relatively shallow feedforward neural network as its classifier, which requires very little computational power at inference time.Therefore, it can be computed alongside the LLM's output.We propose for SAPLMA to supplement an LLM presenting information to users.If SAPLMA detects that the LLM "believes" that a statement that it has just generated is false, the LLM can mark it as such.This could raise human trust in the LLM responses.Alternatively, the LLM may merely delete the incorrect statement and generate a new one instead.
To summarize, the contribution of this paper is twofold.
• The release of a dataset of true-false statements along with a method for generating such information.
• Demonstrating that an LLM might "know" when a statement that it has just generated is false, and proposing SAPLMA, a method for extracting this information.

Related Work
In this section we provide an overview of previous research on LLM hallucination, accuracy, and methods for detecting false information, and we discuss datasets used to that end.
Many works have focused on hallucination in machine translation Dale et al. [2022], Ji et al. [2023].For example, Dale et al.Dale et al. [2022] consider hallucinations as translations that are detached from the source, hence they propose a method that evaluates the percentage of the source contribution to a generated translation.If this contribution is low, they assume that the translation is detached from the source and is thus considered to be hallucinated.Their method improves detection accuracy hallucinations.The authors also propose to use multilengual embeddings, and compare the similarity between the embeddings of the source sentence and the target sentence.If this similarity is low, the target sentence is considered to be hallucinated.The authors show that the latter method works better.However, their approach is very different than the work presented in this paper, as we do not assume any pair of source and target sentences.In addition, while we also use the internal states of the model, we do so by using the hidden states to descriminate between statements that the LLM "believes" are true and those that are false.Furtheremore, we focus on detecting false statements rather than hallucination, as defined by their work.
Other works have focused on hallucination in text summarization Pagnoni et al. [2021].Pagnoni et al. propose a benchmark for factuality metrics of tex summarization.Their benchmark is developed by gathering summaries from several summarization methods and requested humans to annotate their errors.The authors analyze these anotation and the proportion of the different factual error of the summarization methods.We note that most works that consider hallucination do so with relation to a given input (e.g., a passage) that the model operates on.For example, a summarization model that outputs information that does not appear in the provided article, is considered hallucinate it, regardless if the information is factually true.However, in this work we consider a different problem--the veracity of the output of an LLM, without respect to a specific input.
Some methods for reducing hallucination assume that the LLM is a black box Peng et al. [2023].This approach uses different methods for prompting the LLM, possibly by posting multiple queries for achieving better performance.Some methods that can be used for detecting false statements may include repeated queries and measuring the discrepancy among them.We note that instead of asking the LLM to answer the same query multiple times, it is possible to request the LLM to rephrase the query (without changing its meaning or any possible answer) and then asking it to answer each rephrased question.
Other methods finetune the LLM, using human feedback, reinforcement learning, or both Bakker et al. [2022], Ouyang et al. [2022].Ouyang et al. propose a method to improve LLMgenerated content using reinforcement learning from human feedback.Their approach focuses on fine-tuning the LLM with a reward model based on human judgments, aiming to encourage the generation of better content.However, fine tuning, in general, may cause a model to not perform as well on other tasks Kirkpatrick et al. [2017].In this paper, we take an intermediate approach, that is, we assume access to the model parameters, but do not fine-tune or modify them.
A dataset commonly used for training and fine-tuning LLMs is the Wizard-of-Wikipedia Dinan et al. [2018].The Wizard-of-Wikipedia dataset includes interactions between a human apprentice and a human wizard.The human wizard receives relevant Wikipedia articles, which should be used to select a relevant sentence and compose the response.The goal is to replace the wizard with a learned agent (such as an LLM).Another highly relevant dataset is FEVER Thorne et al. [2018Thorne et al. [ , 2019]].The FEVER dataset is designed for developing models that receive as input a claim and a passage, and must determine whether the passage supports the claim, refutes it, or does not provide enough information to support or refute it.While the FEVER dataset is highly relevant, it does not provide simple sentence that are clearly true or false independently of a provided passage.In addition, the FEVER dataset is not partitioned into different topics as the true-false dataset provided in this paper.
In conclusion, while several approaches have been proposed to address the problem of hallucination and inaccuracy in automatically generated content, our work is unique in its focus on utilizing the LLM's hidden layer activations to determine the veracity of generated statements.Our method offers the potential for more general applicability in real-world scenarios, operating alongside an LLM, without the need for fine-tuning or task-specific modifications.

The True-False Dataset
The work presented in this paper requires a dataset of true and false statements.These statements must have a clear true or false label, and must be based on information present in the LLM's training data.Furthermore, since our approach indents to reveal that the hidden states of an LLM have a notion of a statement being true or false, the dataset must cover several disjoint topics, such that a classifier can be trained on the LLM's activations of some topics while being tested on another.Unfortunately, we could not find any such dataset and therefore, compose the true-false dataset.
Our true-false dataset covers the following topics: "Cities", "Inventions", "Chemical Elements", "Animals", "Companies", and "Scientific Facts".For the first 5 topics, we used the following method to compose the dataset.We used a reliable source that included a table with several properties for each instance.For example, for the "chemical elements" we used a table that included, for each element, its name, atomic number, symbol, standard state, group block, and a unique property (e.g., Hydrogen, 1, H, Gas, Nonmetal, the most abundant element in the universe).For each element we composed true statement using the element name and one of its properties (e.g., "The atomic number of Hydrogen is 1").Then, we randomly selected a different row for composing a false statement (e.g., "The atomic number of Hydrogen is 34").If the value in the different row is identical to the value in the current row, we resample a different row until we obtain a value that is different.This process was repeated for the all topics except the "Scientific Facts".For the "Scientific Facts" topic, we asked ChatGPT (Feb 13 version) to provide "scientific facts that are well known to humans" (e.g."The sky is often cloudy when it's going to rain").We then asked ChatGPT to provide the opposite of each statement such that it becomes a false statement (e.g., "The sky is often clear when it's going to rain").The statements provided by ChatGPT were manually curated.As a result of the curation method, the dataset maintains a well-balanced distribution of true and false statements.
Table 3 presents the number of statements in each topic.The following are some examples of true statements from the dataset: • Cities: "Oranjestad is a city in Aruba" • Inventions: "Grace Hopper invented the COBOL programming language" • Chemical Elements: "Boron is used in the production of glass and ceramics" • Animals: "The llama has a diet of herbivore" • Companies: "Meta Platforms has headquarters in United States" • Scientific Facts: "The Earth's tides are primarily caused by the gravitational pull of the moon" The following are some examples of false statements from the dataset: • Cities: "Wellington is a name of a country" • Inventions: "David Schwarz lived in France" • Chemical Elements: "Indium is in the Lanthanide group" • Animals: "The whale has a long, tubular snout, large ears, and a powerful digging ability to locate and consume termites and ants." • Companies: "KDDI operates in the industry of Materials" • Scientific Facts: "Ice sinks in water due to its higher density" The true-false dataset will be made public.

SAPLMA
In this section, we present our Statement Accuracy Prediction, based on Language model Activations (SAPLMA), a method designed to determine the truthfulness of statements generated by an LLM using the values in its hidden layers.Our general hypothesis is that the values in the hidden layers of an LLM contain information on whether the LLM "believes" that a statement is true or false.However, it is unclear which layer should be the best candidate for retaining such information.While the last hidden layer should contain such information, it is primarily focused on generating the next token.Conversely, layers closer to the input are likely focused on extracting lower-level information from the input.Therefore, we use several hidden layers as candidates.For the LLM we use Facebook OPT 6.7b Zhang et al. [2022], an LLM composed of 32 layers.We compose five different models, each using activations from a different layer.Namely, we use the last hidden layer, the 28th layer (which is the 4th before last), the 24th layer (which is the 8th before last), the 20th layer (which is the 12th before last), and the middle layer (which is the 16th layer).We note that each layer is composed of 4096 hidden units.SAPLMA's classifier consists of a feedforward neural network with four fully connected layers using ReLU activation functions and decreasing numbers of neurons (256, 128, and 64), followed by a final sigmoid output.We use the Adam optimizer, and do not fine-tune any of these hyper-parameters for this task.The classifier is trained for 5 epochs.
For each topic in the true-false dataset, we train the classifier on the activation values obtained from all other topics and test its accuracy on the current topic.This way, the classifier is required to determine which sentences the LLM "believes" are true and which it "believes" are false, in a general setting, and not specifically with respect to the topic being tested.To obtain more reliable results, we train each classifier three times with different initial random weights.This process is repeated for each topic, and we report the accuracy mean over these three runs.

Results
We compare the performance of SAPLMA against two different baselines.The first is BERT, for which we train a classifier (with an identical architecture to the one used by SAPLMA) on the BERT embeddings of each sentence.Since we train on the embeddings of all topics except the topic that is tested, we expect that the BERT embeddings will perform very poorly, as they focus on the semantics of the statement rather than on whether a statement is true or false.Our second baseline is a few shot-learner.This baseline is an attempt to reveal whether the LLM itself "consciously knowns" whether a statement is true or false.Unfortunately, any attempts to explicitly prompt the LLM in a 'zero-shot" manner to determine whether a statement is true or false completely failed with accuracy levels not going beyond 52%.Therefore, we use a few shot-query instead, which provided the LLM with truth values from the same topic it is being tested on.Note that this is very different from our methodology in this paper, but was necessary for obtaining some results.Namely, we provided few statements along with the ground-truth label, and then the statement in question.We recorded the probability that the LLM assigned to the token "true" and to the token "false".Unfortunately, the LLM had a tendency to assign higher values to the "true" token; therefore, we divided the probability assigned to the "true" token by the one assigned to the "false" token.Finally, we considered the LLM's prediction "true" if the value was greater than the average, and "false" otherwise.We tested a 3-shot and a 5-shot version.2: Accuracy of all the models tested for each of the topics, and average accuracy.

Model
Table 5 and Figure 2 present the accuracy of all the models tested for each of the topics, along with the average accuracy.As depicted by the table and figure, SAPLMA clearly outperforms BERT and Few-shot learning, with BERT, 3-shot, and 5-shot learning achieving only slightly above a random guess (0.50).It can also be observed that SAPLMA for OPT-6.7bseems to work best when using the 20th layer (out of 32).Recall that each model was trained three times.We note that the standard deviation between the accuracy among the different runs was very small; therefore, the differences in performance between the different layers seem consistent.However, the optimal layer to be used for SAPLMA is very likely to depend on the LLM model, but obtaining the activations from the third quartile seems to perform well.
As for the differences between the topics, we believe that these values depend very much on the training data of the LLM.That is, we believe that the data used for training OPT-6.7bincludes information or stories about many cities and companies, and not as much on chemical elements and animals (other than the very common ones).Therefore, we conjecture SAPLMA achieves over 80% accuracy when the "cities" topic (while trained on all the rest) and the "companies" topic, but achieved only 60% when tested on the "animals" and "elements" topics.
In addition to the topics from the true-false dataset, we use statements generated by the LLM itself (the OPT 6.7b model).For generating statements, we simply provided a true statement, and then let the LLM generate a statement.We first filtered out any statements that were not factual statements (e.g., "I'm not familiar with the Japanese versions of the games.").All statements were generated using the maximal probability, i.e., we did not using sampling.This resulted in 245 labeled statements.The statements were fact-checked and manually labeled by three human judges based on web-searches.The human judges had a very high average observed agreement of 97.82%, and an average Cohen's Kappa of 0.9566.The majority determined the ground-truth label for each statement.48.6% of the statements were labeled as true, resulting in a balanced dataset.
Each of the models was trained 14 times using the same classifier described in Section 4. The models were trained on the entire true-false dataset (i.e., all topics, but not the generated sentences) and tested on the generated sentences.
Table 5 presents the average accuracy of all models on the sentences generated by the LLM.As anticipated, SAPLMA clearly outperforms the baselines, which appear to be entirely ineffectual in this task, achieving an accuracy near 50%.However, the accuracy of SAPLMA on these sentences is not as promising as the accuracy achieved when tested on some of the topics in the true-false dataset (i.e., the cities and companies).Since we expected the LLM to generate sentences that are more aligned with the data it was trained on, we did expect SAPLMA's performance on the generated sentences to be closer to its performance on topics such as cities and companies, which are likely aligned with the data the LLM was trained on.However, there may be also a countereffect in play: the sentences in the true-false dataset were mostly generated using a specific pattern (except the scientific facts topic), such that each sentence is clearly either true or false.However, the sentences generated by the LLM where much more open, and their truth value may be less clearly defined (despite being agreed upon by the human judges).For example, one of the sentences classified by all human judges as false is "Lima gets an average of 1 hour of sunshine per day."However, this sentence is true during the winter.Another example is "Brink is a river," which was also classified as false by all three human judges; however, brink refers to river bank (but is not a name of a specific river, and does not mean river).Indeed, SAPLMA classified approximately 70% of the sentences as true, and the AUC values seem more promising.This may hint that any sentence that seems plausible is classified as true.Therefore, we evaluate the models using 30% of the generated sentences for determining which threshold to use, i.e., any prediction above the threshold is considered true.Importantly, we do not use this validation set for any other goal.We test the models on the remainder of 70% of the generated sentences.We do not evaluate the few shot models again, as our evaluation guaranteed that the number of positive predictions would match the number of negative predictions, which matches the distribution in of the data.

Model
Avg  4: Accuracy and average optimal threshold of all the models tested on the sentences generated by the LLM.The optimal threshold is computed using the validation set (30% of the original testset).
Table 5 presents the accuracy of all models when using the optimal threshold from the validation set.Clearly, SAPLMA performs better with a higher threshold.This somewhat confirms our assumption that the truth value of the sentences generated by the LLM is more subjective than those that appear in the true-false dataset.We note that also the BERT embeddings perform better with a higher threshold.The use of a higher threshold can also be justified by the notion that it is better to delete or to mark as unsure a sentence that is actually true, than to promote false information.
Another interesting observation is that the 20th-layer no longer performs best for the statements generated by the LLM, but the 28th layer seems to perform best.This is somewhat perplexing and we do not have a good guess as to why this might be happening.Nevertheless, we stress that the differences between the accuracy levels of the 28th-layer and the others are statistically significant (using a two-tailed student t-test; p < 0.05).In future work we will consider fusing multiple layers together.
We also ran the statements generated by the OPT 6.7b model on , prompting it to determine whether each statement was true or false.Specifically, we provided the following prompt "Copy the following statements and for each statement write true if it is true and false if it is false:", and fed it 30 statements at a time.It achieved an accuracy level of 84.4%, and a Cohen's Kappa agreement level of 0.6883 with the true label.

Discussion & Limitations
In this work we explicitly do not consider models that were trained or fine-tuned on data from the same topic of the test-set.This is particularly important for the sentences generated by the LLM, as training on a held-out set from them would allow the classifier to learn which type of sentences generated by the LLM are generally true, and which are false.While this information may be useful in practice, and using is likely to yield much higher accuracy, it deviates from the core focus of this paper.
We note that the probability of the entire sentence (computed by multiplying the probabilities of each word, given the previous words) does not provide much valuable information, as many words are more common than others.Therefore, while sentence probabilities may be useful to determine which of two similar sentences is true, it is not useful for the general purpose of determining the truthfulness of a given sentence.For example, the false statement "Vaduz is a city in Belgium" has a probability of 6.44 • 10 − 16.While this value is much lower than the corresponding true statement "Vaduz is a city in Liechtenstein", which has a probability of 2.31•10 − 12, it is still much higher than the true statement "Triesenberg is a city in Liechtenstein", which has a probability of 3.21 • 10 − 20.Therefore, any attempt to use these probabilities must be normalized by the frequency of the tokens in the data.Nevertheless, as we have discussed and shown, even when the LLM generates sentences using the most probable token, it commonly generates false information.This paper focuses on detecting whether a statement is true or false.However, in practice, it may be more beneficial to detect if the LLM is positive that a statement is correct or if it is unsure.The most simple adjustment to the proposed method in this paper is to lift the required threshold for classifying a statement as true above 0.5; however, the exact value would require some form of calibration of the model Bella et al. [2010].Another option is to use multiple classifiers and to require all (or a vast majority of) classifiers to output "true", for the statement to be considered true.Alternatively, dropout layers can be used for the same goal Chen and Yi [2021].The overall system can benefit from multiple outcomes, such that if the LLM is not positive whether the statement is true or false, it can be marked for the user to treat with caution.However, if a statement is classified as false, the LLM can delete it and generate a different statement instead.To avoid regenerating the same statement again, the probability of sampling the words that appear in the current statement should be adjusted downward.
In our work we collected the activation values when each sentence was generated separately.However, in practice, in an LLM generating longer response the activation values develop over time, so they may process both correct and incorrect information.Therefore, the activation values would need to be decoupled so that they can be tested whether the most recent statement was true or false.One approach might be to subtract the value of the activations obtained after the previous statement from the current activation values (a discrete derivative).Clearly, training must be performed using the same approach.

Conclusions & Future Work
In this paper, we tackle a fundamental problem associated with LLMs, i.e., the generation of incorrect and false information.We have introduced SAPLMA, a method that leverages the hidden layer activations of an LLM to predict the truthfulness of generated statements.We demonstrated that SAPLMA outperforms few-shot prompting in detecting whether a statement is true or false, achieving accuracy levels between 60% to 80% on specific topics.This is a significant improvement over the maximum 56% accuracy level achieved by few-shot prompting.
Our findings suggest that LLMs possess an internal notion of statement accuracy, which can be harnessed by SAPLMA to filter out incorrect information before it reaches the user.Using SAPLMA as a supplement to LLMs may increase human trust in the generated responses and mitigate the risks associated with the dissemination of false information.Furthermore, we have released the true-false dataset and proposed a method for generating such data.This dataset, along with our methodology, provides valuable resources for future research in improving LLMs' abilities to generate accurate and reliable information.
In future work we intend to apply our method to larger LLMs, and run experiments with humans, such that a control group will interact with an unfiltered LLM, and the experimental group will interact with a system that augments SAPLMA and an LLM.We hope to demonstrate that humans trust and better understand the limitations of a system that is able to review itself and mark statements that it is unsure about.We also intend to study how the activations develop over time as additional words are generated, so that SAPLMA can be applied to detect inaccurate information within a passage generated by an LLM.
Figure2: A bar-chart comparing the accuracy of SAPLMA (20th-layer), BERT, 3-shot, and 5-shot on the 6 topics, and the average.SAPLMA consistently outperforms other models across all categories.Since the data is balanced, a random classifier should obtain an accuracy of 0.5.

Table 1 :
Number of rows for each topic in the true-false dataset.