LM-Polygraph: Uncertainty Estimation for Language Models

Recent advancements in the capabilities of large language models (LLMs) have paved the way for a myriad of groundbreaking applications in various fields. However, a significant challenge arises as these models often"hallucinate", i.e., fabricate facts without providing users an apparent means to discern the veracity of their statements. Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of LLMs. However, to date, research on UE methods for LLMs has been focused primarily on theoretical rather than engineering contributions. In this work, we tackle this issue by introducing LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python. Additionally, it introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores, empowering end-users to discern unreliable responses. LM-Polygraph is compatible with the most recent LLMs, including BLOOMz, LLaMA-2, ChatGPT, and GPT-4, and is designed to support future releases of similarly-styled LMs.


Introduction
Large language models (LLMs) have demonstrated remarkable performance across a variety of text generation tasks.Instruction fine-tuning and reinforcement learning from human feedback (RLHF) have brought the zero-shot performance of these models to a new level (Ouyang et al., 2022).However, the capabilities of LLMs, despite their profound power and complexity, are inherently constrained.Limitations arise from the finite nature of the training data and the model's intrinsic memorization and reasoning capacities.Hence, their utility is bounded by the depth and breadth of the knowledge they embed.
Due to their training objectives, even when the embedded knowledge of an LLM on a given topic is limited, it tends to be over-eager to respond to a prompt, sometimes generating misleading or entirely erroneous output.This dangerous behavior of attempting to appease the user with plausiblesounding but potentially false information is known as "hallucination" (Xiao and Wang, 2021;Dziri et al., 2022).It poses a significant challenge when deploying LLMs in practical applications.
There are several well-known approaches to censoring LLM outputs, including: filtering with stopword lists, post-processing with classifiers (Xu et al., 2023), rewriting of toxic outputs (Logacheva et al., 2022), and longer fine-tuning with RLHF.However, these approaches cannot be relied on to completely resolve hallucinations.Since LMs are natural (if "unintentional") liars, we propose LM-Polygraph -a program framework that, similar to a human polygraph, leverages various hidden signals to reveal when one should not trust the subject.In particular, LM-Polygraph provides a comprehensive collection of uncertainty estimation (UE) techniques for LLMs in text generation tasks.
Uncertainty estimation refers to the process of quantifying the degree of confidence in the predictions made by a machine learning model.For classification and regression tasks, there is a welldeveloped battery of methods (Gal, 2016).There has also been a surge of work investigating UE, particularly in text classification and regression in conjunction with encoder-only LMs such as BERT (Zhang et al., 2019;He et al., 2020;Shelmanov et al., 2021;Xin et al., 2021;Vazhentsev et al., 2022;Kotelevskii et al., 2022;Wang et al., 2022;Kuzmin et al., 2023).However, UE for sequence generation tasks, including text generation, is a much more complex problem.To quantify the uncertainty of the whole sequence, we have to aggregate uncertainties of many individual token predictions and deal with non-trivial sampling and pruning techniques like beam search.Contrary to classification tasks where the number of possible prediction options is finite, in text generation, the number of possible predictions is infinite or exponential in vocabulary size, complicating the estimation of probabilities and information-based scores.Finally, a natural language text is not a simple sum of its tokens; it is a nuanced interposition of context, semantics, and grammar, so two texts can have very diverse surface forms but similar meanings, which should be taken into account during the UE process.
Several recent studies have delved into developing UE methods for LMs in text generation tasks (Malinin and Gales, 2021;van der Poel et al., 2022;Kuhn et al., 2023;Ren et al., 2023;Vazhentsev et al., 2023b;Lin et al., 2023).However, the current landscape of this research is quite fragmented with many non-comparable or even concurrent studies, which makes it challenging to consolidate the findings and draw holistic conclusions.
In this work, with the development of LM-Polygraph, we strive to bridge these disparate research efforts, fostering more cohesion and synergy in the field.We envision a framework that consolidates the scattered UE techniques within unified frameworks in Python, provides an extendable evaluation benchmark, and offers tools to integrate uncertainty quantification in standard LLM pipelines seamlessly.This endeavor will not only make the journey less challenging for individual researchers and developers but also set the stage for more robust, reliable, and trustworthy LLM deployments for end-users.

Uncertainty Estimation Methods
Here, we summarize UE methods implemented in LM-Polygraph, as listed in Table 1.
There are two major technique types: white-box and black-box.The white-box methods require access to logits, internal layer outputs, or the LM itself.The black-box methods require access only to the generated texts, and can easily be integrated with third-party online services like OpenAI LM API.We note that the methods differ by computational requirements: some techniques pose high computational or memory overheads, e.g., due to repeated inference, making them less suitable for practical usage.The application of some methods also can be hindered by the need for access to the model training data.
Let us consider the input sequence x and the output sequence y ∈ Y of length L, where Y is a set of all possible output sequences.Then the probability of an output sequence given an input sequence for probabilistic autoregressive language models is given by: where the distribution of each y l is conditioned on all the previous tokens in a sequence y <l = {y 1 , . . ., y l−1 }, and θ denotes the parameters of the model.

White-box Methods
We start the discussion of white-box techniques from information-based methods.These techniques are based on token P (y l | y <l , x, θ) and sequence P (y | x, θ) probabilities obtained from a single model prediction.The notable example is entropy, which can be calculated on the token or sequence level.The benefits of information-based methods are that they are cheap to compute and simple to implement.However, the quality of these methods is usually relatively low, so they are considered as baselines.Some domain-specific methods were recently proposed in an attempt to improve over standard information-based approaches, such as semantic entropy (Kuhn et al., 2023).
The second category of white-box techniques is ensemble-based methods, which leverage the diversity of output predictions made by multiple slightly different versions of models under slightly different conditions.Let us assume that M models are available with parameters θ i , i = 1, . . ., M .These parameters can be obtained via independent training of models.Then one can use token P (y l | y <l , x, θ i ) and sequence P (y | x, θ i ) probabilities to compute various metrics such as mutual information that measures the discrepancy between model predictions.
Density-based methods leverage latent representations of instances and construct a probability density on top of them.Usually, these methods approximate training data distribution with the help of one or multiple Gaussian distributions.They can provide a probability or an unnormalized score that determines how likely instances belong to the training data distribution.Therefore, they are good at spotting out-of-distribution (OOD) instances (Vazhentsev et al., 2023b).Several variations of these methods have been proposed in the literature (Lee et al., 2018;Yoo et al., 2022;Ren et al., 2023;Kotelevskii et al., 2022).
The primary advantage of these methods is that they are computationally efficient: they do not need much time for additional model inference, and memory overhead for storing additional parameters is minimal.The drawback is that these methods require access to the model's training data to fit auxiliary models like Gaussians (e.g., the Mahalanobis Distance method requires constructing data centroids and covariance matrices).These methods are also known to capture only epistemic uncertainty.Therefore, they might not be perfect for selective generation as they cannot be used to spot ambiguous in-domain instances.
Finally, we also combine information-based and density-based methods as suggested by Vazhentsev et al. (2023a) and Ren et al. (2023).More specifically, we implement the hybrid uncertainty quantification (HUQ) method (Vazhentsev et al., 2023a) that performs a ranking-based aggregation and leverages strengths of both informationbased methods that detect ambiguous instances and density-based methods that detect OOD instances.
Directly asking the model to validate its answer is another option for UE (Kadavath et al., 2022).In this method, one asks models first to propose answers and then to evaluate the probability P (True) that their answers are correct.Kadavath et al. (2022) show that it achieves reasonable performance on a variety of tasks, including question-answering.We note that this method requires inference of a model twice: the first to generate an answer, and the second for processing its own output.Even though the second inference is usually faster than the first one, it still takes considerable time for computation.

Black-box Methods
In contemporary models, there are instances where the model's architecture and hidden states are unavailable or there is no access to logits during response generation.Nevertheless, a whole class of black box methods only needs to access the model's response.Within the scope of this paper, we consider several approaches of this type that have performed well in other studies (Fomicheva et al., 2020;Kuhn et al., 2023;Lin et al., 2023).We focus on Lexical Similarity, Number of Semantic Sets, Sum of Eigenvalues of the Graph Laplacian, Degree Matrix, and Eccentricity.We use the same methodological approach as the authors of the work (Lin et al., 2023): • Obtain K responses y 1 , . . ., y K for a particular input x. • Compute K × K similarity matrix S between responses, where S ij = s(y i , y j ) for some similarity score s (Natural Language Inference score or Jaccard score).• Based on the similarity matrix S, we compute the final uncertainty score.Thus, the idea of the methods is to analyze the similarity matrix and aggregate the information to compute the uncertainty score.

Demo
We constructed a demo application that can be used to interact with LLMs and also see confidence scores of model answers (see Figure 2).A user specifies a UE method and a language model from a number of publicly-available LLMs with up to 13B parameters, e.g., BLOOMz, Vicuna, and LLaMA-2.There is also the ability to communicate with LLMs deployed as web services such as ChatGPT or GPT-4 and obtain their uncertainty scores based on black-box techniques.For these models, a user should provide an API key.
This demo application is potentially helpful for both end-users and researchers.For end-users, it extends the standard AI assistant interface with information about whether it is reasonable to trust a model answer.Researchers could use this tool for qualitative analysis of various UE methods and LLM responses.

Evaluation Benchmark
LM-Polygraph provides a vast evaluation benchmark.It contains a script for running one or multiple experiments with UE techniques, implemented as Python modules.This feature allows the user to easily extend the set of available methods and evaluate novel UE techniques in a unified manner.Using this benchmark, we have conducted experiments with most methods implemented in LM-Polygraph.Below, we provide experimental details.
Metrics.We focus on the task of selective generation (Ren et al., 2023) where we "rejecting" generated sequences due to low quality based on uncertainty scores.Rejecting means that we do not use the model output, and the corresponding queries are processed differently: they could be further reprocessed manually or sent to a more advanced LLM.Following previous work on UE in text generation (Malinin and Gales, 2021;Vazhentsev et al., 2022), we compare the methods using the Prediction Rejection Ratio (PRR) metric (Malinin et al., 2017).
Consider a test dataset D = {(x i , y i )}.Let f (x i ) be the output generated by an LLM and U (x i ) be the uncertainty score of a prediction.The prediction rejection (PR) curve indicates the dependence of the average quality Q(f (x i ), y i ) of the covered instances from the uncertainty rate a used for rejection, in ascending order.We use ROUGE-L and BERTScore (Zhang et al., 2020) as text quality metrics Q(f (x i ), y i ).Finally, PRR computes the ratio of the area AU CP R unc between the PR curve for the uncertainty estimates and random estimates and the area AU CP R oracle between the oracle and random estimates: Higher PRR values indicate better quality of selective generation.When working with LLMs as web services, usually there is no access to full posterior distributions over tokens, therefore, only black-box methods could be used.Among this group of approaches, the best average performance is achieved by Eccentricity for Vicuna.For LLaMA, there is no clear advantage for any of the methods considered.
Overall, we see that absolute values for all evaluated methods, models, and datasets are far away from perfect.Low performance of current methods is especially evident on more complicated tasks such as XSum and WMT14.Our experimental results demonstrate that the task of selective generation is not close to be solved.This once again underlines the importance of further research and development of efficient uncertainty estimation techniques for generative language models.

Conclusion
As the community strives to advance the potential of LLMs, it is critical to be mindful about dangers of their uncontrolled usage.In this work, we propose a tool for making the application of LLMs safer.Enriching model predictions with uncertainty scores helps users and developers to be informed about these risks, encouraging healthy skepticism towards certain outputs generated by these models.
We plan to further expand our framework with implementations of new UE methods that emerge in the future.We hope that our work will foster the development of techniques to detect and mitigate LLM hallucinations, which we believe is a key to unlocking the safe, responsible, and effective use of LLMs in real-world applications.

Limitations
We have tried to be as comprehensive as possible with our collection of UE methods.However, we omit several techniques that have not demonstrated strong performance in previous work, do not have a strong theoretical motivation, or are similar to other implemented techniques.
We note that comprehensive evaluation of UE methods is an open research question.LM-Polygraph makes the first steps to systematize, and provide interfaces and tools for testing UE techniques in a unified manner.However, we believe that the number of tasks and datasets should be extended in the future.
When running the demo, we cannot provide an access to the biggest and the most powerful public LLMs, because running them is prohibitively expensive.Nevertheless, a user can access models such as ChatGPT by providing an API access key.
LM-Polygraph supports common application program interfaces used by modern LLMs.However, it is possible that certain modifications will be required to support future releases of LLMs.
At the moment of writing, LM-polygraph provides valid uncertainty estimates only for model outputs in English language.This is due to the fact that most generation quality metrics implemented are based off English-specific implementations and non-multilingual models.We plan to alleviate this limitation by allowing the user to easily employ custom quality metrics and scoring models.

Ethics Statement
We conducted all experiments on publicly-available datasets that have been leveraged in various previous work on uncertainty estimation of LLMs.
While training data for most LLMs, such as BLOOMz, was selected to contain little or no abusive text content, such models can still potentially output harmful textual content.Techniques investigated in our work estimate certainty of an LM output to "censor" its output, and model debiasing is an orthogonal direction to our line of work.These additional methods can and perhaps should be combined in real production LLM deployments.We hope that our framework contributes to safer and more reliable usage of language models.is reported.It is worth noting that a higher number of semantic sets corresponds to an increased level of uncertainty, as it suggests a higher number of diverse semantic interpretations for the answer.
Nonetheless, it is essential to acknowledge a limitation of this measure: it can only take integer values.Additionally, it cannot be assumed that the semantic equivalence derived from the NLI model is always transitive.Consequently, the authors of (Lin et al., 2023) suggest the consideration of a continuous counterpart of this metric.They propose the Sum of Eigenvalues of the Graph Laplacian as a potential alternative approach.
Let's consider a similarity matrix S j 1 j 2 = s(y j 1 , y j 2 ) + s(y j 2 , y j 1 ) /2. Averaging is done to obtain better consistency.Normalized Graph Laplacian of the obtained similarity Matrix S has the following formula L = I − D − 1 2 SD − 1 2 , where D is a diagonal matrix and D ii = K j=1 S ij .Consequently, the following formula is derived: This value is a continuous analogue of U N umSemSets .In extreme case if adjacency matrix S is binary these two measures will coincide.
Of course, from a theoretical and practical point of view, U EigV is a much more flexible approach compared to U N umSemSets .Still, they have a common disadvantage: they can not provide uncertainty for each answer.However, authors of (Lin et al., 2023) demonstrate that we can take it from Degree Matrix D computed above.The idea is that the total uncertainty of the answers might be measured as a corrected trace of the diagonal matrix D because elements on the diagonal of matrix D are sums of similarities between the given answer and other answers.Thus, it is an average pairwise distance between all answers, and a larger value will indicate larger uncertainty because of the larger distance between answers.The resulting uncertainty measure becomes U Deg = 1 − trace(D)/K 2 .
A drawback of previously considered methods is the limited knowledge of the actual embedding space for the different answers since we only have measures of their similarities.Nevertheless, we can overcome this limitation by taking advantage of the inferential capabilities of the graph Laplacian, which makes it easier to obtain the coordinates of the answers.Let us introduce u 1 , . . ., u k ∈ R K as the eigenvectors of L that correspond to k smallest eigenvalues.We can efficiently construct an informative embedding v j = [u 1,j , . . ., u k,j ] for an answer y j .Authors of (Lin et al., 2023) demonstrate that this approach allows the usage of the average distance from the center as an uncertainty metric and to consider the distance of each response from the center as a measure of (negative) confidence.In mathematical terms, the estimates for Eccentricity can be defined as follows: Last but not least, Lexical Similarity is a measure proposed by (Fomicheva et al., 2020) that computes how similar two words or phrases are in terms of their meaning.Since the original article is dedicated to machine translation, this measure calculates the average similarity score between all pairs of translation hypotheses in a set, using a similarity measure based on the overlap of their lexical items.Different metrics can be used, such as ROUGE-1, ROUGE-2, ROUGE-L, and BLEU.For our task, this measure iterates over all responses and calculates the average score with other answers.

D Dataset Statistics
Table 7 illustrates the statistics of the datasets that were used in the experiments.Experiments were conducted using all examples from the test sets of these datasets, while training density-based methods were performed on a random subset of 1000 elements from the train set.To evaluate the performance of considered uncertainty estimation methods, we provide code to retrieve benchmark results.Figure 3 shows an example of starting an experiment with the Vicuna-v1.5-7bmodel on the Questions Answering task (CoQA dataset).Figure 4 shows an example of a config file used for experiment related to CoQA dataset with Vicuna-v1.5-7bmodel.It contains information about import and parameters.For other datasets and models the config structure is the same.

E Normalization of Uncertainty Estimates in Demo App
To make uncertainty estimation more intuitive for the end user, directly interacting with the LLM, we perform normalization of various uncertainty estimates.After normalization the output UE(x) of any uncertainty estimation approach becomes a confidence score C(x) ∈ [0, 1] ⊂ R.
We experimented with several ways of achieving this normalization, including quantile-based approach and simple linear normalization on maximum value obtained from validation dataset.Eventually we  performed normalization as a calibration procedure, where normalized confidence score represents expected value of generation quality metric of choice (i.e.RougeL) for a given uncertainty estimate.This expectation is estimated by computing sample averages of quality metric over bins of uncertainty estimates, calculated for some validation dataset.For RougeL metric, the confidence estimate C(x input ) thus becomes:

Figure 2 :
Figure 2: User interface of the demo.A user can interact with an LLM as with any other chat service, but in LM-Polygraph the user also sees the confidence of the model answers.It is possible to specify various UE techniques and various models, including ChatGPT.

Figure 4 :
Figure 4: Config Example for Question Answering on CoQA dataset.

Table 2 :
BERTScore ROUGE-L BERTScore ROUGE-L BERTScore ROUGE-L BERTScore ROUGE-L BERTScore ROUGE-L BERTScore PRR↑ for the Vicuna model with ROUGE-L and BERTScore as text quality metrics.Darker color indicates better results.

Table 3 :
PRR↑ for the LLaMA-2 model with ROUGE-L and BERTScore as text quality metrics.Darker color indicates better results.

Table 4
presents the hyperparameters used for experiments with LLMs Vicuna-v1.5-7band LLaMA-2-7b-hf on various datasets and tasks.Maximum length of generated sequence was set for each dataset as the 99th percentile of target sequence length on the respecitve train set.

Table 6 :
Rouge-L↑ and BERTScore↑ for the Llama v2 model for various tasks.

Table 7 :
Quantitative information regarding the datasets from experiments.It includes the count of instances available for the training, validation, and test sets, as well as the mean lengths of both texts and targets (answers / translations / summaries) measured in terms of tokens.In addition, the languages of the source and target texts are also specified.