Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback

A trustworthy real-world prediction system should produce well-calibrated confidence scores; that is, its confidence in an answer should be indicative of the likelihood that the answer is correct, enabling deferral to an expert in cases of low-confidence predictions. Recent studies have shown that unsupervised pre-training produces large language models (LMs) whose conditional probabilities are remarkably well-calibrated. However, the most widely-used LMs are fine-tuned with reinforcement learning from human feedback (RLHF-LMs), and some studies have suggested that RLHF-LMs produce conditional probabilities that are very poorly calibrated. In light of this perceived weakness, we conduct a broad evaluation of methods for extracting confidence scores from RLHF-LMs. For RLHF-LMs such as ChatGPT, GPT-4, and Claude, we find that verbalized confidences emitted as output tokens are typically better-calibrated than the model's conditional probabilities on the TriviaQA, SciQ, and TruthfulQA benchmarks, often reducing the expected calibration error by a relative 50%.


Introduction
Real-world prediction systems invariably make errors.However, some mitigation of these errors is possible if the system produces well-calibrated 1 confidence estimates.In this case, the system's least confident predictions correspond to those that are most likely to be incorrect, potentially allowing these predictions to be skipped or overridden by a human.In the context of language models, one consequence of poor calibration may be hallucination, where a language model confidently asserts incorrect facts or reasoning.While the ability of very large LMs to absorb and synthesize knowledge about the outside world has gained significant Figure 1: Verbalized confidence scores (blue) are better-calibrated than log probabilities (orange) for gpt-3.5-turbo.Raw model probabilities (top-left) are consistently over-confident.Verbalized numerical probabilities (bottom) are better-calibrated.Considering more answer choices (bottom-right) further improves verbalized calibration (as in 'Considering the Opposite' in psychology; Lord et al. (1985)).Verbalized expressions of likelihood (top-right) also provide improved calibration.Bar height is average accuracy of predictions in bin.Darker bars mean more predictions fall in that confidence range.Results computed on SciQ.
attention (Brown et al., 2020;Roberts et al., 2020;Bubeck et al., 2023), relatively little attention has been given to their well-calibratedness (Kadavath et al., 2022).Further, most existing analyses of the calibratedness of LLMs focus on models trained with maximum likelihood, while in practice, the most widely-used LLMs (such as ChatGPT) are fine-tuned using methods such as reinforcement learning from human feedback (Christiano et al., 2017).Some findings suggest that RLHF-LMs may sacrifice well-calibrated predictions for the sake of closer adherence to user instructions in dialogue (Kadavath et al., 2022;OpenAI, 2023), as the reinforcement learning objective encourages the model to allocate probability mass to the most preferred answer(s), rather than matching the relative frequency of possible answers.This paper evaluates several methods for extracting confidences about model predictions from Llama-70B's log probabilities, as measured by ECE (lower is better) or AUC (higher is better).However, this paper (Tables 1-5) will show that for several strong RLHF-LMs, the model's verbalized confidence is often better-calibrated than its log probabilities, reversing some of this degradation.This reversal is strongest for TruthfulQA, an adversarial dataset testing common misconceptions and other difficult queries.
RLHF-LMs.Due to concerns that RLHF may cause systematic overconfidence in the model's probabilities (Figure 2), as well as the general unavailability of per-token log-probabilities in widely used RLHF-LMs, we pay particular attention to prompts that elicit verbalized probabilities, i.e., the model expresses its confidence in token-space, as either numerical probabilities or another linguistic expression of uncertainty.We find that, surprisingly, popular RLHF-LMs are able to directly verbalize confidence scores that are better-calibrated than the model's conditional probabilities (estimated via sampling), without any fine-tuning to learn verbalization.To further improve calibration, we take inspiration from research in human psychology showing that overconfidence can be mitigated by considering alternative answers before responding (Lord et al., 1985;Mussweiler et al., 2000).We show that prompting a model to produce several answer choices before giving its confidence scores significantly improves calibration of verbalized probabilities.Combined with temperature scaling (Guo et al., 2017), this approach generally provides better calibration than model probabilities for ChatGPT2 , GPT-43 , and Claude 24 across three datasets, often reducing expected calibration error (ECE) by over 50%.Related Work.Several studies have examined the calibration of large LMs (Lin et al., 2022a;Park and Caragea, 2022;Kadavath et al., 2022;Xiao et al., 2022;Kuhn et al., 2023), finding that combining large pre-trained LMs with temperature scaling (Guo et al., 2017) produces very well-calibrated predictions (Kadavath et al., 2022;Xiao et al., 2022;Kuhn et al., 2023).Other work focuses on the tendency of language and dialogue models to use linguistic expressions of uncertainty in a well-calibrated manner (Zhou et al., 2023;Mielke et al., 2022).However, existing studies focus on LMs trained purely with unsupervised learning (although Kadavath et al. (2022) briefly examine RLHF-LMs), while widely used models in practice are fine-tuned with instruction-tuning or RLHF (Christiano et al., 2017).RLHF has been shown to effectively leverage annotations of human preferences to control sentiment (Ziegler et al., 2020), improve summarization or instruction-following quality (Stiennon et al., 2022;Ouyang et al., 2022), and inject behavioral priors of harmlessness (Bai et al., 2022b,a).However, recent work has raised the question of whether or not RLHF harms calibration (OpenAI, 2023).Our work is the first to show that verbalized probabilities are often bettercalibrated than the model's conditional probabilities for RLHF-LMs such as ChatGPT, GPT-4, and Claude, and Llama-2-70B-Chat.
Metrics.We measure calibration with multiple metrics.To measure ECE (expected calibration error; Guo et al. (2017)), we bin model predictions by their confidence and measure the average accuracy of predictions in each confidence bin.The ECE is defined as the average (squared) error between the average accuracy and confidence within each bin, where each error is weighted by the fraction of samples falling within the bin.We report raw ECE as well as ECE with temperature scaling (ECE-t).Temperature scaling fits a single temperature value β to the model's confidences to minimize negative log likelihood (NLL) on the data, giving scaled probability pi of class i as pi ∝ p β i .See Figure 1 for a depiction of ECE binning.Although ECE is a standard and interpretable measure of calibration error, it completely fails to capture the confidences' discriminative power. 5We therefore also report Brier Score (BS; Brier (1950)) on temperaturescaled confidences (BS-t), a proper scoring rule (Ovadia et al., 2019) that is the mean squared error between the confidences and the correctness labels.Finally, we assess calibration using a metric from the selective classification literature (Geifman and El-Yaniv, 2017), specifically, the area under the curve of selective accuracy and coverage (AUC).
Datasets.Our experiments use three questionanswering datasets assessing factual knowledge.
TriviaQA (Joshi et al., 2017) contains 650k question-answer pairs gathered by trivia enthusiasts; SciQ (Welbl et al., 2017) contains approximately 14k crowdsourced science exam questionanswer pairs; TruthfulQA (Lin et al., 2022b) contains 817 questions designed to test language models' tendency to mimic human falsehoods.We sample 1000 questions from the validation split of TriviaQA (rc.web.nocontext) and SciQ and all 817 questions from the validation split of Truth-fulQA (generation) for our experiments.
Evaluation protocol.For each dataset, we generate a response and corresponding confidence from each method on each of the evaluation questions.
Because calibration essentially quantifies the relationship between model confidence and correctness, computing correctness is crucial to accurate measurements of calibration.However, we find doing so to be a challenge, especially in datasets where only a single ground-truth answer (but not aliases or semantically equivalent rephrases) is provided.To avoid excessive false negatives in our correctness computation as a result of exact-match evaluation, we use either GPT-4 or GPT-3.5 to evaluate whether a response is essentially equivalent to the ground truth answer; see Appendix C for the complete equivalence-checking procedure.
Methods.We compare a wide variety of methods for extracting confidence estimates from LLMs.For a comprehensive list of the prompts used for each method, see Appendix Table 6.First, we consider two methods that leverage the true conditional distribution of the model to gener- ate confidence scores.The simplest is Label prob., which uses the conditional probability distribution p(y|x) of the model given a question x, which we estimate using n = 10 samples, since many RLHF-LMs are closed-source and do not offer per-token probabilities. 67We return the most common answer, using the LLM-based equivalence function to determine when two lexically different answers are semantically equivalent.In a variation of the method described by Kadavath et al. (2022) (again, we use samples since model probabilities are not available), 'Is True' prob.samples a single answer ŷ from the model given a question x, and the probability it is true is estimated by the probability the model assigns to 'True' when asked if the given answer is true (where once again the probabilities are estimated via samples), i.e., p(True|x, ŷ).
Next, we consider methods that extract confidence scores through verbalization (Lin et al., 2022a), i.e., where the model expresses its confidence in token space, either with numerical probabilities or linguistic expressions of likelihood. 8First, Verb.1S top-k prompts the model to produce k guesses and a probability that each is correct all in a single response (i.e., '1 stage').We take the highest-probability prediction and its as- 6 We evaluated gpt-3.5-turbo on all three datasets using n = 20 samples, but the calibration did not meaningfully improve, so we always use n = 10 to reduce API costs. 7For each closed LM, we use its default sampling parameters (top-p 1.0 for GPT-* and top-p 0.7 for Claude).For Llama-2, we use temperature 1.0 and top-p 1.0.
8 However, note that none of the methods described finetune the model to perform better on verbalization.sociated probability as the model's output and confidence.Verb.2S top-k similarly uses numerical probabilities, except the model is first asked to provide only its answers, and afterwards, in a second round of dialogue, asked to assign probabilities of correctness to each answer (i.e., '2 stages').Verb.2S CoT uses a chain-of-thought prompt before giving a single answer, and in a second round of dialogue, the model is prompted to assign a probability to that answer (with the chain of thought present in the model's context).Ling.1S-human uses linguistic likelihood expressions, rather than numerical probabilities, to express uncertainty.The model is prompted to assign confidences to its guesses by choosing from a set of linguistic expressions of uncertainty: {Almost certain, Likely, . . ., Almost no chance}.Each linguistic likelihood expression is mapped to a probability using responses from a human survey on social media with 123 respondents (Fagen-Ulmschneider, 2023).Ling.1S-opt.uses a held out set of calibration questions and answers to compute the average accuracy for each likelihood expression, using these 'optimized' values instead.Expressions that are not used for at least 1 N of questions, where N is the number of calibration questions, simply use the human probability.

Results
Tables 1-5 show the results of evaluating various methods for extracting confidence from RLHF-LMs on gpt-3.5-turbo,gpt-4, claude-1, claude-2, and Llama-2-70b-chat, respectively.We distill several key conclusions from these experiments.1.Large RLHF-LMs can often directly verbalize better-calibrated confidences (either a numerical confidence probability or an expression such as 'highly likely') than the models' conditional probabilities.2. Among the methods for verbalizing probabilities directly, we observe that generating and evaluating multiple hypotheses improves calibration (see Figure 1), similarly to humans (Lord et al., 1985), and corroborating a similar finding in LMs (Kadavath et al., 2022).
3. Language models can express their uncertainty with numerical probabilities as well or better than with words, which is surprising in light of longstanding difficulties in representing numbers in language models (Thawani et al., 2021).4. Chainof-thought prompting does not improve verbalized calibration (see Appendix Figure 5 for additional CoT results).5.The calibration of both Claude models' conditional probabilities roughly falls between gpt-3.5-turbo and gpt-4; however, while Claude 1 is much weaker at verbalizing its confidence, Claude 2 is generally a bit stronger than gpt-3.5-turboat verbalizing.The verbal calibration of the open source model Llama-2-70b-chat is generally weaker than that of closed source models but still demonstrates improvement over its conditional probabilities by some metrics, and does so most clearly on TruthfulQA.

Discussion
In summary, we study the calibration of widely used RLHF-LMs.We first replicate the finding for GPT-4 (OpenAI, 2023) that RLHF can worsen the calibration of a model's conditional probabilities using the open-source Llama-2-70B base and chat models (Figure 2).To mitigate this regression and ease extraction of calibrated confidence scores for models for which log probabilities are not available, we propose and study new methods that can elicit calibrated confidences from RLHF-LMs by prompting the model to verbalize its confidence in token space.We find verbalized probabilities are better-calibrated than conditional probabilities across several closed models, with mixed results for Llama-2-70B-Chat.
Our results raise several questions for future work.Most notably, the difference between GPT-*, Claude-*, and Llama-2's ability to verbalize confidence is significant.What factors are important for learning this skill?Additionally, the 1-stage and 2-stage verbalized numerical confidence prompts sometimes differ drastically in the calibration of their confidences.How can we reduce sensitivity of a model's calibration to the prompt?Going beyond question-answering, can we leverage good calibration in short-answer settings to improve the reliability of long-form generations, perhaps by breaking down long-form generation into a sequence of short questions?Finally, to what extent does a language model's calibration depend on the domain; do our conclusions in the context of factual recall hold in the context of reasoning or arithmetic?Answering these questions provides one path toward building more trustworthy and useful language systems.Limitations.While our work demonstrates a promising new approach to generating calibrated confidences through verbalization, there are limitations that could be addressed in future work.First, our experiments are focused on factual recalloriented problems, and the extent to which our observations would hold for reasoning-heavy settings is an interesting open question.Additionally, the lack of technical details available for many state-ofthe-art closed RLHF-LMs may limit our ability to understand what factors enable a model to verbalize well-calibrated confidences and differences in this ability across different models.Finally, our study is limited to short-form question-answering; future work should extend this analysis to longer-form generation settings.
high disagreement with a human evaluator on TriviaQA.Using the ground truth answer as ${GOLD_ANSWER} and the model-generated answer as ${PRED_ANSWER}, we use the following prompt template: Are the following two answers to my question Q semantically equivalent?\n\nQ: ${THE_QUESTION}\nA1: ${GOLD_ANSWER}\nA2: ${PRED_ANSWER}\n\nPlease answer with a single word, either "Yes." or "No.", and explain your reasoning.

Figure 2 :
Figure2: RLHF generally worsens the calibration of Llama-70B's log probabilities, as measured by ECE (lower is better) or AUC (higher is better).However, this paper (Tables 1-5) will show that for several strong RLHF-LMs, the model's verbalized confidence is often better-calibrated than its log probabilities, reversing some of this degradation.This reversal is strongest for TruthfulQA, an adversarial dataset testing common misconceptions and other difficult queries.

Table 1 :
Measuring calibration of various methods for extracting confidences from gpt-3.5-turbo (ChatGPT).The model's conditional probabilities are relatively poorly calibrated, whether using the model's conditional probability of the label given the query (Label prob.) or the probability assigned to 'True' given the query, proposed answer, and a prompt asking if the answer is correct ('Is True' prob.).Surprisingly, directly verbalizing a probability (Verb.1S and Verb.2S) or an expression of confidence such as 'highly likely' (Ling.1S) yields significantly better-calibrated confidence estimates.1S refers to one-stage prediction, where the model provides an answer and confidence probability/expression together.2S refers to two-stage prediction, where the model first gives only an answer, and then in a second stage a confidence.To color the table cells, for each column, we demean and scale by a constant to obtain a shade in [-1,1], where cyan indicates better and orange worse performance.ECE-t ↓ BS-t ↓ AUC ↑ ECE ↓ ECE-t ↓ BS-t ↓ AUC ↑ ECE ↓ ECE-t ↓ BS-t ↓ AUC ↑

Table 2 :
gpt-4's verbalized probabilities are substantially better-calibrated than the model probabilities themselves, even after temperature scaling, similarly to gpt-3.5-turbo in Table1.

Table 4 :
Claude-2 has weaker conditional probabilities than Claude-1 and GPT-*, but its verbalized calibration provides consistent improvement over conditional probabilities at a level comparable to GPT-3.5 and surpassing GPT-* on TruthfulQA.