Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs

Large language models (LLMs) have recently shown great advances in a variety of tasks, including natural language understanding and generation. However, their use in high-stakes decision-making scenarios is still limited due to the potential for errors. Selective prediction is a technique that can be used to improve the reliability of the LLMs by allowing them to abstain from making predictions when they are unsure of the answer. In this work, we propose a novel framework for adaptation with self-evaluation to improve the selective prediction performance of LLMs. Our framework is based on the idea of using parameter-efficient tuning to adapt the LLM to the specific task at hand while improving its ability to perform self-evaluation. We evaluate our method on a variety of question-answering (QA) datasets and show that it outperforms state-of-the-art selective prediction methods. For example, on the CoQA benchmark, our method improves the AUACC from 91.23% to 92.63% and improves the AUROC from 74.61% to 80.25%.


Introduction
Large Language Models (LLMs) have recently demonstrated impressive capabilities in many natural language understanding, reasoning and generation tasks, such as question answering (Jiang et al., 2021;Singhal et al., 2023), summarization (Tang et al., 2023;Zhang et al., 2023b), semantic classification, and code generation (Poesia et al., 2022;Zhang et al., 2023a).As LLMs improve their remarkable performance, they are being increasingly considered to replace humans to perform high-stakes tasks.For example, LLMs can be used for medical QA to assist patients (Singhal et al., 2022).However, LLMs are not guaranteed to be accurate for all queries, so it is important to understand which queries they are reliable for.This dataset: "Which vitamin helps regulate blood clotting?"The OPT-2.7B model incorrectly answers "Vitamin C", when the correct answer is "Vitamin K".Without selective prediction, LLMs will directly output the wrong answer which in this case could lead users to take the wrong medicine, and thus causing potential harm.With selective prediction, LLMs will output a low selection score along with the wrong answer and can further output "I don't know!" to warn users not to trust it or verify it using other sources.
information can be used to direct human oversight to the queries with the lowest selection score.Selective prediction (Geifman and El-Yaniv, 2017), broadly refers to the deployment scenario for AI models where humans are involved to maintain overall accuracy by reviewing AI-generated, lowconfidence outputs.In this scenario, both human and AI performance are considered together to minimize human involvement cost.LLMs should be used in the real-world with enhanced selective prediction performance.They should be able to assess the accuracy of their predictions and refrain from making wrong predictions.If an LLM detects that an answer might be wrong for a question, it should be able to generate an answer with the sentiment of "I don't know!" (as shown in Fig. 1) or defer the answer to a human for manual inspection.This will help to ensure that LLMs are used in a reliably, especially for high-stakes applications.
Selective prediction for LLMs is challenging because LLMs are just trained to predict the next to-ken given a context but are not guaranteed to always predict the correct next token.Also, since LLMs generate an output sequence in an auto-regressive way, they don't directly produce a confidence score for the output sequence.Thus, obtaining selection scores from LLMs for their output sequences is not straightforward.Although there is some research on selective prediction for LLMs, these studies have their own shortcomings.Kadavath et al. propose to use heuristic prompts (e.g., adding prompts like "Is the proposed answer True or False?") to trigger self-evaluation of LLMs.However, those prompts may only work for the LLM used in Kadavath et al. (2022) and may not generalize to other types of LLMs (e.g., OPT and GPT2 models evaluated in our work).Some approaches proposed using semantic entropy (Kuhn et al., 2023) or selfconsistency (Wang et al., 2022) as a measure of uncertainty for selection score.However, they usually require generating multiple output sequences to obtain the uncertainty measure for an input sequence, which introduces high computational cost and latency at test time.Fine-tuning LLMs on training data from the target question answering task using the standard LLM training loss can improve selective prediction performance.This is because fine-tuning can improve the accuracy of the predictions and maximize the likelihood of the ground-truth answer for a given question.However, maximizing the likelihood of the ground-truth answer is not the same as minimizing the likelihood of the wrong answers, since LLMs generate output sequences in an auto-regressive way.Even after fine-tuning, some wrong answers may still have high likelihood and be generated by the LLM at test time.Therefore, distinguishing correct and incorrect answers based on likelihood scores alone is a challenging task.
To address these challenges of self-evaluation and uncertainty estimation, we propose a novel framework -Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs (ASPIRE).Unlike previous methods that rely on hand-crafted heuristics or multiple output sequences, our framework learns to self-evaluate from target-task data.We do this by training LLMs on a subset of the training data from the question-answering tasks.This allows the LLMs to learn to distinguish between correct and incorrect answers on their own.We then define a selection score that combines the likelihood of the generated answer with the learned self-eval score (see Eq. ( 11)) to make selective predictions.This makes our method much less computationally expensive than solutions that require generating multiple output sequences to obtain the uncertainty measure.Thus, the proposed method is useful for practical applications where high selective prediction performance and low inference costs are desired, after deploying the LLM.In such applications, practitioners prefer collecting some training data to fine-tune smaller LLMs to achieve high selective prediction performance rather than directly deploying very large pre-trained LLMs with limited selective prediction performance for specific tasks.
We conduct extensive experiments to evaluate our proposed framework, ASPIRE.We show that ASPIRE achieves the state-of-the-art selective prediction performance on three question answering datasets: CoQA, TriviaQA and SQuAD, using OPT and GPT-2 models.We also provide empirical analysis to delve deeper into our proposed technique.

Related Work
Selective Prediction for LLMs.Recently, LLMs (e.g., GPT-4 (OpenAI, 2023) and PaLM (Chowdhery et al., 2022)) have achieved great success in solving various kinds of Natural Language Generation (NLG) tasks.However, LLMs are still not very reliable and may generate wrong outputs when solving NLG tasks.Due to this, selective prediction (or sometimes called selective generation (Ren et al., 2022)) is critical for safely deploying LLMs in the real-world.Different from selective prediction for classification tasks (e.g., Natural Language Inference (NLI) tasks) (Xin et al., 2021), selective prediction for LLMs in solving NLG tasks is fundamentally different since the prediction is done auto-regressively over many steps and the possible answer set has an infinite size.Recently, several work propose some uncertainty measures for LLMs, which can be used for selective prediction (Si et al., 2022;Kadavath et al., 2022;Varshney et al., 2022;Ren et al., 2022;Kuhn et al., 2023).Some recent work studies selective prediction for solving question answering tasks where questions are ambiguous (Cole et al., 2023;Yin et al., 2023).Varshney and Baral (2023) propose a selective prediction method that at inference time leverages an auxiliary model which is trained to distinguish the correct predictions of the QA model from the incorrect ones.Different from previous work, our work proposes to improve selective prediction performance of LLMs in solving question answering tasks by learning self-evaluation during fine-tuning.Parameter Efficient Fine-tuning.Fine-tuning pretrained LLMs on downstream datasets can bring huge performance gains when compared to using the pretrained LLMs out-of-the-box (e.g., kshot inference).However, as LLMs get larger and larger, full fine-tuning becomes very expensive in terms of computational cost and memory requirements.In addition, massive models might not be data efficient and overfitting issues might be observed, yielding suboptimal generalization.To address these issues, Parameter-Efficient Fine-tuning (PEFT) approaches have been proposed.PEFT approaches only fine-tune a small number of (extra) model parameters while freezing most parameters of the pretrained LLMs, thereby greatly decreasing the computational and storage costs.It has also been shown that PEFT approaches are better than fine-tuning in the low-data regimes and generalize better to out-of-domain scenarios.Existing PEFT approaches include LoRA (Hu et al., 2021), Prefix Tuning (Liu et al., 2021a), Soft Prompt Tuning (Lester et al., 2021) and P-Tuning (Liu et al., 2021b).In this work, we use Soft Prompt Tuning to learn self-evaluation to improve selective prediction performance of LLMs.

Problem Setup
Suppose we have a pre-trained LLM f for an arbitrary generative modeling task such as question answering.The output can be represented as a sequence of tokens from the vocabulary V. Let V * be the space of sequences of tokens.Suppose the logits of The likelihood of the next token following x being v is defined as: whereas the likelihood of generating ŷ ∈ V * given x is defined as: where ŷ = (ŷ 1 , . . ., ŷ|ŷ| ), |ŷ| is the length of ŷ, ŷ[i−1] = (ŷ 1 , . . ., ŷi−1 ) for i > 0 and ŷ[0] = ∅.This likelihood can be very small when |ŷ| is very large.To address this issue, we define the normalized likelihood as: We use f to generate the output sequence for the given input x by solving the following objective: It is impossible to solve this objective exactly since the output sequences can be arbitrarily long.However, we can employ some decoding strategy like greedy decoding or beam search to solve it.
To evaluate if the generated output ŷ is correct or not, we need a set of reference outputs S and an evaluation metric M : V * × V * → [0, 1] that can evaluate the similarity of the generated output ŷ compared to the reference output y r ∈ S. With a threshold γ, we can determine the correctness of the generated output -if max yr∈S M (ŷ, y r ) > γ, then the generated output is correct; otherwise, the generated output is wrong.We discuss the specific choices of M and γ in Section 6.
In selective prediction, we need a rejection option, which is denoted by ⊥.Given a training dataset D tr = {(x i , y i )} ntr i=1 randomly sampled from a target task distribution, we aim to build a selective predictor f s : V * → V * ∪ {⊥} that can achieve strong selective prediction performance on the test dataset , where S i is the set of reference outputs for the input x i .The selective predictor f s is composed of a predictor f : V * → V * and a selection scoring function g : V * → R. With f and g, the selective predictor f s is proposed as: where τ is a threshold.The accuracy of the selective predictor is defined as the fraction of the accepted inputs where the predictions are correct.The coverage of the selective predictor is defined as the fraction of the inputs that are accepted.We can tune the threshold τ to achieve a certain coverage and there would be an accuracy-coverage trade-off.We use the area under the accuracy-coverage curve (AUACC) metric to measure selective prediction performance and use the area under the receiver operator characteristic curve (AUROC) metric to measure the quality of the selection score estimation.AUACC is the common metric used for evaluating selective prediction performance (Xin et al., 2021;Yoshikawa and Okazaki, 2023).AU-ROC is equivalent to the probability that a randomly chosen correct output sequence has a higher selection score than a randomly chosen incorrect output sequence.AUROC is used in (Kuhn et al., 2023) for evaluating uncertainty estimation methods.

ASPIRE Framework
We propose that LLMs should have the selfevaluation ability such that they should be able to distinguish whether their proposed answers for a given question are correct or not.Although some previous work (Kadavath et al., 2022) show that LLMs have good self-evaluation ability with specially designed prompts, those prompts may not transfer to different kinds of LLMs (as shown by our experiments and in Kuhn et al. (2023)) and hand-crafting prompts for different kinds of LLMs can be expensive.A more effective approach is to collect some training data to employ selfevaluation.Towards this end, we propose a novel framework -Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs (ASPIRE).Fig. 2 illustrates the proposed framework and the details are explained next.
Given a training dataset for a generative task, we can fine-tune the pre-trained LLM on the training data to improve its prediction performance.Towards this end, parameter efficient tuning techniques (e.g., soft prompt tuning (Lester et al., 2021) and LoRA (Hu et al., 2021)) might be employed to adapt the pre-trained LLM on the task, given their effectiveness in obtaining strong generalization with small amount of target task data.Specifically, the model parameters θ of the LLM are frozen and adaptable parameters θ p are added for fine-tuning.Only θ p are updated to solve the following training objective: where L is the LLM training loss (e.g.crossentropy).Such fine-tuning can improve selective prediction performance because it not only improves the prediction accuracy, but also enhances the likelihood of correct output sequences.
To further improve selective prediction performance, we propose to fine-tune the LLM to learn self-evaluation.We first use the LLM with the learned θ p to generate different answers for each example (x, y) ∈ D tr .Suppose the decoding algorithm used to generate output sequences for each input x is A. A would produce a list of generated output sequences: where k is the number of output sequences generated.We aim to generate output sequences that have high likelihood (i.e., f (ŷ j | x; θ p ) is high).
We use the metric M defined in Section 3 to determine if the generated output ŷj is correct or not.
If M (ŷ j , y) > γ, we label ŷj as a correct output for x; otherwise, we label ŷj as a wrong output for x.Here, the threshold γ might be different from the threshold γ used for evaluation.We choose a sufficiently large value of γ (e.g., γ = 0.9) so that the generated wrong outputs wouldn't be labeled as correct outputs.In Appendix H, we provide more details and analyses on selection of γ.
After sampling high-likelihood outputs for each query, we add adaptable parameters θ s and only tune θ s for learning self-evaluation.Since the output sequence generation only depends on θ and θ p , freezing θ and the learned θ p can avoid changing the prediction behaviors of the LLM when learning self-evaluation.Let z c and z w be a pair of tokens that represent the words "correct" and "wrong" respectively.We can then optimize θ s using the following training objective: where S c (x, y) is a set of correct outputs containing the reference output y and k c correct outputs with highest likelihood from A(f, θ p , x), and S w (x, y) is a set of wrong outputs containing k w wrong outputs with highest likelihood from A(f, θ p , x).If A(f, θ p , x) has less than k c correct outputs (or has less than k w wrong outputs), we include all its correct outputs (or all its wrong outputs) in S c (or S w ).We ensure that S w contains at least one wrong output.If A(f, θ p , x) doesn't contain wrong outputs, we add a default wrong output (e.g., the empty string) to S w .
After training θ p and θ s , we obtain the prediction for the query x via solving the following objective: We use the beam search decoding method towards this.We define the likelihood of the output ŷ * being correct for the query x as: This score P (z c | x, ŷ * ) is referred as the learned self-eval score.Overall, the selection scoring function is proposed as: where α ∈ [0, 1] is a hyper-parameter.

Implementation via Soft Prompt Tuning
In the proposed framework, θ p and θ s can be trained using parameter efficient tuning approaches.
In our work, we focus on Soft Prompt Tuning, as illustrated in Fig. 3.The driving force behind this approach lies in the recognition that if we can develop prompts that effectively stimulate self-evaluation, it should be possible to discover these prompts through soft prompt tuning in conjunction with targeted training objectives.We first briefly introduce the soft prompt tuning method proposed by Lester et al. (2021).We consider LLMs based on the Transformer architecture (Vaswani et al., 2017).Given a query x = (x 1 , . . ., x mq ), Transformers first embed the tokens, forming a matrix X ∈ R mq×de , where d e is the dimension of the embedding space.The softprompts are represented as parameters θ ∈ R l×de , where l is the length of the prompt.The prompt is then concatenated to the embedded input forming a single matrix [ θ; X] ∈ R (mq+l)×de , which then flows through the transformer as normal.
In the proposed framework, we need to train two portions of the prompts θ p ∈ R l×de and θ s ∈ R l×de .Utilizing soft prompt tuning, the training objective ( 6) is proposed as: where X is the embedding of x and Y [j−1] is the embedding of y [j−1] .On the other hand, the training objective ( 8) is proposed as: where Ŷ is the embedding of ŷ.The inference objective (9) in the framework becomes: The learned self-eval score P (z c | x, ŷ * ) becomes: Frozen LLM [soft prompt] Q: Which vitamin assists in blood clotting?A: Vitamin K Likelihood: 0.7 Q embed and A embed are input embeddings for the question and answer respectively.We first generate the answer and the likelihood of the answer, and then compute the learned self-eval score.We can cache the states when generating the answer and reuse those states when computing the learned self-eval score to save computational costs.
where Ŷ * is the embedding of ŷ * .
To generate the output sequence and obtain the selection score for a given input sequence, we employ two stages: first, we obtain the generated output and the likelihood for the generated output and then, we obtain the learned self-eval score.Since the query of the second stage is constructed by appending some additional tokens to the query of the first stage, the second stage can reuse the states in the first stage instead of recomputing them to save some computational cost (see Fig. 3).
Lastly, we note that the computational complexity of the proposed method at test time is O(l max ) with l max being the maximum length of the generated output sequence.In Appendix F, we provide a more detailed analysis of the computational complexity of different methods.The predictive entropy and semantic entropy methods have a complexity of O(m • l max ) where m is the number of output sequences sampled for uncertainty estimation, which is much larger than that of our method.

Experiments
Our experimental evaluation is focused on the following questions: (Q1) Could a learning-based system using selfevaluation improve selective prediction in LLMs compared to other post-hoc selective prediction alternatives?(A1) By learning self-evaluation, we can significantly improve selective prediction performance across different datasets and LLMs (see Table 1).(Q2) What kinds of decoding algorithms could be used as A for the proposed framework ASPIRE?(A2) Using decoding algorithms that can sample different high-likelihood answers as A (e.g., beam search) is important for ASPIRE to achieve good selective prediction performance (see Table 4).(Q3) What is the effect of the number of training samples for the proposed method ASPIRE?(A3) More training samples lead to enhanced performance and with ∼2k samples, ASPIRE can outperform the baselines without soft prompt tuning significantly on different datasets (see Table 5).

Setup
Dataset.We focus on the free-form question answering tasks on the datasets CoQA (Reddy et al., 2019), TriviaQA (Joshi et al., 2017) and SQuAD (Rajpurkar et al., 2016).For CoQA and SQuAD, since each question is asked based on a context paragraph, we evaluate the LLMs in the zero-shot setting.For TriviaQA, since the LLMs have limited accuracy under the zero-shot setting, we evaluate the LLMs in 5-shot setting.OPT-2.7B and OPT-30B.For GPT-2, we consider GPT2-Medium, GPT2-Large and GPT2-XL.The details of these models are given in Appendix C.
Baselines.For selective prediction, we need to get a predicted output sequence ŷ * and a selection score g(x) for each input sequence x given a model f .The model f can be a pre-trained LLM or an adapted LLM with θ p trained using the training objective (12).We use the beam-search decoding to obtain the predicted output sequence ŷ * and consider the following baselines to compute the selection score g(x): (1) Perplexity; (2) Predictive Entropy; (3) Semantic Entropy (Kuhn et al., 2023); (4) Self-eval; (5) P(True) (Kadavath et al., 2022).More details can be found in Appendix D. Evaluation metrics.We use the Rouge-L (Lin and Och, 2004) as the evaluation metric M to evaluate the similarity of the generated answer to the reference answers following Kuhn et al. (2023).For the threshold γ that is used to determine the correctness of the generated answer, we consider relatively larger values of γ since we focus on safety-critical applications where accepting a wrong answer is more costly compared to rejecting a correct answer that is different from the reference answers (refer to Appendix G for the justifications of the choices of γ).Unless specified, we use γ = 0.7 as default.
Training hyper-parameters.We have two stages of training: the first stage is to train the soft prompt θ p using the training objective ( 12) and the second stage is to train the soft prompt θ s using the training objective (13).For both stages, we train the soft prompts for 10 epochs using AdamW optimizer with a batch size of 8, a learning rate of 0.01 and cosine learning rate scheduling.More training details can be found in Appendix E.
ASPIRE setup.We use the beam search as the decoding algorithm A. We set the number of beams equal to k and use the k highest scoring beams as the answer list A(f, θ p , x).We set l = 50, γ = 0.9, k = 10, k c = 2, k w = 10 and α = 0.25 by default.
We choose these hyper-parameters based on the performance on the validation set from TriviaQA using the OPT-2.7Bmodel.We then use the same hyper-parameters across all datasets and models.

Results
We first evaluate the accuracy of different LLMs.
The results in Table 3 show that after training θ p via soft prompt tuning, the accuracy of LLMs is improved significantly.On the CoQA and SQuAD datasets, the adapted OPT-2.7Bcan even outperform the pre-trained OPT-30B, which demonstrates that it is possible to adapt a smaller LLM to achieve better accuracy than a much larger LLM.We then evaluate different methods to compute the selection score when the model's predictions are fixed.The results in Table 1 show that the proposed method ASPIRE significantly outperforms the baselines in terms of the AUACC and AUROC metrics across different datasets and LLMs.The results also show that after prompt tuning, the AUACC of different methods is significantly improved as the accuracy gets better and the perplexity becomes more meaningful in separating correct and wrong answers.
Additionally, the results show that the proposed ASPIRE with the adapted OPT-2.7Bmodel can significantly outperform the Self-eval and P(True) baselines with the pre-trained OPT-30B model in selective prediction performance.Note that on the TriviaQA dataset, although the pre-trained OPT-30B model has better accuracy than the adapted OPT-2.7Bmodel, the Self-eval and P(True) baselines with the pre-trained OPT-30B model have much worse selective prediction performance compared to the proposed ASPIRE with the adapted OPT-2.7Bmodel.These demonstrate that the selfevaluation approaches are not effective for high capacity LLMs, and applying the proposed ASPIRE to smaller LLMs can lead to better selective prediction performance compared to those self-evaluation approaches with much larger LLMs.Additional results in Appendix I show that ASPIRE significantly outperforms the baselines across OPT and GPT2 models of different sizes for different values of the Rouge threshold γ.

Empirical Analyses
The effect of α.We study the effect of the hyper-parameter α in the proposed selection score (Eq.( 11)).The results in Table 2 show that setting α = 0.25 leads to the best performance since it combines the normalized likelihood and the learned self-eval score in a good way.Only using the normalized likelihood (i.e., α = 0) or only using the learned self-eval score (i.e., α = 1) leads to much worse performance.In practice, the value of α can be chosen based on the performance on the validation data.In Appendix J, we give results for other models and discuss how we choose α.
The choices of A. We compare two decoding algorithms -beam search and multinomial sampling that can be used as A for answer sampling.
For beam search, we use the k highest scoring beams as the answer list.For multinomial sampling, we consider temperature (denoted as T ) in the set {0.1, 1.0, 2.0}.The results in Table 4 show that using multinomial sampling with T = 2.0 or T = 0.1 leads to worse performance compared to other decoding algorithms.If we set a high temperature (T = 2.0) for multinomial sampling, then we sample some random answers that might not have high-likelihood.If we set a low temperature (T = 0.1) for multinomial sampling, then we repeatedly sample the same high-likelihood answers.Thus, the results suggest that sampling different high-likelihood answers is important for our method to achieve high accuracy and coverage in selective prediction.The results also show that using beam search leads to similar performance as using multinomial sampling with T = 1.So we can use either one in practice.
Training sample efficiency.We perform experiments to study the effect of the number of training samples for ASPIRE.We fix the number of training steps to be 50K while varying the size of the training dataset.The results in Table 5 show that more training samples lead to performance improvement and with 2K training samples, ASPIRE can outperform the baselines without soft prompt tuning by a large margin across different datasets.This underlines that our method, ASPIRE, can significantly improve selective prediction performance even with limited number of training samples.

Conclusion
In this paper, we proposed a novel framework for adaptation with self-evaluation to improve selective prediction in LLMs.We implemented the framework via soft prompt tuning and demonstrated its superior performance over existing methods through extensive experiments.In future work, one could explore implementing our framework via other parameter efficient tuning approaches and applying our method to larger LLMs.

Limitations
Higher capacity LLMs are known to often yield superior capabilities.Our work does not include fine-tuning experimental results with the largest and the strongest LLMs in the literature (we have fine-tuning results with LLMs up to 2.7B parameters), due to our computational constraints.However, the proposed framework can be applied to LLMs of any size and similar improvements are expected.We leave the adoption of our methods to larger-scale LLMs to future work.

Ethics Statement
LLMs are widely used in various applications nowadays.However, they can generate wrong or misleading answers to questions, which can cause serious consequences in some safety critical applications.The framework proposed in our work can be used to improve selective prediction performance of LLMs and make their deployments more reliable.However, it is noted that the obtained selective prediction performances are still not perfect.

A Hardware and Software
We run all experiments using the HuggingFace API on 40GB NVIDIA A100 GPUs in the Debian GNU/Linux 10 system.We use the OPT and GPT2 models via the HuggingFace transformers library which can be easily adapted for reproducibility.We modify the Trainer class provided by the HuggingFace API for soft prompt tuning.We use the generate() function of the HuggingFace API to generate answers.Unless specified, we use the default parameters of the generate() function.
When generating the answer set A(f, θ p , x), we set max_new_tokens=50 while in other cases, we always set max_new_tokens=256.The parameters for different decoding strategies are provided below: • Beam search decoding: we set num_beams>1 and do_sample=False.If we want to get num_beams highest scoring beams, we will set num_return_sequences=num_beams.We will specify num_beams when using beam search decoding.
• Multinomial sampling decoding: we set num_beams=1 and do_sample=True.We will specify temperature when using multinomial sampling decoding.

B.1 CoQA
CoQA is a large-scale dataset for Conversational Question Answering systems.The goal of the CoQA challenge is to measure the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation.CoQA contains 127,000+ questions with answers collected from 8,000+ conversations.
The training set contains 108,647 question queries while the test set contains 7,983 question queries.
We use the following template to construct question queries:

[The provided context paragraph] [additional question-answer pairs] Q: [Provided question]
A: where additional question-answer pairs are preceding turns of the conversation about the paragraph consisting of questions and reference answers.

B.2 TriviaQA
TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence triples.

C LLMs
We perform experiments with OPT (Zhang et al., 2022) and GPT-2 (Radford et al., 2019) models, which are based on Transformer architecture.For Transformer architecture, there is a limit on the lengths of the sequences we can pass the models.
The OPT models can handle sequences of up to 2,048 tokens while the GPT-2 models can handle sequences of up to 1,024 tokens.If the sequence length of an input is larger than the maximum sequence length that is allowed, we force the model to output an empty sequence with a −∞ selection score.

D Baselines
For selective prediction, we need to get a predicted output sequence ŷ * and a selection score g(x) for each input sequence x given a model f .The model f can be a pre-trained LLM or an LLM adapted with prompt tuning using training objective (12).We use the beam-search decoding, with the number of beams being equal to 5, to obtain the predicted output sequence ŷ * .We consider the following baselines to compute the selection score g(x): Perplexity.Perplexity is defined as the exponentiated average negative log-likelihood of a sequence.The perplexity of the generated output sequence ŷ * is computed as: Predictive Entropy.
Predictive entropy is a widely used measure of uncertainty.We use the multinomial sampling with a temperature of 0.5 to obtain an answer list [ŷ 1 , . . ., ŷm ] for each input sequence x.The predictive entropy is computed as: We set m = 10.This is the same as the lengthnormalised predictive entropy baseline in Kuhn et al. (2023).Semantic Entropy.
Semantic entropy is an entropy-based uncertainty measure which uses a bidirectional entailment algorithm for marginalising over semantically-equivalent samples (Kuhn et al., 2023).We follow the settings in Kuhn et al. (2023).Specifically, we use the multinomial sampling with a temperature of 0.5 to obtain an answer list of size 10 for each input sequence for uncertainty estimation.We use the Deberta-large model (He et al., 2020) that is fine-tuned on the NLI data set, MNLI (Williams et al., 2017) for the bidirectional entailment clustering algorithm.Self-eval.Self-eval is a simple baseline that obtains a selection score from the LLM by asking whether the proposed answer ŷ * is correct or not.Suppose z s is a series of tokens representing the self-evaluation trigger string "The answer is ".Suppose z c and z w are the tokens that represent the words "correct" and "wrong" respectively.Recall that the logits of the model f on v given x is f (v | x).Then, the self-eval score is computed as: P(True).P(True) proposed by Kadavath et al. (2022) is a way to estimate the probability that a model's generation is correct by "asking" the model if its answer is correct.It samples m answers and constructs a new natural language question using these possible answers as context before asking whether the proposed answer ŷ * is correct and measures the probability of the completion being True.
We set m = 4 and use the multinomial sampling with a temperature of 1.0 to sample the answers.The format of the prompt is: Question: Who was the third president of the United States?Here are some brainstormed ideas: James Monroe Thomas Jefferson John Adams Benjamin Harrison George Washington Possible Answer: James Monroe Is the possible answer: (A) True (B) False.The possible answer is: where the "brainstormed answers" are from the set of sampled answers and P(True) (i.e., the likelihood of the next token being True) is taken as the uncertainty measure.

E Training Details
We have two stage training: the first stage is to train the soft prompt θ p using the training objective ( 12) and the second stage is to train the soft prompt θ s using the training objective (13).For both stages, we train the soft prompt for 10 epochs using AdamW optimizer with a batch size of 8, a learning rate of 0.01 and cosine learning rate schedule.We remove those data points (

F Computational Complexity Analysis
The proposed method ASPIRE needs to train two soft prompts θ p and θ s .The complexity of training θ p using the training objective ( 12) is the same as the complexity of the standard soft prompt tuning.When training θ s using the training objective ( 13), the number of training steps is the same as that of training θ p .In each training step of training θ s , we compute gradients for one correct output and two wrong outputs while in each training step of training θ p , we compute gradients for one reference output.Thus, the complexity of training θ s is the same as that of training θ p .Therefore, the complexity of the proposed method ASPIRE in the training time is the same as that of the standard soft prompt tuning.
We analyze the computational complexity of different methods at test time in terms of the number of forward passes for the LLM.Since the LLM generates the output sequence in an auto-regressive way, the number of forward passes needed depends on the length of the generated output sequence.Suppose the maximum length of the generated output sequence is l max .To generate an output sequence given an input sequence, we need one forward pass to encode the input sequence and at most l max forward passes to obtain the output sequence.Thus, for generating the output sequence, the maximum number of forward passes is 1 + l max and the complexity is O(l max ).For the perplexity method, the computational complexity is O(l max ) since we only need additional one forward pass to obtain the perplexity score.For the predictive entropy method, the computational complexity is O(m • l max ) since we need to additionally generate m answers and compute the likelihood of those m answers.For the semantic entropy method, we omit the computational complexity of the bidirectional entailment clustering algorithm since its computational cost is much smaller than that of the generation of the LLM as stated in Kuhn et al. (2023).Thus, the computational complexity for semantic entropy is O(m • l max ).For the self-eval method, the computational complexity is O(l max ) since we only need one additional forward pass to obtain the self-eval score.For the P(True) method, the computational complexity is O(m • l max ) since we need to additionally generate m answers and need one forward pass to compute the P(True) score.For the proposed method ASPIRE, the computational complexity is O(l max ) since we only need additional one forward pass to obtain the learned self-eval score.

G Rouge Threshold for Evaluation
We use the Rouge-L (Lin and Och, 2004) metric to evaluate if the predicted answer is correct or not.The Rouge-L metric produces a score in [0, 1].
We need a threshold γ to determine whether the predicted answer is correct or not.If the Rouge-L score is larger than the threshold γ, then the predicted answer is correct; otherwise, the predicted answer is wrong.The choice of γ depends on the applications.Low values of γ may lead to labeling some wrong answers as correct answers while large values of γ may lead to labeling some correct answers as wrong answers.If we regard the wrong answer as the positive class, then we can use the precision and recall metrics to evaluate the choice of γ.To compute the precision and recall metrics, we need ground-truth labels for determining the correctness of predicted answers, which requires manual labeling.If the Rouge-L score is equal to 0 (or 1), then it is mostly sure that the predicted answer is wrong (or correct).Thus, we only need to label those samples whose Rouge-L scores are in (0, 1).To compare different values of γ, we compute the precision and recall metrics after manually label 200 samples whose Rouge-L scores are in the range of (0, 1).The results in Table 7 show that larger values of γ lead to higher recall but lower precision, while the lower values of γ lead to higher precision but lower recall.We propose this work for safety-critical applications where accepting a wrong answer is more costly compared to rejecting a correct answer that is different from the reference answers.Thus, we prefer high recall than high precision.In our experiments, we evaluate different methods under the Rouge-L metric with γ ∈ {0.7, 0.8, 0, 9} to ensure that the recall is at least 90%.

H Rouge Threshold for the Proposed Framework
In the proposed framework ASPIRE, we need the Rouge threshold γ to determine if the generated answer is correct or not.We want to set a large enough value of γ so that the generated wrong answers won't be labeled as correct answers.To determine the value of γ, we manually label the correctness of the 10 generated answers for 50 training examples from each dataset (we have three datasets CoQA, TriviaQA and SQuAD).The answers are generated using the OPT-2.7Bmodel.We find that if we set γ = 0.9, then no wrong answers would be labeled as correct answers.Thus, we set γ = 0.9 for the proposed framework.

I Complete Results
In this section, we present the complete results for OPT and GPT2 models of different sizes and different Rouge threshold γ.We first evaluate the accuracy of different LLMs.The results are in Table 8 (set γ = 0.7), Table 9 (set γ = 0.8) and Table 10 (set γ = 0.9).The results show that after training θ p via soft prompt tuning, the accuracy of LLMs is improved significantly.We then evaluate different approaches to compute the selection score when the model's predictions are fixed.The results are in Table 11 (use GPT2 models and set γ = 0.7), Table 12 (use GPT2 models and set γ = 0.8), Table 13 (use GPT2 models and set γ = 0.9), Ta-ble 14 (use OPT models and set γ = 0.7), Table 15 (use OPT models and set γ = 0.8) and Table 16 (use OPT models and set γ = 0.9).The results show that the proposed method ASPIRE significantly outperforms the baselines in terms of AUACC and AUROC across different datasets and LLMs for different values of the Rouge threshold γ.
J The Effect of the Hyper-parameter α We study the effect of the hyper-parameter α in the proposed selection score (Eq.( 11)) for our method.The results in Table 17 show that setting α = 0.25 leads to the best performance across different datasets and different models.Only using the normalized likelihood (i.e., α = 0) or only using the learned self-eval score (i.e., α = 1) consistently leads to much worse performance.We choose α for our method based on the performance on the validation data from the TriviaQA dataset using the OPT-2.7Bmodel.We then use the same α value for different datasets and different models.We consider α ∈ {0.0, 0.25, 0.5, 0.75, 1.0} when tuning it.Based on the validation results, we set α = 0.25 by default.

K Comparing with Self-Consistency
Self-consistency (Wang et al., 2022) can be used to obtain confidence measures as proposed by Si et al. (2022).We sample 10 times to obtain a set of different answers for each question using the multinomial sampling with a temperature of 0.5.Among all the generated answers, we take the most frequent answer as the final prediction and its frequency as the selection score.Since self-consistency produces discrete selection scores (in the above setting, the number of possible selection scores is 10) and we use the composite trapezoidal rule to compute AUACC, it is easier for self-consistency to achieve high AUACC compared to those approaches that produce continuous selection scores.Note that the proposed method produce continuous selection scores.Thus, it might not be fair to compare the proposed method with self-consistency.However, even though selfconsistency has more advantages in achieving high AUACC, the proposed method ASPIRE still significantly outperforms self-consistency as shown in Table 18.We also observe that Self-Consistency might lead to worse accuracy meaning that the LLM can be consistently wrong.

L Qualitative Evaluation
We present some concrete examples from the Triv-iaQA dataset to show the advantages of the proposed method qualitatively.We compare the proposed method ASPIRE to the baseline Semantic Entropy.The model for generating answers is the adapted OPT-2.7B with learned θ p .The examples below show that some semantic entropy scores for correct predictions are lower than some semantic entropy scores for wrong predictions while the AS-PIRE scores for correct predictions are consistently higher than the ASPIRE scores for wrong predictions.

Figure 1 :
Figure 1: A safety-critical question from the TriviaQA

Figure 2 :
Figure 2: In the proposed framework ASPIRE, we first perform task specific tuning to train adaptable parameters θp while freezing the LLM.Then we use the LLM with the learned θp to generate different answers for each training question to create a dataset for self-evaluation learning.Finally, we train the adaptable parameters θs to learn self-evaluation using the created dataset while freezing the LLM and the learned θp.

Figure 3 :
Figure 3: Implementation of the proposed framework via soft prompt tuning.θp and θs are learnable soft prompt embeddings.

Table 1 :
For each dataset, we use a subset of the original training set containing 50K examples for adapting LLMs by default.The details of the datasets are given in Appendix B. LLMs.We use OPT (Zhang et al., 2022) and GPT-2 (Radford et al., 2019) models of various sizes.For OPT, we consider OPT-350M, OPT-1.3B,AUROC ↑ AUACC ↑ AUROC ↑ AUACC ↑ AUROC ↑ Results of evaluating different methods to compute the selection score when the model's predictions are fixed.All numbers are percentages.Bold numbers are superior results.

Table 3 :
Results of evaluating the accuracy of different LLMs.All numbers are percentages.

Table 4 :
AUROC ↑ AUACC ↑ AUROC ↑ AUACC ↑ AUROC ↑ Results of comparing different decoding algorithms for answer sampling in the proposed method.We denote the temperature as T .All numbers are percentages.Bold numbers are superior results.

Table 5 :
Results of studying the effect of training set size for the proposed ASPIRE.All numbers are percentages.
x, y) where |x| + |y| > 700 from the training set D tr to reduce GPU memory usage during training.Here, |x| is the length of the sequence x.This only removes a very small portion of data points from the training set for each dataset (remove 4.02% training data points in CoQA, 0% training data points in Trivi-aQA and 0.04% training data points in SQuAD).During training θ p or θ s , we always use 20% training data as validation data for selecting the best model among all checkpoints after each training epoch.Training θ p , we select the best model based on the loss on the validation data.When training θ s , we select the best model based on the AUROC on the validation data.
Table 6 summarizes the computational complexity of different methods at test time.

Table 6 :
Computational complexity of different methods in the test time.

Table 7 :
Results of comparing different choices of the Rouge threshold γ.The wrong answer is regarded as the positive class.We use the OPT-2.7Bmodel.We manually label 200 samples with Rouge-L scores in the range of (0, 1) in each dataset and then compute the precision and recall.All numbers are percentages.

Table 8 :
Results of evaluating the accuracy of different LLMs when the Rouge threshold γ = 0.7.All numbers are percentages.

Table 9 :
The ASPIRE scores are log likelihood scores and can be converted to likelihood scores by taking exponentiation with the base e.Examples where predictions are correct Results of evaluating the accuracy of different LLMs when the Rouge threshold γ = All numbers are percentages.

Table 17 :
Results of studying the effect of α.All numbers are percentages.Bold numbers are superior results.

Table 18 :
Comparing with self-consistency.All numbers are percentages.Bold numbers are superior results.