Estimating Large Language Model Capabilities without Labeled Test Data

Large Language Models (LLMs) have the impressive ability to perform in-context learning (ICL) from only a few examples, but the success of ICL varies widely from task to task. Thus, it is important to quickly determine whether ICL is applicable to a new task, but directly evaluating ICL accuracy can be expensive in situations where test data is expensive to annotate -- the exact situations where ICL is most appealing. In this paper, we propose the task of ICL accuracy estimation, in which we predict the accuracy of an LLM when doing in-context learning on a new task given only unlabeled test data for that task. To perform ICL accuracy estimation, we propose a method that trains a meta-model using LLM confidence scores as features. We compare our method to several strong accuracy estimation baselines on a new benchmark that covers 4 LLMs and 3 task collections. The meta-model improves over all baselines across 8 out of 12 settings and achieves the same estimation performance as directly evaluating on 40 collected labeled test examples per task. At the same time, no existing approach provides an accurate and reliable ICL accuracy estimation in every setting, highlighting the need for better ways to measure the uncertainty of LLM predictions.


Introduction
In-context learning (ICL) with large language models (LLMs) has shown great potential in performing a wide range of language tasks (Brown et al., 2020).ICL has the unique advantages of being data-efficient (i.e., only a few labeled training examples are needed) and accessible (i.e., expertise in training models is no longer required).With these advantages, a non-expert user can create a system to perform a new task within minutes by writing a few examples.This gives rise to the popularity of Figure 1: A demonstration of our task setting: given ICL accuracy observations of labeled training datasets, we want to estimate the dataset level ICL accuracy on the unseen test datasets without labeled test data.We propose the method of training a meta-model based on the confidence score distributions (we denote them as dataset confidence profiles) ICL-it is being adopted and tested for a variety of use cases that stretch the boundary of what is considered possible to do with language models.Despite the advantages of ICL, its performance is highly task-dependent (see Figure 6).It surpasses expectations on some tasks that are difficult for humans, such as answering trivia questions or riddles, but achieves near-zero performance on seemingly trivial tasks such as some text editing or spelling tasks (Srivastava et al., 2022).While evalu-ating ICL with a labeled test set is a direct solution to know whether ICL will be effective, it greatly reduces the appeal of ICL, as one of ICL's key selling points is that it does not require a large labeled dataset.In addition, many tasks do not come with a labeled test set due to high annotation costs (e.g., medical/law-related questions that require professional knowledge to answer).In such cases, it is highly desirable to estimate the ICL performance without a labeled test set.This would help system developers determine whether ICL is likely to be useful for their problems of interest.
Guided by this motivation, we formalize the problem of few-shot ICL accuracy estimation: given a handful of labeled in-context examples and a set of unlabeled test examples, our goal is to estimate the overall accuracy of ICL on these test examples.Our contributions are twofold: • We propose to address the accuracy estimation problem by training a "meta-model," which takes in LLM confidence features as input and outputs the task accuracy.The meta-model is trained with observed ICL accuracies on seen datasets, and then used to estimate ICL accuracy on unseen datasets (see Figure 1).
• We obtain 42,360 observations of LLM ICL performance, by conducting extensive ICL experiments spanning two tasks (multiplechoice QA and closed-book QA), 91 datasets, and 4 LLMs.We then benchmark the metamodel method and multiple baselines on a total of 12 evaluation settings derived from these observations.
Our meta-model can estimate ICL accuracies without the need for labeled test examples.In 10 out of 12 settings, the meta-model estimates are at least as accurate as directly evaluating on 16 labeled examples.In 2 out of 12 settings, they match with evaluating on 128 labeled examples.On average, we are able to save the annotation cost of 40 test labels per task by using the meta-model.Further, the meta-model outperforms all baseline methods in 8 out of 12 settings, improving the relative estimation error by 23.6%However, we also find that there exists substantial room for improvement across all settings.We envision estimating ICL accuracy without labeled test data as an open challenge and encourage the community to develop new techniques that can more accurately predict when ICL will be effective.

Model Confidence and Calibration
Calibration of LLMs has been studied on a diverse range of tasks such as classification (Desai and Durrett, 2020) and question answering (QA) (Jiang et al., 2021;Kadavath et al., 2022).It aims to study whether LLMs assign meaningful correctness likelihood-also known as model confidence-to the outputs (Guo et al., 2017).Most prior work evaluates calibration at the example level (Desai and Durrett, 2020;Kamath et al., 2020); in this paper, we focus on using overall model confidence distributions to estimate dataset-level accuracies.We propose a method to learn model calibration patterns based on observations of LLMs' performance at the dataset level.

In-context Learning
LLMs pre-trained with auto-regressive language modeling objectives have been shown to be capable of "learning" in context when given a prompt composed of a prompt template and a few labeled demonstrations (Brown et al., 2020;Chowdhery et al., 2022).While LLMs can learn a new task only through model inference, the accuracy is sensitive to the choices of prompt templates and incontext examples (Lu et al., 2021;Zhao et al., 2021;Perez et al., 2021).Therefore, we aim to develop a method to accurately estimate ICL performance for a dataset prompted with any prompt template and combination of in-context examples.

Out-of-distribution (OOD) Prediction
Machine learning models in the real world commonly encounter distribution shifts between training and test time.Prior work (Guillory et al., 2021;Garg et al., 2022;Yu et al., 2022;Singhal et al., 2022;Li et al., 2022) aims to predict models' OOD performance under different setups.Garg et al. (2022) predict target domain accuracy for image classification tasks with distribution by fitting a threshold on model confidence using only labeled source data and unlabeled target data.Singhal et al. (2022) use a few additional target-domain examples to predict the accuracy, focusing on known source-target dataset pairs on which models often have low OOD accuracy due to overfitting to spurious correlations (e.g., MNLI-HANS and QQP-PAWS).They find that accuracy on the given small set of target examples is a strong baseline to approximate accuracy on the full-test set.We include the accuracy for a small set of labeled test examples as an oracle baseline (see Section 3.3).These papers all try to predict the OOD accuracy of a model trained on in-distribution training data; in contrast, in our setting we have access to some labeled datasets but the language models we study were never finetuned on those datasets.In order to avoid confusion, we instead use the terms "seen/unseen tasks" to describe the datasets available to us, rather than "in-distribution/out-of-distribution." 3 Accuracy Prediction

Problem Definition
We formalize the task of ICL accuracy estimation for unseen datasets given observations of the same model's performance on other datasets.A method for the ICL accuracy estimation task takes in four inputs: a language model M ; a set of labeled seen datasets {D i } r i=1 , where each D i consists of a set of labeled examples i ), . . ., (x )} and n i = |D i |; a prompt c for the test task; and an unlabeled test dataset test } of size m.In a typical setting, each seen task should consist of a sufficient amount of labeled examples, i.e., n i ≥ 100.The method should output the estimated accuracy of M on D test when prompted with prompt c; we denote the actual accuracy of the model as acc M,c  test and acc M,c test as the predicted accuracy.Note that with the labeled datasets D i and a corresponding prompt c, we can compute the corresponding dataset-level ICL accuracy acc M,c i for i = 1, . . ., r.

Prompt Formulation and Data Splits
We construct prompts by sampling k in-context examples uniformly at random from available labeled data and formatting them with prompt templates to form a prompt (see Section B and

Comparing with Labeled Test Data
To put our results in context, we compare all methods to the Oracle approach of sampling l labeled examples from the test dataset D test and measuring accuracy on those l examples, which we call oracle l .This approach is used by Singhal et al. (2022) and it represents how well we can evaluate ICL performance for D test by collecting labeled examples.With a large value of l, we get a better evaluation of the test dataset at the cost of collecting expensive annotations.In proposing the task of accuracy prediction, we hope to develop methods that outperform the l-labeled oracle for values of l that represent non-trivial annotation costs.

Confidence Profile Meta-Model
We propose a new method that trains a meta-model based on the confidence profiles of seen datasets {D i } r i=1 to estimate ICL performance.We use the term confidence profile to denote the distribution of model confidence scores on each example in the dataset.We extract the confidence profiles (see Figure 1) from each seen dataset and convert them to a feature vector.We then train a meta-model to map confidence feature vectors to the datasetlevel ICL accuracies.The benefits of using the confidence feature vector are twofold.First, we do not need any labeled test data, which saves annotation costs.Second, this approach is applicable to any pre-trained language model like GPT3 (Brown et al., 2020) and OPT (Zhang et al., 2022).

Confidence Profile
In general, given a (not-necessarily labeled) dataset D, LM M , and a prompt c, we obtain the confidence profile by first computing the confidence score s M,c (x) for each x ∈ D. The score for each input x can be computed by one forward pass of M ; the exact value of the score differs based on the task, as described below.Next, we sort the scores to obtain a list [s 1 , . . ., s |D| ] where each s i ≤ s i+1 .Then we create a d conf -dimensional feature vector conf M,c D , whose i-th component is a linear interpolation between s ⌊|D|×i/d⌋ and s ⌈|D|×i/d⌉ .Intuitively, the i-th feature represents the i/|D|-th percentile confidence score.We refer to the feature vectors derived from confidence profiles as confidence vectors.

Confidence scores
The confidence score s M,c (x) is calculated differently for closed-set generation and open-ended generation.
Closed-set generation.Closed-set generation tasks have a pre-defined label space Y.We take outputs from LLMs and identify the answers only by labels (Kadavath et al., 2022).For each example, we take model confidence as the normalized probability across the label space: where p ỹ is the model-assigned probability for label ỹ on input x, p ŷ is the probability for the output label ŷ from model M , and ŷ = arg max ỹ∈Y p ỹ.
Open-ended generation.We refer to tasks that require sequence generation (e.g., closed-book QA, summarization, machine reading comprehension, etc.) as open-ended generation tasks.We use negative log-likelihood (NLL)2 to obtain confidence scores from each generated sequence.Let ŷ be the model-generated sequence.We compute the confidence score as: p t is the model-assigned probability distribution at output token t and ŷt is the t-th output token

Meta-Model Training Data
For each seen dataset {D i } r i=1 , we sample K prompts {c ij } K j=1 .Then for each prompt sampled we compute the confidence vector conf

Meta-Model Architectures
We choose meta-models that are easy to train and contain far fewer parameters than LLMs for computational efficiency.In this paper, we consider three meta-model architectures.First, we use k Nearest Neighbors regression (k-NN), which measures feature similarity.In the context of this paper, k-NN retrieves the most similar confidence profile from the seen datasets to the test dataset confidence profile and predicts based on the observed ICL accuracy on the retrieved in-distribution datasets.We use the implementation in scikit-learn library. 3econd, we use a two-layer Multilayer Perceptron (MLP) that takes confidence feature vectors as input.Third, we use the tree-based method XG-Boost (Chen and Guestrin, 2016) with the same confidence features.We use XGBoostRegressor implemented in the XGBoost library4 and tune the hyperparameters as described in Appendix C.2.5

Evaluation
For task performance evaluation, we use Exact Match (EM) accuracy to measure accuracy for closed-set QA, and F1-score to measure accuracy for open-ended QA.
We evaluate accuracy prediction models based on absolute error, defined as |acc M,ctest test − acc M,ctest test |, where both are computed using the test dataset D test with a test prompt c test .We then average the absolute error over all prompts C test and compute the dataset-specific mean average error: Finally, to evaluate the overall success of accuracy prediction across a collection of test datasets T , we measure mean absolute error (MAE), defined as:

Baselines
We consider four baselines for accuracy estimation.
Average training accuracy (AVGTRAIN).We simply take the average dataset-level accuracy of the seen datasets as our accuracy estimation: Average Calibration Error (AVGCONF).We take the average confidence across the test dataset as the accuracy estimation: Note that this baseline is only Temperature Scaling (TS) Temperature scaling is a widely used calibration method (Hinton et al., 2015;Guo et al., 2017;Si et al., 2022).By fitting a single scaler parameter called temperature τ , it produces softer model-assigned probabilities We then obtain scaled confidence scores with Equation 1, and evaluate AVGCONF on the test dataset.Note that we optimize temperature τ based on the AVGCONF of the training datasets instead of the common approach of using NLL as an objective function.
Average Threshold Confidence (ATC) We use ATC (Garg et al., 2022) as one of our OOD accuracy estimation baselines.ATC takes accuracy estimation by fitting a confidence threshold on a single source dataset and generalizes to the target dataset.We take the estimated accuracy for the test dataset to be the average of the ATC estimates from each seen dataset: where atc M,c i,test is the D i to D test ATC estimate.

Alternative Featurizations
In addition to the confidence profiles, we experiment with another featurization method that uses model embeddings from the LM M .Given a dataset D, LM M , and prompt c, we obtain the model embedding by first taking the last-layer, lasttoken embedding e M,c (x) for each x ∈ D, and then averaging across the dataset: is very high-dimensional (e.g., 5120 dimensional for 13B models), we use Principle Component Analysis (PCA) to reduce its dimensionality.We fit the PCA model on all dataset embedding vectors embed D , which we can use as a feature vector.As an additional experiment, we can concatenate the confidence vector and embedding vector to form a combined feature vector: Reducing the dimensionality makes the comparison with confidence features more fair, and does not dilute the influence of confidence features when concatenating them with embedding features.

Accuracy Estimation Benchmark
We benchmark both our meta-model method for ICL accuracy estimation and the baseline methods mentioned in Section 4.6 on a total of 12 LLMdataset collection pairs (3 dataset collections × 4 LLMs).For each evaluation setting, we evaluate 3 different featurization methods mentioned in Section 4.7.This adds up to 36 experiment settings.
Datasets.We use three different collections of datasets in total: multiple-choice QA (MCQA) from MMLU (Hendrycks et al., 2020) and both

LLaMA-7B
LLaMA-13B OPT-6.7BOPT-13B MCQA and closed-book QA (CBQA) from Cross-Fit (Ye et al., 2021) (see Table 5 for the full list of tasks).We henceforth use MCQA and CBQA to refer to the CrossFit dataset collections respectively.We use the implementations and training/test partitions from HuggingFace Datasets (Lhoest et al., 2021).We split each collection of datasets into meta-training/test splits using 5-fold cross-validation-we partition each dataset collection into 5 equal-sized subsets and run five versions of each experiment, one with each subset being used as the meta-test set and the remaining subsets used as meta-training data.We take the average of the meta-test results as our final result.
Experimental Details.We generate prompts for each dataset using the method noted in Section 3.2.
For each dataset in MMLU,9 we combine the "validation" set and "dev"10 set to be the training set  11 We choose up to 5-shot setting because it is studied in previous studies (Touvron et al., 2023;Rae et al., 2021).prompts for MCQA/CBQA datasets) for MMLU because it contains a very large number of datasets.

Main Results
The meta-model outperforms all baselines under certain evaluation settings.Table 1 shows the meta-model estimation error for each evaluation setting.For 8 out of 12 settings (all CBQA settings, LLaMA-7B on MCQA, LLaMA-13B on MMLU, both OPT models on MCQA), the best meta-model architecture has 23.67% lower relative MAE than the best baseline method on average.In the best case (OPT-6.7Bon MCQA), the meta-model can achieve 43.5% lower relative MAE than all baselines.However, for the other 4 settings (both OPT models on MMLU, LLaMA-7B on MMLU, and LLaMA-13B on MCQA), baseline methods provide more accurate estimates of ICL accuracy.Fig- ure 2 shows the evaluation results graphically.On average across all 12 settings, the best estimation errors from the meta-models are 32.5% less than the actual accuracy standard deviations.In 11 out of 12 settings, the estimation errors are within one standard deviation of the actual accuracy.
Oracle baselines indicate useful accuracy estimations.In comparison to the Oracle baselines, the meta-model outperforms the oracle 32 baseline in all MMLU and MCQA settings except for LLaMA-13B on MCQA (achieves oracle 8 ) and outperforms the oracle 16 baseline in all CBQA settings except for LLaMA-13B (achieves oracle 8 ).
In the two best-case settings (using XGBoost as the meta-model on MCQA with either OPT model), the meta-model achieves the oracle 128 baseline, i.e., is equivalent to estimating the accuracy using 128 annotations.
Baseline methods are effective in some settings.for 5 settings (LLaMA-13B on MMLU, both OPT models on MMLU, and OPT-6.7B on MCQA).12 Ablation on Model Architecture Across three meta-model structures, the XGBoost meta-model overall provides the most accurate estimation as it has the lowest MAE for 7 out of 12 evaluation settings.The average MAE is 5.88 for XG-Boost meta-models, 5.94 for 3-NN meta-models, and 7.18 for MLP meta-models.Surprisingly, 3-NN meta-models have a lower average MAE than MLP meta-models despite having a simpler model structure.In Figure 3, we show that the XGBoost meta-model provides well-correlated accuracy estimation across 4 different evaluation settings.
Ablation on Featurization Methods We consider three featurization methods as described in Section 4.7.Table 2 in the appendix shows that the best overall accuracy estimation for all settings is attained by using the confidence vectors as metafeatures (achieves the lowest MAE for 26 out of 36 evaluation settings).The average MAE is 6.27 for conf , 8.34 for embed, and 7.36 for ce.Further, using conf as features demonstrates a more dominant advantage for all CBQA tasks, achieving the lowest MAE for 11 out of 12 evaluation settings.

Effect of Unlabeled Data and Confidence Vector Dimensions
We now study confidence feature vector ablations by varying the number of unlabeled test examples m in each unseen dataset and the dimension of the confidence vector d conf .We test with OPT-13B on MCQA datasets using the XGBoost meta-model since we achieve the lowest MAE in this setting.
Figure 4 shows that increasing m enables better accuracy estimation, reducing the average MAE (across all d conf ) from 3.92 for m = 200 to 2.56 for m = 1000.Note that increasing m requires performing additional LLM inferences on unlabeled examples, so leveraging unlabeled test data is constrained by computational cost considerations.The quality of our accuracy estimates does not vary much as we change the confidence vector dimension d conf , as shown in Figure 4.

Effect of Number of Shots
We compare ICL accuracy estimation performance given different k-shot ICL accuracy observations for LLaMA-13B on MMLU datasets.Table 4 in the Appendix shows that the meta-model produces a slightly better ICL accuracy estimation for the 3-shot setting.Overall, the meta-model gives consistent accuracy estimates across different k-shot settings as they all achieve oracle 32 .

Prompt Selection
Previous works demonstrated that ICL performance is highly sensitive to the prompt templates as well as in-context examples (Zhao et al., 2021;Perez et al., 2021;Chen et al., 2022); we are thus interested in whether our ICL accuracy estimation method can be applied to select the best ICL prompt c ∈ C test for the test dataset.For each dataset, we use the XGBoost meta-model to select the best prompt c * , as opposed to the actual best prompt c * .We then compute the corresponding ICL accuracies and compare them to the average accuracy across all test prompts.Figure 5 shows that there is a significant difference in ICL accuracy given different prompts for all 12 settings, and the selected prompts lead to better ICL accuracies than the average accuracy for 7 out of 12 settings.On average, the selected prompt is 15.6% as effective as the actual best prompt.The limited improvement from the random baseline indicates there's a large room for improvement and we encourage future work to derive a better prompt selection standard.

Discussion and Conclusion
In this paper, we study the problem of few-shot ICL accuracy estimation.We propose training a meta-model based on LLM confidence features and observed accuracies on seen datasets.We show that without using any labeled test data, the meta-model is often able to attain accurate estimates of ICL accuracy, which is practically useful for predicting LLMs' accuracy on datasets that have high annotation costs.We also construct a large-scale benchmark for dataset-level ICL accuracy estimation by evaluating the meta-model and multiple baseline methods across 12 evaluation settings and 3 metafeature options.We observe that while some baseline methods can provide good accuracy estimates, our meta-model demonstrates non-trivial improvement in estimation abilities over baseline methods in 8 out of 12 evalutation settings.We encourage future work to develop better meta-model architectures as well as better metafeatures and study potential implications for the meta-model, such as acting as a prompt template/ICL example selection method.We believe that our benchmark can serve as an open challenge for improving dataset-level ICL accuracy estimations, leading to an improved understanding of when ICL is likely to be effective.

Limitations
While we conducted extensive experiments to study ICL accuracy estimations, there are many more LLMs that have exhibited impressive capabilities on a variety of tasks.Due to computational constraints, we do not benchmark accuracy estimations based on LLMs with limited access (e.g., GPT-4 (OpenAI, 2023)) as it is difficult to extract model embedding features, or those larger than 13B.We also don't consider instruction-tuned models to avoid possible overlaps between their training datasets and our evaluation datasets.Meanwhile, instruction tuning sometimes hurts model performance on canonical datasets such as MMLU, as shown in Gudibande et al. (2023).It might also significantly hurt calibration as reported in OpenAI (2023).For the same reasons, we include only a limited number of prompt templates and in-context example variations for ICL prompting.While we choose only 3 few-shot settings for MMLU and 2 for MCQA and CBQA, it is possible to achieve better accuracy estimations with more observations in the training data.
In terms of dataset selection, we use 13 closedbook QA tasks for the open-ended generation setting.Our findings might not generalize to other open-ended generation tasks such as summarization or long-form question answering.Overall, the meta-model provides effective accuracy estimations, but there's still substantial room for improvement.

C Implementation Details
We will release code to reproduce our results upon publication.

C.1 LLMs Implementations
We use LLaMA-7B and LLaMA-13B (Touvron et al., 2023) from Meta AI, and OPT-6.7B and OPT-13B (Zhang et al., 2022) from HuggingFace transformers.For LLaMA-7B, OPT-6.7B, and OPT-13B, we run evaluations using a single RTXA6000 GPU (48GB).For LLaMA-13B, we run evaluations by parallel inference on two RTXA6000 GPUs.We use half-precision for both OPT models.Note that some LLMs (except for OPT-13B) can be run on GPUs with a smaller memory.We evaluate 42,360 ICL observations in total, where each observation is a dataset with 100 to 1000 examples.The total inference process takes around 2000 GPU hours.

C.2 Meta-model Implementations
All meta-model architectures can be trained on an i7-10700 CPU.The total training time of the metamodel on one experiment setting varies from 1.5 hours to 24 hours depending on training data dimensions.We include the implementation details for each of the meta-model architectures.We use the random seed 1 for all processes involving randomness.For K Nearest Neighbors regression, we use the implementation of KNeighborsRegressor from sklearn library.We use euclidean distance as the weight metric, and fit the model on the metatraining data.
For MLP, we implement with Pytorch.We use a 2-layer MLP with the size of the hidden_state = 1536, learning_rate = 1e − 5, and dropout_rate = 0.2.We use Adam Optimizer and MSELoss, and we perform early stopping with the validation data.The validation data is a 20% random partition of the meta-training data.For early stopping, the max epoch is 50 and the patience is 7.

C.3 Other Implementation Details
Output Collection For multiple-choice QA tasks (MMLU and MCQA), we collect generated choice labels (e.g., "(A)") from the first 5 tokens generated.For closed-book QA tasks (CBQA), we collect the first 16 newly generated tokens as the model output and truncate the outputs by the newline token.
Temperature Scaling We search for the optimal temperature τ based on the meta-training set.The search grid is: np.linspace(1.0,3.0, 100) D Few-shot setting ablation results training examples.The meta-model is trained on the meta-training examples and predicts the estimated accuracy acc M,ctest test based on the test dataset feature vector conf M,ctest Dtest for each test prompt c test ∈ C test .Note that since closedset/open-ended generations have different confidence scores and accuracy evaluation metrics, the meta-model does not train on datasets that have a different task formulation than the test datasets.
them into d e -dimensional vectors embed M,c

Figure 2 :
Figure2: Bar graph of evaluation results (MAE) for all meta-models, baseline methods, and Oracle baselines of all 3 dataset collections with all 4 LLMs.We use the confidence vector as the meta-feature.Red/blue bars represent the meta-model/baseline evaluation results and the horizontal lines show the Oracle baselines.

Figure 3 :
Figure 3: We plot the meta-model predicted accuracy versus the actual accuracy across 4 settings.We use the XGBoost meta-model and the confidence vector meta-feature.Each entity represents an observation for one dataset.Red/blue represents higher/lower absolute error.

Figure 4 :
Figure 4: Estimation results for ablating the number of unlabeled examples m (x-axis) and confidence vector dimensions d c (y-axis), evaluated on the OPT-13B on MCQA using the XGBoost meta-model.

Figure 5 :
Figure 5: Prompt selection results for all evaluation settings, measured by the absolute difference between ICL accuracy when prompted with c and the average accuracy.Blue bars show the actual best prompt (i.e., c = c * ), and red bars show the selected best prompt (i.e., c = c * )

Figure 6 :
Figure 6: 4-shot ICL accuracy for OPT-13B and LLaMA-13B on CBQA tasks, where each boxplot summarizes the F1 scores over 30 ICL prompt variations of one dataset.Both OPT-13B and LLaMA-13B have large variances across different tasks, showing the challenge of ICL accuracy estimation.

Table 6
use c to denote a prompt in general, C i to denote the set of training prompts for dataset D i , and C test to denote test prompts for dataset D test .

Table 1 :
Evalutaion results (MAE) and variations (SD) for all 4 LLMs, 3 dataset collections settings, using confidence vector as the meta-feature.For the Oracle baselines, we include the closest lower-bound.e.g., if the error is between oracle 32 and oracle 64 , we put oracle 32 (32).OPT-13B settings have the lowest average MAE (5.conf .Due to computational reasons, we sample only 30 prompts (compared to 60

Table 2 :
Hendrycks et al., 2020) for all 4 LLMs, 3 dataset collections, and 3 meta-feature choices.XGBoost is the best overall meta-model structure with an average MAE of 6.60.The confidence vector is the best overall feature with an average MAE of 6.38 across all evaluation settings.Hendrycks et al., 2020), and 1 generated by ChatGPT.For MCQA and CBQA, we only use the Null template due to resource considerations.

Table 4 :
±11.44 45.36 ±11.26 45.52 ±12.00 45.50 ±11.74 Estimation results for separate and mixed few-shot settings, measured by MAE and tested on the LLaMA-13B on the MMLU setting.The accuracy estimation is consistent across different shot settings.We report the accuracy (ACC) as Exact Match accuracy