Re-Examining Calibration: The Case of Question Answering

Accessible Abstract: Calibration is an important problem in question answering: if a search engine or virtual assistant doesn’t know the answer to a question, you should probably abstain from showing an answer (to save em-barassment, as when Google said a horse had six legs). This EMNLP Findings paper shows that existing metrics to test how good a QA calibration push calibrated conﬁdence toward the average conﬁdence. We proposed an alternate method both for evaluation and to generate better calibration by looking how models change as they learn


Introduction
While large pretrained language models have conquered many downstream tasks (Devlin et al., 2019;Brown et al., 2020), it is sometimes unclear when we should trust them since they often produce false (Lin et al., 2022) or hallucinated (Maynez et al., 2020) predictions.This is important for both model deployment-where low-confidence outputs can be censored-and for end users who need to know whether to trust a model output.The solution is to make sure that models provide reliable 1 Code available at: https://github.com/NoviScl/calibrateQA Figure 1: Distribution of predictions on HOTPOTQA in an OOD setting.We put predictions within the same confidence range into the same bucket (10 fixed-range buckets) and compute the average confidence and accuracy within each bucket.The x-axis represents the confidence range of each bucket, the y-axis represents the average answer accuracy for the dashed line plot and represents the relative bucket sizes for the histogram.Before calibration, most predictions have overly high confidence.After temperature scaling, all predictions' confidence values are scaled to become closer to the overall answer accuracy (24.5).Moreover, both correct (green bars) and wrong predictions (blue bars) are mixed in the same buckets, making them hard to distinguish.
confidence estimates so that we can abstain from wrong predictions and trust the right ones.The prerequisite for such abstention is Model Calibration: making the confidence represent the actual likelihood of being correct (Niculescu-Mizil and Caruana, 2005;Naeini et al., 2015).Past work proposes post-hoc approaches to calibrate model onfidence such as temperature scaling (Guo et al., 2017), and can effectively calibrate multi-class classification, evaluated by the expected calibration error (ECE) metric.
We re-examine calibration and apply it to a complex task with real-world applications: opendomain question answering (ODQA; Chen et al., 2017).The task takes an input question, retrieves evidence passages from a large corpus such as Wikipedia, and then returns an answer string.Un-like classification, ODQA is a pipeline with multiple components: a passage retriever followed by a reader.This complexity is both typical of modern machine learning systems and poses additional challenges.We explore adapting calibration methods for the retriever-reader ODQA models on both in-domain and out-of-domain (OOD) settings.Using the commonly used temperature scaling (TS) method on ODQA, models get lower ECE, similar to previous findings on multi-class classification tasks (Guo et al., 2017;Desai and Durrett, 2020).
However, we argue that low ECE does not correspond to useful calibration: in fact, it underestimates the true calibration errors due to its bucketing mechanism.ECE measures the difference between the confidence and expected accuracy by splitting the confidence values into buckets, taking an average confidence and an average accuracy of each bucket and marginalizing over their differences.However, this allows models with middling confidence to win on the ECE metric.For instance, temperature scaling assigns all predictions in the range [0.1, 0.5) on HOTPOTQA (Figure 1), which is not useful for users separating correct and wrong predictions because the confidence values are all in a similar range.Moreover, the bucketing mechanism causes a cancellation effect where overconfident and under-confident predictions are bucketed together and averaged out, hiding the instancelevel calibration errors.
We propose Macro-average Calibration Error (MACROCE) as an alternative metric that directly focuses on distinguishing correct from wrong predictions (Section 4.2).MACROCE removes the bucketing mechanism and sums calibration error at the instance level.It also takes equal consideration of correct and wrong predictions through macro-averaging to be insensitive to the accuracy level (e.g., when the accuracy is very low, simply lowering confidence on all predictions would lower ECE, but not MACROCE).MACROCE is insensitive to accuracy shifts, successfully satisfying the desiderata for a stable calibration metric (Nixon et al., 2019).We also show that this metric flips the conclusion from the previous section based on ECE-four existing calibration methods, including temperature scaling-the ECE winner-do not lead to improvements in MACROCE (Section 4).
To address this shortcoming, we propose a new method, CONSCAL, which tracks whether the model makes consistent correct predictions over different checkpoints during training.The intuition is that if the same correct prediction is consistent throughout the training trajectory, then it could serve as a strong sign that the model is confident about the prediction.CONSCAL significantly improves MACROCE on both in-domain and OOD evaluation (Section 5), including when downstream users must validate model predictions (Section 5).
In summary, our contributions are: 1. We thoroughly study calibration in the ODQA setting, an under-explored real-world problem involving complex pipelines.We find that existing calibration methods like TS achieve very low

Background
This section reviews the existing calibration framework, the associated ECE metric, and the commonly used temperature scaling method that effectively optimizes the ECE metric.

Bucketing-based Calibration and ECE
Under the existing calibration framework, a model is "perfectly calibrated" if the prediction probability (i.e., confidence) reflects the ground truth likelihood (Niculescu-Mizil and Caruana, 2005).Specifically, given the input x, the ground truth y and the prediction y ˆ, the perfectly calibrated confidence Conf(x, y ˆ) will satisfy: Prior work (Guo et al., 2017) evaluates calibration with Expected Calibration Error (ECE), where N model predictions are bucketed into M bins and predictions within the same confidence range are put into the same bucket.Let B m be the m-th bin of (x, y, y ˆ) triples, the accuracy Acc(B m ) measures how many instances in the bin are correct, where |B m | is the number of examples in the m-th bin, and Conf(B m ) computes the average confidence in the bin, Finally ECE measures the difference in expectation between confidence and accuracy over all bins, Most work uses equal-width buckets: a triple (x, y, y ˆ) with m M ≤ Conf(x, y ˆ) ≤ m+1 M is assigned to the m-th bin.Minderer et al. (2021) and Nguyen and O'Connor (2015) also used equal-mass binning: predictions are sorted by their confidence values and N M triples are assigned to each bin.We find little difference between equal-width and equalmass binning ECE results (Table 4), and so we will use the more common equal-width binning in the rest of our experiments.

Temperature Scaling
Without calibration, the confidence is often too high (or less commonly, too low): it thus needs to be scaled up or down.A widely-used calibration method is temperature scaling (Guo et al., 2017), which uses a single scalar parameter called the temperature τ to scale the confidence.The temperature value is optimized on the dev set.Given the set of candidate answers C and the logit value z ∈ R |C| associated with the prediction y ˆ, the confidence for the prediction y ˆthat is the j-th label in C is: For classification, the temperature scalar τ is tuned to optimize negative log likelihood (NLL) on the dev set.Temperature scaling only changes the confidence-not the predictions-so the model's accuracy always remains the same.

Calibration in Open-Domain Question Answering
This section adapts the bucketing-based calibration framework for multi-class classification to ODQA and evaluates this calibration method on multiple QA benchmarks, both in-and out-of-domain.

The ODQA Model
We use the model from Karpukhin et al. (2020), consisting of retrieval and reader components.The retrieval model is a dual encoder that computes the vector representation of the question and each Wikipedia passage and returns the top-K passages with the highest inner product scores between the question vector and the passage vector.The reader model is a BERT-based (Devlin et al., 2019) span extractor.Given the concatenation of the question and each retrieved passage, it returns three logit values, representing the passage selection score, the start position score, and the end position score.These three logits are produced by three different classification heads on top of the final BERT layer.More precisely, where w psg , w start , w end ∈ R h are trainable parameters.

Temperature Scaling For ODQA
The formulation of ODQA is unlike conventional multi-class classification since it involves both the retriever and reader, leaving to the question: what we should base the confidence score on.To adapt temperature scaling on ODQA, we take the set of top span predictions as our candidate set C. Specifically, we compute the raw score for each candidate span and then apply softmax over C to convert the raw span scores into probabilistic confidence values.We explore two possible implementations: Joint Calibration considers both passage and span scores; Pipeline Calibration selects the highest scored passage and calibrates on span scores only.
Joint Calibration.Given the top k = 10 retrieved passages for each question and for each passage's top n = 10 spans, we have an answer set of n × k = 100 spans per question.We score each candidate by adding its passage, span start, and span end score: We then apply temperature scaling to the predicted logits and the confidence becomes: For ODQA, the number of correct answers in the candidate set C varies (zero, one, or more).Hence, the temperature scalar τ is tuned to optimize dev set ECE instead of NLL.
Pipeline Calibration.We choose the passage with the highest passage selection score i max = argmax 1≤i≤K z psg (i) and then define the span score S(s, e, i) as In this case, we only keep the top n = 10 spans from the top passage for each question.Like Joint Calibration, we apply temperature scaling to the predicted span logits and the confidence is: )︃ .

Temperature Scaling Results
We apply the above temperature scaling methods on both in-domain and OOD settings, since Desai and Durrett (2020); Jiang et al. (2021) argue that OOD calibration is more challenging than in-domain calibration.We use NATURALQUES-TIONS (Kwiatkowski et al., 2019, NQ) as the indomain dataset, and SQUAD (Rajpurkar et al., 2016), TRIVIAQA (Joshi et al., 2017) and HOT-POTQA (Yang et al., 2018) as the out-of-domain datasets. 2We tune hyper-parameters on the indomain NQ dev set.We report exact match (EM) for answer accuracy and ECE for calibration results.Without calibration, both joint and pipeline approaches have high ECE scores, and the pipeline approach incurs higher out-of-the-box calibration error (Table 1).Applying temperature scaling significantly lowers ECE in all cases, including both in-domain and OOD settings.As expected, OOD settings incur higher ECE than the in-domain setting even after calibration.However, in the next section, we challenge this "success" by re-examining the bucketing mechanism in ECE.
2 More dataset details are in appendix C. Across all settings, temperature scaling significantly improves ECE but not MACROCE, highlighting the difference between these calibration metrics.Also, OOD incurs higher calibration errors.

Flaws in ECE and Better Alternatives
This section takes a closer look at model accuracy and confidence and illustrates how ECE is misleading in evaluating model calibration.We provide complementary views of calibration and propose a new calibration metric as an alternative.

What's Wrong With ECE?
We illustrate the ECE problem with a case study on HOTPOTQA; similar trends surface for other datasets (Appendix F).The uncalibrated model is over-confident (Figure 1): the confidence is higher than the accuracy.After temperature scaling, the accuracy and confidence converge, reducing ECE.However, this over-estimates the effectiveness of temperature scaling for two reasons.First, most instances are assigned similar confidence.All predictions have a confidence score between 0.1 and 0.5, not giving useful cues except that the model is not confident on most predictions.This is not ideal The Need for An Alternative View.The above issues are because ECE only measures the expectation where the aim is to match the confidence with the expected accuracy.However, this goal can be trivially achieved by simply outputting similar confidence for all predictions that match the expected accuracy, as in the case of temperature scaling, which is not useful because users cannot easily use such confidence scores to decide when to trust the model.Hence, we propose an alternative view of calibration where the goal is to maximally differentiate correct and wrong predictions.We argue that achieving this goal would bring better practical values for real use cases.Toward this end, we propose a new calibration metric that aligns closely with our objective.

New Metric: MACROCE
We propose alternative metrics that remove the bucketing mechanism to prevent the above problems.We consider two such metrics-ICE and MACROCE.We evaluate their robustness to various distribution shifts, and propose to use MACROCE as the main metric.
Instance-level Calibration Error (ICE) accumulates the calibrator error of each individual prediction and takes an average. 4Formally, ICE is similar to the Brier Score (Brier, 1950) except that we are marginalizing over the absolute difference between accuracy and confidence of predictions, instead of squared errors. 5hile ICE and Brier Score prevent the issues brought by bucketing, it incurs another issue: they can be easily dominated by the majority label classes.For example, if the model achieves high accuracy, always assigning a high confidence can get low Brier Score and ICE because the wrong predictions contribute very little to the overall calibration error (we will empirically show this in the following experiments).This is undesirable because even when the wrong predictions are rare, mistrusting them can still cause severe harm to users.To address this, we additionally macro-average the calibration errors on correct and wrong predictions and name this metric MACROCE.

Macro-average Calibration Error (MACROCE)
considers instance-level errors, but it takes equal consideration of correct and wrong predictions made by the model.Specifically, it calculates a macro-average over calibration errors on correct predictions and wrong predictions: Where n p and n n are the number correct and wrong predictions. 6Ideal calibration metrics should be insensitive to shifts in accuracy (Nixon et al., 2019) and stably reveal models' calibration in all situations.We examine the robustness of these metrics.
Temperature Scaling at Different Accuracy Levels.We re-sample the data to vary the model accuracy and examine the effect of temperature scaling at different accuracy levels.Before calibration, the ECE score decreases with higher model accuracy ( set accuracy) confident predictions in the test set with confidence 1, otherwise 0. Average baseline assigns all test set predictions with a confidence value equal to the average dev set accuracy.
Feature Based Classifier.Prior work has trained a feature-based classifier to predict the correctness of outputs (Zhang et al., 2021;Ye and Durrett, 2022).We use SVM to train a binary classifier following prior work (Kamath et al., 2020).We include features based on previous work (Rodriguez et al., 2019) (features used are described in Appendix B).At inference time, we use the classifiers' predicted probability of the test example being correct as its confidence.
Neural Reranker.We train a neural reranker as an alternative to manual features.We adopt RE-CONSIDER (Iyer et al., 2021) where we train a BERT-large classifier by feeding in the concatenation of the question, passage, and answer span.
Passing the raw logit through a sigmoid 9 provides the confidence score.
Label Smoothing.In addition to the post-hoc calibration methods above, another calibration method is to train models that are inherently better calibrated, and a representative approach is label smoothing (Pereyra et al., 2017;Desai and Durrett, 2020).Label smoothing assigns the gold label probability α and the other classes 1−α |Y |−1 .We apply label smoothing on two components of the ODQA pipeline: passage selection (where the first passage is gold, the rest K − 1 are false); span selection 9 We also experimented with softmax but found sigmoid to be substantially better.
(where the gold classes are the correct start and end positions of the answer span and the false classes are the other positions in the passage).

CONSCAL: Calibration Through Consistency
The failure of temperature scaling under MACROCE implies that only relying on the final outputs from the QA model is not sufficient for calibration.Specifically, given N model checkpoints, 11 we obtain the final model prediction p based on the last checkpoint, then count the checkpoints that make the same prediction, and assign a confidence value 1 if the count is greater than a threshold n; otherwise 0. The threshold n is a hyper-parameter chosen based on the development set.Apart from this binary confidence setting, we also explore assigning continuous confidence values based on checkpoint consistency.This continuous confidence gets slightly worse MACROCE than the binary version but improves on all previous baselines (Appendix B).

Experimental Results
Some existing calibration methods lower ECE, including temperature scaling, the simple average baseline, and label smoothing (Table 2).In particular, the simple average baseline has the lowest ECE both in-domain and OOD.This confirms our earlier point that you can lower ECE by assigning confidence values close to the accuracy for all predictions without any discrimination between correct and wrong predictions.However, none of these baselines reduces MACROCE.
CONSCAL significantly lowers MACROCE, outperforming the previous best by 11% and 6% absolute in-domain and out-of-domain, respectively.This confirms the effectiveness of using consistency over different checkpoints throughout training.While both the binary baseline and CON-SCAL have binary outputs, CONSCAL has lower MACROCE.Thus this is not just due to the binary confidence.Additional results in Appendix D compare the joint and pipeline approach (defined in section 3.2), which have similar MACROCE.
To analyse whether the gains of CONSCAL are simply from ensembling multiple checkpoints, we compare with an additional baseline called CON-SCAL w/o Training Dynamics: we finetune the model N times independently using different random seeds,12 obtain final predictions through majority vote, and compute the confidence scores as we did for CONSCAL.We use the same N (N = 5) for both CONSCAL and CONSCAL w/o Training Dynamics.While this variant reduces MACROCE more than any of the previous methods, its MACROCE values are still higher than CON-SCAL both in-domain and OOD.This suggests that, while ensembling is one factor for reducing MACROCE, it is not the only factor-considering training dynamics remains important.We provide a qualitative example in Figure 4: the model changes its prediction in the last epochs.Such inconsistency is a cue to suggest low confidence via CONSCAL.

Human Study
Finally, we investigate whether CONSCAL improves user decision making-mimicing validating a search engine's answer to a question-with a human study.We randomly sample 100 questions from the NQ test set and present the questions to annotators along with the DPR-BERT predictions.We ask annotators to judge the correctness of the model predictions under four settings: (1) show only questions and predictions without model confidence; (2) show the QA pairs along with raw model confidence without calibration; (3) show the QA pairs along with temperature scaled confidence; (4) show the QA pairs along with confidence calibrated by CON-SCAL.We recruit a total of twenty annotators (five annotators under each setting) on Prolific, each annotating 100 questions, with average compensation of $14.4/hour.
We measure the precision, recall, and F1 score of the human judgement.We also report Krippendorff's alpha among the five annotators. 13Showing the confidence scores significantly improves human decision making (Table 3), and CONSCAL helps achieve better F1 than temperature scaling.Interestingly, despite CONSCAL's binary confidence scores of 0 and 1, humans sometimes do not follow the confidence scores and "overrule" them.This actually leads to a lower F1 than a baseline that always follows CONSCAL's confidence.Moreover, the ECE metric ranks temperature scaling as best,  3: We ask humans-given an estimate of confidence-whether they believe a QA system is correct or not.The model accuracy on this sampled set is 35%.Apart from human ratings, we additionally show a baseline of always following CONSCAL's judgement in the last row.Showing the confidence significant improves human judgement, especially with CONSCAL.Surprisingly, annotators sometimes do not trust CONSCAL's confidence scores and overrule their own judgement, which results in worse F1 (second to last row).Furthermore, MACROCE ranks CONSCAL as the best method, agreeing with human evaluation, while ECE misleadingly favors temperature scaling.
contradicting human study results.Nevertheless, our MACROCE correctly ranks CONSCAL as the best method, aligning with human judgement.

Related Work
This section reviews prior work on calibration metrics and methods that are relevant to NLP.
Calibration Metrics.Brier Score (Brier, 1950) is one of the earliest calibration metrics that sums over squared errors between accuracy and confidence for all instances, but is only applicable to binary classification.(Desai and Durrett, 2020) and sequence tagging (Nguyen and O'Connor, 2015).In question answering, Kamath et al. (2020) propose the selective question answering setting that aims to abstain on as few questions as possible while maintaining high accuracy.Toward this goal, later approaches (Zhang et al., 2021;Ye and Durrett, 2022) extract features to train a binary classifier to decide on which questions to abstain.While selective question answering offers a measurement of calibration, the scale of confidence values is not considered since abstention can be effective as long as correct predictions have higher confidence than wrong ones, regardless of the absolute scales.Concurrent work (Dhuliawala et al., 2022) explores calibration for retrieverreader ODQA, focusing on combining information from the retriever and reader.In addition to spanextraction QA, Jiang et al. (2021) and Si et al. (2022) also explore calibration for generative QA.However, these works use ECE for evaluation.

Conclusion
This paper investigates calibration in the realistic application of ODQA where users need to decide whether to trust model predictions based on confidence scores.Although confidence scores produced by existing calibration methods improve the popular ECE metric, these confidence scores do not help distinguish correct and wrong predictions.We propose to use the MACROCE metric to remedy the flaws, and existing calibration methods fail on our MACROCE metric.We further propose a simple and effective calibration method -CONSCAL -that leverages training consistency.Our human study confirms both the effectiveness of CONSCAL as well as the alignment between MACROCE and human preference.Our work advocates and paves the path for user-centric calibration, and our CON-SCAL method is a promising direction for better calibration.Future work can adapt our calibration metric and method to more diverse tasks (such as generative tasks) and explore other ways to further improve user-centric calibration.

Limitations
We note several limitations of this paper and point to potential future directions to address them: • MACROCE is motivated from a user-centric perspective where we want to maximally distinguish correct and wrong predictions.However, it is not a panacea for all use cases, e.g., in some applications, the confidence output might have to be mediocre to indicate the uncertainty of the output, rather than taking a stance as MACROCE encourages.
• Our experiments are focused on ODQA and in particular, span-extraction models (we also showed similar findings on binary sentiment classification in Appendix F).While we expect most findings in this paper to hold for other models and tasks as well, this needs to be empirically verified in future work.In particular, one promising line for future work is to verify whether CONSCAL also works well for text generation tasks and models.

Ethical Considerations
Data and Human Subjects All datasets used in this paper are from existing public sources and we do not expect any violation of intellectual property or privacy.All human annotators that we recruited on Prolific are well-compensated and we did not receive any complaints from the annotators regarding the job (we do not perceive any possible harm on them either).
Broader Impact We expect our study to have a positive impact on the safe deployment of AI applications.Our study is targeted towards the realworld application of question answering from a user-centric perspective.We make model predictions more trustworthy to users by providing wellcalibrated confidence scores.This especially helps users avoid misleading wrong predictions which can cause serious troubles in real-life applications such as digital assistants and search engines.Our human study has also confirmed the advantages of our proposed metric and calibration method.
Temperature Scaling with Different Temperature Scalars We apply temperature scaling with varying temperature scalar τ .According to Figure 5, as we increase the temperature value, the confidence scores decrease, and consequently

B Implementation Details of Methods in Section 5
Feature Based Classifier We include the following features based on previous work (Rodriguez et al., 2019): the length of the question, passage and predicted answer; raw and softmax logits of passage, span position selection; softmax logits of other top predicted answer candidates; the number of times that the predicted answer appears in the passage and the question; the number of times the predicted answer appears in the top candidates.
We use the QA model's predictions on the NQ dev set as the training data.We re-sample the data to get a balanced training set, and the training objective is binary classification on whether the answer prediction is correct.We hold out 10% predictions as the validation set and we apply early stopping based on the validation loss.During inference, we directly use the predicted probability as the confidence value.
Neural Reranker During training, for each question we include one randomly chosen positive and M − 1(M = 10) randomly chosen hard negatives (hard negatives are negative predictions with the highest raw logits).We use DPR-BERT's predictions on the NQ training set for reranker training.During inference, we use the trained reranker to rerank the top five predictions.In particular, we use sigmoid to convert the raw reranker logits to probabilistic confidence values.
Label smoothing We use α = 0.1 in our experiments, and we find that the calibration results are largely insensitive to the choice of α.We change the loss function from cross entropy to KL divergence with the label smoothed gold probability distribution.We compare the calibration results of the model trained without and with label smoothing, and we also explore applying temperature scaling on top of the model trained with label smoothing.
CONSCAL We use the final checkpoint's predictions as the final predictions.We do this instead of taking the majority vote of all intermediate checkpoints before earlier checkpoints have lower answer accuracy than the final checkpoint.important than all other baselines and variants.

E Impact of Checkpoint Numbers in CONSCAL
In the main paper we saved a total number of n = 5 checkpoints during training for CONSCAL.In order to understand the impact of this hyperparameter n, we experiment with n = {9, 17} and report the results in Table 8.We observe that the impact of different n is very small.

F Illustration of ECE Flaws on More Datasets
In the main paper we illustrated the flaws of ECE with a case study on HOTPOTQA.Here we additional present visualizations of calibration results on NQ (in-domain) and SQUAD (OOD).As shown in Figure 6, the flaws of ECE as described in the main paper hold true for these datasets as well, validating the generality of our conclusions.
In addition to QA, we also present results on a sentiment analysis dataset in Figure 7.We observe the same trends that all predictions have similar confidence scores, making it difficult to identify the wrong predictions.(Socher et al., 2013), a binary sentiment analysis dataset.We can see the same trend as on NQ and HOT-POTQA -all predictions' confidence are very close.In fact, on SST-2, all predictions have high confidence both before and after temperature scaling, making it hard for users to identify wrong predictions.

Figure 4 :
Figure 4: An example from NQ where the final prediction is wrong.The original uncalibrated confidence is while temperature scaling lowers the confidence, it is still over-confident and could mislead users.CONSCAL sets the confidence to 0 because predictions across checkpoints in the training trajectory are inconsistent-the model is confused between the specific performer and the original singer in this example.

Figure 5 :
Figure 5: Calibration errors after temperature scaling.xaxis represents different temperature values; lines with different colors represent different metrics.MACROCE stays relatively constant while ECE varies largely at different temperature values.

Figure 6 :
Figure 6: Bucketing distribution of predictions on NQ and HOTPOTQA.The x-axis represents the confidence range of each bucket, the y-axis represents the average answer accuracy for the dashed line plot and represents the relative bucket sizes for the histogram.Similar flaws hold true for these two datasets: after temperature scaling, all predictions' confidence values are scaled to become closer to the overall answer accuracy, and correct (green bars) and wrong predictions (blue bars) are mixed in the same buckets, making it hard to distinguish.

Figure 7 :
Figure7: Bucketing distribution of predictions on SST-2(Socher et al., 2013), a binary sentiment analysis dataset.We can see the same trend as on NQ and HOT-POTQA -all predictions' confidence are very close.In fact, on SST-2, all predictions have high confidence both before and after temperature scaling, making it hard for users to identify wrong predictions.

Table 1 :
ECE ↓ ICE ↓ MACROCE ↓ In-domain and OOD calibration results.Joint and Pipeline refer to whether the candidate set consists of top answer candidates from all top-10 retrieved passages or just the top-1 retrieved passage.All numbers are multiplied by 100 for better readability throughout the paper.EM: higher is better.Calibration errors: lower is better.Best calibration result in each group is in bold.
since, if there were examples we could trust or abstain, an ideal calibration metric should recognize such scenarios and encourage the calibrator to differentiate correct and wrong predictions.Second, bucketing causes cancellation effects, ignoring instance-level calibration error.Many predictions are clustered in the same buckets.As a result, there are many over-confident and under-confident predictions in the same bucket.They are averaged to become closer to the average accuracy.3

Table 2 )
, since higher accuracy matches the over-confident predictions and gets rewarded by low ECE score.This finding also applies to ICE, since the majority of predictions are correct, the impact of negative predictions with over-confidence is marginal.MACROCE results remain stable across all accuracy levels.As model accuracy increases, ICE pos decreases and ICE neg increases.MACROCE Calibration results when training and test accuracy are different.In the first case, we tune the temperature value on a dev set with only 10% correct predictions and a test set with 90% correct predictions, and we reverse the setup in the second case.ECE and ICE change significantly under such accuracy shifts even though the underlying model is the same.In contrast, only MACROCE is stable under train-test accuracy shifts as desired.
ICE results differ at different accuracies (i.e., highly sensitive to accuracy), while MACROCE stays stable.captures the trade-off and implies the model remains poorly calibrated.7TemperatureScalingunderAccuracyShift.5.1 Existing Calibration BaselinesSimple Baselines.We begin with two simple baselines: Binary baseline assigns the top t% (dev IID(NQ)

Table 2 :
Results of existing calibration methods (Section 5.1) as well as CONSCAL (Section 5.2).'CONSCAL w/o Training Dynamics' is the ensemble-based method from Section 5.3.While some existing methods drastically reduce ECE, none of them significantly reduces MACROCE; on the other hand, CONSCAL sets the new state-of-the-art on MACROCE both in-domain and out-of-domain (best results in bold).Note that different metrics give different rankings between methods, which further highlights the importance of using a reliable and informative metric.

Table
ICE pos increases and ICE neg decreases.
ECE with Equal-Width and Equal-Mass Binning We compare measuring calibration with equal-width bucketed ECE and equal-mass bucketed ECE in Table4.We find that both variants of ECE give similar results on all experiment settings, and both of them show contrary conclusions than MACROCE (e.g., they both underestimate the calibration errors of temperature scaling in OOD settings).