A Close Look into the Calibration of Pre-trained Language Models

Pre-trained language models (PLMs) may fail in giving reliable estimates of their predictive uncertainty. We take a close look into this problem, aiming to answer two questions: (1) Do PLMs learn to become calibrated in the training process? (2) How effective are existing calibration methods? For the first question, we conduct fine-grained control experiments to study the dynamic change in PLMs’ calibration performance in training. We consider six factors as control variables, including dataset difficulty, available training samples, training steps, the number of tunable parameters, model scale, and pretraining. We observe a consistent change in calibration performance across six factors. We find that PLMs don’t learn to become calibrated in training, evidenced by the continual increase in confidence, no matter whether the predictions are correct or not. We highlight that our finding somewhat contradicts two established conclusions: (a) Larger PLMs are more calibrated; (b) Pretraining improves model calibration. Next, we study the effectiveness of existing calibration methods in mitigating the overconfidence issue. Besides unlearnable calibration methods (e.g., label smoothing), we adapt and extend two recently proposed learnable methods that directly collect data to train models to have reasonable confidence estimations. Experimental results show that learnable methods significantly reduce PLMs’ confidence in wrong predictions.


INTRODUCTION
Pre-trained language models (PLMs) are successful in many downstream tasks regarding performance (Wang et al., 2019). But for a reliable deployment in practice, the calibration performance should also be carefully examined (Vaicenavicius et al., 2019). Well-calibrated models assign appropriate confidence scores that truly reflect the outcome probability of their predictions. However, the confidence scores of existing deep neural networks cannot serve as reliable estimates of their uncertainty (Guo et al., 2017), and a deep understanding of PLMs calibration is lacking.
In this paper, we give a systematical analysis of PLMs calibration. We consider two questions about PLMs calibration: (1) Do PLMs learn to become calibrated in the training process? (2) How effective are existing calibration methods? We first introduce the metrics we adopt for calibration performance evaluation. The most widely used calibration metric ECE (Naeini et al., 2015) is considered. It measures the difference between confidence and accuracy by portioning samples into various confidence zones. To give a more comprehensive and practical calibration evaluation, we provide an application-driven perspective, describing two undesirable situations in practice: (1) For the first question, we consider the influence of six factors that have influence on PLMs' calibration performance, including dataset difficulty, available training samples, training steps, the number of tunable parameters, model scale, and pretraining. Some of them are overlooked in previous empirical studies (Snoek et al., 2019;Nixon et al., 2019;Minderer et al., 2021). We motivate to conduct fine-grained control experiments to study the dynamic change in PLMs' calibration performance in training through manipulating controlling variables. We empirically observe a consistent change in calibration performance across six factors. All six factors influence PLMs' fitness on the training distribution. This results in two states of PLMs considering calibration performance, namely under-fitted and over-fitted states (see Fig. 1). In the under-fitted state, PLMs' performance and confidence increase at different speeds when more fitted on the training distribution. In the over-fitting state, PLMs' confidence continues to increase with performance remaining the same. We find evidence that PLMs don't learn to become calibrated in training: PLMs' confidence in their predictions continues to increase when more fitted on the distribution (e.g., more tunable parameters, training longer). This results in two miscalibration behaviors: (1) Increasing ECE in the latter over-fitting stage, and (2) Continually increasing confidence in wrong predictions, indicating that PLMs mostly don't know "what they don't know".
We highlight that our finding presents contradictory views with the two established conclusions: (a) Larger PLMs show better calibration (Srivastava et al., 2022); (b) Pretraining improves model calibration (Hendrycks et al., 2019b). We identify that the inconsistency lies in: (1) The difficulty of evaluation datasets: the performance doesn't saturate in the considered datasets (e.g., BIGbench (Srivastava et al., 2022)). Thus, the evaluation is on the under-fitting stage, leaving the miscalibration behavior in the over-fitting stage unobserved; (2) Evaluation metrics: previous work doesn't measure the confidence in wrong predictions, overlooking the fact that models are becoming more confident in wrong predictions when scaling larger and employing pretraining.
Thus, we find that the main issue of PLMs calibration lies in their overconfidence in wrong predictions, which cannot be trivially solved by increasing the model scale or using pretraining. So we consider the effectiveness of existing calibration methods in mitigating the overconfidence issue. We partition existing calibration methods into unlearnable and learnable groups. Unlearnable methods heuristically manipulate the original confidence in predictions (e.g., label smoothing). Learnable methods directly collect data and train models to give reasonable confidence scores in their predictions. Namely, an extra calibration task is introduced, which aims to extract features from samples and models' preceding performance to predict whether models' predictions are correct or not.
In our experiments, we identify the superiority of learnable methods compared to unlearnable ones, considering both in-distribution (ID) and out-of-distribution (OOD) settings. This is characterized by a sharp decrease in their confidence in wrong predictions when using learnable methods, indicating that they significantly mitigate the overconfidence issue. Moreover, learnable methods can maintain a reasonable increase in CErr pos , holding consistent correlations between the drop in confidence and performance under distribution shifts. This shows difference from unlearnable methods, which take effect by roughly imposing confidence regularization on models' predictions (e.g., label smoothing), resulting in almost the same amount of increase in CErr pos with the decrease in CErr neg .
To further understand learnable calibration methods, we consider the influence of more data and larger model scale for the calibration task, the adopted model for the calibration task, and the data distribution, on PLMs' calibration performance. We highlight three findings: (1) More data and larger model scale for the calibration task both play significant positive roles in PLMs' calibration performance; (2) PLMs can be trained to give their uncertainty. This finding is consistent with the concurrent work (Lin et al., 2022). Further, we provide an extension to this conclusion. We find that using an extrinsic predictive model can achieve comparable results, given the same calibration training data. Thus, we identify that the success of this paradigm essentially lies in the learnable attribute of the calibration task, instead of the PLMs' self-checking process; (3) PLMs' calibration performance under distribution shifts depends on the evaluation datasets chosen. Previous work shows that PLMs exhibit degraded calibration performance under distribution shifts (Desai & Durrett, 2020). We find that this conclusion is reversed when the ID datasets are harder and PLMs achieve better performance on OOD datasets. The concrete arguments and explanations are detailed in Sec. 5.5.

BACKGROUND
Calibration measure. We can visualize model calibration through reliability diagram (DeGroot & Fienberg, 1983). Based on the diagram, we can measure the Expected Calibration Error (ECE; (Guo et al., 2017;Naeini et al., 2015)) by partitioning samples into different confidence zones. The central idea is to measure the absolute difference between models' predictive confidence and accuracy. Although alternative theoretic-motivated metrics have been proposed (Vaicenavicius et al., 2019;Gupta et al., 2021), we still employ ECE in our experiments due to its simplicity and popularity. Besides, we also provide an application-driven perspective to look into model calibration.
Benchmark & Analysis. Given appropriate evaluation metrics, large-scale benchmarks have been conducted to analyze model calibration under different settings, spanning model architectures (Guo et al., 2017;Minderer et al., 2021), model scales (Dan & Roth, 2021), modalities (Desai & Durrett, 2020;Minderer et al., 2021;Kadavath et al., 2022), calibration methods (Guo et al., 2017;Desai & Durrett, 2020), and distribution shifts (Nixon et al., 2019;Kong et al., 2020). However, previous benchmarks follow the fixed training and evaluation paradigms. In this paper, we instead conduct a fine-grained and more comprehensive empirical evaluation to take a close look into PLMs calibration from multiple dimensions that have often been overlooked by previous work.
Method. Calibration is essential for out-of-distribution detection (Hendrycks et al., 2019a), selective prediction (Varshney et al., 2022), robustness (Kumar et al., 2022), and pseudo-labeling (Rizve et al., 2021). Existing calibration methods can be partitioned into unlearnable and learnable groups. For unlearnable methods, there are mainly four categories. Post-hoc calibration intends to readjust the output logits referring to the performance on a held-out validation set (Platt et al., 1999;Guo et al., 2017). Regularization methods aim to prevent models from being over-confident on predictions (Szegedy et al., 2016;Pereyra et al., 2017). Data augmentation (Hendrycks et al., 2020;Wang et al., 2021) and model ensemble (Gal & Ghahramani, 2016;Lakshminarayanan et al., 2017) have also been empirically proven to improve model calibration. For learnable methods, the typical way is to first collect data for the calibration task, and then train a model to predict whether the given answer is correct. The model can be a multilayer perceptron, and the features can be hand-engineered (Ye & Durrett, 2022;Zhang et al., 2021b;Si et al., 2022) or the last hidden states of PLMs (Kadavath et al., 2022). PLMs can also be directly trained to output their uncertainty by words (Lin et al., 2022).

EVALUATION METRICS
For basic evaluation, we report accuracy (Acc) and average confidence score (Conf) on the testing set. For calibration evaluation, we report ECE using equal-mass binning and 100 bins following Minderer et al. (2021). Besides, we provide an application-driven perspective to evaluate model calibration, aiming to quantify two unsatisfied scenarios due to miscalibration in practice: (1) Correct predictions (positive) are rejected due to low confidence; (2) Wrong predictions (negative) are accepted due to high confidence. Specifically, we consider the average confidence in correct predictions Conf pos and wrong predictions Conf neg respectively. For unified comparison, we report two calibration error (CErr) cases, CErr pos = 1 − Conf pos and CErr neg = Conf neg . In principle, we expect calibrated models having both low CErr pos and CErr neg , indicating that they reasonably assign high confidence in correction predictions and low confidence in wrong predictions.  (Raffel et al., 2020), since they represent two classic types of PLMs, namely encoder-only and encoder-decoder models. We experiment with four representative tasks in NLP, including sentiment analysis, natural language inference, news classification, and topic classification. For datasets, we choose SST-2 (Socher et al., 2013a), MNLI (Williams et al., 2018a), AG-News (Zhang et al., 2015), and Yahoo (Zhang et al., 2015) respectively. We employ the prompt-based learning paradigm (Liu et al., 2021) since its superior performance compared to the traditional fine-tuning, especially in the fewshot setting. Specifically, we inherit the masked language modeling task in the pre-training stage and use templates to wrap samples into prompts. We fine-tune the whole PLMs to fill in the [mask] position in the prompt. The manual template and verbalizer for each dataset are listed in Appendix A.

EXPERIMENTAL RESULTS
We conduct a fine-grained controlled study to explore the influence of six factors, including dataset difficulty, available training samples (Fig. 2), training steps ( Fig. 3), number of tunable parameters ( Fig. 4 and Fig. 10), model scale (Fig. 5), and pretraining (Fig. 6). Due to space limit, we show corresponding results of RoBERTa and results of T5 on AG-News in Appendix B. We summarize the overall conclusions, and leave the detailed experimental settings and findings in Appendix B.
We note that all six factors dynamically influence PLMs' fitness on the training distribution, which we identify as the decisive factor of PLMs' calibration performance. We observe a consistent change in calibration performance across six factors, resulting in two PLMs' sates (see Fig. 1

) in training:
Under-fitted state. In this state, PLMs' performance and confidence increase at different speeds when more fitted on the training distribution. In principle, miscalibration is due to the mismatch between performance and confidence. However, we look closely into some critical points where ECE changes sharply (e.g., Fig. 2), and empirically find that the increase or decrease in ECE can be estimated by comparing the increasing rates of PLMs' performance and confidence. We observe that a larger (smaller) increasing rate in performance reduces (increases) ECE. Thus, high ECE can be partially attributed to PLMs' relatively rapid growth in confidence with performance lagging behind.
Over-fitted state. In this state, PLMs' performance doesn't have substantial difference due to their generalization ability (Zhang et al., 2021a). However, PLMs' confidence continues to increase in this state, resulting in increasing ECE. Thus, being more fitted on the training distribution may bring negative effect to PLMs calibration. In addition, due to the variance of ECE in this state, especially obvious when more training steps and tunable parameters are introduced (e.g., Fig. 3 and Fig. 4), the evaluation of calibration performance may be sensitive to the training paradigm. This indicates that previous conclusions drawn from empirical studies should be carefully examined since the training paradigms may be different in model architectures and calibration methods.
Given the two states observed, we conclude that PLMs don't learn to become calibrated in training, evidenced by the continually increasing confidence in predictions, no matter correct or not, in the fitting process. Specifically, this results in two miscalibration behaviors: (1) Increasing ECE in the over-fitted state; (2) The consistent increase in CErr neg throughout the whole training process. This is an undesirable property in practice since users may accept wrong predictions due to their high confidence, and indicates that PLMs mostly don't know "what they don't know".
We highlight two of the considered factors, namely model scale and pretraining ( Fig. 5 and Fig. 6), which are examined in previous work. Our findings present some contradictory views with the established conclusions: (1) Larger PLMs show better calibration (Srivastava et al., 2022); (2) Pretraining improves model calibration (Hendrycks et al., 2019b). Actually, scaling larger and employing pretraining are both strategies to increase PLMs capacity, making them more fitted on the training distribution. Our general conclusion can also be applied. We emphasize and highlight two observations: (1) Essentially, the influence of scaling larger and pretraining on PLMs calibration is dynamically determined by the relative increase in performance and confidence, which is highly relevant to the chosen evaluation datasets. For example, the original scaling experiments are conducted on BIG-bench (Srivastava et al., 2022), in which the performance is far from saturation and increasing the model scale brings substantial improvement to PLMs performance. This shows consistency with the identified under-fitted state. However, when the performance score saturates on evaluation datasets given the certain scale of PLM, scaling larger will only bring up confidence. This results in increasing ECE due to the mismatch between two trends (e.g., T5 and RoBERTa on Yahoo); (2) Scaling larger and employing pretraining consistently bring CErr neg higher. This indicates that these two strategies don't enable PLMs to learn to become calibrated in the training process.

CALIBRATION METHODS
We choose representative calibration methods from each category summarized in Sec. 2. For unlearnable methods, we consider vanilla fine-tuning (Vanilla), temperature scaling (TS) (Guo et al., 2017), label smoothing (LS) (Szegedy et al., 2016), easy data augmentation (EDA) (Wei & Zou, 2019), and deep-ensemble (Ensemble) (Lakshminarayanan et al., 2017). For learnable methods, an extra calibration task is introduced, aiming to train a model to predict whether the original predictions are correct. Each sample in the dataset of the calibration task consists of the original input or the hidden states, the original prediction, and the label indicating whether the prediction is correct.
We adopt the validation set as the training set for the calibration task.
For better clarification, we use the main task to denote the original task. The predictive model for the calibration task can be a separate extrinsic model that we use "E-" for denotation. Specifically, we adapt the method proposed in Kadavath et al. (2022) that uses MLP as the extrinsic model (E-MLP) and the inputs are the hidden states of the main task model. Based on the similar intuition, we extend this method by using an extra T5 as the extrinsic model (E-T5). An example of the template to wrap the sample into an input prompt is: "<original input>, the model prediction is <prediction>, is the prediction True or False? It's <mask>." And the concrete manual templates and verbalizers of the calibration task for each dataset are listed in Table 8.
Besides, the main task model can also be directly employed to perform the calibration task. We deem this paradigm as the intrinsic one, denoted as "I-". Lin et al. (2022) show that GPT-3 (Brown et al., 2020) can be trained to output the uncertainty by words. We adapt this method by first training the model using the main task data, and then continuing the training by using the calibration task data (I-Vanilla). However, this continual learning paradigm may result in degraded performance in the main task. To tackle this, we propose two more practical intrinsic calibration methods through modifying the training paradigm. Specifically, we train PLMs iteratively (I-Iter) or simultaneously (I-Simul) on the original task and the calibration task. The latter can be achieved due to the unified text-to-text training paradigm. The input is the same as E-T5.

EXPERIMENTAL SETTING
We experiment with both in-distribution (ID) and out-of-distribution (OOD) settings. We consider natural language inference, sentiment analysis, and hate-speech detection tasks due to their wellestablished OOD datasets in NLP. Specifically, we choose MNLI (HANS, ANLI), Amazon (SST-5, SemEval), and Civil (Hate Speech, Implicit Hate) as the ID (OOD) datasets. The details of chosen datasets for ID and OOD evaluation are described in Appendix A.

EXPERIMENTAL RESULTS
The results are listed in Table 1 (T5) and Table 9 (RoBERTa). We summarize the overall conclusions as follows: All calibration methods have negligible influence on PLMs' performance in the ID and OOD settings except I-Vanilla. However, PLMs are significantly less calibrated under considered distribution shifts, especially on challenge datasets due to the severe mismatch between performance  (2020). However, the conclusion that PLMs are calibrated on ID data (Desai & Durrett, 2020) is questionable given our answer to the first question (see Sec. 4). The low ECE can be attributed to their high performance on ID datasets and consistently assigning high confidence scores to their predictions. We further show the conclusion that PLMs calibration degrades under distribution shifts is one-sided in Sec. 5.5.
Unlearnable methods. We summarize the findings as follows: (1) Data augmentation and model ensemble don't bring substantial benefits to PLMs calibration, considering the three calibration metrics spanning all evaluation datasets and two PLMs. The reason lies in their inability to relieve the overconfident issue, resulting in the same Cerr neg with the vanilla fine-tuning; (2) TS achieves overall better ECE, maintaining a strong baseline method, with LS being the second effective method for the unlearnable category. This is consistent with previous empirical studies (Nixon et al., 2019). However, we can observe almost the same amount of increase in CErr pos with the decrease in CErr neg . The reason is that these two methods directly impose confidence regularization on predictions, which don't actually make PLMs have clear confidence estimations.
Learnable methods. Compared to unlearnable methods, learnable ones significantly mitigate the overconfidence issue, reflecting in the sharp decrease in CErr neg , indicating that learnable methods output very low confidence in wrong predictions. But we also observe that learnable methods lower the confidence in correct predictions, resulting in increasing CErr pos and ECE. However, we highlight two observations indicating that learnable methods essentially teach models to have clear confidence estimations, instead of roughly reducing the confidence like LS: (1) Compared to the vanilla version, the increase in CErr pos is significantly lower than the decrease in CErr neg , especially on ID samples; (2) Learnable methods give obviously lower confidence in OOD samples, and the average confidence drop is highly relevant with the performance drop under distribution shifts. Thus, the low confidence and relatively higher CErr pos and ECE on OOD samples may be reasonable.
Further, we give detailed analysis of extrinsic and intrinsic learnable methods respectively, and compare our extended calibration methods with previous methods. Then we compare the extrinsic and intrinsic methods: (1) For extrinsic methods, the extended E-T5 exhibits significantly better calibration performance compared to adapted E-MLP considering the mitigation of the overconfidence issue. The essential difference mainly lies in the extrinsic model for the calibration task. We find that using larger capacity model as the extrinsic calibrator shows the same trend with shifting from the vanilla fine-tuning to learnable methods. We further study this scaling effect in Sec. 5.4; (2) For intrinsic methods, the three different training paradigms don't show substantial difference considering the calibration performance, and none of them consistently achieves the best performance on all datasets. However, our methods (I-Iter and I-Simul) address the degraded performance issue of I-Vanilla and make the main task performance match with the vanilla fine-tuning; (3) Interestingly, there doesn't exist a substantial difference between the extrinsic E-T5 method and other intrinsic methods, given the same base architecture (e.g., T5). This finding leads us to reconsider the conclusion in Lin et al. (2022) that PLMs can be trained to give their uncertainty by words. Given the comparable performance between intrinsic and extrinsic methods, we give an extension to this conclusion. We identify that the success of this paradigm essentially lies in the learnable attribute of the calibration task, instead of the self-checking process of PLMs.

EMERGENT CALIBRATION
In Sec. 5.3, we identify the potential in learnable methods. However, detailed exploration of learnable calibration methods is lacking. We conduct experiments to study the influence of two important factors, namely the dataset size and the model scale for the calibration task, on PLMs calibration. Note that the model scale in this section discusses the model adopted for the calibration task, instead of the main task.
Dataset size. Table 2 shows results of different sizes of the calibration dataset. Two basic findings are: (1) The five learnable methods show the consistent trend when increasing the dataset size, indicating that the essence of these methods is the same; (2) The size of datasets for training the calibration task doesn't have substantial influence on PLMs performance on the main task.
Beyond these, we observe that there is a sharp difference in calibration performance when increasing the dataset size from small to middle. The trend is overall consistent with the one observed when shifting from the vanilla fine-tuning to learnable calibration methods. The trend can be summarized as: (1) For ID samples, we can observe a sharp decrease in CErr neg with relatively less negative influence on ECE and CErr pos ; (2) For OOD samples, the CErr pos and ECE increase significantly along with increasing dataset size. However, given the arguments in Sec. 5.3, we identify that PLMs' calibration performance improves when trained on larger calibration datasets. Besides, we observe that no further improvement on calibration performance is observed when increasing the dataset size from middle to large. This is consistent with the normal task training, where increasing dataset size doesn't increase performance after a critical point.
Model scale. Table 3 shows results of increasing the model scale. Two basic findings are: (1) The five learnable methods still show consistent trend when scaling larger; (2) We observe consistent confidence increase in the scaling process. This trend is similar to the one observed in Sec. 4, where increasing capacity makes PLMs more confident.
Surprisingly, although confidence continues to increase, for ID samples, we observe a consistent decrease in CErr pos with neglectable influence on ECE and CErr neg when scaling larger. The reason is that the dataset for the calibration task is collected from ID. Thus, if provided enough ID samples for the calibration task training, scaling larger enables models to better learn the calibration  task, ensuring better calibration performance on ID samples. For OOD samples, we don't observe a consistent trend due to the influence of various factors. Specifically, when using out-of-the-box to tackle OOD samples, the problem of distribution shifts appears in the introduced calibration task. Whether scaling the calibration task model larger improves calibration performance under distribution shifts is determined by many factors (e.g., the dataset difficulty, the overconfidence issue in the calibration task). We leave it for future exploration.

FURTHER ANALYSIS OF DISTRIBUTION SHIFTS
In Sec. 5.3, we show that PLMs are less calibrated under distribution shifts, consistent with previous work (Desai & Durrett, 2020;Minderer et al., 2021). However, can we safely conclude that distribution shifts degrade PLMs' calibration performance? We study hard-to-easy distribution shifts to further investigate the essence of this problem. In this setting, models are trained on a difficult ID dataset and infer on easier OOD datasets. This comes with relatively lower ID and higher OOD performance. Specifically, we consider the sentiment analysis task, and choose Dynasent (Amazon and DSC) as the ID (OOD) datasets. The details of datasets are described in Appendix A.
The results of T5 and RoBERTa are shown in Table 4 and Table 10 respectively. We observe completely different results with Sec. 5.3. Across all methods, the ECE and CErr pos decrease under the hard-to-easy distribution shifts, contradictory to the previous conclusion that PLMs are less calibrated on OOD samples. In hard-to-easy shifts, performance and confidence both increase due to the relative simpleness of the OOD samples. The indication is that PLMs' relative calibration performance on ID and OOD samples relies on the dataset difficulty, and the conclusion that PLMs are less calibrated under distribution shifts is one-sided. This is consistent with our empirical study in Sec. 4 that emphasizes the influence of dataset difficulty on PLMs calibration.
To further investigate the influence of dataset difficulty on PLMs' calibration performance, we evaluate the calibration on task-irrelevant inputs of PLMs trained on ID datasets with different difficulty (e.g., SST-2 and Yahoo). The task-irrelevant inputs include plain texts (e.g., bookcorpus) and random words. Since no golden labels are provided, we measure the calibration performance through maximum confidence scores and predictive entropy. The results of T5 are shown in Table 5, and RoBERTa are shown in Table 11. We show that PLMs have unreasonable high confidence in task-irrelevant inputs, especially when trained on SST-2.
Comparing the results when trained on SST-2 or Yahoo, we find that the ID training dataset has significant influence on PLMs calibration. Still, this can be attributed to the dataset difficulty. We also observe the superior performance of learnable calibration methods. They generate less confidence on plain text and random tokens compared to unlearnable ones.
In summary, the influence of distribution shifts on PLMs calibration is dependent on the evaluation datasets that are chosen. The original conclusion that calibration performance degrades on OOD samples is based on two premises: (1) PLMs are overconfident in their wrong predictions, which is supported by our experiments; (2) The OOD datasets are harder so that PLMs cannot achieve good performance. The second premise has not always been satisfied, and we show that the relative dataset difficulty significantly influence PLMs' calibration performance on ID and OOD samples.

CONCLUSION
In this paper, we take a close look into PLMs calibration. We conduct an empirical study, motivating to answer two questions: (1) Do PLMs learn to become calibrated in the training process? (2) How effective are existing calibration methods? Besides the findings that support existing conclusions, we also provide extension or contradictory arguments to some established conclusions.

LIMITATIONS AND FUTURE WORK
In this paper, we propose three simple extended calibration methods based on existing ones. In our experiments, we evaluate the calibration performance of existing and our calibration methods. We make an assumption that we have a large held-out validation set that can be employed as the training dataset for the calibration task. We demonstrate the effectiveness of learnable calibration methods in an ideal situation. However in practice, we need to make the decision about how to allocate the data for the main task and the calibration task given limited training samples.
In the future, we plan to study a more practical scenario when we have a certain number of training samples, and propose a training framework that can make the most of training samples for both the main task and the calibration task.   MNLI Given the two sentences: (1) {"placeholder": "text a"}.

A DATASETS
In this section, we describe the datasets adopted in experiments by tasks. The datasets statistics are shown in Table 6. The manual templates and verbalizers are presented in Table 7. throughout the paper, is a sentiment analysis dataset of reviews on fine foods from Amazon. Due to the enormous dataset size in the dataset, we sample 10k samples per class from the dataset. SemEval 2016 Task 4 (Nakov et al., 2013) is the sentiment analysis in the Twitter task. We consider Subtask A, where all Twitter texts are labeled as negative, neutral, or positive. Dynasent (Potts et al., 2021) is a challenging and dynamically evolved dataset, adopting human-in-the-loop efforts in dataset construction. We merge the data of round 1 and round 2 in our experiments.
Natural language inference. MNLI (Williams et al., 2018b)  the specific hypotheses about invalid heuristics that may be captured by the NLI model. ANLI (Nie et al., 2020) is an adversarial NLI dataset, created by an iterative (three rounds in total), human-andmodel-in-the-loop procedure. We merge the data from all three rounds in our experiments.
Topic classification. Yahoo Topic Answers (Zhang et al., 2015) contains 10 categories of questions and their corresponding answers from the Yahoo! Webscope program. For each sample, the title and content of the question are concatenated as one text, and the best answer to the question is used as a label. Since the original training dataset is extremely large (1.4 million samples for each category), we randomly sample 140,000 samples for simplicity. AG News (Zhang et al., 2015) is a corpus of news articles consisting of 4 classes: World, Sports, Business, and Science/Technology. For each article, we construct the text by concatenating the title and description.
Toxic detection. Civil Comments 1 is collected from the Civil Comments platform. Each comment is annotated with a float toxicity score, scaling from 0 to 1. We follow the official instructions to set samples with a toxicity score smaller than 0.5 as label 0 and vice versa. Hate Speech (de Gibert et al., 2018), the arguably most popular dataset in toxic detection, is collected from Stormfront, a large forum of white nationalists. The test set we use is sampled by the author in the official Github repository. Implicit Hate (ElSherief et al., 2021) consists of hate tweets from extremist groups in the US. Notably, a part of the hate tweets is implicit, which contains some subtle tricks to conceal the toxicity and evade keyword detection.
Plain text. BookCorpus (Zhu et al., 2015) collects a tremendous number of free novel books and thus is used in the pre-training stage of pre-trained language models. We sample 10k texts for evaluation. Random Words contains 1k meaningless texts, each synthesized by concatenating 20 random words.

B ADDITIONAL RESULTS OF CONTROL EXPERIMENTS
For the empirical control study in the influence of six factors on PLMs calibration, we provide additional experimental results. The results of T5-base on AG News are shown in Fig. 7, Fig. 8, Fig. 9, and Fig.10. The results of RoBERTa-base are shown in Fig. 11, Fig. 12, Fig. 13, Fig. 14 Available training samples. We adopt K-shot learning, where K is the number of samples per class. We experiment with each K five times on each dataset and report the average performance due to the potential variance in the few-shot setting. In this dimension, we additionally find that the trends in average confidence are different in two model architectures. While T5 has an obvious confidence drop in the early stage, the confidence of RoBERTa seems to continually increase along with the number of available training samples. This can be partially explained by the stronger fewshot adaptation of RoBERTa since we observe that the performance of RoBERTa is significantly higher in extreme cases (e.g., K=1,2,4).
Training dynamics. We decompose the whole training process into steps, and measure five metrics during some fixed intervals. In this dimension, the conclusion is consistent with the general one.
Number of tunable parameters. To quantitatively explore the influence of the number of tunable parameters on PLMs calibration, we employ the parameter efficient tuning methods in NLP (Houlsby et al., 2019;Zaken et al., 2022;Ding et al., 2022). Specifically, we adopt Softprompt (Lester et al., 2021) and Adapter (Houlsby et al., 2019) tuning due to their simplicity, stability, and practicality. We experiment with various number of soft tokens and bottleneck dimensions of the inserted adapter modules. Only the parameters in the soft tokens and adapter module is tunable.
We summarize the extra findings as follows: (1) Soft-prompt and Adapter tuning show different trends spanning four datasets; (2) For Soft-prompt tuning, the model performance and confidence increase continually with more tunable parameters. We can observe that the increasing rates are nearly matched, thus decreasing ECE continually. The negative effect is also the increase in CErr neg due to the overconfidence on wrong predictions. This is consistent with the trend we observed in the under-fitting stage; (3) The world in Adapter tuning is different from Soft-prompt tuning where increasing capacity cannot bring substantial performance gains. This is due to the strong capacity of Adapter. However, the overall confidence continues to increase given more capacity, resulting in increasing ECE and CErr neg , while the performance stays constant. This is consistent with the trend we observed in the over-fitting stage; (4) The implication of experimental results is that blindly this, I-Iterative and I-Multitask train the model iteratively or simultaneously on the original task and calibration task, respectively.

C DETAILS OF EVALUATION SETTING.
Hard-to-easy shift. we choose Dynasent as the in-distribution dataset, and choose Amazon and DSC as the out-of-distribution datasets. The evaluation metrics are the same as the experiments on standard OOD shifts. This evaluation setting is expected to test the conclusion that PLMs' calibration performance degrades under distribution shifts.
Calibration on task-irrelevant inputs We choose SST-2 and Yahoo as the in-distribution datasets, and choose Bookcorpus and a synthetic dataset as out-of-distribution datasets. Specifically, each sample in the synthetic dataset is constructed by composing random words. Well-calibrated PLMs should give very low confidence and high probability entropy in the task-irrelevant inputs.

D ADDITIONAL RESULTS OF CALIBRATION METHODS
For the experiments to explore the effectiveness of existing calibration methods, we provide results with RoBERTa in Table 9, Table 10, and Table 11 Table 11: Results on task-irrelevant inputs with RoBERTa. We don't report the entropy results of learnable methods when Yahoo is adopted as ID dataset since the class numbers are different in unlearnable (10 original classes in Yahoo) and learnable methods (2 classes), which will result in unfair comparison.