An Overview of Uncertainty Calibration for Text Classification and the Role of Distillation

Recent advances in NLP systems, notably the pretraining-and-finetuning paradigm, have achieved great success in predictive accuracy. However, these systems are usually not well calibrated for uncertainty out-of-the-box. Many recalibration methods have been proposed in the literature for quantifying predictive uncertainty and calibrating model outputs, with varying degrees of complexity. In this work, we present a systematic study of a few of these methods. Focusing on the text classification task and finetuned large pretrained language models, we first show that many of the finetuned models are not well calibrated out-of-the-box, especially when the data come from out-of-domain settings. Next, we compare the effectiveness of a few widely-used recalibration methods (such as ensembles, temperature scaling). Then, we empirically illustrate a connection between distillation and calibration. We view distillation as a regularization term encouraging the student model to output uncertainties that match those of a teacher model. With this insight, we develop simple recalibration methods based on distillation with no additional inference-time cost. We show on the GLUE benchmark that our simple methods can achieve competitive out-of-domain (OOD) calibration performance w.r.t. more expensive approaches. Finally, we include ablations to understand the usefulness of components of our proposed method and examine the transferability of calibration via distillation.


Introduction
The recent success of NLP systems, notably the pretraining-and-finetuning paradigm has led to widespread applications (Peters et al., 2018;Devlin et al., 2019;Radford et al., 2019). However, these systems are not always well-calibrated; in many high-stake decision-making scenarios such as med-ical diagnosis, even small errors would have large damage. Suppose an ML system predicts a 20% probability a patient has cancer whereas the reality is 40%, diagnosis relying on inaccurate estimates could lead to devastating consequences (Kumar et al., 2019). Further, interpreting and communicating these uncertainties facilitates better trust between humans and ML systems (Bansal et al., 2020;Wilder et al., 2020;Ribeiro et al., 2016Ribeiro et al., , 2018. Hence, it is increasingly important for users to understand not only when the systems would succeed, but also when they could fail. One seemingly straightforward approach is to have the systems output predictions and some measure of their confidence/uncertainty. Users could then use both the predictions and associated uncertainties to decide how much they would trust the prediction. For example, one might decide to take an umbrella to work only if the confidence of the rain prediction is more than 50%. For many statistical methods, confidence/uncertainty is either part of the system by design (e.g., Bayesian methods) or could be efficiently estimated (e.g., linear regressions). Unfortunately, for large-scale DNNs, estimating uncertainty becomes a challenge (Gal, 2016): e.g., nominal probabilities from the softmax function are shown to be uncalibrated estimates of model uncertainty (Platt, 1999;Niculescu-Mizil and Caruana, 2005;Guo et al., 2017;Ovadia et al., 2019).
In this work, we present a systematic study on recalibrating current NLP systems, particularly those that fall in the recent popular pretraining-andfinetuning paradigm (Hendrycks et al., 2020;Desai and Durrett, 2020), as they are widely deployed in recent state-of-the-art systems and hence it is important that they are well calibrated for safety and transparency. However, the methods discussed in this work could generalize to a broader range of systems. We focus on the calibration not only of the task itself, but also under dataset distributional shift (Ovadia et al., 2019).
We start by introducing uncertainty and calibration, and cover related advances in the deep learning literature. In addition to widely-used maximum calibration error and expected calibration error, we follow previous works (Ovadia et al., 2019;Kumar et al., 2019) and include additional calibration evaluation metrics for better comparisons (e.g., Brier scores and p calibration error).
We conduct experiments on GLUE classification tasks (Wang et al., 2019) and show that finetuned language models are usually not calibrated out-of-the-box, especially when the data comes from a distribution different from the training data. We use the term "out-of-domain" (or "out-ofdistribution", OOD) to refer to the setting where the train and evaluation data come from different "distributions". Related works in NLP have considered data from similar tasks but from different datasets as OOD (Ovadia et al., 2019;Hendrycks and Gimpel, 2017). Next, in order to make models more calibrated, we study some of the widelyused recalibration methods, with various degrees of effectiveness and computational cost. For example, ensembling models has been shown to be very effective in out-of-domain settings (Ovadia et al., 2019), but the cost of computation scales with the size of ensembles. On the other hand, distillation (Hinton et al., 2015) is a widely-known method for improving the system's performance by learning from a stronger teacher model. In this work, we empirically examine the connection between distillation and calibration. Notably, we view the objective function of distillation as a regularization term that encourages the student model to match the predictive uncertainty of a stronger, more calibrated teacher model.
We conduct analysis experiments to show that the teacher's calibration performance could be distilled into the student model, even when the teacher model's accuracy remains similar. With this insight, we show that simple methods based on distillation could achieve competitive performance in out-ofdomain calibration, without introducing extra computation at inference time. Finally, we also conduct ablation experiments to understand the usefulness of components of the method. In summary, our contributions are listed as follows: • We present a systematic study on the performance of various recalibration methods on finetuned language models for both in-domain and out-of-domain settings.
• We empirically examine the connection between distillation and calibration, and conduct experiments showing that distillation can distill calibration performance.
• We describe two simple recalibration methods, and experimental results demonstrate their competitiveness in the out-of-domain settings; finally, we also ablate method's components and measure the extent to which distillation transfers teachers' calibration improvement.

Background and Related Works
Due to space constraints, we present some of the most relevant materials in the main paper. Please see the appendix (Sec. A) for extended background and related works. The quality of the uncertainty measurement is usually measured via calibration (Kendall and Gal, 2017). In the context of calibration, the uncertainties often refer to predictive probabilities. The model is calibrated if the predictive probabilities match the empirical frequency of the data (Gal, 2016). LetŶ andP be the predicted class and its associated confidence of a neural network. We would like the confidence estimatesP to be calibrated, which intuitively means that we wantP to represent true probabilities (Guo et al., 2017): (1) Suppose a classification model is given N input examples, and made predictionsŷ 1 , ...,ŷ N , each withp = 0.35. We would expect 35% of the predictions would be correct. The problem of uncertainty/confidence calibration and confidence scores have been studied and applied in various settings such as structured prediction problems (Kuleshov and Liang, 2015), online recalibration (with potentially adversarial/OOD input) (Kuleshov and Ermon, 2017), model regularization (Pereyra et al., 2017), and misclassified/OOD examples detection (Hendrycks and Gimpel, 2017). In practice, however, perfect calibration is almost impossible (Guo et al., 2017), and estimating the first term in Eq. 1 is not straightforward using finite samples, because in most casesP is a continuous random variable (Guo et al., 2017;Kumar et al., 2019). In Sec. 3, we describe ways to estimate the calibration performance.
It has been widely observed that modern neural networks are usually not calibrated out of the box (Platt, 1999;Zadrozny and Elkan, 2001;Guo et al., 2017;Ovadia et al., 2019). Recalibration methods improve calibration by transforming un-calibrated outputs into calibrated outputs/probabilities, and they include scaling-based methods (Platt, 1999;Guo et al., 2017), histogrambinning-based methods (Guo et al., 2017;Zadrozny and Elkan, 2001), and ensembles (Lakshminarayanan et al., 2017). Recently, Kumar et al. (2019) proposed the scaling-binning calibrator and a more sample-efficient estimator of calibration error. In our work, we describe simple approaches that combine the strength of ensembles and temperature scaling without introducing computation at inference time; we further apply the scaling-binning calibrator to ensure calibration.
Ensemble-based methods work by aggregating multiple networks trained independently on the entire dataset, and has been shown to achieve strong performance in out-of-domain calibration (Ovadia et al., 2019;Lakshminarayanan et al., 2017). More generally, there are randomization-based ensembles and boosting-based ensembles. Within the randomization-based ensembles, we use the entire training dataset to train each model instead of different bootstrap samples of the original training set (Lakshminarayanan et al., 2017).
Temperature scaling is an extension of Platt scaling (Guo et al., 2017). It uses a single scalar parameter T > 0 for all classes. Given output z i , the confidence prediction is: An extension, called heteroscedastic regression, is used in our work, which replaces the constant scalar with learned values (Kendall and Gal, 2017;Kendall et al., 2018). Knowledge distillation (Hinton et al., 2015) is a compression technique in which a compact model (usually referred to as the student model) is trained to mimic the behavior of a more powerful teacher model. In the context of classification, knowledge distillation works by augmenting the loss function with an additional term D KL (p i p j ) where p i = softmax(z i /T ) and p j = softmax(z j /T ) with z i and z j the logits from two models, and T controls the smoothness of the output distribution. In this work, we show that distillation can also be used to distill calibration performance, and use it to build simple yet competitive recalibration methods.
Concurrently, Desai and Durrett (2020) studied the calibration of pretrained transformers when finetuned to downstream tasks, and Hendrycks et al. (2020) studied the out-of-distribution robustness of pretrained transformers. We are different from them in that first we present a systematic study on the out-of-distribution calibration; second we draw insights from the connection between distillation and temperature scaling to design simple yet competitive recalibration methods; third, we conduct experiments to understand the connection between them empirically; finally, we also include a more comprehensive set of calibration evaluations following Ovadia et al. (2019) and Kumar et al. (2019).

Calibration Error Metrics
Let X be the input space, and Y = {1, ..., K} be the label space, and X ∈ X and Y ∈ Y be random variables denoting the input and the label, respectively. Further, let f : X → [0, 1] K be a neural network that outputs the model's confidence for each class. For simplicity of notation, we definê Expected Calibration Error. One notion of miscalibration is the expected difference between confidence and accuracy, As mentioned in Sec. 2, this cannot be estimated using finitely many samples ifP is a continuous random variable. Expected Calibration Error (Naeini et al., 2015;Guo et al., 2017), or ECE, approximates this via partitioning predictions into multiple bins and computing the weighted average.

Maximum Calibration Error.
In high-risk scenarios, we might be interested in measuring the worst-case performance. Maximum Calibration Error (Naeini et al., 2015;Guo et al., 2017), or MCE, estimates the following quantity via binning, Brier Score. Calibration alone is not sufficient. We could construct cases in which the outputs of the model are calibrated but not useful. An example includes always outputting 50% in a binary classification task containing 50% of both labels (Kumar et al., 2019). An alternative measure is the Brier score (Brier, 1950), E[(f (X) − Y ) 2 ]. Note that the Brier Score is a proper scoring rule, thus the optimum score corresponds to a system with perfect calibration. We refer a more detailed discussion on proper scoring rule to Lakshminarayanan et al. (2017) (Sec 2.2). An extension of Brier Score is Brier Skill Scores (BSS). BSS is favored when the classes are imbalanced. In our early experiments, we did not observe significant ranking changes between these two measures, so we report Brier Score for simplicity. 1 p Calibration Error. A generalized notion of the calibration error is described in Kumar et al. (2019), This recovers the MCE when p = ∞ and ECE when p = 1 (Kumar et al., 2019). When p = 2, we refer to it as Squared Calibration Error (SCE). 2 This is estimated via binning the outputs and labels in practice similar to ECE and MCE. The plugin estimate for each term in the calibration error has been shown to be a biased estimate in Kumar et al. (2019), and the authors encouraged the use of a debiased estimator for the calibration error. We refer to this as the debiased Squared Calibration Error.

Underestimation of Calibration Errors for Model with Continuous Outputs
As noted in Sec. 2, the key to estimating the calibration error is estimating the conditional expectation ), we believe the results would be qualitatively similar.
2 Technically, this is 2-norm Calibration Error. But we refer to this as the Squared Calibration Error for notation simplicity. and then bins the function values to ensure calibration. Thus, in addition to reporting results using the metrics described in Sec. 3.1, we report results by running the scaling-binning calibrator on top of each method that we considered. 3 We further include ECE results with multiple bin-values in order to reduce the gap.

Baseline Model
Our baseline model follows the general finetuning of large pretrained language models on downstream tasks: we finetune RoBERTa-base (Liu et al., 2019) on downstream tasks.

Distillation and Uncertainty
Despite the strong empirical performance of many calibration methods (e.g., ensembles), their usefulness in practice is limited due to increased computation and/or memory costs at inference time (Ovadia et al., 2019). In Sec. 4.3, we describe a simple baseline: recalibrate, ensemble, and distill.
Distillation has been shown to mostly "preserve" performance in terms of accuracy -stronger teacher models tend to translate to stronger students (Hinton et al., 2015). However, whether distillation could also "preserve" calibration performance is less studied. A model with better performance does not necessarily translate to better calibration (Guo et al., 2017). Here, we briefly look at the distillation's objective from an angle of uncertainty matching, and show that they are related intuitively. Sec. 6.1 provides empirical evidence showing that the teacher model's calibration performance could be distilled into the student model.
There are two ways to see the connection. First, note that distillation tries to minimize the KLdivergence between the teacher output distribution and the student output distribution. This intuitively regularizes the student model to output confidence values that would be close to the confidence values from the teacher model. Later in Sec. 6.1, experimental results show that the confidences from two models indeed correlate positively. Another perspective, which we elaborate below, considers distillation as encouraging the students to output uncertainty close to that of teacher models.  Figure: Visualization of calibration performance, measured by SCEs (debiased), between teacher and student models, trained on RTE and evaluated on QNLI. The n in the legend refers to the size of ensemble(s). One metric/task, emphasizing different ensemble sizes. The Other Three Figures: These are zoomed-out versions of the left-most figure, along with other tasks. Instead of using color to imply the ensemble size, here the color refers to the task in which the models are evaluated, and points of different ensemble sizes but the same evaluation task are aggregated and represented by the same color. Each sub-figure represents the evaluation metric. More tasks/metrics, less emphasis on ensemble sizes. All Figures: The X-axis refers to the teacher model performance, and the Y-axis refers to the student model performance. Each dot represents a different configuration used in the teacher model. The P/S in the legends refer to the Pearson/Spearman correlations.
We start by defining a loss function as a weighted combination of the regular cross entropy loss function and a regularization term that measures the difference in the uncertainty between the student model, θ, and the teacher model, θ , (6) where H refers to predictive entropy (Gal, 2016), and is defined as (θ is ignored for simplicity), Gal (2016) showed that H(y|x, D) could be approximated using samples from the (approximate) posterior distribution of the parameters. In practice, this could be satisfied, for example, if the student model is trained using dropout, and the teacher model uses either MC-dropout or ensembles. 4 Next, suppose we approximate one of the predictive entropy terms using cross entropy. This turns the second term in Eq. 6 into KL-divergence, and hence recovers the distillation objective. 5

Recalibrate, Ensemble, and Distill
This simple algebraic manipulation shows that distillation has the effect of encouraging the student model to match the teacher model's uncertainty, and motivates us to build a simple recalibration 4 Note that the samples from a model using dropout (MCdropout) or ensemble could be used to approximate the posterior distribution (Gal, 2016;Lakshminarayanan et al., 2017). 5 Note that the approximation error equals the KL divergence, the term that the objective function seeks to minimize. As KL-divergence decreases, the approximation error also decreases.
method "recalibrate, ensemble, and distill" by first building an expensive yet calibrated teacher model (an ensemble of models each of which is recalibrated using temperature scaling), 6 and then distilling the expensive teacher model into a cheaper student model.
The training cost is roughly (N + 1)C 0 + C 1 , where N is the ensemble size, C 0 the cost of training the baseline, +1 comes from distillation, and C 1 comes from training the temperature scaling model (which is relatively cheap). However, the inference cost is almost the same as a single model (i.e., small overhead), which is very useful when inference is the primary concern (e.g., deployment).

Choosing the Distillation Temperature
The distillation term is often written as: where P (x; θ, T ) = softmax(f (x; θ)/T ) and T is usually a hyperparameter to be tuned. One might notice that this is similar to the equation of temperature scaling (Eq. 2). This, together with the uncertainty matching viewpoint, motivates a small change to the distillation: we can remove the T from the student, and choose the constantT for the teacher that minimizes the calibration error, which is similar to performing another temperature scaling. The motivation is that we want the student model to produce calibrated probabilities rather than the scaled version of the student. If we simultaneously scale the student by T , then f (x; θ)/T would be calibrated, but the student model itself would not. We want to emphasize here that we are not the first ones to describe the connection between distillation and calibration, related findings have been presented in previous works (Tang et al., 2020;Müller et al., 2019). However, we believe our view from the angle of predictive entropy is novel. More importantly, we conduct extensive experiments and analyses in the context of finetuned language models for several text classification tasks, to empirically verify that calibration performance between student and teacher model is correlated.

Setup
We include additional details in the supplementary materials. Also included are expanded experiment results, such as figures evaluated on more tasks using more evaluation metrics (Sec. 6.1), and detailed/expanded results tables as well as accuracy and ECEs with multiple bin-sizes (Sec. 6.2).
Model. Our codebase is largely based on Hug-gingFace Transformers (Wolf et al., 2019). When applicable, we use an ensemble size 2, and chooseT (Eq. 9) based on the Brier Scores on the validation dataset. The baseline model has 125.2M parameters, the temperature-scaling model (heteroscedastic variant) has 125.8M, and our method has 125.2M (same as the baseline model).  (2019). The train and evaluation data come from the same task for in-domain evaluations, but they come from different tasks of the same type for out-of-domain evaluations. We group MRPC and QQP (paraphrase tasks), and group MNLI (2-label version), QNLI, RTE, and WNLI (NLI tasks). We leave SST-2 (sentiment), CoLA (acceptability), and MNLI (3-label version, NLI) as separate groups.
We use the in-domain validation data to train the scaling-binning calibrator. 7 Analysis Experiments Details. We conduct experiments on RTE, in which we distill teacher models with different ensemble-sizes (from 1 to 6) and the temperature scaling constant (from 0.50 to 2.00 with a step size of 0.02) to student models. Each model is then evaluated on both in-domain task (RTE) and out-of-domain tasks (MNLI-2, QNLI, WNLI) using confidence, ECE, MCE, Brier Scores, SCE (debiased) and SCE (biased). The numbers represent performances on the validation dataset.

Analysis Experiments
Sec. 4.2 shows the connection between distillation and uncertainty regularization. In this section, we perform analysis experiments examining the correlation between the calibration performance of the teacher models and student models. We conduct experiments on RTE, in which we distill teacher models with different ensemble-sizes and the temperature scaling constant to student models. Each model is then evaluated on both in-domain and out-of-domain tasks. Numbers here represent performances on the validation dataset. We start by examining the calibration performances of teacher and student models, where we vary the calibration performance of the teacher model while holding the accuracy almost the same. 8 Fig. 1 (left) shows the debiased Squared Calibration Error of models trained on RTE and 7 We only use the 2-label version of MNLI for evaluation. We use accuracy for CoLA evaluation so that calibration error computations would be more consistent across tasks. 8 Note the accuracy of teacher models with the same ensemble size but different temperature scaling constants would be almost the same, as for each model, temperature scaling constant sharpens/flattens the probabilities but usually does not change their relative ranking. The motivation here is to reduce external influences, as comparing calibration performance might not be very meaningful if the predictions/accuracies change significantly. evaluated on QNLI. We can observe that, by varying the teacher model's calibration performance, the calibration performance of the student model also changes in similar directions. Next, Fig. 1(right) depicts the calibration performances of each teacher-student pair across multiple calibration metrics. Similarly, these figures indicate that correlation of calibration performance between teacher/student models are in general positive. This confirms the intuition described in Sec. 4.2 that calibration performance of the teacher model could be distilled into the student model.

Main Experiments
Next, we show our experimental results comparing the following four models: Baseline (Baseline, Sec. 4.1), Ensemble (Lakshminarayanan et al., 2017) (Ensemble, Sec. 2), Temperature Scaling (Guo et al., 2017) (TempScale, Sec. 2), our method (Ours, Sec. 4.2), and its variant with automatic distillation temperature selection (Ours (T ), Sec. 4.4). For each table, we report results with and without running the scaling-binning calibrator following the description in Sec. 3.2. Due to space constraints, we discuss and display the average performances in here (please see Sec. 5).
Baseline Performances. Results are shown in Table 1; here, we can see that the baseline has relatively high calibration errors. Notably, the outof-domain ECE values are around 18−19, interpreted as over/under-estimating the probability by about 18−19% in expectation.
Ensemble and Temperature Scaling. Next, we add ensembles/temperature scaling to the baseline. Results in Table 1 show that performances improve in general, especially in the out-of-domain settings: 3/9 in-domain metrics improve (2/9 metrics similar) and 8/9 out-of-domain metrics improve for ensembles, 6/9 in-domain metrics improve (2/9 metrics similar) and 7/9 out-of-domain metrics improve (1/9 metrics similar) for temperature-scaling. The results are largely consistent with previous observations that temperature-scaling performed better when the data come from in-domain (it outperforms ensembles among 7/9 metrics and 1/9 similar in in-domain settings), whereas ensembles are more competitive in out-of-domain settings at the cost of extra computation (it out-performs temperature scaling in 4/9 metrics in out-of-domain settings while being similar in 3/9).
Our Methods. Then, we apply our method, which has the same computation at inference time as the baseline. Table 1 showed that performances improve as well despite having no extra inferencetime computation cost: 2/9 metrics improve (3/9 metrics similar) in-domain and 9/9 metrics improve out-of-domain. Applying the automatic temperature selection on top of our method further improves out-of-domain performance in 4 metrics. However, using automatic temperature does not further improve the performance when we additionally apply the scaling-binning calibrator. We hypothesize that this is because temperature values are chosen based on evaluation metrics before applying the scaling-binning calibrator, thus fail to take it into account. Also, comparing our method to ensembles and temperature scaling, our method improves upon temperature scaling in 5/9 metrics in out-of-domain settings (1/9 similar), but outperforms the more expensive ensembles in just 3/9  metrics (1/9 similar). Comparing our method with automatic temperature selection, we can see 8/9 metrics in out-of-domain settings improves compared to temperature scaling, and 5/9 compared to ensembles (1/9 similar). This shows that our methods are competitive in out-of-domain settings with little extra computation.

Ablation Experiments
In this section, we (1) ablate our method by removing components to gain insights into how each of the components contribute to the final performance, 9 and (2) measure how well distillation transfers calibration performance. First, we remove ensembles (or temperature scaling), and include only temperature scaling (or ensembles) and distillation (−Ensembles and −TempScale, respectively). We can see from the results in Table 2 that removing either of them leads to worse performances in general: 7/9 in-domain (2/9 being similar) and 6/9 (2/9 being similar) outof-domain for removing ensembles, 4/9 in-domain (4/9 similar) and 9/9 out-of-domain for removing temperature scaling. This shows that the additional calibration gains from the teacher model can be effectively distilled into the student models.
Next, we compare the models before/after distillation (−Distillation). 10 As expected, the teacher model (before distillation) achieved strong performance at the expense of extra inference-time computation. We then study to what extent distillation transfers calibration performance. Let A t and B t be two different teacher models (before distillation) with difference in only one of the components (e.g., ensemble or temperature-scaling), and let A s and B s be the corresponding student models (after distillation). Then, we compute the relative percentage of improvement because of a component from teacher to student model (assuming A is more powerful than B), denoted as ρ AB : where ε(·) denotes the out-of-domain calibration performance. We compute ρ AB for each metric, and use the median of percentages as the summary statistic. We found 40.8% (111.2%) of the improvements from adding ensembles (temperature scaling) as extra components in teacher models are transferred to students models via distillation. 11

Conclusion and Discussion
We presented a study of calibration of finetuned language models in the context of text classification, where models are evaluated on in-domain and out-of-domain data. We showed the effectiveness of a few widely-used calibration methods. We illustrated the intuitive connection between distillation and calibration, and described simple yet competitive calibration methods. We conducted experiments to empirically understand whether distillation can be used to distill calibration performance, and showed that the simple methods we described achieved competitive out-of-domain calibration performances. We further presented ablation studies on the usefulness of components of the proposed method and examined the transferability of calibration via distillation. However, our method is limited in that it requires an overhead cost involved in training the student model, which could be expensive in some settings. We leave it to future works to investigate more efficient inferencetime recalibration techniques.

A.1 Epistemic and Aleatoric Uncertainty
Two types of uncertainty commonly appear in machine learning literature: epistemic uncertainty and aleatoric uncertainty (Gal, 2016;Kendall and Gal, 2017). Epistemic uncertainty accounts for uncertainty in model parameters, and tends to decrease as the amount of observed data increases. Aleatoric uncertainty conveys the noise inherent in the observations, and thus cannot be explained away with an increasing amount of data available. In the case of classification, examples of aleatoric uncertainty include the probability of the top class, 12 and the entropy of the probability distribution over classes (Kendall et al., 2018); examples of epistemic uncertainties include the mutual information. 13 In the literature of uncertainty calibration, we usually calibrate aleatoric uncertainty measured by the probability of the prediction. In Sec. 4.2, we also view distillation from the angle of matching another uncertainty between teacher model and student model, the predictive entropy (Gal, 2016).

A.2 Uncertainty Calibration
The quality of the uncertainty measurement is usually measured via calibration (Kendall and Gal, 2017). In the context of calibration, the uncertainties often refer to predictive probabilities. The model is calibrated if the predictive probabilities match the empirical frequency of the data (Gal, 12 More specifically, it is one minus the probability/confidence of the top class. 13 Please see page 54 in Gal (2016) for details. 2016). LetŶ andP be the predicted class and its associated confidence (probability of correctness) of a neural network. We would like the confidence estimatesP to be calibrated, which intuitively means that we wantP to represent true probabilities (Guo et al., 2017):  2019) proposed the scaling-binning calibrator and a more sample-efficient estimator of calibration error. In our work, we describe simple approaches that combines the strength of ensembles and temperature scaling without introducing computation at inference time; we further apply the scalingbinning calibrator to ensure calibration.
Ensembles work by aggregating multiple networks trained independently on the entire dataset, and has been shown to achieve strong performance scaling is an extension of Platt scaling (Guo et al., 2017). It uses a single scalar parameter T > 0 for all classes. Given output z i (usually logits vectors), the confidence prediction is: An extension, called heteroscedastic regression, is used in our work, which replaces the constant scalar with learned values (Kendall and Gal, 2017;Kendall et al., 2018).

A.3 Distillation
Knowledge distillation (Hinton et al., 2015;Domingos, 1997;Blum and Mitchell, 1998;Zeng and Martinez, 2000;Ba and Caruana, 2014) is a compression technique in which a compact model (usually referred to as the student model) is trained to mimic the behavior of a more powerful teacher model. In the context of classification, knowledge distillation works by augmenting the loss function with an additional term D KL (p i p j ) where p i = softmax(z i /T ) and p j = softmax(z j /T ) with z i and z j the logits from two models, and T controls the smoothness of the output distribution. Knowledge distillation has been used in a wide range of applications (Buciluǎ et al., 2006;Wang et al., 2018;Kim and Rush, 2016;Furlanello et al., 2018;Clark et al., 2019;Teh et al., 2017;Schwarz et al., 2018;Sanh et al., 2019). In this work, we show that distillation can also be used to distill calibration performance, and use it to build simple yet competitive recalibration methods. A related area of research is label smoothing (Yuan et al., 2020). Label smoothing replaces the hard/one-hot targets y k with modified targets y k (1−α)+α/K, where K is the number of classes and α is a hyper-parameter. Pereyra et al. (2017) showed that label smoothing provides consistent gains across many tasks and proposed a new regularizer, termed confidence penalty. Müller et al. (2019) studied when label smoothing is helpful , and found that label smoothing can implicitly calibrate model's predictions. Instead, our use of a teacher model can be seen as adaptively deciding how much smoothing is needed (Tang et al., 2020).

A.4 Recent Related Works
Finally, there are also a few recent related works in the computer vision literature, e.g., Yun et al. (2020) proposed to distill the predictive distribution between different samples of the same label during training to improve calibration performance, Gurau et al. (2018) proposed Distilled Dropout Network which distills knowledge from multiple MC samples from the teacher to improve the reliability of its uncertainty scores. In our work, we mainly focus on language tasks. Concurrent to our work, Desai and Durrett (2020) studied the calibration of pretrained transformers when finetuned to downstream tasks, and Hendrycks et al. (2020) studied the outof-distribution robustness of pretrained transformers. We are different from these two works in that first we present a systematic study on the out-ofdistribution calibration; second we draw insights from the connection between distillation and temperature scaling to design simple yet competitive recalibration methods; third, we conduct experiments to understand the connection between these two concepts empirically; finally, we also include a more comprehensive set of calibration evaluations following Ovadia et al. (2019)

B Setup Details Model Details and Hyperparameter Search.
Our codebase is largely based on the Transformers library from HuggingFace (Wolf et al., 2019). 15 We used RoBERTa-base (Liu et al., 2019) for the language model backbone and used most of the default/recommended hyperparameters in the Transformers library. We tried two values of the learning rate in our initial experiments: 2e−5 and 1e−5; these numbers are chosen based on the hyperparameter search described in the library, and we stick to one of them (1e−5) based on accuracy. For experiments that involve ensembles, we use an ensemble size 2. For experiments that involve distillation, we set T = 1.0 (i.e., no scaling) for models without automatic temperature selection unless we explicitly mention otherwise. When automatic temperature selection is used, we choseT based on the Brier Scores on the validation dataset. All of our experiments ran on a single V100 GPU. The baseline model has 125.2M parameters, the temperature-scaling (heteroscedastic variant) has 125.8M parameters, and our method has 125.2M (same as the baseline model). We train multiple models using different random seeds before ensembling them, but otherwise run the training and inference once. The runtime varies among tasks, but most of them could finish within a day.
Data. We perform experiments on the classification tasks from the GLUE Benchmark (Wang et al., 2019;Warstadt et al., 2019;Dolan and Brockett, 2005;Agirre et al., 2007;Williams et al., 2018;Rajpurkar et al., 2016;Dagan et al., 2006;Bar-Haim et al., 2006;Giampiccolo et al., 2007;Bentivogli et al., 2009;Levesque et al., 2011). 1617 , and we refer readers to Wang et al. (2019) regarding dataset statistics. Because the calculation of calibration errors requires access to the ground truth data, which is not available for GLUE data, we split the validation dataset into two halves, one for validation and the other for test, following the approach of Desai and Durrett (2020). For MultiNLI, we merge the results for both MultiNLI matched and mismatched sections. When computing the out-of-domain performance between the 3-label MultiNLI and other 2-label NLI tasks, we follow the approach used in the jiant library (Pruksachatkun et al., 2020) and merge the predictions/labels that correspond to "neutral" and "contradiction" into a single category. We follow the Transformers library for the rest of the data preprocessing. For in-domain evaluations, the train data and evaluation data come from the same task. For outof-domain evaluations, the train data and evaluation data come from different tasks of the same type. We group MRPC and QQP (paraphrase tasks), and group MNLI (2-label version), 20 QNLI, RTE, 16 https://gluebenchmark.com/ 17 QQP dataset: https:// www.quora.com/q/quoradata/ First-Quora-Dataset-Release-Question-Pairs 18 https://github.com/google-research/ google-research/tree/master/uq_ benchmark_2019 19 https://github.com/p-lambda/verified_ calibration 20 We only use the 2-label version of MNLI for evaluation. and WNLI (NLI tasks). We leave SST-2 (sentiment), CoLA 21 (acceptability), and MNLI (3-label version, NLI) as separate groups. We use the indomain validation data to train the scaling-binning calibrator.
Analysis Experiments Details. We conduct experiments on RTE, in which we distill teacher models with different ensemble-sizes (from 1 to 6) and the temperature scaling constant (from 0.50 to 2.00 with a step size of 0.02) to student models. Each model is then evaluated on both in-domain task (RTE) and out-of-domain tasks (MNLI-2, QNLI, WNLI) using confidence, ECE, MCE, Brier Scores, SCE (debiased) and SCE (biased). The numbers represent performances on the validation dataset.

C Further Experiment Details
Distillation Transferability of Calibration.
We compute ρ AB based on both Table 1 and  Table 2 of the main paper. The percentage of improvement presented in Sec. 6.3 of the main paper on ensembles is computed based on temperature scaling + ensembles (−Distillation in main paper Table 2) as A t , ensembles only (Ensemble in main paper Table 1) as B t , temperature scaling + ensembles + distillation (Ours in main paper Table 2) as A s , and ensembles + distillation (−TempScale in main paper Table 2) as B s . The percentage of improvement on temperature scaling is computed similarly with temperature scaling component as the main difference between the teacher/student models.

D Expanded Analysis Experiments
Please see Fig. 2 for the expanded visualization of the analysis experiments.

E Detailed Main Experiment Results
Please see Table 3 and Table 4 for detailed indomain and out-of-domain experiment results.

F Detailed Ablation Experiment Results
Please see Table 5 and Table 6 for detailed indomain and out-of-domain ablation experiment results. 21 We use accuracy for CoLA evaluation so that calibration error computations would be more consistent across tasks.     Table 6: Out-of-domain ablation performances on MRPC (MR), QQP (QQ), QNLI (QN), RTE (R), WNLI (W). We use M2 to denote the 2-label version of MultiNLI task. Note that for the metrics we considered here, lower scores indicate better calibration.