Bag of Tricks for In-Distribution Calibration of Pretrained Transformers

While pre-trained language models (PLMs) have become a de-facto standard promoting the accuracy of text classification tasks, recent studies find that PLMs often predict over-confidently.Although calibration methods have been proposed, such as ensemble learning and data augmentation, most of the methods have been verified in computer vision benchmarks rather than in PLM-based text classification tasks. In this paper, we present an empirical study on confidence calibration for PLMs, addressing three categories, including confidence penalty losses, data augmentations, and ensemble methods. We find that the ensemble model overfitted to the training set shows sub-par calibration performance and also observe that PLMs trained with confidence penalty loss have a trade-off between calibration and accuracy. Building on these observations, we propose the Calibrated PLM (CALL), a combination of calibration techniques. The CALL complements shortcomings that may occur when utilizing a calibration method individually and boosts both classification and calibration accuracy. Design choices in CALL’s training procedures are extensively studied, and we provide a detailed analysis of how calibration techniques affect the calibration performance of PLMs.


Introduction
Trustworthy deployment of machine learning applications requires accurate and calibrated predictions to instill their reliability and help users be less confused about models' decisions (Xiao and Wang, 2019;Liu et al., 2020).
However, modern deep neural networks (DNNs) produce miscalibrated predictions, i.e., a mismatch between a model's confidence and its correctness.One of the reasons is that an over-parameterized * Corresponding author.This work is partially done at UNIST.1983) on TREC (Li and Roth, 2002) with PLMs.A dashed line implies a perfect calibration while PLMs generally show over-confident predictions.
classifier typically produces over-confident predictions (Guo et al., 2017).Moreover, the miscalibration can be exacerbated when DNNs make predictions on test data different from the training distribution, i.e., distribution shift (Ovadia et al., 2019).
To obtain the well-calibrated predictions, many pioneering studies have shown the calibration effect of ensemble and regularization techniques focused on computer vision benchmarks.Ensemble learning has become one of the standard approaches to reduce calibration errors (Lakshminarayanan et al., 2017;Bonab and Can, 2019).Pereyra et al. (2017) propose the entropy regularized loss which penalizes confident output distributions in order to reduce overfitting.Hongyi Zhang (2018); Hendrycks et al. (2020) demonstrate that DNNs trained on diverse augmented data are less prone to produce over-confident predictions, leading to the calibration benefit under the distribution shift.
Intense research effort has focused on improving the calibration performance of vision models on image datasets.However, exploration of existing calibration methods with pre-trained Transformers (PLMs) has received less attention.Moreover, recent studies show that PLMs such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) pro-duce miscalibrated predictions introduced by overparameterization (Kong et al., 2020).Therefore, it is necessary to investigate how modern calibration techniques affect PLMs' calibration.
In this paper, focused on PLMs in multi-class classification tasks, we explore widely used calibration families, including (1) confidence penalty loss functions that can be used instead of cross-entropy loss, (2) data augmentations, and (3) ensemble methods.We consider a low-resource regime since the small size of the training dataset amplifies the miscalibration of models (Rahaman et al., 2021).We also observe PLMs especially produce unreliable predictions in the data scarcity setting (see Figure 1).Contributions.We conduct a comprehensive empirical study for the effectiveness of the above calibration methods.In this study, our findings are as follows: • A PLM trained with imposing a strong penalty on the over-confident output shows significant improved calibration performance, but its accuracy can slightly deteriorate.
• For ensemble methods, Deep Ensemble (Lakshminarayanan et al., 2017) and MIMO (Havasi et al., 2021) increase the diversity of predictions, resulting in the well-calibrated predictions in the data scarcity setting.However, the ensemble methods show insufficient calibration when each ensemble member is overfitted to negative log-likelihood for the training dataset.
• Data augmentation methods that can expose diverse patterns such as MixUp (Hongyi Zhang, 2018) and EDA (Wei and Zou, 2019) are more effective for calibration in PLMs compared to weak text-augmentation methods (Kolomiyets et al., 2011;Karimi et al., 2021).
Building on our findings, we present Calibrated PLM (CALL), a blend of the discussed calibration methods.Numerical experiments demonstrate that the components of CALL complement each other's weaknesses.For instance, data augmentation and ensemble methods offset the accuracy decline caused by the confidence penalty loss, while data augmentation and the confidence penalty loss counteract overfitting in the ensemble model.Through our extensive experiments, we show the CALL's competitiveness on several text classification benchmarks.

Related Work
The calibration of machine learning models has been mainly studied for the trustworthy deployment of image recognition applications (Lakshminarayanan et al., 2017;Hongyi Zhang, 2018;Guo et al., 2017).Beyond the computer vision fields, research on the calibration ability of language models in the NLP domain has also recently been attracting attention (Desai and Durrett, 2020;Dan and Roth, 2021).Desai and Durrett (2020) investigate the calibration ability of PLMs, and they demonstrate that RoBERTa produces more calibrated predictions than BERT.They also show that temperature scaling (Hinton et al., 2014) and label smoothing (Szegedy et al., 2016) improve the calibration performance of PLMs for language understanding tasks.Dan and Roth (2021) conduct an empirical study of the effects of model capacity on PLMs and show that smaller pre-trained transformers provide more reliable predictions.Moon et al. (2020) find that PLMs tend to produce over-confident outputs based on in-distribution (ID) keywords rather than contextual relations between words.They demonstrate that keyword-biased predictions can be overconfident even in out-of-distribution samples with ID keywords.Kong et al. (2020) suggest two regularizers using generated pseudo-manifold samples to improve both ID and out-of-distribution calibration for PLMs.They use MixUp (Hongyi Zhang, 2018) as a regularization technique for BERT calibration and show that mixed training samples on the data manifold improve the calibration performance.Similarly, Park and Caragea (2022) propose a variant of MixUp utilizing saliency signals and also analyze the impact of combining additional calibration methods with MixUp.However, they only consider temperature scaling and label smoothing as additional calibration methods.
3 Why Re-assess Calibration Methods?Guo et al. (2017) observe that a larger DNN tends to be more poorly calibrated than a smaller one.As the size of the parameters for modern DNNs continues to increase, the miscalibration issues need to be addressed more than ever.
At the same time, the unique character of PLMs raises concerns about whether previous findings on calibration obtained from standard convolutional neural networks (CNNs) can be successfully ex-tended to PLM.For example, PLMs with ensemble learning may have different behavior compared to randomly initialized CNNs because naive PLMs have a massive amount of parameters and are initialized with pre-trained weights in the fine-tuning stage.
On the other hand, for the data augmentation, because image transformations (e.g., flipping, translation, and rotating) can not be directly applied to text-based samples, thus, it is also necessary to investigate the effect of text-specific augmentations on the calibration of PLMs.

Calibration Strategies
In this section, we review the existing literature used in our experiments and how we applied each method to PLMs.Calibration methods we explore are denoted by bold.

Preliminaries
Notation.Let D = {x i , y i } N i=1 be a dataset consisting of N samples, where x i ∈ X is a input and y i ∈ Y = {1, ..., K} is a ground truth label.We denote by pi = f (y|x i ) the predicted distribution of a classifier f .Class prediction and associated confidence (maximum probability) of f are computed as ŷi = argmax k∈Y pi and pi = max k∈Y pi , respectively.
In the BERT-style architecture, output of embedding layer, L attention blocks, and the output dense layer (with softmax function) are denoted by z embed , g = {g 1 , ..., g L }, and h, respectively.Calibration Metrics.A calibrated model provides reliable predictive probability whose confidence aligns with its expected accuracy, i.e.E p[|P(ŷ = y|p) − p|].Given a finite dataset, Expected Calibration Error (ECE; Naeini et al., 2015) is widely used as a calibration performance measure.ECE can be computed by binning predictions into T groups based on predictions of f and then taking a weighted average of each group's accuracy/confidence difference: where B t is the group of samples and their corresponding confidences belonging to the ( t−1 T , t T ].The acc(B t ) and conf(B t ) denote average accuracy and confidence of predictions for B t , respectively.
Model calibration also can be measured using proper scoring rules (Gneiting and Raftery, 2007) such as Brier score (Brier et al., 1950) and negative log likelihood (NLL).

Confidence Penalty Losses
We explore an alternative loss functions that can be used instead of cross-entropy (CE) loss.Brier Loss (BL; Brier et al., 1950) is one of the proper scoring rules, defined as the squared error between the softmax output and the one-hot ground truth encoding.BL is related to ECE in that it is an upper bound of the calibration error by the calibration-refinement decomposition (Bröcker, 2009;Liu et al., 2020).Entropy Regularized Loss (ERL; Pereyra et al., 2017) penalizes confident output distributions by adding the negative entropy: where L can be an arbitrary classification-based objective function (e.g., CE and BL), and β is the hyperparameter that controls the strength of the confidence penalty.Label Smoothing (LS; Szegedy et al., 2016) is a commonly used trick for improving calibration that generates a soft label by weighted averaging the uniform distribution and the hard label.

Data Augmentations
Data augmentations have been widely used to improve the model's calibration performance in computer vision fields (Hongyi Zhang, 2018;Hendrycks et al., 2020;Wang et al., 2021).However, text augmentations are often overlooked in the literature on the calibration in NLP tasks.To the best of our knowledge, we are the first to extensively study how text augmentation techniques such as Synonym Replacement (SR; Kolomiyets et al., 2011), Easy Data Augmentation (EDA; Wei and Zou, 2019), and An Easier Data Augmentation (AEDA; Karimi et al., 2021) affect calibration performance.We also investigate the recent variant of MixUp (Zhang and Vaidya, 2021).SR randomly choose n words from the input sentence except for stop words and then replace each of these words with one of its synonyms chosen using WordNet (Miller, 1995).
EDA is a token-level augmentation method that consists of four random transformations: SR, Random Deletion, Random Swap, and Random Insertion.
AEDA only use Random Insertion operator that insert punctuation marks (i.e., ".", ",", "!", "?", ";", ":") into a input sentence.MixUp (Hongyi Zhang, 2018) is a data augmentation strategy using convex interpolations of inputs and accompanying labels.Guo et al. (2019) investigate word-and sentence-level MixUp strategies to apply MixUp to recurrent neural networks.Zhang and Vaidya (2021) propose MixUp-CLS, that performs MixUp on the pooled [CLS] token embedding vector for a last attention layer of PLM.MixUp-CLS shows improved accuracy for natural language understanding (NLU) tasks compared to word-level MixUp.Unless otherwise specified, we use MixUp-CLS in our experiment.

Ensembles
Ensemble techniques utilize M models by combining them into an aggregate model and then average the predictions to produce calibrated outputs: We compare the deterministic model with three ensemble approaches, and the computational cost of the ensemble methods used in the experiment is reported in Appendix A. In original MIMO, the M inputs (images) {x m } M m=1 are sampled from D train .MIMO concatenates multiple inputs per channel before the first convolution layer and produces multiple outputs using M independent output dense layers.The feature extractor of CNN remains unchanged.For the training procedure, all ensemble members have the same mini-batch inputs with probability p, and the inputs are randomly sampled from the training dataset with probability 1 − p.
For applying MIMO to the PLMs, the following consideration arise; When multiple inputs are connected before the embedding layer, the length of tokens is M times longer.Thus, applying MIMO to PLMs in this manner is inefficient for a dataset that consists of long sentences.
Instead, we modify the original configuration of MIMO so that it can be applied to various NLP tasks.For PLM, the output of the first attention layer z is calculated by averaging multiple outputs of M independent first attention blocks {g m 1 } M m=1 : To produce multiple predictions, we use M modules that consist of the last attention blocks {g m L } M m=1 and dense layer h.The ensemble prediction is calculated by: where g = {g 2 , ..., g L−1 } is the shared attention blocks.

Experiments
This section presents the experimental results of the calibration methods.We describe experimental datasets and settings (Section 5.1 and 5.2), followed by empirical results for the low-resource regime (Section 5.3), overall calibration result (Section 5.4), and detailed analysis (Section 5.5).We then introduce the training procedure of CALL in Section 6.In our experiments, we set RoBERTa trained with CE as a baseline.Unless otherwise specified, ensemble and augmentation methods are applied to the baseline.

Datasets and Metrics
Dataset.Following Zhou et al. (2021), we use the following three text classification datasets.Data statistics are described in Table 1.• 20 Newsgroups (20NG; Lang, 1995) is a topic categorization dataset which contains news articles with 20 categories.
• TREC (Voorhees and Tice, 2000) is a dataset for question classification, and we use its coarse version with six classes.
To evaluate the effectiveness for calibration methods in the data scarcity setting, we use 10% of the training set.
Metrics.We measure ECE and NLL for each calibration method.For ECE, we bin the predictions into T = 15 equidistant intervals.We report ECE and NLL multiplied by 10 2 in all experimental results for the convenience.

Training Configurations
We implement our framework upon Huggingface's Transformers (Wolf et al., 2020) and build the text classifiers based on RoBERTa (roberta-base) in the main experiment.All models are optimized with Adam optimizer (Kingma and Ba, 2017) with a weight decay rate of 0.01, warmup proportion of 0.1, batch size of 16, a dropout rate of 0.1, and an initial learning rate of 1e-5.We fine-tune the RoBERTa for 10 epochs.For each calibration method, hyper-parameters are tuned according to the classification performance, and the detailed hyper-parameter setting is described in Appendix B. We also provide empirical results for BERT (bert-base-cased) in Appendix C. We report the averaged performance over 5 runs using different random seeds and implementation results are available at https://github.com/kimjeyoung/PLM_CALL.

Result for Low-resource Regime
Table 2 represents the classification accuracy and calibration performances for each dataset in the low-resource regimes.Most calibration strategies perform better than the baseline, even in cases where the baseline calibration results were already good, e.g., TREC.These results demonstrate that the existing methods can enhance PLM's calibration ability when the annotation budget is small, as in many real-world settings.Interestingly, augmentation methods except for AEDA also result in the calibration benefit.For example, MixUp and EDA show improved calibration performances for all datasets compared to the baseline.
Among confidence penalty losses, BL significantly reduces ECE for the three datasets.Moreover, the calibration performance is further improved when BL is combined with an additional regularization method (i.e., BL+ERL and BL+LS).However, BL+LS and BL+ERL underperform the baseline with respect to accuracy, and this performance drop is also observed when applied to BERT (Appendix C).
DE not only shows the most remarkable improvement of NLL but also improves accuracy for all datasets.MIMO also consistently outperforms the baseline for ECE.In summary, DE and MIMO are more effective than the other calibration methods when considering both accuracy and calibration in the low-resource regime.

Overall Result
Overall performance result is reported in Table 3.Similar to the results in

Analysis
Our empirical results raise the following questions: (1) Why do EDA and MixUp show better calibration performance than SR or AEDA? (2) How can we improve the accuracy of BL+ERL?(3) Why are ensemble methods more efficient than regularization methods in the low-resource setting, whereas BL+ERL is most effective for the full-data available setting?We further conduct a detailed analysis focusing on the above questions.
Role of Data Augmentation.Although the PLM trained on the proper scoring rule reduce calibration error for the training dataset, minimizing calibration errors for all unseen ID samples is challenging because we use finite training data (Liu et al., 2020).
As an alternative, if models trained with augmented samples learn diverse representations, we expect to match the distribution of training data with the distribution of unseen ID data.We analyze the distance between unseen and training data distribution, assuming that the augmentation scheme that pulls the distribution of training data towards the unseen data distribution will be effective for calibration.
To measure the distance between the two distributions, we use Hausdorff-Euclidean distance.In Table 4, RoBERTa trained with MixUp shows the closest distance between training data and test data, followed by EDA.In addition, the augmented data generated by MixUp and EDA are far away from the training data.It can be interpreted that EDA and MixUp generate more diverse patterns of representations.Hence, matching the distribution of observed data with the distribution of unseen data by adopting a proper augmentation method that generates diverse patterns may help the model produces calibrated predictions.
On the other hand, since data augmentation generally helps to improve accuracy, we investigate whether augmentation methods improve the accuracy of BL+ERL.In Table 5, MixUp improves not only classification accuracy but also calibration performance on all datasets compared to the naive BL+ERL.Role of Regularization.A crucial empirical observation by Guo et al. (2017) is that overfitting the NLL during training appears to be associated with the miscalibration of DNNs.
To better understand the role of strong regularization, we visualize the NLL during the training process of PLM.In Figure 2, training and test NLL are reduced at the beginning of training regardless of regularization methods.However, as training progresses, the test NLL of RoBERTa trained with CE increases1 .On the other hand, other regularization methods show an inhibitive effect on overfitting compared to CE.
A DNN can produce over-confident predictions if the network increases the norm of its weights, which results in the high magnitudes of the logits (Mukhoti et al., 2020).Figure 2 (Bottom) shows that the RoBERTa trained with CE also has a larger norm than the regularized models.Diversity Analysis in Ensembles.Diversity of predictions in ensemble is one of the key factor of determining calibration performances (Havasi et al., 2021).However, in the presence of overfitting, the diversity of predictions between ensemble members may decrease because the trained individual members would produce similar predictions that are overfitted to the same training data distribution (Shin et al., 2021).We hypothesize ensemble members of DE applied to PLMs may also suffer from overfitting.Thus, we investigate whether the ensemble members are overfitted to NLL.In Figure 3  with 100% of the training data results in a closer NLL for the ensemble members as the training progresses.
According to our experimental result, members within the ensemble often fail to produce different predictions due to the overfitting, indicating that additional effective regularization schemes can be adopted to prevent overfitting when applying the ensemble to the PLM.This finding also explains why ensemble techniques shows sub-par calibration performance compared to the regularization methods in the setting where full-data available.We investigate whether BL+ERL and MixUp methods can compensate for the aforementioned limitation of the ensemble method.We measure disagreement score (see Havasi et al., 2021) to analyze the degree of diversity for predictions.As shown in Figure 4, DE shows a high disagreement score in the low-resource regime.When full-data are available, the disagreement score of DE is consistently the lowest for all datasets.However, we observe that MixUp and BL+ERL significantly mitigate the reduction of predictive diversity for DE.

Calibrated PLMs
Through extensive analyses, we find that (1) MixUP that generate more diverse patterns helps improve the accuracy of BL+ERL, and (2) the reduced predictive diversity in the ensemble can be mitigated by BL+ERL and MixUp.
To this end, we report the calibration performance incrementally applying BL+ERL, MixUp, and ensemble techniques to the naive RoBERTa.Specifically, we denote BL+ERL+MixUP+DE, and BL+ERL+MixUP+MIMO by CALL DE , and CALL MIMO , respectively.
In Table 6, overall, CALL DE achieves remarkable performance compared to DE on SST2 and 20NG datasets.CALL MIMO shows competitive performance with DE with respect to ECE and NLL.This experiment shows that the calibration performance can be improved by the combinations using the ensemble, data augmentation, and confidence penalty losses in NLP tasks based on PLM, and each calibration method complements each other to further improve calibration performance without compromising accuracy.

Conclusion
In this work, we investigate the calibration effect of PLMs with various calibration methods applied.As a result of a comprehensive analysis of how calibration methods work in PLMs, we find that (1) the confidence penalty losses have a trade-off between accuracy and calibration, and (2) ensemble techniques lose predictive diversity as training progresses, resulting in reduced calibration effectiveness.To address these findings, we propose CALL, a combination of BL, ERL, MixUp, and ensemble learning.CALL reduces the risk of accuracy reduction through its data augmentation and ensemble techniques, and enhances the predictive diversity of ensemble methods by incorporating strong regularization and data augmentation.On multiple text classification datasets, CALL outperforms established baselines, making it a promising candidate as a strong baseline for calibration in text classification tasks.

Limitations
Although the proposed framework achieves significantly improved calibration performance compared to the baselines, CALL still has room for performance improvement and may require more diverse approaches (Zadrozny and Elkan, 2001;Hinton et al., 2014;Mukhoti et al., 2020;Liu et al., 2020).Another limitation is that we only address the ID calibration issue for PLMs.Therefore, whether CALL could work well for out-of-distribution detection and generalization tasks is unclear.We leave these questions for future research.

Ethics Statement
The reliability of deep-learning models is crucial to the stable deployment of real-world NLP applications.For example, the computer-aided resume recommendation system and neural conversational AI system should produce trustworthy predictions, because they are intimately related to the issue of trust in new technologies.In this paper, through extensive empirical analysis, we address diverse calibration techniques and provide a detailed experimental guideline.We hope our work will provide researchers with a new methodological perspective.

Figure 1 :
Figure 1: Reliability diagrams (DeGroot and Fienberg, 1983) on TREC (Li and Roth, 2002) with PLMs.A dashed line implies a perfect calibration while PLMs generally show over-confident predictions.
Deep-Ensemble (DE;Lakshminarayanan et al., 2017) consists of M randomly initialized models and provides a calibration effect leveraging the predictive diversity of ensemble members.When applying DE to PLMs, M independent models have different initialization weights only in a penultimate layer since PLMs are initialized with pretrained weights.Monte Carlo Dropout (MCDrop;Gal and Ghahramani, 2016) interprets Dropout as an ensemble model, leading to its application for uncertainty estimates by sampling M times dropout masks at test time.Multi-Input and Multi-Output (MIMO).To alleviate the high computational cost and memory inefficiency of DE,Havasi et al. (2021) propose the multi-input and multi-output architecture by training M sub-networks inside a CNN.

Figure 2 :
Figure 2: The plot of the NLL (Top) and the norm of weights (Bottom) while training RoBERTa on TREC (Left), SST2 (Middle), and 20NG (Right), respectively.The weights are extracted from the penultimate layer of RoBERTa and we use 10% of samples for training.

Figure 3 :
Figure 3: The test NLL for DE.Each arrow denotes the point at which the validation accuracy is the maximum.
, DE trained with 10% of the training data shows a different test NLL for each ensemble member, while DE trained

( a )Figure 4 :
Figure 4: The diversity of predictions in ensemble with respect to the regularization methods.Blue: DE; Orange: DE+MixUp; Green: DE+BL+ERL.Results for MIMO and MCDrop are reported in Appendix D. A higher disagreement means that the models within the ensemble make different predictions.

Table 1 :
Summary of data statistics.l avg : Sentence average length.

Table 2 :
Results for the low-resource regime.For each dataset, all methods are trained with 10% of training samples.The best results in each category are indicated in underline and the best results among all methods are indicated in bold.Accuracy is a percentile.We report ECE and NLL multiplied by 10 2 .

Table 3 :
Overall calibration results for calibration techniques.For each dataset, all methods are trained with 100% of training samples.

Table 4 :
(Left)Distance between original and augmented sentences for the training samples.Higher is better.(Right) Distance between augmented training sentences and original test samples.Lower is better.The distance are computed at the last attention layer of RoBERTa.

Table 5 :
Comparison result for augmentation methods.Each method is trained with 10% of training data.

Table 6 :
CALL MIMO : BL+ERL+MixUp+MIMO.CALL DE : BL+ERL+MixUp+DE.The best and second best results are indicated in bold and underline, respectively.

Table 9 :
Result for BERT with diverse calibration techniques on the low-resource regime.The best results are indicated in bold.