On the Effects of Transformer Size on In-and Out-of-Domain Calibration

Large, pre-trained transformer language models, which are pervasive in natural language processing tasks, are notoriously expensive to train. To reduce the cost of training such large models, prior work has developed smaller, more compact models which achieves a significant speedup in training time while maintaining competitive accuracy to the original model on downstream tasks. Though these smaller pre-trained models have been widely adopted by the community, it is not known how well are they calibrated compared to their larger counterparts. In this paper, focusing on a wide range of tasks, we thoroughly investigate the calibration properties of pre-trained transformers, as a function of their size. We demonstrate that when evaluated in-domain, smaller models are able to achieve competitive, and often better, calibration compared to larger models, while achieving signiﬁcant speedup in training time. Post-hoc calibration techniques further reduce calibration error for all models in-domain. However, when evaluated out-of-domain, larger models tend to be better calibrated, and label-smoothing instead is an effective strategy to calibrate models in this setting. .


Introduction
Large pre-trained transformer language models like BERT (Devlin et al., 2019;Liu et al., 2019) have revolutionized natural language processing, achieving state-of-the-art results in several tasks. The process of applying these models on a downstream task consists of two components: (1) Self-supervised pre-training on a large amount of text corpora and (2) Supervised fine-tuning on the downstream task. Due to the very large number of parameters of such transformer based architectures, the high downstream accuracies comes at a large computational cost (Sharir et al., 2020;Bender et al., 2021) during http://cogcomp.org/page/publication_view/953 the pre-training stage and also to a lesser extent, while fine-tuning. To alleviate this computational cost, several models with fewer parameters have been proposed that significantly speed-up both the pre-training and the fine-tuning stages (Turc et al., 2019;Lan et al., 2020;Sanh et al., 2019;Sun et al., 2020). For example, the smallest model in (Turc et al., 2019) consists of only 4 million parameters compared to BERT-base which has 110 million parameters; this leads to a 65x speedup for pretraining time. It has been widely observed (Turc et al., 2019;Lan et al., 2020) that smaller models achieve comparable downstream task performance with a very significant speedup in training time.
A second issue with pre-trained models with a massive number of parameters, is their lack of calibration, which measures how well the model confidences (posterior probabilities) are aligned with the empirical likelihoods. In other words, for a calibrated model the probability associated with the predicted class label should reflect its ground truth correctness likelihood. Importantly, in the seminal work of (Guo et al., 2017), the authors demonstrate that for deep neural architectures increasing model size negatively affects its calibration, even though classification accuracy increases. In this paper, we extend this to investigate the dependence of calibration on model size for pre-trained transformer models. Since miscalibrated models can make very confident predictions even when they make errors, especially on out-of-distribution data (Gupta et al.), it is crucial to carefully study model calibration.
Recently, there has been some progress on studying the calibration of deep neural networks and specifically, pre-trained transformers (Guo et al., 2017;Desai and Durrett, 2020;Kong et al., 2020;Jagannatha et al., 2020). However, a careful study of how the size of the pre-trained model influences calibration is lacking. With the computational constraints of training large transformers like BERT and the increasingly wide adoption of smaller mod-els, it becomes essential to study the calibration of these variants. In this work, we make a thorough empirical study of the calibration properties of smaller transformer architectures of the BERT family, for a wide set of classification tasks. The set of models have rich variations over number of layers, number of hidden neurons and embedding representation. Additionally, we analyze the effects of techniques designed to help calibrate models: during training (eg: label smoothing) and post-hoc (eg: temperature scaling), on the smaller models, for both in-and out-of-domain datasets. We establish the following results in this paper: 1. When evaluated in-domain, smaller models are as well calibrated as BERT-base, both with and without temperature scaling. 2. When evaluated out-of-domain, smaller models are worse calibrated than BERT-base. This persists, to a lesser extent, even after temperature scaling. 3. Label Smoothing, on the other hand, is not effective in-domain, but helps smaller models attain better calibration than BERT-base out-of-domain. It also helps improve accuracy as compared to the non-smoothed models, on out-of-domain data.

Background
In this section, we describe how we measure calibration and two techniques that help calibrate models: Temperature Scaling and Label Smoothing. Calibration Metric: Let us define the following notation: K is the number of classes, z i denotes the raw logits from the model for the i th example and σ (k) denotes the k th value of the softmax layer σ, corresponding to the probability for the k th class (for k ∈ [1, ..., K]). Then, the confidence on the i th example is p i = max k σ(z i ) (k) . A model is well calibrated if the confidence on a prediction is aligned with the accuracy on that prediction, in expectation. The widely adopted Expected Calibration Error (ECE) metric (Guo et al., 2017) measures exactly this: difference in expectation between confidence and accuracy. Empirically this is approximated by dividing the data into M confidence based bins, i.e., B m (where m ∈ {1, 2, ..., M }) contains all datapoints i for which If acc(B m ) and conf (B m ) denotes the average accuracy and prediction confidence for the points in B m , ECE is defined as: where, |B m | denotes the number of datapoints in B m and n is the total number of samples ( M m=1 B m ). In our experiments we set M = 10. Reliability diagrams are a popular graphical representation of calibration. It plots the bucketwise accuracies acc(B m ) versus the confidences conf (B m ). The identity line denotes perfect calibration. The greater the deviation from the identity line, higher is the mis-calibration of the model. Post-hoc calibration: The calibration properties of a model can be evaluated directly out-of-box (OOB) based on the softmax scores of the model's predictions. Temperature scaling is designed to improve the calibration of a model after training. It rescales the logits z i by a factor of T , before applying softmax σ. On the i th example, the new confidence prediction is q which is the uniform distribution with maximum uncertainty. As T → 0, the probability collapses to a point mass (q i = 1) and if T = 1, p i = q i . The optimal temperature T is tuned on the dev-set by a line search algorithm. Label Smoothing (Szegedy et al., 2016) leads to a modified fine-tuning procedure to address overconfident predictions. While Maximum Likelihood Estimation (MLE), sharpens the model's posterior distribution around the target labels, label smoothing introduces uncertainty to smoothen the posterior over the labels. Label smoothing constructs a new target vector from the one-hot target vector, with a probability of 1 − α on the target label and α K−1 on all the other labels. Then, in the standard manner, the cross entropy loss is minimized between the model predictions and the modified target vectors. Label smoothing has been shown to implicitly calibrate neural networks (Müller et al., 2019) and (Desai and Durrett, 2020) have shown it is effective for calibrating models on out-of-distribution data.

Models
We consider a family of smaller pre-trained transformer models from (Turc et al., 2019) with the number of layers (L) ranging from 2 to 12 and the number of hidden neurons (H) ranging from 128 to 768. This family of models allows us to carefully study calibration as a function of L and H, since the other parameters like training data and architecture type are constant across them. We focus on 5 models: Tiny (L=2, H=128), Mini (L=4, H=256), Small (L=4, H=512), Medium  (Socher et al., 2013) is used in the binary classification setting, where movie reviews are assigned positive or negative labels. For details on the datasets, refer to Appendix A.

In Domain Calibration
For each of the different datasets, we fine-tune 1 the various models on the train-set and evaluate their calibration error on the test-set. Additionally, we calibrate the model in-domain through temperature scaling, where the optimal T is tuned on the dev- set. 2 Table 1 shows the accuracies and the ECE for the various models on the different datasets. We see that models with far fewer parameters than BERT-base, have competitive accuracy as well as competitive (and sometimes better) calibration as compared to BERT-base. This holds even after temperature scaling, which reduces the ECE for all the models. Fig. 1 shows the reliability diagram for MNLI, where we see that the different smaller models are as well calibrated as BERT-base.

Out-of-domain calibration
We further investigate the effect of model size on calibration for out-of-domain data. For Natural Language Inference, all models are fine-tuned on the SNLI train-set and evaluated on the MNLI testset. For Paraphrase Detection, all models are finetuned on the QQP train-set and evaluated on the TwitterPPDB test-set. We also investigate the effect of (1) Temperature Scaling (where the optimal temperature is chosen based on performance on the dev-set for the source domain: SNLI or QQP) and (2) Label Smoothing with α = 0.1, on calibration.
In the reliability diagram in Fig. 2 and in Table 2, we see that smaller models suffer from higher calibration error (ECE) on out-of-domain data, when evaluated out-of-box (OOB) or with temperature scaling (TS). The gap in ECE between smaller models and BERT-base is more severe for the SNLI to MNLI transfer. However, Label Smoothing is very effective in the out-of-domain setting. It significantly reduces calibration error of all models 2 We also try label-smoothing, but it gives worse results than temperature scaling for in-domain data, across all models.   in general, but helps more for smaller models, as seen in both the transfer tasks. Additionally, label smoothing helps improve accuracy for all models when compared to their OOB counterparts.

Conclusion
We presented a thorough empirical study of the effects of model size (

A Dataset Details
Since the GLUE tasks (Wang et al., 2018) do not have an annotated public test-set, we split the devset equally such that one half forms the new dev-set and the other half forms the test-set. The dev-set is used for hyper-parameter selection. Table 3 shows the details for each of the datasets considered.

B Hyper-parameter Selection
All models are used from the HuggingFace Transformers Library (Wolf et al., 2019). All models are fine-tuned for 2 to 4 epochs with the best value chosen on the basis of the accuracy on the dev set. We set the batch size as 16 with a learning rate of 2e-5, gradient clip of 1.0, and no weight decay. All models are optimized using AdamW (Loshchilov and Hutter, 2018