The Art of Abstention: Selective Prediction and Error Regularization for Natural Language Processing

In selective prediction, a classifier is allowed to abstain from making predictions on low-confidence examples. Though this setting is interesting and important, selective prediction has rarely been examined in natural language processing (NLP) tasks. To fill this void in the literature, we study in this paper selective prediction for NLP, comparing different models and confidence estimators. We further propose a simple error regularization trick that improves confidence estimation without substantially increasing the computation budget. We show that recent pre-trained transformer models simultaneously improve both model accuracy and confidence estimation effectiveness. We also find that our proposed regularization improves confidence estimation and can be applied to other relevant scenarios, such as using classifier cascades for accuracy–efficiency trade-offs. Source code for this paper can be found at https://github.com/castorini/transformers-selective.


Introduction
Recent advances in deep learning models have pushed the frontier of natural language processing (NLP). Pre-trained language models based on the transformer architecture (Vaswani et al., 2017) have improved the state-of-the-art results on many NLP applications. Naturally, these models are deployed in various real-world applications. However, one may wonder whether they are always reliable, as pointed out by Guo et al. (2017) that modern neural networks, while having better accuracy, tend to be overconfident compared to simple networks from 20 years ago.
In this paper, we study the problem of selective prediction (Geifman and El-Yaniv, 2017)  the error rate. This is a practical setting in a lot of realistic scenarios, such as making entailment judgments for breaking news articles in search engines (Carlebach et al., 2020) and making critical predictions in medical and legal documents (Zhang et al., 2019). In these cases, it is totally acceptable, if not desirable, for the models to admit their uncertainty and call for help from humans or better (but more costly) models.
Under the selective prediction setting, we construct a selective classifier by pairing a standard classifier with a confidence estimator. The confidence estimator measures how confident the model is for a certain example, and instructs the classifier to abstain on uncertain ones. Naturally, a good confidence estimator should have higher confidence for correctly classified examples than incorrect ones. We consider two choices of confidence estimators, softmax response (SR; Hendrycks and Gimpel, 2017), and Monte-Carlo dropout (MC-dropout; Gal and Ghahramani, 2016). SR interprets the output of the final softmax layer as a probability distribution and the highest probability as confidence. MC-dropout repeats the inference process multiple times, each time with a different dropout mask, and treats the negative variance of maximum probability as confidence. Confidence estimation is critical to selective prediction, and therefore studying this problem also helps relevant tasks such as active learning (Cohn et al., 1995;Shen et al., 2018) and early exiting (Schwartz et al., 2020;Xin et al., 2020;Zhou et al., 2020;Xin et al., 2021).
In this paper, we compare selective prediction performance of different NLP models and confidence estimators. We also propose a simple trick, error regularization, which can be applied to any of these models and confidence estimators, and improve their selective prediction performance. We further study the application of selective prediction on a variety of interesting applications, such as classification with no valid labels (no-answer problem) and using classifier cascades for accuracyefficiency trade-offs. Experiments show that recent powerful NLP models such as BERT (Devlin et al., 2019) and ALBERT (Lan et al., 2020) improve not only accuracy but also selective prediction performance; they also demonstrate the effectiveness of the proposed error regularization by producing better confidence estimators which reduce the area under the risk-coverage curve by 10%.

Related Work
Selective prediction has been studied by the machine learning community for a long time (Chow, 1957;El-Yaniv and Wiener, 2010). More recently, Geifman and El-Yaniv (2017, 2019) study selective prediction for modern deep learning models, though with a focus on computer vision tasks.
Selective prediction is closely related to confidence estimation, as well as out-of-domain (OOD) detection (Schölkopf et al., 2000; and prediction error detection (Hendrycks and Gimpel, 2017), albeit more remotely. There have been many different methods for confidence estimation. Bayesian methods such as Markov Chain Monte Carlo (Geyer, 1992) and Variational Inference (Hinton and Van Camp, 1993;Graves, 2011) assume a prior distribution over model parameters and obtain confidence estimates through the posterior. Ensemble-based methods (Gal and Ghahramani, 2016;Lakshminarayanan et al., 2017; estimate confidence based on statistics of the ensemble model's output. These methods, however, are computationally practical for small models only. Current large-scale pre-trained NLP models, such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), are too expensive to run multiple times of inference, and therefore require lightweight confidence estimation.
Previously, selective prediction and confidence estimation have been studied in limited NLP scenarios. Dong et al. (2018) train a separate confidence scoring model to explicitly estimate confidence in semantic parsing. Kamath et al. (2020) introduce selective prediction for OOD question answering, where abstention is allowed for OOD and difficult questions. However, selective prediction for broader NLP applications has yet to be explored, and we hope to draw the attention of the NLP community to this problem. There are two notable related topics, confidence calibration and unanswerable questions, but the difference between them and selective prediction is still nontrivial. Calibration (Guo et al., 2017;Jiang et al., 2018;Kumar et al., 2018;Wang et al., 2020;Desai and Durrett, 2020) focuses on adjusting the overall confidence level of a model, while selective prediction is based on relative confidence among the examples. For example, the most widely used calibration technique, temperature scaling (Platt, 1999), globally increases or decreases the model's confidence on all examples, but the ranking of all examples' confidence is unchanged. Unanswerable questions are considered in previous datasets, e.g., SQuAD2.0 (Rajpurkar et al., 2018). The unanswerable questions are impossible to answer even for humans, while abstention in selective prediction is due to model uncertainty rather than modelagnostic data uncertainty.

Background
We introduce relevant concepts about selective prediction and confidence estimators, using multiclass classification as an example.

Selective Prediction
Given a feature space X and a set of labels Y, a standard classifier f is a function f : X → Y. A selective classifier is another function h : X → Y ∪ {⊥}, where ⊥ is a special label indicating the abstention of prediction. Normally, the selective classifier is composed of a pair of functions h = (f, g), where f is a standard classifier and g is the selective function g : X → {0, 1}. Given an input x ∈ X , the output of the selective classifier is as follows: and we can see that the output of g controls prediction or abstention. In most cases, g consists of a confidence estimatorg : X → R, and a confidence threshold θ: (2) g(x) indicates how confident the classifier f is on the example x, and θ controls the overall prediction versus abstention level. A selective classifier makes trade-offs between coverage and risk. Given a labeled dataset S = {(x i , y i )} n i=1 ⊂ X × Y and an error function L to calculate each example's error l i = L(f (x i ), y i ), the coverage and the selective risk of a classifier h = (f, g) on S are, respectively, The selective classifier aims to minimize the selective risk at a given coverage. The performance of a selective classifier h = (f, g) can be evaluated by the risk-coverage curve (RCC; El-Yaniv and Wiener, 2010), which is drawn by varying the confidence threshold θ (see Figure 2 for an example). Quantitatively, the area under curve (AUC) of RCC measures the effectiveness of a selective classifier. 1 In order to minimize the AUC of RCC, the selective classifier should, intuitively, output g(x) = 1 for correctly classified examples and g(x) = 0 for incorrect ones. Therefore, an idealg has the following property: ∀(x i , y i ), (x j , y j ) ∈ S, g(x i ) ≤g(x j ) iff l i ≥ l j . We propose the following metric, reversed pair proportion (RPP), to evaluate how far the confidence estimatorg is to ideal, given the labeled dataset S of size n: RPP measures the proportion of example pairs with a reversed confidence-error relationship, and the n 2 in the denominator is used to normalize the value. An ideal confidence estimator has an RPP value of 0.

Confidence Estimators
In most cases for multi-class classification, the last layer of the classifier is a softmax activation, which 1 AUC in this paper always corresponds to RCCs. outputs a probability distribution P (y) over the set of labels Y, where y ∈ Y is a label. In this case, the classifier can be written as whereŷ is the label with highest probability. Perhaps the most straightforward and popular choice for the confidence estimator is softmax response (Hendrycks and Gimpel, 2017): Alternatively, we can use the difference between probabilities of the top two classes for confidence estimation. We refer to this method as PD (probability difference). Gal and Ghahramani (2016) argue that "softmax outputs are often erroneously interpreted as model confidence", and propose to use MC-dropout as the confidence estimator. In MC-dropout, P (ŷ) is computed for a total of R times, using a different dropout mask at each time, producing P 1 (ŷ), P 2 (ŷ), · · · , P R (ŷ). The variance of them is used to estimate the confidence: We use the negative sign here because a larger variance indicates a greater uncertainty, i.e., a lower confidence (Geifman and El-Yaniv, 2017;Kamath et al., 2020). By using different dropout masks, MC-dropout is equivalent to using an ensemble for confidence estimation, but does not require actually training and storing multiple models. Nevertheless, compared to SR, the inference cost of MC-dropout is multiplied by R, which can be a problem when model inference is expensive.

Regularizers
SR and MC-dropout are often used directly out of the box as the confidence estimator. We propose a simple regularization trick that can be easily applied at training (or fine-tuning for pre-trained models) time and can improve the effectiveness of the induced confidence estimators.
Considering that a good confidence estimator should minimize RPP defined in Equation 5, we add the following regularizer to the original training loss function: Here, H(·, ·) is the task-specific loss function such as cross entropy (H is not the same with the error function L), λ is the hyperparameter for regularization,g SR is the maximum softmax probability defined in Equation 7, and e i is the error of example i at the current iteration-details to calculate it will be explained in the next paragraph. We use SR confidence here because it is easily accessible at training time, while MC-dropout confidence is not. The intuition of this regularizer is as follows: if the model's error on example i is larger than its error on example j (i.e., example i is considered more "difficult" for the model), then the confidence on example i should not be greater than the confidence on example j.
In practice, at each iteration of training (finetuning), we can obtain the error e i in one of the two following ways.
• Current iteration error We simply use the error function L to calculate the error of the example at the current iteration, and use it as e i . In the case of multi-class classification, L is often chosen as the 0-1 error.
• History record error Since we intend to use e i to quantify how difficult an example is, we draw inspiration from forgettable examples (Toneva et al., 2019). We calculate example error with L throughout the training process, and use the error averaged from the beginning to the current iteration as e i . In this case, e i takes value from [0, 1].

Practical Approximations
In practice, it is computationally prohibitive to either strictly compute L reg from Equation 10 for all example pairs, or to calculate history record error after every iteration. We therefore make the following two approximations. For L reg from Equation 10, we only consider examples from the mini-batch of the current iteration. For calculating history record error, we compute and record the error values for the entire training set 10 times per epoch (once after each 10% iterations). At each training iteration, we use the average of error values recorded so far as e i .

Experiments
We conduct experiments of selective prediction on NLP tasks. Since the formulation of selective prediction is model agnostic, we choose the  following representative models: (1) BERT-base and BERT-large (Devlin et al., 2019), the dominant transformer-based models of recent years; (2) ALBERT-base (Lan et al., 2020), a variant of BERT featuring parameter sharing and memory efficiency; (3) Long Short-Term Memory (LSTM; Hochreiter and Schmidhuber, 1997), the popular pre-transformer model that is lightweight and fast.
In this section, we compare the performance of selective prediction of these models, demonstrate the effectiveness of the proposed error regularization, and show the application of selective prediction in two interesting scenarios-the no-answer problem and the classifier cascades.

Experiment Setups
We conduct experiments mainly on three datasets: MRPC (Dolan and Brockett, 2005), QNLI (Wang et al., 2018), and MNLI (Williams et al., 2018). In Section 5.4, we will need an additional non-binary dataset SST-5 (Socher et al., 2013). Statistics of these datasets can be found in Table 2. Following the setting of the GLUE benchmark (Wang et al., 2018), we use the training set for training/finetuning and the development set for evaluation (the test set's labels are not publicly available); MNLI's development set has two parts, matched and mismatched (m/mm). These datasets include semantic equivalence judgments, entailment classification, and sentiment analysis, which are important application scenarios for selective prediction as discussed in Section 1.
The implementation is based on PyTorch (Paszke et al., 2019) and the Huggingface Transformers Library (Wolf et al., 2020). Training/fine-tuning and inference are done on a single NVIDIA Tesla V100 GPU. Since we are evaluating the selective prediction performance of different models instead of pursuing state-of-the-art results, we do not extensively tune hyperparameters; instead, most experiment settings such as hidden sizes, learning rates, and batch sizes are kept unchanged from the Huggingface Library. Further setup details can be found in Appendix A.

Comparing Different Models
We compare selective prediction performance of different models in Table 1. For each model, we report the performance given by the two confidence estimators, softmax response (SR) and MC-dropout (MC); the results of using PD for confidence estimation are very similar to those of SR, and we report them in Appendix B due to space limitations. The accuracy and the F1 score 3 measure the effectiveness of the classifier f , RPP measures the reliability of the confidence estimatorg, and AUC is a comprehensive metric for both the classifier and the confidence estimator. The choice of confidence estimator does not affect the model's accuracy. We also provide risk-coverage curves (RCCs) of different models and confidence estima-   tors in Figure 2. MC in the table and the figure uses a dropout rate of 0.01 and repetitive runs R = 10. We first notice that models with overall higher accuracy also have better selective prediction performance (lower AUC and RPP). For example, compared with LSTM, BERT-base has higher accuracy and lower AUC/RPP on all datasets, and the same applies to the comparison between BERT-base and BERT-large. Since the classifier's effectiveness does not directly affect RPP, the consistency of RPP's and accuracy's improvement indicates that sophisticated models simultaneously improve both model accuracy and confidence estimation. This is in contrast to the discovery by Guo et al. (2017) that sophisticated neural networks, despite having better accuracy, are more easily overconfident and worse calibrated than simple ones.
We also notice that MC-dropout performs consistently worse than softmax response, shown by both AUC and RPP. This shows that for NLP tasks and models, model confidence estimated by MCdropout fails to align well with real example difficulty. We further study and visualize in Figure 3 the effect of different dropout rates and different numbers of repetitive runs R on MC-dropout's selective prediction performance. We can see that (1) a dropout rate of 0.01 is a favorable choice: larger dropout rates lead to worse performance while smaller ones do not improve it; (2) MC-dropout needs at least 20 repetitions to obtain results comparable to SR, which is extremely expensive. Although MC-dropout has a sound theoretical foundation, its practical application to NLP tasks needs further improvements.

Effect of Error Regularization
In this part, we show that our simple regularization trick improves selective prediction performance. In  Table 3, we report the accuracy, AUC, and RPP for each model, paired with three different regularizers: no regularization (none), current error regularizer (curr.), and history error regularizer (hist.), as described in Section 4. We first see that applying error regularization (either current or history) does not harm model accuracy. There are minor fluctuations, but generally speaking, error regularization has no negative effect on the models' effectiveness.
We can also see that error regularization improves models' selective prediction performance, reducing AUC and RPP. As we mention in the previous section, AUC is a comprehensive metric for both the classifier f and the confidence estimator g. We therefore focus on this metric in this section, and we bold the lowest AUC in Table 3. We see that error regularization consistently achieve the lowest AUC values, and on average, the best scores are approximately 10% lower than the scores without regularization. This shows that error regularization produces confidence estimators that give better confidence rankings.
The two regularization methods, current error and history error, are similar in quality, with neither outperforming the other across all models and datasets. Therefore, we can conclude only that the error regularization trick improves selective prediction, but the best specific method varies. We leave this exploration for future work.

The No-Answer Problem
In this section, we conduct experiments to see how selective classifiers perform on datasets that either allow abstention or, equivalently, provide the no-answer label. This no-answer problem occurs whenever a trained classifier encounters an example whose label is unseen in training, which is common in practice. For example, in the setting of ultrafine entity typing with more than 10,000 labels (Choi et al., 2018), it is unsurprising to encounter examples with unseen types. Ideally, in this case, the classifier should choose the no-answer label. This setting is important yet often neglected, and there exist few classification datasets with the no-answer label. We therefore build our own datasets, binarized MNLI and SST-5 (bMNLI and bSST-5), to evaluate different models in this setting ( Table 2).
The MNLI dataset is for sentence entailment classification. Given a pair of sentences, the goal is to predict the relationship between them, among three labels: entailment, contradiction, and neutral. The SST-5 dataset is for fine-grained sentence sentiment classification. Given a sentence, the goal is to predict the sentiment of it, among five labels: strongly positive, mildly positive, strongly negative, mildly negative, and neutral. To convert the original MNLI and SST-5 datasets into our binarized versions bMNLI and bSST-5, we modify the following: for SST-5, we merge strongly and mildly positive/negative into one positive/negative class; for MNLI, we simply regard entailment as positive and contradictory as negative. We then remove all neutral instances from the training set but keep those in the development and test sets. This way, neutral instances in the development and test sets should be classified as no-answer by the model. A good model is expected to assign neutral examples in the development and test sets with low confidence scores, thereby predicting the no-answer label for them.
We report results for these two datasets with  the no-answer label in Table 4. Accuracy (Acc), AUC, and RPP have the same meaning from the previous sections. We also consider a new metric specifically for the no-answer setting, augmented accuracy (Acc*), which is calculated as follows: (1) we make a number of attempts by searching a threshold α from 0.7 to 1.0 in increments of 0.01; (2) for each attempt, we regard all examples with predicted confidence lower than α as neutral, and then calculate the accuracy; (3) among all attempts, we take the highest accuracy as Acc*. Choosing the optimal α requires knowing the ground-truth answers in advance and is not practical in reality. 4 Instead, Acc* indicates how well a model recognizes examples whose label is likely unseen in the training set.
We first see that Acc* is consistently higher than Acc in all cases. This is unsurprising, but it demonstrates that unseen samples indeed have lower confidence and shows that introducing the abstention option is beneficial in the no-answer scenario. Also, we observe that error regularization improves the models' selective prediction performance, producing lower AUC/RPP and higher Acc* in most cases. This further demonstrates the effectiveness of the simple error regularization trick.
Secondly, we can see that the improvement of Acc* over Acc is larger in bMNLI than in bSST-5. The reason is that in bMNLI, neutral examples constitute about a third of the entire development set, while in bSST-5 they constitute only a fifth. The improvement is positively correlated with the proportion of neutral examples, since they are assigned lower confidence scores and provide the potential for abstention-based improvements.

Classifier Cascades
In this section, we show how confidence estimation and abstention can be used for accuracy-efficiency trade-offs. We use classifier cascades: we first use a less accurate classifier for prediction, abstain on examples with low confidence, then send them to more accurate but more costly classifiers. Here we choose LSTM and BERT-base to constitute the cascade, but one can also choose other models and more levels of classifiers.
We first use an LSTM for all examples' inference, and then send "difficult" ones to BERT-base. Since the computational cost of LSTM is negligible 5 compared to BERT-base, the key to efficiency here is correctly picking the "difficult" examples.
In Figure 4, we show the results of accuracy/F1 score versus average FLOPs 6 per inference example. Each curve represents a method to choose difficult examples: The blue curves are obtained by randomly selecting examples, as a simple baseline. The orange and green curves are obtained by using SR of LSTM as the indicator of example difficulty; the orange curves represent the LSTM trained with no regularization while the green curves are with history error regularization. Different points on the curves are chosen by varying the proportion of examples sent to the more accurate model, BERT-base. A curve with a larger area under it indicates a better accuracy-efficiency trade-off.
We can see that the blue curves are basically linear interpolations between the LSTM (the lowerleft dot) and BERT-base (the upper-right dot), and this is expected for random selection. Orange and green curves are concave, indicating that using SR for confidence estimation is, unsurprisingly, more effective than random selection. Between these two, the green curves (history error regularization) have larger areas under themselves than orange ones (no regularization), i.e., green curves have better accuracy given the same FLOPs. This demonstrates the effectiveness of error regularization for better confidence estimation.

Conclusion
In this paper, we introduce the problem of selective prediction for NLP. We provide theoretical background and evaluation metrics for the problem, and also propose a simple error regularization method that improves selective prediction performance for NLP models. We conduct experiments to compare different models under the selective prediction setting, demonstrate the effectiveness of the proposed regularization trick, and study two scenarios where selective prediction and the error regularization method can be helpful.
We summarize interesting experimental observations as follows: 1. Recent sophisticated NLP models not only improve accuracy over simple models, but also provide better selective prediction results (better confidence estimation).
2. MC-dropout, despite having a solid theoretical foundation, has difficulties matching the effectiveness of simple SR in practice.
3. The simple error regularization helps models lower their AUC and RPP, i.e., models trained with it produce better confidence estimators.

A Detailed Experiment Settings
The LSTM is randomly initialized without pretraining. For models that require pre-trained, we use the following ones provided by the Huggingface Transformer Library (Wolf et al., 2020).
• BERT-BASE-UNCASED • BERT-LARGE-UNCASED • ALBERT-BASE-V2 All these models are trained/fine-tuned for 3 epochs without early-stopping or checkpoint selection. Learning rate is 2 × 10 −5 . A batch size of 32 is used for training/fine-tuning. The maximum input sequence length is 128. Choices for the regularization hyperparameter λ from Equation 9 are shown in Table 6.
The numbers of parameters for the two models BERT and ALBERT can be found in the paper by Lan et al. (2020).
The LSTM used in the paper is a two-layer bidirectional LSTM, with a hidden size of 200. On top of it there is a max-pooling layer and a fullyconnected layer.

B PD Confidence Estimator
Probability difference (PD), the difference between probabilities of the top two classes, can also be used as confidence estimation. Among the four datasets used in the paper, MRPC and QNLI are  binary classification, and therefore PD's results are identical to softmax response (SR). SST-5 and MNLI have more than two classes, and therefore PD's results are different from SR's. We show them in Table 5.
We can see that the results of PD are very similar to those of SR. Of course, MNLI and SST-5 have only three/five labels respectively, and for datasets with far more labels, PD will possibly show its difference from SR.