Towards Improving Selective Prediction Ability of NLP Systems

It’s better to say “I can’t answer” than to answer incorrectly. This selective prediction ability is crucial for NLP systems to be reliably deployed in real-world applications. Prior work has shown that existing selective prediction techniques fail to perform well, especially in the out-of-domain setting. In this work, we propose a method that improves probability estimates of models by calibrating them using prediction confidence and difficulty score of instances. Using these two signals, we first annotate held-out instances and then train a calibrator to predict the likelihood of correctness of the model’s prediction. We instantiate our method with Natural Language Inference (NLI) and Duplicate Detection (DD) tasks and evaluate it in both In-Domain (IID) and Out-of-Domain (OOD) settings. In (IID, OOD) settings, we show that the representations learned by our calibrator result in an improvement of (15.81%, 5.64%) and (6.19%, 13.9%) over ‘MaxProb’ -a selective prediction baseline- on NLI and DD tasks respectively.


Introduction
In real-world applications, AI systems often encounter novel inputs that differ from their training data distribution. Prior work has shown that even state-of-the-art models tend to make incorrect predictions on such inputs (Elsahar and Gallé, 2019;Miller et al., 2020;Koh et al., 2021;Hendrycks et al., 2021). This raises reliability concerns and hinders their adoption in real-world safety-critical domains like biomedical and autonomous robots. Selective prediction addresses these concerns by enabling systems to abstain from making predictions when they are likely to be incorrect. Avoiding incorrect predictions allows them to maintain high task accuracy and thus makes them more reliable. Hendrycks and Gimpel (2017) proposed 'Max-Prob' that uses the maximum softmax probability across all answer candidates as the confidence es-timate to selectively make predictions. While performing reasonably well in the in-domain setting, MaxProb and other existing selective prediction techniques fail to translate that performance in the out-of-domain setting (Varshney et al., 2022;Kamath et al., 2020).
In this work, we propose a selective prediction method that improves probability estimates of models in both in-domain and out-of-domain settings by learning strong representations via calibration. Specifically, we calibrate models' outputs using a held-out dataset and use the calibrator as confidence estimator for selective prediction. To this end, we first argue that "all instances are not equally difficult and the model is not equally confident in all its predictions" and then through extensive experiments, we show that prediction confidence is positively correlated with correctness while difficulty score is negatively correlated (5.2). We leverage the above finding to calibrate models' outputs using these two signals.
For computing the difficulty scores, we use a model-based technique (3.1) because human perception of difficulty may not always correlate well with machine interpretation. To calibrate a model, we annotate instances of a held-out dataset conditioned on the model's predictive correctness (computed using difficulty score and prediction confidence) and then train a calibrator using these instances. This annotation score represents the likelihood of correctness of the model's prediction. Finally, the trained calibrator predicts this likelihood value for test instances and is used as the confidence estimator for selective prediction.
To evaluate the efficacy of our method, we conduct comprehensive experiments in In-Domain (IID) and Out-of-Domain (OOD) settings for Natural Language Inference (NLI) and Duplicate Detection (DD) tasks. We also compare its performance with existing calibration techniques. On the NLI task, our method achieves 15.81% and 5.64% im-provement on AUC of risk-coverage curve over MaxProb in IID and OOD setting respectively. Furthermore, on the DD task, it achieves 6.19% and 13.9% improvement in IID and OOD setting respectively. Finally, we hope that our work will facilitate development of more robust and reliable AI systems making their wide adoption in real-world applications possible.

Selective Prediction
Selective prediction enables a system to abstain on instances where it is likely to be incorrect i.e it consists of a selector (g) that determines if the system should output the prediction. Usually, g comprises of a prediction confidence estimatorg and a threshold th that controls the abstention level: A selective prediction system makes trade-offs between coverage and risk. For a dataset D, coverage at a threshold th corresponds to the fraction of answered instances (whereg > th) and risk is the error on those answered instances.
With the decrease in th, coverage will increase, but the risk will usually also increase. The overall selective prediction performance across all thresholds is measured by the area under risk-coverage curve (El-Yaniv et al., 2010). Lower the AUC, the better the system as it represents lower average risk across all thresholds.

Method
We propose to train a confidence estimator that can assign higher scores to correctly predicted instances than incorrectly predicted ones. To this end, we leverage a held-out dataset and annotate it's instances conditioned on the model's predictive correctness. Specifically, we infer the model on the held-out dataset and annotate instances with a score such that correctly predicted instances get assigned a higher score than incorrectly predicted instances. This annotation score models the likelihood of the prediction being correct and is computed using the model's prediction confidence and difficulty level of the instance. Finally, a calibrator (regression model) is trained using this annotated held-out dataset and used as the confidence estimator for selective prediction.
We detail each component of our method and the intuition behind it in the following subsections.

Difficulty Score Computation
To compute difficulty score of an instance, we evaluate it after every training epoch and subtract the aggregated softmax probability assigned to the ground-truth answer from 1 i.e. for an instance i, difficulty score d i is calculated as: where the model is trained till E epochs and c ji is prediction confidence of the correct answer given by the model after j th training epoch. Note that c ji is probability assigned to the correct answer not the maximum probability across all answer candidates. The intuition behind this procedure is that the instances that can be consistently answered correctly from the early stages of training are inherently easy and should receive lower difficulty score than the ones that require a large number of training steps. A similar method has been explored in Swayamdipta et al. (2020) for analyzing "training dynamics" but here we use it to quantify difficulty of the held-out instances.

Annotation Score Computation
We define annotation score for the held-out instances as a function of softmax probability outputted by the model and the difficulty score. We show that softmax score is positively correlated while difficulty score is negatively correlated with the predictive correctness i.e the system is more likely to be correct if the softmax score is high and difficulty score is low. Furthermore, in order to justifiably separate the scores for correct and incorrect prediction scenarios in the range 0 to 1, we push the scores above 0.5 in case of correct and below 0.5 in case of incorrect scenarios. Concretely, we use the following functions to compute this: AS 1 uses only softmax, AS 2 uses only difficulty score and AS 3 uses a combination of both. These annotation strategies assign a relatively higher score when the model's prediction is correct and a lower score when it is incorrect. This gold score ranges from 0 to 1 as both s i and maxP rob lie in the same range and better captures the likelihood of correctness unlike the categorical labels (1 for correct and 0 for incorrect) used in typical calibration approaches. Note that this annotation computation is only required for training the calibrator and not at test time. Therefore, difficulty score of the test instances need not be computed.
Both difficulty score and annotation score computation procedures are generic and are widely applicable since NLP systems usually make probabilistic predictions for all kinds of tasks ranging from Classification to Question Answering.

Calibration
Equipped with annotation scores, we extract syntactic features, namely, lengths, Semantic Textual Similarity (STS) value, number of common words between given sentences, and presence of negation words / numbers from the held-out instances to train the calibrator model. These features along with maxProb and prediction outputted by the model serve as inputs for the calibrator. Finally, we use a simple random forest implementation of Scikit-learn (Pedregosa et al., 2011) to train our calibrator that learns strong representations for the inputs. We note that these syntactic features are general and applicable for all language understanding tasks and any regression model can be used as the calibrator. We compare our method with other calibration techniques described in Section 4.1.

Calibration Baselines
Kamath et al. (2020) study a calibration-based selective prediction technique for Question Answering datasets where they annotate a held-out dataset such that correctly predicted instances are assigned class label '1' and incorrect ones are assigned label '0'. Then, a calibrator is trained using this annotated binary classification dataset using features such as input length and probabilities of top 5 predictions. The softmax probability assigned to class '1' by this calibrator is used as the confidence estimator for selective prediction. We refer to this approach as Calib C. We also train a transformer-based model for calibration (Calib T) that leverages the entire input text for this classifi-cation task instead of the syntactic features (Garg and Moschitti, 2021).
Our proposed calibration method differs from these approaches as we quantify the correctness on a continuous scale (instead of categorical labels '1' and '0') using prediction confidence and difficulty of the instances and use explicitly provided general syntactic features described in Section 3.3 for training. Our annotation procedure provides more flexibility for the calibrator to look for finegrained features distinguishing various annotation scores. We note that our simplest annotation strategy (AS 1 ) that does not incorporate difficulty score is similar to Calib R method described in Varshney et al. (2022) but our calibration method uses more general syntactic features.
Note that for fair estimation of abilities of the proposed method, we compare it with other calibration-based techniques only. Other techniques such as Monte-Carlo dropout (Gal and Ghahramani, 2016) and Error Regularization (Xin et al., 2021) are complementary and can further improve our performance.

Datasets
We conduct experiments with Natural Language Inference and Duplicate Detection datasets and compare the performance of various calibration techniques in in-domain and out-of-domain settings. Duplicate Detection Datasets: QQP (Iyer et al., 2017) and MRPC (Dolan and Brockett, 2005).
For NLI task, we train 3-way classification model (NLI has three labels) on SNLI and evaluate the selective prediction performance on SNLI (IID) and MNLI, Stress Test (OOD) datasets. For the DD task, we train model on MRPC and evaluate on MRPC (IID) and QQP (OOD) datasets. We use BERT-BASE model (Devlin et al., 2019) with a linear layer on top of [CLS] token representation for training the model for these tasks. We train these models with the default learning rate of 5e − 5 for 3 epochs. 1 We use the same experimental setup as (Varshney et al., 2022)

Proposed Method Outperforms All
Our method shows a clear benefit over existing calibration techniques as it leads to a considerable improvement in all the cases. The proposed method achieves 15.81% and 6.19% improvement in the IID setting on SNLI and MRPC respectively. Furthermore, it achieves 2.19% on MNLI, 5.64% on Stress Test, and 13.9% on QQP in the OOD setting. Calib T considerably degrades performance in both IID and OOD settings. However, Calib C results in a minor improvement in the IID setting (8.97% for SNLI) but does not consistently improve in the OOD setting (especially on MNLI Mismatched and Competence Stress Test). We attribute this to the limited signal that is given to the calibrator by annotating the held-out dataset with categorical labels '1' and '0'. Thus, it learns weak representations.
Comparing Annotation Functions: We find that the improvement using our method comes from using AS 3 as the annotation score which outperforms AS 1 and AS 2 . This is expected as it leverages useful signals provided by both maxProb and difficulty score for annotation computation.
Relationship With Predictive Correctness: To further analyze our method, we plot the relationship of predictive correctness with prediction confidence and difficulty score in Figure 1. It shows that prediction confidence is positively correlated while the difficulty score is negatively correlated with correctness. This further justifies our annotation score computation procedure.

Conclusion and Future Work
We proposed a selective prediction method that calibrates the model outputs using prediction confidence and difficulty level of the instances. Through comprehensive experiments, we demonstrated that it achieves considerable improvement over Max-Prob on NLI and Duplicate Detection tasks in both IID and OOD settings. We hope that our work will facilitate development of more robust and reliable AI systems making their wide adoption in real-world applications possible.