Can Explanations Be Useful for Calibrating Black Box Models?

NLP practitioners often want to take existing trained models and apply them to data from new domains. While fine-tuning or few-shot learning can be used to adapt a base model, there is no single recipe for making these techniques work; moreover, one may not have access to the original model weights if it is deployed as a black box. We study how to improve a black box model’s performance on a new domain by leveraging explanations of the model’s behavior. Our approach first extracts a set of features combining human intuition about the task with model attributions generated by black box interpretation techniques, then uses a simple calibrator, in the form of a classifier, to predict whether the base model was correct or not. We experiment with our method on two tasks, extractive question answering and natural language inference, covering adaptation from several pairs of domains with limited target-domain data. The experimental results across all the domain pairs show that explanations are useful for calibrating these models, boosting accuracy when predictions do not have to be returned on every example. We further show that the calibration model transfers to some extent between tasks.


Introduction
With recent breakthroughs in pre-trained modeling, NLP models are showing increasingly promising performance on real-world tasks, leading to their deployment at scale for settings such as translation, sentiment analysis, and question answering. These models are sometimes used as black boxes, especially if they are only available as a service through APIs 1 or if end users do not have the resources to fine-tune the models themselves. This poses a challenge when users try to deploy models on a new domain that diverges from the training domain, usually resulting in performance deterioration.
To this end, we investigate the task of domain adaptation of black box models: given a black box model and a small number of examples from a new domain, how can we improve the model's generalization performance on the new domain? In this setting, we are not able to update the model parameters, which makes transfer and few-shot learning techniques inapplicable. Furthermore, we cannot even access the model parameters, ruling out techniques requiring model internal representations.
This paper explores how explanations can help address this task. We leverage black box feature attribution techniques (Ribeiro et al., 2016;Lundberg and Lee, 2017) to interpret a model's internal reasoning process. As shown in Figure 1, we use this knowledge in a calibrator, or a separate model to make a binary decision of whether the black box model is likely to be correct or not on a given instance. While not fully addressing the domain adaptation problem, calibrating the model can nonetheless make it more useful in practice, as we can recognize when it is likely to make mistakes (Guo et al., 2017;Kamath et al., 2020;Desai and Durrett, 2020) and modify our deployment strategy accordingly.
We calibrate by connecting model interpretations with hand-crafted heuristics to extract a set of features describing the reasoning of the model. Figure 1 shows an example for question answering: we believe the answers are more reliable when the tokens of a particular set of tags (e.g., proper nouns) in the question are strongly considered. We extract a set of features describing the attribution values of different tags. Using a small number of examples in the target domain, we can train a simple calibrator for the black box model.
Our approach is closely related to the recent line of work on model behavior and explanations. Chandrasekaran et al. (2018); Hase and Bansal (2020) Figure 1: Overall pipeline and examples from the SQUAD-ADV dataset. A ROBERTA model trained on SQUAD resists the attack on the first example but fails on the second. Features that inspect attribution values produced by LIME can differentiate these two on the basis of attributions to NNP in the question and VB in the context. A calibrator can use these features to predict whether the original black box model was right or wrong.
shows explanations can help users predict model decisions in some ways and Ye et al. (2021) show how these explanations can be semi-automatically connected to model behavior. Our approach goes further by using a model to learn these heuristics, instead of handcrafting them or having a human inspect the explanations.
We test whether our method can improve model generalization performance on two tasks, extractive question answering (QA) and natural language inference (NLI). We construct generalization tasks for 5 pairs of source and target domains across the two tasks. Compared to existing baselines (Kamath et al., 2020) and our own ablations, we find explanations are indeed helpful for this task, successfully improving model generalization performance among all pairs. Although the number of examples needed for training a calibrator is sometimes sufficient to adapt a trained model, we still find occasions where explanation-based calibrators outperform even methods that have full access to the models. Our analysis demonstrates promising cross-domain generalization ability of explanationbased calibrators: our calibrator trained on a new domain can transfer to another new domain in some cases. Moreover, our calibrator can also substantially improves model performance in the Selective QA setting.

Using Explanations for Black Box Model Calibration
Let x = x 1 , x 2 , ..., x n be a set of input tokens andŷ = f (x) be a prediction from our black box model under consideration. Our task in calibration 2 is to assess whether the model prediction on x matches its ground truth y. We represent this with the variable t, i.e., t 1{f(x) = y}.
We explore various calibrator models to perform this task, with our main focus being on calibrator models that leverage explanations in the form of feature attribution. Specifically, an explanation φ for the input x assigns an attribution score φ i for each input token x i , which represents the importance of that token. Next, we extract features u(x, φ) depending on the input and explanation, and use the features to learn a calibrator c : u(x, φ) → t for predicting whether a prediction is valid. We compare against baselines that do not use explanations in order to answer the core question posed by our paper's title.
Our evaluation focuses on binary calibration, or classifying whether a model's initial prediction is correct. Following recent work in this setting Kamath et al. (2020), we particularly focus on domain transfer settings where models make frequent mistakes. A good calibrator can identify instances where the model has likely made a mistake, so we can return a null response to the user instead of an incorrect one.
In the remainder of this section, we'll first introduce how we generate the explanations and then how to extract the features u for the input x.

Generating Explanations
Since we are calibrating black box models, we adopt LIME (Ribeiro et al., 2016) and SHAP (Lundberg and Lee, 2017) for generating explanations for models instead of other techniques that require access to the model details (e.g., integrated gradients (Sundararajan et al., 2017)).
Both LIME and SHAP generate local explanations by approximating the model's predictions on a set of perturbations around the base data point x. In this setting, a perturbation x with respect to x is a simplified input where some of the input tokens are absent (replaced with a <mask> token). Let z = z 1 , z 2 , ..., z n be a binary vector with each z i indicating whether x i is present (using value 1) or absent (using value 0), and h x (z) be the function that maps z back to the simplified input x .
Both LIME and SHAP seek to learn a local linear classifier g on z which matches the prediction of original model f by minimizing: where π x is a local kernel assigning weight to each perturbation z, and Ω is the L2 regularizer over the model complexity. The learned feature weight φ i for each z i then represents the additive attribution (Lundberg and Lee, 2017) of each individual token x i . LIME and SHAP differ in the choice of the local kernel π x . LIME heuristically sets π x as an exponential kernel (with bandwith σ) defined on the cosine distance function between the perturbation and original input, i.e., πx(z) = exp(−dcos(x, hx(z))/σ 2 ) That is, LIME assigns higher instance weights for perturbations that are closer to the original input, and so prioritizes classifying these correctly with the approximation. SHAP derives the π x so the φ can be interpreted as Shapley values (Shapley, 1997): where |z| denotes the number of activated tokens (sum of z). This kernel assigns high weights to perturbations with few or many active tokens, as the predictions when a few tokens' effects are isolated are important. This distinguishes SHAP from LIME, since LIME will place very low weight on perturbations with few active tokens.
For additional details of these methods, we refer readers to the respective papers. The rest of this work only relies on these methods' ability to map an input sequence x i and a model prediction y to a set of importance weights φ i .

Extracting Features by Combining Explanations and Heuristics
Armed with these blackbox explanations, we now wish to connect the explanations to the reasoning we expect from the task: if the model is behaving as we expect it, it may be better calibrated. A human might look at the attributions of some important features and decide whether the model is trustworthy in a similar fashion (Doshi-Velez and Kim, 2017). Past work has explored such a technique to compare explanation techniques (Ye et al., 2021), or even used actual human users to do this task (Chandrasekaran et al., 2018;Hase and Bansal, 2020). Our method automates this process by learning what properties of explanations are important. We first assign each token x i with one or more human-understandable properties V (x i ) = {v j } m i j=1 . Each property v j ∈ V is an element in the property space, which includes indicators like POS tags and is used to describe an aspect of x i whose importance might correlate with the model's robustness. We intend to conjoin these properties with aspects of the explanation to render our calibration judgment. Figure 1 shows examples of properties such as whether a token is a proper noun (NNP).
We now construct the feature set for the prediction made on input x. For every property v ∈ V, we extract a single feature F (v, x, φ) by aggregating the attributions of the tokens associated with v: where 1 is the indicator function, and φ i is the attribution value produced by explanation techniques. In this way, an individual feature represents the total attributions with respect to property v when the model is making the predictions for x. The complete feature set u for x, given as u = {F (v, x, φ)} v∈V , can summarize model rationales from the perspective of the properties defined in V.
Properties We use several types of heuristic properties for calibrating QA and NLI models.
Segments of the Input (QA and NLI): In both of our tasks, an input sequence can naturally be decomposed into two parts, namely a question and a context (QA) or a premise and a hypothesis (NLI). We assign each token with the corresponding segment name, which yields features like Attributions to Question.

POS Tags (QA and NLI):
We also use tags from the English Penn Treebank (Marcus et al., 1993) to implement a group of properties. We hypothesize that tokens of some specific tags should be more important, like proper nouns in the questions of the QA tasks. If a model fails to consider proper nouns of a QA pair, it is more likely to make incorrect predictions.
Overlapping Words (NLI): Word overlapping strongly affects model prediction (McCoy et al., 2019). We assign each token with a property of Overlapping or Non-Overlapping.
Conjunction of Groups: We can further produce higher-level properties by taking the Cartesian product of two or more groups.
We conjoin Segment and Pos-Tags, which yields higher-level features like Attributions to NNP in Question. Such a feature aggregates attributions of tokens that are tagged with NNP and also required to be in the question (marked with orange).

Calibrator Model
We train the calibrator on a small number of samples in our target domain. Each sample is labeled using the prediction of the original model compared to the ground truth. Using our feature set F (v, x, φ), we learn a random forest classifier, shown to be effective for a similar data-limited setting in Kamath et al. (2020), to predict t (whether the corresponding prediction is correct). This classifier returns a score, which overrides the model's original confidence score with respect to that prediction.
In Section 4, we discuss several baselines for our approach. Whenever we vary the features used by the model, all the other details of the classifier and setup remain the same.

Tasks and Datasets
Our task setup involves transferring from a source domain/task A to a target domain/task B. Figure 2 shows the data condition we operate in. Our pri-  Figure 2). In this setting, we have a black box model trained on an source domain A and a small amount of data from the target domain B. Our task is to train a calibrator using data from domain B to identify instances where the model potentially fails in the large unseen test data in domain B. We contrast this black box setting with glass box settings (right column in Figure 2). In glass box settings, we directly have access to the model parameters, therefore it is possible to finetune a model on domain B or train on B from scratch.
Question Answering We experiment with domain transfer from SQUAD (Rajpurkar et al., 2016) to three different settings: SQUAD-ADV (Jia and Liang, 2017), HOTPOTQA (Yang et al., 2018), and TRIVIAQA (Joshi et al., 2017). SQUAD-ADV is an adversarial setting based on SQUAD, which constructs adversarial QA examples based on SQUAD by appending a distractor sentence at the end of each example's context. The added sentence contains a spurious answer and usually has high surface overlapping with the question so as to fool the model. We use the ADDSENT setting from Jia and Liang (2017).
Similar to SQUAD, HOTPOTQA also contains passages extracted from Wikipedia, but HOT-POTQA asks questions requiring multiple reasoning steps. TRIVIAQA is collected from Web snippets, which present a different distribution of questions and passages than SQUAD. For HOTPOTQA and TRIVIAQA, we directly use the pre-processed version of dataset from the MRQA Shared Task (Fisch et al., 2019).
NLI For the task of NLI, we transfer a model trained on MNLI (Williams et al., 2018)

Coverage-F1 Curve on Squad-Adv
Figure 3: Coverage-F1 curves of different approaches on SQUAD-ADV. As more low-confidence questions are answered, the average F1 scores decrease. We use AUC to evaluate calibration performance. (Dolan and Brockett, 2005) and QNLI , similar to the settings in Ma et al. (2019). QNLI contains a question and context sentence pair from SQUAD, and the task is to verify whether a sentence contains the answer to the paired question. MRPC is a paraphrase detection dataset presenting a binary classification task to decide whether two sentences are paraphrases of one another. Note that generalization from MNLI to QNLI or MRPC not only introduces shift in terms of the distribution of the input text, but in terms of the nature of the task itself, since QNLI and MRPC aren't strictly NLI tasks despite sharing some similarity. Both are binary classification tasks rather than three-way.

Experiments
Baselines We compare our calibrator against existing baselines as well as our own ablations. MAXPROB simply uses the probability of the top prediction to assess whether the prediction is trustworthy.
KAMATH (Kamath et al., 2020) (for QA only) is a baseline initially proposed to distinguish out-ofdistribution data points from in-domain data points in the SELECTIVE QA setting, but it can also be applied in our settings. It trains a random forest classifier to learn whether a model prediction is correct based on several heuristic features, including the probabilities of the top 5 predictions, the length of the context, and the length of the predicted answer. Since we are calibrating black box models, we do not use dropout-based features in Kamath et al. (2020).
CLSPROB (for NLI only) uses more detailed information than MAXPROB: it uses the predicted probability for Entailment, Contradiction, and Neutral as the features for training a calibrator instead of only using the maximum probability.
BOWPROP adds a set of heuristic property features on top of the KAMATH method. These are the same as the features used by the full model excluding the explanations. We use this baseline to give a baseline for using general "shape" features on the inputs not paired with explanations.

Implementation of Our Method
We refer our explanation-based calibration method using explanations produced by LIME and SHAP as LIMECAL and SHAPCAL respectively. We note that these methods also take advantages of the bag-of-word features in BOWPROP. For QA, the property space is the union of low-level Segment and Segment × Pos-Tags. For NLI, we use the union of Segment and Segment × Pos-Tags × Overlapping Words to label the tokens. Details number of features can be found in the Appendix. Metrics In addition to calibration accuracy (ACC) that measures the accuracy of the calibrator, we also use the area under coverage-F1 curve (AUC) to evaluate the calibration performance for QA tasks in particular. The coverage-F1 curve (Figure 3) plots the average F1 score of the model achieved when the model only chooses to answer varying fractions (coverage) of the examples ranked by the calibrator-produced confidence. A better calibrator should assign higher scores to the questions that the models are sure of, thus resulting in higher area under the curve; note that AUC of 100 is impossible since the F1 is always bounded by the base model when every question is answered. We additionally report the average scores when answering the top 25%, 50%, and 75% questions, for a more intuitive comparison of the performance.
Results Table 1 summarizes the results for QA. First, we show that using explanations are helpful for calibrating black box QA models out-ofdomain. Our method using LIME substantially improves the calibration AUC compared to KAMATH by 7.1, 2.1 and 1.4 on SQUAD-ADV, TRIVIAQA, and HOTPOTQA, respectively. In particular, LIME-CAL achieves an average F1 score of 92.3 at a coverage of 25% on SQUAD-ADV, close to the performance the base model on original SQUAD examples. Our explanation-based approach is effective at identifying the examples that are robust with respect to the adversarial attacks.
Comparing LIMECAL against BOWPROP, we find that the explanations themselves do indeed help. On SQUAD-ADV and HOTPOTQA, BOW-PROP performs on par with or only slightly better than KAMATH. These results show that connecting explanations with annotations is a path towards building better calibrators.
Finally, we compare the performance of our methods based on different explanation techniques. LIMECAL slightly outperforms SHAPCAL in all 3 Details of hyperparameters can be found in the Appendix.   bility even if the underlying tasks are very different.

Cross-Domain Generalization of Calibrators
Our calibrators so far are trained on individual transfer settings. Is the knowledge of a calibrator learned on some initial domain transfer setting, e.g., SQuAD → TRIVIAQA, generalizable to another transfer setting, e.g. → HOTPOTQA? This would enable us to take our basic QA model and a calibrator and apply that pair of models in a new domain without doing any new training or adaptation. We explore this hypothesis on QA. 4 For comparison, we also give the performance a ROBERTA-model first finetuned on SQUAD and then finetuned on domain A (ADAPT, Figure 2). ADAPT requires access to the model architecture and is an unfair comparison for other approaches.
We show the results in Table 5. None of the approaches can generalize between SQUAD-ADV and the other domains (either trained or tested on SQUAD-ADV), which is unsurprising given the synthetic and very specific nature of SQUAD-ADV.
Between TRIVIAQA and HOTPOTQA, both the LIMECAL and KAMATH calibrators trained on one domain can generalize to the other, even though BOWPROP is not effective. Furthermore, our LIME-CAL exhibits a stronger capability of generalization compared to KAMATH. We then compare LIME-CAL against ADAPT. ADAPT does not always work well, which has also been discussed in Kamath et al. (2020); Talmor and Berant (2019). ADAPT leads to a huge drop in terms of performance when being trained on HOTPOTQA and tested on TRIVIAQA, whereas LIMECAL is the best in this setting. From TRIVIAQA to HOTPOTQA, ADAPT works well, but LIME is almost as effective.
Overall, the calibrator trained with explanations as features exhibits successful generalizability across the two realistic QA tasks. We believe this can be attributed to the features used in the explanation-based calibrator. Although the task is different, the calibrator can rely on some common rules to decide the reliability of a prediction.
Feature Importance We analyze the important features learned by the calibrator. We find explanation-based features are indeed generally among the top used features and more important than Bag-of-Word-based features (see the Appendix for a detailed list). All QA calibrators heavily rely on attribution values of the proper nouns (NNP) and wh-words in the question. BoW features of overlapping nouns are considered important on QNLI, but the top feature is still attribution-based.
These factors give insights into which parts of the QA or NLI reasoning processes are important for models to capture. E.g., the reliance on NNPs in SQUAD-ADV matches our intuitive understanding of this task: distractors typically have the wrong named entities in them, so if the model pays attention to NNPs on an example, it is more likely to be correct, and the calibrator can exploit this.
Impacts of Training Data Size Calibrating a model for a new domain becomes cumbersome if large amounts of annotated data are necessary. We experiment with varying the amount of training data the calibrator is exposed to, with results shown in Table 3   features.

Comparison to Finetuned Models
Throughout this work, we have assumed a black box model that cannot be fine-tuned on a new domain. In this section, we compare calibration-based approaches with glass-box methods that require access to the model architectures and parameters. We evaluate two glass-box methods in two different settings ( Figure 2) On QA tasks, the limited training data is not sufficient for successfully finetuning a ROBERTA model. Consequently, FINETUNE ROBERTA does not achieve credible performance. Finetuning a base QA model greatly improves the performance, surpassing LIMECAL on SQUAD-ADV and HOT-POTQA. However, we still find that on TRIVIAQA, LIMECAL slightly outperforms ADAPT. This is a surprising result, and shows that explanation-based calibrators can still be beneficial in some scenarios, even if we have full access to the model.
On NLI tasks that are substantially easier than QA, finetuning either a ROBERTA LM model or a base NLI model can reach an accuracy of roughly 80%. Our explanation-based approach largely lags glass-box methods, likely because the base NLI model utterly fails on QNLI (50.5% accuracy) and MRPC (55.0% accuracy) and does not grant much support for the two tasks. Nonetheless, the results on NLI still support our main hypothesis: explanations can be useful for calibration.

Selective QA Setting
Our results so far have shown that a calibrator can use explanations to help make binary judgments of correctness for a model running in a new domain. We now test our model on the selective QA setting from Kamath et al. (2020) (Figure 2  Results As shown in Table 6, similar to the main QA results. Our explanation-based approach, LIMECAL, is consistently the best among all settings. We point out our approach outperforms KA-MATH especially in settings that involve SQUAD-ADV as known or unkown OOD distribution. This can be attributed the similarity between SQUAD and SQUAD-ADV which can not be well distinguished with features used in KAMATH (Context Length, Answer Length, and etc.). The strong performance of our explanation-based approach in the selective QA setting further verifies our assumption: explanation can be useful and effective for calibrating black box models.

Related Work
Our approach is inspired by recent work on the simulation test (Doshi-Velez and Kim, 2017), i.e., whether humans can simulate a model's prediction on an input example based on the explanations.
Simulation tests has been carried out in various tasks (Ribeiro et al., 2018;Nguyen, 2018;Chandrasekaran et al., 2018;Hase and Bansal, 2020) and give positive results in some tasks (Hase and Bansal, 2020). Our approach tries to mimic the process that humans would use to judge a model's prediction by combining heuristics with attributions instead of having humans actually do the task. Using "meta-features" to judge a model also shows up in literature on system combination for tasks like constituent parsing (Charniak and Johnson, 2005;Fossum and Knight, 2009) and semantic parsing (Yin and Neubig, 2019). The work of Rajani and Mooney (2018) in VQA is most relevant to ours; they also use heuristic features, but we further conjoin heuristic with model attributions.

Discussion & Conclusion
Limitations Despite showing promising results in improving model generalization performance, our attribution-based approach does suffer from intensive computation cost. Using either LIME or SHAP to generate attributions requires running inference a fair number of perturbations when the input size is large (see Appendix for details), which limits our method's applicability. But this doesn't undermine the main contribution of this paper, answering the question in the title, and our approach is still applicable as-is in the scenarios where we pay for access to the model but not per query.

Conclusion
We have explored whether model attributions can be useful for calibrating black box models. The answer is yes. By connecting attributions with light human heuristics, we successfully improve model generalization performance on new domains, or even different tasks. Besides, it exhibits promising generalization performance in some settings (cross-domain generalization and Selective QA). Table 7 shows the most important features learned by LIMECAL for QA and NLI. For brevity, we present the features related to the probabilities of the top predictions into one feature (Prob). Explanation-based features are indeed generally among the top used features and more important than raw property features.

B Details of POS Tag Properties
We use tagger implemented in spaCy API. 5 The tag set basically follows the Penn Treebank tag set except that we merge some related tags to reduce the number of features given the limited amount of training data. 6 Specifically, we merge JJ,JJR,JJS into JJ, NN,NNS into NN, NNP,NNPS into NNP, RB,RBR,RBS into RB, VB,VBD,VBG,VBN,VBP,VBZ into VB, and WDT,WP,WP$,WRB into W. In this way, we obtain a tag set of 25 tags in total.

C Details of Black Box Calibrators
Number of Feature for QA • KAMATH (Kamath et al., 2020): we use the 7 features described in (Kamath et al., 2020), including Probability for the top 5 predictions, Context Length, and Predicted Answer Length.
• BOWPROP: In addition to the 7 features used in KAMATH. We construct the property space V as the union of low-level Segment and Segment × Pos-Tags. Since there are 3 segments question, context, answer in the input, and 25 tags (Section B), the size of the property space |V| is thereby given as 3 + 3 × 25 = 78. Therefore the total number of features (including the 7 from KAMATH) is 85.
• LIMECAL and SHAPCAL: Recall that the size of the property space is 78. LIMECAL and SHAPCAL uses 78 features describing the attribution related to the corresponding properties in addition to the 85 features used in BOWPROP. The total number of features is therefore 163.