Knowing More About Questions Can Help: Improving Calibration in Question Answering

We study calibration in question answering, estimating whether model correctly predicts answer for each question. Unlike prior work which mainly rely on the model's confidence score, our calibrator incorporates information about the input example (e.g., question and the evidence context). Together with data augmentation via back translation, our simple approach achieves 5-10% gains in calibration accuracy on reading comprehension benchmarks. Furthermore, we present the first calibration study in the open retrieval setting, comparing the calibration accuracy of retrieval-based span prediction models and answer generation models. Here again, our approach shows consistent gains over calibrators relying on the model confidence. Our simple and efficient calibrator can be easily adapted to many tasks and model architectures, showing robust gains in all settings.


Introduction
Despite rapid progress in AI models, building a question answering (QA) system that can always correctly answer any given query is beyond our reach. Thus, questioners have to interpret the model prediction, deciding whether to trust it. We study providing an accurate estimate of the correctness of model prediction for each example at test time. As making incorrect predictions can be much more costly than making no prediction (e.g., missing diagnosis is much more costly than querying human experts), calibrators can bring practical benefits (Kamath et al., 2020).
Existing work on calibration focuses on mdoel confidence, such as the max probability of the predicted class (Guo et al., 2017;Desai and Durrett, 2020). Unlike classification tasks, question answering explores large output space, either through answer generation (Raffel et al., 2020;Lewis et al., 2020) or selecting a span from provided documents (Rajpurkar et al., 2016). In both settings, optimal decoding is often prohibitively expensive, and heuristic decoding is a standard practice (Seo et al., 2017). Thus, relying on the model's confidence score alone is not sufficient for calibration (Kumar and Sarawagi, 2019).
Nonetheless, prior work (Kamath et al., 2020;Jagannatha and Yu, 2020) relied heavily on model confidence, such as the max probability of the predicted answer, together with a handful of manually crafted features containing little information about the input, such as the length of the question. We empower the calibrator by introducing an input example embedding from a pre-trained language model Liu et al., 2019) finetuned on QA supervision data as additional features. With this simple and general feature, calibrator can identify questions regarding rare entities or examples with little lexical overlap between the question and the context. We bring further gains by paraphrasing questions or contexts respectively through back translation (Sennrich et al., 2016), providing lexical variations of the question and the context and enriching the feature space.
We evaluate our calibrator with internal metrics (i.e., calibration accuracy) and external metrics (i.e., impact on QA performance). We first evaluate calibrators in reading comprehension settings introduced in Kamath et al. (2020) -in-domain (Rajpurkar et al., 2016Kwiatkowski et al., 2019), out of domain (Fisch et al., 2019), and adversarial (Jia and Liang, 2017). Then, we expand calibration study to more challenging open retrieval QA setting (Voorhees and Tice, 2000;Chen et al., 2017), where a system is not provided with an evidence document. We adapt our calibrator for state-of-theart generation based (Raffel et al., 2020) and extractive (retrieve-and-predict) QA models (Karpukhin et al., 2020), showing gains in both models. While calibration accuracy is higher in the generation based model, the extractive method provides better answer coverage above fixed accuracy. Lastly, we use calibrator as a reranker for the answer span candidates in an extractive open retrieval QA model (Karpukhin et al., 2020), showing modest gains. We provide rich ablation studies on design choices for our calibrator, such as the choice of base model to derive input example encoding. Our simple input example embedding from pretrained language models shows consistent gains in all settings and datasets. Without any manual engineering specific to the question answering task, our calibrator could be easily adapted to other tasks with rich output space.

Problem Definition
We estimate how the models' prediction confidence aligns with the empirical likelihood of correctness (Brier, 1950). Formally, a calibrator f takes the input example x i and the trained model M θ and identifies whether the model's prediction is correct or not. We treat the correctness as binary (i.e., answer string exact match) for simplicity, instead of partial credit (e.g., token level F1 score). We study two settings: reading comprehension (RC) and open retrieval QA. In RC, an input example x i will be a context c i and the question q i , and in open domain QA, an input example will be a corpus C and the question q i .
We use the same metrics to evaluate the performance of the calibrator f in the two settings.

Metric: Calibrator performance
Accuracy: Given evaluation data D eval = {(x 1 , y 1 ), (x 2 , y 2 ) . . . (x N , y N )} and a learned model M θ , we define the accuracy of the calibrator f as: AUROC: Based on the above definition of the accuracy of the calibrator f , we computes the coverage -fraction of evaluation data D eval that model makes prediction on -and risk, the error at that coverage. We plot risk versus coverage graph, and measure the area under the curve, i.e., AUROC (Area Under the Receiver Operating Characteristic Curve) (Hanley and McNeil, 1982).

Metric: End task performance
We measure how the calibrator performance impacts QA performances. First, we study selective QA setting -where we use calibrator score to decide which examples from D eval to make predictions.
For the extractive model for open retrieval QA (Karpukhin et al., 2020), where multiple answer candidates are given, we further evaluate the performance of calibrator as a reranker and measure the answer span exact match (EM) score.
Selective QA (coverage at fixed accuracy): We use the calibrator score to rank the examples in the evaluation data. Specifically, we use the calibrator's confidence for the top answer candidate instead of model score to decide which examples in D eval the model answers most confidently. Then, we report the percentage of evaluation data that can be predicted while maintaining threshold accuracy (80%), following prior work (Kamath et al., 2020).
Open Retrieval QA (top-N accuracy): We use the calibrator score to rank the answer candidates for each evaluation example, similar to how candidate translations are reranked in machine translation (Shen et al., 2004). We first retrieve answer candidates from multiple paragraphs and utilize the calibrator to override the model's prediction. The calibrator scores the top N answer candidates and outputs the answer with the highest confidence score instead of the answer with the highest model score. Our calibrator can be added as last step for any open retrieval QA systems which generates multiple answer candidates without retraining the model. We evaluate the top 1 exact match accuracy and the top 5 exact match accuracy after re ranking with our calibrator score.

Methods
We propose two general approaches to improve binary calibrator: new feature vector, a dense representation of the input example (Section 3.2) and data augmentation with backtranslation which further improves the new feature vector (Section 3.3). While both are simple, well-established formula for improving end tasks in NLP, neither has been explored in the context of calibration, as prior work assumed model confidence score is the most prominent signal. We follow prior work (Kamath et al., 2020) for calibrator architecture and focus on improving its feature space.

Calibrator Architecture
A binary classifier is trained using the gradient boosting library XGBoost (Chen and Guestrin, 2016), which classifies each test example as correctly answered by the base QA model or not. This calibrator does not share its weights with the base QA models. We finetune the following hyperparameters on the development set: colsample by level, colsample by node, colsample by tree, learning rate, and the number of estimators. All calibrators are trained five times, each with different data partitions and random seeds. We report the variances in the results.

Input Example Embedding Feature From
Base QA Model Prior work uses manually designed features based on the scores to the predicted answer (details in Section 4.2.1). Such features retain little information about input example -e.g, question and the evidence context. Inspired by the recent works in machine learning (e.g. Song et al., 2019;Hendrycks et al., 2019), which use hidden vectors to classify in-domain and out-of-domain data, we introduce an input example embedding, a new feature vector that represent question and (optionally) evidence context to a calibrator. Our input example embedding is a fixed dimensional vector representing an input example, similar to sentence embeddings (Conneau et al., 2017). It differs in that the representation is taken from the final layer of base QA model, which is trained with supervision from question answering data and it encodes question and (optional) evidence context simultaneously. In Section 6.1, we report minor performance degradation from using embeddings from generic pretrained language model instead.
Each base model processes input example, either query q i or query, context pair (q i , c i ) to generate a sequence of hidden vectors, which will be compressed into a fixed dimensional vector to be used as calibrator feature. 2 We denote the input example as a sequence of tokens t = (t 0 , t 1 , · · · , t n ) where n is the length of the input. We pass the sequence t through base QA model and get (h 0 , h 1 , · · · , h n ) where h i is the corresponding final-layer hidden state of t i , and h i = (h i,0 , · · · , h i,m ) where m is the number of hidden dimensions. Then, we get the m-dimensional feature vector where each dimension is an average across the length n. We then train a binary classifier using these features as a calibrator. We now describe our base QA models to get this hidden representations.

Base QA Model
We use standard span prediction architecture for RC, and a generation based model and an extractive model for open retrieval QA. For RC and extractive open retrieval QA model, we use a standard span prediction architecture based on a pretrained language model (Devlin et al., 2018), which predicts start and end index of the answer span separately with softmax layer. The output hidden vector sequence will equal the sum of the length of question and the length of evidence context. For the open retrieval QA setting, the extractive model first retrieves a passage from the corpus and predicts an answer span from it. We use the best model from dense passage retrieval (DPR) (Karpukhin et al., 2020). 3 Specifically, this model retrieves the top 100 retrieved passages as input and trains a span prediction model, which optimizes a softmax cross-entropy loss to select the correct passage among the candidates, and the answer span prediction loss. The model then selects the answer span with the highest answer span score (sum of the start and end logit score) from the passage with the highest passage score. In this setting, t is a concatenation of question q i and the context c i . For generation based model, we use a sequenceto-sequence (seq2seq) model, specifically T5small (Raffel et al., 2020), which takes the question as an input and generates answer tokens. For this base QA model, t only consists of the query since the context is not provided. Data For all experiments in RC, we train the model on the SQuAD 1.1 dataset. For open retrieval QA, models are trained on the Natural Questions (NQ) dataset (Kwiatkowski et al., 2019) following the data split from .

Data Augmentation Via Paraphrasing
Paraphrase generation can improve QA models  by handling language vari- ation. Compared to sentence retrieval (Du et al., 2020) and language model based example generation (Anaby-Tavor et al., 2020), backtranslation can capture the ambiguity of questions and answer (Singh et al., 2019). Given a (q i , c i ) pair, we use back translation (Sennrich et al., 2016) to generate paraphrases of the question q i from q i and the evidence context c i from c i . We use standard transformer-based neural machine translation models (Junczys-Dowmunt et al., 2018) trained on WMT dataset. 4 We first translate the original sentences to a pivot language and then translate them back to the source language. To guarantee translation quality, French and German are used as the pivot languages. We use beam search decoding with beam size as 4 and truncate the context length to 512, as the reading comprehension model truncates the context anyway. We analyze the quality of backtranslation in Section 6.2.
We denote (q i , c i ) as t q = (t q 0 , · · · , t q nq ) and (q i , c i ) as t c = (t c 0 , · · · , t c nc ). Here, n q and n c denote the length after backtranslating the question and context, respectively. For t q and t c , we pass them through the base QA model, get h q and h c , and extract the m-dimensional feature vector as in Eqn (1), We use the concatenation of the original input example embedding and backtranslated one, as features. Backtranslating both context and question did not bring further gains, thus the results from such a feature set are not presented. We hypothesize that backtranslating context and question together might introduce too severe noise. We do not use data augmentation for open retrieval QA experiments.

Experimental Settings
In this section, we describe the experimental setting, dataset setups and baseline systems. Table 1 summarizes the evaluation scheme. A separate calibrator is trained for each calibrator train data configuration.

Data
For all in-domain reading comprehension experiments, we randomly split the data into training, development, and test (40%,10%,50%), following regression and classification benchmarks (Asuncion and Newman, 2007). Further, we assume only limited supervised data is available for calibrators, simulating a set up where we have a general QA model and small number of annotated data reserved for calibration.
Standard RC We test two in domain settings and two out of domain settings. We randomly sample 4K examples from each of the datasets included in the training portion of the MRQA shared task (Fisch et al., 2019) (SQuAD (Rajpurkar et al., 2016), NewsQA (Trischler et al., 2017), Triv-iaQA (Joshi et al., 2017), SearchQA (Dunn et al., 2017), HotpotQA , Natural Questions (Kwiatkowski et al., 2019)). We train two calibrators, one with the SQuAD1.1 + HotpotQA datasets and another with the SQuAD1.1 + NQ datasets. For out of domain evaluation, we use four remaining datasets from MRQA shared task training set.  (Rajpurkar et al., 2018), which contains examples where the answer to the question cannot be derived from the provided context. Crowdworkers posed questions that were impossible to answer based on the paragraph alone while referencing entities in the paragraph and ensuring that a plausible answer is present. For out of domain setting, we train the calibrator on 6K examples (1K each sampled from MRQA datasets) and test on SQuAD 2.0 dataset (same as adversarial RC setting).
Open Retrieval QA We use the open retrieval version of the NQ . We split its training data 60% and 40% for calibrator training and validation and use the NQ test set for testing.

Comparison Systems
We summarize the calibrators used in our study in Table 2. All calibrators are trained with the same gradient boosting library XGBoost (Chen and Guestrin, 2016), and they only differ in the feature sets. These calibrators are efficient, trained within a few minutes even with our new feature space.

Reading Comprehension
MaxProb is the simplest baseline that relies on the model's confidence score. The model score is the sum of the logit scores of the start and end of the answer span for reading comprehension. For open retrieval question answering, the model first determines the passage with the highest passagematch score and then extracts the answer span from this passage.
Formally, given the set of answer spans Y , Max-Prob with model M θ estimates confidence on input x i as:  where M θ (y | x i ) refers to the model score for candidate answer y.
Ours represents a calibrator that is trained with the question context embedding, φ(q i , c i ) in Eqn (1). '+ features' refers to augmenting features from (Kamath et al., 2020), described above. Augmenting the feature sets with question context embeddings from backtranslated questions is denoted as '+φ(q i , c i )', and augmenting the feature sets with question context embeddings from backtranslated contexts is denoted as '+φ(q i , c i )' from Eqn. (2).
Extractive (Retrieve-And-Predict) We consider two baseline calibrators: one takes the product of normalized passage score (normalized across all passage candidates) and answer score (normalized across the top 10 answer spans for each passage), and another takes the product of unnormalized passage and answer scores. Then, we introduce calibrator augmented with our input example embedding. We include two example embeddings as features: one is the question context embedding as used in the reading comprehension setting (from Eqn 1), and another is the average of the answer span start token representation and the answer span end token representation.
Generation based (Seq2Seq) For seq2seq models (Raffel et al., 2020), the output answer space includes all sentences that can be generated with conditional language model. Thus, instead of Max-Prob, we use the likelihood of the generated answer (i.e., the product of the conditional probabilities for each token in the generated answer) as a baseline. Then, we introduce calibrator with our input example embedding (from Eqn 1).

Results
Calibration Table 3 reports calibration results on standard reading comprehension datasets. The top block displays the performance of calibrators trained on the SQuAD and HotpotQA datasets, and the bottom block shows the results of calibrators trained on the SQuAD and NQ datasets. In both settings, the our input example embedding works better than the manual feature set. However, two approaches are complementary in all settings. Interestingly, paraphrasing questions shows gains in Natural Questions but not in other datasets. We hypothesize that organically collected search queries contain more ambiguous and ill-defined queries than crowdsourced questions where questions were based directly on the context. Adding paraphrased context embeddings, on the other hand, shows a modest gain across all settings. Unlike QA models have access to millions of parameters, calibrators, even with our feature set, are provided with very limited information. We hypothesize that augmenting the feature set with paraphrased context enabled the calibrator to gain more information about the example, facilitating higher performance. Table 4 shows the results in more challenging settings: one with adversarial attacks and another containing unanswerable questions. In both settings, we observe sizable gains (5-10% increase in calibration accuracy) for the in domain setting, but the gains are smaller in out of domain settings. Similar to the Natural Questions dataset, in SQuAD 2.0, which includes adversarially designed questions without an answer, paraphrasing the question is more helpful than paraphrasing the context. On the other hand, in the adversarial setting where contexts are manipulated, paraphrasing contexts is more effective. Overall, our new feature vector shows consistent gain across all datasets and settings.
We present the calibration in open retrieval QA in Table 5. Overall, calibrator accuracy is higher   (Raffel et al., 2020) Ours (+ Likelihood) 91.6±0.3 92.9±0.1 11.3% compared to RC, partially because the answer accuracy is substantially lower. For example, with generation based model (T5)'s answer accuracy of 25.5, simply predicting incorrectly for every example will give 74.5 calibration accuracy. In both models, internal confidence scores (Likelihood and Normalized scores) provide reasonable calibrator performance, yet adding our feature set improves the performance. In particular, our calibrator shows a larger gain in the DPR setting. Encouraged by this result, we test our calibrator as an answer candidate reranker for top answer candidates from DPR. Despite high calibration accuracy of generation based approach, selective QA performance (Cov@Acc=80%) is higher with the extractive approach, suggesting comparing calibration performance across models of different accuracy is challenging.
Answer Reranking Table 6 shows the results of our calibrator as an answer candidate reranker. The calibrator considers the top 1,000 answer candidates (100 retrieved passages, each with top 10 answer spans) and outputs top candidates based on the calibrator score instead of the model score. We show negligible gains in top 1 accuracy but bigger gains in top 5 accuracy. These small but noticeable gains show potential for using calibrators to improve open retrieval QA performances, where  Table 6: Results on open domain question answering in NQ. The calibrator is used as a reranker for selecting the top answer span out of 1,000 answer spans (10 answer spans per each of 100 retrieved passages). multiple answer candidates are considered.  brator uses a standard pretrained language model (BERT) to encode [CLS; (q i , c i )] and takes the final layer hidden representation of the [CLS] token as a feature. Table 7 shows the performance of the [CLS] token classifier. Surprisingly, this calibrator outperforms the MaxProb baseline (in Table 3) in all settings and outperforms Kamath et al. (2020) (in Table 3) in most settings, indicating information about the question and context might be more useful than the QA model's confidence. Using the input example embedding from the QA model shows only 1-3 point gains than using the CLS token embedding. This trend holds for across various settings (more results in Table 11 in Appendix).

Quality of Back Translation
Question paraphrasing (Dong et al., 2017)  We study how much variability is introduced during paraphrasing by studying divergence between the original sentence and the paraphrased sentence. We calculate the sentence BLEU score with NLTK (Bird et al., 2009), using the original text as source and the back-translated text as target for both question paraphrasing and context paraphrasing. The average sentence BLEU score is larger than 0.55 for all datasets, indicating back-translation introduces relatively minor changes in phrasing. In what country is Normandy located? q What country is Normandy in? q When did Edward return? q When did Edward come back? q How would one write T(n) = 7n2 + 15n + 40 in big O notation? q How do you write T(n) = 7n2 + 15n + 40? q What kind of arches does Norman architecture have? q What kind of arches does Norman's building have? Table 8: Question back translation samples from SQuAD 2.0 dataset. The first row (q) refers to the original question, and the second row (q ) refers to backtranslated question. In the third example, back translation introduces an error.
hotpot false hotpot true squad false squad true Figure 1: A visualization for the input example embedding from HotPotQA and SQuAD datasets. We denote the data domain by markers with different shapes and denote the correctness with different colors. The Xaxis and Y-axis denote the first and second dimensions extracted by linear discriminant analysis, respectively.
Visualization Figure 1 shows a visualization of the question context embeddings from HotpotQA and SQuAD. We use linear discriminant analysis (Pedregosa et al., 2011) to plot input example embeddings and observe that embeddings from the same dataset are closer to each other. It demonstrates that embeddings are almost linearly separable between domains, but it is much harder to distinguish correct answers from incorrect ones.   neighbors (KNN). Table 9 indicates our gains hold across different classifiers. Full experimental results can be found in Appendix.

Choice of Pivot Language
We test whether the choice of pivot language in backtranslation impacts performances. We find little difference between pivoting through German or French (See Table 10).

Related Work
Calibration in NLP Calibration has become an important topic in NLP as well as general machine learning (Guo et al., 2018;FAN et al., 2021) as confidence scores from calibrators can be useful for the error correction process (Feng and Sears, 2004). Calibration has been studied in natural language inference, commonsense reasoning (Desai and Durrett, 2020;Varshney et al., 2020), dialogue systems (Mielke et al., 2020), semantic parsing (Dong et al., 2018), coreference resolution (Nguyen and O'Connor, 2015) and sequence labeling (Jagannatha and Yu, 2020). In question answering, Kamath et al. (2020)'s study on selective question answering inspired our work. We measure the calibration performance with calibrator accuracy, AUROC, and coverage at accuracy. Expected Calibration Error (ECE) (Guo et al., 2017) is another commonly used metric for calibration performance, but we consider calibrator as a binary classifier at here. Jagannatha and Yu (2020) also studies calibration in reading comprehension, using language model perplexity and model's confidence as features. Language model perplexity coarsely and indirectly captures information about the question and context. We propose an improved feature space and thoroughly test it in challenging settings, e.g., adversarial RC, unanswerable RC, and open retrieval QA.
Calibration During Training Recent work in QA introduces an answer verification step (Tan et al., 2018;Hu et al., 2019;Wang et al., 2020) at the end of the pipeline. During the training, this verifier module takes the questions, answers, or MRC model's state as inputs and determines the answers' validity. Then, the validity score is used to update the model parameters during training. Thus, the validator is jointly trained with the MRC model. While this is conceptually similar to our set up, instead of tying the calibrator into the model, we design a universal post-hoc calibrator that can be easily applied to any model architecture.
Calibration with Ensembles Ensemble diversity has been used to improve uncertainty estimation and calibration (e.g. Raftery et al., 2005;Stickland and Murray, 2020). While it is effective, calibration with model ensembling is usually expensive and time consuming (Zhou et al., 2002. Our calibrator is an offline postprocessing step that does not require further training of the original model.

Conclusion
We introduce a richer feature space for question answering calibrators with question and context embeddings and paraphrase-augmented inputs. Our work suggests deciding the correctness of a QA system depends on both the semantics of the questioncontext and the confidence of the model. We thoroughly test our calibrator in domain shift, adversarial, and open domain QA settings. The experiments show noticeable gains in performance across all settings. We further demonstrate our calibrator's general applicability by using it as a reranker in extractive open domain QA. To summarize, our calibrator is simple, effective and general, with potential to be incorporated into existing models or extended for other NLP tasks.

A Additional Experimental Results
In

B Hyperparameters and Training Details
A binary classifier is trained using the gradient boosting library XGBoost (Chen and Guestrin, 2016). We finetune the following hyper-parameters, colsample by level, colsample by node, colsample by tree, learning rate, and the number of estimators on the development set. We use the following search space: colsample by level/mode/tree is set to the same value and selected from {0.0, 0.1, 0.2, 0.3, 0.4, 0.5}, the learning rate and number of estimators are selected from {0.01, 0.1, 0.2, 0.5} and {5, 25, 50, 100}, respectively. These hyper-parameters are chosen based on the performance on the validation set. For base QA models, we mostly following the hyperparameters used in the original work (e.g., batch size 32 & learning rate of 5 × 10 −5 for BERT-base SQuAD 1.1 model). All calibrators are trained five times, each with different data partitions and random seeds. We report the variances in the results. Our calibrator does not share its weights with the base QA models.