Will this Question be Answered? Question Filtering via Answer Model Distillation for Efficient Question Answering

In this paper we propose a novel approach towards improving the efficiency of Question Answering (QA) systems by filtering out questions that will not be answered by them. This is based on an interesting new finding: the answer confidence scores of state-of-the-art QA systems can be approximated well by models solely using the input question text. This enables preemptive filtering of questions that are not answered by the system due to their answer confidence scores being lower than the system threshold. Specifically, we learn Transformer-based question models by distilling Transformer-based answering models. Our experiments on three popular QA datasets and one industrial QA benchmark demonstrate the ability of our question models to approximate the Precision/Recall curves of the target QA system well. These question models, when used as filters, can effectively trade off lower computation cost of QA systems for lower Recall, e.g., reducing computation by ~60%, while only losing ~3-4% of Recall.


Introduction
Question Answering (QA) technology is at the core of several commercial applications, e.g., virtual assistants such as Alexa, Google Home and Siri, serving millions of users. Optimizing the efficiency of such systems is vital to reduce their operational costs. Recently, there has been a large body of research devoted towards reducing the compute complexity of retrieval (Gallagher et al., 2019;Tan et al., 2019) and transformer-based QA models (Sanh et al., 2019;Soldaini and Moschitti, 2020).
An alternate solution for improving QA system efficiency aims to discard questions that will most probably be incorrectly answered by the system, using automatic classifiers. For example, Fader et al. (2013); Faruqui and Das (2018) aim to capture the grammaticality and well-formedness of questions. However, these methods do not take the specific answering ability of the target system into account. In practice, QA systems typically do not answer a significant portion of user questions since their answer scores could be lower than a confidence threshold (Kamath et al., 2020), tuned by the system to achieve the required Precision. For example, QA systems for medical domains exhibit a high Precision since providing incorrect/imprecise answers can have critical consequences for the end user. Based on the above rationale, discarding questions that will not be answered by the QA system presents a remarkable cost-saving opportunity. However, applying this idea may appear unrealistic from the outset since the QA system must first be executed to generate the answer confidence score.
In this paper, we take a new perspective on improving QA system efficiency by preemptively filtering out questions that will not be answered by the system, by means of a question filtering model. This is based on our interesting finding that the answer confidence score of a QA system can be well approximated solely using the question text.
Our empirical study is supported by several observations and intuitions. First, the final answer confidence score from a QA system (irrespective of its complex pipeline) is often generated by a Transformer-based model. This is because Transformer-based models are used for answer extraction in most research areas in QA with unstructured text, e.g., Machine Reading(MR) (Rajpurkar et al., 2016), and Answer Sentence Selection(AS2) (Garg et al., 2020). Second, more linguistically complex questions have a lower probability to be answered. Language complexity correlates with syntactic, semantic and lexical properties, which have been shown to be well captured by pre-trained language models (LMs) (Jawahar et al., 2019). Thus, the final answer extractor will be affected by said complexity, suggesting that we can predict which questions are likely to be unanswered just using their surface forms.
Third, pre-training transformer-based LMs on huge amounts of web data enables them to implicitly capture the frequency/popularity of general phrases 1 , among which entities and concepts play a crucial role for answerability of questions. Thus, the contextual embedding of a question from a transformer LM is, to some extent, aware of the popularity of entities and concepts in the question, which impacts the retrieval quality of good answer candidates. This means that a portion of the retrieval complexity of a QA system can also be estimated just using the question. Most importantly, we only try to estimate the answer score from a QA system and not whether the answer provided by the system for a question is correct or incorrect (the latter being a much more difficult task).
Following the above intuitions, we distill the knowledge of QA models, using them as teachers, into Transformer-based models (students) that only operate on the question. Once trained, the student question model can be used to preemptively filter out questions whose answer score will not clear the system threshold, translating to a proportional reduction in the runtime cost of the system. More specifically, we propose two loss objectives for training two variants of this question filter: one with a regression head and one with a classification head. The former attempts to directly predict the continuous score provided by the QA system. The latter aims at learning to predict if a question will generate a score >τ , which is the answer confidence threshold the QA system was tuned to.
We perform empirical evaluation for (i) showing the ability of our question models to estimate the QA system score; and (ii) testing the cost savings produced by our question filters, trading off with a drop in Recall. We test our models on two QA tasks with unstructured text, MR and AS2, using (a) three academic datasets: WikiQA, ASNQ, and SQuAD 1.1; (b) a large scale industrial benchmark, and (c) a variety of different transformer architectures such as BERT, RoBERTa and ELEC-TRA. Specifically for (i), we compare the Precision(Pr)/Recall(Re) curves of the original and the new QA system, where the latter uses the question model score to trade-off Precision for Recall. For (ii), we show the cost savings produced by our question filters, when operating the original QA system at different Precision values. The results show that: (i) The Pr/Re curves of the question models are close to those of the original system, suggesting that they can estimate the system scores well; and (ii) our question models can preemptively filter out 21.9−45.8% questions while only incurring a drop in Recall of 3.2−4.9%. 2 2 Related Work Question Answering Prior efforts on QA have been broadly categorized into two fronts: tackling MR, and AS2. For the former, recently pre-trained transformer models (Devlin et al., 2019;Liu et al., 2019;Lan et al., 2020;Clark et al., 2020), etc. have achieved SOTA performance, sometimes even exceeding the human performance. Progress on this front has also seen the development of large-scale QA datasets like SQuAD (Rajpurkar et al., 2016), HotpotQA (Yang et al., 2018), NQ (Kwiatkowski et al., 2019), etc. with increasingly challenging types of questions. For the task of AS2, initial efforts embedded the question and candidates using CNNs (Severyn and Moschitti, 2015), weight aligned networks (Shen et al., 2017;Tran et al., 2018;Tay et al., 2018) and compare-aggregate architectures (Wang and Jiang, 2016;Bian et al., 2017;Yoon et al., 2019). Recent progress has stemmed from the application of transformer models for performing AS2 (Garg et al., 2020;Han et al., 2021;Lauriola and Moschitti, 2021). On the data front, small datasets like TrecQA (Wang et al., 2007) and WikiQA (Yang et al., 2015) have been supplemented with datasets such as ASNQ (Garg et al., 2020) having several million QA pairs.
Open Domain QA (ODQA) Chen and Yih, 2020) systems involve a combination of a retriever and a reader (Semnani and Pandey, 2020) trained independently  or jointly . Efforts in ODQA transitioned from using knowledge bases for answering questions to using external text sources and web articles (Savenkov and Agichtein, 2016;Sun et al., 2018;Lu et al., 2019). Numerous research works have proposed different techniques for improving the performance on ODQA (Min et al., 2019;Asai et al., 2019;Qi et al., 2019). Filtering Ill-formed Questions Evaluating wellformedness and intelligibility of queries has been a popular research topic for QA systems. Faruqui and Das annotate the Paralex dataset (Fader et al., 2013) on the well-formedness of the questions. The majority of research efforts have been aimed at reformulating user queries to elicit the best possible answer from the QA system (Yang et al., 2014;Buck et al., 2017;Chu et al., 2019). A complementary line of work uses hate speech detection techniques (Gupta et al., 2020) to filter questions that incite hate on the basis of race, religion, etc.
Answer Verification QA systems sometimes use an answer validation component in addition to the system threshold, which analyzes the answer produced by the system and decides whether to answer or abstain. These systems often use external entity knowledge (Magnini et al., 2002;Ko et al., 2007;Gondek et al., 2012) for basing their decision to verify the correctness of the answer. Recently Wang et al. (2020) propose to add a new MR model to reflect on the predictions of the original MR model to decide whether the produced answer is correct or not. Other efforts (Rodriguez et al., 2019;Kamath et al., 2020;Jia and Xie, 2020;Zhang et al., 2021) have trained calibrators for verifying if the question should be answered or not. All these works are fundamentally different from our question filtering approach since they operate jointly on the question and generated answer, thereby requiring the entire computation to be performed by the QA system before making a decision. Our work operates only on the question text to preemptively decide whether to filter it or not. Thus the primary goal of these existing works is to improve the precision of the answering model by not answering when not confident, while our work aims to improve efficiency of the QA system and save runtime compute cost.
Query Performance Prediction Pre-retrieval query difficulty prediction has been previously explored in Information Retrieval (Carmel and Yom-Tov, 2010). Previous works (He and Ounis, 2004;Mothe and Tanguy, 2005;He et al., 2008;Zhao et al., 2008;Hauff, 2010) target p(a|q, f ), ground truth probability of an answer a to be correct, given a question q and a feature set f in input using simple linguistic (e.g., parse trees, polysemy value) and statistical (e.g., query term statistics, PMI) methods; while we target the QA-system score s(a|q, f ), the probability of an answer to be correct as estimated by the system. This task is more semantically driven than syntactically, and enables the use of large amounts of training data without human labels of answer correctness. Figure 1: A real-world QA system having a retrieval (R), candidate extraction (S) and answering component (M). Our proposed question filter (highlighted by the red box) preemptively removes the questions which will fail the threshold τ 1 of M.
Efficient QA Several works on improving the efficiency of the retrieval involve using a cascade of rerankers to quickly identify good documents (Wang et al., 2011(Wang et al., , 2016Gallagher et al., 2019), nonmetric matching functions for efficient search (Tan et al., 2019), etc. Towards reducing compute of the answer model, the following techniques have been explored: multi-stage ranking using progressively larger models (Matsubara et al., 2020), using intermediate representations for early elimination of negative candidates (Soldaini and Moschitti, 2020; Xin et al., 2020), combining separate encoding of question and answer with shallow DNNs , and the most popular being knowledge distillation (Hinton et al., 2015) to train smaller transformer models with low inference latencies (DistilBERT (Sanh et al., 2019), TinyBERT (Jiao et al., 2020), MobileBERT (Sun et al., 2020), etc.)

Preliminaries and Problem Setting
We first provide details of QA systems and explain the cost-saving opportunity space when they operate at a given Precision (or answer score threshold).

QA Systems for Unstructured Text
We consider QA systems based on unstructured text, a simple design for which works as follows (as depicted in Fig. 1): given a user question q, a search engine, R, first retrieves a set of documents (e.g., from a web index). A text splitter, S, applied to the documents, produces a set of passages/sentences, which are then input to an answering model M. The latter produces the final answer. There are two main research areas studying the design of M: Machine Reading (MR): S extracts passages {p 1 , . . ., p m } from the retrieved documents. M is a reading comprehension head, which uses (q, {p 1 , . . ., p m }) to predict start and end position (span) for the best answer based on these passages. Answer Sentence Selection (AS2): S splits the retrieved documents into a set {s 1 , . . ., s m } of individual sentences. M performs sentence re-ranking over (q, {s 1 , . . ., s m }), where the top ranked candidate is provided as the final answer. M is typically learned as a binary classifier applied to QA pairs, labelled as being correct or incorrect. AS2 models can handle scaling to large collection of sentences (more documents and candidates) more easily than MR models (the latter process passages in entirety before answering while the former can break down passages into candidates and evaluate them in parallel) thereby having lower latency at inference time.

Precision/Recall Tradeoff
M provides the best answer (irrespective of the answer modeling being MR or AS2) for a given question q along with a prediction score σ (DNNs typically produce a normalized probability), which is termed MaxProb in several works (Hendrycks and Gimpel, 2017;Kamath et al., 2020). The most popular technique to tune the Pr/Re tradeoff is to set a threshold, τ , on σ. This means that the system provides an answer for q only if σ> τ . Henceforth, we denote M operating at a threshold τ by M τ . While not calibrated perfectly (as shown by Kamath et al.), the predictions of QA models are supposed to be aligned with the ground truth such that questions that are correctly answered are more likely to receive higher σ than those that are incorrectly answered. This is an effect of the binary cross-entropy loss, L CE , typically used for training M 3 . For example, Fig. 2 plots Pr/Re on varying threshold τ of popular MR and AS2 systems, both built using transformer models (SQuAD: BERT-Base M, ASNQ: RoBERTa-Base M, details in Section 5.2). The results show that increasing τ achieves a higher Pr trading it off for lower Re.

Question Filtering Opportunity Space
Real-world QA systems are always associated with a target Precision, which is typically rather high to meet the customer quality requirements 4 using the threshold τ . This means that systems will not provide an answer for a large fraction of questions (q's for which σ ≤ τ ). For example from Fig. 2(b), to obtain a Precision of 90% for SQuAD 1.1, we need to set τ =0.55, resulting in not answering 35.2% of all questions. Similarly, to achieve the same Precision on ASNQ ( Fig. 2(a)), we need to set τ =0.89, resulting in not answering 88.8% of all questions.
It is important to note that the QA system still performs the entire computation: R→S→M on all questions (even the unanswered ones), to decide whether to answer or not. Thus, filtering these questions before executing the system can save the cost of running redundant computation, e.g., 35.2% or 88.8% of the cost of the two systems above (assuming the required Precision value is 90%). In the next section, we show how we build models that can produce a reliable prediction of the QA system score, only using the question text.

Modelling QA System Scores using Input Questions
We propose to use a distillation approach to learn a model operating on questions that can predict the confidence score of the QA system, within a certain error bound. We denote the QA system with Ω(R, S,M τ ), and the question model by F (as we will use it to filter out questions preemptively). Intuitively, F aims at learning how confident the answer model M is on answering a particular question when presented with a set of candidate answers from a retrieval system (R, S).
For a question q, we indicate the set of answer The output from M for q ands corresponds to the score for the best candidate/span: M(q,s) = max s∈s M(q, s). We train a filter F M for M using a regression head to directly predict the score of M irrespective of the threshold τ using the loss: where L MSE is the mean square error loss. Fig. 3 diagrammatically shows the training process of F M . Additionally, as M typically operates with a threshold τ , we train a filter F M τ corresponding to a specific τ , i.e., M τ , using the following loss: where L CE is the cross entropy loss and 1 denotes the binary indicator function.
The novelty of our proposed approach from standard distillation techniques (Hinton et al., 2015;Sanh et al., 2019;Jiao et al., 2020) stems from the fact that, unlike the standard setting, in our case the teacher M and the student F operate on different inputs: F only on the questions while M on question-answer pairs. This makes our task much more challenging as F needs to approximate the probability of M fitting all answer candidates for a question. Since our F does not predict if an answer provided by M is correct/incorrect, we don't require labels of the QA dataset for training F (F's output only depends on predictions of M). This enables large scale training of F without any human supervision using the predictions of a system Ω(R, S,M τ ) and a large number of questions.
To use the trained F for preemptively filtering out questions, we use a threshold on the score of F. Henceforth, we refer to the threshold of M by τ 1 and that of F by τ 2 . Any question q for which F(q)≤τ 2 , gets filtered out. Using the question filter, we define the new QA system as Ω(F τ 2 , R, S, M τ 1 ) where the filter F can be trained using Eq. 1 or 2 (F M or F M τ 1 ).

Experiments
First, we compare how well our models F can approximate the answer score of M. Then we optimize τ 1 and τ 2 on the dev. set to precisely estimate the cost savings that we can obtain with the application of our approach to questions filtering. We also compare it with different baseline models for F from previous works on question filtering.

Datasets
We use three academic and one industrial datasets to validate our claims across different data domains and question answering tasks (MR and AS2). WikiQA: An AS2 dataset (Yang et al., 2015) with questions from Bing search logs and answer candidates from Wikipedia. We use the most popular setting of training with questions having at least one positive answer candidate, and testing in the clean mode with questions having at least one positive and one negative answer candidate. ASNQ: A large scale AS2 dataset (Garg et al., 2020) 5 corresponding to Natural Questions (NQ), containing over 60k questions and 23M answer candidates. Compared to WikiQA, ASNQ has more sophisticated user questions derived from Google search logs and a very high class imbalance (∼ 1 correct in 400 candidate answers) thereby making it a challenging dataset for AS2. We divide the dev set from the release of ASNQ into two equal splits with 1336 questions each to be used for validation and testing. SQuAD1.1: A large scale MR dataset (Rajpurkar et al., 2016) 6 containing questions asked by crowdworkers with answers derived from Wikipedia articles. Unlike the previous two datasets, SQuAD1.1 requires predicting the exact answer span to answer a question from the provided passage. We divide the dev set into two splits of 5266 and 5267 questions for validation and testing respectively. Pr/Re is computed based on exact answer match (EM). AQAD: A large scale internal industrial dataset containing non-representative de-identified user questions from Alexa virtual assistant. Alexa QA Dataset (AQAD) contains 1 million and 50k questions in its train and dev. sets respectively, with their top answer and confidence scores as provided by the QA system (without any human labels of correctness). Note that the top answer is selected using an answer selection model from hundreds of candidates that are retrieved from a large webindex (∼ 1B web pages). For the purpose of this paper, we use a human annotated portion of AQAD (5k questions other than the train/dev. splits) as the test split for our experiments. Results on AQAD are presented relative to the baseline M 0 due to the data being internal.
Sugawara et al. previously highlight several shortcomings of using popular MR datasets like SQuAD1.1 for evaluation, due to artifacts such as (i) 35% questions being answerable only using their first 4 tokens, (ii) 76% questions having the correct answer in the sentence with the highest unigram overlap with the question, etc. To ensure that our question filters are learning the capability of the QA system and not these artifacts, we consider datasets from industrial scenarios (where questions are real customer queries) like ASNQ, AQAD 7 and WikiQA in addition to SQuAD.

Models
For each of the three academic datasets, we use two transformer based models (12 and 24 layer) as M: state-of-the-art RoBERTa-Base and RoBERTa-Large trained with TANDA for WikiQA 2 (Garg et al., 2020); RoBERTa-Base and RoBERTa-Large-MNLI fine-tuned on ASNQ 2 (Garg et al., 2020); and, BERT-Base and BERT-Large fine-tuned on SQuAD1.1 (Devlin et al., 2019). For AQAD, we use ELECTRA-Base trained using TANDA (Garg et al., 2020) after an initial transfer on ASNQ as M. For the question filter F, we use two different transformer based models (RoBERTa-Base, Large) for each of the four datasets. For WikiQA, ASNQ and SQuAD1.1, the RoBERTa-Base F is used for the 12-layer M and the RoBERTa-Large F is used for the 24-layer M. For AQAD we train both the RoBERTa-Base and RoBERTa-Large F for the single ELECTRA-Base M. All experimental details are presented in Appendix B, C for reproducibility.

Baselines
To demonstrate efficacy of our question filters, we use two question filtering baselines. The first captures well-formedness and intelligibility of questions from a human perspective. For this we train RoBERTa-Base, Large regression models on question well-formedness human annotation scores of the Paralex dataset (Faruqui and Das, 2018) 8 . We denote the resulting filter by F W . For the second baseline, we train a question classifier which predicts whether M will correctly answer a question. This idea has been studied in very recent contemporary works (Varshney et al., 2020;Chakravarti and Sil, 2021) but for answer verification (not for efficiency). We fine-tune RoBERTa-Base, Large for each dataset to predict whether the target M correctly answers the question or not. We denote this filter by F C .
We exclude comparisons with early exiting strategies (Soldaini and Moschitti, 2020;Xin et al., 2020;Liu et al., 2020) that adaptively reduce the number of transformer layers per sample and aim to improve efficiency of M instead of Ω. Inference batching strategy with multiple samples cannot exploit this efficiency benefit directly, thus these works report efficiency gains through abstract concepts such as FLOPs (Floating Point Operations per Second) using an inference batch-size=1, which is not practical. The efficiency gains from our approach are tangible, since filtering questions can scale down the required number of GPU-compute instances. Furthermore, ideas from these works can easily be combined with ours to add both the efficiency gains to the QA System.

Approximating Precision/Recall of M
Firstly, we want to compare how well our question filter F can approximate the answer score from M. For doing this, we plot the Pr/Re curves of M by varying τ 1 (i.e, M τ 1 ) and that of filter F by varying τ 2 (i.e, F τ 2 ) on the dataset test splits. We consider three options for filter F: our regression head question filter F M and the two baselines: F W , F C . We present graphs on SQuAD1.1 and AQAD using RoBERTa-Base F in Fig. 4. Note that our classification-head filter (F M τ 1 ) is trained specific to a particular τ 1 for M, and hence it cannot be directly compared in Fig. 4 (since training F M τ 1 for every τ 1 ∈ [0, 1] is not feasible). The graphs show that F M approximates the Pr/Re of M much better than the baseline filters F W and F C . The gap in approximating Pr/Re of M between F M and F C indicates that learning answer scores is easier than predicting if the model's answer is correct just using the question text.
While these plots independently compare F M and M, in practice, Ω will operate M at a non-zero threshold τ 1 sequentially after F M τ 2 (henceforth we denote this by F M τ 2 →M τ 1 ). To simplify visualization of the resulting system in Fig. 4, we propose to use a single common threshold τ for both F M and M, denoted as F M τ →M τ . From Fig. 4-(a), (b), the Pr/Re curve for F M τ →M τ on varying τ approximates that of M very well. Using F M however, imparts a large efficiency gain to Ω as shown by the four operating points that represent the % of questions filtered out by F M . For example, for AQAD, 60% of all the questions can be filtered out before running (R, S, M) (translating to a cost saving of the same fraction) while only dropping the Recall of Ω by 3 to 4 points. Complete plots having Pr/Re curves for F C τ →M τ and F W τ →M τ for all four datasets are included in Appendix D.

Selecting Threshold τ 2 for F
When adding a question filter F to Ω(R, S, M), the operating threshold τ 2 of F is a user-tunable parameter which can be varied per the use-case: efficiency desired at the expense of recall. This user-tunable τ 2 is a prominent advantage of our approach since one can decide what fraction of questions to filter out based on how much recall one can afford to drop. We plot the variation of the fraction of questions filtered by F along with the change in Pr/Re of Ω on varying τ 2 in Fig. 5. Specifically we consider the ASNQ and AQAD datasets, and M operating at τ 1 =0.5. From Fig. 5(a) we can observe that for ASNQ, our filter F M can obtain ∼18% filtering gains while only losing a recall of ∼3 points. F M can obtain even better filtering gains on AQAD: from Fig. 5(b) ∼40% filtering by only losing ∼4 points of recall. Complete plots for all datasets can be found in Appendix E.
We now present one possible way to choose an operating threshold τ 2 for filter F. For a QA system Ω(F τ 2 , R, S,M τ 1 ), we find the threshold τ * 2 for F at which it best approximates the answering/abstaining choice of M τ 1 . Specifically, we use the dev. split of the datasets to find τ 2 * ∈ [0, 1] such that F τ * 2 obtains the highest F1-score corresponding to the binary decision of answering or abstaining by M τ 1 . We present empirical results of our filters at different thresholds τ 1 of M in Table 1. We evaluate the % of questions filtered out by F τ * 2 (efficiency gains) and the resulting drop in recall of F τ * 2 →M τ 1 from M τ 1 on the test split of the dataset. For each dataset and model M, we train one regression head filter F M and five classification head filters F M τ 1 : one at every threshold τ 1 for M ∈ {0.3, 0.5, 0.6, 0.7, 0.9}. For regression head F M , the optimal τ 2 * is calculated independently for every τ 1 of M.
Results: From Table 1 we observe that our question filters (both with classification and regression heads) can impart filtering efficiency gains (of different proportions) while only incurring a small drop in recall. For example on ASNQ (12-layer M, F), F M 0.5 is able to filter out 17.8% of the questions while only incurring a drop in recall of 2.9%. On ASNQ (24-layer M, F), F M is able to filter out 21.6% of the questions with a drop in recall of only 3.9% at τ 1 =0.7. Barring some cases at higher thresholds, F M achieves comparable filtering performance to F M τ 1 . The best filtering gains are obtained on the industrial AQAD dataset having real world noise, where for τ 1 = 0.5, the 24layer F M 0.5 can filter out 45.8% of all questions only incurring a drop in recall of 4.9%.
We observe that the filtering gains at the optimal τ 2 * are inversely correlated with the precision of M. For example, for (12-layer M, F) at τ 1 = 0.5, the Pr of SQuAD 1.1 and ASNQ is 88.7 and 74.6 respectively, and that of AQAD is significantly lower than ASNQ (due to real world noise). The % of questions filtered by F M τ 1 τ * 2 or F M τ * 2 increases in the order from 9−9.4% to 14.9−17.8% ∆ Re --2.8 --6.6 --7.5 --8.7 --3.8 --1.7 --5.7 --6.2 --7.0 --3.9 Table 1: Filtering gains and drop in recall for question filters operating at optimal filtering threshold τ 2 * . For a particular filter F operating with answer model M τ 1 , ∆ Re refers to the difference in Recall of F τ 2 * →M τ 1 and M τ 1 . % Filter refers to the % of questions preemptively discarded by F. M τ 1 results for AQAD are relative to M 0 . to 43.2−48.2%. The efficiency gain of our filters thus increases as the QA task becomes increasingly difficult (SQuAD 1.1 → ASNQ → AQAD). Furthermore, except for some cases on WikiQA, we observe that our question filters increase the precision of the system (for full table with ∆Pr and ∆F1 refer Table 5 in Appendix F). This is in line with our observations in Fig. 2 and Kamath et al..
WikiQA (873 questions) is a very small dataset for efficiently distilling information from a transformer M. Standard distillation (Hinton et al., 2015) often requires millions of training samples for efficient learning. To mitigate this, we extrapolate sequential fine-tuning as presented in (Garg et al., 2020) for learning question filters for WikiQA. We perform a two step learning of F M , F M τ 1 : first on ASNQ and then on WikiQA. The results for WikiQA in Table 1 correspond to this paradigm of training F M and F M τ 1 , and demonstrate that our approach works to a reasonable level even on very small datasets. This also has implication towards shared semantics of question filters for models M trained on different datasets.
The drop in Re and filtering gains are contingent on the Pr/Re curve of M for the dataset. At higher thresholds (say τ 1 =0.9), if the drop in recall due to F M τ 1 or F M at τ 2 * is more than desirable, then one can reduce the value of τ 2 down from τ 2 * by reducing the efficiency gains using plots like Fig. 5.

Comparison with Baselines:
We also present results on optimal τ 2 * for F W and F C in Table 2 for ASNQ (complete results for all datasets are in Appendix F). When compared with the performance of F M and F M τ 1 for ASNQ in Table 1, both F W and F C perform inferior in terms of filtering performance. F W , which evaluates well-formedness of the questions from a human perspective, is unable to filter out any questions even when operating at its optimal threshold. This indicates that human-supervised filtering of ill-formed questions is sub-optimal from an efficiency perspective. F C gets better performance than F W , but always trails F M and F M τ either in terms of a smaller % of questions filtered or a larger drop in recall incurred.
Efficiency Gains from Filtering: Under simplifying assumptions, the computational resources required to answer questions within a fixed time budget over a fixed set of documents scales roughly linearly with the number of concurrent requests that need to be processed by Ω. We present a simple analysis on ASNQ (ignoring cost of retrieval R, S) considering 1000 questions (ASNQ has 400  candidate answers/question) and a batch-size=100. M requires 1000 * 400/100=4000 transformer forward passes (max_seq_length=128, standard for QA tasks due to long answers). On the other hand, max_seq_length=32 suffices for F. Since inference latency of transformers scales roughly quadratic over input sequence length, 1 batch through M is 4 2 =16 times slower than through F. Assuming 20% question filtering by F, M now only answers 800 questions (3200 forward passes of M), while adding 1000/100=10 forward passes of F. The %-cost reduction in time is 19.968%∼20%. We perform inference on ASNQ test-set on one V100-GPU (with F M set to filter out 20% as per above) and observe latency dropping from 531.29s → 433.16s (18.47%, slightly lower than calculated 19.968% due to input/output overheads). The latency reduction can also translate to a reduction in the number of GPU compute resources required when performing inference in parallel. Furthermore, in practice, our filter will also provide cost/latency savings by not performing document retrieval for the filtered out questions.

Qualitative Analysis
In Table 3, we discuss some examples to highlight a few shortcomings of F. Both F,M can successfully filter out non-factual queries asking for opinions (examples 1, 2). Identifying popular entities like ("Jennifer Lopez", "Lakers", "horses") while training, F incorrectly assumes that a question composed of these entities will be answered by the system. While it may happen that due to unavailability of a web document having the exact answer, M might not answer the question (examples 3, 4). On the other hand, being unfamiliar with entities not encountered during training ("Ahsoka Tano","Mandalorian") or syntacticallycomplex questions, F preemptively might filter out questions which actually will be answered by M (examples 5, 6).

Conclusion and Future Work
In this paper, we have presented a novel paradigm of training a question filter to capture the semantics of a QA system's answering capability by distilling the knowledge of the answer scores from it. Our experiments on three academic and one industrial QA benchmark show that the trained question models can estimate the Pr/Re curves of the QA system well, and can be used to effectively filter questions while only incurring a small drop in recall. An interesting future work direction is to analyze the impact/behavior of the question filters in a cross-domain setting, where the training and testing corpora are from different domains. This would allow examining the transferability of the semantics learned by the question filters. A complementary future work direction could be knowledge distillation from a sophisticated answer verification module like (Rodriguez et al., 2019;Kamath et al., 2020;Zhang et al., 2021).
In addition to providing efficiency gains, the question filters could be used to qualitatively study the characteristics of questions that are likely to lead to low answer confidence scores. This can (i) help error analysis for improving the accuracy of QA systems, and (ii) be used for efficient sampling of training questions that are harder to be answered by the target QA system.
Our approach for training the question filters proposes the idea of partial-input knowledge distillation (e.g., using only questions instead of QA pairs). This concept can possibly be extended to other NLP problems for achieving compute efficiency gains, improved explainability (e.g., to what extent a partial-input influences model prediction) and qualitative analysis.  AQAD: A large scale internal industrial QA dataset derived from Alexa virtual assistant. Alexa QA Dataset (AQAD) contains 1 million and 50k questions in its train and dev. sets respectively, with their top answer and confidence scores as provided by the QA system. Note that the question answer pairs are without any human labels of correctness/incorrectness. The top answer is selected using an answer sentence selection model from hundreds of candidates that are retrieved from a large web-index (∼ 1B web pages). For testing, we use 5000 questions (other than those in the train/dev. splits), each of which is human annotated with a label corresponding to the top answer from the QA system being correctly or incorrect. For learning the correctness filter F C baseline on AQAD, we use an additional annotated split of 2,500 questions other than the train/dev./test splits.

B Model Details
For each of the datasets we describe the details of the answer models M for reproducibility purposes:  • SQuAD1.1: We consider the BERT-Base uncased and BERT-Large uncased with whole word masking model variants and fine-tune them on the training set of SQuAD1.1 for 3 epochs with a standard learning rate of 2e-5 Adam and learning rate warm-up set for the first 5% of training steps. Baseline accuracy for the 12 and 24-layer model M on the test split is 77.9% and 84.0% respectively.
• AQAD: We consider the ELECTRA-Base (Clark et al., 2020) model and perform sequential fine-tuning using TANDA (Garg et al., 2020) by a first round of fine-tuning on ASNQ for 3 epochs with a learning rate of 2e-5 Adam (learning rate warm-up of 5%), followed by a second round of fine-tuning for 3 epochs with a learning rate of 2e-6 Adam (learning rate warm-up of 5%). Baseline accuracy is not disclosed since the data is internal.

C Experimental Details
Training Details: All computations are performed on NVIDIA Telsa V100 GPUs with a batch-size of 128. For training a question filter, we train F using the proposed loss objectives for 3 epochs on the training split of the dataset using a standard learning rate of 2e-5 Adam (with learning rate warm-up set for the first 5% of the training steps). RoBERTa-Base and RoBERTa-Large question filters are trained corresponding to 12 and 24 layer answer models M respectively. For AQAD, both the RoBERTa-Base and RoBERTa-Large question filters are trained corresponding to the ELECTRA-Base answer model M. For WikiQA, the question filters are trained by sequential training: first on ASNQ for 3 epochs using a standard learning rate of 2e-5 Adam (with learning rate warm-up set for the first 5% of the training steps), and then on WikiQA for 3 epochs with a learning rate of 3e-6 (with learning rate warm-up set for first 5% of the training steps). The baseline classifier for wellformedness F W and correctness of answering a question F C are also trained by fine-tuning the RoBERTa-Base and Large models for 3 epochs using a standard learning rate of 2e-5 (with learning rate warm-up set for the first 5% of the training steps). As mentioned in Appendix A, we use an additional annotated data split containing 250k QA pairs (2.5k questions) for training F C on AQAD.
Validation Strategy: For computing the optimal threshold τ 2 * of a question filter F as described in Section 5.5, we use the dev. split of the datasets to find τ 2 * ∈ [0, 1] such that F τ * 2 obtains the highest F1-score corresponding to the binary decision of answering or abstaining by M τ 1 . We present concrete results corresponding to 5 different operating thresholds τ 1 of the answer model M : {0.3, 0.5, 0.6, 0.7, 0.9}. At each τ 1 for M, we consider all 4 different possible question filters: our answer-model distilled question filter with regression head F M , our answer-model distilled question filter with classification head F M τ 1 , the correctness question filter F C and the well-formedness question filter F W . For each of these four filters, we independently optimise τ 2 * ∈ [0, 1] corresponding to best F1 filtering of M τ 1 . Code: The code for training our answer-model distilled question filters can be accessed at https:// github.com/alexa/wqa-question-filtering.

D Complete Graphs on Approximating
Pr/Re of M We present Pr/Re curves of M by varying τ 1 (i.e, M τ 1 ) and that of filters F: {F M ,F C ,F W } by varying τ 2 (i.e, F τ 2 ) on the dataset test splits in Fig. 6 (a)-(d). We also present Pr/Re curves comparing the filters F: {F M ,F C ,F W } when operating jointly with the answer model M at the same threshold τ , i.e, F M τ →M τ , F C τ →M τ and F W τ →M τ on the dataset test splits in Fig. 6 (e)-(h). As visible from the graphs, our filter F M is able to better approximate the Pr/Re of M when operating independently of M ( Fig. 6 (a)-(d)) as well as when operating jointly with M at a nonzero threshold (Fig. 6 (e)-(h)). Note that the trained answer models on WikiQA (which has a very small test set having only 237 samples) are very poorly calibrated. This is visible from the shape of the Pr/Re curve of M in Fig. 6 (d),(h). Interestingly, even for such a poorly calibrated answer model, our filter F M is still able to approximate the Pr/Re of M better than the baselines F C and F W . This illustrates the validity of our technique even for very small datasets (few-shot setting).
Our classification-head filter F M τ 1 is trained specific to a threshold τ 1 of M. Since each point in the Pr/Re graph of M corresponds to a different threshold τ 1 ∈ [0, 1], for fair comparison in Fig. 6, we will need to train several F M τ 1 for every τ 1 ∈ [0, 1] which is unfeasible. To show how the classification head filters can approximate the Pr/Re of M, we arbitrarily select two thresholds τ 1 ={0.5, 0.7} and plot the Pr/Re curves of the two classification head filters F M 0.5 and F M 0.7 in Fig. 7. We also plot the operating configurations for these filters at the corresponding τ 1 for M, i.e, F M 0.5 →M 0.5 and F M 0.7 →M 0.7 in Fig. 7.

E Complete Graphs on Varying
Threshold τ 2 of F We present the variation of the fraction of questions filtered by F along with the change in Pr/Re of Ω on varying τ 2 for all the datasets in Fig. 8. We present plots for three different operating thresholds τ 1 ={0.5, 0.7, 0.9} of the answer model M.
For a dataset, since F M is trained independent of any threshold τ 1 , the fraction of filtered questions would remain the same as we vary τ 2 even at different values of τ 1 . Using these graphs, one can choose the desired operating point for the filter corresponding to how much efficiency gain is desired and how much drop in recall can be tolerated.  Table 5. We evaluate the % of questions filtered out by F τ * 2 (efficiency gains) and the resulting drop in Precision, Recall and question-answering F1 score of F τ * 2 →M τ 1 from M τ 1 on the test split of the dataset. For each dataset and model M, we train one regression head question filter F M and five classification head question filters F M τ 1 : one at every threshold τ 1 for M ∈ {0.3, 0.5, 0.6, 0.7, 0.9}. The optimal filtering threshold τ 2 * is computed using the validation strategy described in Appendix C. For the regression head F M , the optimal τ 2 * is calculated independently for every τ 1 of M.
Additionally we present the complete empirical results on all datasets corresponding to the optimal filtering threshold τ 2 * for the baseline question filters: F C and F W in Table 6. We observe that both F W and F C perform inferior in terms of filtering performance to our filters F M and F M τ . Except for higher thresholds on AQAD, the wellformedness filter F W is unable to filter out a sizable fraction of questions even when operating at τ 2 * which indicates that human-supervised filtering of ill-formed questions is sub-optimal from an efficiency perspective. F C gets better performance than F W , but always trails F M and F M τ either in terms of a smaller % of questions filtered or a larger drop in recall incurred.