Model Agnostic Answer Reranking System for Adversarial Question Answering

While numerous methods have been proposed as defenses against adversarial examples in question answering (QA), these techniques are often model specific, require retraining of the model, and give only marginal improvements in performance over vanilla models. In this work, we present a simple model-agnostic approach to this problem that can be applied directly to any QA model without any retraining. Our method employs an explicit answer candidate reranking mechanism that scores candidate answers on the basis of their content overlap with the question before making the final prediction. Combined with a strong base QAmodel, our method outperforms state-of-the-art defense techniques, calling into question how well these techniques are actually doing and strong these adversarial testbeds are.


Introduction
As reading comprehension datasets (Richardson et al., 2013;Hermann et al., 2015a;Rajpurkar et al., 2016;Joshi et al., 2017) and models (Sukhbaatar et al., 2015;Seo et al., 2016;Devlin et al., 2019) have advanced, QA research has increasingly focused on out-ofdistribution generalization (Khashabi et al., 2020;Talmor and Berant, 2019) and robustness. Jia and Liang (2017) and Wallace et al. (2019) show that appending unrelated distractors to contexts can easily confuse a deep QA model, calling into question the effectiveness of these models. Although these attacks do not necessarily reflect a real-world threat model, they serve as an additional testbed for generalization: models that perform better against such adversaries might be expected to generalize better in other ways, such as on contrastive examples (Gardner et al., 2020).
In this paper, we propose a simple method for adversarial QA that explicitly reranks candidate answers predicted by a QA model according to a notion of content overlap with the question. Specifically, by identifying contexts where more named entities are shared with the question, we can extract answers that are more likely to be correct in adversarial conditions.
The impact of this is two-fold. First, our proposed method is model agnostic in that it can be applied post-hoc to any QA model that predicts probabilities of answer spans, without any retraining. Second but most important, we demonstrate that even this simple named entity based questionanswer matching technique can be surprisingly useful. We show that our method outperforms state-of-the-art but more complex adversarial defenses with both BiDAF (Seo et al., 2016) and BERT (Devlin et al., 2019) on two standard adversarial QA datasets (Jia and Liang, 2017;Wallace et al., 2019). The fact that such a straightforward technique works well calls into question how reliable current datasets are for evaluating actual robustness of QA models.

Related Work
Over the years, various methods have been proposed for robustness in adversarial QA, the most prominent ones being adversarial training (Wang and Bansal, 2018;Lee et al., 2019;Yang et al., 2019b), data augmentation (Welbl et al., 2020) and posterior regularization (Zhou et al., 2019). Among these, we compare our method only with techniques that train on clean SQuAD (Wu et al., 2019;Yeh and Chen, 2019) for fairness. Wu et al. (2019) use a syntax-driven encoder to model the syntactic match between a question and an answer. Yeh and Chen (2019) use a prior approach (Hjelm et al., 2019) to maximize mutual information among contexts, questions, and answers to avoid overfitting to surface cues. In contrast, our technique is more . Given each answer option (right column), we extract named entities and compare them to named entities in the question. The overlap is used as a reranking feature to choose the final answer. The ground truth answer containing sentence is highlighted in green, the ground truth answer is boxed and the distractor sentence is highlighted in red.
closely related to retrieval-based methods for opendomain QA (Chen et al., 2017;Yang et al., 2019a) and multi-hop QA (Welbl et al., 2018;De Cao et al., 2019): we show that shallow matching can improve the reliability of deep models against adversaries in addition to these more complex settings.
Methods for (re)ranking of candidate passages/answers have often been explored in the context of information retrieval (Severyn and Moschitti, 2015), content-based QA (Kratzwald et al., 2019) and open-domain QA Lee et al., 2018). Similar to our approach, these methods also exploit some measure of coverage of the query by the candidate answers or their supporting passages to decide the ranks. However, the main motive behind ranking in such cases is usually to narrow down the area of interest within the text to look for the answer. On the contrary, we use a reranking mechanism that allows our QA model to ignore distractors in adversarial QA and can also provide model-and task-agnostic behavior unlike the commonly used learning-based (re)ranking mechanisms.
In yet another related line of research, (Chen et al., 2016;Kaushik and Lipton, 2018) reveal the simplistic nature and certain important shortcomings of popular QA datasets. Chen et al. (2016) conclude that the simple nature of the questions in the CNN/Daily Mail reading comprehension dataset (Hermann et al., 2015b) allows a QA model to perform well by extracting single-sentence relations. Kaushik and Lipton (2018) perform an ex-tensive study with multiple well-known QA benchmarks to show several troubling trends: basic model ablations, such as making the input questionor passage-only, can beat the state-of-the-art performance, and the answers are often localized in the last few lines, even in very long passages, thus possibly allowing models to achieve very strong performance through learning trivial cues. Although we also question the efficacy of well-known adversarial QA datasets in this work, our core focus is on exposing certain issues specifically with the design of the adversarial distractors rather than the underlying datasets.

Approach
Neural QA models are usually trained in a supervised fashion on labeled examples of contexts, questions, and answers to predict answer spans; we represent these as (s, e) tuples, where s represents the sentence and e the candidate span. Prior work (Lewis and Fan, 2019;Mudrakarta et al., 2018;Yeh and Chen, 2019;Chen and Durrett, 2019) has noted that the end-to-end paradigm can overfit superficial biases in the data causing learning to stop when simple correlations are sufficient for the model to answer a question confidently. By explicitly enforcing content relevance between the predicted answer-containing sentence and the question, we can combat this poor generalization.
Specifically, we explicitly score the candidate sentences as per the word-level overlap in named entities common to both the question and a sen-  Table 1: AddSent and AddOneSent results with BERT-S. MAARS outperforms the vanilla and baseline models on adversarial data but its performance drops a bit on the original data due to constrained reranking of answers.
tence. We refer to our method as Model Agnostic Answer Reranking System (MAARS). Figure 1 illustrates the workflow of MAARS. MAARS can be applied to any arbitrary QA model that predicts answer span probabilities. First, we use the base QA model to compute the n best answer spans A = {(s 1 , e 1 ), . . . , (s n , e n )} for a context-question pair (c, q) where n is a hyperparameter. Any answer span not lying in a single sentence is broken into subspans that lie in separate sentences and A is updated accordingly.
Next, we extract the set of candidate sentences L from the context containing these n answer spans. For the question and each sentence, we compute a set of named entity chunks using an open-source AllenNLP (Gardner et al., 2017) NER model. We then compute the set of words inside named entity chunks from each candidate sentence NER(l k ) ∀ l k ∈ L and the question NER(q); note that NER(·) refers to a set of words and not a set of named entities. Each candidate sentence l k is then given a score SC(l k ) = NER(l k ) ∩ NER(q) and the answer spans are reranked per the scores of the sentences containing them. In the case of ties or if there are multiple spans in the same candidate sentence, they are reranked among themselves according to the original ordering as per the QA model. Finally, the span with the highest rank after reranking is chosen as the final answer.
Compared to the base QA model, this approach only relies on an additional NER model that can be used without any retraining of the base model. Note that the architecture doesn't depend on any specific tagger, and the other content matching models like word matching could also be used in the system here.  include the adversarial distraction generation process for either of the datasets and point the interested reader to the original papers for exact details. For Adversarial SQuAD, we test MAARS with both BiDAF and BERT and compare against stateof-the-art baselines on adversary types used in the original papers. To the best of our knowledge, there is no pre-existing literature that proposes a defense technique for Universal Triggers. We also find that it fails to degrade the performance of our vanilla BERT model, probably because the attacks were originally generated for BiDAF. Thus, we only evaluate on this dataset in the BiDAF setting, using all four triggers Who, When, Where and Why.
For BiDAF, we compare MAARS against the Syntactic Leveraging Network (SLN) by Wu et al. (2019) on AddSent. SLN encodes predicateargument structures from the context and question, a conceptually similar structure matching approach as MAARS but trained end-to-end with many more parameters. For BERT, we benchmark MAARS against QAInfoMax (Yeh and Chen, 2019) on AddSent and AddOneSent. In addition to the standard loss for training QA models, QAInfoMax adds a loss to maximize the mutual information between the learned representations of words in context and their neighborhood, and also between those of the answer spans and the question. Implementation details. We use the uncased base (single) pretrained BERT from Hugging-Face (Wolf et al., 2019) and finetune it using Adam with weight decay (Loshchilov and Hutter, 2019) optimizer and an initial learning rate of 3e −5 on SQuAD (Rajpurkar et al., 2016) v1.1 for 2 epochs for both vanilla BERT and BERT + QAInfoMax. We set the training batch size to 5 and the propor-  Table 3: Results on Universal Triggers with BiDAF (BERT-specific triggers unavailable publicly). MAARS is better than the vanilla model for most adversaries but with smaller performance gains than Adversarial SQuAD.
. tion of linear learning rate warmup for the optimizer to 10%.
Our BiDAF (Seo et al., 2016) model has a hidden state of size 100 and takes 100 dimensional GloVe (Pennington et al., 2014) embeddings as input. For character-level embedding, it uses 100 onedimensional convolutional filters, each with a width of 5. A uniform dropout (Srivastava et al., 2014) of 0.2 is applied at the CNN layer for character embedding, all LSTM (Hochreiter and Schmidhuber, 1997) layers and at the layer before the logits. We train it with AdaDelta (Zeiler, 2012) and an initial learning rate of 0.5 for 50 epochs. We set the training batch size to 128. For our Syntactic Leveraging Network, we follow the exact hyperparameter settings of (Wu et al., 2019).
Other hyperparameters common to both BERT and BiDAF include an input sequence length of 400, maximum query length of 64, and 40 predicted answer spans per context-question pair. For NER tagging, we use an ELMo-based implementation from AllenNLP (Gardner et al., 2017) that has been finetuned on CoNLL-2003 (Tjong Kim Sang andDe Meulder, 2003). Finally, we set the value of n (the number of candidates considered for reranking) in MAARS to 10 across all our experiments.

Results
In all our results tables, we report the macroaveraged F1 and exact match (EM) scores separated by a slash in each cell. In Tables 1 and 2, Original and Adversarial (Adv.) refer to a model's performance on only clean and only adversarial data respectively. Mean denotes the weighted mean of the Original and Adversarial scores, weighted by the respective number of samples in the dataset. Both AddSent and AddOneSent have 1000 clean and 787 adversarial instances.
Adversarial SQuAD. Table 1 shows the results with BERT-single (-S) on AddSent and AddOne-Sent. MAARS outperforms both the vanilla model and QAInfoMax on both Adversarial and Mean metrics. The performance gains are also substantial, especially on Adversarial where MAARS improves F1 over QAInfoMax by about 20 points on AddSent and 16 points on AddOneSent. This clearly shows that our method is much more capable of avoiding distractors in data and it is a much stronger defense technique in this setting. For both QAInfo-Max and MAARS there is a drop in performance on clean data, but the drop for MAARS is larger. This drop naturally arises from the simplicity of the heuristic: matching words in named entities with the question sometimes assigns a higher score to a candidate sentence which has a higher overlap in terms of named entities with the question but doesn't contain the right answer. One such example where MAARS fails to pick the correct top candidate after reranking is shown in fig. 2a. Table 2 details the results with BiDAF on AddSent. 1 Here, we also see significant performance gains over the vanilla model and the SLN baseline. MAARS results in an increase in adversarial F1 by 24 points over vanilla BiDAF and about 22 points over BiDAF + SLN. Interestingly, the performance on clean data doesn't drop as in the case of BERT. This difference may be a result of BiDAF using more surface word matching itself, leading to a closer alignment between its predictions and the reranker's choices. However, note that our simple heuristic still performs well even with a complex model like BERT.
Discussion. Overall, our results on this dataset look promising for both BERT and BiDAF despite our method's inherent simplicity. This raises two questions. First, how effective is the Adversarial SQuAD dataset as a testbed for adversarial attacks? When a simple method can achieve large gains, we cannot be sure that more complex methods are truly working as advertised rather than learning such heuristics. Second, how effective are these current defenses? They underperform a simple heuristic in this setting; however, because the full breadth of possible adversarial settings has not been explored, it's hard to get a holistic sense of which methods are effective. Additional settings are needed to fully contrast these techniques. and present the results in Table 3. In particular, we append the following distractors for different adversary types. The target answers in the distractors have been bolded. Due to unavailability of prior work on triggerspecific defense and BERT-specific triggers, we report only vanilla BiDAF and BiDAF with MAARS. F1 drops by a small amount (0.3 points) from BiDAF to BiDAF with MAARS while the EM score doesn't change at all for Why. The scores improve by around 1-2 points for the other adversary types. However, the gains are much lower in comparison to Adversarial SQuAD. These results indicate the promise of simple defenses, but more exhaustive evaluation of defenses on different types of attacks is needed to draw a more complete picture of the methods' generalization abilities.

Failure cases
Besides the instances where the primary error source is picking a wrong top candidate (refer to Fig. 2a), we notice two other common failure case types with MAARS. One directly stems from MAARS' inability to attend to the question type during reranking. In Fig. 2b, the question word is How but MAARS picks Scottish devolution referendum which is not the appropriate type of answer here. The other type of failure occurs when multiple similar span types are present in the same candidate, thus creating ambiguity for the base QA model. In the example shown in Fig. 2c, the QA model fails to distinguish between the two spans and retrieve specific information about the US. Better base QA models may resolve these issues, or a more powerful reranker could also be used. However, rerankers learned end-to-end would suffer from the same issues as BERT and require additional engineering to avoid overfitting the training data.

Conclusion
In this work, we introduce a simple and model agnostic post-hoc technique for adversarial question answering (QA) that predicts the final answer after re-ranking candidate answers from a generic QA model as per their overlap in relevant content with the question. Our results show the potential of our method through large performance gains over vanilla models and state-of-the-art methods. We also analyze common failure points in our method. Finally, we reiterate that our main contribution is not the heuristic defense itself but rather its ability to paint a more complete picture of the current state of affairs in adversarial QA. We seek to illustrate that our current adversaries are not strong and generic enough to attack a wide variety of QA methods, and we need a broader evaluation of our defenses to meaningfully gauge our progress in adversarial QA research.