TASA: Deceiving Question Answering Models by Twin Answer Sentences Attack

We present Twin Answer Sentences Attack (TASA), an adversarial attack method for question answering (QA) models that produces fluent and grammatical adversarial contexts while maintaining gold answers. Despite phenomenal progress on general adversarial attacks, few works have investigated the vulnerability and attack specifically for QA models. In this work, we first explore the biases in the existing models and discover that they mainly rely on keyword matching between the question and context, and ignore the relevant contextual relations for answer prediction.Based on two biases above, TASA attacks the target model in two folds: (1) lowering the model’s confidence on the gold answer with a perturbed answer sentence; (2) misguiding the model towards a wrong answer with a distracting answer sentence. Equipped with designed beam search and filtering methods, TASA can generate more effective attacks than existing textual attack methods while sustaining the quality of contexts, in extensive experiments on five QA datasets and human evaluations.


Introduction
Question Answering (QA) is the cornerstone of various NLP tasks.In extractive QA (the most common setting), given a question and an associated context, a QA model needs to comprehend on the context and predict the answer (Rajpurkar et al., 2016).While most works keep improving the answer correctness on benchmarks (Devlin et al., 2019;Yu et al., 2018), few studies investigate the robustness of QA models, e.g., is the performance achieved by sound contextual comprehension or via shortcuts like keyword matching?Although adversarial attacks attract growing interests in computer vision (Goodfellow et al., 2014;Zhao et al., 2018) and recently in NLP (Ren et al., 2019;Li et al., 2021), most of them study general tasks without taking into account the properties of QA.The vulnerability and biases of models can lead to catastrophic failures outside the benchmark datasets.An effective way to study them is through adversarial attacks specifically designed for QA tasks.
Generating adversarial textual examples is challenging due to the discrete syntactic restriction, especially on QA, where the additional relationship between question and context should be further considered.Existing works such as AddSent and Human-in-the-loop (Jia and Liang, 2017;Wallace et al., 2019b) heavily rely on human annotators to create effective adversarial QA examples, which are costly and hard to scale.A few studies (Gan and Ng, 2019;Wang et al., 2020;Wallace et al., 2019a) can generate adversarial samples automatically.But they only perturb either the context or the question separately, and thus ignore the consistency between them.Moreover, the major pitfalls of QA models' detailed comprehension process are not fully investigated, confining producing more powerful adversarial attacks.
In this paper, we develop an adversarial attack specifically targeting two biases of mainstream QA models discussed in §2: (1) making prediction via keywords matching in the answer sentence of contexts; and (2) ignorance of the entities shared between the question and context.Our method, Twin Answer Sentences Attack (TASA), automatically produces black-box adversarial attacks (Papernot et al., 2017) perturbing a context without hurting its fluency or changing the gold answer.TASA firstly allocates the answer sentence in the context that is decisive for answering (Chen and Durrett, 2019) and then modify it into two sentences targeting the two biases above: one sentence preserves the gold answer and the meaning but replaces the keywords that are shared with the question with their synonyms; while the other leaves the keywords and the syntactic structure intact but changes the entities (subjects/objects) associated with the answer.Thereby, the former is a perturbed answer sentence (PAS) lowering the focus of the model on the gold answer, while the latter generates a distracting answer sentence (DAS) as Jia and Liang (2017) to further misguide the model towards a wrong answer with respect to irrelevant entities.Thus, the adversarial context can substantially distort the QA model without changing the answer for humans.To address the challenge of efficiency and textual fluency, we further propose beam search and filtering techniques empowered by pretrained models.
In experiments, we evaluate TASA and other adversarial attack baselines on attacking three popular contextualized QA models, BERT (Devlin et al., 2019), SpanBERT (Joshi et al., 2020), and BiDAF (Seo et al., 2017), on five extractive QA datasets, i.e., SQuAD 1.1, NewsQA, NaturalQuestions, HotpotQA, and TriviaQA.Experimental results and human evaluations consistently show that TASA achieves better attack capability than other baselines and meanwhile preserves the textual quality and gold answers identifiable by humans.
Our contributions are three-fold: • We propose a novel adversarial attack method "TASA" specifically designed to fool extractive QA models while retain the gold answers for humans.
• We study the biases and vulnerability of QA models that motivate TASA, and demonstrate that those models mainly rely on keyword matching, while may ignore the contextual relation.
• Experiments on five QA benchmark datasets and three types of victim models demonstrate that TASA outperforms existing baselines on attack performance, as well as the comparable capability to preserve textual quality and answers.

Predicting Bias in Question Answering
Recent works show that state-of-the-art natural language inference models often overly rely on certain keywords as shortcuts for prediction (Wallace et al., 2019a;Sinha et al., 2021).In the empirical study of this section, we illustrate that current QA models consistently exhibit such bias on the sensitive words without leveraging the contextual relationship for predicting answers.
We analyze two mainstream QA models with contextualized comprehension capabilities, BERT (Devlin et al., 2019) and BiDAF (Seo et al., 2017), trained on the original training set of SQuAD1.1 (Rajpurkar et al., 2016) and tested on samples modified from its validation set.We define the sentence in the context that contains the gold answer as the answer sentence, which is the key for predicting answers (Chen and Durrett, 2019).We first compare the performance of models on the original sample with only answer sentence as the context ("Only answer sent.").Besides, to investigate the bias on sensitive words, we further examine models on samples with various types of sensitive words in the answer sentence being (1) either removed ("Remove") or (2) only retained ("Only").There are three types of sensitive words to be considered: (1) Entities.The same named entities shared between the answer sentence and the question.
(2) Lexical words (lexical.).with lexical meanings (excluding all named entities) shared between the answer sentence and question.They cover the words with POS tags of NOUN, VERB, ADJ, etc.
(3) Function words (func.).Words that do not Answer sentence: The annual NFL Experience was held at the Moscone Center located in San Francisco.
The annual NFL Experience was held at the located in San Francisco.

Only entities
Moscone Center San Francisco.

Only lexical.
The annual NFL Experience held at the Moscone Center in San Francisco.Was located San Francisco.

Remove func.
Only func.
The annual NFL Experience was held at Moscone Center located San Francisco.
The in San Francisco.
-24. 08 -22.34 -22.24 -22.72 Table 1: EM and F1 scores of BERT and BiDAF models on different modified samples compared to results on the original samples.Shuffle means the best results among texts whose tokens are random-ordered.
have lexical meaning but are shared between the answer sentence and the question.They include words with POS tags of DET, ADP, PRON, etc.
When modifying the answer sentence, we only remove or retain these three types of sensitive words, except the gold answer words, and also keep the rest context intact.As shown in Figure 2, the modified texts are unreadable and difficult to infer their true meaning from the human perspective.In addition, we follow UNLI (Sinha et al., 2021) to Shuffle tokens in the answer sentence for Only lexical.conditions, verifying the possibility of models to achieve even better performance, given the texts are totally ungrammatical but contain sensitive words.
Table 1 compares the evaluation results on different modifications.Given the answer-sentence-only context, the performance of both BERT and BiDAF are improved, indicating that they mainly rely on the answer sentence and almost ignore the rest of the context.While removing entities or function words causes a slight difference in metrics, removing lexical words leads to a larger performance drop.In addition, both models perform surprisingly satisfactory when keeping only lexical words in answer sentences, compared to the 30% ∼ 60% drop when keeping other words.Moreover, shuffling tokens under the lexical-only conditions even possibly benefit the model despite the answer sentence being merely discrete tokens and hard to read.This suggests that both models can answer questions solely relying on the shared lexical words (not contextual), i.e., keywords in the answer sentence, regardless of the word order and other contextual information like entities.
Inspired by this observation, we question that whether we can utilize the discovered pitfall to design an efficient adversarial attack method specifically for QA? Can we lower the model's attention on the gold answers and then misguide it to incorrect answers by manipulating the existing sensitive keywords in the context and adding some new misleading ones?The answer is affirmative: we show that the predictions can be shifted to crafted wrong answers in §4.4.

Methodology
We propose an adversarial attack method for extractive QA models, Twin Answer Sentences Attack (TASA), which automatically produces black-box attacks solely based on the final output of the victim QA model F (•).Given a typical QA sample composed of a context c, a question q, and an answer a (i.e., a positional span in c), we study how to perturb the context c as c that can deceive F (•) towards producing an incorrect answer F (c , q) = a, while c retains the correct answer a that can be identified by humans.We keep the question q intact to ensure the answer a valid, as editing the short q with simple syntactic structure easily alters its meaning.
TASA can be summarized as three main steps: (1) Remove coreferences in the context to facilitate the following edits; (2) Perturb the answer sentence by replacing keywords (overlapped sensitive lexical words in §2) with synonyms to produce a perturbed answer sentence (PAS), lowering the model's focus on the gold answer; (3) Add a distracting answer sentence (DAS) that keeps the keywords intact but changes the associated subjects/objects to misguide the model for producing a wrong answer, which can be proven in Table 5.How these the three steps are applied is illustrated in Figure 1.And Algorithm 1 gives the complete procedure of TASA.

Remove Coreferences
Coreference relations across sentences commonly exist in texts (Hobbs, 1979) and also bring extra challenges to adversarial attacks during making substitutions on target words.For example, in a sentence "His patented AC induction motor were licensed", "His" refers to "Nikola Tesla's" according to the whole context.However, given the single sentence, it is hard to precisely allocate candidates for substitution "his" as it is a pronoun.Instead, we remove the coreference by replacing such pronouns with the entity names they refer to, e.g., specific persons or locations, so we can edit them directly without considering a complicated coreference.

Perturb the Answer Sentence
According to the former analysis, the answer sentence is the most important part of context c for QA tasks, and QA models usually predict answers according to keyword matching (Chen and Durrett, 2019).Hence, we first study how to obtain a perturbed answer sentence (PAS) by only perturbing those sensitive keywords instead of changing the whole context.Given the gold answer a, we first allocate the answer sentence s a in c.In TASA, we use the text matching to search for s a that contains text a. Determine the keywords to perturb.As discussed in §2, QA models normally rely on keywords to make predictions.Hence, we directly perturb those keywords rather than randomly-selected tokens as previous works (Ren et al., 2019;Jin et al., 2020) to produce more effective attacks.We adopt three criteria to select words of s a into the keyword set X : (1) they are not included in the answer span a so the gold answer will retain; and (2) each of them shares the same lemma with a token in the question q; and (3) each keyword's POS tag belongs to a POS tag set for lexical words, e.g., NOUN, ADJ, etc. Rank keywords by importance.Following previous works (Jin et al., 2020), we rank keywords in X according to their importance scores in the descending order.Given the original context c and answer a, the importance score I i of x i ∈ X is where p F (a|•) denotes the probability of the original span position of gold answer a predicted by the victim model F , mask(c, x i ) means c is modified by replacing a token x i with a special mask symbol, e.g., given c = ..x i−1 x i x i+1 .., mask(c, x i ) = ..x i−1 < mask > x i+1 ... Finally, we obtain a set X of keywords ranked by their importance.

Generate perturbed answer sentence (PAS).
Following the order in X , we edit each keyword x i ∈ X one after another.Specifically, we replace x i with its synonym r j from a synonym set and transform the inflection of r j as the same as x i , e.g., we change "Tesla investigated..." to "Tesla looked into..." where "investigated" is a keyword and "look into" is one of its synonyms.Thereby, multiple PASs are obtained during editing each keyword if more than one synonym exists.We retain the top few of them via beam search and filtering strategy (as elaborated in §3.4) to promote the effectiveness as well as efficiency, resulting in a set of PASs P, which will be the initial texts of the next perturbation turn.While PASs do not change the meaning of texts as they replace words with their synonyms, they will distract the model, which relies on keyword matching, away from PAS containing the answer.

Add a Distracting Answer Sentence
To further deceive the model, we also add a distracting answer sentence (DAS) at the end of the context.In particular, DAS is modified from the answer sentence s a as well: it changes the subjects/objects and the answer, but keeps the keywords intact which can draw models' attention.Collaborating with PAS, DAS misguides models to predict incorrect answers regarding wrong subjects/objects due to the pitfall studied in §2, which will be verified in Table 5.Our method differs from previous works (Jia and Liang, 2017;Wallace et al., 2019a) as our distractors are added automatically and suits more general conditions.Determine the tokens to edit.Similar to PAS, the first step of generating DAS is to select a set Y of tokens from the s a as the candidates of subjects/objects that will be edited.In TASA, each selected token y ∈ Y needs to meet all the following criteria: (1) y / ∈ X so the original keywords are preserved; (2) y / ∈ a (as we will process the answer tokens separately); (3) y is a named entity or its POS tag is NOUN.The goal of (3) is to extract and change the subjects/objects of s a to produce a pseudo answer sentence that contains incorrect answers.We do not use a syntactic parser to locate the subjects/objects, as we find it less accurate and effective than POS tags empirically.

Generate distracting answer sentence (DAS).
Similar to PAS, we edit each y i ∈ Y to obtain a DAS.Specifically, we replace each y i with a token/phrase of the same entity/noun type, e.g., "Tesla investigated..." can be modified to "Charlie investigated..." since both "Tesla" and "Charlie" are persons.In principle, (1) if y i is a named entity, we randomly sample N different entities with the same NER tag from the whole corpus as the candidates to replace y i ; (2) otherwise, we randomly sample N nouns with the same hypernym as y i from the corpus for substitution.Hence, multiple DASs can be generated, and we also use the beam search strategy to only choose the top few of them, resulting in a set of DASs D. Change the answer in DAS.Since the main purpose of DAS is to misguide the model to predict a wrong answer, we entirely replace the text span of the original answer in DAS with a pseudo answer, which helps to remove the ambiguity of the answer from humans' perspective.Specifically, we replace every lexical token of a in DAS with one of pseudo answer token candidates that share the same NER tags or POS tags, which are randomly sampled from the whole corpus.Likewise, this procedure results in multiple results and thus a beam search is also necessary for the efficiency and attack success purpose as well.

Beam Search and Filtering
Beam search.When editing each word in generating the PAS and DAS, there usually exist multiple replacement candidates, resulting in multiple perturbed sentences.In order to obtain the one that has the greatest potential leading to a successful attack, and to improve the attack's efficiency, we apply a beam search strategy defined based on the effect score E n for each perturbed sentence s n .
where edit denotes that the original context c is modified by s n : (1) if s n is a PAS, it replaces the original answer sentence s a in c; (2) if s n is a DAS, it is appended to the end of c.These edited texts will be ranked by E n in the descending order, and only the top M (M is beam size) are retained for the next edit step.Beam search will stop if (1) no additional edit is needed for the current sample, or (2) the minimum effect score among the result is higher than a threshold T E that can ensure sufficient performance drop.TASA runs beam search for PAS to obtain a PAS set P, then obtain a DAS set D sequentially, and finally generate the adversarial context c .Note that we obtain a DAS based on a series of contexts that are already perturbed by P. So each item in D is a pair of a DAS s j and a corresponding perturbed context c j , and the initial D contains all possible contexts edited by each PAS in P.
Filtering by textual quality.To ensure high textual quality and answer preservation of the generated adversarial contexts, TASA applies a filtering procedure on the M (beam size) PASs achieved after the final beam search for generating PAS.We skip it for DASs as they have no effect on the gold answer.In particular, we firstly use a model to justify whether the question q is still answerable given the perturbed context edit(c, s n ).Such a model can be a large-scale pretrain model fine-tuned on both answerable and unanswerable samples (refer to Appendix A.2 for details).Only those contexts classified as answerable will remain.In addition, we further constrain the remained contexts' textual quality in terms of semantic similarity and fluency: where U SE denotes the USE similarity (Cer et al., 2018) between two sentences and P P L denotes the perplexity computed by a GPT2 model (Radford et al., 2019).Only s n fulfilling U n ≥ T U (T U as a threshold) are retained for the next step.

Experiments
We evaluate TASA on extractive QA tasks.We begin by details of setup ( §4.1), then introduce the main results in §4.2, followed by ablation studies in §4.3 and additional analysis in §4.4 to better illustrate each module in our method.
Victim models.We attack three QA models, i.e., BERT (Devlin et al., 2019), SpanBERT (Joshi et al., 2020), and BiDAF (Seo et al., 2017), in our experiments.The former two are on the top of pretrained BERT base and SpanBERT large respectively.Both of them benefit from huge corpora, where SpanBERT can also be regarded as one of the SOTA models for general extractive QA tasks.The latter BiDAF is an end2end model based on LSTM and bidirectional attention specially designed for extractive QA (Related results are provided in Appendix B.1 as it is not a SOTA model).All models output the start and end positions of the answer span in the context as the prediction.
Implementation.Given a dataset, we firstly train each kind of models on its training set to get a model achieving satisfactory performance on the dev set.The trained model is then used as a victim model F (•) and we perform an adversarial attack using all samples from the whole dev set.We use a beam size M = 5 for TASA.The synonym set used for PAS is obtained by unionizing two sources, i.e., (1) WordNet synonym dictionary (Fellbaum, 2010) and (2) PPDB 2.0 dataset containing tokenlevel paraphrase pairs (Mrkšic et al., 2016).More details about TASA can be found in Appendix A.2.
Baselines.We consider the following 2 strong baselines1 besides the original dev set (Original).
• TextFooler (Jin et al., 2020): A general tokenlevel attack method using synonyms derived from counter-fitting word embeddings.We directly apply it to the context c to make perturbations and use the model's prediction F a (•) on the gold answer to determine whether to stop attacking.
• T3 (Wang et al., 2020): A tree-autoencoderbased method to obtain perturbed sentences for attacking.It can be directly applied to QA by adding a distracting sentence to the context.Both of them and our TASA are black-box attack methods without using the internal parameters of victim models.We also include human-annotated AddSent adversarial data (Jia and Liang, 2017) for SQuAD 1.1, as they share the same contexts.

Evaluation metrics.
Following the former works (Rajpurkar et al., 2016;Wang et al., 2020;Li et al., 2021), we evaluate attack methods using the following metrics: 1) EM, the exact match ratio of predicted answers; 2) F1, the F1 score between the predicted answers and the gold answers.Lower EM and F1 mean better attack effectiveness; 3) Grammar error (GErr), the context grammatical error numbers given by LanguageTool2 following Zang et al. (2020), we use the average value per 100 tokens due to various context lengths among datasets; 4) PPL, the average perplexity of all adversarial contexts given by a small sized GPT2 model (Radford et al., 2019) to measure their fluency (Kann et al., 2018).Lower values of GErr and PPL indicate better textual quality.

Main Results
The main experimental results on BERT and Span-BERT are summarized in Table 2. TASA achieves the overall best performance among all methods.In particular, it shows the best capability to deceive models than others on 3 datasets and the comparable best results on NewsQA and TriviaQA, where it causes more drops on EM and F1 metrics compared to baselines.It means the combination of PAS and DAS is more efficient than solely editing tokens or adding distracting text.Noticeably, all methods have fair attack effect for datasets with longer contexts, e.g., NewsQA and TriviaQA, because limited numbers of token-level perturbations or adding a single sentence causes fewer impacts on long texts.Besides, SpanBERT is more robust with slight accuracy declines due to its larger scale and superior pre-training strategy.
In terms of textual quality, TASA achieves the overall lowest PPL and comparable low values  on GErr.TextFooler usually has the lowest GErr values, as it makes pure token-level perturbation that generates fewer sentence-level unnatural errors.While T3 always generates sentence-level distractors that are meaningless without a complete syntactic structure, resulting in worse performance on GErr and PPL.TASA fulfills trade-off attacks on both token and sentence levels, avoiding significant textual quality loss.
It is also worth mentioning that TASA is better than AddSent at fooling models.Despite having a better textual quality by adding human-annotated distracting texts, samples in AddSent does not perturb the influential part of the original context, limiting its effects on making attacks.
Human evaluation.We randomly sample 150 sets of adversarial samples, each containing 3 samples generated by TextFooler, T3, and TASA originated respectively from the same sample in SQuAD 1.1, using BERT as the victim model.Each set is evaluated in two aspects: (1) Answer preservation, whether the gold answer of a sample remains unchanged; (2) Textual quality, ranking the quality (1 ∼ 3) of the context based on the fluency and grammaticality.Totally 63 non-expert annotators are involved, and related results are summarized in Table 3 Table 3: Human evaluation results on SQuAD 1.1 (answer preservations are in percentage).± indicates the confidence intervals with a 95% confidence level.
answer preservation as T3 always retains the original part of the context, it is equivalent to Textfooler and both of them have a significantly better textual quality than T3 due to the reason we have concluded before.Such a comparable sample quality is sufficient to verify the superiority of TASA, considering its much stronger capability to deceive models (Refer to Appendix C for qualitative adversarial samples by TASA).

Ablation Studies
We verify the effectiveness of each key module in TASA by: 1) w/o remove coref.: without removing coreferences; 2) w/o PAS: without applying perturbed answer sentence; 3) w/o DAS: without adding distracting answer sentence.The upper part of Table 4 proves their contributions.It can be found that remove coref.slightly benefits the quantity of suitable attack samples, while both PAS and DAS make vital contributions to successful attacks and feasible numbers of adversarial samples.We then do ablations on PAS, including: 1) w/o importance: without ranking keywords and edit them randomly; 2) w/o quality: without filtering perturbed texts using quality index U n ; 3) Only use WordNet as the synonym source; and 4) Only use PPDB as the synonym source.Based on the middle part of Table 4, w/o importance slightly lower the overall performance.Despite w/o quality can promote the attack success rate, it introduces extra textual quality degeneration.Besides, more synonym sources mean a larger search space, so we introduce both WordNet and PPDB into TASA.
Ablations on DAS are finally conducted, 1) w/o pseudo answer: do not change answers in DASs; 2) Only NE and 3) Only nouns: only edit named entities/nouns.Related results are given in the lower Table 4.The obvious change on w/o pseudo answer illustrates that changing the original answer in DASs is crucial for attacking, also proving DAS can shift models' focus from the original answer sentence as they can still derive the gold answer from DASs.Moreover, involving various editing types, including both NE and nouns, benefit the attack effectiveness and generated sample quantity.

More Analysis
Effect of beam size.We vary the beam size during generating PASs and DASs to investigate its influ- ence.Figure 3 reports the cahnges of EM, F1, and quantities of adversarial samples.Clearly, a larger beam size leads to better performance and more diverse adversarial samples.Naturally, the larger the beam size also means the slower speed.Thus, we use M = 5 for a trading off of performance and efficiency, as we see limited performance gains from beam sizes larger than 5.
Shift to the pseudo answers.Since DAS aims to misguide the attention of models from the original answer sentences to them, we expect QA models to output the pseudo answers contained in DASs.
Table 5 shows the F1 scores between the predicted answers and the pseudo answers on all adversarial samples that include DAS from 5 datasets.The results demonstrate that there are high overlaps between incorrect predictions by victim models and pseudo answers, as these values are close to the performance drops caused by adversarial samples, confirming that DASs can draw attention from models to make incorrect predictions.
Adversarial training.To verify the effectiveness of TASA in improving the robustness of QA models, we randomly replace training data in SQuAD 1.1 with corresponding adversarial samples generated by TASA in varied ratios, and then use the new training data to fine-tune a BERT model.The performance on the original dev set, the adversarial dev set generated by TASA, and samples from AddSent, is shown in Figure 4, where different mixing ratios are used, Noticeably, with a suitable mixture ratio, adversarial samples from TASA can make models more robust under adversarial attacks without significant performance loss on the original data.Interestingly, this defense capability can also be transferable to other adversarial Figure 4: The performance of BERT model fine-tuned on the original SQuAD data mixed with adversarial samples from TASA in different ratios, evaluated on the original dev samples, adversarial samples from TASA and AddSent.We expect a slight influence on original ones, while promotions on the latter two kinds of samples.
data, e.g.AddSent.Such results verify the potential of TASA to enhance the current QA models.

Related Work
Question answering.Extractive QA is the most common QA task, where the answer is a text span in the supporting context.There are various datasets, e.g., SQuAD, NewsQA, and NaturalQuestions (Rajpurkar et al., 2016(Rajpurkar et al., , 2018;;Trischler et al., 2017;Kwiatkowski et al., 2019), motivating more works on QA models, such as end2end models like BiDAF, R-Net, QANet and so on (Seo et al., 2017;Wang et al., 2017;Yu et al., 2018).Pretrained models are widely applied recently, such as BERT, RoBerta, and SpanBert (Devlin et al., 2019;Liu et al., 2019;Joshi et al., 2020).They realize remarkable promotions benefited from huge corpora, meanwhile they can also be used as backbones to solve more complex QA tasks (Cao et al., 2019;Huang et al., 2021).Nevertheless, there are more concerns (Sinha et al., 2021;Ettinger, 2020;Wallace et al., 2019a) whether models can really capture contextual information rather than using token-level knowledge simply.
Textual adversarial attack.Textual adversarial attack has been widely investigated in general tasks like text classification and natural language inference (NLI).Some works use character-level misspelled tokens to attack models, but are easy to be defended (Liang et al., 2018;Ebrahimi et al., 2018;Li et al., 2019;Pruthi et al., 2019).More studies use more sophisticated toke-level perturbations (Ren et al., 2019;Alzantot et al., 2018;Zang et al., 2020;Li et al., 2021) or phrase/sentence-level editing (Iyyer et al., 2018;Chen et al., 2021;Lei et al., 2022) to produce adversarial texts, with some filtering strategies to guarantee the text meaning and quality.However, none of them shows their effectiveness on QA tasks.There are some efforts on attacking QA models.
AddSent (Jia and Liang, 2017) contains adversarial samples with distracting sentences added by human annotators.Wallace et al. employ human testers to interact with models and realize dynamic attacks.Despite showing their effectiveness, these approaches are not extensible and limited in scale.
There are also automatic methods.T3 (Wang et al., 2020) utilizes a Tree LSTM to obtain a distracting sentence based on the skeleton of the question.Universal Trigger (Wallace et al., 2019a) find input-agnostic texts that deceive models for a specific question type via gradient-guided search.Our TASA differs from them as it bridges contexts and questions to attack more efficiently and suits more general conditions.

Conclusion
We present TASA, an automatic adversarial attack method for QA models.It generates twin answer sentences, perturbed answer sentence (PAS), and distracting answer sentence (DAS), to construct a new adversarial context in a QA sample.It can deceive models and misguide them to an incorrect answer based on their pitfalls that overly rely on matching sensitive keywords during predicting answers.In experiments, TASA achieves remarkable attack performance on five datasets and three victim models with satisfactory sample quality.Our additional analysis also proves that it is possible to get more robust QA models via TASA in the future.
span that is chosen most times by annotators or the first span in the context.The lemmas and POS tags of different are obtained via SpaCy.The POS tag set used to get keywords is {"VERB", "NOUN", "ADJ", "ADV"}.When perturbing a token with its synonyms, we use pyinflect9 to recover the lemmas of replacements into the same inflections of the original token.
Adding distracting answer sentences.We construct a NER dictionary and a word dictionary (except named entities) for each target dataset by parsing all contexts in both the train and dev sets via SpaCy.During generating DAS or changing answers in DAS, we randomly sample named entities with the same NER tag or words with the same POS tag from the dictionaries we built before.Each time, we sample N = 20 from them and ensure there is no overlap with the original entity/token we want to replace.Pyinflect is also used during replacement.
Beam Search.During beam search, we apply an early-stop strategy on the filtered results after each time of a search.We also restrict the maximum perturbation number to 5 for both PAS and DAS.
If one of the following 3 criteria is satisfied: 1) the minimum effect score E n among them satisfies min(E n ) ≥ T E , where T E is a threshold and T E = 0.2; 2) all possible token/entities have been replaced, the beam search will stop, and the final M sentences will proceed to the next step.Quality filtering.During filtering, we use the official USE model10 to get USE similarity and a small size GPT2 model11 to get the PPL.
Model to determine whether a question is answerable for modified context.We use a RoBerta base model fine-tuned on the original SQuAD 2.0 dataset12 as the answerable judgment model for SQuAD 1.1, because these two datasets share the same corpus and model trained on SQuAD 2.0 has the capability to predict whether a question is answerable.If the model outputs the highest answer possibility on the special "<s>" token at the beginning of the input, then the current sample is regarded as unanswerable.
For the rest four datasets, we use other Roberta models fine-tuned on our newly constructed training sets.More specially, each set includes the original samples whose label is answerable and negative samples (unanswerable samples) whose quantity is the same as answerable samples.Here, each negative sample has a question obtained by randomly sampling from the whole dataset that does not belong to the given context.We follow the same training pattern as SQuAD 2.0 to finetune RoBerta models, where the model needs to have the capability of both answering answerable samples and output "unanswerable" label for unanswerable samples.We list the performance of all these models used in our experiments in Table 6.
Our constructed data are less challenging for models because the questions of negative samples are randomly sampled from the whole corpus, which may be quite different from the context and easy to be distinguished.We list all hyperparameter values used by TASA method in Table 7, which are obtained by empirical tuning based on the trade-off between attack effectiveness and textual quality.We conduct all our experiments on a single NVIDIA V100 GPU.We also publish our code anonymously at https://anonymous.4open.science/r/TASA/.
The possible limitations of our method: TASA is only appliable to extractive QA tasks, and the question or answer is not perturbed to achieve a better deception on models ,which we leave for the future work.

A.3 Baselines
We run the official code provided by the authors of original baseline papers to derive the relevant  results in our experiments.We have tried our best to reproduce the results reported in papers, but their configurations are quite different from ours.
TextFooler Since this method is not designed for QA tasks, we made some modifications to it.1) We only use the context as the targeted attack text and mask tokens within it to get their importance scores; 2) in order to avoid changing the answer, we do not involve answer tokens as the perturbation targets; 3) we also use the prediction possibility on the gold answer to get the evaluation on each time attack and determine when to stop the attack.We implement our attack based on the official code and keep other settings as the default.
T3 We apply its official code directly as it already contains the function to attack QA dataset in SQuAD format.To make a fair comparison, we use its black-box configuration without accessing the internal parameters of models.Besides, we use its target configuration, which aims to specially misguide the model predictions to the pseudo answer in the distracting sentence and shows a better performance.

A.4 Datasets
We provide some statistics about 5 datasets we used in Table 8.We use the official release version of SQuAD 1.1, while the MRQA version13 for other 4 datasets, where we transform them into the same format as SQuAD 1.1 for the convenience of our experiments.

B.1 Using BiDAF as the Victim Model
We also include BiDAF as one kind of victim model in our experiments, as it is a representative End2end RNN-based model.The related results are not provided in the main part due to the page limitation and its current fair performance compared to SOTA models.The attack results on five dataset same as §4 are shown in Table 9.Similarly, our TASA achieves the best attack effectiveness in 3 datasets among 5 according to the declining scale of EM and F1, while remaining comparable in the other 2 datasets.In addition, TASA also achieves the overall lower PPL among all conditions and a close performance to TextFooler in terms of grammar error.These observations again demonstrate the superiority of our method.Moreover, it is noticeable that BiDAF is less vulnerable than BERT as the performance degeneration is slighter, especially on datasets with long contexts, e.g., NewsQA and TriviaQA.

B.2 Shift to Pseudo Answers
Since PASs aim to attract models' focus from original answer sentences and misguide models to make predictions on pseudo answers.We have conducted related experiments in §4.4 to prove their validity.
Here, we provide more results about this experiment, including not only the F1 scores between predictions by 3 models on TASA adversarial samples and the pseudo answers contained in the corresponding PASs, but also F1 scores between pseudo answers and the models' predictions on the original samples, making a further comparison to eliminate the possible influence of the existing prediction overlap.Results are given in Table 10.Obvi-  ously, all models under all conditions tend to predict answers that have more overlaps with pseudo answers given TASA adversarial samples, proving the misguiding effect of DASs.Besides, the F1 score difference between predictions on TASA samples and the original samples will be reduced on datasets where the attack capability of TASA is consistently weaker, such as NewsQA and TriviaQA.This proves that the efficiency of DASs drawing models' attention affects the attack performance remarkably when combined with PAS.

B.3 Analysis of Computational Complexity
We illustrate the per sample attack time and query number to the victim models of our TASA and two baselines, TextFooler and T3, on SQuAD 1.1 dataset and all 3 types of models, in Figure 5.Note that T3 has a constant query number to victim models, so it is not involved in this part.All results are obtained on a single NVIDIA V100 GPU.It can be seen that our TASA is the fastest attack method compared to other baselines, and it also makes fewer queries to the victim model before obtaining an adversarial sample.Although T3 has a  constant query number to the victim model, its complexity depends on the scale of the target model's embedding.

B.4 The Composition of Samples Generated by TASA
Although we design twin sentences, PAS and DAS, to attack QA models, it is possible that not both of them are applicable for a sample.E.g., only PAS is applicable if there is no proper named entity or noun that can be edited in the answer sentence excluding keywords and the gold answer; or only DAS is applicable for a sample where no overlapped keyword is found between the answer sentence and question.A sample where only PAS or DAS is applied will also be put into the final adversarial sample set, along with samples that both PAS and DAS (PAS+DAS) are involved.In order to study the compositions of different adversarial sample sources, as well as the performance of victim models on each part, we provide the ratios of each type of samples generated by TASA on different datasets along with the performance of QA models on them in Table 11.It can be found that PAS+DAS compose the majority of adversarial samples on nearly all datasets, while the quantities samples that only contain DAS are generally larger than samples with only PAS.When comes to the performance of QA models on each part, it can be found that PAS+DAS has the best attack effectiveness among all types of samples, because they not only deceive models using perturbed keywords but also utilize distracting answer sentences to misguide models to make wrong predictions on the included pseudo answers.On the other hand, only using PAS or DAS can lower the attack effectiveness.The reason is that a single attack source may not sufficiently fool models, proving the necessity of combining the two folds of pitfall we discussed in §2 into the adversarial attack on the QA task.Moreover, the attack difference between PAS+DAS and PAS will be narrowed on datasets having longer contexts like NewsQA and TriviaQA, where EM and F1 values on these two types of samples are more close.The relatively weak attack ability on such datasets should be the main cause.Besides, longer input sequences will lower the attention weights of models on each token, merely adding PAS also results in less influence because their ratio on the whole input becomes smaller.

C Qualitative Samples
We provide some samples generated by TextFooler, T3 and TASA along with corresponding model predictions in Table 12, Table 13.We also provide the instruction screenshot for human evaluation in Figure 6 and Figure 7.

Figure 1 :
Figure 1: An example of TASA generating adversarial context C .Underlined parts indicate keywords.Orange indicates gold answer or pseudo answer.Other colors indicate tokens for perturbation, distracting, or coreferences.

Figure 3 :
Figure 3: The EM, F1 and quantities of adversarial samples using different beam size on three victim models.

Figure 5 :
Figure5: The per sample time to generate adversarial samples (in second) and average query number to victim models of TextFooler, T3 and TASA, using all kinds of victim models on SQuAD 1.1 dataset.

Table 2 :
Main results on 5 QA datasets.The best results are in bold.Num is the sample number of a dataset or generated adversarial samples from the whole dataset by a method.↓ means that the lower value is the better.
*: samples are annotated by humans.
. Although TASA is weaker than T3 in

Table 4 :
Results of TASA ablation studies on SQuAD 1.1 dataset using BERT as the victim model.

Table 5 :
F1 score of predicted answers and pseudo answers, on adversarial samples from TASA with DASs.

Table 7 :
Values of hyperparameters used in TASA.

Table 8 :
The statistics of 5 datasets used in our experiments.|C| is the average length of context, |Q| is the average length of questions, both in token level.

Table 9 :
Attack results on 5 QA datasets using BiDAF as the victim model.The best results are bold.Num is the sample number of a dataset or generated from the whole dataset by a method.↓ represents that the lower the better.

Table 11 :
The ratio and performance of QA models on different compositions of adversarial samples generated by TASA, on all 5 datasets and 3 victim models.PAS+DAS: both PAS and DAS are applicable in the current sample; PAS: only PAS is applicable in the current sample; DAS: only DAS is applicable in the current sample.