Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation

Despite recent progress, state-of-the-art question answering models remain vulnerable to a variety of adversarial attacks. While dynamic adversarial data collection, in which a human annotator tries to write examples that fool a model-in-the-loop, can improve model robustness, this process is expensive which limits the scale of the collected data. In this work, we are the first to use synthetic adversarial data generation to make question answering models more robust to human adversaries. We develop a data generation pipeline that selects source passages, identifies candidate answers, generates questions, then finally filters or re-labels them to improve quality. Using this approach, we amplify a smaller human-written adversarial dataset to a much larger set of synthetic question-answer pairs. By incorporating our synthetic data, we improve the state-of-the-art on the AdversarialQA dataset by 3.7F1 and improve model generalisation on nine of the twelve MRQA datasets. We further conduct a novel human-in-the-loop evaluation and show that our models are considerably more robust to new human-written adversarial examples: crowdworkers can fool our model only 8.8% of the time on average, compared to 17.6% for a model trained without synthetic data.


Introduction
Large-scale labelled datasets like SQuAD (Rajpurkar et al., 2016) and SNLI (Bowman et al., 2015) have been driving forces in natural language processing research. Over the past few years, however, such "statically collected" datasets have been shown to suffer from various problems. In particular, they often exhibit inadvertent spurious statistical patterns that models learn to exploit, leading to poor model robustness and generalisation (Jia and Liang, 2017;Gururangan et al., 2018;Geva et al., 2019;McCoy et al., 2019;Lewis et al., 2021a). * Most of this work was carried out while MB was at Facebook AI Research.
Wikipedia "Old English was not static, and its usage covered a period of 700 years, from the Anglo-Saxon settlement of Britain in the 5th century to the late 11th century … Albert Baugh dates Old English from 450 to 1150, a period of full inflections, a synthetic language. Perhaps around 85 per cent …" BART <s> … settlement of Britain </s> Old English was not … </s> When did Old English begin to be used? (ii) answer candidate selection and filtering by model confidence (an example retained answer shown in green, and a dropped answer candidate in red); (iii) question generation using BART Large ; and (iv) answer re-labelling using self-training. The generated synthetic data is then used as part of the training data for a downstream Reading Comprehension model.
A recently proposed alternative is dynamic data collection (Bartolo et al., 2020;Nie et al., 2020), where data is collected with both humans and models in the annotation loop. Usually, these humans are instructed to ask adversarial questions that fool existing models. Dynamic adversarial data collection is often used to evaluate the capabilities of current state-of-the-art models, but it can also create higher-quality training data (Bartolo et al., 2020;Nie et al., 2020) due to the added incentive for crowdworkers to provide challenging examples. It can also reduce the prevalence of dataset biases and annotator artefacts over time (Bartolo et al., 2020;Nie et al., 2020), since such phenomena can be subverted by model-fooling examples collected in subsequent rounds. However, dynamic data collection can be more expensive than its static predecessor as creating examples that elicit a certain model response (i.e., fooling the model) requires more annotator effort, resulting in more time spent, and therefore higher cost per example.
In this work, we develop a synthetic adversarial data generation pipeline, making novel contributions to the answer selection, question generation, and filtering and re-labelling tasks. We show that dynamic adversarial data collection can be made more sample efficient by synthetically generating (see Figure 1) examples that improve the robustness of models in terms of performance on adversariallycollected datasets, comprehension skills, and domain generalisation.
We are also the first to evaluate models in-theloop for robustness to human adversaries using the macro-averaged validated model error rate, demonstrating considerable improvements with crowdworkers only able to fool the model-in-theloop 8.8% of the time on average, compared to 17.6% for our best baseline. The collected dataset will form part of the evaluation for a new round of the Dynabench QA task. 1 2 Related Work

Adversarial Data Collection
We directly extend the AdversarialQA dataset collected in "Beat the AI" (Bartolo et al., 2020), which uses the same passages as SQuAD1.1. Adversar-ialQA was collected by asking crowdworkers to write extractive question-answering examples that three different models-in-the-loop were unable to answer correctly, creating the D BiDAF , D BERT , and D RoBERTa subsets.
Other datasets for question answering (Rajpurkar et al., 2018;Dua et al., 2019;Wallace et al., 2019), sentiment analysis (Potts et al., 2021), hate speech detection (Vidgen et al., 2021), and natural language inference (Nie et al., 2020) have been collected in a similar manner. While appealing, human-generated adversarial data is expensive to collect; our work is complementary in that it explores methods to extract further value from existing adversarially collected datasets without requiring additional annotation effort.

Synthetic Question Generation
Many approaches have been proposed to generate question-answer pairs given a passage (Du et al., 2017;Du and Cardie, 2018;Zhao et al., 2018;Lewis and Fan, 2019;Alberti et al., 2019;Puri et al., 2020;Lewis et al., 2021b). These generally use a two-stage pipeline that first identifies an answer conditioned on a passage, then generates a question conditioned on the passage and answer; we train a similar pipeline in our work. G-DAUG (Yang et al., 2020) trains generative models to synthesise training data for commonsense reasoning. Our work focuses on extractive question-answering (QA), which motivates the need for different generative models. Yang et al. (2020) filter generated examples using influence functions, or methods that attempt to maximise diversity; we find that a different approach that considers answer agreement between QA models trained with different random seeds leads to better performance in our setting.

Self-training
In self-training, a model is trained to both predict correctly on labelled examples and increase its confidence on unlabelled examples. Self-training can yield complementary accuracy gains with pretraining (Du et al., 2020) and can improve robustness to domain shift (Kumar et al., 2020). In our setting, large amounts of unlabelled adversarial-style questions are not readily available, which motivates our use of a question generation model.

Human Evaluation
The ultimate goal of automatic machine learning model evaluation is usually stated as capturing human judgements (Callison-Burch et al., 2006;Hill et al., 2015;Vedantam et al., 2015;Liu et al., 2016). Evaluation with real humans is considered beneficial, but not easily scalable, and as such is rarely conducted in-the-loop. With NLP model capabilities ever improving, adversarial worst case evaluation becomes even more pertinent. To our knowledge, this work is the first to compare models explicitly by their adversarial validated model error rate (vMER), which we define in Section 4.4.

Synthetic Data Generation
We develop a synthetic data generation pipeline for QA that involves four stages: passage selection, answer candidate selection, question generation, and synthetic data filtering and re-labelling. Due to the complexity of the system, we study each of these in isolation, and then combine our best identified approaches for the final systems. We evaluate each component both intrinsically and on their contribution to downstream QA performance on the AdversarialQA test sets and an unseen split of the SQuAD1.1 dev set. The final synthetic data generation pipeline consists of: 1. Passage selection: we use passages from Wikipedia for this work.
2. Answer Candidate selection: the model identifies spans within the passage that are likely to be answers to a question.
3. Question Generation: a generative model is used to generate a question, conditioned on the passage and each answer.

4.
Filtering and Re-labelling: synthetic questionanswer pairs that do not meet the necessary criteria are discarded, or have their answers re-labelled using self-training.
Results for the baseline and overall best performing systems are shown in Table 7. Further results for ELECTRA Large (Clark et al., 2020) are shown in Appendix J.

Data Generation Pipeline
In order to generate synthetic adversarial examples, we first select passages, then identify candidate answers in those passages, generate corresponding questions for these answers, and then filter or relabel for improved quality based on various criteria.

Passage Selection
The text passages we use are sourced from SQuAD (further details can be found in Appendix A). We also experiment with using passages external to SQuAD, which also sourced from Wikipedia. To preserve evaluation integrity, we analyse the 8gram overlap of all external passages to the evaluation datasets, after normalisation to lower-cased alphanumeric words with a single space delimiter (Radford et al., 2019). We find that just 0.3% of the external passages have any overlap with the evaluation sets, and filter these out.

Answer Candidate Selection
The next step is to identify which spans of text within the passages are likely to be answers to a question. We investigate a range of existing methods for answer candidate selection, which takes the passage as input and outputs a set of possible answers. We further propose a self-attention-based classification head that jointly models span starts and ends, with improved performance. Since SQuAD and the AdversarialQA datasets use the same passages partitioned into the same data splits, we align the annotated answers to create representative answer selection training, validation and test sets. Dataset statistics (see Appendix C), highlight the high percentage of overlapping answers suggesting that existing answer tagging methods (Zhou et al., 2017;Zhao et al., 2018) might struggle, and models should ideally be capable of handling span overlap.
Baseline Systems We investigate three baseline systems; noun phrases and named entities following Lewis et al. (2019), as well as an extended part-of-speech tagger incorporating named entities, adjectives, noun phrases, numbers, distinct proper nouns, and clauses.

Span Extraction
We fine-tune a RoBERTa Large span extraction model as investigated in previous work (Alberti et al., 2019;Lewis and Fan, 2019). We treat the number of candidates to sample as a hyper-parameter and select the optimal value for k ∈ {1, 5, 10, 15, 20} on the validation set.
Generative Answer Detection We use BART Large (Lewis et al., 2020) in two settings; one generating answer and question, and the other where we generate the answer only, as we find that this setting provides better control of answer diversity. We use the same range of k ∈ {1, 5, 10, 15, 20} for both settings.
Self-Attention Labelling (SAL) We propose a multi-label classification head to jointly model candidate start and end tokens, and provide a binary label for whether each possible span of text from the passage is a candidate answer. We adapt scaled dot-product attention (Vaswani et al., 2017) where  the candidate start, S, and end, E, token representations are analogous to the projected layer input queries and keys. We apply a sigmoid over the computed attention scores, giving a matrix where each cell gives the probability p(a ij |c) of whether the span in the context, c, with start index i and end index j is a valid answer candidate. Formally: We optimise using binary cross-entropy, masking out impossible answer spans defined as those not in the passage, with end indices before start, or longer than the maximum permitted answer length, and upweigh positive examples to help counteract the class imbalance. We decode from the output probability matrix to the original passage tokens using a reversible tokeniser and use a probability threshold of 0.5 for candidate selection, which can be adapted to tune precision and recall. While answer candidate selection only requires a single attention head, the multi-head implementation allows application to any labelling task requiring span modelling with overlaps, where each head is trained to predict labels for each class, such as for nested Named Entity Recognition. We implement this in Transformers (Wolf et al., 2020) and fine-tune RoBERTa Large with SAL on the answer selection dataset.
Evaluation We evaluate performance on the answer selection dataset using entity-level precision, recall, and F 1 on unique normalised candidates. Results are shown in Table 1. We further investigate the effects of different answer candidate selection methods on downstream QA model performance (see Table 2) by training a RoBERTa Large model on synthetic QA pairs generated when using different answer selection methods. To eliminate generated dataset size as a potential confounder, we also replicate these experiments using a sample of 87,000 examples and find similar results (see Appendix C).

Question Generation
Once answer candidates have been identified for a selected passage, we then generate a corresponding question by directly fine-tuning a BART Large (Lewis et al., 2020) autoregressive sequence generation decoder. 2 To discourage the model from memorising the questions in the SQuAD training set and directly reproducing these, we train on a subset of 10k examples from SQuAD, selected such that they correspond to the same source passages as the AdversarialQA training data. This ensures that when scaling up synthetic generation, the vast majority of passages are previously completely unseen to the generator.
Source Questions Since the types of questions a generative model is trained on can impact both performance and diversity, we experiment with training on SQuAD and different subsets of Adversari-alQA, and the combination of both. Examples of the generated questions are shown in Table 3.
We carry out a manual answerability analysis on a random sample of 30 generated questions (using beam search with k = 5) in each of these settings (see Table 4 and Appendix B). We define answerability by the following criteria: (i) The question must be answerable from a single continuous span in the passage; (ii) There must be only one valid (or clearly one most valid) answer (e.g. in the case of a co-reference the canonical entity name should be the answer); (iii) A human should be able to answer the question correctly given sufficient time; and (iv) The correct answer is the one on which the model was conditioned during question gen-

Context:
Following the series revival in 2005, Derek Jacobi ANS provided the character's re-introduction in the 2007 episode "Utopia". During that story the role was then assumed by John Simm who returned to the role multiple times through the Tenth Doctor's tenure. As of the 2014 episode "Dark Water," it was revealed that the Master had become a female incarnation or "Time Lady," going by the name of "Missy", played by Michelle Gomez.

SQuAD 10k
Who portrayed the Master in the 2007 episode "Utopia"?   eration. We find that when the models attempt to generate complex questions, the generated question is often inconsistent with the target answer, despite remaining well-formed. We also observe that when the generated question requires external knowledge (e.g. "What is a tribe?" or "Which is not a country?") the models are reasonably consistent with the answer, however, they often lose answer consistency when answering the question requires resolving information in the passage (e.g. "What is the first place mentioned?"). For each of these models, we generate 87k examples (the same size as the SQuAD training set to facilitate comparison) using the human-provided answers, and then measure the effects on downstream performance by training a QA model on this synthetic data. Results are shown in Table 5. We find that, in this setting, the best source data for the generative model is consistently the combination of SQuAD and AdversarialQA. We also note that using only synthetic generated data, we can achieve good performance on D SQuAD consistent with the findings of Puri et al. (2020), and outperform the model trained on the human-written SQuAD data on D BERT (+0.6F 1 ) and D RoBERTa (+6.6F 1 ). This is in line with the observations of Bartolo et al. (2020) suggesting that the distribution of the questions collected using progressively stronger modelsin-the-loop is less similar to that of SQuAD. It also shows that the generator can successfully identify and reproduce patterns of adversarially-written questions. However, the results using synthetic data alone are considerably worse than when training the QA model on human-written adversarial data with, for example, a performance drop of 21.2F 1 for D BERT . This suggests that while we can do well on SQuAD using synthetic questions alone, we may need to combine the synthetic data with the human-written data for best performance in the more challenging adversarial settings.
Question Diversity In order to provide training signal diversity to the downstream QA model, we experiment with a range of decoding techniques (see Appendix D), and then evaluate these by downstream performance of a QA model trained on the questions generated in each setting. We observe minimal variation in downstream performance as a result of question decoding strategy, with the best downstream results obtained using nucleus sampling (top p = 0.75). However, we also obtain similar downstream results with standard beam search using a beam size of 5. We find that, given the same computational resources, standard beam search is roughly twice as efficient, and therefore opt for this approach for our following experiments.

Filtering and Re-labelling
The synthetic question generation process can introduce various sources of noise, as seen in the previous analysis, which could negatively impact downstream results. To mitigate these effects, we explore a range of filtering and re-labelling methods. Results for the best performing hyper-parameters of each method are shown in Table 6 and results controlling for dataset size are in Appendix E.
Answer Candidate Confidence We select candidate answers using SAL (see section 3.1.2), and filter based on the span extraction confidence of the answer candidate selection model.   Question Generator Confidence We filter out samples below various thresholds of the probability score assigned to the generated question by the question generation model.

Influence Functions
We use influence functions (Cook and Weisberg, 1982;Koh and Liang, 2017) to estimate the effect on the validation loss of including a synthetic example as explored by Yang et al. (2020), but adapted for QA. We filter out examples estimated to increase the validation loss.
Ensemble Roundtrip Consistency Roundtrip consistency (Alberti et al., 2019;Fang et al., 2021) uses an existing fine-tuned QA model to attempt to answer the generated questions, ensuring that the predicted answer is consistent with the target answer prompted to the generator. Since our setup is designed to generate questions which are intentionally challenging for the QA model to answer, we attempt to exploit the observed variation in model behaviour over multiple random seeds, and replace the single QA model with a six-model ensemble.
We find that filtering based on the number of downstream models that correctly predict the original target answer for the generated question produces substantially better results than relying on the model confidence scores, which could be prone to calibration imbalances across models.

Self-training
Filtering out examples that are not roundtrip-consistent can help eliminate noisy data, however, it also results in (potentially difficult to answer) questions to which a valid answer may still exist being unnecessarily discarded. Self-training has been shown to improve robustness to domain shift (Kumar et al., 2020) and, in our case, we relabel answers to the generated questions based on the six QA model predictions.
Specifically, in our best-performing setting, we keep any examples where at least five of the six QA models agree with the target answer (i.e. the one with which the question generator was originally prompted), re-label the answers for any examples where at least two of the models QA agree among themselves, and discard the remaining examples (i.e. those for which there is no agreement between any of the QA models).
We find that the best method combines selftraining with answer candidate confidence filtering. By using appropriate filtering of the synthetic generated data, combined with the ability to scale to many more generated examples, we approach the performance of R SQuAD+AQA , practically matching performance on SQuAD and reducing the performance disparity to just 2.2F 1 on D BiDAF , 6.6F 1 on D BERT , and 8.3F 1 on D RoBERTa , while still training solely on synthetic data.  We report the mean and standard deviation (subscript) over 6 runs with different random seeds. mvMER is the macro-averaged validated model error rate in the adversarial human evaluation setting ( * lower is better).

End-to-end Synthetic Data Generation
We also try using BART to both select answers and generate questions in an end-to-end setting. We experiment with different source datasets, number of generations per passage, and decoding hyperparameters, but our best results fall short of the best pipeline approach at 62.7/77.9 EM/F 1 on D SQuAD , 30.8/47.4 on D BiDAF , 23.6/35.6 on D BERT , and 18.0/28.3 on D RoBERTa . These results are competitive when compared to some of the other answer candidate selection methods we explored, however, fall short of the results obtained when using SAL. We find that this approach tends to produce synthetic examples with similar answers, but leave exploring decoding diversity to future work.

Fine-tuning Setup
We investigate two primary fine-tuning approaches: combining all training data, and a two-stage set-up in which we first fine-tune on the generated synthetic data, and then perform a second-stage of finetuning on the SQuAD and AdversarialQA humanwritten datasets. Similar to Yang et al. (2020), we find that two-stage training marginally improves performance over standard mixed training, and we use this approach for all subsequent experiments.

Measuring Model Robustness
Based on the findings in the previous section, we select four final models for robustness evaluation: 1. R SQuAD : using the SQuAD1.1 training data.
2. R SQuAD+AQA : trained on SQuAD combined and shuffled with AdversarialQA.
3. SynQA: uses a two-stage fine-tuning approach, first trained on 380,785 synthetically generated questions on the passages in the SQuAD training set, and then further finetuned on SQuAD and AdversarialQA.
4. SynQA Ext first trained on the same synthetic SQuAD examples as (iii) combined with 1.5M synthetic questions generated on the previously described Wikipedia passages external to SQuAD, and then further fine-tuned on SQuAD and AdversarialQA.
Individual models are selected for the best combined and equally-weighted performance on a split of the SQuAD validation set and all three Adver-sarialQA validation sets. We first evaluate model robustness using three existing paradigms: adversarially-collected datasets, checklists, and domain generalisation. We also introduce adversarial human evaluation, a new way of measuring robustness with direct interaction between the human and model.

Adversarially-collected Data
We evaluate the final models on AdversarialQA, with results shown in Table 7. We find that synthetic data augmentation yields state-of-the-art results on AdversarialQA, providing performance gains of 2.3F 1 on D BiDAF , 4.1F 1 on D BERT , and 4.9F 1 on D RoBERTa over the baselines while retaining good performance on SQuAD, a considerable improvement at no additional annotation cost.

Comprehension Skills
CheckList (Ribeiro et al., 2020) is a model agnostic approach that serves as a convenient test-bed for evaluating what comprehension skills a QA model could learn. We find that some skills that models struggle to learn when trained on SQuAD, such as discerning between profession and nationality, or handling negation in questions, can be learnt by incorporating adversarially-collected data during training (see Appendix H). Furthermore, augmenting with synthetic data improves performance on a variety of these skills, with a 1.7% overall gain for SynQA and 3.1% for SynQA Ext . Adding the external synthetic data improves performance on most taxonomy-related skills, considerably so on "profession vs nationality", as well as skills such as "his/her" coreference, or subject/object distinction. While many of these skills seem to be learnable, it is worth noting the high variation in model performance over multiple random initialisations.

Domain Generalisation
We evaluate domain generalisation of our final models on the MRQA (Fisch et al., 2019) dev sets, with results shown in Table 8. 3 We find that augmenting training with synthetic data provides performance gains on nine of the twelve tasks. Performance improvements on some of the tasks can be quite considerable (up to 8.8F 1 on SearchQA), which does not come at a significant cost on the three tasks where synthetic data is not beneficial.

Adversarial Human Evaluation
While existing robustness measures provide valuable insight into model behaviour, they fail to capture how robust a model might be in a production setting. We use Dynabench (Kiela et al., 2021), a research platform for dynamic benchmarking and evaluation, to measure model robustness in an adversarial human evaluation setting. This allows for live interaction between the model and human annotator, and more closely simulates realistic and challenging scenarios a deployed system might encounter, compared to evaluation on static datasets.
We set up the experiment as a randomised controlled trial where annotators are randomly allocated to interact with each of our four final models based on a hash of their annotator identifier. We run the experiment through Amazon Mechanical Turk (AMT) using Mephisto. 4 Workers (see Appendix I) are first required to complete an onboarding phase to ensure familiarity with the interface, and are then required to ask five questions of the model. We pay $0.20 per question and given a strong incentive to try to beat the model with a $0.50 bonus for each validated question that the model fails to answer correctly. 5 The model identity is kept hidden and workers are awarded an equal base pay irrespective of the model-in-the-loop to avoid creating an incentive imbalance. Each annotator is allowed to write at most 50 questions, to avoid having a few productive annotators dominate our findings. All model-fooling examples are further validated by an expert annotator. We skip validation of questions the model answered correctly, as manual validation of a sample of 50 such examples found that all are valid, suggesting that the QA model's ability to answer them is a good indicator of their validity.
We measure performance as the validated model error rate (vMER), that is, the percentage of validated examples that the model fails to answer correctly. Despite limiting the number of collected examples to 50 per annotator, there is still the potential of an imbalance in the number of QA pairs produced by each annotator. In order to eliminate annotator effect as a potential confounder, we propose using the macro-averaged validated model error rate (mvMER) over annotators, defined as: We find that SynQA roughly halves the model error rate compared to R SQuAD+AQA from 17.6% to 8.8% (see Table 7, further details in Appendix I), meaning that it is considerably harder for human adversaries to ask questions that the model cannot answer. While SynQA Ext still considerably outperforms R SQuAD+AQA at a 12.3% mvMER, we find that it is not as hard to beat as SynQA in this setting. A low model error rate also translates into  increased challenges for the adversarial human annotation paradigm as the effort required for each model-fooling example increases, and provides motivation to expand the current extractive QA task beyond single answer spans on short passages.
These findings further suggest that while static adversarial benchmarks are a good evaluation proxy, performance gains on these may be underestimating the effect on model robustness in a setting involving direct interaction between the models-inthe-loop and human adversaries.

Discussion and Conclusion
In this work, we develop a synthetic adversarial data generation pipeline for QA, identify the best components, and evaluate on a variety of robustness measures. We propose novel approaches for answer candidate selection, adversarial question generation, and synthetic example filtering and relabelling, demonstrating improvements over existing methods. Furthermore, we evaluate the final models on three existing robustness measures and achieve state-of-the-art results on AdversarialQA, improved learnability of various comprehension skills for CheckList, and improved domain generalisation for the suite of MRQA tasks.
We then put the synthetically-augmented models back in-the-loop in an adversarial human evaluation setting to assess whether these models are actually harder for a human adversary to beat.
We find that our best synthetically-augmented model is roughly twice as hard to beat. Our findings suggest that synthetic adversarial data generation can be used to improve QA model robustness, both when measured using standard methods and when evaluated directly against human adversaries.
Looking forward, the methods explored in this work could also be used to scale the dynamic adversarial annotation process in multiple ways. Synthetic adversarial data generation could facilitate faster iteration over rounds of adversarial human annotation as it reduces the amount of human data required to effectively train an improved QA model. Generative models could also help guide or inspire human annotators as they try to come up with more challenging examples. Furthermore, while our work focuses on improving adversarial robustness, this approach is not limited to the adversarial setting. We believe that our findings can motivate similar investigations for tasks where data acquisition can be challenging due to limited resources, or for improving different aspects of robustness, for example for model bias mitigation.

Ethical Considerations
We collect an evaluation dataset as a part of the adversarial human evaluation process. The passages are sourced from the SQuAD1.1 dataset distributed under the CC BY-SA 4.0 license. As described in the main text, we designed our incentive structure to ensure that crowdworkers were fairly compensated. Full details are provided in the main text and Appendix I. Our datasets focus on the English language. As this data is not collected for the purpose of designing NLP applications, we do not foresee any risks associated with the use of this data.

A Further Details on Passage Selection
Passages are sourced from SQuAD1.1, and are therefore from Wikipedia. For training answer candidate selection models and question generation models, we use a subset of 10,000 examples from the SQuAD1.1 training set asked on 2,596 of the 18,891 available training passages. This ensures that both the answer candidate selection and question generation models do not simple reproduce their respective training sets. Bartolo et al.
(2020) split the SQuAD1.1 dev set into a dev and test set, with passages allocated between the two. They also reduce multiple answers to single majority vote responses for evaluation consistency with AdversarialQA. These two splits are referred to as D SQuAD dev and D SQuAD test . We use D SQuAD dev and the AdversarialQA dev sets for validation, and report results on D SQuAD test and the Adversari-alQA test sets. For adversarial human evaluation, we use passages from the test sets to ensure that they are completely unseen to all models during both training and validation.

B Manual Answerability Analysis
For the manual answerability analysis, we define answerability by the following criteria: (i) The question must be answerable from a single continuous span in the passage; (ii) There must be only one valid (or clearly one most valid) answer (e.g. in the case of a co-reference the canonical entity name should be the answer); (iii) A human should be able to answer the question correctly given sufficient time; and (iv) The correct answer is the one on which the model was conditioned during question generation.

C Further Details on Answer Candidate Selection
Dataset statistics for the passage-aligned splits are shown in Table 9.  Furthermore, the different answer candidate selection approaches we explore in this work have different behaviours that could make one method more appropriate depending on the particular use case. To facilitate this process, we provide some example answer candidates of each of the methods in Table 11.
We observe minimal variation in downstream performance (see Table 13) as a result of question decoding strategy, with the best downstream results obtained using nucleus sampling (top p = 0.75). However, we also obtain similar downstream results with standard beam search using a beam size of 5. We find that, given the same computational resources, standard beam search is roughly twice as efficient, with minimal performance drop when compared to nucleus sampling, and therefore opt for this approach for our following experiments.

E Controlling for Data Size
Since the synthetic data generation process allows for scale to a large number of unseen passages, at the limit the bottleneck becomes the quality of generating data rather than quantity. Due to this, we provide results for experiments controlling for dataset size for both answer candidate selection (see Table 12) and filtering method (see Table 14). Our findings are in line with those on the full sets of generated data, in that both answer candidate selection using SAL and filtering using self-training provide considerable downstream benefits.

F A Note on Data Efficiency
It is challenging to compare the efficiency of the synthetic generation process to manually collecting additional data. Figure 3 shows that, for RoBERTa Large , performance starts to converge when trained on around 5-6k manually-collected adversarial examples. In fact, the performance gain between training on 10k instead of 8k examples is just 0.5F 1 on the overall AdversarialQA test set. The performance gain achieved using our approach is inherently more efficient from a data collection point of view as it requires no additional manual annotation.

G AdversarialQA Dev Set Results
Results for the final models on the AdversarialQA validations sets are shown in Table 15.

H Results on CheckList
We provide a breakdown of results by comprehension skill and example model failure cases on CheckList in Table 17.

I Adversarial Human Evaluation
For adversarial human evaluation, crowdworkers are required to be based in Canada, the UK, or the US, have a Human Intelligence Task (HIT) Approval Rate greater than 98%, and have previously completed at least 1,000 HITs.
We provide a breakdown of results from the Adversarial Human Evaluation experiments in Ta         C: There is a tiny oval thing in the room. Q: What size is the thing? A: tiny M: oval Profession vs nationality 68.8 8.7 37.5 9.9 23.7 11.7 5.9 1.6 C: Lauren is a Japanese adviser.