On the Efficacy of Adversarial Data Collection for Question Answering: Results from a Large-Scale Randomized Study

In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions. Researchers hope that models trained on these more challenging datasets will rely less on superficial patterns, and thus be less brittle. However, despite ADC’s intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models. In this paper, we conduct a large-scale controlled study focused on question answering, assigning workers at random to compose questions either (i) adversarially (with a model in the loop); or (ii) in the standard fashion (without a model). Across a variety of models and datasets, we find that models trained on adversarial data usually perform better on other adversarial datasets but worse on a diverse collection of out-of-domain evaluation sets. Finally, we provide a qualitative analysis of adversarial (vs standard) data, identifying key differences and offering guidance for future research.


Introduction
Across such diverse natural language processing (NLP) tasks as natural language inference (NLI; Poliak et al., 2018;Gururangan et al., 2018), question answering (QA; Kaushik and Lipton, 2018), and sentiment analysis (Kaushik et al., 2020), researchers have discovered that models can succeed on popular benchmarks by exploiting spurious associations that characterize a particular dataset but do not hold more widely. Despite performing well on independent and identically distributed (i.i.d.) data, these models are liable under plausible domain shifts. With the goal of providing more challenging benchmarks that require this stronger form of generalization, an emerging line of research has investigated adversarial data collection (ADC), a scheme in which a worker interacts with a model (in real time), attempting to produce examples that elicit incorrect predictions (e.g., Dua et al., 2019;Nie et al., 2020). The hope is that by identifying parts of the input domain where the model fails one might make the model more robust. Researchers have shown that models trained on ADC perform better on such adversarially collected data and that with successive rounds of ADC, crowdworkers are less able to fool the models (Dinan et al., 2019).
While adversarial data may indeed provide more challenging benchmarks, the process and its actual benefits vis-a-vis tasks of interest remain poorly understood, raising several key questions: (i) do the resulting models typically generalize better out of distribution compared to standard data collection (SDC)?; (ii) how much can differences between ADC and SDC be attributed to the way workers behave when attempting to fool models, regardless of whether they are successful? and (iii) what is the impact of training models on adversarial data only, versus using it as a data augmentation strategy?
In this paper, we conduct a large-scale randomized controlled study to address these questions. Focusing our study on span-based question answering and a variant of the Natural Questions dataset (NQ; Lee et al., 2019;Karpukhin et al., 2020), we work with two popular pretrained transformer architectures-BERT large (Devlin et al., 2019) and ELECTRA large (Clark et al., 2020)each fine-tuned on 23.1k examples. To eliminate confounding factors when assessing the impact of ADC, we randomly assign the crowdworkers tasked with generating questions to one of three groups: (i) with an incentive to fool the BERT model; (ii) with an incentive to fool the ELECTRA model; and (iii) a standard, non-adversarial setting (no model in the loop). The pool of contexts is the same for each group and each worker is asked to generate five questions for each context that they see. Workers are shown similar instructions (with minimal changes), and paid the same base amount.
We fine-tune three models (BERT, RoBERTa, and ELECTRA) on resulting datasets and evaluate them on held-out test sets, adversarial test sets from prior work (Bartolo et al., 2020), and 12 MRQA (Fisch et al., 2019) datasets. For all models, we find that while fine-tuning on adversarial data usually leads to better performance on (previously collected) adversarial data, it typically leads to worse performance on a large, diverse collection of out-of-domain datasets (compared to fine-tuning on standard data). We observe a similar pattern when augmenting the existing dataset with the adversarial data. Results on an extensive collection of out-of-domain evaluation sets suggest that ADC training data does not offer clear benefits vis-à-vis robustness under distribution shift.
To study the differences between adversarial and standard data, we perform a qualitative analysis, categorizing questions based on a taxonomy (Hovy et al., 2000). We notice that more questions in the ADC dataset require numerical reasoning compared to the SDC sample. These qualitative insights may offer additional guidance to future researchers.

Related Work
In an early example of model-in-the-loop data collection, Zweig and Burges (2012) use n-gram lan-guage models to suggest candidate incorrect answers for a fill-in-the-blank task. Richardson et al. (2013) suggested ADC for QA as proposed future work, speculating that it might challenge state-ofthe-art models. In the Build It Break It, The Language Edition shared task (Ettinger et al., 2017), teams worked as builders (training models) and breakers (creating challenging examples for subsequent training) for sentiment analysis and QA-SRL.
Research on ADC has picked up recently, with Chen et al. (2019) tasking crowdworkers to construct multiple-choice questions to fool a BERT model and Wallace et al. (2019) employing Quizbowl community members to write Jeopardystyle questions to compete against QA models. Zhang et al. (2018) automatically generated questions from news articles, keeping only those questions that were incorrectly answered by a QA model. Dua et al. (2019) and Dasigi et al. (2019) required crowdworkers to submit only questions that QA models answered incorrectly. To construct FEVER 2.0 (Thorne et al., 2019), crowdworkers were required to fool a fact-verification system trained on the FEVER ( Thorne et al., 2018) dataset. Some works explore ADC over multiple rounds, with adversarial data from one round used to train models in the subsequent round. Yang et al. (2018b) ask workers to generate challenging datasets working first as adversaries and later as collaborators. Dinan et al. (2019) build on their work, employing ADC to address offensive lan-guage identification. They find that over successive rounds of training, models trained on ADC data are harder for humans to fool than those trained on standard data. Nie et al. (2020) applied ADC for an NLI task over three rounds, finding that training for more rounds improves model performance on adversarial data, and observing improvements on the original evaluations set when training on a mixture of original and adversarial training data. Williams et al. (2020) conducted an error analysis of model predictions on the datasets collected by Nie et al. (2020). Bartolo et al. (2020) studied the empirical efficacy of ADC for SQuAD (Rajpurkar et al., 2016), observing improved performance on adversarial test sets but noting that trends vary depending on the models used to collect data and to train. Previously, Lowell et al. (2019) Kaushik et al. (2020Kaushik et al. ( , 2021 collect counterfactually augmented data (CAD) by asking crowdworkers to edit existing documents to make counterfactual labels applicable, showing that models trained on CAD generalize better out-of-domain.
Absent further assumptions, learning classifiers robust to distribution shift is impossible (Ben-David et al., 2010). While few NLP papers on the matter make their assumptions explicit, they typically proceed under the implicit assumptions that the labeling function is deterministic (there is one right answer), and that covariate shift (Shimodaira, 2000) applies (the labeling function p(y|x) is invariant across domains). Note that neither condition is generally true of prediction problems. For example, faced with label shift (Schölkopf et al., 2012;Lipton et al., 2018) p(y|x) can change across distributions, requiring one to adapt the predictor to each environment.

Study Design
In our study of ADC for QA, each crowdworker is shown a short passage and asked to create 5 questions and highlight answers (spans in the passage, see Fig. 1). We provide all workers with the same base pay and for those assigned to ADC, pay out an additional bonus for each question that fools the QA model. Finally, we field a different set of workers to validate the generated examples.
Context passages For context passages, we use the first 100 words of Wikipedia articles. Truncating the articles keeps the task of generating questions from growing unwieldy. These segments typically contain an overview, providing ample material for factoid questions. We restrict the pool of candidate contexts by leveraging a variant of the Natural Questions dataset (Kwiatkowski et al., 2019;Lee et al., 2019). We first keep only a subset of 23.1k question/answer pairs for which the context passages are the first 100 words of Wikipedia articles 2 . From these passages, we sample 10k at random for our study.
Models in the loop We use BERT large (Devlin et al., 2019) and ELECTRA large (Clark et al., 2020) models as our adversarial models in the loop, using the implementations provided by Wolf et al. (2020). We fine-tune these models for span-based question-answering, using the 23.1k training examples (subsampled previously) for 20 epochs, with early-stopping based on word-overlap F1 3 over the validation set. Our BERT model achieves an EM score of 73.1 and an F1 score of 80.5 on an i.i.d. validation set. The ELECTRA model performs slightly better, obtaining an 74.2 EM and 81.2 F1 on the same set.
Crowdsourcing protocol We build our crowdsourcing platform on the Dynabench interface (Kiela et al., 2021) and use Amazon's Mechanical Turk to recruit workers to write questions.
To ensure high quality, we restricted the pool to U.S. residents who had already completed at least 1000 HITs and had over 98% HIT approval rate. For each task, we conducted several pilot studies to gather feedback from crowdworkers on the task and interface. We identified median time taken by workers to complete the task in our pilot studies and used that to design the incentive structure for the main task. We also conducted multiple studies with different variants of instructions to observe trends in the quality of questions and refined our instructions based on feedback from crowdworkers. Feedback from the pilots also guided improvements to our crowdsourcing interface. In total, 984 workers took part in the study, with 741 creating questions. In our final study, we randomly assigned workers to generate questions in the following ways: (i) to fool the BERT baseline; (ii) to fool the ELEC-TRA baseline; or (iii) without a model in the loop. Before beginning the task, each worker completes an onboarding process to familiarize them with the platform. We present the same set of passages to workers regardless of which group they are assigned to, tasking them with generating 5 questions for each passage.
Incentive structure During our pilot studies, we found that workers spend ≈ 2-3 minutes to generate 5 questions. We provide workers with the same base pay-$0.75 per HIT-(to ensure compensation at a $15/hour rate). For tasks involving a model in the loop, we define a model prediction to be incorrect if its F1 score is less than 40%, following the threshold set by Bartolo et al. (2020). Workers tasked with fooling the model receive bonus pay of $0.15 for every question that leads to an incorrect model prediction. This way, a worker can double their pay if all 5 of their generated questions induce incorrect model predictions.
Quality control Upon completion of each batch of our data collection process, we presented ≈ 20% of the collected questions to a fourth group of crowdworkers who were tasked with validating whether the questions were answerable and the answers were correctly labeled. In addition, we manually verified a small fraction of the collected question-answer pairs. If validations of at least 20% of the examples generated by a particular worker were incorrect, their work was discarded in its entirety. The entire process, including the pilot studies cost ≈ $50k and spanned a period of seven months. Through this process, we collected over 150k question-answer pairs corresponding to the 10k contexts (50k from each group) but the final datasets are much smaller, as we explain below.

Experiments and Results
Our study allows us to answer three questions: (i) how well do models fine-tuned on ADC data generalize to unseen distributions compared to finetuning on SDC? (ii) Among the differences between ADC and SDC, how many are due to workers trying to fool the model regardless of whether they are successful? and (iii) what is the impact of training on adversarial data only versus using it as a data augmentation strategy?
Datasets For both BERT and ELECTRA, we first identify contexts for which at least one question elicited an incorrect model prediction. Note that this set of contexts is different for BERT and ELECTRA. For each such context c, we identify the number of questions k c (out of 5) that successfully fooled the model. We then create 3 datasets per model by, for each context, (i) choosing precisely those k c questions that fooled the model (BERT fooled and ELECTRA fooled ); (ii) randomly choosing k c questions (out of 5) from ADC data without replacement (BERT random and ELECTRA random )-regardless of whether they fooled the model; and (iii) randomly choosing k c questions (out of 5) from the SDC data without replacement. Thus, we create 6 datasets, where all 3 BERT datasets have the same number of questions per context (and 11.3k total training examples), while all 3 ELECTRA datasets likewise share the same number of questions per context (and 14.7k total training examples). See Table 1 for details on the number of passages and question-answer pairs used in the different splits.
Models For our empirical analysis, we fine-tune BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), and ELECTRA (Clark et al., 2020) models on all six datasets generated as part of our study (four datasets via ADC: BERT fooled , BERT random , ELECTRA fooled , ELECTRA random , and the two datasets via SDC). We also fine-tune these models after augmenting the original data to collected datasets. We report the means and standard deviations (in subscript) of EM and F1 scores following 10 runs of each experiment. Models fine-tuned on all ADC datasets typically perform better on their held-out test sets than those trained on SDC data and vice-versa (  Out-of-domain generalization to adversarial data We evaluate these models on adversarial test sets constructed with BiDAF (D BiDAF ), BERT (D BERT ) and RoBERTa (D RoBERTa ) in the loop (Bartolo et al., 2020). Prior work suggests that training on ADC data leads to models that perform better on similarly constructed adversarial evaluation sets. Both BERT and RoBERTa models fine-tuned on adversarial data generally outperform models fine-tuned on SDC data (or when either datasets are augmented to the original data) on all three evaluation sets (Table 3 and Appendix Table 6). A RoBERTa model fine-tuned on BERT fooled outperforms a RoBERTa model fine-tuned on SDC by 9.1, 9.3, and 6.2 EM points on D RoBERTa , D BERT , and D BiDAF , respectively. We observe similar trends on ELECTRA models fine-tuned on ADC data versus SDC data, but these gains disappear when the same models are finetuned on augmented data. For instance, while ELECTRA fine-tuned on BERT random obtains an EM score of 14.8 on D RoBERTa , outperforming an ELECTRA fine-tuned on SDC data by ≈ 3 pts, the difference is no longer significant when respective models are fine-tuned after original data is augmented to these datasets. ELECTRA models fine-tuned on ADC data with ELECTRA in the loop perform no better than those trained on SDC. Fine-tuning ELECTRA on SDC augmented to original data leads to an ≈ 1 pt improvement on both metrics compared to augmenting ADC. Overall, we find that models fine-tuned on ADC data typically generalize better to out-ofdomain adversarial test sets than models fine-tuned on SDC data, confirming the findings by Dinan et al. (2019).
Out-of-domain generalization to MRQA We further evaluate these models on 12 out-of-domain datasets used in the 2019 MRQA shared task 4 (Table 4 and Appendix Table 7). 5 Notably, for BERT, fine-tuning on SDC data leads to significantly better performance (as compared to fine-tuning on . 5 Interestingly, RoBERTa appears to perform better compared to BERT and ELECTRA. Prior works have hypothesized that the bigger size and increased diversity of the pretraining corpus of RoBERTa (compared to those of BERT and ELECTRA) might somehow be responsible for RoBERTa's better out-of-domain generalization, (Baevski et al., 2019;Hendrycks et al., 2020;Tu et al., 2020).   Bartolo et al. (2020). Adversarial results in bold are statistically significant compared to SDC setting and vice versa with p < 0.05.
ADC data collected with BERT) on 9 out of 12 MRQA datasets, with gains of more than 10 EM pts on 6 of them. On BioASQ, BERT fine-tuned on BERT fooled obtains EM and F1 scores of 23.5 and 30.3, respectively. By comparison, fine-tuning on SDC data yields markedly higher EM and F1 scores of 35.1 and 55.7, respectively. Similar trends hold across models and datasets. Interestingly, ADC fine-tuning often improves performance on DROP compared to SDC. For instance, RoBERTa finetuned on ELECTRA random outperforms RoBERTa fine-tuned on SDC by ≈ 7 pts. Note that DROP itself was adversarially constructed. On Natural Questions, models fine-tuned on ADC data generally perform comparably to those fine-tuned on SDC data. RoBERTa fine-tuned on BERT random obtains EM and F1 scores of 48.1 and 62.6, respectively, whereas RoBERTa fine-tuned on SDC data obtains scores of 47.9 and 61.7, respectively. It is worth noting that passages sourced to construct both ADC and SDC datasets come from the Natural Questions dataset, which could be one reason why models fine-tuned on ADC datasets perform similar to those fine-tuned on SDC datasets when evaluated on Natural Questions.
On the the adversarial process versus adversarial success We notice that models fine-tuned on BERT random and ELECTRA random typically outperform models fine-tuned on BERT fooled and ELECTRA fooled , respectively, on adversarial test data collected in prior work (Bartolo et al., 2020), as well as on MRQA. Similar observation can be made when the ADC data is augmented with the original training data. These trends suggest that the ADC process (regardless of the outcome) explains our results more than successfully fooling a model. Furthermore, models fine-tuned only on SDC data tend to outperform ADC-only fine-tuned models; however, following augmentation, ADC fine-tuning achieves comparable performance on more datasets than before, showcasing generalization following augmentation. Notice that augmenting ADC data to original data may not always help. BERT fine-tuned on original 23.1k examples achieves an EM 11.3 on SearchQA. When fine-tuned on BERT fooled augmented to the original data, this drops to 8.7, and when fine-tuned on BERT random augmented to the original data, it drops to 11.2. Fine-tuning on SDC augmented to the original data, however, results in EM of 13.6.

Qualitative Analysis
Finally, we perform a qualitative analysis over the collected data, revealing profound differences with models in (versus out of) the loop. Recall that be-  cause these datasets were constructed in a randomized study, any observed differences are attributable to the model-in-the loop collection scheme.
To begin, we analyze 100 questions from each dataset and categorize them using the taxonomy introduced by Hovy et al. (2000). 6 We also look at the first word of the wh-type questions in each dev set (Fig. 3) and observe key qualitative differences between data via ADC and SDC for both models.
In case of ADC with BERT (and associated SDC), while we observe that most questions in the dev sets start with what, ADC has a higher proportion compared to SDC (587 in BERT fooled and 492 in BERT random versus 416 in SDC). Furthermore, we notice that compared to BERT fooled dev set, SDC has more when-(148) and who-type (220) questions, the answers to which typically refer to dates, places and people (or organizations), respectively. This is also reflected in the taxonomy categorization. Interestingly, the BERT random dev set has more whenand who-type questions than BERT fooled (103 and 182 versus 50 and 159, respectively). This indicates that the BERT model could have been better at answering questions related to dates and people (or organizations), which could have further incentivized workers not to generate toplevel.html such questions upon observing these patterns. Similarly, in the 100-question samples, we find that a larger proportion of questions in ADC are categorized as requiring numerical reasoning (11 and 18 in BERT fooled and BERT random , respectively) compared to SDC (7). It is possible that the model's performance on numerical reasoning (as also demonstrated by its lower performance on DROP compared to fine-tuning on ADC or SDC) would have incentivized workers to generate more questions requiring numerical reasoning and as a result, skewed the distribution towards such questions.
Similarly, with ELECTRA, we observe that what-type questions constitute most of the questions in the development sets for both ADC and SDC, although data collected via ADC has a higher proportion of these (641 in ELECTRA fooled and 619 in ELECTRA random versus 542 in SDC). We also notice more how-type questions in ADC (126 in ELECTRA random ) vs 101 in SDC, and that the SDC sample has more questions that relate to dates (223) but the number is lower in the ADC samples (157 and 86 in ELECTRA random and ELECTRA fooled , respectively). As with BERT, the ELECTRA model was likely better at identifying answers about dates or years which could have further incentivized workers to generate less questions of such types. However, unlike with BERT, we observe that the ELECTRA ADC and SDC 100-question samples contain similar numbers of questions involving numerical answers (8, 9 and 10 in ELECTRA fooled , ELECTRA random and SDC respectively).
Lastly, despite explicit instructions not to generate questions about passage structure (Fig. 1), a small number of workers nevertheless created such questions. For instance, one worker wrote, "What is the number in the passage that is one digit less than the largest number in the passage?" While most such questions were discarded during validation, some of these are present in the final data. Overall, we notice considerable differences between ADC and SDC data, particularly vis-avis what kind of questions workers generate. Our qualitative analysis offers additional insights that suggest that ADC would skew the distribution of questions workers create, as the incentives align with quickly creating more questions that can fool the model. This is reflected in all our ADC datasets. One remedy could be to provide workers with initial questions, asking them to edit those questions to elicit incorrect model predictions. Similar strategies were employed in (Ettinger et al., 2017), where breakers minimally edited original data to elicit incorrect predictions from the models built by builders, as well as in recently introduced adversarial benchmarks for sentiment analysis (Potts et al., 2020).

Conclusion
In this paper, we demonstrated that across a variety of models and datasets, training on adversarial data leads to better performance on evaluation sets created in a similar fashion, but tends to yield worse performance on out-of-domain evaluation sets not created adversarially. Additionally, our results suggest that the ADC process (regardless of the outcome) might matter more than successfully fooling a model. We also identify key qualitative differences between data generated via ADC and SDC, particularly the kinds of questions created.
Overall, our work investigates ADC in a con-trolled setting, offering insights that can guide future research in this direction. These findings are particularly important given that ADC is more timeconsuming and expensive than SDC, with workers requiring additional financial incentives. We believe that a remedy to these issues could be to ask workers to edit questions rather than to generate them. In the future, we would like to extend this study and investigate the efficacy of various constraints on question creation, and the role of other factors such as domain complexity, passage length, and incentive structure, among others.

Ethical Considerations
The passages in our datasets are sourced from the datasets released by Karpukhin et al. (2020) under a Creative Commons License. As described in main text, we designed our incentive structure to ensure that crowdworkers were paid $15/hour, which is twice the US federal minimum wage. Our datasets focus on the English language, and are not collected for the purpose of designing NLP applications but to conduct a human study. We share our dataset to allow the community to replicate our findings and do not foresee any risks associated with the use of this data.