What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?

Crowdsourcing is widely used to create data for common natural language understanding tasks. Despite the importance of these datasets for measuring and refining model understanding of language, there has been little focus on the crowdsourcing methods used for collecting the datasets. In this paper, we compare the efficacy of interventions that have been proposed in prior work as ways of improving data quality. We use multiple-choice question answering as a testbed and run a randomized trial by assigning crowdworkers to write questions under one of four different data collection protocols. We find that asking workers to write explanations for their examples is an ineffective stand-alone strategy for boosting NLU example difficulty. However, we find that training crowdworkers, and then using an iterative process of collecting data, sending feedback, and qualifying workers based on expert judgments is an effective means of collecting challenging data. But using crowdsourced, instead of expert judgments, to qualify workers and send feedback does not prove to be effective. We observe that the data from the iterative protocol with expert assessments is more challenging by several measures. Notably, the human–model gap on the unanimous agreement portion of this data is, on average, twice as large as the gap for the baseline protocol data.


Introduction
Crowdsourcing is a scalable method for constructing examples for many natural language processing tasks. Platforms like Amazon's Mechanical Turk give researchers access to a large, diverse pool of people to employ (Howe, 2006;Snow et al., 2008;Callison-Burch, 2009). Given the ease of data collection with crowdsourcing, it has been frequently * Equal contribution. † Work done while at New York University.
used for collecting datasets for natural language understanding (NLU) tasks like question answering (Mihaylov et al., 2018), reading comprehension (Rajpurkar et al., 2016;Huang et al., 2019), natural language inference (Dagan et al., 2005;Bowman et al., 2015;Williams et al., 2018;Nie et al., 2020a), and commonsense reasoning (Talmor et al., 2019). There has been substantial research devoted to studying crowdsourcing methods, especially in the human-computer interaction literature (Kittur et al., 2008(Kittur et al., , 2011Bernstein et al., 2012). However, most prior research investigates methods for collecting accurate annotations for existing data, for example labeling objects in images or labeling the sentiment of sentences (Hsueh et al., 2009;Liu et al., 2019a;Sun et al., 2020). There are some small-scale studies that use writing tasks, like writing product reviews, to compare crowdsourcing methodologies (Dow et al., 2012). However, we are unaware of any prior work that directly evaluates the effects of crowdsourcing protocol design choices on the quality of the resulting data for NLU tasks.
Decisions around methodology and task design used to collect datasets dictate the quality of the data collected. As models become stronger and are able to solve existing NLU datasets, we have an increasing need for difficult, high-quality datasets that are still reliably solvable by humans. As a result, our thresholds for what makes a dataset acceptable become stricter: The data needs to be challenging, have high human-agreement, and avoid serious annotation artifacts (Gururangan et al., 2018).
To make collecting such large-scale datasets feasible, making well-informed crowdsourcing design decisions becomes crucial.
Existing NLP datasets have been crowdsourced with varying methods. The prevailing standard is to experiment with task design during pilots that are run before the main data collection (Vaughan, 2018). This piloting process is essential to design- ing good crowdsourcing tasks with clear instructions, but the findings from these pilots are rarely discussed in published corpus papers, and the pilots are usually not large enough or systematic enough to yield definitive conclusions. In this paper, we use a randomized trial to directly compare crowdsourcing methodologies to establish general best practices for NLU data collection.
We compare the efficacy of three types of crowdsourcing interventions that have been used in previous work. We use multiple-choice question answering in English as a testbed for our study and collect four small datasets in parallel including a baseline dataset with no interventions. We choose QA as our test-bed over the similarly popular testbed task of natural language inference (NLI) because of our focus on very high human-agreement examples which calls for minimizing label ambiguity. In multiple-choice QA, the correct label is the answer choice that is most likely to be correct, even if there is some ambiguity in whether that choice is genuinely true . In NLI however, if more than one label is plausible, then resolving the disagreement by ranking labels may not be possible (Pavlick and Kwiatkowski, 2019). In the trial, crowdworkers are randomly assigned to one of four protocols: BASELINE,JUSTIFICATION,CROWD,or EXPERT. 1 In BASELINE, crowdworkers are simply asked to write question-answering examples. In JUSTIFICA-TIONthey are tasked with also writing explanations for their examples, prompting self-assessment. For the EXPERT and CROWD protocols, we train work-ers using an iterative process of collecting data, sending feedback, and qualifying high performing workers to subsequent rounds. We use expertcurated evaluations in EXPERT, and crowdsourced evaluations in CROWD for generating feedback and assigning qualifications. We use a a standard of high pay and strict qualifications for all protocols. We also validate the data to discard ambiguous and unanswerable examples. The experimental pipeline is sketched in Figure 1.
To quantify the dataset difficulty, we collect additional label annotations to establish human performance on each dataset and compare these to model performance. We also evaluate the difficulty of the datasets for typical machine learning models using IRT (Baker and Kim, 1993;Lalor et al., 2016).
We find that the EXPERT protocol dataset is the most challenging. The human-model gap with RoBERTa LARGE (Liu et al., 2019b) on the unanimous agreement portion of EXPERT is 13.9 percentage point, compared to 7.0 on the BASELINE protocol. The gap with UnifiedQA (Khashabi et al., 2020) is 6.7 on EXPERT, compared to 2.9 on BASE-LINE. However, the CROWD evaluation data is far less challenging than EXPERT, suggesting that expert evaluations are more reliable than crowdsourced evaluations for sending feedback and assigning qualifications.
We also find that the JUSTIFICATION intervention is ineffective as a stand-alone method for increasing NLU data quality. A substantial proportion of the explanations submitted are duplicates, reused for multiple examples, or give trivial reasoning that is not specific to the example.
Lastly, to evaluate the datasets for serious annotation artifacts we test the guessability of answers by omitting the questions from the model input. This partial-input baseline achieves the lowest accuracy on EXPERT, showing that the interventions used to successfully boost example difficulty may also reduce annotation artifacts.

Related Work
Creating NLU Corpora Existing NLU datasets have been collected using a multitude of methods, ranging from expert-designed, to crowdsourced, to automatically scraped. The widely used Winograd schema dataset by Levesque et al. (2012) is constructed manually by specialists and it has 273 examples. Larger NLU datasets, more appropriate for training neural networks, are often crowdsourced, though the crowdsourcing methods used vary widely. Popular datasets, such as SQuAD (Rajpurkar et al., 2016) for question answering and SNLI (Bowman et al., 2015) for natural language inference, are collected by providing crowdworkers with a context passage and instructing workers to write an example given the context. Rogers et al. (2020) crowdsource QuAIL, a QA dataset, by using a more constrained data collection protocol where they require workers to write nine specific types of question for each passage. QuAC (Choi et al., 2018) is crowdsourced by pairing crowdworkers, providing one worker with a Wikipedia article, and instructing the second worker to ask questions about the hidden article.
Recently, there has been a flurry of corpora collected using adversarial models in the crowdsourcing pipeline. Dua et al. (2019), Nie et al. (2020a), and Bartolo et al. (2020) use models in the loop during data collection, where crowdworkers can only submit examples that cannot be solved by the models. However, such datasets can be biased towards quirks of the model used during data collection (Zellers et al., 2019;Gardner et al., 2020).
Crowdsourcing Methods While crowdsourcing makes it easy to collect large datasets quickly, there are some clear pitfalls: Crowdworkers are generally less knowledgeable than field experts about the requirements the data needs to meet, crowdwork can be monotonous resulting in repetitive and noisy data, and crowdsourcing platforms can create a "market for lemons" where fast work is incentivized over careful, creative work because of poor quality requesters (Akerlof, 1978;Chandler et al., 2013). Daniel et al. (2018) give a broad overview of the variables at play when trying to crowdsource high-quality data, discussing many strategies available to requesters. Motivated by the use of selfassessment in teaching Boud (1995), Dow et al. (2012) study the effectiveness of self-assessment and external assessment when collecting data for product reviews. They find that both strategies are effective for improving the quality of submitted work. However, Gadiraju et al. (2017) find that crowdworker self-assessment can be unreliable since poor-performing workers overestimate their ability. Drapeau et al. (2016) test a justifyreconsider strategy: Crowdworkers justify their annotations in a relation extraction task, they are shown a justification written by a different crowdworker, or an expert, and are asked to reconsider their annotation. They find that this method significantly boosts the accuracy of annotations.
Another commonly used strategy when crowdsourcing NLP datasets is to only qualify workers who pass an initial quiz or perform well in preliminary crowdsourcing batches (Wang et al., 2013;Cotterell and Callison-Burch, 2014;Ning et al., 2020;Shapira et al., 2020;Roit et al., 2020). In addition to using careful qualifications, Roit et al. (2020) send workers feedback detailing errors they made in their QA-SRL annotation. Writing such feedback is labor-intensive and can become untenable as the number of workers grows. Dow et al. (2011) design a framework of promoting crowdworkers into "shepherding roles" to crowdsource such feedback. We compare expert and crowdsourced feedback in our EXPERT and CROWD protocols.

Data Collection Protocols
We run our study on Amazon Mechanical Turk. 2 At launch, crowdworkers are randomly assigned to one of four data collection protocols, illustrated in Figure 1. 3 To be included in the initial pool, workers need to have an approval rating of 98% or higher, have at least 1,000 approved tasks, and be located in the US, the UK, or Canada.

Writing Examples
This task is used for collecting question-answer pairs in the crowdsourcing pipeline for all four pro-tocols. Crowdworkers assigned to the BASELINE protocol are presented with only this task.
In this writing task, we provide a context passage drawn from the Open American National Corpus (Ide and Suderman, 2006). 4 Inspired by Hu et al. (2020), we ask workers to write two questions per passage with four answer choices each. We direct workers to ensure that the questions are answerable given the passage and that there is only one correct answer for each question. We instruct them to limit word overlap between their answer choices and the passage and to write distracting answer choices that will seem plausibly correct to someone who hasn't carefully read the passage. To clarify these criteria, we provide examples of good and bad questions.

Self-Assessment
Workers assigned to the JUSTIFICATION protocol are given the writing task described above (Section 3.1) and are also tasked with writing a 1-3 sentence explanation for each question. They are asked to explain the reasoning needed to select the correct answer choice, mentioning what they think makes the question they wrote challenging.

Iterative Feedback and Qualification
Tutorial Workers assigned to the CROWD and EXPERT protocols are directed to a tutorial upon assignment. The tutorial consists of two quizzes and writing tasks. The quizzes have four steps. In each step workers are shown a passage, two question candidates and are asked to select which candidate (i) is less ambiguous, (ii) is more difficult, (iii) is more creative, or (iv) has better distracting answer choices. These concepts are informally described in the writing task instructions, but the tutorial makes the rubric explicit, giving crowdworkers a clearer understanding of our desiderata. We give workers immediate feedback on their performance during the first quiz and not the second so that we can use it for evaluation. Lastly, for the tutorial writing tasks, we provide two passages and ask workers to write two questions (with answer choices) for each passage. These questions are graded by three experts 5 using a rubric with the same metrics described in the quiz, shown in Fig the writing tasks to the top 60% of crowdworkers who complete the tutorial. We only qualify the workers who wrote answerable, unambiguous questions, and we qualify enough workers to ensure that we would have a large pool of people in our final writing round.
Intermediate Writing Rounds After passing the tutorial, workers go through three small rounds of writing tasks. At the end of each round, we send them feedback and qualify a smaller pool of workers for the next round. We only collect 400-500 examples in these intermediate rounds. At the end of each round, we evaluate the submitted work using the same rubric defined in the tutorial. In the EXPERT protocol, three experts grade worker submissions, evaluating at least four questions per worker. The evaluation annotations are averaged and workers are qualified for the next round based on their performance. The qualifying workers are sent a message with feedback on their performance and a bonus for qualifying. Appendix A gives details on the feedback sent.
Evaluating the examples in each round is laborintensive and challenging to scale (avg. 30 expertmin. per worker). In the CROWD protocol we experiment with crowdsourcing these evaluations. After the first intermediate writing round in CROWD, experts evaluate the submitted work. The evaluations are used to qualify workers for the second writing round and to promote the top 20% of workers into a feedback role. After intermediate writing rounds 2 and 3, the promoted workers are tasked with evaluating all the examples (no one evaluates their own work). We collect five evaluations per example and use the averaged scores to send feedback and qualify workers for the subsequent round.
For both CROWD and EXPERT protocols, the top 80% of workers are requalified at the end of each round. Of the 150 workers who complete the tutorial, 20% qualify for the final writing round. Our qualification rate is partly dictated by a desire to have a large enough pool of people in the final writing task to ensure that no dataset is skewed by only a few people (Geva et al., 2019).
Cost We aim to ensure that our pay rate is at least US $15/hr for all tasks. The total cost per question, excluding platform fees, is $1.75 for the BASELINE protocol and $2 for JUSTIFICATION. If we discard all the data collected in the intermediate writing rounds, the cost is $3.76 per question for EXPERT, 6 and $5 for CROWD.
The average pay given during training to workers that qualify for the final writing task in EXPERT is about $120/worker (with an estimated 6-7 hours spent in training). In CROWD, there is an additional cost of $85/worker for collecting crowdsourced evaluations. The cost per example, after training, is $1.75 per question for both protocols, and total training cost does not scale linearly with dataset size, as one may not need twice as many writers for double the dataset size. More details on our payment and incentive structure can be found in Appendix B.

Data Validation
We collect label annotations by asking crowdworkers to pick the correct answer choice for a question, given the context passage. In addition to the answer choices written by the writer, we add an Invalid question / No answer option. We validate the data from each protocol. For CROWD and EXPERT, we only validate the data from the final large writing rounds. Data from all four protocols is shuffled and we run a single validation task, collecting either two or ten annotations per example.
We use the same minimum qualifications as the writing task (Section 3), and require that workers 6 The discarded data collected during training was annotated by experts, and if we account for the cost of expert time used, the cost for EXPERT increases to $4.23/question. This estimate is based on the approximate hourly cost of paying a US PhD student, including benefits and tuition. first pass a qualification task. The qualification task consists of 5 multiple-choice QA examples that have been annotated by experts. 7 People who answer at least 3 out of 5 questions correctly receive the qualification to work on the validation tasks. Of the 200 crowdworkers who complete the qualification task, 60% qualify for the main validation task. Following Ho et al. (2015), to incentivize higher quality annotations, we include expert labeled examples in the validation task, constituting 10% of all examples. If a worker's annotation accuracy on these labeled examples falls below 50%, we remove their qualification (7 workers are disqualified through this process), conversely workers who label these examples correctly receive a bonus. Pavlick and Kwiatkowski (2019) show that annotation disagreement may not be noise, but could be a signal of true ambiguity. Nie et al. (2020b) recommend using high-humanagreement data for model evaluation to avoid such ambiguity. To have enough annotations to filter the data for high human agreement and to estimate human performance, we collect ten annotations for 500 randomly sampled examples per protocol.

10-Way Validation
Cost We pay $2.50 for the qualification task and $0.75 per pair of questions for the main validation task. For every 3 out of 4 expert-labeled examples a worker annotates correctly, we send a $0.50 bonus.

Datasets and Analysis
We collect around 1,500 question-answer pairs from each protocol design: 1,558 for BASELINE, 1,534 for JUSTIFICATION, 1,600 for CROWD, and 1,580 for EXPERT. We use the validation annotations to determine the gold-labels and to filter out examples: If there is no majority agreement on the answer choice, or if the majority selects invalid question, the example is discarded (∼ 5% of examples). For the 2-way annotated data, we take a majority vote over the two annotations plus the original writer's label. For the 10-way annotated data, we sample four annotations and take a majority vote over those four plus the writer's vote, reserving the remainder to compute an independent estimate of human performance.  RoBERTa shows average zero-shot performance for six RoBERTa LARGE models finetuned on RACE, standard deviation is in parentheses. UniQA shows zero-shot performance of the T5-based UnifiedQA-v2 model. ∆ shows the differences in human and model performance.

Human Performance and Agreement
For the 10-way annotated subsets of the data, we take a majority vote over the six annotations that are not used when determining the gold answer, and compare the result to the gold answer to estimate human performance. Table 1 shows the result for each dataset. The EXPERT and CROWD datasets have lower human performance numbers than BASELINE and JUSTIFICATION. This is also mirrored in the inter-annotator agreement for validation, where Krippendorf's α (Krippendorff, 1980) is 0.67 and 0.71 for EXPERT and CROWD, compared to 0.81 and 0.77 for BASELINE and JUSTIFICATION ( Table 3 in Appendix C). The lower agreement may be reflective of the fact that while these examples are still clearly human solvable, they are more challenging than those in BASELINE and JUSTIFICA-TION As a result, annotators are prone to higher error rates, motivating us to look at the higher agreement portions of the data to determine true dataset difficulty. And while the agreement rate is lower for EXPERT and CROWD, more than 80% of the data still has high human-agreement on the goldlabel, where at least 4 out of 5 annotators agree on the label. The remaining low-agreement examples may have more ambiguous questions, and we follow Nie et al.'s (2020b) recommendation and focus our analysis on the high-agreement portions of the dataset.

Zero-Shot Model Performance
We test two pretrained models that perform well on other comparable QA datasets: RoBERTa LARGE (Liu et al., 2019b) and UnifiedQA-v2 (Khashabi et al., 2020). We fine-tune RoBERTa LARGE on RACE (Lai et al., 2017), a large-scale multiplechoice QA dataset that is commonly used for training (Sun et al., 2019). We fine-tune 6 RoBERTa LARGE models and report the average performance across runs. The UnifiedQA-v2 model is a single T5-based model that has been trained on 15 QA datasets. 8 We also fine-tune RoBERTa LARGE on CosmosQA and QuAIL, finding that zero-shot model performance is best with RACE fine-tuning but that the trends in model accuracy across our four datasets are consistent (Appendix D).

Comparing Protocols
As shown in Table 1, model accuracy on the full datasets is lowest for EXPERT, followed by CROWD, JUSTIFICATION, and then BASELINE. However, model accuracy alone does not tell us how much headroom is left in the datasets. Instead, we look at the difference between the estimated human performance and model performance.

Human-Model gap
The trends in the humanmodel gap on the 10-way annotated sample are inconsistent across models. For a more conclusive analysis, we focus on the higher-agreement portions of the data where label ambiguity is minimal. On the high agreement section of the datasets, both models' performance is weakest on EXPERT. RoBERTa LARGE shows the second largest humanmodel gap on CROWD, however for UnifiedQA JUSTIFICATION is the next hardest dataset. This discrepancy between the two types of iterative feedback protocols is even more apparent in the unanimous agreement portion of the data. On the unanimous agreement examples, both models show the lowest performance on EXPERT but Unified-QA achieves near perfect performance on CROWD. This suggests that while the CROWD protocol used nearly the same crowdsourcing pipeline as EXPERT, the evaluations done by experts are a much more reliable metric for selecting workers to qualify and for generating feedback, at the cost of greater difficulty with scaling to larger worker pools. This is confirmed by inter-annotator agreement: Expert agreement on the rubric-based evaluations has a Krippendorf's α of 0.65, while agreement between crowdworker evaluations is 0.33.

Self-Justification
Model performance on the unanimous agreement examples of JUSTIFICATION is comparable to, or better than, performance on BASELINE. To estimate the quality of justifications, we manually annotate a random sample of 100 justifications. About 48% (95% CI: [38%, 58%]) are duplicates or near-duplicates of other justifications, and of this group, nearly all are trivial (e.g. Good and deep knowledge is needed to answer this question) and over half are in non-fluent English (e.g. To read the complete passage to understand the question to answer.). On the other hand, non-duplicate justifications are generally of much higher quality, mentioning distractors, giving specific reasoning, and rewording phrases from the passage (e.g. Only #1 is discussed in that last paragraph. The rest of the parts are from the book, not the essay. Also the answer is paraphrased from "zero-sum" to "one's gain is another's loss"). While we find that JUSTI-FICATION does not work as a stand-alone strategy, we cannot conclude that self-justification would  be equally ineffective if combined with more aggressive screening to exclude crowdworkers who author trivial or duplicate justifications. Gadiraju et al. (2017) also recommend using the accuracy of a worker's self-assessments to screen workers.
Cross-Protocol Transfer Since the datasets from some protocols are clearly more challenging than others, it prompts the question: are these datasets also better for training models? To test cross-protocol transfer, we fine-tune RoBERTa LARGE on one dataset and evaluate on the other three. We find that model accuracy is not substantively better from fine-tuning on any one dataset (Table 5, Appendix E). The benefit of EX-PERT being a more challenging evaluation dataset does not clearly translate to training. However, these datasets may be too small to offer clear and distinguishable value in this setting.
Annotation Artifacts To test for undesirable artifacts, we evaluate partial input baselines (Kaushik and Lipton, 2018;Poliak et al., 2018). We take a RoBERTa LARGE model, pretrained on RACE, and fine-tune it using five-fold cross-validation, providing only part of the example input. We evaluate three baselines: providing the model with the passage and answer choices only, the question and answer choices only, and the answer choices alone. Results are shown in Table 2. The pas-sage+answer baseline has significantly lower performance on the EXPERT dataset in comparison to the others. This indicates that the iterative feedback and qualification method using expert assessments not only increases overall example difficulty but may also lower the prevalence of simple artifacts that can reveal the answer. Performance of the question+answer and answer-only baselines is comparably low on all four datasets.
Question and Answer Length We observe that the difficulty of the datasets is correlated with average answer length (Figure 3). The hardest dataset, EXPERT, also has the longest answer options with an average of 9.1 words, compared to 3.7 for BASE-LINE, 4.1 for JUSTIFICATION, and 6.9 for CROWD. This reflects the tendency of the 1-and 2-word answers common in the BASELINE and JUSTIFICA-TION datasets to be extracted directly from the passage. While sentence-length answers, more common in EXPERT and CROWD, tend to be more abstractive. Figure 3 also shows that incorrect answer options tend to be shorter than correct ones. This pattern holds across all datasets, suggesting a weak surface cue that models could exploit. Using an answer-length based heuristic alone, accuracy is similar to the answer-only model baseline: 34.2% for BASELINE, 31.7% for JUSTIFICATION, 31.5% for CROWD, and 34.3% for EXPERT.
Wh-words We find that the questions in EXPERT and CROWD protocols have similar distributions of wh-words, with many why questions and few who or when questions compared to the BASELINE and JUSTIFICATION protocols, seemingly indicating that this additional feedback prompts workers to write more complex questions.

Non-Passage-Specific Questions
We also observe that many questions in the datasets are formulaic and include no passage-specific content, for instance Which of the following is true?, What is the main point of the passage?, and Which of the following is not mentioned in the passage?. We manually annotate 200 questions from each protocol for questions of this kind. We find that there is no clear association between the dataset's difficulty and the frequency of such questions: 15% of questions in EXPERT are generic, compared to 4% for CROWD, 10% for JUSTIFICATION, and 3% for BASELINE. We might expect that higher quality examples that require reading a passage closely would ask questions that are specific rather than generic. But our results suggest that difficulty may be due more to the subtlety of the answer options, and the presence of distracting options, rather than the complexity or originality of the questions.

Order of Questions
We elicit two questions per passage in all four protocols with the hypothesis that the second question may be more difficult on aggregate. However, we find that there is only a slight drop in model accuracy from the first to second question on the CROWD and EXPERT datasets (1.0 and 0.7 percentage points). And model accuracy on BASELINE remains stable, while it increases by 2.7 percentage points on JUSTIFICA-TION. A task design with minimal constraints, like ours, does not prompt workers to write an easier question followed by a more difficult one, or vice versa.

Item Response Theory
Individual examples within any dataset can have different levels of difficulty. To better understand the distribution of difficult examples in each protocol, we turn to Item Response Theory (IRT; Baker and Kim, 1993), which has been used to estimate individual example difficulty based on model responses (Lalor et al., 2019;Martínez-Plumed et al., 2019). Specifically, we use the three-parameter logistic (3PL) IRT model, where an example is characterized by discrimination, difficulty, and guessing parameters. Discrimination defines how effective an example is at distinguishing between weak and strong models, difficulty defines the minimum ability of a model needed to obtain high performance, and the guessing parameter defines the probability of a correct answer by random guessing. Following Vania et al. (2021), we use 90 Transformer-based models fine-tuned on RACE, with varying ability levels, and use their predictions on our four datasets as responses. For comparison, we also use model predictions on QuAIL and CosmosQA. Refer to Appendix F for more details. Figure 4 shows the distribution of example difficulty for each protocol. Also plotted are the difficulty parameters for the intermediate rounds of data that are collected in the iterative feedback protocols. 9 We see that EXPERT examples have the highest median and 75th percentile difficulty scores, while BASELINE scores the lowest. We also note that the greatest gain in difficulty for CROWD examples happens between rounds 1 and 2, the only feedback and qualification stage that is conducted by experts. This offers further evidence that expert assessments are more reliable, and that crowdsourcing such assessments poses a significant challenge.
While the examples in EXPERT have higher difficulty scores than the other protocols, the scores are significantly lower than those for CosmosQA and QuAIL (all four datasets show similar discrimination scores to CosmosQA and QuAIL). The data collection methods used for both CosmosQA and QuAIL differ substantially from methods we tested. Rogers et al. (2020) constrain the task design for QuAIL and require workers to write questions of specific types, like those targeting temporal reasoning. Similarly, in CosmosQA workers are encouraged to write questions that require causal or deductive commonsense reasoning. In contrast, we avoid dictating question type in our instructions. The IRT results here suggest that using prior knowledge to slightly constrain the task design can be effective for boosting example difficulty. In addition to differing task design, CosmosQA and QuAIL also use qualitatively different sources for passages. Both datasets use blogs and personal stories, QuAIL also uses texts from published fiction and news. Exploring the effect of source text genre on crowdsourced data quality is left to future work.

Conclusion
We present a study to determine effective protocols for crowdsourcing difficult NLU data. We run a randomized trial to compare interventions in the crowdsourcing pipeline and task design. Our results suggest that asking workers to write justifications is not a helpful stand-alone strategy for improving NLU dataset difficulty, at least in the absence of explicit incentives for workers to write high-quality justifications. However, we find that training workers using an iterative feedback and requalification protocol is an effective strategy for collecting high-quality QA data. The benefit of this method is most evident in the high-agreement subset of the data where label noise is low. We find that using expert assessments to conduct this iterative protocol is fruitful, in contrast with crowdsourced assessments that have much lower inter-annotator agreement and the noisy signal from these assessments does not boost example difficulty. from financial support to SB by Eric and Wendy Schmidt (made by recommendation of the Schmidt Futures program), Apple, and Intuit, and from in-kind support by the NYU High-Performance Computing Center and by NVIDIA Corporation (with the donation of a Titan V GPU). SS was supported by JST PRESTO Grant No. JPMJPR20C4. This material is based upon work supported by the National Science Foundation under Grant No. 1922658. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Ethics Statement
We are cognizant of the asymmetrical relationship between requesters and workers in crowdsourcing, and we take care to be responsive employers and to pay a wage commensurate with the high-quality work we're looking for. So in additional to the ethical reasons for paying fair wages, our successes with collecting high-quality NLU data offer weak evidence that others should also follow this practice. However, the mere existence of more research on NLU crowdsourcing with positive results could arguably encourage more people to do crowdsourcing under a conventional model, with low pay and little worker recourse against employer malpractice. The only personal information we collect from workers is their Mechanical Turk worker IDs, which we keep secure and will not release. However, we do not engage with issues of bias during data collection and we expect that the data collected under all our protocols will, at least indirectly, reinforce stereotypes.
We confirmed with New York University's IRB that crowdsourced NLP dataset construction work, including experimental work on data collection methods, is exempt from their oversight. The only personal information we collect from workers is their Mechanical Turk worker IDs, which we keep secure and will not release.

A Iterative Protocol Feedback
In the EXPERT and CROWD protocols, we conduct three small intermediate rounds of data collection to help train crowdworkers and give them feedback on their submissions. At the end of each small round of writing, the submitted examples are evaluated either by experts or crowdworkers, as described in Section 3.3. The rubric given in Figure  2 is used during evaluations. After compiling the evaluations, we qualify the top 80% of workers for the next round and send them a feedback message.
We tell workers what their difficulty and creativity scores are in comparison to the average. We also tell them what percentage of their question-answer pairs were labeled as having distracting answer choices and what percentage were labeled ambiguous, with examples of any such questions. Lastly, we list the examples they wrote that received the highest and lowest overall rubric scores.

B Payment and Incentive Structure
The compensation for for writing two questions in the baseline writing task is $3.50, excluding platform fees, we estimate it takes 12-15 minutes to do a close reading of the passage and write two challenging questions. For the JUSTIFICATION protocol, the compensation is $4 per task to account for the additional time it takes to write a justifications for each question. For the tutorial that workers in the CROWD and EXPERT protocols need to complete, we pay $3.50, and give a bonus of $1.50 if they qualify onto the writing tasks. Similarly, at the end of each intermediate writing batch, a bonus is sent to the workers that qualify for the subsequent round: $5, $7, and $10 after the 1st, 2nd and 3rd rounds respectively. Promoted workers who are tasked with the crowdsourced evaluations in the CROWD protocol, are paid $0.50 per question. They are also sent a bonus of $5 for each round of evaluations they complete. Table 3 shows the inter-annotator agreement during data validation task for each dataset. The Krippendorf's α is lowest for EXPERT, which also has the lowest human performance baseline, likely due to the pressure to produce subtle questions.   Table 4: Zero-shot model accuracy on our datasets, when training on the datasets named in the columns.

D Zero-Shot Model Performance: CosmosQA and QuAIL
In addition to fine-tuning RoBERTa LARGE on RACE, we also fine-tune it on CosmosQA, and QuAIL to test zero-shot model performance. Table 4 shows the zero-shot results. We observe that model performance on our datasets is substantially worse when fine-tuning on CosmosQA or QuAIL. However, the pattern in model behaviour is consistent regardless of corpus used. In all three conditions, model accuracy is highest on BASELINE, followed by JUSTIFICATION, then CROWD, and finally EXPERT.

E Cross-Protocol Transfer
As discussed in Section 5.3, we test cross-protocol transfer by fine-tuning RoBERTa LARGE on one dataset and evaluating on the other three. For a baseline comparison, we also fine-tune the model on each dataset using five-fold cross-validation. Results are shown in Table 5.  Table 5: Cross-protocol evaluation where the row and column indicate target and source datasets respectively. Cross-val shows the accuracy and std. dev. from fivefold cross-validation on each dataset.