WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation

A recurring challenge of crowdsourcing NLP datasets at scale is that human writers often rely on repetitive patterns when crafting examples, leading to a lack of linguistic diversity. We introduce a novel approach for dataset creation based on worker and AI collaboration, which brings together the generative strength of language models and the evaluative strength of humans. Starting with an existing dataset, MultiNLI for natural language inference (NLI), our approach uses dataset cartography to automatically identify examples that demonstrate challenging reasoning patterns, and instructs GPT-3 to compose new examples with similar patterns. Machine generated examples are then automatically filtered, and finally revised and labeled by human crowdworkers. The resulting dataset, WANLI, consists of 107,885 NLI examples and presents unique empirical strengths over existing NLI datasets. Remarkably, training a model on WANLI improves performance on eight out-of-domain test sets we consider, including by 11% on HANS and 9% on Adversarial NLI, compared to training on the 4x larger MultiNLI. Moreover, it continues to be more effective than MultiNLI augmented with other NLI datasets. Our results demonstrate the promise of leveraging natural language generation techniques and re-imagining the role of humans in the dataset creation process.


Introduction
As much as large-scale crowdsourced datasets have expedited progress on various NLP problems, a growing body of research has revealed fundamental limitations in existing datasets: they are often flooded with repetitive and spurious patterns, rather than covering the broad range of linguistic phenomena required by the task (Bowman and Dahl, 2021).This leads to models that seem to achieve humanlevel performance on in-domain test sets, yet are brittle when given out-of-domain or adversarial examples (Ribeiro et al., 2020;Glockner et al., 2018).(Swayamdipta et al., 2020) of an existing dataset relative to a trained model, (1) we automatically identify pockets of data instances exemplifying challenging reasoning patterns.Next, (2) we use GPT-3 to generate new instances with the same pattern.These generated examples are then (3) automatically filtered via a metric we introduce inspired by data maps, and (4) given to human annotators to assign a gold label and optionally revise.
We attribute this problem to an inherent challenge in the crowdsourcing design-the prevalent paradigm for creating large-scale NLP datasetswhere a relatively small number of workers create a massive number of free text examples.While human annotators are generally reliable for writing correct examples, crafting diverse and creative examples at scale can be challenging.Thus, crowdworkers often resort to a limited set of writing strategies for speed, at the expense of diversity (Geva et al., 2019;Gururangan et al., 2018).When models overfit to such repetitive patterns, they fail to generalize to out-of-domain examples where these patterns no longer hold (Geirhos et al., 2020).
On the other hand, there has been remarkable progress in open-ended text generation based on massive language models (Brown et al., 2020;Raffel et al., 2020, i.a.).Despite known deficiencies such as incoherence or repetition (Dou et al., 2021), these models often produce human-like text (Clark et al., 2021) and show potential for creative writing tasks (Lee et al., 2022).Importantly, these models are capable of replicating a pattern given just a few examples in context (Brown et al., 2020, GPT-3).
In this paper, we introduce a novel approach for dataset creation which brings together the generative strength of language models and the evaluative strength of humans through human and machine collaboration ( §2).The key insight of our approach is that language models can create new examples by replicating linguistic patterns that are valuable for training, without necessarily "understanding" the task itself.Illustrated in Figure 1, our pipeline starts with an existing dataset.We use dataset cartography from Swayamdipta et al. (2020) to automatically identify pockets of examples that demonstrate challenging reasoning patterns relative to a trained model.Using each group as a set of in-context examples, we leverage a pretrained language model to generate new examples likely to have the same pattern (see Table 1).We then propose a novel metric, building on dataset cartography, to automatically filter generations that are most likely to aid model learning.Finally, we validate the generated examples by subjecting them to human review, where crowdworkers assign a gold label and (optionally) revise for quality.
We demonstrate the effectiveness of our approach on the task of natural language inference (NLI), which determines whether a premise entails (i.e., implies the truth of) a hypothesis, both expressed in natural language.Despite being one of the most resource-available tasks in NLP, analysis and challenge sets repeatedly demonstrate the limitations of existing datasets and the brittleness of NLI models trained on them (Gururangan et al., 2018;Poliak et al., 2018;Tsuchiya, 2018).Using MultiNLI (Williams et al., 2018) as our original dataset, we use our pipeline to create a dataset of 107,885 examples, which we call Worker-and-AI NLI (WANLI). 1emarkably, empirical results demonstrate that replacing MultiNLI supervision with WANLI (which is 4 times smaller) improves performance on eight different out-of-domain test sets, including datasets that are converted to the NLI format from downstream tasks such as question-answering and fact verification ( §3).This result holds even when augmenting MultiNLI with other NLI datasets and recently proposed augmentation sets.Moreover, including WANLI in the training data can help improve performance on certain in-domain test sets.
We then analyze WANLI and show that it has fewer previously documented spurious correlations than MultiNLI ( §4), and provide insights into the collaborative framework ( §5).
Our approach contrasts with previous instructionbased generation of dataset examples (Schick and Schütze, 2021;West et al., 2021), which require the model to understand the task from context, fundamentally limiting the complexity of generated output to what is accessible by the model.Moreover, our human-in-the-loop approach is collaborative, rather than adversarial (Dinan et al., 2019;Nie et al., 2020;Bartolo et al., 2020).Overall, we leverage the best of both worlds: a powerful model's ability to efficiently generate diverse examples, and humans' ability to improve and ensure the quality of generations.
Our worker-AI collaborative approach is more scalable compared to the traditional crowdsourcing framework.Our approach is generalizable, allowing for rejuvenating datasets on many different classification tasks, especially when performance seems to stagnate due to overfitting to popular benchmarks (Recht et al., 2019).Our work shows the promise of leveraging language models in a controlled way to aid the dataset creation process, and we encourage the community to think of dataset curation as an AI challenge itself.

Worker-AI Collaborative Dataset
Creation for NLI We describe our four-stage approach for dataset creation based on worker and AI collaboration.In this work, we apply it to the task of natural language inference (NLI), which involves predicting whether a premise entails, contradicts or is neutral to a hypothesis.NLI has broad applicability in NLP: it has proven useful for pretraining (Clark et al., 2019;Phang et al., 2018), and can be applied to verify candidate answers in question-answering  (Chen et al., 2021) or factuality of generated summaries (Maynez et al., 2020).Our approach requires as prerequisites an initial dataset D 0 and a strong task model M trained on D 0 .We use MultiNLI (Williams et al., 2018), a large-scale multi-genre NLI dataset, as D 0 .We finetune RoBERTa-large (Liu et al., 2019) on MultiNLI for our task model M (training details in Appendix B).
As an overview, we first automatically collect groups of examples exemplifying challenging reasoning patterns in D 0 relative to M, using data maps (Swayamdipta et al., 2020; Stage 1, see §2.1).Then we overgenerate similar examples by leveraging the pattern replication capabilities of GPT-3 (Brown et al., 2020) (Stage 2; §2.2).While GPT-3 can generate examples efficiently, it may not reliably replicate the desired pattern and its output quality will not be uniform.We address this by automatically filtering the generated examples using a metric derived from data maps (Stage 3; §2.3).We finally subject the collected data to human review, in which crowdworkers optionally revise examples and assign gold labels (Stage 4; §2.4).

Dataset Cartography.
A key component of our pipeline is inspired by data maps (Swayamdipta et al., 2020), which automatically reveal different regions in a dataset, w.r.t. the behavior of a classification model during training.These include easyto-learn examples which the model consistently predicts correctly through training, hard-to-learn examples on which it is consistently incorrect, and ambiguous examples for which the model's confidence in the correct answer exhibits high variability across train epochs.Our pipeline focuses on ambiguous examples, which were shown to lead to more robust models.Additionally, ambiguous examples contain fewer spurious correlations (Gardner et al., 2021), suggesting that they capture underrepresented counterexamples to spurious correlations.Indeed, such counterexamples take more epochs of training to learn and are crucial for generalization (Tu et al., 2020), providing a potential explanation for why they appear ambiguous across early epochs and lead to more robust models.

Stage 1: Collection of Exemplars
In this stage, we automatically collect groups of examples from D 0 which represent linguistic patterns we wish to include in the target dataset.We begin with a seed example (x i , y i ) ∈ D 0 belonging to the most ambiguous p = 25% relative to M.2 To generate a new example with the same reasoning pattern, we wish to leverage the ability of GPT-3 (Brown et al., 2020) for in-context learning; hence, we need to first collect examples that test a similar kind of reasoning to x i .To do this, we use the [CLS] token representation of each example relative to the task model M, and find the k = 4 nearest neighbors via cosine similarity to x i that have the same label.Detailed qualitative inspection shows that the nearest neighbors in this representation space tend to capture a human-interpretable similarity in the reasoning required to solve an example, rather than lexical or semantic similarity (examples in Table 1).Han and Tsvetkov (2021) give another interpretation for this approach: for examples with the same label, the similarity of [CLS] token embeddings actually represents the similarity of gradient updates in the row of the final projection layer corresponding to that label.Thus, two examples are close if training on them would "update" the final layer of the model similarly.
By automatically identifying areas for augmentation, our method does not require any prior knowledge of challenging patterns and makes our method tractable for building on top of large-scale datasets.Nonetheless, exemplar collection could potentially be approached in different ways (e.g., through expert curation or category labels).

Stage 2: Overgeneration
Given an automatically extracted group of k + 1 examples from the original dataset D 0 , we construct a natural language context (prompt) for a left-to-right language model; in this work, we use GPT-3 Curie (the second-largest GPT-3 model).The prompt template we use is shown in Figure 2, where we order the examples in increasing similarity to the seed example.
Note that our method leverages GPT-3 in way

Stage 3: Automatic Filtering
In this step, we wish to filter generated examples from Stage 2 to retain those that are the most ambiguous with respect to M. However, computing ambiguity for an example requires that it be a part of the original training set, whereas we wish to estimate the ambiguity of an unlabeled example without additional training.Thus we introduce a new metric called estimated max variability, which measures the worst-case spread of predictions on an example x i across checkpoints of a trained model.Let E be the total epochs in training, Y the label set, and p θ (e) the probability assigned with parameters θ e at the end of the e-th epoch.We define the estimated max variability as: where σ is the standard deviation function.
Concretely, we retroactively compute the prediction from each saved epoch of M on x i .The only assumption made is that the single example, if it had been a part of the training set, would have made a negligible difference on each model checkpoint (at least as observed through its posterior probabilities). 3In taking a maximum across labels, we consider x i to be ambiguous as long as M is undecided on any label ∈ Y.
We first employ simple heuristics to discard examples exhibiting observable failure cases of GPT-3.Specifically, we discard examples where 1) the premise and hypothesis are identical, modulo punctuation or casing, 2) the generated example is an exact copy of an in-context example, 3) the example contains some phrases from the instruction (e.g., "pair of sentences"), or 4) the premise or hypothesis is shorter than 5 characters.Then, we compute the estimated max variability for the remaining examples with respect to M, and retain an equal number of examples from each (intended) label class with the highest max variability, to create a dataset D filtered that is half the size of D gen .

Stage 4: Human Review
As the final stage of our pipeline, we recruit human annotators on Amazon Mechanical Turk to review each unlabeled example x i ∈ D filtered .(Details about crowdworkers and guidelines in Appendix D.) The annotator may optionally revise x i to create a higher-quality example x ′ i , or let x ′ i = x i .Either way, they assign a label y i .When revising examples, we asked annotators to preserve the intended meaning as much as possible through minimal revisions.4However, if an example would require a great deal of revision to fix or if it could be perceived as offensive, they should discard it.This results in the labeled dataset D collab = {(x ′ i , y i )} i .Crowdworkers annotate a total of 118,724 examples, with two distinct workers reviewing each example.For examples that both annotators labeled without revision, we achieved a Cohen's κ of 0.60, indicating substantial agreement.To create the final dataset, we discard an example if either annotator chose to discard it, and we keep a revision only if both annotators revise an example (and choose a revision uniformly at random).When both annotators label the example as-is but choose different labels, we sample one of the two labels uniformly

Split
Size Label distribution (E/N/C)  In general, we believe the role of revision depends on the quality of machine-generated examples.Indeed, we need to strike a balance between leveraging human capabilities and avoiding the reemergence of annotation artifacts that may come with too much freedom in revision.

Training NLI Models with WANLI
We finetune different copies of RoBERTa-large (Liu et al., 2019) on different training sets, and evaluate each resulting model's performance on a large suite of NLI challenge sets.Given that the challenge sets were constructed independently of MultiNLI or WANLI, we consider them out-ofdistribution (OOD) for both training datasets.

NLI Test Suite
The NLI challenge sets come from a wide array of domains, methodologies (e.g., crowdsourcing, expert curation, generation), and initial task formats (e.g., question-answering, fact verification).5 NLI Diagnostics (Wang et al., 2018) is a manuallycurated test set that evaluates a variety of linguistic phenomena using naturally-occurring sentences from several domains.

HANS (McCoy et al., 2019) targets unreliable syntactic heuristics based on lexical overlap between the premise and hypothesis.
QNLI was adapted from the Stanford Question-Answering Dataset (Rajpurkar et al., 2016) by the GLUE benchmark (Wang et al., 2018).Each exam- FEVER NLI is adapted from the FEVER fact verification dataset (Thorne et al., 2018), and introduced along with ANLI.In each example, the premise is a short context from Wikipedia, and the hypothesis is a claim that is either supported (entailed), refuted (contradicted), or neither (neutral).

Training Datasets
In addition to stand-alone WANLI and MultiNLI, we also consider combining MultiNLI with other NLI datasets.We use the train sets of SNLI (Bowman et al., 2015), ANLI, and FEVER-NLI, as well as the augmentation set generated via TAILOR (Ross et al., 2022), which perturbed SNLI hypotheses to create examples with high lexical overlap between the premise and hypothesis, and the augmentation set Z-Aug (Wu et al., 2022), which was created by generating in-distribution examples and filtering them based on spurious correlations.We consider two schemes for combining datasets A and B: 1) augmentation (A + B), in which the two datasets are concatenated, and 2) random replacement (A ⋄ B), where |B| examples from A are randomly swapped out and replaced with all examples from B.  NLI datasets and augmentation sets, in every OOD setting.This includes when comparing to a model trained on 9× more data from three existing NLI datasets, MNLI + SNLI + ANLI.The consistent advantage of WANLI over datasets that include ANLI (e.g., MNLI + ANLI) is noteworthy, as ANLI's adversarial creation pipeline posed a much greater challenge for human workers, and used more existing resources to train model adversaries.

Results are shown in
Quite surprisingly, training on WANLI alone also outperforms combining WANLI with MultiNLI.This reinforces that more data might not necessarily be better, especially when the data predominantly consists of easy-to-learn examples.
In addition to the OOD setting, we consider whether augmentation with WANLI can improve in-domain test performance for another dataset (Table 4).Indeed, augmenting ANLI's train set with WANLI improves test accuracy on ANLI by 1.4%, while greatly aiding OOD test performance.

Artifacts in WANLI
We next investigate whether WANLI contains similar artifacts to MultiNLI. 6We find that while WANLI contains fewer previously known spurious correlations, it has a distinct set of lexical correlations that may reflect artifacts in GPT-3 output.

Partial Input Models
Given that the task requires reasoning with both the premise and the hypothesis, a model that sees only one of the two inputs should have no information about the correct label.We reproduce the methodology from Gururangan et al. (2018) and train fastText classifiers to predict the label using partial input.After first balancing WANLI, a model trained on just the hypotheses of WANLI achieves 41.6% accuracy on the test set compared to 49.6% for MultiNLI, when restricted to the same size.A Figure 3: Competency problem-style statistical correlation plot between individual words and particular class labels, where the y-axis is the probability of label y given the presence of the word x i , and the x-axis is the number of times word x i appears in the data.All points representing (word, label) pairs above the blue line have detectable correlations (Gardner et al., 2021).
premise-only model trained on WANLI achieves an accuracy of 42.9%.7

Lexical Correlations
Gardner et al. ( 2021) posit that all correlations between single words and output labels are spurious.We plot the statistical correlation for every word and label in Figure 3, after balancing WANLI and downsampling MultiNLI.We observe that WANLI also contains words with detectable correlations, suggesting that GPT-3 may have some artifacts of its own due to the slightly different templates and different sets of in-context examples for each label.Interestingly, the correlations tend to be a different set of words than for MultiNLI (other than "not" and "no"), with less interpretable reasons for correlating with a certain label (e.g., "second", "was").

Premise-Hypothesis Semantic Similarity
We explore the semantic similarity between the premise and hypothesis within each label class using Sentence-BERT (Reimers and Gurevych, 2019); these distributions are shown in Figure 4.In both MultiNLI and WANLI, entailed hypotheses are naturally most semantically similar to the premise.In MultiNLI, this is followed by neutral 5 What does WANLI show about the human machine collaboration pipeline?
We discuss observations from collecting WANLI that may shed insight for future work in the direction of collaborative dataset creation.

What kinds of revisions do annotators tend to make?
We find that revisions fall broadly into two categories: improving the fluency of the text, and improving the clarity of the relationship.The majority of revisions change the length only slightly, with 74% of both premise revisions and hypothesis revisions changing the word count between −1 and +2 words.Fluency revisions often target well-documented issues with text generation, such as redundancy and self-contradiction.Clarity revisions often resolve ambiguities in the example that make the entailment relationship difficult (or impossible) to determine, such as ambiguous coreference or temporal references.We provide examples of revisions in Appendix D.3.

What kinds of examples do annotators disagree on?
We find that examples on which annotators disagree provide an extremely interesting test bed for how ambiguities surface in classification tasks.Upon inspecting the examples (some are shown in Table 5), we observe that they represent genuinely ambiguous cases rather than careless mislabels, echoing previous findings (Pavlick and Kwiatkowski, 2019).See further discussion in Appendix D.4.

How reliably does GPT-3 reproduce the in-context pattern?
One characteristic of WANLI is its imbalanced label distribution: even though the set of seed examples for generation was constructed to be balanced, after undergoing human labeling, only 15% of examples are given the contradiction label.We observe that contradiction patterns in in-context examples are generally much more challenging for GPT-3 to copy, likely because it was trained on (mostly) coherent sequences of sentences.More broadly, we find that more abstract reasoning patterns are harder for GPT-3 to mimic than patterns that involve simpler transformations.Nonetheless, even when GPT-3 does not successfully copy the examples, the diverse set of in-context examples leads to a variety of creative output that may be challenging for human crowdworkers to achieve.

Related Work
Crowdsourcing The scalability and flexibility of crowdsourcing has enabled the creation of foundational NLP benchmarks across a wide range of subproblems, and made it the dominant paradigm for data collection (Mihaylov et al., 2018;Rajpurkar et al., 2016;Huang et al., 2019;Talmor et al., 2019, i.a.).Nonetheless, a growing body of research shows that resulting datasets may not isolate the key linguistic phenomena (Jia and Liang, 2017;Chen et al., 2016;Sugawara et al., 2020).
For crowdsourcing NLI datasets, where the annotator is given a premise and asked to write a hypothesis of each label (Bowman et al., 2015;Williams et al., 2018), the presence of annotation artifacts is especially well-studied (Gururangan et al., 2018;McCoy et al., 2019;Glockner et al., 2018).Recent work attempted to remedy this through different data collection protocols but found negative results (Vania et al., 2020;Bowman et al., 2020), showing this is a hard problem requiring greater innovation.

Entailment Contradiction
Does "almost half" mean "not half" or "basically half"?P: As a result of the disaster, the city was rebuilt and it is now one of the most beautiful cities in the world.H: A disaster made the city better.

Do indirect consequences count?
P: It is a shame that the world has to suffer the pain of such unnecessary war.H: The world does not have to suffer such pain.

Entailment Contradiction
Is the scope of "has to" in the hypothesis given the war or not?P: The original draft of the treaty included a clause that would have prohibited all weapons of mass destruction.H: The clause was removed in the final version of the treaty.

Entailment Neutral
Does the premise imply that the clause is no longer in the treaty?P: If you can't handle the heat, get out of the kitchen.H: If you can't handle the pressure, get out of the situation.

Entailment Neutral
Is the premise to be interpreted literally or figuratively?P: In a world of increasing uncertainty, the only certainty is that nothing is certain.H: There is no certainty in the world.

Entailment Contradiction
Self-contradictory but coherent premise Table 5: Examples where two annotators assigned different labels.We find that many examples represent genuinely ambiguous cases rather than careless mislabels, echoing previous findings (Pavlick and Kwiatkowski, 2019).
test sets (Kaushik et al., 2021) and decreasing data diversity (Bowman and Dahl, 2021).Moreover, the resulting data has been shown to depend strongly on the adversaries, inhibiting a fair evaluation (Phang et al., 2021).Finally, these approaches may produce examples beyond the scope of the task.For example, in Adversarial NLI (Nie et al., 2020), an estimated 58% of examples required "reasoning from outside knowledge or additional facts," which is arguably separate from the underlying problem of understanding semantic entailments.We argue that we can better leverage the strengths of machines and humans by having them collaborate rather than act as adversaries.
Dataset generation Another recent approach leverages language models toward fully automatic dataset creation (Schick and Schütze, 2021;Wu et al., 2022;West et al., 2021;Bartolo et al., 2021a, i.a.).Removing human input may fundamentally limit the complexity of examples to phenomena already accessible by the model, when our goal is precisely to teach models more diverse phenomena.The most similarly-motivated work to ours, Lee et al. (2021), trains a data generator on "datarich slices" of an existing dataset, and applies it to under-represented slices.However, they use labels or metadata to represent slices, leaving automatic methods of identifying slices to future work.2021) employ a language model to generate counter-narratives to hate speech and biographies, respectively, which are validated and revised by humans.This was for a generative task, and we complement their findings by showing that human-machine collaboration can also be useful for generating labeled datasets for robust classification models.Contemporary work (Bartolo et al., 2021b) finetunes a generative annotation assistant to produce question-answer pairs that humans can revise for extractive QA.

Conclusion
At the heart of dataset creation is distilling human linguistic competence into data that models can learn from.The traditional crowdsourcing paradigm takes the view that the best approach for this is to solicit people to write free-form examples expressing their capabilities.In this work, we present a worker-and-AI collaborative approach and apply it to create WANLI, whose empirical utility suggests that a better way of eliciting human intelligence at scale is to ask workers to revise and evaluate content.To this end, we hope to encourage more work in developing generative algorithms to aid the dataset creation process, and therefore re-imagining the role of human annotation.

Ethics Statement
We acknowledge that text generated from large pretrained language models is susceptible to perpetuating social harms and containing toxic language (Sheng et al., 2019;Gehman et al., 2020).To partially remedy this, we ask annotators to discard any examples that may be perceived as offensive.Nonetheless, it is possible that harmful examples (especially if they contain subtle biases) may have been missed by annotators and included in the final dataset.Specifically due to the above harms, we additionally caution readers and practitioners against fully automating any data creation pipeline.
In addition, we are cognizant of the asymmetrical relationship between requesters and workers in crowdsourcing.We took great care to pay fair wages, and were responsive to feedback and questions throughout the data collection process (see Appendix D for details).The only personal information we collect is the worker IDs from Amazon Mechanical Turk, which we will not release.The annotation effort received an IRB exemption.

Limitations
In this paper, we apply our collaborative dataset creation pipeline to a single language and task, English natural language inference, and leave application of the pipeline more broadly to future work.
It is possible (if not likely) that datasets partially authored by language models will have artifacts of their own, especially those reflecting social biases that may not be captured by our accuracy-based evaluation setup.For investigation of a specific generation artifact observed by Yuan et al. (2021) in their own collaborative dataset, namely the over-representation of Western entities, please see Appendix C.4.
We are not able to perform ablations on different parts of the pipeline to understand the effectiveness of each component, e.g., by comparing different means of collecting exemplar groups or different templates for prompting GPT-3.Unfortunately, such variations would be prohibitively expensive as they each require collecting a dataset of sufficient scale (along with the necessary human annotation).
Finally, although we uncover examples where annotators disagree for valid reasons (see Table 5), we only use one label per example for training and evaluation.This is because to show the effectiveness of WANLI, we need to compare WANLI to existing (singly-labeled) training datasets via performance on established (singly-labeled) benchmarks.We encourage future work to understand the limitations of forcing inherently ambiguous instances into the n-way classification scheme, or otherwise discarding these potentially valuable examples of linguistic reasoning as noise.

B Modeling Details
All model training is implemented with the Hug-gingFace (Wolf et al., 2020) library and uses the original hyperparameters from the RoBERTa paper for finetuning on GLUE (Liu et al., 2019).We train the model for five epochs and evaluate the final model.We choose not to use an early stopping scheme in order to isolate the training data as the object of study and control for training length as a confounding factor.This is important since Tu et al. (2020)

C.2 GPT-3 Generation Hyperparameters
We queried the GPT-3 Curie model available through the OpenAI API8 on the dates November 3 to November 5, 2021.In total, the generation cost $677.89.Hyperparameters for generation9 are shown in Table 7.   (Yuan et al., 2021).To investigate whether this is also characteristic of WANLI, we use flair (Akbik et al., 2019) to perform named entity recognition on MultiNLI and WANLI.Due to the challenges and ethical risks of automatically determining the origin of names and organizations, we focus on the diversity of locations mentioned.We use geopy 10 to map all locations (e.g., cities, provinces, landmarks, as well as countries) to a country.We find that 79% of location mentions in WANLI are in Europe or North America, compared to 71% in MultiNLI.In particular, the United States is massively over-represented, accounting for 46% of mentions in WANLI and 26% in MultiNLI.However, both datasets feature a diversity of location names: WANLI mentions locations in 210 countries across 22K location entities, and MultiNLI mentions locations in 227 countries across 163K location entities.We conclude that over-representation of Western entities is indeed a concern for generated datasets, and encourage future work to consider this.

D Human Review
Screenshots of the instructions, guidelines, and annotation interface are shown in Tables 6, 7, and 8.The guidelines take inspiration from the design of the NLI Diagnostics dataset (Wang et al., 2018).To collect a pool of qualified workers, we designed a qualification task with examples testing each of these categories.NLI is a challenging task, and many generated examples are especially challenging by design.Therefore, instructing annotators in how to think about the task and resolve common issues is key to collecting high-quality, label-consistent data.

D.1 The Annotators
Annotators were required to have a HIT approval rate of 98%, a total of 10,000 approved HITs, and 10 https://geopy.readthedocs.iobe located in the United States.
300 Turkers took our qualification test, of which 69 passed.Turkers who were later found to produce extremely careless annotations were removed from the qualification list (and oftentimes, their annotations were discarded, though they were still paid for their work).The number of workers who contributed to the final dataset is 62.
Throughout the data collection process, the authors would review annotations and write individualized emails to Turkers with feedback, as well as group emails to clarify common challenging cases of NLI (such as examples involving questions).This follows the recommended crowdsourcing protocol from Nangia et al. (2021).

D.2 Compensation
In designing the task, we aimed for a pay rate of at least $15 per hour.Workers were paid $0.12 for each example that they annotate.At the end of data collection, we aggregate the earning and time spent from each crowdworker, and find that the median hourly rate was $22.72, with 85% of workers being paid over the $15/hour target.

D.3 Revision Analysis
We provide examples of revisions in Table 9.We find that revisions are generally targeted yet effective.The majority of revisions change the length only slightly, with 74% of both premise revisions and hypothesis revisions changing the word count between −1 and +2 words.A very large proportion, 11.6% of premise revisions and 20.6% of hypothesis revisions, changed the set of pronouns present in the text, often to clarify coreference.
We instructed annotators to revise examples only when it would make the example more "interesting" in some sense, or more clear without removing what's interesting.Nonetheless, we still observed a large number of revisions that greatly simplified the example, oftentimes re-introducing the same artifacts that have been documented in prior work.Therefore, we ultimately chose to include revisions only when both annotators revised the example, indicating that the revision was necessary to improve the quality of the example.

D.4 Disagreement Analysis
In order to investigate the utility of collecting a third annotation, we randomly sampled 80 examples where the two annotators disagreed on the label (and neither revised nor discarded), and two of the authors separately annotated each one.Shockingly, the two authors agreed on the label only 49% of the time.Furthermore, in 12% of cases, all three labels were present among the four annotations.This suggests that disagreement is often due to true ambiguity rather than careless mislabeling, and a third annotation would be unlikely to have high payoff in terms of "correcting" the label.As a result, we choose not to collect a third annotation in this work.Instead, we believe that the doublyannotated examples in WANLI have flagged many interesting cases of ambiguity in NLI, and we encourage future work to design richer annotation frameworks to uncover the source(s) of ambiguity. We

E.2 Evaluation on MultiNLI
We report the results on MultiNLI's development set in Table 8.We find that mixing WANLI into the MultiNLI training data (either through swapping or augmentation) maintains in-domain accuracy within ∼1%.Training on WANLI alone drops performance on MultiNLI's development set by ∼10%; however, the higher performance on other out-of-domain test sets suggests that evaluation through MultiNLI may not be a definitive signal of model ability.

E.3 Finetuning T5
We demonstrate that the robustness improvements from training on WANLI generalizes to another model architecture, T5-base (Raffel et al., 2020), which was never used in the data curation pipeline.Shown in Table 11, training T5-base on WANLI also outperforms training on MultiNLI on every test set, including by 4% of NLI Diagnostics, 10% on HANS, and 8% on Adversarial NLI (similar margins compared to finetuning RoBERTa-large).

F Data Map of WANLI
In Figure 9, we show a data map of MultiNLI relative to RoBERTa-large trained on MNLI, and of WANLI relative to RoBERTa-large trained on WANLI.

Figure 1 :
Figure 1: An illustration of our pipeline for creating WANLI.Starting with a data map (Swayamdipta et al., 2020) of an existing dataset relative to a trained model, (1) we automatically identify pockets of data instances exemplifying challenging reasoning patterns.Next, (2) we use GPT-3 to generate new instances with the same pattern.These generated examples are then (3) automatically filtered via a metric we introduce inspired by data maps, and (4) given to human annotators to assign a gold label and optionally revise.

Figure 2 :
Figure 2: Prompt template instructing GPT-3 to generate a new example, given a set of in-context examples.To separate the premise and hypothesis, the word "Implication" is used for entailment examples (shown here), "Possibility" for neutral examples, and "Contradiction" for contradiction examples.

Figure 4 :
Figure 4: Semantic similarity between the premise and hypothesis, computed based on SBERT embeddings (Reimers and Gurevych, 2019).The distributions for each label class are much more well-separated in MultiNLI than in WANLI.
In terms of human-machine collaboration, Tekiroglu et al. (2020) and Yuan et al. (

Figure 5 :
Figure 5: Correlation between variability of examples on a model that trains on the full MNLI dataset, and estimated max variability of the same examples when they are held out of the training set.

Figure 6 :
Figure 6: Instructions provided to crowdworkers on Amazon Mechanical Turk.

Figure 7 :
Figure 7: Guidelines provided to crowdworkers in the human review stage.

Figure 8 :
Figure8: The interface on Amazon Mechanical Turk used for collecting human annotations.Annotators are given free text boxes that are pre-populated with the original premise and hypothesis, to ease the work of revision.Then, they either select an entailment class or discard the example.

Figure 9 :
Figure 9: Left: Data map for MultiNLI train set, based on a RoBERTa-large classifier trained on MultiNLI.Right: Data map for WANLI train set, based on a RoBERTa-large classifier trained on WANLI.A comparison of the distribution in variability (which determines example ambiguity) is remarkable -we see that MultiNLI is overwhelmingly dominated by easy-to-learn examples with variability close to 0. In contrast, the distribution in variability is much more spread out in WANLI, suggesting that the dataset contains more valuable examples overall.

Table 1 :
Swayamdipta et al. (2020)nd corresponding WANLI examples generated by GPT-3.P stands for premise, H for hypothesis.The seed example is "ambiguous" according to the definitions ofSwayamdipta et al. (2020), discussed in §2.The remaining in-context examples (shown in Appendix C.1) share the same pattern and are found using distance in [CLS] embeddings of a trained task model.The reasoning is a short description of the pattern we observe from the group, and which is successfully repeated in the generated example.

Table 3 :
Empirical comparison of different training sets for RoBERTa-large, for generalization to out-of-distribution (OOD) challenge sets.Gray cells mark settings that do not represent an OOD challenge.Top: Training on MultiNLI alone.Middle: Comparison of combination schemes with MultiNLI.We consider two data combination strategies, augmentation (+), and random replacement (⋄), where the resulting dataset size is unchanged.Bottom: Training sets that include WANLI.The highest accuracy on each test set (excluding gray cells) is bolded.Test sets with * contain two label classes: entailment and non-entailment.
(Bowman et al., 2015)1)se that is a sentence, and a hypothesis that is a question, which is entailed if the question is answered by the premise.Winograd NLI was adapted by the GLUE benchmark from the Winograd Schema Challenge(Levesque et al., 2011), which tests correct coreference via common sense.To convert this dataset to NLI, an entailed hypothesis is formed by substituting a correct referent and a non-entailed hypothesis is formed by substituting an incorrect referent.Adversarial NLI (ANLI;Nie et al., 2020) is an adversarially-constructed dataset where crowdworkers are instructed to write examples that stump existing models.Examples are collected in three rounds that progressively increase in difficulty, with model adversaries trained on MultiNLI, SNLI(Bowman et al., 2015), FEVER-NLI (discussed below), as well as ANLI sets from earlier rounds.

Table 3 .
When comparing MultiNLI (MNLI) and WANLI alone, training a model on WANLI instead of MultiNLI leads to better performance on every test set we consider, including by 4% on Diagnostics, 11% on HANS, and 9% on Adversarial NLI.This is remarkable given WANLI is 4× smaller than MultiNLI, and contains primarily machine-written examples.A WANLI-trained model continues to outperform baselines that combine MultiNLI with other

Table 4 :
Comparison of whether including WANLI in the training data of ANLI improves in-domain test performance, when finetuning RoBERTa-large.
, adversarial methods have been challenged for not leading to better generalization on non-adversarial According to the most recent statistics, the rate of violent crime in the United States has dropped by almost half since 1991.H: The rate of violent crime has not dropped by half since 1991.

Table 6 :
showed that counter-examples can be learned better with longer training.Training hyperparameters for RoBERTa-large.

Table 8 :
choose to keep examples with disagreement in the WANLI dataset because we believe that finetuning with one of multiple reasonable labels still provides valuable training signal.Results on MultiNLI's development set.We additionally perform comparisons with several subsets of MultiNLI which are the same size as WANLI: MultiNLI filtered with the AFLite algorithm (MultiNLI with AFLite; Le Bras et al., 2020), the most ambiguous examples of MultiNLI (MultiNLI ambiguous; Swayamdipta et al., 2020), and a random subset of MultiNLI (MultiNLI downsampled).Results in Table 10 show that a WANLItrained model outperforms these baselines on every test set.