DynaSent: A Dynamic Benchmark for Sentiment Analysis

We introduce DynaSent (‘Dynamic Sentiment’), a new English-language benchmark task for ternary (positive/negative/neutral) sentiment analysis. DynaSent combines naturally occurring sentences with sentences created using the open-source Dynabench Platform, which facilities human-and-model-in-the-loop dataset creation. DynaSent has a total of 121,634 sentences, each validated by five crowdworkers, and its development and test splits are designed to produce chance performance for even the best models we have been able to develop; when future models solve this task, we will use them to create DynaSent version 2, continuing the dynamic evolution of this benchmark. Here, we report on the dataset creation effort, focusing on the steps we took to increase quality and reduce artifacts. We also present evidence that DynaSent’s Neutral category is more coherent than the comparable category in other benchmarks, and we motivate training models from scratch for each round over successive fine-tuning.


Introduction
Sentiment analysis is an early success story for NLP, in both a technical and an industrial sense. It has, however, entered into a more challenging phase for research and technology development: while present-day models achieve outstanding results on all available benchmark tasks, they still fall short when deployed as part of real-world systems (Burn-Murdoch, 2013;Grimes, 2014Grimes, , 2017Gossett, 2020) and display a range of clear shortcomings (Kiritchenko and Mohammad, 2018;Hanwen Shen et al., 2018;Wallace et al., 2019;Tsai et al., 2019;Jin et al., 2019;Zhang et al., 2020).
In this paper, we seek to address the gap between benchmark results and actual utility by introduc- * Equal contribution.  Figure 1: The DynaSent dataset creation process. The human validation task is the same for both rounds; five responses are obtained for each sentence. On Dynabench, we explore conditions with and without prompt sentences that workers can edit to achieve their goal.
ing version 1 of the DynaSent dataset for Englishlanguage ternary (positive/negative/neutral) sentiment analysis. 1 DynaSent is intended to be a dynamic benchmark that expands in response to new models, new modeling goals, and new adversarial attacks. We present the first two rounds here and motivate some specific data collection and modeling choices, and we propose that, when future models solve these rounds, we use those models to create additional DynaSent rounds. This is an instance of "the 'moving post' dynamic target" for NLP that Nie et al. (2020) envision. Figure 1 summarizes our method, which incorporates both naturally occurring sentences and sentences created by crowdworkers with the goal of fooling a top-performing model. The starting point is Model 0, which is trained on standard sentiment benchmarks and used to find challenging sentences in existing data. These sentences are fed into a human validation task, leading to the Round 1 Dataset. Next, we train Model 1 on Round 1 in addition to publicly available datasets. In Round 2, this model runs live on the Dynabench Platform for humanand-model-in-the-loop dataset creation; 2 crowdworkers try to construct examples that fool Model 1. These examples are human-validated, which results in the Round 2 Dataset. Taken together, Rounds 1 and 2 have 121,634 sentences, each with five human validation labels. Thus, with only two rounds collected, DynaSent is already a substantial new resource for sentiment analysis.
In addition to contributing DynaSent, we seek to address a pressing concern for any dataset collection method in which workers are asked to construct original sentences: human creativity has intrinsic limits. Individual workers will happen upon specific strategies and repeat them, and this will lead to dataset artifacts. These artifacts will certainly reduce the value of the dataset, and they are likely to perpetuate and amplify social biases.
We explore two methods for mitigating these dangers. First, by harvesting naturally occurring examples for Round 1, we tap into a wider population than we can via crowdsourcing, and we bring in sentences that were created for naturalistic reasons, rather than the more artificial goals present during crowdsourcing. Second, for the Dynabench cases created in Round 2, we employ a 'Prompt' setting, in which crowdworkers are asked to modify a naturally occurring example rather than writing one from scratch. We compare these sentences with those created without a prompt, and we find that the prompt-derived sentences are more like naturally occurring sentences in length and lexical diversity. Of course, fundamental sources of bias remain -we seek to identify these in the Datasheet (Gebru et al., 2018) distributed with our datasetbut we argue that these steps help, and can inform crowdsourcing efforts in general.
As noted above, DynaSent presently uses the labels Positive, Negative, and Neutral. This is a minimal expansion of the usual binary (Positive/Negative) sentiment task, but a crucial one, as it avoids the false presupposition that all texts convey binary sentiment. We chose this version of the problem to show that even basic sentiment analysis poses substantial challenges for our field. 2 https://dynabench.org/ We find that the Neutral category is especially difficult. While it is common to synthesize such a category from middle-scale product and service reviews, we use an independent validation of the Stanford Sentiment Treebank (Socher et al., 2013) dev set to argue that this tends to blur neutrality together with mixed sentiment and uncertain sentiment (Section 5.2). DynaSent can help tease these phenomena apart, since it already has a large number of Neutral examples and a large number of examples displaying substantial variation in validation. Finally, we argue that the variable nature of the Neutral category is an obstacle to fine-tuning (Section 5.3), which favors our strategy of training models from scratch for each round.

Related Work
Sentiment analysis was one of the first natural language understanding tasks to be revolutionized by data-driven methods. Rather than trying to survey the field (see Pang and Lee 2008;Liu 2012;Grimes 2014), we focus on the benchmark tasks that have emerged in this space, and then seek to situate these benchmarks with respect to challenge (adversarial) datasets and crowdsourcing methods.

Sentiment Benchmarks
Many sentiment datasets are derived from customer reviews of products and services Lee, 2004, 2005;Socher et al., 2013;Maas et al., 2011;Jindal and Liu, 2008;Ni et al., 2019;McAuley et al., 2012;Zhang et al., 2015). This is an appealing source of data, since such texts are accessible and abundant in many languages and regions of the world, and they tend to come with their own authorprovided labels (star ratings). On the other hand, over-reliance on such texts is likely also limiting progress; DynaSent begins moving away from such texts, though it remains rooted in this domain.
Not all sentiment benchmarks are based in review texts. The MPQA Opinion Corpus of Wiebe et al. (2005) contains news articles labeled at the phrase-level for a variety of subjective states; it presents an exciting vision for how sentiment analysis might become more multidimensional. Se-mEval 2016 and 2017 (Nakov et al., 2016;Rosenthal et al., 2017) offered Twitter-based sentiment datasets. And of course there are numerous additional datasets for specific languages, domains, and emotional dimensions; Google's Dataset Search currently reports over 100 datasets for sentiment.

Challenge and Adversarial Datasets
Challenge and adversarial datasets (Winograd, 1972;Levesque, 2013) have risen to prominence in response to the sense that benchmark results are over-stating the quality of the models we are developing (Linzen, 2020). These efforts seek to determine whether models have met specific learning targets (Alzantot et al., 2018;Glockner et al., 2018;Naik et al., 2018;Nie et al., 2019), exploit relatively superficial properties of their training data, (Jia and Liang, 2017;Kaushik and Lipton, 2018;Zhang et al., 2020), or inherit social biases in the data they were trained on (Kiritchenko and Mohammad, 2018;Rudinger et al., 2017Sap et al., 2019;Schuster et al., 2019).
For the most part, challenge and adversarial datasets are meant to be used primarily for evaluation (though Liu et al. (2019a) show that even small amounts of training on them can be fruitful in some scenarios). However, there are existing adversarial datasets that are large enough to support full-scale training efforts (Zellers et al., 2018(Zellers et al., , 2019Dua et al., 2019;Bartolo et al., 2020). DynaSent falls into this class; it has large train sets that can support from-scratch training as well as fine-tuning. Our approach is closest to, and directly inspired by, the Adversarial NLI (ANLI) project, which is reported on by Nie et al. (2020) and which continues on Dynabench. In ANLI, human annotators construct new examples that fool a top-performing model but make sense to other human annotators. This is an iterative process that allows the annotation project itself to organically find phenomena that fool current models. The resulting dataset has, by far, the largest gap between estimated human performance and model accuracy of any benchmark in the field right now. We hope DynaSent follows a similar pattern, and that its naturally occurring sentences and prompt-derived sentences bring beneficial diversity.

Crowdsourcing Methods
Within NLP, Snow et al. (2008) helped establish crowdsourcing as a viable method for collecting data for at least some core language tasks. Since then, it has become the dominant mode for dataset creation throughout all of AI, and the scientific study of these methods has in turn grown rapidly. For our purposes, a few core findings from research into crowdsourcing are centrally important.
First, crowdworkers are not fully representative of the general population (Hube et al., 2019), and any crowdsourcing project will reach only a small population of workers (Gadiraju et al., 2017). This narrowness seems to be an underlying cause of many of the artifacts that have been identified in prominent NLU benchmarks (Poliak et al., 2018;Gururangan et al., 2018;Tsuchiya, 2018;Belinkov et al., 2019). DynaSent's naturally occurring sentences and prompt sentences can help, but we acknowledge that those texts come from people who write online reviews, which is also a special group. Second, as with all work, quality varies across workers and examples, which raises the question of how best to infer individual labels from response distributions. Dawid and Skene (1979) is an early contribution to this problem leveraging Expectation Maximization (Dempster et al., 1977). Much subsequent work has pursued similar strategies; for a full review, see Zheng et al. 2017. Our corpus release uses the true majority (3/5 labels) as the gold label where such a majority exists, leaving examples unlabeled otherwise, but we include the full response distributions in our corpus release and make use of those distributions when training Model 1. For additional details, see Section 3.3.

Round 1: Naturally Occurring Sentences
We now begin to describe our method for constructing DynaSent (Figure 1). The current section focuses on Model 0 and Round 1, and Section 4 explains how these feed into Model 1 and Round 2.

Model 0
Our Model 0 begins with the RoBERTa-base parameters (Liu et al., 2019b) and adds a three-way sentiment classifier head. The model was trained on a number of publicly-available datasets, as summarized in Table 2. See Appendix A for details on these datasets and how we processed them for our ternary task. We evaluate this and subsequent models on three datasets (   this to the fact that the Neutral categories for all these corpora were derived from three-star reviews, which actually mix a lot of different phenomena: neutrality, mixed sentiment, and (in the case of the reader judgments in SST) uncertainty about the author's intentions. We return to this issue in Section 5.2, arguing that DynaSent marks progress on creating a more coherent Neutral category. Finally, Table 3 includes results for our Round 1 dataset, as we are defining it. Performance is atchance across the board by construction (see Section 3.4 below). We include these columns to help with tracking the progress we make with Model 1. We also report performance of this model on our Round 2 dataset (described below in Section 4), again to help with tracking progress and understanding the two rounds.

Harvesting Sentences
Our first round of data collection focused on finding naturally occurring sentences that would challenge our Model 0. To do this, we harvested sentences from the Yelp Academic Dataset, using the version of the dataset that contains 8,021,122 reviews. 3 The sampling process was designed so that 50% of the sentences fell into two groups: those that occurred in 1-star reviews but were predicted by Model 0 to be Positive, and those that occurred in 5-star reviews but were predicted by Model 0 to be Negative. The intuition here is that these would likely be examples that fooled our model. Of course, negative reviews can (and often do) contain positive sentences, and vice-versa. This motivates the validation stage that we describe next.

Validation
Our validation task was conducted on Mechanical Turk. Workers were shown ten sentences and asked to label them according to the categories Positive, Negative, Neutral, and Mixed. See Appendix B for the full interface, including glosses for the categories and the task instructions.
For this round, 1,978 workers participated in the validation process. In the final version of the corpus, each sentence is validated by five different workers. To obtain these ratings, we employed an iterative strategy. Sentences were uploaded in batches of 3-5K and, after each round, we measured each worker's rate of agreement with the majority. We then removed from the potential pool those workers who disagreed more than 80% of the time with their co-annotators, using a method of 'unqualifying' workers that does not involve rejecting their work or blocking them (Turk, 2017). We then obtained additional labels for examples that those 'unqualified' workers annotated. The final version of DynaSent keeps only the responses from the highest-rated workers. This led to a substantial increase in dataset quality by removing a lot of labels that seemed to us to be randomly assigned. Appendix B describes the process in more detail, and our Datasheet enumerates the known unwanted biases that this process can introduce.

Round 1 Dataset
The Round 1 dataset is summarized in Table 5, and  Table 4 gives randomly selected short examples. Because each sentence has five ratings, there are two perspectives we can take on the dataset: Distributional Labels We can repeat each example with each of its labels (de Marneffe et al., 2012;Pavlick and Kwiatkowski, 2019). For instance, the first sentence in Table 4 would be repeated three times with 'Mixed' as the label and twice with 'Negative'. For many classifier models, this reduces to labeling each example with its probability distribution over the labels. This is an appealing approach to creating training data, since it allows us to make use of all the examples, 4 even those that do not have a majority label, and it allows us to make maximal use of the labeling information. In our experiments, we found that training on the distributional labels consistently led to slightly better  Table 3: Model 0 performance (F1 scores) on external assessment datasets (Table 1). We also report on our Round 1 dataset (Section 3.4), where performance is at chance by construction, and we report on our Round 2 dataset (Section 4) to further quantify the challenging nature of that dataset.   models, suggesting that annotator disagreement is stable and informative.

Majority Label
We can take a more traditional route and infer a label based on the distribution of labels. In Table 5, we show the labels inferred by assuming that an example has a label just in case at least three of the five annotators chose that label. This is a conservative approach that creates a fairly large 'No Majority' category. More sophisticated approaches might allow us to make fuller use of the examples and account for biases relating to annotator quality and example complexity (see Section 2.3). We set these options aside for now because our validation process placed more weight on the best workers we could recruit (Section 3.3). The Majority Label splits given by Table 5 are designed to ensure five properties: (1) the classes are balanced, (2) Model 0 performs at chance, (3) the review-level rating associated with the sentence has no predictive value, (4) at least four of the five workers agreed, and (5)  Over the entire round, 47% of cases are such that the validation majority label is Positive, Negative, or Neutral and Model 0 predicted a different label. Table 6a provides a conservative estimate of human F1 in order to have a quantity that is comparable to our model assessments. To do this, we randomize the responses for each example to create five synthetic annotators, and we calculate the precision, recall, and F1 scores for each of these annotators with respect to the gold label. We average those scores. This heavily weights the single annotator who disagreed for the cases with 4/5 majorities. We    Table 2. SST-3 is repeated 3 times. For Yelp and Amazon, we sample 1-, 3-, and 5-star reviews with the goal of down-weighting them and removing ambiguous reviews. Round 1 uses distributional labels and is copied twice.

Estimating Human Performance
can balance this against the fact that 614 of 1,280 workers never disagreed with the majority label (see Appendix B for the full distribution). However, it seems reasonable to say that a model has solved the round if it achieves comparable scores to our aggregate F1 -a signal to start a new round.

Round 2: Dynabench
In Round 2, we leverage Dynabench to begin creating a new dynamic sentiment benchmark.

Model 1
Model 1 was created using the same general methods as for Model 0 (Section 3.1): we begin with RoBERTa parameters and add a three-way sentiment classifier head. The differences between the two models lie in the data they were trained on. The train set is summarized in Table 7, and Appendix A provides additional details. Table 8 summarizes the performance of our model on the same evaluation sets as are reported in Table 8 for Model 0. Overall, we see a small performance drop on the external datasets, but a huge jump in performance on our dataset (Round 1). While it is unfortunate to see a decline in performance on the external datasets, this is expected if we are shifting the label distribution with our new dataset -it might be an inevitable consequence of hill-climbing in our intended direction.

Dynabench Interface
Our data distribution provides the Dynabench interface we created for DynaSent as well the complete instructions and training items given to workers. The essence of the task is that the worker chooses a label y to target and then seeks to write an example that the model (currently, Model 1) assigns a label other than y but that other humans would label y. Workers can try repeatedly to fool the model, and they get feedback on the model's predictions as a guide for how to fool it.

Methods
We consider two conditions. In the Prompt condition, workers are shown a sentence and given the opportunity to modify it as part of achieving their goal. Prompts are sampled from parts of the Yelp Academic Dataset not used for Round 1. In the No Prompt condition, workers wrote sentences from scratch, with no guidance beyond their goal of fooling the model. We piloted both versions and compared the results. Our analyses are summarized in Section 5.1. The findings led us to drop the No Prompt condition and use the Prompt condition exclusively, as it clearly leads to examples that are more naturalistic and linguistically diverse.
For Round 2, our intention was for each prompt to be used only once, but prompts were repeated in a small number of cases. We have ensured that our dev and test sets contain only sentences derived from unique prompts (Section 4.5).

Validation
We used the identical validation process as described in Section 3.3, getting five responses for each example as before. This again opens up the possibility of using label distributions or inferring individual labels. 395 workers participated in this round. See Appendix B for additional details.

Round 2 Dataset
Table 10 summarizes our Round 2 dataset, and Table 9 provides train examples from Round 2 sampled using the same criteria we used for Table 4     is about 19%, which is much lower than the comparable value for Round 1 (47%). There seem to be three central reasons for this. First, Model 1 is hard to fool, so many workers reach the maximum number of attempts. We retain the examples they enter, as many of them are interesting in their own right. Second, some workers seem to get confused about the true goal and enter sentences that the model in fact handles correctly. Some non-trivial rate of confusion here seems inevitable given the cognitive demands of the task, but we have taken steps to improve the interface to minimize this factor. Third, a common strategy is to create examples with mixed sentiment; the model does not predict this label, but it is chosen at a high rate in validation.
Despite these factors, we can construct splits that meet our core goals: (1) Model 1 performs at chance on the dev and test sets, and (2) the dev and test sets contain only examples where the majority label was chosen by at least four of the five workers. In addition, (3) our dev and test sets contain only examples from the Prompt condition (the No Prompt cases are in the train set, and flagged as such), and (4) all the dev and test sentences are derived from unique prompts to avoid leakage between train and assessment sets and reduce unwanted correlations within the assessment sets. Table 6b provides estimates of human F1 for Round 2 using the same methods as described in Section 3.5. We again emphasize that these are conservative estimates. A large percentage of workers (116 of 244) never disagreed with the gold label on the examples they rated, suggesting that human performance can approach perfection. Nonetheless, the estimates we give here seem useful for helping us decide whether to continue hill-climbing on this round or begin creating new rounds.

Discussion
We now address a range of issues that our methods raise but that we have so far deferred in the interest of succinctly reporting on the methods themselves.

The Role of Prompts
As discussed in Section 4, we explored two methods for collecting original sentences on Dynabench: with and without a prompt sentence that workers could edit to achieve their goal. We did small pilot rounds in each condition and assessed the results. This led us to use the Prompt condition exclusively. This section explains our reasoning more fully.
First, we note that workers did in fact make use of the prompts. In Figure 2a, we plot the Levenshtein edit distance between the prompts provided to annotators and the examples the annotators produced, normalized by the length of the prompt or the example, whichever is longer. There is a roughly bimodal distribution in this plot, where the peak on the right represents examples generated by the annotator tweaking the prompt slightly and the peak on the left represents examples where they deviated significantly from the prompt. Essentially no examples fall at the extreme ends (literal reuse of the prompt; complete disregard for the prompt).
Second, we observe that examples generated in the Prompt condition are generally longer than those in the No Prompt condition, and more like our Round 1 examples. Figure 2b summarizes for string lengths; the picture is essentially the same for tokenized word counts. In addition, the Prompt examples have a more diverse vocabulary overall. Figure 2c provides evidence for this: we sampled 100 examples from each condition 500 times, sampled five words from each example, and calculated the vocabulary size (unique token count) for each sample. (These measures are intended to control for the known correlation between token counts and vocabulary sizes; Baayen 2001.) The Promptcondition vocabularies are much larger, and again more similar to our Round 1 examples.
Third, a qualitative analysis further substantiates the above picture. For example, many workers realized that they could fool the model by attributing a sentiment to another group and then denying it, as in "They said it would be great, but they were wrong". As a result, there are dozens of examples in the No Prompt condition that employ this strategy. Individual workers hit upon more idiosyncratic strategies and repeatedly used them. This is just the sort of behavior that we know can create persistent dataset artifacts. For this reason, we include No Prompt examples in the training data only, and we make it easy to identify them in case one wants to handle them specially.

The Neutral Category
For both Model 0 and Model 1, there is consistently a large gap between performance on the Neutral category and performance on the other categories, but only for the external datasets we use for evaluation. For our dataset, performance across all three categories is fairly consistent. We hypothesized that this traces to semantic diversity in the Neutral categories for these external datasets. In review corpora, three-star reviews can signal neutrality, but they are also likely to signal mixed sentiment or uncertain overall assessments. Similarly, where the ratings are assigned by readers, as in the SST, it seems likely that the middle of the scale will also be used to register mixed and uncertain sentiment, along with a real lack of sentiment.
To further support this hypothesis, we ran the SST dev set through our validation pipeline. This leads to a completely relabeled dataset (distributed with DynaSent) with five ratings for each example and a richer array of categories. The new labels are closely aligned with SST's for Positive and Negative, but the SST-3 Neutral category has a large percentage of cases falling into Mixed and No Majority. Appendix D provides the full comparison matrix and gives a random sample of cases where the two label sets differ with regard to the Neutral category. It also provides all seven cases of sentiment confusion. We think these comparisons favor our labels over SST's original labels.

Fine-Tuning
Our Model 1 was trained from scratch (beginning with RoBERTa parameters)d. An appealing alternative would be to begin with Model 0 and fine-tune it on our Round 1 data. This would be more efficient, and it might naturally lead to the Round 1 data receiving the desired overall weight relative to the other datasets. Unfortunately, our attempts at this led to worse models, and the problems traced to very low performance on the Neutral category.
To study the effect of our dataset on Model 1 performance, we employ the "fine-tuning by inoculation" method of Liu et al. (2019a). We first divide our Round 1 train set into small subsets via random sampling. Then, we fine-tune our Model 0 (a) Normalized edit distances between the prompt and the example.
(b) String lengths. The picture is essentially the same for tokenized word counts.
(c) Vocabulary sizes in samples of 100 examples (500 samples with replacement).  using these subsets of Round 1 train with nondistributional labels. We early-stop our fine-tuning process if performance on the Round 0 dev set of Model 0 (SST-3 dev) has not improved for five epochs. Lastly, we measure model performance with Round 1 dev (SST-3 dev plus Round 1 dev) and our external evaluation sets (Table 1). Figure 3 presents F1 scores for our three class labels using this method. Model performance on Round 1 dev increases for all three labels given more training examples. The F1 scores for the Positive and Negative classes remain high, but they begin to drop slightly with larger samples. The F1 scores on SST-3 dev show larger perturbations. The most striking trends are for the Neutral category, where the F1 score on Round 1 dev increases steadily while the F1 scores on the three original development sets for Model 0 decrease drastically. This is the pattern that Liu et al. (2019a) associate with dataset artifacts or label distribution shifts.
Our current hypothesis is that the pattern we observe can be attributed, at least in large part, to label shift -specifically, to the difference between our Neutral category and the other Neutral categories, as discussed in the preceding section. Our strategy of training from scratch seems less susceptible to these issues, though the label shift is still arguably a factor in the lower performance we see on this category with our external validation sets.

Conclusion
We presented DynaSent, as the first stage in an ongoing effort to create a dynamic benchmark for sentiment analysis. To date, the best future-looking Model 2 we have developed achieves 83.1 F1 on Round 1 and 70.8 F1 on Round 2 while maintaining good performance on our external benchmarks. Appendix E provides details on this model and others, and the Dynabench platform offers a detailed and up-to-date leaderboard. We hope and expect that the community will find models that solve both rounds. That will be our cue to launch another round of data collection to fool those models and push the field of sentiment forward by another step.

Impact Statement
DynaSent is distributed with a detailed Datasheet (Gebru et al., 2018) that describes the data collection process and its motivations, and seeks to articulate known limitations of the resource. The data distribution also includes a Model card (Mitchell et al., 2019) that seeks to provide similar disclosures concerning Model 0 and Model 1. Taken together, these documents further articulate our central goals for these resources and provide guidance on responsible use. These documents will be upated appropropriately as DynaSent and our associated models evolve.  Figure 4 shows the interface for the validation task used for both Round 1 and Round 2. The top provides the instructions, and then one item is shown. The full task had ten items per Human Interface Task (HIT). Workers were paid US$0.25 per HIT, and all workers were paid for all their work, regardless of whether we retained their labels.

B.2 Worker Selection
Examples were uploaded to Amazon's Mechanical Turk in batches of 3-5K examples. After each round, we assessed workers by the percentage of examples they labeled for which they agreed with the majority. For example, a worker who selects Negative where three of the other workers chose Positive disagrees with the majority for that example. If a worker disagreed with the majority more than 80% of the time, we removed that worker from the annotator pool and revalidated the examples they labeled. This process was repeated iteratively over the course of the entire validation process for both rounds. Thus, many examples received more than 5 labels; we collected a total of 808,289 responses, of which 608,170 (75%) are used in the final dataset, as we keep only those by the topranked workers according to agreement with the majority. We observed that this iterative process led to substantial improvements to the validation labels according to our own intuitions. To remove workers from our pool, we used a method of 'unqualifying', as described in Turk 2017. This method does no reputational damage to workers and is often used in situations where the requester must limit responses to one per worker (e.g., surveys). We do not know precisely why workers tend to disagree with the majority. The reasons are likely diverse. Possible causes include inattentiveness, poor reading comprehension, a lack of understanding of the task, and a genuinely different perspective on what examples convey. While we think our method mainly increased label quality, we recognize that it can introduce unwanted biases. We acknowledge this in our Datasheet, which is distributed with the dataset. Figure 5 show the distribution of workers for the validation task for both rounds. In the final version of Round 1, the median number of examples per worker was 45 and the mode was 11. For Round 2, the median was 20 and the mode was 1. Figure 6 summarizes the rates at which individual workers agree with the gold label. Across the dev and test sets for both rounds, substantial numbers of workers agreed with the gold label on all of the cases they labeled, and more than half were above 95% for this agreement rate for both rounds.

C.2 Instructions
Our data distribution includes the complete instructions for the Dynabench task, and the list of comprehension questions we required workers to answer correctly before starting.

C.3 Data Collection Pipeline
For each task, a worker has ten attempts in total to find an example that fools the model. A worker can immediately claim their payment after submitting a single fooling example, or running out of attempts. The average number of attempts per task is two before the worker generates an example that they claim fools the model. Workers are paid US$0.30 per task. A confirmation step is required if the model predicts incorrectly: we explicitly ask workers to confirm the examples they come up with are truly fooling examples.
To incentivize workers, we pay a bonus of US$0.30 for each truly fooling example according to our separate validation phase. We temporarily disallow a worker to do our task if they fail to correctly answer all our onboarding questions within five attempts. We also temporarily disallow a worker to do our task if they consistently cannot come up with truly fooling examples according to our validation task.
A worker must meet the following qualifications before accepting our tasks. First, a worker must reside in the U.S. and speak English. Second, a worker must have completed at least 1,000 tasks on Amazon Mechanical Turk with an approval rating of 98%. Lastly, a worker must not be in any of our temporarily disallowing worker pools.
We adapt the open-source software package Mephisto as our data collection tool. 7 D SST-3 Validation Examples Table 11 compares the SST-3 labels with the labels from our separate validation task. There are just seven cases of polarity (Positive/Negative and Negative/Positive) disagreement. These are included in Table 12. The rate of disagreement is much higher where the SST-3 Neutral category is involved, which we trace (in Section 5.2) to the nature of the SST-3 category. Table 12 gives a random selection of cases involving the Neutral category to support these claims qualitatively.  E A Future-Looking Model 2

SST-3 Positive Negative Neutral
As we say in Section 6, we hope that DynaSent continues to grow. A future Round 3 would use a future Model 2 (or set of such models), either to harvest naturally occurring examples or to drive another round of adversarial example creation on Dynabench. We have explored a variety of Transformer-based architectures (Vaswani et al., 2017) for Model 2, designed and optimized according to the protocols given in Appendix A: RoBERTa (Liu et al., 2019b), BERT (Devlin et al., 2019), XLNet , and ELEC-TRA (Clark et al., 2019). ELECTRA has yielded the best results so far, with 83.1 F1 on Round 1 and 70.8 on Round 2. We do not think these are the best possible models; we offer these very preliminary results in the hope that they provide some useful guidance.

SST-3 Responses
should be seen at the very least for its spasms of absurdist humor. neu pos, pos, pos, pos, pos Van Wilder brings a whole new meaning to the phrase ' comedy gag . ' neu mix, neu, pos, pos, pos ' They' begins and ends with scenes so terrifying I'm still stunned. neu neu, neu, pos, pos, pos Barely gets off the ground. neu neg, neg, neg, neg, neg As a tolerable diversion, the film suffices; a Triumph, however, it is not. neu mix, mix, mix, mix, neg (c) A random selection of examples for which SST-3 label is Neutral and our validation label is not.