Don’t Blame the Annotator: Bias Already Starts in the Annotation Instructions

In recent years, progress in NLU has been driven by benchmarks. These benchmarks are typically collected by crowdsourcing, where annotators write examples based on annotation instructions crafted by dataset creators. In this work, we hypothesize that annotators pick up on patterns in the crowdsourcing instructions, which bias them to write many similar examples that are then over-represented in the collected data. We study this form of bias, termed instruction bias, in 14 recent NLU benchmarks, showing that instruction examples often exhibit concrete patterns, which are propagated by crowdworkers to the collected data. This extends previous work (Geva et al., 2019) and raises a new concern of whether we are modeling the dataset creator’s instructions, rather than the task. Through a series of experiments, we show that, indeed, instruction bias can lead to overestimation of model performance, and that models struggle to generalize beyond biases originating in the crowdsourcing instructions. We further analyze the influence of instruction bias in terms of pattern frequency and model size, and derive concrete recommendations for creating future NLU benchmarks.


Introduction
Benchmarks have been proven pivotal for driving progress in Natural Language Understanding (NLU) in recent years (Rogers et al., 2021;Bach et al., 2022;Wang et al., 2022). Nowadays, NLU benchmarks are mostly created through crowdsourcing, where crowdworkers write examples following annotation instructions crafted by dataset creators (Callison-Burch and Dredze, 2010;Zheng et al., 2018;Suhr et al., 2021). The instructions typically include a short description of the task, along with several examples (Dasigi et al., 2019;Sakaguchi et al., 2020).
Despite the vast success of this method, past studies have shown that data collected through crowdsourcing often exhibit various biases that lead to overestimation of model performance (Schwartz et al., 2017;Gururangan et al., 2018;Poliak et al., 2018;Tsuchiya, 2018;Le Bras et al., 2020;Mishra et al., 2020a;Mishra and Arunkumar, 2021;Hettiachchi et al., 2021). Such biases are often attributed to annotator-related biases, such as writing style and background knowledge (Gururangan et al., 2018;Geva et al., 2019) (see more discussion on related work in §A).
In this work, we propose that biases in crowdsourced NLU benchmarks often originate at an early stage in the data collection process of designing the annotation task. In particular, we hypothesize that task instructions provided by dataset creators, which serve as the guiding principles for annotators to complete the task, often influence crowdworkers to follow specific patterns, which are then propagated to the dataset and subsequently over-represented in the collected data. For instance, ∼ 36% of the instruction examples for the QUOREF dataset (Dasigi et al., 2019) start with "What is the name", and this same pattern can be observed in ∼ 59% of the collected instances.
To test our hypothesis, we conduct a broad study of this form of bias, termed instruction bias, in 14 recent NLU benchmarks. We find that instruction bias is evident in most of these datasets, showing that ∼ 73% of instruction examples on average share a few clear patterns. Moreover, we find that these patterns are propagated by annotators to the collected data, covering ∼ 61% of the instances on average. This suggests that instruction examples play a critical role in the data collection process and the resulting example distribution.
It is difficult to represent a task with a few examples, and bias in instruction examples makes it even more difficult since a task and its associated reasoning have a larger scope than instruction patterns. For example co-reference resolution, temporal commonsense reasoning, and numerical reasoning are much broader tasks than the prevalent patterns in QUOREF ("what is the name..."), MC-TACO ("how long...") and DROP ("how many field goals...") datasets.
We investigate the effect of instruction bias on model performance, showing that performance is overestimated by instruction bias and that models often fail to generalize beyond instruction patterns. Moreover, we observe that a higher frequency of instruction patterns in the training set often increases the model performance gap on pattern and nonpattern examples and that large models are generally less sensitive to instruction bias.
In conclusion, our work shows that instruction bias widely exists in NLU benchmarks, often leading to an overestimation of model performance. Based on our study, we derive concrete recommendations for monitoring and alleviating this bias in future data collection efforts. From a broader perspective, our findings also have implications on the recent learning-by-instructions paradigm (Efrat and Levy, 2020;Mishra et al., 2021), where crowdsourcing instructions are used in model training.

Instruction Bias in NLU Benchmarks
Instructions are the primary resource for educating crowdworkers on how to perform their task (Nangia et al., 2021). Bias in the instructions, dubbed instruction bias, could lead crowdworkers to propagate specific patterns to the collected data.
Here, we study instruction bias in NLU benchmarks 2 , focusing on two research questions: (a) Do crowdsourcing instructions exhibit patterns that annotators can pick up on? and (b) Are such patterns propagated by crowdworkers to the collected data? In our study, we use the instructions of 14 recent NLU benchmarks: 3 (1) CLARIQ (Aliannejadi et al.    involves a wide range of different tasks. Also, we believe that the lower number of examples in crowdsourcing instructions might be limiting the imagination of annotators while creating samples, resulting in instruction bias.

Patterns in Crowdsourcing Instructions
Our goal is to quantify biases in instruction examples that propagate to collected data instances. In this study, we focus on an intuitive form of bias of recurring word patterns, which crowdworkers can easily pick up on. To find such patterns, we manually analyze the instruction examples of each dataset to find a dominant pattern, using the following procedure: (a) identifying repeating patterns of n ≥ 2 words, (b) merging patterns that are semantically similar or have a substantial word overlap, and (c) selecting the most frequent pattern as the dominant pattern (an example is provided in §C). Tab. 1 shows the dominant pattern in the instruction examples of each dataset. On average, 72.7% of the instruction examples used to create a dataset exhibit the same dominant pattern, and for 10 out of 14 datasets, the dominant pattern covers more than half of the instruction examples. This sug-gests that crowdsourcing instructions demonstrate a small set of repeating "shallow" patterns. Moreover, the short length of the patterns (2-4 words) and the typical low number of instruction examples (Tab. 2) make the patterns easily visible to crowdworkers, who can end up following them.
Notably, our results are an underestimation of the actual instruction bias, since (a) we only consider the dominant pattern for each dataset (b) our manual analysis over instruction examples has a preference for short patterns (c) we do not consider paraphrased patterns (beyond the shallow paraphrases which are visible in annotation instructions), and (d) datasets may include implicit patterns (e.g. writing style and biases from the annotator's background knowledge) that also contribute to instruction bias. Accounting for such patterns is expected to increase the bias percentage in Tab. 1 further.

Instruction Bias Propagation to Datasets
We now turn to investigate whether patterns in instruction examples are further propagated by crowdworkers to the collected data. We analyze the train and test sets of each benchmark 4 to find the same patterns, using simple string matching. To account for syntactic modifications in identified patterns based on some examples from dataset, we also consider synonym words where appropriate and match the paraphrased version of each pattern.
Tab. 1 shows the results. Across all datasets, instruction patterns are ubiquitous in the collected data, occurring in 60.5% of the instances on average, with similar presence in training (59%) and test (62%) examples. While the dominant pattern's frequency in the data is typically not higher than in the instructions, for CLARIQ, DUORC, MUL-TIRC, QUOREF and ROPES, the pattern frequency was amplified by the crowdworkers. Interestingly, these datasets used a relatively large number of instruction examples (Tab. 2), suggesting that more examples do not necessarily alleviate the propagation of instruction bias. Example data instances with instruction patterns are provided in §D.
A natural question that arises is whether patterns in collected data reflect the true task distribution rather than a bias in the instructions. We argue that this is highly unlikely. First, while the space of possible patterns for a NLU task is arguably large, the dataset patterns are imbalanced proportionately  Propagation of instruction bias to the test set raises concerns regarding its reliability for evaluation, which we address next.

Effect on Model Learning
Let S train (S test ) be the set of training (test) examples, and denote by S p train (S p test ) and S −p train (S −p test ) its disjoint subsets of examples with and without instruction patterns, respectively. We conduct two experiments where we fine-tune models on (a) S p train and (b) S p train ∪ S −p train , and evaluate them on S −p test and S p test . This is to assess to what extent models generalize from instruction patterns to the downstream  task (a), and to compare model performance on instances with and without instruction patterns (b).

Experimental Setting
Datasets Since model training is computationally expensive, we select a subset of seven datasets from those analyzed in §2: (1)  Evaluation We evaluate model performance using the standard F 1 evaluation score, and report the average score over three random seeds.

Results
We observe similar results for T5 and BART, and thus, present only the results for T5 in this section. Results for BART are provided in §F.
Models often fail to generalize beyond instruction patterns. Tab. 3 shows the performance on S p test and S −p test when training only on examples with instruction patterns. Across all experiments, there are large performance gaps, reaching to 58% in DROP and > 10% in both base and large models for CLARIQ, MULTIRC, PIQA, and QUOREF. This indicates that models trained only on examples with instruction patterns fail to generalize to other task examples, and stresses that instruction bias should be monitored and avoided during data collection. Notably, the gap is lower for large models than for base ones, showing that large models are less sensitive to instruction bias. This might be attributed to their larger capacity to capture knowledge and skills during pre-training.
Model performance is overestimated by instruction bias. We compare the performance on S p test and S −p test of models trained on the full training set (Tab. 4). The average performance across all datasets is higher on examples that exhibit instruction patterns by ∼ 7% and ∼ 3% for the base and large models, respectively. Specifically, base models perform worse on S −p test than on S p test for all datasets except DROP, in some cases by a dramatic gap of > 15% (e.g. 18.7% in ROPES and 15.7% in QUOREF). In contrast, results for the large models vary across datasets, while the performance gap is generally smaller in magnitude. This shows that model performance is often overestimated by instructions bias, and reiterates that large models are generally less sensitive to instruction patterns.

Conclusions and Discussion
We identify a prominent source of bias in crowdsourced NLU datasets, called instruction bias, which originates in annotation instructions written by dataset creators. We study this bias in 14 NLU benchmarks, showing that instruction examples used to create NLU benchmarks often exhibit clear patterns that are propagated by annotators to the collected data. In addition, we investigate the effect of instruction bias on model performance, showing that instruction patterns can lead to overestimated performance as well as limit the ability of models to generalize to other task examples.
Based on our findings, we derive three recommendations for future crowdsourced NLU benchmarks: (1)

Limitations
This work covers 14 NLU datasets, for which annotation instructions are publicly available. However, most of these datasets are QA datasets. Our analysis can be extended to other NLU task categories, such as Natural Language Inference (NLI) and Relation Extraction (RE).
Our study reveals a concrete bias that skews the collected data distribution toward specific patterns. While the effect of instruction examples on collected data is prominent, it is hard to quantify how different the distribution of crowdsourced examples is from the natural distribution of the task. Concretely, to conduct a study that compares the distributions of crowdsourced versus natural complex reasoning questions, datasets of complex natural questions are needed. However, to the best of our knowledge, as of today, no such datasets exist.
In our analysis, we focused on shallow patterns based on word matching, however, it is known that there are other types of biases that are implicit in the text. Exploring these kinds of biases can be an interesting future direction. In addition, our analysis of model performance is based on splitting dataset instances based on the dominant pattern. However, it might be possible that there are more patterns, and the non-pattern subset might include other less frequent patterns. Hence, exploring the effect of different less frequent patterns on model learning can be a future work.
Last, our work studied the effect of instruction bias on widely used generative models (i.e., T5 and BART); it would be valuable to investigate whether our findings hold in encoder-only models, such as BERT (
In this work, we show that biases exhibited by annotators start from the crowdsourcing instructions designed by dataset creators.

B Dataset Statistics
Tab. 5 describes the statistics of train and evaluation sets of datasets used in our experiments. Here, we can observe that each selected dataset differs in terms of number of training samples, % of instruction patterns, and tasks.

C Pattern Extraction Method
Here, we describe an example to show how we extract the dominant pattern from the crowdsourcing instructions and subsequently identify the same pattern in the dataset. We try to find recurring word patterns such as "Are you...", "how many points...", "Was... still...", "since... the...".
For example, MC-TACO (event duration) has 3 examples in crowdsourcing instructions: (1) how long did Jack play basketball?, (2) how long did he do his homework?, and (3) how long did it take for him to get the Visa? In step (a), we analyze examples manually and find dominant pattern. Here, we can see that all examples contain tri-gram pattern, i.e., "how long did". In step (b), we try to generate more possible patterns that are semantically similar to the dominant pattern or have a significant word overlap. Here, "how long did" can be "how long was", "how long does", etc. (i.e, How long AUX). In step (c), we look for all these possible patterns in datasets using simple word-matching techniques.

D Pattern Examples
Tab. 8 provides dataset, instruction patterns and corresponding examples of data instances that exhibit the instruction patterns.

E The Effect of Instruction Examples on Pattern Frequency in Collected Data
To study the effect of bias in instruction examples on collected data, we asked NLP graduate students to write five questions for each of (1) temporal reasoning (event duration) and (2) coreference resolution, based on the crowdsourcing instructions of MC-TACO and QUOREF, respectively. For each task, we conduct two surveys, where the instructions include and do not include any examples. We collected responses from 10 participants. The dominant patterns of MC-TACO ('how long') and QUOREF ('What is the name') only contribute to 38% and 8% of our collected data where examples are not given, in contrast to 68% (↑79%) and 32% (↑300%) in collected data where examples are given. This indicates that crowdsourcing examples bias crowdworkers to follow certain patterns, whereas showing no examples increases the creativity of crowdworkers.
In addition, our collected responses where examples are not given contain 10 and 9 unique patterns for MC-TACO (event duration) and QUOREF respectively, in contrast to only 4 and 5 unique patterns in collected data where examples are given. Our finding shows that there is substantial linguistic diversity associated with the NLP tasks, unlike   the patterns covered in instruction examples that get propagated to corresponding datasets. The task instructions and collected annotations are available at https://github.com/Mihir3009/ instruction-bias/blob/main/SURVEY.md.

F Additional Results
Tab. 6 and Tab. 7 show the performance of BART on S p test and S −p test when training only on examples with instruction patterns and the full train-ing set, respectively. From Tab. 6, there are large performance gaps reaching 45.9% in MULTIRC and > 20% in both base and large models for QUOREF, and PIQA. Overall, the average performance across all datasets is 27.9% and 22.2% higher on S p test for the base and large models, respectively. This indicates that both base and large models often fail to generalize beyond instruction patterns.
From Tab. 7, we see that the average performance across all datasets is higher on examples that exhibit instruction patterns by ∼ 10.5% for both base and large models. From the results, we can conclude that the model's performance is overestimated by instruction bias.