Unveiling the Multi-Annotation Process: Examining the Influence of Annotation Quantity and Instance Difficulty on Model Performance

The NLP community has long advocated for the construction of multi-annotator datasets to better capture the nuances of language interpretation, subjectivity, and ambiguity. This paper conducts a retrospective study to show how performance scores can vary when a dataset expands from a single annotation per instance to multiple annotations. We propose a novel multi-annotator simulation process to generate datasets with varying annotation budgets. We show that similar datasets with the same annotation budget can lead to varying performance gains. Our findings challenge the popular belief that models trained on multi-annotation examples always lead to better performance than models trained on single or few-annotation examples.


Introduction
The process of creating datasets often involves practical constraints such as time, resources, and budget that limit the number of annotators or experts available for collecting annotations (Sheng et al., 2008).As a result, there is a prevalence of single or few labels per instance (depending on the limited number of annotators) in the collected data.However, training models on these datasets pose challenges to their generalization abilities, primarily because the data lacks diversity.With a scarcity of different perspectives and variations in the training data (Basile et al., 2021;Plank, 2022), models may struggle to learn robust representations and fail to generalize effectively (Nie et al., 2020;Meissner et al., 2021).
To address these challenges, the NLP community has highlighted the advantages of utilizing multi-annotator datasets (Davani et al., 2022) and also emphasized the importance of releasing multiannotator datasets and associated information (cultural and demographic, etc.) (Sap et al., 2022;Hershcovich et al., 2022).However, this approach introduces its own set of challenges.Collecting data with multiple annotators requires significant time, annotation budget, and annotator expertise to ensure the creation of high-quality datasets with diverse perspectives.
Moreover, with a limited annotation budget, it becomes crucial to determine the optimal number of annotators within the given constraints.This not only helps save annotation time and budget but also ensures efficient utilization of available resources.While some research (Wan et al., 2023;Zhang et al., 2021) has provided insights and suggestions on finding the optimal number of annotators, a definitive solution to this problem has yet to be achieved.
Another challenge is the restricted number of annotations available per instance, typically not exceeding 6 -10, even with a large number of recruited annotators (Plank, 2022).This limitation arises from the considerable annotation efforts required for a large volume of instances.As a result, when models are trained on such datasets, they only capture the opinions and information of a small subset of the annotator pool.Additionally, certain datasets have not released annotator-specific labels or established mappings to individual annotators (Nie et al., 2020;Jigsaw, 2018;Davidson et al., 2017).However, the trend is gradually shifting, and there is a growing recognition that annotatorlevel labels should be made available (Prabhakaran et al., 2021;Basile et al., 2021;Denton et al., 2021).
This study aims to tackle the challenge of lacking annotator-specific labels by simulating a multiannotation process.Through this study, we provide insights into how the inclusion of more annotators can introduce variations in model performance and identify the factors that influence this variation.Considering that previous research (Swayamdipta et al., 2020) has highlighted the influence of individual instance difficulty on model performance, we examine how the addition of more annotations alters the difficulty level of instances and conse-quently affects model performance.
In summary, our main contributions are: • We propose a novel multi-annotator simulation process to address the issue of missing annotator-specific labels.• We demonstrate, that increasing the number of annotations per instance does not necessarily result in significant performance gains.• We also demonstrate, that altering the number of annotations per instance has a noticeable impact on the difficulty of instances as perceived by the model and consequently affects the model performance.

The Multi-annotated Dataset
In practical scenarios, the annotation process begins by hiring one or more annotators who annotate each instance in the dataset.To enhance the representation of the true label distribution, we have the option to extend this process by recruiting additional annotators.We continue this iterative process until either the annotation budget is exceeded or we observe saturation in the model's performance in predicting the true label distribution.As a result, we obtain multiple annotations assigned to each instance in this multi-annotated dataset.
A multi-annotator dataset D is formally characterized as a triplet D = (X, A, Y ) in this research paper.The set X represents N text instances, denoted as x 1 , x 2 , . . ., x N .The set A corresponds to M annotators, represented as a 1 , a 2 , . . ., a M .The annotation matrix Y captures the annotations, with rows indexed by X and columns indexed by In simpler terms, the entry Y [x i ; a j ] stores the label y i,j assigned to instance x i by annotator a j .Furthermore, an annotator-set A k , which comprises k annotators where This paper refers to D k as the dataset subset with k annotations per instance.Figure 1 illustrates a toy multi-annotator dataset, showcasing M annotators, and N instances along with its subsets comprising 2 and k annotators.

Simulating the Multi-annotation Process
Based on our current knowledge, it is worth noting that existing multi-annotator datasets typically do not include annotator-specific labels.Instead, the available information is limited to the label distribution for each instance (Nie et al., 2020;Jigsaw, 2018;Davidson et al., 2017).For instance, in cases with M annotations per instance and three possible labels, the label distribution is commonly represented by a list [p, q, r], where p, q, and r are positive integers that sum up to M .To address this constraint, we introduce a simulation process for multi-annotator scenarios that leverages the instance-level label distribution.Our proposed approach (see Algorithm 1), encompasses the following steps: • Initially, we generate a list of annotations for each instance by considering the actual instance-level label distribution.[Line 1] • Subsequently, we randomize these annotation lists using a consistent random seed across instances.[Lines 5-6] • Next, we select the first k annotations from each randomized list, creating the dataset subset D k .[Lines 4-8] By employing this algorithm, we can generate k annotations per instance, thereby addressing the limitation of annotator-specific labels in existing multi-annotator datasets.By repeating the algorithm with different random seeds or parameters, we can create multiple datasets subsets D k , each containing k annotations per instance.This flexibility enables the generation of diverse subsets, expanding the range of multi-annotator scenarios that can be explored and analyzed in our research.

Datasets
We selected the ChaosNLI dataset (Nie et al., 2020) for our study, as it contains the highest number of annotations (=100) per instance among the publicly available datasets (Plank, 2022).ChaosNLI is a Natural Language Inference (NLI) task dataset known for its high ambiguity.Additionally, the ChaosNLI dataset includes sub-datasets, namely ChaosNLI-S and ChaosNLI-M, which are subsets extracted from the development sets of SNLI (Bowman et al., 2015) and MNLI-matched (Williams et al., 2018), respectively.Another sub-dataset, ChaosNLI-α, is created from the entire development set of AbductiveNLI hereafter, referred to as α-NLI (Bhagavatula et al., 2019).The ChaosNLI dataset consists of 4,645 instances, each annotated with 100 new annotations.Additionally, the dataset already includes 5 old annotations for ChaosNLI-S and ChaosNLI-M, and 1 old annotation for ChaosNLI-α.Subsequently, we create D k 's (see §3) utilizing these datasets and then divide these D k 's into train, development, and test sets using an 80:10:10 ratio.
It is important to clarify that our objective is not to showcase state-of-the-art (SOTA) performance using these models, but rather to demonstrate the variations in performance as we incrementally add annotations to the dataset.

Training Strategies
In this section, we describe two variants of training strategies.
Majority Label (ML): The PLMs are finetuned using the majority label, which is determined by aggregating annotations from the target list of annotations.The training objective aims to minimize the cross-entropy between the output probability distribution and the one-hot encoded majority label.
Label Distribution (LD): The PLMs are finetuned using the label distribution from the target list of annotations (Meissner et al., 2021).The training objective aims to minimize the cross-entropy between the output probability distribution and the target label distribution.

Evaluation
To evaluate the performance of our models, we utilize the classification accuracy computed on the test dataset.In the ML setting, the accuracy is computed by comparing the label associated with the highest softmax probability predicted by the model with the majority label derived from the target annotations.In the LD setting, the accuracy is computed by comparing the label corresponding to the highest softmax probability predicted by the model with the label that has the highest relative frequency in the target label distribution.

Experimental Settings
Following the approaches described in the studies (Nie et al., 2020;Meissner et al., 2021), we construct base models by finetuning PLMs (described in §4.2) on the combined train sets of SNLI and  We choose hyperparameters from the experimental settings of the following work (Nie et al., 2020;Meissner et al., 2021;Bhagavatula et al., 2019).Our optimization technique involves employing the AdamW optimizer (Loshchilov and Hutter, 2019).More details on hyperparameters can be found in §A.2.To ensure reproducibility, we conduct our experiments using the open-source Hugging Face Transformers 2 library (Wolf et al., 2020).Furthermore, all experiments are performed using 2 × NVIDIA RTX 2080 Ti GPUs.
5 Results and Discussion 5.1 Is higher performance always guaranteed by increasing the number of annotations?
Figure 2 presents the accuracy scores as the number of annotations increases.Notably, the trends observed in the performance of ChaosNLI-S, ChaosNLI-M, and ChaosNLI-α challenge the prevailing belief that increased annotations invariably lead to improved performance.Specifically, for ChaosNLI-S and ChaosNLI-M, the accuracy scores exhibit a non-monotonic increasing pattern.In contrast, the trend observed for ChaosNLI-α, particularly with BERT and DistilBERT models, deviates from this expected behavior.In these cases, the accuracy scores show a decreasing trend as the number of annotations increases.Upon examining the RoBERTa accuracy scores for the LD setting 2 https://huggingface.co/docs/transformers/ in ChaosNLI-S, it is observed that the performance reaches a saturation point between 20 to 80 annotations.This means that increasing the number of annotations beyond this range does not result in significant improvement in the accuracy scores.
Table 2 provides a complementary perspective on the observed trends.It highlights that the minimum performance is not consistently associated with the dataset having the fewest annotations, and vice versa.In the case of ChaosNLI-α with BERT and DistilBERT, it is interesting to note that the optimal performance is achieved with just three annotations.This represents an extreme scenario where a minimal number of annotations can lead to the best performance.In general, these findings shed light on the optimization of our annotation budget.Similarly, the performance gain (maximum -minimum accuracy) across different datasets also significantly varies.The average performance gain for ChaosNLI-M, ChaosNLI-S and ChaosNLI-α is 0.106, 0.177, and 0.031, respectively.The notable variability in performance gain across different datasets further emphasizes that the impact of increasing annotations on performance improvement is not consistent.It underscores the need to carefully analyze and understand the specific characteristics of each dataset and model combination to ascertain the relationship between annotation quantity and performance.
To provide an explanation for the observed complex behavior, we utilize the V-Information (Ethayarajh et al., 2022).V-information is a measure that quantifies the ease with which a model can predict the output based on a given input.The higher the Vinformation, the easier it is for the model to predict the output given input.Furthermore V-information cannot be negative unless model overfits, etc. (see §A.1).
Figure 3 provides a visual representation of the V-information scores for the three datasets across five different PLMs.As anticipated, the Vinformation scores are higher for the ChaosNLI-S and ChaosNLI-M datasets.Models that exhibit higher V-information scores also tend to yield higher accuracy scores in the LD-based performance evaluation.For instance, RoBERTa outperforms other models (except XLNet, for which the performance is similar) in terms of accuracy for the ChaosNLI-S dataset.The saturation of V-information scores starting at k = 20 for the ChaosNLI-S dataset effectively explains the ob-served saturation of LD-based accuracy after 20 annotations, as depicted in Figure 2.This phenomenon suggests that the model reaches a point where additional annotations provide diminishing returns in terms of extracting valuable insights from the instances.Therefore, the model's performance ceases to improve significantly beyond this threshold.For the ChaosNLI-α dataset, except RoBERTa and XLNet (V-Information ∈ [0, 0.25], comparatively low), all models yielded approximately zero V-information scores3 .This implies that adding more annotations to the ChaosNLI-α dataset does not establish a clear relationship between the input and output label distribution.This observation suggests that, for this particular variant of the dataset, the model might rely on factors other than the provided annotations to make accurate predictions.
The aforementioned findings indicate that not all datasets yield similar performance when trained under the same budget, underscoring the importance of selecting the appropriate dataset for a specific task.Furthermore, these findings emphasize the significance of determining the optimal number of annotators, as the model's performance varies with the increase in annotations.

Does the number of annotations influence the difficulty of instances as perceived by the model?
To investigate this question, we employ the concept of dataset cartography as proposed by Swayamdipta et al. (2020), which leverages training dynamics to distinguish instances based on their (1) confidence, measured as the mean probability of the correct label across epochs, and (2) variability, represented by the variance of the aforementioned confidence.This analysis generates a dataset map that identifies three distinct regions of difficulty: easy-to-learn, hard-to-learn, and instances that are ambiguous with respect to the trained model.Easy-to-learn (e) instances exhibit consistently high confidence and low variability, indicating that the model can classify them correctly with confidence.hard-to-learn (h) instances, on the other hand, have low confidence and low variability, indicating the model's struggle to consistently classify models for k ≤ 3 overfitted resulting in negative V-Information.
them correctly over multiple epochs.Ambiguous (a) instances display high variability in predicted probabilities for the true label.We investigate the proportion of the transitions between these categories with the incorporation of additional annotations.For example, e → a represents proportion of the transitions from easy-to-learn to ambiguous category among all transitions.This provides valuable insights into the underlying factors that contribute to the observed improvements or lack thereof in the model's performance.
Figure 4 illustrates an interesting pattern in ChaosNLI-S and ChaosNLI-M datasets: as the number of annotations increases, a significant proportion of training instances transition from the a → e category.For instance, more than 60% of all transitions between 1 to 10 annotations involve instances moving from the a → e category.However, beyond 10 annotations, the proportion of instances transitioning to the e from the a category does not show a substantial increase.On the other hand, the reverse transition from the e → a category is the second most common transition, with an average proportion of 20%.The difference in proportions between the transition from a → e and the transition from e → a becomes more substantial (at least 29%) as more annotations are added.In the ChaosNLI-M dataset, we observe a higher proportion of instances transitioning from category a to category h compared to the ChaosNLI-S dataset.Specifically, over 15% of the ambiguous instances in ChaosNLI-M exhibit a shift towards the hard region, which is more than 50% of similar transitions observed in ChaosNLI-S.We argue that this substantial difference in transition patterns has a direct impact on the performance of models on the ChaosNLI-S dataset compared to ChaosNLI-M.BERT Despite the presence of higher proportions of a to e transitions in ChaosNLI-M compared to ChaosNLI-S, the a to category h consistently leads to better performance on the ChaosNLI-S dataset across all models analyzed.

a e a h h e h a
ChaosNLI-α exhibits distinct trends across various models.Specifically, in the case of BERT and DistillBERT, where accuracy scores decline as the annotation increases (see Figure 2), we witness significant proportions of e → a (∼ 80%) and a → h (∼ 43%) transitions, respectively.These transitions suggest that the models struggle to comprehend the instances and classify them with reduced confidence.For XLNet and ALBERT, the combined proportion of low confidence transitions, e → a and a → h either surpasses or remains equal to the proportion of high confidence transition a → e.In the case of RoBERTa, it behaves the same as ChaosNLI-S and ChaosNLI-M.
These results suggest adding more annotations has indeed its effects on the difficulty of instance thereby affecting the performance of the model.

Related Works
Human disagreements in annotations.Traditional approaches like majority voting or averaging can overlook important nuances in subjective NLP tasks, where human disagreements are prevalent.To address this issue, Multi-annotator models treat annotators' judgments as separate subtasks, capturing the distribution of human opinions, which challenges the validity of models relying on a majority label with the high agreement as ground truth (Davani et al., 2022;Nie et al., 2020).Human variation in labeling, which is often considered noise (Pavlick and Kwiatkowski, 2019), should be acknowledged to optimize and maximize machine learning metrics, as it impacts all stages of the ML pipeline (Plank, 2022).Incorporating annotation instructions that consider instruction bias (Parmar et al., 2023), which leads to the over-representation of similar examples, is crucial.This bias can limit model generalizability and performance.Future data collection efforts should focus on evaluating model outputs against the distribution of collective human opinions to address this issue.All of the above works study annotator disagreements and how they affect the performance of models on downstream tasks.However, in our work, considering disagreements' effect on model performance, we try to find out how the model performance varies as we increase the number of annotations per instance, i.e., varying the annotator disagreement, Overall, we try to answer, does more annotation per instance leads to better performance or is the other way around?
Annotation under restricted annotation budget.Also, prior studies have investigated how to achieve optimal performance in natural language processing (NLP) models under restricted annotation budgets.One such study by (Sheng et al., 2008) examined the impact of repeated labeling on the quality of data and model performance when labeling is imperfect and/or costly.Another study by (Bai et al., 2021) framed domain adaptation with a constrained budget as a consumer choice problem and evaluated the utility of different combinations of pretraining and data annotation under varying budget constraints.Another study by (Zhang et al., 2021) explored new annotation distribution schemes, assigning multiple labels per example for a small subset of training examples, and proposed a learning algorithm that efficiently combines signals from uneven training data.Finally, a study by (Chen et al., 2022) proposed an approach that reserves a fraction of annotations to explicitly clean up highly probable error samples to optimize the annotation process.All these studies contribute to the understanding of how to maximize the performance of NLP models under restricted annotation budgets.Our study aimed to address a specific question within this context: assuming a fixed annotation budget, which dataset would yield the highest performance?
Previous studies have demonstrated that annotation disagreements affect model performance.However, our study aims to explore how performance varies as we change the level of disagreement.we consider ideas from (Zhang et al., 2021) who proposed a learning algorithm that can learn from training examples with different amounts of annotation (5-way, 10-way, 20-way) in a multilabel setting, but we expand the number of annotations from 1-way till 100-way and train our model in a label distribution setting rather than in a multi-label setting.To investigate the reasons for performance variation as we increase the number of annotations, we incorporate (Swayamdipta et al., 2020)'s ideas and (Ethayarajh et al., 2022)'s concepts of dataset difficulty.While previous studies focused on building datasets and models and their impact on performance when the annotation budget is restricted, our work answers whether increasing the annotation budget necessarily leads to improved model performance.Overall, our study aims to demonstrate that, even with less annotation budget than its upper bound, it is possible to achieve optimal performance compared to the performance at the upper bound thereby saving annotation budget and time.Our findings provide insights into optimizing annotation budgets.

Conclusion
In this paper, we introduced a novel approach to handle the absence of annotator-specific labels in the dataset through a multi-annotator simulation process.Additionally, we investigated the impact of varying the number of annotations per instance on the difficulty of instances and its effect on model performance.Our results highlighted that increasing the number of annotations does not always lead to improved performance, emphasizing the need to determine an optimal number of annotators.This has important implications for optimizing annota-tion budgets and saving time.Our findings provide valuable insights for optimizing annotation strategies and open up new possibilities for future research in this direction.

Limitations
The current study acknowledges several limitations that deserve attention.Firstly, the experiments were conducted using small-size Language Models due to resource constraints.It is important to recognize that employing larger language models, such as BLOOM, GPT, and others, could potentially yield different outcomes and should be explored in future research.Furthermore, the scope of the discussion is constrained by the availability of datasets with a large number of labels per instance, leading to the utilization of the ChaosNLI dataset (Nie et al., 2020).Consequently, the generalizability of the findings to other datasets, if they emerge in the future, might be restricted.
Punta Cana, Dominican Republic.Association for Computational Linguistics.

Appendices A More Details
A.1 V-Information V-Information (Kulmizev and Nivre, 2023;Ethayarajh et al., 2022), where V represents specific model families such as BERT, GPT, etc., measures the level of ease with which model V can predict the output variable Y given the input X.The higher the V-Information, the easier it is for the model V to predict the output variable Y given X.To measure V-Information, we use predictive V-entropy: and conditional V-entropy: In simple terms, our goal is to find the f ∈ V that maximizes the log-likelihood of the label data with and without input X.Using these two quantities, V-Information can be calculated using the formula: While V-Information functions as an aggregated measure calculated for the whole dataset, (Ethayarajh et al., 2022) extended this measure to a new measure called Pointwise V-Information (PVI), which allows for the calculation of the difficulty of individual instances.The higher the PVI, the easier the instance is for V in the given distribution.It can be depicted by the formula: where f θ , f ′ θ ∈ V are models trained with and without input x ∈ X, respectively, and y * refers to the gold label.Unlike V-Information, PVI can be negative, indicating that the model predicts the majority class better without considering the input x compared to when considering the input.
Refer to Table 6 for a sample of instances from the ChaosNLI-α dataset with very low PVI, which demonstrates the high ambiguity in these instances.

A.2 Hyperparameter Details
Referring to Table 4, we initially trained the models using the hyperparameters provided by (Nie et al., 2020).However, during our experiments, we observed signs of overfitting to our datasets.Consequently, we adjusted the hyperparameters, leading to the set provided in the table.More hyperparameter details can be found in Tables 3 and  5 A.3 Detailed Plots for Figure 2 For a more comprehensive view of the phenomenon where performance decreases with an increasing number of annotations, we provide detailed plots for BERT and DistilBERT, as shown in Figure 5.While Figure 2 maintains a consistent y-axis for datasets ChaosNLI-(S, M, and α), these plots feature distinct axes.

Algorithm 1
Creation of Annotator Datasets Input: X: set of N instances CL: list of C class labels LC: label counts of shape N × C M : number of annotators Output: D ′ = {D 1 , D 2 , . . ., D M } 1: AL ← GETANNOTATIONLIST() 2: Initialize an empty set D ′ 3: for k ← 1 to M do

Figure 2 :
Figure2: The figure displays accuracy scores for various models across k for datasets ChaosNLI-S, ChaosNLI-M and ChaosNLI-α.For every k on X-axis, the mean and standard deviation of the accuracy scores of models trained on 10 D k 's are displayed.The detailed plots for ChaosNLI-α BERT and ChaosNLI-α DistilBERT can be found in Figure5in the Appendix.

Figure 3 :
Figure3: The figure displays the V-Information values for various models in the LD setting.A higher value indicates that the data is easier for the respective model V with respect to extracting information from it.These values can be compared across datasets and models.

Figure 4 :
Figure 4: The figure provides a visual representation of the transition of instances between different categories during training as the number of annotators increase from A 1 to A 10 , . . ., A 100 .e → a indicates percentage of instances that transitoned from category e to a.

Figure 5 :
Figure 5: The figure displays accuracy scores for BERT and DistilBERT across k for dataset ChaosNLI-α.For every k on X-axis, the mean and standard deviation of the accuracy scores of models trained on 10 D k 's are displayed.

Table 1 :
Table 1 provides detailed statistics of the datasets used in our study.Dataset Statistics 1

Table 2 :
The performance of various models in both the ML and LD settings is presented in this table.Values indicate accuracy, and values in braces indicate k.The values highlighted in bold indicate the optimal number of annotators where the performance reaches its peak compared to the maximum annotation budget allocated (100).Conversely, the highlighted values in the minimum accuracy column indicate the lowest performance achieved compared to the minimum budget allocated (1).This information provides insights into the impact of the number of annotators on the model's performance.
k , where k ∈ [1, 100].For each k, we report average performance scores over test sets of 10 D k 's (see §3)

Table 5 :
Hyperparameters for finetuned models for dataset ChaosNLI-S and ChaosNLI-M