Disagreement Matters: Preserving Label Diversity by Jointly Modeling Item and Annotator Label Distributions with DisCo

This paper contains content that can be offensive or disturbing. Annotator disagreement is common whenever human judgment is needed for supervised learning. It is conventional to assume that one label per item represents ground truth. However, this obscures minority opinions, if present. We regard “ground truth” as the distribution of all labels that a population of annotators could produce, if asked (and of which we only have a small sample). We next introduce DisCo (Distribution from Context), a simple neural model that learns to predict this distribution. The model takes annotator-item pairs, rather than items alone, as input, and performs inference by aggregating over all annotators. Despite its simplicity, our experiments show that, on six benchmark datasets, our model is competitive with, and frequently outperforms, other, more complex models that either do not model specific annotators or were not designed for label distribution learning.


Introduction
Human feedback remains a critical component as machine-learning-based systems play an ever larger role in society and our daily lives.Furthermore, when these systems fail, they assume that there is a single, correct answer in every case.Yet, due to differences in perception, context, experiences, attitudes that vary from person to person, and demographic differences, humans often disagree on what the "right response" should be.For instance, Binns et al. (2017) showed that female annotators frequently disagree with males on what constitutes offensive speech.
This prevalence of disagreement in humanlabeled data has made annotator modeling a popular research problem.In its most elementary form, each item is assigned the majority label.More sophisticated approaches seek to understand annotator behavior (Dawid and Skene, 1979;Rodrigues and Pereira, 2018;Lakkaraju et al., 2015;Gordon et al., 2022) via machine learning.Yet the vast majority of these approaches require or assume that ground truth is a single (but unknown) label (or collection of labels) and that any deviation from the ground truth label is indicative of poor quality.Consequently, most models learn to discriminate between "good" and "bad" annotators (Lakkaraju et al., 2015;Rodrigues and Pereira, 2018) and resolve disagreement out of existence.
However, on problems such as hate speech detection, language complexity, or machine translation, disagreement may actually signify the views of vulnerable communities that should be preserved (Gray and Suri, 2019;Klenner et al., 2020;Basile, 2020;Prabhakaran et al., 2021), or even made predictable (Lakkaraju et al., 2015).A major barrier to achieving these goals is annotator sparseness.Human annotators are an expensive and often limiting factor in a learning loop.It is usually not feasible to collect enough annotations from each item in order to have confidence in them as a representative sample of the underlying population's response.
In this paper, we explore the idea that, in key settings, ground truth is more plainly seen as a distribution of labels representing the opinions and beliefs of a (partially observed) population of annotators, rather than a single label (or multi-label).In the extreme (as we do here), this approach ignores the near-certainty that some annotators are unreliable.However, the goal of modeling as precisely as possible annotator responses at the population level is a transparent, data-conservative approach to preserving annotator information, as opposed to the conventional approach of resolving disagreement before learning.Later in this work, we will discuss extensions for modeling annotator reliability.
We propose a new neural model, DisCo (Distribution from Context), designed to address the annotator sparsity problem.At training time, the model takes in as input a training example and an annotator id, and outputs three simultaneous predictions: the label the annotator gives the example, the distribution of all labels the example received (from all annotators who responded to it) and the distribution of all labels (over all examples) that the annotator provided.At inference time, it takes an unlabeled item as input and predicts the distribution of labels that it would receive from the population of annotators.It ties together two rather successful prior approaches: label distribution learning (LDL) (Geng, 2016) and item-annotator modeling (Dawid and Skene, 1979).Our models are publicly available. 1n this work, we address the following questions: RQ1 How does the performance of DisCo compare to that of LDL approaches that do not model annotators?
RQ2 How does the performance of DisCo compare to that of non-LDL approaches that model annotators?
To answer these questions, we evaluate DisCo against three competitive models that exemplify, respectively, the conventional ground truth approach, a label distribution approach without annotator modeling, and an approach that models annotators but, unlike DisCo, is not purpose-built for distributional ground truth.We test these models on six benchmark datasets that contain annotator-item assignments and annotator-level labels.We evaluate our models on two different gold standards: the most frequent label and the label distribution.

Related Work
Our work is philosophically aligned with welldocumented analyses of inherent annotator disagreement (Davani et al., 2021;Prabhakaran et al., 2021;Pavlick and Kwiatkowski, 2019) or annotator bias (Field and Tsvetkov, 2020).However, we are going beyond simply analyzing the difference.Rather, we seek to leverage the distribution of annotator responses as a signal to be learned for its own sake.
The study of annotator disagreement has a long history, coincident with the emergence of data-driven behavioral research (Cohen, 1960).Dawid and Skene (1979) introduced item-annotator tableau models.They use the multiple labels associated with each data item and each annotator to jointly estimate the ground truth label of each item as well as the error rate of each annotator.Their approach uses only the labels, not the data item features associated with them, and so, alone, this method cannot outperform supervised learning.Rather, it is used as the first of a two-step learning process, where the second step can be any supervised learning algorithm.
Later researchers put this model on a fully Bayesian foundation (Raykar et al., 2010;Kim and Ghahramani, 2012) or considered more complex models of annotators, ground truth, or both (Whitehill et al., 2009;Northcutt et al., 2019).Notably, (as spam is a common problem in crowdsourced label sets) several investigators distinguish between honest and dishonest annotators (Raykar and Yu, 2012;Hovy et al., 2013).More recently, investigators have studied clustering as an unsupervised approach to discover annotators with similar behavior (Venanzi et al., 2014;Lakkaraju et al., 2015).Yet all of these approaches are still based on the assumption that each item is associated with a single ground truth label.
Here, the goal is to predict the distribution of labels associated with an item rather than a single ground truth label.It is relatively natural, in this setting, to consider clustering together related data items in order to improve the ground truth estimates of label distributions, as several prior efforts have done, either in the feature space of the items (Zheng et al., 2018;Zeng et al., 2020;Xu et al., 2021) or directly in the label space of the items themselves (Liu et al., 2019b,a;Weerasooriya et al., 2020).Note that models that cluster only in the label space can only be used as the first step in a two-step supervised learning process (for the same reason that the David and Skene model can only be used in this way).
Our work is most closely aligned with others who seek not only to gain understanding of annotator disagreement, but to predict it for it's own sake.CrowdTruth (Aroyo and Welty, 2013;Dumitrache  et al., 2018) views truth in crowdsourcing as a function of the data, response space, and workers who annotate the data.Gordon et al. (2022) study modeling and predicting annotator behavior for specific demographic groups.Their approach is based on a recommender system.Wan et al. (2023) also propsed a model to learn and predict labels using annotator demographics.In contrast to their work, DisCo is able to jointly model and learn from annotator behavior, their annotations, and the content of the data itself.

Data
Figure 1 summarizes the notation that we use to describe our data.Let x ∈ R J×1 be the Jth dimensional (column) feature vector for a particular data item, where X ∈ R J×M is the design matrix or entire collection of all M data items of a dataset.Y ∈ {0, 1, . . ., Q} N ×M is the dataset's annotator response matrix, where each column y •,m of Y corresponds to a data item, each row y n,• an annotator, and each entry y n,m is one label in {1, 2, . . ., Q} or 0, indicating "no response" for that annotator-item pair.Note that, in practice, each item commonly has ≤ 5 labels, so Y is typically a sparse matrix.However, each annotator could label the item if asked.
We are interested in distributions over annotator responses, for any slice of Y, horizontally (denoted y n,• ) or vertically (denoted y •,m ), as well as the response of an individual annotation to an individual item (denoted y m,n ).
We also viewed the responses in each slice as a probability distribution over the space of possible responses.Let # denote an operator that converts a (horizontal or vertical) slice y into a vector #y ∈ [0, 1] Q , i #y i = 1 representing the frequency of each response {1, 2, . . ., Q} as a probability distribution.So, e.g., if there are three responses of "2" out of 10 responses total in y •,m , then #(y •,m ) 2 = 0.3.
We conducted our experiments on the few publicly available human annotated datasets with annotator assignments.2See Table 1 for a summary of the datasets.
4 DisCo: A Neural Probabilistic Model for Estimating Label Distributions DisCo (Figure 2) stands for "Distribution from Context," because it takes two inputs, an item x •,m and an annotator a n , and then learns to jointly predict the annotator's response to the item y n,m , the distribution of all responses to the item #y •,m , and the distribution of all responses the annotator provides (to all items) #y n,• .Additionally, we intend the name to invoke the inclusive, diversity-celebrating spirit of the early disco movement, as preserving annotator diversity is the primary motivation behind the design.Note that, because the annotators in of our datasets are completely anonymous (except for their set of responses), we represent a n as a onehot vector 0 n−1 10 N −n .In future work, we hope to have annotator features associated with key features believe to drive disagreement, such as age, race, gender, ethnicity, political affiliation etc.. Also, because we only deal with vertical slices of X, for clarity we denote x •,m as x m .We denote the machine predictions of y n,m , #y •,m , and #y n,• as z y , z yI , and z yA , respectively.
Although, strictly speaking, only z y is needed for prediction, (x m , a n ) represents the intersection of a column and row of the label matrix Y, and z yI or z yA represent the marginal distribution associated with this column or row, respectively.It also provides the same context during training that many of the established item-annotator models rely on (Dawid and Skene, 1979).Moreover, items and annotators tend to cluster in label distribution space (Lakkaraju et al., 2015;Venanzi et al., Table 1: Summary of the datasets that we conduct our experiments with.The datasets are: GoEmotion (D GE ), LabelMe (D LM ), Jobs (D JQ1-3 ), and SBIC Intent (D SI ).All of our datasets contain posts that are in English or are based on image data that is already processed (D LM ).[A] We calculated the mean entropy per data item (respectively, annotator).[B] We calculated the entropy of the mean label distribution over all data items (respectively, annotators).Entropy is calculated via the natural logarithm (the units are in nats).See Section "Data" for more details.x m and a one-hot encoding a n of an integer identifier n, and is ultimately trained to output a set of three probability distributions, namely, a vector of class probabilities z y , a distribution of labels from all annotators z yI , and a distribution of labels from all items z yA .
Notice that that x m and a n are first each embedded into their own respective sub-spaces (z I and z A ) before they are combined through a vector combination operator (such as concatenation).
2014), and so backpropagating gradients from the Kullback-Leibler (KL) terms placed on these distributions (as we will describe later) acts as a form of regularization tailored to distributional labels.By aggregating labels from related items and annotators, we believe this approach also addresses label sparsity.
In order to facilitate tractable inference and parameter learning, we opted to craft a probabilistic encoder-decoder architecture.DisCo is defined by a set of synaptic weight parameter matrices housed in two constructs Θ e = {W I , W A , W P , W E } and Θ d = {W yI , W yA , W y } (bias vectors omitted for clarity), where Θ e contains the encoder parameters and Θ d contains the decoder parameters.The model is designed, for each data item feature vector x m and annotator a n pair, to estimate the values of a set of target label distributions, i.e., the item label distribution y I the annotator label distribution y A , and the (ground-truth) label distribution y m,n .
The output of the encoder is the latent representation of data items and annotators -note that the data item is projected to the space z I while the annotator identification integer is embedded into the space z A .As a result, the encoder, which takes in as input the data item feature vector x m ∈ R J×1 (where J is the dimensionality of the item feature space) and the annotator identifier a n , computes the following output: (1) where • denotes matrix multiplication and [a, b] represents a vector combination operation applied to input vectors a and b (such as concatenation or element-wise summation).Notice that a residual connection has been introduced in Equation 3to improve gradient flow during model training.
where J I and J A are their respective embedding dimensionalities.An additional linear projection is applied to the combined item and annotator embeddings via matrix W P to reduce the dimesionality further to z E ∈ R J P ×1 before running the representation through one more non-linear transform (to obtain encoder output z E ).
The decoder, which takes in as input the latent code produced by the encoder z E , computes its outputs (three different label distribution estimates) as follows: where retrieves the jthe value/element of the vector v).Note that z y is interpreted as P(y n,m |x m , a n ), and can be seen as a Bayesian distribution of annotator n's response to x m However, z yI represents y •,m (the mth column of Y), normalized to sum to one, and z yA represents y n,• (also normalized).Thus, these two outputs can be interpreted as frequentist representations of item m's and n's responses, respectively.3

Out-of-Sample Inference
After DisCo has been trained according to the objective function above (Equation 8), it conducts inference over out-of-sample items in a slightly different manner than it does as described in Equations 1-3.Specifically, we present the form that this takes when not only an item x m is presented to the model (we later discuss the case when only an item x m but also its associated label distribution vector y •,m are available).
Inference under our proposed model entails using its knowledge of all annotators encountered in the training set to make multiple predictions for a newly encountered data item x m and then finally aggregating across this set in order to produce a predicted label or label distribution vector.Concretely, this means that our model will emit N predictions for x m , i.e., one prediction per annotator embedding stored in its internal memory W A .Formally, this means that, instead of using Equations 1-3, we conduct inference as follows: (5) , and ( 6) where 1 c = {1} 1×N is a row vector of ones meant to be multiplied with a column vector to yield a matrix of shape J × N (meaning the vector result of (W I • x m ) is copied into each column of the output matrix).When using Equation 4 after computing z E via Equations 5-7, the resulting outputs z y , z yI , and z yA would be matrices with each containing N columns (one distribution vector per annotator).
If one desires to use DisCo to produce a final predicted label for item x m , then arg max of each column in z y is taken to produce a list of integer labels and the mode is taken over this final set of model-generated class integers.If one desires a single label distribution vector to be produced for item x m , then the expectation is calculated across columns in z y .

Experimental Setup
Before conducting the research described here, we consulted with our institutional review board(s).They determined it did not constitute human subjects research, primarily because the data was publicly available and secondary.Beyond that, all authors have basic training on conducting human subjects research from CITI.4Moreover, we do not reveal any apparent personal identifiers in the data that we use.
In cases when the original data splits are not provided, we use a 50/25/25-percent train/dev/test split.
For natural language datasets D GE , D JQ1-3 , D MR , and D SI , we used SBERT (Reimers and Gurevych, 2019) with the pretrained paraphrase-MiniLM-L6-v2 model to generate sentence embeddings as our feature vectors.The model generates embeddings over a 384 dimension space.For D LM , we use features that are distributed with the data set -these are pre-encoded using VGG-16 (Simonyan and Zisserman, 2015).
There are relatively few publicly available datasets that provide label distributions, rather than single labels or label sets.There are even fewer that say which annotators labeled which items (i.e., that provide the annotator label matrix Y).These annotator assignments (or annotator-level labels (Prabhakaran et al., 2021)) are essential for modeling annotators.Thus, we based our comparison models from those that had been previously tested on data with annotator assignments, even if some of the models in question do not use them.One model (CNN) is a baseline that does no pre-processing or modeling of the labels.Another (MM+CNN) is LDL aware, but does not explicitly model annotators.The third (CL) models annotators, but is not explicitly designed for LDL.In the Appendix, we present a short description of our baselines.

DisCo
Our model is formally described earlier in the paper.The item (z I ) and annotator (z A ) embeddings are combined by setting [z I , z A ] to be vector concatenation.We furthermore regularized model parameters during training by running a Bayesian hyperparameter search (Biewald, 2020) on each dataset across 100 different model parameters.The Adam adaptive learning rate (Kingma and Ba, 2014) was used to optimize parameters by using gradients calculated over mini-batches of 256 samples for 200 epochs.Model parameters that we tuned were: (1) drop-out probabilities, varying from p = 0 to p = 0.99, to the outputs of z P and z E , (2) random orthogonal matrix (Saxe et al., 2013) versus gaussian versus uniform initialization, (3) the activation function choice between softsign, elu, relu, relu6, and tanh, (4) annotator and item encoder weights varying from 0 to 2, (5) L1 and L2 regularization weights ranging from 1e −07 to 0.001, and, (6) hidden layer sizes ranging from 64 to 256.
We evaluate these models as a single label learning (SL) problem using accuracy, F1-score, precision, and recall measured over the test set.We further evaluate our models with respect to the label distribution learning (LD) problem using KLdivergence.

Results and Discussion
Table 2 presents our main results (additional datasets and results included in the Appendix A.1). Since the main goal of this paper is to learn to predict the distributions of annotator responses, we focus first on KL-divergence.RQ1 asks about the performance of DisCo compared to other LDL approaches that do not model annotators, i.e., CNN, Max Ent, and MM+CNN in our experiments.As per Table 2, we see mixed results, with DisCo performing the best on three and MM+CNN performing best on three.Recall that MM+CNN uses multinomial mixture model clustering and pools together all labels from all items in a given cluster.Compared to DisCo, this tends to result in each item having a much denser set of clusters, and this may explain why it performs so well.We speculate that there is a "sweet spot" between including just enough labels from related items/annotators but not so many that the labels are irrelevant, which varies from dataset to dataset.So when more labels are needed then MM+CNN does best, but when fewer labels are optimal then DisCo wins.Indeed, the datasets on which MM+CNN performs best tend to have more label choices than the datasests where DisCo performs best.It would seem that the more label choices there are, the more labels that one would need to collect in order to get a representative sample of annotator disagreement.In addition, note that DisCo takes about half as much time to train as MM+CNN.
RQ2 asks about the performance of DisCo versus approaches that model annotators, but are not LDLbased, i.e., CL and DS+CNN in our experiments.In contrast to RQ1, DisCo outperforms all of these models across all datasets.
Learning label distributions can result in better single-label learning (Venanzi et al., 2014;Liu et al., 2019a;Weerasooriya et al., 2020).When evaluated as a single label problem, DisCo beats all of the other models in all but one (MM+CNN beats it in terms of accuracy in D GE ).Even then, a common dataset between SL and LD measure which bypasses DisCo is D GE .Notably, this is one of the largest datasets with offensive language content, with a large label selection and high number of items per annotator.
To get a sense of the impact of the W y I and W y A aggregating layers of DisCo, we included results using the model but with those layers removed (A = I = 0).The model shows substantial improvement in nearly all tests.
To gain a qualitative sense of our results, we inspected several of the test splits of the more interesting datasets for examples on which DisCo assigned nearly even weights to the two highestscoring labels.The SBIC Intent (D SI ) dataset is one on which DisCo performed the best.It is also the one for which DisCo would be expected to yield the most interesting results, as annotator disagreement could be quite significant.Many of these results were jokes, such as "What do you get when you mix human DNA and goat DNA?Kicked out of the petting zoo." with (Intended, Not-Intended) = (0.35, 0.65) or "why was the lord of the rings trilogy filmed in new zealand ?cause the us were missing the two towers."with (Intended, Not- Intended) = (0.30, 0.70), which are clearly offensive to some people, but apparently funny to others.There are also politically charged messages such as "we need to bring back monster trucks, guns, heavy metal 1776 MAGA I want trumps next speech to have monster trucks jumping over an ac/dc concert," with (Intended, Not-Intended) = (0.35, 0.65), or the use of racially derogative terms that may not be universally recognized as such.
Wan et al. ( 2023) also proposed a model using D SI dataset for modeling annotators based on their demographic details.We summarize F1-Scores for all of the baselines and DisCo in Figure 3.The contextual learning ability of DisCo does show a significant improvement over the prior models which do not perform in a similar manner.
On the other hand, items on which the prediction assigned all or nearly all of the probability mass to one label tended to be very obviously racist and/or hateful.In the specific research focus of hate or offensive speech and monitoring in real-world settings involving contentious issues (Palakodety et al., 2020), there is a growing consensus that human-in-the-loop systems aided by automated methods can be more robust in handling controversial edge cases.If our automated method assigns nearly even weights to two (or more) highest scoring labels, perhaps those instances merit greater scrutiny and vetting from multiple web moderators.Since real-world human moderation is costly, our model can potentially serve as a guide in prioritizing human moderation resources.
Our involvement in label distribution learning came from a community-based participatory research group that we belonged to on the use of AI technology in vulnerable communities as a means of preserving, in AI pipelines, minority perspectives that would otherwise be erased when annotator disagreement is resolved (usually in favor of the plurality label, as is common practice today).We believe that these methods, coupled with demographic information on annotators and reliable confidence estimates, can lead to annotated data that is more representative of the true values within a society.

Future work
Our work currently assumes that each annotator provides at most one label for each item from a fixed set of allowed responses.However, settings in which annotators may provide multiple labels per item, or where the domain of responses is open or highly structured are common, and are often where response diversity can be particularly rich, are rewarding to model.Consider, for instance, machine translation, in which there is clearly no "correct" translation from one language to another.One way to handle multiple responses per item/annotator is to consider each element in the powerset P(Q) of individual responses Q, so that each subset of Q is treated as an individual response.However, this creates a very large, and usually sparse, response space that is unwieldly.Such simplistic approaches do not address more complex responses such as translations from one language to another.We hope to fulfill a vision laid out by Lakkaraju et al. (2015), further motivated by Sap et al. (2021), and addressed by Gordon et al. (2022) in which we predict, not just the distribution of responses of the entire population, but that of key vulnerable subgroups.This would allow us to better understand when disagreement is likely to have social or political impacts.Note that if we had demographic information about our annotators, we could infer over any such group by masking out at inference time all annotators who do not not belong to the group of interest.
Given the clustering of responses revealed in Figure 4 and the competitive performance of MM+CNN with respect to KL-divergence, we would like to explore ways to incorporate clustering into the design of DisCo.

Conclusion
We proposed a novel neural architecture, DisCo, for modeling the distribution of labels an item receives and the distribution that annotators provide in the presence of item-annotator pairs.Our design was motivated by the desire to break free of the standard assumption (in supervised learning) of single-label ground truth.Experimental results indicate that DisCo performs at a level comparable to state-of-the-art models that were purpose-built for label distribution learning, but with faster training time, and outperforms state-of-the-art annotatormodeling models, even on single-label learning problems.Qualitative inspection of the data shows that the model can predict striking examples of annotator disagreement.Future work will explore ways to more flexibly increase the labels of related items/annotators in order to enhance the sparse label sets.

Limitations
It is highly desirable to test our model on more datasets.However, there are very few multi-class, publicly available datasets that include information about annotator assignments.Often this information is, unfortunately, either discarded or withheld.Without annotator assignments, it is difficult to run experiments related to label distribution learning driven by annotator-item modeling.We hope that this paper encourages more researchers to collect and share more datasets that retain information about annotator-item matchings.
Datasets: We understand that the disagreement between the annotators could arise due to the subjectivity/ambiguity of the content to be annotated, nature of the study, or even worker reliability (Aroyo and Welty, 2013;Inel et al., 2014).These observations cannot be solely utilized to disregard a dataset, since it is not a limitation of the dataset but the nature of the problem domain of annotator disagreement.

Ethical Considerations
All statistical methods are double-edged swords.Used maliciously, these methods could be used to misrepresent social values and opinions.Moreover, while these methods would be more informative with demographic information on the annotators, this conflicts with the privacy of the annotators, a group of workers who are often treated unfairly (Gray and Suri, 2019).

A Datasets
All the datasets we use for this research are collected by other researchers.We have included information on how they were collected and the platforms utilized in the descriptions.
Job-related (D JQ1 , D JQ2 , and D JQ3 ): On a dataset of 2000 tweets, Liu et al. (2016) asked five annotators each from MTurk and FigureEight to label work-related tweets according to three questions with multiple choice responses: point of view of the tweet (D JQ1 : 1st person, 2nd person, 3rd person, unclear, or not job related), subject's employment status (D JQ2 : employed, not in labor force, not employed, unclear, and not job-related), and employment transition event (D JQ3 : getting hired/job seeking, getting fired, quitting a job, losing job some other way, getting promoted/raised, getting cut in hours, complaining about work, offering support, going to work, coming home from work, none of the above but job related, and not job-related).
LabelMe (D LM ): was originally released as part of a data challenge for computer vision research.The label categories were: highway, inside city, tall building, street, forest, coast, mountain or open country.There are a total of 2, 688 images in the dataset, out of which 1, 000 were annotated by an average of 2.547 MTurkers (Rodrigues et al., 2017).The authors use data augmentation to create a larger sample of 10,000 items for training CL (Russell et al., 2008).In order to compare DisCo against this previous benchmark, we ran our experiments on this larger dataset.
Movie Reviews (D MR ): Rodrigues and Pereira (2018) culled 1,500 items from a dataset of 5,006 movie reviews in English with a single rating a scale of 1-10 (Pang and Lee, 2005).They asked multiple AMT workers (4.96 per item, on average) provide their own ratings as test data for a Crowd-Layer regression task.
The Social Bias Inference Corpus (D SI ): The D SI dataset contains 45k posts from Reddit, Twitter, and hate sites collected by Sap et al. (2019) 6 .The dataset was annotated with respect to seven questions: offensiveness, intent to offend, lewdness, group implications, targeted group, implied statement, in-group language.Out of these, we consider only the "intent to offend" question, as it had the richest label distribution patterns.It has the label options: Intended, Probably Intended, Probably Not Intended, Not Intended.The items in this dataset are in English.The number of annotations The Multi-Domain Agreement(D MDA ): The dataset created by Leonardelli et al. (2021) consists of 10,753 English tweets from three domains (Black Lives Matter movement, Election 2020, and COVID-19 pandemic).Each tweet was annotated for offensiveness by five annotators through Amazon Mechanical Turk.The annotator pool consisted of > 800 annotators.
The Gab Dataset (D GAB ): This dataset collected from the social network "Gab" introduced by Kennedy et al. (2022) consists of 27,665 posts that are annotated by a minimum of three annotators.The original dataset annotated hate and offensive content.We work with the labels associated for vulgar and/or offensive language classification.

B.1 Computational Setup
Our experiments were conducted on: #1 -A desktop computer with an Intel i6-7600k (4 cores) at 4.20GHz, 32GB RAM, and nVidia GeForce RTX 2070 Super 8GB VRAM, and, #2 -A shared server in our institution with an Intel(R) Xeon E7 v4, 264GB RAM, and GPU Tesla P4 8GB.Our worst case computation was using machine #1 in our setup and with the dataset D GE .The runtime for a single pass of experiments on a single dataset for the CNN took 2 minutes, MM + CNN took 2 hours, CL took 30 minutes, and DisCo took 1 hour.We repeated each experiment 100 times in order to report standard error.Liu et al. (2019a).It is an LDL-aware, two-step process that, in order to improve the estimates of each item's given label distribution, applies (as the first step) an unsupervised clustering step to the label distributions before they are passed (as the second step) to an unsupervised learner, which is the same as the CNN model described above.We performed parameter search on the number of item clusters K ∈ {4, . . ., 40} and report the results on the best performing model.Specifically, Table 4 presents the model selection parameters for the MM model of Liu et al. (2019a).MM + CNN model is clustered only on item classes.K is the number of item classes, L is the number of annotator classes, and KL is the KL divergence when evaluated against empirical ground truth.The rationale behind this design (Liu et al., 2019a) is that if a group of data items have similar label distributions, then the annotators believe that this group of items is related and can be clustered together and regarded as having the same distribution, namely the cluster centroid.In this way, the clustering helps with label sparsity.This approach, however, does not model the annotators (nor does it need to be aware of which annotators labeled which items).
Maximum Entropy (Max Ent) is a barebones maximum entropy linear classifier, a single dense layer classification model with a softmax activation.We use this model as an alternative for the CNN classification model.(2018) attach to the output of any neural network with a Q-dimensional output layer (recall that Q is the size of the label space) a crowd-layer, which has multiple, parallel, Q-dimensional, new output layers, one for each annotator, and takes as input the old output layer.This extended model is trained as a single, monolithic neural network.It then learns to simultaneously predict the labels of each annotator.The old output layer (now an inner layer) thus becomes a bottleneck through which each of these independent annotator predictions must pass, and the overall model effectively learns a collective ground truth distribution for the entire population of annotators.During inference the crowd-layer is discarded and the old output layer is used instead.However, during learning the weights from the bottleneck layer to each individual annotator layer learn to discount unreliable annotators and favor reliable ones.This model effectively can learn a label distribution ground truth (that is, there is nothing in their model to bias the bottleneck layer toward a single label output).However, the authors did not anticipate LDL or evaluate its ability to learn label distributions.
Compared to DisCo, CL takes a single data item as input, while DisCo takes an annotator-item pair.CL also has parallel, independent, output dimensions for each annotator, while DisCo has an output layer whose size is independent of the number of annotators and items.Consequently, DisCo takes in more information at input (an item-annotator pair versus just an item) and has to solve a simpler prediction task (namely, to output one label distribution per input versus one label distribution for each annotator per input).We believe that our design offers a more scalable and more tractable learning problem (especially if there are many annotators, as is commonly the case, e.g., when crowdsourcing is used).We also believe DisCo is the superior design for sparse labels, because each input to the model uses all of its layers.By contrast, CL has a large number of parallel layers that are only active when the corresponding annotators are present.So when annotators are sparse, a relatively large number of these annotator layers are not used.Furthermore, while both our model and CL have a bottleneck layer, the dimension of the bottleneck in the CL model must have the same dimension as the label space (because it is used for inference) while ours can have an arbitrary dimension.This gives our model a bit more flexibility but also requires us to consider this dimension as a hyperparameter that must be tuned.The implementation is based on the code released for the CrowdLayer classification task. .The resultant partial derivatives are then used to change the current values in Θ e and Θ d via stochastic gradient descent or with a more advanced adaptive learning rate rule such as Adam (Kingma and Ba, 2014).

C Derivation of KL-Divergence for DisCo
Our use of KL divergence here as a loss function and in our results as an evaluation instrument is particularly relevant to us because it has a very important connection to the likelihood of multinomial samples.Suppose we wished to estimate #y •,m by drawing a sample #y •,m of size S from the distribution defined by z yI .Let L( #y •,m |z yI ) denote the log-likelihood of this sample.Then (Shlens, 2014),

D Crowd Analysis
We generated t-SNE visualizations on the outputs of DisCo models trained on D JQ1 , D JQ2 , and D JQ3 .See Figure 4.The visualization reveal clustering in the output space of these models and is reminiscent of the clustering in the label distribution space that the MM+CNN model is designed to exploit, but which is not explicitly modeled by DisCo.

Figure 1 :
Figure1: We represent each item x •,m from each dataset used in this study as a column vector of features.Each data item is associated with a vector of annotations, which, we represent as column of individual annotator responses y •,m as well as a distribution over the label choices #y •,m .We are also interested in the distribution of responses that each annotator provides for all the items they annotate #y n,• .

Figure 2 :
Figure 2: Block diagram showing the main components and parameters of DisCo.The model takes in as input an itemx m and a one-hot encoding a n of an integer identifier n, and is ultimately trained to output a set of three probability distributions, namely, a vector of class probabilities z y , a distribution of labels from all annotators z yI , and a distribution of labels from all items z yA .Notice that that x m and a n are first each embedded into their own respective sub-spaces (z I and z A ) before they are combined through a vector combination operator (such as concatenation).

Figure 3 :
Figure 3: The figure consists a comparison of the models for D SI dataset based on F1-Score.Left to right, (grey) CNN is a baseline with no label modeling, which receives for each data item the empirical label distributions provided by annotators.(red) DS+CNN uses the same CNN model, but the empirical label distributions are replaced with labels from the DS model.MM+CNN uses the same CNN model, but the empirical label distributions are replaced with MM centroids.(yellow) CL is the CrowdLayer baseline.WKK is the baseline from Wan et al. (2023).(pink) A=I=0 is our DisCo models without output layer with annotator and item embedding.(black) DisCo is the new method introduced here.
To train DisCo's parameters Θ = {Θ e , Θ d }, we propose the following multi-objective function:,m • log(z y ) + m KL(#y •,m ||z yI ) + n KL(#y n,• ||z yA ) (8)where the first term is the negative categorical log likelihood of the target one-hot encoded label y and the second and third terms measure the Kullback-Leibler (KL) divergence of between the decoder's estimate and the actual item label distribution y i and the actual annotator label distribution y a , respectively.Specifically, the form of the KL divergence that we use compares two multinomial/multinoulli distributions:KL(y •,m ||z yI ; Θ) = m #y •,m • log #y •,m − m #y •,m • log z yI (9) KL(y n,• ||z yA ; Θ) = j #y n,• • log #y n,• − n #y n,• • log z yA ,(10)where here and above log is applied to each scalar value independently and is base e.DisCo's parameters are adjusted to minimize the function defined in Equation 8 by calculating the gradients with respect to both the encoder and decoder weights, i.e., ∂L(Θe,Θ d ) ∂Θe and ∂L(Θe,Θ d ) ∂Θ d lim N →∞ L(| #y •,m |z yI )/S = −KL(p| #y •,m ∥z yI ).

Figure 4
Figure 4: t-SNE plot for training set from DisCo models.Each color represents a label class.(Top) Plot for D JQ1 dataset, which has five label classes (C).(Middle) Plot for D JQ2 dataset, which also has five label classes (C).(Bottom) Plot for D JQ3 dataset, which has 12 label classes (C).

Table 2 :
(Dawid and Skene, 1979)erimental results for the classification tasks.DisCo is the new method introduced here.CNN is a baseline with no label modeling, which receives for each data item the empirical (unprocessed, i.e. f dist ) label distributions provided by annotators.DS+CNN uses the same CNN model, but the empirical label distributions are replaced with labels from the(Dawid and Skene, 1979)model.MM+CNN uses the same CNN model, but the empirical label distributions are replaced with multinomial mixture model centroids.CL is the CrowdLayer baseline.We repeated each experiment 100 times and report the mean and standard deviation.A=I=0 is our DisCo models without contextual layer with annotator and item encoding.The best performing model is indicated in bold.See Section "Experiments" for further details.

Table 4 :
Experimental parameters of the MM+CNN model.