SemEval-2021 Task 12: Learning with Disagreements

Disagreement between coders is ubiquitous in virtually all datasets annotated with human judgements in both natural language processing and computer vision. However, most supervised machine learning methods assume that a single preferred interpretation exists for each item, which is at best an idealization. The aim of the SemEval-2021 shared task on learning with disagreements (Le-Wi-Di) was to provide a unified testing framework for methods for learning from data containing multiple and possibly contradictory annotations covering the best-known datasets containing information about disagreements for interpreting language and classifying images. In this paper we describe the shared task and its results.


Introduction
The assumption that natural language ( ) expressions have a single and clearly identifiable interpretation in a given context, or that images have a preferred labels, still underlies most work in and computer vision. However, there is now plenty of evidence that this assumption is just a convenient idealization; virtually every project devoted to large-scale annotation has found that genuine disagreements are widespread.
In , that annotator/coder disagreement can be genuine-i.e., resulting from debatable, difficult, or linguistic ambiguity-has long been known for anaphora and coreference (Poesio and Artstein, 2005;Versley, 2008;Recasens et al., 2011). But in recent years, we have also seen evidence that disagreements among subjects/coders are common with virtually every aspect of language interpretation, from apparently simple aspects such as partof-speech tagging (Plank et al., 2014b), to more See also the analysis of disagreements in OntoNotes and word senses in Pradhan et al. (2012), Passonneau et al. (2012), andMartínez Alonso et al. (2016). complex ones like semantic role assignment (Dumitrache et al., 2019), to subjective tasks such as sentiment analysis (Kenyon-Dean et al., 2018), and to the inferences that can be drawn from sentences (Pavlick and Kwiatkowski, 2019).
In computer vision, as well, the assumption that gold labels may be specified for items has proven an idealization (Rodrigues and Pereira, 2018)-in fact, possibly even more than for . In many widely used crowdsourced datasets for computer vision, different coders assign equally plausible labels to the same items. The problem of disagreement among coders, including experts, on the classification of noisy image data has arisen in many applications. This includes classification of astronomical images (Smyth et al., 1994), medical image classification (Raykar et al., 2010), and numerous others (Sharmanska et al., 2016;Rodrigues and Pereira, 2018;Firman et al., 2018).
Many researchers have concluded that rather than attempting to eliminate disagreements from annotated corpora, we should preserve them-indeed, some researchers have argued that corpora should aim to collect all distinct interpretations of an expression (Smyth et al., 1994;Poesio and Artstein, 2005;Aroyo and Welty, 2015;Sharmanska et al., 2016;Plank, 2016;Kenyon-Dean et al., 2018;Firman et al., 2018;Pavlick and Kwiatkowski, 2019). Poesio and Artstein (2005) and Recasens et al. (2012) suggest that the best way to create resources capturing disagreements is by preserving implicit ambiguity-i.e., having multiple annotators label the items, and then keeping all these annotations, not just an aggregated 'gold standard'. A number of corpora with these characteristics now exist (Passonneau and Carpenter, 2014;Plank et al., 2014a;Dumitrache et al., 2019;Poesio et al., 2019;Rodrigues and Pereira, 2018;Peterson et al., 2019) Much recent research has explored the question of whether corpora of this type, besides being more accurate characterizations of the linguistic reality of language interpretation and image categorization, are also better resources for training and computer vision models, and if so, what is the best way for exploiting disagreements in modeling. Beigman Klebanov and Beigman (2009) used information about disagreements to exclude items on which judgements are unclear ('hard' items). In the CrowdTruth project (Aroyo and Welty, 2015;Dumitrache et al., 2019) information about disagreement is used to weigh the items used for training. Plank et al. (2014a) proposed to use the information about disagreement to supplement the gold label during training. Finally, methods were proposed for training directly from the data with disagreements, without first obtaining an aggregated label (Sheng et al., 2008;Rodrigues and Pereira, 2018;Peterson et al., 2019;Uma et al., 2020). Only limited comparisons of these methods have been carried out (Jamison and Gurevych, 2015), and the sparse research landscape remains fragmented; in particular, methods applied in have not yet been tested in , and vice versa. The objective of SemEval-2021 Task 12, Learning with Disagreements (L --D ), was to provide a unified testing framework for learning from disagreements in and using datasets containing information about disagreements for interpreting language and classifying images. The expectation being that unifying research on disagreement from different fields may lead to novel insights and impact widely.

Task organization
In order to provide a thorough benchmark for methods for learning from disagreements, we identified five well-known datasets for very different and tasks, all characterized by providing a multiplicity of labels for each instance, by having a size sufficient to train state-of-the-art models, and by evincing different characteristics in terms of the crowd annotators and data collection procedure. We found or developed near-state-of-the-art models for the tasks represented by these datasets. Both 'hard' and 'soft' evaluation metrics were employed (Uma et al., n.d.).
The shared task was set up on the CodaLab Competitions platform, which enables training and uniform evaluation on these datasets, such https://www.microsoft.com/en-us/research/project/ codalab/ that the crowd learning adaptations of the base models proposed by participants to the task would be directly comparable.
In this section, we briefly introduce the five datasets included in the benchmark and our evaluation criteria. We also elaborate on the setup of the shared task.

Data
There are by now quite a few datasets preserving disagreements, and covering many levels of language interpretation; remarkably, none of these has ever been used for a shared task like the one we are proposing, and the majority of them have never been used for a shared task at all. Our shared task has aimed at leveraging this diversity. The datasets included are outlined in this section and their characteristics are summarized in Table 1. Figure 1 shows the observed agreement of each dataset. One widely used resource for developing disagreement-aware models is the dataset of Twitter posts annotated with tags collected by Gimpel et al. (2011). Plank et al. (2014b) mapped the Gimpel tags to the universal tag set (Petrov et al., 2012) and collected at least five crowdsourced labels per token from 177 annotators. This dataset contains 14K training examples (English words/ tokens) annotated by 177 annotators. Each item was annotated between five and 177 times, 16.38 times on average. For this shared task, we selected 8.3K, 3K, and 3.1K tokens as training, development and test sets respectively.

The corpus
The with the Phrase Detectives gamified online platform (Poesio et al., 2013). We use , a simplified version of the corpus containing only binary information status labels: Discourse New (the entity referred to has never been mentioned before) and Discourse Old (it has been mentioned). consists of 542 documents, for a total of 408K tokens and over 96K markables. These documents were annotated by game players who produced an average of 11.87 annotations per markable.
Forty-five of the documents (5.2K markables), collectively called gold , additionally contain expert-adjudicated gold labels. This subset of was designated as the test set. The training and development datasets consist of 473 documents (and 86.9K markables) and 24 documents (4.2K markables) respectively.

The Humour dataset
The comprehension and appreciation of humour is known to vary across individuals (Ruch, 2008), making disagreement over the perceived funniness of jokes an appealing subject of study. For our training data, we used the corpus of Simpson et al. (2019), which consists of 4,030 short texts (3398 jokes, mostly based on puns, and 632 non-jokes such as proverbs and aphorisms). 28,210 unique pairings of these texts were presented to five crowdsourcers each, who indicated which text in the pair (if either) they found to be funnier. The goal is to learn a model that can predict binary pairwise labels that can predict which of two short texts is funnier.
The 4,030 text instances were split into 60% (2,418 texts, 9,916 unique pairs) for the training set and 20% (806 texts, 1,086 unique pairs) for the development set. Since this dataset has already been published, we constructed a new test dataset along similar lines: 1,000 short texts (all punning jokes) were paired in 7,000 different ways, and each of these 7,000 pairs was then presented to five crowd workers for a preference judgement.  (2018) collected an average of 2.5 annotations per image from 59 annotators for 10K images in this dataset. We randomly selected 5K, 2.5K, and 2.5K images for training, development, and testing respectively, careful to keep the label proportions in each subset close to the proportions in the 10K dataset.

The -10 corpus
Krizhevsky's (2009) -10 dataset consists of 60K tiny images from the web, carefully labelled and expert-adjudicated to produce a single gold label for each image in one of 10 categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. Peterson et al. (2019) collected crowd annotations for 10K images from this dataset (the designated test portion) using Amazon Mechanical Turk, creating the -10 dataset which we use for this shared task.
We randomly selected 7K, 1K, and 2K images for training, development and testing respectively. We kept as much data as we could for training without jeopardizing the evaluation process, as the base model was found to be sensitive to data size. As with the original dataset, each subset we created contains an equal number of images per category.

Evaluation metrics
While recent research questions the assumption that a single 'hard' label (a gold label) exists for every employed, paid in line with the federal minimum wage.
http://labelme.csail.mit.edu/Release3.0 https://github.com/jcpeterson/cifar-10h item in a dataset, the models proposed for learning from multiple interpretations are still largely evaluated under this assumption, using 'hard' measures like accuracy or class-weighted F 1 (Sheng et al., 2008;Plank et al., 2014a;Martínez Alonso et al., 2015;Sharmanska et al., 2016;Rodrigues and Pereira, 2018). For reference and comparison reasons, we also evaluate the models produced for this shared task using F 1 . However, a way of evaluating models as to their ability to capture disagreement is needed, especially for datasets with substantial extent of disagreement. The simplest 'soft' metric of this type is to evaluate ambiguity-aware models by treating the probability distribution of labels they produce as a soft label, and comparing that to the full distribution produced by annotators, using, for example, cross-entropy. This approach was adopted in, inter alia, (Peterson et al., 2019;Uma et al., 2020). Peterson et al. (2019) tested this approach on image classification tasks, generating the soft label by transforming the item annotation distribution using standard normalization. In this shared task we also use standard normalization to produce soft labels for the humour dataset. Uma et al. (2020) show that the choice of soft label encoding function depends on the characteristics of the dataset. For and -, they show that a softmax function over the annotator distribution is preferable over standard normalization. On the other end, for , training a soft-loss model using the posterior probability produced by Hovy et al.'s (2013) probabilistic aggregation model as a soft label produces predictions that a most accurate with respect to the gold. Therefore, in this shared task we used different soft label encoders to generate soft labels from annotator distributions for the test data.

Task setup
CodaLab was the designated site for hosting SemEval-2021 competitions. L --D was run in two main phases: Practice phase. In the practice phase, the goal was to train models for each task to learn from crowd annotations, given (1) the training data (consisting of raw and preprocessed input data and crowd annotations), (2) the development data with no labels, and (3) the base models (discussed in Section 3). While participants were encouraged to start with the Our competition can be found at https://competitions. codalab.org/competitions/25748. base models and extend them, we did not make this mandatory. Participants could test the performance of their models on the development set by making predictions on the given development input data and then uploading their submissions to CodaLab for preliminary testing. We permitted up to 999 submissions in this phase. The 'leader board' was made public to allow participants not only to see how their models performed, but also to compare the performance of their model to those submitted by other participants.
Evaluation phase. The evaluation phase was the official testing phase of the competition. In this phase, we released test data (without labels) but we also released the gold labels and crowd annotations for the development set to facilitate quick offline testing and refining of models and model selection. The number of submissions for this phase was limited to ten submissions per participant to prevent the participants from fine-tuning their models on the test data. The allowed number of submissions was later increased to 999 to more encourage submission attempts. The leader board was also kept public in this phase. Each participant could see the best model of each of the tasks using each of the evaluation metrics.
Post-campaign evaluation. As our aim was to make this benchmark available beyond the competition to researchers developing disagreement-aware models, we included a third, post-evaluation phase to allow lifetime access to the data. Researchers participating in this phase will be able to access the same data as in the evaluation phase and test their models on the test data for the various tasks.

Base models and baselines
In order to encourage the participants to focus on the development of methods for learning from disagreement, as opposed to achieving higher performance by developing better models, we provided 'base' models for each of the tasks represented by the aforementioned corpora. In this section, we briefly discuss the baseline models for each task that we provided. In Section 5, we report the results using these base models and two crowd learning approaches: majority voting and the soft loss method (Peterson et al., 2019;Uma et al., 2020). This proved unnecessary as the inherent difficulty of the shared task was enough of a deterrent.

The
tagging model. The tagger is a bi-  with additional use of attention over the input word and character embeddings, as used in Uma et al. (2020).

The
classification model. The model for this task was developed by comparing architectures from two models: a state-of-the-art coreference model and a state-of-the-art classification model. We combined the mention representation component of Lee et al.'s (2018) coreference resolution system with the mention sorting and non-syntactic feature extraction components of the classification model proposed by Hou (2016) to create a novel classification model that outperforms Hou (2016) on the corpus. The training parameters were set following Lee et al. (2018).
The humour preference learning model. We use as base model for this task Gaussian process preference learning ( ) with stochastic variational inference, as described and implemented by Simpson and Gurevych (2020). As an input vector to , we first take the mean word embedding of a text, using 300-dimensional word2vec embeddings trained on the Google News corpus (Mikolov et al., 2013). Then, we compute the frequency of each unigram in the text in a 2017 Wikipedia dump, and each bigram in the text in a Google Books Ngram dataset. Finally, we concatenate the mean unigram and bigram frequencies with the mean word embedding vector to obtain the input vector representation for each short text. The model is trained on pairwise labels from the training set to obtain a ranking function that can be used to score test instances or output pairwise label probabilities. As a Bayesian model, it takes into account sparsity and noise in the crowdsourced training labels, and moderates its confidence accordingly. Hence, it is a strong baseline for accounting for disagreement among annotators. This same approach set the previous state of the art on the humour dataset (Simpson et al., 2019).
The LabelMe image classification model. For this task, we replicated the model from Rodrigues and Pereira (2018). The images were encoded using pretrained layers of the -16 deep neural network (Simonyan et al., 2013). This encoding is passed into a feed-forward neural network layer This model was developed for fine-grained information status classification on the corpus (Markert et al., 2012;Hou et al., 2013).
with a e activated hidden layer with 128 units. A 0.2 dropout is applied to this learned representation which is then passed through a final layer with softmax activation to produce the model's predictions.

The
-10 image classification model. The trained model provided for this task is the ResNet-34A model (He et al., 2016), a deep residual framework which is one of the best performing systems for the -10 image classification. We made available to participants the publicly available Pytorch implementation of this ResNet model.

Participating systems
Unfortunately, we observed a dramatic difference in the number of participants that signed up to the competition (over 100 groups), the number of groups that participated in the trial phase, and the number of groups that submitted a run for official evaluation.
Only one group, , submitted in the evaluation phase (Osei-Brefo et al., 2021). However, they did submit models for each of the tasks, and did adopt a learning from disagreements approach. To extend this model for crowd learning, added an adaptation of the crowd layer from Rodrigues and Pereira (2018). Rather than compute a single loss from the crowd layer as Rodrigues and Pereira (2018) do, compute a joint loss from both the crowd layer and the base model (without the crowd layer bottleneck). https://github.com/KellerJordan/ResNet-PyTorch-CIFAR10 Two participating groups cited an inability to come up with a novel crowd learning paradigm as the reason they did not submit for official evaluation. classification. For this task, also used a fine-tuned together with Rodrigues and Pereira's (2018)  Humour preference learning. For humour preference learning, the participant submitted predictions using the base model without modifications.
LabelMe image classification ( -). For this task, adapted the Rodrigues and Pereira (2018) crowd layer to the base model.
-10 image classification ( -10 ). For -10 , the crowd labels were aggregated into hard labels using majority voting. However, combined Zagoruyko and Komodakis's (2016) WideResNet model, which has been shown to outperform He et al.'s (2016) ResNet with the novel Sharpness-Aware Minimization ( ) optimization technique, proposed by Foret et al. (2020), that has been shown to efficiently improve model generalization, especially on noisy, singly labelled data. Table 2 contains the results of various models discussed in Sections 3 and 4 on this shared task when evaluated based on the hard metric (i.e., the class-weighted F 1 with respect to the gold labels) and the soft metric (the cross-entropy between the soft labels for each task-see Section 2.2-and the model prediction for that task). The best results for each task are highlighted in bold.

Results and discussion
concentrated their effort on the -10 dataset, on which they did achieve good results and outperformed the baseline (see below). In the other datasets, their official results at the end of the evaluation phase were less competitive.
With the and datasets, the model proposed by , adding a crowd layer on top of , achieved substantially worse results than training from a label aggregated using majority voting or training using a soft-loss function, both according to the hard evaluation metric (F 1 ) and the soft metric ( ). The ranking between soft-loss method, aggregation, and crowd layer with is consistent with that obtained by Uma et al. (n.d.), but the results obtained by are much worse for reasons that will require further investigation. (With , Uma et al. (n.d.) obtain comparable results with soft-loss functions and with the crowd layer.) More generally, the results show that although the hard label (the majority voting aggregate of the annotator distribution) and the soft label (a probability distribution encoding of the annotator distribution) were drawn from the same annotator distribution with this dataset, given the same base model, training by targeting the soft label (base model + soft loss) outperforms training using majority voting aggregates (base model + majority voting) regardless of which evaluation metric is used to compare the models.
For the humour preference learning task, again, the base model outperforms 's submission on both metrics, but in this case the difference in performance between and is much less substantial with the hard metric, although it remains large according to the soft metric. This large difference may be due to a technical issue that requires further investigation, since 's submission was also supposed to have been produced by the same base system. A possible reason for poor crossentropy error is the use of discrete labels, which are heavily penalized for overconfidence by crossentropy error. On this soft metric, the Bayesian probabilistic approach of may have advantages over approaches with poorer calibration, which remains to be explored in future work. The approach therefore remains the state of the art with this dataset.

For
-, again, soft-loss training achieved better hard and soft scores than both aggregation training with majority voting labels and the extension of the base model using a crowd layer adapted from Rodrigues and Pereira (2018). The finding that the group's adaption of the Rodrigues and Pereira (2018) crowd layer yielded lower F 1 than training using majority voting is unexpected, given that in Rodrigues and Pereira (2018); Uma et al. (2020) and Uma et al. (n.d.), the crowd layer, particularly the -variant, was shown to be a competitive approach to learning from crowds and always outperforms majority voting. However, 's crowd layer does achieve better soft evaluation (cross-entropy) scores than majority voting.
There is one dataset, however, on which outperformed the two baselines:  Table 2: Results on the benchmarks and participant submissions on all the tasks using F 1 (higher is better) and cross-entropy (lower is better) Foret et al.'s (2020) optimization technique. The results show that WideResNet outperforms ResNet with this task both according to the hard metric and the soft metric. Interestingly, this is the one dataset in which the Deep Learning from Crowds approach of Rodrigues and Pereira (2018) works best according to Uma et al. (n.d.), outperforming both soft-loss training and majority voting training. It would thus be interesting to understand if the performance of 's model could be further increased by adopting one of these methods.

Conclusion
This shared task presented the first unified testing framework for learning with disagreements. The datasets include sequence labelling, three classification tasks, and preference learning, hence provide a testbed for a wide range of challenges when learning from multiple annotators. We proposed to evaluate not just the 'hard' performance against a gold standard, but also the ability to predict the distribution of different interpretations of the data-that is, the alternative labellings provided by different annotators. The results show the benefit of soft loss functions that account for the distribution of labels in the training data. However, modelling alternative As a postscript, we should note that after the end of the official competition we did carry out an investigation of the reasons for the poor performance of 's models on the tasks other than -10 . Some points emerging from the discussion are presented in the participants' paper for the shared task.
interpretations of data remains an under-researched topic in and computer vision. To encourage future work on learning with disagreements, the shared task and datasets will remain available for evaluating new methods.