Temporal-aware Language Representation Learning From Crowdsourced Labels

Learning effective language representations from crowdsourced labels is crucial for many real-world machine learning tasks. A challenging aspect of this problem is that the quality of crowdsourced labels suffer high intra- and inter-observer variability. Since the high-capacity deep neural networks can easily memorize all disagreements among crowdsourced labels, directly applying existing supervised language representation learning algorithms may yield suboptimal solutions. In this paper, we propose TACMA, a temporal-aware language representation learning heuristic for crowdsourced labels with multiple annotators. The proposed approach (1) explicitly models the intra-observer variability with attention mechanism; (2) computes and aggregates per-sample confidence scores from multiple workers to address the inter-observer disagreements. The proposed heuristic is extremely easy to implement in around 5 lines of code. The proposed heuristic is evaluated on four synthetic and four real-world data sets. The results show that our approach outperforms a wide range of state-of-the-art baselines in terms of prediction accuracy and AUC. To encourage the reproducible results, we make our code publicly available at https://github.com/CrowdsourcingMining/TACMA.


Introduction
Crowdsourcing offers the ability to utilize the power of human computation to generate data annotations that are needed to train various AI systems. For many practical supervised learning applications, it may be infeasible (or very expensive) to obtain objective and reliable labels due to many reasons such as varying skill-levels and biases of crowdsourced workers. Instead, to improve the quality of labels, we can collect subjec- * Corresponding author: Zitao Liu. tive and inconsistent labels from multiple heterogeneous crowdsourced workers. In practice, there is a substantial amount of disagreement between the crowdsourced workers (Nie et al., 2020), i.e., interobserver variability or even between a worker and the same worker looking at the same example some time later (Guan et al., 2018), i.e., intra-observer variability. Hence, it is of great practical interest to address supervised learning problems in this scenario.
Meanwhile, with the recent advances of deep neural networks (DNNs), supervised representation learning (SRL) has led to rapid improvements in the ability of learning intrinsic nonlinear embeddings using DNNs that preserves the distance between similar examples close and dissimilar examples far on the embedding space. In spite of the significant progress for SRL applications such as face recognition (Schroff et al., 2015), image retrieval (Xia et al., 2014), directly applying existing deep language representation learning approaches on crowdsourced labels may yield poor generalization performance (Han et al., 2018). Because of the high capacity, DNNs could entirely memorize the inconsistency within crowdsourced labels sooner or later during the modeling training process. Besides, this phenomenon does not change with the choice of training optimizations or network architectures (Han et al., 2018).
A large spectrum of approaches have been successfully developed in either estimating true labels from crowdsourced labels, a.k.a., truth inference or label aggregation (Dawid and Skene, 1979;Whitehill et al., 2009), learning via adversarial data generation (Wang et al., 2020a), or learning language representations discriminatively from large-scale consistent labeled data with complicated neural architectures (Rodrigues and Pereira, 2018). However, learning effective neural embeddings directly from crowdsourced labels of real-world data poses numerous challenges. First, crowdsourced workers conduct labeling tasks sequentially, i.e., they label samples one after another. Such sequential labeling behavior is a process of learning, and the expertise of the workers is not stable but gradually changing even without feedback (Elliott and Riach, 1965). According to Miller's Law (Miller, 1956), humans retain what they just learned in their shortterm working memory with a limited span of 7 ± 2. Temporal factors such as fatigue  and intrinsic motivation (Kaufmann et al., 2011) implicitly influence the crowdsourcing quality, which are different from existing well-studied factors, such as the quality of crowdsourced workers, the difficulty of data samples, the price of annotation tasks, etc. In the following, such unconscious temporal behaviors are referred to as "temporal labeling effects". How to model such sample-level temporal information for each individual worker undoubtedly poses a hard modeling problem. Second, a large number of real-world crowdsourced data sets have a substantial amount of disagreement among labels and a relatively small sample size. The majority of existing SRL approaches are discriminatively trained on large-scale consistent labeled data to learn their complicated neural architectures, which may easily overfit the inconsistent crowdsourced data.
In this paper we study and develop solutions that are applicable and can learn effective neural language representations from crowdsourced labels in an end-to-end manner. Our work focuses on the refinements of a popular deep language representation learning paradigm: the deep metric learning (DML) (Koch et al., 2015;Xu et al., 2019;Wang et al., 2020b). We aim to develop an algorithm to automatically learn a nonlinear language representation of the crowdsourced data from multiple workers using DNNs.
Briefly, the DML is a classical and widely used approach for language representation learning that preserves the distance between similar examples close and dissimilar examples far on the embedding space. The majority of existing DML techniques restricted to just noise-free labels appropriately. However, learning effective representation from highly inconsistent crowdsourced data sets from multiple workers gives rise to numerous important questions: (1) since in practice, annotation performance is affected and varied over time (Boksem et al., 2005;, how do we capture such temporal labeling effects in the DML learning framework? (2) while in some cases the problem may be alleviated by pre-processing methods, such as filtering (Li et al., 2016), label correction (Li et al., 2019a), truth inference (Dawid and Skene, 1979;Raykar et al., 2010), etc., the number of remained instances is often significantly reduced or such preprocessing errors for many problems will be propagated to the downstream representation learning tasks. How to capture the label uncertainties from multiple workers and at the same time prevent the overfitting problem in an end-to-end framework?
In this work we address the above issues by presenting a temporal-aware language representation learning heuristic for crowdsourced labels with multiple annotators (TACMA), that • utilizes the attention mechanism to capture the temporal influence among sequential labeling tasks according to each worker's short-term working memory.
• estimates and aggregates the annotation confidence from disagreements among multiple workers for each sample.
• supports language representation learning with DML into an end-to-end fashion, and is extremely easy to implement based on existing DML framework with crowdsourced labels i.e., RLL (Xu et al., 2019), in around 5 line of codes.
2 Related Work

Truth Inference in Crowdsourcing
A large body of research has focused on inferring true labels from crowdsourced labels from multiple workers (Dawid and Skene, 1979;Whitehill et al., 2009;Li et al., 2019c;Rodrigues and Pereira, 2018). The majority of truth inference approaches are inspired by the classic Expectation-Maximization learning paradigm that iterates between estimating the expertise of annotators given true labels inferred and inferring true labels given the expertise of annotators (Dawid and Skene, 1979;Whitehill et al., 2009;Zhang et al., 2014;Li et al., 2019c). Some improvements include modeling the difficulty of items and the expertise of annotators jointly (Whitehill et al., 2009), applying spectral methods to initialize worker confusion matrix (Zhang et al., 2014), and modeling correlations of workers (Li et al., 2019c), etc.
In spite of the successful applications of the truth inference techniques, the majority of aforementioned approaches do not consider the temporal effects of labeling tasks of each individual worker and they cannot seamlessly integrate into deep SRL frameworks.

Learning from Noisy Labels
Learning with noisy labels has been an important research topic since the beginning of machine learning (Frénay and Verleysen, 2013) and a large spectrum of models have been developed and successfully applied in improving the model prediction performance in noisy settings from different perspectives such as effective label cleaning (Lee et al., 2018), robust model architectures (Vahdat, 2017) and loss functions (Ghosh et al., 2017), sample reweighting (Ren et al., 2018), and carefully designed training procedures (Zhong et al., 2019).
However, in this work, different from above approaches of robust learning from noisy labels that assume certain percentage of labels are corrupted, our scenario focuses on noisy labels obtained from multiple annotators where the disagreement (corruption) proportion might be surprisingly high and sometimes even 100%, i.e., no completely agreement on every single sample from all crowd workers.

Deep Metric Learning
DML approaches automatically learn nonlinear metric spaces (Schroff et al., 2015). Many approaches have achieved promising results in many tasks such as face recognition (Schroff et al., 2015), person re-identification (Yi et al., 2014), and collaborative filtering (Hsieh et al., 2017) etc. Recently a body of works have attempted to learn effective embeddings from crowdsourced labels by using DML approaches (Xu et al., 2019;Wang et al., 2020b). For example, Xu et al. estimated crowdsourced label confidence and adjust the DML loss function accordingly (Xu et al., 2019). An exhaustive review of previous work is beyond the scope of this paper. We refer to the survey of (Schroff et al., 2015) on works of DML. Although DML approaches are able to learn effective representations, they heavily rely on comparisons within pairs or triplets, which is very sensitive to ambiguous examples and may be easily misled by inconsistent crowdsourced labels.
Please note that models from the above three categories are complementary and they can be combined. For example, learning representation from crowdsourced labels can be conducted in two stages where the truth inference algorithms in Section 2.1 is applied to get estimated labels and then the standard DML approaches in Section 2.3 are used to output the learned embeddings. Details are discussed in Section 4.

Notations
Without loss of generality, we consider crowdsourcing scenarios that each data sample is annotated by multiple workers. Following the crowdsourcing practice and to avoid the order effect (Hogarth and Einhorn, 1992) and cheating, each worker will annotate the same set of samples but with shuffled orders. Let α j be the sample order index set for the j th worker and α j i be the index of i th sample for worker j. Let x α j i and y α j i be the feature vectors and the worker's assigned label for sample α j i . Let F(·) represent the learned language representation. Let (·) + and (·) − be the indicators of positive and negative examples.

Temporal-Aware Memory Confidence
According to Miller's Law (Miller, 1956), humans can only hold a very limited number of objects in their short-term working memories. When workers conduct labeling tasks, they tend to make relative comparisons in their memory spans and the annotation quality of one sample is largely influenced by its preceding samples. Therefore, in this work, we focus on studying and modeling the effects of unconscious human behaviors during the labeling process that may implicitly influence the overall crowdsourcing quality. We design an approach to explicitly capture such unconscious temporal human behaviors, i.e., temporal labeling effects. We aim to ensure that the newly annotated samples should obtain the consistent label with similar samples that have already been annotated recently.
Here we first define the short-term labeling memory as follows: Definition 1. (SHORT-TERM LABELING MEM-ORY) A short-term labeling memory of i th sample, i.e., indexed as α j i , is composed of a sequence of the current item and k most recent historical items that have been labeled by worker j, i.e., When the new labeling task arrives, i.e., the i th sample, we compute a weight for every element in worker j's short-term labeling memory M j i as the dot product of their learned language representations. This weight might be viewed as an attention over the short-term labeling memory per sample per worker.
To form a proper probability distribution over the elements in M j i , we normalize the weights using the softmax function. This way we model probability s α j i−l that represents the similarity between the i th sample and the sample appears at position l in M j i . In a functional form this is: Then we define a memory confidence score, i.e., c j i , to represent the probability that how likely the sample i is positive (y α j i = 1) solely considering similar samples in the short-term labeling memory. The memory confidence score of c j i is computed as follows: Please note that our attention based temporalaware memory confidence scores are not limited to binary crowdsourcing tasks and it can be easily extended to multi-class tasks.

Multi-Worker Confidence Aggregation
For each sample i, after collecting the memory confidence scores from all workers, we conduct the mean pooling as our aggregation operation, and the final aggregated multi-worker confidence is computed as follows: where m is the number of workers.

Representation Learning Framework
We use DML as our representation learning framework. Specifically, following the suggestion of (Xu et al., 2019), instead of using pair and triplet comparisons, we use group, a.k.a., n-tuplet, as our comparison unit. A group is made up of two positive and n negative examples. Similar to (Xu et al., 2019), we choose to learn our model parameters by maximizing the conditional likelihood of retrieving the positive example x + j given the positive example x + i from a given group. Importantly, we do not assume that we know the ground truth label of items in the training set and the validation set. During the training stage of the representation learning framework, after obtaining the aggregated multi-worker confidence c i of an item with methods introduced in Section 3.3, its label is estimated by arg max c i .
Given a collection of groups, we optimize the DML model parameters by maximizing the sum of log conditional likelihood of finding a positive example x + j given the paired positive example x + i within every group g, which will push items of the same class close and items of different classes far in the embedding space. Furthermore, we incorporate the aggregated temporal-aware multi-worker confidence scores from Section 3.3 into the loss function to capture the inconsistency of crowdsourced labels. The loss function is defined as where Ω is the parameter set of the DNN. r i * represents the cosine similarity score between the representations of x + i and x * in the embedding space. η is a smoothing hyper parameter in the softmax function, which is set empirically on a held-out data set in our experiment. Since L(Ω) is differentiable with respect to Ω, we use gradient based optimization approach to train the DNN. The cosine similarity scores, i.e., r i * , are calculated between the representations of x + i and x * in the embedding space. Finally, the goal of training is to maximize the conditional likelihood p(x + j |x + i ), which incorporates temporal-aware memory confidence scores c j i .

Experiments
Experiments are conducted on both real-world and synthetic data sets. The internal cross validation approach is used to select hyper parameters when optimizing models' predictive performances. Means as well as standard deviations of both accuracy and AUC scores are reported, to comprehensively evaluate the performance of our proposed method, i.e. TACMA.

Real-World Data Sets
Experiments are first conducted on 4 real-world data sets and the corresponding descriptive statistics can be found in Table 1.
• Emotion: A vocal emotional speech data set with binary labels indicating whether the voice fragment is exciting or not.
• Concluding: A linguistic data set where each item is labeled on whether it is a conclusion of a lesson.
• Commending: A linguistic data set of ASR transcripts from real-world classroom recordings. Each item is labeled on whether it's a commending instruction from the instructors.
• Question: A vocal speech data set where each item is labeled on whether it is an interrogative sentence.
Acoustic features of the Emotion data set are extracted using OpenSmile 1 with the computational paralinguistic challenge's (COMPARE-2013) feature set (Schuller et al., 2013). Sentence embedding features are extracted with a Chinese RoBERTa pretrained model 2 . Again we emphasize that the ground truth labels of items in the training and validation set are not observed. In order to evaluate the performance of each model objectively, the labels of items in test sets are labeled by experts and they have reached an agreement on the labels of items.
Inter-observer variability of each data set is measured with Fleiss-kappa score (Fleiss, 1971). Intraobserver variability, i.e., the level of consistency of an annotator when labeling items from the same class, is hard to directly measure without ground truth labels. We will explore the effect of intraobserver variability using temporal-aware memory confidence in Section 4.8. 1 https://www.audeering.com/opensmile/ 2 https://github.com/ymcui/Chinese-BERT-wwm

Synthetic Data Sets
In real-world scenarios, annotators are not guaranteed to be serious about their annotating work, and one may assign random labels in order to get paid quickly. Methods designed for crowdsourcing scenarios should be able to get rid of the influence of these noisy annotations. Hence we build synthetic data sets to evaluate the robustness to irresponsible annotators of each method. Starting from the original Question data set, we gradually add 2, 4, 6 and 8 simulated irresponsible annotators. They make random judgments regardless of the features of items. Hence in the worst case, 8 out of 13 workers are making random judgments, resulting in an extreme low kappa of 0.02. Experiments conducted on these synthetic data sets are helpful to examine the robustness of methods.
Group 3: Learning from Noisy Data. Group 3 contains methods of learning with noisy labels: LC (Arazo et al., 2019) use a two-component beta mixture model to perform unsupervised noise mod-  (Rodrigues and Pereira, 2018) is an end-to-end approach learning a DNN from noisy labels with a crowd layer.
Group 4: Combining Group 1 with Groups 2 & 3. Some methods of Group 2 & 3, i.e. Triple, Center, LC, DivideMix, are not specifically designed for crowdsourcing scenarios. Although majority-voting labels are served as a default choice, these models should be trained with labels inferred by methods of Group 1 as stronger baselines, since methods of Group 1 are likely to provide more accurate inferred labels than majority voting. These methods are therefore trained with labels inferred by EBCC, which achieves the best performances of Group 1 in all data sets.

Setup and Implementation Details
Experimental codes are implemented in Tensorflow 1.8 available at https: //github.com/CrowdsourcingMining/TACMA. Experiments are conducted on a server with a GTX 1080 Ti GPU. We set the tuplet size n to 5 for all the experiments, as suggested in (Xu et al., 2019). The representation learning network has a simple structure, i.e., 2 fully-connected layers with a drop-out rate of 0.2, a learning rate of 1e-3, and hyper-parameters including sizes of each layer and scale of 2 regularization searched via grid searching with cross validation. The network weights are initialized with a normal distribution initializer and updated with Adadelta optimizer (Zeiler, 2012). For all the representation learning methods, the downstream classifier is set to be a logistic regression classifier with 2 penalty containing the only hyper-parameter C as penalty strength ranging from 1e-2 to 1e4.

Performance Comparison
We compare performance of TACMA with existing methods on 4 real-world data sets and the results are summarized in Table 2. TACMA outperforms all the 4 groups of baselines, and here are some observations: • The advantage of TACMA over truth inference methods gets bigger on the Concluding data set than other data sets. The Concluding data set has a low kappa score of 0.37, indicating that there are more disagreements among workers, which makes it hard to inference correct labels regardless of items' features. By contrast, TACMA makes full use of representations of items to gain more information resulting in the best performance.
• Although labels inferred by EBCC boost the performances of representation learning models, e.g., Triple+EBCC, they still perform inferior to TACMA, a possible explanation is that these two-stage methods give equal weight to each item and ignores temporal labeling effects. TACMA is able to discover potential conflicts in the short-term working memory, by applying the attention mechanism and gives low weights to the conflicting judgments.
• TACMA shares the same representation network structures with other methods of representation learning with crowdsourced labels i.e., RLL-MLE, RLL-bayesian and RECLE.
The learned representations are compared in Figure 2 by feeding the raw features into representation network and performing dimension reduction into 2-dimensional space with t-SNE method (Van and Hinton, 2008). In the raw feature space, items of different classes are interleaved with each other. By contrast, learned representations of TACMA are more separated than the other methods, reducing the difficulty of downstream classification tasks.

Robustness to Irresponsible Workers
We select some representatives from Groups 1-4 and draw the curves of accuracy on synthetic data sets containing different number of irresponsible workers in Figure 3. We can find that: • Truth inference methods such as EBCC stay stable facing different numbers of irresponsible workers. On the other hand, the accuracy of other methods decreases when increasing the number of irresponsible workers. This result may be explained by the fact that for methods including RLL-Bayesian, Triple, learning effective representations of items heavily relies on correct labels, and hence becomes harder as the labels become more noisy.
• TACMA maintains the highest accuracy of all the methods. Unlike the two-stage method i.e., Triplet + EBCC, which gives equal weight to each item and ignores temporal labeling effects, TACMA is able to discover potential conflicts in the short-term working memory using the attention mechanism, and give low training weights to the conflicting judgments.

Effect of Working Memory Sizes
We set the working memory size ranging from 3 to 11 to find the optimized length and at the same time explore its influence on performance, shown in Figure 4. The accuracy of our proposed method goes up at the beginning with the increasing working memory size, and the standard deviations gradually become smaller at the same time. It is reasonable because potential inconsistent judgments among similar items cannot be found without observing enough historical annotations. As the working memory size continues extending, the accuracy scores become relatively stable, indicating that there is sufficient evidence to estimate the time-aware confidence of the current annotation.

Relations between Temporal-aware Memory Confidence and Worker's Expertise
In this part we further explore the relations between worker's expertise and temporal-aware memory confidence. To evaluate a worker's expertise, a Logistic Regression classifier is trained with labels annotated by this same person, and the accuracy on the corresponding test set is recorded. On the other hand, the temporal-aware confidence of all   the judgments made by this worker is averaged. We perform standardization on both accuracy scores and the averaged temporal-aware confidence scores within the corresponding data set, and put the standardized values of all the 62 workers from 4 real-world data sets and 4 synthetic data sets together in Figure 5, to reveal the universal relation between temporal-aware confidence and the worker's expertise. We can find a wide range of intra-observer variability among different workers, estimated by their temporal-aware confidence scores. A strong positive correlation is found between averaged confidence and prediction accuracy (pearson r = 0.844). Specifically, synthetic irresponsible annotators, colored in blue, are automatically clustered in the lower left corner, indicating that the poor performances of the classifiers trained with their labels derive from huge inner inconsistencies in their judgments. Figure 5: The relations between standardized temporalaware memory confidence and standardized prediction accuracy of annotators in both real and synthetic data sets. Most of the irresponsible annotators appear in the lower left corner, indicating that there are internal conflicts in their judgments (low confidence), and therefore LR models trained with these labels perform worse than average.

Conclusion
We presented TACMA, an end-to-end framework for language representation learning from crowdsourced labels. Comparing with traditional SRL approaches, the advantages of our framework are: (1) it is able to consider temporal labeling effects within sequences of sample-level labeling tasks for each worker; (2) it automatically computes and aggregates sample-level confidence scores from multiple workers which makes the training process more effective. Experimental results on both synthetic and real-world data sets demonstrates that our approach outperforms other state-of-the-art baselines in terms of accuracy and AUC scores.