Crowdsourcing Learning as Domain Adaptation: A Case Study on Named Entity Recognition

Crowdsourcing is regarded as one prospective solution for effective supervised learning, aiming to build large-scale annotated training data by crowd workers. Previous studies focus on reducing the influences from the noises of the crowdsourced annotations for supervised models. We take a different point in this work, regarding all crowdsourced annotations as gold-standard with respect to the individual annotators. In this way, we find that crowdsourcing could be highly similar to domain adaptation, and then the recent advances of cross-domain methods can be almost directly applied to crowdsourcing. Here we take named entity recognition (NER) as a study case, suggesting an annotator-aware representation learning model that inspired by the domain adaptation methods which attempt to capture effective domain-aware features. We investigate both unsupervised and supervised crowdsourcing learning, assuming that no or only small-scale expert annotations are available. Experimental results on a benchmark crowdsourced NER dataset show that our method is highly effective, leading to a new state-of-the-art performance. In addition, under the supervised setting, we can achieve impressive performance gains with only a very small scale of expert annotations.


Introduction
Crowdsourcing has gained a growing interest in the natural language processing (NLP) community, which helps hard NLP tasks such as named entity recognition (Finin et al., 2010;Derczynski et al., 2016), part-of-speech tagging (Hovy et al., 2014), relation extraction (Abad et al., 2017), translation (Zaidan and Callison-Burch, 2011), argument retrieval (Mayhew et al., 2020), and others (Snow et al., 2008;Callison-Burch and Dredze, 2010) to * Corresponding author. collect a large scale dataset for supervised model training. In contrast to the gold-standard annotations labeled by experts, the crowdsourced annotations can be constructed quickly at a low cost with masses of crowd annotators (Snow et al., 2008;Nye et al., 2018). However, these annotations are relatively lower-quality with much-unexpected noise since the crowd annotators are not professional enough, which can make errors in complex and ambiguous contexts (Sheng et al., 2008). Previous crowdsourcing learning models struggle to reduce the influences of noises of the crowdsourced annotations (Hsueh et al., 2009;Raykar and Yu, 2012a;Hovy et al., 2013;Jamison and Gurevych, 2015). Majority voting (MV) is one straightforward way to aggregate high-quality annotations, which has been widely adopted (Snow et al., 2008;Fernandes and Brefeld, 2011;Rodrigues et al., 2014), but it requires multiple annotations for a given input. Recently, the majority of models concentrate on monitoring the distances between crowdsourced and gold-standard annotations, obtaining better performances than MV by considering the annotator information together (Nguyen et al., 2017;Simpson and Gurevych, 2019;Li et al., 2020). Most of these studies assume the crowdsourced annotations as untrustworthy answers, proposing sophisticated strategies to recover the golden answers from crowdsourced labels.
In this work, we take a different view for crowdsourcing learning, regarding the crowdsourced annotations as the gold standard in terms of individual annotators. In other words, we assume that all annotators (including experts) own their specialized understandings towards a specific task, and they annotate the task consistently according to their individual principles by the understandings, where the experts can reach an oracle principle by consensus. The above view indicates that crowdsourcing learning aims to train a model based on the understandings of crowd annotators, and then test the model by the oracle understanding from experts.
Based on the assumption, we find that crowdsourcing learning is highly similar to domain adaptation, which is one important topic that has been investigated extensively for decades (Ben-David et al., 2006;Daumé III, 2007;Chu and Wang, 2018;Jia and Zhang, 2020). We treat each annotator as one domain specifically, and then crowdsourcing learning is essentially almost a multi-source domain adaptation problem. Thus, one natural question arises: What is the performance when a state-of-the-art domain adaptation model is applied directly to crowdsourcing learning.
Here we take NER as a study case to investigate crowdsourcing learning as domain adaptation, considering that NER has been one popular task for crowdsourcing learning in the NLP community (Finin et al., 2010;Rodrigues et al., 2014;Derczynski et al., 2016). We suggest a state-of-the-art representation learning model that can effectively capture annotator(domain)-aware features. Also, we investigate two settings of crowdsourcing learning, one being the unsupervised setting with no expert annotation, which has been widely studied before, and the other being the supervised setting where a certain scale of expert annotations exists, which is inspired by domain adaptation.
Finally, we conduct experiments on a benchmark crowdsourcing NER dataset (Tjong Kim Sang and De Meulder, 2003;Rodrigues et al., 2014) to evaluate our methods. We take a standard BiLSTM-CRF (Lample et al., 2016) model with BERT (Devlin et al., 2019) word representations as the baseline, and adapt it to our representation learning model. Experimental results show that our method is able to model crowdsourced annotations effectively. Under the unsupervised setting, our model can give a strong performance, outperforming previous work significantly. In addition, the model performance can be greatly boosted by feeding with small-scale expert annotations, which can be a prospective direction for low-resource scenarios.
Crowdsourcing Learning Multi-source Domain Adaptation Figure 2: Illustration of the connection between multisource domain adaptation and crowdsourcing learning.
In summary, we make the following three major contributions: (1) We present a different view of crowdsourcing learning, and propose to treat crowdsourcing learning as domain adaptation, which naturally connects the two important topics of machine learning for NLP.
(2) We propose a novel method for crowdsourcing learning. Although the method is of a limited novelty for domain adaptation, it is the first work to crowdsourcing learning, and can achieve state-of-the-art performance on NER.
(3) We introduce supervised crowdsourcing learning for the first time, which is borrowed from domain adaptation and would be a prospective solution for hard NLP tasks in practice.
We will release the code and detailed experimental settings at github.com/izhx/CLasDA under the Apache License 2.0 to facilitate future research.

The Basic Idea
Here we describe the concepts of the domain adaptation and crowdsourcing learning in detail, and show how they are connected together.

Domain Adaptation
Domain adaptation happens when a supervised model trained on a fixed set of training corpus, including several specific domains, is required to test on a different domain (Ben-David et al., 2006;Mansour et al., 2009). The scenario is quite frequent in practice, and thus has received extensive attention with massive investigations (Csurka, 2017;Ramponi and Plank, 2020). The major problem lies in the different input distributions between source and target domains, leading to biased predictions over the inputs with a large gap to the source domains.
Here we focus on multi-source cross-domain adaptation, which would suit our next corresponding mostly. Following Mansour et al. (2009);Zhao et al. (2019), the multi-source domain adaptation assumes a set of labeled examples from M domains available, denoted by , 2 and we aim to train a model on D src to adapt to a specific target domain with the help of a large scale raw corpus X tgt = {x i } Nt i=1 of the target domain. Note that under this setting, all Xs, including source and target domains, are generated individually according to their unknown distributions, thus the abstract representations learned from the source domain dataset D src would inevitably be biased to the target domain, which is the primary reason for the degraded performance of the target domain (Huang and Yates, 2010;Ganin et al., 2016). A number of domain adaptation models have struggled for better transferable high-level representations as domain shifts (Ramponi and Plank, 2020).

Crowdsourcing Learning
Crowdsourcing aims to produce a set of large-scale annotated examples created by crowd annotators, which is used to train supervised models for a given task (Raykar et al., 2010). As the majority of NLP models assume that gold-standard highquality training corpora are already available (Manning and Schutze, 1999), crowdsourcing learning has received much less interest than cross-domain adaptation, although the availability of these corpora is always not the truth.
Formally, under the crowdsourcing setting, we usually assume that there are a number of crowd annotators A = {a i } M i=1 (here we use the same M as well as later superscripts in order to align with the domain adaptation), and all annotators should have a sufficient number of training examples by their different understandings for a given task, which are referred to as We aim to train a model on D crowd and adapt it to predict the expert outputs. Note that all Xs do not have significant differences in their distributions in this paradigm. 1 A domain is commonly defined as a distribution on the input data in many works, e.g., Ben-David et al. (2006). To make domain adaptation and crowdsourcing learning highly similar in formula, we follow Zhao et al. (2019), defining a domain as a joint distribution on the input space X and the label space Y. Section 4.5 gives a discussion of their connection.
2 N * indicates the number of instances. Crowdsourcing Learning as Domain adaptation By scrutinizing the above formalization, when we set all Xs jointly with the annotators by using x i j = a i (x i j ), which indicates the contextualized understanding (a vectorial form is desirable here of the neural representations) of x i j by the annotator a i , then we would regard that is generated from different distributions as well. In this way, we are able to connect crowdsourcing learning and domain adaptation together, as shown in Figure 2, based on the assumption that all Y s are gold-standard for crowdsourced annotations when crowd annotators are united as joint inputs. And finally, we need to perform predictions by regarding x expert = expert(x), and in particular, the learning of expert differs from that of the target domain in domain adaptation.

A Case Study On NER
In this section, we take NER as a case study, which has been investigated most frequently in NLP (Yadav and Bethard, 2018), and propose a representation learning model mainly inspired by the domain adaptation model of (Jia et al., 2019) to perform crowdsourcing learning. In addition, we introduce the unsupervised and supervised settings for crowdsourcing learning which are directly borrowed from the domain adaptation.

The Representation Learning Model
We convert NER into a standard sequence labeling problem by using the BIO schema, following the majority of previous works, and extend a stateof-the-art BERT-BiLSTM-CRF model (Mayhew et al., 2020) to our crowdsourcing learning. Figure 3 shows the overall network structure of our representation learning model. By using a sophisticated parameter generator module (Platanios et al., 2018), it can capture annotator-aware features. Following, we introduce the proposed model by four components: (1) word representation, (2) annotator switcher, (3) BiLSTM Encoding, and (4) CRF inference and training.
Word Representation Given a sentence of n words x = w 1 · · · w n , we first convert it to vectorial representations by BERT. Different from the standard BERT exploration, here we use Adapter•BERT (Houlsby et al., 2019), where two extra adapter modules are inside each transformer layer. The process can be simply formalized as: where • indicates an injection operation. The detailed structure of the transformer with adapters is described in Appendix A.
Noticeably, the Adapter • BERT method no longer needs fine-tuning the huge BERT parameters and can obtain comparable performance by adjusting the much lightweight adapter parameters instead. Thus the representation can be more parameter efficient, and in this way we can easily extend the word representations to annotator-aware representations.
Annotator Switcher Our goal is to efficiently learn annotator-aware word representations, which can be regarded as contextualized understandings of individual annotators. Hence, we introduce an annotator switcher to support Adapter • BERT with annotator input as well, which is inspired byÜstün et al. (2020). The key idea is to use Parameter Generation Network (PGN) (Platanios et al., 2018;Jia et al., 2019) to produce adapter parameters dynamically by input annotators. In this way, our model can flexibly switch among different annotators.
Concretely, assuming that V is the vectorial form of all adapter parameters by a pack operation, which can also be unpacked to recover all adapter parameters as well, the PGN module is to generate V for Adapter • BERT dynamically according the annotator inputs, as shown in Figure 3 by the right orange part. The switcher can be formalized as: where Θ ∈ R |V |×|e a | , x = r 1 · · · r n is the annotator-aware representations of annotator a for x = w 1 · · · w n , and e a is the annotator embedding.
BiLSTM Encoding Adapter • BERT requires an additional task-oriented module for high-level feature extraction. Here we exploit a single BiL-STM layer to achieve it: h 1 · · · h n = BiLSTM(x), which is used for next-step inference and training.
CRF Inference and Training We use CRF to calculate the score of a candidate sequential output y = l 1 · · · l n globally: where W crf , b crf and T are model parameters.
Given an input (x, a), we perform inference by the Viterbi algorithm. For training, we define a sentence-level cross-entropy objective: where y a is the gold-standard output of x from a, y belongs to all possible candidates, and p(y a |x, a) indicates the sentence-level probability.

The Unsupervised Setting
Here we introduce unsupervised crowdsourcing learning in alignment with unsupervised domain adaptation, assuming that no expert annotation is available, which is the widely-adopted setting of previous work of crowdsourcing learning (Sheng et al., 2008;Zhang et al., 2016;Sheng and Zhang, 2019). This setting has a large divergence with domain adaptation in target learning. In the unsupervised domain adaptation, the information of the target domain can be learned through a large-scale raw corpus (Ramponi and Plank, 2020), where there is no correspondence in the unsupervised crowdsourcing learning to learn information of experts.
To this end, here we suggest a simple and heuristic method to model experts by the specialty of crowdsourcing learning. Intuitively, we expect that experts should approve the knowledge of the common consensus for a given task, and meanwhile, our model needs the embedding representation of experts for inference. Thus, we can estimate the expert embedding by using the centroid point of all annotator embeddings: where A represents all annotators contributed to the training corpus. This expert can be interpreted as the elected outcome by annotator voting with equal importance. In this way, we perform the inference in unsupervised crowdsourcing learning by feeding e expert as the annotator input.

The Supervised Setting
Inspired by the supervised domain adaptation, we also present the supervised crowdsourcing learning, which has been seldom concerned. The setting is very simple, just by assuming that a certain scale of expert annotations is available. In this way, we can learn the expert representation directly by supervised learning with our proposed model. The supervised setting could be a more practicable scenario in real applications. Intuitively, it should bring much better performance than the unsupervised setting with few shot expert annotations, which does not increase the overall annotation cost much. In fact, during or after the crowdsourcing annotation process, we usually have a quality control module, which can help to produce silvery quality pseudo-expert annotations (Kittur et al., 2008;Lease, 2011). Thus, the supervised setting can be highly valuable yet has been ignored mostly.

Setting
Dataset We use the CoNLL-2003 NER English dataset (Tjong Kim Sang and De Meulder, 2003) with crowdsourced annotations provided by Rodrigues and Pereira (2018) to investigate our methods in both unsupervised and supervised settings. The crowdsourced annotations consume 400 new articles, involving 5,985 sentences in practice, which are labeled by a total of 47 crowd annotators. The total number of annotations is 16,878. Thus the averaged number of annotated sentences per annotator is 359, which covers 6% of the total sentences. The dataset includes golden/expert annotations on the training sentences and a standard CoNLL-2003 test set for NER evaluation.
Evaluation The standard CoNLL-2003 evaluation metric is used to calculate the NER perfor-  (Rodrigues et al., 2014) 49.40 85.60 62.60 LC (Nguyen et al., 2017) 82.38 62.10 70.82 LC-cat (Nguyen et al., 2017) 79 mance, reporting the entity-level precision (P), recall (R), and their F1 value. All experiments of the same setting are conducted by five times, and the median outputs are used for performance reporting. We exploit the pair-wise t-test for significance test, regarding two results significantly different when the p-value is below 10 −5 .
Baselines We re-implement several methods of previous work as baselines, and all the methods are based on Adapter•BERT-BiLSTM-CRF (no annotator switcher inside) for fair comparisons.
For both the unsupervised and supervised settings, we consider the following baseline models: • ALL: which treats all annotations equally, ignoring the annotator information no matter crowd or expert.
• MV: which is borrowed from Rodrigues et al. (2014), where aggregated labels are produced by token level majority voting. In particular, the gold-standard labels are used instead if they are available for a specific sentence during the supervised crowdsourcing learning.
• LC: which is proposed by Nguyen et al. (2017), where the annotator bias to the goldstandard labels is explicitly modeled at the CRF layer for each crowd annotator, and specifically, the expert is with zero bias.
• LC-cat: which is also presented by Nguyen et al. (2017) as a baseline to LC, where the annotator bias is modeled at the BiLSTM layer instead and also the expert bias is set to zero.  Notice that ALL and MV are annotator-agnostic models, which exploit no information specific to the individual annotators, while the other three models are all annotator-aware models, where the annotator information is used by different ways.

Hyper-parameters
We offer all detailed settings of Hyper-parameters in Appendix B. Table 1 shows the test results of the unsupervised setting. As a whole, we can see that our representation learning model (i.e., This Work) borrowed from domain adaptation can achieve the best performance, resulting in an F1 score of 77.95, significantly better than the second-best model LC-cat (i.e., 77.95 − 76.79 = 1.16). The result indicates the advantage of our method over the other models.

Unsupervised Results
By examining the results in-depth, we can find that the annotator-aware model is significantly better than the annotator-agnostic models, demonstrating that the annotator information is highly helpful for crowdsourcing learning. The observation further shows the reasonableness by aligning annotators to domains, since domain information is also useful for domain adaptation. In addition, the better performance of our representation learning method among the annotator-aware models indicates that our model can capture annotator-aware information more effectively because our start point is totally different. We do not attempt to model the expert labels based on crowdsourcing annotations.
Further, we observe that several models show better precision values, while others give better recall values. A high precision but low recall indicates that the model is conservative in detecting named entities, and vice the reverse. Our proposed model is able to balance the two directions better, with the least gap between them. Also, the re-sults imply that there is still much space for future development, and the recent advances of domain adaptation might offer good avenues.
Finally, we compare our results with previous studies. As shown, our model can obtain the best performance in the literature. In particular, by comparing our results with the original performances reported in Nguyen et al. (2017), we can see that our re-implementation is much better than theirs. The major difference lies in the exploration of BERT in our model, which brings improvements closed to 6% for both LC and LC-cat.

Supervised Results
To investigate the supervised setting, we assume that expert annotations (ground truths) of all crowdsourcing sentences are available. Besides exploring the full expert annotations, we study another three different scenarios by incrementally adding the expert annotations into the unsupervised setting, aiming to study the effectiveness of our model with small expert annotations as well. Concretely, we assume proportions of 1%, 5%, 25%, and 100% of the expert annotations available. 4 Table 2 shows all the results, including our four baselines and an gold model based on only expert annotations for comparisons. Overall, we can see that our representation learning model can bring the best performances for all scenarios, demonstrating its effectiveness in the supervised learning as well.
Next, by comparing annotator-agnostic and annotator-aware models, we can see that annotatoraware models are better, which is consistent with the unsupervised setting. More interestingly, the results show that All is better than gold with very small-scale expert annotations (1% and 5%), and the tendency is reversed only when there are sufficient expert annotations (25% and 100%). The observation indicates that crowdsourced annotations are always helpful when golden annotations are not enough. In addition, it is easy to understand that MV is worse than gold since the latter has a higher-quality of the training corpus. Further, we can find that even the annotatoraware LC and LC-cat models are unable to obtain any positive influence compared with gold, which demonstrates that distilling ground-truths from the crowdsourcing annotations might not be the most promising solution. While our representation learning model can give consistently better results than gold, indicating that crowdsourced annotations are always helpful by our method. By regarding crowdsourcing learning as domain adaptation, we no longer take crowdsourced annotations as noise, and on the contrary, they are treated as transferable knowledge, similar to the relationship between the source domains and the target domain. Thus they could always be useful in this way.

Analysis
To better understand our idea and model in-depth, we conducted the following fine-grained analyses. 5 Visualization of Annotator Embeddings Our representation learning model is able to learn annotator embeddings through the task objective. It is interesting to visualize these embeddings to check their distributions, which can reflect the relationships between the individual annotators. Figure 4 shows the visualization results after Principal Component Analysis (PCA) dimensionality reduction,  where the unsupervised and three supervised scenarios are investigated. 6 As shown, we can see that most crowd annotators are distributed in a concentrated area for all scenarios, indicating that they are able to share certain common characteristics of task understanding. Further, we focus on the relationship between expert and crowd annotators, and the results show two interesting findings. First, the heuristic expert of our unsupervised learning is almost consistent with that of the supervised learning of the whole expert annotations (100%), which indicates that our unsupervised expert estimation is perfectly good. Second, the visualization shows that the relationship between expert and crowd annotators could be biased when expert annotations are not enough. As the size of expert annotations increases, their connection might be more accurate gradually.
The Predictability of Crowdsourcing Annotations Our primary assumption is based on that all crowdsourced annotations are regarded as the gold-standard with respect to the crowd annotators, which naturally indicates that these annotations are predictable. Here we conduct analysis to verify the assumption by a new task to predicate the crowdsourced annotations, Concretely, we divide the annotations into two sections, where 85% of them are used as the training and the remaining are used for testing, and then we apply our baseline and proposed models to learn and evaluate. Table 3 shows the results. As shown, our model can achieve the best performance by an F1 score of 77.12%, and the other models are significantly worse (at least 4.86 drops by F1). Considering that the proportion of the averaged training examples per annotator over the full 5,985 sentences is only 5%, 7 we exploit the gold model of the 5% expert annotations for reference. We can see that the gap between them is small (77.12% v.s. 79.33%), 6 The 1% setting is excluded for its incapability to capture the relationship between the expert and crowd annotators with such small expert annotations. 7 The value can be directly calculated (0.06 * 0.85 ≈ 0.05).  Figure 5: Comparisons by F1 scores between full and filtered crowdsourced annotations (i.e., excluding unreliable annotators). We compute F1 values of each annotator with respect to the gold-standard labels, and filter out 10 annotators with lowest scores.
which indicates that our assumption is acceptable as a whole. The other models could be unsuitable for our assumption due to the poor performance induced by their modeling strategies.
The Impact of Unreliable Annotators Handling unreliable annotators, such as spammers, is a practical and common issue in Crowdsourcing (Raykar and Yu, 2012b). Obviously, regarding crowd annotations as untrustworthy answers is more considerate to this problem. In contrast, our assumption might be challenged because these unreliable annotators are discrepant in their own annotations. To show the influence of unreliable annotators, we filter out several unreliable annotators in the corpus, and reevaluate the performance for the low-resource supervised and unsupervised scenarios on the remaining annotations. Figure 5 shows the comparison results of the original corpus and the filtered corpus. 8 First, we can find that improved performance can be achieved in all cases, indicating excluding these unreliable annotations is helpful for crowdsourcing. Second, the LC and LC-cat model give smaller score differences compared with the ALL model between these two kinds of results, which verified that they are considerate to unreliable annotators. Third, our model also performs robustly, it can cope with this practical issue in a certain degree as well.
Results on The Sampled Annotators and Annotations The above analysis shows the benefit of removing unreliable annotators, which reduces a small number of annotators and annotations. A problem arises naturally: will the performance be   Table  1. The Excluded is the filtered corpus in Figure 5. The Part-1 and Part-2 are both consist of 13 annotators. Part-1 have 1800 texts with 6275 crowd annotations, each text is labeled by at least 3 annotators. These numbers of Part-2 are 2192, 5582, and 2, respectively. consistent if we sample a small proportion of annotators? To verify it, we sampled two sub-set from the crowdsourced training corpus and re-train our model as well as baselines. Table 4 shows the evaluation results of re-trained models on the standard test set in unsupervised setting. We also add our main result for the comparison. As shown, all sampled datasets demonstrate similar trends with the main result (denoted as Full). The supervised results are consistent with our main result as well, which are not listed due to space reasons.

The Discussion of Domain Definitions
The most widely used definition of a domain is the distribution on the input space X . Zhao et al. (2019) define a domain D as the pair of a distribution D on the input space X and a labeling function f : X → Y, i.e., domain D = D, f . In this work, we assume each annotator is a unique labeling function a : X → Y. Uniting each annotator and the instances he/she labeled, we can result in a number of domains where A represents all annotators. Then the crowdsourcing learning can be interpreted by the later definition, i.e., learning from these crowd annotators/domains and predicting the labels of raw inputs (sampled from the raw data distribution D expert ) in expert annotator/domain D expert , expert . To unify the definition in a single distribution, we directly define a domain as the joint distribution on the input space X and the label space Y.
In addition, we can align to the former definition by using the representation outputs x i = a i (x) as the data input, which shows different distributions for the same sentence towards different annotators. Thus, each source domain D i is the distibution of x i , and we need learn the expert representations x expert to perform inference on the unlabled texts.

Crowdsourcing Learning
Crowdsourcing is a cheap and popular way to collect large-scale labeled data, which can facilitate the model training for hard tasks that require supervised learning (Wang and Zhou, 2016;Sheng and Zhang, 2019). In particular, crowdsourced data is often regarded as low-quality, including much noise regarding expert annotations as the gold-standard. Initial studies of crowdsourcing learning try to arrive at a high-quality corpus by majority voting or control the quality by sophisticated strategies during the crowd annotation process (Khattak and Salleb-Aouissi, 2011;Liu et al., 2017;Tang and Lease, 2011).
Recently, the majority work focuses on full exploration of all annotated corpus by machine learning models, taking the information from crowd annotators into account including annotator reliability (Rodrigues et al., 2014), annotator accuracy (Huang et al., 2015), worker-label confusion matrix (Nguyen et al., 2017), and sequential confusion matrix (Simpson and Gurevych, 2019).
In this work, we present a totally different viewpoint for crowdsourcing, regarding all crowdsourced annotations as golden in terms of individual annotators, just like the primitive gold-standard labels corresponded to the experts, and further propose a domain adaptation paradigm for crowdsourcing learning.

Domain Adaptation
Domain adaptation has been studied extensively to reduce the performance gap between the resourcerich and resource-scarce domains (Ben-David et al., 2006;Mansour et al., 2009), which has also received great attention in the NLP community (Daumé III, 2007;Jiang and Zhai, 2007;Finkel and Manning, 2009;Glorot et al., 2011;Chu and Wang, 2018;Ramponi and Plank, 2020). Typical methods include self-training to produce pseudo training instances for the target domain (Yu et al., 2015) and representation learning to capture transferable features across the source and target domains (Sener et al., 2016).
In this work, we make correlations between domain adaptation and crowdsourcing learning, enabling crowdsourcing learning to benefit from the advances of domain adaptation, and then present a representation learning model borrowed from Jia et al. (2019Jia et al. ( ) andÜstün et al. (2020.
In addition, NER has been widely adopted as crowdsourcing learning as well (Finin et al., 2010;Rodrigues et al., 2014;Derczynski et al., 2016;. Thus, we exploit NER as a case study following these works, and take a BERT-BiLSTM-CRF model as the basic model for our annotator-aware extension.

Conclusion and Future Work
We studied the connection between crowdsourcing learning and domain adaptation, and then proposed to treat crowdsourcing learning as a domain adaptation problem. Following, we took NER as a case study, suggesting a representation learning model from recent advances of domain adaptation for crowdsourcing learning. By this case study, we introduced unsupervised and supervised crowdsourcing learning, where the former is a widelystudied setting while the latter has been seldom investigated. Finally, we conducted experiments on a widely-adopted benchmark dataset for crowdsourcing NER, and the results show that our representation learning model is highly effective in unsupervised learning, achieving the best performance in the literature. In addition, the supervised learning with a very small scale of expert annotations can boost the performance significantly.
Our work sheds light on the application of effective domain adaptation models on crowdsourcing learning. There are still many other sophisticated cross-domain models, such as adversarial learning (Ganin et al., 2016) and self-training (Yu et al., 2015). Future work may include how to apply these advances to crowdsourcing learning properly.

Ethical Impact
We present a different view of crowdsourcing learning and propose to treat it as domain adaptation, showing the connection between these two topics of machine learning for NLP. In this view, many sophisticated cross-domain models could be applied to crowdsourcing learning. Moreover, the motivation that regarding all crowdsourced annotations as gold-standard to the corresponding annotators, also sheds light on introducing other transfer learning techniques in future work.
The above idea and our proposed representation learning model for crowdsourcing sequence labeling, are totally agnostic to any private information of annotators. And we do not use any sensitive information, bu only the ID of annotators, in problem modeling and learning. The crowdsourced CoNLL English NER data also anonymized annotators. There will be no privacy issues in the future.

A Transformer with Adapters
In our Adapter • BERT word representation, we insert two adapter modules for each transformer layer inside BERT. Figure 6 shows the detailed network structure of transformer with adapters. More specifically, the forward operation of an adapter layer is computed as follows: where W    Here we also give a supplement to illustrate the pack operation from all adapter parameters into a single vector V : (7) where first all parameters of a single adapter are reshaped and concatenated and then a further concatenation is performed over all adapters.

B Hyper-parameters
We choose the BERT-base-cased 9 , which is for English language and consists of 12-layer transformers with the hidden size 768 for all layers. We load the BERT weight and implement the adapter injection based on the transformers (Wolf et al., 2020) library. The sizes of the adapter middle hidden states are set to 128 constantly. The annotator embedding size is 8 to fit the model in one RTX-2080TI GPU of 11GB memory. The BiLSTM hidden size is set to 400. For all models, we inject adapters or switchers in all 12 layers of BERT. All experiments are run on the single GPU at an 8-GPU server with a 14 core CPU and 128GB memory.
We exploit the stochastic gradient-based online learning, with a batch size of 64, to optimize model parameters. We apply the time-step dropout, which randomly sets several representations in the sequence to zeros with a probability of 0.2, on the word representations to avoid overfitting. We use the Adam algorithm to update the parameters with a constant learning rate 1 × 10 −3 , and apply the gradient clipping by a maximum value of 5.0 to avoid gradient explosion.

C The Advantage of Adapter • BERT
Our models are all based on Adapter • BERT as the basic representations, which is different from the widely-adopted BERT fine-tuning architecture.
Here we compare the two strategies in detail. The results are shown in Table 5, where for Adapter • BERT we consider gradually increasing the number of transformer layers (covering the last n layers) inside the BERT. As shown, it is apparently that Adapter • BERT is much more parameter efficient, and when all layers are exploited, the model can be even better than BERT fine-tuning. Thus it is more desirable to use Adapter • BERT covering all BERT transformers inside.

D Case Study
Here we also offer a case study to understand the performance in unsupervised and supervised crowdsourcing learning, as well as the different crowdsourcing models. We exploit one complex example in Table 6 which involves different outputs for various models. As shown, we can see that supervised models are able to recall the ambiguous entity (i.e., Pace, a single word with multiple senses) correctly, while unsupervised models fail, which may be due to the inconsistencies of the crowdsourced annotations. By comparing our model with other baselines, we can show that our representation learning model can capture the global text input understanding consistently, e.g., being able to connect Ohio State and Arizona State together.