A Confidence-based Partial Label Learning Model for Crowd-Annotated Named Entity Recognition

Existing models for named entity recognition (NER) are mainly based on large-scale labeled datasets, which always obtain using crowdsourcing. However, it is hard to obtain a unified and correct label via majority voting from multiple annotators for NER due to the large labeling space and complexity of this task. To address this problem, we aim to utilize the original multi-annotator labels directly. Particularly, we propose a Confidence-based Partial Label Learning (CPLL) method to integrate the prior confidence (given by annotators) and posterior confidences (learned by models) for crowd-annotated NER. This model learns a token- and content-dependent confidence via an Expectation-Maximization (EM) algorithm by minimizing empirical risk. The true posterior estimator and confidence estimator perform iteratively to update the true posterior and confidence respectively. We conduct extensive experimental results on both real-world and synthetic datasets, which show that our model can improve performance effectively compared with strong baselines.


Introduction
Named entity recognition (NER) plays a fundamental role in many downstream natural language processing (NLP) tasks, such as relation extraction (Bach and Badaskar, 2007), event extraction (Wadden et al., 2019;Zhou et al., 2022).Recently, by leveraging deep learning models, existing NER systems have witnessed superior performances on NER datasets.However, these models typically require a massive amount of labeled training data, such as MSRA (Levow, 2006), Ontonotes 4.0 (Weischedel et al., 2011), and Resume (Zhang and Yang, 2018).In real applications, we often need to consider new types of entities in new domains where we do not have existing annotated.The majority way to label the data at a lower cost ˚Corresponding author, jie_zhou@fudan.edu.cn. is crowdsourcing (Peng and Dredze, 2015), which labels the data using multiple annotators.
The crowd-annotated datasets are always low quality for the following two reasons.First, as an exchange, crowd annotations are always nonexperts.Various annotators may have different interpretations of labeling guidelines.Moreover, they may make mistakes in the labeling process.It is hard to require a number of annotations to reach an agreement.For example, annotator 1 labels "David and Jack" as a PER entity, while the correct label is "David" and "Jack" under our guidelines (Table 1).Also we should label the continuous time and place as one entity (e.g, "tomorrow at 10:00 a.m." and "company ( room 1003 )").Second, due to the ambiguous word boundaries and complex composition, the NER task is more challenging compared with the text classification tasks.Annotator 3 ignores the token "a.m." for the time entity and adds "the" as part of the place entity falsely.Also, he/she misses the person entities in the text.In this paper, we focus on building a powerful NER system based on crowd-annotated data, which is of low quality.
There are two main ways to utilize crowdannotated data.One simple and important way to obtain high-quality annotations for each input instance is majority voting.As shown in Table 1, the majority voting method can not obtain the correct answers from these three annotations well.The right labels (e.g., "David", "Jack", "tomorrow at 10:00 a.m.", and "company ( room 1003 )") are only annotated by annotators 1 and 2 once.Another majority of work models the differences among annotators by finding the trustworthy annotators (Rodrigues et al., 2014;Nguyen et al., 2017;Yang et al., 2018).From Table 1, we can find that none of the three annotators labels the entities absolutely right.Thus, these two kinds of methods are a waste of human labor.
To address this problem, we translated this task into a partial label learning (PLL) problem, which  trains the model based on the dataset where each sample is assigned with a set of candidate labels (Cour et al., 2011;Wen et al., 2021).Thus, it is natural to utilize all human labor via PLL, which can be divided into two types: 1) average-based methods which consider each candidate class equally (Hüllermeier and Beringer, 2006;Zhang and Yu, 2015); 2) identification-based methods which predict the ground-truth label as a latent variable via a translation matrix to describe the scores of each candidate label (Feng and An, 2019;Yan and Guo, 2020;Feng et al., 2020).Despite extensive studies on PLL methods, there are still two challenges in our condition.One challenge (C1) is that these methods are criticized when the same candidate label occurs more than once.The general PLL is under the assumption that each candidate label is only been assigned once, while each sample may be assigned the same classes multiple times by the different annotators in our situation.Another challenge (C2) is that most of the existing studies about PLL focus on image or text classification tasks, while we focus on a more complex task, sequence labeling, where each token is asserted with a label.Thus, the token itself and its content should be considered in this task.
In this paper, we propose a Confidence-based Partial Label Learning (CPLL) model for crowdannotated NER.For C1, we treat the classes' labeled number for each sample as prior confidence provided by the annotators.Also, we learn the confidence scores via an Expectation-Maximization (EM) algorithm (Dempster et al., 1977).We estimate the real conditional probability P pY " y|T " t, X " xq via a true posterior estimator based on the confidence that consists of the prior and posterior confidences.For C2, we learn a token-and content-dependent confidence via a confidence estimator to consider both the token t and sequence input x, because the candidate labels are always token-dependent and content-dependent.In fact, our model can be applied to all the sequence labeling tasks, such as word segment, part of speech, etc.We conduct a series of experiments on one real-world dataset and four synthetic datasets.The empirical results show that our model can make use of the crowd-annotated data effectively.We also explore the influence of annotation inconsistency and balance of prior and posterior confidences.
The main contributions of this work are listed as follows.
• To better utilize the crowd-annotated data, we propose a CPLL algorithm to incorporate the prior and posterior confidences for sequence labeling task (i.e., NER).
• To take the confidence scores into account, we design a true posterior estimator and confidence estimator to update the probability distribution of ground truth and token-and content-dependent confidence iteratively via the EM algorithm.
• Extensive experiments on both real-world and synthetic datasets show that our CPLL model outperforms the state-of-the-art baselines, which indicates that our model disambiguates the noise labels effectively.

Our Approach
In this section, we first give the formal definition of our task.Then, we provide an overview of our proposed CPLL model.Finally, we introduce the main components contained in our model.

Formal Definition
Given a training corpus D " tx i , p Ŷi , A i qu
Here, ŷ " ty 1 , y 2 , ...., y | ŷ| u is the candidate label set of the token t and a " ra 1 , a 2 , ..., a |ŷ| s is The true posterior estimator is used to predict the true posterior P pY " y|T " t, X " xq based on the confidence score learned by the confidence estimator.The confidence estimator learns the confidence based on the prior confidence obtained from annotators and the posterior confidence learned by the model.labeled times obtained from annotations.Specifically, a is the labeled times of candidate label y for token t. ŷ P t2 Y zHzYu where Y is the label space and 2 Y means the power set.For the rest of this paper, y denotes the true label of token t in text x unless otherwise specified.The goal of this task is to predict the truth posterior probability P pY " y|T " t, X " xq of token t in text x.

Overview
In this paper, we propose a CONfidence-based partial Label Learning (CPLL) model for crowdannotated NER (Figure 1).Particularly, we learn the true posterior P pY " y|T " t, X " xq via a true posterior estimator f and a confidence score gpy; Ŷ , t, xq by minimizing the following risk.where the classifier f py; t, xq is used to predict P pY " y|T " t, X " xq and L is the loss.Particularly, we rely on the Expectation-Maximization algorithm (Dempster et al., 1977) to find the maximum likelihood parameters of CPLL by regarding the ground truth as a latent variable.In the M-step, we train a naive classifier f to predict the true posterior P pY " y|T " t, X " xq via a true posterior estimator (Section 2.3).In the E-step, we update the confidence score via a confidence estimator (Section 2.4), which consists of the prior confidences (calculated from annotations) and posterior confidences (learned by model).

True Posterior Estimator
First, we train a naive classifier as our true posterior estimator f to infer the true posterior P pY " y|T " t, X " xq.To model the sequence, we adopt a pre-trained language model (BERT (Kenton and Toutanova, 2019)) M to learn a contentaware token representation.Specifically, we input the sequence x " tt 1 , t 2 , ..., t |x| u into M to obtain the sequence representations, Then, we utilize a fully connected layer (FC) to predict the probability distribution, where σ is a sigmoid function, θ F C " tW, bu is the learnable parameters of FC.We regard θ " tθ M , θ F C u as a parameter set of true posterior estimator f .Negative learning (Kim et al., 2019) is adopted, which not only considers "the token belongs to positive label (candidate label y P ŷ)" but also "the token does not belong to negative label (its complementary label y R ŷ)".Finally, we optimize the empirical risk by integrating confidence gpy; ŷ, t, xq with the loss function (Equation 1).We will introduce the confidence gpy; ŷ, t, xq in detail below.

Confidence Estimator
The confidence estimator is used to learn the confidence scores gpy; ŷ, t, xq, which represents the confidence of label y given the token t, text sequence x, and partial label ŷ.
gpy; ŷ, t, xq " α ˚cA y;t,x `p1 ´αq ˚cM y;t,x (5 where the confidence score c M y;t,x is learned by model and c A y;t,x is given by annotators.α is a hyper-parameter used to balance these two terms.The annotators will affect the quality of the datasets and we can calculate the prior confidence based on the labeled times of each class.However, prior confidence is biased since the annotators we selected have biases.To address this problem, we also let the model learn the posterior confidence to reduce the biases in prior confidence.
Posterior Confidence We update posterior confidence c M y;t,x based on true posterior distribution P pY " y|T " t, X " xq estimated by true posterior estimator f py; t, xq.
We calculate the confidence score for positive and negative labels independently. the token.Thus, we model the confidence by considering both the token and content.Finally, we compute the final confidence score gpy; ŷ, t, xq via Equation 5, which considers both biases from annotators and models.
We update the parameters θ and confidence score in the M step and E step of the EM algorithm.Specifically, we perform the true posterior estimator and confidence estimator iteratively.The initialization of c M y;t,x is 1 |ŷ| for y P ŷ and 1 |Y|´|ŷ| for y R ŷ.

Experimental Setups
In this section, we first introduce one real-world and four synthetic datasets we adopted to evaluate the performance (Section 3.1).Then, we list the selected popular baselines to investigate the validity of our CPLL model (Section 3.2).Finally, we present the implementation details and metrics to replicate the experiment easily (Section 3.3).

Datasets
Real-World Dataset.To build the real-world dataset, we ask the annotators to label the person, place, and time in the text independently.Each sample is assigned to three annotators with guidelines and several examples.To be specific, we ask three students to label 1000 samples as the training set.The average Kappa value among the annotators is 0.215, indicating that the crowd annotators have low agreement on identifying entities in this data.In order to evaluate the system performances, we create a set of the corpus with gold annotations.Concretely, we randomly select 881 sentences from the raw dataset and let two experts generate the gold annotations.Among them, we use 440 sentences as the development set and the remaining 441 as the test set.noise on four typical NER datasets: MSRA (Levow, 2006), Weibo (Peng and Dredze, 2015), Ontonotes 4.0 (Weischedel et al., 2011) and Resume (Zhang and Yang, 2018).To simulate a real noise situation, we add noise to the original datasets using four rules: 1) BE (Bound Error) that adds or deletes some tokens of the entity to destroy the bound (change "room 1003" to "(room 1003"); 2) ME (Missing Error) that removes the entity from the label ("David" is not labeled); 3) CE (Category Error) that changes the category of the entity (change "Location" to "Organization"); 4) SE (Segmentation Error) that splits the entity into two entities (change "tomorrow at 10:00 am" to "tomorrow" and "at 10:00 am").We run each rule randomly with a perturbation rate r, which is set as 10% in the experiments.Additionally, we explore the influence of annotation inconsistency with different rates.Table 3 shows statistical information of these datasets based on token-level majority voting.We can find that a large number of entities are perturbed by our rules.For example, more than 40% tokens labeled as entities are perturbed with a perturbation rate r of 20%.

Baselines
To verify the effectiveness of our CPLL model, we compare it with several strong and typical baselines, which can be categorized into three groups: votingbased models, partial label learning-based models, and annotator-based models.
• Voting-based models.We select two votingbased models, entity-level and token-level voting models.The entity-level voting model obtains the ground truth by voting at the entity level.The token-level voting model calculates the ground truth by voting at the token level.A BERT-based sequence labeling model (Kenton and Toutanova, 2019) is trained based on the ground truth calculated by voting.
• Partial label learning-based models.We adopt two classic PLL baselines to utilize the crowd-annotated data with multiple candidate labels.PRODEN-mlp (Lv et al., 2020) adopts a classifier-consistent risk estimator with a progressive identification method for PLL.Wen et al. (2021) propose a Leveraged Weighted (LW) loss for PLL to take the partial and non-partial labels into account, which is proved to be risk consistency.It achieved state-of-the-art results on various computer version tasks.We implement the models by translating the official codes to our NER task.
After seeing researchers achieve great success in fullysupervised learning, we are easily going to think about how to gain fully-supervised data from crowd-annotated data when we use crowdsourcing.Seqcrowd (Nguyen et al., 2017) uses a crowd component, a Hidden Markov Model (HMM) learned by the Expectation-Maximization algorithm, to transform crowd-annotated data into fully-supervised data instead of simply voting at token-level or entity-level.When we get the ground truth calculated by this crowd component, we can adopt some efficient fully-supervised learning method to finish the corresponding task.

Implementation Details and Metrics
We adopt a PyTorch (Paszke et al., 2019) framework Transformers to implement our model based on GPU GTX TITAN X. Chinese-roberta-wwm-ext model (Cui et al., 2019) 1 is used for our true posterior estimator.We utilize Adam optimizer (Kingma and Ba, 2014)   is 512, the batch size is 8 and the dropout rate is 0.1.We search the best α from 0.1 to 0.9 with step 0.1 using the development set.All the baselines use the same settings hyper-parameters mentioned in their paper.Our source code will be available soon after this paper is accepted.
To measure the performance of the models, we adopt Macro-F1 as the metric, which is widely used for NER (Yadav and Bethard, 2018).In particular, we evaluate the performance on the span level, where the answer will be considered correct only when the entire span is matched.

Experimental Results
In this section, we conduct a series of experiments to investigate the effectiveness of the proposed CPLL model.Specifically, we compare our model with three kinds of strong baselines (Section 4.1) and do ablation studies to explore the influence of the key parts contained in CPLL (Section 4.2).Also, we investigate the influence of annotation inconsistency (Section 4.3) and hyper-parameter α, which controls the balance of posterior confidence and prior confidence (Section 4.4).

Main Results
To evaluate the performance of our model, we present the results of compared baselines and our CPLL model (See Table 4).First, we can find that our model outperforms all the baselines on both the real-world and synthetic datasets.The labels obtained by voting-based methods (e.g., Tokenlevel voting and entity-level voting) always contain much noise because of the large labeling space and the complexity of this task.For PLL-based models (e.g., PRODEN-mlp and LW loss), they ignore the labeled times by the annotators.Furthermore, annotator-based methods (e.g., Seqcrowd) aim to find the trustworthy label or annotator.Note that Seqcrow does not work on Weibo and performs poorly on Ontonotes.It is because Seqcrow cannot solve the case of small sizes or large noise of datasets, which is also verified in Section 2. All these methods cause information loss which affects the performance of the models largely.Our CPLL model makes use of the crowd-annotated data by translating this task into a PLL task to integrate confidence.Second, our CPLL model can reduce the influence of noise effectively.From the results, we observe that CPLL obtains comparable results with the model trained on the clean data.Our confidence estimator can learn the bias generated by annotations effectively via the posterior and prior confidence.

Ablation Studies
To evaluate the effectiveness of each part contained in our model, we do ablation studies (See Table 5).We remove posterior confidence (w/o Posterior Confidence), prior confidence (w/o Prior Confi- dence), and both of them (w/o Both) from CPLL model.For w/o Both, we remove the confidence estimator by setting the confidences as 1{|ŷ| for partial labels and 0 for non-partial labels.
From the results, we find the following observations.1) Confidence estimator can learn the annotation bias effectively.Removing it (w/o Both) reduces more than 4 points in terms of F1 on the test sets over real-world and Weibo datasets.2) Both posterior confidence and prior confidence are useful for this task.Obviously, prior confidence is vital to leverage the labeled confidence given by annotators.However, prior confidence may exist bias since the annotators are limited.Thus, the posterior confidence learned by the model is also crucial for partial label learning to rectify the prediction.

Influence of Annotation Inconsistency
We also explore the influence of annotation inconsistency on synthetic datasets with various perturbation rates.Annotation inconsistency is used to model the label quality of crowd-sourcing.The bigger the perturbation rate, the worse the quality of the annotation.We report the results with a rate from 5% to 25% with step 5% over Weibo, Resume ,and Ontonotes datasets (Figure 2).
First, our CPLL model outperforms all the baselines with different perturbation rates.Moreover, the higher the annotation inconsistency, the more our model improves relative to the baselines.Our model can reduce the influence of annotation inconsistency more effectively.Second, several baselines almost do not work with a large perturbation rate (e.g., 25%), while our model can handle it effectively.The F1 score of Seqcrowd is only less than 20 when the rate r is larger than 20%.Third, it is obvious that the annotation quality will affect the performance of the model largely.The higher the inconsistency, the worse the quality of the annotation and the worse the performance of the model.

Influence of Hyper-parameter α
We further investigate the influence of the hyperparameter α (in Equation 5), which is used to balance the posterior and prior confidence (Figure 3).The prior confidence demonstrates the labeled confidence given by the annotators, which is biased due to the selection of annotators.To reduce this bias, we enhance our model to estimate the posterior confidence that is learned by the model.
From the figures, we can observe the following observations.First, when the noise is high, the smaller the α, the better the performance.Intuitively, the confidence given by annotators is not reliable when the perturbation rate r is large.Second, when the noise is low, the trend that the larger the α, the better the performance is relatively not as obvious.The reason is that the model can disambiguate the ground truth from the candidates easily since the data is clear.Most of the labels are correct and confidence is not important at this time.All the findings indicate that our confidence estimator can make use of prior confidence and learn posterior confidence effectively.

Related Work
In this section, we mainly review the most related works about named entity recognition (Section 5.1) and partial label learning (Section 5.2).

Named Entity Recognition
Named Entity Recognition (NER) is a research hotspot since it can be applied to many downstream Natural language Processing (NLP) tasks.A welltrained NER model takes language sequence as input and marks out all the entities in the sequence with the correct entity type.NER is widely treated as a sequence labeling problem, a token-level tagging task (Chiu and Nichols, 2015;Akbik et al., 2018;Yan et al., 2019).Also, some of the re- searchers regard NER as a span-level classification task (Xue et al., 2020;Fu et al., 2021;Alemi et al., 2023).In these works, NER is a fully-supervised learning task based on large-scale labeled data, where each token is asserted with a golden label.
Crowdsourcing platforms (e.g., Amazon Mechanical Turk) are a popular way to obtain large labeled data.Due to the large label space and complexity of NER, the quality of labeled data is low.The ground truth obtained by simple majority voting contains a lot of noise, which limits the performance of the model largely.There is some literature that trains the model from multiple annotators directly (Simpson and Gurevych, 2019;Nguyen et al., 2017).They mainly focus on modeling the differences among annotators to find a trustworthy annotator.In fact, a sentence may not be correctly labeled by all the annotators while they all may label part of the right entities.To address this problem, we translate this task into a partial label learning problem with a prior confidence score.

Partial Label Learning
Unlike fully-supervised learning, which uses data with golden label y, Partial Label Learning (PLL) asserts a candidate set Y for each input x (Zhang et al., 2016;Wang et al., 2023;Lv et al., 2020).Despite the fact that we can not ensure golden label y always in the candidate set Y, most PLL researchers assume one of the candidate labels is the golden label for simplicity.The existing studies about PLL can be categorized into two groups, average-based methods (Zhang and Yu, 2015) and identification-based methods (Jin and Ghahramani, 2002;Lyu et al., 2019).Average-based methods (Zhang and Yu, 2015;Hüllermeier and Beringer, 2006) intuitively treat the candidate labels with equal importance.The main weakness of these algorithms is that the false positive may severely distract the model with wrong label information.Recently, identification-based methods (Jin and Ghahramani, 2002;Wang et al., 2023) are proposed to identify the truth label from the candidates by regarding the ground truth as a latent variable.More and more literature pays attention to representative methods (Lyu et al., 2019;Nguyen and Caruana, 2008), self-training methods (Wen et al., 2021), loss function adjustments (Wu and Zhang, 2018).
However, most of the current work focuses on image classification or text classification tasks, while how to model the confidence for NER is not well studied.The sequence labeling task aims to identify the entities in the sentence with an entity type in the token level.Thus, how to model the token self and its content also plays an important role in this task.To address this problem, we design a confidence estimator to predict the token-and content-dependent confidence based on the prior confidence given by annotators.

Conclusion and Future Work
In this paper, we translate crowd-annotated NER into a PLL problem and propose a CPLL model based on an EM algorithm.To rectify the model's prediction, we design a confidence estimator to predict token-and content-dependent confidence by incorporating prior confidence with posterior confidence.We conduct the experiments on one real-world dataset and four synthetic datasets to evaluate the performance of our proposed CPLL model by comparing it with several state-of-theart baselines.Moreover, we do ablation studies to verify the effectiveness of the key components and explore the influence of annotation inconsistency.
In the future, we would like to investigate the performance of our model on other sequence labeling tasks.

Limitations
Although our work shows that our CPLL model can learn from crowd-annotated NER data well, there are at least two limitations.First, we set the hyperparameter α manually.It would be better if we could design a strategy to learn a alpha adaptive value for each sample atomically.Second, though we mainly experiment on NER tasks, our model can be applied to all sequence labeling tasks, such as part-of-speech tagging (POS), Chinese word segmentation, and so on.We would like to explore it in further work.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?3.3 Implementation Details and Metrics C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?We run our model using the same seed and select the best based on the development set.
C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Not applicable.Left blank.
D Did you use human annotators (e.g., crowdworkers) or research with human participants?
3.1 Datasets D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? 3.1 Datasets D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?3.1 Datasets D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?3.1 Datasets D4.Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: The framework of our CPLL model, which consists of a true posterior estimator and confidence estimator.The true posterior estimator is used to predict the true posterior P pY " y|T " t, X " xq based on the confidence score learned by the confidence estimator.The confidence estimator learns the confidence based on the prior confidence obtained from annotators and the posterior confidence learned by the model.

Figure 2 :
Figure 2: The influence of annotation inconsistency.

Figure 3 :
Figure3: The influence of hyper-parameter α, which is leveraged to control the balance between the posterior and prior confidence.

Table 1 :
The spans marked with blue , green , and red are time (TIME), person (PER), and place (PLACE) entities labeled by three annotators.
The loss function is computed,

Table 2 :
The statistical information of real-world dataset.#Sample means the number of samples in the corresponding dataset.#TIME, #PLACE and #PERSON represent the number of time, place, and person entities.
Table 2 shows the statistical information of this dataset.Synthetic Datasets.Inspired by (Rodrigues et al., 2014), we build synthetic datasets by adding

Table 3 :
The statistical information of synthetic datasets.#Original means the number of the tokens labeled as an entity (not O) in the original dataset.BI/C means the number of tokens that have a wrong BI/Category label but the right Category/BI label.Percent " (BI+C)/#Original.

Table 4 :
to update our model and set different learning rates for the BERT module (0.00002) and the rest module (0.002).The max sequence length The performance of our model and baselines in terms of F1.For real-world dataset, we do not report the results on clean data and Seqcrowd since we do not have ground truth for the training set.

Table 5 :
The performance of ablation studies.