Named Entity Recognition through Deep Representation Learning and Weak Supervision

Weakly supervised methods estimate the labels for a dataset using the predictions of several noisy supervision sources. Many machine learning practitioners have begun using weak supervision to more quickly and cheaply annotate data compared to traditional manual labeling. In this paper, we focus on the spe-ciﬁc problem of weakly supervised named entity recognition (NER) and propose an end-to-end model to learn optimal assignments of latent NER tags using observed tokens and weak labels provided by labeling functions. To capture the sequential dependencies between the latent and observed variables, we propose a sequential graphical model where the components are approximated using neural networks. State-of-the-art contextual embeddings are used to further discriminate the quality of noisy weak labels in various contexts. Results of experiments on four public weakly supervised named entity recognition datasets show a signiﬁcant improvement in F1 score over recent approaches.


Introduction
Many industries and organizations have collected large amounts of unlabeled text data that they want to make use of for various Natural Language Processing (NLP) applications. However, in many of these applications (named-entity recognition, question answering, text summarization, relation extraction), obtaining a large number of labels can be prohibitively expensive, error-prone, or otherwise infeasible. Furthermore, domain adaptation (Han and Eisenstein, 2019), which is commonly used in scarce label settings, can often struggle in new emerging / specific domains that don't have any closely related labeled datasets.
Under the absence of a closely related labeled dataset, weakly supervised learning is often used as a cheaper, less time-consuming alternative to obtaining gold standard labels. The main idea of weak supervision is to approximate the true labels by integrating multiple sets of noisy training labels. Each set of noisy labels (commonly referred to as "weak labels") is provided by a weak labeling function which often comes in the form of a knowledge base, heuristic, or pre-trained model. Weak supervision has obtained a lot of success in several NLP tasks containing approximately i.i.d. data such as topic classification (Bach et al., 2019), sentiment analysis, and social media content tagging (Fu et al., 2020).
Weak supervision on sequential data labeling problems such as NER is an emerging topic. Most current methods require either the time-consuming creation of additional heuristics such as 'linking rules' (Safranchik et al., 2020) or 'entity boundary detectors' , or assume that the accuracy of a weak labeler only depends on the true latent class (Safranchik et al., 2020;Lison et al., 2020;. The latter is likely suboptimal since we would expect the accuracy to vary even within instances of the same class depending on the context given by surrounding tokens. One exception to this is the Fuzzy-LSTM-CRF (Shang et al., 2018). However, it ignores which weak labeler each prediction came from as well as the number of predictions for each class. This could be an issue when the weak labelers have differing accuracies, since the model may learn from a majority of wrong labels.
To address these foregoing issues, one of our main contributions is the proposal of an endto-end method called Deep Weak Supervision on Sequential Data (DWS), that learns contextdependent proficiency representations for weak labelers, enhanced further through contextual embeddings from pre-trained language models. In addition, instead of following the traditional approach that treats named entity (NE) tags as multi-nomial samples, we directly model the conditional dependency of tags on tokens as interactions of two representations: tag representation and contextual token representation.
Advantages of adopting representation learning are two-fold: First, learning context-aware proficiency of labeling functions enables the weak supervision procedure to denoise unreliable labels at tokens where labeling functions have many disagreements. Second, because embedding methods have demonstrated great potential in capturing semantic meanings, the proposed model allows flexible transfer of existing NER pipelines to new domains through leveraging pre-trained domain-specific embeddings.
DWS relies on a graphical model to capture statistical dependencies among tokens, weak labels and true latent NE tags. However, latent variable estimation is challenging and the techniques are often both sample and computationally complex. For example,  required a Gibbs-based algorithm, and  required estimating the full inverse covariance matrix among the labelers. In video analysis, Varma et al. (2019b) required the use of multiple iterations of stochastic gradient descent (SGD) to learn accuracy parameters, but the dependencies are limited to weak labels and true labels. When context of the sequential inputs are directly involved in the model, optimizing or even formulating the analytical solution becomes much more difficult.
Our solution is motivated by the advantages of using neural networks to model the transition and output distributions (Bengio and Frasconi, 1996;Li and Shum, 2006). Instead of deriving analytical formulations to learn the parameters for the given structure, we use deep neural networks to approximate conditional dependencies and sequential transitions. Furthermore, the marginal likelihood of the proposed model is optimized via hard EM (Min et al., 2019) to find the most probable sequence of latent tags. Such a hybrid process allows us to easily integrate distributed representations in our model and also enables us to explore complex model structures.
We benchmarked the proposed DWS model on several NER datasets and compared it with some recent weak supervision approaches. Experimental results show DWS's advantage on tagging tasks where there are a lot of conflicting weak labels. Furthermore, we conduct detailed analysis to eval-uate the complexity of these NER datasets in terms of the number of NE tags and labeling functions and also the amount of inconsistency among the weak labels.
We conclude that some datasets used in past works are useful to evaluate the robustness of weak supervision models, whereas others contain weak labels that are trivial to denoise and result in all algorithms looking equivalent. Therefore, directly comparing results on weakly labeled datasets without accounting for the difficulty of denoising their labels can drive misleading conclusions. To more rigorously test how the performance of the algorithms scale with the difficulty of the weak supervision problem, we introduce a new method which stratifies a dataset into tokens containing varying levels of weak labeler disagreement, quantified by the entropy of the weak label predictions, and compares how the performance of each algorithm scales with the difficulty of the denoising task. Results show that the performance advantage of our algorithm over the current methods grows quickly with respect to the disagreement among the weak labelers.

Related Work
Snorkel  is a well-known tool that learns a generative model to estimate the accuracies and correlations between weak labelers on i.i.d. data. SwellShark  treats weakly supervised NER as an i.i.d. task but requires creating additional heuristics (sometimes called entity span generators) to find the entity boundaries. In recent years, various new algorithms have been introduced to extend the weak supervision idea to tasks involving sequential data such as NER. BOND (Liang et al., 2020) and AutoNER (Shang et al., 2018) respectively denoise predictions of a single weak labeler and a set of dictionaries. Several generative approaches such as Hidden Markov Model (Lison et al., 2020;Safranchik et al., 2020) and a discriminative method Fuzzy-LSTM-CRF (Shang et al., 2018) have been proposed to model the dependencies between NE tags; thus, entity span generators are no longer needed. There have also been several relevant approaches on other weak supervision tasks. ReHession (Liu et al., 2017) was developed for weakly supervised relation extraction and tries to learn the contexts where each weak labeler is proficient, then uses that knowledge to infer the true labels. Varma et al. (2019a) proposes a robust PCA-based algorithm to learn dependency structures for image classification. In the application of video analysis, Fu et al.
(2020) applies a general binary Ising model to factorize likelihood expectations over cliques so an analytical solution is found to speed up the model parameter learning.
In this paper, we systematically compare most of the recent approaches proposed for NER except for BOND, AutoNER, and Swellshark which are not directly applicable to our experimental settings.

Problem Definition
We assume a sequential labeling problem formulation where we are given a sequence of tokens X = {x 1 , ..., x N } which map to a sequence of latent class variables Y = {y 1 , .., y N }. In a fully supervised scenario, Y is usually partially annotated thus Y = Y train ∪ Y test , so model parameters are estimated from (X train , Y train ). Because Y train is usually expensive to obtain, our primary goal is to estimate Y train with the help of multiple, potentially noisy labeling sources. Suppose we have a set of weak labels L train available for X train , where each token is assigned a set of weak labels provided by m different labeling sources λ 1 , ..λ m (labelers can choose to make no prediction by voting 'Abstain'). For simplicity, we drop the notion of training and test data, and simply use X , Y, L to represent all tokens, latent tags, and weak labels in our problem. Additionally, we define the vote of weak labeler λ j on token x i as l i,j .
In a supervised setting where Y are given, we learn model parameters Θ by maximizing the loglikelihood of Y given the input X with respect to Θ: whereas in our weak supervision scenario, Y are hidden and L are fully observable, so the learning objective can instead be to maximize the marginal likelihood of the weak labels. Letting E be the set of K entity classes which includes 'No Entity', we have: which can be used to compute the objective as follows: 4 Model Overview

Model Structure and Likelihood
To incorporate the dependencies among tokens, weak labels, and true labels, we define a graphical structure illustrated as Figure 1. The proposed graphical model is partially directed and its analytical solutions and exact inference are available through segmentations of the input sequence but are usually complex to derive. In our approach, the likelihood P (L, Y|X ) defined in (2) is factorized as P (L|X , Y) and P (Y|X ), where each is approximated via a neural network. These networks are described in Sections 4.2 and 4.3, respectively. The parameters are learned through maximizing an approximation to the marginal likelihood P (L|X ) using a hard EM algorithm. Following is an in depth description of the model formulation. According to (3) and (2) we have: We now assume different labeling functions are independent of each other given the input text and true labels. Additionally, we assume the weak labels at token x i are conditionally independent of the true labels at any index k = i when given the input text and true label at index i (d-separation details are in Appendix A.2). Therefore, we have: where y c,i denotes the label at index i of Y c . This means: In it's current form, the conditional log likelihood is difficult to optimize because of the sum-product inside the logarithm. As an alternative, we maximize J HardEM which is an approximation to the conditional log-likelihood where the summation over entity labels is replaced with a maximum.
This new optimization problem attempts to find accurate modes of P (L, Y|X ). All that remains is to formulate P (Y c |X ) and P (l i,j |X , y c,i ) for some pair (i, j).

Modeling Weak Labeler Representations
In practice, a weak labeler is usually derived from a specific rule or a controlled vocabulary. Thus, it is reasonable to assume its accuracy depends on the context of the token it is making a prediction on. Inspired by Liu et al. (2017), we model labeling function λ j providing the correct label to x i as a discrete event following a Bernoulli distribution, given by: Here, α ij = σ( , σ is the sigmoid function, θ j is a learnable embedding specific to λ j , and z i (z i ∈ R d ) is the contextual embedding of the token x i . This formulation improves the modeling capacity over many past methods such as Snorkel  or the HMM (Safranchik et al., 2020;Lison et al., 2020) which have purely classconditioned accuracies. In addition, it allows the utilization of external knowledge from large pretrained language models by making z i a function of their contextual embeddings. When a weak labeler abstains, there is no concept of accuracy. We chose to set the probability of the 'Abstain' votes to 1, but in the future could instead model the probability of abstaining.

Modeling Class Representations and Transition Scores
We model P (Y c |X ) as a function of the contextual embeddings z. The model uses a linear-chain conditional random field (CRF) output layer which is often utilized to model dependencies between labels (Huang et al., 2015;Lample et al., 2016;Akbik et al., 2018). The distribution P (Y c |X ) is given by and where T is a learnable matrix defining the transition scores between any two entity classes and t y i is the learnable embedding for class y i . We have scaled the dot products in both the formula for α ij and (11), by the square root of the embedding size d to significantly increase the stability of training as was shown to be useful in Vaswani et al. (2017).

Algorithm for Optimization
To maximize the objective J HardEM , we repeat the following two steps: This approach is often termed hard EM and has been successful in other areas such as weakly supervised relation extraction (Liu et al., 2017) and question answering (Min et al., 2019).
Step 1 can be computed as follows: where s i,y c,i = si,y c,i + m j=1 log P (li,j|X , yc,i) The details of the derivation are in Appendix A.1. Equation (13) can be solved efficiently with the same Viterbi decoding algorithm used in Huang et al. (2015). In practice, we constrain each y i to be in the set of classes voted on among weak labelers on the token x i by setting s i,y c,i = −∞ for any y c,i that does not satisfy the constraint. Additionally, we found it useful to penalize choosing sequences Y containing illegal class transitions. This was done by adding a hyperparameter τ to the transition matrix T from (13) in all locations that correspond to illegal transitions. For example, transitioning from the middle to the beginning of a 'Person' entity (i.e, I-Per → B-Per) would be penalized as it does not make sense.

Training Details
The contextual token embeddings z used to calculate (8) and (11) are obtained from a trainable single layer bidirectional LSTM that uses contextual word embeddings from pre-trained BERT models as described in Section 5.2. We choose to freeze the parameters of the pre-trained models to allow fast and inexpensive training.
To obtain a better starting point for the EM algorithm, the models are given a warm start by training for one epoch where the token labels Y in step 2 of the optimization algorithm in Section 4.4 are set to be the majority vote labels.
Lastly, we choose the model parameters to be those that give the best entity F1 score on a held out validation set over 10 random restarts of 5 epochs each. Further details of the training procedure and architectures used are described in Appendix A.3 and A.4.

Experiments and Results
Experiments are conducted on four public weakly supervised NER datasets (Lison et al., 2020;Safranchik et al., 2020) as summarized in Table  1. We use the same weak labeling functions as reported in these approaches. Each dataset is split into train, validation, and test set using the same splits as Safranchik et al. (2020) suggests for NCBI-Disease, BC5CDR, and LaptopReview, and Liang et al. (2020) for CoNLL2003 1 .

Weakly Labeled Dataset Difficulty
We first conduct a study to establish the difficulty of denoising the weak labels in each dataset. As we will demonstrate, this allows a detailed analysis of the strengths and weaknesses of each model. Both the amount and types of disagreement among the weak labelers differ drastically  To understand the prominent types of disagreement in each weakly labeled dataset, we calculated the number of tokens which contain each type of disagreement. The results are displayed in the last three columns of Table 1, and show that weak labelers on NCBI-Disease, BC5CDR, and LaptopReview disagree on a token's positioning within an entity much more often than on the unpositioned entity class. This indicates that the main difficulty on those weakly labeled datasets is resolving disagreements on where each entity begins and ends. To the contrary, CoNLL2003 contains many disagreements on both the token position and unpositioned class predictions.
We also include Figure 2 to emphasize the difficulty of the weakly labeled CoNLL2003 dataset compared to the others. On average there are a much higher number of classes being disagreed upon per token, a heuristic for the difficulty of the weakly labeled dataset.

Results
Our experiments focus on the following topics: (1) Comparison of the proposed method's performance with existing methods on benchmark datasets. (2) Stratified analysis and benchmarking of various models w.r.t. the difficulty of the NER task.
(3) Systematic study to establish the importance of various design decisions of our proposed method.
The proposed DWS model is compared to sev-   Lison et al. (2020) and the weak labels for NCBI-Disease, LaptopReview, and BC5CDR are defined by Safranchik et al. (2020). The last three columns are defined as: '#Unpositioned': is the number of tokens with more than one unpositioned class voted on. '#Positioned': is the number of tokens with more than one positioned class voted on. Our weak supervision pipeline has two steps: Step 1) First it learns the latent labels of the training data using weak labels and tokens, so the learned data can be used as annotated training data to train NER classifiers. We measure model performance in this step through evaluating the microentity precision, recall, and F1 scores of the learned latent labels against the ground truth labels. Table  2 (upper) displays these results and shows that the proposed DWS method creates training labels with an F1 score of 1.65% higher than the nearest comparing model on CoNLL2003 and F1 scores within 0.1% and 0.03% of the best comparative approach on BC5CDR and NCBI-Disease, respectively. Additionally, DWS is outperformed by an F1 score of 0.58% on LaptopReview but this is expected since the dataset only contains 43 tokens where the weak labelers disagree. This is likely far too little for DWS to learn robust deep representations for the accuracies of each weak labeler.
Step 2) Using the learned labels as training labels, we train classifiers and apply the trained classifiers on test data. We train the same classifier mentioned in Safranchik et al. (2020) on the weakly labeled training data obtained by each algorithm on all benchmark datasets. We also keep the test data identical per each dataset and report the obtained performance on test data using micro-entity precision, recall, and F1 score. According to the results displayed in Table 2 (lower), the proposed DWS outperforms the next best model by an F1 score of 3.95% on CoNLL2003, which is arguably the hardest dataset given the analysis in Section 5.1. Additionally, DWS outperforms the closest competitor on NCBI-Disease by an F1 score of 0.83%, and achieves the second best performance on both BC5CDR and LaptopReview.
The purpose of reporting performance separately in two steps is to clearly demonstrate the improvement of performance obtained by weak supervision alone, which only compares the quality of learned annotations with true labels in Step 1. To our understandings, this is important because in practice, model selection and quality control of weak supervision should also focus on the quality of learned annotations. The second step, after the training data is automatically annotated, focuses on the bias-variance problem in a supervised scenario and the main goal is to select the best classifier to generalize well on unseen data.

Performance vs Entropy of Weak Labels
As previously discussed, comparing F1 scores alone across weak supervision approaches without knowing the difficulties of the problem could drive misleading conclusions. For example, if the weak labels provided for a dataset are highly correlated, it becomes a trivial problem to denoise them -meaning algorithms will have indistinguish-  Step 1: Training set results of the training labels produced by the weak supervision models.
Step 2: Test set results of discriminators trained on the label predictions of the weakly supervised NER algorithms. All of the results in both tables are the average scores over 5 runs of each model using different random seeds. * HMM1 is from Lison et al. (2020) and HMM2 is the Hidden Markov Model from Safranchik et al. (2020) without 'linking' rules. Note that we attempted to train HMM2 on CoNLL2003 but ran out of memory when using 64GB of RAM. ** We use our own implementation for the Fuzzy-LSTM-CRF. The results differ from Shang et al. (2018) since different weak labelers are used.
able F1 scores. To more clearly differentiate the performance of algorithms, we can calculate how their performance scales with the difficulty of the denoising task. Intuitively, we expect that better algorithms should be more robust to challenging denoising tasks. To demonstrate this, we first bucket tokens in the weakly labeled CoNLL2003 training dataset based on the level of disagreement among their weak labels, which can be thought of as the difficulty of denoising them, and then plot the differences in average token F1 scores between DWS and the comparative models in each bucket. To quantify the 'disagreement' of the weak labelers on a token, we use the entropy of the distribution of their non-abstained votes. To gain further insight into the strengths and weaknesses of each model, we group entropy and F1 calculations by 3 different class types: Token Position, Positioned Class, Unpositioned Class. These are described in Section 5.1. The results are plotted in Figure 3. Figure 3 shows the improvement in F1 score obtained by DWS over other comparing methods at several different levels of entropy. The left figure shows that as entropy increases, generally the improvement of performance on unpositioned classes also increases. The middle figure shows that our model does not have much of an advantage in resolving disagreements over the position each token has within an entity (ie B,I,L,U), and that Fuzzy-LSTM-CRF is actually significantly better at high entropy levels. Most importantly though, when both types of disagreement are combined in Figure  3 (right), the advantage of DWS becomes very significant as the gaps in performance between itself and others are steadily increasing functions.

Ablation Studies
To better understand the importance of each component in DWS, we study how the performance of DWS on CoNLL2003 is affected by removing some design functions in DWS. Specifics of the experiments are as follows: No Penalty: Sets the penalty given to illegal class transitions (defined in Section 4.4) to 0. Without this illegal class transition penalty, validation and test F1 scores drop by 1.27% and 2.34% respectively. No Penalty No CRF: In addition to removing the penalty for illegal tag transitions as explained above, we treat the true tags as conditionally independent by replacing the CRF in P (Y |X) with a token-wise softmax. Such changes result in lowering the test F1 score by 8.77%. This experiment helps to highlight the importance of modeling sequential dependencies since without it, the performance is lower than even majority voting. No Warm Start: Warm starting by initializing the hard EM procedure with majority votes is helpful Figure 3: Each plot contains 95% confidence intervals on the difference in mean token F1 scores between DWS and comparing models in each bucket. These were calculated using the results from 5 randomized runs for each model. When the entropy is large, there is a lot of disagreement; when it's zero, the weak labelers all agree on the same class.  to achieve better performance compared to random initialization. After iterating hard EM for the same epochs, warm start boosts validation and test F1 scores by 1.16% and 2.72% respectively, compared to random initialization. The results of each ablation are reported in Table 3.

Discussion
In Section 5.1 we showed that the amount of disagreement between the weak labelers varied substantially between datasets. This insight is important when interpreting the effectiveness of a weak supervision model because on datasets containing very little disagreement, we wouldn't expect to be able to learn anything much different than majority voting. As the number of contradictions between the weak labeler votes increases, for the most part so does the amount of information to help gauge the accuracies of the weak labelers and surpass majority voting. This means that we would roughly expect a good algorithm to have its greatest performance advantage on CoNLL2003 followed by NCBI-Disease and BC5CDR, and lastly LaptopReview which did not contain many weak labeler disagreements. The results in Section 5.2 show that as we would hope, our method has the largest performance advantage on CoNLL2003 and retains strong results on the remaining datasets which have fewer disagreements.
In Section 5.3 ('Performance vs Entropy'), we more concretely show that the F1 score advantage of DWS over the other models when using positioned labels increases with respect to the amount of contradiction information. These advantages may be attributable to the learned proficiency representations of weak labelers that help discriminate noisy labels provided by labeling functions with low proficiencies.
Experiments in Section 5.3 also show that DWS is very effective at resolving disagreements on unpositioned classes but is more mediocre at denoising disagreements on the entity positioning. These results suggest that the ideal model to use in a practical scenario likely depends on the amount of each type of disagreement in the given weakly labeled dataset.

Conclusion
In this paper we introduced a novel method (DWS) for weakly supervised NER which learns contextdependent proficiency for labeling sources while also modeling sequential dependencies among weak labels, inputs, and true labels. The proposed approach integrates representation learning and graphical models within the weak supervision setting, and obtains better performance on public benchmark datasets than recently introduced approaches. The proposed method is quite generic and can be applied to other sequential learning tasks in NLP or other modalities such as image analysis/computer vision by adopting various pretrained domain-specific models to embed the input sequence.

A.1 Derivations
The details for efficiently computing Y in step 1 of the hard EM algorithm as mentioned in Section 4.4 are as follows: log P (li,j|X , yc,i) where s i,y c,i = si,y c,i + m j=1 log P (li,j|X, yc,i) where K is a normalizing constant and C is the function defined in equation 10.

A.2 Graphical Model of DWS
We illustrate the exact graphical model of DWS in Figure 4, where the structure is similar to the Input/Output HMM(IOHMM) proposed by (Bengio and Frasconi, 1996). In IOHMM, transitions from latent variables y i−1 to y i are directional, and analytical solutions are available based on factorizations of the sequence marginal likelihood. Here in our approach, similar analytical solutions could be derived as a partially directed model (e.g., CRF) without the need of parameterized distributions of input embeddings z i . However, hybrid approaches that integrate neural networks with graphical models to enable efficient and scalable model training are preferred (Bourlard and Wellekens, 1990;Bengio et al., 1992;Bengio and Frasconi, 1996;Li and Shum, 2006). Training Details DWS: The optimizer used was RMSProp with learning rate 0.001. The model parameters were chosen to be those that gave the best micro-entity F1 score on the validation set over 10 random restarts of 5 epochs each. The batch sizes used were 32 for LaptopReview, 128 for NCBI-Disease, 256 for CoNLL2003, and 512 for BC5CDR. We defined the batch size by the number of tokens with at least 1 weak vote. Additionally, the illegal transition penalty was -2 for LaptopReview and -10 for the remaining datasets. Lastly, we used 1 warm-up epoch on majority vote labels and a dropout probability of 0.1 in the BiL-STM. These hyperparameters were chosen through manual tuning on the validation set.
Training Details Fuzzy-LSTM-CRF: The optimizer used was RMSProp with learning rate 0.01. The model parameters were chosen to be those that gave the best micro-entity F1 score on the validation set over 5 random restarts of 10 epochs each since it converged more slowly than DWS and had less variance among the random restarts.

Training Details of Other Comparing Models:
We used the same priors for the HMM1 model as were used in Lison et al. (2020) on CoNLL2003. For the remaining datasets the priors were tuned the on the validation sets. Additionally, we tried running the HMM from Safranchik et al. (2020) on CoNLL2003 but ran out of memory when using 64GB of RAM even when reducing the batch size to 1 and using unpositioned classes rather than BILU positioned classes (reduces number of classes by a factor of 4).
To improve the performance of majority vote and unweighted vote on CoNLL2003, which has many very noisy labels, we predicted 'No Entity' with probability 1 when there were less than T votes which was a tuned hyperparameter. The best value of T was found to be 5 for both methods and was chosen from {1, 5, 10}.

A.4 Discriminator Model Architecture / Training Details
The architecture of the discriminator was the same as used in Safranchik et al. (2020) which consisted of a 2 layer BiLSTM with 200 dimensional embeddings along with a CRF as the output layer. The model used both word embeddings from a variant of BERT and character embeddings from a CNN