BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition

We study the problem of learning a named entity recognition (NER) tagger using noisy labels from multiple weak supervision sources. Though cheap to obtain, the labels from weak supervision sources are often incomplete, inaccurate, and contradictory, making it difficult to learn an accurate NER model. To address this challenge, we propose a conditional hidden Markov model (CHMM), which can effectively infer true labels from multi-source noisy labels in an unsupervised way. CHMM enhances the classic hidden Markov model with the contextual representation power of pre-trained language models. Specifically, CHMM learns token-wise transition and emission probabilities from the BERT embeddings of the input tokens to infer the latent true labels from noisy observations. We further refine CHMM with an alternate-training approach (CHMM-ALT). It fine-tunes a BERT-NER model with the labels inferred by CHMM, and this BERT-NER’s output is regarded as an additional weak source to train the CHMM in return. Experiments on four NER benchmarks from various domains show that our method outperforms state-of-the-art weakly supervised NER models by wide margins.


Introduction
Named entity recognition (NER), which aims to identify named entities from unstructured text, is an information extraction task fundamental to many downstream applications such as event detection (Li et al., 2012), relationship extraction (Bach and Badaskar, 2007), and question answering (Khalid et al., 2008). Existing NER models are typically supervised by a large number of training sequences, each pre-annotated with token-level labels. In practice, however, obtaining such labels could be prohibitively expensive. On the other hand, many domains have various knowledge resources such as knowledge bases, domain-specific dictionaries, or labeling rules provided by domain experts (Farmakiotou et al., 2000;Nadeau and Sekine, 2007). These resources can be used to match a corpus and quickly create large-scale noisy training data for NER from multiple views.
Learning an NER model from multiple weak supervision sources is a challenging problem. While there are works on distantly supervised NER that use only knowledge bases as weak supervision (Mintz et al., 2009;Shang et al., 2018;Cao et al., 2019;Liang et al., 2020), they cannot leverage complementary information from multiple annotation sources. To handle multi-source weak supervision, several recent works (Nguyen et al., 2017;Safranchik et al., 2020;Lison et al., 2020) leverage the hidden Markov model (HMM), by modeling true labels as hidden variables and inferring them from the observed noisy labels through unsupervised learning. Though principled, these models fall short in capturing token semantics and context information, as they either model input tokens as one-hot observations (Nguyen et al., 2017) or do not model them at all (Safranchik et al., 2020;Lison et al., 2020). Moreover, the flexibility of HMM is limited as its transitions and emissions remain constant over time steps, whereas in practice they should depend on the input words.
We propose the conditional hidden Markov model (CHMM) to infer true NER labels from multi-source weak annotations. CHMM conditions the HMM training and inference on BERT by predicting token-wise transition and emission probabilities from the BERT embeddings. These tokenwise probabilities are more flexible than HMM's constant counterpart in modeling how the true labels should evolve according to the input tokens. The context representation ability they inherit from BERT also relieves the Markov constraint and expands HMM's context-awareness.
Further, we integrate CHMM with a supervised BERT-based NER mode with an alternate-training method (CHMM-ALT). It fine-tunes BERT-NER with the denoised labels generated by CHMM. Taking advantage of the pre-trained knowledge contained in BERT, this process aims to refine the denoised labels by discovering the entity patterns neglected by all of the weak sources. The finetuned BERT-NER serves as an additional supervision source, whose output is combined with other weak labels for the next round of CHMM training. CHMM-ALT trains CHMM and BERT-NER alternately until the result is optimized.
Our contributions include: • A multi-source label aggregator CHMM with token-wise transition and emission probabilities for aggregating multiple sets of NER labels from different weak labeling sources.
• An alternate-training method CHMM-ALT that trains CHMM and BERT-NER in turn utilizing each other's outputs for multiple loops to optimize the multi-source weakly supervised NER performance.
• A comprehensive evaluation on four NER benchmarks from different domains demonstrates that CHMM-ALT achieves a 4.83 average F1 score improvement over the strongest baseline models.
The code and data used in this work are available at github.com/Yinghao-Li/CHMM-ALT.

Related Work
Weakly Supervised NER There have been works that train NER models with different weak supervision approaches. Distant supervision, a specific type of weak supervision, generates training labels from knowledge bases (Mintz et al., 2009;Yang et al., 2018;Shang et al., 2018;Cao et al., 2019;Liang et al., 2020). But such a method is limited to one source and falls short of acquiring supplementary annotations from other available resources. Other works adopt multiple additional labeling sources, such as heuristic functions that depend on lexical features, word patterns, or document information (Nadeau and Sekine, 2007;Ratner et al., 2016), and unify their results through multi-source label denoising. Several multi-source weakly supervised learning approaches are designed for sentence classification (Ratner et al., 2017(Ratner et al., , 2019Yu et al., 2020). Although these methods can be adapted for sequence labeling tasks such as NER, they tend to overlook the internal dependency relationship between token-level labels during the inference. Fries et al. (2017) target the NER task, but their method first generates candidate named entity spans and then classifies each span independently. This independence makes it suffer from the same drawback as sentence classification models. A few works consider label dependency while dealing with multiple supervision sources. Lan et al. (2020) train a BiLSTM-CRF network (Huang et al., 2015) with multiple parallel CRF layers, each for an individual labeling source, and aggregate their transitions with confidence scores predicted by an attention network (Bahdanau et al., 2015;Luong et al., 2015). HMM is a more principled model for multi-source sequential label denoising as the true labels are implicitly inferred through unsupervised learning without deliberately assigning any additional scores. Following this track, Nguyen et al. (2017) and Lison et al. (2020) use a standard HMM with multiple observed variables, each from one labeling source. Safranchik et al. (2020) propose linked HMM, which differs from ordinary HMM by introducing unique linking rules as an adjunct supervision source additional to general token labels. However, these methods fail to utilize the context information embedded in the tokens as effectively as CHMM, and their NER performance is further constrained by the Markov assumption.
Neuralizing the Hidden Markov Model Some works attempt to neuralize HMM in order to relax the Markov assumption while maintaining its generative property (Kim et al., 2018). For example, Dai et al. (2017) and  incorporate recurrent units into the hidden semi-Markov model (HSMM) to segment and label highdimensional time series;  learn discrete template structures for conditional text generation using neuralized HSMM. Wessels and Omlin (2000) and Chiu and Rush (2020) factorize HMM with neural networks to scale it and improve its sequence modeling capacity. The work most related to ours leverages neural HMM for sequence labeling (Tran et al., 2016). CHMM differs from neural HMM in that the tokens are treated as a dependency term in CHMM instead of the observation in neural HMM. Besides, CHMM is trained with generalized EM, whereas neural HMM opti-  Figure 1: An example of label aggregation with two weak labeling sources. We use BIO labeling scheme. PER represents person; LOC is location. mizes the marginal likelihood of the observations.

Problem Setup
In this section, we formulate the multi-source weakly supervised NER problem. Consider an input sentence that contains T tokens w (1:T ) , NER can be formulated as a sequence labeling task that assigns a label to each token in the sentence. 1 Assuming the set of target entity types is E and the tagging scheme is BIO (Ramshaw and Marcus, 1995), NER models assign one label from the label set l ∈ L to each token, where the size of the label set is |L| = 2|E| + 1, e.g., if E = {PER, LOC}, then L = {O, B-PER, I-PER, B-LOC, I-LOC}.
Suppose we have a sequence with K weak sources, each of which can be a heuristic rule, knowledge base, or existing out-of-domain NER model. Each source serves as a labeling function that generates token-level weak labels from the input corpus, as shown in Figure 1. For the input sequence w (1:T ) , we use x (1:T ) k , k ∈ {1, . . . , K} to represent the weak labels from the source k, where x (t) k ∈ R |L| , t ∈ {1, . . . , T } is a probability distribution over L. Multi-source weakly supervised NER aims to find the underlying true sequence of labelsŷ (1:T ) ,ŷ (t) ∈ L given {w (1:T ) , x

Methodology
In this section, we describe our proposed method CHMM-ALT. We first sketch the alternate-training procedure ( § 4.1), then explain the CHMM component ( § 4.2) and how BERT-NER is involved ( § 4.3).

Alternate-Training Procedure
The alternate-training method trains two modelsa multi-source label aggregator CHMM and a BERT-NER model-in turn with each other's output. CHMM aggregates multiple sets of labels from different sources into a unified sequence of 1 We represent vectors, matrices or tensors with bold fonts and scalars with regular fonts; 1 : a {1, 2, . . . , a}. labels, while BERT-NER refines them by its language modeling ability gained from pre-training. The training process is divided into two phases.
• In phase I, CHMM takes the annotations x (1:T ) 1:K from existing sources and gives a set of denoised labels y * (1:T ) , which are used to fine-tune the BERT-NER model. Then, we regard the fine-tuned model as an additional labeling source, whose outputsỹ (1:T ) are added into the original weak label sets to give the updated observation instances: 1:K+1 from the previous one. Then, its predictions are adopted to fine-tune BERT-NER, whose output updates x (1:T ) K+1 . Figure 2 illustrates the alternate-training method. In general, CHMM gives high precision predictions, whereas BERT-NER trades recall with precision. In other words, CHMM can classify named entities with high accuracy but is slightly disadvantaged in discovering all entities. BERT-NER increases the coverage with a certain loss of accuracy. Combined with the alternate-training approach, this complementarity between these models further increases the overall performance.

Conditional Hidden Markov Model
The conditional hidden Markov model is an HMM variant for multi-source label denoising. It models true entity labels as hidden variables and infers them from the observed noisy labels. Traditionally, discrete HMM uses one transition matrix to model the probability of hidden label transitioning and one emission matrix to model the probability of the observations from the hidden labels. These two matrices are constant, i.e., their values do not change over time steps. CHMM, on the contrary, conditions both its transition and emission matrices on the BERT embeddings e (1:T ) of the input tokens w (1:T ) . This design not only allows CHMM to leverage the rich contextual representations of the BERT embeddings but relieves the constant matrices constraint as well.
In phase I, CHMM takes K sets of weak labels from the provided K weak labeling sources. In phase II, in addition to the existing sources, it takes Weak Source K Figure 2: The illustration of the alternate-training method. Phase I is acyclic, starting from getting K weak labels from supervision sources and ending at the fine-tuning of BERT-NER with CHMM's denoised output. Phase II contains several loops, each trains CHMM with K + 1 sources, including the additional BERT predictions from the previous loop, and fine-tunes BERT-NER using the updated denoised labels. another set of labels from the previously fine-tuned BERT-NER, making the total number of sources K + 1. For convenience, we use K as the number of weak sources below. Figure 3 shows a sketch of CHMM's architecture. 2 z (1:T ) denotes the discrete hidden states of CHMM with z (t) ∈ L, representing the underlying true labels to be inferred from multiple weak annotations.

Model Architecture
) represents the probability of source k observing label j when the 2 We relax plate notation here to present details. hidden label is i at time step t.
For each step, e (t) ∈ R d emb is the output of a pre-trained BERT with d emb being its embedding dimension. Ψ (t) and Φ (t) 1:K are calculated by applying a multi-layer perceptron (MLP) to e (t) : Since the MLP outputs are vectors, we need to reshape them to matrices or tensors: To achieve the proper probability distributions, we apply the Softmax function along the label axis so that these values are positive and sum up to 1: .
a is an arbitrary vector. The formulae in the following discussion always depend on e (1:T ) , but we will omit the dependency term for simplicity.
Model Training According to the generative process of CHMM, the joint distribution of the hidden states and the observed weak labels for one sequence p(z (0:T ) , x (1:T ) |θ) can be factorized as: where θ represents all the trainable parameters.
HMM is generally trained with an expectationmaximization (EM, also known as Baum-Welch) algorithm. In the expectation step (E-step), we compute the expected complete data log likelihood: θ old is the parameters from the previous training step, E z [·] is the expectation over variable z, and Combining (6) is the expected number of transitions. These parameters are computed using the forward-backward algorithm. 4 In the maximization step (M-step), traditional HMM updates parameters θ HMM = {Ψ, Φ, π} by optimizing (7) with pseudo-statistics. 5 However, as the transitions and emissions in CHMM are not standalone parameters, we cannot directly optimize CHMM by this method. Instead, we update the model parameters through gradient descent w.r.t. θ CHMM using (9) as the objective function: In practice, the calculation is conducted in the logarithm domain to avoid the loss of precision issue that occurs when the floating-point numbers become too small.
To solve the label sparsity issue, i.e., some entities are only observed by a minority of the weak sources, we modify the observations x (1:T ) before training. If one source k observes an entity at time step t: x (t) j =1,k > 0, the observation of nonobserving sources at t will be modified to x where is an arbitrary small value. Note that x (t) 1,κ corresponds to the observed label O.
CHMM Initialization Generally, HMM has its transition and emission probabilities initialized with the statistics Ψ * and Φ * computed from the observation set. But it is impossible to directly set Ψ (t) and Φ (t) in CHMM to these values, as these matrices are the output of the MLPs rather than standalone parameters. To address this issue, we choose to pre-train the MLPs before starting CHMM's training by minimizing the mean squared error (MSE) loss between their outputs and the target statistics: where · F is the Frobenius norm. Right after initialization, MLPs can only output similar probabilities for all time steps: . . , T }. But their token-wise prediction divergence will emerge when CHMM has been trained. The initial hidden state z (0) is fixed to O as it has no corresponding token.
Inference Once trained, CHMM can provide the most probable sequence of hidden labelsẑ (1:T ) along with the probabilities of all labels y * (1:T ) .

Improving Denoised Labels with BERT
The pre-trained BERT model encodes semantic and structural knowledge, which can be distilled to further refine the denoised labels from CHMM. Specifically, we construct the BERT-NER model by stacking a feed-forward layer and a Softmax layer on top of the original BERT to predict the probabilities of the classes that each token belongs to (Sun et al., 2019). The probability predictions of CHMM, y * (1:T ) , often referred to as soft labels, are chosen to supervise the fine-tuning procedure. Compared with the hard labelsẑ (1:T ) , soft labels lead to a more stable training process and higher model robustness (Thiel, 2008;Liang et al., 2020). We train BERT-NER by minimizing the Kullback-Leibler divergence (KL divergence) between the soft labels y * and the model output y: where θ BERT denotes all the trainable parameters in the BERT model. BERT-NER does not update the embeddings e (1:T ) that CHMM depends on. We obtain the refined labelsỹ (1:T ) ∈ R T ×|L| from the fine-tuned BERT-NER directly through a forward pass. Different from CHMM, we continue BERT-NER's training with parameter weights from the last loop's checkpoint so that the model is initialized closer to the optimum. Correspondingly, phase II trains BERT-NER with a smaller learning rate, fewer epoch iterations, and batch gradient descent instead of the mini-batch version. 6 This strategy speeds up phase II training without sacrificing the model performance as y * (1:T ) does not change significantly from loop to loop.

Experiments
We benchmark CHMM-ALT on four datasets against state-of-the-art weakly supervised NER baselines, including both distant learning models and multi-source label aggregation models. We also conduct a series of ablation studies to evaluate the different components in CHMM-ALT's design.

Setup
Datasets We consider four NER datasets covering the general, technological and biomedical domains: 1) CoNLL 2003 (English subset) (Tjong Kim Sang and De Meulder, 2003) is a general domain dataset containing 22,137 sentences manually labelled with 4 entity types. 2) LaptopReview dataset (Pontiki et al., 2014)    Baselines We compare our model to the following state-of-the-art baselines: 1) Majority Voting returns the label for a token that has been observed by most of the sources and randomly chooses one if it's a tie; 2) Snorkel (Ratner et al., 2017) treats each token in a sequence as i.i.d. and conducts the label classification without considering its context; 3) SwellShark (Fries et al., 2017) improves Snorkel by predicting all the target entity spans before classifying them using naïve Bayes; 4) Au-toNER (Shang et al., 2018) augments distant supervision by predicting whether two consecutive tokens should be in the same entity span; 5) BOND (Liang et al., 2020) adopts self-training and highconfidence selection to further boost the distant supervision performance. 6) HMM is the multiobservation generative model used in Lison et al. (2020) that does not have the integrated neural network; 7) Linked HMM (Safranchik et al., 2020) uses linking rules to provide additional inter-token structural information to the HMM model.
For the ablation study, we modify CHMM to another type of i.i.d. model by taking away its transition matrices. This model, named CHMM-i.i.d.,

Models
CoNLL  Table 2: Evaluation results on four datasets. The results are presented in the "F1 (Precision/Recall)" format. "CHMM + BERT-NER" is essentially CHMM-ALT's phase I output. "BOND-MV" is the BOND model trained with majority voted labels. † indicates unsupervised label denoiser; ‡ represents fully supervised models. A model with † ‡ is either distantly supervised or trains a supervised by labels from the denoiser. signifies the results from our experiments. In addition to models with , Snorkel and Linked HMM also share our labeling sources.
directly predicts the hidden steps from the BERT embeddings, while otherwise identical to CHMM. We also investigate how CHMM-ALT performs with other aggregators other than CHMM. We also introduce two upper bounds from different aspects: 1) a fully supervised BERT-NER model trained with manually labeled data is regarded as a supervised reference; 2) the best possible consensus of the weak sources. The latter assumes an oracle that always selects the correct annotations from these weak supervision sources. According to the definition, its precision is always 100% and its recall is non-decreasing with the increase of the number of weak sources.

Evaluation Metrics
We evaluate the performance of NER models using entity-level precision, recall, and F1 scores. All scores are presented as percentages. The results come from the average of 5 trials with different random seeds. The only tunable hyper-parameter in CHMM is the learning rate. But its influence is negligible-benefitted from the stability of the generalized EM, the model is guaranteed to converge to a local optimum if the learning rate is small enough. For all the BERT-NER models used in our experiments, the hyper-parameters except the batch size are fixed to the default values (appendix C).

Implementation Details
To prevent overfitting, we use a two-scale early stopping strategy for model choosing at two scales based on the development set. The micro-scale early stopping chooses the best model parameters for each individual training process of both CHMM and BERT-NER; the macro-scale early stopping selects the best-performing model in phase II iterations, which reports the test results. In our experiments, phase II exits if the macro-scale development score has not increased in 5 loops or the maximum number of loops (10) is reached.

Main Results
Table 2 presents the model performance from different domains. We find that our alternate-training framework outperforms all weakly supervised baseline models. In addition, CHMM-ALT approaches or even exceeds the best source consensus, which sufficiently proves the effectiveness of the design. For general HMM-based label aggregators such as CHMM, it is impossible to exceed the best consensus since they can only predict an entity observed by at least one source. Based on this fact, CHMM is designed to select the most accurate observations from the weak sources without shrinking their coverage. In comparison, BERT's language  Figure 4: F1 score evolution across the alternate-training phases. "PI" is phase I; "PII-i" is the ith loop of phase II. The "strongest baseline" reports the result from the best-performed baseline in Table 2 for each dataset.
representation ability enables it to generalize the entity patterns and successfully discovers those entities annotated by none of the sources. Comparing CHMM + BERT to CHMM, we can conclude that BERT basically exchanges recall with precision, and its high-recall predictions can improve the result of CHMM in return. The complementary nature of these two models is why CHMM-ALT improves the overall performance of weakly supervised NER.

Analysis of CHMM
Looking at Table 2, we notice that CHMM performs the best amongst all generative models including majority voting, HMM and CHMM-i.i.d. The performance of conventional HMM is largely limited by the Markov assumption with the unchanging transition and emission probabilities. The results in the table validate that conditioning the model on BERT embedding alleviates this limitation. However, the transition matrices in HMM are indispensable, implied by CHMM-i.i.d.'s results, as they provide supplemental information about how the underlying true labels should evolve.

Analysis of Alternate-Training
Performance Evolution Figure 4 reveals the details of the alternate-training process. For less ambiguous tasks including NCBI-Disease, BC5CDR and LaptopReview with fewer entity types, BERT generally has better performance in phase I but gets surpassed in phase II. Interestingly, BERT's performance never exceeds that of CHMM on the LaptopReview dataset. This may be because BERT fails to construct sufficiently representative patterns from the denoised labels for this dataset. For CoNLL 2003, where it is harder for the labeling sources to model the language structures, the strength of a pre-trained language model in pattern recognition becomes more prominent. From the re-  sults it seems that the performance increment of the denoised labels y * (1:T ) provides marginally extra information to BERT after phase II, as most of the increment comes from the information provided by BERT itself. Even so, keeping phase II is reasonable when we want to get the best out of the weak labeling sources and the pre-trained BERT.
BERT-NER Initialization CHMM-ALT initializes BERT-NER's parameters from its previous checkpoint at the beginning of each loop in phase II to reduce training time ( § 4.3). If we instead fine-tune BERT-NER from the initial parameters of the pre-trained BERT model for each loop, CHMM-ALT gets 84.30, 84.71, and 76.68 F1 scores on NCBI-Disease, BC5CDR, and LaptopReview datasets. These scores are close to the results in Table 2, but the training takes much longer. Consequently, our BERT-NER initialization strategy is a more practical choice overall.
Applying Alternate-Training to Other Methods Table 3 shows the alternate-training performance acquired with different label aggregators. The ac-companying BERT-NER models are identical to those described in § 5.1. The results in the table suggest that the performance improvement obtained by using alternate-training on the label aggregators is stable and generalizable to any other models yet to be proposed.

Conclusion
In this work, we present CHMM-ALT, a multisource weakly supervised approach that does not depend on manually labeled data to learn an accurate NER tagger. It integrates a label aggregator-CHMM and a supervised model-BERT-NER together into an alternate-training procedure. CHMM conditions HMM on BERT embeddings to achieve greater flexibility and stronger context-awareness. Fine-tuned with CHMM's prediction, BERT-NER discovers patterns unobserved by the weak sources and complements CHMM. Training these models in turn, CHMM-ALT uses the knowledge encoded in both the weak sources and the pre-trained BERT model to improve the final NER performance. In the future, we will consider imposing more constraints on the transition and emission probabilities, or manipulating them according to sophisticated domain knowledge. This technique could be also extended to other sequence labeling tasks such as semantic role labeling or event extraction. p(z (t) = i|x (1:T ) ), i ∈ {1, 2, . . . , |L|}, t ∈ {1, 2, . . . , T } and the expected number of transitions ξ (t) i,j p(z (t−1) = i, z (t) = j|x (1:T ) ), i, j ∈ {1, 2, . . . , |L|}. 8 |L| is the number of BIO formatted entity labels, which are regarded as hidden states; T is the total number of hidden steps in a sequence, which equals the number of tokens.
Defining α i,j can be represented by α and β using the Bayes' rule and Markov assumption: is the likelihood of the observation when the hidden state is i ( § 4.2).
Written in the matrix form, (12) and (13) become: where is the element-wise product. Note that the elements in both γ (t) and ξ (t) should sum up to 1.
The Forward Pass The filtered marginal α (t) i can be computed iteratively: Written in the matrix form, (16) becomes We initialize α with α (0) = π ( § 4.2) since we have no observation at time step 0. As α (t) is a probability distribution, the elements in it sum up to 1. The calculation of α is the forward pass.
The Backward Pass In the same way, we do the backward pass to compute the conditional future evidence β (18) In the matrix form, (18) becomes: whose base case is

A.2 The Maximization step for Unsupervised HMM
For traditional unsupervised HMM, the expected complete data log likelihood is maximized by updating the matrices with the approximated pseudostatistics. different from CHMM, HMM has constant transition and emission for all time steps, i.e.: For simplicity, we remove the term t for the transition and emission matrices. Suppose we are updating HMM based on one instance with t starting from 1: Note that the observation has property 0 ≤ x

B Labeling Source Performance
The weak labeling sources of the CoNLL 2003 dataset come from Lison et al. (2020), whereas Safranchik et al. (2020) provide the sources for the LaptopReview, NCBI-Disease and BC5CDR dataset. For Safranchik et al. (2020)'s labeling sources, we apply a majority voting using their tagging results to the spans detected by their linking rules to convert the linking results to token annotations. In consideration of the training time and resource consumption, we only adopt a subset of the labeling sources provided by the authors. The performance of the labeling sources is presented in the tables below.     Please refer to Lison et al. (2020) for the information about the construction of the labeling sources on the CoNLL 2003 dataset; please refer to Safranchik et al. (2020) for the labeling sources on other three datasets.

C Hyper-Parameters
The experiments are conducted on one GeForce RTX 2080 Ti GPU. For NCBI-Disease, BC5CDR and LaptopReview datasets, CHMM is pre-trained for 5 epochs and trained for 20 epochs. The learning rates for these three datasets are 5×10 −4 , 10 −3 and 10 −4 , respectively, and the batch sizes are 64, 64 and 128. In phase I, BERT-NER is trained with the default learning rate (5 × 10 −5 ) for 100 epochs. The batch sizes are 8, 8, and 48, respectively. Note that for LaptopReview, the maximum length limitation of BERT-NER is set to 128 whereas the limitation is 512 for the other two datasets. In phase II, we use half the learning rate with 20 epochs for each loop.
For CoNLL 2003, CHMM has the same number of training epochs as for other datasets. The batch size is 32, and the learning rate is 10 −5 . BERT-NER has a maximum sequence length of 256. It is trained for 15 epochs in phase I and 5 epochs in phase II. Other hyper-parameters are identical to other BERT-NER models'.