GLaRA: Graph-based Labeling Rule Augmentation for Weakly Supervised Named Entity Recognition

Instead of using expensive manual annotations, researchers have proposed to train named entity recognition (NER) systems using heuristic labeling rules. However, devising labeling rules is challenging because it often requires a considerable amount of manual effort and domain expertise. To alleviate this problem, we propose GLARA, a graph-based labeling rule augmentation framework, to learn new labeling rules from unlabeled data. We first create a graph with nodes representing candidate rules extracted from unlabeled data. Then, we design a new graph neural network to augment labeling rules by exploring the semantic relations between rules. We finally apply the augmented rules on unlabeled data to generate weak labels and train a NER model using the weakly labeled data. We evaluate our method on three NER datasets and find that we can achieve an average improvement of +20% F1 score over the best baseline when given a small set of seed rules.


Introduction
Named entity recognition (NER) models often need to be trained with many manual labels to perform well. Due to the high cost of manual annotations, collecting labeled data to train NER models is challenging for real-world applications. Recently, researchers have proposed to collect weak labels using heuristic rules, which are called labeling rules (Bach et al., 2017;Fries et al., 2017;Ratner et al., 2020;Safranchik et al., 2020). This kind of methods typically first ask domain experts to write labeling rules for a NER task, then use these manual rules to generate labeled data and train a NER model with the weakly labeled data. The advantage of these methods is that they do not require manual annotations. However, during our study, we find that writing labeling rules is also challenging for domain-specific tasks. Devising accurate def rule1( x ): return "Disease" if x.PreBigram=="associated with" else "Other" Rule-1: associated_with_* → Disease def rule2( x ): return "Disease" if x.PreBigram == "cause of" else "Other"

Rule-2: cause_of_* → Disease
The symptoms were associated with enzyme deficiency. The fragile site is not associated with mental retardation.
Migraine is an uncommon cause of cranial neuropathy. The cause of hearing loss after spinal anaesthesia is unknown. rules often demands a significant amount of manual effort because it requires developers to have deep domain expertise and a thorough understanding of the target data.
To alleviate manual effort on writing labeling rules, we propose GLARA, a Graph-based Labeling Rule Augmentation framework to automatically learn new rules from unlabeled data with a handful of seed rules. Our work is motivated by the intuition that if two rules can accurately label the same type of entities, then they are semantically related via the entities matched by them. Therefore, we can acquire new labeling rules based on their semantic relatedness with the rules that we have already known. For example, Figure 1 shows two example rules for labeling Disease entities. If we know that rule1 "associated with * →Disease" is an accurate rule for labeling diseases, and rule1 is semantically related to rule2 "cause of * →Disease", then we can derive that rule2 can be another accurate rule for labeling diseases.   Figure 2: An example workflow of GLARA framework that using suffix rules to recognize Diseases. Given unlabeled data and seeding rules, (1) we first extract candidate rules from unlabeled data, (2) then build a graph of rules and learn new rules by propagating the labeling information from seeding rules to other rules. (3) Next we apply selected rules on unlabeled data and obtain a label matrix. (4) Finally, we estimate the noisy labels using a generative model, and (5) train a final NER model with noisy labels. " * noma → Disease" denotes that if a word's suffix is "noma" then it will be labeled as Disease.
To augment labeling rules, we first define six types of rules and extract all possible rules from unlabeled data as candidate rules. Then, for each rule type, we build a graph by connecting rules of this type based on their semantic similarities. In our work, we compute a rule's embedding as the average of contextual embeddings of entities matched by the rule, and compute the similarities between rules using their embeddings. We learn new rules using a graph neural network model from a small set of seed rules. Next, we train a discriminative NER model using the weak labels generated by both the seeding and learned rules. To obtain weak training labels, we first obtain a label matrix by applying all the augmented rules on each token in unlabeled data. Then, we estimate the labels of unlabeled instances using the LinkedHMM model (Safranchik et al., 2020). We evaluate our framework on three datasets. In our experiments, we first show that our method can achieve better results than baselines when abundant rules are available. We also demonstrate that we can achieve an average improvement of +20% F1 when only a small set of rules are available.
We summarize our major contributions as: • We propose a new Graph-based Rule Labeling Rule Augmentation (GLARA) 1 framework, 1 The code is available at https://github.com/ zhaoxy92/GLaRA. which can effectively learn new labeling rules from unlabeled data automatically.
• We propose a new graph neural network to estimate rules' labeling confidence with a new class distance-based loss function.
• We define six types of labeling rules, which have been proven to be effective on three named entity recognition tasks.

The GLARA Framework
Our goal is to build a NER system with a small set of manually selected seeding rules and unlabeled data. Our key idea is to first augment labeling rules using graph neural networks, based on the hypothesis that semantically similar rules should have similar abilities to recognize entities. Then, we train a NER model using the weak training data labeled by the augmented rules.
Overview Figure 2 shows an example workflow of GLARA framework using suffix rules to recognize Disease entities. Our framework consists of five major components.
(1) Rule extractor: We define six types of rules and extract all possible rules from unlabeled data as candidate rules.
(2) Rule augmentation: For each rule type, we first build a graph of rules by connecting rules based on their semantic similarities. Given a small set of manual seeding rules, we learn new rules by propagating the labeling confidence from seeding rules to other rules. (3) Rule Applier: We obtain a label matrix by applying all the augmented rules on each token in the unlabeled data. (4) Generative Model: We estimate the labels of unlabeled instances using a generative model. (5) Discriminative Model: We train a final discriminative NER model using the weak labels produced by the generative model.

Candidate Rule Extraction
As demonstrated in previous work on NER (Zhou and Su, 2002), lexical and contextual clues are strong indicators for entity recognition. Therefore, we define and extract six types of rules: SurfaceForm, Prefix, Suffix, PreNgram, PostNgram, and DependencyRule to recognize entities by considering their lexical, contextual, and syntax information.
Given an unlabeled sentence, we first extract all noun phrases (NPs) using a set of Part-of-Speech (POS) patterns, as candidate entity mentions. The POS patterns include "JJ? NN+" (JJ denotes an adjective, and NN denotes a noun) and top 15 most frequent POS patterns of the entity mentions in the development sets. Then, we extract all six types of rules from the unlabeled data as candidate rules for each candidate entity mention. Specifically, we extract the surface form of each candidate entity mention as a SufraceForm rule. If the mention is a single token, we extract its first and last m characters as Prefix and Suffix rules, respectively. For PreNgram rule, we extract leading n words of a candidate entity as inclusive PreNgram rule; meanwhile, we also extract n words on the left of candidate as exclusive PreNgram rule. PostNgram rules are created similarly from the right context. Also, for each multi-token entity candidate, we first extract the dependency relations of the first token and the second last token, respectively, and then each dependency is combined with the last token as Dependency rules of this mention. We treat m and n as hyperparameters in our work. Figure 3 show some example rules that extracted for the candidate entity mention "Alzheimer 's disease" and how they are used for labeling entities.
manually selected positive and negative seeding rules. Positive seeding rules are those that can be used to predict a target entity type. Negative ones are used to predict instances of the "Other" class. Next, we estimate the representations of rules by optimizing a graph neural network model. Finally, we compute the distances of a rule to the centroids of positive and negative seeding rules, respectively, and then select rules close to positive centroid as new labeling rules.
Graph of Rules For each type rules, we create a graph G = (V u , V pos s , V neg s , A) with this type of rules as nodes, where V u denotes the candidate rules extracted in the previous step, V pos s and V neg s denotes the positive and negative seeding rules, respectively. A is the adjacency matrix of nodes. In our graph, each node (i.e., rule) is connected with the top 10 semantically similar nodes. The similarity between two rules is computed as the cosine similarity using their embeddings.

Rule Embeddings
We estimate semantic relatedness between rules with their embeddings, which are computed using pre-trained contextual embedding models. Specifically, we first apply the pretrained ELMo (Peters et al., 2018) model on all unlabeled sentences to obtain contextual embeddings of candidate entity mentions. Then, we compute the embedding of a rule as the average of the embeddings of all the candidate entities that can be matched by this rule. For example, the embedding of the prefix rule "*noma" is calculated as the average of the embeddings of all candidate mentions that ends with "noma".
Graph Propagation Model Given a graph of rules and seeding rules, we formulate the problem of learning new labeling rules (i.e., positive rules) as a graph-based semi-supervised node classification task that aims to classify candidate rules (which could be treated as unlabeled nodes in the graph) as positive or negative.
Based on the intuition that semantically similar rules should predict entity labels similarly. We propose a graph neural network model to propagate labeling information from seeding nodes to other nodes based on the recent work on Graph Attention Network (Veličković et al., 2017). Specifically, given the input embedding h i of node i and its neighbors N i , we first compute an attention weight for each connected pair (i, j) as, where W is a parameter matrix, and f is the LeakyReLU activation function. We then recompute the embedding of i: To keep model stable, we apply multi-head attention mechanism to obtain K attentional states for each node, the average of which is used as final node representation, i.e., h * T i = 1 K k h kT i . The objective of our model is defined as follows: where L sup is the supervision loss computed on both positive and negative seeding rule nodes, L reg is the regularization that encourages connected nodes to share similar representations, and L dist aims to maximize the distance between positive and negative seeding nodes. dist(·) computes the cosine similarity between the centroids of the positive and negative seeds. p i is the probability of a node being classified as positive, and h pos and h neg are the average embeddings of positive and negative nodes, respectively.
When the learning process is finished, each rule is associated with a new embedding representation, i.e., h * T i . For each rule, we first compute its cosine distances to the centroids of positive and negative seeding nodes using their embeddings, respectively. Then, we rank all rules by the difference of two distances and select the top M rules closest to the centroid of the positive seeding rules as new labeling rules.

Generative Model for Label Estimation
After the rule learning process, we apply both the newly learned rules and the seeding rules on unlabeled data to produce a matrix of labels. 2 Since one token can be potentially matched by several different rules, the resulting labels can have conflicts. Therefore, we use the LinkedHMM (Safranchik et al., 2020) model to combine these labels into one label for each token. Briefly, the main idea of LinkHMM is to treat the true label of a token as a latent random variable and estimate its value by relating it to the label outputs from different labeling rules. The estimated labels can be used to train final discriminative NER models.

Discriminative NER Model
After the training of the generative model (i.e., the LinkedHMM in our work) is completed, each token in the unlabeled data is associated with a weak label. Each weak label is a probability distribution over all entity classes, which can be used to train a discriminative NER model. One advantage of training a discriminative NER model is that it can use other token features while the generative model can only use labeling rules' outputs as inputs. Therefore, even if a token is not matched by any labeling rules, it can still be predicted correctly by the discriminative model.
In our work, we use BiLSTM-CRF (Huang et al., 2015) as our discriminative model. The model first uses BiLSTM to generate a state representation for each token in a sequence. The CRF layer then predicts each token by maximizing the expected likelihood of the entire sequence based on the estimated labels.

Experimental Setup
In this section, we present the details of our experimental setup. First, we evaluate our method on three NER datasets, which we refer to NCBI, BC5CDR, and LaptopReview. Then, we describe the baseline methods compared in our experiments. We also give a detailed description of the seed rules used in our experiments.

Datasets
We evaluated our method on three datasets. Details of each dataset are described below.
NCBI ( Note that we use all training data as our unlabeled data by removing the manual annotations.

Compared Methods
We compare our method with state-of-the-art weakly supervised methods using dictionaries or heuristic rules as supervision. In this section, we briefly describe these baseline methods.
AutoNER (Shang et al., 2018) is a distantly supervised method, which automatically builds a neural named entity recognition model using dictionaries as weak supervision.
Snorkel (Ratner et al., 2020) is a general machine learning framework that can train classifiers using heuristic rules. By default, it uses a Naive Bayes generative model to denoise labeling rules by predicting each token's label independently.
SwellShark (Fries et al., 2017) is an extension of Snorkel that was developed for biomedical NER. Same as Snorkel, it uses a naive Bayes generative model to denoise manual labeling rules. It also requires a special entity candidate generator to detect entity spans accurately before predicting their entity labels. In our experiments, we report both results using simple noun phrases as candidates, and that generated using extra expert effort.

LinkedHMM
(Safranchik et al., 2020) is a framework for training sequence tagging models using weak supervision from manual rules. Besides using labeling rules, it can also use linking rules that indicate whether two consecutive tokens have the same label.

Seed Rules
In our experiments, we used all the labeling rules 3 developed by Safranchik et al. (2020) as part of positive seeding rules. Besides, we manually selected another small set of labeling rules as our input because: (1) we defined six types of rules as described in Section 2.1, but some types of rules were not used in Safranchik et al. (2020) (e.g., prefix rules are not used in NCBI), (2) our method requires negative seed rules, which are used to identify terms that are not entities, to initiate its learning process. To automatically learn new labeling rules, we use both labeling rules from (Safranchik et al., 2020) and our manually selected rules as seeding rules. Numbers of both positive and negative seeding rules used in our experiments are shown in  included in Appendix C.

Implementation Details
In our experiments, we create a graph for each type of rules for each dataset and learn new rules independently with the same setup. Prefix and Suffix candidate rules are generated by considering the first and last 3 to 6 characters, and PreNgram and PostNgram candidate rules are extracted with the windows of 1 to 3 tokens.
We use a two-layer graph attention network to train our graph model. After training, we select M new rules for each type of rules, where the value of M is searched between 20 and 500 on dev sets.
For different datasets, our discriminative NER models used different pre-trained contextual models. Since NCBI and BC5CDR datasets are in the biomedical domain, we finetuned our NER model on the pretrained SciBERT embedding (Beltagy et al., 2019), while for LaptopReview data, the NER model is finetuned on the pretrained BERT base (Devlin et al., 2018) embedding. More details are provided in Appendix A.

Experimental Results
In this section, we first compare our method with state of the art methods using all available manual rules. Then, we evaluate our method under scenarios when only limited seeding rules are available, which are very common in real-world applications. Next, we conduct an ablation study to investigate the effectiveness of different types of rules and our newly proposed loss function based on centroid distance. Finally, we also perform a quality analysis of the automatically learned labeling rules.

Results with Abundant Seeding Rules
In this subsection, we report both the result of our generative model with augmented rules in Table  2 and the result of our discriminative NER model that is trained using weak labels from the generative model in Table 3. Table 2 shows the performance of our generative model with augmented rules and baseline generative models 5 . The Supervised line is the performance of the LinkedHMM model trained on fully labeled training data. The results show that our method with augmented labeling rules performed best on BC5CDR and LaptopReview datasets. We also achieved a comparative F1 score of 80.2 on NCBI with SwellShark (F1 80.8). Note that Swell-Shark used lots of manual effort from experts to carefully tune a particular candidate generator for a given dataset. We also notice that, with augmented rules, our method outperforms LinkedHMM by an average of 2.0 F1 points, which demonstrates the effectiveness of augmented rules.   Table 3 shows the performance of our discriminative NER model (BiLSTM-CRF) trained on weak labels from our GLARA with augmented rules and baseline discriminative models using weak labels from LinkedHMM and Snorkel without augmented rules (Safranchik et al., 2020). The results show that our method achieved a +1.9 F1 point improvement over the second-best system. We also notice that our discriminative model performs slightly better on NCBI and Laptop, but worse on BC5CDR than the generative model, which is consistent with that from (Safranchik et al., 2020).

Results with Limited Seeding Rules
Though weakly supervised state-of-the-art methods reported in Table 2 achieved relatively good performance, they require a significant amount of manual effort from domain experts for designing and tuning labeling rules. Well-performing methods that require less manual effort are often more desirable. Therefore, we also evaluate our method with little manual effort (i.e., limited seeding rules). We conducted experiments by randomly select at most k rules for each rule type from our seed rules (Section 3.3), where k ∈ {10, 20, 50, 100, 200}. When a rule type has less than k seeding rules, we use all of them. Figure 4 shows the performance of the LinkedHMM model using only seeding rules and our GLARA method with augmented rules on three datasets. Figure 4(d) shows that, with automatically learned rules, our method can obtain average F1 gains of +22 points when using 10 seeding rules and +13 points when there are 200 seeding rules.

Impact of Different Types of Rules
To investigate the effectiveness of different types of labeling rules, we conducted ablation experiments to evaluate our generative model with augmented rules (i.e., GLARA) by adding each type of rules cumulatively. Results are shown in

Effectiveness of Distance Loss L dist
In this work, we proposed a new graph neural network model with a new loss function, i.e., L dist in Eq. 3. This loss function measures the distance between the centroids of positive and negative seeding rules by computing their cosine similarity. The motivation is that the model should keep positive rules distant from negative ones during the learning process. Table 5 shows the performance of our generative model with and without the distance loss. We find that our method can obtain an average F1 gain of 1.2 points across three datasets with this new loss function.

Quality Analysis of Learned New Rules
In Table 6, we present top 5 learned rules for each rule type that we automatically learned on the NCBI data set. Each rule is formatted in the way described in Section 2.1. Table 7 shows the analysis results of the number of new rules (#Rule) and their accuracy (ACC) on the dev sets, which were learned during the experiments presented in Section 4.1. Results show that all the learned rules have accuracy ≥64% with an average of 76%, which justifies the quality of these new rules.  Besides, we also manually analyzed why some learned rules are not helpful to improve recognition performance. First, some of the learned rules can be overlapping. For example, both " * demia→Disease" and " * edemia→Disease" can be used to recognize Diseases. However, if * demia is learned first, then learning * edemia rule will not help because the entities matched by these two rules have large overlaps. Second, some of the rules learned from unlabeled data may not be applied to testing data due to the mismatch between two datasets. For example, though the PostNgram rule " * dystonia→Disease" is an accurate rule to label diseases such as "oromandibular dystonia" and "responsive dystonia". However, we do not find any matches on test data.

Related Work
Training reliable NER systems usually requires large annotation efforts, which is often considered expensive and impractical in certain domains. Therefore, previous studies have been trying to reduce the manual efforts required for annotation while producing comparable performance in NER tasks by utilizing manually designed rules that are cheap and accurate. Studies have shown that human-defined rules (i.e dictionaries) can greatly aid NER tasks, especially in the domains where the identification of entities needs domain knowledge (Cohen and Sarawagi, 2004;Savova et al., 2010;Xu et al., 2010;Eftimov et al., 2017), and they can also be used to distantly create labeled data set to benefit machine learning models (Mann and McCallum, 2010;Neelakantan and Collins, 2015;Giannakopoulos et al., 2017).
While distantly labeled data sets can be created at a low cost to boost NER tasks, the models still suffer from the noise introduced by the imperfect rules. Therefore, denoising models have been proposed to allow the model to better tolerate the im-perfect annotations created by rules. Shang et al. (2018) proposed AutoNER that trains NER systems using only lexicons with a tie-or-break tagging denoising model. Similarly, some recent work (Liu et al., 2019;Yang et al., 2018;Cao et al., 2019) have used a partial matching module to denoise the noisily labeled data sets.
Recently, weak supervision is proposed as another form of denoising framework without using any labeled data. Weakly supervised systems use handcrafted labeling rules to create weak training instances from unlabeled data and then use a denoising generative model to approximate the posteriors of the rules. In this process, the unknown gold label is treated as latent variable by the generative model. The performance of the labeling rules could be further enhanced by training a neural network-based discriminative model by treating the posteriors as soft labels. Related studies such as Snorkel (Ratner et al., 2020), SwellShark (Fries et al., 2017), LinkedHMM (Safranchik et al., 2020), and (Lison et al., 2020 have demonstrated great success with carefully curated labeling rules. However, manually designing those high-quality rules often require domain expertise and easy to have a low sensitivity on identifying entities. In (Lison et al., 2020), the weak training data is created by broadly collecting available labeling rules from multiple sources, which demonstrates the importance of being able to automatically find new heuristics missed by human efforts. To find new heuristic rules on the basis of a relatively limited number of manually designed rules, previous studies have tried bootstrapping by relying on the co-occurrence, context and pattern features (Thelen and Riloff, 2002;Riloff et al., 2003;Yangarber, 2003;Shen et al., 2017;Tao et al., 2015;Berger et al., 2018;Yan et al., 2019).
Recent studies on graph neural networks has opened up another possibility for learning new rules. By internally infusing the semantics of the neighboring nodes, the popular Graph Convolutional Network (GCN) (Kipf and Welling, 2016) and Graph Attention Network (GAT) (Veličković et al., 2017) have shown great success in semisupervised node classification when the number of labeled nodes is limited. Graph neural networks have been applied for many NLP tasks such as text classification (Yao et al., 2019;Zhang et al., 2019a;Hu et al., 2019), semantic role labeling (Marcheggiani and Titov, 2017), machine translation (Beck et al., 2018), question answering (Song et al., 2018;Saxena et al., 2020), information extraction (Liu et al., 2018;Vashishth et al., 2018;Nguyen and Grishman, 2018;Sahu et al., 2019;Fu et al., 2019;Zhang et al., 2019b), etc. In our work, we proposed to use graph neural networks to learn new labeling rules. Based on Graph Attention Network (Veličković et al., 2017), we designed a new graph network model with a new loss function. Experimental results demonstrated that our model performed better than the original graph attention network on learning accurate labeling rules.

Conclusion
In this work, we proposed a weakly supervised NER framework that automatically learns highquality new rules from only a handful of manually designed rules with a graph-based labeling rule augmentation method (GLARA). Experiments on three NER datasets demonstrate that our model outperforms baseline systems and achieves substantially better performance when the number of manual rules is limited. In addition, we also defined six types of rules that have been demonstrated useful for recognizing entities. In the future, we plan to improve GLARA by investigating more complex rule types and rule representation methods for weakly supervised NER.

References
Stephen H Bach, Bryan He, Alexander Ratner, and Christopher Ré. 2017. Learning the structure of generative models without labeled data.  Table 8: Summary of hyperparameters for propagation. In BC5CDR data. "D" denotes the number of selected rules for Disease and "C" denotes the that for Chemical for the corresponding rule type. "-" means the corresponding type of propagated rules are not used on our final model. Table 9 presents the hyperparameters used for tuning our LinkedHMM generative model, including "Initial Accuracy (estimated initial accuracy of the rules)", "Accuracy Prior (regularization for initial accuracy)", and "Balance Prior (the entity class distribution)". We used grid search to find the best hyperparameters.

B Performance on Development Data
In

C Manually Selected Seeds for propagation
As mentioned in the paper, the baseline LinkedHMM does not have seeding rules for all types of rules. For example, negative seeding rules are missing from the baseline LinkedHMM model, except that SurfaceForm rules uses a list of stopwords as negative seeding rules. For some types of seeding rules, there are only a few available that are not good enough for training the propagation model. Therefore, for the rule types that do not have enough seeds, we manually select a small set of additional rules as seeds. Note that we keep the total number of seeding rules (including the ones from baseline system) less than 15. We report the manually selected seeding rules for NCBI, BC5CDR-Disease, BC5CDR-Chemical, and Lap-topReview in Table 12, Table 13, Table 14, and  Table 15, respectively. Table 12: Manually selected seeding rules for NCBI dataset. "-" means no seeding rules selected, and we only use the rules provided in baseline.