Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

Weak supervision has shown promising results in many natural language processing tasks, such as Named Entity Recognition (NER). Existing work mainly focuses on learning deep NER models only with weak supervision, i.e., without any human annotation, and shows that by merely using weakly labeled data, one can achieve good performance, though still underperforms fully supervised NER with manually/strongly labeled data. In this paper, we consider a more practical scenario, where we have both a small amount of strongly labeled data and a large amount of weakly labeled data. Unfortunately, we observe that weakly labeled data does not necessarily improve, or even deteriorate the model performance (due to the extensive noise in the weak labels) when we train deep NER models over a simple or weighted combination of the strongly labeled and weakly labeled data. To address this issue, we propose a new multi-stage computational framework – NEEDLE with three essential ingredients: (1) weak label completion, (2) noise-aware loss function, and (3) final fine-tuning over the strongly labeled data. Through experiments on E-commerce query NER and Biomedical NER, we demonstrate that NEEDLE can effectively suppress the noise of the weak labels and outperforms existing methods. In particular, we achieve new SOTA F1-scores on 3 Biomedical NER datasets: BC5CDR-chem 93.74, BC5CDR-disease 90.69, NCBI-disease 92.28.


Introduction
Named Entity Recognition (NER) is the task of detecting mentions of real-world entities from text and classifying them into predefined types. For example, the task of E-commerce query NER is to identify the product types, brands, product attributes of a given query. Traditional deep learning * Work was done during internship at Amazon. approaches mainly train the model from scratch (Ma and Hovy, 2016;Huang et al., 2015), and rely on large amounts of labeled training data. As NER tasks require token-level labels, annotating a large number of documents can be expensive, timeconsuming, and prone to human errors. Therefore, the labeled NER data is often limited in many domains (Leaman and Gonzalez, 2008). This has become one of the biggest bottlenecks that prevent deep learning models from being adopted in domain-specific NER tasks.
To achieve better performance with limited labeled data, researchers resort to large unlabeled data. For example, Devlin et al. (2019) propose to pre-train the model using masked language modeling on large unlabeled open-domain data, which is usually hundreds/thousands of times larger than the manually/strongly labeled data. However, opendomain pre-trained models can only provide limited semantic and syntax information for domainspecific tasks. To further capture domain-specific information, Lee et al. (2020); Gururangan et al. (2020) propose to continually pre-train the model on large in-domain unlabeled data.
When there is no labeled data, one approach is to use weak supervision to generate labels automatically from domain knowledge bases (Shang et al., 2018;Liang et al., 2020). For example, Shang et al. (2018) match spans of unlabeled Biomedical documents to a Biomedical dictionary to generate weakly labeled data. Shang et al. (2018) further show that by merely using weakly labeled data, one can achieve good performance in biomedical NER tasks, though still underperforms supervised NER models with manually labeled data. Throughout the rest of the paper, we refer to the manually labeled data as strongly labeled data for notational convenience.
While in practice, we often can access both a small amount of strongly labeled data and a large amount of weakly labeled data, generated from large scale unlabeled data and domain knowledge bases. A natural question arises here: "Can we simultaneously leverage small strongly and large weakly labeled data to improve the model performance?" The answer is yes, but the prerequisite is that you can properly suppress the extensive labeling noise in the weak labels. The weak labels have three features: 1) "incompleteness": some entity mentions may not be assigned with weak labels due to the limited coverage of the knowledge base; 2) "labeling bias": some entity mentions may not be labeled with the correct types, and thus weak labels are often noisy; 3) "ultra-large scale": the weakly labeled data can be hundreds/thousands of times larger than the strongly labeled data. An ultra-large volume of weakly labeled data contains useful domain knowledge. But it also comes with enormous noise due to the "incompleteness" and "labeling bias" of weak labels. The enormous noise can dominate the signal in the strongly and weakly labeled data, especially when combined with the unsupervised pre-training techniques. Such noise can be easily overfitted by the huge neural language models, and may even deteriorate the model performance. This is further corroborated by our empirical observation (See Section 4) that when we train deep NER models over a simple or weighted combination of the strongly labeled and weakly labeled data, the model performance almost always becomes worse.
To address such an issue, we propose a threestage computational framework named NEEDLE (Noise-aware wEakly supErviseD continuaL prEtraining). At Stage I, we adapt an open-domain pre-trained language model to the target domain by in-domain continual pre-training on the large in-domain unlabeled data. At Stage II, we use the knowledge bases to convert the in-domain unlabeled data to the weakly labeled data. We then conduct another continual pre-training over both the weakly and strongly labeled data, in conjunction with our proposed weak label completion procedure and noise-aware loss functions, which can effectively handle the"incompleteness" and "noisy labeling" of the weak labels. At Stage III, we finetune the model on the strongly labeled data again. The last fine-tuning stage is essential to the model fitting to the strongly labeled data.
We summarize our key contributions as follows: • We identify an important research question on weak supervision: while training deep NER models using a simple or weighted combination of the strongly labeled and weakly labeled data, the ultra-large scale of the weakly labeled data aggravates the extensive noise in the weakly labeled data and can significantly deteriorate the model performance.
• We propose a three-stage computational framework named NEEDLE to better harness the ultralarge weakly labeled data's power. Our experimental results show that NEEDLE significantly improves the model performance on the E-commerce query NER tasks and Biomedical NER tasks. In particular, we achieve new SOTA F1-scores on 3 Biomedical NER datasets: BC5CDR-chem 93.74, BC5CDR-disease 90.69, NCBI-disease 92.28. We also extend the proposed framework to the multilingual setting.

Preliminaries
We briefly introduce the NER problem and the unsupervised language model pre-training.

Named Entity Recognition
NER is the process of locating and classifying named entities in text into predefined entity categories, such as products, brands, diseases, chemicals. Formally, given a sentence with N tokens X = [x 1 , ..., x N ], an entity is a span of tokens s = [x i , ..., x j ] (0 ≤ i ≤ j ≤ N ) associated with an entity type. Based on the BIO schema (Li et al., 2012), NER is typically formulated as a sequence labeling task of assigning a sequence of labels Y = [y 1 , ..., y N ] to the sentence X. Specifically, the first token of an entity mention with type X is labeled as B-X; the other tokens inside that entity mention are labeled as I-X; and the non-entity tokens are labeled as O. Supervised NER. We are given M sentences that are already annotated at token level, denoted as {(X m , Y m )} M m=1 . Let f (X; θ) denote an NER model, which can compute the probability for predicting the entity labels of any new sentence X, where θ is the parameter of the NER model. We train such a model by minimizing the following loss over {(X m , Y m )} M m=1 : where (·, ·) is the cross-entropy loss for tokenwise classification model or negative likelihood for CRF model (Lafferty et al., 2001). Weakly Supervised NER. Previous studies (Shang et al., 2018;Liang et al., 2020) of weakly supervised NER consider the setting that no strong label is available for training, but only weak labels generated by matching unlabeled sentences with external gazetteers or knowledge bases. The matching can be achieved by string matching (Giannakopoulos et al., 2017), regular expressions (Fries et al., 2017) or heuristic rules (e.g., POS tag constraints). Accordingly, they learn an NER model by minimizing Eq. (1) with {Y m } M m=1 replaced by their weakly labeled counterparts.

Unsupervised Pre-training
One of the most popular approaches to leverage large unlabeled data is unsupervised pre-training via masked language modeling. Pre-trained language models, such as BERT and its variants (e.g., RoBERTa Liu et al. (2019), ALBERT Lan et al. (2020b) and T5 Raffel et al. (2019)), have achieved state-of-the-art performance in many natural language understanding tasks. These models are essentially massive neural networks based on bi-directional transformer architectures, and are trained using a tremendous amount of open-domain data. For example, the popular BERT-base model contains 110 million parameters, and is trained using the BooksCorpus (Zhu et al., 2015) (800 million words) and English Wikipedia (2500 million words). However, these open-domain data can only provide limited semantic and syntax information for domain-specific tasks. To further capture domain-specific knowledge, Lee et al. (2020);Gururangan et al. (2020) propose to continually pre-train the model over large in-domain unlabeled data.

Method
To harness the power of weakly labeled data, we propose a new framework -NEEDLE, which contain stages as illustrated in Figure 1: 1) We first adapt an open-domain pre-trained language model to the downstream domain via MLM continual pre-training on the unlabeled in-domain data. 2) We use the knowledge bases to convert the unlabeled data to the weakly labeled data through weak supervision. Then we apply noise-aware continual pre-training for learning task-specific knowledge from both strongly and weakly labeled data; 3) Lastly, we fine-tune the model on the strongly labeled data again.

Stage I: Domain Continual Pre-training over Unlabeled Data
Following previous work on domain-specific BERT (Gururangan et al., 2020;Lee et al., 2020), we first conduct domain continual masked language model pre-training on the large in-domain unlabeled data { X m } M m=1 . Note that the masked language model f LM (·; θ enc , θ LM ) contains encoder parameters θ enc and classification head parameters θ LM , which are initialized from open-domain pretrained masked language models (e.g., BERT and RoBERTa).

Stage II: Noise-Aware Continual
Pre-training over both Strongly and Weakly labeled Data In the second stage, we use the knowledge bases to convert the unlabeled data to weakly labeled data to generate weak labels for the unlabeled data: We then continually pre-train the model with both weakly labeled in-domain data and strongly labeled data. Specifically, we first replace the MLM head by a CRF classification head (Lafferty et al., 2001) and conduct noise-aware weakly supervised learning, which contains two ingredients: weak label completion procedure and noise-aware loss function. • Weak Label Completion. As the weakly labeled data suffer from severe missing entity issue, we propose a weak label completion procedure. Specifically, we first train an initial NER model f (; θ Init ) by optimizing Eq (1) with θ Init = (θ enc , θ CRF ), where the encoder θ enc is initialized from Stage I and NER CRF head θ CRF is randomly initialized. Then, for a given sentence Such a weak label completion procedure can remedy the incompleteness of weak labels. • Noise-Aware Loss Function. The model tends to overfit the noise of weak labels when using negative log-likelihood loss over the weakly labeled data, Eq (1). To alleviate this issue, we propose a noise-aware loss function based on the estimated confidence of the corrected weak labels Y c , which is defined as the estimated probability of Y c being the true labels Y: P ( Y c = Y| X). The confidence can be estimated by the model prediction score f ( X; θ) and histogram binning (Zadrozny and Elkan, 2001). See more details in Appendix A.
We design the noise-aware loss function to make the fitting to the weak labels more conservative/aggressive, when the confidence is lower/higher. Specifically, when Y c = Y , we let loss function L be the negative log-likelihood, i.e., L(·, ·| Y c = Y ) = (·, ·); when Y c = Y , we let L be the negative log-unlikelihood, i.e., L(·, ·| Y c = Y ) = − (·, ·) 1 . Accordingly, the noise-aware loss function is designed as where the log-unlikelihood loss can be viewed as regularization and the confidence of weak labels can be viewed as an adaptive weight. The training objective on both the strongly labeled data and weakly labeled data is:

Stage III: Final Fine-tuning
Stages I and II of our proposed framework mainly focus on preventing the model from the overfitting to the noise of weak labels. Meanwhile, they also suppress the model fitting to the strongly labeled data. To address this issue, we propose to fine-tune the model on the strongly labeled data again. Our experiments show that such additional fine-tuning is essential.

Datasets
We evaluate the proposed framework on two different domains: E-commerce query domain and Biomedical domain. The data statistics are summarized in Table 1. For E-commerce query NER, we consider two settings: english queries and multilingual queries. For English NER, there are 10 different entity types, while the multilingual NER has 12 different types. The queries are collected from search queries to a shopping website. The unlabeled in-domain data and the weak annotation is obtained by aggregating user behavior data collected from the shopping website. We give more details about the weakly labeled data in Appendix E.
For Biomedical NER, we use three popular benchmark datasets: BC5CDR-Chem, BC5CDR-Disease (Wei et al., 2015), and NCBI-Disease (Dogan et al., 2014). These datasets only contain a single entity type. We use the pre-processed data in BIO format from Crichton et al. (2017) following BioBERT (Lee et al., 2020) and PubMedBERT (Gu et al., 2020). We collect unlabeled data from PubMed 2019 baseline 2 , and use the dictionary lookup and exact string match to generate weak labels 3 . We only include sentences with at least one weak entity label.
• Weak Labels Performance. Table 1 also presents the precision and recall of weak labels performance on a evaluation golden set. As can be seen, the weak labels suffer from severe incompleteness issue. In particular, the recall of Ecommerce query NER is lower than 50. On the other hand, the weak labels also suffer from labeling bias.

Baselines
We compare NEEDLE with the following baselines (All pre-trained models used in the baseline methods have been continually pre-trained on the in-domain unlabeled data (i.e., Stage I of NEEDLE) for fair comparison): • Supervised Learning Baseline: We directly fine-tune the pre-trained model on the strongly labeled data. For E-commerce query NER, we use Query-RoBERTa-CRF, which is adapted from the RoBERTa large model. For E-commerce multilingual query NER, we use Query-mBERT-CRF, which is adapted from the mBERT. For Biomedical NER, we use BioBERT-CRF (Lee et al., 2020), which is adapted from BERT-base.
• Semi-supervised Self-Training (SST): SST use the model obtained by supervised learning to generate pseudo labels for the unlabeled data and then conduct semi-supervised leaning (Wang et al., 2020;Du et al., 2021).
We tune the weight γ and present the best result. • Robust WSL: WSL with mean squared error loss function, which is robust to label noise (Ghosh et al., 2017). As the robust loss is not compatible with CRF, we use the token-wise classification model for the Stage II training.
• Partial WSL: WSL with non-entity weak labels excluded from the training loss (Shang et al., 2018).

E-commerce NER
We use span-level precision/recall/F1-score as the evaluation metrics. We present the main results on English query NER in Table 2

Main Results
• NEEDLE: NEEDLE outperforms the fully supervised baseline and achieves the best performance among all baseline methods; • Weakly Supervised Baselines: All weakly supervised baseline methods, including WSL, Weighted WSL, Partial WSL and Robust WSL, lead to worse performance than the supervised baseline. This is consistent with our claim in Section 1. The weakly labeled data can hurt the model performance if they are not properly handled; • SST: Semi-supervised self-training outperforms the supervised baseline and weakly supervised baselines. This indicates that if not properly handled, the weak labels are even worse than the pseudo label generated by model prediction. In contrast, NEEDLE outperforms SST, which indicates that the weak labels can indeed provide additional knowledge and improve the model performance when their noise can be suppressed.

Ablation
We study the effectiveness of each component of NEEDLE. Specifically, we use the following abbreviation to denote each component of NEEDLE: • WLC: Weak label completion.
• NAL: Noise-aware loss function, i.e., Eq.(4). Since NAL is built on top of WLC, the two components need to be used together.
• FT: Final fine-tuning on strongly labeled data (Stage III).
As can be seen from

Biomedical NER
We present the main results on Biomedical NER in Table 5. NEEDLE achieves the best performance among all comparison methods. We outperform previous SOTA (Lee et al., 2020;Gu et al., 2020) by 0.41%, 5.07%, 3.15%, on BC5CDR-chemical, BC5CDR-disease and NCBI-disease respectively, in terms of the F1-score. We achieve very significant improvement on BC5CDR-disease. We conjecture that the weak labels for disease entities are relatively accurate, since WSL can also improve the model performance.

Analysis
Size  data, we test the model performance with randomly sub-sampled weakly labeled data. We plot the F1score curve for E-commerce English query NER in Figure 2a and BC5CDR data in Figure 2b. We find that NEEDLE gains more benefits from increasing the size of weakly labeled data compared with other methods (SST and WSL). We also present the performance of NEEDLE w/o FT in Figure 2c. As can be seen, although the performance of NEEDLE w/o FT decreases with more weakly labeled data, the model can still learn more useful information and achieves better performance after fine-tuning. Two Rounds of Stage II Training. Since the model after the final fine-tuning is better than the initial model in Stage II, we study whether using the fine-tuned model for an addition round of Stage II can further improve the performance of NEE-DLE. Specifically, after Stage III, we 1) use the new model to complete the original weak labels; 2) conduct noise-aware continual pre-training over both strongly and weakly labeled data; 3) fine-tune the model on strongly labeled data. The results are presented in Figure 2 (last point of each curve). As can be seen, NEEDLE can obtain slight improvement using the two rounds of Stage II training. On the other hand, we also show that SST and NEE-DLE w/o NAL achieve little improvement using the second round of training. Size of Strongly Labeled Data. To demonstrate that NEEDLE is sample efficient, we test NEEDLE on randomly sub-sampled strongly labeled data on E-commerce NER. As we show in Figure 3, NEEDLE only requires 30% ∼ 50% strongly labeled data to achieve the same performance as the (fully) supervised baseline. We also observe that NEEDLE achieves more significant improvement with fewer labeled data: +2.28/3.64 F1-score with 1%/10% labeled data.

Weak Label Errors in E-commerce NER
Here we study several possible errors of the weak labels to better understand the weak labels and how the proposed techniques reduce these errors. Label Distribution Mismatch. First, we show the distribution difference between the weak labels and the strong labels, and demonstrate how the weak label completion reduces the gap. Specifically, we compare the entity distribution of the true labels, weak labels, corrected weak labels and self-training pseudo labels in Figure 4. As can be seen, the original weak labels suffer from severe missing entity issue (i.e., too many non-entity labels) and distribution shift (e.g., nearly no Misc labels). On the other hand, the corrected weak labels suffer less from the missing entities and distribution shift. SST pseudo labels are the most similar to the strong labels, which explains why SST can directly improves the performance. Systematical Errors. We observe that many errors from the weakly labeled data are systematical errors, which can be easily fixed by the final fine-tuning stage. For example, "amiibo" is one Product Line of "nintendo". The amiibo characters should be defined as Misc type, while the weak labels are all wrongly annotated as Color.
We list 4 queries and their strong labels and weak labels in Table 6. Although these errors lead to worse performance in Stage II, they can be easily fixed in the final fine-tuning stage. Specifically, the pre-training first encourages the model to learn that "xxx amiibo" is a combination of color + productLine with a large amount of weakly labeled data, and then the fine-tuning step corrects such a pattern to misc + productLine with a limited amount of data. It is easier than directly learning the misc + productLine with the limited strongly labeled data.  Figure 2: Size of weakly labeled data vs. Performance. We present the performance after the final round of fine-tuning in (a) and (b). We also compare the performance with and without fine-tuning in (c) using E-commerce English query NER data. The baselines are Query-RoBERTa-CRF for (a,c) and BioBERT-CRF for (b). "Baseline": the baseline here is the fully supervised baseline. We also present the performance after two rounds of Stage II training at the rightmost point of each curve ("Stage II x2").

Label Types
Querys and Labels Human Labels zelda amiibo wario amiibo yarn yoshi amiibo amiibo donkey kong Original Weak Labels zelda amiibo wario amiibo yarn yoshi amiibo amiibo donkey kong Corrected Weak Labels zelda amiibo wario amiibo yarn yoshi amiibo amiibo donkey kong Self-Training Labels zelda amiibo wario amiibo yarn yoshi amiibo amiibo donkey kong  Entity BIO Sequence Mismatch in Weak Label Completion. Another error of the weakly labels is the mismatched entity BIO sequence in the weak label completion step, e.g., B-productType followed by I-color 4 . For English Query NER, the proportion of these broken queries is 1.39%. Removing these samples makes the Stage II perform better (F1 score +1.07), while it does not improve the final stage performance (F1 score -0.18). This experiment indicates that the final fine-tuning suffices to correct these errors, and we do not need to strongly exclude these samples from Stage II.
Quantify the Impact of Weak Labels. Here we examine the impact of weak labels via the lens of prediction error. We check the errors made by the model on the validation set. There are 2384 entities are wrongly classified by the initial NER model. After conducting NEEDLE, 454 of 2384 entities are correctly classified. On the other hand, the model makes 311 more wrong predictions. Notice that not all of them are directly affected by the weakly labeled data, i.e., some entities are not ob-served in the weakly labeled data. Some changes may be only due to the data randomness. If we exclude the entities which are not observed in the weakly annotated entities, there are 171 new correctly classified entities and 93 new wrongly classified entities, which are affected by the weak labels. Such a ratio 171/93 = 1.84 >> 1 justifies that the advantage of NAL significantly out-weights the disadvantage of the noise of weak labels.

Discussion and Conclusion
Our work is closely related to fully weakly supervised NER. Most of the previous works only focus on weak supervision without strongly labeled data (Shang et al., 2018;Lan et al., 2020a;Liang et al., 2020). However, the gap between a fully weakly supervised model and a fully supervised model is usually huge. For example, a fully supervised model can outperform a weakly supervised model (Au-toNER, Shang et al. (2018)) with only 300 articles. Such a huge gap makes fully weakly supervised NER not practical in real-world applications. Our work is also relevant to semi-supervised learning, where the training data is only partially labeled. There have been many semisupervised learning methods, including the popular self-training methods used in our experiments for comparison (Yarowsky, 1995;Rosenberg et al., 2005;Tarvainen and Valpola, 2017;Miyato et al., 2018;Meng et al., 2018;Clark et al., 2018;Yu et al., 2021). Different from weak supervision, these semi-supervised learning methods usually has a partial set of labeled data. They rely on the labeled data to train a sufficiently accurate model. The unlabeled data are usually used for inducing certain regularization to further improve the generalization performance. Existing semi-supervised learning methods such as self-training doesn't leverage the knowledge from weak supervision and can only marginally improve the performance.
Different from previous studies on fully weakly supervised NER, we identify an important research question on weak supervision: the weakly labeled data, when simply combined with the strongly labeled data during training, can degrade the model performance. To address this issue, we propose a new computational framework named NEEDLE, which effectively suppresses the extensive noise in the weak labeled data, and learns from both strongly labeled data and weakly labeled data. Our proposed framework bridges the supervised NER and weakly supervised NER, and harnesses the power of weak supervision in a principled manner. Note that, NEEDLE is complementary to fully weakly supervised / semi-supervised learning. One potential future direction is to combine NEE-DLE with other fully weakly supervised / semisupervised learning techniques to further improve the performance, e.g., contrastive regularization (Yu et al., 2021).

Broader Impact
This paper studies NER with small strongly labeled and large weakly labeled data. Our investigation neither introduces any social/ethical bias to the model nor amplifies any bias in the data. We do not foresee any direct social consequences or ethical issues.

A Estimation of Weak Label Confidence
Here we describe how do we estimate the confidence of weak labels -P ( Y c = Y| X). Notice that, the corrected weak labels Y c in NEEDLE consists of two parts: original weak labels Y w and model prediction Y p . So we estimate the confidence of corrected weak labels by the confidence of these two parts using a simple linear combination: The weight of such linear combination comes from the rule of the weak label completion procedure.
Recall that, we use the original weak labels for all matched tokens in original weakly-supervised data, while we use the model prediction for other tokens. We first assume the confidence of weak labels are high, i.e. P ( Y w = Y| X) = 1, as there is less ambiguity in the domain-specific dictionary and matching process.
The label prediction Y p of CRF model is based on Viterbi decoding score The confidence of Y p , i.e., P ( Y p = Y| X) can be estimated via histogram binning (Zadrozny and Elkan, 2001). Specifically, we categorize samples into bins based on the decoding score s( Y p ). For each bin we estimate the confidence using a validation set (independent of the final evaluation set). For a new sample, we first calculate the decoding score, and estimate the prediction confidence by the confidence of the corresponding bin in the histogram. Figure 5 illustrates an example of histogram binning. As can be seen, the decoding score has a strong correlation with the prediction confidence.

B Additional Experimental Results for E-commerce NER
We also present Token/Span/Query level Accuracy, as they are commonly used in E-commerce NER tasks.

D Extension: Multilingual NER
The proposed framework can be extended to improve multilingual NER. For Stage I and Stage II, we use data from other languages to learn domain-specific knowledge and task-related knowledge. In the final fine-tuning stage, we use the data from the target language, which allows us to adapt the model to the target language and obtain a better performance on the target language. The framework is summarized in Figure 6. The results of Multilingual Query NER are presented in Table 11. As can be seen, NEEDLE outperforms baseline methods. Figure 6: Three-Stage NEEDLE for Multilingual NER Table 11: E-commerce Multilingual Query NER: Span Precision/Recall/F1 and Token/Span/Query level Accuracy. The Best performance is bold, and the results that are close to best performance (≤ 0.2%) are also bold. 'mBERT-CRF (Single)': fine-tune mBERT with strongly labeled data from the target language. 'w/ Fine-tune': the additional fine-tuning stage only use strongly labeled data from the target language. For other methods, we use multilingual human-annotated data.

1789
E Detailed of Weakly Labeled Datasets E.1 Weak Labels for Biomedical NER Data

Unlabeled Data
The large-scale unlabeled data is obtained from titles and abstracts of Biomedical articles.

Weak Label Generation
The weak annotation is generated by dictionary lookup and exact string match.

Unlabeled Data
The unlabeled in-domain data is obtained by aggregated anonymized user behavior data collected from the shopping website.

Weak Label Generation
The weak annotation is obtained by aggregated anonymized user behavior data collected from the shopping website.
Step 1. For each query, we aggregate the user click behavior data and find the most clicked product.
Step 2. Identify product attributes in the product knowledge base by product ID.
Step 3. We match spans of the query with product attribute. If a match is found, we can annotate the span by the attribute type.