Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning

Recent advances in Named Entity Recognition (NER) show that document-level contexts can significantly improve model performance. In many application scenarios, however, such contexts are not available. In this paper, we propose to find external contexts of a sentence by retrieving and selecting a set of semantically relevant texts through a search engine, with the original sentence as the query. We find empirically that the contextual representations computed on the retrieval-based input view, constructed through the concatenation of a sentence and its external contexts, can achieve significantly improved performance compared to the original input view based only on the sentence. Furthermore, we can improve the model performance of both input views by Cooperative Learning, a training method that encourages the two input views to produce similar contextual representations or output label distributions. Experiments show that our approach can achieve new state-of-the-art performance on 8 NER data sets across 5 domains.


Introduction
Pretrained contextual embeddings such as ELMo (Peters et al., 2018), Flair (Akbik et al., 2018) and BERT (Devlin et al., 2019) have significantly improved the accuracy of Named Entity Recognition (NER) models. Recent work (Devlin et al., 2019;Yu et al., 2020;Yamada et al., 2020) found that including document-level contexts of the target sentence in the input of contextual embeddings methods can further boost the accuracy of NER models. However, there are a lot of application scenarios * Yong Jiang and Kewei Tu are the corresponding authors. ‡ : This work was conducted when Xinyu Wang was interning at Alibaba DAMO Academy. 1 Our code is publicly available at https://github. com/Alibaba-NLP/CLNER. senate democrats eliminated the nuclear option when they had the majority a few years ago , over republican objections .
President Obama called for eliminating the legislative filibuster last month , which could occur if Democrats retake the Senate . Some Republicans say it s time to undo a wrong committed by Reid . Senate Republicans are considering using the nuclear option to end a potential Democratic filibuster and confirm Neil Gorsuch to the Supreme Court . Senate Republicans deployed the nuclear option on Wednesday to drastically reduce the time it takes to confirm hundreds of President Trump s nominees .  Figure 1: A motivating example from WNUT-17 dataset. The retrieved texts help the model to correctly predict the named entities of "democrats" and "republican".
in which document-level contexts are unavailable in practice. For example, there are sometimes no available contexts in users' search queries, tweets and short comments in various domains such as social media and E-commerce domains. When professional annotators annotate ambiguous named entities in such cases, they usually rely on domain knowledge for disambiguation. This kind of knowledge can often be found through a search engine. Moreover, when the annotators are not sure about a certain entity, they are usually encouraged to find related knowledge through a search engine . Therefore, we believe that NER models can benefit from such a process as well.
In this paper, we propose to improve NER models by retrieving texts related to the input sentence by an off-the-shelf search engine. We re-rank the retrieved texts according to their semantic relevance to the input sentence and select several top-ranking texts as the external contexts. Consequently, we concatenate the input sentence and external contexts together as a new retrieval-based input view and feed it to the pretrained contextual embedding module, so that the resulting semantic representations of the input tokens can be improved. The token representations are then fed into a CRF layer for named entity prediction. A motivating example is shown in Figure 1.
Moreover, we consider utilizing the new input view to improve model performance with the original input view that does not have external contexts. This can be useful in application scenarios when external contexts are unavailable or undesirable (e.g., in time-critical scenarios). To this end, we propose Cooperative Learning (CL) that encourages the two input views to produce similar predictions. We propose two approaches to CL which minimize either the L 2 distances between the token representations of the two input views or the Kullback-Leibler (KL) divergence between the prediction distributions of the two input views during training.
Our experiments show that including the retrieved external contexts can significantly improve the accuracy of NER models on 8 NER datasets from 5 domains. With CL, the accuracy of the NER models with both input views can be further improved. Our approaches outperform previous state-of-the-art approaches in each domain.
The contributions of this paper are: 1. We propose a simple and straight-forward way to improve the contextual representation of an input sentence through retrieving related texts using a search engine. We take the retrieved texts together with the input sentence as a new retrieval-based view.
2. We propose Cooperative Learning to jointly improve the accuracy of both input views in a unified model. We propose two approaches in CL based on the L 2 norm and KL divergence respectively. CL can utilize unlabeled data for further improvement.
3. We show the effectiveness of our approaches in several NER datasets across 5 domains and our approaches achieve state-of-the-art accuracy. By leveraging a large amount of unlabeled data, the performance can be further improved.

Framework
Given a sentence of n tokens x = {x 1 , · · · , x n }, the input sentence is fed into a search engine as a query. The search engine returns the top k relevant texts {x 1 , · · · ,x k }. Our framework feeds these texts into a re-ranking model. We concatenate l top-ranking texts output from the re-ranking model as the external contexts. The NER model is fed with either an input view with the input sentence (original input view) or a concatenation of the input sentence and external contexts (retrieval-based input view) as input. The model outputs the predictions of labels y = {y 1 , · · · , y n } at each position based on the CRF layer. To further improve the model, we use Cooperative Learning to train a unified model that is strong in both input views. With CL, the model is additionally constrained to be consistent in the internal representations or the output distributions of both input views. The architecture of our framework is shown in Figure 2.

Re-ranking
Given an input sentence as a search query, the search engine returns ranked relevant texts. However, the off-the-shelf search engine is highly optimized for a fast speed over a large set of documents, so it may sometimes produce semantically irrelevant results or rank the results using inaccurate relevance scores. Since the NER task targets at semantically recognizing named entities, it is more helpful if the relevant texts are semantically similar to the input sentence. Therefore, we need to re-rank the retrieved texts so that the most semantically relevant texts are chosen. We propose to apply BERTScore (Zhang et al., 2020) to score the relatedness of each retrieved text to the input sentence. BERTScore is a language generation metric that calculates a sum of cosine similarity between token representations of two sentences. Therefore, it is more likely that the search query and the retrieved texts have strong semantic relations when BERTScore is large. The token representations are generated from pretrained contextual embeddings such as BERT. Given the corresponding prenormalized token representations {r 1 , · · · , r n } of the input sentence x and the pre-normalized token representations {r 1 , · · · ,r m } of a certain retrieved textx with m words, the Precision (P), Recall (R) of BERTScore measure the semantic similarities from one to another: We re-rank the retrieved texts by the F1 scores F1=2 P·R P+R and concatenate l top-ranking texts {x 1 , · · · ,x l } with F1 scores together as the ex-  Figure 2: The architecture of our framework. An input sentence x is fed into a search engine to get k related texts. The related texts are then fed into the re-ranking module. The framework selects l highest ranking related texts output from the re-ranking module and feeds the texts to a transformer-based model together with the input sentence. Finally, we calculate the negative likelihood loss L NLL and L NLL-EXT together with the CL loss (either L CL-L2 or L CL-KL ).
ternal contexts: where sep_token is a special token representing a separate of sentences in the transformer-based pretrained contextual embeddings (for example, "[SEP]" in BERT).

NER Model
We solve the NER task as a sequence labeling problem. We apply a neural model with a CRF layer, which is one of the most popular state-of-the-art approaches to the task (Lample et al., 2016;Ma and Hovy, 2016;Akbik et al., 2019). In the sequence labeling model, the input sentence x is fed into a transformer-based pretrained contextual embeddings model to get the token representations . The token representations are fed into a CRF layer to get the conditional probability p θ (y|x): where ψ is the potential function and θ represents the model parameters. Y(x) denotes the set of all possible label sequences given x. y 0 is defined to be a special start symbol. W T ∈ R t×d and b ∈ R t×t are parameters computing emission and transition scores respectively. d is the hidden size of v and t is the size of the label set. During training, the negative log-likelihood loss for the input sequence with gold labels y * is defined by: In our approach, we concatenate the external contextsx at the end of the input sentence x to form the retrieval-based input view. The token representations are now given by: The architecture of our NER model is shown in Figure 3. Now the conditional probability p θ (y|x) becomes p θ (y|x,x). The loss function in Eq. 2 becomes:

Cooperative Learning
In practice, there are two application scenarios for the NER model: 1) offline prediction, which re-

Input Sentence External Contexts
Transformer-Based Embedding quires high accuracy of the prediction but the prediction speed is less emphasized; 2) online serving, which requires a faster prediction speed. The retrieval-based input view meets the requirement of the first scenario for its strong token representations. However, it does not meet the requirement of the second scenario. The external contexts are usually significantly longer than the input sentence and a search engine may not meet the latency requirements. These two issues significantly slow down the prediction speed of the model. Therefore, it is essential to improve the accuracy of the original input views in a unified model to meet these two scenarios.
Cooperative Learning targets at using the retrieval-based input view to help improve the accuracy of the model when there are no external contexts available. CL adds constraints between the internal representations or the output distributions between two input views to enforce that the predictions of both views should be near. The objective function of CL is calculated by: where D is a distance function between a function h with different inputs. Because the representations or the distributions with retrieval-based input view are usually informative, we do not backpropagate the gradient through h([x;x]). We propose two approaches for CL.
Token Representations: Stronger token representations usually lead to better accuracy on the task. Therefore, CL constrains the token representations of two input views to be similar. This helps the model learn to predict the token representations with external contexts even if the contexts are not available. In this approach, D is the L 2 norm to represent the distances of the token representations: Label Distributions: Since CL enforces the label predictions of both input views to be similar, a straight-forward approach is constraining the label distributions predicted by the model to be similar with the two input views. In this approach, we use the KL divergence as the function D. Then objective function in Eq. 4 becomes the KL divergence between p θ (y|x,x) and p θ (y|x): With the CRF layer, the loss function is difficult to calculate because the output space of p θ (y|•) is exponential in size. To alleviate this issue, we calculate the KL divergence between the marginal distributions q θ (y i |x,x) and q θ (y i |x) at each position of the sentence to approximate Eq. 6. The marginal distributions can be obtained using the forward-backward algorithm: As mentioned earlier, we do not back-propagate the gradient through p θ (y|x,x). Therefore calculating the KL divergence is equivalent to calculating the cross-entropy loss between q(y|x,x) and q(y|x): Together with the negative log-likelihood losses in Eq. 2, 3, the total loss in training is a summation of label losses and a CL loss: where L CL (θ) can be one of the CL loss in Eq. 5, 8 or a summation of both of them. 3 Experiments

Settings
Datasets To show the effectiveness of our approach, we experiment on 8 NER datasets across 5 domains: • Social Media: We use WNUT-16 (Strauss et al., 2016) and WNUT-17 (Derczynski et al., 2017) datasets collected from social media. We use the standard split for these datasets.
• News: We use CoNLL-03 English (Tjong Kim Sang and De Meulder, 2003) dataset and CoNLL++  dataset. The CoNLL-03 dataset is the most popular dataset for NER. CoNLL++ is a revision of the CoNLL-03 datasets.  fixed annotation errors on the test set by professional annotators and improved the quality of the training data through their CrossWeigh approach. We use the standard dataset split for these datasets.
• Biomedical: We use BC5CDR (Li et al., 2016) and NCBI-disease (Dogan et al., 2014) datasets, which are two popular biomedical NER datasets. We merge the training and development data as training set following Nooralahzadeh et al. .
• Science and Technology: We use CBS SciTech News dataset collected by Jia et al. (2019). The dataset only contains the test set with the same label set as the CoNLL-03 dataset. We use the dataset to evaluate the effectiveness of crossdomain transferability from the news domain.
• E-commerce: We collect and annotate an internal dataset from one anonymous E-commerce website. The dataset contains 25 named entity labels for goods in short texts. We also collect 300,000 unlabeled sentences for semi-supervised training.
We show the statistics of the datasets in Table 1.
Annotations of the E-commerce dataset We manually labeled the user queries through crowdsourcing from www.aliexpress.com, which is a real-world E-commerce website. For each query, we asked one annotator to label the entities and ask another annotator to check the quality. After that, we randomly select 10% of the dataset and ask the third annotator to check the accuracy. As a result, the overall averaged query-level accuracy 2 is 95%. The dataset will not be released due to user privacy.

Retrieving and Ranking
We use an internal E-commerce search engine for the E-commerce dataset. For the other datasets, we use Google Search as the search engine. Google Search is an off-the-shelf search engine and can simulate the offline search over various domains. We use summarized descriptions from the search results as the retrieved texts 3 . As Google Search limits the maximal length of searching queries to 32 words, we chunk a sentence into multiple sub-sentences based on punctuation if the sentence is longer than 30, feed each sub-sentence to the search engine, and retrieve up to 20 results. We filter the retrieved texts that contain any part of the datasets. Our reranking module selects top 6 relevant texts 4 as the external contexts of the input sentence and chunk the external contexts if the total sub-token lengths of the input sentence and external contexts exceeds 510.

Model Configurations
For the re-ranking module, we use Roberta-Large  for token representations which is the default configuration in the code 5 of BERTScore (Zhang et al., 2020). For token representations in the NER model,

1805
we use pretrained Bio-BERT (Lee et al., 2020) for datasets from the biomedical domain and use XLM-RoBERTa (Conneau et al., 2020) for datasets from other domains.
Training During training, we fine-tune the pretrained contextual embeddings by AdamW (Loshchilov and Hutter, 2018) optimizer with a batch size of 4. We use a learning rate of 5 × 10 −6 to update the parameters in the pretrained contextual embeddings. For the CRF layer parameters, we use a learning rate of 0.05. We train the NER models for 10 epochs for the datasets in Social Media and Biomedical domains while we train the NER models for 5 epochs for other datasets for efficiency as these datasets have more training sentences.

Results
We experiment on the following approaches: • LUKE is a very recent state-of-the art model on CoNLL-03 NER dataset proposed by Yamada et al. (2020). We use the same parameter setting as Yamada et al. (2020) and use a single sentence as the input instead of taking document-level contexts in the dataset as in Yamada et al. (2020) for fair comparison.
• W/O CONTEXT represents training the NER model without external contexts (Eq. 2), which is the baseline of our approaches.
• W/ CONTEXT represents training the NER model with external contexts (Eq. 3).
Besides, we also compare our approaches with previous state-of-the-art approaches over entity-level F1 scores 6 . During the evaluation, our approaches are evaluated using inputs without external contexts (W/O CONTEXT) and inputs with them (W/ CONTEXT). We report the results averaged over 5 runs in our experiments. The results are listed in 6 We do not compare the results from previous work such as Yu et al. (2020); Luoma and Pyysalo (2020); Yamada et al.
(2020) that utilizes the document-level contexts in CoNLL-03 NER here. We conduct a comparison with these approaches in Appendix A. 2 7 . With the external contexts, our models with CL outperform previous state-of-the-art approaches on most of the datasets. Our approaches significantly outperform the baseline that is trained without external contexts with only one exception. Comparing with LUKE, our approaches and our baseline outperform LUKE in all the cases. The possible reason is that LUKE is pretrained only using long word sequences, which makes the model prone to fail to capture the information of entities based on short sentences 8 . For our approaches, with CL, the accuracy can be improved on both input views comparing with W/O CONTEXT and W/ CONTEXT, which shows adding constraints between the two views during training helps the model better utilize the original text information. For the two constraints in CL, we find that CL-KL is relatively stronger than CL-L 2 in a majority of the cases.

Cross-Domain Transfer
For cross-domain transfer, we train the models on the CoNLL-03 datasets, evaluate the accuracy on the CBS SciTech News dataset, and compare the results with those in Jia et al. (2019). We evaluate our approaches with each input view and the results are shown in Table 3. Our approaches can improve the accuracy in cross-domain evaluation. The external contexts during evaluation can help to improve the accuracy of W/ CONTEXT. However, the gap between the two input views for the CL approaches is diminished. The observation shows that CL is able to improve the accuracy in crossdomain transfer for both views and eliminate the gap between the two views.

Semi-supervised Cooperative Learning
Cooperative learning can take advantage of large amounts of unlabeled text for further improvement. We jointly train on the labeled data and unlabeled data in training to form a semi-supervised training manner. During training, we alternate between minimizing the loss (Eq. 9) for labeled data and the CL loss for unlabeled data (Eq. 4). We conduct the experiment on the E-commerce dataset as an exam-    Table 4: A comparison between of CL approaches with and without semi-supervised learning. SEMI represents the approaches with semi-supervised learning. † represents the approach is significantly (p < 0.05) stronger than the approach without semi-supervised learning with the same input view. Table 4 show that the accuracy of both input views can be improved especially for the input without external contexts, which shows the effectiveness of CL in semi-supervised learning.

Analysis
We use the WNUT-17 dataset in the analysis.

Comparison of Re-ranking Approaches
Various re-ranking approaches may affect the token representations of the model. We compare our approach with three other re-ranking approaches. The first is the ranking from the search engine without any re-ranking approaches. The second is reranking through a fuzzy match score. The approach has been widely applied in a lot of previous work (Gu et al., 2018;Zhang et al., 2018;Hayati et al., 2018;Xu et al., 2020). The third is BERTScore with tf-idf importance weighting which makes rare words more indicative than common words in scoring. We train our models (W/ CONTEXT) with external contexts from these re-ranking approaches and report the averaged and best results on WNUT-17 in Table 5. Our results show that re-ranking with BERTScore performs the best, which shows the semantic relevance is helpful for the performance. However, for BERTScore with the tf-idf weighting, the accuracy of the model drops significantly (with p < 0.05). The possible reason might be that the tf-idf weighting gives high weights to irrelevant texts with rare words during re-ranking.

How the Context Quality Affects Accuracy
We analyze how the NER model will perform when the quality of external contexts varies. We train and evaluate the NER model in four conditions with various contexts. The first one takes each dataset split as a document and encodes each sentence with document-level contexts. In this case, we encode the document-level contexts following the approach of Yamada et al. (2020). The second one uses GPT-2 (Radford et al., 2019) to generate 6 relevant sentences as external contexts. The other two conditions randomly select from the retrieved texts or the dataset as external contexts. Results in Table  6 show that all these conditions result in inferior accuracy comparing with the model without any external context. However, our external contexts are more semantically relevant to the input sentence and helpful for prediction.

Ablation Study
To show the effectiveness of CL, we conduct three ablation studies for our approach. The first one is training the NER model based on one view and predict on the other. The second is jointly training both views without the CL loss term (removing L CL (θ) in Eq. 9). The final one is using both CL losses to train the model (L CL (θ) = L CL-L 2 (θ) + L CL-KL (θ) in Eq. 9). Results in Table 7 show that the external context can help to improve the accuracy even when the NER model is trained without the contexts. However, when the model is trained with the external contexts, the accuracy of the model drops when predicting the inputs without external contexts. In joint training without CL, the accuracy of the model over inputs without contexts can be slightly improved but the accuracy over inputs with contexts drops, which shows the benefit of adding CL. For the model trained with both CL losses, we find no improvement over the models trained with a single CL loss.

Related Work
Named Entity Recognition Named Entity Recognition (Sundheim, 1995) has been studied for decades. Most of the work takes NER as a sequence labeling problem and applies the linear-chain CRF (Lafferty et al., 2001) to achieve state-of-the-art accuracy (Ma and Hovy, 2016;Lample et al., 2016;Akbik et al., 2018Akbik et al., , 2019Wang et al., 2020b). Recently, the improvement of accuracy mainly benefits from stronger token representations such as pretrained contextual embeddings such as BERT (Devlin et al., 2019), Flair (Akbik et al., 2018) and LUKE (Yamada et al., 2020). Very recent work (Yu et al., 2020;Yamada et al., 2020) utilizes the strength of pretrained contextual embeddings over long-range dependency and encodes the document-level contexts for token representations to achieve state-of-the-art accuracy on CoNLL 2002/2003 NER datasets (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003).
Improving Models through Retrieval Retrieving related texts from a certain database (such as the training set) has been widely applied in tasks such as neural machine translation (Gu et al., 2018;Zhang et al., 2018;Xu et al., 2020), text generation (Weston et al., 2018;, semantic parsing (Hashimoto et al., 2018;Guo et al., 2019). Most of the work uses the retrieved texts to guide the generation or refine the retrieved texts through the neural model, while we take the retrieved texts as the contexts of the input sentence to improve the semantic representations of the input tokens.
For the re-ranking models, fuzzy match score (Gu et al., 2018;Zhang et al., 2018;Hayati et al., 2018;Xu et al., 2020), attention mechanisms (Cao et al., 2018;Cai et al., 2019), and dot products between sentence representations (Lewis et al., 2020;Xu et al., 2020) are usual scoring functions to re-rank the retrieved texts. Instead, we use BERTScore to re-rank the retrieved texts instead as BERTScore evaluates semantic correlations between the texts based on pretrained contextual embeddings.
Multi-View Learning Multi-View Learning is a technique applied to inputs that can be split into multiple subsets. Co-training (Blum and Mitchell, 1998) and co-regularization (Sindhwani and Niyogi, 2005) train a separate model for each view. These approaches are semi-supervised learning techniques that require two independent views of the data. The model with higher confidence is applied to construct additional labeled data by predicting on unlabeled data. Sun (2013) and Xu et al. (2013) have extensively studied various multiview learning approaches. Hu et al. (2021) shows the effectiveness of multi-view learning on crosslingual structured prediction tasks. Recently,  proposed Cross-View Training (CVT), which trains a unified model instead of multiple models and targets at minimizing the KL divergence between the probability distributions of the model and auxiliary prediction modules. Comparing with CVT, CL targets at improving the accuracy of two kinds of inputs rather than only one of them. We also propose to minimize the distance of token representations between different views in addition to KL-divergence. Besides, CL utilizes the external contexts and therefore we do not need to construct auxiliary prediction modules in the model. Moreover, CVT cannot be directly applied to our transformer-based embeddings. Finally, our decoding layer in the model uses the CRF layer instead of the simple Softmax layer as in CVT. The CRF layer is stronger but more difficult for KLdivergence computation.
Knowledge Distillation Knowledge distillation (Buciluǎ et al., 2006;Hinton et al., 2015) transfers the knowledge of "teacher" models to smaller "student" models through minimizing the KL divergence of prediction probability distribution between the models. In speech recognition (Huang et al., 2018) and natural language processing (Wang et al., 2020a(Wang et al., , 2021b, the marginal probability distribution of the linear-chain CRF layer has been applied to distill the knowledge between teacher models and student models. Comparing with these approaches, our approaches train a single unified model instead of transferring the knowledge between two models. We also show that the accuracy of both views can be improved with our approaches, unlike in knowledge distillation only the student model is updated and improved.

Conclusion
In this paper, we propose to improve the NER model's accuracy by retrieving related contexts from a search engine as external contexts of the inputs. To improve the robustness of the models when no external contexts are available, we propose Cooperative Learning. Cooperative Learning adds constraints between two input views over either the token representations or label distributions of both input views to be consistent. Empirical results show that our approach significantly outperforms the baseline models and previous state-of-the-art approaches on the datasets over 5 domains. We also show the effectiveness of Cooperative Learning in a semi-supervised training manner.  A Retrieved Contexts Versus Document-level contexts on CoNLL-03 We conduct a comparison between our retrieved contexts and the document-level contexts on CoNLL-03 datasets. In Table 8, we report the best model on development set following Yamada et al. (2020). Comparing with previous state-of-the-art approaches with encoding document-level contexts, our approaches are competitive and even stronger than some of the previous approaches utilizing maximal document-level contexts. Comparing with our model trained on document-level contexts (W/ DOC CONTEXT), we find that there is still a gap between the document-level contexts and retrieved contexts but our CL approaches can reduce the gap between these two contexts.