Explicitly Capturing Relations between Entity Mentions via Graph Neural Networks for Domain-specific Named Entity Recognition

Named entity recognition (NER) is well studied for the general domain, and recent systems have achieved human-level performance for identifying common entity types. However, the NER performance is still moderate for specialized domains that tend to feature complicated contexts and jargonistic entity types. To address these challenges, we propose explicitly connecting entity mentions based on both global coreference relations and local dependency relations for building better entity mention representations. In our experiments, we incorporate entity mention relations by Graph Neural Networks and show that our system noticeably improves the NER performance on two datasets from different domains. We further show that the proposed lightweight system can effectively elevate the NER performance to a higher level even when only a tiny amount of labeled data is available, which is desirable for domain-specific NER.


Introduction
Named entity recognition (NER) has been well studied for the general domain, and recent systems have achieved close to human-level performance for identifying a small number of common NER types, such as Person and Organization, mainly benefiting from the use of Neural Network models (Ma and Hovy, 2016;Yang and Zhang, 2018) and pretrained Language Models (LMs) (Akbik et al., 2018;Devlin et al., 2019). However, the performance is still moderate for specialized domains that tend to feature diverse and complicated contexts as well as a richer set of semantically related entity types (e.g., Cell, Tissue, Organ etc. for the biomedical domain). With these challenges in view, we hypothesize that being aware of the 1 The code for the system is available here: https:// github.com/brickee/EnRel-G re-occurrences of the same entity as well as semantically related entities will lead to better NER performance for specific domains.
Therefore, we propose to explicitly connect entity mentions in a document that are coreferential or in a tight semantic relation to better learn entity mention representations. Precisely, as shown in Figure 1, we first connect repeated mentions of the same entity even if they are sentences away. For example, the named entity "tumor vasculature" appears both in the Title and sentence S6 but in quite different contexts. Connecting the repeated mentions in a document enables the integration of contextual cues as well as enables consistent predictions of their entity types.
Second, we also connect entity mentions based on sentence-level dependency relations to effectively identify semantically related entities. For example, the two entities in sentence S3, "bone marrow" of the type Multi-tissue Structure and "endothelial progenitors" of the type Cell, are the subject and object of the predicate "contains" respectively in the dependency tree. If the system can reliably predict the type of one entity, we can infer the type of the other entity more easily, knowing that they are closely related on the dependency tree.
We incorporate both relations by using Graph Neural Networks (GNNs), specifically, we use the Graph Attention Networks (GATs) (Velickovic et al., 2018) that have been shown effective for a range of tasks (Sui et al., 2019;Linmei et al., 2019). Empirical results show that our lightweight method can learn better word representations for sequence tagging models and further improve the NER performance over strong LMs-based baselines on two datasets, the AnatEM (Pyysalo and Ananiadou, 2014) dataset from the biomedical domain and the Mars (Wagstaff et al., 2018) dataset from the planetary science domain. In addition, considering the lack of annotations challenge for domain-specific NER, we plot learning curves and show that leveraging relations between entity mentions can effectively and consistently improve the NER performance when limited annotations are available.

Related Work
NER research has a long history and recent approaches (Yang and Zhang, 2018;Jiang et al., 2019;Jie and Lu, 2019;Li et al., 2020) using Neural Network models like BiLSTM-CNN-CRF (Ma and Hovy, 2016) and contextual embeddings such as BERT (Devlin et al., 2019) and FLAIR (Akbik et al., 2018) have improved the NER performance in the general domain to the human-level. However, the NER performance for specific domains is still moderate due to the challenges of limited annotations and dealing with complicated domain-specific contexts.
We aim to further improve NER performance by considering coreference relations and semantic relations between entity mentions. This is in contrast to the usual way of thinking about NER as an up-stream task conducted before coreference resolution or entity relation extraction. The idea aligns with recent works that conduct joint inferences among multiple information extraction tasks (Miwa and Bansal, 2016;Li et al., 2017;Bekoulis et al., 2018;Luan et al., 2019;Sui et al., 2020;Yuan et al., 2020), including NER, coreference resolution and relation extraction, by mining dependencies among the extractions. However, joint inference approaches require annotations for all the target tasks and aim to improve performance for all the tasks as well, while our lightweight approach aims to improve the performance of the basic NER task requiring no additional annotations (usually unavailable for specific domains).
Our approach is also related to several recent neural approaches for NER that encourage label dependencies among entity mentions. The Pooled FLAIR model (Akbik et al., 2019) proposed a global pooling mechanism to learn word representations. Dai et al. (2019) used a coreference layer with a regularizer to harmonize word representations. Closely related to our work, Qian et al. (2019) used graph neural nets to capture repetitions of the same word as well, but in a denser graph that includes edges between adjacent words and is meant to completely overlay the lower encoding layers. Memory networks (Gui et al., 2020;Luo et al., 2020) were also used to store and refine predictions of a base model by considering repetitions or co-occurrences of words. In addition, dependency relations have been commonly used to connect entities for relation extraction Bunescu and Mooney, 2005), but we aim to better infer the type of an entity by associating it with other closely related entities in a sentence.

Model Architecture
Our system with Entity Relation Graphs (EnRel-G) mainly contains 5 layers as in Figure 2: an embedding layer, an encoding layer, a GNNs layer, a fusion layer, and a decoding layer.

Embedding Layer
We choose the BERT-base LM as our embedding layer. For domain-specific datasets, we use BioBERT (Lee et al., 2020)  .., w n ] with n words, the BERT model will output a contextual word embeddings matrix E = [w 1 , w 2 , ..., w n ] ∈ R n×d 1 with a d 1 dimension vector for each word.

Encoding Layer
To capture the sequential context information, we use a BiLSTM layer to encode the word embeddings from the BERT model. We concatenate the forward and backward LSTM hidden states as the encoded representations and then obtain embedding matrix E lstm = BiLST M (E) ∈ R n×d 2 with a d 2 dimension vector for each word.

Graph Neural Networks Layer
For the GNNs layer, we first introduce how to build Entity Relation Graphs using global coreference relations (coreference graph, C-graph) and local dependency relations (dependency graph, D-graph) between entities, and then describe how the GNNs model incorporates them into the word representations.
Coreference Relation Graph For each document, we build a graph G C = (V, A C ) based on coreference relations, in which V is a set of nodes denoting all the words in a document and A C is the adjacency matrix. Specifically, we approximate the entity coreference relations using 3 syntactic coreference clues as in Figure 1: (1) Exact Match, two nouns are connected if they are the same, e.g., "tumor vasculature" in both the Title and S6; (2) Lemma Match, two nouns are linked together if they have the same lemma, e.g., "progenitors" and "progenitor" in the S3 and S6; (3) Acronym Match, the acronym word is connected to all full expression words, e.g., "VEGF" and "vascular endothelial growth factor" in the S6. For each connected node pair (i, j), we set A C i,j = 1. We also add a selfconnection to each node (A C i,i = 1) to maintain the words' original semantic information.
Dependency Relation Graph We build a Dependency Relation Graphs G D = (V, A D ) for each document based on sentence-level dependency relations. We first parse each sentence using the scispaCy 2 tool and then connect the following word pairs in the dependency tree: (1) subject head word & object head word & their predicate, we connect them to enhance the interactions between the entities from the subject and object. e.g., "marrow" and "progenitors" with the predicate "contains" in the S3; (2) compound & head word, we connect the compounds with their head words because they often both exist in an entity. e.g., the "bone" and "marrow" in the S3. Same as before, We set A D i,j = 1 for each connect pair (i, j), and also add self-connection (A D i,i = 1) for each node. Then we update the encoded word embeddings with the entity relations graphs based on GNNs, particularly the GATs. Since nodes represent the words in a document, we initialize the node representations in the graphs from the encoding layer The graph attention mechanism updates the initial representation of node w lstm i to w gnn i by aggregating its neighbors' representations with their corresponding normalized attention scores.
As in equation (1), and we have K attention heads and concatenate ( ) them as the final representation. For head k, we weighted all the adjacent nodes (N i , obtained from the adjacent matrix A) by W k and and then aggregate them with the attention score α k ij . σ is the activation function LeakyReLU. The attention score α k ij is obtained as followed (a T is a weight vector): For each of the two relation graphs, we use an independent graph attention layer. The output word representations from the two GATs are denoted as:

Fusion Layer
Similar to Sui et al. (2019), we also use a fusion layer to blend the encoded word embeddings and the GNNs updated word embeddings. We first project these embeddings into the same hidden space using liner transformation and then add them, Then we will have a feature matrix F ∈ R n×d 4 for the n words blended with both the sequential context information and global entity relations.

Decoding Layer
Finally, a Conditional Random Field (CRF) (Lafferty et al., 2001) layer is used to decode the enriched embeddings F = [f 1 , f 2 , ..., f n ] into a sequence of labels y = {y 1 , y 2 , ..., y n }. In the training phrase, we optimize the whole model by minimizing the negative log-likelihood loss with respect to gold labels.  2019) is an extended version of the FLAIR model with global memory and pooling mechanism for the same word, which helps consistent predictions of coreferential entity mentions. We also use the embeddings from it with a BiLSTM-CRF architecture as a baseline.
Tuning Bio/SciBERT We also use Bio/SciBERT with a BiLSTM-CRF architecture as baselines for the AnatEM/Mars datasets, which do not have the GNNs layer or Fusion layer as compared with our system.

Results
To alleviate random turbulence, we train all the systems five times using different random seeds and evaluate their average performance on the test sets using the same script 5 , as in the   One main limitation of domain-specific NER systems is the lack of annotations, therefore, it is vital to make the best use of labeled data. The learning curves (Figure 3) shows that leveraging the relations between entity mentions can effectively elevate the NER performance to a higher level even when only a tiny amount of labeled data (a quarter of training data) is available, and this is true on both the AnatEM dataset and the Mars dataset. Although fine-tuning pretrained LMs has im-proved the performance of many NLP tasks, one limitation is the increase of training time. Therefore, it is important to build computing efficient models based on pretrained LMs. As shown in Figure 4, our model with the GNNs layer does not increase the time cost for fine-tuning the BERT models. The training time of methods with or without the GNNs layer is similar.

Edge selection in the Dependency Graph
To build the sentence-level dependency graph, we selected only two types of dependency relations: between the subject, object and their predicate (Key Edges) and between a compound modifier and its head word. As shown in the Table 2, we also tried to connect all the modifiers with their head word and found that this yields slightly worse performance, and the reason may be that many modifiers other than compounds are not entities themselves. In addition, including all the dependency edges also yields worse performance than using the two selected types of dependency relations, probably for the same reason that many of the nodes in a dependency tree are not parts of entity mentions and many dependency relations do not directly contribute to capturing relations between entities.

Conclusion
In this work, we explicitly capture the global coreference and local dependency relations between entity mentions, and use graph neural nets to incorporate the relations to improve domain-specific NER tasks. Experimental results on two datasets show the effectiveness of this lightweight approach. We also find that the selection of entity relations is important to the system performance. Future work may consider about using GNNs to incorporate external knowledge for performance improvement.

Appendix B: Data Preprocessing
We want our model to take advantage of the document-level information, but some of the documents are extremely too long. Moreover, the BERT model also has a limitation of 512 subtokens for input texts. So we need to split the long documents. Besides, the BERT language model needs a big enough batch size (e.g., 16 or 32) to be well finetuned, which is also a burden for the GPU memory consumption. In consideration of these restrictions, we limit the max subtoken count of a split document to 128 in the data preprocessing. Future work with more computing resources may try longer input documents.
Moreover, we also add the POS and Dependency Tree information into the data using scispaCy for constructing the Coreference Graph and the Dependency Graph in our model.

Appendix C: Model Settings
For the NCRF++ baseline, we use one layer of BiLSTM for word sequence representation with 300-dim Glove (Pennington et al., 2014)  EnRel-G Adam 5e-5 32 representation with 50-dim random initialized character embeddings, and a CRF layer for inference. For the FLAIR and Pooled FLAIR baselines, we use the PubMed version (pretrained on the biomedical corpus) for the AnatEM dataset and the general English version (pretrained on the English news articles) for the Mars dataset. Particularly, for the Pooled FLAIR model, we set the mean pooling mechanism to calculate the average of embeddings for multiple occurrences of a word, and then use it as the representation for the word.
For the Tuning BERT baselines, we use BioBERT-Base v1.1 for the AnatEM dataset and SciBERT-scivocab-uncased for the Mars dataset.
For our EnRel-G system, we keep the embeddings layer the same as the Tuning BERT baselines. As for the GNNs layer, we use one layer of the graph attention mechanism with 4 heads, and each head has a hidden dimension of 128.
For the optimization related parameters, as in the Table 4, we mainly use the recommended settings for the baseline models. For our EnRel-G system, we keep the same parameters as in the Tuning BERT baseline for fair comparison.
We train all the systems on a single Nvidia GEFORCE GTX 2080Ti GPU. We set the maximum epoch as 100 and use the best-performed model on the development set to evaluate the test data.