Modularized Interaction Network for Named Entity Recognition

Although the existing Named Entity Recognition (NER) models have achieved promising performance, they suffer from certain drawbacks. The sequence labeling-based NER models do not perform well in recognizing long entities as they focus only on word-level information, while the segment-based NER models which focus on processing segment instead of single word are unable to capture the word-level dependencies within the segment. Moreover, as boundary detection and type prediction may cooperate with each other for the NER task, it is also important for the two sub-tasks to mutually reinforce each other by sharing their information. In this paper, we propose a novel Modularized Interaction Network (MIN) model which utilizes both segment-level information and word-level dependencies, and incorporates an interaction mechanism to support information sharing between boundary detection and type prediction to enhance the performance for the NER task. We have conducted extensive experiments based on three NER benchmark datasets. The performance results have shown that the proposed MIN model has outperformed the current state-of-the-art models.


Introduction
Named Entity Recognition (NER) is one of the fundamental tasks in natural language processing (NLP) that intends to find and classify the type of a named entity in text such as person (PER), location (LOC) or organization (ORG). It has been widely used for many downstream applications such as relation extraction (Xiong et al., 2018), entity linking (Gupta et al., 2017), question generation (Zhou et al., 2017) and coreference resolution (Barhom et al., 2019). * Corresponding authors.
Currently, there are two types of methods for the NER task. The first one is sequence labeling-based methods (Lample et al., 2016;Chiu and Nichols, 2016;Luo et al., 2020), in which each word in a sentence is assigned a special label (e.g., B-PER or I-PER). Such methods can capture the dependencies between adjacent word-level labels and maximize the probability of predicted labels over the whole sentence. It has achieved the state-of-the-art performance in various datasets over the years. However, NER is a segment-level recognition task. As such, the sequence labeling-based models which focus only on word-level information do not perform well especially in recognizing long entities (Ye and Ling, 2018). Recently, segment-based methods (Kong et al., 2016;Li et al., 2020b;Yu et al., 2020b;Li et al., 2021) have gained popularity for the NER task. They process segment (i.e., a span of words) instead of single word as the basic unit and assign a special label (e.g., PER, ORG or LOC) to each segment. As these methods adopt segment-level processing, they are capable of recognizing long entities. However, the word-level dependencies within a segment are usually ignored.
NER aims at detecting the entity boundaries and the type of a named entity in text. As such, the NER task generally contains two separate and independent sub-tasks on boundary detection and type prediction. However, from our experiments, we observe that the boundary detection and type prediction sub-tasks are actually correlated. In other words, the two sub-tasks can interact and mutually reinforce each other by sharing their information. Consider the following example sentence: "Emmy Rossum was from New York University". If we know "University" is an entity boundary, it will be more accurate to predict the corresponding entity type to be "ORG". Similarly, if we know an entity has an "ORG" type, it will be more accurate to predict that "University" is the end boundary of the entity "New York University" instead of "York" (which is the end boundary for the entity "New York"). However, sequence labeling-based models consider the boundary and type as labels, and thus such information cannot be shared between the subtasks to improve the accuracy. On the other hand, segment-based models first detect the segments and then classify them into the corresponding types. These methods generally cannot use entity type information in the process of segment detection and may have errors when passing such information from segment detection to segment classification.
In this paper, we propose a Modularized Interaction Network (MIN) model which consists of the NER Module, Boundary Module, Type Module and Interaction Mechanism for the NER task. To tackle the issue on recognizing long entities in sequence labeling-based models and the issue of utilizing word-level dependencies within a segment in segment-based models, we incorporate a pointer network (Vinyals et al., 2015) into the Boundary Module as the decoder to capture segment-level information on each word. Then, these segmentlevel information and the corresponding word-level information on each word are concatenated as the input to the sequence labeling-based models.
To enable interaction information, we propose to separate the NER task into the boundary detection and type prediction sub-tasks to enhance the performance of the two sub-tasks by sharing the information from each sub-task. Specifically, we use two different encoders to extract their distinct contextual representations from the two sub-tasks and propose an Interaction Mechanism to mutually reinforce each other. Finally, these information are fused into the NER Module to enhance the performance. In addition, the NER Module, Boundary Module and Type Module share the same word representations and we apply multitask training when training the proposed MIN model.
In summary, the main contributions of this paper include: • We propose a novel Modularized Interaction Network (MIN) model which utilizes both the segment-level information from segmentbased models and word-level dependencies from sequence labeling-based models in order to enhance the performance of the NER task.
• The proposed MIN model consists of the NER Module, Boundary Module, Type Module and Interaction Mechanism. We propose to separate boundary detection and type prediction into two sub-tasks and the Interaction Mechanism is incorporated to enable information sharing between the two sub-tasks to achieve the state-of-the-art performance.
• We conduct extensive experiments on three NER benchmark datasets, namely CoNLL2003, WNUT2017 and JNLPBA, to evaluate the performance of the proposed MIN model. The experimental results have shown that our MIN model has achieved the state-of-the-art performance and outperforms the existing neural-based NER models.

Related Work
In this section, we review the related work on the current approaches for Named Entity Recognition (NER). These approaches can be categorized into sequence labeling-based NER and segment-based NER.

Sequence Labeling-based NER
Sequence labeling-based NER is regarded as a sequence labeling task, where each word in a sentence is assigned a special label (e.g., B-PER, I-PER). Huang et al. (Huang et al., 2015) utilized the BiLSTM as an encoder to learn the contextual representation of words, and then Conditional Random Fields (CRFs) was used as a decoder to label the words. It has achieved the state-of-the-art results on various datasets for the past many years. Inspired by the success of the BiLSTM-CRF architecture, many other state-of-the-art models have adopted such architecture. Chiu and Nichols (Chiu and Nichols, 2016) used Convolutional Neural Network (CNN) to capture spelling features, and the character-level and word-level embeddings are concatenated as the input of BiLSTM with CRF network. Further, Lample et al. (Lample et al., 2016) proposed RNN-BiLSTM-CRF as an alternative. More recently, pretrained language models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) have been adopted to further enhance the performance of NER.

Segment-based NER
Segment-based NER identifies segments in a sentence and classifies each segment with a special label (e.g., PER, ORG or LOC). Kong et al. (Kong et al., 2016) used BiLSTM to map arbitrary-length segment into a fixed-length vector, and then these vectors were passed to Semi-Markov Conditional Random Fields (Semi-CRFs) for labeling the segments. Zhuo et al. (Zhuo et al., 2016) adopted a gated recursive Convolutional Neural Network instead of BiLSTM to build a pyramid-like structure for extracting segment-level features in a hierarchical way. In recent years, Ye et al. (Ye and Ling, 2018) exploited the weighted sum of word-level within segment to learn segment-level features with Semi-CRFs which was then trained jointly on word-level with the BiLSTM-CRF network. Li et al. ) used a recurrent neural network encoder-decoder framework with a pointer network to detect entity segments. Li et al. (Li et al., 2020b) treated NER as a machine reading comprehension (MRC) task, where entities were extracted as retrieved answer spans. Yu et al. (Yu et al., 2020b) ranked all the spans in terms of the pairs of start and end tokens in a sentence using a biaffine model.

Proposed Model
This section presents our proposed Modularized Interaction Network (MIN) for NER. The overall model architecture is shown in Figure 1(a), which consists of the NER Module, Boundary Module, Type Module and Interaction Mechanism.

NER Module
In the NER Module, we adopt the RNN-BiLSTM-CRF model (Lample et al., 2016) as our backbone, which consists of three components: word representation, BiLSTM encoder and CRF decoder. Word Representation Given an input sentence S =< w 1 , w 2 , · · · , w n >, each word w i (1 ≤ i ≤ n) is represented by concatenating a word-level embedding x w i and a character-level word embedding x c i as follows: where x w i is the pre-trained word embedding, and the character-level word embedding x c i is obtained with a BiLSTM to capture the orthographic and morphological information. It considers each character in the word as a vector, and then inputs them to a BiLSTM to learn the hidden states. The final hidden states from the forward and backward outputs are concatenated as the character-level word information.
BiLSTM Encoder The distributed word embeddings X =< x 1 , x 2 , · · · , x n > are then fed into the BiLSTM encoder to extract the hidden sequences H =< h 1 , h 2 , · · · , h n > of all words as follows: In the NER Module, we fuse the distinct contextual boundary representation and type representation for the NER task. In addition, we also fuse the segment information from the Boundary Module to support the recognition of long entities. Note that the boundary information and type information can mutually reinforce each other. Thus, we use an interaction mechanism to reinforce them before fusing these information in the NER Module. Instead of directly concatenating these information with hidden representations in the NER module, we follow the previous studies (Zhang et al., 2018;Yu et al., 2020a) to use a gate function to dynamically control the amount of information flowing by infusing the expedient part while excluding the irrelevant part. The gate function uses the information from the NER Module to guide the process, which is described formally as follows:  sigmoid function and ⊗ denotes the element-wise multiplication.
The final hidden representations in the NER Module are as follows: CRF Decoder CRF has been widely used in the state-of-the-art NER models (Chiu and Nichols, 2016;Lample et al., 2016) to model tagging decisions when considering strong connections between output tags. For an input sentence S =< w 1 , w 2 , · · · , w n >, the score of a predicted sequence of labels y =< y 1 , y 2 , · · · , y n > is defined as follows: where T y i ,y i+1 represents the score of a transition from y i to y i+1 , and P i,y i is the score of the y i tag of the i th word in a sentence. The CRF model describes the probability of predicted labels y over all possible tag sequences in the set Y , that is: We maximize the log-probability of the correct sequence of labels during the training. During decoding, we predict the label sequence with the maximum score:

Boundary Module
The Boundary Module needs to provide not only distinct contextual boundary information but also segment information for the NER Module. Here, we use another BiLSTM as encoder to extract distinct contextual boundary information. And inspired by BDRYBOT , a recurrent neural network encoder-decoder framework with a pointer network is used to detect entity segments for segment information. The BDRYBOT model processes the starting boundary word in an entity to point to the corresponding ending boundary word. The other entity words in the entity are skipped. The non-entity words are pointed to a specific position. This method has achieved promising results in the boundary detection task. However, due to the variable length of entities, this model is deprived of the power of batch training. In addition, as the segment information of each word in an entity is the same as the starting boundary word, the segment information for all the words within a segment will be incorrect if the starting boundary word is detected wrongly. To avoid this problem, we improve the training process and propose a novel method to capture the segment information of each word.
We train the starting boundary word to point to the corresponding ending boundary word, and the other words in the sentence to a sentinel word inactive. The process is shown in Figure 1(b). Specifically, we use another BiLSTM as encoder to obtain the distinct boundary hidden sequences H Bdy =< h Bdy 1 , h Bdy 2 , · · · , h Bdy n >, and a sentinel vector is padded into the last positions of hidden sequences H Bdy for the sentinel word inactive. Then, a unidirectional LSTM is used as a decoder to generate the decoded state d j at each time step j. To add extra information to the input of the LSTM, we follow (Fernández-González and Gómez-Rodríguez, 2020) and use the sum of the hidden states of current (h Bdy i ), previous (h Bdy i−1 ) and next (h Bdy i+1 ) words instead of word embedding 204 as the input to the decoder as follows: Note that the first word and last word do not have hidden states of previous and next, we use zero vectors to represent it which are shown as grey blocks in Figure 1(b). After that, we use the biaffine attention mechanism (Dozat and Manning, 2017) to generate a feature representation for each possible boundary position i at time step j, and the Sof tmax function is used to obtain the probability of word w i for determining an entity segment that starts with word w j and ends with word w i .
where W is the weight matrix of bi-linear term, U and V are the weight matrices of linear terms, b is the bias vector and i ∈ [j, n + 1] indicates a possible position in decoding. Different from the existing methods (Zhuo et al., 2016;Sohrab and Miwa, 2018) that enumerate all segments starting with word w j with equal importance, we use the probability p (w i |w j ) as the confidence of the segment that starts with word w j and ends with word w i , and then all these segments under the probability p (w i |w j ) are summed as the segment information of word w j .
where h p j,i is the representation of the segment that starts with word w j and ends with word w i , and is element-wise product.

Type Module
For the Type Module, we use the same network structure as in the NER Module. Given the shared input X =< x 1 , x 2 , · · · , x n >, BiLSTM is used to extract distinct contextual type information H T ype =< h T ype 1 , h T ype 2 , · · · , h T ype n >, and then CRF is used to tag type labels.

Interaction Mechanism
As discussed in Section 1, the boundary information and type information can mutually reinforce each other. We first follow (Cui and Zhang, 2019;Qin et al., 2021) and use a self-attention mechanism over each sub-task labels to obtain the explicit label representations. Then, we concatenate these representations and contextual information of corresponding sub-tasks to get label-enhanced contextual information. For the i th label-enhanced boundary contextual representation h B−E i , we first use the biaffine attention mechanism (Dozat and Manning, 2017) to grasp the attention scores between h B−E i and the label-enhanced type contextual in- > are computed in the same way as in Equation (9). Then, we concatenate the i th label-enhanced boundary representation h B−E i and the interaction representation r B−E i by considering the type information as its updated boundary representation: Similarity, we can obtain the updated type representation h T ype i by considering the boundary information.

Joint Training
There are three modules in our proposed MIN model: NER Module, Boundary Module and Type Module. They share the same word representations. Thus, the whole model can be trained with multitask training. During training, we minimize the negative log-probability of the correct sequence of labels in Equation (6) for the NER Module and Type Module, while the cross-entropy loss is used for the Boundary Module: where X represents input sequence, andŷ N ER and y T ype represent the correct sequence of labels for the NER Module and Type Module respectively. p Bdy i is the probability distribution of the gold label andŷ Bdy i is the gold one-hot vector for the Boundary Module. Then, the final multitask loss is a weighted sum of the three losses:

Experiments
In this section, we first introduce the datasets, baseline models and implementation details. Then, we present the experimental results on three benchmark datasets. Moreover, an ablation study is also conducted. Finally, we give some insights on further analysis.
• CoNLL2003 -It is collected from Reuters news articles. Four different types of named entities including PER, LOC, ORG and MISC are defined by the CoNLL 2003 NER shared task.
• WNUT2017 -It is a set of noisy usergenerated text including YouTube comments, StackExchange posts, Twitter text, and Reddit comments. Six types of entities including PER, LOC, Group, Creative work, Corporation and Product are annotated.
• JNLPBA -It is collected from MEDLINE abstracts. Five types of entities including DNA, RNA, protein, cell line and cell type are annotated. Table 1 presents the statistics of these datasets.

Baseline Models
We compare the proposed MIN model with several baseline models including sequence labeling-based models and segment-based models. The compared sequence labeling-based models include: • CNN-BiLSTM-CRF (Chiu and Nichols, 2016) -This model utilizes CNN to capture character-level word features, and then the character-level and word-level embeddings are concatenated as the input to the BiLSTM-CRF network. It is a classical baseline for NER.   The compared segment-based models include: • BiLSTM-Pointer 1 ) -This model uses BiLSTM as the encoder and another unidirectional LSTM with pointer networks as the decoder for entity boundary detection. Then, the entity segments generated by the decoder are classified with the Softmax classifier for NER.
• HSCRF (Ye and Ling, 2018) -This model exploits the weighted sum of word-level within segment to learn segment-level features with Semi-CRFs which is then trained jointly on word-level with the BiLSTM-CRF network.

206
• MRC+BERT (Li et al., 2020b) -This model formulates the NER task as a machine reading comprehension task.
• Biaffine+BERT (Yu et al., 2020b) -This model ranks all the spans in terms of the pairs of start and end tokens in a sentence using a biaffine model.

Implementation Details
Our proposed MIN model is implemented with the PyTorch framework. We use 100-dimensional pre-trained Glove word embeddings 2 (Pennington et al., 2014). The char embeddings is initialized randomly as 25-dimensional vectors. When training the model, both of the embeddings are updated along with other parameters. We use Adam optimizer (Kingma and Ba, 2014) for training with a mini-batch. The initial learning rate is set to 0.01 and will shrunk by 5% after each epoch, dropout rate to 0.5, the hidden layer size to 100, and the gradient clipping to 5. We report the results based on the best performance on the development set. All of our experiments are conducted on the same machine with 8-cores of Intel(R) Xeon(R) E5-1630 CPU@3.70GHz and two Nvidia GeForce-GTX GPU. Following the work in (Ye and Ling, 2018), the maximum segment length for segment information discussed in Section 3.2 is set to 6 for better computational efficiency. Table 2 shows the experimental results of our proposed MIN model and the baseline models. In Table 2, when compared with models without using any language models or external knowledge, we observe that our MIN model outperforms all the compared baseline models in terms of precision, recall and F1 scores, and achieves 0.57%, 4.77% and 3.26% improvements on F1 scores for the CoNLL2003, WNUT2017 and JNLPBA datasets respectively. Among the compared models, the F1 scores of the BiLSTM-Pointer model are generally lower than other models. This is because it does not utilize the word-level dependencies within a segment and also suffers from the problem on boundary error propagation during boundary detection and type prediction. The CNN-BiLSTM-CRF and RNN-BiLSTM-CRF models have achieved similar performance results on the three datasets, which perform worse than that of HCRA and HSCRF. The HCRA model uses sentence-level and documentlevel representations to augment the contextualized word representation, while the HSCRF model considers the segment-level and word-level information with multitask training. However, the HCRA model does not consider the segment-level information, and the HSCRF model does not model directly the word-level dependencies within a segment. In addition, all the above models do not share information between the boundary detection and type prediction sub-tasks. Our MIN model has achieved the best performance as it is capable of considering all these information.

Experimental Results
When pre-trained language models such as ELMo and BERT are incorporated, all the models have achieved better performance results. In particular, we observe that our MIN model has achieved 0.95%, 3.83% and 2.73% improvements on the F1 scores for the CoNLL2003, WNUT2017 and JNLPBA datasets respectively when compared with the other models. The results are consistent with what have been discussed in models without using any pre-trained language models.

Ablation Study
To show the importance of each component in our proposed MIN model, we conduct an ablation experiment on the Boundary Module, Type Module and Interaction Mechanism. As shown in Table 3, we can see that all these components contribute significantly to the effectiveness of our MIN model.
The discussion on the effectiveness of each component is given with respect to the three datasets. The Boundary Module improves the F1 scores by 1.13%, 3.58% and 2.1% for CoNLL2003, WNUT2017 and JNLPBA respectively. This is because it not only provides segment-level information for the NER Module but also provides the boundary information for the Type Module. As such, it helps recognize long entities and predict the entity types more accurately.
The Type Module improves the F1 scores by 1.02%, 2.81% and 1.42% for CoNLL2003, WNUT2017 and JNLPBA respectively. This is because it provides the type information for the Boundary Module which can help detect entity boundaries more accurately. In addition, it can also help obtain more effective segment information.   The Interaction Mechanism has achieved 0.54%, 1.86% and 0.72% improvements on F1 scores for CoNLL2003, WNUT2017 and JNLPBA respectively. As it bridges the gap between the Boundary Module and Type Module for information interaction and sharing, it can help improve the performance of boundary detection and type prediction simultaneously.
Overall, the different components of the proposed model can work effectively with each other with multitask training and enable the model achieve the state-of-the-art performance for the NER task.

Performance Against Entity Length
As our proposed MIN model is capable of recognizing long entities, we compare the performance of our MIN model with RNN-BiLSTM-CRF and HSCRF. Note that the RNN-BiLSTM-CRF model is the base model used in our MIN model. And the HSCRF model also considers the segment-level and word-level information with multitask training. The results are shown in Figure 2. The experiment is conducted on the CoNLL2003 test dataset. We follow the setting in (Ye and Ling, 2018) and group the data according to the number of entities from 1 to ≥ 6 in a sentence. We observe that our MIN model and the HSCRF model consistently outperform RNN-BiLSTM-CRF in each group. In particular, the improvement is obvious when the entity length is longer than 4 because both our MIN model and the HSCRF model consider the segmentlevel information. However, our MIN model performs better than the HSCRF model in each group. More specifically, when the entity length is longer than 4, our MIN model has great improvement over HSCRF. This is because the HSCRF model directly uses segment-level features with Semi-CRFs to tag the segments, which ignore word-level dependencies within the segment. In contrast, our MIN model combines segment-level information with word-level dependencies within a segment for the NER task.

Conclusion
In this paper, we have proposed a novel Modularized Interaction Network (MIN) model for the NER task. The proposed MIN model utilizes both segment-level information and word-level dependencies, and incorporates an interaction mechanism to support information sharing between boundary detection and type prediction to enhance the performance for the NER task. We have conducted extensive experiments on three NER benchmark datasets. The experimental results have shown that our proposed MIN model has achieved the state-ofthe-art performance.