Locate and Label: A Two-stage Identifier for Nested Named Entity Recognition

Named entity recognition (NER) is a well-studied task in natural language processing. Traditional NER research only deals with flat entities and ignores nested entities. The span-based methods treat entity recognition as a span classification task. Although these methods have the innate ability to handle nested NER, they suffer from high computational cost, ignorance of boundary information, under-utilization of the spans that partially match with entities, and difficulties in long entity recognition. To tackle these issues, we propose a two-stage entity identifier. First we generate span proposals by filtering and boundary regression on the seed spans to locate the entities, and then label the boundary-adjusted span proposals with the corresponding categories. Our method effectively utilizes the boundary information of entities and partially matched spans during training. Through boundary regression, entities of any length can be covered theoretically, which improves the ability to recognize long entities. In addition, many low-quality seed spans are filtered out in the first stage, which reduces the time complexity of inference. Experiments on nested NER datasets demonstrate that our proposed method outperforms previous state-of-the-art models.


Introduction
Named entity recognition (NER) is a fundamental task in natural language processing, focusing on identifying the spans of text that refer to entities. NER is widely used in downstream tasks, such as entity linking (Ganea and Hofmann, 2017;Le and Titov, 2018) and relation extraction (Li and Ji, 2014;Miwa and Bansal, 2016).
Previous works usually treat NER as a sequence labeling task, assigning a single tag to each to- * * Corresponding author ken in a sentence. Such models lack the ability to identify nested named entities. Various approaches for nested NER have been proposed in recent years. Some works revised sequence models to support nested entities using different strategies (Alex et al., 2007;Ju et al., 2018;Straková et al., 2019;Wang et al., 2020a) and some works adopt the hyper-graph to capture all possible entity mentions in a sentence (Lu and Roth, 2015;Katiyar and Cardie, 2018). We focus on the span-based methods (Sohrab and Miwa, 2018;Zheng et al., 2019;Tan et al., 2020), which treat named entity recognition as a classification task on a span with the innate ability to recognize nested named entities. For example, Sohrab and Miwa (2018) exhausts all possible spans in a text sequence and then predicts their categories. However, these methods suffer from some serious weaknesses. First, due to numerous low-quality candidate spans, these methods require high computational costs. Then, it is hard to identify long entities because the length of the span enumerated during training is not infinite. Next, boundary information is not fully utilized, while it is important for the model to locate entities.
Although some methods (Zheng et al., 2019;Tan et al., 2020) have used a sequence labeling model to predict boundaries, yet without dynamic adjustment, the boundary information is not fully utilized. Finally, the spans which partially match with entities are not effectively utilized. These methods simply treat the partially matched spans as negative examples, which can introduce noise into the model. Different from the above studies, we observed that NER and object detection tasks in computer vision have a high degree of consistency. They both need to locate regions of interest (ROIs) in the context (image/text) and then assign corresponding categories to them. Furthermore, both flat NER and nested NER have corresponding structures in the object detection task, as shown in Figure 1. For the flat structure, there is no overlap between entities or between objects. While for nested structures, finegrained entities are nested inside coarse-grained entities, and small objects are nested inside large objects correspondingly. In computer vision, the two-stage object detectors (Girshick et al., 2014;Girshick, 2015;Ren et al., 2017;Dai et al., 2016;Cai and Vasconcelos, 2018) are the most popular object detection algorithm. They divide the detection task into two stages, first generating candidate regions, and then classifying and fine-tuning the positions of the candidate regions.
Inspired by these, we propose a two-stage entity identifier and treat NER as a joint task of boundary regression and span classification to address the weaknesses mentioned above. In the first stage, we design a span proposal module, which contains two components: a filter and a regressor. The filter divides the seed spans into contextual spans and span proposals, and filters out the former to reduce the candidate spans. The regressor locates entities by adjusting the boundaries of span proposals to improve the quality of candidate spans. Then in the second stage, we use an entity classifier to label entity categories for the number-reduced and quality-improved span proposals. During training, to better utilize the spans that partially match with the entities, we construct soft examples by weighting the loss of the model based on the IoU. In addition, we apply the soft non-maximum suppression (Soft-NMS) (Bodla et al., 2017) algorithm to entity decoding for dropping the false positives.
Our main contributions are as follow: • Inspired by the two-stage detector popular in object detection, we propose a novel twostage identifier for NER of locating entities first and labeling them later. We treat NER as a joint task of boundary regression and span classification.
• We make effective use of boundary information. Taking the identification of entity boundaries a step further, our model can adjust the boundaries to accurately locate entities. And when training the boundary regressor, in addition to the boundary-level SmoothL1 loss, we also use a span-level loss, which measures the overlap between two spans.
• During training, instead of simply treating the partially matched spans as negative examples, we construct soft examples based on the IoU. This not only alleviates the imbalance between positive and negative examples, but also effectively utilizes the spans which partially match with the ground-truth entities.
• Experiments show that our model achieves state-of-the-art performance consistently on the KBP17, ACE04 and ACE05 datasets, and outperforms several competing baseline models on F1-score by +3.08% on KBP17, +0.71% on ACE04 and +1.27% on ACE05. Figure 2 illustrates an overview of the model structure. We first obtain the word representation through the encoder and generate seed spans. Among these seed spans, some with higher overlap with the entities are the proposal spans, and others with lower overlap are the contextual spans. In the span proposal module, we use a filter to keep the proposal spans and drop the contextual spans. Meanwhile, a regressor regresses the boundary of each span to locate the left and right boundaries of entities. Next, we adjust the boundaries of the span proposals based on the output of the regressor, and then feed them into the entity classifier module. Finally, the entity decoder decodes the entities using the Soft-NMS algorithm. We will cover our model in the following sections.

Token Representation
Consider the i-th word in a sentence with n words, we represent it by concatenating its word embedding x w i , contextualized word embedding x lm i , part- of-speech(POS) embedding x pos i and characterlevel embedding x char i together. The characterlevel embedding is generated by a BiLSTM module with the same setting as (Ju et al., 2018). For the contextualized word embedding, we follow (Yu et al., 2020) to obtain the context-dependent embedding for a target token with one surrounding sentence on each side. Then, the concatenation of them is fed into another BiLSTM to obtain the hidden state as the final word representation h i ∈ R d .

Seed Span Generation
Seed spans are subsequences sampled from a sequence of words. By filtering, adjusting boundaries, and classifying on them, we can extract entities from the sentence. Under the constraint of a prespecified set of lengths, where the maximum does not exceed L, we enumerate all possible start and end positions to generate the seed spans. We denote the set of seed spans as B = {b 0 , . . . , b K }, where b i = (st i , ed i ) denotes i-th seed span, K denotes the number of the generated seed spans, and st i , ed i denote the start and end positions of the span respectively.
For training the filter and the regressor, we need to assign a corresponding category and regression target to each seed span. Specifically, we pair each seed span in B and the ground-truth entity with which the span has the largest IoU. The IoU measure the overlap between spans, defined as IoU(A, B) = A∩B A∪B , where A and B are two spans. Then we divide them into positive and negative spans based on the IoU between the pair. The spans whose IoU with the paired ground truth is above the threshold α 1 are classified as positive examples, and those less than threshold α 1 are classified as negative examples. For the positive span, we assign it the same categoryŷ with the paired ground truth and compute the boundary offsett between them. For the negative span, we only assign a NONE label. We downsample the negative examples such that the ratio of positive to negative is 1:5.

Span Proposal Module
The quality of the generated seed spans is variable. If we directly input them into the entity classifier, it will lead to a lot of computational waste. High-quality spans have higher overlap with entities, while low-quality spans have lower overlap. We denote them as span proposals and contextual spans, respectively. Our Span Proposal module consists of two components: Span Proposal Filter and Boundary Regressor. The former is used to drop the contextual spans and keep the span proposals, while the latter is used to adjust the boundaries of the span proposals to locate entities.
Span Proposal Filter For the seed span b i (st i , ed i ), we concatenate the maximum pooled span representation h p i with the inner boundary word representations (h st i , h ed i ) to obtain the span representation h f ilter i . Based on it we calculate the probability p f ilter i that the span b i belongs to the span proposals, computed as follows: where [; ] denotes the concatenate operation, MLP consists of two linear layers and a GELU (Hendrycks and Gimpel, 2016) activation function.
Boundary Regressor Although the span proposal has a high overlap with the entity, it cannot hit the entity exactly. We design another boundary regression branch where a regressor locates entities by adjusting the left and right boundaries of the span proposals. The boundaries regression requires not only the information of span itself but also the outer boundary words. Thus we concatenate the maximum pooled span representation h p i with the outer boundary word representations (h st i −1 , h ed i +1 ) to obtain the span representation h reg i . Then we calculate the offsets t i of left and right boundaries:

Entity Classifier Module
With the boundary offsets t i predicted by the boundary regressor, we adjust the boundaries of span proposals. The adjusted start postion st i and end position ed i of b i are calculated as follow: where t l i and t r i denote the left and right offsets, respectively. As in the filter above, we concatenate the maximum pooled span representation h p i with the inner boundary word representations (h st i , h ed i ). Then we perform entity classification: where MLP consists of two linear layers and a GELU activation function, as in the filter above. For training the entity classifier, we need to reassign the categories based on the IoU between the new adjusted span proposal and paired ground-truth entity. Specifically, if the IoU between a span and its corresponding entity is higher than the threshold α 2 , we assign the span the same category with the entity, otherwise we assign it a NONE category and treat the span as a negative example.

Training Objective
The spans that partially match with the entities are very important, but previous span-based approaches simply treat them as negative examples. Such practice not only fails to take advantage of these spans but also introduces noise into the model. We treat partially matched spans as soft examples by weighting its loss based on its IoU with the corresponding ground truth. For the i-th span b i , the weight w i is calculated as follows: where α ∈ {α 1 , α 2 } denotes the IoU threshold used in the first or the second stage and e i denotes corresponding ground-truth entity of b i . η is a focusing parameter that can smoothly adjust the rate at which partially matched examples are down-weighted. We can find that if we set η = 0, the above formula degenerates to a hard one. Also, if a span does not overlap with any entity or match exactly with some entity, the loss weight w i = 1. Then, we calculate the losses for the span proposal filter, boundary regressor and entity classifier respectively. For the span proposal filter, we use focal loss (Lin et al., 2017) to solve the imbalance problem: (12) where w i is the weight of i-th example calculated at Equation 11 and γ denotes focusing parameter of focal loss. For the boundary regressor, the loss consists of two components, the smooth L1 loss at the boundary level and the overlap loss at the span level, calculated as follows: t l i andt r i denote the ground-truth left boundary, right boundary, left offset and right offset, respectively. For the entity classifier, we simply use the cross-entropy loss: where w i is the weight of i-th example calculated at Equation 11. We train the filter, regressor and classifier jointly, thus the total loss is computed as: where λ 1 , λ 2 and λ 3 are the weights of filter, regressor and classifier losses respectively.

Entity Decoding
In the model prediction phase, after the above steps, we get the classification probability and boundary offset regression results for each span proposal. Based on them, we need to extract all entities in the sentence (i.e., find the exact start and end positions of the entities as well as their corresponding categories). We assign label y i = argmax(p i ) to span s i and use score i = max(p i ) as the confidence of span s i belonging to the y i category. Now for each span proposal, our model has predicted the exact start and end positions, the entity class and the corresponding score, denoted as s i = (l i , r i , y i , score i ). Given the score threshold δ and the set of span proposals S = {s 1 , . . . , s N }, where N denotes as the number of span proposals, we use the Soft-NMS (Bodla et al., 2017) algorithm to filter the false positives. As shown in Algorithm 1, we traverse the span proposals by the order of their score (the traversal term is denoted as s i ) and then adjust the scores of other span proposals s j to f (s i , s j ), which is defined as: where u ∈ (0, 1) denotes the decay coefficient of the score and k denotes is the IoU threshold. Then we keep all span proposals with a score > δ as the final extracted entities.

Evaluation Metrics
We use strict evaluation metrics that an entity is confirmed correct when the entity boundary and the entity label are correct simultaneously. We employ precision, recall and F1-score to evaluate the performance.

Parameter Settings
In most experiments, we use GloVE (Pennington et al., 2014) and BERT (Devlin et al., 2019) Table 1 illustrates the performance of the proposed model as well as baselines on ACE04, ACE05, GE-NIA and KBP17. Our model outperforms the stateof-the-art models consistently on three nested NER datasets. Specifically, the F1-scores of our model advance previous models by +3.08%, +0.71%, +1.27% on KBP17, ACE04 and ACE05 respectively. And on GENIA, we achieve comparable performance. We analyze the performance on entities of different lengths on ACE04, as shown in Table 2. We observe that the model works well on the entities whose lengths are not enumerated during training. For example, although entities of length 6 are not enumerated, while those of length  Table 1: Results for nested NER tasks 5 and 7 are enumerated, our model can achieve a comparable F1-score for entities of length 6. In particular, the entities whose lengths exceed the maximum length (15) enumerated during training, are still well recognized. This verifies that our model has the ability to identify length-uncovered entities and long entities by boundary regression. We also evaluated our model on two flat NER datasets, as shown in Appendix B.

Ablation Study
We choose the ACE04 and KBP17 datasets to conduct several ablation experiments to elucidate the main components of our proposed model. To illustrate the performance of the model on entities of different lengths, we divide the entities into three groups according to their lengths. The re-  sults are shown in Table 3. Firstly, we observe that the boundary regressor is very effective for the identification of long entities. Lack of the boundary regressor leads to a decrease in F1-score for long entities (L ≥ 10) on ACE04 by 36.73% and KBP17 by 30.54%. Then, compared with the w/o filter setting, the F1-scores of our full model on the two datasets improved by 0.52% and 0.75%, respectively. In addition, experimental results also demonstrate that the soft examples we constructed are effective. This allows the model to take full advantage of the information of partially matched spans in training, improving the F1-score by 0.87% on ACE04 and 0.16% on KBP17. However, Soft-NMS play a limited role and improve the model performance only a little. We believe that text is sparse data compared to images and the number of false positives predicted by our model is quite small, so the Soft-NMS can hardly perform the role of a filter.

Time Complexity
Theoretically, the number of possible spans of a sentence of length N is N (N +1)

2
. Previous spanbased methods need to classify almost all spans into corresponding categories, which leads to the high computational cost with O(cN 2 ) time complexity where c is the number of categories. The words in a sentence can be divided into two categories: contextual words and entity words. Traditional approaches waste a lot of computation on the spans composed of contextual words. However, our approach retains only the span proposals containing entity words by the filter, and the time complexity is O(N 2 ). Although in the worst case the model keeps all seed spans, generating N (N +1) 2 span proposals, we observe that we generate approximately three times as many span proposals as the entities in practice. Assuming that the number of entities in the sentence is k, the total time complexity of our model is O(N 2 + ck) where k << N 2 .

Case Study
Examples of model predictions are shown in Table  4. The first line illustrates that our model can recognize entities with multi-level nested structures. We can see that the three nested entities from inside to outside are united nations secretary general kofi annan, united nations secretary general and united nations, all of which can be accurately recognized by our model. The second line illustrates that our model can recognize long entities well, although trained without seed spans of the same length as it. The long entity Aceh, which is rich in oil and gas and has a population of about 4.1 million people, with a length of 20, exceeds the maximum length of generated seed spans, but can still be correctly located and classified. However, our model has difficulties in resolving ambiguous entity references. As shown in the third line, our model incorrectly classifies the reference phrase both sides, which refers to ORG, into the PER category.
6 Related Work 6.1 Nested Named Entity Recognition NER is usually modeled as a sequence labeling task, and a sequence model (e.g., LSTM-CRF (Huang et al., 2015)) is employed to output the sequence of labels with maximum probability. However, traditional sequence labeling models cannot handle nested structures because they can only assign one label to each token. In recent years, several approaches have been proposed to solve the nested named entity recognition task, mainly including tagging-based (Alex et al., 2007;Wang et al., 2020a), hypergraph-based (Muis and Lu, 2017;Katiyar and Cardie, 2018), and span-based  Table 3: Ablation study on ACE04 and KBP17. To compare the performance of the model on entities of different lengths, we divided the entities into three groups: 1 ≤ L < 5, 5 ≤ L < 10 and L ≥ 10.  (Sohrab and Miwa, 2018; Zheng et al., 2019) approaches. The tagging based nested NER model transforms the nested NER task into a special sequential tagging task by designing a suitable tagging schema. Layered-CRF (Alex et al., 2007) dynamically stacks flat NER layers to identify entities from inner to outer. Pyramid (Wang et al., 2020a) designs a pyramid structured tagging framework that uses CNN networks to identify entities from the bottom up. The hypergraph-based model constructs the hypergraph by the structure of nested NER and decodes the nested entities on the hypergraph. Lu and Roth (2015) is the first to propose the use of Mention Hypergraphs to solve the overlapping mentions recognition problem. Katiyar and Cardie (2018) proposed hypergraph representation for the nested NER task and learned the hypergraph structure in a greedy way by LSTM networks. The span-based nested NER model first extracts the subsequences (spans) in a sequence and then classifies these spans. Exhaustive Model (Sohrab and Miwa, 2018) exhausts all possible spans in a text sequence and then predicts their classes. Zheng et al. (2019); Tan et al. (2020) took a sequence labeling model to identify entity boundaries and then predicted the categories of boundary-relevant regions. Different from the above methods, some works adopt the methods from other tasks. For example, Yu et al. (2020) reformulated NER as a structured predic-tion task and adopted a biaffine model for nested and flat NER. While  treated NER as a reading comprehension task, and constructed type-specific queries to extract entities from the context.

Object Detection
Object detection is a computer vision technique that can localize and identify objects in an image. With this identification and localization, object detection can determine the exact location of objects while assigning them categories. Neural-based object detection algorithms are divided into two main categories: one-stage and two-stage approach. The one-stage object detector densely proposes anchor boxes by covering the possible positions, scales, and aspect ratios, and then predicts the categories and accurate positions based on them in a singleshot way, such as OverFeat (Sermanet et al., 2013), YOLO (Redmon et al., 2016) and SSD (Liu et al., 2016). The two-stage object detector can be seen as an extension of the dense detector and has been the most dominant object detection algorithm for many years (Girshick et al., 2014;Girshick, 2015;Ren et al., 2017;Dai et al., 2016;Cai and Vasconcelos, 2018). It first obtains sparse proposal boxes containing objects from a dense set of region candidates, and then adjusts the position and predicts a category for each proposal.

Conclusion
In this paper, we treat NER as a joint task of boundary regression and span classification and propose a two-stage entity identifier. First we generate span proposals through a filter and regressor, then classify them into the corresponding categories. Our proposed model can make full use of the boundary information of entities and reduce the computational cost. Moreover, by constructing soft samples during training, our model can exploit the spans that partially match with the entities. Experiments illustrate that our method achieves state-of-the-art performance on several nested NER datasets. For future work, we will combine named entity recognition and object detection tasks, and try to use a unified framework to address joint identification on multimodal data. A Experiments on Nested NER

A.1 Statistics of Nested Datasets
In Table 5, We report the number of sentences, the number of sentences containing nested entities, the average sentence length, the total number of entities, the number of nested entities and the nesting ratio on the ACE04, ACE05, GENIA and KBP17 datasets.

A.2 Baseline Methods
We use the following models as baselines for nested NER: • Biaffine (Yu et al., 2020) reformulates NER as a structured prediction task and adopts a dependency parsing approach for NER.
• Pyramid (Wang et al., 2020a) consists of a stack of inter-connected layers. Each layer predicts whether a text region of certain length is a complete entity mention.
• BiFlaG (Yu et al., 2020) designs a bipartite flat-graph network with two interacting subgraph modules for outermost entities and inner entities, respectively.
• HIT (Wang et al., 2020b) leverages the headtail pair and token interaction to express the nested entities.
• ARN (Lin et al., 2019) designs a sequence-tonuggets architecture by modeling and levraging the head-driven phrase structures of entity mentions.
• KBP17-Best (Ji et al., 2017) gives an overview of the Entity Discovery task and reports previous best results for the task of nested NER.
We didn't compare our model with BERT-MRC , because it uses additional external resources to construct the questions, which essentially introduces descriptive information about the categories.

A.3 Detailed Parameter Settings
In our experiments, the detailed parameter settings for the model are shown in Table 6.

A.4 Analysis of Boundary Offset Regression
We analyzed the distribution of the boundary offsets predicted by the model on the ACE04 dataset, as shown in Figure 3. We can find that the numbers of offsets by 0, 1, 2, 3 and ≥ 4 are 2162, 2440, 888, 368 and 202, respectively. Most of the offsets are 1, indicating that most of the seed spans require slight boundary adjustments to accurately locate the entities. There are also many offsets of 0. This is because many entities in the dataset are short and the seed spans can cover them, and their boundaries do not need to be adjusted. We use two flat NER datasets to evaluate our model: CoNLL03 English is an English dataset (Tjong Kim Sang and De Meulder, 2003) with four types of flat entities: Location, Organization, Person and Miscellaneous. Following Lin et al. (2019), we train our model on the concatenation of the train and dev set.
Weibo Chinese is a Chinese dataset (Peng and Dredze, 2015) sampled from Weibo with four types of flat entities, including Person, Organization, Location and Geo-political. And we evaluate our model using the same setting with Li et al. (2020a).

B.2 Baselines
For English flat NER, we use several taggers as baseline models, including ELMO-Tagger (Peters et al., 2018), BERT-Tagger (Peters et al., 2018), which using ELMO, BERT as encoder respectively. And for Chinese flat NER, we use Glyce (Meng   (Li et al., 2020a) and SLK-NER (Hu and Wei, 2020) as baseline models. They incoprate glyph information, phrase embeddings and second-order lexicon knowledge for Chinese NER respectively.

B.3 Results
We evaluated our model on the flat NER dataset, as shown in Table 7. Our model outperforms the baseline models on Weibo Chinese, improving the F1-score by 0.61%. On CoNLL03, our model also achieves comparable results, with less than 1% performance drop compared to the (Yu et al., 2020