Nested Named Entity Recognition via Explicitly Excluding the Influence of the Best Path

This paper presents a novel method for nested named entity recognition. As a layered method, our method extends the prior second-best path recognition method by explicitly excluding the influence of the best path. Our method maintains a set of hidden states at each time step and selectively leverages them to build a different potential function for recognition at each level. In addition, we demonstrate that recognizing innermost entities first results in better performance than the conventional outermost entities first scheme. We provide extensive experimental results on ACE2004, ACE2005, and GENIA datasets to show the effectiveness and efficiency of our proposed method.


Introduction
Named entity recognition (NER), as a key technique in natural language processing, aims at detecting entities and assigning semantic category labels to them. Early research (Huang et al., 2015;Ma and Hovy, 2016;Lample et al., 2016) proposed to employ deep learning methods and obtained significant performance improvements. However, most of them assume that the entities are not nested within other entities, so-called flat NER. Inherently, these methods do not work satisfactorily when nested entities exist. Figure 1 displays an example of the nested NER task.
Recently, a large number of papers proposed novel methods (Fisher and Vlachos, 2019;Wang et al., 2020) for the nested NER task. Among them, layered methods solve this task through multi-level sequential labeling, in which entities are divided into several levels, where the term level indicates the depth of entity nesting, and sequential labeling is performed repeatedly. As a special case of layered method, Shibuya and Hovy (2020) force the * This work was done when the first author was at NAIST.

Former Hogwarts headmaster
Dumbledore Albus next level entities to locate on the second-best path of the current level search space. Hence, their algorithm can repeatedly detect inner entities through applying a conventional conditional random field (CRF) (Lafferty et al., 2001) and then exclude the obtained best paths from the search space. To accelerate computation, they also designed an algorithm to efficiently compute the partition function with the best path excluded. Moreover, because they search the outermost entities first, performing the second-best path search only on the spans of extracted entities is sufficient, since inner entities can only exist within outer entities. However, we claim that the target path at the next level is neither necessary nor likely to be the second-best path at the current level. Instead, those paths sharing many overlapping labels with the current best path are likely to be the second-best path. Besides, Shibuya and Hovy (2020) reuse the same potential function at all higher levels. Thus, even though they exclude the best path, the influence of the best path is still preserved, since the emission scores of labels on the best path are used in the next level recognition. Moreover, these best path labels are treated as the target labels at the current level. However, if they are not on the best path of the next level, they will be treated as non-target labels at the next level, hence these adversarial optimization goals eventually hurt performance.
In this paper, we use a different potential function at each level to solve this issue. We propose to achieve this by introducing an encoder that pro-duces a set of hidden states at each time step. At each level, we select some hidden states for entity recognition, then, remove these hidden states which have interaction with the best path labels before moving to the next level. In this way, the emission scores of these best path labels are completely different, so we can explicitly exclude the influence of the best path. Furthermore, we also propose three different selection strategies for fully leveraging information among hidden states.
Besides, Shibuya and Hovy (2020) proposed to recognize entities from outermost to inner. We empirically demonstrate that extracting the innermost entities first results in better performance. This may due to the fact that some long entities do not contain any inner entity, so using outermostfirst encoding mixes these entities with other short entities at the same levels, therefore leading encoder representations to be dislocated. In this paper, we convert entities to the IOBES encoding scheme (Ramshaw and Marcus, 1995), and solve nested NER through applying CRF level by level.
Our contributions are considered as fourfold, (a) we design a novel nested NER algorithm to explicitly exclude the influence of the best path through using a different potential function at each level, (b) we propose three different selection strategies for fully utilizing information among hidden states, (c) we empirically demonstrate that recognizing entities from innermost to outer results in better performance, (d) and we provide extensive experimental results to demonstrate the effectiveness and efficiency of our proposed method on the ACE2004, ACE2005, and GENIA datasets.

Proposed Method
Named entities recognition task aims to recognize entities in a given sequence {x t } n t=1 . For nested NER some shorter entities may be nested within longer entities, while for flat NER there is no such case. Existing algorithms solve flat NER by applying a sequential labeling method, which assigns each token a label y t ∈ Y to determine the span and category of each entity and non-entity simultaneously. To solve nested NER, we follow the previous layered method and extend this sequential labeling method with a multi-level encoding scheme. In this encoding scheme, entities are divided into several levels according to their depths, we apply the sequential labeling method level by level to recognize all entities.

Encoding Schemes
Shibuya and Hovy (2020) proposed to recognize the outermost entities first and recursively detect the nested inner entities. However, we find that detecting from the innermost entities results in better performance. We take the sentence in Figure 1 as an example to illustrate the details of these two encoding schemes. The results of the outermost-first encoding scheme look as follows.
Labels B-, I-, E-indicate the current word is the beginning, the intermediate, and the end of an entity, respectively. Label S-means this is a single word entity, and label O stands for nonentity word. For example, the outermost entity "Former Hogwarts headmaster Albus Dumbledore" appears at the first level, while innermost entities "Hogwarts" and "headmaster" appear at the fourth level. Since there exists no deeper nested entity, the remaining levels contain only label O.
In contrast, the innermost-first encoding scheme converts the same example to the following label sequences.
In this encoding scheme, innermost entities "Hogwarts", "headmaster", and "Albus Dumbledore" appear at the first level. Note that the innermost-first encoding scheme is not the simple reverse of the outermost-first encoding scheme. For example, the entity "Former Hogwarts headmaster" and the entity "Albus Dumbledore" appear at the same level in the outermost-first scheme but they appear at different levels in the innermost-first scheme.

Influence of the Best Path
Although the second-best path searching algorithm is proposed as the main contribution of Shibuya and Hovy (2020), we claim that forcing the target path at the next level to be the second-best path at the current level is not optimal. As the innermostfirst encoding example above, the best path at level 3 is B-ROLE,I-ROLE,E-ROLE,O,O. Therefore the second-best path is more likely to be one of those paths that share as many as possible labels with the best path, e.g., B-ROLE,I-ROLE,E-ROLE,O,S-ORG, rather than the actual target label sequence at level 4, i.e., B-PER,I-PER,I-PER,I-PER,E-PER, which does not overlap with the best path at all. In addition, Shibuya and Hovy (2020) reuse the same potential function at all higher levels. This indicates that, for instance, at level 3 and time step 1, their model encourages the dot product of the hidden state and the label embedding h 1 v B-ROLE to be larger than h 1 v B-PER , while at level 4, the remaining influence of the best path reversely forces h 1 v B-PER to be larger than h 1 v B-ROLE . These adversarial optimization goals eventually hurt performance and result in sub-optimal performance. Therefore, the crux of the matter is to introduce different emission scores for different levels. For example, encouraging h 3 1 v B-ROLE at level 4 will not lead to adversarial optimization directions anymore, where h 3 1 and h 4 1 are two distinctive hidden states to be used at levels 3 and 4, respectively.
To achieve this goal, we introduce a novel encoder which outputs m hidden states {h l t } m l=1 , where m is the number of levels, as an alternative to the conventional encoder which can only output a single hidden state h t ∈ R d h at each time step. To make a distinction between our m hidden states and the conventional single hidden state, we use the term chunk from now on to refer to these hidden states h l t ∈ R d h /m . We restrict chunk dimension to be d h /m, so the total number of parameters remain unchanged.

Chunk Selection
As we mentioned above, our algorithm maintains a chunk set for each time step, through selecting and removing chunks, to exclude the influence of the best path. Naturally, how to select chunk becomes the next detail to be finalized. For clarity, we use notation H l t to denote the chunk set at level l, and use H l to refer to all of these chunk sets at level m across time steps, i.e., {H l t } n t=1 . Because we remove one and only one chunk at each time step, |H l t | + l = m + 1 always holds.
An intuitive idea is to follow the original chunk order and simply to select the l-th chunk for level l. At level l, no matter to which label, the emission score is calculated by using h l t . In this way, this naive potential function can be defined as follow, indicates the transition score from label y l t−1 to label y l t , and v y l t ∈ R d h /m is the embedding of label y l t . In this case, the l-th chunk h l t ∈ H l t is just the chunk which have an interaction with target label, thus should be removed from H l t .
One concern of the naive potential function is that it implicitly assumes the outputs of the encoder are automatically arranged in the level order instead of other particular syntactic or semantic order, e.g., the encoder may encodes all LOC related information at the first h d /m dimensions while remaining Algorithm 1: Training input :first level chunk sets H 1 input :target label sequences y 1 , · · · , y m output :negative log-likelihood L L ← 0 where σ l is the index of selected chunk at level l, but for naive potential function, the inequation above does not always hold. From this aspect, our method can also be considered as selecting the best path in the second-best search space.
Therefore, instead of following the original chunk orders, we propose to let each label y j select the most similar chunk to it to obtain an emission score. We denote this definition as max potential function, In this case, we update chunk sets by removing these chunks which are selected by the target labels.
Furthermore, since the log-sum-exp operation is a well known differentiable approximation of the max operation, we also introduce it as the third potential function, Algorithm 2: Decoding input :first level chunk sets H 1 output :recognized entity set E E ← ∅ for l = 1 to m dô y l ← arg max The chunk set is updated in the same way as Equation 4. We refer to this potential function definition as logsumexp in the rest of this paper.

Embedding Layer
Following previous work (Shibuya and Hovy, 2020), we convert words to word embeddings w t ∈ R dw and employ a character-level bidirectional LSTM to obtain character-based word embeddings c t ∈ R dc . The concatenation of them is fed into the encoding layer as the token representa-

Encoding Layer
We employ a three-layered bidirectional LSTM to encode sentences and leverage contextual information, where h t ∈ R d h is the hidden state. In contrast to the encoders of previous work, which can only output single hidden states at each time step, we split h t into m chunks,

Training and Decoding
Following the definition of CRF, the conditional probabilistic function of a given label sequence at l-th level, i.e., y l = {y l t } n t=1 , can be defined as, where Z(H l ) is the sum of all paths' scores and is commonly known as the partition function. We optimize our model by minimizing the sum of the negative log-likelihoods of all levels.
On the decoding stage, we iteratively apply the Viterbi algorithm (Forney, 1973) at each level to search the most probable label sequences.
The pseudocodes of the training and the decoding algorithms with max or logsumexp potential function can be found in Algorithms 1 and 2, respectively.

Datasets
We conduct experiments on three nested named entity recognition datasets in English, i.e., ACE2004 (Doddington et al., 2004), ACE2005 (Walker et al., 2006) and GENIA (Kim et al., 2003). We divide all these datasets into tran/dev/test split by following Shibuya and Hovy (2020) and Wang et al. (2020). The dataset statistics can be found in Table 1

Hyper-parameters Settings
For word embeddings initialization, we utilize 100dimensional pre-trained GloVe (Pennington et al., 2014) for the ACE2004 and the ACE2005 datasets, and use 200-dimensional biomedical domain word embeddings 1 (Chiu et al., 2016) for the GENIA dataset. Moreover, we randomly initialize 30dimensional vectors for character embeddings. The hidden state dimension of character-level LSTM d c is 100, i.e., 50 in each direction, thus the dimension of token representation d x is 200. We apply dropout (Srivastava et al., 2014) on token representations before feeding it into the encoder. The hidden state dimension of the three-layered LSTM is 600 for ACE2004 and ACE2005, i.e., 300 in each direction, and 400 for GENIA. Choosing a different dimension is because the maximal depth of entity nesting m is different. We apply layer normalization (Ba et al., 2016) and dropout with 0.5 ratio after each bidirectional LSTM layer.
Different from Shibuya and Hovy (2020), we use only one CRF instead of employing different CRFs for different entity types. Besides, our CRF is also shared across levels, which means we learn and decode entities at all levels with the same CRF.
Our model is optimized by using stochastic gradient descent (SGD), with a decaying learning rate η τ = η 0 /(1 + γ · τ ), where τ is the index of the current epoch. For ACE2004, ACE2005, and GE-NIA, the initial learning rates η 0 are 0.2, 0.2, and 0.1, and the decay rates γ are 0.01, 0.02, and 0.02 respectively. We set the weight decay rate, the momentum, the batch size, and the number of epochs to be 10 −8 , 0.5, 32, and 100 respectively, especially we use batch size 64 on the GENIA dataset. We clip the gradient exceeding 5.
Besides, we also conduct experiments to evaluate the performance of our model with contextual word representations. BERT (Devlin et al., 2019) and Flair (Akbik et al., 2018) are the most commonly used contextual word representations in previous work, and have also been proved that they can substantially improve the model performance. In these settings, contextual word representations are concatenated with word and character representations to form the token representations, i.e., x t = [w t , c t , e t ], where e t is the contextual word representation and it is not fine-tuned in any of our experiments.  Bold and underlined numbers indicates the best and the second-best results respectively. naive, max, and logsumexp refer to the three potential function definitions, i.e., Equations 1, 3, and 5, respectively. These numbers in parentheses are standard deviations.
BERT is a transformer-based (Vaswani et al., 2017) pre-trained contextual word representation. In our experiments, for the ACE2004 and ACE2005 datasets we use the general domain checkpoint bert-large-uncased, and for the GENIA dataset we use the biomedical domain checkpoint BioBERT large v1.1 2 . We average all BERT subword embeddings in the last four layers to build 1024-dimensional vectors.
Flair is a character-level BiLSTM-based pretrained contextual word representation. We concatenate these vectors obtained from the news-forward and news-backward checkpoints for ACE2004 and ACE2005, and use the pubmed-forward and pubmed-backward checkpoints for GENIA, to build 4096-dimensional vectors.

Evaluation
Experiments are all evaluated by precision, recall, and F 1 . All of our experiments were run 4 times 2 https://github.com/naver/ biobert-pretrained with different random seeds and averaged scores are reported in the following tables.
Our model 3 is implemented with PyTorch (Paszke et al., 2019) and we run experiments on GeForce GTX 1080Ti with 11 GB memory. Table 2 shows the performance of previous work and our model on the ACE2004, ACE2005, and GENIA datasets. Our model substantially outperforms most of the previous work, especially when comparing with our baseline Shibuya and Hovy (2020). When using only word embeddings and character-based word embeddings our method exceeds theirs by 2.64 F 1 score, and also achieves comparable results with the recent competitive method (Wang et al., 2020). In the case of utilizing BERT and further employing Flair, our method consistently outperforms Shibuya and Hovy (2020) by 1.09 and 0.60 by F 1 scores, respectively.

Experimental Results
On the ACE2005 dataset, our method improves the F 1 scores by 1.98, 0.72, and 0.59 respectively, comparing with Shibuya and Hovy (2020). Although our model performance is inferior to Wang et al. (2020) at general, our max potential function method is slightly superior to them by 0.05 in F 1 score when employing BERT.
Furthermore, on the biomedical domain dataset GENIA, our method constantly outperforms Shibuya and Hovy (2020) by 0.18, 1.62, and 1.57 in F 1 score, respectively. Although the low scores of Shibuya and Hovy (2020) are due to their usage of the general domain checkpoint bert-large-uncased, instead of our biomedical domain checkpoint, our model is still superior to Straková et al. (2019) by 0.47 and 0.62 in F 1 scores, who used the same checkpoint as us.
As for these three potential functions, we notice the max and logsumexp potential functions generally works better than the naive potential function. These results demonstrate that the chunk selection strategy of the max and logsumexp can leverage information from all remaining chunks and constrains hidden states of LSTM to be more semantically ordered. When we use BERT and Flair, the advantage of the max and the logsumexp potential function is less obvious compared with the case when we only use word embeddings and characterbased word embeddings, especially on the GENIA dataset. We hypothesize that BERT and Flair can provide rich contextual information, then selecting chunks in the original order is sufficient, thus our dynamic selecting mechanism can only slightly improve the model performance.

Influence of the Encoding Scheme
We also conduct experiments on the ACE2004 dataset to measure the influence of the outermostfirst and innermost-first encoding schemes. As shown in Table 3, the innermost-first encoding scheme consistently works better than the outermost-first encoding scheme with all potential functions. We hypothesize that outermost entities do not necessarily contain inner entities especially for longer ones, and that putting those diversely  nested outermost entities at the same level would dislocate the encoding representation. Furthermore, even if we use the outermost-first encoding scheme, our method is superior to Shibuya and Hovy (2020), which further demonstrates the effectiveness of excluding the influence of the best path.

Time Complexity and Speed
The time complexity of encoder is O (n), and because we employ the same tree reduction acceleration trick 4 as Rush (2020), the time complexity of CRF is reduced to O (log n), therefore the overall time complexity is O (n + m · log n).
Even our model outperforms slightly worse than Wang et al. (2020), the training and inference speed of our model is much faster than them, as shown in Table 4, since we do not need to stack the decoding component to 16 layers. Especially, when we increase the batch size to 64, the decoding speed is more than two times faster than their model.

Method
Batch Size Training Decoding

Level-wise Performance
We display the performance on the dataset ACE2005 at each level, as in Table 5. The max potential function at the first three levels achieves constantly higher precision scores than the naive and logsumexp potential functions, while at the same time obtains the lowest recall scores. The logsumexp potential function on the contrary achieves the highest recall scores but fails to obtain satisfactory precision scores. Because most entities are located at the first two levels, the max and logsumexp achieves the best overall precision and recall scores, respectively.

Chunk Distribution
We analyze the chunk distribution on the test split of the dataset ACE2005 by plotting the heat maps  4  29  22  15  14  16   3  22  30  19  6  20   2  17  27  25  8  21   10  10  12  18  28  22   22  2  3  16  43  14   44  6  4  9  3  35   16  30  19  15  13  6   13  24  34  11  15  4   11  17  20  27  21  4   8  15  14  26  29  8   8  8  9  12  19    in Figure 3, in which these numbers indicate the percentages of each chunk being selected by a particular level or label. For example, the 35 at the upper-right corner means when using logsumexp potential function, 35% of predictions at the first level are made by choosing the sixth chunk, while the 78 at the lower-left corner shows 78% of WEA are related to the first chunk with naive. To make it easier to compare with the naive, we arranged the chunk orders of max and logsumexp, without losing generality, to make the level-chunk distribution mainly concentrate on the diagonal. The naive potential function simply selects the l-th chunk at l-th level, therefore the heat map is just diagonal. At the first level, the logsumexp potential function also prefers to select the sixth and the fourth chunks rather than the first chunk, we hypothesis this is due to most of B-and S-labels are located on the first level, and this can be confirmed according to the syntactic-chunk heat map of logsumexp where 78% B-and 70% S-labels go to the sixth and fourth chunks. Similarly, max also has a high probability to select the second chunk.
Generally, the chunk distribution of logsumexp is more smooth than max. Besides, we find label O almost uniformly select chunks, in both the syntactic and semantic heat maps, while other meaningful labels have their distinguished preferences.
Syntactic labels S-and B-mainly represent the beginning of an entity, while I-and E-stands for the continuation and ending of an entity. In the syntactic-chunk heat map of naive, they are indiscriminately distributed to the first chunk, because most of the entities are located on the first level. However, max and logsumexp utilize different chunks to represents these different syntactic categories.
Likewise, the semantic label GPE, when using logsumexp, also has a 61% probability to select the sixth chunks other than concentrating on the first chunk as naive. These observations further demonstrate our dynamic chunk selection strategies are capable of learning more meaningful representations.

Related Work
Existing NER algorithms commonly employ various neural networks to leverage more morphological and contextual information to improve performance. For example, to handle the out-ofvocabulary issue through introducing morphological features, Huang et al. (2015) proposed to employ manual spelling feature, while Ma and Hovy (2016) and Lample et al. (2016) suggested introducing CNN and LSTM to build word representations from character-level. Zhang et al. (2018) and Chen et al. (2019) introduced global representation to enhance encoder capability of encoding contextual information.
Layered Model As a layered model, Ju et al. (2018) dynamically update span-level representations for next layer recognition according to recognized inner entities. Fisher and Vlachos (2019) proposed a merge and label method to enhance this idea further. Recently, Shibuya and Hovy (2020) designed a novel algorithm to efficiently learn and decode the second-best path on the span of detected entities. Luo and Zhao (2020) build two different graphs, one is the original token sequence, and the other is the tokens in recognized entities, to model the interaction among them. Wang et al. (2020) proposed to learn the l-gram representations at layer l through applying a decoder component to reduce a sentence layer by layer and to directly classify these l-gram spans.  proposed an anchor-region network to recognize nested entities through detecting anchor words and entity boundaries first, and then classify each detected span. Exhaustive models simply enumerate all possible spans and utilize a maximum entropy tagger (Byrne, 2007) and neural networks (Xu et al., 2017;Sohrab and Miwa, 2018;Zheng et al., 2019) for classification. Luan et al. (2019) additionally aims to consider the relationship among entities and proposed a novel method to jointly learn both entities and relations. Lu and Roth (2015) proposed a hyper-graph structure, in which edges are connected to multiple nodes to represents nested entities. Muis and Lu (2017) and  resolved spurious structures and ambiguous issue of hyper-graph structure. And Katiyar and Cardie (2018) proposed another kind of hyper-graph structure.

Hypergraph-based Model
Parsing-based Model Finkel and Manning (2009) indicated all these nested entities are located in some non-terminal nodes of the constituency parses of the original sentences, thus they proposed to use a CRF-based constituency parser to obtain them. However, the cubic time complexity limits its applicability.  instead proposed to use a transition-based constituency parser to incrementally build constituency forest, its linear time complexity ensures it can handle longer sentences.

Conclusion
In this paper, we proposed a simple and effective method for nested named entity recognition by explicitly excluding the influence of the best path through selecting and removing chunks at each level to build different potential functions. We also proposed three different selection strategies to leverage information from all remaining chunks. Besides, we found the innermost-first encoding scheme works better than the conventional outermost-first encoding scheme. Extensive experimental results demonstrate the effectiveness and efficiency of our method. However, one of the demerits of our method is the number of chunks, i.e., the maximal depth of entity nesting, must be chosen in advance as a hyper-parameter. We will extend it to arbitrary depths as future work.