HiT RANS : A Hierarchical Transformer Network for Nested Named Entity Recognition

Nested Named Entity Recognition (NNER) has been extensively studied, aiming to identify all nested entities from potential spans (i.e., one or more continuous tokens). How-ever, recent studies for NNER either focus on tedious tagging schemas or utilize complex structures, which fail to learn effective span representations from the input sentence with highly nested entities. Intuitively, ex-plicit span representations will contribute to NNER due to the rich context information they contain. In this study, we propose a Hi erarchical Trans former (HiT RANS ) network for the NNER task, which decomposes the input sentence into multi-grained spans and enhances the representation learning in a hierarchical manner. Speciﬁcally, we ﬁrst utilize a two-phase module to generate span representations by aggregating context information based on a bottom-up and top-down transformer network. Then a label prediction layer is designed to recognize nested entities hierarchically, which naturally explores semantic dependencies among different spans. Experiments on GENIA, ACE-2004, ACE-2005 and NNE datasets demonstrate that our proposed method achieves much better performance than the state-of-the-art approaches.

Nested Named Entity Recognition (NNER) has been extensively studied, aiming to identify all nested entities from potential spans (i.e., one or more continuous tokens). However, recent studies for NNER either focus on tedious tagging schemas or utilize complex structures, which fail to learn effective span representations from the input sentence with highly nested entities. Intuitively, explicit span representations will contribute to NNER due to the rich context information they contain. In this study, we propose a Hierarchical Transformer (HiTRANS) network for the NNER task, which decomposes the input sentence into multi-grained spans and enhances the representation learning in a hierarchical manner. Specifically, we first utilize a two-phase module to generate span representations by aggregating context information based on a bottom-up and top-down transformer network. Then a label prediction layer is designed to recognize nested entities hierarchically, which naturally explores semantic dependencies among different spans. Experiments on GENIA, ACE-2004, ACE-2005 and NNE datasets demonstrate that our proposed method achieves much better performance than the state-of-the-art approaches.

Introduction
Named entity recognition (NER) is an essential task in the research of natural language processing, which aims to detect and classify text spans into corresponding semantic categories, e.g., Person (PER), Organization (ORG), and Location (LOC), from a chunk of text. Most existing studies focus on flat NER, i.e., without nested entities, by sequence labeling methods (Yang et al., 2020;Yoon et al., 2019). However, named entities are generally nested with each other in the real world (Finkel and Manning, 2009). For example, in Figure 1, the entity "St. Louis" with Label "CITY" is nested in "St. Louis Cardinals" with Label "SPORTSTEAM". This poses a major technical challenge to the previous methods and thus a more robust model for nested NER (NNER) is urgently desirable.
Previous literature for NNER can be roughly categorized into three types: 1) hypergraphbased models focus on designing a complex hypergraph structure to obtain an expressive tagging schema (Straková et al., 2019;Katiyar and Cardie, 2018;Lu and Roth, 2015), which are time-consuming when encountering ambiguous schemas; 2) span-based models tend to detect candidate spans from an input sentence first, and then train a classifier to predict entity categories (Luo and Zhao, 2020;Zheng et al., 2019). However, it is hard to get a complete meaning of the sentence because each text span contains only a part of the semantics, and errors may propagate to the prediction stage if the span boundary is divided incorrectly at the first stage; and 3) layered-based models are proposed to utilize layered structures to deal with NNER based on the divide and conquer strategy (Jue et al., 2020;Xia et al., 2019;Ju et al., 2018). However, it merely breaks down the complex problem into several smaller subtasks and pays little attention to the hierarchical representation learning for multi-grained named entities.
To this end, we propose a novel hierarchical transformer network (HiTRANS) to recognize more named entities (either nested or not) for a given sentence, where we capture the dependencies of adjacent candidate spans and utilizes an attention mech-anism to enhance the representation of text spans. More specifically, the input of our proposed method combines character-level, word-level, and sentencelevel representations, which are obtained by three embedding networks, respectively. We then propose a two-phase span generation model (SGM) on top of the multi-level representations, which hierarchically aggregates adjacent spans based on transformer mechanism at each layer. The SGM includes a bottom-up and a top-down structure to enhance the representation learning of each candidate span which is finally fed into a label prediction layer to assign an entity class for the span. As a result, nested entities are comprehensively contained in the candidate spans and the representation learning is further enhanced based on both multi-level embedding and the hierarchical transformer mechanisms. Experimental results on FOUR datasets demonstrate that HiTRANS establishes new stateof-the-art performances, which verifies the effectiveness of our proposed framework. The main contributions of this work are as follows: • We propose a novel hierarchical transformer framework (HiTRANS) for NNER, which is superior in modeling the nested relations among multi-grained named entities and learning more effective representations.
• Entity representation learning is formulated as a two-phase span generation, which aggregates context information of adjacent spans in a bottom-up and top-down manner respectively. The span representation is enhanced by multi-level features and context information.
• The overall superiority of our HiTRANS is validated across four benchmarks comparing with state-of-the-art methods. Visualization and case study conducted on top of the outputs from each layer further shows an in-depth understanding of our method.

Related Work
We briefly review some prior works closely related to ours from three perspectives: hypergraph-based, span-based, and layered-based approaches.
Hypergraph-based approaches obtain expressive tagging schemas for NNER (Lu and Roth, 2015;. However, the hypergraph requires specific modules designed to prevent the spurious structure of hypergraphs. Muis and Lu (2017) introduced mention separators to facilitate multi-graph representation. Katiyar and Cardie (2018) further improved the result using features extracted from a recurrent neural network. Recently, Straková et al. (2019) proposed two competitive neural networks using a linearized scheme. However, more expressive and unambiguous schemas will inevitably cause higher time complexity.
Span-based methods achieve promising results for NNER (Tan et al., 2020;Zheng et al., 2019;Sohrab and Miwa, 2018), which explicitly enumerate all possible spans from input sentences, which will be fed into a classifier for category prediction based on multitask learning. Lin et al. (2019) proposed a sequence-to-nuggets architecture to recognize nested entities with semantic central words. Li et al. (2020) extracted answer spans from a passage through a given question. Luo and Zhao (2020) proposed a novel bipartite flat-graph network to learn the dependencies of inner spans. But most of these methods generally break input sequences into fragments, leading to inferior semantics.
Layered-based models are recently proposed, e.g., Finkel and Manning (2009) constructed a syntactic constituency tree to transform each sentence into a tree,  proposed a transitionbased model by mapping a sentence with nested mentions to a designated forest, Fisher and Vlachos (2019) and Ju et al. (2018) dynamically stacked multiple flat NER layers from inside to outside, Shibuya and Hovy (2020) introduced a decoding method that iteratively recognizes entities in an outside-to-inside way, Jue et al. (2020) and Xia et al. (2019) utilized a layered model to recursively identify named entity candidates based on a hierarchical structure, which is suitable for NNER. However, few of them emphasize on learning more effective span representations, failing to recognize nested named entities in more complex sentences.
The core idea of our work is inspired to enhance representation learning for more complex sentences. We propose to leverage the representation power of transformer based on a hierarchical structure for improving NNER. Particularly, pretrained word embeddings such as GloVe (Pennington et al., 2014), and pre-trained sentence-level embedding such as BERT (Devlin et al., 2019) and ALBERT (Lan et al., 2020) have proven to be effective to NER. In this paper, we will apply both kinds of embeddings besides character embeddings to further improve the performance. There are two phases in span generation model (SGM) network, which iteratively generate span representations for each layer by merging adjacent spans in a bottom-up and a top-down manner, respectively. "P" denotes the padding when employing CNN and the outputs "*t spans" denote the representations of candidate spans at each layer, where is set to 6 in the above example. (c) During hierarchical label prediction (HLP), the same labeling network, e.g., 4, is employed in each layer. (d) Output entities. Different layers are displayed in different colors.

Our Proposed Method
Prior hypergraph-based and span-based methods for NNER suffer from ambiguous schemas or errors propagation in complex sentences, thus layered-based models are proposed to decompose the problem into several smaller subtasks. However, for NNER, learning effective representations and modeling inter-entity dependencies is still a substantial challenge. In this study, we hypothesis that nested entities in the same context are complementary and the text representation at multi-level could improve NNER.
Given an input sentence S is composed of a sequence of words, i.e., S = { 1 , 2 , . . . , | S | }, where |S| denotes the number of words. For the NNER task, each word is associated with mul- where denotes the maximum nesting depth. Note that if = 1, a word is associated with one categorical label, which is regarded as flat NER. Therefore, we formulate NNER as a multi-layer prediction problem. Specifically, the topmost layer is processed as flat NER, and other layers merely using − { } and labels to recognize complete entities from text spans. For each layer, it modeled as sequence labeling, that is, : e 1 e 2 · · · e → 1 2 · · · , where e i indicates the representation of a text span (i.e., one or more continuous words) iteratively generated from the previous layer, indicates the number of spans in the -th layer (1 ≤ ≤ ).
In the following subsections, we will introduce our proposed HiTRANS, which consists of three parts: Multi-level Representation, Span Generation Model, and Hierarchical Label Prediction. Figure  2 gives an overview of our framework.

Multi-level Representation
To better capture the semantic information of a sentence, we learn token representations from multiple levels, e.g., character level, word level, and sentence level. As Figure 2 (a) shows, given a sentence composed of a sequence of words S = { 1 , 2 , . . . , | S | } and denotes the -th character within the -th word . For the -th word, the multi-level representation is represented as follows: where x denotes the character-level representation within . As each word can be regarded as a character sequence, randomly initialized character embeddings are encoded by a bidirectional LSTM layer (Zheng et al., 2019) to capture sequential features in the context, then we use the last hidden state as x . x denotes the word-level representation obtained from GloVe (Pennington et al., 2014) for the -th word ; and x denotes the sentencelevel representation obtained from pretrained language model, e.g., BERT and ALBERT. And [; ] denotes concatenation. Furthermore, a dense layer is applied to reduce the embedding dimension, i.e., x → e . Thus, we can obtain the span representation in the -th layer as H l = {e 1 , e 2 , ..., e T }, where is the span number.
In order to learn more effective span representations in Figure 2 (b), we further adapt the multi-head attention mechanism from Transformer (Vaswani et al., 2017) in each layer of HiTRANS, as illustrated in Figure 3. Specifically, in the -th layer, HiTRANS first transforms the multi-level representation H l into multiple subspaces with different linear projections: where {Q ℎ , K ℎ , V ℎ } are respectively the query, key, and value representations with trainable parameters {W ℎ , W ℎ , W ℎ } corresponding to the ℎth head. Then, the attention functions are applied to refine the span representations.
where H l ℎ is the ℎ-th head with ℎ as the dimension. Furthermore, we concatenate the output representations of all these heads with the residual connection to capture global semantic information in parallel, which is as follows: where H ∈ ℝ T× ℎ is the final span representation in the -th layer, is the number of parallel heads, and W is a trainable parameter. For example, H 1 indicates the refined span representations for the first layer at Phase 1 of Figure 2 (b).

Span Generation Model
To extract nested entities from nested-structure sentences, we design a hierarchical span generation model (SGM) consisting of two phases to generate candidate spans for the NNER, as shown in Figure  2 (b). Specifically, the two phases are composed of layers that respectively generate candidate spans in a Bottom-Up and Top-Down manner (i.e., BU-SGM and TD-SGM) in sequence. In each layer of SGM, a convolution neural network (CNN) is firstly utilized to aggregate two adjacent spans for the next layer which generates all possible flat entities as candidates for further prediction. Then a multi-head attention layer is utilized to enhance the representation learning of each candidate. The details of each component will be described below.
BU-SGM. The core idea of BU-SGM network is to generate feature vectors for candidate spans by recursively stacking convolutional neural networks from the bottom layer to the top layer as shown at Phase 1 of SGM in Figure 2 (b). Specifically, the generated span representations in the first layer correspond to 1-token entities. As for higher layers, a CNN with a kernel size of 2 is iteratively applied to generate continuous -token span representations from the ( − 1)-th layer in a bottom-up manner, which avoids breaking the consecutive context. The span representations in the -th layer can be obtained in a bottom-up manner: where f(·) indicates the shorthand of Equation (4), H denotes the refined multi-level representation obtained from the first layer, andĤ l denotes span representations in Layer generated iteratively from Layer − 1. It is noted that stacking CNN will lead to the length reduced by 1 in each layer. Besides, a ReLU and Norm layer is applied to obtain the final span representations. TD-SGM. In the opposite direction, as long entities at higher-layer are closely related to short entities at lower-layer in the same context, highlevel features can contribute to identifying entities in lower layers by providing additional background information, which is complementary with lowlevel features. Therefore, TD-SGM network aims to propagate the higher-layer information to lower layers in a top-down manner, which is initialized by 0 and guided with the output from the corresponding layer of Phase 1. Specifically, the span representations generated at Phase 2 is iteratively obtained by stacking CNNs (with a kernel size of 2) with proper zero-paddings in a top-down manner. For example, as Figure 2 (b) shows, the span representations in Layer 4 at Phase 2, i.e.{e 1 , e 2 , e 3 }, are generated from the span representations in Layer 5, i.e., {0, e 1 , e 2 , 0}, which are obtained by concatenating the span representations in Layer 5 at Phase 1 and Phase 2, and then padding with zeros. Similarly, the span representations in the -th layer can be generated in a top-down manner as follows: whereȞ denotes span representations in Layer generated from Layer + 1, 0 denotes zero tensor for initializing the top-layer representation.

Hierarchical Label Prediction
To recognize named entities from candidate spans, a hierarchical label prediction (HLP) network is introduced, as shown in Layer 4, Figure 2 (c). First, the outputs of Phase 1 and Phase 2 are concatenated as the final candidate span representations to combine bidirectional features into a global informative representation. Formally, the final span representations in the -th layer are as follows: As BiLSTM networks can make full use of the context information at a higher level, we employ a BiLSTM and a linear layer to predict labels for candidate spans in a hierarchical manner. As we have obtained complete candidate spans, e.g., {1token spans, 2-token spans, ..., -token spans}, based on the attention weights in the SGM module, we can easily classify them into a proper category. The predicted labels for the span representations in the -th layer is obtained as follows: where U 1 , U 2 , b 1 , and b 2 are trainable parameters, Y is the predicted labels of the -th layer. The total output of the L layers is Y = {Y 1 , Y 2 , ..., Y L }.

Model Training
We prepare the gold labels in a hierarchical manner, therefore, each layer of the proposed model could be simplified as a multi-class classification task in any layer of bottom − 1 layers and a flat NER task in the topmost layer. During training, our model predicts the distribution of entity semantic labels for each layer. Finally, we compute the crossentropy loss as follows: whereŶ and Y denote the true distribution and predicted distribution of entity semantic labels, respectively. L is the summation of the loss from all layers. Our complete training procedure for HiTRANS is shown in Algorithm 1.
Revisited (Katiyar and Cardie, 2018) 79  use the original dataset split and pre-processing. There are 5/7/7/114 different entity types in GE-NIA, ACE-2004, ACE-2005, and NNE datasets, respectively. For evaluation, we employ microaveraged precision (P), recall (R), and F1. Table 1 lists the concerned data statistics of each dataset. We comprehensively compare our proposed model with the state-of-the-art baselines, which could be categorized into three groups as follows: • Hypergraph-based methods: These obtain expressive tagging schemas for NER, including Revisited Model (Katiyar and Cardie, 2018), and Linearization model (Straková et al., 2019).

Experimental Settings
We obtain the character-level representation encoded by BiLSTM, and word-level representation from the 100-dimensional pre-trained word embedding GloVe (Pennington et al., 2014), which is trained in 6B tokens. For sentence-level embeddings, we use the BERT and ALBERT embeddings to further improve the NNER. For ACE-2004, ACE-2005, and NNE datasets, the dimensions of character-level embedding, word-level embedding, sentence-level embedding are set by default to 30, 100, and 5120 (1024+4096), respectively. As for the GENIA dataset, we obtain word embedding from pretrained embedding Pubmed trained on biomedical corpus (Chiu et al., 2016), setting the dimension of word-level embeddings to 200. The output dimension of the multi-level representation and the hidden size of bidirection LSTM are set to 200. The number of parallel heads is set to 8.
The number of layers is set to 16, which exceeds the length of most entities and the batch size is empirically set to 32. We use SGD optimizer for training our model with learning rate set to 0.02, and the dropout rate is set to 0.4 to avoid overfitting. All of our experiments are performed on the same machine. We repeat these experiments 5 times, and report the average performance on the test set. Table 2 shows the overall results compared with the baseline methods by groups. Overall, hypergraphbased methods achieve decent results depending on the expressive tagging schema; however, ambiguity and high time complexity are hardly inevitable. Span-based methods improve the performance of NNER; however, they may break the continuousstructure of the context. To alleviate the problem, layered-based models further improve the final performance with hierarchical layers, however, span representations are oversimplified. In addition, the methods incorporated with the pretrained language model, e.g., BERT and ALBERT, generally outperform previous methods, which take the advantage of capturing sentence-level features from context. As shown in Table 1, we can observe that there are 22%, 47%, 39%, and 83% in the test set of GENIA, ACE-2004, ACE-2005, and NNE, respectively, which contain nested entities to different degrees. Table 2 shows that our proposed HiTRANS achieves the state-of-the-art results on GENIA 2 , ACE-2004, ACE-2005, and NNE datasets, which verifies the effectiveness of HiTRANS for NNER. Besides, HiTRANS outperforms other baselines on NNE dataset containing 114 categories of entities, which further validates the superiority in recognizing nested entities from complex sentences. From the tendency, span-based methods and layered-based methods draw more attention than hypergraph-based methods in recent years, which probably because they effectively balance the performance and efficiency. In summary, the overall performance of the HiTRANS demonstrates its superiority in NNER, which benefits from the hierarchical span representation.

Ablation Study
As shown in Table 3  The multi-level features (i.e., CL-EMB, WL-EMB, and SL-EMB) obtained from character-level, wordlevel, and sentence-level are essential for the final performance. Particularly, the sentence-level feature improves the performance by a large margin, which may because the language model usually has a large number of parameters to learn a better representation. Besides, HiTRANS without WL-EMB has a slight increase in recall, but a decrease in precision, which indicates that the word-level feature contributes to select the correct entity from candidate spans. The residual multi-head attention (MHA) contributes to the final performance as well, which could be due to the refined span representations in each layer. In addition, HiTRANS model with two phases shows better performance, which may because phase 2 can further propagate information in a top-down manner. We only remove Phase 2 for ablation studies, since Phase 1 need to take original multi-level representations as input. In all, our HiTRANS achieves 87.04% F1 score, which indicates that all components contribute to the effectiveness and the whole framework has superior in achieving the overall performance.  Table 4 shows a case study comparing our model with Exhaustive (Sohrab and Miwa, 2018), Layered (Ju et al., 2018), Boundary-aware (Zheng et al., 2019), and Pyramid (Jue et al., 2020) models, which are more germane and representative. In this example, there is an entity "the New England chain Stop n' shop" containing the entity "the New England chain", which also has an entity "New England" nested in it. Our proposed model recognizes all potential entities of different-length in a fine-to-coarse manner. Exhaustive gets the wrong token of entity heads and misses the token "the" in entities, and Layered merely extracts outer entities. Compared with the Pyramid model detecting wrong spans, our HiTRANS can extract both inner and outer entities more precisely in a hierarchical manner. It demonstrates that HiTRANS contributes to the performance of NNER, which may due to the hierarchical transformer refines span representations in each layer. Furthermore, the hierarchical label prediction model has the advantage of identifying nested named entities by incorporating both semantic dependencies.

Case Study and Visualization
For in-depth analysis of HiTRANS, we visualize the result of the predictions in each layer with masking. Owing to space limit, only the first four layers are shown in Figure 4. From the input sentence, "his" is correctly recognized as entities of 1-token with 0.43 confidence in Layer 1, "Saddam Hussein" and "his Henchmen" are recognized as entities of 2-token with 0.29 and 0.31 confidence in Layer 2, respectively. Likewise, other spans of -token in Layer ∈ are assigned with different confidences. In a word, we can observe that the recognized entities of different lengths are assigned with higher confidences than others in each layer, which contributes to distill truth named entities from candidate spans and further validates the effectiveness of our HiTRANS for the NNER. The out-of-memory problem occurs when the number of layers is set to 32 (i.e., 2 5 ) on the NNE and GENIA dataset, as shown at the left.

Parameter Sensitivity Analysis
Two primary parameters, i.e., the number of layers and batch size, are selected to verify the impact of parameters on the effectiveness of HiTRANS. The number of layers denotes how many layers used in the hierarchical model and the batch size controls the size of allocated resources. To study uncertainty in the output of our HiTRANS, we adopt the single-parameter sensitivity analysis by varying one parameter while fixing the others each time. As Figure 5 shows, when the number of layers and the batch size change, especially when the number of layers is greater than 4, and the batch size is greater than 4, HiTRANS still maintains high performance on these four benchmark datasets. Although the number of layers is related to the maximum nesting depth, the results demonstrate that HiTRANS is not sensitive to parameter settings and has superior performance and robustness in NNER.

Conclusion
This paper presents a novel HiTRANS framework, which learns effective span representations for label prediction of nested entities in a hierarchical manner. The proposed framework iteratively generates candidate span representations by aggregating adjacent features and further refines them based on a bottom-up and top-down transformer network. Moreover, a candidate span is further recognized as a named entity sequentially, leveraging the semantic dependency of adjacent spans. Extensive experimental results demonstrate that HiTRANS achieves the state-of-the-art performances on GE-NIA, ACE-2004, ACE-2005 andNNE datasets.