Enhancing Entity Boundary Detection for Better Chinese Named Entity Recognition

In comparison with English, due to the lack of explicit word boundary and tenses information, Chinese Named Entity Recognition (NER) is much more challenging. In this paper, we propose a boundary enhanced approach for better Chinese NER. In particular, our approach enhances the boundary information from two perspectives. On one hand, we enhance the representation of the internal dependency of phrases by an additional Graph Attention Network(GAT) layer. On the other hand, taking the entity head-tail prediction (i.e., boundaries) as an auxiliary task, we propose an unified framework to learn the boundary information and recognize the NE jointly. Experiments on both the OntoNotes and the Weibo corpora show the effectiveness of our approach.


Introduction
Given a sentence, the NER task aims to identify the noun phrases having special meanings that predefined. Due to its importance on many downstream tasks, such as relation extraction (Ji et al., 2017), coreference resolution (Clark and Manning, 2016) and knowledge graphs , NER has attracted much attention for long time.
In comparison with English, due to the lack of explicit word boundary and tenses information, Chinese NER is much more challenging. In fact, the performance of the current SOTAs in Chinese is far inferior to that in English, the gap is about 10% in F1-measure. In this paper, we propose a boundary enhancing approach for better Chinese NER.
Firstly, using Star-Transformer (Guo et al., 2019), we construct a lightweight baseline system. Benefit from the unique star topological structure, Star-Transformer is more dominant in representing long distance sequence, and thus, our baseline achieves comparable performance to the SOTAs. Considering the deficiency in the representation of local * Corresponding author. sequence information, we then try to enhance the local boundary information. In particular, our approach enhances the boundary information from two perspectives. On one hand, we add an additional GAT (Veličković et al., 2017) layer to capture the internal dependency of phrases. In this way, boundaries can be distinguished implicitly, while the semantic information within the phrase is enhanced. On the other hand, we add an auxiliary task to predict the head and tail of entities. In this way, using the framework of multi-tasking learning, we can learn the boundary information explicitly and help the NER task. Experiments show the effectiveness of our approach. It should be noted that, our approach obtains the new state-of-the-art results on both the OntoNotes and the Weibo corpora. That means our approach can perform well for both written and non-written texts.

Related Work
As is well known, most researches cast the NER task as a traditional sequence labelling problem, and many models extending the Bi-LSTM+CRF architecture are proposed (Huang et al., 2015;Chiu and Nichols, 2016;Dong et al., 2016;Lample et al., 2016;Ma and Hovy, 2016). Although the attention-based model, i.e., Transformer (Vaswani et al., 2017), has gradually surpassed the traditional RNN model (Zaremba et al., 2014) in various fields, Yan et al. (2019) has verified that the fully connected Transformer mechanism does not work well on NER. Until recently, some researches show that Star-Transformer can work well on NER owing to its lightweight topological structure (Guo et al., 2019;Chen et al., 2020). Moreover, lexical and dependent information has been widely used in this task (Zhang and Yang, 2018;Ma et al., 2020;Gui et al., 2019;Sui et al., 2019;Tang et al., 2020) to better capture local semantic information.
In this paper, using Star-transformer as our base- line, we mainly focus on enhancing the boundary information to improve Chinese NER.

Model
We also treat NER as a sequence labeling task, decoding with a classical CRF (Lafferty et al., 2001). Figure 1 shows the complete model. We can find that the encoder of our model consists of three parts, i.e., GRU-based head and tail representation layer, Star-transformer based contextual embedding layer, and GAT-based dependency embedding layer.

Token embedding layer
Considering the lack of explit word boundary, we combine word-level represention with character, avoiding the error propagation caused by word segmentation. For a given sentence, we represent each word and character by looking up the pre-trained word embeddings 1 (Li et al., 2018). The sequence of character embeddings contained in a word will be fed to a bi-direction GRU layer. The hidden state of bi-direction GRU can be expressed as folowing: where x t i is the token representation, denote the t-th forward and backward hidden state of GRU layer. The final token representation is obtained as 1 https://github.com/Embedding/Chinese-Word-Vectors equation(4) ∼ (6): where [;] denotes concatenation, and pos i is the Part-of-Speech tagging of word i .

Star-transformer based contextual embedding layer
Star-Transformer abandons redundant connections and has an approximate ability to model the longrange dependencies. For NER task, entities are sparse, so it is unnecessary to pay attention on all nodes in the sentence all the time. We utilize this structured model to encode the words in a sentence, which shows comparable performance with the traditional RNN models, but with the capability of capturing long-range dependencies.

Multi-Head Attention
Transformer employs h attention heads to implement self-attention on an input sequence separately. The result of each attention head will be integrated together, called Multi-Head Attention. Given a sequence of vectors X, we use a query vector Q to soft select the relevant information with attention: where W K and W V are learnable parameters. Then Multi-Head Attention can be defined as equation(9) ∼ (10): where ⊕ denotes concatenation, and

Star-Transformer Encoder
The topological structure of Star-Transformer is made up of one relay node and n satellite nodes.
The state of i-th satellite node represents the feature of the i-th token in a text sequence. The relay node acts as a virtual hub to gather and scatter information from and to all the satellite nodes (Guo et al., 2019).
Star-Transformer proposes a time-step cyclic updating method, in which each satellite node is initialized by the input vector, and the relay node is initialized as the average value of all tokens. The status of each satellite node is updated according to its adjacent nodes, including the previous node in the previous round h t−1 i−1 , the current node in the previous round h t−1 i , the next node in the previous round h t−1 i+1 , the current node e i and the relay node in the previous round s t−1 . The update process is shown in the equation(11) ∼ (12): where C t i denotes contextual information of i-th. The update of relay node is determined by the information of all the satellite nodes and the status of the previous round :

Highway Networks
Highway Networks (Srivastava et al., 2015) can alleviate the blocked gradient backflow when the network deepens. Such gating mechanisms can be of vital significance to Transformer (Chai et al., 2020). We use Highway Networks to mitigate the depth and complexity of Star-Transformer.
After calculating the Multi-Head Attention, a new branch dominated by Highway Networks joins in, indicating the self-updating and dynamic adjustment of satellite node.
where w 1 , w 2 , b 1 , b 2 are learnable parameters, and σ is the activation function.
Finally, the updated satellite node is denoted as: Highway Networks not only enhances the inherent characteristics of the satellite nodes, but also avoids gradient blocking.

GAT-based dependency embedding layer
In this work, we propose the use of dependencies between words to construct graph neural networks. The dependency is directional, and the current word is only related to the word with shared edge. This kind of directed linkage further obtains the internal structural information of the entity, enriching the sequential representation.
Graph Attention Networks(GAT) (Veličković et al., 2017), leveraging masked self-attention layers to assign different importance to neighbouring nodes, works well with our work.
The attention coefficient e ij and α ij represents the importance of node j to node i: A GAT operation with K independent attention heads can be expressed as: where ⊕ denotes concatenation, W and − → a are learnable parameters, N i is the neighborhood of node i, σ is the activation function.
In addition to the strong focus on the associated nodes of GAT layer, it can well make up for the deficiency of Star-Transformer in capturing the internal dependency of the phrases.

GRU-based head and tail representation layer
While GAT is effective in capturing internal dependency within an entity, the boundary of the entity need to be strengthened. We then regard the entity boundary detection as binary classification task, which trains with NER at the same time, giving NER clear entity boundary information. During training phase, two separate GRU layers are used to make head and tail prediction of the entities, whose hidden features are added with the output of GAT layer: W 1 , W 2 , W 3 are learnable parameters, and H is the final input for CRF.  (2018)

Model Learning
Entities boundaries are not only the task we deal with, but the perfect natural assistance by NER, which transform from outside to inside of the mention and vice versa. The multi-task loss function is composed of the categorical cross-entroy loss for boundary detection and entity categorical label prediction: 4 Experiments

Datasets
The label in our work is marked by BIESO, and we use Precision(P ), Recall(R) and F 1 score(F 1) as evaluation metrics. OntoNotes V4.0 2 (Pradhan, 2011) is a Chinese dataset and consists of texts from news domain. We use the same split as Zhang and Yang (2018).
OntoNotes V5.0 3 (Pradhan et al., 2013) is also a Chinese dataset from news domain, but with larger scale and more entity types. We use the same split as Jie and Lu (2019).
Weibo NER 4 (Peng and Dredze, 2015) contains annotated NER messages drawn from the social meida Sina Weibo. We use the same split as Peng and Dredze (2015).
Additionally, the tool used to parse syntactic dependency in this paper is DDParser 5 .

Results and Analysis
We conduct experiments on the OntoNotes and Weibo corpora and compare the results with the  existing models, as shown in table 1 6 .
We begin by establishing a Star-Transformer baseline, which is more effective on the smaller social media Weibo corpus than OntoNotes. Star-Transformer could be superior to all existing models in Weibo, at least 6.29%(F1) and 8.85%(F1) for Named Entity(NE) and Nominal Entity(NM).
Considering the structural peculiarity of OntoNotes, where entities have similar composition, we utilize GAT to simulate the feature inside the entity. The precision on the OntoNotes are both improved by 3.93% and 1.62%. Futhermore, boundary prediction used as multi-task has been trained with label classification, supplying local sequence information for NER. Tabel 2 shows the number of different entity recognition errors of our models, including Type Error(TE), Unidentification Error(UE) and Boundary Error(BE).The addition of entity head-tail prediction reduces the number of boundary errors on OntoNotes V4.0 by 37. There is no doubt that the boundary enhanced model are quite profitable to the recognition of both entity boundary and entity type.
For Weibo, NE and NM illustrate different performance. The more standard NE has a similar performance to OntoNotes, while NM shows less impact from GAT, due to its short length and nonstructue.
Combining the respective advantages of the three layers above, an unified and lightweight model can be applied to Chinese NER, getting the new stateof-the-art results on both the OntoNotes and Weibo corpora.

Conclusion
In this paper, we mainly focus on the impact of boundary information on Chinese NER. We firstly propose a Star-transformer based NER system. Then both explicit head and tail boundary information and Dependency GAT-based implicit boundary information are combined to improve Chinese NER. Experiments on both the OntoNotes and the Weibo corpora show the effectiveness of our approach.