MECT: Multi-Metadata Embedding based Cross-Transformer for Chinese Named Entity Recognition

Recently, word enhancement has become very popular for Chinese Named Entity Recognition (NER), reducing segmentation errors and increasing the semantic and boundary information of Chinese words. However, these methods tend to ignore the information of the Chinese character structure after integrating the lexical information. Chinese characters have evolved from pictographs since ancient times, and their structure often reflects more information about the characters. This paper presents a novel Multi-metadata Embedding based Cross-Transformer (MECT) to improve the performance of Chinese NER by fusing the structural information of Chinese characters. Specifically, we use multi-metadata embedding in a two-stream Transformer to integrate Chinese character features with the radical-level embedding. With the structural characteristics of Chinese characters, MECT can better capture the semantic information of Chinese characters for NER. The experimental results obtained on several well-known benchmarking datasets demonstrate the merits and superiority of the proposed MECT method.


Introduction
Named Entity Recognition (NER) plays an essential role in structuring of unstructured text. It is a sequence tagging task that extracts named entities from unstructured text. Common categories of NER include names of people, places, organizations, time, quantity, currency, and some proper nouns. NER is the basis for many Natural Language Processing (NLP) tasks such as event extraction (Chen et al., 2015), question answering (Diefenbach et al., 2018), information re- * Corresponding author. 1 The source code of the proposed method is publicly available at https://github.com/CoderMusou/ MECT4CNER.

Character
CR HT SC trieval (Khalid et al., 2008), knowledge graph construction (Riedel et al., 2013), etc. Compared with English, there is no space between Chinese characters as word delimiters. Chinese word segmentation is mostly distinguished by readers through the semantic information of sentences, posing many difficulties to Chinese NER (Duan and Zheng, 2011;. Besides, the task also has many other challenges, such as complex combinations, entity nesting, and indefinite length . In English, different words may have the same root or affix that better represents the word's semantics. For example, physiology, psychology, sociology, technology and zoology contain the same suffix, '-logy', which helps identify the entity of a subject name. Besides, according to the information of English words, root or affixes often determine general meanings (Yadav et al., 2018). The root, such as 'ophthalmo-' (ophthalmology), 'esophage-' (esophagus) and 'epithelio-' (epithelium), can help human or machine to better recognize professional nouns in medicine. Therefore, even the stateof-the-art methods, such as BERT (Devlin et al., 2019) and GPT (Radford et al., 2018), trained on large-scale datasets, adopt this delicate word segmentation method for performance boost.
For Chinese characters, there is also a structure
To address the aforementioned issues, we take the advantages of Flat-Lattice Transformer (FLAT) (Li et al., 2020) in efficient parallel computing and excellent lexicon learning, and introduce the radical stream as an extension on its basis. By combining the radical information, we propose a Multi-metadata Embedding based Cross-Transformer (MECT). MECT has the lattice-and radical-streams, which not only possesses FLAT's word boundary and semantic learning ability but also increases the structure information of Chinese character radicals. This is very effective for NER tasks, and has improved the baseline method on different benchmarks. The main contributions of the proposed method include: • The use of multi-metadata feature embedding of Chinese characters in Chinese NER.
• A novel two-stream model that combines the radicals, characters and words of Chinese characters to improve the performance of the proposed MECT method.
• The proposed method is evaluated on several well-known Chinese NER benchmarking datasets, demonstrating the merits and superiority of the proposed approach over the stateof-the-art methods.

Related Work
The key of the proposed MECT method is to use the radical information of Chinese characters to enhance the Chinese NER model. So we focus on the mainstream information enhancement methods in the literature. There are two main types of Chinese NER enhancement methods, including lexical information fusion and glyph-structural information fusion.
Lexical Enhancement In Chinese NER, many recent studies use word matching methods to enhance character-based models. A typical method is the Lattice-LSTM model  that improves the NER performance by encoding and matching words in the lexicon. Recently, some lexical enhancement methods were proposed using CNN models, such as LR-CNN (Gui et al., 2019a), CAN-NER (Zhu and Wang, 2019). Graph networks have also been used with lexical enhancement. The typical one is LGN (Gui et al., 2019b). Besides, there are Transformer-based lexical enhancement methods, such as PLT (Xue et al., 2019) and FLAT. And SoftLexicon  introduces lexical information through label and probability methods at the character representation layer.
Glyph-structural Enhancement Some studies also use the glyph structure information in Chinese NER. For example,  were the first to study the application of radical-level information in Chinese NER. They used Bi-LSTM to extract radical-level embedding and then concatenated it with the embedding of characters as the final input. The radical information used in Bi-LSTM is structural components (SC) as shown in Table 1, which achieved state-of-the-art performance on the MSRA dataset. The Glyce  model used Chinese character images to extract features such as strokes and structure of Chinese characters, achieving promising perfor- mance in Chinese NER. Some other methods (Xu et al., 2019;Song et al., 2020) also proposed to use radical information and Tencent's pre-trained embedding 2 to improve the performance. In these works, the structural components of Chinese characters have been proven to be able to enrich the semantics of the characters, resulting in better NER performance.

Background
The proposed method is based on the Flat-Lattice Transformer (FLAT) model. Thus, we first briefly introduce FLAT that improves the encoder structure of Transformer by adding word lattice information, including semantic and position boundary information. These word lattices are obtained through dictionary matching. Figure 1 shows the input and output of FLAT. It uses the relative position encoding transformed by head and tail position to fit the word's boundary information. The relative position encoding, R ij , is calculated as follows: where W r is a learnable parameter, h i and t i represent the head position and tail position of the i-th character, ⊕ denotes the concatenation operation, and p span is obtained as in Vaswani et al. (2017): where p span corresponds to p in Eq. (1), and span Then the scaled dot-product attention is obtained by: where R * ij = R ij ·W R . u, v and W 2 are learnable parameters.

The Proposed MECT Method
To better integrate the information of Chinese character components, we use Chinese character structure as another metadata and design a two-stream form of multi-metadata embedding network. The architecture of the proposed network is shown in Figure 2a. The proposed method is based on the encoder structure of Transformer and the FLAT method, in which we integrate the meaning and boundary information of Chinese words. The proposed two-stream model uses a Cross-Transformer module similar to the self-attention structure to fuse the information of Chinese character components. In our method, we also use the multi-modal collaborative attention method that is widely used in vision-language tasks (Lu et al., 2019). The difference is that we add a randomly initialized attention matrix to calculate the attention bias for the two types of metadata embedding.

CNN for Radical-level Embedding
Chinese characters are based on pictographs, and their meanings are expressed in the shape of objects. In this case, the structure of Chinese characters has certain useful information for NER. For example, the radicals such as '艹' (grass) and '木' (wood) generally represent plants, enhancing Chinese medicine entity recognition. For another example, '月' (body) represents human body parts or organs, and '疒' (disease) represents diseases, which benefits Chinese NER for the medical field. Besides, the Chinese have their own culture and belief in naming. Radicals '钅' (metal), '木' (wood), '氵' (water), '火' (fire), and '土' (earth) represented by the Wu-Xing (Five Elements) theory are often used as names of people or companies. But '锈' (rust), '杀' (kill), '污' (dirt), '灾' (disaster) and '堕' (fall) are usually not used as names, even if they contain some elements of the Wu-Xing theory. It is because the other radical components also determine the semantics of Chinese characters. Radicals that generally appear negative or conflict with Chinese cultural beliefs are usually not used for naming.
Therefore, we choose the more informative Structural Components (SC) in Table 1 as radicallevel features of Chinese characters and use Convolutional Neural Network (CNN) to extract character  features. The structure diagram of the CNN network is shown in Figure 3. We first disassemble the Chinese characters into SC and then input the radicals into CNN. Last, we use the max-pooling and fully connected layers to get the feature embedding of Chinese characters at the radical-level.

The Cross-Transformer Module
After radical feature extraction, we propose a Cross-Transformer network to obtain the supplementary semantic information of the structure of Chinese characters. It also uses contextual and lexical information to enrich the semantics of Chinese characters. The Cross-Transformer network is illustrated in Figure 2b. We use two Transformer encoders to cross the lattice and radical information of Chinese characters, which is different from the selfattention method in Transformer. The input Q L (Q R ), K L (K R ), V L (V R ) are ob-tained by the linear transformation of lattice and radical-level feature embedding: (7) where E L and E R are lattice embedding and radical-level embedding, I is the identity matrix, and each W is a learnable parameter. Then we use the relative position encoding in FLAT to represent the boundary information of a word and calculate the attention score in our Cross-Transformer: where u and v are learnable parameters for attention bias in Eq. (10), A L is the lattice attention score, and A R denotes the radical attention score. And R * ij = R ij · W R . W R are learnable parameters. The relative position encoding, R ij , is calculated as follows:

Random Attention
We empirically found that the use of random attention in Cross-Transformer can improve the performance of the proposed method. This may be due to the requirement of attention bias in lattice and radical feature embedding, which can better adapt to the scores of two subspaces. Random attention is a randomly initialized parameter matrix B max len×max len that is added to the previous attention score to obtain a total attention score:

The Fusion Method
To reduce information loss, we directly concatenate the lattice and radical features and input them into a fully connected layer for information fusion: where ⊕ denotes the concatenation operation, W o and b are learnable parameters.
After the fusion step, we mask the word part and pass the fused feature to a Conditional Random Field (CRF) (Lafferty et al., 2001) module.

Experimental Results
In this section, we evaluate the proposed MECT method on four datasets. To make the experimental results more reasonable, we also set up two additional working methods for assessing the performance of radicals in a two-stream model. We use the span method to calculate F1-score (F1), precision (P), and recall (R) as the evaluation metrics.

Experimental Settings
We use four mainstream Chinese NER benchmarking datasets: Weibo (Peng and Dredze, 2015;He and Sun, 2016), Resume , MSRA (Levow, 2006), and Ontonotes 4.0 (Weischedel and Consortium, 2013). The corpus of MSRA and Ontonotes 4.0 comes from news, the corpus of Weibo comes from social media, and the corpus of Resume comes from the resume data in Sina Finance. Table 3 shows the statistical information of these datasets. Among them, the Weibo dataset has four types of entities, including PER, ORG, LOC, and GPE. Resume has eight types of entities, including CONT, EDU, LOC, PER, ORG, PRO, RACE, and TITLE. OntoNotes 4.0 has four types of entities: PER, ORG, LOC, and GPE. The MSRA dataset contains three types of entities, i.e., ORG, PER, and LOC.
We use the state of the art method, FLAT, as the baseline model. FLAT is a Chinese NER model based on Transformer and combined with lattice. Besides, we also compared the proposed method with both classic and innovative Chinese NER models. We use the more informative 'SC' as the radical feature, which comes from the online Xinhua   Dictionary 3 . The pre-trained embedding of characters and words are the same as FLAT.
For hyper-parameters, we used 30 1-D convolution kernels with the size of 3 for CNN. We used the SMAC (Hutter et al., 2011) algorithm to search for the optimal hyper-parameters. Besides, we set a different learning rate for the training of the radicallevel embedding with CNN. Readers can refer to the appendix for our hyper-parameter settings.

Comparison with SOTA Methods
In this section, we evaluate and analyze the proposed MECT method with a comparison to both the classic and state of the art methods. The experimental results are reported in Tables 4−7   fourth ones are the results obtained by the proposed MECT method as well as the baseline models. Weibo: Table 4 shows the results obtained on Weibo in terms of the F1 scores of named entities (NE), nominal entities (NM), and both (Overall). From the results, we can observe that MECT achieves the state-of-the-art performance. Compared with the baseline method, MECT improves 2.98% in terms of the F1 metric. For the NE metric, the proposed method achieves 61.91%, beating all the other approaches.
Resume: The results obtained on the Resume dataset are reported in Table 5. The first block shows  comparative results on the character-level and word-level models. We can observe that the performance of incorporating word features into the character-level model is better than other models. Additionally, MECT combines lexical and radical features, and the F1 score is higher than the other models and the baseline method.
Ontonotes 4.0: Table 6 shows the results obtained on Ontonotes 4.0. The symbol ' §' indicates gold segmentation, and the symbol ' ¶' denotes automated segmentation. Other models have no segmentation and use lexical matching. Compared to the baseline method, the F1 score of MECT is increased by 0.47%. MECT also achieves a high recall rate, keeping the precision rate and recall rate relatively stable.  MSRA: Table 7 shows the experimental results obtained on MSRA. In the first block, the result proposed by  is the first method using radical information in Chinese NER. From the table, we can observe that the overall performance of MECT is higher than the existing SOTA methods. Similarly, our recall rate achieves a higher performance so that the final F1 has a certain performance boosting.
With BERT: Besides the single-model evaluation on the four datasets, we also evaluated the proposed method when combining with the SOTA method, BERT. The BERT model is the same as FLAT using the 'BERT-wwm' released by Cui et al. (2020). The results are shown in the fourth block of each table. The results of BERT are taken from the FLAT paper. We can find that MECT further improves the performance of BERT significantly.

Effectiveness of Cross-Transformer
There are two sub-modules in the proposed Cross-Transformer method: lattice and radical attentions. Figure 4 includes two heatmaps for the normalization of the attention scores of the two modules. From the two figures, we can see that lattice attention pays more attention to the relationship between words and characters so that the model can obtain the position information and boundary information of words. Radical attention focuses on global information and corrects the semantic information of   each character through radical features. Therefore, lattice and radical attentions provide complementary information for the performance-boosting of the proposed MECT method in Chinese NER.

Impact of Radicals
We visualized the radical-level embedding obtained by the CNN network and found that the cosine distance of Chinese characters with the same radical or similar structure is smaller. For example, Figure  5 shows part of the Chinese character embedding trained on the Resume dataset. The highlighted dots represent Chinese characters that are close to the character '华'. We can see that they have the same radicals or similar structure. It can enhance the semantic information of Chinese characters to a certain extent. We also examined the inference results of MECT and FLAT on Ontonotes 4.0 and found many exciting results. For example, some words with a percentage like '百分之四十三点二 (43.2%)' is incorrectly labelled as PER in the training dataset, which causes FLAT to mark the percentage of words with PER on the test dataset, while MECT avoids this situation. There are also some words such as '田时' and '以国' that appear in the lexicon, which was mistakenly identified as valid words by FLAT, leading to recognition errors. Our MECT addresses these issues by paying global attention to the radical information. Besides, in FLAT, some numbers and letters are incorrectly marked as PER, ORG, or others. We compared the PER label accuracy of FLAT and MECT on the test dataset. FLAT achieves 81.6%, and MECT reaches 86.96%, which is a very significant improvement.

Analysis in Efficiency and Model Size
We use the same FLAT method to evaluate the parallel and non-parallel inference speed of MECT on an NVIDIA GeForce RTX 2080Ti card, using batch size = 16 and batch size = 1. We use the non-parallel version of FLAT as the standard and calculate the other models' relative inference speed. The results are shown in Figure 6. According to the figure, even if MECT adds a Transformer encoder to FLAT, the speed is only reduced by 0.15 in terms of the parallel inference speed. Our model's speed is considerable relative to LSTM, CNN, and some graph-based network models. Because Transformer can make full use of the GPU's parallel computing power, the speed of MECT does not drop too much, but it is still faster than other models. The model's parameter is between 2 and 4 million, determined by the max sentence length in the dataset and the d model size in the model.

Ablation Study
To validate the effectiveness of the main components of the proposed method, we set up two experiments in Figure 7. In Experiment A, we only use a single-stream model with a modified self-attention, which is similar to the original FLAT model. The difference is that we use a randomly initialized attention matrix (Random Attention) for the attention calculation. We combine lattice embedding and radical-level embedding as the input of the model. The purpose is to verify the performance of the twostream model relative to the single-stream model. In Experiment B, we do not exchange the query's feature vector. We replace the cross-attention with two sets of modified self-attention and follow the two modules' output with the same fusion method as MECT. The purpose of experiment B is to verify the effectiveness of MECT relative to the twostream model without crossover. Besides, we evaluate the proposed MECT method by removing the random attention module. Table 8 shows the ablation study results. 1) By comparing the results of Experiment A with the results of Experiment B and MECT, we can find that the two-stream model works better. The use of lattice-level and radical-level features as the two streams of the model helps the model to better understand and extract the semantic features of Chinese characters. 2) Based on the results of Experiment B and MECT, we can see that by exchanging the two query feature vectors, the model can extract features more effectively at the lattice and radical levels. They have different attention mechanisms to obtain contextual information, resulting in global and local attention interaction. This provides better information extraction capabilities for the proposed method in a complementary way. 3) Last, the performance of MECT drops on all the datasets by  removing the random attention module (the last row). This indicates that, as an attention bias, random attention can eliminate the differences caused by different embeddings, thereby improving the model's performance further.

Conclusion
This paper presented a novel two-stream network, namely MECT, for Chinese NER. The proposed method uses multi-metadata embedding that fuses the information of radicals, characters and words through a Cross-Transformer network. Additionally, random attention was used for further performance boost. Experimental results obtained on four benchmarks demonstrate that the radical information of Chinese characters can effectively improve the performance for Chinese NER. The proposed MECT method with the radical stream increases the complexity of a model. In the future, we will consider how to integrate the characters, words and radical information of Chinese characters with a more efficient way in two-stream or multi-stream networks to improve the performance of Chinese NER and extend it to other NLP tasks.