Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter

Lexicon information and pre-trained models, such as BERT, have been combined to explore Chinese sequence labeling tasks due to their respective strengths. However, existing methods solely fuse lexicon features via a shallow and random initialized sequence layer and do not integrate them into the bottom layers of BERT. In this paper, we propose Lexicon Enhanced BERT (LEBERT) for Chinese sequence labeling, which integrates external lexicon knowledge into BERT layers directly by a Lexicon Adapter layer. Compared with existing methods, our model facilitates deep lexicon knowledge fusion at the lower layers of BERT. Experiments on ten Chinese datasets of three tasks including Named Entity Recognition, Word Segmentation, and Part-of-Speech Tagging, show that LEBERT achieves state-of-the-art results.


Introduction
Sequence labeling is a classic task in natural language processing (NLP), which is to assign a label to each unit in a sequence (Jurafsky and Martin, 2009).Many important language processing tasks can be converted into this problem, such as partof-speech (POS) tagging, named entity recognition (NER), and text chunking.The current state-of-theart results for sequence labeling have been achieved by neural network approaches (Lample et al., 2016;Ma and Hovy, 2016;Chiu and Nichols, 2016;Gui et al., 2017).
Chinese sequence labeling is more challenging due to the lack of explicit word boundaries in Chinese sentences.One way of performing Chinese sequence labeling is to perform Chinese word segmentation (CWS) first, before applying word sequence labeling (Sun and Uszkoreit, 2012;Yang et al., 2016).However, it can suffer from the segmentation errors propagated from the CWS system !"#$%&'"()"!"#$%&'"()" 123!
There are two lines of recent work enhancing character-based neural Chinese sequence labeling.The first considers integrating word information into a character-based sequence encoder, so that word features can be explicitly modeled (Zhang and Yang, 2018;Yang et al., 2019;Liu et al., 2019;Ding et al., 2019;Higashiyama et al., 2019).These methods can be treated as designing different variants to neural architectures for integrating discrete structured knowledge.The second considers the integration of large-scale pre-trained contextualized embeddings, such as BERT (Devlin et al., 2019), which has been shown to capture implicit wordlevel syntactic and semantic knowledge (Goldberg, 2019;Hewitt and Manning, 2019).
The two lines of work are complementary to each other due to the different nature of discrete and neural representations.Recent work considers the combination of lexicon features and BERT for Chinese NER (Ma et al., 2020;Li et al., 2020), Chinese Word Segmentation (Gan and Zhang, 2020), and Chinese POS tagging (Tian et al., 2020b).The main idea is to integrate contextual representations from BERT and lexicon features into a neural sequence labeling model (shown in Figure 1 (a)).However, these approaches do not fully exploit the representation power of BERT, because the external features are not integrated into the bottom level.
Inspired by the work about BERT Adapter (Houlsby et al., 2019;Bapna and Firat, 2019;Wang et al., 2020), we propose Lexicon Enhanced BERT (LEBERT) to integrate lexicon information between Transformer layers of BERT directly.Specifically, a Chinese sentence is converted into a charwords pair sequence by matching the sentence with an existing lexicon.A lexicon adapter is designed to dynamically extract the most relevant matched words for each character using a char-to-word bilinear attention mechanism.The lexicon adapter is applied between adjacent transformers in BERT (shown in Figure 1 (b)) so that lexicon features and BERT representation interact sufficiently through the multi-layer encoder within BERT.We fine-tune both the BERT and lexicon adapter during training to make full use of word information, which is considerably different from the BERT Adapter (it fixes BERT parameters).
We investigate the effectiveness of LEBERT on three Chinese sequence labeling tasks1 , including Chinese NER, Chinese Word Segmentation2 , and Chinese POS tagging.Experimental results on ten benchmark datasets illustrate the effectiveness of our model, where state-of-the-art performance is achieved for each task on all datasets.In addition, we provide comprehensive comparisons and detailed analyses, which empirically confirm that bottom-level feature integration contributes to span boundary detection and span type determination.

Related Work
Our work is related to existing neural methods using lexicon features and pre-trained models to improve Chinese sequence labeling.Lexicon-based.Lexicon-based models aim to enhance character-based models with lexicon information.Zhang and Yang (2018) introduced a lat-tice LSTM to encode both characters and words for Chinese NER.It is further improved by following efforts in terms of training efficiency (Gui et al., 2019a;Ma et al., 2020), model degradation (Liu et al., 2019), graph structure (Gui et al., 2019b;Ding et al., 2019), and removing the dependency of the lexicon (Zhu and Wang, 2019).Lexicon information has also been shown helpful for Chinese Word Segmentation (CWS) and Part-ofspeech (POS) tagging.Yang et al. (2019) applied a lattice LSTM for CWS, showing good performance.Zhao et al. (2020) improved the results of CWS with lexicon-enhanced adaptive attention.Tian et al. (2020b) enhanced the character-based Chinese POS tagging model with a multi-channel attention of N-grams.
Pre-trained Model-based.Transformer-based pre-trained models, such as BERT (Devlin et al., 2019), have shown excellent performance for Chinese sequence labeling.Yang (2019) simply added a softmax on BERT, achieving state-of-the-art performance on CWS.Meng et al. (2019); Hu and Verberne (2020) showed that models using the character features from BERT outperform the static embedding-based approaches by a large margin for Chinese NER and Chinese POS tagging.
Hybrid Model.Recent work tries to integrate the lexicon and pre-trained models by utilizing their respective strengths.Ma et al. (2020) concatenated separate features, BERT representation and lexicon information, and input them into a shallow fusion layer (LSTM) for Chinese NER.Li et al. (2020) proposed a shallow Flat-Lattice Transformer to handle the character-word graph, in which the fusion is still at model-level.Similarly, character N-gram features and BERT vectors are concatenated for joint training CWS and POS tagging (Tian et al., 2020b).Our method is in line with the above approaches trying to combine lexicon information and BERT.The difference is that we integrate lexicon into the bottom level, allowing in-depth knowledge interaction within BERT.
There is also work employing lexicon to guide pre-training.ERNIE (Sun et al., 2019a,b) exploited entity-level and word-level masking to integrate knowledge into BERT in an implicit way.Jia et al. (2020) proposed Entity Enhanced BERT, further pre-training BERT using a domainspecific corpus and entity set with a carefully designed character-entity Transformer.ZEN (Diao et al., 2020) enhanced Chinese BERT with a multi-layered N-gram encoder but is limited by the small size of the N-gram vocabulary.Compared to the above pre-training methods, our model integrates lexicon information into BERT using an adapter, which is more efficient and requires no raw texts or entity set.BERT Adapter.BERT Adapter (Houlsby et al., 2019) aims to learn task-specific parameters for the downstream tasks.Specifically, they add adapters between layers of a pre-trained model and tune only the parameters in the added adapters for a certain task.Bapna and Firat (2019) injected taskspecific adapter layers into pre-trained models for neural machine translation.MAD-X (Pfeiffer et al., 2020) is an adapter-based framework that enables high portability and parameter-efficient transfer to arbitrary tasks.Wang et al. (2020) proposed K-ADAPTER to infuse knowledge into pre-trained models with further pre-training.Similar to them, we use a lexicon adapter to integrate lexicon information into BERT.The main difference is that our goal is to better fuse lexicon and BERT at the bottom-level rather than efficient training.To achieve it, we fine-tune the original parameters of BERT instead of fixing them, since directly injecting lexicon features into BERT will affect the performance due to the difference between that two information.

Method
The main architecture of the proposed Lexicon Enhanced BERT is shown in Figure 2. Compared to BERT, LEBERT has two main differences.First, LEBERT takes both character and lexicon features as the input given that the Chinese sentence is converted to a character-words pair sequence.Second, a lexicon adapter is attached between Transformer layers, allowing lexicon knowledge integrated into BERT effectively.
In this section we describe: 1) Char-words Pair Sequence (Section 3.1), which incorporates words into a character sequence naturally; 2) Lexicon Adapter (Section 3.2), by injecting external lexicon features into BERT; 3) Lexicon Enhanced BERT (Section 3.3), by applying the Lexicon Adapter to BERT.

Char-Words Pair Sequence
A Chinese sentence is usually represented as a character sequence, containing character-level features solely.To make use of lexicon information, we The architecture of Lexicon Enhanced BERT, in which lexicon features are integrated between kth and (k + 1)-th Transformer Layer using Lexicon Adapter.Where c i denote the i-th Chinese character in the sentence, and x ws i denotes matched words assigned to character c i .
extend the character sequence to a character-words pair sequence.
Given a Chinese Lexicon D and a Chinese sentence with n characters s c = {c 1 , c 2 , ..., c n }, we find out all the potential words inside the sentence by matching the character sequence with D. Specifically, we first build a Trie based on the D, then traverse all the character subsequences of the sentence and match them with the Trie to obtain all potential words.Taking the truncated sentence "美 国人民 (American People)" for example, we can find out four different words, namely "美国 (America)", "美国人 (American)", "国人 (Compatriot)", "人民 (People)".Subsequently, for each matched word, we assign it to the characters it contains.As shown in Figure 3, the matched word "美国 (America)" is assigned to the character "美" and "国" since they form that word.Finally, we pair each character with assigned words and convert a Chinese sentence into a character-words pair sequence, Figure 3: Character-words pair sequence of a truncated Chinese sentence "美 国 人 民 (American People)".There are four potential words, namely "美 国 (America)", "美国人 (American)", "国人 (Compatriot)", "人民 (People)"."<PAD>" denotes padding value and each word is assigned to the characters it contains.
i.e. s cw = {(c 1 , ws 1 ), (c 2 , ws 2 ), ..., (c n , ws n )}, where c i denotes the i-th character in the sentence and ws i denotes matched words assigned to c i .

Lexicon Adapter
Each position in the sentence consists of two types of information, namely character-level and wordlevel features.In line with the existing hybrid models, our goal is to combine the lexicon feature with BERT.Specifically, inspired by the recent works about BERT adapter (Houlsby et al., 2019;Wang et al., 2020), we propose a novel Lexicon Adapter (LA) shown in Figure 4, which can directly inject lexicon information into BERT.
A Lexicon Adapter receives two inputs, a character and the paired words.For the i-th position in a char-words pair sequence, the input is denoted as (h c i , x ws i ), where h c i is a character vector, the output of a certain transformer layer in BERT, and x ws i = {x w i1 , x w i2 , ..., x w im } is a set of word embeddings.The j-th word in x ws i is represented as following: x where e w is a pre-trained word embedding lookup table and w ij is the j-th word in ws i .
To align those two different representations, we apply a non-linear transformation for the word vectors: where W 1 is a d c -by-d w matrix, W 2 is a d c -by-d c matrix, and b 1 and b 2 are scaler bias.d w and d c denote the dimension of word embedding and the hidden size of BERT respectively.As Figure 3 shows, each character is paired with multiple words.However, the contribution to each task varies from word to word.For example, as The adapter takes as input a character vector and the paired word features.Subsequently, a bilinear attention over both character and words is used to weighted the lexicon feature into a vector, which is then added to the input character-level vector and followed by a layer normalization.
for Chinese POS tagging, words "美 国 (America)" and "人民 (People)" are superior to "美国人 (American)" and "国人 (Compatriot)", since they are ground-truth segmentation of the sentence.To pick out the most relevant words from all matched words, we introduce a character-to-word attention mechanism.Specifically, we denote all v w ij assigned to i-th character as V i = (v w i1 , ..., v w im ), which has the size m-by-d c and m is the total number of the assigned word.The relevance of each word can be calculated as: where W attn is the weight matrix of bilinear attention.Consequently, we can get the weighted sum of all words by: Finally, the weighted lexicon information is injected into the character vector by: It is followed by a dropout layer and layer normalization.

Lexicon Enhanced BERT
Lexicon Enhanced BERT (LEBERT) is a combination of Lexicon Adapter (LA) and BERT, in which LA is applied to a certain layer of BERT shown in Figure 2. Concretely, LA is attached between certain transformers within BERT, thereby injecting external lexicon knowledge into BERT.Given a Chinese sentence with n characters s c = {c 1 , c 2 , ..., c n }, we build the corresponding character-words pair sequence s cw = {(c 1 , ws 1 ), (c 2 , ws 2 ), ..., (c n , ws n )} as described in Section 3.1.The characters {c 1 , c 2 , ..., c n } are first input into Input Embedder which outputs E = {e 1 , e 2 , ..., e n } by adding token, segment and position embedding.Then we input E into Transformer encoders and each Transformer layer acts as following: where H l = {h l 1 , h l 2 , ..., h l n } denotes the output of the l-th layer and H 0 = E; LN is layer normalization; MHAttn is the multi-head attention mechanism; FFN is a two-layer feed-forward network with ReLU as hidden activation function.
To inject the lexicon information between the k-th and (k + 1)-th Transformer, we first get the output .., h k n } after k successive Transformer layers.Then, each pair (h k i , x ws i ) are passed through the Lexicon Adapter which transforms the i th pair into hk i : Since there are L = 12 Transformer layers in the BERT, we input H k = { hk 1 , hk 2 , ..., hk n } to the remaining (L − k) Transformers.At the end, we get the output of L-th Transformer H L for the sequence labeling task.

Training and Decoding
Considering the dependency between successive labels, we use a CRF layer to make sequence labeling.Given the hidden outputs of the last layer For a label sequence y = {y 1 , y 2 , ..., y n }, we define its probability to be: where T is the transition score matrix and ỹ denotes all possible tag sequences.Given N labelled data {s j , y j }| N j=1 , we train the model by minimize the sentence-level negative loglikelihood loss as: While decoding, we find out the label sequence obtaining the highest score using the Viterbi algorithm.

Experiments
We carry out an extensive set of experiments to investigate the effectiveness of LEBERT.In addition, we aim to empirically compare model-level and BERT-level fusion in the same setting.Standard F1-score (F1) is used as evaluation metrics.

Datasets
We evaluate our method on ten datasets of three different sequence labeling tasks, including Chinese NER, Chinese Word Segmentation, and Chinese POS tagging.The statistics of the datasets is shown in Table 1.Chinese NER.We conduct experiments on four benchmark datasets, including Weibo NER (Peng andDredze, 2015, 2016), OntoNotes (Weischedel et al., 2011), Resume NER (Zhang and Yang, 2018), and MSRA (Levow, 2006).Weibo NER is a social media domain dataset, which is drawn from Sina Weibo; while OntoNotes and MSRA datasets are in the news domain.Resume NER dataset consists of resumes of senior executives, which is annotated by Zhang and Yang (2018).Chinese Word Segmentation.For Chinese word segmentation, we employ three benchmark datasets in our experiments, namely PKU, MSR, and CTB6, where the former two are from SIGHAN 2005 Bakeoff (Emerson, 2005) and the last one is from Xue et al. (2005).For MSR and PKU, we follow their official training/test data split.For CTB6, we use the same split as that stated in Yang and Xue (2012); Higashiyama et al. (2019).Chinese POS Tagging.For POS-tagging, three Chinese benchmark datasets are used, including CTB5 and CTB6 from the Penn Chinese Tree-Bank (Xue et al., 2005) and the Chinese GSD Treebank of Universal Dependencies(UD) (Nivre et al., 2016).The CTB datasets are in simplified Chinese while the UD dataset is in traditional Chinese.Following Shao et al. (2017), we first convert the UD dataset into simplified Chinese before the POS-tagging experiments3 .Besides, UD has both universal and language-specific POS tags, we follow previous works (Shao et al., 2017;Tian et al., 2020a), referring to the corpus with two tagsets as UD1 and UD2, respectively.We use the official splits of train/dev/test in our experiments.

Experimental Settings
Our model is constructed based on BERT BASE (Devlin et al., 2019), with 12 layers of transformer, and is initialized using the Chinese-BERT checkpoint from huggingface4 .We use the 200dimension pre-trained word embedding from Song et al. (2018), which is trained on texts of news and webpages using a directional skip-gram model.The lexicon D used in this paper is the vocab of the pre-trained word embedding.We apply the Lexicon Adapter between the 1-st and 2-nd Transformer in BERT and fine-tune both BERT and pre-trained word embedding during training.Hyperparameters.We use the Adam optimizer with an initial learning rate of 1e-5 for original parameters of BERT, and 1e-4 for other parameters introduced by LEBERT, and a maximum epoch number of 20 for training on all datasets.The max length of the sequence is set to 256, and the training batch size is 20 for MSRA NER and 4 for other datasets.Baselines.To evaluate the effectiveness of the pro- posed LEBERT, we compare it with the following approaches in the experiments.
• BERT.Directly fine-tuning a pre-trained Chinese BERT on Chinese sequence labeling tasks.
• BERT+Word.A strong model-level fusion baseline method, which inputs the concatenation of BERT vector and bilinear attention weighted word vector, and uses LSTM5 and CRF as fusion layer and inference layer respectively.
• ERNIE (Sun et al., 2019a).An extension of BERT using a entity-level mask to guide pretraining.
• ZEN.Diao et al. (2020) explicitly integrate Ngram information into BERT through an extra multi-layers of N-gram Transformer encoder and pre-training.
Further, we also compare with the state-of-theart models of each task.

Overall Results
Chinese NER.Table 2 shows the experimental results on Chinese NER datasets6 .The first four rows (Zhang and Yang, 2018;Zhu and Wang, 2019;Liu et al., 2019;Ding et al., 2019) in the first block show the performance of lexicon enhanced character-based Chinese NER models, and the last two rows (Ma et al., 2020;Li et al., 2020) in the same block are the state-of-the-art models using shallow fusion layer to integrate lexicon information and BERT.The hybrid models, including existing state-of-the-art models, BERT + Word, and  the proposed LEBERT, achieve better performance than both lexicon enhanced models and BERT baseline.This demonstrates the effectiveness of combining BERT and lexicon features for Chinese NER.
Compared with model-level fusion models ( (Ma et al., 2020;Li et al., 2020), and BERT+Word), our BERT-level fusion model, LEBERT, improves in F1 score on all four datasets across different domains, which shows that our approach is more efficient in integrating word and BERT.The results also indicate that our adapter-based method, LEBERT, with an extra pre-trained word embedding solely, outperforms those two lexicon-guided pre-training models (ERNIE and ZEN).This is likely because implicit integration of lexicon in ERNIE and restricted pre-defined n-gram vocabulary size in ZEN limited the effect.Chinese Word Segmentation.We report the F1 score of our model and the baseline methods on Chinese Word Segmentation in Table 3  Our proposed model has achieved state-of-theart results across all datasets.To better show the strength of our method, we also summarize the relative error reduction over BERT baseline and BERT-based state-of-the-art models in Table 5.The results show that the relative error reductions are significant compared with baseline models.Ontonotes and UD1 datasets respectively.In the first example, BERT can not determine the entity boundary, but BERT+Word and LEBERT can segment it correctly.However, the BERT+Word model fails to predict the type of the entity "呼伦贝尔盟 (Hulunbuir League)" while LEBERT makes the correct prediction.This is likely because fusion at the lower layer contributes to capturing more complex semantics provided by BERT and lexicon.In the second example, the three models can find the correct span boundary, but both BERT and BERT+Word make incorrect predictions of the span type.Although BERT+Word can use the word information, it is disturbed by the irrelevant word "七八 (Seven and Eight)" predicting it as NUM.
In contrast, LEBERT can not only integrate lexicon features but also choose the correct word for prediction.

Discussion
Adaptation at Different Layers.We explore the effect of applying the Lexicon Adapter (LA) between different Transformer layers of BERT on Ontonotes dataset.Different settings are evaluated, including applying LA after one, multiple, and all layers of Transformer.As for one layer, we applied LA after k ∈ {1, 3, 6, 9, 12} layer; and {1, 3}, {1, 3, 6}, {1, 3, 6, 9} layers for multiple layers.All  layers represents LA used after every Transformer layer in BERT.The results show in Table 7.The shallow layer achieves better performance, which can be due to the fact that the shallow layer promotes more layered interaction between lexicon features and BERT.Applying LA at multi-layers of BERT hurts the performance and one possible reason is that integration at multi-layers causes overfitting.
Tuning BERT or Not.Intuitively, integrating lexicon into BERT without fine-tuning can be faster (Houlsby et al., 2019) but with lower performance due to the different characteristics of lexicon feature and BERT (discrete representation vs. neural representation).To evaluate its impact, we conduct experiments with and without fine-tuning BERT parameters on Ontonotes and UD1 datasets.From the results, we find that without fine-tuning the BERT, the F1-score shows a decline of 7.03 points (82.08 → 75.05) on Ontonotes and 3.75 points (96.06 → 92.31) on UD1, illustrating the importance of finetuning BERT for our lexicon integration.

Conclusion
In this paper, we proposed a novel method to integrate lexicon features and BERT for Chinese sequence labeling, which directly injects lexicon information between Transformer layers in BERT using a Lexicon Adapter.Compared with modellevel fusion methods, LEBERT allows in-depth fusion of lexicon features and BERT representation at BERT-level.Extensive experiments show that the proposed LEBERT achieves state-of-theart performance on ten datasets of three Chinese sequence labeling tasks.

Figure 1 :
Figure 1: Comparison of fusing lexicon features and BERT at different levels for Chinese sequence labeling.For simplicity, we only show two Transformer layers in BERT and truncate the sentence to three characters.c i denotes the i-th Chinese character, w j denotes the j-th Chinese word.

Figure 4 :
Figure4: Structure of Lexicon Adapter (LA).The adapter takes as input a character vector and the paired word features.Subsequently, a bilinear attention over both character and words is used to weighted the lexicon feature into a vector, which is then added to the input character-level vector and followed by a layer normalization.

Table 1 :
The statistics of the datasets.

Table 2 :
Results on Chinese NER.

Table 3 :
Results on Chinese Word Segmentation.

Table 5 :
The relative error reductions over different base models.

Table 6 :
Span F1 and Type Acc of different models.

Table 7 :
Results of variations of LEBERT with Lexicon Adapter applied at different layers of BERT model.one, multi, all mean applying LA after one layer, multiply layers, all layers of Transformer in BERT.

Table 8 :
Examples of tagging result.