Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling

Boundary information is critical for various Chinese language processing tasks, such as word segmentation, part-of-speech tagging, and named entity recognition. Previous studies usually resorted to the use of a high-quality external lexicon, where lexicon items can offer explicit boundary information. However, to ensure the quality of the lexicon, great human effort is always necessary, which has been generally ignored. In this work, we suggest unsupervised statistical boundary information instead, and propose an architecture to encode the information directly into pre-trained language models, resulting in Boundary-Aware BERT (BABERT). We apply BABERT for feature induction of Chinese sequence labeling tasks. Experimental results on ten benchmarks of Chinese sequence labeling demonstrate that BABERT can provide consistent improvements on all datasets. In addition, our method can complement previous supervised lexicon exploration, where further improvements can be achieved when integrated with external lexicon information.


Introduction
The representative sequence labeling tasks for the Chinese language, such as word segmentation, partof-speech (POS) tagging and named entity recognition (NER) (Emerson, 2005;Jin and Chen, 2008), have been inclined to be performed at the characterlevel in an end-to-end manner (Shen et al., 2016).The paradigm, naturally, is standard to Chinese word segmentation (CWS), while for Chinese POS tagging and NER, it can better help reduce the error propagation (Sun and Uszkoreit, 2012;Yang et al., 2016;Liu et al., 2019a) compared with word-based counterparts by straightforward modeling.
Recently, all the above tasks have reached stateof-the-art performances with the help of BERTlike pre-trained language models (Yan et al., 2019;Meng et al., 2019).The BERT variants, such as BERT-wwm (Cui et al., 2021), ERNIE (Sun et al., 2019), ZEN (Diao et al., 2020), NEZHA (Wei et al., 2019), etc., further improve the vanilla BERT by either using external knowledge or larger-scale training corpus.The improvements can also benefit character-level Chinese sequence labeling tasks.
Notably, since the output tags of all these character-level Chinese sequence labeling tasks involve identifying Chinese words or entities (Zhang and Yang, 2018;Yang et al., 2019), prior boundary knowledge could be highly helpful for them.
A number of studies propose the integration of an external lexicon to enhance their baseline models by feature representation learning (Jia et al., 2020;Tian et al., 2020a;Liu et al., 2021).Moreover, some works suggest injecting similar resources into the pre-trained BERT weights.BERT-wwm (Cui et al., 2021) and ERNIE (Sun et al., 2019) are the representatives, which leverage an external lexicon for masked word prediction in Chinese BERT.
The lexicon-based methods have indeed achieved great success for boundary integration.However, there are two major drawbacks.First, the lexicon resources are always constructed manually (Zhang and Yang, 2018;Diao et al., 2020;Jia et al., 2020;Liu et al., 2021), which is expensive and time-consuming.The quality of the lexicon is critical to our tasks.Second, different tasks as well as different domains require different lexicons (Jia et al., 2020;Liu et al., 2021).A well-studied lexicon for word segmentation might be inappropriate for NER, and a lexicon for news NER might also be problematic for finance NER.The two drawbacks can be due to the supervised characteristic of these lexicon-based enhancements.Thus, it is more desirable to offer boundary information in an unsupervised manner.
In this paper, we propose an unsupervised Boundary-Aware BERT (BABERT), which is achieved by fully exploring the potential of statisti-cal features mined from a large-scale raw corpus.We extract a set of N-grams (a predefined fixed N) no matter they are valid words or entities, and then calculate their corresponding unsupervised statistical features, which are mostly related to boundary information.We inject the boundary information into the internal layer of a pre-trained BERT, so that our final BABERT model can approximate the boundary knowledge softly by using inside representations.The BABERT model has no difference from the original BERT, so that we can use it in the same way as the standard BERT exploration.
We conduct experiments on three Chinese sequence labeling tasks to demonstrate the effectiveness of our proposed method.Experimental results show that our approach can significantly outperform other Chinese pre-trained language models.In addition, compared with supervised lexicon-based methods, BABERT obtains competitive results on all tasks and achieves further improvements when integrated with external lexicon knowledge.We also conduct extensive analyses to understand our method comprehensively.The pre-trained model and code are publicly available at http://github.com/modelscope/ adaseq/examples/babert.
Our contributions in this paper include the following: 1) We design a method to encode unsupervised statistical boundary information into boundary-aware representation, 2) propose a new pre-trained language model called BABERT as a boundary-aware extension for BERT, 3) verify BABERT on ten benchmark datasets of three Chinese sequence labeling tasks.

Related Work
In the past decades, machine learning has achieved good performance on sequence labeling tasks with statistical information (Bellegarda, 2004;Low et al., 2005;Bouma, 2009).Recently, neural models have led to state-of-the-art results for Chinese sequence labeling (Lample et al., 2016;Ma and Hovy, 2016;Chiu and Nichols, 2016).In addition, the presence of language representation models such as BERT (Devlin et al., 2019) has led to impressive improvements.In particular, many variants of BERT are devoted to integrating boundary information into BERT to improve Chinese sequence labeling (Diao et al., 2020;Jia et al., 2020;Liu et al., 2021).
Statistical Machine Learning Statistical information is critical for sequence labeling.Previous works attempt to count such information from large corpora in order to combine it with machine learning methods for sequence labeling (Bellegarda, 2004;Liang, 2005;Bouma, 2009).Peng et al. (2004) attempts to conduct sequence labeling by CRF and a statistical-based new word discovery method.Low et al. (2005) introduce a maximum entropy approach for sequence labeling.Liang (2005) utilizes unsupervised statistical information in Markov models, and gets a boost on Chinese NER and CWS.
Pre-trained Language Model Pre-trained language model is a hot topic in natural language processing (NLP) communities (Devlin et al., 2019;Liu et al., 2019b;Wei et al., 2019;Clark et al., 2020;Diao et al., 2020;Zhang et al., 2021) and has been extensively studied for Chinese sequence labeling.For instance, TENER (Yan et al., 2019) adopts Transformer encoder to model characterlevel features for Chinese NER.Glyce (Meng et al., 2019) uses BERT to capture the contextual representation combined with glyph embeddings for Chinese sequence labeling.

Lexicon-based Methods
In recent studies, lexicon knowledge has been applied to improve model performance.There are two mainstream categories to the work of lexicon enhancement.The first aims to enhance the original BERT with implicit boundary information by using the multi-granularity word masking mechanism.BERT-wwm (Cui et al., 2021) and ERNIE (Sun et al., 2019) are representatives of this category, which propose to mask tokens, entities, and phrases as the mask units in the masked language modeling (MLM) task to learn the coarse-grained lexicon information during pre-training.ERNIE-Gram (Xiao et al., 2021), an extension of ERNIE, utilizes statistical boundary information for unsupervised word extraction to support masked word prediction, The second category, which includes ZEN (Diao et al., 2020), EEBERT (Jia et al., 2020), and LEBERT (Liu et al., 2021), exploits the potential of directly injecting lexicon information into BERT via extra modules, leading to better performance but is limited in predefined external knowledge.Our work follows the first line of work, most similar to ERNIE-Gram.However, different from ERNIE-Gram, we do not discretize the real-valued statistical information ex-     tracted from corpus, but adopt a regression manner to leverage the information fully.

Method
Figure 1 shows the overall architecture of our unsupervised boundary-aware pre-trained language model, which mainly consists of three components: 1) boundary information extractor for unsupervised statistical boundary information mining, 2) boundary-aware representation to integrate statistical information at the character-level, and 3) boundary-aware BERT learning which injects boundary knowledge into the internal layer of BERT.In this section, we first focus on the details of the above components, and then introduce the fine-tuning method for Chinese sequence labeling.

Boundary Information Extractor
Statistical boundary information has been shown with a positive influence on a variety of Chinese NLP tasks (Song and Xia, 2012;Higashiyama et al., 2019;Ding et al., 2020;Xiao et al., 2021).We follow this line of work, designing a boundary information extractor to mine statistical information from a large raw corpus in an unsupervised way.
The overall flow of the extractor includes two steps: I) First, we collect all N-grams from the raw corpus to build a dictionary N , in which we count the frequencies of each N-gram and filter out the low frequencies items; II) second, considering that word frequency is insufficient for representing the flexible boundary relation in the Chinese context, we further compute two unsupervised indicators which can capture most of the boundary information in the corpus.In the following, we will describe these two indicators in detail.
Pointwise Mutual Information (PMI) Given an N-gram, we split it into two sub-strings and compute the mutual information (MI) between them as a candidate.Then, we enumerate all sub-string pairs and choose the minimum MI as the overall PMI to estimate the tightness of the N-gram.Let g = {c 1 ...c m } be an N-gram that consists of m characters, we calculate PMI using this formula: where p(•) denotes the probability over the corpus.Note that, when m = 1, the corresponding PMI is constantly equal to 1.The higher PMI indicates that the N-gram (e.g., "贝克汉姆 (Beckham)") has a similar occurrence probability to the sub-string pair (e.g., "贝克 (Beck)" and "汉姆 (Ham)"), leading to a higher association between internal sub-string pairs, which makes the N-gram more likely to be a word/entity.In contrast, a lower PMI means the Ngram (e.g., "克汉(Kehan)") is possibly an invalid word/entity.
Left and Right Entropy (LRE) Given an Ngram g, we first collect a left-adjacent character set S l m = {c l 1 , ..., c l n l } with n l characters.Then, we utilize the conditional probability between g and its left adjacent characters in S l m to compute the left entropy (LE), which measures sufficient boundary information.LE can be defined as: Similar to LE, we further collect a right adjacent set S r m = {c r 1 , ..., c r nr } with n r characters to calculate the right entropy (RE) for the N-gram g: Intuitively, LRE represents the abundance of neighboring characters for the N-gram.With a lower LRE, the N-gram (e.g., "汉姆 ") has a more fixed context, indicating it is more likely to be a part of a phrase or entity.Conversely, the N-gram with a higher LRE (e.g., "贝克汉姆") will interact more with context, which prefers to be an independent word or phrase.
Finally, we utilize PMI and LRE to measure the flexible boundary relations in the Chinese context, and then update each N-gram in N with the unsupervised statistical indicators above.

Boundary-Aware Representation
By using the boundary information extractor, we can obtain an N-gram dictionary N with unsupervised statistical boundary information.Unfortunately, since the context independence and the high relevance to N-gram, previous works (Ding et al., 2020;Xiao et al., 2021) use such statistical features for word extraction only, which ignore the potential of statistical boundary information in representation learning.To alleviate this problem, we propose boundary-aware representation, a highly extensible method, to fully benefit from the statistical boundary information for representation learning.
To achieve boundary-aware representation, we first build contextual N-gram sets from the sentence.As shown in Figure 1 (b), given a sentence x = {c 1 , c 2 , ..., c n } with n characters and the maximum N-gram length N , we extract all N-grams that include c i as the contextual N-gram set ..c i } for character c i .Then, we design a composition method to integrate the statistical features of N-grams in S c i by using specific conditions and rules, aiming to avoid the sparsity and contextual independence limitations of statistical information.
Concretely, we divide the information composition method into PMI and entropy representation.First, we concatenate the PMI of all N-grams in S c i to generate PMI representation: where e p i ∈ R a , and a = 1+2+• • •+N is the number of the N-grams that contain c i .Note that the position of each N-gram is fixed in PMI representation.We strictly follow the order of N-gram length and the position of c i in N-gram to concatenate their corresponding PMI, ensuring that the position and context information can be encoded into e p i .Entropy representation focuses on the contextual interactions of each character.When c i is the border of N-grams in S c i , we separately aggregate the LE and RE as left and right entropy representation: where e le i ∈ R b , e re i ∈ R b , and b = N1 is the number of integrated N-grams.Similar to PMI representation, the position of each N-gram in e le i and e le i is fixed and symmetric.Therefore, the boundary-aware representation e i of c i can be formalized as: where e i ∈ R a+2b .Finally, by composing multigranularity statistical boundary information in a specific order, we are able to obtain the boundaryaware representation, which explicitly contains the boundary and context information.Figure 2 shows an example of the boundaryaware representation.Given a sentence "南 京 市 长 江 大 桥 (Nanjing Yangtze River Bridge)" and a maximum N-gram length N = 3, we first build a contextual N-gram set for the character "长 (Long)".Then, we integrate the PMI of all N-grams in a specific order (from N-gram "长" to "京市长 (Mayor of Jing)") to compute PMI representation.Furthermore, left and right entropy representations are also calculated in a particular order (from Ngram "长" to "长江大 (Yangtze River Big)" and "京市长", respectively).Finally, we concatenate the above features to produce the overall boundaryaware representation of the character "长".

Boundary-Aware BERT Learning
Boundary-aware BERT is a variant of BERT, enhanced with boundary information simply and effectively.In this subsection, we describe how the boundary information can be integrated into BERT during pre-training by boundary-aware learning.
Boundary-Aware Objective As mentioned in Section 3.2, given a sentence x with characterlength n, we can compute the corresponding boundary-aware representation E = {e 1 , ..., e n }.Then, we transfer the BERT feature into the boundary information space and approximate it to E for boundary-aware learning.Moreover, Liu et al. (2021) shows that encoding basic lexical knowledge in the shallow BERT layers is a more effective approach.Hence, we use the hidden features H l = {h l 1 , ..., h l n } of the l-th shallow layer to achieve the boundary-aware objective: where MSE(•) denotes the mean square error loss.W B is a trainable matrix used to project BERT representation into boundary information space.
Previous classification-based word-level masking methods use statistical information as thresholds to filter valid words for masked word prediction.Unlike the above works, we softly utilize such information in a regression manner, avoiding possible errors in empirically filtering valid tags, thereby fully exploring the potential of this information.
Pre-training Following Jia et al. (2020) and Gao and Callan (2021), we opt to initialize our model with a pre-trained BERT model released by Google 2 and randomly initialize the other parameters, alleviating the enormous cost of train-2 https://github.com/google-research/berting BABERT from scratch.In particular, we discard the next sentence prediction task during pretraining, which is confirmed to be not essential for the pre-trained language models (Lan et al., 2020;Liu et al., 2019b).The total pre-training loss of BABERT can be formalized as: where L MLM is the standard objective of MLM task.

Fine-tuning for Sequence Labeling
Straightforward Fine-tuning As shown in Figure 1 (c), because BABERT has the same architecture as BERT, we can adopt the identical procedure that BERT uses for fine-tuning, where the output of BABERT can be used as the contextual character representation for sequence labeling.Concretely, given a sequence labeling dataset D = {(x j , y j )} N j=1 , where y j is the label sequence of x j , we utilize the output of BABERT and a CRF layer to calculate the sentence-level output probability p(y j |x j ), which is exactly the same as Liu et al. (2021).The negative log-likelihood loss for training can be defined as: At the inference stage, we use the Viterbi algorithm (Viterbi, 1967) to generate the final label sequence.

Combining with Supervised Lexicon Features
We can naturally combine BABERT with other supervised lexicon-based methods because of the unsupervised setting of BABERT.To this end, we propose a lexicon-enhanced BABERT (BABERT-LE) for the fine-tuning stage, which utilizes the lexicon adapter proposed by Liu et al. (2021) to incorporate external lexicon knowledge into BABERT feature: where LA(•) is the lexicon adapter, S lex i is a set of related N-gram embeddings of character c i , and ĥi is the lexicon-enhanced version of original BERT feature h i .We apply the lexicon adapter after the l-th layer to be consistent with boundary-aware learning.Finally, BABERT-LE performs a similar fine-tuning procedure as BABERT for training: 4 Experiments

Datasets
Following previous works (Devlin et al., 2019;Xiao et al., 2021), we draw the mixed corpus of Chinese Wikipedia3 and Baidu Baike4 as our pretraining corpus, which contains 3B tokens and 62M sentences.To further confirm the effectiveness of our proposed method for Chinese sequence labeling, we evaluate BABERT on ten benchmark datasets of three representative tasks: Chinese Word Segmentation We use three CWS benchmarks to evaluate our BABERT.Penn Chinese TreeBank version 6.0 (CTB6) is from Xue et al. (2005), and MSRA and PKU are from SIGHAN 2005 Bakeoff (Emerson, 2005).
Part-Of-Speech Tagging For Chinese POS tagging, we conduct experiments on CTB6 (Xue et al., 2005) and the Chinese part of Universal Dependencies (UD) (Nivre et al., 2016).The UD dataset uses two different POS tagsets, which are universal and language-specific tagsets.We follow Shao et al. (2017), referring to the corpus with the two tagsets as UD1 and UD2, respectively.
Named Entity Recognition For the Chinese NER task, we conduct experiments on OntoNotes 4.0 (Onto4) (Weischedel et al., 2011) and News datasets (Jia et al., 2020), both of which are from the standard newswire domain.Moreover, we evaluate BABERT in the internet novel (Book) and financial report (Finance) domains (Jia et al., 2020) to further verify the robustness of our method.The statistics of the benchmark datasets are shown in Table 1.For a fair comparison, we split these datasets into training, development, and test sections following previous works (Jia et al., 2020;Liu et al., 2021).Note that MSRA, PKU, and Finance do not have development sections.Therefore, we randomly select 10% instances from the training set as the development set for these datasets.

Experimental Settings
Hyperparameters During pre-training, we use the hyperparameters of BERT BASE to initialize BABERT and Adam (Kingma and Ba, 2014)  The batch size is set to 32, the learning rate is 1e-4 with a warmup ratio of 0.1, and the max length of the input sequence is 512.To extract unsupervised boundary information, we set the maximum N-gram length N to 45 and the frequency filtering threshold to 50.Then we use the 3-th BERT layer to compute boundary-aware objective.BABERT has no extra modules, which is why the parameter size and model architecture are the same as those of BERT BASE .Finally, we train the BABERT on 8 NVIDIA Tesla V100 GPUs with 32GB memory.
For Chinese sequence labeling, we empirically set hyperparameters based on previous studies (Jia et al., 2020;Liu et al., 2021) and preliminary experiments.The batch size is 32, the max sequence length is 256, and the learning rate is fixed to 2e-5.
Baselines To verify the effectiveness of our proposed BABERT, we build systems on the following methods to conduct fair comparisons: • BERT is the Chinese version BERT BASE model released by Google.
• BERT-wwm performs segmentation on the corpus and further conduct word-level masking in pre-training (Cui et al., 2021).
• ERNIE is an extension of BERT, which leverages external lexicons for word-level masking (Sun et al., 2019).
• ERNIE-Gram is an extension of ERNIE, which alleviates the limitations of external lex- icons by using statistical information for entity and phrase extraction (Xiao et al., 2021).
• ZEN uses an extra N-gram encoder to integrate external lexicon knowledge into BERT during pre-training (Diao et al., 2020).
• NEZHA leverages functional relative positional encoding, supervised word-level masking strategy, and enormous training data 6 to enhance vanilla BERT (Wei et al., 2019).
• BERT-LE is a lexicon-enhanced BERT (Liu et al., 2021), which introduces a lexicon adapter between BERT layers to incorporate external lexicon embeddings.We strictly follow.Liu et al. (2021) to reimplement it with open-source word embeddings 7

Main Results
The overall Chinese sequence labeling results are shown in Table 2.We report the F1-score of the test datasets on CWS, POS, and NER tasks.Here, we first compare our BABERT with various Chinese pre-trained language models to evaluate its effectiveness.Then, we compare BABERT-LE with other supervised lexicon-based methods to show the potential of BABERT in combining with external lexicon knowledge.
6 NEZHA uses three large corpora, including Chinese Wikipedia, Baidu Baike, and Chinese News, which contain 11B tokens and are four times more than us.
7 https://ai.tencent.com/ailab/nlp/en/embedding.htmlFirst, we examine the F1 values of the BERT baseline.As shown, BERT obtains comparable results on all Chinese sequence labeling tasks, which is similar to that of Diao et al. (2020), Tian et al. (2020a) and Liu et al. (2021).BABERT significantly outperforms BERT, resulting in an increase of 90.47 − 89.80 = 0.67 on average.This observation clearly indicates the advantage of introducing boundary information into BERT pre-training.
Compared with various BERT extensions, our BABERT can achieve competitive performances as a whole.First, in comparison with BERT-wwm, ERNIE, ERNIE-gram, and ZEN, which leverage external lexicons that include high-frequency words for pre-training, BABERT outperforms all of them by averaging 0.54+0.41+0.40+0.574 = 0.48 point, and achieves top scores on eight of the ten benchmarks.This result is consistent with our intuition that directly exploiting a supervised lexicon can only achieve good performance in specific tasks, indicating the limitation of these methods when the chosen lexicon is incompatible with the target tasks.Second, we find that BABERT surpasses NEZHA in the average F1 values, indicating that the boundary information is more critical than the data scale for Chinese sequence labeling.
Then, we compare our method with supervised lexicon-based methods.Lattice-based methods (Zhang and Yang, 2018;Yang et al., 2019)  2020a,b) designs an external memory network after the BERT encoder to incorporate lexicon knowledge.EEBERT (Jia et al., 2020) builds entity embeddings from the corpus and further utilizes them in the multi-head attention mechanism.The results are shown in Table 2 (II).All the above methods lead to significant improvements over the base BERT model, which shows the effectiveness of external lexicon knowledge.Moreover, BABERT can achieve comparable performance with the above methods, which further demonstrates the potential of our unsupervised manner.BABERT learns boundary information from unsupervised statistical features with vanilla BERT, which means it has excellent scalability to fuse with other BERT-based supervised lexicon models.As shown, we can see that our BABERT-LE achieves further improvements and state-of-the-art performances on all tasks, showing the advantages of our unsupervised setting and boundary-aware learning.Interestingly, compared with MEM-ZEN, BABERT-LE has larger improvements over their corresponding baselines.One reason might be that both ZEN and the memory network module exploits supervised lexicons, which leads to a duplication of introduced knowledge.

Analysis
In this subsection, we conduct detailed experimental analyses for an in-depth comprehensive understanding of our method.
Few-Shot Setting To further verify the effectiveness of BABERT, we conduct experiments under the few-shot setting, where we randomly sample 10, 50, and 100 instances of the original training data from PKU (CWS) and Onto4 (NER).For fair comparisons, we compare BABERT with the pretrained language models without external supervised knowledge.The results are presented in Table 3.As the size of training data is reduced, the  adding T-test does not bring further improvements.One possible reason is that the T-test is essentially similar to the entropy measure of 2-grams, which has already been injected into our BABERT model.
Boundary Information Encoding Layer Previous works (Jawahar et al., 2019;Liu et al., 2021) exploit the fact that different BERT layers would generate different concept representations.The shallow BERT layers are more likely to capture basic lexicon information, while the top layers focus on the semantic representation.We empirically set l in {1, 3, 6, 12} to explore the effect of computing boundary-aware loss by the hidden features H l of different BERT layers on Chinese sequence labeling tasks.Table 5 shows the results.We can see that the best F1-score can be achieved when l = 3 on all datasets, which indicates that the BABERT still needs sufficient parameters to learn the basic boundary information.Interestingly, the BABERT performs poorly when l = 12, which might be due to a conflict between the MLM loss and our boundary-aware regression loss during pretraining.
Qualitative Analysis To explore how BABERT improves the performance for Chinese sequence labeling, we conduct qualitative analysis on the News test dataset, which consists of four different subdomains, namely game (GAM), entertainment (ENT), lottery (LOT) and finance (FIN).The results are shown in Table 6.We can see that compared with other pre-trained language models, BABERT can obtain consistent improvement in all domains with unsupervised statistical boundary information, while the other models only improve performance on specific domains.Moreover, as shown in Table 7, we also give an example from the game domain to further demonstrate the effectiveness of our method.BABERT is the only model that correctly recognizes all entities.In particular, the prediction of BABERT for the entity "WCG2011 org " indicates the potential of boundary information.

Conclusion
In this paper, we proposed BABERT, a novel unsupervised boundary-aware pre-training model for Chinese sequence labeling.In BABERT, given a Chinese sentence, we calculated boundaryaware representation with unsupervised statistical information to capture boundary information, and directly injected such information into BERT weights during pre-training.Unlike previous works, BABERT exploited an effective way to utilize boundary information in an unsupervised manner, thereby alleviating the limitations of supervised lexicon-based approaches.Experimental results on ten benchmark datasets of three different tasks illustrated that our method was highly effective and better than other Chinese pre-trained models.Moreover, the combination with supervised lexicon extensions could achieve further improvements and state-of-the-art results on most tasks.

Limitations
BABERT suffers from three major limitations.The first limitation is that in the boundary information extractor, we empirically chose PMI and LRE.In addition to these indicators and the T-test measure we verified in the experimental analyses, some alternatives that contain boundary information could be used to compute boundary-aware representation.Thus, we plan to explore more unsupervised statistical features.The second limitation is that we focused only on Chinese sequence labeling tasks in this work, ignoring the potential of boundary information and BABERT in other Chinese NLP tasks.
The third one is that we only consider BABERT for Chinese.For other languages which do not use spaces between words such as Japanese and Thai, we can also attempt to inject boundary information, and the effectiveness in these languages should be verified by experiments.Further research is needed to evaluate our BABERT in future studies.

Figure 1 :
Figure 1: The overall architecture of the boundary-aware pre-trained language model, which consists of three parts: (a) boundary information extractor, (b) boundary-aware representation, and (c) boundary-aware BERT Learning.The boundary-aware objective L BA is defined in Equation 7.

Table 1 :
for optimizing.The number of BERT layers L is 12, with 12 self-attention heads, 768 dimensions for hidden states, and 64 dimensions for each head.The statistics of sentence number for the benchmark datasets.For the datasets without development sections, we randomly select 10% sentences from the corresponding training set as the development set.

Table 2 :
The overall results on three Chinese sequence labeling tasks, where we report the F1-score on the test set.‡ denotes external knowledge is used.† denotes that large-scale pre-training corpus is used.♠ indicates that we reproduce LEBERT in a similar way for fair comparisons.
are the first to integrate word features into neural networks for Chinese sequence labeling.MEM(Tian et al.,

Table 4 :
The results of different feature combining settings on the four NER benchmark datasets.

Table 5 :
The influence of boundary information learn at different layers of BERT model.

Table 6 :
The results of different feature combining settings on the four NER benchmark datasets.