Lexicon-Based Graph Convolutional Network for Chinese Word Segmentation

Precise information of word boundary can alleviate the problem of lexical ambiguity to improve the performance of natural language processing (NLP) tasks. Thus, Chinese word segmentation (CWS) is a fundamental task in NLP. Due to the development of pre-trained language models (PLM), pre-trained knowledge can help neural methods solve the main problems of the CWS in signiﬁcant mea-sure. Existing methods have already achieved high performance on several benchmarks (e.g., Bakeoff-2005). However, recent outstanding studies are limited by the small-scale annotated corpus. To further improve the performance of CWS methods based on ﬁne-tuning the PLMs, we propose a novel neural framework, L B GCN, which incorporates a l exicon-b ased g raph c onvolutional n etwork into the Transformer encoder. Experimental results on ﬁve benchmarks and four cross-domain datasets show the L B GCN successfully captures the information of candidate words and helps to improve performance on the benchmarks (Bakeoff-2005 and CTB6) and the cross-domain datasets (SIGHAN-2010). Further experiments and analyses demonstrate that our proposed framework effectively models the lexicon to enhance the ability of basic neural frameworks and strengthens the robustness in the cross-domain scenario. 1


Introduction
Neural methods often leverage word-level information to improve the performance of many downstream natural language processing (NLP) tasks such as text classification and machine translation (Yang et al., 2018), etc. Therefore, in determining the word boundary, word segmentation is regarded as a prerequisite for most downstream NLP tasks.
Unlike most written languages, the Chinese written language has no explicit delimiters to separate words in the written text. Thus, Chinese word segmentation (CWS) is an essential and pre-processing step for many Chinese NLP tasks.
With the development of deep learning techniques, recent neural CWS approaches that do not heavily rely on the hand-craft feature engineering have already achieved high performance on several benchmark datasets (Cai and Zhao, 2016;Cai et al., 2017;Ma et al., 2018). In particular, recent outstanding studies have also exploited the learning paradigm in applying pre-trained language models (PLM) for many NLP tasks. Various methods that fine-tune PLMs have achieved progress on indomain and cross-domain CWS without much manual effort (Meng et al., 2019;Tian et al., 2020;Ke et al., 2021).
Prior research has shown that the problems of CWS are segmentation ambiguity and out-ofvocabulary (OOV) words . With the help of the pre-trained knowledge (Devlin et al., 2018;, the fine-tuning CWS methods can effectively alleviate these two issues and outperform other neural network architectures. The methods fine-tuning PLMs become the mainstream approach for CWS. However, the performance of fine-tuning CWS methods is limited by the scale and quality of annotated CWS corpus. The dependencies between neighboring Chinese characters are diverse and it is hard to build a large-scale annotated corpus because of the characteristics of linguistics in Chinese. The difficulty of manual annotation restricts the scale and quality of CWS datasets. Besides, directly fine-tuning methods do not utilize contextual n-grams or other contextual information, which is important for previous model architectures (e.g., BiLSTM and Transformer) (Huang et al., 2015;Ma et al., 2018;Qiu et al., 2020). The methods that fine-tune PLMs may generate segmentation errors because of ambigu-ous contextual information. Thus, it is a challenge to design a framework that can effectively transfer pre-trained knowledge into the CWS.
In this paper, we propose the LBGCN, a neural framework with a lexicon-based graph convolutional network (GCN), to improve the performance of the CWS by leveraging lexicon knowledge. In detail, we utilize the GCN to extract contextual features of candidate words and the information of word boundary from the pre-defined lexicon. The neural framework incorporates the GCN into the Transformer encoder (Vaswani et al., 2017) which is a part of PLM (e.g., BERT (Devlin et al., 2018)). The additional lexicon-based GCN can supply a gap of fine-tuning paradigm and better transfer pre-trained knowledge into the in-domain and cross-domain CWS tasks. Besides, through multi-feature interaction, the disambiguation and OOV word recognition are effectively carried out.
To sum up, the contributions of this work are as follows: • Our proposed framework mainly consists of a lexicon-based GCN and the Transformer encoder. The lexicon-based GCN captures rich contextual information to alleviate the problem of lack-training by the small-scale annotated corpus. This framework achieves a noticeable improvement for CWS.
• Experimental results obtained from widely used benchmark datasets demonstrate that LBGCN can improve the performance compared with powerful baseline methods and outperform previous state-of-the-art studies.
• The novel method extracts the information from the lexicon via the GCN and is not overreliant on the quality of the lexicon. Experimental results in the cross-domain scenario prove that the method can enhance the robustness of the basic neural CWS approaches.

Related Work
Chinese Word Segmentation Since Xue (2003) formalizes the CWS as a sequence labeling problem, most studies follow the character-based paradigm to predict segmentation labels for each character in the sentence. In particular, the adopted methods fall into two categories, including 1) statistical machine learning methods (Peng et al., 2004;Tseng et al., 2005;Zhao and Kit, 2008; and 2) neural network methods (Zheng et al., 2013;Pei et al., 2014;Chen et al., 2015a,b;Cai and Zhao, 2016;Yang et al., 2017). As the studies of deep learning techniques develop in-depth, the neural CWS methods achieve better performance compared with statistical learning methods (Cai et al., 2017;Zhou et al., 2017;Ma et al., 2018;Yang et al., 2019a;. And neural network architectures gradually replace statistical machine learning methods as the mainstream approaches for CWS.
Cross-Domain CWS However, there is an obvious gap in the cross-domain CWS scenario. Neural CWS methods still suffer from the OOV problems.
To alleviate this problem, many kinds of research utilize external resources (e.g., pre-trained embeddings, unlabeled data, and lexicons) to improve the performance of the cross-domain CWS (Zhao et al., 2018;Ye et al., 2019;Ding et al., 2020). For example,  try to transfer pre-trained knowledge into the crossdomain CWS in full by leveraging more annotated datasets with different segmentation criteria . Tian et al. (2020) utilize lexicons and wordhood measures to enhance the robustness in the cross-domain CWS scenario.
Graph Neural Network In recent years, the graph neural network has been fully explored and achieved significant progress in several kinds of NLP tasks (Zhou et al., 2020). When dealing with text scenarios, graphs can extract the features from non-structural data by modeling a set of objects (nodes) and their relationships (edges). In particular, we can consider each variable in the text as a node and the dependencies as edges for the sequence labeling task. Marcheggiani and Titov (2017) present a syntactic GCN to solve the problem of semantic role labeling. Ding et al. (2019) utilize a multi-graph structure to capture the information that the gazetteers offer. In addition, the graph neural network based on the domain lexicon is used to learn the local composition features for medical domain CWS (Du et al., 2020).

Proposed Framework
The framework of LBGCN is illustrated in Figure 1. It mainly consists of two parts: an encoder-decoder layer and a GCN. In the first part, we utilize the Transformer as the encoder and the Dense as the decoder. The Transformer encoder adopts the PLMs  (e.g., BERT and RoBERTa) which contain rich pre-trained knowledge to train. The pre-trained knowledge can effectively help the model alleviate the problem of OOV word recognition. In the second part, a GCN based on the pre-defined lexicon is built. The generated graph embeddings can make up the deficiency of contextual information from candidate words. In addition, the proposed framework that is integrated with bi-gram features and multiple contextual features can improve the performance of CWS. Following previous studies (Xue, 2003), we regard the CWS as the character-based sequence labeling task. The framework predicts a tag that represents the position in a word for each character (e.g., tag "B" represents the first character in a word). The process of LBGCN to find the most possible pathŶ can be formalized as: where T denotes the set of all types of segmenta-tion labels, and N is the length of the input sentence X . The rest of this section describes the architecture of the encoder-decoder layer, the construction of the lexicon graph, and how it is integrated with the GCN, respectively.

Encoder and Decoder
Transformer Encoder Recently, there are several PLMs (e.g., BERT and RoBERTa) that have shown state-of-the-art performance of many NLP tasks. In particular, a modified method based on RoBERTa model is built for the Chinese NLP tasks (Cui et al., 2019). With the previous success on PLMs, we adopt the main architecture (Transformer) as the encoder of our proposed framework, which can straightforwardly leverage the pretrained knowledge from PLMs for the Transformer encoder because of the similar structure.
The PLM is trained for predicting the word in general. To transfer the pre-trained knowledge into the CWS, we need to fine-tune the PLM by the annotated corpus of CWS. Given an input sentence X = x 1 ...x i−1 x i ...x n from the training data, the input sentence is converted to the corresponding vector embeddings Table" layer, where H ∈ R N * d model , N and d model represent the length and the same dimensions with the PLM. To be consistent with the pre-trained process, two tags ("[CLS]" and "[SEP]") are added to the beginning and the end of each sentence, respectively.
Given an input vector sentence H ∈ R N ×d model , the Transformer encoder utilizes self-attention layers to extract the contextual feature for each character. The self-attention layer adopts "Scaled Dot-Product Attention" to compute representation.
where Q, K, V represents a query and a set of keyvalue pairs through a linear transformation respectively, the matrices Instead of performing a single-head attention function, the Transformer encoder uses the multihead self-attention layer in order to extract contextual features from different representation spaces and utilizes feed forward network (FFN) to enhance representation ability. Assuming the input of the multi-head self-attention layer is H, the outputH is calculated by where "LN" indicates the layer normalization (Ba et al., 2016).
Dense Decoder A dense layer with W D ∈ R d model ×Tn converts hidden dimensions to the 4tag set T = {B, M, E, S}, where T n presents the size of the tag sets (T n = 4). After linear mapping, the framework adopts the function Sof tmax and the greedy search for decoding. In previous studies, many kinds of research adopt the CRF as the decoder layer to improve the performance of sequence labeling tasks (Lample et al., 2016). However, the CRF layer has larger time complexity and space complexity for CWS (Duan and Zhao, 2020). For practicality, the proposed framework utilizes the lightweight function Sof tmax as the decoder layer and also achieves competitive performance compared with other studies using the CRF.
The training step of the framework is to minimize the errors by solving the following optimization function: where y(x) denotes the true labels on the annotated corpus, Θ t and Θ g are all trainable parameters in the transformer layer and GCN, respectively, and the loss function J seg is given by:

Lexicon-Based GCN
Lexicon-Based Construction The bottom part of the Figure 1 starts with a lexicon and we construct the graph by the pre-defined lexicon. Given the input sentence X = x 1 ...x i−1 x i ...x n , the graph utilizes a pre-defined lexicon to extract candidate words in the sentence after the Transformer encoder. For example, X = ["水仙花是草本植 物"] (Daffodils are herbaceous plant) consists of 8 Chinese characters, and the word list L = ["水 仙花"(daffodils), "是"(are), "草本"(herbaceous), "植物"(plant)] is obtained from the lexicon. The lexicon-based graph is defined as G := (V, E), where V and E are the sets of nodes and edges, respectively. Each character is represented as a node in the graph and adjacent nodes connect to each other by undirected edges for capturing the contextual information. The set of these undirected edges is E c . Besides, we integrate three addi- with the character set of nodes V c , and the entire set of nodes To extract the information of the word boundary, we also build edges between candidate words w i = c 1 ...c n , w i ∈ L and additional nodes V d . The entire set of edges is E = E c ∪ E d , where E d represents the set of edges between candidate words and additional nodes. The 1st character c 1 in the candidate word connects to the node V B and V M , and the last character c n connects to the node V M and V E . For instance, the candidate word "水仙花" (daffodils) consists of three characters "水(water), 仙(fairy) and 花(flower)". In particular, the character node "水" (water) connects to the Benchmarks  MSR  PKU  AS  CITYU  CTB6  TRAIN TEST TRAIN TEST TRAIN TEST TRAIN TEST TRAIN TEST   CHAR -2010). Note that the cross-domain datasets do not contain the training sample, so we use the "PKU" which is the most similar to them as the training data.
node V B and V M . The character node "仙" (fairy) only connects to the node V M . The character node "花" (flower) connects to the node V M and V E . The construction of the lexicon graph is illustrated in Figure 1.
GCN After the construction of the lexicon-based graph, we utilize a GCN (Kipf and Welling, 2016) to encode the graph G.
Here, A is the adjacency matrix of the undirected graph G. I N is the identity matrix and W (l) is a layer-specific trainable weight matrix. σ(·) denotes the ReLU activation function.Ĥ (l) ∈ R N ×D is the matrix of hidden states in the l th layer; H(0) = X. GCN mainly consists of two matrices. One is the symmetric normalized Laplacian matrixÃ. The other is the layer-specific trainable weight matrix W (l) . The GCN can extract the features from the lexicon-based graph. In addition, the weight matrix adopts the random initialization and the learning rate of this layer is different from the transformer encoder.

Datasets and Settings
To verify the improvement of our proposed framework LBGCN, we do comparative experiments on both benchmarks (Bakeoff-2005(Emerson, 2005 and CTB6) and cross-domain datasets (SIGHAN-2010) (Zhao and Liu, 2010). The size of the benchmark is shown in  pre-process the unsegmented sentences, which is similar to the previous paper (Cai et al., 2017). The evaluation values for CWS are F-score and R oov . We utilize three mainstream PLMs for training the Transformer encoder, including XLNET-BASE (Yang et al., 2019b;, BERT-BASE and ROBERTA-WWM (Cui et al., 2019). 2 To finetune PLMs, we tune a few crucial hyper-parameters with the development sets for the model. The hyperparameters and search ranges are shown in Table  2. We deploy the model on the same device (GPU environment: Nvidia Tesla V100).

Experimental Results
This section first reports the results of LBGCN with different configurations on five benchmarks and comparison with existing models. Then it describes the effect of LBGCN in the cross-domain scenario.   shown in Table 3, three baseline models which utilize different PLMs to train the Transformer encoder of our proposed framework, are represented as "XLNET", "BERT, and "ROBERTA", respectively. There are three observations drawn from the results. First, The framework which integrates with our proposed LBGCN outperforms the baseline models for all 5 datasets in terms of F-scores and for the majority of datasets in terms of R oov . Second, the proposed LBGCN make small improvements in some datasets, whereas considerable improvements are shown in the other datasets.  Besides, we compare the proposed framework with existing methods. The comparison is also presented in Table 3, where the proposed framework LBGCN based on the BERT or RoBERTa outperforms all existing models in terms of the F-scores on all benchmarks.

Results on Benchmarks
Results on Cross-Domain CWS Domain variance is important to affect the performance of word semgenters. To demonstrate the efficiency of LBGCN, we also run frameworks with and without the LBGCN in the cross-domain scenario. Table 4 reports the results in F-score, which shows a similar trend as that in Table 3, where LBGCN outperforms baselines in all 5 domains. And the Figure 2: The F-scores of LBGCN using four different lexicons on two benchmark datasets, where "PKU" is the simplified Chinese dataset and "AS" is the traditional Chinese dataset. framework ROBERTA-LBGCN achieves state-ofthe-art performance in terms of average F-score. In particular, the XLNET has better performance on the "Computer" domain, and the BERT has better performance on the "Literature" domain. In general, our proposed LBGCN mechanism can effectively improve performance in the cross-domain scenario, and all LBGCNs fine-tuning different PLMs achieve competitive performance, compared with existing methods.

Effect of Using Different Lexicons
LBGCN utilizes a general way of integrating lexicon for CWS. To analyze the effect of methods using different lexicons, we adopt four different lexicons into the ROBERTA-LBGCN and compare them with the baseline model, as shown in Figure 2. Four lexicons consist of two simplified Chinese dictionaries 3 and two traditional Chinese dictionaries 4 . Particularly, two simplified Chinese dictionaries consist of a basic version "LBGCNs" (red) and a modified version "LBGCNsd" (yellow), respectively. Similarly, two traditional Chinese dictionaries are also a basic version "LBGCNt" (purple), and a modified version "LBGCNtd" (green).
As shown in Figure 2, the performance of using the four lexicons are all better than those of the baseline models on both the "PKU" and "AS" dataset, indicating the efficiency of our proposed lexicon-based framework. The framework using the basic simplified Chinese dictionary (red) achieves the biggest improvement on the "PKU" and the one using the basic traditional Chinese dictionary (purple) achieves the biggest improvement on the "AS".

Ablation Study
LBGCN integrates two additional components for CWS, including the bi-gram features and the lexicon-based GCN. To analyze the effect of LBGCN with respect to different components, we do an ablation experiment based on the ROBERTA PLM which performs better for both "PKU" and "MSR" benchmarks and the results are shown in Table 5. Table 5 shows that the GCN (ID:3,4) effectively improves the performance of the baseline model on "PKU" and "MSR", and it also alleviates the issue of OOV words, indicating the effectiveness of our proposed framework. While the GCN that integrates with the bi-gram component (ID:4) achieves progress on the "PKU" from +0.15 to +0.21, it hurts the R oov . A single bi-gram component (ID:2) hardly affects the F-score but it can improve the recall of OOV words. In terms of the results in Table 5, the bi-gram and GCN boost the performance considerably.

Case Study
To investigate how the proposed framework learns from the lexicon-based GCN, we choose an example input sentence "在/青石板/路上/的/清脆/回 声" (The clear echo on the flagstone road) in the literature domain scenario as a case study. In this sentence, the n-gram "青石板/路" (flagstone road) is the road that is made of a special kind of stone and always occurs in the Chinese literature. However, the split "板路" is short for the plate circuit. The baseline model may confuse this case because of the character diversity in Chinese. Intuitively, the "青石板" is in the pre-defined lexicon and an undirected graph is constructed with the information of the word boundary. Then the lexicon-based GCN capture this information and integrates the graph embeddings with the original hidden states. The integrated embeddings with the knowledge of lexicon information are transferred into correct tags by the decoder. In Figure 3, we visualize the resulted weights that learn from the basic Transformer encoder (a), as well as from the final tagger (b).
In addition, in another case, "地瓜/粥" (sweet potato congee) represents a Chinese food and "粥" (congee) should be regarded as a single suffix word on "PKU" segmentation criterion. The baseline model cannot segment it correctly, because it keeps the superabundant pre-trained knowledge of PLMs. In the LBGCN, "地瓜粥" does not exist in the lexicon but "地瓜" is a lexicon word. The LBGCN constructs this relationship in the graph to distinguish important n-grams and improves performance accordingly for CWS.

Conclusion
To make up for the insufficiency of previous methods that fine-tune PLMs, in this paper, we propose a lexicon-based graph convolutional network to better transfer pre-trained knowledge from PLMs into the CWS. Our proposed framework LBGCN provides baseline models with the information of word boundary and contextual information, in addition to preserving the merits of baseline models in applying PLMs. In summary, the advantages of LBGCN are threefold. First, the novel framework does not rely on a particular PLM, and it can get further promotion on all baseline methods based on three mainstream PLMs for CWS. Second, the results on extensive experiments show that LBGCN achieves competitive performance on the CWS benchmarks, compared with previous methods. Third, further experiments and analyses demonstrate the effectiveness of LBGCN in the cross-domain scenario as well as when using different lexicons and components. Overall, this paper presents an elegant way to use a graph neural network for CWS and enhance fine-tuning CWS methods. For future work, we plan to investigate other sequence labeling tasks using the same methodology.