Improved Unsupervised Chinese Word Segmentation Using Pre-trained Knowledge and Pseudo-labeling Transfer

Unsupervised Chinese word segmentation (UCWS) has made progress by incorporating linguistic knowledge from pre-trained language models using parameter-free probing techniques. However, such approaches suffer from increased training time due to the need for multiple inferences using a pre-trained language model to perform word segmentation. This work introduces a novel way to enhance UCWS performance while maintaining training efficiency. Our proposed method integrates the segmentation signal from the unsupervised segmental language model to the pre-trained BERT classifier under a pseudo-labeling framework. Experimental results demonstrate that our approach achieves state-of-the-art performance on the seven out of eight UCWS tasks while considerably reducing the training time compared to previous approaches.


Introduction
Word segmentation is critical in natural language processing (NLP) tasks.Different from English, Chinese text does not show explicit delimiters between words (e.g., whitespaces), making Chinese word segmentation (CWS) a challenging task.Traditional unsupervised Chinese word segmentation (UCWS) approaches include rule-based and statistical methods (Chang and Lin, 2003;Tseng et al., 2005;Low et al., 2005;Mochihashi et al., 2009).In recent years, neural approaches based on word embeddings and recurrent neural networks have been studied for UCWS.Sun and Deng (2018) proposed Segmental Language Model (SLM) with the pretrained word embeddings (Mikolov et al., 2013) and the LSTM-based language model (Hochreiter and Schmidhuber, 1997).By calculating the probabilities between words in a fixed-length segmentation size, SLM performs better than the traditional approaches on the UCWS tasks.
Recently, as large-scale pre-trained language models have become mainstream solutions for several NLP tasks, Li et al. (2022) successfully applied BERT (Devlin et al., 2018) to UCWS and demonstrated state-of-the-art (SOTA) performance by incorporating Perturb Masking (Wu et al., 2020) and the self-training loops into the BERT classifier.However, Perturb Masking is computationally inefficient due to the requirement of performing at least twice BERT forward passes for each token in left-to-right order (Li et al., 2022).In other words, for a sequence with a length of N , the additional training time complexity using Perturb Masking (Wu et al., 2020) will become 2N , which significantly increases the training time of fine-tuning BERT for UCWS.
This paper introduces a simple unsupervised training framework to leverage the pre-trained BERT (Devlin et al., 2018) for UCWS efficiently.Following Xue (2003), we view the CWS task as a sequence tagging problem.To make the BERT model learn how to segment words with its implicit pre-trained knowledge, we propose a pseudo-labeling approach to fine-tune BERT with the pseudo-labels generated by an unsupervised segment model.Our experiments demonstrate that the proposed method brings substantial performance gains compared to the previous studies (Sun and Deng, 2018;Downey et al., 2022;Li et al., 2022).In addition, the proposed method provides an 80% decrease in training time than the existing SOTA method (Li et al., 2022), which also utilized the pre-trained language model for UCWS.

Method
There are two modules in the proposed framework: the segment model and the classifier, which we show in Figure 1.The segment model produces the text segmentation results, which serve as pseudolabels.The classifier, which is a BERT-based classification model, then uses these labels to learn how to separate the words in a sequence.In other words, the segment model acts as a teacher for the classifier.The following sections first introduce the details of the two modules and then describe the training process in our proposed framework.

Segment Model
In our approach, we employ the Segment Language Model (SLM) proposed by Sun and Deng (2018) as our segment model for providing pseudo-labels in our proposed framework.SLM is a language model which segments a sequence according to the probability of generating <eos> (end of a segment) as the next character.For example, given a sequence {x 1 , x 2 , x 3 }, two segments {x 1 , x 2 } and {x 3 } can be obtained via SLM if the probability of generating <eos> after x 2 is higher than that after x 3 .We omit an exhaustive unsupervised training process of SLM (Sun and Deng, 2018) and refer readers to the original paper for more details.

Classifier
We follow Xue (2003) to treat CWS as a sequence tagging task.As the illustration in Figure 1 (b), we add fully-connected layers (FC) on top of the BERT (Devlin et al., 2018) model as our classifier: where p ij is the probability of the j-th token in the i-th sequence, N is the number of examples in a dataset, and T i is the length of the i-th sequence.W ∈ R d×k and b ∈ R k are trainable parameters for k output tagging labels, and h ij ∈ R d is the output hidden state of BERT at the j-th token with d dimensions.

Training Framework
Our training framework is composed of two stages, as illustrated in Figure 1.In the first stage, we follow Sun and Deng (2018) to train the segment model.We then perform pseudo-labeling and create high-quality word segments with the segment model.In the second stage, we use these pseudolabeled segments to train the classifier with crossentropy loss: where y ij is the pseudo-label of the j-th token in the i-th sequence.We adopt the tagging schema with binary tagging labels (k = 2), where "1" represents "segment from the next character" and "0" indicates "do not segment."The results with other tagging schemas are included in Appendix A.1.
Our goal is to provide the pre-trained classifier with pseudo-labels as training data, allowing it to utilize the knowledge acquired during its pretraining stage to learn further and improve its performance on Chinese word segmentation.

Implementation Details
We follow Sun and Deng (2018) to replace the continuous English characters with the special token <eng>, digits with the token <num>, and punctuations with the token <punc>.We use the CBOW model (Mikolov et al., 2013) which has been pre-trained on the Chinese Gigaword Corpus (LDC2011T13) to acquire word representations for our segment model.The encoder and the decoder of the segment model are composed of a one-layer LSTM and a two-layer LSTM.For the classifier, we use the pre-trained Chinese BERT BASE model 1 .We use Adam (Kingma and Ba, 2015) with learning rates of 5e-3 for the segment model and [1e-5, 5e-5] for the classifier.We train the segment model for 6,000 steps for each dataset, including the initial 800 steps for linear learning rate warm-up.The classifier is trained for 1,000 steps on each task with early stopping.More training details and training progress of our approach without using early stopping can be found at Appendix A.2 and A.3.

Results
Table 1 shows the results for UCWS on the eight datasets.Our baselines include SLM-4 (Sun and Deng, 2018), MSLM (Downey et al., 2022), and BERT-circular (Li et al., 2022).We demonstrate that the proposed approach outperforms the baselines for UCWS.Although our approach is slightly worse than the existing SOTA method (Li et al., 2022) on the MSR dataset ( serve the substantial performance gains for the remaining seven Chinese word segmentation tasks.
Note that the methods we compare in Table 1 did not include complete results on the eight tasks in their original papers, so the remaining scores were obtained by our own implementations.We compare our re-implementation results with the baseline methods in Appendix A.4 on the datasets presented in their original papers.

Training Speed Comparison
As previously mentioned, our work aims at simplifying the framework of BERT-circular (Li et al., 2022) to reduce the training time for UCWS.Here, we test the training speed to compare the proposed method and the baselines using a single GPU of RTX 3090.Table 2 shows that the proposed method takes only 20% of the training time of BERTcircular but performs better on the PKU dataset.
In addition, our approach is better than MSLM (Downey et al., 2022)   that self-training (ST) brings a subtle improvement to our proposed framework.We argue that a filtering strategy for low-confidence examples should be combined with self-training, which will be further studied in our future work.

Segmentation Examples
Table 4 provides three examples of CWS.Across these examples, both SLM and BERT-circular exhibit a mixture of correct and incorrect word segmentation results.For instance, in the first example, "自行车" (bicycle) is incorrectly segmented as "自行/车" (self / bicycle) by SLM, while BERTcircular segments it correctly.Conversely, "俱乐 部" (club) is wrongly segmented by BERT-circular, while SLM is correct.Notably, our model excels in correctly segmenting proper nouns, as seen with "特拉维夫" (Tel Aviv), where both SLM and BERT-circular falter.However, our model does tend to encounter challenges with complex terms, such as the combination of a proper noun and a standard term, exemplified by "哈尔滨" + "市" (Harbin city).Despite this, our method is able to leverage the insights from both SLM and BERT, achieving accurate segmentation in numerous instances.See Table 9 in Appendix for more examples.
We also discover that BERT-circular shows unsatisfactory results when segmenting words longer than two characters, such as "俱乐部" (club) and "奥运功臣" (Olympic hero).Therefore, we analyze the relationship between performance and segmentation lengths in the next section.

Comparison of Model Performance on
Different Segmentation Lengths.
Figure 2 shows the performance comparison for different segmentation lengths of the PKU dataset in 1-gram, 2-gram, 3-gram, and 4-gram.Although BERT-circular (Li et al., 2022) uses the pre-trained BERT model (Devlin et al., 2018) in its framework, it fails to provide satisfying results for longer segments (3-gram and 4-gram).In contrast, our method performs well in most settings and shows competitive results for 4-gram segmentations compared with SLM (Sun and Deng, 2018).The reason for the results of BERT-circular may come from Perturb Masking (Wu et al., 2020), which measures the relationship between every two left-to-right tokens and relies on the similarity between the two representations.The segmentation results of BERTcircular may be less accurate when computing similarities multiple times for longer segments.
for modeling Chinese fragments without labels.To perform unsupervised Chinese word segmentation, they leveraged dynamic programming to find the optimal possible segmentations based on the word probabilities of the LSTM-based language model.Due to the invention of the self-attention mechanism (Vaswani et al., 2017), Downey et al. (2022) carefully designed a masking strategy for replacing the LSTM-based architecture of the SLM with Transformer (Vaswani et al., 2017).Wang and Zheng (2022) integrated the forward and backward segmentations instead of only using forward information as in previous work.
In terms of pre-trained semantic knowledge usage, Wu et al. (2020) developed the Perturbed Masking probing to assess the relations between tokens in a sentence from the masked language model of BERT (Devlin et al., 2018).Based on the probing approach, Li et al. ( 2022) proposed a selftraining manner that makes the classifier learn to divide word units from the perturbed segmentation and achieved SOTA performance for UCWS.

Conclusion
This work presents an improved training framework for unsupervised Chinese word segmentation (UCWS).Our framework leverages the pseudolabeling approach to bridge the two-stage training of the LSTM-based segment model and the pretrained BERT-Chinese classifier.The experiments show that the proposed framework outperforms the previous UCWS methods.In addition, without using Perturb Masking (Wu et al., 2020) and self-training (Li et al., 2022), our framework significantly reduces training time compared with the SOTA approach (Li et al., 2022).Our code is available at https://github.com/IKMLab/ImprovedUCWS-KnowledgeTransfer.

Limitations
Our segment model (Sun and Deng, 2018) requires the pre-trained word embedding (Mikolov et al., 2013) as initialization of the embedding layer.Random initialization of the embedding layer might lead to slow convergence of the segment model.

A Appendix
A.1 Additional results for different tagging schemas We show the two additional tagging schemas, BI (beginning, inside) and BMES (beginning, middle, end, single), for evaluating the performance of the proposed method in Table 5.The results show minor differences in the "01" tagging schema and the other two for the eight unsupervised Chinese word segmentation (UCWS) tasks.

A.3 Performance without Early Stopping
The line chart shown in Figure 3 illustrates the training outcomes of our framework in the second stage, which did not use early stopping.The curves in Figure 3 show an initial upward trend, followed by a subsequent decrease before stabilizing.In the second stage, we train the BERT-based classifier using pseudo-labels.We believe that BERT's pre-trained knowledge (Devlin et al., 2018) can improve word segmentation quality in the initial training phases.Therefore, we include the early stopping mechanism to ensure the enhancement of word segmentation.Without early stopping, the BERT-based classifier would be influenced by the distribution of pseudo-labels generated by the SLM (Sun and Deng, 2018).

A.4 Re-implementation Results
We evaluate the proposed method on the eight CWS datasets (Table 1).However, not all of them are reported in the original publications of the compared baseline methods.In order to validate the re-implementation results, Table 7 compares the scores obtained by our implementations on the datasets presented in the original publications.According to the results, our scores of SLM (Sun and Deng, 2018) and MSLM (Downey et al., 2022) are close to the ones reported in their original papers.
However, we cannot reproduce the scores consistent with those presented in the BERT-circular paper (Li et al., 2022) using their publicly released code2 and the hyperparameters.We discovered that the results of BERT-circular are not deterministic due to the lack of randomness control in their code.Indeed, as the main contributions stated in their paper (Li et al., 2022), BERT-circular shows much better performance than SLM (Sun and Deng, 2018), which is consistent with our results and can also be observed in Tables 1 and 7. Additional experiments can be found in

Figure 1 :
Figure 1: The proposed training framework for unsupervised Chinese word segmentation.
in both training speed and model performance.However, compared to the SLM(Sun and Deng, 2018), our method needs a slightly longer training time due to the training of the classifier with pseudo-labeling.Li et al. (2022) include self-training in their framework and improve the model performance.Thus, this section reveals if our training framework also benefits from the self-training technique.We first follow the proposed approach to train the classifier.Then we iteratively train the segment model and the classifier with the pseudo-labels from each of the two modules until early-stopping.

Figure 2 :
Figure 2: Comparison of model performance in F1score on the different segmentation lengths using the PKU dataset.The x-axis shows the gold segmentation lengths, with the proportion denoted in parentheses.

Figure 3 :
Figure 3: Training progress of the BERT-based classifier in the second stage without the use of early stopping.

Table 2 :
Training time comparison (in minutes) on the PKU dataset.The underlined scores are taken from the original papers.

Table 4 :
Segmentation examples from the PKU dataset.
Table 6 shows the information of the train and test set splits.In our training framework, we use the train set in each CWS dataset for unsupervised training of the segment model.Afterward, we infer each example in a test set with the fixed segment model to acquire the segmentation results as pseudo-labels for training the BERT-based classifier.For evaluations of UCWS, we use the trained classifier to predict the examples in test sets, which follows the approach of Li et al. (2022).

Table 5 :
Table 8, which showcases five experiments conducted for each approach on every CWS dataset.Performance in F1-score (%) on the eight datasets using different tagging schemas.

Table 6 :
Number of examples in each Chinese word segmentation dataset.

Table 7 :
Comparison of F1-score (%) on the same datasets included in the original publications of the baseline methods.Our reimplementations are marked with an asterisk (*) where the score is the best one from 5 experiments.