Enhancing Chinese Word Segmentation via Pseudo Labels for Practicability

Pre-trained language models (e.g., BERT) sig-niﬁcantly alleviate two traditional challenging problems for Chinese word segmentation (CWS): segmentation ambiguity and out-of-vocabulary (OOV) words. However, such improvements are usually achieved on traditional benchmark datasets and not close to an important goal of CWS: practicability (i.e., low complexity as a standalone task and high ben-eﬁciality to downstream tasks). To make a trade-off between traditional evaluation and practicability for CWS, we propose a semi-supervised neural method via pseudo labels. The neural method consists of a teacher model and a student model, which distills knowledge from unlabeled data to the student model so as to improve both in-domain and out-of-domain CWS. Experiments show that our proposed method can not only keep the practi-cability of the lightweight student model but also improve the performance of segmentation effectively. We also evaluate a range of heterogeneous neural architectures of CWS on downstream Chinese NLP tasks. Results of further experiments demonstrate that our proposed segmenter is reliable and practical as a pre-processing step of the downstream NLP tasks at the minimum cost. 1


Introduction
Natural language processing (NLP) tasks often leverage word-level features to exploit lexical knowledge. Segmenting a sentence into a sequence of words, especially for languages without explicit word boundaries (e.g., Chinese) not only extracts lexical features, but also shortens the length of the sentence to be processed. Thus, word segmentation, detecting word boundaries, is a crucial pre-processing task for many NLP tasks. In this aspect, Chinese word segmentation (CWS) is widely acknowledged as an essential task for Chinese NLP.
CWS has made substantial progress in recent studies on several benchmarks, which is reported by Huang and Zhao (2007) and Zhao et al. (2019). In particular, pretrained language models (PLMs), like BERT (Devlin et al., 2019), have established new state-of-the-art in sequence labeling (Meng et al., 2019). Various fine-tuning methods have been proposed to improve the performance of indomain and cross-domain CWS based on PLMs Tian et al., 2020). The two challenging problems in CWS, segmentation ambiguity and out-of-vocabulary (OOV) words, have been significantly mitigated by PLM-based methods that are fine-tuned on large-scale annotated CWS corpora. Such methods are even reaching human performance on benchmarks. Nevertheless, CWS is more valuable as a prelude for downstream NLP tasks than as a standalone task. Intrinsic evaluation of CWS on benchmark datasets only examines the effectiveness of current neural methods on word boundary detection. To better apply CWS in downstream NLP tasks, we should comprehensively re-think CWS from the perspective of practicability. In this paper, we define the practicability of CWS with two aspects: low complexity as a standalone task and high beneficiality to downstream tasks.
The complexity is twofold: 1) complexity of implementation and 2) time and space complexity of a CWS algorithm. Previous neural methods usually require additional resources (Zhou et al., 2017;Ma et al., 2018;Zhang et al., 2018b;Zhao et al., 2018;Qiu et al., 2020), such as external pre-trained embeddings. The complexity of implementation is reflected in the difficulty of acquiring external resources. External resources vary in quality and the length of time for computation, For example, it is time-consuming to obtain effective pre-trained embeddings as they are trained on a huge amount of data. Generally, it is difficult to maintain high CWS performance for many previous neural methods in a low-resource environment. Neural methods with external resources achieve high CWS performance but at the cost of a high complexity of implementation. On the other hand, for training and inference, PLM-based CWS methods also consume large memory to store a huge number of parameters of their models. The speed of inference is usually slow. The huge memory consumption and slow inference prevent PLM-based CWS models from being deployed in small-scale smart devices. And, as CWS is often used with downstream models, this even weakens the applicability on smart devices as CWS is not supposed to take too much overhead in this situation.
The second is the beneficiality to downstream tasks. CWS is rarely used as a standalone task in industry. Existing CWS evaluations only rely on benchmarks and analyze the behavior of segmentation methods in a static scenario. Some well-known benchmarks are quite old (e.g., Bakeoff-2005) and not challenging for neural CWS anymore. Such evaluations are intrinsic, which are not associated with downstream NLP tasks. High CWS performance (e.g., Precision and F 1 ) does not mean that segmentation results are beneficial to downstream processing. Additionally, benchmark datasets have a plenty of segmentation noises that affect CWS training and evaluation. For instance, although the structure of "副" (vice) + "X" is segmented as two words: "副" (vice) and "X" in training data and never unified as a single word, "副校长" (vicepresident) appears as one word in test data, note that: X presents any job titles, e.g., "总统" (president) and "经理" (manager). There are also many obvious errors due to annotation inconsistency in data. We have found, in one benchmark dataset, the word "操作系统" (operating system) is regarded as two words ["操作" (operate) + "系统" (system)] 6 times and appears as one word 14 times, respectively. Therefore, to measure and improve the beneficiality of CWS to downstream tasks, intrinsic evaluations on CWS benchmark datasets are not sufficient. We should perform extrinsic evaluations with downstream tasks.
To address the aforementioned practicability issue of CWS, we propose a semi-supervised neu-ral method via pseudo labels. The method consists of two parts: a teacher model and a student model. First, we use a fine-tuned CWS model that is trained on the annotated CWS data as the teacher model, which can achieve competitive performance in traditional perspective for CWS. Then we collect massive unlabeled data and distill knowledge from the teacher model to the student model by generating pseudo labels. We filter out noisy pseudo labels to provide reliable knowledge for training the student model. The unlabeled data is easier to obtain than other external resources (e.g., lexicon and pretrained embeddings) and can be updated anytime at a low cost. And we use the lightweight student model for inference, hence significantly reducing the memory consumption and inference time complexity. The practicability of our proposed method is competitive.
To sum up, the contributions of this work are as follows: • Our proposed method distills knowledge from the teacher model via unlabeled data to coach the lightweight student model. The proposed method achieves a noticeable improvement over strong baselines for CWS by the traditional intrinsic evaluation.
• The lightweight student can be deployed on a small-scale device, even in a non-GPU environment. We abandon the PLM neural architectures (teacher model) during decoding. The speed of decoding is thus fast for practical application. Our method reduces the complexity of implementation, inference time, and memory consumption.
• We empirically investigate the effectiveness of the proposed method to downstream Chinese NLP tasks and analyze the impact of segmentation results on them via extrinsic evaluations.

Related Work
Since Xue (2003) formalizes CWS as a sequence labeling problem, many traditional statistical methods have achieved high performance for CWS on several benchmarks (Emerson, 2005). According to (Huang and Zhao, 2007) and (Zhao et al., 2019), CRF-based models (Tseng et al., 2005;Zhao and Kit, 2008;Zhao et al., 2010;Sun et al., 2012;Zhang et al., 2013) and neural methods (Zheng et al., 2013;Pei et al., 2014;Chen et al., 2015;Cai and Zhao, 2016;Cai et al., 2017) have been reported to outperform traditional methods with high F 1 scores (0.95-0.97). In particular, Long Short-Term Memory Networks (LSTM) are the main backbone networks being used in these methods Ma et al., 2018;. Except for using LSTM, self-attention networks have been also employed for CWS (Duan and Zhao, 2020). The OOV problem has been a long-standing challenge for CWS and it is particularly serious in the cross-domain scenario. To relieve this issue, many studies incorporate a variety of pre-trained word embeddings and external resources into CWS models (Zhou et al., 2017;Zhang et al., 2018b,a;. Recently, with the development of PLMs (Devlin et al., 2018;, fine-tuning methods benefit from a huge amount of the pre-trained knowledge for alleviating the OOV problem for CWS (Meng et al., 2019;Tian et al., 2020;Qiu et al., 2020). Such methods are nearly reaching human-level performance.
Nevertheless, external resources and PLMs result in additional costs in memory consumption and inference time. Knowledge distillation has been proposed to alleviate this additional cost issue (Ba and Caruana, 2014;Hinton et al., 2015). Kim and Rush (2016) propose to use knowledge distillation for neural machine translation while Mukherjee and Awadallah (2019) study several aspects of distillation to match the student model for sentiment classification. Jiao et al. (2020) adopt multiple distilling strategies to minimize the number of the parameters of the pre-trained language model. Different from these previous studies, our proposed method utilizes unified pseudo labels to improve the performance of the lightweight model. The model can provide positive influence as a preprocessing step to downstream tasks, compared with previous state-of-the-art methods.

Proposed Framework
Aiming at not only keeping competitive performance on benchmarks but also reducing the complexity of the CWS methods, our proposed framework consists of two essential modules: a student model and a teacher model, as shown in Figure 1. There is an obvious performance gap between the model based on PLMs  and the lightweight model (Duan and Zhao, 2020). The OOV issue is the main reason for the gap. Since the teacher model based on fine-tuned PLMs with high complexity can alleviate the OOV issue effectively, we use a combination of PLM-based teacher and lightweight student. First, the teacher model transfers pre-trained knowledge into a specific data distribution by annotating unlabeled data. Then we utilize a huge amount of such annotated data to distill knowledge from the teacher model to the lightweight student model. The pseudo labels provided by the teacher model can help the lightweight model to alleviate the OOV issue of CWS.

Teacher Model
Recently, there are several PLMs (e.g., BERT and RoBERTa) that have shown competitive performance for many NLP tasks. In particular, a modified RoBERTa model has been built for Chinese NLP tasks (Cui et al., 2019). Inspired by the previous success of PLM-based models on CWS , we use the RoBERTa-WWM PLM as the teacher model. Normally, PLMs are trained for predicting words in general. To adapt PLMs and transfer their knowledge to CWS, we need to fine-tune PLMs on the annotated data of CWS. Let X denote the inputs, which are converted into a sequence of embeddings. For consistency, two tags ("[CLS]" and "[SEP]") are added to the beginning and end of each sentence, respectively. A Linear transfer layer with W (t) ∈ R d model * N replaces the original component, where d model is the same as the size of dimensions of the pre-trained model and N presents number of tags in CWS annotated data (N = 4). We convert CWS annotations into annotations with a 4-tag set T = {B, M, E, S} that indicates the Begin, Middle, End of a word, or a Single character forming a word. After linear mapping, the teacher model adopts the function of Softmax and the greedy search for decoding.
where h t (x) represents the hidden states of the teacher model. Complex algorithms (e.g., CRF or beam search) for decoding are abandoned in order to reduce the complexity. In addition, these complex algorithms only obtain a slight improvement for CWS. CRF increases the time complexity by n times and beam search requires more search time varying with the beam size, compared with the greedy search.
where M is a constant representing other factors in the model complexity, n is the size of the sentence, and b is the width of beam.
The training of the teacher model is to minimize the errors by solving the following optimization function: where the loss function J seg is computed by:

Student Model
To improve the practicability of CWS, our proposed framework rediscovers the potential of the lightweight models. The lightweight model suffers from the OOV problem compared with the teacher model. However, the lightweight model can help us to solve the practicability issue of CWS. We propose multiple lightweight models as the student model, as shown in Figure 1.
-ConPrune. This is a pruned PLM model, where three quarters of the PLM's layers are discarded. Particularly, we only use the first top 3 layers of the entire 12 layers. We also incorporate a Convolutional Neural Network (CNN) encoder to capture the local features of the sequence.
-LSTM. LSTM is the most popular architecture for sequence labeling tasks (Ma et al., 2018). As shown in Figure 1, for each input character c i , the corresponding character uni-gram embedding and bi-gram embedding are represented as e c i and e c i c i+1 , respectively. The LSTM model is fed with the two types of character embeddings by concatenation operation, w i = e c i ⊕ e c i c i+1 . The loss function and the decoding are the same as the teacher model.
-Transformer. The Transformer is usually not working as well as LSTM for sequence labeling tasks despite its success on other tasks. We propose a new Transformer variant that is inspired by Duan and Zhao (2020). The modified Transformer utilizes the Gaussian directional mask to encode unigram features.
-CRF. Although CRF is not a dominant model for CWS, it still has great significance for practicability. We only utilize uni-gram and bi-gram features for CRF, keeping the same as neural methods for a fair comparison. It does not rely on any auxiliary features, e.g., accessor variety (AV) (Feng et al., 2004) or pointwise mutual information (PMI) (Sun et al., 1998).
All formulations and details of the student models are shown in Appendix A.

Pseudo Labels
Neural networks typically predict the probability of each class by Softmax. In the form of distillation, knowledge is transferred to the distilled model by using a distribution that is produced by the teacher model with a temperature in its Softmax. However, the architectures of the student models are completely different from the teacher model, as shown in the last section. Unlike previous studies on distilling knowledge, the process of knowledge distillation in our framework is essentially the same as the original CWS task. Particularly, our proposed method distills the knowledge from the teacher model to the student model by using a huge amount of unlabeled data as the knowledge container. It is easy to obtain unlabeled data from the Internet. The pseudo labels are generated together with noisy labels and we reduce the impact of noisy labels. Due to the high correlation between training data and unlabeled data, we directly distill knowl- MSR  PKU  AS  CITYU  CTB6  TRAIN TEST TRAIN TEST TRAIN TEST TRAIN TEST TRAIN  TEST   # CHAR 4,050K 184K 1,826K 173K 8,368K 198K 2,403K  68K 1,156K  134K  # WORD 2,368K 107K 1,110K 104K 5,500K 123K 1,456K  41K  701K  82K  NER-PKU  WMT-18  UNLABELED  TRAIN TEST  TRAIN  DEV  TEST  PEOPLE'S  edge from the teacher model to the student model, and the final loss is shown as follow: where s denotes the student model, α is a weight to balance the losses on the labeled data and unlabeled data (α = 0.5 in our experiments). The loss function is calculated with two parts. One is from the labeled data, the other is from the unlabeled data x u . Hard prediction of the teacher model on the unlabeled data produces noisy labels y u . And the prediction of the student model on unlabeled data is y * . To reduce the redundant computation, pseudo labels are mix-sampled according to a regular interval. The sampling strategy chooses the different n-gram features with the annotated data, which makes the distribution of unlabeled sentences different from the annotated data. Instead of optimizing the loss function jointly, we adopt a two-stage optimizing method. The first stage trains student models on the large-scale annotated data. In the second stage, the student model is continued to be trained on the data with labels predicted from the teacher model. Since the teacher model is also finetuned on the annotated data, the two-stage training does not suffer from the catastrophic forgetting issue.

Datasets and Settings
To examine the advantage of distilling knowledge and the complexity of our proposed framework via pseudo labels, we conducted experiments on five benchmarks (Bakeoff-2005 2 and CTB6  statistics of the benchmarks are shown in Table 1. We randomly picked 10% sentences from the training data as the development data for tuning. The unlabeled data were collected from the People's Daily website. We crawled 5,000 articles. For consistency, we pre-processed unsegmented sentences, which is similar to previous work (Cai et al., 2017). In addition, to empirically validate the beneficiality of the proposed CWS method to downstream tasks, we carried out comprehensive experiments on named entity recognition (NER) and machine translation (MT). The details of datasets for these two tasks are shown in Table 1. We used F 1 as the evaluation metric for NER and BLEU (Papineni et al., 2002) for MT.
To fine-tune the teacher model (i.e., RoBERTa-WWM), we adjusted a few crucial hyperparameters for it, as shown in Table 2. The hyperparameters of the student model were tuned with the development sets. We evaluated inference speed for all models on the same hardware configuration (Non-GPU environment: Intel(R) Core(TM) i9-10900KF CPU @ 3.70GHz //GPU environment: Nvidia GeForce RTX 3090). All other hyperparameters and search ranges are shown in Appendix B.

Results of Intrinsic Evaluation
As shown in Table 3, we investigated the effect of the proposed method on the benchmark Bakeoff-2005, which is the most widely-used dataset for CWS. "TEACHER" denotes the teacher model as introduced in section 3.1. It achieves competitive performance as it is based on a state-of-the-art pretrained model. Other models are the student models.
Experimental results in Table 4 show that our proposed semi-supervised method significantly improves the performance on all 5 benchmark datasets, compared with the pure student model. Surprisingly, results of the proposed semisupervised method are even close to those of the teacher model. We also compared our proposed method against previous SOTA models, as shown in Table 4. In particular, Tian et al. (2020) and  utilize PLMs which are slower in inference than non-PLM CWS models. This paper focuses on the methods with low complexity. These results demonstrate that our proposed method achieves state-of-the-art performance compared with non-PLM methods. Although there is a small performance gap between our proposed method and fine-tuned PLM methods, the advantage of our method over PLMs is that our method is much faster in both CPU and GPU environments, as displayed in Table 3, which is the key interest of our work. From this perspective, our method is more readily to be used in downstream tasks than previous state-of-the-art PLM methods. In addition, our proposed method not only maintains the advantages of the basic neural methods but also has a low complexity for practicability. Meanwhile, the method leverages easily available unlabeled data to make up for the insufficiency of the student model.

Results of Extrinsic Evaluation
The performance of models on various CWS benchmarks only demonstrates the merits of models themselves. However, CWS results of different methods that achieve good performance on benchmarks are not necessarily beneficial for specific downstream tasks. We therefore investigated the effect of using different CWS results on the two popular downstream Chinese NLP tasks (NER and MT) to analyze the beneficiality of CWS methods to other tasks. The benchmarks of these two tasks that we adopted are both widely acknowledged in the literature of NER and MT. Particularly, we used  Table 5: Results on NER and MT. "NR, NP AND NT" represent entities of person, place and organization. "GOLD" denotes gold-standard word segmentations for NER and "CHAR" indicates the character-level neural model based on Transformer for baseline comparison. † indicates that the corresponding model utilizes the proposed semisupervised method.
the "PKU" open resources for NER evaluation and a Chinese-to-English machine translation task from WMT-18 3 for MT evaluation. The model for NER employs word-based LSTM to extract context information and applies a CRF layer stacked over LSTM for decoding. It utilizes random word-level embeddings which can be further fine-tuned later. The evaluation of this task is the same as CWS (F 1 ). The MT model is based on the Transformer (Vaswani et al., 2017) neural network. We used Byte Pair Encoding (BPE) for alleviating the issue of rare words. We kept all other hyper-parameters of the NER and MT models as those widely used. We then fed CWS results produced by different models into the NER and MT models. Results are shown in Table 5. Clearly, our proposed method can provide word segmentations that are beneficial for the two downstream tasks. The performance of NER using segmentation results yielded by our proposed method is better than others, even ground-truth word segmentations. All segmentation systems achieve good performance with no evidence of OOV. However, there are still some distinctions between two CWS methods, which will be analyzed in the case study section. Except for the quality of word segmentations, the speed of our proposed method is fast enough to support specific downstream tasks. Surprisingly, we find that word segmentations with high F 1 scores on CWS benchmarks do not necessarily indicate high performance on downstream tasks. Especially, the optimal performance of segmentation results ("Seg. train" and "Seg. test") does not suggest the highest performance on NER ("NR", "NP" and "NT"), as shown in Table 5. This 3 http://data.statmt.org/wmt18/ translation-task/ might be due to two reasons. First, gold results in NER have noises, which is similar to CWS. Our proposed method has a strong robustness to deal with noisy labels. Second, word segmentation errors do not necessarily cause error propagation.

Case Study
To make further progress on CWS, it is important to understand errors that CWS methods are making. Hence we randomly selected typical errors from the PKU test set and manually analyzed them.
The segmentation errors can be roughly divided into two categories. One is the type of errors with OOV words. The proposed semi-supervised method can alleviate the issue of OOV words effectively. For instance, "威尔第" (Verdi) is segmented into two words incorrectly by the pure student model. This split frequently occurs in the unlabeled data, and such knowledge is distilled from the teacher model. The semi-supervised method can revise these OOV words.
Except for the type of errors of OOV words, the rest of errors are mainly caused by segmentation inconsistency. For example, the word "人" (person) should be regarded as a suffix word behind some words, e.g., "中国+人" (Chinese) and "代理+人" (agent). "人" (person) also exists as part of other words, e.g., "关系人" and "继承人". Simply training neural model on such inconsistent segmentation data may be insufficient to solve these segmentation errors without further efforts in data processing. This situation naturally raises a question: do the errors caused by segmentation inconsistency really influence the performance of downstream NLP tasks?
To answer this question, we conducted additional experiments on the two downstream NLP tasks. In NER, segmentation results of non-entity words hardly affect the performance of NER. For instance, the word "不懈奋斗" (untiringly struggle) is regarded as a word according to the criterion of "PKU". Previous state-of-the-art methods that achieve high F 1 scores for CWS can segment it correctly. While our proposed method splits this unit into two independent words "不懈" (untiringly) and "奋斗" (unremitting). However, these two words do not belong to any entities. In other words, the better performance for segmenting non-entity words does not necessarily indicate better performance of NER. In addition, segmentation results of entity words directly affect the veracity of NER. There is a phrase "西方七国集团" (the Group of Seven, abbreviations: G7) in a sentence. This segment is an organizational entity. In word segmentation, it is regarded as two words "西方" (western) and "七国集团" (the group of seven countries) according to ground-truth segmentation results. Previous state-of-the-art methods are usually able to segment it correctly. By contrast, our proposed method segments it into three words "西方" (western), "七国" (seven countries) and "集团" (group). Surprisingly, the final result of NER is out of expectation. The entity with incorrectly segmented words by our method is correctly recognized. Gold segmentation does not achieve a better result on this entity. The vague boundary of a word may increase the uncertainty and difficulty of downstream Chinese NLP tasks. There are many prefix and suffix words in Chinese. Sometimes, it is hard to determine whether these words are a single word or not. For this reason, high performance of CWS is not equal to high performance of Chinese NER.
In MT, due to the technique of BPE, rare words are segmented into sub-words. The issue of unknown words can thus be alleviated effectively. Even if a word as simple as "日本" (Japan) is segmented into two words incorrectly, NMT models are able to prevent the error propagation of segmentation in the training step. Thus, a faster segmentation system, rather than a high-performance segmentation system, is more practical for NMT. To analyze NMT translation differences between two sentences with different segmentation results, we also supply the additional analysis in Appendix C. We find that segmentation results with slight differences make translation results varying. The boundary of words may lead to these differences. But we also conjecture that this is more due to the robustness of NMT models.

Conclusion
To bring a positive impact of CWS to downstream NLP tasks, this paper makes a trade-off between the traditional evaluation and the complexity (e.g., implementation and decoding speed), which makes the segmenter more practical. We propose a semi-supervised method that distills knowledge via pseudo labels into the lightweight student model. The method is low coupling, which significantly improves the performance of multiple heterogeneous tiny neural architectures. The proposed framework can achieve competitive performance on CWS benchmarks and the speed of the student model also satisfies the practical requirement. In summary, the advantages of our model are twofold. First, the inference speed of the method is much faster than PLM methods. It can run under low resource conditions, even on CPUs. Second, the model provides efficient segmentation results for downstream NLP tasks.    where The layer normalization is adopted in the end of each multi-head Gaussian directional attention layer.

B Hyper-parameter Setting
To improve the reproducibility, we list all important hyper-parameters of the teacher model (Table 6), the student models (Table 7 and 8), the word-based NER models (Table 9) and the NMT model. We randomly pick 10% sentences from the training data as the development data for the tuning. In addition, we use the original development set of   WMT-18 for MT. We utilize the uniform-sample to choose the hyper-parameters. In particular, we use the hyper-parameters of the modified Transformer model and the NMT model following previous studies (Vaswani et al., 2017;Duan and Zhao, 2020).

C Case Study
For NMT, it is difficult to analyze translation results as the interpretability of NMT is poor. We start with examples and focus on the differences between two translations with different segmentation results in addition to sentence-level BLEU scores. The Pinyin sequences for the two Chinese sentences are shown in Figure 2 and translations are shown in Table 10. We find that segmentation results with slight differences make translation results varying. The boundary of words may lead to these differences. There is a considerable discrepancy when the neural machine translation system stops training at different steps. That shows the neural machine translation system is unstable. It is full of uncertainty, and it still brings great challenges for the MT model itself and other crucial techniques.
LABEL MODEL SEG. RESULT MT. RESULT BLEU I TEACHER yi/ ta/ wei/ yi tuo/ ,/ wu zhen da dao/ ke with it as its base, the koku district of wuzhen 21.79 chuang/ ji ju qu/ ying yun er sheng boulevard came into being.
CONPRUNE † yi/ ta/ wei/ yi tuo/ ,/ wu zhen/ da dao/ ke to rely on it, wuzhen boulevard science set up 70.71 chuang/ ji ju qu/ ying yun er sheng the agglomeration area emerged at the moment.