Segment, Mask, and Predict: Augmenting Chinese Word Segmentation with Self-Supervision

Recent state-of-the-art (SOTA) effective neural network methods and fine-tuning methods based on pre-trained models (PTM) have been used in Chinese word segmentation (CWS), and they achieve great results. However, previous works focus on training the models with the fixed corpus at every iteration. The intermediate generated information is also valuable. Besides, the robustness of the previous neural methods is limited by the large-scale annotated data. There are a few noises in the annotated corpus. Limited efforts have been made by previous studies to deal with such problems. In this work, we propose a self-supervised CWS approach with a straightforward and effective architecture. First, we train a word segmentation model and use it to generate the segmentation results. Then, we use a revised masked language model (MLM) to evaluate the quality of the segmentation results based on the predictions of the MLM. Finally, we leverage the evaluations to aid the training of the segmenter by improved minimum risk training. Experimental results show that our approach outperforms previous methods on 9 different CWS datasets with single criterion training and multiple criteria training and achieves better robustness.


Introduction
In extensive natural language processing (NLP) scenarios, most of the tasks are based on word-level methods. When we deal with the Chinese language, there is no specific boundary between two Chinese words. The situation is different in western languages. For instance, there is a space between two words. Thus, Chinese word segmentation (CWS) is considered an essential task, which will accurately represent semantic information of Chinese NLP tasks. Besides, the length of the sentence is shortened by word segmentation. The shorter length of a sentence is effective for the deep learning method in some cases.
Recently, good performance for CWS has already been achieved in large-scale annotated corpora as reported by related research (Huang and Zhao, 2007;Zhao et al., 2019). Most methods start with data-driven to improve the performance for CWS. For instance, some neural methods try to incorporate external resources to achieve good performance for in-domain and cross-domain CWS (Zhou et al., 2017;. The previous methods fall into two categories: (1) the statistical machine learning methods and (2) neural network methods. In statistical machine learning meth-ods, Conditional Random Fields (CRF) is the most effective model for the sequence labeling problem (Zhao and Kit, 2008;. However, the performance of the CRF model depends on the quality of the hand-crafted features. To minimize the effects of different hand-crafted features, neural network methods (Chen et al., 2015b;Cai et al., 2017;Ma et al., 2018) have been widely used.
On the other hand, these supervised learning methods are usually limited by the training data. Recent SOTA approaches utilize the pre-trained models (PTM) to improve the quality of CWS (Tian et al., 2020;Huang et al., 2020). However, the CWS methods based on the PTM only utilize the large-scale annotated data to finetune the parameters. It omits much-generated information of the training step. Besides, the annotated data has some incorrect labels due to lexical diversity in Chinese, therefore the robustness of methods is quite important for the CWS.
In this work, we propose a self-supervised CWS approach to enhance the performance of CWS model. In addition, we also investigate on the crossdomain and low-quality datasets to analyze the robustness of CWS models. As depicted in Figure  1, our model consists of two parts: segmenter and predictor. We leverage the Transformer encoder as a word segmenter. We exploit the revised masked language model (MLM) as a predictor to improve the segmentation model. We generate masked sequences with respect to the segmentation results. Then we exploit MLM to predict the masked part and evaluate the quality of the segmentation based on the quality of the predictions. We leverage an improved version of minimum risk training (MRT) (Shen et al., 2016) to enhance the segmentation.
Our contributions are as follows: • We propose a self-supervised method for CWS, which uses the predictions of revised MLM to assist the word segmentation model.
• We present an improved version of MRT by adding regularization terms to boost the performance of the word segmentation model.
• Experimental results show that our approach outperforms previous methods with different criteria training, and our proposed method also improves the robustness of the model.

Related Work
Chinese word segmentation (CWS) has been studied for several years as an essential Chinese NLP task. CWS methods are divided into two streams of approaches: word-based methods and characterbased methods. Since Xue (2003) first formalizes the CWS task as a sequence labeling problem, almost all methods transfer the CWS results into the sequence labels. As a sequence labeling task, the CRF-based model can achieve a competitive performance with multiple features (Peng et al., 2004;Tseng et al., 2005;Zhao and Kit, 2008;. However, the effect of each method is determined by the quality of manual features. To reduce the influence of feature engineering, neural CWS methods have been studied and further progress has been made (Zheng et al., 2013;Pei et al., 2014;Chen et al., 2015a,b;Cai and Zhao, 2016;Cai et al., 2017). Neural methods gradually replaces traditional machine learning methods. Ma et al. (2018) propose the basic LSTM model that is the same with Chen et al. (2015b). But the former study could achieve SOTA performance through tuning the hyper-parameters. Some studies leverage the rich pre-trained embeddings to improve the performance for neural CWS methods (Zhou et al., 2017;Yang et al., 2017Yang et al., , 2019. To alleviate the issue of OOV words for CWS, some researches have been studied for cross-domain CWS.  incorporate the domain dictionary into the neural network, and Zhao et al. (2018) utilize the unlabeled data to enhance the ability to recognize OOV words. With the development of pre-trained language models (PLM) (Devlin et al., 2019), CWS methods also make further progress. Previous SOTA methods effectively achieve good performance for CWS (Meng et al., 2019;Huang et al., 2020;Duan and Zhao, 2020), and they take the advantages of PLMs rather than the pure models themselves. The redundant components get slight improvements that are not as much as the PLMs learning paradigm.

Method
The overall process of our method is shown in algorithm 1: First, we train a word segmentation model and use it to generate segmentation results. Then, according to the segmentation results, the masked sentence is generated based on certain strategies, and an MLM is trained with the masked sentence. Afterward, we mask the sentences in the training  p and D (t) as a reward. 7: Update the S(θ (o) ) to S(θ (n) ).
set and predict the masked part using the MLM to evaluate the quality of the segmentation results. Finally, we use the results to aid the training of the segmentation model.

Segmentation Model
The model architecture is shown in Figure 2. Similar to the architecture of Huang et al. (2020) our segmentation model architecture is also based on BERT (Devlin et al., 2019). The input is a sentence with character-based tokenization and the output is generated by a BERT model and a CRF layer sequentially. The segmentation results are represented by four tags B, M, E, and S. B and E denote the beginning and end of a multi-character word, respectively. M denotes the middle part of a multicharacter word, and S represents a single-character word. Our segmentation model is initialized with PTM (i.e. BERT) and trained with negative loglikelihood (NLL) loss.

Revised MLM as Predictor
In this work, we use a revised MLM similar to BERT (Devlin et al., 2019) to evaluate the quality of segmentations. However, the masking strategy adopted in the training of the Chinese BERT PTM makes the character a unit. This masking strategy cannot reflect the segmentation information, thus we design a new masking strategy that can reflect the segmentation information: 1. Only one character or multiple consecutive characters within a word can be masked simultaneously.
2. We set a threshold mask_count. If the length of a word is less than or equal to mask_count, the entire word will be masked. Otherwise, we randomly choose consecutive mask_count words and mask them.
3. From all possible maskings, we randomly select one with equal probability and apply it to the input. Table 1 shows an example of the masking strategy we introduce above.
When evaluating the quality of segmentation results, we first find all the legal masked sequences based on the segmentation result. Then, we use the revised MLM to evaluate the prediction quality of all masked words in these inputs. We take the average of all the quality scores as the quality of the segmentation result: (1) where x and y represent the input sequence and tag sequence, respectively, M (x, y) denotes the set of all the legal maskings of x when the segmentation result is y, and x m denotes the results of the prediction from MLM. x According to Equation (1), a larger value of q(y, x) indicates a larger gap between the prediction result and ground-truth, i.e, a worse quality of prediction results.

Training Procedure with Improved MRT
After we train the segmentation model with NLL loss, we further train it using MRT (Shen et al., 2016). Specifically, on the training data X, we optimize Table 2: Example of an abnormal phenomenon in MRT loss without regularization. "Cand.", P and Q denote "Candidate", P (·|x; ·, α) and Q(·|x; ·), respectively.
where θ is the parameter of the segmentation model, and Y (x) is the set of all possible word segmentation results of x. However, due to the large number of possible segmentation results, the computational cost of Equation (2) is unacceptably large. Therefore, we sample a subset of S(x) from Y (x), and define a new probability distribution Q on S(x): where α is a parameter that controls the sharpness of Q. We calculate the approximation of Equation (2) on Q: Additionally, the loss defined in Equation (4) can only provide a weak supervision signal, because when the denominator of Equation (3) becomes smaller, the loss can be rather low even if the value of P (y|x; θ) is very small (see Table 2). This may decrease the probability of some good segmentation results, thereby reducing the performance of the segmentation model. Therefore, we modify the loss defined in Equation (4) by adding a regularization term to mitigate the impact of getting the denominator of Q(y|x; θ, α) smaller: where the hyper-parameter λ is used to adjust the weight of the regularization term.

Data Preparation
All the corpora used in our experiment are from SIGHAN05 (Emerson, 2005), SIGHAN08 (Jin and Chen, 2008), SIGHAN10 (mei Zhao and Liu, 2010) and some OTHER open datasets (Zhang et al., 2014) respectively. The statistics of our corpora are shown in Table 3 and some hyperparameters also given in Table 4. The datasets MSRA, PKU, AS and CITYU are from the corpora SIGHAN05 2 , while the datasets CTB and SXU are from SIGHAN08 and CNC, UDC and ZX are from OTHER open datasets. Both the corpora SIGHAN08 and OTHER datasets are also openly available 3 . SIGHAN10 contains data in different domains, and we choose "Finance", "Literature" and "Medicine" for our cross-domain experi-2 http://sighan.cs.uchicago.edu/ bakeoff2005/ 3 https://github.com/hankcs/ multi-criteria-cws/tree/master/data/other ment. Besides, we take CTB6 as CTB dataset in our whole experiment. We use the original format of AS and CITYU instead of using their corresponding simplified versions. Furthermore, we use the same data pre-processing as used in Huang et al. (2020) for whole experiment.
Both in the single criterion and multiple criteria experiments, the majority of results are originated from their corresponding papers. For the multiple criteria experiment , we follow He et al. (2018) and prepare the training data by combining all the datasets. For the noisy-labeled experiment, we convert the input sequence into character and randomly generate four tags (e.g. B, M, E, and S) for each position of the characters among the input sequence. We use the identical pre-processed data for all architectures and we only build 10% noisy-labeled data of each corpus and use 90% real data. For the revised masking strategy, we explore the best accuracy of the predictor by training and testing the MLM on SIGHAN05.

Baselines
We compare our method with the following strong baselines in the field of CWS:

Results of Single Criterion Learning
As shown in Table 5, our proposed method obtains better results on different standard datasets with single criterion learning. Different segmenta-tion criteria are used in the popular datasets. Especially, the segmentation rules of PKU, MSRA and ZX are different from each other (Huang et al., 2020). Therefore, to investigate the quality of our segmentation model, we compare our approach with the previous SOTA methods on the 9 benchmark datasets of CWS. We refer to the reported results in their corresponding papers, except the baselines BERT and BERT+LTL on SIGHAN05 corpora. However, for the other two corpora (i.e., SIGHAN08 and OTHER) we almost re-run the not reported results in their papers. Due to the low GPU memory, we re-implement the BERT version of BERT+LTL rather than using ROBERTA. We report all the results with single criterion learning.

Results of Multiple Criteria Learning
As given in Table 6, to further validate the quality of our method, we also conduct the multiple cri-   Table 8: Comparison among the SOTA performance (F1-score, %) with supervised training on different domains. "Fin.", "Lit." and "Med." represent different domains (i.e., "Finance", "Literature" and "Medicine").
The underlined results represent that we re-implement the existing methods for a fair comparison.
teria experiment which is proposed by  and we compare the performance of our model with other methods on the same corpora as single criterion training. Our proposed approach consistently outperforms previous SOTA methods.
Although we remarkably outperform all baselines on the majority of datasets, we find that some results in multiple criteria learning are highly close to, and sometimes lower than the results of single criterion training. We also directly refer to the results of their papers except BERT and BERT+LTL. We only compare with a few baselines which also explore multiple criteria learning. The effectiveness of multiple criteria learning does not improve the performance of our model on CITYU corpus. However, on the other datasets, we obtain higher results than single criterion learning.
Comparison on Low-quality Datasets As Table 7 shows, to analyze the robustness of our proposed method with respect to the revised MLM, we prepare noisy-labeled datasets which contain 90% real data and 10% randomly shuffled data (see Section 4.1). In this experiment, we exploit the single criterion training on the noisy-labeled data rather than using multiple criteria training. We run all the models on the same noisy-labeled datasets with their corresponding architectures. Obviously, all the results are almost lower than the results from single criterion training. However, our proposed method still gains better results than SOTA baselines with noisy-labeled datasets rather than the standard labeled data. Not only with the single criterion training and multiple criteria training but also with the noisy-labeled data training, we constantly obtain improvements over highly similar previous work.

Comparison on Different Domains
In Table 8, to further validate the effectiveness of our model, we choose some datasets in different domains from SIGHAN10 corpora and compare the segmentation quality of highly similar previous works which also used cross-domain datasets. In this experiment, we also refer to the reported results from their papers, except the baseline systems BERT and BERT+LTL. We use the model trained on PKU corpus and test the different domain test datasets. The presented approach also gains better performance on the "Literature" and "Medicine" domain compared to other approaches but obtains worse results than BERT+LTL on the domain of "Finance".

Effect of Masked Count in MLM
As shown in Table 9, to explore the influences of the value of mask_count for the quality of MLM, we train MLM with different values of mask_count. We find that the accuracy of the predictor achieves the highest score when     Table 10: The effect of the PTM on our model with single criterion learning. "P., R. and F." denote the evaluation methods of precision, recall and F1-score (%). " √ " and "×" represent with or without PTM, respectively. mask_count = 2. Note that if mask_count = 1, only one character can be masked. In this case, masking any character is legal, regardless of the segmentation result. Therefore, we analyze the case where the mask_count is greater than or equal to 2 and choose the mask_count number that makes the accuracy of the MLM highest.

Effect of Pre-Trained Model
As shown in Table 10, we explore the influences of the PTM on the segmentation model with single criterion training. We take BERT as PTM and explore the effect of PTM to the quality of our word segmentation model on different datasets (i.e, MSRA, AS, CTB and CNC) with different segmentation criteria. Intuitively, the performance of our segmentation approach with PTM obtains remarkably better results than without using PTM.

Effect of Hyper-Parameters
We regard improved MRT as a crucial part of our self-supervised word segmentation architecture. To choose the best values of the hyper-parameters, we explore the different values of α, λ and the size of S(x) on the datasets CTB, CNC and UDC with single criterion training.
Effect of size of S(x) S(x) is a subset of all word segmentation results Y (x) corresponding to the sentence x, which is used to generate the distribution Q defined in Equation (3). As shown in Figure 3(a), when the size of S(x) = 10, improved MRT enhances the quality of segmentation model remarkably better than other values on the different corpora.
Effect of α α is used to control the sharpness of the distribution Q defined in Equation (3). As depicted in Figure 3(b), when α = 0.5 improved MRT increases the quality of our segmentation model outstandingly on the different corpora.
Effect of λ λ is the regularization term for improved MRT, which appears in Equation (5). As illustrated in Figure 3(c), when λ = 0.1 our model achieves the best segmentation performance compared to other values.

Conclusion and Future Work
In this work, we propose a self-supervised method for CWS. We first generate masked sequences based on the segmentation results and then use revised MLM to evaluate the quality of segmentation and enhance the segmentation by improved MRT. Experimental results show that our approach outperforms previous methods on both popular and cross-domain CWS datasets, and has better robustness on noised-labeled data. In the future, we can also extend our work to tasks of morphological word segmentation (e.g., morphological analysis).