Improving Multi-Criteria Chinese Word Segmentation through Learning Sentence Representation

Recent Chinese word segmentation (CWS) models have shown competitive performance with pre-trained language models’ knowledge. However, these models tend to learn the segmentation knowledge through in-vocabulary words rather than understanding the meaning of the entire context. To address this issue, we introduce a context-aware approach that incorporates unsupervised sentence representation learning over different dropout masks into the multi-criteria training framework. We demonstrate that our approach reaches state-of-the-art (SoTA) performance on F1 scores for six of the nine CWS benchmark datasets and out-of-vocabulary (OOV) recalls for eight of nine. Further experiments discover that substantial improvements can be brought with various sentence representation objectives.


Introduction
Chinese word segmentation (CWS) is a fundamental step for Chinese natural language processing (NLP) tasks.Researchers have publicized various labeled datasets for evaluating CWS models.However, due to the varied properties among the CWS datasets, different segmentation criteria exist in different datasets (Chen et al., 2017;Huang et al., 2020a).A straightforward solution is to create a model for each segmentation criterion (Tian et al., 2020), but this constrains the model from learning cross-dataset segmentation instances.
In order to facilitate the differentiation of various segmentation criteria, researchers started to work on building multi-criteria CWS (MCCWS) models.Common MCCWS models employ either a single encoder with multiple decoders (Chen et al., 2017;Huang et al., 2020b) or a single model with extra special tokens (He et al., 2019;Huang et al., 2020a;Ke et al., 2020;Qiu et al., 2020;Ke et al., 2021).The former assigns distinct decoders to different criteria, sharing other model parts.The

Previous Works
Sentence A   latter uses special tokens at each input's start, serving the same purpose as private decoders, for a compact model that still differentiates segmentation criteria.However, both of the approaches tend to overfit the in-domain majority criteria for each dataset in use and therefore fail to provide correct segmentations for the minority words, especially the context-dependent ones.We show an example in Figure 1.
In this paper, we present a context-aware approach to improve MCCWS.To enhance the model's understanding of context, inspired by Gao et al. (2021), we leverage the randomness of dropout (Srivastava et al., 2014) and introduce an auxiliary task for minimizing the difference of sentence representations under the multi-criteria training framework.Our contributions lie in (1) The proposed approach sets a new state-of-the-art in the evaluations of F1-score and OOV recall on several CWS datasets.(2) Various objective designs for sentence representation learning can also be effective for improving CWS.

Related work
After Xue (2003) first treated the CWS task as a character tagging problem, most successive researchers followed the approach and performed well.Chen et al. (2017)  Learning sentence representation enhances the pre-trained language model's contextual understanding.In recent years, contrastive learning (Gao et al., 2021;Chuang et al., 2022;Zhang et al., 2022) has been the most popular method for learning sentence representation without additional annotated data.Maximizing the similarity of the hidden representation of the same text masked by different masks and minimizing the similarity of the hidden representation of different text help the model further understand the text.Their model performs better on natural language understanding (NLU) tasks after training on the sentence representation task.As a result, we use an extra special token as a hint to control the criterion and add a sentence representation task to produce the SoTA performance.

Chinese Word Segmentation
Our model is based on pre-trained Chinese BERT (Devlin et al., 2019).Suppose that we have M datasets {D k } M k=1 .Each input sentence s from a dataset D k transforms into a sequence as below: where [CLS] and [SEP] are special tokens for the pre-trained language model, and [CT] is the criterion token for each dataset D k .We denote the hidden representation of each token in index i output from BERT as h i .As a sequence labeling task, the MCCWS model outputs a vector y i consisting of each label's probability to each input character s i .Each element in y i stands for the probability of each label in the label set A = {B, M, E, S}, and B, M, E stands for the beginning, middle, and end of a word, and S represents a word that only has a single character.To accelerate the decoding process and make our model simple, we replace the CRF (Lafferty et al., 2001) layer with a linear layer as our decoder.Our decoder can form as follows: where W d and b d are trainable parameters.We use the cross-entropy function as our loss function to force our model to predict the correct label on each character.
where y l i represents the probability of the correct label l given by our model.

Criterion Classification
To let our model distinguish the criteria accurately, we refer to the approach proposed by Ke et al. (2020) and train the criterion token with a classification task.These criterion tokens can also be viewed as control tokens that manually prompt the model to segment sentences using different criteria.We can form the criterion classifier as follows: (3) Both W c and b c are trainable parameters.M is the number of datasets we used for training.The function we used for training criterion classification is cross-entropy loss and can be formed as: where c k represents the probability given by our model of the input dataset D k .

Learning Sentence Representation
To make our model further understand the input text, we add the sentence representation loss to our training objective.Following the contrastive learning method proposed by Gao et al. (2021), we pass every sequence s through the encoder with different dropout masks.The two hidden representations are a pair of positive samples.The pair of negative samples are combined with two hidden representation vectors from different input sentences.We pick up the two hidden representations of the same input sequence i at index 0, which is the hidden representation of [CLS] token, and denote them as h 0i where τ is a hyperparameter of temperature and sim(h 0i , h + 0i ) is the cosine similarity

Total Loss
Combining Equation(2), Equation( 4), and Equation(5), we get the final training objective L total : where α is a hyperparameter to control how the model weighs between criterion classification loss and the sentence representation loss.

Datasets
We perform our experiment on nine different CWS datasets.AS, CIT, MSR, and PKU datasets are from the SIGHAN2005 bakeoff (Emerson, 2005).CNC is from the Chinese corpus.CTB6 (XUE et al., 2005) is from the Chinese Penn Treebank.SXU dataset is from the SIGHAN2008 bakeoff (Jin and Chen, 2008).UD is from Zeman et al. (2017).ZX (Zhang et al., 2014) corpus is segmented from a novel called ZuXian.In Chinese Word Segmentation, the F1 score is used to evaluate the performance.The OOV recall is used to evaluate an MCCWS model's generalization ability.We report our F1 score and OOV recall on the test set according to the best checkpoint on the development set.

Experimental Setting
We preprocess all nine datasets by replacing consecutive English letters and numbers with 'a and '0' respectively.The optimizer we used for finetuning is AdamW (Loshchilov and Hutter).Furthermore, their moving average coefficients are set to (0.9, 0.999).We set our learning rate to 2 × 10 −5 with a linear warmup and a linear decay.The warmup rate is 0.1 times the total training steps.Our model is finetuned for 5 epochs.We use the gradient accumulation with step 2 on a batch size 16 which approximates the effect of batch size 32.The value of α in Equation ( 6) is 0.3, and τ in Equation ( 5) is 0.1.We use the label smoothing technique on the word segmentation decoder, and the smoothing factor is set to 0.1.We refrain from using label smoothing on the criteria classifier because we aim for the model to precisely distinguish the differences between datasets.We run all our experiments on Intel Xeon Silver 4216 CPU and an Nvidia RTX 3090 GPU.

F1 score
Table 1 shows our F1 score on nine datasets.Our method achieves SoTA on 6 out of 9 datasets.We also report the average F1 score on 4 (denoted as Avg.4) and 6 (denoted as Avg.6) to compare with other methods that did not evaluate their model on all nine datasets.The model that is most similar to ours is proposed by Ke et al. (2020).By adding a sentence representation task, our MCCWS model's performance on Avg.4 and Avg.6 can improve 0.25% and 0.29%, respectively.Huang et al., 2020b used a private structure and CRF decoder, which means more parameters for each criterion.However, with a simpler architecture, our model performs better on Avg.6 and Avg.9.

OOV recall
Out-of-Vocabulary (OOV) recall is a critical evaluation benchmark to measure an MCCWS model's ability to segment unseen words.OOV recall on nine datasets.Our method achieves SoTA on eight out of the nine datasets, showing that our approach can better learn to segment sequences according to the context instead of relying on the in-vocabulary information.

Ablation Study
To understand the influence of various loss functions on the performance of the model, we first remove the criterion classification loss.The F1 score drops slightly by 0.008% (See Table 3).This result shows that the criterion classification task helps the model distinguish the criteria, but an MC-CWS model can learn most of the criteria difference by itself.Second, we exclude the sentence representation loss.The F1 score drops 0.049% (See Table 3), six times greater than the reduction observed upon removing the criterion classification task alone.The OOV recall drops dramatically and shows our model can segment unseen words by referring to the context.We can conclude that learning semantic knowledge can further enhance performance.Finally, we remove both additional tasks; the F1 score drops even more (See Table 3).Therefore, these two additional tasks both improve performance.Notably, the sentence representation task plays a pivotal role in this enhancement, so we can conclude that learning semantic knowledge can further enhance CWS performance.To prove that learning semantic knowledge enhances our model's segmentation ability, we tried MSE, cosine embedding, and loss function proposed by Zhang et al. (2022) to learn the sentence representation (See Table 4).MSE loss and cosine embedding loss are used to keep the two hidden representations we get from [CLS] token with different dropout masks similar.Additionally, we use the loss function revised by Zhang et al. (2022) to obtain better sentence embedding.No matter which method we use for learning sentence representation leads to improvement.The F1 score and OOV recall are better than without the sentence representation task.Therefore, we prove that learning representation helps the model segment input sentence better.In the end, our model and the gold label also show the same result as our analysis, but the model that is not trained by sentence representation task cannot deliver the correct result.

Case Study
We have selected a sentence from the MSR dataset and present the respective segmentations offered by different models in  / 结算 / 公司."While the character sequence remains unsegmented, the sequence's interpretation is "a company that offers the services of securities registration."In contrast, the segmented representation conveys the meaning of "securities registration and clearing companies."Based on the contextual cues, the semantic of the segmented representation must include all companies related to securities registration and clearing.Therefore, the segmented sequence is superior in meaning and interpretation.
In the end, our model and the gold label also show the same result as our analysis, but the model that is not trained by sentence representation task cannot deliver the correct result.

Conclusion
In this paper, we introduce a novel training task to CWS, achieving the SoTA F1 score and OOV recall on several datasets.We then demonstrate that learning sentence representation is more crucial than adding criterion classification objective function because our model can learn the criteria differences during training in most cases.After realizing each context, our model can also deliver different results for a sequence of words with different contexts (More examples can be found in Table 6).Finally, because our model is simple and without any additional components, it can be easily utilized as a base model to finetune for specific datasets when only a single criterion is required.

Acknowledgment
This work was funded by the National Science and Technology Council, Taiwan, 112-2223-E-006-009.Finally, thanks Jo-Ting, Chen for writing suggestion.

Limitations
Several previous works did not release their code, so we did not reproduce their experiment.Ke et al., 2021;Huang et al., 2020a;Qiu et al., 2020;Huang et al., 2020b

A Segment Examples
We list another case in Table 6 to show that our model can generate different segmentation results according to the context.

B Loss for Learning Sentence Representation
In this section, we list all the loss functions we used for learning sentence representation.The MSE loss is shown as: The cosine embedding loss is shown as: N stands for the batch size.h 0i and h + 0i are the hidden representation of the [CLS] token from the same input text with different dropout masks.sim(h 0i , h + 0i ) represents the cosine similarity and can be calculated as || .These two loss functions minimize the difference between the representation of the input text with different dropout masks and make the model realize these two input texts have the same meaning.
The third loss function is based on contrastive learning.It was proposed by Zhang et al. (2022) : where m is viewed as a decision boundary between a positive pair and a negative pair.θ i,j is the angle between two hidden representation h i and h j .θ i,j can be calculated as : By adding a decision boundary, the model can distinguish positive pairs and negative pairs more precisely.

C Significance Test
We use the t-test to demonstrate that our model's performance surpasses the previous SoTA and show the results in Table 7 and 8.

Original Sentence
Segmentation Results from Our Model 请小心使用,不要用坏了 请-小心-使用-,-不要-用坏-了 不要用坏了,别间厕所才可以正常使用 不要-用-坏-了-,-别-间-厕所-才-可以-正常-使用 show that our method can produce better results than previous SoTA statistically.(Based on the result of our trails, we perform a one-tailed t-test with the hypothesis that our average OOV recall ≤ the previous SoTA OOV recall, and the calculated t-value is approximately 23.638.With a significance level of α = 0.05 and degrees of freedom 4, the critical value is 2.776.Since the t-value exceeds the critical value, we reject the null hypothesis, and conclude that our method is significantly better than the previous SoTA.)

Figure 1 :
Figure 1: Comparison of previous works and ours incorporated multiple CWS datasets by matching a specific decoder to different criteria during training.Inspired by the idea of a decoder for each criterion, Huang et al. (2020b) used a pre-trained language model as their encoder, further enhancing CWS performance.But the cost of maintaining decoders increases when the MCCWS model addresses more datasets.To reduce the model parameters, He et al. (2019); Ke et al. (2020); Qiu et al. (2020); Huang et al. (2020a); Ke et al. (2021) utilized extra special tokens to represent criteria respectively, such that the MCCWS model can interpret the extra special token as a hint and segment text differently.
Table tennis balls are sold out in the auction.Where can I buy limited table tennis balls?
Table tennis rackets are sold out.Where can I buy table tennis rackets?

Table 3 :
Ablation study.Where "w/o both" indicate removing criteria classification and sentence representation tasks

Table 5 :
The sentence is picked from the MSR dataset, and our model can deliver the correct segmentation.In contrast, the CWS model that did not train by the sentence representation task segmented the sentence with a wrong comprehension.
also face the same problem.We refer to previous comparison methods to compare our results with previous works and surpass their performance.Even though comparing the result without reproducing other work's experiments is slightly unfair, we still inevitably do so in Table1and Table2.joint segmentation and POS-tagging.In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 588-597, Gothenburg, Sweden.Association for Computational Linguistics.

Table 6 :
An example shows that our model can segment sentences according to the context.The hyphen "-" denotes segmentation.

Table 8 :
OOV recall of 5 different trials.Seed: Random seed we used in the experiment.Avg.9:Average among nine datasets.Avg: Average OOV recall among 5 trials.Std: Standard deviation among 5 trials.We use t-test to