SOM-NCSCM : An Efficient Neural Chinese Sentence Compression Model Enhanced with Self-Organizing Map

Sentence Compression (SC), which aims to shorten sentences while retaining important words that express the essential meanings, has been studied for many years in many languages, especially in English. However, improvements on Chinese SC task are still quite few due to several difficulties: scarce of parallel corpora, different segmentation granularity of Chinese sentences, and imperfect performance of syntactic analyses. Furthermore, entire neural Chinese SC models have been under-investigated so far. In this work, we construct an SC dataset of Chinese colloquial sentences from a real-life question answering system in the telecommunication domain, and then, we propose a neural Chinese SC model enhanced with a Self-Organizing Map (SOM-NCSCM), to gain a valuable insight from the data and improve the performance of the whole neural Chinese SC model in a valid manner. Experimental results show that our SOM-NCSCM can significantly benefit from the deep investigation of similarity among data, and achieve a promising F1 score of 89.655 and BLEU4 score of 70.116, which also provides a baseline for further research on Chinese SC task.


Introduction
Sentence Compression (SC) is an important natural language processing (NLP) task which aims to shorten sentences or texts while preserving their essential original meanings. The technique of SC can benefit several real applications, such as automatic title generation (Zhang et al., 2012;Wang et al., 2018), information extraction, opinion mining (Feng et al., 2010), machine translation (Li et al., 2020), and question answering systems.
Previous works of SC can be classified into two categories: (1) rule-based approaches (Vanetik et al., 2020), (2) machine-learning-based approaches (Filippova et al., 2015;Zhao et al., 2017). The latter can be further divided into statistic-based approaches (Knight and Marcu, 2002) and syntaxbased approaches (Kamigaito and Okumura, 2020). Besides, most approaches have treated SC as a deletion-based processing that estimates the importance of each word in the sentences or texts in turn, and then decides if it should be kept or deleted in generating compressed sentences.
Although researches on SC have been conducted for many years in many languages, especially in English, improvements on Chinese SC task are still quite few as far as we know. One reason for this situation is lacking corpora of Chinese parallel data which are necessary for training and evaluating a supervised or semi-supervised Chinese SC model. To address this problem, several Chinese SC works tend to translate the English SC datasets into Chinese or use web crawler and data filtering techniques to produce high-quality data from the popular Chinese micro-blogging website or other Chinese news websites (Chen et al., 2009;Zhang et al., 2013;Hu et al., 2015). However, those datasets are mostly written in Chinese formal expressions, and not made publicly available or exactly extraction-based. Another reason is the particularity of the Chinese language itself, which not only brings about the former problem, but also causes the different segmentation granularity of Chinese sentences, without standard syntactic or grammatical instructions and vagueness of some Chinese expressions that make the judgement of compressed results non-identical but reasonable among different human evaluators.
To deal with the problems above, many Chinese SC approaches are more unsupervised-methodbased using heuristic rules, which are hard to be duplicated and transferred to delete unnecessary constituents in sentences for other domains, or using statistical probabilities, like the TF-IDF, to trim the syntactic parse trees (Zhang et al., 2012(Zhang et al., , 2013, which lack flexibility and capacity, and demand for proper data distribution via calculating from the adopted dataset. Since the domains of current Chinese SC datasets are limited, to better explore the advantages of entire utilization of neural network models for Chinese SC task, we (1) produce an SC dataset of Chinese colloquial sentences from a real-life question answering system in the telecommunication domain which is more natural than Chinese formal written expressions but more challenging to compress, and (2) augment a neural sequence labeling model with the combination of the output of a pre-trained neural clustering model based on a Self-Organizing Map (SOM) (Kohonen, 1982) using this labeled dataset, and several lexical features for a more effective Chinese SC method (SOM-NCSCM). We measure the performance of our models on the Chinese colloquial SC dataset using the F1 metric, BLEU scores (Papineni et al., 2002) and compression ratio (CR) (Napoles et al., 2011).
The contributions of this paper can be summarized as follows: • To deal with the sparsity of Chinese SC parallel datasets, we create a Chinese colloquial SC corpora, which is the first Chinese parallel SC dataset in the telecommunication domain as far as we know. Evaluations during manually labeling and at the end demonstrate the high quality of the parallel data.
• On account of the source of data, and for better exploring the similarity among Chinese colloquial sentences, we propose a SOM-enhanced neural Chinese SC model (SOM-NCSCM) to gain a valuable insight from our data and improve the performance of the whole neural Chinese SC model in a simple but valid manner.
• We conduct extensive experiments to examine the effectiveness of our proposed SOM-NCSCM on our Chinese colloquial SC dataset. The results prove that our model can get promising performance.
2 Related Work 2.1 Sentence Compression (SC) As mentioned in the introduction, the works of SC can be classified into two categories: the rule-based and machine-learning-based approaches. The rulebased approaches are mostly unsupervised, which dispense with large parallel corpora and generate compressed sentences using artificial rules (Zajic et al., 2007). The statistic-based approaches compress sentences according to calculated probabilities from large corpora, and they can also be supervised or unsupervised based on the types of training corpora (Knight and Marcu, 2002;Turner and Charniak, 2005;Malireddy et al., 2020). While the recent researches tend to operate on syntactic trees and reformulate the compression task as a tree pruning procedure (Clarke and Lapata, 2008;Filippova et al., 2015;Zhao et al., 2018;Kamigaito and Okumura, 2020).
However, the works above are mainly focusing on English SC tasks while researches on Chinese SC tasks are less common. Furthermore, most Chinese SC approaches are less commonly and fully adopting neural network models. Xu and Grishman (2009) enhanced linguistically-motivated heuristics by exploiting the event word significance and event information density via the TF-IDF weighting scheme, and then to solve Chinese news SC task. Similarly, Feng et al. (2010) applied a statistical score function for opinion-oriented Chinese SC. While Zhang et al. (2012) used Support Vector Machine (SVM) for syntax tree trimming of Chinese sentences to generate news titles from preprocessed WangYi news.  proposed a Chinese SC algorithm based on the combination of heuristic rules and the emotional needs of different text scenes, and tested it on articles from Chinese news websites.
Moreover, according to the above researches on Chinese SC task, along with those on English SC task, two conclusions can be made. (1) The corpora of parallel training data are mostly produced from news articles in which the compressed sentences are the titles of those articles which are all in formal written language. That is far away from human daily expressions which are more casual and natural. While original Chinese parallel training corpora are relatively few, let alone those colloquial expressions in question answering situations could make those casual sentences become more complicated to compress. (2) Besides, efforts for seeking the effective ability of fully adopted neural network models for Chinese SC task have been barely done. Therefore, we will work on creating a Chinese parallel SC dataset from Chinese collo-quial expressions in a real-life question answering system, and proposing an efficient method to deal with Chinese SC task in a neural-network manner.

Text Clustering
Text clustering is an application of cluster analysis to textual data for sample classification problems. As an unsupervised machine learning method, text clustering has a certain flexibility and high automatically processing capacity, for it does not require a training process with manually labeled categories of texts in advance.
There are many typical clustering algorithms, to name a few, such as K-means (MacQueen, 1967), Balanced Iterative Reducing and Clustering Using Hierarchies (BIRCH) (Zhang et al., 1996), and Gaussian Mixture Model (GMM) (Rasmussen et al., 1999). These techniques have been applied to automatic summarization of documents, information retrieval, recommendation systems, and etc.
In this paper, we introduce a neural clustering method to enhance a neural Chinese SC model in dealing with Chinese colloquial sentences. The unsupervised clustering algorithm we utilize is SOM which fully uses Artificial Neural Network (ANN) framework (Kohonen, 1982). Besides, another reason for choosing the SOM as our clustering method is that it emphasizes mapping input data to the lowdimensional map while still preserving their topology. Moreover, it applies competitive learning in which the output neurons in the output computational layer need to compete amongst themselves to be activated and are selectively tuned to various classes of input data in the course of learning, which is different from those classic clustering algorithms and neural-network-based clustering methods that need to be provided with the number of clusters in advance, or use error-correction learning (such as the error back-propagation with gradient descent), or just generate word embeddings for those traditional clustering algorithms. As a result, only one output neuron could be activated at any one time.

Data Collection and Annotation
The process of data collection and annotation is shown in Figure 1. nication domain 2 . This question answering system runs within China and most users are Chinese. The queries of users are composed of simplified Chinese natural language words and sometimes a few English words and numbers, such as "VIP", "WIFI" and "1GB".
Data collection. The real-life queries of different users are randomly collected from the question answering system. Besides, no user will be directly contacted, neither will their personal privacy information be stored in our dataset. So the data collected are not secrecy-related, nor do they involve users' personal privacy information. If there is little user information or domain-related sensitive information in the queries, we will filter out those queries or use tokens to mask the whole phrases express sensitive information in them. Appendix A.1 shows all special tokens we use in our dataset. In addition, our annotators will re-confirm all data in the process of data annotation 3 . It is worth noting that, we maintain the same segmentation results as the question answering system manages, for Chinese sentences should be segmented before downstream processing.
Data annotation. We ask two professional annotators to manually recheck the spelling of words, the granularity of segmentation, and then label the compressed sentences. These two annotators are both expert annotators and recruited specifically for their understanding of telecommunication domain and having at least two-year work experience in handling and annotating various telecommunication-domain data from the question answering system. They're treated equally and provided with 500 sampled queries beforehand to be trained for the purpose of the Chinese SC task and to discuss with each other, before actually compressing those sentences. To ensure unbiased annotations, the comprehension of users' intent from both annotators and the question answering system's responses to queries are used as extra reference standards 4 . And the annotators should resume and modify mistakes over and over until there are no obvious conflicts between them. Finally, we obtain two sets of 3,300 compressed sentences from the two annotators, and further apply several quantitative metrics to measure those compressed results and merge them to produce the final compressed sentences.

Data Evaluation and Description
Data Evaluation. During the manually compressing process of each annotator, we set several automatic and quantitative evaluation metrics to assist their work: (1) a list of high-frequency words in the whole data, except the common stop words in Chinese; (2) a list of words left in real-time compressed sentences together with their 0/1 labels and the corresponding frequencies. 5 In the end, to assess the inter-annotator agreement, we apply the Cohen's Kappa coefficient (Cohen, 1960) and compute Cohen's unweighted k. The unweighted k gets 0.623, which reaches a substantial level (Landis and Koch, 1977). And to assess the ability to apply clustering algorithms on our dataset, we apply the Hopkins statistic (Banerjee and Dave, 2004) and consistently have a value of around 0.719∼0.726, which indicates our data has a high tendency to be clustered.
Data Description. We merge the manuallylabeled compressed results of the two annotators to get a unified dataset, and the descriptions about this Chinese colloquial SC dataset are detailed in Table   4 The original query sentences and their corresponding compressed should get the same responses of the question answering system. 5 Appendix A.2 shows some data details: the word frequency proportion of the 15 randomly-chosen words, and the number and proportion of different lengths of all segmented query sentences in our dataset.  1: (1) there are 3,300 queries left for annotation, after we automatically mask the personal information and filter out sentences with too many meaningless punctuations and those beyond responses from the system. The amount of the dataset is large enough to train and test a supervised SC model for Chinese SC task, especially by utilizing public pre-trained word embeddings and neural models; (2) the averaging CR is 0.709, which implies that the whole dataset consists of some short sentences that do not need to be compressed and are confusing and challenging for models to make deletion decisions. 5

Models
In this section, we first provide the formalized definition of the SC task. Then, we describe the baseline model, a commonly-used neural sequence labeling model. Finally, we introduce our SOM-NCSCM to improve the performance and generality of the baseline model.

Task Definition
The formal definition of SC task is the same as that addressed by Filippova et al. (2015). That is, each original query sentence contains n length of word tokens s = (w 1 , w 2 , . . . , w n ). Here, each w i ∈ V, where V is the vocabulary of our dataset. The SC task is to delete some of the words in s but remain the necessary words that express important information to produce a compressed sentence. Therefore, the corresponding compressed sentence contains m length of word tokens c = (w 1 , w 2 , . . . , w m ) from the original and we can use a series of 0/1 labels y = (y 1 , y 2 , . . . , y n ) to denote the sequence of binary operations of the words in the original, where y i ∈ {0, 1}. Here, y i = 0 refers to a deletion operation of w i , and y i = 1 refers to a retention operation of w i , so the total number of ones in the label sequence y is m. Our Chinese colloquial SC dataset is denoted as D = {(s (j) , c (j) )} N j=1 , and its corresponding deletion/retention label sequences denoted as C = {(s (j) , y (j) )} N j=1 . Therefore, our Chinese SC goal is to learn a sequence labeling model using C, so that for any Chinese query sentence s, we can get its label sequence y and thus convert it into compressed sentence c.

Baseline Model
As discussed earlier, there are few neural Chinese SC baseline models that can be readily trained on our Chinese colloquial SC dataset, so we choose a commonly-used neural sequence labeling model as our baseline model, which is a Bidirectional Long Short Term Memory (Bi-LSTM) network combined with a Conditional Random Field (CRF) layer, and its embedding layer is enhanced with various rich features, such as named entity (NE) and part-of-speech (POS).
Hence, for each query s, the Bi-LSTM takes the joint embedding of its words and lexical features (the NE and POS features) as input x = (w 1 , w 2 , . . . , w Tx ) with w i ∈ R dw+dn+dp , where T x is the input length and d w , d n , d p is the dimensionality of word embedding, NE embedding and POS embedding respectively. It then produces a sequence of hidden states h = (h 1 , h 2 , . . . , h Tx ) to represent its input, each of which is a concatenation of a forward and a backward LSTM representation: Then, instead of predicting label decisions independently, we pass the output of the Bi-LSTM to a CRF layer which can produce a probability distribution over the label sequence and give the best label sequence in all possible label sequences. Specifically, before passing the output of Bi-LSTM directly to the CRF, we add a dense layer and take its output as the input to the CRF layer: where W d and b d are the weight and bias of the dense layer. The score of a corresponding label sequence y is computed as: and then, considering all possible label sequences, the probability of this label sequence y can be defined as: where W c and b c are the weight and bias corresponding to the given label pair (y i−1 , y i ) we are processing with. During training, the objective of the whole model is to maximize the log-probability of the correct label sequence, and finally, we adopt the Viterbi algorithm for training the CRF layer and use the label sequence y * with maximum score as the optimal label sequence: log p y (j) |ĥ (j) ;Wc,bc y * = argmax y∈Y(ĥ) p(y |ĥ;Wc,bc)

Our SOM-enhanced Neural Chinese SC Model
For minimizing manual labelling, we make no use of any syntactic feature, such as dependency parsing trees, and for better exploring the deeper implication of our data and relieving the degree of data sparsity, we decide to take the advantage of clustering methods and finally apply a pre-trained neural clustering model to improve our baseline neural Chinese SC model. Specifically, in our model, we set up a SOM as our clustering method and build it to the kind that consists of a feed-forward structure with a single output computational feature layer where each neuron is fully connected to all the input nodes in the input layer. The architecture of our model is shown in Figure 2.
For training a SOM, we first discretize the word embeddings e w = e w 1 , e w 2 , . . . , e w Tx from our original query sentences to a two-dimensional feature layer. Its output representation is defined as: where som(·) refers to the SOM processing, and θ s denotes the trainable weights which are initialized with random values. The SOM processing will stop when the feature layer stops changing. Particularly, the normalized coordinates z s of the activated neuron, representing a cluster of a query sentence, can then be converted to a mono-dimensional index. Later, by applying the pre-trained SOM, we can assign a corresponding cluster index to each original query sentence. Then, we append an extra attention-based Bi-LSTM model (Bahdanau et al., 2015) for better enhancing the representation of a query sentence which takes the joint embedding of its words, lexical features (the NE and POS  features), and the randomly initialized cluster index feature as input x = w 1 , w 2 , . . . , w Tx with w i ∈ R dw+dn+dp+dc , where T x , d w , d n , d p are the same as the input to the baseline model mentioned above, and the d c is the dimensionality of cluster index feature. The output representation of the current sentence is calculated as follows: where t ∈ [1, T x ], W h , W s , and v are all trainable parameters, the h i and h t are the outputs of the extra Bi-LSTM which are the concatenation of a forward and backward LSTM representation using the similar formula as Equation 1. Finally, the input to the base model in our SOM-NCSCM is the concatenation of the word embeddings e w and the sentence representation e s : To avoid repetition, we won't describe the following compressing process which is similar to that of the baseline model again.

Dataset and Experiment Settings
Dataset. We conduct experiments on our Chinese colloquial SC dataset. We shuffle the whole dataset and then split it into three parts, which produces 3,000, 150 and 150 samples for training set, development set and test set, respectively.
Implementation Details. In the experiment, we use the pre-trained Chinese word vectors with 300 dimensions to initialize the Chinese word embeddings . 6 We use the Stanford CoreNLP to extract POS and NE features. 7 The MiniSom 8 is employed for constructing the neural clustering model, and we choose an 11 × 11 square map with a sigma of 4, an initial learning rate of 0.5, the Euclidean distance function to activate the map and the Gaussian function to weigh the neighborhood of nodes in the map.  Evaluation Metrics. For automatically evaluating the performance of Chinese SC models and comparing them with other models in the future, we report the micro F1, BLEU scores and CR as the main evaluation metrics where the CR is computed as the number of left words in compressed sentences divided by the total number of words in the original query sentences.
Model Comparison. To evaluate our Chinese SC dataset, we propose the following models for comparisons.
(1) Chinese BERT-based model. We implement a char-level model using fine-tuned Chinese BERT-wwm model (Cui et al., 2020), and evaluate its performance on our dataset by replacing the pre-trained word embeddings in the baseline with the concatenation of BERT's outputs (without the NE and POS features). The output labels are based on the BIO scheme.
(2) Baseline model. This is the baseline model with NE and POS features as introduced in Section 4.2. We also evaluate its variants, such as those without NE and POS features, and those directly incorporating the cluster index features into the inputs. (3) NCSCM enhanced with classical clustering algorithms. We replace the cluster index features obtained from the SOM with those from two classical clustering algorithms (K-Means and Gaussian Mixture), and evaluate their performances on our dataset, respectively. We set the parameter of the number of clusters to be 120, which is comparable to the output of the SOM model.  (4) Our SOM-NCSCM. This is our proposed neural Chinese SC model as described in Section 4.3. Table 2 shows the main metric evaluation results of models on our test Chinese colloquial SC dataset and we have the following observations:

Main Results
• Although the fine-tuned Chinese BERT-based model tends to retain more tokens/chars (its CR is relatively higher than the others), its F1 score doesn't reach a fascinating point. Moreover, during the training (fine-tuning) process, it's very easy to be over-fitting, due to the complex structure which consists of plenty of para-meters and requires large amount of dataset to fine-tune it. • All the models employed with lexical features (NE and POS features) perform better than those without them. This verifies the effectiveness of the lexical features in expressing word functions and sentence meaning. • With the performances of the baseline models which have been directly incorporated with cluster index features, compared to the one that only utilizes word embeddings as inputs,  Table 4: Some original query sentences along with the actual output compressed sentences from the compared models and our SOM-NCSCM. We provide aligned translations while additional words are being added in parentheses to form proper English sentences. The words in segmented Chinese sentences are separated with slashes and those exactly matching the Gold are in blue color, while words in brackets are masked personal information, words in red color are frequently omitted by those models, and characters in green braces are tokens retained by the fine-tuned Chinese BERT-based model.
it's clear that adding the cluster index features generally helps. But directly employing the cluster index features into the variants of baseline causes the performance to drop a little, compared to the standard baseline model. This makes sense because, while lexical features indicate each word in sentences, the cluster index features denote each sentence as a whole which could bring noise and cause sparsity problem as the amount of dataset is not so large to make a trade-off between fixing clustering mistakes and learning from cluster index features. • Our NCSCM, accompanied by two classic clustering algorithms, can get attractive performance, which indicates the efficient and feasible structure of our Chinese SC model. Besides, our SOM-NCSCM, which utilizes the cluster index features from the SOM and other lexical features in an extra attentionbased neural network to represent sentences, achieves the best F1 score of 89.655 and BLEU4 score of 70.116 among all baseline models. This result testifies the advantageous ability of the SOM and implies that the whole model can alleviate the effect of the shortage of parallel data while also make better use of similarity among data to solve SC task at the same time. Moreover, we will analyze several actual compressed sentences in the following section.  (Note: the bold scores are the optimal scores among all scores, and scores with * are sub-optimal. The som-X means the SOM size = X×X.

Experiments on SOM Parameters
To better understand how different settings of SOM size in the neural clustering model impact the overall performance of our SOM-NCSCM, we conduct several ablation studies and those ablation results are shown in Table 3. With the growth of the size of SOM map, the number of clusters increases which may lead each cluster to be sparse so that it contains less than two query sentences. Besides, based on the outputs of SOM models in different SOM parameters, we manually track several original query sentences and analyze their corresponding clusters along with other original query sentences in the same clusters. Manual judgment criteria include the user demands, the sentence length, and the structure of original query sentences, such as those queries asking for e-bill, which should be less likely to be clustered with those asking about installation costs of TV.
In addition, we randomly selected 50 query sentences from our dataset, and calculate the Silhouette Coefficient scores (Rousseeuw, 1987) and Calinski-Harabasz scores (Caliński and Harabasz, 1974) to evaluate the cluster performance of all experimental models and the manual judgment. 9 Those scores are showed in Table 5. As a result, in the main experiments, we choose the SOM size = 11×11 whose outputs are more consistent with manual judgements and could get the best F1 score for illustration and comparison. 10

Case Studies
Here, we provide two instances in our Chinese colloquial SC dataset jointly with the actual outputs from each model in Table 2 and analyze them to help guide further researches on this task (Table 4).
In both examples, we can see that, although the outputs of all the models tend to miss some words which form a grammatical structure of sentences, or keep too many optional words which don't contain key information of the sentences, they are acceptable and meaningful to some extent, in Chinese colloquial expressions. This indicates that there's some degree of gap between getting exact matched results with strict gold compressed sentences and producing acceptable outputs, and grammatical check could be added during the training or post-processing process. Besides, except for the situation of over-fitting, the fine-tuned Chinese BERT-based model keeps the most characters in its outputs where some characters (边 and 份), such as in the first example, are even incorrectly labeled with "I-1" while without the corresponding "B-1" that couldn't form the accurate and meaningful words (那边 -there and 一份 -an).
In the second example, including the baseline model, the outputs from it and our SOM-NCSCM are not "perfect", which mistakenly delete the keyword "客服代表 (customer service representative)" in the original query sentence. We further investigate the cause of this mistake and find that this word is a domain-specific proper noun which is an unknown word in the pre-trained word vectors dictionary and it even only occurs once in our dataset, let alone it's unrecognizable as a whole word phrase for the Stanford CoreNLP tool to label NE and POS tags. While in the meantime, the finetuned Chinese BERT-based is a char-level model 9 We use the scikit-learn toolkit to calculate those scores. 10 More detailed information and analyses can be found in Appendix A.3. and could hold the correct word tokens but in a different segmentation granularity. This phenomenon stimulates our interests in exploring how to better use the domain-specific knowledge and other neural-network techniques (e.g., copy mechanism) in our model to improve the quality of compressed sentences in the future.

Conclusion
To sum up, we construct a Chinese SC dataset, composed of Chinese colloquial sentences, from a real-life question answering system to address a major problem for supervised Chinese SC models -the lack of parallel corpora. The dataset, as far as we know, is the first Chinese parallel SC dataset in the telecommunication domain. Then, we build several fundamental baselines and propose an efficient neural Chinese SC model for introducing the neural clustering technique (the SOM) to enhance the fully neural Chinese SC task, which achieves satisfactory performances on the dataset. Those results confirm the utilization of similarity among data could benefit the SC task. We believe our Chinese SC dataset and SOM-NCSCM could provide a public and diverse Chinese SC dataset and a fully neural-network-based efficient Chinese SC model, and we also plan to continue constructing larger Chinese colloquial SC dataset and explore other neural-network techniques and semi-supervised approaches to deal with the Chinese SC task.

A.1 Description of Special Tokens
In order to protect users' personal information and guard against breaches of the sensitive information in the telecommunication domain, we use five tokens to mask such information. And if such information in queries contains more than one segmented Chinese word, we will still use one token to mask that whole piece of word phrase. These tokens will not impede the use and comprehension of query sentences. After the manually annotating process, we randomly choose 15 words in our dataset, including telecommunication-related words and common stop words, and calculate their word frequency in the cases of original and compressed sentences. Figure 3 shows the heat map of word frequency proportion of those words in our dataset.
(1) In the original query sentences, the focus on domainrelated words are distracted by those stop words, for users usually use colloquial expressions and polite words in the real-life question answering system, which means their queries are not concise enough.
(2) After compression, those stop words without semantic meanings in the original sentences and those that won't affect the syntactic structure of the compressed sentences, will be deleted. As a result, the compressed sentences will be shorter and more succinct than the originals that express the demands of users immediately, which also achieves the goal of SC task to dispense important information and give that to the downstream tasks more effectively.

A.2.2 Distribution of Sentence Lengths of Our Dataset
All query sentences of real-life users are randomly chosen from the question answering system, and we calculate the number and proportion of different lengths of all 3,300 segmented query sentences in our dataset, in the cases of original and compressed parts. Figure 4 shows the distribution of sentence lengths of our dataset. In the original part, nearly 85% of the sentences contains 4 -11 words, while in the compressed part, they can be shortened into 3 -8 words. There are too short sentences (e.g. containing fewer than 4 words) in the original part, which make the exception of SC task that may not be shortened anymore, while those long sentences (e.g. containing more than 15 words) will be obviously tightened. Containing different lengths of original sentences can ensure the diversity of our data and challenge the ability of SC models.

A.3 Cluster Results of the Randomly-selected Query Sentences in the Dataset
To analyse the performance of the clustering algorithms with different parameters in a clearer way, we randomly select 50 query sentences from our dataset, and use the GMM, K-means and pretrained SOM models in different SOM sizes to produce the corresponding cluster results. Those cluster results are shown in Figure 5. 11 Several observations can be made as follows.
(1) Look directly at the data distribution, and we can see there is a tendency to cluster those   sentences, and the cluster results of manual judgement also prove this tendency.
(2) The GMM and K-Means have more than twice as many clusters as the SOM model of size = 11×11 and the manual judgement do. They are sort of randomly assigning clusters to those 50 query sentences.
(3) Each SOM model can obtain good cluster results of those sentences, where the cluster results of the SOM size = 11×11 are much closer to those of manual judgement, and the separation of each cluster in its results is clearer than that of other results in SOM models of different sizes.
(4) There are some intersections and correlation among the data, such as query sentences located in the lower left and right corner of the sub-figure respectively. When we analyse those query sentences in detail, we notice that there are several query sentences without clear user demands or specific domainrelated words, which are difficult to classify them. And there are also some query sentences manually classified into more finegrained clusters than they are in other cluster algorithms. For better classifying them, the telecommunication-related knowledge should be involved and the specific situation in which the query sentences are spoken should be considered. These kinds of information are important in dealing with real-life semantic analysis.
We will continue our studies on the Chinese SC task in the future.