Contextualized Semantic Distance between Highly Overlapped Texts

Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation. Better evaluation of the semantic distance between the overlapped sentences benefits the language system's understanding and guides the generation. Since conventional semantic metrics are based on word representations, they are vulnerable to the disturbance of overlapped components with similar representations. This paper aims to address the issue with a mask-and-predict strategy. We take the words in the longest common sequence (LCS) as neighboring words and use masked language modeling (MLM) from pre-trained language models (PLMs) to predict the distributions on their positions. Our metric, Neighboring Distribution Divergence (NDD), represent the semantic distance by calculating the divergence between distributions in the overlapped parts. Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts. Based on the discovery, we further implement an unsupervised and training-free method for text compression, leading to a significant improvement on the previous perplexity-based method. The high scalability of our method even enables NDD to outperform the supervised state-of-the-art in domain adaption by a huge margin. Further experiments on syntax and semantics analyses verify the awareness of internal sentence structures, indicating the high potential of NDD for further studies.


Introduction
Comparison between highly overlapped sentences exists in many natural language processing (NLP) tasks, like text rewriting (Liu et al., 2020) and semantic textual similarity (Zhelezniak et al., 2019).A reliable evaluation of these paired sentences will benefit controllable generation and precise semantic difference understanding.
Conventional metrics, like the cosine similarity (S C ), have been popular for semantics similarity evaluation.Nevertheless, we find the evaluating capability of S C severely degrades when the overlapping ratio rises.Niu et al. try to introduce the difference between perplexity (∆PPL) to describe the semantic distance.Unfortunately, ∆PPL suffers from the word frequency imbalance.Also, many sentences share a similar PPL.
Based on the failure of S C , we hypothesize that the evaluation is disturbed by the overlapped components, which share similar representations in the paired sentences.We thus intend to mitigate the disturbance and thus propose a mask-and-predict strategy to attenuate the disturbance from overlapped words.Compared to directly using the word representations for comparison, we discover that using predicted distributions from masked language modeling (MLM) results in better evaluation.Taking Figure 1 as the instance, unmasked comparison involves similar representations of the and heavy in both sentences since the encoder can see these words and encode with their information.But when these words are masked, the MLM has to predict the distributions considering the contextual difference.While using representations results in a trivial heavy-heavy comparison, the difference of distributions (between candidates long, short and large,wide) better indicates how the contextual semantics changes.
Thus, we are motivated to propose a new metric, Neighboring Distribution Divergence, which compares predicted MLM distributions from pretrained language models (PLMs) and uses the the divergence between them to represent the semantic distance.We take the overlapped words in the longest common sequence (LCS) between the paired sentences as neighboring words for the divergence calculation.We conduct experiments on semantic textual similarity and text compression.Experiment results verify NDD to be more sensitive to precise semantic differences than conventional metrics like S C .Experiments on the Google dataset show our method outperforms the previous PPL-based baseline by around 10.0 on F1 and ROUGE scores.Moreover, the NDD-based method enjoys outstanding compression rate controlling ability, which enables it to outperform the supervised state-of-the-art by 18.8 F1 scores when adapting to Broad News Compression Corpus in a new domain.The cross-language generality of NDD is also verified by experiments on a Chinese colloquial Sentence Compression dataset.
We further use syntax and semantics analyses to test NDD's awareness of the sentence's internal structure.Our experiments show that NDD can be applied for accurate syntactic subtree pruning and semantic predicate detection.Results from our analyses verify the potential of NDD on more syntax or semantics-related tasks.Our contributions are summarized as follows: • We address the component overlapping issue in text comparison by using a mask-andpredict strategy and proposing a new metric, Neighboring Distribution Divergence.
• We use semantic tests to verify NDD to be more sensitive to various semantic differences than previous metrics.
• NDD-based training-free algorithm has strong performance and compression rate controlling ability.The algorithm sets the new unsupervised state-of-the-art on the Google dataset and outperforms the supervised state-of-theart by a sharp margin on the Broad News Compression dataset.
• Further syntax and semantics analyses show NDD's awareness of internal structures in sentences.
2 Neighboring Distribution Divergence

Background
Before the main discussion, we first recall the definition of perplexity and cosine similarity as the basis for further discussion.
Perplexity For a sentence with n words (more specifically, subwords perplexity refers to the average of log possibility for each word to exist in W .If the perplexity is evaluated by an MLM-based PLM, then the existing possibility is represented by the predicting distribution on the masked position. The PLM predicts the distribution Q for the masked word on i-th position.Then, the softmax function is used to get the probability distribution Q where q j refers to the appearance possibility of j-th word in the c-word dictionary on i-th position.Here Idx(•) returns the index of word in the dictionary.The distribution predicting process is summarized as a function MLM(•) where MLM(W, i) = q i .
As implausible words or structures will result in high perplexity, this metric can reflect some semantic information.Perplexity is commonly used to evaluate the plausibility of text and detect semantic errors in sentences.
Cosine Similarity For the a sentence pair W x , W y , a pre-trained encoder (like PLM or word embedding) encodes their contextual representations as R x , R y .We use PLM-based S C for experiments and follow the best-representing scenario in (Gao et al., 2021) to use the CLS token as the sentence representation.

The Calculation Method
This section will detail the steps involved in determining the Neighboring Distribution Divergence.Breaking down the term NDD, Neighboring refers to words contained within the longest common subsequence, Distribution is in reference to the predicted results of the Masked Language Model on those neighboring words, while Divergence signifies the disparity between the predicted distributions within the LCS of the pair of sentences under scrutiny.
We'll start with a sentence pair, denoted as (W, W ). The first step is to identify the LCS between these sentences, denoted as W LCS .Words within this LCS will serve as our neighboring words for comparison.The Pretraining Language Model (PLM) is applied to each word in W LCS to predict their respective distributions using the MLM.Subsequently, a divergence function is employed to assess the distribution divergence between W and W based on the same shared word.The divergence scores obtained are then assigned weights and totaled to produce the final NDD output.
The process can be mathematically expressed as: In this equation, F div (•) symbolizes a divergence function that calculates the divergence between distributions.The functions Idx d (•) and Idx d (•) are used to identify the index of w in sentences W and W respectively.The term a w denotes the weight assigned to each word w, which inversely corresponds to its proximity to the nearest word outside the LCS.

Semantic Distance Evaluation
We conduct experiments on the test dataset of Semantic Textual Similarity Benchmark2 (STS-B) to analyze the metrics.Multiple sentence pair similarity evaluation tasks are designed to compare the metric performance and investigate the metric property.
• Synonym-Antonym test creates sentence pairs by replacing words with their synonyms and antonyms.Replacing by the synonym (antonym) results in a positive (negative) pair.with ones that have the same (positive) or different (negative) parts-of-speech 3 .
• Term test replaces verbs with ones in the same (positive) and different (negative) terms.
• Lemma test replaces words with ones that have the same (positive) or different (negative) lemma root.
• Supervised test uses the human-annotated scores for STS-B sentence pairs.
We replace 20% words for synonym-antonym, POS, and lemma tests.100% verbs are replaced for term tests.The words for the replacement are sampled from the STS test dataset following their frequency.For the supervised test, we sample sentence pairs with an LCS that consists of at least 80% words in the shorter sentence.We use Roberta base as the PLM and apply Hellinger distance as the divergence function to guarantee the boundary of our metric.Mean pooling is used as the attention-assigning strategy.
The STS experiment results are presented in Table 1.For a fair comparison, Roberta base is also applied to calculate S C and PPL.NDD outperforms other metrics in all tasks, showing the strong capability of NDD to analyze semantic similarity.Also, NDD is more sensitive to POS, lemma, which is an admirable property to preserve the semantic structure for text editing.
Figure 3 shows how the ratio of overlapped words affects the metric performance.r = 0 indicates there is no overlapped word, so we are only able to use [CLS] and [SEP] tokens to evaluate the divergence.r = 1 indicates the shorter sentence is a substring of the longer sentence as all words are overlapped.
While S C performs better when fewer overlapped words hinder its evaluation, its performance severely suffers from a drop to even negative when the overlapped word ratio becomes > 80%.In contrast, the rising of the ratio helps NDD perform even better as more neighboring words participate in the evaluation to provide a precise evaluation.The ensemble (ratio = 1 : 0.0025) of NDD and S C generally boosts the evaluating performance when the overlapped ratio ≤ 80%, indicating that NDD and S C evaluate different aspects of the semantic similarity.We further discuss the metrics using specific cases in Appendix B.

Unsupervised Text Compression
The prominent performance of NDD and its correlation with overlapped word ratio inspire us to apply it for extractive text compression.Text compression takes a sentence W as the input and outputs W C where W C is a substring of W that maintains the main semantics in W .As a substring, the compressed sentence guarantees a 100% overlapped ratio to support NDD's performance.

Span Searching and Selection
Given a sentence W , we try every span with length under a length limitation L max for deletion.Then we use NDD to score the semantic difference caused by the deletions.

Metric
Complexity Table 2: Time complexity of evaluating and compressing methods.
As in Figure 4, We first filter W ij with N DD ij above the threshold N max .As overlapping still exists in searched spans, we compare each overlapped span pair and drop the span with a lower NDD score.The process iterates until no overlapped candidate exists.

Experiment
Dataset We conduct our experiments on two English datasets, Google dataset (Filippova et al., 2015) and Broadcast News Compression (BNC) Corpus4 .On the Google dataset, we follow previous setups to use the first 1000 sentences for testing.The BNC dataset does not have a training dataset so one of the previous works (Kamigaito and Okumura, 2020) trains a compressor on the training dataset of Google for compression.We also include a Chinese colloquial Sentence Compression (SC) dataset5 to investigate the cross-language generality of NDD.For the Chinese colloquial Sentence Compression dataset, we replaced the masks of entities with their natural language expressions6 to avoid inaccuracy caused by them to NDD calculation.
Configuration We take cased BERT base as the PLM for English and BERT Chinese for Chinese.The divergence function is set to Kullback-Leibler (KL) divergence.The prediction on the initial text is used as the approximating distribution since it is predicted based on the text with an integral structure.
We fix the following hyperparameters during the experiment.L max is set to 9 when the syntax is used and else 5. N max is set to 1.0.Our compression is iterated for at most 5 times until no word is deleted.The weighing process can be referred to Appendix.Other parameters are adjusted to control the compression ratio.Considering the time complexity of NDD, we have developed a faster variant called Fast NDD.Fast NDD calculates the divergence by considering only the two adjacent words of the compressed span.This approach is based on the hypothesis that the nearest words are the most affected by span switching.As the effect of syntax information is shown to be effective in supervised text compression (Kamigaito and Okumura, 2020), we add a constraint that only allows dropped spans to subtrees in the syntactic dependency treebank for each step.This also boosts the efficiency as we only need to consider sparse subtree spans.The efficiency of different scenarios of NDD is shown in Table 2.Here the time complexity refers to the times of PLM-based MLM or presentation calculation.n and k refer to the length of the sentence and the dropped span, respectively.
Metric We apply the commonly-used F1 score and ROUGE metric (Lin, 2004) to evaluate the overlapping between our compression and the golden one and compare with previous works.For ROUGE, we follow the evaluating scenario in (Kamigaito and Okumura, 2020) to truncate the parts in the prediction that exceed the byte length of the golden one.We also incorporate BLEU (Papineni et al., 2002) to compare with baselines that report BLEU on the Chinese colloquial Sentence Compression dataset.Compression ratio (CR) refers to the percentage of preserved sentences in the initial sentence and ∆C = CR pred − CR gold , which is better when being closer to 0.   Baseline We use the PPL Deleter (Niu et al., 2019) as the main baseline.Deleter uses ∆PPL to control the compressing procedure and tries to preserve a lower PPL in each step.Simple baselines that directly drop words according to the compression ratio are also included.We report several supervised results to show the current development on the tasks.
Google Table 3 presents our results on the Google dataset.Compared to the PPL Deleter, the basic NDD leads to a sharp improvement on all metrics, 10.3 improvement on the F1 score, and 8.7 on ROUGE L .Fast NDD underperforms the initial NDD, but the performance is admirable considering its efficiency.Our method benefits from syntactic constraints, especially for Fast NDD.Syntactic constraints boost Fast NDD's performance to around 7.0 on most metrics to set the new unsupervised state-of-the-art on F1 score.Still, the initial NDD method with syntax is state-of-the-art on ROUGE metrics.Thanks to the compression rate controlling ability of our method, we can control the compression to a CR extremely close to the golden one.
BNC The BNC Corpus is a perfect case to show the advantage of NDD's ability to control the compression rate.We take the supervised SOTA syntactically look-ahead attention network (SLAHAN) (Kamigaito and Okumura, 2020)    the strength of our approach.In our experiments, we also deployed a whole-word-masking (wwm) Roberta 7 as the PLM.This led to additional performance enhancements, which indicates the accuracy of NDD can benefit from whole-word-masking during the pre-training.In summary, our NDD method coupled with the Subtree Constraint offers the best overall performance among unsupervised models.It achieves the highest F 1 score of 76.7, surpasses all others in most metrics, and is very close to the best CR.This confirms its strong potential for the task of sentence compression across languages.

Compression Rate Controlling
We provide a more specific analysis of NDD's compression rate controlling ability.By changing the configuration of our scenario, our method can result in different compression ratios, from 16% to 69%.When the compression ratio is higher than 43%, NDD always results in text with admirable quality (F1 > 60%, ROUGE 1&L > 60%).Also, when the CR is extremely small, NDD can still preserve much information in the initial sentence, with overlapping F1 score 58.5 for 27% and 44.6 for 16%.Also, 7 https://huggingface.co/hfl/chinese-roberta-wwm-ext adjusting the compressing iteration for the same configuration setup can result in high-quality output in different compression ratios.The compression rate controlling ability enables our method to easily adapt to systems requiring different compression ratios.Further case-based discussion can be referred to Appendix H.

Further Analysis 8
We continue studying the compression algorithm to further investigate NDD's syntax awareness via analyzing the roles of pruned words in the syntax treebank.

Syntax Subtree Pruning
This task tests whether NDD is able to detect syntactic structures using syntax treebanks.(1) If the pruned nodes mostly play subordinated roles in the tree, our algorithm can be better certificated to compress with an awareness of syntax.We depict an instance of syntax treebank in Figure 5.In the treebank, deeper nodes like the and that are less important for the integrity of syntax structure.(2) Also, pruning a subtree like that cake will preserve more syntax structure than pruning a non-subtree like ate that.Thus, we introduce two metrics to evaluate the pruning performance: Depth-n and Subtree-k.where w, s represent the pruned words and spans.Count(•) returns the number of items, Depth(•) returns the depth of a word in the syntax treebank, IsSub(•) return if a span is a subtree in the treebank, and Len(•) returns the number of words in a span.Depth-n and Subtree-k thus reflect the wordlevel and span-level pruning quality, respectively.We experiment on the PTB-3.0test dataset (Marcus et al., 1993).We use the random dropping strategy with the same compression ratio as the baseline for comparison.As in Table 7, the proportion of nodes in shallower levels (depth=1 ∼ 3) pruned by our algorithm is smaller than all the corresponding random and PPL-based pruning.Also, the proportion of subtrees in spans pruned by the NDD-based algorithm is significantly larger than in other correspondents.Thus, we conclude that NDD can guide the compressing algorithm to detect subordinated components in syntax dependency treebanks.

Predicate Detection
To explore the semantics awareness of NDD, we experiment on the semantic role labeling (SRL) task for predicate detection.As predicates are semantically related to more components (augments) in sentences, deleting them or replacing them with stop words will result in a larger semantic distance from the initial sentence.We evaluate the predicate detecting ability following the words ranking task.We rank the probability of words to be predicates according to NDD evaluation and evaluate the detecting performance by ranking metrics: mean average precision (mAP) and area under curve (AUC).We conduct our experiments on Conll-2009 SRL datasets9 (Hajic et al., 2009).To test our method's generality, in-domain (ID) and out-ofdomain (OOD) English (ENG) datasets are included.Another Spanish (SPA) dataset is also used for cross-language evaluation.To generate a new sentence for semantic distance computation, we edit each word in the sentence in three ways: (a) Deletion, (b) Replacement with a mask token, (c) Replacement with a stop word10 .We apply cased SpanBERT base (Joshi et al., 2020) and cased BERT Spanish11 (Cañete et al., 2020) as PLMs.For comparison, we implement a PPL-based algorithm that uses ∆PPL to detect predicates.
Our results are presented in Table 8.The generally poor performance shows that ∆PPL might not be a proper metric for predicate detection.In contrast, the NDD-based algorithm produces much better results and outperforms the PPL-based algorithm by 10 ∼ 20 scores on both AUC and mAP metrics, which is a remarkably significant margin and verifies NDD to be much more capable in understanding semantics.The ensemble of three processes boosts AUC, mAP to higher than 80.0, 50.0, respectively, making it a plausible way to detect predicates following an unsupervised procedure.

Related Works
The evaluation on text similarity provides valuable guidance on various downstream tasks, including text classification (Park et al., 2020), document clustering (Lakshmi and Baskar, 2021), and translated text detection (Nguyen-Son et al., 2021).The commonly used cosine similarity evaluates paired sentences' similarity based on the cosine value between word embeddings or pre-trained representations (Reimers and Gurevych, 2019;Zhang et al., 2020b).Unfortunately, when the overlapping ratio between paired sentences rises, the representationbased method suffers from faults caused by similar word representations.Our work replaces word representations with predicted distributions to mitigate the disturbance from overlapped components.
The proposal of PLMs (Devlin et al., 2019) inspires researchers to leverage the upstream training process for text similarity evaluation.Niu et al. leverage the perplexity calculated from PLMs to represent the semantic distance between texts during text compression.While perplexity can evaluate the fluency of sentences, a recent study (Kuribayashi et al., 2021) suggests that low perplexity does not directly refer to a human-like sentence.Also, perplexity fails with words that share a similar existing probability but are with opposite or irrelevant meanings.Other PLM-based metrics like BERTScore have been verified by experiments to evaluate text generation better (Zhang et al., 2020a).Other pre-trained models for evaluation are also an interesting topic.To evaluate semantics preservation in AMR-to-sentence, Opitz and Frank exploits AMR parser to compare the AMR graph of generated results with the golden graph, showing the potential of pre-trained models to evaluate more complex linguistic structures.
Many supervised methods (Malireddy et al., 2020;Nóbrega et al., 2020) have been proposed for text compression.Syntax treebanks play a critical role in text compression (Xu and Durrett, 2019;Wang and Chen, 2019;Kamigaito and Okumura, 2020).Unsupervised methods have been explored to extract sentences from documents to represent key points (Jang and Kang, 2021).Nevertheless, span pruning is still far from satisfaction.As mentioned before, (Niu et al., 2019) explores using ∆PPL for compression, which is not so capable as NDD in semantics preservation.
Syntax and semantic analyses (Dozat and Manning, 2017;Li et al., 2020b,a,c) reflect model's awareness of the internal structures in sentences.The awareness of syntax and semantics of NDD is verified by those tasks.

Conclusion
We address the overlapping issue in semantic distance evaluation in this paper.To mitigate the disturbance from overlapped components, we mask and predict words in the LCS via PLM-based MLM.NDD evaluates the semantic distance using a weighted sum of the divergence between predicted distributions.STS experiments verify NDD to be more sensitive to a wide range of semantic differences and perform better on highly overlapped paired texts, which is challenging for conventional metrics.NDD-based text compression algorithm significantly boosts the unsupervised performance, and its high compression rate controlling ability enables the adaption to datasets in different domains.NDD's awareness of syntax and semantics is verified by further analyses, showing the potential of NDD for further studies.

Limitations
While our NDD metric has demonstrated its effectiveness in measuring the semantic distance between overlapped sentences, there are still some limitations to consider.Firstly, the calculation efficiency of NDD may become a bottleneck when dealing with large amounts of data.The mask-andpredict strategy requires the generation of a large number of predictions for each word in the LCS, which can be computationally expensive.Therefore, for large-scale applications, more efficient algorithms or hardware acceleration may be necessary to speed up the calculation of NDD.Secondly, our method currently cannot selectively compress certain parts of the text.The mask-and-predict strategy compresses the entire overlapped segment, which may not always be desirable.For example, in some cases, it may be more desirable to compress only the less relevant portion of the text while retaining the most informative content.While NDD has an advantage over supervised compressors in controlling compression ratio, it still cannot control the compression orders.Future research may investigate techniques to allow for more fine-grained control over the compression process.Overall, while NDD shows great promise in improving the evaluation of semantic similarity and text compression, further research is needed to address these limitations and improve the compression rate controlling ability and versatility of the method.

B Specific Cases for Semantic Difference Evaluation
We use specific cases to further explore the ability of NDD to capture precise semantic differences using several examples.As in Table 10, we edit the initial sentence "I am walking in the cold rain." with a series of replacements.We keep the syntactic structure of the sentence unchanged and replace some words with other words of the same part-ofspeech.Thus, the difference between the initial and edited sentences is majorly the semantics.We divide the editing cases into several groups.In the first three groups, we change words (adjective, noun, and verb respectively) into similar, different, or opposite meanings.NDD successfully detects the semantic difference and precisely evaluates changing extents.Taking the first group as an instance, changing from cold into cool and freezing keeps most semantics while changing into hot leads to the opposite and even implausible se-mantics.NDD reflects the difference of semantics between these edited results and assigns a much higher score to the cold-to-hot case.Moreover, in the medium case where the aspect for description is changed to heavy, NDD remarkably assigns a medium score to this case, showing its high discerning capability.
In the last case group, we change the tense and subject of the sentence.NDD is shown to be fairly sensitive to tenses and subjects.This property can be used to retain those critical properties during edits.NDD is also able to detect syntactic faults like the combination of He am and can thus be used for fault prevention during the edit.
From these cases, we can also see why perplexity and cosine similarity is incapable of detecting precise semantic difference as NDD.In Table 10, cosine similarity cannot detect the subtle semantic difference and even syntactic faults.We attribute this to the high reliance on word representations for sentence representations, as sentences with many words overlapped will be classified to be similar.
For perplexity (PPL), the first problem with it is that this metric evaluates the fluency of a single sentence.Perplexity will thus guide edits to transform sentences into more syntactically plausible versions, ignoring semantics.As a result, edited results with lower perplexity may change semantics like cold-to-heavy and rain-to-snow.NDD is able to preserve semantics much better by suggesting changing cold to cool or freezing and changing walking to running or wandering.
Another reason is that perplexity can easily be misguided by low-frequency words.In the walking-to-wandering case, since wandering is a lowfrequency word, the resulted perplexity is even higher than the walking-to-swimming case.Since perplexity is scored based on the existence probability of words, the low-frequency wandering will lead to a higher perplexity, even though wandering is semantically closer to walking than swimming.This issue is overcome in NDD as we use predicted distributions rather than real words.As described before, NDD can understand low-frequency words and even named entities much better.As a result, NDD correctly scores the semantic difference caused by replacement on walking.

C Other Details for Compression
For weighing in text compression, we modify the exponential weight and use the balanced weights for distance.
where n is the length of the initial sentence, k is the neighboring word's position, and i, j are the start and end positions of the pruned span.The modification guarantees the total distance weights are the same for each NDD calculation, while the exponential weight assigns fewer weights to words on two sides of the sentence.Furthermore, we add another weight b k to encourage our algorithm to delete later words in the sentence.As shown in Figure 6, later words are less common to be used for summary.We modify the weighted sum as follows.
In experiments, we fix µ to 0.9 and adjust ν to adapt to the compression rate.

D Mask to Expression
Mask Expression 某地 Table 11: The dictionary that transforms Chinese masks to natural language expressions.

E Extra Comparison
We further investigate the capability difference of different metrics in text compression.As in Table 12, we replace the evaluator in the compressing scenario with S C and BERTScore (Zhang et al., 2020a).The experiment results show a large gap between NDD and other metrics, verifying the prominent semantic distance evaluating the capability of NDD.

F Human Evaluation
We further use human evaluation to compare the performance of text compression algorithms.We sample 100 sentences from the Google test dataset and ask human evaluators to score for the syntactic and semantic integrity of the output.The evaluators are blind to which algorithm compresses and produces the output.We assign scores from 0 to 5 as follows: • 0: No legal structure, totally a combination of meaningless fragments.• 1: Poor structure, only some meaningful components, and the whole structure are not understandable.
• 2: The whole structure is acceptable but contains faults compared to the initial sentence.
• 3: Some parts of the initial structure are preserved, but the compression drops some important components.
• 4: Most parts of the initial structure are preserved, still there exists a little inconsistency or ignorance of important components.
• 5: The structure is as integral as human's.
The human evaluation verifies NDD to keep a large gap with conventional metrics in text compression in syntactic and semantic integrity.Also, the benefit of introducing syntactic constraints is shown in every algorithm.

G NDD distributions
To more specifically present how NDD is sensitive to semantic differences, we depict the distribution of bounded (Hellinger distance-based) and unbounded (KL divergence-based) NDD in

H Compression Cases
Init: The speed limit on rural interstate highways in Illinois will be raised to 70 mph next year after Gov. Pat Quinn approved legislation Aug. 19, despite opposition from the Illinois Dept. of Transportation, state police and leading roadway safety organizations.Edit: The speed limit will be 70 mph despite opposition from organizations.Gold: The speed limit on highways in Illinois will be raised to 70 mph next year.F1 Score = 51.9(↓ 8. Real Effect v.s.Automatic Metrics As the compressed results for sentences can be various, automatic metrics might not be able to fully reflect the compressing ability of our algorithm.Also, as our compression follows a training-free procedure, the compressed results might not be in the same style as the annotated golden ones like the first instances in Table 14.Both our compressed and the golden result keep the main point that the speed limit will be 70 mphs, preserving the semantics of the whole sentence.Nevertheless, the golden compression tends to keep some auxiliary information like the location on highways in Illinois and the time next year.In contrast, NDD-based compression tends to remove that unimportant information and prevent semantics in other parts of the sentence from being unchanged.Thus, NDD-based compression still keeps despite opposition from organizations towards the integrated semantics.In the second instance of Table 14, as the golden compression also removes location and time information from the sentence, our algorithm leads to a significant improvement since our compressing style matches with the annotated one.Considering that the automatic metrics may be biased due to the style of annotation, we present more cases in this section to show the capacity of our algorithm to keep semantics and fluency while removing unimportant and auxiliary components at the same time.
Init: A US$5 million fish feed mill with an installed capacity of 24,000 metric tonnes has been inaugurated at Prampram, near Tema, to help boost the aquaculture sector of the country.
Iter1: A US$5 million fish feed mill with an installed capacity of 24,000 metric tonnes has been inaugurated at Prampram, near Tema, to help boost the aquaculture sector of the country.
Iter2: A fish feed mill with capacity 24,000 has been inaugurated at Prampram to boost the aquaculture sector.15.NDD-based text compression is shown to be capable of detecting and removing auxiliary components like locations or adjective spans in the sentence, for example.Also, the syntactic integrity and initial semantics are preserved in each iteration of our algorithm.There is an advantage over supervised methods as output in each iteration is still a plausible compression for the initial sentence.We can thus set some proper thresholds and iterate the compression until we get a fully satisfying output.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Left blank.
C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Left blank.
C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Left blank.

D
Did you use human annotators (e.g., crowdworkers) or research with human participants?
Left blank.
D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Left blank.

Figure 1 :
Figure 1: The comparison between two possible text scenarios with shared components.Mask-andpredicting neighboring words attenuates disturbance from overlapping when evaluating semantic distance.

Figure 4 :
Figure 4: The compressing scenario of our NDD-based algorithm.

Figure 6 :
Figure 6: The ratio of preserved tokens in certain positions of the initial sentence.Statistics from the Google training dataset.

Final:
A mill has been inaugurated to boost aquaculture sector.

Table 3 :
Results for sentence compression on the Google dataset.SC: Subtree Constraint with syntax treebanks.
Underline: the performance improvement is significant (p < 0.05) considering the highest baseline.†: the method is a re-implementation.‡: the method uses syntactic information.

Table 4 :
Comparison between the supervised state-of-the-art SLAHAN and our NDD method on BNC Corpus.
In contrast, our PLM-based unsupervised method enjoys robustness and can be easily adapted to different domains, and reach a CR close to the golden one.Our unsupervised method thus outperforms the supervised state-of-the-art by a huge margin (20 ∼ 30) on all metrics.
as the baseline.Since BNC does not have a training dataset, SLA-HAN is trained on the 200K Google corpus.Nevertheless, the cross-domain adaption of SLAHAN is not successful as its ∆C is an extremely negative −0.35 in Table4.

Table 5 :
Results for sentence compression on the Chinese colloquial Sentence Compression dataset.

Table 6 :
Performance results of different configuration setups on the Google dataset.

Table 7 :
Proportion (%) of pruned nodes in certain depths of the syntax treebanks and proportion (%) of pruned spans that are subtrees.L max is set to 5.

Table 9 :
Statistics of our datasets in experiments.

Table 10 :
Cases for detection of NDD on very precise semantic difference.The initial sentence is "I am walking in the cold rain."

Table 13 :
Human evaluation on the syntax and semantics integrity of outputs from unsupervised text compression algorithms.

Table 14 :
5) ROUGE = 53.8Examples for how automatic metrics reflect the performance of NDD-based compression.Improvement refers to comparison with unedited texts.

Table 15 :
Cases for output in iterations of the NDDbased compression.Bold: Kept components Outputs from Compression Iterations We present the intermediate outputs of our algorithm in Table