Conciseness: An Overlooked Language Task

We report on novel investigations into training models that make sentences concise. We define the task and show that it is different from related tasks such as summarization and simplification. For evaluation, we release two test sets, consisting of 2000 sentences each, that were annotated by two and five human annotators, respectively. We demonstrate that conciseness is a difficult task for which zero-shot setups with large neural language models often do not perform well. Given the limitations of these approaches, we propose a synthetic data generation method based on round-trip translations. Using this data to either train Transformers from scratch or fine-tune T5 models yields our strongest baselines that can be further improved by fine-tuning on an artificial conciseness dataset that we derived from multi-annotator machine translation test sets.


Introduction
"Vigorous writing is concise.A sentence should contain no unnecessary words, a paragraph no unnecessary sentences, for the same reason that a drawing should have no unnecessary lines and a machine no unnecessary parts."Strunk and White (1918) The Elements of Style Conciseness is a writing principle of removing redundant information in text.Even though conciseness is highly valued in expository English writing and is often considered good writing style (Brock and Walters, 1992;Zinsser, 2016), it is still an understudied topic in the natural language processing (NLP) community, mainly due to the lack of annotated data sets.However, automatic methods for improving conciseness have the potential to improve the writing experience even for native speakers, or to provide useful tools for editorial tasks.In this work we take initial steps towards conciseness from an NLP perspective.We release1 two hand-annotated test sets for conciseness -Concise-Lite (2-way annotated) and Concise-Full (5-way annotated).Concise-Lite annotators were asked to make minimal changes to the original sentence, whereas Concise-Full annotators were given the option to make larger rewrites.Table 1 contains examples from both test sets.For evaluation, we compute F 0.5 -scores of edit spans, a metric that is also commonly used for grammatical error correction (GEC) (Dahlmeier and Ng, 2012;Felice et al., 2016;Bryant et al., 2017).Given that both the test sets and the evaluation tool we employ are publicly available, we hope our setup will encourage NLP researchers to investigate models for conciseness.
We evaluate a range of models on our newly collected conciseness test sets.Our initial approach follows the recent paradigm of using massively pretrained neural models with either no or very little task-specific training data.Inspired by Brown et al. (2020) we report on zero-shot experiments with the large language model LaMDA (Thoppilan et al., 2022).We also fine-tune the large sequence model T5 (Raffel et al., 2020) on small conciseness data sets.We achieve our best results using an unsupervised synthetic data generation method based on round-trip translations, i.e. sentence pairs that were generated by translating an English sentence into another language (e.g.German) and back, a technique that was previously proposed for GEC pre-training (Lichtarge et al., 2019).We construct additional data sets by creating mappings from the longest to the shortest reference in multi-reference machine translation (MT) test sets.Our experiments suggest that conciseness is a hard task for current NLP models.We conclude with a thorough investigation into the similarities and differences of our systems and map out the challenges ahead.Chuck Knoblauch and Tino Martinez were as popular as squeegee men a week ago, the speculation rampant that one or the other or both might be exiled if the Yankees' historic year crumbled in the post-season.
Knoblauch and Martinez home run hits cinch Yankee's First World Series game Chuck Knoblauch and Tino Martinez were as popular as squeegee men a week ago, the speculation rampant that either or both could be exiled if the Yankees' historic year crumbled in the postseason.
Table 2: Example outputs of one of our conciseness models on sentences from an abstractive sentence summarization data set (Over et al., 2007, DUC2004).

Input sentence
Sentence simplification Conciseness model output A mutant is a type of fictional character that appears in comic books published by Marvel comics.
A mutant is a form of imaginary character that is seen in comic books published by Marvel comics.
A mutant is a fictional character that appears in comics published by Marvel comics.It will then dislodge itself and sink back to the river bed in order to digest its food and wait for its next meal.
It will then get away from its place and sink back into the river bed in order to digest its food and wait for its next meal.
It will then dislodge and return to the riverbed to digest its food and wait for the next meal.
Table 3: Example outputs of one of our conciseness models on sentences from a text simplification data set (Zhang and Lapata, 2017, WikiLarge).

The conciseness task
In this work we define the conciseness task as applying the required edits to make a sentence less wordy without changing its meaning, intent or sentiment.We will shed more light on the limitations of this definition in Sec. 6.We expect conciseness models to be useful mainly for native or advanced non-native writers who wish to improve their writing style.Conciseness is related to several other NLP tasks, but we argue below that each of these tasks has a different focus and deserves an independent treatment.

Summarization and sentence compression
Abstractive sentence summarization (Over et al., 2007) attempts to produce a condensed version of the input text.Summaries are similar to headlines with a maximum length that is independent of the input sentence length (Rush et al., 2015).Thus, generating a summary often requires a much more severe compression compared to conciseness.
Unlike summarization, conciseness is faithful to the input and aims to avoid the loss of any information -the goal is to generate a shorter sentence that can replace the original sentence within continuous text (see Table 2 for examples).Furthermore, most work on summarization focuses on the compression of entire documents or paragraphs (Zhang et al., 2020) and not on single sentences.
Similarly to sentence summarization, sentence compression also aims to generate a shorter version of the input text.Many sentence compression models only allow the deletion of words without the ability to rephrase parts of the sentence (Knight and Marcu, 2000;Jing, 2000;Filippova et al., 2015).Perhaps closest to our work, Mallinson et al. (2018) trained sentence compression models on round-trip translations and thereby avoided this restriction.The main difference to us is that we evaluate a broader range of methods on human-annotated test sets which we release for future research.
Sentence simplification The task of reducing the linguistic complexity of text to improve readability is known as sentence simplification (Saggion, 2017).It can be subdivided into lexical (e.g.replacing uncommon words with synonyms) and syntactic (e.g.changing passive to active) simplification (Devlin, 1999;Carroll et al., 1999).Most forms of syntactic simplification result in concise outputs, 2 but lexical simplification may yield even more verbose outputs.For example, replacing 'to portray' with a simpler but verbose phrase such as 'to describe very vividly' would be an instance of lexical simplification but not of conciseness.Conversely, a conciseness system may substitute a phrase with another that is concise but less common and thereby deteriorate readability.Another difference is that simplification often targets people with cognitive disabilities (Devlin, 1999;Carroll et al., 1999;Rello et al., 2013) or low literacy (Watanabe et al., 2009) or second language learners (Petersen and Ostendorf, 2007;Siddharthan, 2002;Xia et al., 2016) whereas conciseness can be thought as writing assistance for proficient writers.Table 3 contrasts simplification and conciseness with the help of example sentences.
Style transfer Text style is an important consideration for several NLP tasks (Fu et al., 2018).For example, it is desirable for MT output to match the stylistic properties of the source sentence (Sennrich et al., 2016;Lohar et al., 2017).Natural language generation systems not only need to take into account the content of generated utterances but also other attributes such as style and sentiment (Li et al., 2018).Text-to-text style transfer systems have been used to change Shakespearean English to modern English (Jhamtani et al., 2017).We consider conciseness as a special case of style transfer with a single source style (wordy) and one target style (concise).However, while most style transfer systems attempt to change attributes like sentiment or political slant (Li et al., 2018;Fu et al., 2018;Prabhumoye et al., 2018;Shen et al., 2017), our conciseness models aim to keep them unchanged.
Paraphrasing Paraphrasing databases such as PPDB (Ganitkevitch et al., 2013;Pavlick et al., 2015) that store pairs of phrases with the same meaning have proven useful for various NLP tasks such as textual entailment (Bjerva et al., 2014) and 2 An exception would be sentence splitting since it is a syntactic simplification strategy that often makes the text longer.semantic similarity (Han et al., 2013).In this work we include a paraphrasing system for comparison.

Modeling conciseness
The approaches in this section cover a wide range of NLP models to convey a better sense for the task.They are intended to serve as baselines to compare against, and as a starting point for future research.

Giant language models (LaMDA)
Large language models (LMs) such as OpenAI's GPT-3 (Radford et al., 2019), Google's Meena (Adiwardana et al., 2020) and PaLM (Chowdhery et al., 2022) and Microsoft's Turing NLG3 have recently captured the interest of the general public through their ability to generate text that is sometimes astonishingly difficult to distinguish from text written by humans.While these models are useful for building open-domain dialog agents, they also have the potential to solve specific NLP problems when provided with an appropriate preamble (LM history) (Brown et al., 2020).We expect general dialog agents to understand the nuances of language such as grammar, conciseness, etc.Thus, we explored using the large LM LaMDA (Thoppilan et al., 2022) with a zero-shot preamble that steers the model towards making a sentence more concise.We use the following template to provide the LM context: Here is some text: "[INPUT_SENTENCE]".Rewrite it to be more concise.
where [INPUT_SENTENCE] is replaced by the source sentence. 4We post-process the output to a) discard any additional comment that the model generated besides the rewrite, and b) retain only the first suggestion if multiple rewrites are generated.

Transformers pre-trained on round-trip translations
This method employs synthetic training data generated using MT.Fig. 1 illustrates the approach.First, we translate an English sentence into a pivot language such as German, and then translate it back   into English.This idea of generating sentence pairs via round-trip translation was initially proposed by Lichtarge et al. (2019) to pre-train GEC systems.
In this work, we construct synthetic parallel data for conciseness by using the longer sentence as the source and the shorter sentence as the target sentence.We then train a standard neural sequence-tosequence Transformer (Vaswani et al., 2017) on the synthetic data until convergence. 5This approach is simple and enables us to generate large quantities of data, but the resulting data set contains noise.For example, round-trip translation pairs often contain synonym substitutions (see the replacement of almost with nearly in the second sentence in Fig. 1) that do not help conciseness.Furthermore, MT may fail to translate the sentence properly, resulting in an undesirable change of meaning (see the third sentence in Fig. 1).Another problem is that it is hard to control the compression ratio in the data set.Despite these limitations we show in Sec. 5 that round-trip translations are useful for pre-training.

Fine-tuning T5
The final method considered in this work employs T5 (Raffel et al., 2020).Very large sequence-tosequence models have been found to be extremely powerful, even for challenging language tasks with a limited amount of training data.We fine-tuned the publicly available 11B parameter version (xxl) of T56 , with a batch size of 1,024 sentences and a learning rate of 10 −4 .that were prepared as described in Sec.3.2.For fine-tuning T5 on round-trip translations we randomly sample 1M sentence pairs from the full data set to limit computation.

Data sets
OpenMT-based fine-tuning and development sets (MultiRefMT-*) We derive fine-tuning and development sets from existing publicly available MT test sets.It is common practice in several NLP areas to collect reference sentences from multiple annotators to increase the trustworthiness of automatic evaluation measures, for example in grammatical error correction (Ng et al., 2014;Bryant and Ng, 2015;Napoles et al., 2017), MT (Freitag et al., 2020), and image caption generation (Zheng et al., 2018).Multi-reference MT test sets have been used in the past to evaluate paraphrasing or sentence compression systems (Ganitkevitch et al., 2011;Pang et al., 2003).We make use of these multi-annotator test sets by selecting the longest reference sentence as the (wordy) source sentence and the shortest reference sentence as the golden (concise) target sentence (Fig. 2).Our MultiRefMT-FineTune set uses all Arabic-English and Chinese-English NIST Open Machine Translation (OpenMT) evaluation sets from 2002-2005.The MultiRefMT-Dev set is based on the Chinese-English 2012 OpenMT evaluation set.
Hand-annotated test sets (Concise-*) Deriving conciseness test sets from multi-reference MT evaluation sets is viable as a first approximation given that all references have similar meaning, intent, and sentiment by design (apart from annotation errors).However, it does not allow us to determine how wordy the sentence is in the first place.If all MT references agreed, it would suggest that the original source sentence has a single obvious translation, not that the references are already concise.Therefore, we collected two new data sets, consisting of 2000 sentences each, that were explicitly annotated for conciseness -Concise-Lite and Concise-Full.Both data sets used the same set of source sentences drawn from Wikipedia.Sentences that a) were ungrammatical, b) contained fewer than 15 words or c) included mismatched quotation marks were not selected.While Concise-Lite annotators were asked to make minimal changes to the original sentence, Concise-Full annotators were given the flexibility to make larger changes to the original sentence.The exact annotator guidelines are listed in Appendix B.
We will make the test sets publicly available to establish a benchmark for researchers to evaluate conciseness models.

Results
We use the GEC evaluation toolkit ERRANT (Bryant et al., 2017;Felice et al., 2016) to compute F 0.5 -scores on spaCy 7 -tokenized text.Like in GEC, precision is weighted twice as high as recall using the F 0.5 -score, which matches our intuition   that a conciseness system should act as a minimally intrusive writing assistant for which false positives are far worse than false negatives.
• Simplification: T5 fine-tuned on the Wiki-Large simplification dataset (Zhang and Lapata, 2017) using a procedure similar to our T5-conciseness system from Sec.The summarization baselines (rows a and b) perform poorly since they are mostly trained on full documents.The simplification system achieves a slightly higher performance but is weaker than the paraphrasing or the Transformer/T5 based conciseness systems.The paraphrasing system (row d) achieved a recall of over 20% on both test sets, but the precision is relatively low because the ParaNMT training set contains various types of edits such as synonym replacements or word reorderings that do not necessarily help conciseness.
The zero shot Giant-LM (LaMDA) setup (row e) was not able to match either the precision or recall of the other conciseness systems.Round-trip translations are useful for both training a Transformer model from scratch (row f) and fine-tuning T5 (row h).Subsequent fine-tuning on MultiRefMT-FineTune yields large precision and recall gains for the Transformer model (row g).MultiRefMT-FineTune also improves the recall for T5, but the precision suffers (row i).9 T5 outperforms the Transformers in terms of F 0.5 -score by achieving higher precision on both sets but has many more parameters (Table 7).

Ablation studies and analyses
The following analyses were carried out on the Concise-Lite and Concise-Full test sets.Japanese, and Russian.Fig. 3 shows that combining all languages yields consistent gains on both test sets over using any single language.

Round-trip translation languages Our final models in
Preserving semantics To measure how well our systems retain the meaning of the original sentence we computed semantic similarity scores between the input and the output sentences using the models provided by the Semantic Reactor toolkit (Yang et al., 2018;Cer et al., 2018).Systems and annotators trade off compression against semantic similarity differently (Figure 4).There is a large variability in compression ratio (i.e. the number of target words divided by the number of source words) and semantic similarity between the Concise-Full annotators (dark purple).The Giant-LM (blue) is more prone to meaning change than other systems, and is not effective in reducing the sentence length.Fine-tuning on MultiRefMT-FineTune (empty vs. filled circle/square) improves the compression ratio but hurts semantic similarity.T5 (red) preserves semantics better than the Transformer but outputs slightly longer sentences.
Readability Fig. 5 shows that our systems often improve the readability of the sentence, in particular the Giant-LM system.The Giant-LM prefers simpler language as it was originally designed for dialog applications (Thoppilan et al., 2022).In contrast, the Concise-Full annotators tend to achieve concision using longer and more complex words, resulting in a decline in readability (dark purple).Information density We expect the outputs of a high-performing conciseness system to have a high information content per word.This information density can be measured using per-token inverse document frequency (Jones, 1973): where t is the token, N is the total number of documents, and D is the document collection.In our case, the document frequencies are derived from the C4 corpus (Raffel et al., 2020).Fig. 6 shows that the reference sentences from the Concise-Lite and Concise-Full annotators indeed have a higher per-token IDF than the input sentences (pink and dark purple bars).The results on the system outputs are mixed, but fine-tuning on MultiRefMT-FineTune improves the per-token IDF for the Transformer and T5 ("RT" vs. "RT→ MT").
Synonym substitutions One problem with using round-trip translations for training and multireference test sets for evaluation is that both may contain synonym substitutions that do not help conciseness.We counted synonym substitutions by extracting all 1:1 substitutions and checking 25.3 28.9 26.0 20.7 29.2 22.0 25.7 28.9 26.3 23.1 28.0 23.9 25.4 27.0 25.7 Table 8: Measuring annotator agreement on Concise-Full by evaluating each single annotator using the other four annotations as references.We list the Transformer and T5 system outputs ("RT→MT") for comparison.whether these were marked as synonyms in Word-Net (Miller, 1995).Fig. 7 shows that most of our systems replace synonyms on an average in every 10th sentence.Fine-tuning the or T5 on MultiRefMT-FineTune reduces the number of synonym substitutions.Synonyms are much less of a problem with the Giant-LM (blue bar) which was not trained on round-trip translations.

Limitations
In terms of both information density (Fig. 6) and number of unnecessary synonym replacements (Fig. 7), the annotators are clearly separated from most of our automatic systems, illustrating the gap to human performance on this task.
Our experiments showed that the Giant-LM (zero-shot) underperformed the other approaches.Preliminary experiments using few-shot learning did not yield improvements over the zero-shot setting.We expect the performance of Giant-LM to improve via systematic prompt engineering.
Another challenge lies in the intrinsic uncertainty (Ott et al., 2018;Stahlberg et al., 2022) of the conciseness task, i.e. the existence of multiple viable ways to make a sentence more concise.Table 8 demonstrates that the five Concise-Full annotators usually did not agree on a single concise version of a sentence, leading to great variability in F 0.5 -scores when evaluated against each other.10Therefore, adequate system outputs may get penalized if they do not agree with one of the human references.We mitigate this concern by using multiple annotators, but -like in other intrinsically uncertain NLP tasks such as MT -a certain level of noise remains in our evaluation.

Limitations of our task definition
We acknowledge that there are various aspects of conciseness that are not covered by our definition in Sec. 2 ("applying the required edits to make a sentence less wordy without changing its meaning, intent or sentiment").First, we intentionally did not include the use of context in our definition.In practice, however, appropriate levels of conciseness can be highly context dependent.Treating the problem on the sentence-level is limiting because using inter-sentential cross-references for conciseness requires access to the document-level context such as the previous sentence.Furthermore, the sentencelevel restriction prevents the systems from improving conciseness through sentence splitting (Botha et al., 2018) or merging (Geva et al., 2019).In reallife situations, the context may also be provided through other channels such as physical medium (e.g.pointing to things) or social factors (e.g.does person B know person A?).We also noticed that our Concise-Full annotators occasionally relied on common knowledge to shorten sentences (see Appendix C for examples), a strategy that is not covered by our definition and thus makes our evaluation slightly more noisy.Exploring the various forms of context for conciseness is a promising potential direction for future research.
Another limitation of our definition is that it does not allow for a change of semantics, intent, or sentiment.In practice, however, conciseness or the lack of it may reflect the intent of the speaker, for example in indicating emergency situations (signalling urgency through brevity) or in detecting lying (Vrij, 2005).Another manner in which conciseness can carry meaning is when used as a rhetorical device to persuade or inspire the audience, a well-known strategy in legal writing (Osbeck, 2011) that was perhaps most famously demonstrated by Abraham Lincoln in the Gettysburg Address (Oseid, 2009).Furthermore, our ablation studies in Sec.5.2 revealed that systems and human annotators alike sometimes accepted a minor loss of (irrelevant) information to achieve better compression, which, despite being contrary to our definition, may be acceptable in practice.

Conclusion
Our work is an initial exploration of conciseness from an NLP point of view.We compared a variety of approaches to the problem using popular techniques based on synthetic data generation or giant pre-trained sequence models.Round-trip translations provide a useful data source for training conciseness models but can introduce undesirable synonym substitutions. 11Our analyses show that our systems trade off the objectives in conciseness differently (e.g.reducing the sentence length vs. preserving semantics vs. improving readability vs. increasing information density).Further experiments necessary to understand how these tradeoffs would impact the user experience or potential downstream NLP tasks.We expect our study and our annotated test sets to provide impetus for researchers to explore this field further.

A Transformer hyper-parameters
Our round-trip translation based models (Sec.3.2) are trained on TPUs with the LAMB optimizer (You et al., 2020) in JAX (Bradbury et al., 2021).We used the Transformer (Vaswani et al., 2017) implementation from the MT example in Flax 12 with the 32K SentencePiece vocabulary (Kudo and Richardson, 2018) from T5 (Raffel et al., 2020).Model hyper-parameters are listed in Table 9.

B Annotator instructions
The Concise-Lite annotators received the following instructions: Rewrite the sentence to make it more concise, without changing the sentence structure.By sentence structure, we mean the general order of words in the sentence should not change, some sub-phrases could be rewritten/replaced/deleted (3-5 words).These should be relatively minor rewrites, such that you can replace a phrase with a shorter alternative without reorganizing the entire sentence.The sentences should be annotated in isolation without any assumptions on preceding or succeeding sentences.
The Concise-Full instructions are: Rewrite the sentence to achieve maximum conciseness.These can be major rewrites that alter the sentence structure to make it as concise as possible.The annotator needs to make sure that the sentence stays the same semantically (meaning, intent & sentiment) and there is no loss of any critical information.The sentences should be annotated in isolation without any assumptions on preceding or succeeding sentences.

C Example outputs
Table 10 shows some example outputs of our systems and the baselines.The summarization (Long T5) system frequently changes the meaning of the source sentence.The simplification (Simplify T5) system performs slightly better but still changes the meaning in some instances (example c).The T5 system is mostly faithful to the meaning of the source sentence.We observe occasional slight meaning shifts with the Transformer and ParaNMT systems (see e. on background knowledge, e.g. by replacing "the Northern States" with "the Union" or "the North" in example g).

Figure 1 :
Figure 1: Synthetic pre-training data generation using round-trip translations.
Round-trip translations (RoundTrip-*) Our Transformer system is pre-trained on round-trip translations of sentences crawled from news websites following the recipe ofLichtarge et al. (2019) hand-annotated conciseness test setTable 5: Synthetic and hand-annotated conciseness data sets used in this work.

Figure 2 :
Figure 2: Fine-tuning data generation using multi-reference MT test sets.

F 0. 5 Figure 3 :
Figure 3: Transformer models trained from scratch on round-trip translations via different pivot languages.
3.3. 8• Paraphrasing: A Transformer model trained on the full ParaNMT-50M (Wieting and Gimpel, 2018) training set using the hyperparameters in Appendix A.

Figure 4 :
Figure 4: Trade-off between semantic similarity and the sentence compression ratio.

Figure 6 :
Figure 6: Relative change in information density.
g. examples b) and g)).The Giant-LM often changes or expands the information in the source sentence (e.g.examples b) and d), f)) or adds certain artefacts (e.g."Here is a revision: '. . .' " in example a)) that stem from its main use case as a user-facing dialog agent.Being a paraphrasing system, ParaNMT often falls short of actually improving the conciseness (examples c) and f)), and often uses unnecessary synonyms.Synonym replacements can also be found sometimes in Transformer and T5 outputs (examples a) and c)), but not in Giant-LM and human-annotated sentences.The pre-trained models Giant-LM and T5 are sometimes able to compress sentences by relying Table10: Example sentences from our conciseness systems and other baselines (summarization: Long T5, simplification: Simplify T5, ParaNMT).We use the "RT→ MT" setups for the Transformer and T5 systems.We show one Concise-Lite and one Concise-Full human reference.

Table 1 :
Example sentences from our Concise-Lite and Concise-Full test sets.

Table 4 :
Data set statistics.The compression ratio is the number of target words divided by the number of source words.
Table 4 lists the data sets used in this work.Table 5 contains information about their provenance.

Table 6 :
System comparison on our two conciseness test sets."RT" denotes models trained on round-trip translations."RT→MT" configurations are subsequently fine-tuned on MultiRefMT-FineTune.

Table 7 :
Number of model parameters.
Table 6 use round-trip translations from four different pivot languages: French, German,