Generating Multiple-Length Summaries via Reinforcement Learning for Unsupervised Sentence Summarization

Sentence summarization shortens given texts while maintaining core contents of the texts. Unsupervised approaches have been studied to summarize texts without human-written summaries. However, recent unsupervised models are extractive, which remove words from texts and thus they are less flexible than abstractive summarization. In this work, we devise an abstractive model based on reinforcement learning without ground-truth summaries. We formulate the unsupervised summarization based on the Markov decision process with rewards representing the summary quality. To further enhance the summary quality, we develop a multi-summary learning mechanism that generates multiple summaries with varying lengths for a given text, while making the summaries mutually enhance each other. Experimental results show that the proposed model substantially outperforms both abstractive and extractive models, yet frequently generating new words not contained in input texts.


Introduction
The goal of sentence summarization is to enhance the readability of texts by reducing their lengths through word dropping, replacement, or paraphrasing.The applications of the task include subtitle generation (Luotolahti and Ginter, 2015) and email summarization (Zajic et al., 2008).An issue is that it is costly to have human editors write summaries for each text.Hence, it is critical to develop an unsupervised model that does not require any human-written summaries.
Early models focus on abstractive summarization that generates words from a vocabulary set rather than extractive summarization, which merely selects words from texts.Specifically, abstractive models have adopted autoencoder networks to summarize texts in an unsupervised manner (Wang and Lee, 2018;Févry and Phang, 2018;Baziotis et al., 2019).In contrast, extractive models summarize texts by finding word combinations from texts, aiming at maximizing predefined scores (e.g., fluency of summaries) (West et al., 2019).Despite their limited functionality, i.e., word selection, recent extractive models outperformed the abstractive models (Schumann et al., 2020;Liu et al., 2022).
Despite the success of the extractive models, we argue that they have an inherent downside.The extractive models only select words from texts, and thus they cannot generate new words that can be effective for sentence summarization.For example, extractive models are unable to generate acronyms (e.g., PM) for words (e.g., Prime Minister) if the acronyms do not appear in texts.In contrast, abstractive models can resolve the limitation of extractive models.However, the summary quality of existing abstractive models is sometimes worse than a simple baseline, which simply truncates input texts from the beginning (Schumann et al., 2020).This implies that existing abstractive models fall short of reducing text lengths while maintaining the summary quality.The aforesaid limitations of existing models motivate us to devise an abstractive model that produces high-quality summaries with generating new words not contained in input texts.
This work employs reinforcement learning (RL) for unsupervised abstractive summarization 1 .RL enables a model to learn to summarize using rewards even though they are non-differentiable.Our model generates high-quality summaries with considering 1) the semantic similarity between the generated summary and its corresponding input text, and 2) fluency of the generated summaries.Notably, the semantic similarity is more robust to preserve core contents of input texts than the wordlevel reconstruction objective (Pagliardini et al., 2018), which is adopted by existing abstractive models.
Moreover, we argue that the difficulty of summarization depends on the summary lengths (e.g., the shorter the summary, the more difficult it is to summarize).In this respect, we develop a multisummary learning mechanism that generates multiple summaries with varying lengths for a given text, while making the summaries mutually enhance each other.The main idea is to use a highquality summary of a certain length, which is easy to generate, to enhance the quality of a low-quality summary of another length, which is difficult to generate, rather than independently generating summaries in each length.Specifically, we design the mechanism to make low-quality summaries semantically similar to high-quality ones.
We also devise a pretraining task to obtain wellinitialized model parameters for the RL training.We first augment input texts by applying word-level perturbations and inserting length prompts, which indicate the lengths of the original texts.Then, we train the model to reconstruct the original text from the augmented one, which makes the model learn to summarize and control the output length.By pretraining the model in this manner, our goal is to equip the model with essential abilities for summarization, which results in an improved summary quality after the RL training with the pretrained model.We dub our model Multi-Summary based Reinforcement learning with Pretraining (MSRP).
Experiments show that MSRP outperforms the abstractive and extractive baseline models in both automatic and human evaluations.We also analyze summaries generated by MSRP to illuminate its benefits compared to the recent extractive models.

Unsupervised Sentence Summarization
Supervised models depend on human-written summaries, which involve costly and time-consuming data creation (Rush et al., 2015;He et al., 2020;Song et al., 2021).In contrast, unsupervised models learn to summarize texts without any human-written summaries.Abstractive models mainly adopt autoencoders to build a summarization model.Févry and Phang (2018) adopt a denoising autoencoder to summarize texts by treating texts as noised data and summaries as clean data.Wang and Lee (2018); Baziotis et al. (2019) design autoencoders that generate word sequences as interim outputs of the autoencoders and use the word sequences as summaries.Zhou and Rush (2019) devise a model that selects the best next word based on a fluency score to generate summaries.In contrast, an extractive model (West et al., 2019) iteratively deletes words from texts to generate summaries while measuring the fluency of each intermediate summary.Schumann et al. (2020) select the best word combination that maximizes predefined scores based on a hill-climbing search algorithm, and it surpassed the abstractive models.However, the search requires exhaustive computation.In response, Liu et al. (2022) train an extractive model with summaries generated by Schumann et al. (2020) so that it can quickly generate summaries without the exhaustive search.Compared to extractive models, this work aims to design an abstractive model to enjoy its flexible operation, i.e., generating words not contained in texts.

Reinforced Summarization Models
RL has been used as a technique to solve summarization tasks.With referential summaries, Paulus et al. (2018); Bian et al. (2019) relieve the exposure bias of teacher forcing-based supervision.Without referential summaries, Böhm et al. (2019); Stiennon et al. (2020) devise RL-based models that maximize a reward representing the summary quality, which is annotated by human experts.Wang and Lee (2018) address the unsupervised sentence summarization where only input texts are available, which is our target scenario.They utilize RL to train an autoencoder with a word-level reconstruction loss to preserve contents of texts in summaries.In this work, we formulate a RL framework to achieve three aspects: 1) semantic similarity between input texts and summaries instead of word-level similarity, 2) controllability on summary length, and 3) model-agnostic RL framework.

Pretraining Task for Summarization
Pretraining tasks are crucial to obtain high accuracy on NLP tasks (Devlin et al., 2019;Lewis et al., 2020).Recent research invents pretraining tasks for long-document summarization (Zhang et al., 2020;Zhu et al., 2021).However, the approaches are not applicable to sentence summarization due to the absence of multiple sentences, and do not consider controlling the summary length.We thus propose an effective pretraining task to make models learn to summarize and control the summary length.

Reinforcement Learning Framework
Due to the absence of ground-truth summaries, we train a text generator based on the quality of generated summaries.However, the summary generation requires the word-sampling process, which is nondifferentiable.We thus consider RL to address the non-differentiability (Figure 1), and describe the proposed Markov decision process as follows.
States describe the possible combinations of input texts t and generated summaries y t = [y 1 , y 2 , • • • , y t ] at time t.State at time t can be formulated as s t = [t, y t ].Actions are the candidate next words from a vocabulary set V at given states.A policy π θ selects an action a t ∈ V as a next word y t+1 based on a given state s t , resulting in next summary y t+1 .Transition function determines next states based on a state s t and action a t , i.e., s t+1 = T (s t , a t ) = [t, y t+1 ].Reward R(s t , a t ) represents the summary quality when a target summary length l is given.We obtain the reward of the generated summaries such that: where y denotes the generated summary (i.e., y = y t for simplicity), [EOS] is the end-of-sentence token, M g is the maximum length of generated summaries.We design pertinent aspects of summaries: • Content preservation A requirement for highquality summaries is to preserve the gist of the input texts.We consider the semantic similarity between summaries and the corresponding texts: where R C ∈ [0, 1], sim is a similarity function, and f is a function to embed texts (i.e., y and t) such as BERT. 2 We use cosine similarity with 2 The specific model is described in Section 4.1.4.The semantic similarity enables the model to robustly capture the meaning of texts despite different words in two texts, e.g., Who's the winner?and Who won the game?.
• Fluency Another requisite for summaries is fluency, representing how generated summaries are grammatically and semantically natural.We use the perplexity as fluency: (2) where PPL is the perplexity from a language model with its parameters ψ, and y t−1 is the generated summary before time t.Low PPL indicates high fluency.We define the fluency reward: where R F ∈ (0, 1] and σ F ∈ R + is a tunable scaling factor to control the steepness of R F .3 • Summary length We design our model to summarize texts in a desired length.We first insert the desired length l (e.g., 8 words) at the beginning of an input text (e.g., '8:' Figure 1), and then optimize the following reward: where R L ∈ (0, 1] and σ L ∈ R + is a tunable scaling factor.After training, we can control the summary length by changing the desired length.

Policy Gradient
Policy gradient directly updates the policy parameters θ to minimize an objective function J : The gradient of the policy can be written as: where ȳ is a baseline summary, whose words are greedily selected, i.e., yt+1 = argmax π θ (y t+1 |st), instead of sampling words, i.e., yt+1∼π θ (y t+1 |st).The gradient has a direction to maximize the likelihood if R(y, t, l) > R(ȳ, t, l).

Multi-Summary Learning Mechanism
We further improve the summary quality by making multiple summaries with varying lengths mutually enhance each other (Figure 3).The main idea is to use a high-quality summary of a certain length, which is easy to generate, to enhance the quality of a low-quality summary of another length, which is difficult to generate.We first generate multiple summaries in different lengths for each text: where Y is the set of summaries generated for each length l ∈ L, L is a set of lengths, and y l is a summary generated for the length l.For brevity, we denote a target summary as y, while the other summaries as y ′ henceforth.
We then design the mechanism to make a summary semantically similar to the other summaries based on mutual relationship: measures the usefulness of a summary y ′ to a target summary y generated for a target length l given text t.Hence, the model makes a target summary y to be semantically similar to another summary y ′ based on its usefulness, i.e., refer to a summary y ′ if it is useful to a target summary y.
We design the usefulness function u by considering the summary quality and length: 1) given the input text t, a target summary y should refer to another summary y ′ with different length only if the quality of y ′ is higher than that of y, i.e., q(y ′ ,t) > q(y,t), where q ∈ [0, 1] is a function that measures the summary quality.2) A summary generated for a length l should refer to another summary with similar length to the target length l, i.e., the more similar the length, the higher the relevance.We define the usefulness function u: where α ∈ R is a scaling factor, and [ • ] + represents max(•, 0).The first term produces a positive score if q(y ′ ,t) > q(y,t).Similarly, second term produces a high score if the length of another summary y ′ is close to a target length l of a given summary y.We consider the summary quality based on the content preservation and fluency: Finally, the total reward R ⋆ can be written with multiple summaries and the quality reward This mechanism makes summaries mutually enhance each other during training time, but generates summaries independently in inference time.Thus, the complexity of the inference does not increase.

Pretraining Task
We also devise prompt-based text reconstruction task (Table 1), and its main goal is to make our model learn to control the output length.We first apply perturbations to texts: shuffling, dropping, and adding words.We then insert the length of the original text at the beginning of the perturbed text, called prompt, e.g., '20:' in Table 1.Thus, by inserting the prompt to perturbed texts, the model can be explicitly informed about the target length for the original text.We train our model to reconstruct the original text from the perturbed text, which  makes the model learn to control the output length and reorder, add, and remove words.After pretraining, we perform the RL training with the pretrained model.We provide the details in Appendix A.2.

Datasets
We evaluate MSRP on benchmark datasets for sentence summarization.The Gigaword dataset contains a news headline per news article.The number of training and evaluation data are 3,803,957 and 1,951, respectively.We only use news articles to train MSRP so that our model does not draw on any article-headline pairs.We select 500 validation data only for tuning the hyperparameters as done in prior work (Schumann et al., 2020;Liu et al., 2022).We also use DUC2004 dataset, designed only for evaluation, consisting of four headlines per news article, and it contains 500 news articles.

Metrics and Evaluation Protocol
We use ROUGE, a word-overlapping ratio between generated and human-written summaries: ROUGEn for n-gram matching and ROUGE-L for longest common subsequence matching.We use ROUGE F-1 (RF) on the Gigaword dataset, but use ROUGE recall (RR) on the DUC2004 dataset by following its evaluation protocol.In addition, we measure the fidelity (i.e., content preservation) of generated summaries to input texts using SentenceBERT4 (Reimers and Gurevych, 2019) and the fluency of generated summaries with a language model, i.e., GPT-2 (Radford et al., 2019), based on Equation 3. Besides, since ROUGE gets higher as the summary gets longer, we group models based on the average length of the generated summaries for fair comparisons by following Schumann et al. (2020); Liu et al. (2022).We consider both settings of summarizing with a condition of a length (i.e., 8, 10, 13 words) and compression ratio (i.e., 50% of the length of input texts) as done in the prior work.We also note that the evaluation protocol of DUC2004 truncates summaries that exceed 75 characters for fair comparisons in terms of the summary length.

Models Compared
Abstractive models Zajic et al. (2004) summarize texts using a syntax tree trimming.Wang and Lee (2018) 2020) with corresponding input texts in a supervised manner.We also report another non-autoregressive model (Su et al., 2021) that is trained similarly to Liu et al. (2022).

Implementation Details
We use sent2vec (Pagliardini et al., 2018) as a word embedding-based projection function (i.e., f in Equation 1) that is trained on the text corpus (i.e., news articles) by following the prior work (Schumann et al., 2020;Liu et al., 2022).We also report the results of MSRP with SentenceBERT as a BERT-based projection function (Section 4.5).As a language model, we use pretrained GPT-2 to obtain the fluency reward (ψ in Equation 2).We fine-tuned the language model on a target corpus (i.e., news headlines) as done in prior work (Zhou and Rush, 2019;Schumann et al., 2020).As a policy π θ , we use pretrained T5 (Raffel et al., 2020).For the multi-summary learning mechanism, we train MSRP with a set of lengths L = {8, 10, 13} for the length-based evaluation and with a set of compression ratios L = {30%, 40%, 50%} for the  compression ratio-based evaluation.During beam search, we select a summary that maximizes the rewards (i.e., R C , R F , R L ) and does not include predefined patterns.We provide more details in Appendix A.3.

Automatic Evaluation
We compare the summary quality of the models in Table 2 and 3 ency scores, while MSRP is generally better than the best baseline model (Liu et al., 2022).Moreover, as Liu et al. (2022) do not use a pretrained model, we include another baseline denoted by Liu et al. (2022)  † that uses the same initial model (i.e., pretrained T5) as MSRP for fair comparisons.We observe that MSRP still outperforms Liu et al. (2022)  † with the pretrained model.
To investigate the effect of our RL framework, we consider MSRP that is not trained under the RL framework (denoted by MSRP w/o RL).The model is substantially inferior compared to MSRP and the baseline models, indicating that our RL framework is vital to surpassing the recent extractive models.
In a nutshell, MSRP achieves the best ROUGE, fidelity, and competitive fluency thanks to our RL framework.We also observe that the inference time of MSRP is competitively short compared to the  state-of-the-art baseline models (Appendix A.4).

Human Evaluation
We perform human evaluations to compare the summary quality between MSRP and the baseline models, i.e., Schumann et al. (2020) and Liu et al. (2022), on Gigaword data with 10 words as the summary length (Table 4).We provide the summaries generated by MSRP and each baseline model along with the corresponding input texts to annotators, who are asked to choose a better summary in terms of fidelity and fluency.We ask a global annotation corporation to have three native speakers annotate 100 summaries.We use majority voting and unanimity to consolidate the annotators' responses.
We analyze the inter-annotator agreement based on Fleiss' kappa κ,5 indicating fair agreement for the comparison between MSRP and Schumann et al.
(2020) and moderate agreement for the comparison between MSRP and Liu et al. (2022).
In Table 4, MSRP substantially outperforms the baseline models in both criteria.Particularly, the annotators indicate that MSRP generates more fluent summaries than Schumann et al. (2020), despite their highest fluency score in Table 2.Such a discrepancy between automatic and human evaluation results has been also observed in recent work (Kuribayashi et al., 2021), and thus we argue that human evaluations are crucial for accurately evaluating the fluency of generated summaries.From this experiment, we conclude that MSRP is indeed superior to the baseline models based on both automatic and human evaluations.

Frequency Analysis of New Words
We demonstrate the benefit of MSRP as an abstractive model in Table 5.In this experiment, we examine the generated summaries that contain new words, i.e., words that do not appear in input texts.
We observe that the ratio of summaries that contain new words is around 50%, and roughly 1.3 new words appear per summary.This result indicates that MSRP frequently performs the abstractive operation (i.e., generating new words) so that MSRP achieves higher summary quality than the extractive baselines, which merely select words from the input texts.We also report POS tags of new words, and observe that MSRP mainly generates prepositions and nouns as new words.We illustrate the generated summaries with new words in the following section 4.8.

Ablation Study
This section provides an ablation study to inspect the effect of each component in MSRP (Table 6).We first train MSRP without the multi-summary learning mechanism (− MSL) and the promptbased text reconstruction task (− PTR), and observe that the performance generally degrades.Thus, both components are necessary to enhance the summary quality.In the following section 4.6 and 4.7, we provide in-depth analyses on each component.We then train MSRP without the fluency (−R F ) and content preservation (−R C ) reward, and observe that both rewards are essential to generating high-quality summaries.
We further compare the semantic similarity and a word-level similarity adopted by prior abstractive models (Wang and Lee, 2018;Baziotis et al., 2019).By following their approaches, we build an autoencoder with an additional seq2seq model (i.e., pretrained T5).We then design a reward to minimize the reconstruction loss L AE with a scaling factor,  From the result, we show that the semantic similarity is a reason for the superior performance of MSRP compared to prior abstractive models.
Lastly, we replace the projection function f from sent2vec with SentenceBERT (SBERT as f ) and observe further improvements in ROUGE.This result implies that an accurate projection function f can enhance the summary quality of MSRP.

Effect of Multi-Summary Learning
We investigate which summary length benefits from the MSL mechanism in Figure 3. MSRP tends to generate higher-quality summaries in the short length, i.e., 8 words, than our model not trained under the MSL mechanism (MSRP w/o MSL).This result connotes that MSRP can better learn to generate short summaries by referring to corresponding long summaries than independently generating short summaries.

Effect of Pretraining Task
In Figure 4, we inspect the effect of the PTR task.MSRP more quickly optimizes the rewards (particularly the length reward R L ) than the model not Input: israeli prime minister shimon peres said monday he was confident the ceasefire in lebanon would hold because it was in the best interests of both countries as well as syria .pretrained (MSRP w/o PTR).This result implies that PTR task enables the model to learn how to control the summary length and summarize before RL training.We thus posit that PTR task improves the summary quality as RL training takes advantage of the well-initialized model parameters.

Case Study
We study the generated summaries to deeply understand the behavior and benefits of MSRP compared to the best-performing baseline models (Table 7).In the top example, MSRP generates an acronym pm to replace prime minister, a new word that does not appear in the input text.Similarly, MSRP generates another new word, will, resulting in a similar summary to the human-written summary that cannot be generated only by the extractive operation.
In the bottom example, MSRP changes the past tense of the word announced to the present tense announces, which is more appropriate for news headlines than the past tense (Chovanec, 2003).In contrast, the baseline models use reforms as a verb, which can be reasonable.However, MSRP preserves both important words announces and reforms so that the summary of MSRP is more similar to the referential summary.We thus affirm that MSRP surpasses the state-of-the-art extractive models by performing the abstractive operations for summarization.

Conclusion
This work employs the RL for unsupervised abstractive sentence summarization with the rewards representing the summary quality and length.We invent the multi-summary learning mechanism to make the summaries with varying lengths mutually enhance each other.In addition, we design the prompt-based text reconstruction task to further improve the RL training.Experimental results show that MSRP achieves the state-of-the-art summary quality on both automatic and human evaluation.

Limitations
RL enables summarization models to learn how to summarize with rewards representing the summary quality even though the rewards are nondifferentiable.However, RL requires the wordsampling process to generate summaries in the training time.Thus, the computation time per input text is inherently longer than the sequenceto-sequence training with the cross-entropy loss, which the best baseline adopt (Liu et al., 2022).
As a remedy, we expect non-autoregressive models can enhance the training efficiency of the RL framework by generating words in parallel instead of sequentially generating words, i.e., autoregressive generation.An issue of non-autoregressive models is the inferior quality of generated texts compared to the autoregressive models (Su et al., 2021), as non-autoregressive models are limited to consider the previously-generated words.Thus, future work can study non-autoregressive models in the RL framework to enhance training efficiency while maintaining the summary quality.
It is worth noting that the total training time of MSRP is shorter than the one of the best baseline (Liu et al., 2022) despite the RL training.The best baseline depends on the summaries generated by Schumann et al. (2020), while their inference time is excessively long due to the search operation.Based on the inference time in Appendix A.4, 27 hours are required to generate summaries for 3M texts that are used by Liu et al. (2022), while the training time of MSRP with the pretraining task is about 8 hours.Thus, MSRP is more efficient in terms of the total training time than the best baseline if we consider its data-generation time.
The goal of sentence summarization is to shorten a text (i.e., a long sentence) t = [w 1 , w 2 , • • • , w |t| ] into a short summary y = [y 1 , y 2 , • • • , y |y| ]where w, y are words and |y| < |t|.It is important to note that the text-summary pairs are not available for training models.In other words, we focus on the unsupervised sentence summarization.

Figure 1 :
Figure 1: Reinforcement learning with a length prompt.

Figure 2 :
Figure 2: Multi-summary learning mechanism Original text three researchers on monday won the nobel medicine prize for discovering how nitric oxide acts as a signal molecule 1. Shuffle nitric three researchers on monday won the nobel prize for discovering how signal medicine oxide acts as a molecule 2. Drop nitric three researchers on monday won the nobel prize for discovering how signal medicine oxide acts as a molecule3.Add & PromptPrompt 20: three crashing flight researchers town on 103 won the down nobel medicine on prize for tiny how this nitric oxide as a signal molecule train a model with an adversarial and cycle consistency loss.Févry and Phang (2018) utilize a denoising autoencoder.Zhou and Rush (2019) model fidelity and fluency of summaries via contextual matching.Baziotis et al. (2019) stack autoencoders to impose the cycle consistency loss.Extractive models Lead baseline truncates texts from the beginning to the target lengths.West et al. (2019) iteratively delete words from a text to generate a summary based on a fluency score.Schumann et al. (2020) search for the best word combination from texts based on a hill-climbing algorithm.Liu et al. (2022) train a non-autoregressive transformer using summaries generated by Schumann et al. (

Figure 3 :
Figure 3: Effect of multi-summary learning mechanism.
Reference: peres confident ceasefire will hold MSRP: israeli pm confident ceasefire in lebanon will hold NAUS: israeli minister shimon peres confident ceasefire in Lebanon HC: israeli prime minister shimon peres confident in syria Input: president bill clinton announced reforms of the central intelligence agency aimed at restoring credibility in an espionage agency tarnished by the discovery of a russian mole in its midst .Reference: clinton announces us intelligence reforms MSRP: president bill clinton announces reforms of intelligence agency NAUS: bill reforms intelligence agency aimed at restoring credibility HC: clinton reforms intelligence agency aimed at restoring credibility Table 7: Case study with generated summaries.NAUS: Liu et al. (2022), HC: Schumann et al. (2020).

Table 1 :
Example of text perturbation with a length prompt.Changes in each step are marked in red.

Table 2 :
Automatic evaluation on Gigaword dataset.∆ R : the improvement of total ROUGE of MSRP over each model, Len: averaged length of summaries, † : Liu et al. (2022) with the same pretrained model used for MSRP.

Table 3 :
Automatic evaluation on DUC2004 dataset.FD and FL stand for the fidelity and fluency, respectively.

Table 4 :
, and make the following observations.MSRP consistently shows the best ROUGE scores compared to both abstractive and extractive models over different groups of summary length.Human evaluation results.κ denotes Fleiss' kappa representing inter-annotator agreements.
In terms of the fidelity, MSRP consistently achieves the best score compared to the baseline models, although Schumann et al. (2020); Liu et al. (2022) also consider the fidelity score (R C in MSRP) during training time.MSRP achieves competitive flu-

Table 5 :
Statistics of new words in summaries generated by MSRP (top) and the meaning of POS tags (bottom).l denotes the target summary length.