Efficient and Interpretable Compressive Text Summarisation with Unsupervised Dual-Agent Reinforcement Learning

Recently, compressive text summarisation offers a balance between the conciseness issue of extractive summarisation and the factual hallucination issue of abstractive summarisation. However, most existing compressive summarisation methods are supervised, relying on the expensive effort of creating a new training dataset with corresponding compressive summaries. In this paper, we propose an efficient and interpretable compressive summarisation method that utilises unsupervised dual-agent reinforcement learning to optimise a summary's semantic coverage and fluency by simulating human judgment on summarisation quality. Our model consists of an extractor agent and a compressor agent, and both agents have a multi-head attentional pointer-based structure. The extractor agent first chooses salient sentences from a document, and then the compressor agent compresses these extracted sentences by selecting salient words to form a summary without using reference summaries to compute the summary reward. To our best knowledge, this is the first work on unsupervised compressive summarisation. Experimental results on three widely used datasets (e.g., Newsroom, CNN/DM, and XSum) show that our model achieves promising performance and a significant improvement on Newsroom in terms of the ROUGE metric, as well as interpretability of semantic coverage of summarisation results.


Introduction
Most existing works on neural text summarisation are extractive, abstractive, and compressive-based. Extractive methods select salient sentences from a document to form its summary and ensure the production of grammatically and factually correct summaries. These methods usually follow the sentence ranking conceptualisation (Narayan et al., 2018b; Liu and Lapata, 2019;Zhong et al., 2020). The supervised models commonly rely on creating proxy extractive training labels for training (Nallapati et al., 2017;Jia et al., 2021;Mao et al., 2022;Klaus et al., 2022), which can be noisy and may not be reliant. Various unsupervised methods (Zheng and Lapata, 2019;Padmakumar and He, 2021;Liu et al., 2021) were proposed to leverage pre-trained language models to compute sentences similarities and select important sentences. Although these methods have significantly improved summarisation performance, the redundant information that appears in the salient sentences may not be minimized effectively.
Abstractive methods formulate the task as a sequence-to-sequence generation task, with the document as the input sequence and the summary as the output sequence (See et al., 2017; As supervised learning with ground-truth summaries may not provide useful insights on human judgment approximation, reinforcement training was proposed to optimise the ROUGE metric (Parnell et al., 2021), and to fine-tune a pre-trained language model (Laban et al., 2020). Prior studies showed that these generative models are highly prone to external hallucination (Maynez et al., 2020).
Compressive summarisation is a recent appraoch which aims to select words, instead of sentences, from an input document to form a summary, which improves the factuality and conciseness of a summary. The formulation of compressive document summarisation is usually a two-stage extract-then-compress approach (Zhang et al., 2018;Mendes et al., 2019;Xu and Durrett, 2019;Desai et al., 2020): it first extracts salient sentences from a document, then compresses the extracted sentences to form its summary. Most of these methods are supervised, which require a parallel dataset with document-summary pairs to train. However, the ground-truth summaries of existing datasets are usually abstractive-based and do not contain supervision information needed for extractive summarisation or compressive summarisation (Xu and Durrett, 2019;Mendes et al., 2019;Desai et al., 2020).
Therefore, to address these limitations, we propose a novel unsupervised compressive summarisation method with dual-agent reinforcement learning strategy to mimic human judgment, namely URL-ComSum. As illustrated in Figure 1, URLComSum consists of two modules, an extractor agent and a compressor agent. We model the sentence and word representations using a efficient Bi-LSTM (Graves and Schmidhuber, 2005) with multi-head attention (Vaswani et al., 2017) to capture both the long-range dependencies and the relationship between each word and each sentence. We use a pointer network (Vinyals et al., 2015) to find the optimal subset of sentences and words to be extracted since the Pointer Network is well-known for tackling combinatorial optimization problems. The extractor agent uses a hierarchical multi-head attentional Bi-LSTM model for learning the sentence representation of the input document and a pointer network for extracting the salient sentences of a document given a length budget. To further compress these extracted sentences all together, the compressor agent uses a multi-head attentional Bi-LSTM model for learning the word representation and a pointer network for selecting the words to assemble a summary.
As an unsupervised method, URLComSum does not require a parallel training dataset.We propose an unsupervised reinforcement learning training procedure to mimic human judgment: to reward the model that achieves high summary quality in terms of semantic coverage and language fluency. Inspired by Word Mover's Distance (Kusner et al., 2015), the semantic coverage rewardis measured by Wasserstein distance (Peyré et al., 2019) between the semantic distribution of the document and that of the summary. The fluency reward is measured by Syntactic Log-Odds Ratio (SLOR) (Pauls and Klein, 2012). SLOR is a referenceless fluency evaluation metric, which is effective in sentence compression (Kann et al., 2018) and has better correlation to human acceptability judgments (Lau et al., 2017).
The key contributions of this paper are: • We propose the first unsupervised compressive summarisation method with dual-agent reinforcement learning, namely URLComSum.
• We design an efficient and interpretable multihead attentional pointer-based neural network for learning the representation and for extracting salient sentences and words.
• We propose to mimic human judgment by optimising summary quality in terms of the semantic coverage reward, measured by Wasserstein distance, and the fluency reward, measured by Syntactic Log-Odds Ratio (SLOR).
• Comprehensive experimental results on three widely used datasets, including CNN/DM, XSum, Newsroom, demonstrate that URL-ComSum achieves great performance.

Related Work
Most of the existing works on neural text summarisation are extractive, abstractive, and compressivebased.

Extractive Methods
Extractive methods usually follow the sentence ranking conceptualisation, and an encoder-decoder scheme is generally adopted. An encoder formulates document or sentence representations, and a decoder predicts extraction classification labels. The supervised models commonly rely on creating proxy extractive training labels for training (Cheng and Lapata, 2016;Nallapati et al., 2017;Jia et al., 2021), which can be noisy and may not be reliant. Some methods were proposed to tackle this issue by training with reinforcement learning (Narayan et al., 2018b;Luo et al., 2019) to optimise the ROUGE metric directly. Various unsupervised methods (Zheng and Lapata, 2019;Padmakumar and He, 2021) were also proposed to leverage pre-trained language models to compute sentences similarities and select important sentences. Although these methods have significantly improved summarisation performance, since the entire sentences are extracted individually, the redundant information that appears in the salient sentences may not be minimized effectively.

Abstractive Methods
Abstractive methods formulate text summarisation as a sequence-to-sequence generation task, with the source document as the input sequence and the summary as the output sequence. Most existing methods follow the supervised RNN-based encoderdecoder framework (See et al., 2017;. As supervised learning with ground-truth summaries may not provide useful insights on human judgment approximation, reinforcement training was proposed to optimise the ROUGE metric (Paulus et al., 2018;Parnell et al., 2021), and to fine-tune a pre-trained language model (Laban et al., 2020). These models naturally learn to integrate knowledge from the training data while generating an abstractive summary. Prior studies showed that these generative models are highly prone to external hallucination, thus may generate contents that are unfaithful to the original document (Maynez et al., 2020).

Compressive Methods
Compressive methods select words from a given document to assemble a summary. Due to the lack of training dataset, not until recently there have emerged works for compressive summarisation (Zhang et al., 2018;Mendes et al., 2019;Xu and Durrett, 2019;Desai et al., 2020). The formulation of compressive document summarisation is usually a two-stage extract-then-compress approach: it first extracts salient sentences from a document, then compresses the extracted sentences to form its summary. Most of these methods are supervised, which require a parallel dataset with document-summary pairs to train. However, the ground-truth summaries of existing datasets are usually abstractive-based and do not contain supervision information needed for extractive summarisation or compressive summarisation. Several reinforcement learning based methods (Zhang et al., 2018) use existing abstractive-based datasets for training, which is not aligned for compression. Note that existing compressors often perform compression sentence by sentence. As a result, the duplicated information among multiple sentences could be overlooked. Therefore, to address these limitations, we propose a novel unsupervised compressive method by exploring the dual-agent reinforcement learning strategy to mimic human judg-ment and perform text compression instead of sentence compression.

Methodology
As shown in Figure 1, our proposed compressive summarisation method, namely URLComSum, consists of two components, an extractor agent and a compressor agent. Specifically, the extractor agent selects salient sentences from a document D to form an extractive summary S E , and then the compressor agent compresses S E by selecting words to assemble a compressive summary S C .

Extractor Agent
Given a document D consisting of a sequence of M sentences {s i |i = 1, ..., M }, and each sentence s i consisting of a sequence of N words {we ij |j = 1, ..., N } 2 , the extractor agent aims to produce an extractive summary S E by learning sentence representation and selecting L E sentences from D. As illustrated in Figure 2, we design a hierarchical multi-head attentional sequential model for learning the sentence representations of the document and using a Pointer Network to extract sentences based on their representations.

Hierarchical Sentence Representation
To model the local context of each sentence and the global context between sentences, we use twolevels Bi-LSTMs to model this hierarchical structure, one at the word level to encode the word sequence of each sentence, one at the sentence level to encode the sentence sequence of the document.
To model the context-dependency of the importance of words and sentences, we apply two levels of multi-head attention mechanism (Vaswani et al., 2017), one at each of the two-level Bi-LSTMs.
Given a sentence s i , we encode its words into word embeddings xe i = {xe ij |j = 1, ..., N } by xe ij = Enc(we ij ), where Enc() denotes a word embedding lookup table. Then the sequence of word embeddings are fed into the word-level Bi-LSTM to produce an output representation of the words le w : To utilize the multi-head attention mechanism to obtain ae w The concatenation of le w i and ae w i of the words are fed into a Bi-LSTM and the output is concatenated to obtain the local context representation he ws i for each sentence s i : To further model the global context between sentences, we apply a similar structure at sentence level. he ws = {he ws i |i = 1, ..., M } are fed into the sentence-level Bi-LSTM to produce output representation of the sentences le s : To utilize the multi-head attention mechanism to obtain ae s = {ae s 1 , ..., ae s M } at sentence level, we define Q = le s , K = V = he ws , The concatenation of the Bi-LSTM output le s and the multi-head attention output ae s of the sentences are fed into a Bi-LSTM to obtain the final representations of sentences he s = {he s 1 , ..., he s M }: Similar to (Chen and Bansal, 2018), we use an LSTM-based Pointer Network to decode the above sentence representations he s = {he s 1 , ..., he s M } and extract sentences recurrently to form an extractive summary S E = {A 1 , ..., A k , ..., A L E } with L E sentences, where A k denotes the k-th sentence extracted.
At the k-th time step, the pointer network receives the sentence representation of the previous extracted sentence and has hidden state de k . It first obtains a context vector de k by attending to he s : where v, W 1 , W 2 are learnable parameters of the pointer network. Then it predicts the extraction probability p(A k ) of a sentence: (8) Decoding iterates until L E sentences are selected to form S E .

Compressor Agent
Given an extractive summary S E consisting of a sequence of words wc = {wc i |i = 1, ..., N }, the compressor agent aims to produce a compressive summary S C by selecting L C words from S E . As illustrated in Figure 3, it has a multi-head attentional Bi-LSTM model to learn the word representations. It uses a pointer network to extract words based on their representations.

Word Representation
Given a sequence of words wc, we encode the words into word embeddings xc = {xc i |i = 1, ..., N } by xc i = Enc(wc i ). Then the sequence of word embeddings are fed into a Bi-LSTM to produce the words' output representation lc w : To utilise the multi-head attention mechanism to obtain ac w = {ac w 1 , ..., ac w N }, we define Q = lc w , K = V = xc, The concatenation of lc w and ac w of the words are fed into a Bi-LSTM to obtain the representation hc w i for each word wc i :

Word-Level Extraction
The word extractor of the compressor agent shares the same structure as that of the extractor agent's sentence extractor. To select the words based on the above word representations hc w = {hc w 1 , ..., hc w N }, the word extractor decodes and extracts words recurrently to produce {B 1 , ..., B k , ..., B L C }, where B k denotes the word extracted at the k-th time step. The selected words are reordered by their locations in the input document and assembled to form the compressive summary S C .

Reward in Reinforcement Learning
We use the compressive summary S C to compute the reward of reinforcement learning and denote where w cov and w flu denote the weights of two rewards.

Semantic Coverage Reward
We compute Reward cov with the Wasserstein distance between the corresponding semantic distributions of the document D and the summary S, which is the minimum cost required to transport the semantics from D to S. We denote D = {d i |i = 1, ..., N } to represent a document, where d i indicates the count of the i-th token (i.e., word or phrase in a vocabulary of size N ). Similarly, for a summary S = {s j |j = 1, ..., N }, s j is respect to the count of the j-th token . The semantic distribution of a document is characterized in terms of normalised term frequency without the stopwords. The term frequency of the i-th token in the document D and the j-th token in the summary S are denoted as TF D (i) and TF S (j), respectively. By defining TF D = {TF D (i)} ∈ R N and TF S = {TF S (j)} ∈ R N , we have the semantic distributions within D and S respectively.
The transportation cost matrix C is obtained by measuring the semantic similarity between each of the tokens. Given a pre-trained tokeniser and token embedding model with N tokens, define v i to represent the feature embedding of the i-th token. Then the transport cost c ij from the i-th to the j-th token is computed based on the cosine similarity: . An optimal transport plan T * = {t * i,j } ∈ R N ×N in pursuit of minimizing the transportation cost can be obtained by solving the optimal transportation and resources allocation optimization problem (Peyré et al., 2019). Note that the transport plan can be used to interpret the transportation of tokens from document to summary, which brings interpretability to our URLComSum method.
Wasserstein distance measuring the distance between the two semantic distributions TF D and TF S with the optimal transport plan is computed by: d W (TF D , TF S |C) = i,j t * ij c ij . Reward cov (D, S) can be further defined as:

Fluency Reward
We utilise Syntactic Log-Odds Ratio (SLOR) (Pauls and Klein, 2012) to measure Reward flu (S), which is defined as: Reward flu (S) = 1 |S| (log(P LM (S)) − log(P U (S))) , where P LM (S) denotes the probability of the summary assigned by a pre-trained language model LM , p U (S) = t∈S P (t) denotes the unigram probability for rare word adjustment, and |S| denotes the sentence length.
We use the Self-Critical Sequence Training (SCST) method (Rennie et al., 2017), since this training algorithm has demonstrated promising results in text summarisation (Paulus et al., 2018;Laban et al., 2020). For a given input document, the model produces two separate output summaries: the sampled summary S s , obtained by sampling the next pointer t i from the probability distribution at each time step i, and the baseline summaryŜ, obtained by always picking the most likely next pointer t at each i. The training objective is to minimise the following loss: where N denotes the length of the pointer sequence, which is the number of extracted sentences for the extractor agent and the number of extracted words for the compressor agent. Minimising the loss is equivalent to maximising the conditional likelihood of S s if the sampled summary S s outperforms the baseline summarŷ S, i.e. Reward(D, S s ) − Reward(D,Ŝ) > 0, thus increasing the expected reward of the model.

Experimental Settings
We conducted comprehensive experiments on three widely used datasets: Newsroom (Grusky et al., 2018), CNN/DailyMail (CNN/DM) (Hermann et al., 2015), and XSum (Narayan et al., 2018a). We set the LSTM hidden size to 150 and the number of recurrent layers to 3. We performed hyperparameter searching for w cov and w flu and decided to set w cov = 1 , w flu = 2 in all our experiments since it provides more balanced results across the datasets. We trained the URLComSum with AdamW (Loshchilov and Hutter, 2018) with learning rate 0.01 with a batch size of 3. We obtained the word embedding from the pre-trained GloVe (Pennington et al., 2014). We used BERT for the pre-trained embedding models used for computing semantic coverage reward. We chose GPT2 for the trained language model used for computing the fluency reward due to strong representation capacity.
As shown in Table 1, we followed (Mendes et al., 2019) to set L E for Newsroom and (Zhong et al., 2020) to set L E for CNN/DM and XSum. We also followed their protocols to set L C by matching the average number of words in summaries.    We compare our model with existing compressive methods which are all supervised, including LATENTCOM (Zhang et al., 2018), EXCONSUMM (Mendes et al., 2019), JECS (Xu and Durrett, 2019), CUPS (Desai et al., 2020). Since our method is unsupervised, we also compare it with unsupervised extractive and abstractive methods, including TextRank (Mihalcea and Tarau, 2004), PacSum (Zheng and Lapata, 2019), PMI (Padmakumar and He, 2021), and SumLoop (Laban et al., 2020). To better evaluate compressive methods, we followed a similar concept as LEAD baseline (See et al., 2017) and created LEAD-WORD baseline which  denotes the extractive summary produced by our extractor agent. URLCom-Sum (Ext.+Com.) denotes the compressive summary produced further by our compressor agent. extracts the first several words of a document as a summary. The commonly used ROUGE metric (Lin, 2004) is adopted.

Experimental Results
The experimental results of URLComSum on different datasets are shown in Table 2, Table 3 and Table 4 in terms of ROUGE-1, ROUGE-2 and ROUGE-L F-scores. (Ext.), (Abs.), and (Com.) denote that the method is extractive, abstractive, and compressive, respectively. Note that on the three datasets, LEAD and LEAD-WORD baseline are considered strong baselines in the literature and sometimes perform better than the state-of-the-art supervised and unsupervised models. As also discussed in (See et al., 2017;Padmakumar and He, 2021), it could be due to the Inverted Pyramid writing structure (Pöttker, 2003) of news articles, in which important information is often located at the beginning of an article and a paragraph.
Our URLComSum method significantly outperforms all the unsupervised and supervised ones on Newsroom. This demonstrates the effectiveness of our proposed method. Note that, unlike supervised EXCONSUMM, our reward strategy contributes to performance improvement when the compressor agent is utilised. For example, in terms of ROUGE-L, EXCONSUMM(Ext.+Com.) does not outperform EXCONSUMM(Ext.), while URLComSum(Ext.+Com.) outperforms URLCom-Sum(Ext.). Similarly, our URLComSum method achieves the best performance among all the unsupervised methods on XSum, in terms of ROUGE-1 and -L. URLComSum underperforms in ROUGE-2, which may be due to the trade-off between informativeness and fluency. The improvement on Newsroom is greater than those on CNN/DM and XSum, which could be because the larger size of Newsroom is more helpful for training our model.
Our URLComSum method achieves comparable performance with other unsupervised methods on CNN/DM. Note that URLComSum does not explicitly take position information into account while some extractive methods take advantage of the lead bias of CNN/DM, such as PacSum and LEAD. Nevertheless, we observe that URLCom-Sum(Ext.) achieves the same result as LEAD . Even though URLComSum is unsupervised, eventually the extractor agent learns to select the first few sentences of the documents, which follows the principle of the aforementioned Inverted Pyramid writing structure.

Ablation Studies
Effect of Compression. We observed that the extractive and compressive methods usually obtain better results than the abstractive ones in terms of ROUGE scores on CNN/DM and Newsroom, and vice versa on XSum. It may be that CNN/DM and Newsroom contain summaries that are usually more extractive, whereas XSum's summaries are highly abstractive. We noticed that URL-ComSum(Ext.+Com.) generally achieves higher ROUGE-1 and -L scores than its extractive version on Newsroom. Meanwhile, on CNN/DM and XSum, the compressive version has slightly lower ROUGE scores than the extractive version. We observe similar behaviour in the literature of compressive summarisation, which may be that the sentences of news articles have dense information and compression does not help much to further condense the content.
Effect of Transformer. Note that we investigated the popular transformer model (Vaswani et al., 2017) in our proposed framework to replace Bi-LSTM for learning the sentence and word representations. However, we noticed the transformerbased agents do not perform as well as the Bi-LSTM-based ones while training from scratch with the same training procedure. The difficulties of training a transformer model have also been discussed in (Popel and Bojar, 2018;. Besides, the commonly used pre-trained transformer models, such as BERT (Devlin et al., 2019) and BART (Lewis et al., 2020), require high computational resources and usually use subword-based tokenizers. They are not suitable for URLComSum since our compressor agent points to words instead of subwords. Therefore, at this stage Bi-LSTM is a simpler and more efficient choice. Nevertheless, the transformer is a module that can be included in our framework and is worth further investigation in the future.
Comparison of Extraction, Abstraction and Compression Approaches. We observed that the extraction and compressive approaches usually obtain better results than the abstractive in terms of ROUGE scores on CNN/DM and Newsroom, and vice versa on XSum. It may be because CNN/DM and Newsroom contain summaries that are usually more extractive, whereas XSum's summaries are highly abstractive. Since the ROUGE metric reflects lexical matching only and overlooks the linguistic quality and factuality of the summary, it is difficult to conclude the superiority of one approach over the others solely based on the ROUGE scores. Automatic linguistic quality and factuality metrics would be essential to provide further insights and more meaningful comparisons.

Qualitative Analysis
In Figure 5, 6, 7 in Appendix A, summaries produced by URLComSum are shown together with the reference summaries of the sample documents in the CNN/DM, XSum, and Newsroom datasets. This demonstrates that our proposed URLComSum method is able to identify salient sentences and words and produce reasonably fluent summaries even without supervision information.

Interpretable Visualisation of Semantic Coverage
URLComSum is able to provide an interpretable visualisation of the semantic coverage on the summarisation results through the transportation matrix. Figure 4 illustrates the transport plan heatmap, which associated with a resulting summary is illustrated. A heatmap indicates the transportation of semantic contents between tokens in the document and its resulting summary. The higher the intensity, the more the semantic content of a particular document token is covered by a summary token. Red line highlights the transportation from the document to the summary of semantic content of token "country", which appears in both the document and the summary. Purple line highlights how the semantic content of token "debt", which appears in the document only but not the summary, are trans- Figure 4: Interpretable visualisation of the OT plan. from a source document to a resulting summary on the CNN/DM dataset. The higher the intensity, the more the semantic content of a particular document token is covered by a summary token. Red line highlights the transportation from the document to the summary of semantic content of token "country", which appears in both the document and the summary. Purple line highlights how the semantic content of token "debt", which appears in the document only but not the summary, are transported to token "bankruptcy" and "loans", which are semantically closer and have lower transport cost, and thus achieve a minimum transportation cost in the OT plan. ported to token "bankruptcy" and "loans", which are semantically closer and have lower transport cost, and thus achieve a minimum transportation cost in the OT plan.

Conclusion
In this paper, we have presented URLComSum, the first unsupervised and an efficient method for compressive text summarisation. Our model consists of dual agents: an extractor agent and a compressor agent. The extractor agent first chooses salient sentences from a document, and the compressor agent further select salient words from these extracted sentences to form a summary. To achieve unsupervised training of the extractor and compressor agents, we devise a reinforcement learning strategy to simulate human judgement on summary quality and optimize the summary's semantic coverage and fluency reward. Comprehensive experiments on three widely used benchmark datasets demonstrate the effectiveness of our proposed URLComSum and the great potential of unsupervised compressive summarisation. Our method provides interpretability of semantic coverage of summarisation results.

A Sample Summaries
The following shows the sample summaries generated by URLComSum on the CNN/DM, XSum, and Newsroom datasets. Sentences extracted by the URLComSum extractor agent are highlighted. Words selected by the URLComSum compressor agent are underlined in red. Our unsupervised method URLComSum can identify salient sentences and words to produce a summary with reasonable semantic coverage and fluency.