DiffuSum: Generation Enhanced Extractive Summarization with Diffusion

Extractive summarization aims to form a summary by directly extracting sentences from the source document. Existing works mostly formulate it as a sequence labeling problem by making individual sentence label predictions. This paper proposes DiffuSum, a novel paradigm for extractive summarization, by directly generating the desired summary sentence representations with diffusion models and extracting sentences based on sentence representation matching. In addition, DiffuSum jointly optimizes a contrastive sentence encoder with a matching loss for sentence representation alignment and a multi-class contrastive loss for representation diversity. Experimental results show that DiffuSum achieves the new state-of-the-art extractive results on CNN/DailyMail with ROUGE scores of $44.83/22.56/40.56$. Experiments on the other two datasets with different summary lengths also demonstrate the effectiveness of DiffuSum. The strong performance of our framework shows the great potential of adapting generative models for extractive summarization. To encourage more following work in the future, we have released our codes at \url{https://github.com/hpzhang94/DiffuSum}


Introduction
Document summarization aims to compress text material while keeping its most salient information.It plays a critical role with the growing amount of publicly available text data.Automatic text summarization approaches can be divided into two streams: abstractive and extractive summarization.Although abstractive methods (Nallapati et al., 2016;Gupta and Gupta, 2019;Bae et al., 2019;Li et al., 2020) produce flexible and less redundant summaries, they suffer from problems of generating ungrammatical or even nonfactual contents (Kryściński et al., 2019;Zhang et al., 2022b).In contrast, extractive summarization forms a summary by directly extracting sentences from the source document.Thus, the extracted summaries are grammatically accurate and faithful.
We focus on extractive summarization in this work.Extractive summarization is commonly formulated as a sequence labeling problem, which predicts a 0/1 label for each sentence, indicating whether the sentence should be included in summary (Nallapati et al., 2017;Zhou et al., 2018;Liu and Lapata, 2019).Compared to individual sentence label prediction in the sequence labeling setting, generative models offer increased flexibility and attend to the entirety of input context.Recent works have also successfully applied generative models to wide-ranging token-level sequence labeling tasks (Athiwaratkun et al., 2020;Du et al., 2021;Yan et al., 2021).Nonetheless, how to apply generative models for sentence-level tasks like extractive summarization has not been explored.
Recently, continuous diffusion models have achieved great success in vision and audio domains (Ho et al., 2020;Kong et al., 2020;Yang et al., 2022;Rombach et al., 2022;Ho et al., 2022).Researchers have also attempted to apply diffusion models for text generation by converting the discrete token to continuous embeddings and mapping from embedding space to words with a rounding method (Li et al., 2022;Yuan et al., 2022;Strudel et al., 2022;Gong et al., 2022).However, these approaches are not applicable for sentence-level tasks like summarization: (1) Summarization has a relatively longer input context and larger generation length (around 3-6 sentences), while the above token-level diffusion-LM models are only applicable to short generation tasks like text simplification and question generation.Their performance tends to drop by a large margin when generating longer sequences; (2) The word embeddings generated by these models could be indistinguishable, resulting in ambiguous and hallucinated generation; (3) The rounding step in those existing diffusion models is less efficient and slows down the inference dramatically.
To address the above issues, we propose a novel extractive summarization paradigm, DiffuSum, which generates the desired summary sentence representations with transformer-based diffusion models and extracts summaries based on sentence representation matching.Instead of generating word by word, DiffuSum directly generates the desired continuous representations for each summary sentence and thus could process much longer text.DiffuSum is a summary-level framework since the transformer-based diffusion architecture generates all summary sentence representations simultaneously.Moreover, DiffuSum incorporates a contrastive sentence encoding module with a matching loss for sentence representation alignment and a multi-class contrastive loss (Khosla et al., 2020) for representation diversity.DiffuSum jointly optimizes the sentence encoding module and the diffusion generation module, and extracts sentences by representation matching without any rounding step.We validate DiffuSum by extensive experiments on three benchmark datasets and experimental results demonstrate that DiffuSum achieves a comparable or even better performance than state-of-the-art systems that rely on pre-trained language models.DiffuSum also shows a strong adaptation ability based on cross-dataset evaluation results.
We highlight our contributions in this paper as follows: (i) We propose DiffuSum, a novel generationaugmented paradigm for extractive summarization with diffusion models.DiffuSum directly generates the desired summary sentence representations and then extracts sentences based on representation matching.To the best of our knowledge, this is the first attempt to apply diffusion models for the extractive summarization task.
(ii) We also introduce a contrastive sentence encoding module with a matching loss for representation alignment and a multi-class contrastive loss for representation diversity.

Extractive Summarization
Recent advances in deep neural networks have dramatically boosted the progress in extractive summarization systems.Existing extractive summarization systems span an extensive range of approaches.Most works formulate the task as a sequence classification problem and use sequential neural models with different encoders like recurrent neural networks (Cheng and Lapata, 2016;Nallapati et al., 2016) and pre-trained language models (Egonmwan and Chali, 2019;Liu and Lapata, 2019;Zhang et al., 2023).Another group of work formulates extractive summarization as a node classification problem and applies graph neural networks to model inter-sentence dependencies (Xu et al., 2019;Zhang and Zhang, 2020;Wang et al., 2020;Zhang et al., 2022a).These formulations are sentence-level methods that make individual predictions for each sentence.Recently, Zhong et al. (2020) observed that a summary consisting of sentences with the highest scores is not necessarily the best.As a result, summary-level formulation like text matching (Zhong et al., 2020;An et al., 2023) and reinforcement learning (Narayan et al., 2018b;Bae et al., 2019) are proposed.Our proposed framework DiffuSum is also a novel summary-level extractive system with generation augmentation.Instead of sequentially labeling sentences, DiffuSum directly generates the desired summary sentence representations with diffusion models and extracts sentences by representation matching.

Diffusion Models on Text
Continuous diffusion models are first introduced in (Sohl-Dickstein et al., 2015) and have achieved great success in continuous domain generations like image, video, and audio (Kong et al., 2020;Yang et al., 2022;Rombach et al., 2022;Ho et al., 2022).Nevertheless, few works have applied continuous diffusion model to text data due to its inherently discrete nature.Among the initial attempts, Diffusion-LM (Li et al., 2022) first adapts continuous diffusion models for text by adding an embedding step and a rounding step, and designing a training objective to learn the embedding.DiffuSeq (Gong et al., 2022) proposes a diffusion model designed for sequenceto-sequence (seq2seq) text generation tasks by adding partial noise during the forward process and conditional denoising during the reverse pro-cess.CDCD (Dieleman et al., 2022) is proposed for text modeling and machine translation based on variance-exploding stochastic differential equations (SDEs) on token embeddings.SeqDif-fuSeq (Yuan et al., 2022) also proposes an encoderdecoder diffusion model architecture for conditional generation by combining self-conditioning and adaptive noise schedule technique.However, these works only focus on generating token-level embeddings for short text generation (less than 128 tokens).In order to adapt diffusion models to longer sequences like summaries, our DiffuSum directly generates summary sentence embeddings with a partial denoising framework.In addition, DiffuSum jointly optimizes the diffusion model with a contrastive sentence encoding module instead of using a static embedding matrix.

Continuous Diffusion Models
The continuous diffusion model (Ho et al., 2020) is a probabilistic model containing two Markov chains: the forward and the backward process.Forward Process Given a data point sampled from a real-world data distribution x 0 ∼ q(x), the forward process gradually corrupts x 0 into a standard Gaussian distribution prior x T ∼ N (0, I).Each step of the forward process gradually interpolates Gaussian noise to the sample, represented as: where β t ∈ (0, 1) adjusts the scale of the variance.Reverse Process The reverse process starts from x T ∼ N (0, I) and learns a parametric distribution p θ (x t−1 |x t ) to invert the diffusion process of Eq. 1 gradually.Each step of the reverse process is defined as: where µ θ (x t , t) and σ 2 θ (t) are learnable means and variances predicted by neural networks.
While there exists a tractable variational lowerbound (VLB) on log p θ (x 0 ), Ho et al. (2020) simplifies the loss function of continuous diffusion to: where fθ (x t , t) is the reconstructed x 0 at step t.

Problem Formulation
Given a document with n sentences as D = {s d 1 , s d 2 , ..., s d n }, extractive summarization system aims to form a m(m ≪ n) sentences summary S = {s s 1 , s s 2 , ..., s s m } by directly extracting sentences from the source document.Most existing work formulates it as sequence labeling and gives each sentence a {0, 1} label, where label 1 indicates that the sentence will be included in summary S. Since extractive ground-truth labels (ORACLE) are not available for human-written gold summary, it is common to use a greedy algorithm to generate an ORACLE consisting of multiple sentences which maximize the ROUGE-2 score against the gold summary following (Nallapati et al., 2017).
In contrast, we propose a summary-level framework with generative model augmentation as shown in Figure 1.Formally, we train a diffusion model with the reverse process p θ ( Hs t−1 | Hs t , H d ) to directly generate the desired summary sentence representations Hs , where hs j is the vector representing the j-th summary sentence at diffusion step t − 1.The model then extracts summary sentences based on the matching between the generated summary sentence representations after T reverse steps Hs 0 = [ hs 1 , hs 2 , ..., hs m ] and the document sentence embeddings . The matching score for the j-th sentence in the output s s j with the document is defined as: Here we use dot product as similarity measurement and then extract the sentence with the highest matching score for each generated summary sentence.

Sentence Encoding Module
The overall architecture of DiffuSum.The input document is passed to the sentence encoding module and the diffusion generation module.DiffuSum will generate the desired summary sentence representations for inference.
Our framework operates on the summary level by generating all summary sentence representations simultaneously and adopts continuous diffusion models here for sentence embedding generation.

Method
In this section, we introduce the detailed design of DiffuSum.DiffuSum consists of two major modules: a sentence encoding module and a diffusion module, which will be introduced in Section 4.1 and Section 4.2, respectively.After that, we explain how we optimize our model and conduct inference in Section 4.3.The overall model architecture of DiffuSum is also illustarted in Figure 2.

Sentence Encoding Module
In order to generate desired summary sentence embeddings, we first build a contrastive sentence encoding module to transfer discrete text inputs , where h is the dimension of the encoded sentence representations.
Specifically, we first obtain the initial representations of sentences 1 , e d 2 , ..., e d n ] with Sentence-BERT (Reimers and Gurevych, 2019).Note that the Sentence-BERT is only used for initial sentence embedding, but is not updated during training.The initial representations are then fed into a stacked transformer layer followed by a projection layer to obtain contextualized sentence representations h d i : The same encoding process is applied to the summary sentences S = {s s 1 , s s 2 , ..., s s m } to obtain en-coded summary sentence representations The encoded document sentence representations H d and summary sentence representations H s are then concatenated as H in = H d ∥H s ∈ R (n+m)×h and will be passed to the diffusion generation module.
To ensure the sentence encoding module produces accurate and distinguishable representations, we introduce a matching loss L match and a multiclass supervised contrastive loss L contra to optimize the module, which are defined as follows.: Matching Loss We first introduce a matching loss to ensure an accurate matching between the encoded document and summary sentence representations.Formally, for the j-th encoded summary sentence representation h s j , we generate its encoding matching scores ŷj by computing the dot product with document representations followed by a softmax function: Then we have the encoding matching loss L match as the cross-entropy between our encoding matching score ŷj and the ground truth extractive summarization label (ORACLE) y j : CrossEntropy (y j , ŷj ) . (7) Contrastive Loss The sentence encoding module also needs to ensure the encoded summary sentence embeddings [h s 1 , h s 2 , ..., h s m ] are diverse and distinguishable.Thus, we introduce the multi-class supervised contrastive loss (Khosla et al., 2020) to push the summary sentence representation closer to its corresponding document sentence representation while keeping it away from other sentence embeddings.
Given the sentence contextual representations H in = [h 1 , h 2 , ..., h n+m ] ∈ R (n+m)×h , the contrastive label y c is defined as: where q ∈ {1, 2, • • • , m} and y c p is the p-th element of y c .The contrastive loss L contra is defined as: where N y c p is the total number of sentences in the document that have the same label y c p (N y c p = 2 in our case) and τ is a temperature hyperparameter.
The overall optimizing objective for the sentence encoding module L se is defined as: where γ is a rescale factor that adjusts the diversity of the sentence representations.

Diffusion Generation Module
After obtaining the input encoding H in = H d ∥H s , we adopt the continuous diffusion model to generate desired summary sentence embeddings conditionally.As described in Section 3.1, our diffusion generation module adds Gaussian noise gradually through the forward process and fits a stacked Transformer to invert the diffusion in the reverse process.
We first perform one-step Markov transition q(x 0 |H in ) = N H in , β 0 I for the starting state x 0 = x d 0 ∥x s 0 .Note that the initial Markov transition is applied to both document and summary sentence embeddings.
We then start the forward process by gradually injecting partial noise to summary embeddings x s and leaving document embeddings unchanged x d similar to (Gong et al., 2022).This enables the diffusion model to generate conditionally on the source document.At step t of the forward process q (x s t |x t−1 ), the noised representations is x t : where t ∈ {1, 2, • • • , T } for a total of T diffusion steps and ∥ represents concatenation.
Once the partially noised representations are acquired, we conduct the backward process to remove the noise of summary representations given the condition of the sentence representations of the previous step: where µ θ (•) and σ 2 θ (•) are parameterized models (stacked Transformer in our case) to predict the mean and standard variation at diffusion step t − 1.The final output of the diffusion module is the generated summary sentence representations after T reverse steps Hs 0 = [ hs 1 , hs 2 , ..., hs m ].We optimize the diffusion generation module with diffusion loss L diffusion defined as: where fθ (x t , t) is the reconstructed x 0 at step t and R (x 0 ) is a L-2 regularization term.

Optimization and Inference
We jointly optimize the sentence encoding module and the diffusion generation module in an end-toend manner.The overall training loss of DiffuSum can be represented as: where η is a balancing factor of sentence encoding module loss L se and diffusion generation module loss L diffusion .label ỹpred i as in Eq. 4. We extract the sentence with the highest score for each generated summary sentence representation and form the summary.

Experimental Setup
Datasets We conduct experiments on three benchmark summarization datasets: CNN/DailyMail, XSum, and PubMed.CNN/DailyMail (Hermann et al., 2015) is the most widely-adopted summarization dataset that contains news articles and corresponding human-written news highlights as summaries.We use the non-anonymized version in this work and follow the common training, validation, and testing splits (287,084/13,367/11,489). XSum (Narayan et al., 2018a) is a one-sentence news summarization dataset with all summaries professionally written by the original authors of the documents.We follow the common training, validation, and testing splits (204,045/11,332/11,334). PubMed (Cohan et al., 2018) is a scientific paper summarization dataset of long documents.We follow the setting in (Zhong et al., 2020) and use the introduction section as the article and the abstract section as the summary.The training/validation/testing split is (83,233/4,946/5,025).The detailed statistics of each dataset are shown in Table 1.
We also compare DiffuSum with state-of-theart summary-level approaches: contrastive Learning based re-ranking framework COLO (An et al., 2023) and summary-level two-stage text matching framework MATCHSUM (Zhong et al., 2020).

Implementation Details
We use Sentence-BERT (Reimers and Gurevych, 2019) checkpoint all-mpnet-base-v2 for initial sentence representations.The dimension of the sentence representations h is set to 128.We use an 8-layer Transformer with 12 attention heads in our sentence encoding module and a 12-layer Transformer with 12 attention heads in the diffusion generation module.The hidden size of the model is set to 768, and temperature τ is set to 0.07.The scaling factors γ and η are set to 0.001 and 100, where γ is searched in the range of [0.0001, 1] and η is searched within the range of [10,1000].We set the diffusion steps T to 500.Effects of hyperparameter T and h are discussed in section 6.2.
DiffuSum has a total of 13 million parameters and is optimized with AdamW optimizer (Loshchilov and Hutter, 2017) with a learning rate of 1e −5 and a dropout rate of 0.1.We train the model for 10 epochs and validate the performance by the average of ROUGE-1 and ROUGE-2 F-1 scores on the validation set.
Following the standard setting, we evaluate model performance with ROUGE1 F-1 scores (Lin and Hovy, 2003).Specifically, ROUGE-1/2 scores measure summary informativeness, and the ROUGE-L score measures summary fluency.Single-run results are presented in the following sections with the default random seed of 101.

Experiment Results
Results on CNN/DailyMail Experimental results on CNN/DailyMail dataset are shown in Table 2.The first block in the table contains the extractive ground truth ORACLE (upper bound) and LEAD that selects the first few sentences as a summary.The second block includes recent strong one-stage extractive baseline methods and our proposed model DiffuSum.The third section includes two-stage baseline methods that pre-select salient sentences.We follow the same setting and show the results of DiffuSum with the same pre-selection for a fair comparison.
According to the results, DiffuSum achieves new state-of-the-art performance under both onestage and two-stage settings, especially a large raise in the ROUGE-2 score.The supreme per-   longer input contexts and complex generations.Our summary-level setting also benefits data with longer summaries by considering summary sentence dependencies.
For data with shorter summaries like XSum, DiffuSum also achieves comparable performance to SOTA approaches, with a significantly higher ROUGE-2 score.Short-summary data tend to be simpler for matching-based methods like Match-Sum since the candidate pool is much smaller.
Overall, DiffuSum achieves a comparable or even better performance compared to pre-trained language model-based baseline methods.The results demonstrate the effectiveness of DiffuSum on summarization data with different lengths.

Ablation Study
To understand the strong performance of DiffuSum, we perform an ablation study by removing model components of the sentence encoding module and show the results in Table 4.The second row shows that performance drops when replacing the initial sentence representation from Sentence-BERT to BERT-base encoder (Devlin et al., 2018).The performance drop indicates sentence-level information is necessary for the success of DiffuSum.The third row shows that replacing ORACLE with abstractive reference summaries degrades performance.As for the sentence encoding loss, both the matching loss and contrastive loss benefit the overall model performance according to rows 4 and 5.The matching loss is critical to the model, and the performance drops dramatically by more than 40% without it.The results prove the importance of jointly training a sentence encoder that produces accurate and diverse sentence representations with the generation module..

Hyperparameter Sensitivity
We also study the influence of our diffusion generation module's two important hyperparameters: diffusion steps T and the sentence representations dimension h in Table 5.The first row is our best model, and the second block shows the performance of DiffuSum with different sentence representation dimensions.The performance drops by a large margin when setting the dimension to 64, indicating severe information loss when shrinking the sentence dimension too much.The performance also drops a little when the dimension is set to 256, suggesting that a too-large dimension may bring in more noise.The third block shows the influence of diffusion steps, where we find that model performance first increases with more diffusion steps, then starts to decrease and oscillate if steps keep increasing.We argue that the noise injected in the forward pass cannot be fully removed if the steps are too small, and the model will introduce too much noise to recover if the steps are too big.

Cross-dataset Evaluation
We also notice that DiffuSum shows a strong crossdataset adaptation ability.As shown in Table 6, the model trained on the news domain (CNN/DM and XSum) achieves comparable performance (only 1.57 and 2.69 ROUGE-1 drops) when directly tested on the scientific paper domain.The crossdataset results demonstrate the robustness of our generation-augmented framework and the potential to build a generalized extractive summarization system.

Representation Analysis
We also analyze the generated sentence representation quality.We apply T-SNE (van der Maaten and Hinton, 2008) to reduce the sentence representation's dimension to 2 and show the encoded sentence representations as well as the generated summary sentence representations in Figure 3.The blue dots in the figure represent non-summary sentences, and the red dots represent summary sentences (ORACLE) from our sentence encoding module.The green dots are summary sentence representations reconstructed by our diffusion generation module.We can find that most of the ORACLE sentences gather on the right.This finding proves that our contrastive encoder could distinguish OR-ACLE sentences from non-summary sentences.We also find that the sentence representations generated by the diffusion module (green) are very close to the original summary representations (red).The finding demonstrates that our diffusion generation module is powerful in reconstructing sentence representations from random Gaussian noise.

Conclusions
This paper proposes a new paradigm for extractive summarization with generation augmentation.Instead of sequentially labeling sentences, DiffuSum directly generates the desired summary sentence representations with diffusion models and extracts summary sentences based on representation matching.Experimental results on three benchmark datasets prove the effectiveness of DiffuSum.This work is the first attempt to adapt diffusion models for summarization.Future work could explore various ways of applying continuous diffusion models to both extractive and abstractive summarization.

Limitations
Despite the strong performance of DiffuSum, its design still has the following limitations.First, DiffuSum is only designed for extractive summarization, and the diffusion generation module only generates sentence embeddings instead of tokenlevel information.Thus, it is not applicable to the abstractive summarization setting.Moreover, DiffuSum is only tested on single document summarization datasets.How to adapt DiffuSum for multi-document and long document summarization scenarios need further investigation.In addition, our generative model involves multiple steps of noise injection and denoising, compared to discriminator-based extractive systems.

ACL 2023 Responsible NLP Checklist
A For every submission: A1. Did you describe the limitations of your work?
The Limitations Section A2.Did you discuss any potential risks of your work?
Our paper proposes a summarization model and we experiment with public datasets.The model will only summarize documents and has no potential risk.
A3. Do the abstract and introduction summarize the paper's main claims?
Abstract and Section 1 Introduction We only use public available data and models.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Section 5 B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Not applicable.Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? Section 5 B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.Section 5.1 C Did you run computational experiments?
Section 5 C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?Section 5.2

(
iii) We conduct extensive experiments and analysis on three benchmark summarization datasets to validate the effectiveness of DiffuSum.DiffuSum achieves new extractive state-of-art results on CNN/DailyMail dataset with ROUGE scores of 44.83/22.56/40.56.

Figure 1 :
Figure 1: The proposed generation-enhanced extractive summarization framework.The model first conditionally generates desired summary embeddings and then extracts sentences based on representation matching.

A4.
Have you used AI writing assistants when working on this paper?Left blank.B Did you use or create scientific artifacts?Section 5 B1.Did you cite the creators of artifacts you used?Section 5 B2.Did you discuss the license or terms for use and / or distribution of any artifacts?

Table 1 :
Statistics of the experimental datasets.Doc # words and Sum # words refer to the average word number in the source document and summary.# Ext refers to the number of sentences to extract.

Table 2 :
Experimental results on CNN/DailyMail dataset.Models using pre-trained language models are marked with*.

Table 3 :
Experimental Results on PubMed and XSum datasets.

Table 5 :
The performance of DiffuSum with different hyperparameter settings on CNN/DM dataset.