Unsupervised Multi-document Summarization with Holistic Inference

Multi-document summarization aims to obtain core information from a collection of documents written on the same topic. This paper proposes a new holistic framework for unsupervised multi-document extractive summarization. Our method incorporates the holistic beam search inference method associated with the holistic measurements, named Subset Representative Index (SRI). SRI balances the importance and diversity of a subset of sentences from the source documents and can be calculated in unsupervised and adaptive manners. To demonstrate the effectiveness of our method, we conduct extensive experiments on both small and large-scale multi-document summarization datasets under both unsupervised and adaptive settings. The proposed method outperforms strong baselines by a significant margin, as indicated by the resulting ROUGE scores and diversity measures. Our findings also suggest that diversity is essential for improving multi-document summary performance.


Introduction
The multi-document summarization (MDS) is one of the essential tools to obtain core information from a collection of documents written for the same topic.It seeks to find the main ideas from multiple sources with diversified messages.In spite of recent advances in MDS system designs (Mihalcea and Tarau, 2004;Liu and Lapata, 2019a;Xiao et al., 2022), three major challenges hinder its development: First, existing extractive multi-document summarization systems rely on optimization with individual scoring.It becomes sub-optimal when we need to extract multiple summary sentences (Zhong et al., 2020).A typical individual system scores each candidate summary with only measurements of the newly added sentences during inference.In contrast, the holistic system simultaneously measures all summary sentences and the relations among them.Despite recent efforts in holistic methods on a single document (An et al., 2022;Zhong et al., 2020), how to extract sentences holistically for multi-document summarization remains open.In this work, we propose an inference method that holistically optimizes the extractive summary under multi-document setting.
In Figure 1, we show a salient and diversified summary versus a salient but redundant summary.A salient and diversified summary often covers the information thoroughly, while the salient but redundant summary is usually incomplete.Different from existing approaches (Suzuki and Nagata, 2017;Cho et al., 2019b;Xiao and Carenini, 2020) for limiting the repetitions, we introduce Subset Representative Index (SRI), a holistically balanced measurement between importance and diversity for extractive multi-document summarization.
Finally, recent deep learning-based supervised summarization methods are data-driven and require a massive number of high-quality summaries in the training data.Nevertheless, hiring humans to write summaries is always expensive, time-consuming, and thus hard to scale up.This problem becomes more severe for multi-document summarization, since it requires more effort to read more documents.Therefore, existing multi-document summarization datasets are either small-scale (Over and Yen, 2004;Dang and Owczarzak, 2008) or created by acquiring data from the Internet with automatic alignments (Fabbri et al., 2019;Antognini and Faltings, 2020) that could be erroneous.Here we propose an unsupervised multi-document summarization method to tackle the low-resource issue.It can further benefit the unsupervised multidocument summarization, with the adaptive setting using large-scale high-quality single-document summarization data (e.g., CNN/DailyMail (Hermann et al., 2015)).
In this work, we present a novel framework for unsupervised extractive multi-document summarization, aiming to holistically select the extractive summary sentences.The framework contains the holistic beam search inference method associated with the holistic measurements named SRI ( Subset Representative Index).The SRI is designed as a holistic measurement for balancing the importance of individual sentences and the diversity among sentences within a set.To address data sparsity, we propose to calculate SRI in both unsupervised and adaptive manners.Unsupervised SRI relies on the centrality from graph-based methods (Erkan and Radev, 2004;Mihalcea and Tarau, 2004) for subset importance measurement, while adaptive SRI uses BERT (Devlin et al., 2018) fine-tuned on single document summarization (SDS) corpus for sentence importance measurement.Our method shows performance improvements in both the summary informativeness and diversity scores, indicating our approach can achieve better coverage of documents while maintaining the gist information of multi-documents.We highlight the contributions of our work as follows: • We propose a novel holistic framework for multi-document extractive summarization.
Our framework incorporates a holistic inference method for summary sentence extraction and holistic measurement called Subset Representative Index (SRI) for balancing the importance and diversity of a subset of sentences.
• We propose two unsupervised ways to measure SRI by using graph-based centrality or Step 1 Step 2 Step 3 ...

Unsupervised
• We conduct extensive experiments on several benchmark datasets, and the results demonstrate the effectiveness of our paradigm under both unsupervised and adaptive settings.Our findings suggest that effectively modeling sentence importance and pairwise sentence similarity is crucial for extracting diverse summaries and improving summarization performance.

Related Works
Multi-document Summarization Traditional non-neural approaches to multi-document summarization have been both extractive (Carbonell and Goldstein, 1998;Erkan and Radev, 2004;Mihalcea and Tarau, 2004) and abstractive (Ganesan et al., 2010).Recent neural MDS systems rely on Transformer-based encoder-decoder model to process the integrated long documents with hierarchical inter-paragraph attention (Liu and Lapata, 2019a;Fabbri et al., 2019), or attention across representations of different granularity (Jin et al., 2020).This work focuses on unsupervised MDS scenarios where gold reference summaries are unavailable.Prior unsupervised MDS systems are mostly graph-based (Erkan and Radev, 2004;Liu et al., 2021).Similar to our adaptive setting, Lebanoff et al. (2018) proposed to adapt the encoder-decoder framework from a single document corpus, but our work focuses on extractive summarization setting with holistic inference.
Sentence Importance Measurements Most works formulate extractive summarization as a sequence classification problem and use sequen-tial neural models with different encoders like recurrent neural networks (Cheng and Lapata, 2016;Nallapati et al., 2016) and pre-trained language models (Liu and Lapata, 2019b;Zhang et al., 2023b).The prediction probabilities are treated as the importance measurement of sentences.On the other hand, unsupervised graphbased methods calculate the importance of sentences with node centrality and rank them for the summaries, including TextRank (Mihalcea and Tarau, 2004), LexRank (Erkan and Radev, 2004), PACSUM (Zheng and Lapata, 2019), and its variants (Liang et al., 2021;Liu et al., 2021).Recent researches (Xu et al., 2019;Wang et al., 2020;Zhang et al., 2022Zhang et al., , 2023a) ) (Carbonell and Goldstein, 1998), Determinantal Point Process (DPP) (Kulesza and Taskar, 2012), and submodular selection (Lin et al., 2009).Trigram blocking is introduced to explicitly reduce redundancy by avoiding sentences that share a 3-gram with the previously added one (Liu and Lapata, 2019b).Paulus et al. (2017) first adopt trigram blocking in decoding for abstractive summarization.Ma et al. (2016) proposed the sentence filtering and beam search methods to for extractive summarization sentence selection.Xiao and Carenini (2020) conducted a systematic study of redundancy in long documents.

Method
This section provides a detailed description of our proposed holistic MDS summarization framework.
We first explain how we formulate the MDS problem holistically in Section 3.1.The overall architecture of our holistic framework is shown in Figure 2, which includes holistic inference methods for summary sentence extraction in Section 3.2, and a new holistic measurement, the Subset Representative Index (SRI) in Section 3.3.

Problem Formulation
Multi-document summarization typically takes a collection of n documents D = {D (1) , . . ., D (n) } as inputs.Each document contains a varying number of sentences l i }, where l i is the number of sentences in the i-th document.Let S be the collection of all sentences, i.e. S = D (1) ∪ • • • ∪ D (n) .Additionally, let e i,j denote the similarity score between sentence s i and sentence s j .Our goal is to select a representative subset of sentences S ′ ⊂ S that maximizes the total importance of the subset while minimizing the redundancy within sentences in the subset at the same time.

Holistic Inference
Most existing approaches for unsupervised extractive summarization formulate it as an individual sentence ranking problem.They first calculate a measurement M(s i ) (e.g.sentence importance) for each sentence s i ∈ S and rank all sentences in S accordingly.For summary inference, they directly use an individual greedy method that adds one sentence with the highest ranking at a time until the desired total number of summary sentences is reached.In contrast, a holistic summarization method should evaluate a subset of sentences M(S ′ ) as a whole, then select the best subset S ′ .The setting formulates the holistic summary inference into a best subset selection problem, which has exponential time complexity.
To address the exponential time complexity issue, we propose several holistic inference methods for summary sentence extraction.These methods optimize subsets of sentences using subset measurements, as opposed to the individual greedy inference method.We describe the different variants of the proposed method as below.
Holistic Greedy Method.The most straightforward way to address the exponential time complexity issue is to adopt a greedy approach.Similar to the individual greedy method, the holistic greedy method also adds one sentence at a time.However, it picks the sentence using a subset measurement that takes into account the previously selected sentences.Formally, at each step, the method selects the sentence that maximizes the following objective: where S ′ represents the previously selected sentences.
Holistic Exhaustive Search.It is a brute-force method that considers every possible subset with the desired number of sentences.However, due to the exponential computation time, it is necessary to first filter out low-importance candidates using M({s i }) to reduce the search space.
Holistic Beam Inference .We also propose Holistic Beam Inference which balances the tradeoff between search space size and efficiency.It is a more advanced holistic inference method that adapts the beam-search decoding algorithm.We illustrate the algorithm in Algorithm 1.At each step, it considers the top-k candidate subsets, which enlarges the search space and therefore has a higher chance of finding a better subset solution compared to the holistic greedy method.Meanwhile, the algorithm has linear time complexity, making it more efficient than the holistic exhaustive search method.

Subset Representative Index
To complement the holistic inference methods, we propose a new subset measurement, Subset Representative Index (SRI), denoted as M(S ′ ).It balances the importance measurement I(S ′ ) and redundancy measurement R(S ′ ).
An ideal extractive summary should select the most representative subset from a collection of the input sentences, maximizing the total nonredundant salient information passed to the user.SRI is a holistic subset measurement that balances the importance and redundancy of a subset of sentences from the source documents.Formally, we define SRI as below: where I(S ′ ) measures the informativeness of a set of sentences, and R(S ′ ) measures the redundancy within the set.The parameter λ is used to control the weight of the redundancy in the overall SRI score.We detail the methods for measuring the set importance and redundancy in an unsupervised manner as follows.
Graph-Based Importance Measurement.To measure the importance of sentences, we use a graph-based approach.We construct a graph G = (V, E), where node v i ∈ V represents sentence for X ∈ C do 5: for x ∈ X ′ do 7: Add X ∪ {x} to C ′ 8: end for 9: end for 10: s i ∈ S, and edge e i,j ∈ E represents the similarity between sentence s i and s j .Our proposed approach for sentence similarity score employs a combination of two methods: TF-IDF and Sentence-BERT (Reimers and Gurevych, 2019).TF-IDF is used to encode sentences with surface-form similarity, while Sentence-BERT is used to encode sentences with semantic similarity: where c i , c j , r i and r j are the corresponding TF-IDF features and sentence embeddings for the i-th and j-th sentences, respectively.The weight term α ∈ [0, 1] is a configurable hyperparameter to balance between statistical similarity and contextualized similarity.
Inspired from (Mihalcea and Tarau, 2004;Erkan and Radev, 2004), we define the importance of a sentence as its node centrality in the graph, which is calculated as the sum of the weights of edges connected to the node representing this sentence: Similarly, the importance of a subset of sentences is defined as the total weights between the subgraph and the remaining graph: Since |S ′ | is usually far smaller than |S| in summarization tasks, we can approximate the denominator by using |S| directly.This way, the subset importance only takes into account the relationship of the subset with the remaining sentences, rather than considering dependencies within the subset.
Adaptive Importance Measurement.In spite of the data sparsity issue in MDS, the Single Document Summarization (SDS) task has abundant high-quality labeled data (Hermann et al., 2015;Narayan et al., 2018;Cohan et al., 2018).We propose a method called adaptive importance measurement, which adapts SDS data for MDS importance measurement.This method utilizes the labeled data from SDS to train a model for predicting the importance of sentences in MDS.
In the adaptive setting, we fine-tune the BERT (Devlin et al., 2018) to a sentence importance scorer on SDS datasets, and then adapt the fine-tuned model to the target MDS datasets.Specifically, we first calculate the normalized salience of a sentence as: f where W is a trainable weight, and r i is the contextualized representation of sentence s i .Then, we fine-tune BERT to minimize the following loss: The fine-tuned BERT can be directly adapted to the MDS datasets and calculate the adaptive importance measurement for sentences.
Redundancy Measurement.The redundancy measurement for a subset of sentences S ′ is defined as the total similarity score of each sentence with its most similar counterpart.This measurement captures the degree of overlap between the sentences in the subset, indicating the level of redundancy present in the selected sentences: Overall, we can calculate SRI in both unsupervised and adaptive manners.Our holistic framework extracts summaries as a whole with the holistic inference method, which is guided by SRI to measure the importance and redundancy of a subset of sentences.This approach allows us to balance the importance and redundancy of a summary, making it more informative and coherent.

Experiments
In this section, we provide details on our experimental setup, including the datasets, evaluation metrics, baselines, and implementation details (Section 4.1).We then present the results of our model on benchmark MDS datasets in both unsupervised (Section 4.2) and adaptive (Section 4.3) settings.

Experimental Setting
Dataset.We evaluate our unsupervised method on benchmark multi-document summarization datasets.Particularly, we use MultiNews (Fabbri et al., 2019), WikiSum (Liu et al., 2018), DUC-04 (Over and Yen, 2004), and TAC-11 (Dang and Owczarzak, 2008) datasets.MultiNews is collected from a diverse set of news articles on newser.com.It is a large-scale dataset containing reference summaries written by professional editors.WikiSum is another large-scale dataset that provides documents and summaries from Wikipedia webpages where the documents come from the reference webpages of Wikipedia articles and top-10 Google searches, and the summaries are the lead section of the Wikipedia articles.We use the top-40 highranked paragraphs for the document inputs following (Liu and Lapata, 2019a).
For summary extraction, we use the average number of reference sentences: 10 and 5, respectively on MultiNews and WikiSum.For the DUC and TAC datasets, the task is to generate a succinct summary of up to 100 words from a set of 10 news articles.We report results on DUC-04 and TAC-11, which are standard test sets used in previous studies (Hong et al., 2014;Cho et al., 2019a).DUC-03 and TAC-08/09/10 are used for the validation set to tune hyper-parameters.For adaptive setting, we fine-tune BERT on single document summarization dataset CNN/DailyMail (Hermann et al., 2015) and directly adapt to MDS test sets.Table 1 shows the statistics of the datasets in detail.Evaluation Metrics.The extracted summaries are evaluated against human reference summaries using ROUGE (Lin, 2004) 1 for the summarization quality.We report ROUGE-1, ROUGE-2, ROUGE-SU4, and ROUGE-L2 that respectively measure the overlap of unigrams, bigrams, skip bigrams with a maximum distance of 4 words, and the longest common sequence between extracted summary and reference summary.To align with previous works, we report R-1, R-2, R-L for Multinews and Wikisum datasets, and R-1, R-2, R-SU4 for DUC and TAC datasets.For all baseline methods, we report ROUGE results from their original papers if available or use results reported in (Cho et al., 2019a;Liu et al., 2021).We also report the measure of diversity for the generated summaries by calculating a unique n-gram ratio (Xiao and Carenini, 2020;Peyrard et al., 2017) defined as: Baselines.We compare our methods with strong unsupervised summarization baselines.In particular, MMR (Carbonell and Goldstein, 1998) combines query relevance with information novelty in the context of summarization.LexRank (Erkan and Radev, 2004) computes sentence importance based on eigenvector centrality in a graph representation of sentences.TextRank (Mihalcea and Tarau, 2004) adopts PageRank (Page et al., 1999) to compute node centrality recursively based on a Markov chain model.SumBasic (Vanderwende et al., 2007) is an extractive approach assuming words frequently occurring in a document cluster are more likely to be included in the summary.KL-Sum (Haghighi and Vanderwende, 2009) uses a greedy approach to add a sentence to the summary to minimize the KL divergence.PRIMERA (Xiao et al., 2022) is a pyramid-based pre-trained model for MDS that achieves state-of-the-art performance.
We compare it under its zero-shot setting.Implementation Details.We run all experiments with 88 Intel(R) Xeon(R) CPUs.We combine the surface indicator based on TF-IDF and contextualized embeddings.We treat each document clus-ter as a corpus and each sentence as a document when calculating the TF-IDF scores.We employ the pre-trained sentence-transformer (Reimers and Gurevych, 2019) and extract sentence representations using a checkpoint of 'all-mpnet-base-v2'.
The graph edges with low similarity are treated as disconnected to emphasize the connectivity of the graph and avoid noisy edge connections.We keep a threshold ẽ for edge weights such that edges with similarity scores smaller than ẽ will be set to 0.Here ẽ is controlled by a hyper-parameter to be tuned according to datasets.The final representation of edge weight between two sentences where ẽ = min(e) + θ (max(e) − min(e)) is the threshold controlled by hyper-parameter θ.For exhaustive search, we filter out the sentences with low centrality and only keep the top 15 sentences at inference.All hyper-parameters are tuned on validation sets on MultiNews and WikiSum and training sets on DUC and TAC.The best parameters are selected based on the highest R-1 score.More specific, for the balancing factor λ in SRI, we use {2 −13 , 2 −7 , 2 −4 , 2 −6 } on DUC, TAC, MultiNews and Wik-iSum dataset.For α that weighted the contributions of TF-IDF and contextualized sentence similarity, we use 0.9 on News domain datasets and 0.8 on the WikiSum dataset.The edge weight threshold θ is {0, 0, 0.1, 0.1} for DUC, TAC, MultiNews and WikiSum.As for beam search, we use beam size {4, 4, 4, 3} on the corresponding datasets.

Unsupervised Summarization Results
The unsupervised summarization results on four benchmark MDS datasets are shown in Table 2.
The summarization performance of our method outperforms strong unsupervised baselines.Note that MultiNews and WikiSum datasets provide abundant training samples and contain shorter input than the DUC or TAC datasets.Our method performs better than the pre-trained model, PRIMERA with a zero-shot setting.Compared to the baseline (Sent.greedy) that extracts sentences solely based on importance, balancing diversity with SRI boosts performance by a large margin.
For the DUC-04 and TAC-11 datasets, our proposed methods outperform unsupervised baselines by a large margin.It demonstrates that balancing the summary informativeness and diversity during LEAD 30.77 8.27 7.35 32.88 7.84 11.46 39.41 11.77 14.51 37.63 14.75 33.76 MMR (1998) 30.14 4.55 8.16 31.43 6.14 11.16 38.77 11.98 12.91 31.22 10.24 22.48 LexRank (2004) 34.44 7.11 11.19 33.10 7.50 11.13 38.27 12.70 13.20 36.12 11.67 22.52 TextRank (2004) 33.16 6.13 10.16 33.24 7.62 11.27 38.44 13.10 13.50  the sentence extraction process is crucial for better summary quality.Note that the input length of DUC/TAC datasets is extremely long spanning an average of 180 sentences.These long input easily exceeds the input capacity of transformer-based models possibly resulting in information loss from documents.The proposed methods on the other hand process documents regardless of the input length or formats (SDS or MDS).Also, our unsupervised methods have the advantage of processing datasets with small training data.The supreme performances on datasets with different input lengths and low-resource data illustrate the effectiveness of our methods.To further verify the model performance, we also conduct a human evaluation by experts on a scale of 5.The results shown in Table 3 also prove our method outputs better summaries in unsupervised setting.

Adptive Summarization Results
The experimental results under the adaptive setting are shown in Table 4. Compared to large pretrained generation model (BART) and other taskspecific pre-trained summarization models (PEGA-SUS, PRIMERA), our framework shows strong performance when adapting from a single document summarization dataset.We also notice fine- tuning on single document summarization corpus improves the performance of all pre-trained models, but still, our framework achieves the best results under the adaptive setting.

Summary Diversity
Other than summary quality, we also test the effectiveness of our SRI in terms of the diversity of the output summaries.We present the unique n-gram ratios of output summaries under unsupervised and adaptive settings and the reference summary on the TAC-11 dataset in Figure 3.According to the results, our framework is extremely effective in reducing summary redundancy and increasing summary diversity under both unsupervised and adaptive settings.Compared to the ROUGE-F1 results, holistic inference with importance-diversity balancing measurement SRI increases both summary quality and diversity at the same time.The results suggest that considering summary diversity is beneficial in extractive summarization, especially in redundant cases like MDS and long document summarization.Our finding also verifies the crucial rule of effective modeling of sentence importance and similarity.To test the robustness of our proposed approaches, we study the hyperparameter sensitivity of our proposed methods.The results are shown in Figure 4.The first plot shows the impact of balancing factor λ in SRI.The second plot shows the impact of α, which balances the contextualized and TF-IDF sentence embedding and the edge weight threshold.The results show that our methods are relatively stable towards the hyperparameter values and could be easily adapted to unseen datasets.Table 5: ROUGE-F1 (w/o word limit) results of SRIbeam with different beam sizes on TAC with λ = 0.125.

Inference Approaches Analysis
We also compare the efficiency and effectiveness of different inference methods.As in Figure .5, we compare sentence-level greedy search, set-level greedy search, set-level beam search (beam size = 4), and set-level exhaustive search with pre-filtering as inference methods for both unsupervised and adaptive settings.We pick the filter size of 20 here since the search space without filtering C(N, K) is extremely large.According to the results, all setlevel inference methods outperform the sentencelevel methods.This suggests that extracting summaries at a set level (holistic) is optimal over the common sentence-level setting that extracts sentences individually.The finding is also consistent with the inherent performance gap between sentence-level and holistic extractors in (Zhong et al., 2020).Moreover, we realize the set-level beam search and set-level exhaustive search achieve the comparable best performance.However, set-level beam search speed-wise is much more efficient than setlevel exhaustive search.We also show the effect of different beam sizes in Table 5.The results indicate that a reasonably small beam size achieves the best ROUGE results, which are both effective and efficient.To conclude, set-level beam search with SRI shows the best overall performance.

Conclusion
This paper proposes a holistic framework for unsupervised multi-document extractive summarization.Our framework incorporates the holistic beam search inference methods and SRI, a holistically balanced measurement between importance and diversity.We conduct extensive experiments on both small and large-scale MDS datasets under both unsupervised and adaptive settings and the proposed method outperforms strong baselines by a large margin.We also find that balancing summary set importance and diversity benefits both the quality and diversity of output summaries for MDS.

Limitations
The proposed framework in this paper is mainly designed for low-resource scenarios without gold summaries for multi-document summarization.Adapting the framework for a supervised setting requires further investigation.Recently, large language models (LLM) like ChatGPT have shown strong zero-shot summarization ability, which may raise doubt about the necessity of unsupervised summarization methods.
However, LLM suffers from the hallucination problem and MDS may exceed its input limit (e.g.4,696 words for TAC) than the input limit of Chat-GPT (500-word/4,000-character).In contrast, unsupervised summarization methods can tackle input of arbitrary length and have a faster inference speed than ChatGPT when processing long input documents.In addition, a recent study (Zhang et al., 2023c) shows that ChatGPT's extractive summarization performance is still inferior to existing supervised systems in terms of ROUGE scores.

Ethical Consideration
Our proposed framework forms summary by directly extracting sentences from source documents.Therefore, the extracted summary may be incoherent or contain unfactual co-references.In addition, the extracted summary will keep biased contents from the source sentences, if any.

Figure 1 :
Figure 1: An example of a diverse summary vs. a redundant summary.Sentences in the redundant summary have higher semantic similarity than a diverse summary.

Figure 2 :
Figure 2: Illustration of the proposed holistic framework for multi-document summarization.The individual inference only resorts to each candidate while the holistic inference is based on all candidates.Orange and Green indicate newly added sentences and already added ones to the summary respectively.

Table 1 :
Detailed statistics of four multi-document datasets.#test denotes the number of document clusters in the test set, #ref denotes the number of reference summaries, avg.word(doc) denotes the average number of words in the source document cluster, avg.word(sum) denotes the average number of words in the ground truth summary.

Table 2 :
(See et al., 2017) four datasets under the unsupervised setting.Best unsupervised results are bold.For a fair comparison, we report R-L on Multinews and R-Lsum(See et al., 2017)for WikiSum and limit summaries to 100 words on DUC-04 and TAC-11.R-L are marked with * if reporting ROUGE-Lsum numbers.

Table 3 :
Human evaluation results on a scale of 1-5.

Table 4 :
ROUGE-F1 results on DUC-04 and Multinews datasets under the adaptive setting.Models adapted from CNN/DailyMail dataset are marked in the bracket.
Figure 5: Efficiency vs. average ROUGE (w/o word limit) scores of different inference methods on TAC-11.