Supervising the Centroid Baseline for Extractive Multi-Document Summarization

The centroid method is a simple approach for extractive multi-document summarization and many improvements to its pipeline have been proposed. We further refine it by adding a beam search process to the sentence selection and also a centroid estimation attention model that leads to improved results. We demonstrate this in several multi-document summarization datasets, including in a multilingual scenario.


Introduction
Multi-document summarization (MDS) addresses the need to condense content from multiple source documents into concise and coherent summaries while preserving the essential context and meaning.Abstractive techniques, which involve generating novel text to summarize source documents, have gained traction in recent years (Liu and Lapata, 2019;Jin et al., 2020;Xiao et al., 2022), following the advent of large pre-trained generative transformers.However, their effectiveness in summarizing multiple documents remains challenged.This is attributed not only to the long input context imposed by multiple documents but also to a notable susceptibility to factual inconsistencies.In abstractive methods, this is more pronounced when compared to their extractive counterparts due to the hallucination-proneness of large language models.
Extractive approaches, on the other hand, tackle this problem by identifying and selecting the most important sentences or passages from the given documents to construct a coherent summary.Extractive MDS usually involves a sentence importance estimation step (Hong and Nenkova, 2014;Cao et al., 2015;Cho et al., 2019), in which sentences from the source document are scored according to their relevance and redundancy with respect to the remaining sentences.Then, the summary is built by selecting a set of sentences achieving high relevance and low redundancy.The centroid-based method (Radev et al., 2000) is a cheap unsupervised solution in which each cluster of documents is represented by a centroid that consists of the sum of the TF-IDF representations of all the sentences within the cluster and the sentences are ranked by their cosine similarity to the centroid vector.While the original method is a baseline that can be easily surpassed, subsequent enhancements have been introduced to make it a more competitive yet simple approach (Rossiello et al., 2017;Gholipour Ghalandari, 2017;Lamsiyah et al., 2021).
In this work, we refine the centroid method even further: i) we utilize multilingual sentence embeddings to enable summarization of clusters of documents in various languages; ii) we employ beam search for sentence selection, leading to a more exhaustive exploration of the candidate space and ultimately enhancing summary quality; iii) we leverage recently proposed large datasets for multidocument summarization by adding supervision to the centroid estimation process.To achieve this, we train an attention-based model to approximate the oracle centroid obtained from the ground-truth target summary, leading to significant ROUGEscore improvements in mono and multilingual settings.To the best of our knowledge, we are the first to tackle the problem within a truly multilingual framework, enabling the summarization of a cluster of documents in different languages.1

Related Work
Typical supervised methods for extractive summarization involve training a model to predict sentence saliency, i.e. a model learns to score sentences in a document with respect to the target summary, either by direct match in case an extractive target is available or constructed (Svore et al., 2007;Woodsend and Lapata, 2012;Mendes et al., 2019) or by maximizing a similarity score (e.g., ROUGE) with respect to the abstractive target summaries (Narayan et al., 2018).Attempts to reduce redundancy exploit the notion of maximum marginal relevance (MMR; Carbonell and Goldstein, 1998;McDonald, 2007) or are coverage-based (Gillick et al., 2008;Almeida and Martins, 2013), seeking a set of sentences that cover as many concepts as possible while respecting a predefined budget.During inference, the model is then able to classify the sentences with respect to their salience, selecting the highest-scored sentences for the predicted summary.Rather than training a model that predicts salience for each individual sentence, we employ a supervised model that directly predicts an overarching summary representation, specifically predicting the centroid vector of the desired summary.Training this model can thus be more direct when training with abstractive summaries (as is the case in most summarization datasets), since computing the reference summary centroid is independent of whether the target is extractive or abstractive.
Regarding enhancements to the centroid method for extractive MDS, Rossiello et al. (2017) refined it by substituting the TF-IDF representations with word2vec embeddings (Mikolov et al., 2013), and further incorporated a redundancy filter into the algorithm.Gholipour Ghalandari (2017), on the other hand, retained the utilization of TF-IDF sentence representations but improved the sentence selection process.Recently, Lamsiyah et al. (2021) introduced modifications to the sentence scoring mechanism, incorporating novelty and position scores, and evaluated a diverse array of sentence embeddings with the proposed methodology, including contextual embeddings provided by ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019).
While there have been initiatives to foster research in multilingual extractive MDS (Giannakopoulos, 2013;Giannakopoulos et al., 2015), the proposed approaches (Litvak and Vanetik, 2013;Aries et al., 2015;Huang et al., 2016) are only language-agnostic, requiring all the documents within each cluster to be in the same language.In contrast, we address extractive MDS in a scenario where each cluster is multilingual.

Methodology
The pipeline of our proposed model is divided into two stages.In the first stage, we use an attention model to obtain a cluster representation that replaces the naive centroid obtained by averaging sentence embeddings of the documents in a cluster.The rationale behind this approach is that the contribution of each sentence to the cluster centroid should depend on its relevance to the cluster summary.In order to capture the whole cluster context, a sentence-level attention model is employed, assigning variable weights to each sentence embedding so as to approximate the resulting average to the centroid that would be obtained by averaging the sentence embeddings of the target summary.In the second stage, an adapted version of the greedy sentence selection algorithm from Gholipour Ghalandari (2017) for extractive MDS is used to select the sentences included in the predicted summary.This adapted version uses our proposed supervised centroid and also includes a beam search algorithm to better explore the space of candidate summaries.

Centroid Estimation
Gholipour Ghalandari (2017) builds a centroid by summing TF-IDF sentence representations of all the sentences that compose the cluster to summarize.In our research, we compute the centroid from a learnable weighted average of the contextual sentence embeddings, via an attention model.

Attention Model
In our centroid estimation procedure, we use a pre-trained multilingual sentence transformer from Yang et al. (2020) to encode the sentences from the news articles, obtaining contextual embeddings e k ∈ R d , k ∈ {1, . . ., N }, for each of the N sentences in a cluster.Since it is often the case that the first sentences of a document are especially important for news summarization tasks, we add sentence-level learnable positional embeddings to the contextual embeddings at the input of the attention model.Specifically, given a cluster D comprising N sentences, we compute: where pos(k) is the position within the respective document of the k-th sentence in the cluster and p pos(k) ∈ R d is the corresponding learnable positional embedding.Each e pos,k ∈ R d is then concatenated with the mean-pool vector of the cluster, 2 denoted by e pos ∈ R d , resulting in e ′ pos,k = concat(e pos,k , e pos ) for each sentence.This concatenation ensures that the computation of the attention weight for each position uses information from all the remaining positions.The vector β ∈ R N of attention weights is obtained as: pos,1 ), . . ., MLP(e ′ pos,N ) , (2) where MLP is a two-layer perceptron shared by all the positions.It has a single output neuron and a hidden layer with d units and a tanh activation.
After computing the attention weights for the cluster, we take the original sentence embeddings e k , k ∈ {1, . . ., N }, and compute a weighted sum of these representations: (3) Consequently, the resultant vector h ∈ R d is a convex combination of the input sentence embeddings.Since it is not guaranteed that the target centroid lies within this space, h is subsequently mapped to the output space through a linear layer, yielding an estimate ĉattn ∈ R d of the centroid.Hereafter we refer to this attention model as Centroid Regression Attention (CeRA).
Interpolation The original (unsupervised) approach involves estimating the centroid by computing the average of all sentence representations e k within a cluster, which has consistently demonstrated strong performance.Let e D represent this centroid for cluster D. To leverage the advantages of this effective technique, we introduce e D as a residual component to enhance the estimate produced by the attention model.Thus, our final centroid estimate is computed as: where α ∈ [0, 1] d is a vector of interpolation weights and ⊙ denotes elementwise multiplication.The interpolation weights are obtained from concatenating ĉattn and e D and mapping it through an MLP of two linear linear layers with d units each.
The two layers are interleaved with a ReLU activation and a sigmoid is applied at the output.We call the model with interpolation CeRAI.
Training Objective Finally, we minimize the cosine distance between the model predictions ĉ and the mean-pool of the sentence embeddings of the target summary c gold .

Sentence Selection
Considering the cluster D and a set S with the current sentences in the summary.at each iteration of greedy sentence selection (Gholipour Ghalandari, 2017), we have for each sentence s ∈ D \ S.Then, the new sentence s * to be included in the summary is where cos sim is the cosine similarity.The algorithm stops when the summary length reaches the specified budget. 3As demonstrated in that work, redundancy is mitigated since the centroid is compared to the whole candidate summary S ∪ {s} at each iteration and not only to the new sentence s.
In our version of the algorithm, we not only estimate the cluster centroids as explained in §3.1, replacing e D by ĉ in equation ( 6), but also employ a beam search (BS) algorithm so that the space of candidate summaries is explored more thoroughly.Moreover, in order to exhaust the chosen budget, we add a final greedy search to do further improvements to the extracted summary.The procedure is defined in Algorithm 1, shown in Appendix A, and we describe it less formally below.

Beam Search
The process begins by preselecting sentences, retaining only the first n sentences from each document.Beam search initiates by selecting the top B sentences with the highest similarity scores with the centroid, where B represents the beam size.In each subsequent iteration, the algorithm finds the highest-scoring B sentences on each beam, generating a total of B 2 candidates.Among these candidates, only the highest-ranked B sentences are retained.Suppose any of these sentences exceed the specified budget length for the summary.In that case, we preserve the corresponding previous state, and no further exploration is conducted on that beam.The beam search concludes when all candidate beams have exceeded the budget or when no more sentences are available.
Greedy Search To exhaust the specified budget and improve results, we add a greedy search of sentences that are allowed within the word limit.
The top-scoring B states from the beam search are used as starting points for this greedy search.
Then, for each state, we greedily select the highestscoring sentence that does not exceed the budget among the top T ranked sentences.This process iterates until either all of the top T ranked sentences would exceed the budget or there are no further sentences left for consideration.

Experimental Setup
Herein, we outline the methods, datasets, and evaluation metrics employed in our experiments.
Methods We compare our approaches with the centroid-based methods from Gholipour Ghalandari (2017) and Lamsiyah et al. (2021), described in §2.To be consistent with the remaining methods, the approach by Gholipour Ghalandari (2017) was implemented on top of contextual sentence embeddings instead of TF-IDF.Additionally, we perform ablation evaluations in three scenarios: i) a scenario (BS) where we do not use the centroid estimation model ( §3.1) and rely solely on the beam search for the sentence selection step ( §3.2); ii) a scenario (BS+GS) identical to the previous one, except that we perform the greedy search step after the beam search; iii) two scenarios (CeRAI and CeRA) where we utilize the centroid estimation model with and without incorporating interpolation, and apply the BS+GS algorithm on the predicted centroid.The "Oracle centroid" upperbounds our approaches, since it results from applying BS+GS on the mean-pool of the sentence embeddings of the target summary, c gold , as the cluster centroid.
Appendix C provides additional details about data processing and hyperparameters.
CrossSum was conceived for single-document cross-lingual summarization, so we had to adapt it for multilingual MDS.This adaptation results in clusters that encompass documents in multiple languages, with each cluster being associated with a single reference summary containing sentences in various languages.We explain this procedure and provide further details about each dataset in Appendix B.

Evaluation Metrics
We evaluate ROUGE scores (Lin, 2004) in all the experiments.When evaluating models in the multilingual setting, we translated both the reference summaries and the extracted summaries into English prior to ROUGE computation.As we optimized for R2-R on the validation sets, we report it as our main metric in Tables 1 and  2. The remaining scores are shown in Appendix D.

Results
Monolingual Setting The ROUGE-2 recall (R2-R) of all the methods in the monolingual datasets are presented in Table 1.F1 scores and results for the other ROUGE variants are presented in Table 4, in Appendix D. The first observation is that BS alone outperforms Gholipour Ghalandari (2017) in all datasets, with additional improvements obtained when the greedy search step is also performed (BS+GD).This was expected since our approach explores the candidate space more thoroughly.The motivation for using a supervised centroid estimation model arose from the excellent ROUGE results obtained when using the target summaries to build the centroid ("Oracle centroid" in the tables), showing that an enhanced centroid estimation procedure could improve the results substantially.This is confirmed by the two methods using the centroid estimation model (CeRA and CeRAI), which improve R2-R significantly in Multi-News and WCEP-10 and perform at least on par with Lamsiyah et al. (2021) in TAC2008 and DUC2004.It's also worth noting that CeRA and CeRAI were only trained on the Multi-News training set and nevertheless performed better or on par with the remaining baselines on the test sets of the remaining corpora.Incorporating the interpolation step (CeRAI) appears to yield supplementary enhancements compared to the non-interpolated version (CeRA) across various settings, which we attribute to this method adding regularization to the estimation process, improving results on harder scenarios.
Multilingual Setting The R2-R scores of all the methods in CrossSum can be found in Table 2, while additional results are in Table 5 of Appendix D. Once again, we observe the superiority of the centroid estimation models, CeRA and CeRAI, in comparison to all the remaining methods, with the variants with and without interpolation performing on par with each other.Most notably, these models prove to be useful even when tested with languages unseen during the training phase, underscoring their robustness and applicability in a zero-shot setting.

Conclusions
We enhanced the centroid method for multidocument summarization by extending a previous approach with a beam search followed by a greedy search.Additionally, we introduced a novel attention-based regression model for better centroid prediction.These improvements outperform existing methods across various datasets, including a multilingual setting, offering a robust solution for this challenging scenario.Regarding future work, we believe an interesting research direction would be to further explore using the supervised centroids obtained by the CeRA and CeRAI models, by having them as a proxy objective to obtain improved abstractive summaries.

Limitations
While we believe that our approach possesses merits, it is equally important to recognize its inherent limitations.Diverging from conventional centroid methods that operate entirely in an unsupervised manner, our centroid estimation model necessitates training with reference summaries.Nevertheless, its robustness to dataset shifts was demonstrated: the model trained on Multi-News consistently yielded strong results when assessed on different English datasets, and the model trained on a subset of languages from CrossSum displayed successful generalization to other languages.Finally, our method introduces increased computational complexity.This arises from both the forward pass through the attention model and the proposed beam search algorithm, which incurs a greater computational cost compared to the original, simpler greedy approach proposed by Gholipour Ghalandari (2017).

B Datasets
We now describe each of the datasets used for evaluation and explain how we have adapted CrossSum for the task of MDS.
Multi-News The Multi-News dataset (Fabbri et al., 2019) is a large-scale dataset for MDS of news articles.It contains up to 10 documents per cluster and more than 50 thousand clusters divided into training, validation, and test splits.There is a single human-written reference summary for each cluster.
WCEP-10 This dataset (Ghalandari et al., 2020;Xiao et al., 2022) consists of short human-written target summaries extracted from the Wikipedia Current Events Portal (WCEP).Each news cluster associated with a certain event is paired with a single reference summary, and there are at most 10 documents per cluster.The dataset comprises 1022 clusters, all of which are used for testing.
TAC2008 This is a multi-reference dataset introduced by the Text Analysis Conference (TAC) 5 .It provides no training nor validation sets and the test set consists of 48 news clusters, each with 10 related documents and 4 human-written summaries as references.
DUC2004 Another multi-reference news summarization dataset6 designed and used for testing only.It contains 50 clusters with 10 documents and 4 human-written reference summaries each.
CrossSum To assess the performance of the models in a multilingual context, we have adapted the CrossSum dataset (Bhattacharjee et al., 2023) for the task of MDS.Initially designed for cross-lingual summarization, this dataset offers document-summary pairs for more than 1500 language directions.The dataset is derived from pairs of articles sourced from the multilingual summarization dataset XL-Sum (Hasan et al., 2021).Notably, these pairings were established using an automatic similarity metric, resulting in many pairs covering similar topics rather than the exact same stories, rendering it well-suited MDS.
To tailor this dataset for our specific task, we began by selecting the data from a predefined subset of the languages.Subsequently, we aggregated the documents into clusters, taking into account their pairings.For instance, if document A was paired with document B and document B was paired with document C, then A, B, and C would belong to the same cluster.Clusters containing only one document were discarded.For obtaining multilingual reference summaries for each cluster, we interleaved the sentences from the individual summaries until we reached a predefined limit of 100 words.We have built training, validation, and test sets using data in English, Spanish, and French, and another test set using data in Portuguese, Russian, and Turkish to evaluate our model in a zero-shot setting.Statistics about each split are presented in Table 3.

C Experimental Details
Data Processing To ensure a fair comparison, all the models we evaluated used the same sentence representations, specifically, sentence embeddings obtained from the distiluse-base-multilingual-cased-v27 sentence encoder (Yang et al., 2020).
For monolingual datasets, the documents were split into sentences using sent_tokenize from the NLTK library (Bird et al., 2009).For CrossSum, we used SentSplitter from the multilingual ICUtokenizer.8Regular expressions were applied to replace redundant white spaces and excessive paragraphs and empty sentences were excluded.Before sentence selection (Algorithm 1), the data goes through a second processing step, during which duplicate sentences and sentences that individually exceed the summary budget are eliminated.
When evaluating models in CrossSum, we translated both the reference summaries and the extracted summaries into English prior to ROUGE computation.All the translations were performed using the M2M-100 12-billion-parameter model (Fan et al., 2021).
The following word-limit budgets were used by all models: 230 words for the Multi-News dataset, 100 words for TAC2008, DUC2004 and CrossSum, and 50 words for WCEP-10.9 Hyperparameters The hyperparameters for the beam search-based methods were tuned by running a grid search on the BS+GS approach on the Multi-News validation set.For the number of sentences n, odd numbers from 1 to 9 were tested.For the beam width B values 1,5, and 9 were examined, and regarding the number of candidates T , values 1,5, and 9 were considered.The values that maximized R2-R on this validation set were n = 9, B=5, and T =9.In all of our experiments, these were the values we considered for the parameters.Note that for the BS method only n and B are relevant.
The hyperparameters of the centroid estimation model used in CeRA were obtained by random search on Multi-News.The hyperparameters yielding the highest R2-R score on the validation set for the produced summaries were kept.The CeRAI model was trained using the optimal hyperparameters found for CeRA.The optimal parameters were: batch size = 2, learning rate = 5×10 −4 , and number of positional encodings = 35.We utilized the Adam optimizer with a multi-step learning rate scheduler configured with step size = 3 and γ = 0.1.Implementation Details Our CeRA and CeRAI models used early stopping, where the stopping criteria metric was based on R2-R.Layer normalization (Ba et al., 2016) was applied on the input data before adding the positional information to it and before passing the data through the last linear layer that transforms h (equation ( 3)) into ĉattn in the CeRA and CeRAI models.We have also normalized the input data to have a unit L2 norm.

D Additional Results
The ROUGE-1/2/L recall and F1 scores obtained by all the methods in the monolingual datasets are shown in Table 4. Table 5 presents the same quantities for the multilingual case.

Table 1 :
ROUGE-2 recall with 95% bootstrap confidence intervals of different extractive methods on the considered test sets.CeRA and CeRAI were only trained on the Multi-News training dataset.