Improving Unsupervised Extractive Summarization with Facet-Aware Modeling

Unsupervised extractive summarization aims to extract salient sentences from documents without labeled corpus. Existing methods are mostly graph-based by computing sentence centrality. These methods usually tend to select sentences within the same facet, however, which often leads to the facet bias problem especially when the document has multiple facets (i.e. long-document and multi-documents). To address this problem, we proposed a novel facet-aware centrality-based ranking model. We let the model pay more attention to different facets by introducing a sentence-document weight. The weight is added to the sentence centrality score. We evaluate our method on a wide range of summarization tasks that include 8 representative benchmark datasets. Experimental re-sults show that our method consistently out-performs strong baselines especially in long-and multi-document scenarios and even performs comparably to some supervised models. Extensive analyses conﬁrm that the performance gains come from alleviating the facet bias problem.


Introduction
Document summarization is the task of transforming a long document into a shorter version while retaining its most important content (Nenkova and McKeown, 2011).Existing extractive or abstractive methods are mostly in supervised fashion which rely on large amounts of labeled corpora (Cheng and Lapata, 2016;Nallapati et al., 2017;Gehrmann et al., 2018;Liu and Lapata, 2019a,b;Zhang et al., 2019;. However, this is not available for different summarization styles, domains, and languages. Fortunately, recent work has shown successful practices on unsupervised * Contribution during internship at Tencent Inc. † Corresponding Author Figure 1: Examples from New York Times. We selected part of key sentences from the source document to show in this table. "..." refers to the omissions of context sentences due to space limitation. extractive summarization (Radev et al., 2000;Mihalcea and Tarau, 2004;Erkan and Radev, 2004;Schluter and Søgaard, 2015;Tixier et al., 2017;Zheng and Lapata, 2019;Xu et al., 2020;Dong et al., 2020). Compare with supervised ones, unsupervised methods 1). remove the dependency on large-scale annotated document-summary pairs; 2). are more general for various scenarios.
Graph-based models are commonly used in unsupervised extractive methods (Radev et al., 2000;Mihalcea and Tarau, 2004;Erkan and Radev, 2004). For example, Zheng and Lapata (2019) proposed a directed centrality-based method named PacSum by assuming that the contribution of any two nodes to their respective centrality is influenced by their relative position in a document. Dong et al. (2020) further improved PacSum by incorporating hierarchical and positional information into the directed centrality method. The core idea of centralitybased models is that the more similar a sentence is to other sentences, the more important it is (Radev et al., 2000). This usually works well for documents with a single facet (i.e. topic, aspect). However, there is always more than one facet, especially in long-document or multi-documents. Figure 1 shows an example of a long-document with 3 facets. We highlight the key phrases of each facet in different colors. Current centrality-based models often select sentences from one facet which is supported by more similar sentences. For example, the baseline model selects 3 sentences from facet 1. We call this the facet bias problem. Figure 2 shows an intuitive explanation of the facet bias problem. The nodes are sentence representations, the star is the document representation and rhombuses are the centers of selected summary sentences. The sentences that support the same facet are masked in the same circle. Centralitybased models tend to select sentences from facet 1 (red nodes). Because these sentences are more similar to each other which leads to a higher centrality score. However, the true summary should consist of important sentences from different facets (blue nodes). To address the facet bias problem, in this paper, we proposed a facet-aware centralitybased model, which is called Facet-Aware Rank (FAR). First, we introduce a modified graph-based ranking method to filter irrelevant sentences. Then we encode the whole document into vector space which is used to capture all facets in the document. For each candidate summary, we calculate a similarity score between the summary sentences and the document. This sentence-document similarity aims at measuring the relevance between summary and document. Whereas the sentence centrality measures the sentence-level importance. In the ranking phase, we combine the sentence-document similarity and the sentence centrality to guarantee the selected sentences are important and cover all facets. As shown in Figure 2, by incorporating the sentence-document similarity, we are more likely to select the blue ones, that is closer to the star, instead of the red ones. We evaluate our method on 8 representative datasets. The results show that our model can surpass strong unsupervised baselines on most datasets and is comparable to supervised models on some datasets. Extensive analyses confirm that the performance gains indeed come from alleviating the facet bias problem. Besides, we surprisedly find that our method can tackle redundancy in summary to some extent.

Background: Graph-based Ranking
Given a document D, it contains a set of sentences {s 1 , .., s i , .., s j , .., s n }. Graph-based algorithms treats D as a graph G = (V, E). V = {v 1 , v 2 , . . . , v n } is the vertex set where v i is the representation of sentence s i . E is the edge set, which is an n × n matrix. Each = {e i,j } ∈ E denotes the weight between vertex v i and v j .
The key idea of graph-based ranking is to calculate the centrality score of each sentence (or vertex). Traditionally, this score is measured by degree or ranking algorithms (Mihalcea and Tarau, 2004;Erkan and Radev, 2004) based on PageRank (Brin and Page, 1998). Then the sentences with the top score are extracted as a summary. The undirected graph algorithm compute the sentence centrality score as follows: This is based on the assumption that the contribution of the sentence's importance in the document is not affected by the order of the sentence. In contrast, directed graph-based ranking algorithm takes the positional information into consideration, which is based on the assumption that the previous content of current sentence and the later contexts have different impact on current sentence's centrality score (Mann and Thompson, 1988). Then equation 1 is reformulated as Where λ 1 + λ 2 = 1. Hyper-parameters λ 1 and λ 2 were used to adjust the influence of previous and last content. Our method is built based on the directed graph-based ranking algorithm.

Modified Directed Graph-based Ranking
We propose a variation of directed graph-based ranking in this section. We modify Equation 2 in terms of filtering negligible sentences. We take s 1 in Figure 2 as an example to give an intuitive explanation. There usually exist many unrelated sentences especially in long documents for s 1 i.e. s 2 , s 3 , s 4 . As shown in equation 2, all these sentences have a contribution in computing s 1 's centrality score. We regard sentences like them as noise of s 1 and propose a modified directed graphbased ranking to filter them. To this end, we simply introduce a threshold to Equation 2. For s 1 , can be seen as a diameter, s 1 is the centre. The centrality score of s 1 only consider nodes in red dashed circle. We further rewrite 2 as : where = β · (max(e ij ) − min(e ij )). β is a Hyper-parameter to control the scale of diameter. As shown in Equation 3, if the similarity between s i and s j is lower than , s j is neglected. We find this modification is very effective but the model is very sensitive to the selection of β, so we carefully tune β on the development set. We finally rank and select sentences with Equation 4.
Where top-ranked k sentences will be extracted as summary and k is pre-defined with the average length of summary in training data.

Facet-Aware Centrality Scoring
In this section, we introduce how to implement Equation 3 and how we incorporate facet into centrality-based ranking in detail. We propose a simple method to model the facets in a document by a special representation based on the whole document. Specifically, based on Equation 4, we add a sentence-document similarity, which computes the similarity between sentences in candidate summary and document to measure the relevance between summary C and document d. Candidate summary is pre-selected sentences from top-ranked K sentences with score DC(s i ) to reduce search range. We combine sentence-document similarity with sentence centrality and obtain the best candidate summary by 5.
where α is a hyper-parameter to control the influence of directed centrality. sim(d,v) refers to the sentence-document similarity, where d is the document representation andv is the candidate summary representation.v is obtained by i∈C (v i ) |C| which is the mean representation of summary sentences. We select the cosine similarity for sim(·).
The combination of sentence-document similarity and sentence centrality can not only tackle the facet problem but also reduce the redundancy to some extent. As shown in Figure 2, the centrality score of red nodes is extremely high due to they are similar to each other. Previous centrality-based models tend to select them as the summary. We incorporate document representation and sentencedocument similarity to weight centrality score. This force model chooses the blue nodes, whose center is closer to the star, instead of red nodes. The introduction of sentence-document similarity makes it extremely unlikely that nodes of high cohesion will be selected. Thus, the redundancy is also reduced.
A candidate summary C is the subset of topranked K sentences after ranking with DC(s i ), which satisfy the following two conditions: 1) the length of sentences in candidate summary is predefined L, which is related to the summary length of dataset training data; 2) the total length of topranked K sentences is t × L, where t is empirically set as 3. For the sentence representations v i , we employ BERT as encoder which maps each word into a hidden state. Specifically, the sentence representations v i is obtained by sigmoid(h i ), where h i is the hidden state of "[CLS]". Each e ij in E is calculated by the dot product of the two sentences v i v j . For document representation, we first collect all the sentence representations {v 1 , v 2 , . . . , v n }. To compress all the valuable information in the document, we apply a maxpooling function to sentence representations. The document representation d is computed as

Improved Sentence Representation
The sentence representations plays a crucial role in our ranking model. The previous study shows that improving the quality of sentence representations helps improve the ranking performance (Zheng and Lapata, 2019;Dong et al., 2020). We post-train BERT on a sentence-level task constructed based on the corpus of a specific task. The idea is that its representation is affected not only by the words in it, but also the sentences around it. For a sentence in a document, we take its previous sentence and its following sentence to be positive examples and random sample sentences from documents as negative examples. The objective function follows that used in (Reimers and Gurevych, 2019). Specifically, for sentence s i , a positive sentence s j , and a negative sentence s k , the BERT is trained to minimize the following equation: where v is the sentence representation, and µ is margin which ensures that v j is at least µ closer to s i than s k . The hidden state vector of "[CLS]" is used as sentence representations and we set µ to 1 following (Reimers and Gurevych, 2019) in post-training phase.

Datasets
We introduce the datasets used in our experiments in this section.
CNN/DM dataset contains 93k articles from CNN, and 220k articles from Daily Mail newspapers (Hermann et al., 2015). We use the nonanonymous version. Following (Zheng and Lapata, 2019), documents whose length of summaries are shorter than 30 tokens are filtered out.
NYT dataset contains articles published by the New York Times between January 1, 1987and June 19, 2007(Li et al., 2016. The summaries are written by library scientists. Different from CNNDM, salient sentences distribute evenly in an article (Durrett et al., 2016). We filter out documents whose length of summaries are shorter than 50 tokens (Zheng and Lapata, 2019).
MultiNews dataset consists of news articles and human-written summaries. The dataset is the first large-scale Multi-Documents Summarization (MDS) news dataset and comes from a diverse set of news sources (over 1500 sites) (Fabbri et al., 2019).
arXiv&PubMed datasets are two long document datasets of scientific publications from arXiv.org (113k) and PubMed (215k) (Cohan et al., 2018). The task is to generate the abstract from the paper body.
WikiSum dataset is a multi-documents summarization dataset from Wikipedia (Liu et al., 2018). We use the version provided by (Liu and Lapata, 2019a), which selects ranked top-40 paragraphs as input. For this dataset, we filter out documents whose summary length is less than 100 tokens. After the process, WikiSum test set contains 15,795 examples and the average length of summaries is 198.
WikiHow dataset is a large-scale dataset of instructions from the online WikiHow.com website (Koupaee and Wang, 2018). The task is to generate the concatenated summary-sentences from the paragraphs.
BillSum dataset contains US Congressional bills and human-written reference summaries from the 103rd-115th (1993-2018) sessions of Congress (Kornilova and Eidelman, 2019). These datasets differ in scale, domain and task type. We collect details of the 8 corpus in Table 1.

Implementation Details and Metrics
FAR has 4 hyper-parameters and the best set of them are chosen from the following setting: α ∈ {1, 2}, β ∈ {0.0, 0.1, . . . , 0.9}, λ 1 + λ 2 = 1, λ 1 ∈ {0.0, 0.1, . . . , 1.0}. In most case, FAR with the default setting (α = 1, β = 0.5, λ 1 = 0.5, λ 2 = 0.5) can achieve satisfied performance on all datasets. We select best hyper-parameters by sampling 1,000  examples from validation set (Zheng and Lapata, 2019). The implementation of our encoder model is based on the PyTorch implementation of BERT * . The BERT follows the base settings. In the posttraining, we employ basic BERT model to initialize our sentence encoder. We use Adam (Kingma and Ba, 2014) as our optimizer with a learning-rate of 2e −5 . During post-training, we sample documents from training set of all datasets. The max length of the input sentence is set to 60. A linear warm-up for the first 10% of steps followed by a linear decay to 0 is used. The BERT encoder is post-trained on 6 Tesla V100 GPUs.
We use ROUGE-1.5.5.pl script † to evaluated summarization quality automatically with ROUGE F1 (Lin and Hovy, 2003). We report ROUGE-1/2/L score to measure the quality of summaries. Besides, we also do a human evaluation for the facet bias and redundancy of extracted summaries.

Results
Table 2-4 report the results of datasets with 3 types. In each table, we present the results of Oracle and * https://github.com/huggingface/transformers † https://github.com/andersjo/pyrouge previous supervised models in the first block. Oracle can be seen as the upper bound of extractive models, which extracts gold standard summaries by greedily selecting sentences to optimize the mean of ROUGE-1 and ROUGE-2 (Nallapati et al., 2017). We compare our approach with strong unsupervised baselines Lead, TextRank (Mihalcea and Tarau, 2004), LexRank (Erkan and Radev, 2004), MMR (Carbonell and Goldstein, 1998) in the second block of each table. Lead selects the first k tokens as a summary. We also report previous best centrality-based model PacSum (Zheng and Lapata, 2019) in the third block of each table.
Overall, FAR outperforms above-mentioned unsupervised strong baselines on most datasets, especially on long-document and multi-documents datasets and is more generalized than them for differnt types, domains datasets. Table 2 reports the results on single document summarization (SDS) datasets CNN/DM, NYT and WikiHow. PTR-GEN (See et al., 2017) is a supervised abstractive model with classic seq2seq structure. REFRESH (Narayan et al., 2018) and BertExt (Liu and Lapata, 2019b) are supervised extractive models. STAS (Xu et al., 2020) is the best unsupervised model on CNN/DM   and NYT with two redesigned pretrain tasks to measure the importance of sentences.

Results on SDS
From the results, we can see that: 1) Our model outperforms all strong baselines in the second block and PacSum by wide margins in terms of ROUGE-1/2/L on 3 SDS datasets. 2) Especially on NYT, our model outperforms the previous best unsupervised extractive system STAS and supervised method REFERSH.
After we re-implement the trigram blocking trick (i.e., removing sentences with repeating trigrams to existing summary sentences) which STAS used (Xu et al., 2020), FAR can achieve a better ROUGE-1 score 40.93/17.80/37.00 than STAS on CNN/DM. Table 3 reports the results on long document summarization (LDS) datasets arXiv, PubMed and BillSum. For supervised extractive models, we compare with SummaRuN-Ner (Nallapati et al., 2017) and GlobalLocalCont (Xiao and Carenini, 2019). We also compare with supervised abstractive models Discourse-aware (Cohan et al., 2018) and PRT-GEN.

Results on LDS
As shown in Table 3, our model has obviously higher ROUGE-1/2/L score (+1.89 +1.56 +1.38) on arXiv and (+2.22 +1.55 +1.45) on PubMed than PacSum. Compare with supervised models, our un-supervised model outperforms supervised abstractive models PTR-GEN and Discourse-aware, but still have a gap with supervised extractive models. The reason for this gap is that supervised extractive models can extract sentences with dynamic length through training with labeled corpus, but unsupervised models need to predefined the length or number of extracted summaries.
Besides, we can see that the improvement on Billsum is limited. We analysis the input document of Billsum and find that documents in Billsum contains many very short sentences which lead to this limited improvement. Table 4 reports the results on multi-documents summarization datasets Multi-News and WikiSum. T-DMCA and HiMAP are proposed with the construction of WikiSum and MultiNews. FT (Flat Transformer) and HT (Hierarchical Transformer) are two supervised extractive models which are proposed by (Liu and Lapata, 2019a).

Results on MDS
From results in Table 4, we can see that Pac-Sum and FAR have a strong performance on Multi-News, which may result from the characteristic of news datasets and the high-quality human-written documents-summary pairs of MultiNews. On Wik-  iSum, compare with PacSum, FAR is obviously better. We also can observe that the performance of unsupervised models are far less than supervised models. Because the length of multi-document summary has a great fluctuation and unsupervised methods are hard to decide the length of extracted sentences.

Analysis
In this section, we present a series of analysis and tests to understand the improvements of our FAR reported in the previous section, and to prove that it fulfills our intuition that the design of our model improves the facet bias. We choose NYT from SDS and arXiv from LDS to analyze the performance of FAR. These 2 datasets are typical and cover the situation of short and long document inputs.
Ablation Study In order to access the contribution of 3 components of FAR -modified DC in section 3.1, facet-aware scoring in section 3.2, and post-training in section 3.3. We remove each component of them and report ablation study results in 5. We can see that modified DC and facet-aware scoring are indispensable to the performance of FAR. If we remove each of them, the performance of FAR drops sharply. When we replace BERT with post-training with original BERT, the results also confirm that post-training is usable.
Human Evaluation To evaluate the ability of FAR in reducing facet bias and redundancy, we asked 3 human annotators to evaluate the extracted summaries of PacSum and FAR with the gold reference summary. Three annotators were asked to give 0-2 scores for facet bias and redundancy of 100 random sampled examples. The results of Pac-Sum in terms of facet bias is 1.42 and redundancy is 1.17. Our FAR performs significantly better than PacSum (p < 0.05) whose facet bias is 0.96 and redundancy is 0.81. Human evaluation results indicated that FAR can extract high-quality summaries by facet-aware modeling and reduce redundancy of summaries to some extent. Figure 3: Sentence position distribution of arXiv and NYT. We use the first 40 sentences for NYT and the first 120 sentences for arXiv.

Sentence Position Distribution
We compare the position distribution of extracted sentences of FAR, PacSum, and Oracle to further inspect the performance of FAR. We report the position distribution of extracted sentences in Figure 3. We can see that 1) The distribution of FAR is more close to Oracle; 2) PacSum only extracts sentence in the head of documents on arXiv, which is also mentioned by (Dong et al., 2020); 3) The advantages of our model are more significant for LDS datasets.
Analysis of Hyper-parameter β Hyperparameter β is a crucial hyper-parameter that is used to filter out noise sentences in documents. We fixed other hyper-parameters and observed the change of ROUGE-1 from 0.1 to 0.9 with β in Figure 4. We can see that Hyper-parameter β has great impact on model's effect, especially on NYT dataset. These curves prove that noise sentences truly exists and hurt the performance of centrality-based models. Case Study To intuitively show the ability of FAR to tackle the facet bias problem and reduce redundancy, we choose one typical example from NYT dataset. (example is from a news report and only used to analyze the effectiveness of our model.) As shown in Table 5, we can see that sentences extracted by PacSum all focus on the facet which describes terroristic attacks in Iraq. However, FAR can cover all 3 facets in gold reference. This shows that our FAR can effectively improve the performance by reducing the facet bias problem.

Related Work
Summarization is a long-standing challenge for researchers to address. Thanks to the power of the neural network and availability of large-scale parallel datasets. Supervised summarization algorithms develop sharply (Chopra et al., 2016;Cao et al., 2018;Zhang et al., 2018;Zhong et al., 2019;Gehrmann et al., 2019;Cho et al., 2019;Jin et al., 2020b;Cao et al., 2020;Jin et al., 2020a;Zhong et al., 2020). However, high-quality parallel datasets are not always available. Researches on unsupervised summarization are necessary, which can be diveided into extractive and abstractive. Unsupervised abstractive summarization is more challenging than extractive. There are also many interesting works (Wang and Lee, 2018;Févry and Phang, 2018;Baziotis et al., 2019;Jernite, 2019;Zhou and Rush, 2019;West et al., 2019;Chu and Liu, 2019;Yang et al., 2020) on unsupervised abstractive summarization. However, most unsupervised summarization models are extractive (Radev et al., 2000;Mihalcea and Tarau, 2004;Erkan and Radev, 2004;Carbonell and Goldstein, 1998;Wan, 2008;Wan and Yang, 2008;Schluter and Søgaard, 2015;Zhao et al., 2020) and focused on the measure of sen-tence salient. Graph-based models are effective and widely concerned in unsupervised extractive methods. Different from traditional undirected graph rank models (Radev et al., 2000;Mihalcea and Tarau, 2004;Erkan and Radev, 2004), (Zheng and Lapata, 2019) proposed directed centrality method, which is based on the Rhetorical Structure Theory (RST) (Mann and Thompson, 1988) assumption. (Dong et al., 2020) point out that PACSUM has position bias, which makes PACSUM not suitable for long document summarization, and proposed hierarchical position-based model HipoRankfor scientific document summarization. STAS (Xu et al., 2020) design two summarization tasks related pretraining tasks to improve sentence representation. Then they proposed a rank method which combines attention weight with reconstruction loss to measure the centrality of sentences.
We find the facet bias problem in graph-based models, which lead to the extracted summaries can not cover multi-facets information in document. A similar concept in summarization is redundancy. However, the difference between redundancy and facet bias is two folds: 1) to solve redundant problem, we just need to make sure selected sentences are not too similar; 2) However, to tackle the facet bias problem, we need to select sentences that can retain multi-facets information.

Conclusion
In this paper, we discover the facet bias problem in centrality-based unsupervised summarization models and proposed a novel facet-aware centralitybased ranking model FAR to tackle it. We introduce a sentence-document weight into centrality, which forced the model to pay more attention to different facets and find that FAR can reduce redundancy to some extent. Results on a wide range of summarization tasks show that our method consistently outperforms strong baselines especially in long-and multi-document scenarios, which prove our model is robust and effective. Extensive analyses confirmed that the performance gains of our model come from alleviating the facet bias problem.
Jiawei Zhou and Alexander Rush. 2019. Simple unsupervised summarization by contextual matching. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5101-5106, Florence, Italy. Association for Computational Linguistics.

A Hyperparamters
Hyper-paramters of the FAR were reported in Table  6.

B Filter Summary Length of arXiv&PubMed
To prove unsupervised is limited by summary length, we filter examples in the test set with summary length, and report the results in Figure 6. We can see that when examples with short summary, which do not match the predefined length, were removed, the performance improved obviously.

C Sentence Position Distribution
We show sentence position distribution of all 8 datasets in Figure 7.