Unsupervised Extractive Summarization using Pointwise Mutual Information

Unsupervised approaches to extractive summarization usually rely on a notion of sentence importance defined by the semantic similarity between a sentence and the document. We propose new metrics of relevance and redundancy using pointwise mutual information (PMI) between sentences, which can be easily computed by a pre-trained language model. Intuitively, a relevant sentence allows readers to infer the document content (high PMI with the document), and a redundant sentence can be inferred from the summary (high PMI with the summary). We then develop a greedy sentence selection algorithm to maximize relevance and minimize redundancy of extracted sentences. We show that our method outperforms similarity-based methods on datasets in a range of domains including news, medical journal articles, and personal anecdotes.


Introduction
Modern neural network-based approaches to summarization require a large amount of documentsummary pairs that are usually unavailable outside of the news domain. For example, summarization datasets of personal narratives and office meetings contain only a few hundred examples (Ouyang et al., 2017;Carletta et al., 2005). In this work, we tackle the problem of unsupervised extractive summarization which aims to select important sentences from the document. While there exists extensive prior work (Radev et al., 2000;Mihalcea and Tarau, 2004;Liu and Lapata, 2019;Zheng and Lapata, 2019), most approaches rely on the assumption that important sentences are similar to other sentences in the document. However, it is unclear if similarity-based features lead to meaningful content selection (Kedzie et al., 2018).
Inspired by recent work on formalizing the notion of importance in summarization (Peyrard, 2019), we propose metrics for relevance and redundancy based on pointwise mutual information (PMI). Intuitively, a relevant summary allows the reader to maximally infer the document content, and a summary has minimal redundancy if each sentence in it provides additional information. Therefore, we measure the relevance of a summary by its PMI with the document. High relevance means that the probability of the document increases conditioning on the summary. Similarly, we measure redundancy by PMI of sentence pairs within the summary. A sentence is redundant if seeing other sentences significantly increases its probability.
Based on the new metrics, we design a simple sentence extraction algorithm. We estimate the PMI of sentence pairs by a pre-trained language model fine-tuned on in-domain documents. We then use a simple sequential sentence selection algorithm for extractive summarization, which greedily maximizes relevance and minimizes redundancy.
Experimental results show that our algorithm outperforms similarity-based methods across multiple domains, including news, personal stories, and medical articles. 1

Relevance and Redundancy
We begin by formalizing relevance and redundancy for summarization. Consider a document of n sentences D = {d 1 , . . . , d n } and a summary of m sentences S = {s 1 , . . . , s m }. We would like to measure the relevance of S to D and the redundancy of S.
Relevance. Relevance measures how well the summary condenses the original text such that we can infer its key content. Specifically, a summary sentence is relevant if observing it reduces our uncertainty about (unseen) sentences in the document. For example, the summary may contain the main link in the thread of conversation in the document. We thus quantify the relevance of a summary sentence s to a document sentence d by their PMI: which measures the dependence between s and d.
A positive score means that s and d are very likely to co-occur, thus seeing one implies the other. A zero score means that s and d are independent. A negative score means that s and d are unlikely to co-occur, e.g. contradicting sentences, thus such a summary sentence is discouraged. We further define the relevance of a summary S to the document D by the sum of sentence-level relevance: Redundancy. Redundancy measures how much overlap exists among the summary sentences. It is typically measured by the semantic similarity between two sentences. However, even if two sentences express different meanings, there is redundancy if one is entailed by the other. For example, consider: 1. "Michelle, of South Shields, Tyneside, says she feels like a new woman after dropping from dress size 30 to size 12." 2. "Michelle weighed 25st 3lbs when she joined the group in April 2013 and has since dropped to 12st 10lbs." Though expressing different information, both imply Michelle's weight loss. Given a summary sentence, we want to assign a score proportional to the amount of information in the sentence which is already present in the rest of the summary. Therefore, we quantify the redundancy of a sentence s given another sentence s by their dependence in terms of PMI: Similarly, the redundancy of a summary S is defined by the total redundancy of all sentence pairs: Estimate PMI. By definition, pmi(s; d) = log p(s|d) p(d) = log p(d|s) p(s) . Since both s and d are sentences, we use a language model, p LM , to estimate the probabilities. Conditional probabilities are comoputed by considering the condition sentence as the prefix.
Note that while PMI can be computed in two equivalent ways according to the definition, the estimates from a language model do not guarantee that p LM (d|s) p LM (d) = p LM (s|d) p LM (s) . Thus we choose to condition on the summary sentence: This is consistent with our definition of relevance: seeing the summary, how well we can estimate the document content. For redundancy, we condition on the earlier sentence: since a sentence is redundant if it can be inferred from previous sentences.

Sequential Sentence Extraction
Given relevance and redundancy defined above, we aim to select important sentences from the document that maximize relevance and minimizes redundancy. We consider a weighted combination of the two criteria: where |S| denotes the number of sentences in the summary. This is a combinatorial problem that is expensive to solve when k is large. Therefore, we solve it approximately by selecting sentences sequentially in a greedy fashion. Given the previously selected sentences, we select the next sentence from the document that maximally improves the objective (7)   Baselines. We compare our approach against the following unsupervised extraction methods: (i) heuristic: lead-k which selects the first k sentences. (ii) similarity-based: TextRank (Barrios et al., 2016;Mihalcea and Tarau, 2004) and Pac-Sum (Zheng and Lapata, 2019) which use graphbased selection with similarity metrics based on tf-idf sentence vectors and BERT embeddings, respectively. 2 To ablate the contribution of PMI, we also include a variant of our algorithm which uses cosine similarity of tf-idf sentence representations to measure redundancy and relevance of sentence pairs. Additionally, we include two reference methods: oracle extraction 3 and the state-of-the-art super- Language model fine-tuning. For our method, we use the pre-trained GPT-2 large model (Radford et al., 2019) to calculate the PMI. To adapt the language model to specific domains, we fine-tune it on the training documents (excluding the gold summaries) in each dataset. To make it better fit our task of estimating the probability of one sentence given another, as described in Section 2, we finetune GPT-2 on two-sentence segments (as opposed to a long stream of tokens). Each segments consist of a pair of sentences from the document. 4 Implementation details. We preprocess the documents with spaCy (Honnibal and Montani, 2017) to split the text into sentences. All hyperparameters are tuned on 200 randomly sampled documentsummary pairs selected from the validation set to optimize the Rouge-1 F-measure, including λ 1 and λ 2 in our method which balances relevance and redundancy scores in Equation (7), the number of keywords in TextRank and the number of sentences to select for all extractive methods.
To select the values of λ 1 and λ 2 we run a grid search at intervals of 0.1 from -2 to 2 for both. For all datasets, the best weighting was 2 for relevance and -2 for redundancy. 5 The weights are intuitive because we want to maximize relevance (λ 1 ) and minimize redundancy (λ 2 ). For the Lead-k baselines, k was 3 for CNN-Dailymail, Reddit-TIFU and XSum, 4 for Reddit and 9 for PubMed.

Results
We evaluate all methods on the five datasets using the Rouge-1/2/L (Lin and Rey, 2004) F-measure. Table Table 1 shows our main results.
PMI vs similarity. We first compare PMI and tfidf in our framework. The results (Ours (PMI) vs Ours (tf-idf)) show that measuring relevance and redundancy using PMI is better than word overlap, especially on narratives (Reddit). The Reddit writing style is less reporting facts like in news and more describing a sequence of events, thus it is helpful to capture the dependence between events. We show an example contrasting the two metrics in Table 2. Further, our method achieves better or comparable results across all datasets compared to other similarity-based methods. The oracle extraction is predictably the best result across all datasets. Our extractive approach outperforms the supervised baseline on the smallest dataset, Reddit, demonstrating the utility of unsupervised approaches in this setting. More examples are shown in Appendix B.  Table 1: Rouge-1/2/L F-Measure scores of unsupervised extractive methods, oracle extraction, and SOTA results using surpervised learning on CNN-DM, XSum, Reddit, Reddit-TIFU and PubMed. The best results among unsupervised methods are in bold. Our PMI-based extractor outperforms similarity-based methods on non-news domains, and is comparable on news domains. (PacSum results on non-news domains are not reported as the released model is fine-tuned only on the news domain.)

Gold Summary Ours(PMI) PacSum
Cillian McCann was filmed by his mother Toni at seven weeks old. In the clip, the little boy can clearly be seen trying to speak to his family. After several attempts he manages to say "hello". The average child can say six words by the time they reach 18 months.
Whose adorable son Cillian said his first word at just seven weeks old. In the video Cillian is seen struggling to get his word out, but with a bit of encouragement from his mother he finally says hello. Toni says that Cillian was very alert from a young age and had been trying to make out words since he was just five weeks old .
Most parenting advice says you don't have to worry if your baby doesn't start speaking until around 18 months. The tiny tot, who is now nine weeks old, was filmed by his 36-year-old mother who says that she knew he had been trying to communicate for a while. Cillian has three older sisters, Toni revealed that her little girls, Sophie(bottom right), Eva(bottom left) and Ellie(top), did not start talking at such an early age. Extraction algorithm. Another important component in extractive methods is the sentence selection/decoding algorithm. The most common approach is to select sentence greedily according to certain objective. TextRank uses a graph-based method inspired by PageRank. However, in Table 1 we see that TextRank and Ours (tf-idf) (using greedy selection) achieve similar results, showing that the selection algorithm does not have a large impact, which is also found by Zheng and Lapata (2019).
Ablation study. To understand the contribution of relevance and redundancy in the proposed metric, we conducted an ablation study on CNN/DM and Reddit-TIFU. In Table 3, we see that relevance alone does well, but augmenting it with redundancy obtains the best performance across all metrics. Minimizing redundancy alone works poorly because it cannot identify important content.
Position bias in PacSum. One may wonder why PacSum and lead-k significantly outperform other extractive methods on CNN-DM. We hypothe-  Table 3: Ablation Results on CNN-DM and Reddit-TIFU. We observe that the combination of relevance and redundancy yields the best performance across all metrics. size that they take advantage of the lead bias on CNN/DM. Figure 1 shows histograms of the positions of summary sentences selected by our method and PacSum on CNN/DM. Notably, 82.3% of sentences selected by PacSum were in the first three, and this value drops to 21.4% in our method. This provides an empirical explanation as to why Pac-Sum is so far ahead of the other extractive approaches on CNN-DM. The authors (Zheng and Lapata, 2019) also noted a drop in performance when positional information was removed in Pac-Sum. In addition, their performance degrades on XSum (which doesn't suffer from lead bias). A concurrent work (Xu et al., 2020) performed a similar analysis, observing the reliance on position information of PacSum in the news domain.

Related Work
Similarity-based summarization. Most unsupervised extractive summarization methods rely on sentence pair similarity as a proxy for importance (Zheng and Lapata, 2019; Mihalcea and Tarau, 2004; Barrios et al., 2016;Erkan and Radev, 2004) and use variants of the Pagerank (Page et al., 1999) algorithm to perform selection. We propose PMI as an alternative and compare it to a concurrent similarity based approach.
Leveraging pretrained language models. Scoring a sentence on its ability to predict subsequent sentences using a language model has been adapted for sentence summarization. (West et al., 2019). Zhou and Rush (2019) use two language models, a generic pretrained language model for contextual matching and a task specific one to enforce fluency. A concurrent work (Xu et al., 2020) used sentence level transformer self-attentions and probabilities to rank sentences for unsupervised extractive summarization. We use the language model to compute PMI, which then scores sentences on relevance and redundancy as criteria for selection.
Diversity in content selection. Maximal Marginal Relevance (Goldstein and Carbonell, 1998) has been used to produce summaries that prioritize diversity in selected content. A similarity metric is used to produce summaries based on similarity to a query while maintaining diversity among selected sentences in various domains (Chandu et al., 2017). This can be seen as analogous to our comparison to our approach tf-idf based selection.

Conclusion and Discussion
We propose metrics for relevance and redundancy in summarization based on pointwise mutual information, and an unsupervised extractive summarization algorithm using pre-trained language models. We demonstrate the effectiveness of our method on both news and non-news domains. Supervised models often learn the lead bias in the datasets and degrade significantly when such hues are absent (Kedzie et al., 2018). Furthermore, even human evaluation of content selection has large variance (Nenkova et al., 2007;Chaganty et al., 2018). Our work is a first step towards formalizing a notion of importance that informs algorithm design in summarization. We believe it is important to have a better formalization of content importance in terms of both task definition/evaluation and modeling. We hope our results will spur more work in this direction.

A Dataset Details
CNN/DM is known to have a very strong extractive Lead-3 baseline as is common in the news domain. XSum contains summaries of BBC news articles but is highly abstractive in nature. The Reddit dataset is a small corpus of around 500 personal stories shared on Reddit with abstractive and extractive summaries. For Reddit-TIFU, we use the TIFU-long subset as used in Zhang et al.
(2019). The Reddit-TIFU didn't come with a train split and since we look at unsupervised methods, we used 200 pairs as validation data to decide parameters and report test results on the rest. The PubMed dataset contains longer medical journal articles with the corresponding abstracts functioning as the groundtruth summaries.   Table 5 shows some example summaries from the CNN-Dailymail validation set in comparison to extractive candidate summaries obtained for the correponding documents using the baseline Lead-3 approach, our Interpolated PMI based approach and the PacSum approach (Zheng and Lapata, 2019) that uses sentence similarity to obtain state-of-the-art Rouge results on the dataset. These are shown to highlight the difference between using PMI and similarity for sentence selection. In the first example, the gold summary details about how medical information regarding two patients was leaked to sales representatives. The similarity based approach selects all three sentences associated with only one of the patients whereas the PMI based approach yields a summary that contains information about both. Once the first sentence concerning the first patient is selected, all sentences associated with it are penalised by a corresponding amount resulting in a more well rounded selection of information. Similarly, in the second example the sentence that details how the child overcame his initial struggles to speak after some encouragement from his parent was only selected by the PMI approach. This summarises exactly what happens in the video clip being spoken about and the same point is highlighted even in the gold summary. The contents of the paragraph can be easily understood when the information about the clip is used as a context. The rest of the paragraph talks goes into how the child spoke faster than his siblings which explains the selection made by PacSum. The third example highlights the issue of PacSum being identical to the Lead-3 baseline by modelling position information present in the dataset. In the fourth example, the PacSum based approach selects two sentences with quotes that have negative connotations while the one selected by PMI about how the protagonist could not forgive himself could serve to better explain the need for an intervention on the Dr. Phil show.

B Analysis of Examples
The purpose of this is to highlight that the intangible nature of the definition of relevance. The content selected varies between PMI and sentence similarity and each might find an application in the right setting. It again highlights the need to consider what one expects from the summarisation task.

Gold
Interpolated PMI PacSum Lead3 Tim Esworthy, 66, has a prosthetic limb after losing his leg in a workplace incident and said he had been targeted by cold callers selling products easing joint pain. Christine Lewis, 62, is wheelchair-bound following a brain haemorrhage. She is also on list of people obtained by the mail and has been targeted by stairlift salesmen.
Tim Esworthy, 66, from colchester, was 'absolutely appalled' to find his private medical details had been sold. Christine Lewis, who is recovering from a brain haemorrhage she had 12 years ago, was on a list of people who have mobility problems obtained by the mail and has been targeted by stairlift salesmen cold calling her. 'They shouldn't have my information, especially if they know I'm disabled because they are targeting me because they think I'm vulnerable.' Tim Esworthy, 66, from colchester, was 'absolutely appalled' to find his private medical details had been sold. Retired financial services manager Tim Esworthy was 'absolutely appalled' to find his private medical details had been sold . They know they can target vulnerable people because they have their medical information.
Tim Esworthy, 66, from Colchester, was 'absolutely appalled' to find his private medical details had been sold. Case 1 : Pensioner who lost leg at work. Retired financial services manager Tim Esworthy was 'absolutely appalled' to find his private medical details had been sold.
Cillian McCann was filmed by his mother Toni at seven weeks old. In the clip, the little boy can clearly be seen trying to speak to his family. After several attempts he manages to say "hello". The average child can say six words by the time they reach 18 months.
Whose adorable son Cillian said his first word at just seven weeks old. In the video Cillian is seen struggling to get his word out, but with a bit of encouragement from his mother he finally says hello. Toni says that Cillian was very alert from a young age and had been trying to make out words since he was just five weeks old .
Most parenting advice says you don't have to worry if your baby doesn't start speaking until around 18 months. The tiny tot, who is now nine weeks old, was filmed by his 36-year-old mother who says that she knew he had been trying to communicate for a while. Cillian has three older sisters, Toni revealed that her little girls, Sophie(bottom right), Eva(bottom left) and Ellie(top), did not start talking at such an early age.
Most parenting advice says you don't have to worry if your baby doesn't start speaking until around 18 months. Whose adorable son Cillian said his first word at just seven weeks old. The tiny tot, who is now nine weeks old, was filmed by his 36-year-old mother who says that she knew he had been trying to communicate for a while.
Ashleigh humphrys, 20, died in a hit-andrun early on sunday morning. Police believe the driver of the car was heading to work. A man is assisting police with their investigations after the death. Ms Humphrys was walking home after celebrating her birthday with friends. A security guard rang police after she was walking disorientated. CCTV footage shows two taxis stop near her before she was struck and put hazard lights on. Then a car drove past the taxis, mounted the footpath before swerving back onto the road and driving off. A taxi is said to have been seized and police are talking to a person 'within the vicinity' at the time of the incident.
Brisbane woman Ashleigh Humphrys died in a hit-and-run incident after deciding to walk from Toowong to her Seventeen Mile Rocks home in Brisbane after having an argument with a friend while they were out celebrating her 20th birthday. Only moments later the guard, who was still on the phone to police while driving around trying to find Ms Humphrys, discovered her dead on the road at the city end of the western freeway. Just before Ms Humphrys was hit, CCTV footage shows two taxis stop near the woman and put their hazard lights on before a car drove past the taxis, mounted the footpath and then swerved back onto the road before driving off.
The driver of a car that hit and killed a young woman in the early hours on Sunday morning was on the way to work, police believe. Brisbane woman Ashleigh Humphrys died in a hit-and-run incident after deciding to walk from Toowong to her Seventeen Mile Rocks home in Brisbane after having an argument with a friend while they were out celebrating her 20th birthday. Now, after it was revealed that a man was assisting police with their investigations, officers have said they believe he was on his way to work and went to his shift as normal on sunday, the Courier Mail reported.
The driver of a car that hit and killed a young woman in the early hours on Sunday morning was on the way to work, police believe. Brisbane woman Ashleigh Humphrys died in a hit-and-run incident after deciding to walk from Toowong to her Seventeen Mile Rocks home in Brisbane after having an argument with a friend while they were out celebrating her 20th birthday. Now, after it was revealed that a man was assisting police with their investigations, officers have said they believe he was on his way to work and went to his shift as normal on sunday, the Courier Mail reported.
Dr. Phil Mcgraw staged a highly-charged intervention with Nick Gordon last Thursday. With his mother, Michelle, by his side a sobbing Gordon talked about missing Bobbi Kristina. Gordon is now in rehab after the intervention having been drinking heavily and taking xanax. Girlfriend Bobbi Kristina has been in a medically induced coma since January 31 and Gordon has not been allowed to see her. The dramatic intervention will air Wednesday on the Dr Phil show.
Amid scenes of high emotion, an often incoherent Gordon admitted drinking heavily and taking xanax, for which he has a prescription, in an attempt to deal with life since Bobbi Kristina was found face down and unresponsive in her bathtub on January 31. Breakdown: With his mother, Michelle, by his side Nick Gordon struggles to stay coherent as he is questioned by Dr Phil. According to his mother, Michelle, Gordon can not forgive himself for his 'failure' to revive Bobbi Kristina Weeping and wailing Nick Gordon, the troubled fiancé of Bobbi Kristina Brown, has admitted that he has twice tried to kill himself and confessed: "I'm so sorry for everything." Asked if he still intended to kill himself he said: "If anything happens to Krissi I will." Amid scenes of high emotion, an often incoherent Gordon admitted drinking heavily and taking xanax, for which he has a prescription, in an attempt to deal with life since Bobbi Kristina was found face down and unresponsive in her bathtub on January 31.
Weeping and wailing Nick Gordon, the troubled fiancé of Bobbi Kristina Brown, has admitted that he has twice tried to kill himself and confessed : "I'm so sorry for everything." Gordon, 25, was speaking to Dr Phil Mcgraw in a dramatic intervention due to air on Wednesday, Daily Mail online can reveal. Asked if he still intended to kill himself he said: "If anything happens to Krissi I will." Table 5: Example summaries obtained from the CNN-Dailymail validation set compared to the corresponding extractive candidate summary obtained using Interpolated PMI, PacSum (State-of-the-art unsupervised summary using sentence similarity) and the Lead-3 Baseline