OTExtSum: Extractive Text Summarisation with Optimal Transport

Extractive text summarisation aims to select salient sentences from a document to form a short yet informative summary. While learning-based methods have achieved promising results, they have several limitations, such as dependence on expensive training and lack of interpretability. Therefore, in this paper, we propose a novel non-learning-based method by for the first time formulating text summarisation as an Optimal Transport (OT) problem, namely Optimal Transport Extractive Summariser (OTExtSum). Optimal sentence extraction is conceptualised as obtaining an optimal summary that minimises the transportation cost to a given document regarding their semantic distributions. Such a cost is defined by the Wasserstein distance and used to measure the summary's semantic coverage of the original document. Comprehensive experiments on four challenging and widely used datasets - MultiNews, PubMed, BillSum, and CNN/DM demonstrate that our proposed method outperforms the state-of-the-art non-learning-based methods and several recent learning-based methods in terms of the ROUGE metric.


Introduction
Text summarisation aims to condense a given document into a short and succinct summary that best covers the semantics of the document with the least redundancy. It helps users quickly browse and understand long documents by focusing on their most important sections (Mani, 2001;Nenkova and McKeown, 2011). A common practice for text summarisation is extractive summarisation which aims to select the salient sentences of a given document to form its summary. Extractive summarisation ensures the production of grammatically and factually 1 Our code is publicly available for research purpose in https://github.com/peggypytang/OTExtSum/ Figure 1: Illustration of Optimal Transport Extractive Summariser (OTExtSum): the formulation of extractive summarisation as an optimal transport (OT) problem. Optimal sentence extraction is conceptualised as obtaining the optimal extraction vector m * , which achieves an OT plan from a document D to its optimal summary S * that has the minimum transportation cost. Such a cost is defined as the Wasserstein distance between the document's semantic distribution TF D and the summary's semantic distribution TF S and is used to measure the summary's semantic coverage. correct summaries, though the output summaries could be inflexible. Since abstractive summaries are highly prone to contain contents that are unfaithful and nonfactual to the original document , extractive summaries are more practical for real-world scenarios, especially for the domains requiring formal writing such as legal, science, and journalism documents.
Existing methods (Yao et al., 2017) often first score the importance of individual sentences of a given document and then combine the top-ranked ones to form a summary. However, the sentences with high importance scores may not well represent the document from a global perspective, which results in a sub-optimal summary. Recently, learning-based methods, especially those based on supervised and unsupervised deep learning techniques (Narayan et al., 2018;Zheng and Lapata, 2019;Zhang et al., 2019;Xu et al., 2020;Zhong et al., 2020;Padmakumar and He, 2021) can significantly improve summarisation performance. However, training deep learning models is computationally expensive, and it can be difficult to apply those models learned from a particular domain to other domains with different distributions. Moreover, deep learning methods generally lack interpretability for the summarisation process.
Motivated by these issues, we propose a novel non-learning based extractive summarisation method, namely Optimal Transport Extractive Summariser (OTExtSum). As illustrated in Figure  1, we formulate extractive summarisation based on the optimal transport (OT) theory (Peyré et al., 2019). A candidate summary can be evaluated by an OT plan regarding the optimal cost to transport between the semantic distributions of the summary and its original document. Then a Wasserstein distance can be obtained with this optimal plan to measure the discrepancy between the two distributions. To this end, it can be expected that a summary of high quality minimizes this Wasserstein distance. Moreover, a common assumption in the formulations of the OT problem is that the source and target distributions are fixed. In OTExtSum problem formulation, we relax this assumption by adding an extraction vector m * to indicate which document sentences would be extracted to form the summary's semantic distribution, thus making the target distribution variable.
The semantic distributions of a given document and its candidate summary can be formulated in line with the frequency of their tokens. Inspired by Word Mover's Distance (Kusner et al., 2015), summarisation can be conceptualized as moving the "semantics" of a given document to its summary, and the ideal summary is obtained at the minimal transportation cost. This ensures the highest semantic coverage of the given document and the least redundancy in the summary without explicitly modelling conventional criteria such as relevance and redundancy. Thus, under the OT plan, the Wasserstein distance indicates the candidate summary's semantic coverage of the given document.
We design two optimisation strategies to approximate the extraction vector m * , namely beam search strategy (Tillmann and Ney, 2003), which itera-tively evaluates the semantic coverage scores of a set of candidate summaries to obtain the optimal extraction, and binary integer programming strategy, which approximates the optimal extraction given the constraints of the Wasserstein distance and extraction budget. As a non-learning based method, OTExtSum does not require any training and is applicable to different document domains. Furthermore, it provides explainable results in terms of the semantic coverage of the summary.
There have been some studies on OT in NLP, such as document distance (Kusner et al., 2015;Yurochkin et al., 2019), text generation (Chen et al., 2018), text matching (Swanson et al., 2020), and machine translation (Xu et al., 2021). These methods generally focus on deriving similarities between words, sentences, and documents. On the contrary, we for the first time formulate text summarisation as an OT problem that optimally transports the semantic distributions between two texts (e.g., source document and summary candidate).
Overall, the key contributions of this paper are: • We propose a non-learning based extractive summarisation method -OTExtSum by treating the text summarisation task as an optimal transport problem for the first time.
• We design two optimisation strategies for OTExtSum: beam search strategy and binary integer programming strategy.
• We present an interpretable visualisation of the semantic coverage of a generated summary by visualising the transport plan between summary tokens and document tokens.
• Comprehensive experimental results on four widely used datasets, including CNN/DM, MultiNews, BillSum and PubMed, demonstrate that OTExtSum outperforms the stateof-the-art non-learning based methods.

Related Work
Generally, text summarisation methods can be categorized as extractive, abstractive, and hybrid ones. While abstractive and hybrid summarisation methods (Lebanoff et al., 2019;Zhang et al., 2020) aim to mimic human beings in summarisation by paraphrasing a given document, extractive summarisation generally produces more factual summaries.
In this section, we review existing extractive sum-marisation methods in two categories: non-learning based and learning-based methods.

Non-learning based Methods
Most of the non-learning based methods conceptualise text summarisation as a sentence ranking task. Each sentence in a given document is scored in terms of various sentence importance criteria, which measure how well the sentence could represent the document. The top-ranked sentences are combined to form a summary. These methods often heavily rely on handcrafted features in regards to linguistic knowledge by focusing on local and/or global contexts. Local Context based Methods. Local contextbased methods rank a sentence based on the features obtained from the sentence itself. Sentence features such as frequency-based and topic-based were studied. Frequency-based features (Edmundson, 1969;Hovy and Lin, 1998) assume that the occurrence of high-frequency terms in a sentence is associated with their importance. Topic-based features (Kupiec et al., 1995;Nobata and Sekine, 2004;Lin and Hovy, 2000) assume that the density of a set of topic terms is highly correlated to the topic of a document.
Global Context based Methods. As local context features could overlook the correlations between sentences and lead to redundant summaries involving similar sentences, global context-based methods rank individual sentences from the perspective of the entire document. Discourse-based methods (Marcu, 1999) construct a document's rhetorical structure and extract the sentences on the longest chain of the semantic structure, i.e. the main topic. Centroid-based methods (Radev et al., 2000) cluster the sentences of a document through similarity measures and rank the sentences based on their distances to the cluster centroids. TextRank (Mihalcea and Tarau, 2004), as a graphbased method, is the state-of-the-art non-learning based method. A graph among document sentences is first formed by connecting sentences using sentence similarity scores, then the sentence connectivity can be used to score the importance of a sentence. Nonetheless, the nature of these sentence based scoring methods could miss summary-level or document-level patterns.

Learning-based Methods
Instead of utilising handcrafted features, due to the great success of deep learning in many natural lan-guage processing tasks, recent studies on extractive summarisation aim to learn sentence features from the corpus in a data-driven manner.
Supervised Methods. Most of these methods follow the sentence ranking conceptualisation, and an encoder-decoder scheme is generally adopted (Nallapati et al., 2017;Zhang et al., 2019;Xu et al., 2020). An encoder formulates document or sentence representations, and a decoder predicts a sequence of sentence importance scores with the supervision of ground-truth sentence labels.
Reinforcement Learning based Methods. Reinforcement learning (RL) can be utilised for extractive summarisation by directly optimising the ROUGE metric, which is used as the training reward. The RL based summarisation task can be treated as a sentence ranking problem similar to the aforementioned methods (Narayan et al., 2018) or as a contextual-bandit problem (Luo et al., 2019) .
Unsupervised Methods Various unsupervised methods have also been proposed to leverage pretrained language models to compute sentence similarities and select important sentences. Some methods (Zheng and Lapata, 2019) use these similarities to construct a sentence graph and select sentences based on their centrality. Some methods (Padmakumar and He, 2021) use these to score relevance and redundancy of sentences as selection criteria.
Although these learning-based methods have significantly improved summarisation performance, computationally expensive training costs are inevitable, and it is challenging to generalise the trained models to documents from other domains that have distributions different from the training dataset. In addition, it is difficult to explain the correspondence and the coverage between a summary and a source document using these deep models. Therefore, to address these limitations, we revisit the non-learning based approach and propose a novel summarisation method by exploring the optimal transport theory for the first time.

Methodology
As shown in Figure 1, OTExtSum utilizes a text OT approximation to obtain the optimal extraction vector m * = [m 1 , ..., m n ] T , where m i ∈ {0, 1} denotes whether the i-th sentence is to be extracted (denoted by 1) or not (denoted by 0). The optimal extraction vector m * achieves an OT plan from the semantic distribution of the document to that of its optimal candidate summary which has the minimum total transportation cost.
The OT approximation consists of four components: 1) a tokeniser & embedding procedure that formulates token level representations and a semantic distribution estimation that computes the frequency of each token within a summary or a document ; 2) a transportation cost matrix that measures the cost using one token to represent another based on their Euclidean distances; 3) an OT solver that approximates Wasserstein distance and semantic coverage of the candidate summaries; and 4) an optimisation strategy that obtains the optimal extraction vector by choosing the summary with the minimum Wasserstein distance, and thus with the highest semantic coverage of the source document.

Optimal Transport
Consider a transportation problem that transports goods from a collection of suppliers D = {d i |i = 1, ..., N } to a collection of customers S = {s j |j = 1, ..., N }, where d i and s j indicate the supply quantity of the i-th supplier and the order quantity of the j-th customer, respectively. Note that, in this study, we consider the number of suppliers to be the same as the customers. By defining t ij as the quantity transported from the i-th supplier to the j-th customer, a transport plan T = {t ij } ∈ R N ×N can be obtained. Given a cost matrix C = {c ij } ∈ R N ×N , where c ij is the cost to deliver a unit of goods from the i-th supplier to the j-th supplier, the cost of the transport plan T can be calculated. Particularly, an OT plan T * = {t * i,j } ∈ R N ×N in pursuit of minimising the transportation cost can be obtained by solving the following optimisation problem: where the first two constraints indicate the quantity requirements for both suppliers and customers and the last constraint proves a non-negative order quantity. Mathematically, this OT problem is to find a joint distribution T with respect to a cost C, of which the marginal distribution is D and S. In particular, Wasserstein distance can be defined as: It can be viewed as the distance between the two probability distributions D and S, if they are normalized, in line with the cost C.

Semantic Distribution
In the context of text summarisation, denote D = {s 1 , ..., s n } to represent a document, where s i denotes the i-th sentence contained in the document. The sentence s i has a semantic distribution TF i ∈ R N computed by the normalised bag-oftokens with removal of stop-words: where d j indicates the count of the j-th token in a vocabulary of size N .
A document D has a semantic distribution TF D : For a summary S ⊂ D with its corresponding extraction vector m, of which the i-th element m i is an indicator (m i = 1 if s i ∈ S, m i = 0 otherwise), it has a semantic distribution TF S : In our proposed method, a normalization step is introduced to approximate the semantic distributions of D and S with term frequency. Note that after the normalization TF D and TF S have an equal total good quantities of 1 and can be completely transported from one to the other. In addition, TF D and TF S satisfy the property of discrete probability distributions, of which the sum should be 1.

Transport Cost between Tokens
We define the unit transportation cost between two tokens by measuring their semantic similarity. Intuitively, the more semantically dissimilar a pair of tokens are, the higher the "transport cost" of transporting one token to another. Given a pre-trained tokeniser and token embedding model with N tokens, define v i to represent the feature embedding of the i-th token. The transport cost from the i-th token to the j-th token c ij in C can be written as: which is based on the Euclidean distance. 1

Semantic Coverage of Candidate Summaries
Intuitively, a good summary S is supposed to be close to the document D in terms of their semantic distributions. OTExtSum utilizes the Wasserstein distance to measure the distance between the two associated semantic distributions TF D and TF S with the OT cost. The computation of the Wasserstein distance has time complexity of O(p 3 log(p)) (Altschuler et al., 2017), where p denotes the number of unique words in the document.
In detail, it can be obtained with Eq.
(2) as d W (TF D , TF S |C) with a pre-defined cost matrix C. Then a semantic coverage score of the summary S in respect to the document D can be further defined based on the Wasserstein distance: Therefore, OTExtSum aims to search for an extraction vector m, of which the corresponding summary S minimises the Wasserstein distance, i.e. maximising the semantic coverage score for the given document D by solving OT problems.

Optimisation Strategy
The remaining problem for OTExtSum is to search for the optimal extraction vector m * which achieves the minimum total transportation cost from the semantic distribution of the document TF D to that of the optimal summary TF S , given a budget B which is the number of sentences can be extracted to create a summary: In search of optimal extraction vector m * , we design two optimisation strategies, namely beam search strategy to achieve better coverage approximation, and binary integer programming strategy to achieve better computational efficiency. 1 We investigated the effect of different distance measurements. As discussed in Section 4.3, cost matrix based on the Euclidean distance and the cosine distance yield similar ROUGE scores.

Algorithm 1: Optimisation of OTExtSum with Beam Search Strategy
Input :D the document, B the budget of the number of extracted sentences, K the beam width. Output :S * the optimal extractive summary. Compute the semantic distribution TF S k of S k ∈ S;

10
Compute the Wasserstein distance d W (TF D , TF S k |C) and the semantic coverage g(TF D , TF S k |C)); 11 end 12 Keep the top K candidate summaries with the highest g(TF D , TF S k |C)) and prune the rest in S; 13 end 14 S * = argmax S k ∈S g(TF D , TF S k |C));

Beam Search Strategy
The Beam Search (BS) strategy with the beam width K maintains the candidate summary set S and searches for the optimal extraction vector m * , thus the optimal extractive summary S * . Algorithm 1 presents the steps to obtain the optimal summary with OTExtSum using the BS strategy. The time complexity is O(BKn(p 3 log(p))).
Initially, we have m = 0, where none of the sentences are extracted. Then, each sentence in the document D is selected as a candidate summary, which derives a set of candidate extraction vectors corresponding to a set of candidate summaries, and its semantic coverage score can be evaluated. The top K candidate summaries in terms of the semantic coverage are kept in the set S and the rest are pruned. During the b-th iteration of the beam search, by appending each possible sentence to an existing candidate summary S k ∈ S, where the sentence is not in S k , a set of new candidate summaries S k b can be obtained. Then S is updated by combining all these sets of new candidate summaries in regards to k: At the end of beam search, a set of final K summary candidates within the budget B is obtained. Among the K final candidates from the beam search, OTExtSum obtains the optimal extraction vector and thus the optimal summary by choosing the candidate with the highest semantic coverage of the document D.

Algorithm 2: Optimisation of OTExtSum with Binary Integer Programming Strategy
Input :D the document, B the budget of the number of extracted sentences, T the number of iterations. Output :S * the optimal extractive summary. 13 Obtain S * by extracting top-B sentences with the highest m i values for i = 1, ..., n;

Binary Integer Programming Strategy
Some prior works showed that integer linear programming is an efficient solution to summarisation problem (McDonald, 2007;Gillick and Favre, 2009).The Binary Integer Programming (BIP) strategy therefore is utilised to search for the optimal extraction vector m * with T iterations. Based on the extraction vector, we obtain the optimal extractive summary S * . Algorithm 2 presents the optimisation steps to obtain the optimal summary with OTExtSum using the BIP strategy. The time complexity is O(T (p 3 log(p))).
As m * is a multi-hot vector and is not differentiable, to make the backpropagation work, we optimise a proxy continuous vector w ∈ R n , which is differentiable. Then we hard sample from the Gumbel-Softmax distribution (Maddison et al., 2016) to discretise and compute a multi-hot vector b during the iterations, and soft sample to compute m * at the end.
The BIP strategy optimises the following loss function w.r.t. w, which is a weighted sum of the Wasserstein distance d W (TF D , TF S ) and the L 1 regularisation of b 2 : where α denotes the weight of L 1 regularisation.

Datasets
To validate the effectiveness of the proposed OTExtSum on the documents with various writing styles and its ability to achieve improved summarisation performance, we perform experiments on four widely used challenging datasets collected from different domains.   is a scientific article dataset that uses the abstract section as the ground-truth summary and the long body section as the document. Table 1 shows an overview of the four datasets. The dataset details are in Appendix A. While CNN/DM contains shorter documents and summaries, the other three datasets are more challenging because they have more extended documents and summaries, thus have a higher chance to extract sentences containing redundant contents or having limited relevance to the document.

Implementation Details
In terms of the pre-trained token embedding model, we compare the static embedding model Word2Vec and the contextual embedding models BERT and GPT2. The details of hyperparameter settings and software used are in Appendix B and C.
Our OTExtSum is compared against LEAD (See et al., 2017), ORACLE (Nallapati et al., 2017), the state-of-the-art non-learning based methods and the recent unsupervised learning-based methods. LEAD and ORACLE are standard baselines in the summarisation task. LEAD baseline extracts the first several sentences of a document as a summary. ORACLE baseline greedily extracts the sentences that maximise the ROUGE-L score based on the reference summary. We compare with the results of strong non-learning-based methods, including LSA (Gong and Liu, 2001), TextRank (Mihalcea and Tarau, 2004), and LexRank (Erkan and Radev, 2004).
Their results on MultiNews, BillSum, PubMed, and CNN/DM are from (Fabbri et al., 2019), (Kornilova and Eidelman, 2019), (Cohan et al., 2018), and (Padmakumar and He, 2021) respectively. For an informative reference, we report recent unsupervised learning-based methods, including PacSum (Zheng and Lapata, 2019), which its released model was trained on the news domain, and PMI (Padmakumar and He, 2021), of which the released models were trained on the news and science domains. Their results on CNN/DM are from (Padmakumar and He, 2021). Their results on Multi-News, BillSum, and PubMed are evaluated on the datasets with the corresponding released models from the same domains. And we include the results of the state-of-the-art supervised learning-based methods with extractive approach MatchSum from (Zhong et al., 2020), and those with abstractive approach PEGASUS from (Zhang et al., 2020).

Quantitative Analysis
The commonly used ROUGE metric (Lin, 2004) is also adopted for our quantitative analysis. It evaluates the content consistency between the generated summary and the reference summary. In detail, ROUGE-n scores measure the number of overlapping n-grams between the generated summary and the reference summary. A ROUGE-L score considers the longest common subsequence between the generated summary and the reference summary.
Performance Overview. The experimental results of OTExtSum on the four datasets are listed in Table 2 in terms of ROUGE-1, ROUGE-2 and ROUGE-L F-scores. We observed that the BS strat-egy could generally achieve better optimisation results than the BIP strategy. It is in line with our design understanding that beam search can better reach the global optimum. Whereas, the two strategies achieve similar results in CNN/DM, which could be because CNN/DM has fewer document sentences and lower budget, thus fewer possible solutions and easier to find the optimum.
OTExtSum outperforms the state-of-the-art nonlearning based methods and is comparable to the learning-based methods. Note that the state-of-theart methods usually optimise at the sentence level, whilst OTExtSum is based on the summary level OT evaluation, by which the quality of the resulting summaries is improved.
We observed that OTExtSum obtains significantly better ROUGE scores than the baseline methods on Multi-News, BillSum and PubMed, while the improvement is not that significant on CNN/DM . When the summary is more extended, such as these three more challenging datasets, the summary sentences are more likely to have redundant content. That is, even summary-level optimisation is more difficult to achieve, our OTExtSum demonstrates higher improvements.
OTExtSum is a non-learning based method, and training is not required. Unlike learning-based methods, it is not limited by the training data domain and can be used for different domains. Experimental results demonstrate generalisation ability of OTExtSum over news, law, and science domains.
Effects of Token Embeddings Models. OTExtSum is dependent on a pre-trained token embedding method.
Specifically, the token embedding model affects the cost matrix C and the tokenisation, thus the frequency vector, of the document. We examine how different token embedding models would affect the performance of OTExtSum by comparing static embedding model Word2Vec, and contextual embedding models BERT and GPT2.
The results on most of the datasets indicate that a more advanced contextual embedding model such as BERT and GPT2 is more effective than a static embedding model Word2Vec. It is in line with the intuitive understanding that a more representative model with adequate training samples often approximates better token embeddings and representation. Despite that, the performance of OTExtSum with Word2Vec is surprisingly competitive.
Effects on Stop-words. We investigate the impact of stop-words on the performance of OTExtSum. As shown in Table 3 in Appendix E, the effect varies slightly across the datasets, and may not much influence the ROUGE scores. It could be because text summarisation does not generally depend on stop-words. A side benefit of removing the stop-words is reducing the vocabulary size and thus the computation time of OT. Effects on Distance Measurement. We examine how the distance measurement of the cost matrix would impact the performance of OTExtSum. As shown in Table 3 in Appendix E, cost matrix based on the cosine distance and the Euclidean distance usually yield similar ROUGE scores.

Interpretable Visualisation
OTExtSum is able to provide an interpretable visualisation of the summarisation procedure. Figure 2 in Appendix D illustrates the transport plan heatmap, which indicates the transportation of semantic contents between tokens in the document and its resulting summary. The higher the intensity, the more the semantic content of a particular document token is covered by a summary token. Figure 3 , 4 , 5, and 6 in Appendix F compare the summaries produced by OTExtSum and TextRank. TextRank extracted sentences that are salient on their own yet redundant when combined to form a summary. In comparison, OTExtSum is able to compose summaries that have higher semantic coverage and less redundant content.

Conclusion
In this paper, we have presented OTExtSum, the first optimal transport-based optimisation method for extractive text summarisation. It aims to identify an optimal subset of sentences for producing a summary that achieves high semantic coverage of the document by minimising the Wasserstein distance between the semantic distributions of the document and the summary. It helps obtain a summary from a global perspective and provides an interpretable visualisation of extraction results. In addition, OTExtSum does not require computationally expensive training. The comprehensive experiments demonstrate the effectiveness of OTExtSum, which is generalisable over various document domains. In our future work, we will explore other OT solvers for extractive summarisation.

A Dataset Details
We followed (Zhong et al., 2020) to set B for CNN/DM, PubMed and Multi-News, and used the average number of sentences in the summaries to set B for BillSum since this is a common practice in the literatures (Narayan et al., 2018). These datasets were obtained from a source, namely Hug-gingFace Datasets 3 .
Since OTExtSum does not require training, for a fair comparison, all experimental results are reported on the test splits of the four datasets only.

B Hyperparameter Details
For the hyperparameter settings of the BIP strategy, the number of iteration T was set to 200, α was set to 1, and it used the SGD optimiser (Sutskever et al., 2013) with learning rate 0.1. For the BS strategy, the beam width K was set to 5 4 .

C Software and Hardware Used
We obtained the pre-trained Word2vec (Google News 300 dimension) from GENSIM 5 , and the contextual embedding models BERT (base version) and GPT2 from HuggingFace 6 . To compute the Wasserstein distances, we adopted GENSIM, the POT 7 and GeomLoss (Feydy et al., 2019) libraries. List of stop-words was from NLTK library 8 . Our experiments were run on a GeForce GTX 1080 GPU card. We obtain our ROUGE scores by using the pyrouge package 9 .

D Example of Interpretable Visualisation
Figure 2: Interpretable visualisation of the OT plan from a source document to a resulting summary on the CNN/DM dataset. The higher the intensity, the more the semantic content of a particular document token is covered by a summary token. Purple line highlights the transportation from the document to the summary of semantic content of token "month", which appears in both the document and the summary. Red line highlights how the semantic content of token "sponsor", which appears in the document only but not the summary, are transported to token "tour" and "extension", which are semantically closer and have lower transport cost, and thus achieve a minimum transportation cost in the OT plan.

F Generation Samples
Below are the generation samples of OTExtSum and TextRank. In general, OTExtSum based summary contains less redundant content and provides higher semantic coverage with the same number of extracted sentences.