Bipartite Graph Pre-training for Unsupervised Extractive Summarization with Graph Convolutional Auto-Encoders

Pre-trained sentence representations are crucial for identifying significant sentences in unsupervised document extractive summarization. However, the traditional two-step paradigm of pre-training and sentence-ranking, creates a gap due to differing optimization objectives. To address this issue, we argue that utilizing pre-trained embeddings derived from a process specifically designed to optimize cohensive and distinctive sentence representations helps rank significant sentences. To do so, we propose a novel graph pre-training auto-encoder to obtain sentence embeddings by explicitly modelling intra-sentential distinctive features and inter-sentential cohesive features through sentence-word bipartite graphs. These pre-trained sentence representations are then utilized in a graph-based ranking algorithm for unsupervised summarization. Our method produces predominant performance for unsupervised summarization frameworks by providing summary-worthy sentence representations. It surpasses heavy BERT- or RoBERTa-based sentence representations in downstream tasks.


Introduction
Unsupervised document summarization involves generating a shorter version of a document while preserving its essential content (Nenkova and McKeown, 2011).It typically involves two steps: pretraining to learn sentence representations and sentence ranking using sentence embeddings to select the most relevant sentences within a document.

𝑺 𝒃
A Chinese fan wearing an Argentina shirt runs onto the pitch to hug soccer legend Lionel Messi.

𝑺 𝒄
The clips show Messi, who appears initially shocked, stretching out his arms and hugging the fan back.For unsupervised document summarization, learning semantic sentence embeddings is crucial, alongside the sentence ranking paradigm.Textual pre-training models like skip-thought model (Kiros et al., 2015), TF-IDF, and BERT (Devlin et al., 2019) generate sentential embeddings, enabling extractive systems to produce summaries that capture the document's central meaning (Yasunaga et al., 2017;Xu et al., 2019;Jia et al., 2020;Wang et al., 2020).By combining sentence representations generated from pre-trained language models, prominent performances have been achieved with graph-based sentence ranking methods (Zheng and Lapata, 2019;Liang et al., 2021;Liu et al., 2021).
Despite the effectiveness of graph-based ranking methods that incorporate pre-trained sentential embeddings, there are some underexplored issues.Firstly, a significant gap exists between the twostep paradigm of textual pre-training and sentence graph-ranking, as the optimization objectives diverge in these two steps.The pre-trained framework is primarily designed to represent sentences with universal embeddings rather than summaryworthy features.By relying solely on the universal embeddings, the nuanced contextual information of the document may be overlooked, resulting in sub-optimal summaries.Secondly, the existing graph formulation (e.g., GCNs (Bruna et al., 2014)) only encodes distinctive sentences but not necessarily cohensive ones, which may limit the extraction of summary-worthy sentences.
In summarization, cohensive sentences reveal how much the summary represents a document, and distinctive sentences involve how much complementary information should be included in a summary.To exemplify how these sentence features come from words, we analyze a sentenceword bipartite graph as depicted in Figure 1.
• The connections S a −w 1 , S b −w 2 , S c −w 4 capture intra-sentential information, where the unique word nodes w 1 = Bejing, w 2 = fan, w 4 = shocked contribute distinctive features to their respective sentence nodes S a , S b , S c .
• The connections S a −w 0 , S b −w 0 , S b −w 3 , S c −w 3 capture inter-sentential information, where the shared word nodes w 0 = Argentina, w 3 = Messi contains cohensive features for their connected sentence nodes S a , S b , S c .
Clearly, a sentence's unique features come from its individual word nodes, while its cohensive features come from shared word nodes with other sentences.Based on this observation, we argue that optimizing cohensive and distinctive sentence representations during pre-training is ultimately beneficial for ranking significant sentences in downstream extractive summarization.To achieve this, we propose a novel graph pre-training paradigm using a sentence-word bipartite graph with graph convolutional auto-encoder (termed as Bi-GAE1 ) to learn sentential representations.
In detail, we pre-train the bipartite graph by predicting the word-sentence edge centrality score in self-supervision.Intuitively, more unique nodes imply smaller edge weights, as they are not shared with other nodes.Conversely, when there are more shared nodes, their edge weights tend to be greater.We present a novel method for bipartite graph encoding, involving the concatenation of an inter-sentential GCN inter and an intra-sentential GCN intra .These two GCNs allocate two encoding channels for aggregating inter-sentential cohesive features and intra-sentential distinctive features during pre-training.Ultimately, the pre-trained sen-tence node representations are utilized for downstream extractive summarization.
Our pre-trained sentence representations obtain superior performance in both single document summarization on the CNN/DailyMail dataset (Hermann et al., 2015) and multiple document summarization on the Multi-News dataset (Sandhaus, 2008) within salient extractive summarization frameworks.i) To our knowledge, we are the first to introduce the bipartite word-sentence graph pretraining method and pioneer bipartite graph pretrained sentence representations in unsupervised extractive summarization.ii) Our pre-trained sentence representation excels in downstream tasks using the same summarization backbones, surpassing heavy BERT-or RoBERTa-based representations and highlighting its superior performance.
2 Background & Related Work
In contrast to LexRank and TextRank constructing an undirected sentence graph, the model of PacSum (Zheng and Lapata, 2019) builds a directed graph.Its sentence centrality is computed by aggregating its incoming and outgoing edge weights: where hyper-parameters λ 1 , λ 2 are different weights for forwardand backward-looking directed edges and λ 1 + λ 2 = 1.e i, j is the weights of the edges e i, j ∈ E and is computed using word co-occurrence statistics, such as the similarity score.Building upon the achievements of Pac-Sum (Zheng and Lapata, 2019), recent models such as FAR (Liang et al., 2021) and DASG (Liu et al., 2021) have aimed to improve extractive summarization by integrating centrality algorithms.These models primarily focus on seeking central sentences based on semantic facets (Liang et al., 2021) or sentence positions (Liu et al., 2021).

Sentential Pre-training
PLM's pre-training, such as BERT and GPT, is crucial for identifying meaningful sentences in downstream summarization tasks.The previously mentioned graph-based summarization methods, such as PacSum (Zheng and Lapata, 2019), FAR (Liang et al., 2021), and DASG (Liu et al., 2021) utilize pre-trained BERT representations for sentence ranking.STAS (Xu et al., 2020) takes a different approach by pre-training a Transformer-based LM to estimate sentence importance.However, STAS is not plug-and-play and requires a separate pretraining model for each downstream task.
Despite the success of the aforementioned unsupervised extractive summarization methods, it still maintains a gap between the PLMs' pre-training and the downstream sentence ranking methods.Additionally, low-quality representations can result in incomplete or less informative summaries, negatively affecting their quality.Pre-training models typically produce generic semantic representations instead of generating summary-worthy representations, which can result in suboptimal performance in unsupervised summarization tasks.

Methodology
In what follows, we describe our pre-training model Bi-GAE (as shorthand for Bipartite Graph Pre-training with Graph Convolutional Auto-Encoders ) used for unsupervised extractive summarization.We will introduce bipartite graph encoding and the pre-training procedure using our Bi-GAE.Ultimately, we will utilize the pre-trained sentence representations for the downstream unsupervised summarization.

Document as a Bipartite Graph
Formally, We denote the constructed bipartite wordsentence graph G = {V, A, E, X}, where V = V w ∪ V s .Here, V w denotes |V w | = n unique words of the document and V s corresponds to the |V s | = m sentences in the document.A = e 11 , ..., e i, j , ..., e nm defines the adjacency relationships among nodes, and e i, j ∈ {0, 1} n×m indicates the edge weight from source node i to target node j.X ∈ R (n+m)×d , is termed as a matrix containing the representation of all nodes.The node representations will be iteratively updated by aggregating summary-worthy features (intra-sentential and inter-sentential messages) between word and sentence nodes via the bipartite graph autoencoder.

Bipartite Graph Pre-training
We reform the original VGAE (Kipf and Welling, 2016) pre-training framework by optimizing edge weight prediction in bipartite graphs.The pretraining optimizer learns to fit the matrices between the input weighted adjacency matrix and the reconstructed adjacency matrix in a typical way of self-supervised learning.By integrating an intrasentential GCN intra and an inter-sentential GCN inter in the VGAE (Kipf and Welling, 2016) selfsupervised framework, our pre-training method enables effective aggregation of intra-sentential and inter-sentential information, allowing for the representation of high-level summary-worthy features in the bipartite graph pre-training.Bipartite Graph Initializers.Let X w ∈ R (n)×d w and X s ∈ R (m)×d s represent the input feature matrix of the word and sentence nodes respectively, where d w and d s are the dimension of word embedding vector and sentence representation vector respectively.We first use convolutional neural networks (CNN) (LeCun et al., 1998) with different kernel sizes to capture the local n-gram feature for each sentence S C i and then use the bidirectional long short-term memory (BiLSTM) (Hochreiter and Schmidhuber, 1997) layer to get the sentencelevel feature S L i .The concatenation of the CNN local feature and the BiLSTM global feature is used as the sentence node initialized feature The initialized representations are used as inputs to the graph autoencoder module.Bipartite Graph Encoder.To model summaryworthy representations, we encode the bipartite graph by a concatenation of an intra-sentential GCN intra and an inter-sentential GCN inter , in which two GCNs assign two encoding channels for aggregating intra-sentential distinctive features and inter-sentential cohesive features.The GCN intra (H 0 = X, A weight , Θ), can be seen as a form of message passing to aggregate intra-sentential distinctive features.The first GCN intra layer generates a lower-dimensional feature matrix.Its node-wise formulation is given by: where e i, j ∈ A weight denotes the edge weight from source node i to target node j.Here we use the betweenness centrality2 as the edge  weights.The betweenness centrality of an edge is the sum of fractions of the shortest paths passing through it.The first GCN intra layer makes features of neighbour nodes with fewer association relationships aggregated and enlarged and outputs a lower-dimensional feature matrix H.Then, the second GCN intra layer generates µ intra = GCN µ H intra , A weight and log(σ intra ) 2 = GCN σ H intra , A weight .
The GCN inter (H 0 = X, A weight , Θ) can be seen as a form of message passing to aggregate intersentential cohensive features: The graph convolution operator GCN inter will aggregate neighbour node features with more association relationships aggregated and enlarged.Analogously, we can obtain µ inter and log(σ inter ) 2 , which are parameterized by the two-layer GCN inter .Then we can generate the latent variable Z as output of bipartite graph encoder by sampling from GCN inter and GCN intra and then concatenating sampled two latent variables Z inter and Z intra : where q(z inter i ) and q(z intra i ) are from two GCNs, satisfying independent distribution conditions.Here, Generative Decoder.Our generative decoder is given by an inner product between latent variables Z.The output of our decoder is a reconstructed adjacency matrix Â, which is defined as follows: where p(A i, j |z i z j ) = σ(z ⊤ i z j ), and A i, j are the elements of Â. σ(•) is the logistic sigmoid function.Edge Weights Prediction as the Pre-training Objective.We use edge weight reconstruction as the training objective to optimize our pre-trained Bi-GAE.Specifically, the pre-training optimizer learns to fit the matrices between the input weighted adjacency matrix A weight and the reconstructed adjacency matrix Âweight .pre-trained sentence representations of Bi-GAE to replace those used in state-of-the-art unsupervised summarization backbones.This allows us to assess the effectiveness of the pre-trained sentence representation in downstream tasks.

Downstream Tasks and Datasets
We evaluate our approach on two summarization datasets: the CNN/DailyMail (Hermann et al., 2015) dataset and the Multi-news (Fabbri et al., 2019) dataset.The CNN/DailyMail comprises articles from CNN and Daily Mail news websites, summarized by their associated highlights.We follow the standard splits and preprocessing steps used in baselines (See et al., 2017;Liu and Lapata, 2019;Zheng and Lapata, 2019;Xu et al., 2020;Liang et al., 2021), and the resulting dataset contains 287,226 articles for training, 13,368 for validation, and 11,490 for the test.The Multinews is a large-scale multi-document summarization (MDS) dataset and comes from a diverse set of news sources.It contains 44,972 articles for training, 5,622 for validation, and 5,622 for testing.Referring to prior works (Fabbri et al., 2019;Liu et al., 2021), we create sentence discourse graphs for each document and cluster them, with each cluster yielding a summary sentence.

Pre-training Datasets
We construct a bipartite graph with word and sentence nodes, determining edge weights through graph centrality.The centrality-based weights denoted as A weight serve as inputs for the Bi-GAE model.During pre-training, we use MSE loss to measure the average squared difference between the predicted edge values Âweight and the true values input A weight , as it indicates more minor errors between the predicted and true values.We conveniently utilize training datasets without their summarization labels as the corpus to pre-train sentence representations by our Bi-GAE.

Backbones of Summarization Approaches
There are several simple unsupervised summarization extraction frameworks, including TextRank (Mihalcea and Tarau, 2004) and LexRank (Erkan and Radev, 2004), as well as more robust graph-based ranking methods such as Pac-Sum (Zheng and Lapata, 2019), FAR (Liang et al., 2021), DASG (Liu et al., 2021).Graph-based ranking methods take sentence representations as input, using the algorithm of graph-based sentence centrality ranking for sentence selection.We now introduce extractive summarization backbones.
• TextRank and LexRank utilize PageRank to calculate node centrality based on a Markov chain model recursively.
• PacSum (Zheng and Lapata, 2019) constructs graphs with directed edges.The rationale behind this approach is that the centrality of two nodes is influenced by their relative position in the document, as illustrated by Equation 15.
• DASG (Liu et al., 2021) selects sentences for summarization based on the similarities and relative distances among neighbouring sentences.It incorporates a graph edge weighting scheme to Equation 15, using a coefficient that maps a pair of sentence indices to a value calculated by their relative distance.
• FAR (Liang et al., 2021) modifies Equation 15by applying a facet-aware centrality-based ranking model to filter out insignificant sentences.FAR also incorporates a similarity constraint between candidate summary representation and document representation to ensure the selected sentences are semantically related to the entire text, thereby facilitating summarization.
The main distinction among the extractive frameworks mentioned above lies in their centrality algorithms.A comprehensive comparison of these algorithms can be found in Appendix 8.

Compared Sentence Embeddings
We evaluate three sentence representations for computing sentence centrality.The first compared sentence embedding employs a TF-IDF-based approach, where each vector dimension is calculated based on the term frequency (TF) of the word in the sentence and the inverse document frequency (IDF) of the word across the entire corpus of documents.The second representation is based on the Skip-thought model (Kiros et al., 2015), an encoder-decoder model trained on surrounding sentences using a sentence-level distributional hypothesis (Kiros et al., 2015).We utilize the publicly available skip-thought model3 to obtain sentence representations.The third approach relies on BERT (Devlin et al., 2019) or RoBERTa (Liu et al., 2019) to generate sentence embeddings.In contrast, sentence representations initialized with BERT or RoBERTa perform poorly in Tex-tRank and LexRank frameworks.This could be attributed to the collapse of BERT-derived sentence representations, resulting in high similarity scores for all sentences and thus failing to leverage the potential centrality in TextRank and LexRank.However, our methods surpass BERT and RoBERTa in the FAR and DASG summarization frameworks, showcasing the effectiveness of sentence representations pre-trained by our graph auto-encoders.

Multi-Document Experiments
Table 2 shows the comparison of Multi-news summarization.Given that all frameworks employing our pre-trained representations outperform the First-3 baseline, our approach effectively mitigates position bias (Dong et al., 2021).This bias often results in incomplete summaries that neglect essential information located in the middle of the document.
The results demonstrate two key findings: (i) Our method adeptly captures essential summary-worthy sentences, thereby consolidating the process of sentence clustering and, in turn, improving extractive accuracy.(ii) The embedded, intra-sentential distinctive features and inter-sentential cohensive features are crucial in ranking significant sentences across multiple documents.

Component-wise Analysis
To comprehend how modelling intra-sentential features and inter-sentential features contribute to sentence-word bipartite graphs, we conducted an ablation study on the CNN/DailyMail dataset.As shown in Table 3, we can observe that the Bi-GAE model equipped solely with GCN inter or solely with GCN intra performs well.When combined with both, Bi-GAE yields the best results across all metrics.This highlights the importance of incorporating intra-sentential and inter-sentential features for effective summarization.Combining the two GCNs leads to complementary effects, enhancing the model's overall performance.On the contrary, using only GCN inter or GCN intra individually results in poor performance, as it fails to capture either the semantically cohensive or the distinctive content of the document.

Effects of Pre-training Datasets
To evaluate the impact of different pre-training datasets, we test summarization frameworks using two types of representations pre-trained on distinct corpora.In Table 4 and Table 5, we can observe pre-training on the Multi-news dataset showed minimal performance degradation or limited changes in CNN/DailyMail summarization, and vice versathe similarity between the two news corpora leads to consistent results in downstream tasks.

Density Estimation of Summarization
There are three measures -density, coverage, and compression -introduced by Grusky et al. ( 2018) and Fabbri et al. (2019) to assess the extractive nature of an extractive summarization dataset.In this paper, we adopt these measures to evaluate the quality of extracted summaries, as illustrated in Figure 3.The coverage (x-axis) measure assesses the degree to which a summary is derived from the original text.The density (y-axis) measures the extent to which a summary can be described as a series of extractions.Compression c, on the other hand, refers to the word ratio between two texts -Text A and Text B. Higher compression pose a challenge as it necessitates capturing the essential aspects of the reference text with precision.For detailed mathematical definitions of these evaluation measures, please refer to Appendix 8.6.
We utilize three measures that quantify the level of text overlap between (i) the Oracle summary and the manual summary (subfigures (a) and (e)), (ii) the summary extracted by the BERT-based DASG and the manual summary (subfigure (b) and (f)), (iii) the summary extracted by our Bi-GAE based DASG and the manual summary (subfigure (c) and (g)), and (iv) the summary extracted by our Bi-GAE based DASG and the Oracle (subfigure (d) and (h)).These measures are plotted using kernel density estimation in Figure 3.Among them, subfigure (a) displays the comparison between the Oracle summary compared to the manual summary, which serves as the upper bound for the density and coverage distributions of extractive compression score in extractive summarization.Subfigure (e) shows this score in the multi-news dataset.
Comparing the extractive summary of our Bi-GAE based DASG (DASG integrated by the sentence representation of our Bi-GAE) and the extractive Oracle summary in subfigures (a), (b), and (c), we have observed variability in copied word percentages for diverse sentence extraction in CNN/DailyMail.A lower score on the x-axis suggests a greater inclination of the model to extract fragments (novel words) that differ from standard sentences.Our model also outperforms the BERT-based DASG in compression score (0.6522) to compare subfigures (b) and (c).Regarding the yaxis (fragment density) in subfigure (d), our model shows variability in the average length of copied sequences to the Oracle summary, suggesting varying styles of word sequence arrangement.These advantages persist in the multi-news dataset.

Conclusion
In this paper, we introduce a pre-training process that optimizes summary-worthy representations for extractive summarization.Our approach employs graph pre-training autoencoders to learn intrasentential and inter-sentential features on sentenceword bipartite graphs, resulting in pre-trained embeddings useful for extractive summarization.Our model is easily incorporated into existing unsupervised summarization models and outperforms salient BERT-based and RoBERTa-based summarization methods with predominant ROUGE-1/2/L score gains.Future work involves exploring the potential of our pre-trained sentential representations for other unsupervised extractive summarization tasks and text-mining applications.

Limitations
We emphasize the importance of pre-trained sentence representations in learning meaningful representations for summarization.In our approach, we pre-train the sentence-word bipartite graph by predicting the edge betweenness score in a selfsupervised manner.Exploring alternative centrality scores (such as TF-IDF score or current-flow betweenness for edges) as optimization objectives for MSE loss would be a viable option.
Additionally, we seek to validate the effectiveness of the sentence representations learned from Bi-GAE in other unsupervised summarization backbones and tasks.vocabulary to 50,000 and initialize tokens with 300dimensional GloVe 840B embeddings5 .We filter stop words and punctuations when creating word nodes and truncate the input document to a maximum length of 50 sentences.To eliminate the noisy common words, we remove 10% of the vocabulary with low TF-IDF values over the whole dataset.We initialize sentence nodes with d s = 150.We use a batch size of 8 during pre-training and apply the Adam optimizer with a learning rate of 5e-5 for CNN/DailyMail and 2e-5 for Multi-News.The dropout is 0.1.The pre-training model is trained for 210,000 steps, and the warm-up step is set to 8000.Attempts made to invoke certain model interfaces in PYG have revealed that using JKNET (Xu et al., 2018) and GCNII (Chen et al., 2020) as the encoder backbone in the pre-training process results in performance for downstream tasks that are essentially indistinguishable from those of GCN.

Hyper-parameters in Summarization
We begin by using Stanford NLP6 to split sentences and preprocess the dataset.The source text has a maximum sentence length of 512, while the summary is limited to a maximum sentence length of 140.During the tuning process for extractive summarization, we fine-tune the parameters related to the centrality algorithm within a narrow range of [-1.0, 2.0].Table 6 presents the optimal hyperparameters for each extractive summarization backbones, utilizing our Bi-GAE pre-trained sentence representations.For the CNN/DailyMail dataset, we select the top-3 sentences for the summarization based on the average length of the Oracle humanwritten summaries, whereas, for Multi-New, we choose the top-9 sentences.

Sentence Similarity Computation
The crucial aspect of the unsupervised graph rank method in downstream tasks lies in the calculation of similarity between two sentences.In this regard, we examine two methods for calculating similarity, both of which draw inspiration from the similarity calculation approach utilized in PacSum (Zheng and Lapata, 2019).The first one can employ a pair-wise dot product to compute an unnormalized similarity matrix Ēi j = v ⊤ i v j , and the second one is cosine similarity Ēi j = cos(v i , v j ).The final normalized Summarizing with higher compression is challenging as it requires capturing more precisely the critical aspects of the article text.Among our settings about the above metrics, we have expanded the comparison between summary text and article text to include: the comparison between extracted summary and manual summary, the comparison between the extractive Oracle and the manual summary, or the comparison between extracted summary and Oracle summary.

Figure 1 :
Figure 1: The graph structure of bipartite sentence-word graph.The sentences connect with unique nodes (monopolized by a single sentence node), and common nodes (shared by multiple sentence nodes).

Figure 2 :
Figure2: Overall architecture of our pre-training model Bi-GAE.We construct a sentence-word bipartite graph to optimize both distinctive intra-sentential and cohensive inter-sentential nodes, by predicting the word-sentence edge centrality scores using a self-supervised graph autoencoder.
)The loss function of the bipartite graph pre-training has two parts.The first part is MSE loss which measures how well the pre-training model reconstructs the structure of the bipartite graph.KL works as a regularizer in original VGAE, and p(Z) = N(0, 1) is a Gaussian prior.

Figure 3 :
Figure 3: Density and coverage distributions of extractive compression scores on CNN/DailyMail (subfigures (a), (b), (c), (d)) and Multi-News (subfigures (e), (f), (g), (h)) datasets.Each box represents a normalized bivariate density plot, showing the extractive fragment coverage on the x-axis and density on the y-axis.The top left corner of each plot shows the number n of text and the median compression ratio c between text A and text B. The comp(A, B) denotes the comparison elements are the text A used for comparing, and the text B used as the reference.comp(O r , M u ): the Oracle and the manual summary.comp(B e , M u ): the extracted summary of BERT-based DASG and the manual summary.comp(B i , M u ): the extracted summary of our Bi-GAE based DASG and the manual summary.comp(B i , O r ): the extracted summary of our Bi-GAE based DASG and the Oracle.

Figure 4 :
Figure 4: Verification results of edge prediction accuracy during Bi-GAE pre-training on CNN/Daily Mail and Multinews corpora.
; Wang * Jianxin Li is the corresponding author.

Table 1 :
Liang et al. (2021)e of the single document extractive summarization on the CNN/DailyMail.♭ is reported inXu et al. (2020), † is reported in Zheng and Lapata (2019) and ‡ is reported inLiang et al. (2021).* means our careful re-implementation due to the absence of publicly accessible source code for these methods or the experiment was missing from the published paper.The best results are in-bold.
5.1 Single-Document ExperimentsOur results on the CNN/Daily Mail are summarized in Table1.The Oracle upper bound extracts gold standard summaries by greedily selecting sen-

Table 2 :
Wang et al. (2020)9) of the multi-document extractive summarization on the Multi-News.• is reported inFabbri et al. (2019), † is reported inLi et al. (2020)and ‡ is reported inWang et al. (2020).* means our careful implementation due to the absence of publicly accessible source code for these methods or the experiment was missing from the published paper.The best results are in-bold.

Table 3 :
ROUGE F1 performance of the extractive summarization.Pre-trained encoder in our Bi-GAE is equipped with one kind of GCNs (GCN inter or GCN intra ).FAR and DASG are two extractive frameworks, respectively, and are tested in the CNN/DailyMail dataset.The pre-training corpora used also is the downstream CNN/DailyMail dataset without summarization labels.

Table 4 :
ROUGE F1 performance of Bi-GAE on the downstream CNN/DailyMail summarization and Bi-GAE is pre-trained on the Multi-news dataset.

Table 5 :
ROUGE F1 performance on the downstream Multi-news extractive summarization, in which the model is pre-trained on the CNN/DailyMail dataset.