Discourse Understanding and Factual Consistency in Abstractive Summarization

We introduce a general framework for abstractive summarization with factual consistency and distinct modeling of the narrative flow in an output summary. Our work addresses current limitations of models for abstractive summarization that often hallucinate information or generate summaries with coherence issues. To generate abstractive summaries with factual consistency and narrative flow, we propose Cooperative Generator-Discriminator Networks (Co-opNet), a novel transformer-based framework where the generator works with a discriminator architecture to compose coherent long-form summaries. We explore four different discriminator objectives which each capture a different aspect of coherence, including whether salient spans of generated abstracts are hallucinated or appear in the input context, and the likelihood of sentence adjacency in generated abstracts. We measure the ability of Co-opNet to learn these objectives with arXiv scientific papers, using the abstracts as a proxy for gold long-form scientific article summaries. Empirical results from automatic and human evaluations demonstrate that Co-opNet learns to summarize with considerably improved global coherence compared to competitive baselines.


Introduction
Generating summaries with coherent discourse structure and domain knowledge awareness poses a challenge for current methods in summarization. Generative models can commonly produce high quality text (Figure 1), but fail to understand finergrained details of coherence such as the structure and flow of a narrative. In addition, they often generate factually incorrect content. Prior work on factuality in abstractive summarization has found that current models can hallucinate information more Stochastic birth-death-immigration models are widely used in biology and ecology to study population dynamics.
In this paper, we introduce a new formalism for describing...
We consider the evolution of multispecies populations in which each individual is assigned a species. The model is based on birth-deathimmigration (bdiy), and we assume...
We study the evolution of stochastic models for the evolution of multispecies populations, where each species is a clone of a parent species... -13.873 -13.213 -12.883 Figure 1: Generated abstracts for a biology article (from the Bio subset of our arXiv dataset). Abstracts are ranked from most (top) to least likely (bottom) using the generator model. Abstracts with better narrative structure and domain-specific content (such as the circled abstract) are often out-ranked in terms of likelihood by abstracts with factual errors and less structure. than 70% of the time when generating summaries of news articles (Maynez et al., 2020).
To address these issues, we focus our study on generating abstractive summaries with factuality and narrative flow. Given an input document, the goal is to generate a paragraph-length abstractive summary with proper discourse structure that contains factually correct claims. Our study builds on and extends previous work that focuses on either extractive document-level summarization (Nenkova and McKeown, 2012;Allahyari et al., 2017) or abstractive sentence-level summarization (Rush et al., 2015;Grusky et al., 2019;Narayan et al., 2018).
In pursuit of this goal, we introduce Cooperative Generator-Discriminator Networks (Co-opNet), a framework for abstractive summarization that considers subtle aspects of fact-checking and discourse necessary for coherent text generation. In this framework, the generator, a transformer language model fine-tuned for abstractive summarization, proposes a pool of candidate summaries ( §2). The discriminator, also transformer-based, scores the factuality or discourse quality of candidate summaries using one of four different objectives: the overlap between a scientific article introduction and predicted fact-checking evidence spans in generated summaries, the ordering of predicted discourse roles, the coverage of predicted discourse roles, or the likelihood of adjacency between generated sentences ( §3). The best summary is chosen cooperatively by combining the generator and discriminator scores ( §4).
Most previous works on abstractive documentlevel summarization have difficulty in directly modeling or evaluating narrative flow and factuality in generated summaries. This weakness is largely due to the inherent limitations of existing datasets, such as the CNN/DailyMail dataset (Hermann et al., 2015). The reference summaries available in these commonly used resources are mainly headlines of news articles or stories. As a result, they are often sets of disconnected sentences that are highly extractive, leading to models that are also extractive (Hoang et al., 2019), rather than abstractive.
In order to address these data challenges, we test our summarization model on a set of arXiv scientific papers. Scientific abstracts are ideal for modeling narrative flow as they are structured with highly coherent discourse flow. They also maintain implicit abstractive alignments with respect to the introduction of the article -in contrast to the tight, extractive alignments of current models. Scientific article summarization is also a task where factuality is more well-defined than in other domains like story summarization which leave more room for interpretation.
Comprehensive empirical results considering both automatic and human evaluations demonstrate that Co-opNet learns to summarize scientific articles from three domains with considerably improved global coherence compared to competitive baselines ( §6). We also demonstrate that the framework is generalizable to multiple coherence objectives, and effective at generating scientific abstracts that are more factually consistent.

Generator Networks
We use the transformer architecture of Radford et al. (2019) as our generator's architecture. Following the work of Liu et al. (2018), we adapt a language model to the task of abstractive summarization by concatenating the ar-ticle a, a delimiter token [SEP], the summary s, and an end token [END] into one input vector X = (a 1 , ..., a |a| , [SEP], s 1 , ..., s |s| , [END]), where |a| is the length of the gold article and |s| is the length of the gold summary.
At each time step i, the model produces an output probability distribution over the vocabulary for the next token w i given all previous output tokens w <i . For any arbitrary token w j preceding w i , the per-layer representation of that token is computed in the following way: where block refers to each transformer block composed of multi-headed attention, a feedforward network and layer normalization, W e is a word embedding matrix, p j is the position embedding, h 0 j is the initial representation, {h} l j is the block output for an arbitrary layer l, and {h} l−1 <j is the set of all block outputs from the preceding layer for positions up to j. Finally, for the current position i in the sequence, we compute a distribution over the output vocabulary as follows: where W e is the same embedding matrix as in Equation 1 and h L i−1 is the final layer transformer block output.

Discriminator Networks
Because summarization models are prone to narrative flow and factual consistency issues (Kryściński et al., 2020;Xu et al., 2020), we use a discriminator to score generated summaries for discourse and factuality properties. Due to the challenge of explicitly defining discourse and factuality properties as scores, these properties are approximated using parameterized scoring functions.
These scoring functions determine if generated text demonstrates discourse and factuality properties in three ways: (1) predicting the discourse role of sentences within a full summary, (2) predicting the likelihood of adjacency given a sentence pair, and (3) measuring the presence of salient facts in the generated summary from the original input context. While our discriminators focus on these three properties, we note that this framework is generalizable and could be extended to include other  discriminator models that encourage different communicative norms associated with high-quality language generation.

Discourse
We explore different discriminator architectures as additional discourse scoring functions during the generator's decoding process. For these discriminators, we generally score discourse in two ways. First, we use inferred sentence-level scientific abstract discourse role labels 1 defined by Cohan et al. (2019) and predict them using a sequence classifier 2 based on SciBERT .
Using these predictions, we score the discourse properties of the abstract relative to their coverage ( §3.1.1) or ordering ( §3.1.2). Second, we learn a function that can score the likelihood that sentences within generated abstracts should be adjacent to one another ( §3.1.3).

Coverage
We measure the completeness of the narrative structure within a scientific abstract by defining the following coverage score: where D abs is the number of unique discourse roles appearing in an abstract and D all is the total number of possible discourse roles. This objective al-

Ordering
We also score the order in which discourse labels appear in generated abstracts. In Table 1, we hard-code valid orderings of discourse labels for generated sentences based on each of the abstract discourse roles of . If the ordering for two adjacent sentences in the abstract O(s i−1 , s i ) is valid, the score for the ordering is 1 (-1 otherwise). We sum the scores for all the orderings within a particular abstract and normalize between 0 and 1 (as described by f n ): We also impose a rule for s 1 ='BACKGROUND' and a rule for s S ='RESULT' to encourage more natural orderings.

Adjacency Classification
To model the likelihood of adjacency between two sentences s u and s v , we first compute a hidden representation of the sentence pair using SciBERT . The encoder input is the concatenation of the sentences: is a special token associated with the task and [SEP] is a sentence delimiter token. Each word in the sequence is encoded by a word embedding w i and positional embedding p i and passed through the SciBERT model to yield h cls , the output state at the position of the [CLS] token. We then obtain the probability of adjacency between the sentences by a linear projection of h cls followed by a sigmoid activation: We define the training objective for the adjacency discriminator to minimize the negative log likelihood of predicting whether two sentences are adjacent or not: where δ adj (s) is an indicator function for whether the two sentences in s are adjacent. We note that while the discourse discriminators mainly focus on narrative structure, they may also capture contextaware aspects of factuality and content selection.

Factuality and Faithfulness
To measure factuality of generated summaries, we predict which tokens in the summary are likely to belong to a fact-checking evidence span (i.e., a span of the text used to prove a scientific claim using a finetuned BERT token classification model. 4 Recent work has shown that inspecting attention weights alone is not necessarily a reliable metric for determining saliency of particular aspects in the input context to the output of neural models (Serrano and Smith, 2019). The saliency weights representing the likelihood of tokens belonging to evidence spans provides us with a more explicit representation of factual importance. We obtain proxy saliency labels for the importance of a particular token t appearing in an abstract using a BERT model trained on evidence 4 See Appendix A.4 for details of token classification model.  spans annotated for scientific fact-checking (Wadden et al., 2020). Specifically, if t is not a stopword and t ∈ E, where E is an evidence span used to check a scientific claim, then we assign a label of 1 to t. Otherwise, the label for t is 0. Examples of extracted spans are given in table 2.
We compare the predicted evidence spans against information presented in the original introduction to capture the degree to which generative models are hallucinating information.
Factuality Objective At inference time, we compare the extracted salient spans, F (g), of the generated summary g against the set of all ngrams in the article input context, N (a), measuring the degree to which salient spans are hallucinated:

Reranking with Discourse and Factuality Experts
To incorporate the discriminator objective into our summarization framework, we first generate a pool of candidate summaries from the base summarization model ( §2) using any decoding strategy (e.g., beam search or top-k sampling). Then, the discriminator is used to re-rank these candidates in conjunction with the original token-level generator scores. For example, in the case of the adjacency discriminator, we maximize the generator tokenlevel probability of a candidate summary g, and the average of adjacency scores for the set of sentences composing g (denoted S(g)) -i.e., the probability of each sentence s u being adjacent to the previous sentence s u−1 in S(g):  where λ gen and λ disc are hyper-parameters controlling the contribution of the generator and adjacency discriminator to the final predicted summary. The same procedure is followed for the other discourse and factuality objectives, replacing P adj (s u , s u−1 ) with the scores from these discriminators.

Datasets
Since the focus of this work is on generating summaries with more coherent narrative flow and greater factual consistency, we concentrate on datasets requiring discourse structure to generate good summaries. Particular attributes of the discourse structure of these datasets include: • Length of summaries → Are the summaries long enough to clearly show narrative flow properties and factual correctness?
• Abstractiveness of gold summaries → Do the summaries exhibit particular sentence-level flow, or are the summary sentences extracted highlights from the context?
ArXiv We crawled over 700K samples (472K abstracts) from scientific articles on arxiv.org.
In our experiments we primarily focus on the CS 5 and Bio 6 domain subsets. The task we define is to generate an abstract given a introduction, which presents a challenge to existing summarization models. This task also requires models to learn relevant domain knowledge for the scientific domain of interest and recognize common discourse structure for papers written in that domain.
AAN Additionally, we include an existing dataset of scientific articles that focuses on papers in the NLP computer science domain. This dataset consists of a 12k paper subset from the ACL Anthology Network (AAN; Radev et al., 2009) with extracted introduction and abstract pairs. Scientific abstracts in ArXiv and AAN have properties that are missing from existing summarization datasets based on Newswire data. For example, XSum (Narayan et al., 2018) and Newsroom (Grusky et al., 2019) summaries are generally too short to exhibit cross-sentence narrative flow. Meanwhile, CNN/DailyMail (Hermann et al., 2015) summaries are acquired by concatenating extracted highlights, which can be unrelated. Conversely, ArXiv and AAN abstracts are long enough to have multiple sentences, 7 and generally exhibit strong discourse patterns typical to scientific writing, making them ideal corpora for assessing discourse understanding in abstractive summarization. Table 3 provides details of dataset splits.

Experimental Setup
Our implementation is based on the Huggingface implementation 8 of the BERT (Devlin et al., 2019) and GPT-2 language models (Radford et al., 2019).
Generator We perform WordPiece tokenization for the input context and output summaries. Because of the fixed input size of the transformer language model, the input context is truncated to a maximum of 800 tokens, and summaries are truncated to a maximum of 200 tokens. We use a learning rate of 2e-5 and a batch size of 16 to finetune the generator. We train the base summarization transformer model for 12 epochs. All experiments are run on either a Titan-X or Quadro RTX 8000 GPU. For the adjacency discourse models, we fine-tune the discriminator using a learning rate of 2e-5, a linear warmup learning rate schedule, and a batch size 7 See Appendix A.6 for comparison of datasets. 8 https://github.com/huggingface/ transformers 9 See the original papers for details of training the SciFact and abstract discourse models.   of 32. All adjacency discourse discriminator models are fine-tuned for 2 epochs on a Titan-X GPU.
The adjacency discriminator models are adapted from the Huggingface implementation of the BERT next sentence prediction classifier. We initialize the 12-layer BERT-base discriminator model with the pretrained weights of the SciBERT-uncased model, which was originally trained on 1.14 million scientific papers . Two discriminators are trained: one is fine-tuned on AAN for decoding both ArXiv CS and AAN, while the other discriminator is fine-tuned on ArXiv Bio and used exclusively for decoding that subset. We weigh the generation and discriminator models equally when decoding by setting λ gen =λ disc =.5. Additional implementation details are provided in Appendices A.3 and A.4.

Experiments
We compare against extractive approaches using the Lede-3 and LexRank (Erkan and Radev, 2004) baselines. We also compare against two abstractive approaches: a 2-layer bi-LSTM sequenceto-sequence model with attention (LSTM), and a pointer-generator model (PGen; See et al., 2017). Training details of the supervised baselines can be found in the Appendix A.2. In addition, we compare to a subset of our approach that only uses the generator to produce summaries, rather than the full framework. 10 Our code/data is released here: https://github. com/skgabriel/coopnet.

Automatic Evaluation
Following previous work on summarization, we use the ROUGE metric (Lin, 2004) for automatic evaluation of generative models and Co-opNet. Specifically, we report ROUGE-1, ROUGE-2 and ROUGE-L F1 scores. To capture similarity in contextual meaning, we look at BERTScore F1 (Zhang et al., 2020a), which has been shown to more closely correlate with human judgements than other generation metrics.
Results on the AAN, CS and Bio subsets of ArXiv are shown in Table 4. Co-opNet outperforms all baselines on ROUGE-1 and ROUGE-L by a consistent margin. Notably, Co-opNet's performance is superior to the generator-only model, illustrating the importance of the discriminators for generating more coherent summaries. Interestingly, on the more domain-specific AAN subset, our model is over 12% better on ROUGE-L compared to the PGen baseline and 5.86% better than the best extractive model. Our model also outperforms the strongest baselines on BERTScore.
When we break down results for various Co-opNet architectures (see Table 6), we find that the factuality and discourse role discriminators lead to the best performance in terms of ROUGE scores with the adjacency discriminator achieving lower performance on ROUGE than the base generator. However, as shown by Table 5, the adjacency discriminator outperforms the base generator when we consider BERTScore, a more contextual evaluation metric, indicating that this generator-discriminator combination selects summaries that capture the same linguistic patterns and meaning as reference summaries without directly copying.

Human Evaluation
Since coherence of generated text is difficult to measure with automatic metrics (Kilickaya et al., 2017;Sun et al., 2019;, we   conduct human evaluations to assess how the discriminator affects generation quality using pairwise model comparisons.
Setup We use four key criteria in all evaluations -abstractiveness, coherence, factuality and best overall quality, which we define as follows: • Abstractiveness → Which abstract rewords information from the introduction instead of directly copying from the introduction?
• Coherence → Which abstract is more structured, and presents a complete and coherent story about the work done in the paper?
• Factuality → Which abstract is more factually consistent, presenting the same information that appears in the introduction and not producing hallucinated information?
• Overall → Which abstract is better overall?
We conduct human evaluations on Amazon Mechanical Turk (AMT) considering 4 different abstractive baseline model variants over 100 randomly sampled AAN test set examples. Given a gold introduction, AMT evaluators are asked to compare a corresponding abstract generated from Co-opNet against an abstract generated by a baseline or our generator model. To reduce bias, the ordering of generated abstracts are randomized and evaluators are not told that abstracts are machinegenerated. Each abstract pair is judged by three unique annotators. For each criteria, we filter to 50 abstracts based on the amount of time AMT workers spent (≥ 20 seconds) and inter-annotator agreement (at least 2 3 of annotators should agree on which abstract is best). We also prime annotators to consider subtler aspects of discourse coherence by providing examples that capture good or bad narrative flow without complete text degeneration.
We test the Co-opNet framework using both the factuality and adjacency discriminators, as these are the highest and lowest performing discriminator architectures in terms of automatic metrics on the AAN domain. We allow for ties, as Co-opNet and the generator baseline sometimes assign the highest probability to the same abstract, or generated abstracts in the candidate pool are high quality enough that there is little room for improvement.

Results
We find that Co-opNet is preferred across all criteria for all comparisons, when we use the adjacency discriminator (see Table 7). When using the the factuality discriminator, Co-opNet is superior to baselines in all cases except when compared on abstractiveness to the PGen model.
In particular, human evaluators prefer Co-

Gold
We investigate mutual benefits between syntax and semantic roles using neural network models, by studying a parsing->SRL pipeline, a SRL->parsing pipeline, and a simple joint model by embedding sharing. The integration of syntactic and semantic features gives promising results in a Chinese Semantic Treebank...

PGen
In this paper, we propose a novel approach to learn syntactic and semantic role labeling models to semantic role labeling (wsd). In the first neural network models induce non-linear feature features from word and part-of-speech (pos) parsing. We show that semantic features can be used to learn...

Generator
Syntax-semantic relations play a crucial role in natural language processing. In contrast, semantic role labeling (srl) models typically rely on parser output features to improve accuracy. In this work, we propose a joint srl and syntactic parsing srl pipeline using the chinese treebank (qiu et al., 2016)...

Co-opNet (Adj)
In this paper, we explore the use of neural network models to jointly train semantic role labelers and parsers for semantic role labeling (srl). We first propose a simple neural srl model that uses a neural long shortterm memory (lstm)-based parser to represent the output of an srl system... Table 8: Example of gold and generated abstracts from baseline Pointer Networks + Coverage (See et al., 2017) (PGen) and two of our proposed models, Generator and Co-opNet, on the NLP scientific domain. Coherence issues and factual errors in generated abstracts are highlighted in italics. We highlight correct terminology and transitional phrases that contribute to coherent flow by properly delineating sections of abstracts in bold and italics.
opNet with the adjacency discriminator over baselines by over 8% on the coherence metric and 18.12% compared to PGen on overall quality. Notably, the adjacency discriminator encourages more abstractiveness in generated abstracts while still maintaining higher levels of factual consistency. We also find that Co-opNet with the factuality discriminator improves coherence and overall quality in addition to factuality. However, Co-opNet generations with the factuality discriminator were found to be more extractive than abstracts generated by PGen.
As shown in Table 8, generations selected by the adjacency discriminator more closely match the distribution of abstracts, while the generator sometimes favors copying from the introduction at the loss of narrative structure. For example, the generator will select a summary that opens with "we present a method for jointly solving penn treebank style empty category (e.g. figure 1)...", while the adjacency discriminator selects a summary that opens with "we present a method to jointly solve the problem of empty categories..." and does not refer to a particular figure. Both summaries are faithful to the introduction, but the discriminatorselected summary makes more sense in the context of a paper abstract.

Related Work
Narrative Flow and Factuality Modeling coherent narrative flow remains a major challenge in the field of text generation, due to the need for accurate understanding of narrative structure (Chris-tensen et al., 2013;Nikolov et al., 2018;Holtzman et al., 2018;Qin et al., 2019;Koncel-Kedziorski et al., 2019;Gabriel et al., 2021). Early approaches to incorporating structure include integration of explicit discourse markers into automatic summarization (Alonso i Alemany and Fuentes Fort, 2003). Recently proposed solutions include globaltracking of entities (Kiddon et al., 2016;Mei et al., 2016), as well as discourseaware attention (Cohan et al., 2018). While there has been prior work on factual consistency (Cao et al., 2018;Gao et al., 2019;Kryściński et al., 2020;Zhang et al., 2020b), these works did not focus on scientific paper summarization.
Neural Abstractive Summarization In the past, abstractive summarization models (Rush et al., 2015;Gehrmann et al., 2018) have relied upon seq2seq encoder-decoder architectures (Sutskever et al., 2014;Narayan et al., 2018;Celikyilmaz et al., 2018). Transformer models have emerged as a promising architecture for text generation and summarization (Liu et al., 2018;Hoang et al., 2019;Khandelwal et al., 2019;Zhang et al., 2019). While our model builds upon this work, it is, to our knowledge, the first transformer summarization framework to explicitly model narrative flow and scientific fact-checking across domains.

Conclusion
In this work, we introduced Cooperative Generator-Discriminator Networks, a framework for more coherent natural language generation with transformer language models through the integration of discriminators that encourage proper narrative flow and factual consistency. Through our analyses over scientific papers from ArXiv and AAN, we empirically showed that our framework selects generations that are more relevant and narratively coherent than previous approaches.

A Appendices
A.1 Additional Implementation Details

A.2 Baselines
For the sequence-to-sequence RNN model, a bi-LSTM is used to encode a given source article a and a separate decoder LSTM produces the generated summary g. At each decoding time step, the decoder attends to all the context vectors produced by the encoder as well as the maintained state from the previous decoder tokens to produce the next token in the summary.
The Pointer-Generator (PGEN + Cov) model extends the base LSTM model (LSTM + Cov) to allow tokens to be copied from the input during generation. Baselines are trained for up to 40000 steps with a batch size of 16. Following previous work, we decode from these baselines using beam search with a beam size of 4.

A.3 Generator Model
We use the 345M parameter GPT-2 model. The model is trained to minimize the negative log likelihood of the next word w i given all preceding words: where w i is the i th token of our full input vector X, a is our article and s is our summary. At test time, X only consists of the gold article and delimiter token (a 1 , ..., a |a| , [SEP]) and we decode generated summaries g starting from this input. During generation, we filter candidate summaries from the hypothesis generation pool that contain sentences longer than a fixed max length of 200 tokens, a clear sign of coherence deterioration. We use a candidate pool size of 30 for ATLAS and 20 for AAN.

A.4 Discriminator Training
Factuality Discriminator Details For the tokenlevel classification model, we use the BERT base model with binary labels for whether or not a token should be included in a salient span. We predict for all spans in an abstract at once.
Order Discriminator Details We set the max length of summaries considered by the order discriminator to be 10 sentences, truncating longer summaries. Given the max length of a summary, we have a fixed number of orderings |O| that can be scored. We calculate the final score from the order discriminator based on the unnormalized sum of scores from these orderings, S, and the following function f n : Sentence Selection for Discriminator Models To train an adjacency discriminator model, we use a subset of adversarial and positive sentence pair examples extracted from the training set. The sentence pairs are extracted from gold abstracts containing at least five sentences using the following approach: For a randomly selected sentence s u from the abstract, we randomly select an adjacent sentence, s u−1 or s u+1 , as a positive example and any nonadjacent sentence s v∉ [u−1,u,u+1] as a negative example.
Discriminator Performance We measure the performance of discriminator models using recall, precision, accuracy and F1. Table 9 provides summary statistics of discriminator performance on the various discourse and factuality objectives. Discourse-Adj denotes the adjacency discriminators, while Discourse-Abs denotes the discourse role label prediction model  and Factuality denotes the token saliency prediction model.

A.5 Details on Model Performance
Automatic results for Co-opNet selection were given using a context size of 800 tokens for the input, while a context size of 800 characters was used to select Co-opNet summaries for the human eval. The automatic results for the summaries used in the human eval were lower than the ones using the longer context size. Using a smaller context size leads to faster and more efficient Co-opNet selection (less memory usage), but slightly lower overall automatic performance (while maintaining the same ordering in terms of highest and lowest scores for Co-opNet variants on ROUGE).

A.6 Comparison of Datasets
We removed duplicates and articles without abstracts from AAN. From this subset, we extract introduction and abstract pairs.

A.7 Additional Analysis
Comparison with Gold Summaries To obtain an upper-bound comparison for the human evaluation and verify the effectiveness of our human evaluation pipeline for judging the quality of abstracts, we used the same intro-abstract pairs and Mturk annotation framework as the model comparison to conduct a Turing-style evaluation. In this evaluation, we presented a Co-opNet (adj) generated abstract and a gold abstract to the annotators in a random ordering without noting whether either of the abstracts were human-written or machine-generated. We found that annotators consistently selected the gold abstract over the machine-generated abstract when considering factuality and coherence, though they found the machine-generated abstracts to be slightly more abstractive. We provide the results for this full evaluation in Table 11.