DYLE: Dynamic Latent Extraction for Abstractive Long-Input Summarization

Transformer-based models have achieved state-of-the-art performance on short-input summarization. However, they still struggle with summarizing longer text. In this paper, we present DYLE, a novel dynamic latent extraction approach for abstractive long-input summarization. DYLE jointly trains an extractor and a generator and treats the extracted text snippets as the latent variable, allowing dynamic snippet-level attention weights during decoding. To provide adequate supervision, we propose simple yet effective heuristics for oracle extraction as well as a consistency loss term, which encourages the extractor to approximate the averaged dynamic weights predicted by the generator. We evaluate our method on different long-document and long-dialogue summarization tasks: GovReport, QMSum, and arXiv. Experiment results show that DYLE outperforms all existing methods on GovReport and QMSum, with gains up to 6.1 ROUGE, while yielding strong results on arXiv. Further analysis shows that the proposed dynamic weights provide interpretability of our generation process.


Introduction
Transformer-based (Vaswani et al., 2017) pretrained language models (PLMs), e.g., BART (Lewis et al., 2020a) and T5 (Raffel et al., 2020), have achieved state-of-the-art performance on short text summarization. However, due to the high memory complexity of the full self-attention mechanism (Tay et al., 2020a), PLMs still struggle to handle long inputs (Rohde et al., 2021). Model efficiency and summary quality present a pair of challenges for long input summarization (Huang et al., 2021): models need to capture information scattered across the long input while maintaining a low computational cost. Figure 1: Graphical overview of our approach. The input is a document X and an optional query q, where each x ∈ X is a sentence, and the output is a summary y of length T .
Prior models tackled long input summarization mostly following four ways. First, sparse attention (Child et al., 2019;Beltagy et al., 2020;Tay et al., 2020b) is used to reduce the memory complexity of the Transformers so that they can attend to more tokens. Second, extract-then-generate methods extract salient texts from the input and then summarize based on the extracted texts. Extractors are either independently trained with full supervision (Zhong et al., 2021b) or optimized using Reinforcement learning (Williams, 1992;Chen and Bansal, 2018;Bae et al., 2019). Third, models are proposed to divide source text into sections (Gidiotis and Tsoumakas, 2020; which are individually summarized and combined to form a full summary. Fourth, hierarchical models (Rohde et al., 2021;Zhu et al., 2020) attempt to improve summarization by capturing sentence or discourse level dependencies. We elaborate on these four directions and their drawbacks in Section 5.
We believe that the extract-then-generate approach mimics the way a person would handle long-input summarization: identify important information in the text and then summarize them. This approach reduces the source inputs to a fixed pre-set length, which addresses the main challenge of model not being able to handle longer input beyond a certain limit. However, previous separatelytrained extract-then-generate approaches are limited as they break contextual dependencies between extracted chunks and suffer from cascaded errors from the extractor to the generator. Though various Reinforcement Learning techniques are introduced to bridge the two steps, it has noticeable drawbacks (explored in Section 2.3), and we argue that the nature of long input makes this approach suboptimal.
In this paper, we propose a new approach for long-input summarization: Dynamic Latent Extraction for Abstractive Summarization (DYLE). We jointly train an extractor with an abstractor using the extracted text snippets as the latent variable. For an output token, we compute its probability conditioned on each input snippet separately, and its generation probability is computed by marginalizing over all input snippets under a learned distribution assigned by the extractor. The distribution over the input snippets is conditioned on the previously generated tokens. Output tokens are dynamically generated at each time step.
We propose to optimize the extractor using two surrogate losses. First, we compute the extractive oracle based on the gold output using greedy search. These oracle snippets are used as targets to optimize the extractor. Second, we propose the consistency loss, which encourages the extractor to approximate the averaged dynamic weights predicted by the generator.
We conducted experiments on two long-input summarizatoin datasets: GovReport (Huang et al., 2021) for long document summarization and QM-Sum (Zhong et al., 2021b) for long dialogue summarization. Our method achieves state-of-theart results on both datasets and significantly outperforms the currently best baselines. These experiments demonstrate the generalizability of our model to multiple long-input summarization tasks of different domains. The dynamic weights in our model improve the interpretability of the generation process and help denoise the extraction by down-weighting irrelevant text snippets. Our contributions are as follows.
• We introduce dynamic latent extraction for the abstractive long-input summarization task, a new approach that better captures information in the long input, allows interpretable dynamic weights, and reduces computational complexity.
• We propose multiple auxiliary optimizations: extractive oracle as a learning signal for the extractor, consistency loss that bridges extraction and generation, hybrid training methods that furthers generalizability of the extractor.
• Experimental results show that our approach achieves state-of-the-art results on two long input summarization datasets in documents and dialogues. We also conducted detailed analysis on the interpretability of our model.

Our Approach
An overview of our approach is shown in Figure 1. In Section 2.1, we formulate our task and the extractor-generator framework. In Section 2.2, we introduce our parameterization of the extractor for long inputs. The extractor module could be optimized with the consistency loss and the oracle loss (Section 2.4). The overall training objective is summarized in Section 2.5.

Extractor-Generator Framework
We start by formulating the task of our interest. The input consists of L text snippets, X = (x 1 , . . . , x L ), and an optional query q for querybased summarization tasks. In long-input summarization, the number of text snippets, L, is usually very large. The output is a summary y of length T . For the dialogue summarization task, dialogue turns (utterances by each speaker) are used as snippets. For the long document summarization task, we tokenize the input into sentences and use the sentences as snippets. The goal is to learn a model that generates the sequence of summary tokens y given the input snippets X and the previously generated tokens y <t : The extractor-generator framework is based on the assumption that salient information useful for summarization only occupies a small portion of the input, which is a sensible assumption given the long input length. Specifically, the extractor takes the query and the document as input and outputs a score s i = E η (q, x i ) for each text snippet x i . Here η denotes the extractor parameters. To extract K snippets X K from the document X, we take the K snippets with highest scores: After retrieving X K from X, the extractiongeneration framework models the output probability by replacing X with X K , i.e., Note that the top-K operation in Eq. (1) is nondifferentiable, and we do not propagate gradients through top-K; instead, we propose methods to optimize the extractor in Section 2.3 and Section 2.4.

Extractor for Long Inputs
An interesting research question is how to design the extractor for long inputs. Limited by GPU memory, it is impractical to concatenate all snippets and encode them with a large pre-trained language model. As shown in Figure 2, we group consecutive snippets into chunks. We concatenate the query q with each chunk and compute the encoded vector for each snippet within the chunk it belongs to. We project the encoded vectors to scalar scores s i = E η (q, x i ) using an MLP.

Generator with Dynamic Weights
A simple way to use the extracted snippets is to concatenate them into a single sequence and feed this sequence to a seq2seq generator. However, when applied to long-input summarization, it faces two challenges. The first challenge is that the extraction operation, i.e., top-K in Eq.
(1), is nondifferentiable. One approach is to adopt RL-based optimizations (Chen and Bansal, 2018;Bae et al., 2019). However, this approach has three drawbacks if applied to long input summarization. Firstly, Reinforcement Learning for large action spaces (i.e., extracting K out of L snippets when L is very large) has high variances. The second challenge is that we cannot interpret how the generator utilizes the extracted snippets. For example, one may want to know whether the generator is leveraging extracted information at each decoding time step. Thirdly, current methods in fine-tuning extract-then-abstract models with RL either uses sentence-level ROUGE (Chen and Bansal, 2018) or summary-level ROUGE (Bae et al., 2019) as rewards. Using sentence-level ROUGE could potentially select sentences with overlapping contents (Narayan et al., 2018), resulting in redundant final summaries. Using a summary-level ROUGE as the training reward leads to the sparsity of the training signal, and longer input makes this approach even harder to train. To address these challenges, we propose a generator that dynamically assigns weights to every extracted snippet at each time step. Different from the extractor scores, which is independent of the decoding time step, generator assigns different dynamic scores at different time steps. The dynamic weights make the decoding process interpretable, and it also provides training signals for the extractor using what we term as the consistency loss.
Generator formulation The overview of the generator is shown in Figure 3. Specifically, for each extracted snippet x, the generator predicts the generation probability P θ (y t |q, x, y <t ) based on this snippet and a dynamic weight P θ (x|q, X K , y <t ) for this snippet. Without loss of generality, we assume that P θ (·|q, x, y <t ) is computed by first mapping the input (q, x, y <t ) to a contextualized representation vector h x t . For Transformers (Vaswani et al., 2017) and encoderdecoder with attention models (Bahdanau et al., 2015), h x t is usually the model's output before the final language model head. The generation probability P θ (y t |q, x, y <t ) is computed by feeding h x t into the language model head. For the dynamic weight P θ (x|q, X K , y <t ), we adopt a separate MLP to map each h x t to a scalar logit l x , and P θ (·|q, X, y <t ) is defined as softmax({l x } x∈X ). The generation probability is computed by marginalizing over all extracted snippets: (3) The dynamic weight P θ (x|q, X K , y <t ) at each decoding time step t allows us to interpret how the generator utilizes the extracted snippets. For example, a larger weight to a particular snippet indicates larger importance of the snippet to the current decoding time step. The generation loss is defined as the NLL of the gold summary: where P θ (y|q, X K ) is defined in Eq.
(2). Here we do not propagate gradients of L θ gen to the extractor parameters since top-K is non-differentiable.

Consistency loss
We also leverage the dynamic weights to provide training signal for the extractor. Since the dynamic weight of a snippet can be interpreted as the importance of the snippet at a particular time step. We can average the dynamic weights over all decoding steps and view the averaged weight as the overall importance of the snippet. Based on this intuition, we propose what we term as consistency loss, which measures the distance between the averaged weight distribution and the extractor distribution. We want these two distributions to be close on an arbitrary subset of X. For simplicity, we take X K as the subset and define the consistency loss as Note that the consistency loss is superscripted with the extractor's parameters η, which means that we do not compute gradients for the generator's parameters θ. Since we want the distributional distance to be small on an arbitrary subset of X, we do not propagate gradients through the top-K operator.

Leveraging Extractive Oracles
For long-input summarization, the extracted snippets X K used during training are important for the stable optimization. Instead of using the definition of X K in Eq. (1), which is adopted at test time, we propose to leverage extractive oracles during training.
Greedy search for extractive oracles Extractive oracles denote a set of selected text snippets whose concatenation maximizes the evaluation metric given the gold summary. We implement the extractive oracle using greedy search. Specifically, we start with an empty set, and we iteratively select a snippet from the input such that the concatenation of that snippet and the already selected snippets maximizes the average of ROUGE-1, ROUGE-2 and ROUGE-L scores given the gold summary. We denote the extractive oracles as X o .
Hybrid training We leverage the extractive oracles to define X K used during training. If the number of oracles equals or exceeds K, we define X K as the first K oracle snippets. If the number of oracles is less than K, we define X K as the union of X o and the top snippets ranked by the extractor that are not appearing in X o . Such hybrid training has two benefits. First, compared with X K defined in Eq. (1), it provides higher-quality inputs to the generator. Second, it reduces the reliance on the oracle and improves the generalizability of our model beyond the training set. Specifically, it is possible that there are other text snippets in the source input that are omitted in the greedy oracle extraction but could still help the generation. This way, hybrid training allows our model to capture a greater variety of source text snippets.
Oracle loss The extractive oracles X o are used as a supervision signal for the extraction part of our model. The oracle loss L η oracle is computed from the cross-entropy loss between all chunks in the extractor selected set and the extractive oracle.
Formally, the oracle Loss is computed as e Eη(q,x) x i ∈X e Eη(q,x i ) (6)

Training Objective
The overall training objective of our method is where λ g , λ o , and λ c are hyperparameters to balance the loss components. Gradients are computed for the superscripted parameters. Specifically, the extractor is solely optimized with the consistency loss and the oracle loss, and the generator is solely optimized with the generation loss.

Datasets and Baselines
QMSum (Zhong et al., 2021c) is a benchmark for query-based multi-domain meeting summarization. It consists of meetings from three domains, including AMI and ICSI, and the committee meetings of the Welsh Parliament and Parliament of Canada. The meetings in this dataset comprise of a large number of turns uttered by multiple speakers.
GovReport (Huang et al., 2021) is a large-scale long document summarization dataset, consisting of about 19.5k U.S. government reports with expertwritten abstractive summaries. GovReport is a good benchmark as it contains significantly longer documents (average 9.4k words) and summaries (553 words) than other long document datasets, such as ArXiv, PubMed (Cohan et al., 2018), Bill-Sum (Kornilova and Eidelman, 2019), and Big-Patent (Sharma et al., 2019). The GovReport authors also note that salient information is more scattered across the documents. A comparison of the datasets is presented in Table 1.
Baselines We used reported baselines. The baselines for the QMSum paper came from (Zhong et al., 2021a). The baseline used for GovReport came from the original paper (Huang et al., 2021), which experiments with multiple encoder self-attention, encoder-decoder attention patterns on BART-large.

Implementation Details
The extractor is initialized with the pretrained Roberta-base model . The generator is initialized with the BART-large (Lewis et al., 2020a) model. Both models used Hugging Face implementations. Training is done with the Adam optimizer. We apply gradient checkpointing (Chen et al., 2016) to both the extractor and the generator to save memory. Each experiment is run on a single NVIDIA Quadro RTX 8000 GPU. We set the batch size as 8 (1 sample per forward pass with gradient accumulation step set to 8). ROUGE (Lin, 2004) is used as the automatic evaluation metrics throughout all experiments. We split the sentence in each generated summary to obtain the full ROUGE-L scores.

Automatic Evaluation
For automatic evaluation, we report ROUGE-1, ROUGE-2, and ROUGE-L. The results are summarized in Table 2 and Table 3.
Our model DYLE achieved state-of-the-art performance on both datasets. On the GovReport dataset, we achieved 4.15 ROUGE-1 improvement, 6.21 ROUGE-2 improvement, and 4.00 ROUGE-L improvement compared to the currently best model.
On the QMSum dataset, we achieved 2.13 ROUGE-1 improvement, 1.04 ROUGE-2 improvement, and 1.93 ROUGE-L improvement compared to the state-of-the-art model HMNet.
These results show that DYLE can be applied to both the long document summarization and long dialogue summarization tasks.

Evaluation of Auxiliary Optimizations
We conduct detailed studies to investigate the effectiveness of the auxiliary optimizations we introduced. Specifically, we report the full model's performance after removing 1) hybrid training, 2) consistency loss, 3) oracle loss. The results are summarized in Table 4. Without the hybrid training optimization, only the extractive oracles will be used to train the generator.
We see that excluding either of the hybrid training, consistency loss, and oracle loss optimization leads to a performance drop. Training the model without the supervision of the oracle leads to the greatest decrease in model performance, showing the importance of good supervision for the extractor. Removing the consistency loss also decreases the model performance. This shows that there Dataset Src. length Tgt. length

Analysis of Extracted Snippets
We are interested in the amount of salient information passed to the generator. To investigate, we report the decomposed precision and recall of ROUGE scores in Table 5. We observe that the extracted snippets have much higher recall than the generated summaries, and the generated summaries  have higher precision. It suggests that one way to improve the overall performance is to increase the information coverage (i.e., recall) of the extractor and to more accurately identify the salient snippets (i.e., precision) in the generator.

Discussion
Capacity to Summarize Longer Input This paper demonstrates the effectiveness of latent extraction for abstractive summarization, in both long document summarization and long dialogue summarization. The first-step extraction picks out salient information from the long input, thereby greatly extending the input length that the model can handle. Previous Bart-large-based baselines on GovReport, even with sparse encoder self-attention and encoder-decoder attention, are only able to process up to 10,240 tokens. However, our model can handle the long input at its full length.
Interpretability of Dynamic Weights Our approach is more interpretable than sparse attention and two-step extraction-generation pipeline methods. Specifically, dynamic weights in the generator show how the information is used throughout the  decoding process. In Figure 4, we visualize the dynamic weights for the extracted snippets assigned by the generator during decoding. In each subfigure, we visualize the dynamic weight matrices of the generated summary and a random summary from other samples in the validation set. The xaxis and y-axis represent the index of the extracted top-K snippets and the decoding time step, respectively. Darker squares denote higher weights. For each generated summary, we observe multiple consecutive high-weight areas, indicating alignments between the extracted snippets and the generated summary. By contrast, weights are uniformly distributed for random summaries.
Comparison with RAG The generator of our method is related to but differ significantly from Retrieval-Augmented Generation (RAG) (Lewis et al., 2020b). The similarity only lies in the idea of marginalization over a set of text snippets. However, unlike our dynamic weights, the weights in RAG remains static during decoding. Specifically, using our notations, RAG decomposes the generation probability as: The static weight P θ (x|q, X K ) in Eq. 8 is computed based on q and X K , while our dynamic weight P θ (x|q, X K , y <t ) is additionally conditioned on the already generated tokens. Furthermore, RAG retrieves a set of documents, whereas our extractor extracts text snippets from the input.
Effect of K in top-K We vary the value of K of top-K in Eq.
(1) and test it on both the GovReport and QMSum datasets. We observe that model performance generally increases as the value of K   increases. Results are summarized in Table 6. Due to the limitation of computational resources, the largest K value we tried is 25.
Extractor performance We feed the extractive oracle to the generator. The results are summarized in Table 7. We observe that extractive oracle contains more salient information than the text snippets extracted by extractor.
Future Directions One future direction is to adapt our approach to other long input generation tasks, such as open-domain question answering and response generation in multi-turn dialogue systems when the dialogue history is long.

Related Work
Sparse Attention Mechanism The full attention mechanism has a quadratic dependency on mem- ory. Prior research works have proposed different sparse attention mechanisms that reduce the memory cost. Longformer (Beltagy et al., 2020) uses a dilated sliding window of blocks and taskmotivated global attention patterns. BigBird (Zaheer et al., 2020) treats attention patterns as a graph sparsification problem and employs sliding window and random blocks to simplify attention complexity. Reformer (Kitaev et al., 2020) makes use of the locality-sensitive hashing to reduce the memory complexity. In addition to optimizing the encoder self-attention, Huang et al. (2021) proposes head-wise positional strides to reduce the cost of the encoder-decoder attention. However, sparse attention diminishes the benefits of pretraining and sacrifices parts of the receptive field.
Extract-then-abstract Method The model first extracts salient text snippets from the input, followed by rewriting them abstractively to generate a concise overall summary. Most two-stage summarization approaches (Zhang et al., 2019;Lebanoff et al., 2019;Xu and Durrett, 2019;Bajaj et al., 2021) are trained separately, which suffer from information loss due to the cascaded errors. Some approaches attempt to reduce that loss by bridging the two stages. Chen and Bansal (2018) adopts an extract-then-rewrite approach using Reinforcement Learning with a sentence-level policy gradient method. Bae et al. (2019) improves it by using summary-level policy gradient. In addition to the drawbacks explained in Section 2.3, our model is different as we jointly train an extract-then-abstract model for summarization using latent variables.
Divide-and-conquer Approach A common approach in long input summarization is divide-andconquer (Gidiotis and Tsoumakas, 2020;Grail et al., 2021). This approach breaks a long input into multiple parts, which are summarized separately and later combined to produce a final complete summary. However, these models do not capture the contextual dependencies across parts and need to assume a certain structure of the input (such as paper sections).
Hierarchical Models Various hierarchical models have been proposed to handle the longer inputs. Cohan et al. (2018)'s model consists of a hierarchical encoder that models the document dis-course structure and an attentive discourse-aware decoder to generate the summary. HAT-Bart (Rohde et al., 2021) proposes a new Hierarchical Attention Transformer-based architecture that attempts to capture sentence and paragraph level information. HMNet (Zhu et al., 2020) builds a hierarchical structure that includes discourse level information and speaker roles. However, these models focus mainly on model performance and not on reducing the memory and computational cost.
Latent Retrieval in Open-domain QA RAG (Lewis et al., 2020b) used a parametric memory of the Transformer and a non-parametric memory of Wikipedia vector index. It trains these components in a probabilistic model end-to-end. REALM (Guu et al., 2020) augments language model pretraining with a latent knowledge retriever. Both these papers are on open-domain Question Answering and little attention has been given to long input summarization using latent variables.

Conclusions
In this paper, we propose the first framework that jointly trains an extract-then-abstract model with latent extraction for long input summarization. We demonstrate its effectiveness by testing on the Gov-Report and QMSum datasets. Our model significantly outperforms the current state-of-the-art on both, while having the advantages of being able to process arbitrary long input, low memory cost, and interpretable generator weights.