Sparsity and Sentence Structure in Encoder-Decoder Attention of Summarization Systems

Transformer models have achieved state-of-the-art results in a wide range of NLP tasks including summarization. Training and inference using large transformer models can be computationally expensive. Previous work has focused on one important bottleneck, the quadratic self-attention mechanism in the encoder. Modified encoder architectures such as LED or LoBART use local attention patterns to address this problem for summarization. In contrast, this work focuses on the transformer’s encoder-decoder attention mechanism. The cost of this attention becomes more significant in inference or training approaches that require model-generated histories. First, we examine the complexity of the encoder-decoder attention. We demonstrate empirically that there is a sparse sentence structure in document summarization that can be exploited by constraining the attention mechanism to a subset of input sentences, whilst maintaining system performance. Second, we propose a modified architecture that selects the subset of sentences to constrain the encoder-decoder attention. Experiments are carried out on abstractive summarization tasks, including CNN/DailyMail, XSum, Spotify Podcast, and arXiv.


Introduction
The Transformer architecture (Vaswani et al., 2017) with large-scale pre-training has become the defacto approach for a wide range of NLP tasks, from classification (Devlin et al., 2019) to seq2seq (Raffel et al., 2020). Training and inference using large transformer models can be computationally expensive because the self-attention's time and memory grow quadratically with sequence length. Hence, there has been significant interest in efficient transformer architectures. A number of approaches have been proposed to tackle the quadratic complexity, and a comprehensive survey on efficient transformers has been compiled in Tay et al. (2020). Most existing approaches are developed for encoder-only architectures. For seq2seq tasks, efficient models such as BigBird (Zaheer et al., 2020) or LED (Beltagy et al., 2020) consist of an efficient encoder with the vanilla decoder. For long-document summarization, this combination has been shown effective because the major bottleneck is the encoder selfattention (Manakul and Gales, 2021). The attention mechanisms in the decoder consist of self-attention and encoder-decoder attention. Techniques such as local attention are applicable to self-attention in both the encoder and decoder, while this work focuses on the encoder-decoder attention.
When humans produce a summary, the information conveyed by each word/part in the summary is likely drawn from some key sentences in the original document. Inspired by this, we hypothesize that if the encoder-decoder attention is constrained dynamically to salient sentences, the computation cost will be reduced. For instance, sentence-level structures for the encoder-decoder attention have been shown effective in the traditional RNN encoder-decoder attention (Cohan et al., 2018;Li et al., 2019;Manakul et al., 2020) In this work, first, we compare the decoder's cost in the training and inference stages. We study the sparsity of the encoder-decoder attention in a common transformer-based abstractive summarization model. An approximation method to exploit this sparsity is described, and an empirical upper bound performance is given. Second, we propose a modified decoder architecture that can dynamically select salient input sentences to constrain the encoder-decoder attention without having to compute complete attention at inference time. Techniques to train our proposed model are described, and compared to the full attention baseline performance and empirical upper bound.

Models and Data
Vanilla Transformers. We use BART (Lewis et al., 2020) and local-attention BART (LoBART) (Manakul and Gales, 2021) as our base models. BART's maximum input length is 1024, while that of LoBART is 4096 with attention width of 1024. BART is fine-tuned to CNN/DailyMail and XSum, and LoBART is fine-tuned to Podcast and arXiv.
Data. CNN/DailyMail (Hermann et al., 2015) and XSum (Narayan et al., 2018) are used with BART, while long-document arXiv (Cohan et al., 2018) and Spotify Podcast (Clifton et al., 2020) are used with LoBART. More details about models, data, and training are provided in Appendix A.

Attention in the Transformer
Time and memory are dominated by the encoder self-attention, and models such as LoBART adopt local attention in its encoder to mitigate this bottleneck, while keeping the original decoder (Manakul and Gales, 2021). Training is fast because attention is highly parallelizable. However, during inference, the decoder uses its histories, becoming less parallelizable. To understand when the decoder might become a bottleneck, we fix the input length N and measure the computational time as a function of the target length M : in three operating modes: i) Forward+Backward, e.g. at training time; ii) Forward only, e.g. forwardpass where the input to the decoder is provided in advance; iii) Inference, e.g. the decoder using its own back histories as the input. Through a curve-fitting method, the results in Table 1 show that the relative decoder cost during inference is almost one order of magnitude larger than that during training, e.g. forward+backward or forward only. More details are provided in Appendix B, where we also show that the encoderdecoder attention cost is greater than the decoder self-attention cost. Therefore, this work will focus on the encoder-decoder attention.

Encoder-Decoder Attention
Let M = the summary length, N = the input length, N 1 = #sentences, and N 2 = the average number of words in a sentence, e.g. N = N 1 N 2 . The standard encoder-decoder attention in Eq. 2 (scaling factor omitted) where Q ∈ R M ×D and K, V ∈ R N ×D Modec 2 /c 1 (10 −3 )c 3 /c 1 (10 −6 ) Forward+Backward 1.08 0.17 Forward only 1.14 0.25 Inference 9.96 1.30 has the complexity: Note that we fix the representation dimension D, so D is omitted in our complexity notation.
If the attention is concentrated on some r sentences, 2 by selecting appropriate r, the speed of the encoder-decoder attention can be improved by a factor of N 1 /r in average. This is equivalent to:

Sparsity of Encoder-Decoder Attention
Let the subscript (i, j) denote the position of the j-th word in the i-th input sentence, e.g. K = [k 1,1 , k 1,2 , k 1,J 1 sent1 , ..., At inference time, the outputs are generated sequentially: a m =softmax(q m K T )V, so r sentences can be determined independently for each q m . Consider the following sum of attention weights as the saliency at decoding step m of sentence i: 3 where Z m = ∀i ∀j exp(q m · k i ,j ). We then compute i α s m,i up to r sentences ranked by α s m,i . The results in Fig. 1a show that r=25 is required to achieve the sum of attention weights at 90%. In addition to the vanilla model, we can fine-tune BART explicitly to make the attention sparse using: where L xent is the teacher-forced cross entropy loss, L sparse = 1 M M m=1 H(α s m ), and entropy Fig.  1b that the fine-tuned models (γ=0.1 & γ=1.0) retain close to 100% of attention weights for small r. Subsequently, we investigate how selecting r sentences impacts the summarization performance. To obtain an empirical upper bound performance of Eq. 3, for each q m , we can get ideal k, v corresponding to the top r sentences ranked by α s m,i :  The results in Table 2 show that: • For the vanilla model, despite the sum of attention weights being around 50% at r=5 (Fig.  1a), the model is sufficiently sparse, and constraining to r ideal sentences (All → I r,Ideal m ) results in a small performance degradation.
• Forcing for sparsity ( Fig.1b) does not yield a significant performance improvement; but this forcing also makes the model more sensitive to random selection (results in Appendix C).
Thus, for summarization, there is an observable sparsity, which allows us to reduce the cost of encoder-decoder attention with a minimal degradation. Next, we investigate how to build an efficient form of approximator to obtain salient sentences.

Sentence-Level Structure for Encoder-Decoder Attention
In Section 3.1, we use ideal selection I r m (Eq. 6), which requires computing α s m,i (Eq. 4) using all input words. This process cannot make the decoder more efficient. By exploiting the sentence structure in the document, we propose the following partition for the sentence-level attention score (Eq. 4) to allow a compact approximation: Essentially, we modify the standard encoder-decoder attention such that it performs sentence selection based onα s m,i (Eq. 7) and computes subset attentionÂ (Eq. 3).

Complexity of Modified Attention
The modified encoder-decoder attention consists of two components: i) sentence-level attention over N 1 sentences; ii) word-level attention over rN 2 words. Let p denote a unit of matrix multiplication cost and q denote a unit of softmax cost. The costs associated with attention are: The additional cost associated with the sentencelevel representation on the encoder side grows with the input length N =N 1 N 2 . Thus, as opposed to O(M N 1 N 2 ) in the case of vanilla encoder-decoder attention, the overall complexity of the modified at- pD+q and k e depends on the exact form of sentence-level representation computation.

Model-based Neural Approximator
To utilize the simple partition and sentence-level structure in Eq. 7, we use a linear mapping for f 1 and a bidirectional RNN for f 2 as follows: As illustrated in Fig. 3, the base transformer model is extended by augmenting two layers: i) sentencelevel encoder-decoder attention computingα s m,i in Eq. 7; ii) sentence encoder computing the sentencelevel representation in Eq. 10. The details about model parameters are provided in Appendix A.

KL Loss and Integrated Training
Let θ dec denote the original decoder, andθ denote the neural approximator. We trainθ by minimizing: In addition, we can integrate θ dec in the training process. With the teacher-forced cross entropy loss L xent , we define the integrated training loss: We cannot optimize L I in an end-to-end fashion because the top-r operation in Eq. 6 is not differentiable. Hence, we interleave the training, i.e. update θ dec at fixedθ and updateθ using L KL only: Because during training, we compute both α s m,i (ideal) andα s m,i (approx), we can use either in the top-r selection. Also, inspired by scheduled sampling (Bengio et al., 2015), we try mixing them: α s m,i with probability 1 − step epoch_size , otherwiseα s m,i .

System Performance
In  (I r,Idl m ) obtained by Eq. 6. The results show that the KL-only system clearly outperforms the random selection baseline, and the performance degradation of the KL-only system can be reduced by our integrated training. The results verify the effectiveness of our modified decoder that attends to a subset of sentences. Also, Table 3 shows that it is best to use I r,Apx m as reference in integrated training. This result is likely because we initialized integrated training from the KL-only model. In addition, we apply the modified architecture to BART trained on XSum (≤1k words), and to LoBART trained on Podcast and arXiv (≤4k words). The results in Fig. 2 confirm that the performance of our proposed method converges to that of the full attention baseline across all models and datasets.
In addition, r * ≈ 5,10,30,30, respectively. 4 Although XSum has fewer sentences in average compared to CNNDM, r * XSum >r * CNNDM as XSum is more abstractive. For longer summarization tasks as shown by Podcast and arXiv, the performance degradation appears larger, meaning that the task of constraining to salient sentences in longer tasks is more challenging, and larger r is required.

Sensitivity of r in Integrated Training
We train BART in three settings: r train =2,5,10, and we show the performance level w.r.t. the model with r train =5 in Fig. 4. The results show that setting r train beyond r * is not necessarily beneficial as shown by the model with r train =10 in the CNNDM result, and it is best to set r train close to r inference .

Further Discussion on Sentence-Level Encoder-Decoder Attention
The results in Section 4.4 demonstrate empirically that a neural network can predict sparsity, therefore, allowing sentence selection. Our novel framework requires an addition of modules to the original attention mechanism, and the real gain in speed will depend on the balance of the sparsity against the computational cost of the additional modules. Consequently, the challenge is to make these additional modules highly efficient. Because the particular network realization selected in this paper is to show the feasibility of our framework, its limitations and possible improvements are discussed as follows.

Limitations and Possible Improvements
The model choice of using RNN for the sentence encoder in Eq. 10 leads to a large additional computational cost (specifically large k e ) because the computational cost of RNN grows with N 1 N 2 D 2 . Because the goal is to obtain a sentence-level representation, there is an opportunity to replace RNN by a hierarchical attention that runs over sentences, which could instead lead to a computational cost that grows with N 1 N 2 D. Additional sentence-level query and key mappings in Eq. 8 and Eq. 9 also incur a large computational cost.

Model-free Approximator
Lastly, we re-visit the sentence-level attention in Eq. 4, which have been approximated by Eq. 7 and the model-based approximator. It is a challenge to attempt via a model-free algebraic approximation, which does not require any training and has little additional inference-time cost. We examined various forms, and we present one model-free approach as well as experimental results in Appendix D, but the current form has worse summarization performance than our model-based approach.

Related Work
The discrepancy between low/moderate attention weight sparsity and good sparse approximation could be because a considerable amount of the attention weight is assigned to special tokens, e.g. '.' in all sentences, but their vector norm is small, which was observed in Kobayashi et al. (2020). The sparse attention (Eq. 3) with ideal selection (Eq. 6) can be considered as content selection, which has been shown to improve summarization (Gehrmann et al., 2018;Hsu et al., 2018). Recently, head-wise masks are applied to encoder-decoder attention at inference time, and a performance improvement is reported (Cao and Wang, 2021). Voita et al. (2019) observed that heads are redundant, and Clark et al. (2019) found that a head in BERT rarely attends to several consecutive tokens. Based on these, Huang et al. (2021) applies a stride pattern in the encoder-decoder attention, reducing its cost by a factor of the stride size, and this method is likely complementary to our work.

Conclusion
We show that the computational cost of the transformer decoder becomes more significant at inference time. Towards reducing this cost, first, we show that there is sparsity in the encoder-decoder attention that allows us to reduce the computational cost with a minimal degradation. Second, we partition the sentence-level attention score, and we augment the standard decoder by adding a neural network to approximate the attention over sentences, allowing sentence selection. We show that the summarization performance of our approach converges to that of the full attention baseline, while switching the complexity from O(M N 1 N 2 ) to O(M N 1 + k w M rN 2 + k e N 1 N 2 ).  (Wolf et al., 2020), including BART models fine-tuned to CNNDM 5 and XSum 6 . We take LoBART from Manakul and Gales (2021), including LoBART(4k)+MCS fine-tuned to Podcast and arXiv. MCS is the multitask content selection system for handling the Podcast/arXiv input documents that exceed 4096 words.
Modified Architecture: As shown in Fig. 3, the modified architecture consists of a sentence encoder and a sentence-level encoder-decoder attention. The sentence encoder, which approximates f 2 (.) in Eq. 7, is a two-layer bi-directional GRU (Cho et al., 2014)  Our data processing is based on the byte-pairencoding (BPE) tokenizer same as the BART-large tokenizer, and we use the NLTK toolkit for sentence splitting. **In Fig. 2 and Fig. 4, to reduce computational cost, we use first 2,000 samples of each test set, except Podcast which contains less than 2,000 samples.

A.3 Training and Inference
We use PyTorch (Paszke et al., 2019)   optimizer (Kingma and Ba, 2015) with β 1 =0.9, β 2 =0.999, and the learning rate is: where we use 20,000 warmup steps. In all experiments, we set batch size to 1, and gradient accumulation to 2 steps. We evaluate the training loss on the validation set every 20,000 steps, and stop the training if the validation loss does not improve 3 times. All training experiments converged within 1 epoch. All experiments were carried out in 32-bit precision on either one V100 (32GB) GPU, or one RTX 2080Ti (11GB) GPU.
At inference time, we use the standard setting: beam search of width 4, and length penalty of 2.0 (Wu et al., 2016) for all experiments. The ROUGE (Lin, 2004) scoring tool is pyrouge. 7

A.4 Multi-head attention
In all of the equations and expressions in the paper, we omit the heads for simplicity. Both BART and LoBART models have 16 heads. In Fig. 1, we average α s m,i over heads, before the summation. When computing an uncertainty measure such as entropy H(.) or KL-divergence KL(.), we compute the measure for each head separately and take the average. In obtaining I r m , we average α s m,i over heads, before the top-r operation, i.e. all heads get assigned the same subset of sentences, but the differences are across layers and decoding timesteps.

A.5 KL Loss and Integrated Training
The target α s m,i is re-normalized to encourage higher sparsity as follows: where temperature T is set to 0.5. For integrated training, we set λ in L I to 0.2. We initialize integrated training experiments from KL-only models.

B Time Analysis
For each mode in Table 1, we take 6 samples of M and the average time of 100 iterations. Curve fitting yields R-squared of at least 0.994. Computational time as function of M and N is time = c 1 +c 2 M +c 3 N +c 4 M N +c 5 M 2 +c 6 N 2 . The coefficients are obtained by a least-squares regression, e.g. c * = (P T P) −1 P T t where P is the matrix of M, N associated with the coefficients, and t contains the time measures. We collect 30 samples of the average time of 100 F+B passes, spanning N ∈ [256, 1024] and M ∈ [50, 300]. The normalized coefficients are: c 1 = 1.00, c 2 = 3.78 × 10 −3 , c 3 = 3.15 × 10 −3 , c 4 = 1.47 × 10 −6 , c 5 = 7.26 × 10 −7 , c 6 = 7.79 × 10 −7 . Because c 4 ≈ 2c 5 and N > M , the enc-dec attention cost is greater than the decoder self attention cost.  Table 5: Impact of sparsity on the sensitivity to random selection based on CNNDM with r = 5.

D Model-free Approximation for Eq. 4
Since α s m,i = 1 Zm J i j=1 exp(q m · k i,j ) is used to rank input sentences, the normalization term, Z m , can be dropped. The encoder-decoder attention has O(M N 1 N 2 ) complexity because q m · k i,j is computed for every m and (i, j) pair. Hence, if we can group k i,j into sentences, the complexity could potentially be reduced. We try the following approximation of unnormalized α s m,i : where d = {1, ..., D} is the hidden dimension, and φ(.) = ELU(.) + 1 (or other form such as exp and ReLU). The model-free method reduces complexity from O(M N 1 N 2 ) to O(M N 1 + k w M rN 2 + k e N 1 N 2 ) where k e is now much smaller compared to the model-based approach. Based on Eq. 19, we provide model-free results in Table 6.  Table 6: Model-free results on CNNDM (r = 5).
Our model-free approach is better than the random selection baseline, but it is significantly worse than both the ideal selection baseline and the modelbased approach. The reasons for this poor performance are: (i) the approximation from (16) to (17) requires the following condition to be true: A 1 A 2 > B 1 B 2 → A 1 + A 2 > B 1 + B 2 , so it is inaccurate when the values are not in a similar range; (ii) the approximation from (17) to (18) is inaccurate for non-positive values. In conclusion, this experiment investigates an alternative challenging method, which would not require any training and would be computationally cheaper at inference time. Although the current algebra does not work well, we hope that our initial study might draw more interests into this type of model-free approach to exploit the sentence structure in seq2seq tasks such as abstractive summarization.

E Word-level and Sentence-level Attention Weight Plots
Constraining the encoder-decoder attention to r sentences is motivated by the observations of sentence-level attention (in Fig. 5b, 6b, 7b, 8b). Note that we average over all heads for the plots. For instance, Fig. 5 shows that the decoder attends particularly to input sentences #1,#2,#13 in the summary generation. Compared to Fig. 5,  Fig. 6 shows a wider spread of the attention over sentences in a more abstractive task. When using LoBART, Fig. 7 and Fig. 8 show a similar trend of the sparsity to BART scenarios. These figures also explain Fig. 1a (Section 3) that I r m i α s m,i is only low/moderate because most sentences get assigned some attention weights, despite being non-salient.