HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text Extractive Summarization

To capture the semantic graph structure from raw text, most existing summarization approaches are built on GNNs with a pre-trained model. However, these methods suffer from cumbersome procedures and inefficient computations for long-text documents. To mitigate these issues, this paper proposes HetFormer, a Transformer-based pre-trained model with multi-granularity sparse attentions for long-text extractive summarization. Specifically, we model different types of semantic nodes in raw text as a potential heterogeneous graph and directly learn heterogeneous relationships (edges) among nodes by Transformer. Extensive experiments on both single- and multi-document summarization tasks show that HetFormer achieves state-of-the-art performance in Rouge F1 while using less memory and fewer parameters.


Introduction
Recent years have seen a resounding success in the use of graph neural networks (GNNs) on document summarization tasks Hanqi Jin, 2020), due to their ability to capture inter-sentence relationships in complex document. Since GNN requires node features and graph structure as input, various methods, including extraction and abstraction (Li et al., 2020;Huang et al., 2020;Jia et al., 2020), have been proposed for learning desirable node representations from raw text. Particularly, they have shown that Transformer-based pre-trained models such as BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) offer an effective way to initialize and fine tune the node representations as the input of GNN.
Despite great success in combining Transformerbased pre-trained models with GNNs, all existing approaches have their limitations. The first limitation lies in the adaptation capability to long-text input. Most pre-trained methods truncate longer documents into a small fixed-length sequence (e.g., n = 512 tokens), as its attention mechanism requires a quadratic cost w.r.t. sequence length. This would lead to serious information loss (Li et al., 2020;Huang et al., 2020). The second limitation is that they use pre-trained models as a multilayer feature extractor to learn better node features and build multi-layer GNNs on top of extracted features, which have cumbersome networks and tremendous parameters (Jia et al., 2020).
Recently there have been several works focusing on reducing the computational overhead of fullyconnected attention in Transformers. Especially, ETC (Ravula et al., 2020) and Longformer (Beltagy et al., 2020) proposed to use local-global sparse attention in pre-trained models to limit each token to attend to a subset of the other tokens (Child et al., 2019), which achieves a linear computational cost of the sequence length. Although these methods have considered using local and global attentions to preserve hierarchical structure information contained in raw text data, their abilities are still not enough to capture multi-level granularities of semantics in complex text summarization scenarios.
In this work, we propose HETFORMER, a HETerogeneous transFORMER-based pre-trained model for long-text extractive summarization using multi-granularity sparse attentions. Specifically, we treat tokens, entities, sentences as different types of nodes and the multiple sparse masks as different types of edges to represent the relations (e.g., token-to-token, token-to-sentence), which can preserve the graph structure of the document even with the raw textual input. Moreover, our approach will eschew GNN and instead rely entirely on a sparse attention mechanism to draw heterogeneous graph structural dependencies between input tokens.
The main contributions of the paper are summarized as follows: 1) we propose a new structured pre-trained method to capture the heterogeneous structure of documents using sparse attention; 2) we extend the pre-trained method to longer text  extractive summarization instead of truncating the document to small inputs; 3) we empirically demonstrate that our approach achieves state-of-the-art performance on both single-and multi-document extractive summarization tasks.

HETFORMER on Summarization
HETFORMER aims to learn a heterogeneous Transformer in pre-trained model for text summarization.
To be specific, we model different types of semantic nodes in raw text as a potential heterogeneous graph, and explore multi-granularity sparse attention patterns in Transformer to directly capture heterogeneous relationships among nodes. The node representations will be interactively updated in a fine-tuned manner, and finally, the sentence node representations are used to predict the labels for extractive text summarization.

Node Construction
In order to accommodate multiple granularities of semantics, we consider three types of nodes: token, sentence and entity.
The token node represents the original textual item that is used to store token-level information. Different from HSG  which aggregates identical tokens into one node, we keep each token occurrence as a different node to avoid ambiguity and confusion in different contexts. Each sentence node corresponds to one sentence and represents the global information of one sentence. Specifically, we insert an external [CLS] token at the start of each sentence and use it to encode features of each tokens in the sentence. We also use the interval segment embeddings to distinguish multiple sentences within a document, and the position embeddings to display monotonical increase of the token position in the same sentence. The entity node represents the named entity associated with the topic. The same entity may appear in multiple spans in the document. We utilize NeuralCoref 1 to obtain the coreference resolution of each entity, which can be used to determine whether two expressions (or "mentions") refer to the same entity.

Sparse Attention Patterns
Our goal is to model different types of relationships (edges) among nodes, so as to achieve a sparse graph-like structure directly. To this end, we leverage multi-granularity sparse attention mechanisms in Transformer, by considering five attention patterns, as shown in Fig. 1: token-to-token (t2t), tokento-sentence (t2s), sentence-to-token (s2t), sentenceto-sentence (s2s) and entity-to-entity (e2e).
Specifically, we use a fixed-size window attention surrounding each token ( Fig. 1(a)) to capture the short-term t2t dependence of the context. Even if each window captures the short-term dependence, by using multiple stacked layers of such windowed attention, it could result in a large receptive field (Beltagy et al., 2020). Because the top layers have access to all input locations and have the capacity to build representations that incorporate information across the entire input.
The t2s represents the attention of all tokens connecting to the sentence nodes, and conversely, s2t is the attention of sentence nodes connecting to all tokens across the sentence (the dark blue lines in Fig. 1(b)). The s2s is the attention between multiple sentence nodes (the light blue squares in Fig. 1(b)). To compensate for the limitation of t2t caused by using fixed-size window, we allow the sentence nodes to have unrestricted attentions for all these three types. Thus tokens that are arbitrarily far apart in the long-text input can transfer information to each other through the sentence nodes.
Complex topics related to the same entity may span multiple sentences, making it challenging for existing sequential models to fully capture the se-mantics among entities. To solve this problem, we introduce the e2e attention pattern (Fig. 1(c). The intuition is that if there are several mentions of a particular entity, all the pairs of the same mentions are connected. In this way, we can facilitate the connections of relevant entities and preserve global context, e.g., entity interactions and topic flows.
Linear Projections for Sparse Attention. In order to ensure the sparsity of attention, we create three binary masks for each attention patterns M t2t , M ts and M e2e , where 0 means disconnection and 1 means connection between pairs of nodes. In particular, M ts is used jointly for s2s, t2s and s2t. We use different projection parameters for each attention pattern in order to model the heterogeneity of relationships across nodes. To do so, we first calculate each attention with its respective mask and then sum up these three attentions together as the final integrated attention ( Fig. 1(d)).
Each sparse attention is calculated as: where X is the input text embedding, represents the element-wise product and W m Q is the projection parameter. The key K m and the value V m are calculated in a similar way as Q m , but with respect to different projection parameters, which are helpful to learn better representation for heterogeneous semantics. The expensive operation of full-connected attention is QK T as its computational complexity is related to the sequence length (Kitaev et al., 2020). While in HETFORMER, we follow the implementation of Longformer that only calculates and stores attention at the position where the mask value is 1 and this results in a linear increase in memory use compared to quadratic increase for full-connected attention.

Sentence Extraction
As extractive summarization is more general and widely used, we build a classifier on each sentence node representation o s to select sentences from the last layer of HETFORMER. The classifier uses a linear projection layer with the activation function to get the prediction score for each sentence: where σ is the sigmoid function, W o and b o are parameters of projection layer.
In the training stage, these prediction scores are trained learned on the binary cross-entropy loss with the golden labels y. In the inference stage, these scores are used to sort the sentences and select the top-k as the extracted summary.

Extension to Multi-Document
Our framework can establish the document-level relationship in the same way as the sentence-level, by just adding document nodes for multiple documents (i.e., adding the [CLS] token in front of each document) and calculate the document↔sentence (d2s, s2d), document↔token (d2t, t2d) and documentto-document (d2d) attention patterns. Therefore, it can be easily adapted from the single-document to multi-document summarization.

Discussions
The most relevant approaches to this work are Longformer (Beltagy et al., 2020) and ETC (Ravula et al., 2020) which use a hierarchical attention pattern to scale Transformers to long documents. Compared to these two methods, we formulate the Transformer as multi-granularity graph attention patterns, which can better encode heterogeneous node types and different edge connections. More specifically, Longformer treats the input sequence as one sentence with the single tokens marked as global. In contrast, we consider the input sequence as multi-sentence units by using sentenceto-sentence attention, which is able to capture the inter-sentence relationships in the complex document. Additionally, we introduce entity-to-entity attention pattern to facilitate the connection of relevant subjects and preserve global context, which are ignored in both Longformer and ETC. Moreover, our model is more flexible to be extended to the multi-document setting.

Datasets
CNN/DailyMail is the most widely used benchmark dataset for single-document summarization (Zhang et al., 2019;Jia et al., 2020). The standard dataset split contains 287,227/13,368/11,490 samples for train/validation/test. To be comparable with other baselines, we follow the data processing in (Liu and Lapata, 2019b;See et al., 2017).
Multi-News is a large-scale dataset for multidocument summarization introduced in (Fabbri et al., 2019), where each sample is composed of 2-10 documents and a corresponding humanwritten summary. Following Fabbri et al. (2019), we split the dataset into 44,972/5,622/5,622 for train/validation/test. The average length of source documents and output summaries are 2,103.5 tokens and 263.7 tokens, respectively. Given the N input documents, we taking the first L/N tokens from each source document. Then we concatenate the truncated source documents into a sequence by the original order. Due to the memory limitation, we truncate input length L to 1,024 tokens. But if the memory capacity allows, our model can process the max input length = 4,096.
While the dataset contains abstractive gold summaries, it is not readily suited to training extractive models. So we follow the work of (Zhou et al., 2018) on extractive summary labeling, constructing gold-label sequences by greedily optimizing R-2 F1 on the gold-standard summary.

Baselines and Metrics
We evaluate our proposed model with the pretrained language model (Devlin et al., 2018;Liu et al., 2019), the state-of-the-art GNN-based pretrained language models Jia et al., 2020;Hanqi Jin, 2020) and pre-trained language model with the sparse attention (Narayan et al., 2020;Beltagy et al., 2020). And please check Appendix B for the detail.
We use unigram, bigram, and longest common subsequence of Rouge F1 (denoted as R-1, R-1 and R-L) (Lin and Och, 2004) 2 to evaluate the summarization qualities. Note that the experimental results of baselines are from the original papers.

Implementation Detail
Our model HETFORMER 3 is initialized using the Longformer pretrained checkpoints longformer-base-4096 4 , which is further pertained using the standard masked language model task on the Roberta checkpoints roberta-base 5 with the documents of max length 4,096. We apply dropout with probability 0.1 before all linear layers in our models. The proposed model follows the Longformer-base architecture, where the number of d model hidden units in our models is set as 768, the d h hidden size is 64, the layer number is 12 and the number of heads is 12. We train our model for 500K steps on the TitanRTX, 24G GPU with gradient accumulation in every two steps with Adam optimizers.  Learning rate schedule follows the strategies with warming-up on first 10,000 steps (Vaswani et al., 2017). We select the top-3 checkpoints according to the evaluation loss on validation set and report the averaged results on the test set. For the testing stage, we select top-3 sentences for CNN/DailyMail and top-9 for Multi-News according to the average length of their humanwritten summaries. Trigram blocking is used to reduce repetitions.

Summerization Results
As shown in Table 1, our approach outperforms or is on par with current state-of-the-art baselines. Longformer and ETC outperforms the hierarchical structure model using fully-connected attention model HiBERT, which shows the supreme of using sparse attention by capturing more relations (e.g., token-to-sentence and sentence-to-token). Comparing to the pre-trained models using sparse attention, HETFORMER considering the heterogeneous graph structure among the text input outperforms Longformer and ETC. Moreover, HETFORMER achieves competitive performance compared with GNN-based models, such as HSG and HAHsum. Our model is slightly lower than the performance of HAHsum large . But it uses large architecture (24 layers with about 400M parameters), while our   Table 2 shows the results of multi-document summarization. Our model outperforms all the extractive and abstractive baselines. These results reveal the importance of modeling the longer document to avoid serious information loss.

Memory Cost
Compared with the self-attention component requiring quadratic memory complexity in original Transformers, the proposed model only calculates the position where attention pattern mask=1, which can significantly save the memory cost. To verify that, we show the memory costs of BERT, RoBERTa, Longformer and HETFORMER base-version model on the CNN/DailyMail dataset with the same configuration (input length = 512, batch size = 1). From the results in Table 3, we can see that HETFORMER only takes 55.9% memory cost of RoBERTa model and also does not take too much more memory than Longformer.

Ablation Study
To show the importance of the design choices of our attention patterns, we tried different variants and reported their controlled experiment results. To make the ablation study more manageable, we train each configuration for 500K steps on the singledocument CNN/DailyMail dataset, then report the Rouge score on the test set.
The top of Table 4 demonstrates the impact of different ways of configuring the window sizes per layer. We observe that increasing the window size from the bottom to the top layer leads to the best performance (from 32 to 512). But the reverse way leads to worse performance (from 512 to 32). And using a fixed window size (the average of window sizes of the other configuration) leads to a performance that it is in between.
The middle of Table 4 presents the impact of incorporating the sentence node in the attention pattern. In implementation, no sentence node means that we delete the [CLS] tokens of the document input and use the average representation of each token in the sentences as the sentence representation. We observe that without using the sentence node to fully connect with the other tokens could decrease the performance.
The bottom of Table 4 shows the influence of using the entity node. We can see that without the entity node, the performance will decrease. It demonstrates that facilitating the connection of relevant subjects can preserve the global context, which can benefit the summarization task.

Conclusion
For the task of long-text extractive summarization, this paper has proposed HETFORMER, using multigranularity sparse attention to represent the heterogeneous graph among texts. Experiments show that the proposed model can achieve comparable performance on a single-document summarization task, as well as state-of-the-art performance on the multi-document summarization task with longer input document. In our future work, we plan to expand the edge from the binary type (connect or disconnect) to more plentiful semantic types, i.e., is-a, part-of, and others (Zhang et al., 2020).

A.1 Graph-enhanced Summarization
In the recent state-of-the-art summarization models, there is a trend to extract the structure from the text to formulate the document text as a hierarchical structure or heterogeneous graph . HiBERT (Zhang et al., 2019), GraphSum (Li et al., 2020) and HT (Liu and Lapata, 2019a) consider the word-level, sentence-level and document-level of the input text to formulate the hierarchical structure. MGSum (Hanqi Jin, 2020), ASGARD (Huang et al., 2020), HSG  and HAH-Sum (Jia et al., 2020) construct the source article as a heterogeneous graph where words, sentences, and entities are used as the semantic nodes and they iteratively update the sentence nodes representation which is used to do the sentence extraction. The limitation of those models is that they use pre-trained methods as the feature-based model to learn the node feature and build GNN layers upon the node which brings more training parameters than just using pre-trained methods. Compared with those models, our work can achieve the same thing but using the lite framework. Moreover, these models typically limit inputs to n = 512 tokens since the O(n 2 ) cost of attention. Due to the long source article, when applying BERT or RoBERTa to the summarization task, they need to truncate source documents into one or several smaller block input (Li et al., 2020;Jia et al., 2020;Huang et al., 2020). Huang et al. (2021) proposed an efficient encoderdecoder attention with head-wise positional strides, which yields ten times faster than existing full attention models and can be scale to long documents. Liu et al. (2021) leveraged the syntactic and semantic structures of text to improve the Transformer and achieved nine times speedup. Our model focuses on the different direction to use graph-structured sparse attention to capture the long term dependence on the long text input. The most related approaches to the work presented in this paper are Longformer (Beltagy et al., 2020) and ETC (Ravula et al., 2020) which feature a very similar global-local attention mechanism and take advantage of the pre-trained model RoBERTa. Except Longformer has a single input sequence with some tokens marked as global (the only ones that use full attention), while the global tokens in the ETC is pre-trained with CPC loss. Comparing with those two works, we formulate the heterogeneous attention mechanism, which can consider the wordto-word, word-to-sen, sen-to-word and entity-toentity attention.

A.3 Graph Transformer
With the great similarity between the attention mechanism used in both Transformer (Vaswani et al., 2017) and Graph Attention network (Veličković et al., 2017), there are several recent Graph Transformer works recently. Such as GTN (Yun et al., 2019), HGT (Hu et al., 2020), (Fan et al., 2021) and HetGT (Yao et al., 2020) formulate the different type of the attention mechanisms to capture the node relationship in the graph.
The major difference between of our work and Graph Transformer is that the input of graph transformer is structural input, such as graph or dependence tree, but the input of our HeterFormer is unstructured text information. Our work is to convert the transformer to structural structure so that it can capture the latent relation in the unstructured text, such as the word-to-word, word-to-sent, sentto-word, sent-to-sent and entity-to-entity relations.

B Baseline Details
Extractive Models: BERT (or RoBERTa) (Devlin et al., 2018;Liu et al., 2019) is a Transformer-based model for text understanding through masking language models. HIBERT (Zhang et al., 2019) proposed a hierarchical Transformer model where it first encodes each sentence using the sentence Transformer encoder, and then encoded the whole document using the document Transformer encoder. HSG, HDSG  formulated the input text as the heterogeneous graph which contains different granularity semantic nodes, (like word, sentence, document nodes) and connected the nodes with the TF-IDF. HSG used CNN and BiLSTM to initialize the node representation and updated the node representation by iteratively passing messages by Graph Attention Network (GAT). In the end, the final sentence nodes representation is used to select the summary sentence. HAHsum (Jia et al., 2020) constructed the input text as the heterogeneous graph containing the word, named entity, and sentence node. HAHsum used a pre-trained ALBERT to learn the node initial representation and then adapted GAT to iteratively learn node hidden repre-sentations. MGsum (Hanqi Jin, 2020) treated documents, sentences, and words as the different granularity of semantic units, and connected these semantic units within a multi-granularity hierarchical graph. They also proposed a model based on GAT to update the node representation. ETC (Narayan et al., 2020), andLongformer (Beltagy et al., 2020) are two pre-trained models to capture hierarchical structures among input documents through the sparse attention mechanism.
Abstractive Models: Hi-MAP (Fabbri et al., 2019) expands the pointer-generator network model into a hierarchical network and integrates an MMR module to calculate sentence-level scores. Graphsum (Li et al., 2020) leverage the graph representations of documents by processing input documents as the hierarchical structure with a pretrained language model to generate the abstractive summary.