HiPool: Modeling Long Documents Using Graph Neural Networks

Encoding long sequences in Natural Language Processing (NLP) is a challenging problem. Though recent pretraining language models achieve satisfying performances in many NLP tasks, they are still restricted by a pre-defined maximum length, making them challenging to be extended to longer sequences. So some recent works utilize hierarchies to model long sequences. However, most of them apply sequential models for upper hierarchies, suffering from long dependency issues. In this paper, we alleviate these issues through a graph-based method. We first chunk the sequence with a fixed length to model the sentence-level information. We then leverage graphs to model intra- and cross-sentence correlations with a new attention mechanism. Additionally, due to limited standard benchmarks for long document classification (LDC), we propose a new challenging benchmark, totaling six datasets with up to 53k samples and 4034 average tokens’ length. Evaluation shows our model surpasses competitive baselines by 2.6% in F1 score, and 4.8% on the longest sequence dataset. Our method is shown to outperform hierarchical sequential models with better performance and scalability, especially for longer sequences.


Introduction
Transformer-based models like BERT (Vaswani et al., 2017a) and RoBERTa (Zhuang et al., 2021) have achieved satisfying results in many Natural Language Processing (NLP) tasks thanks to largescale pretraining (Vaswani et al., 2017b). However, they usually have a fixed length limit, due to the quadratic complexity of the dense self-attention mechanism, making it challenging to encode long sequences.
One way to solve this problem is to adapt Transformers to accommodate longer inputs and optimize the attention from BERT. BigBird (Zaheer et al., 2020) applies sparse attention that combines random, global, and sliding window attention in a long sequence, reducing the quadratic dependency of full attention to linear. Similarly, Longformer (Beltagy et al., 2020) applies an efficient self-attention with dilated windows that scale linearly to the window length. Both models can take up to 4096 input tokens. Though it is possible to train even larger models for longer sequences, they are restricted by a pre-defined maximum length with poor scalability. More importantly, they fail to capture high-level structures, such as relations among sentences or paragraphs, which are essential to improving NLP system performance Zhu et al., 2019).
Another way is to apply a hierarchical structure to process adjustable input lengths with chunking representations for scalability on long sequences. Hi-Transformer (Wu et al., 2021) encodes both sentence-level and document-level representations using Transformers. ToBERT (Pappagari et al., 2019) applies a similar approach that stacks a sentence-level Transformer over a pretrained BERT model. While most of the existing work models upper-level hierarchy using sequential structures, such as multiple layers of LSTMs (Hochreiter and Schmidhuber, 1997) or Transformers, this may still bring the long dependency issue when the sequence gets longer. To alleviate this, we investigate graph modeling as a novel hierarchy for upper levels. Besides, we also consider inter-hierarchy relationships using a new attention mechanism.
Our key insight is to replace the sequence-based model with a hierarchical attentional graph for long documents. We first apply a basic pretrained language model, BERT or RoBERTa, to encode local representation on document chunks with a fixed length. The number of chunks could be extended for longer sequences for better scalability. Different from other works, we apply a graph neural network (GNN) (Zhou et al., 2018) to model the upper-level hierarchy to aggregate local sentence in-formation. This is to alleviate the long dependency issue of the sequential model. Moreover, within such a graph structure, we propose a new heterogeneous attention mechanism to consider intra-and cross-sentence-level correlations.
Our contributions are two-fold: 1) We propose HiPool with multi-level hierarchies for long sequence tasks with a novel inter-hierarchy graph attention structure. Such heterogeneous graph attention is shown to outperform hierarchical sequential models with better performance and scalability, especially for longer sequences; 2) We benchmark the LDC (long document classification) task with better scaled and length-extended datasets. Evaluation shows that HiPool surpasses competitive baselines by 2.6% in F1 score, and 4.8% on the longest sequence dataset. Code is available at https: //github.com/IreneZihuiLi/HiPool.

Model
We introduce the HiPool (Hierarchical Pooling) model for long document classification, illustrated in Fig. 1. It consists of an overlapping sequence encoder, a HiPool graph encoder, and a linear layer. Overlapping Sequence Encoder. Given the input document S, we first chunk the document into a number of shorter pieces with a fixed length L, and we set the overlapping window size to be L olp . Overlapping encoding makes it possible for a chunk to carry information from its adjacent chunks but not isolated, differentiating our model from other hierarchical ones. Then each chunk is encoded with a pretrained Transformer model, i.e., BERT or RoBERTa; we choose the CLS token representation as the input to our HiPool layer: X = BERT(S). HiPool Graph Encoder. We apply a graph neural network to encode incoming word-level information. We construct a graph, defined by G(V, E), where V is a set of nodes, and E is a set of node connections. There are two node types: n lowlevel nodes and m high-level nodes, and typically m < n. In our experiment, we set m = n/p, and p ≥ 0. The feedforward operation goes from lowto high-level nodes. In layer l, low-level nodes are inputs from the previous layer l − 1, while high-level nodes at layer l are computed based on low-level ones. Moreover, these high-level nodes will be the input to the next layer l + 1, becoming the low-level nodes in that layer. We consider X the low-level nodes in the first HiPool layer, as shown in the figure.  In each HiPool layer, given node representation H l and adjacency matrix A l at layer l, the task is to obtain H l+1 : (1) Inspired by DiffPool (Ying et al., 2018), we conduct a clustering method to aggregate information. We assign node clusters with a fixed pattern based on their position. For example, adjacent low-level neighbors should map to the same high-level clustering node. So we first define a clustering adjacency matrix A self ∈ IR n×m that maps n nodes to m nodes, indicating the relations from low-to high-level nodes, marked as black arrows in the figure. Note that our approach allows overlapping, in which some nodes may belong to two clusters. We set the clustering sliding window to be 2p, with a stride to be p. In the figure, we show the case of p = 2. We denote interactions between low-level nodes by the adjacency matrix A l , 1 and we model it using a chain graph, according to the natural order of the document. 2 Then, the relations between high-level nodes A l high and their node representations H l high are computed: Besides, for each high-level node, to strengthen the connections across different clusters, we propose an attention mechanism to obtain crosssentence information. We propose a new edge type that connects external cluster low-level nodes to each high-level node, and the adjacency matrix is simply A cross = 1 − A self , marked by green in the figure. We update H l high as the following: where W atten is trainable, and W score is a scoring matrix. We then apply a GNN to obtain H l+1 . For example, a graph convolution network (GCN) (Kipf and Welling, 2016): We run our experiments with two layers, and apply a sum aggregator to achieve document embeddings. More HiPool layers are also possible. Linear Layer. Finally, a linear layer is connected and cross-entropy loss is applied during training.

LDC Benchmark
The LDC benchmark contains six datasets. We first choose four widely-used public datasets. Hyperpartisan (HYP) (Kiesel et al., 2019) and 20News-Groups (20NG) (Lang, 1995) are both news text datasets with different scales. IMDB (Maas et al., 2011) is a movie review dataset for sentiment classification. ILDC (Malik et al., 2021) is a large corpus of legal cases annotated with binary court decisions ("accepted"and "rejected"). Limitation and new datasets. However, 20News-Groups and IMDB cannot test the limit of models in encoding long documents since the average length of sentence is still relatively small; whereas Hyperpartisan only contains 645 examples and is thus prone to overfitting and not representative. ILDC is large and contains long texts, but it is mainly in the legal domain. Therefore, to enrich evaluation scenario, we select and propose two new benchmarks with longer documents based on an existing large-scale corpus, Amazon product reviews (He and McAuley, 2016), to conduct long document classification. Amazon-512 (A-512) contains all reviews that are longer than 512 words from the Electronics category; Amazon-2048 (A-2048) contains 10,000 randomly sampled reviews that are longer than 2048 words from the Books category. We randomly split 8/ us to draw statistically significant conclusions on model performance as sequence lengths increase, as demonstrated in in Table 1.

Evaluation
Hyperparameters. We list details in Appendix C.
Baselines. We select four pretrained models: BERT (Devlin et al., 2019), RoBERTa (Zhuang et al., 2021), BigBird (Zaheer et al., 2020) and Longformer (Beltagy et al., 2020). We also compare with a hierarchical Transformer model To-BERT (Pappagari et al., 2019). Hi-Transformer (Wu et al., 2021) failed to be reproduced as there is no code available. We evaluate two variations of our HiPool method by changing the sequence encoder model: HiPool-BERT and HiPool-RoBERTa. We report the Micro-F1 score in Tab   the following sequential modules: Simple linear summation over low-level nodes; CNN applies a 1-dimension convolution; Trans is to apply a Transformer on top of low-level nodes. Besides, we also look at multiple graph settings: Aggr-mean is to use a mean aggregator to obtain the final document representation; Aggr-std is to use a feature-wise standard deviation aggregator; finally, Aggr-pcp applies Principal Neighbourhood Aggregation (PNA) (Corso et al., 2020). We report results on Amazon-2048 in Tab. 3, as it has the longest sequence on average. An observation is that applying aggregators are better than simpler structures, while keeping a graph is still a better choice. HiPool also considers attention in message passing, so it is doing even better. We also test other variations in Appendix B.

Ablation Study
Effect of input length. To better understand the effect of input length, in Fig. 2, we present an ablation study on the Amazon-2048 and ILDC, and compare three models: BigBird, Longformer, and HiPool. In general, the models benefit from longer input sequences in both datasets. Interestingly, when sequence is larger than 2048, Longformer and Bigbird could not improve and they are limited in maximum lengths. In contrast, as the input sequence gets longer, HiPool steadily improves, showing its ability to encode long documents in a hierarchical structure.  shown in Tab. 4, we first take the best model setting, HiPool-RoBERTa, and compare it with the following settings: 1) w/o RoBERTa is to replace RoBERTa with BERT, then the model becomes HiPool-BERT; 2) w/o HiPool is to remove the proposed HiPool module and replace with a simple CNN (Kim, 2014); 3) w/o Overlapping is to remove the overlapping word encoding. We could see that removing the HiPool Layer leads to a significant drop, indicating the importance of the proposed method. Moreover, the HiPool framework can work with many pretrained language models, as we can see that applying RoBERTa improves BERT. A complete result table can be found in Appendix.

Conclusion
In this paper, we proposed a hierarchical framework for long document classification. The evaluation shows our model surpasses competitive baselines.

A IMDB-long Dataset
HiPool Performs The Best for Long Sequences in IMDB. As a supplementary analysis, we look at the IMDB dataset, in which HiPool performs worse than BigBird and Longformer. We filter out the sequences that are longer than 512 tokens to construct the IMDB-long dataset, resulting in 3250 and 3490 samples for training and testing. We show the detailed statistics of the IMDB-long dataset in Tab. 5. We show the evaluation in Fig. 3. We can observe that HiPool can do better for long sequences.   Figure 3: Performance on IMDB-long. HiPool outperforms BigBird and Longformer when the sequence length is larger than 512.

B Graph Variations
We study other possible GNN types for hierarchy modeling. In Eq. 1, we replace the HiPool graph encoder with a GCN or GAT encoder. We apply two layers of the graph networks before the linear layer to compare fairly, and show results in Fig. 6. We notice that using GCN and GAT results in lower performance than that of HiPool. A possible reason is that they only focus on modeling the low-level nodes, ignoring a cross-sentence attention mechanism to strengthen high-level communication on long sequences like HiPool.  1e-5 1e-5 1e-5 1e-5 1e-5 5e-6 learning rate: RoBERTa 5e-6 5e-6 5e-6 5e-6 5e-6 5e-6 Table 7: Hyperparameters for baseline models and HiPool. Time * indicates how many hours on overall trial, training and testing using a single GPU. Note that we report average and standard deviation for HiPool, so we ran the evaluation at least 5 times there.

D Frequently Asked Questions
• Q: Why do we call it a heterogeneous graph?
A: We use the term "heterogeneous"to distinguish the nodes from the graph. We wish to emphasize that the nodes are not the same, and they come from multiple levels and represent different information.
• Q: Are there other possible variations for modeling the hierarchy?
A: Yes, our HiPool model is a framework that applies a graph structure for high-level hierarchy, so it is possible to apply other GNN models. One can use Relational Graph Convolutional Networks (R-GCNs) (Schlichtkrull et al., 2018) to model the different relations for A self and A cross . Besides, some inductive methods like GraphSAGE (Hamilton et al., 2017) can also be applied to obtain node embeddings in the graph. We leave this topic as future work.
• Q: How does the aggregator work in Tab. 3.?
• Q: Why did not evaluate on the LRA (Long Range Arena) (Tay et al., 2021)

benchmark?
A: LRA is more suitable for testing the efficiency of Transformer-based models and it consists of multiple types of long sequences. As we mentioned in the Introduction, our proposed model belongs to another category for long sequence encoding, not the efficiency transformer category that focuses on optimizing KQV attention.

E Limitations and Potential Risks
Limitations The model we proposed is specifically for classification, while it is possible to be extended to other NLP tasks by changing the high-level task-specific layer. Besides, in the evaluation, we focused on English corpora. We plan to test on other languages in the future. Potential Risks We make our code publicly available so that everyone can access our code. As the model is a classification model, it does not generate risky content. Users should also notice that the classification predictions may not be perfectly correct.