Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling

Transformer is important for text modeling. However, it has difficulty in handling long documents due to the quadratic complexity with input text length. In order to handle this problem, we propose a hierarchical interactive Transformer (Hi-Transformer) for efficient and effective long document modeling. Hi-Transformer models documents in a hierarchical way, i.e., first learns sentence representations and then learns document representations. It can effectively reduce the complexity and meanwhile capture global document context in the modeling of each sentence. More specifically, we first use a sentence Transformer to learn the representations of each sentence. Then we use a document Transformer to model the global document context from these sentence representations. Next, we use another sentence Transformer to enhance sentence modeling using the global document context. Finally, we use hierarchical pooling method to obtain document embedding. Extensive experiments on three benchmark datasets validate the efficiency and effectiveness of Hi-Transformer in long document modeling.


Introduction
Transformer (Vaswani et al., 2017) is an effective architecture for text modeling, and has been an essential component in many state-of-the-art NLP models like BERT (Devlin et al., 2019;Yang et al., 2019;Wu et al., 2021). The standard Transformer needs to compute a dense self-attention matrix based on the interactions between each pair of tokens in text, where the computational complexity is proportional to the square of text length (Vaswani et al., 2017;Wu et al., 2020b). Thus, it is difficult for Transformer to model long documents efficiently .
There are several methods to accelerate Transformer for long document modeling Kitaev et al., 2019;Qiu et al., 2020). One direction is using Transformer in a hierarchical manner to reduce sequence length, e.g., first learn sentence representations and then learn document representations from sentence representations . However, the modeling of sentences is agnostic to the global document context, which may be suboptimal because the local context within sentence is usually insufficient. Another direction is using a sparse self-attention matrix instead of a dense one. For example, Beltagy et al. (2020) proposed to combine local self-attention with a dilated sliding window and sparse global attention. Zaheer et al. (2020) proposed to incorporate a random sparse attention mechanism to model the interactions between a random set of tokens. However, these methods cannot fully model the global context of document (Tay et al., 2020).
In this paper, we propose a hierarchical interactive Transformer (Hi-Transformer) 1 for efficient and effective long document modeling, which models documents in a hierarchical way to effectively reduce the complexity and at the same time can capture the global document context for sentence modeling. In Hi-Transformer, we first use a sentence Transformer to learn the representation of each sentence within a document. Next, we use a document Transformer to model the global document context from these sentence representations.

Hi-Transformer
In this section, we introduce our hierarchical interactive Transformer (Hi-Transformer) approach for efficient and effective long document modeling. Its framework is shown in Fig. 1. It uses a hierarchical architecture that first models the contexts within a sentence, next models the document contexts by capturing the interactions between sentences, then employs the global document contexts to enhance sentence modeling, and finally uses hierarchical pooling techniques to obtain document embeddings. In this way, the input sequence length of each Transformer is much shorter than directly taking the word sequence in document as input, and the global contexts can be fully modeled. The details of Hi-Transformer are introduced as follows.

Model Architecture
Hi-Transformer mainly contains three modules, i.e., sentence context modeling, document context modeling and global document context-enhanced sentence modeling. The sentence-level context is first modeled by a sentence Transformer. Assume a document contains M sentences, and the words in the i-th sentence are denoted as [w i,1 , w i,2 , ..., w i,K ] (K is the sentence length). We insert a "[CLS]" token (denoted as w s ) after the end of each sentence. This token is used to convey the contextual information within this sentence. The sequence of words in each sentence is first converted into a word embedding sequence via a word and position embedding layer. Denote the word embedding sequence for the i-th sentence as [e i,1 , e i,2 , ..., e i,K , e s ]. Since sentence length is usually short, we apply a sentence Transformer to each sentence to fully model the interactions between the words within this sentence. It takes the word embedding sequence as the input, and outputs the contextual representations of words, which are denoted as Specially, the representation h s i of the "[CLS]" token is regarded as the sentence representation. Next, the document-level context is modeled by a document Transformer from the representations of the sentences within this document. Denote the embedding sequence of sentences in this document We add a sentence position embedding (denoted as p i for the i-th sentence) to the sentence representations to capture sentence orders. We then apply a document Transformer to these sentence representations to capture the global context of document, and further learn document context-aware sentence representations, which are denoted as [r s 1 , r s 2 , ..., r s M ]. Then, we use the document context-aware sentence representations to further improve the sentence context modeling by propagating the global document context to each sentence. Motivated by (Guo et al., 2019), we apply another sentence Transformer to the hidden word representations and the document-aware sentence representation for each sentence. It outputs a document contextaware word representation sequence for each sentence, which is denoted as In this way, the contextual representations of words can benefit from both local sentence context and global document context.
By stacking multiple layers of Hi-Transformer, the contexts within a document can be fully modeled. Finally, we use hierarchical pooling (Wu et al., 2020a) techniques to obtain the document embedding. We first aggregate the document contextaware word representations in each sentence into a global context-aware sentence embedding s i , and then aggregate the global context-aware embeddings of sentence within a document into a unified document embedding d, which is further used for downstream tasks.

Efficiency Analysis
In this section, we provide some discussions on the computational complexity of Hi-Transformer. In sentence context modeling and document context propagation, the total computational complexity is O(M · K 2 · d), where M is sentence number with a document, K is sentence length, and d is the hidden dimension. In document context modeling, the computational complexity is O(M 2 · d). Thus, the total computational cost is O(M ·K 2 ·d+M 2 ·d). 2 Compared with the standard Transformer whose computational complexity is O(M 2 · K 2 · d), Hi-Transformer is much more efficient.

Datasets and Experimental Settings
Our experiments are conducted on three benchmark document modeling datasets. The first one is Amazon Electronics (He and McAuley, 2016) (denoted as Amazon), which is for product review rating prediction. 3 The second one is IMDB (Diao et al., 2014), a widely used dataset for movie review rating prediction. 4 The third one is the MIND dataset (Wu et al., 2020c), which is a large-scale dataset for news intelligence. 5 We use the content based news topic classification task on this dataset. The detailed dataset statistics are shown in Table 1.
In our experiments, we use the 300-dimensional pre-trained Glove (Pennington et al., 2014) embeddings for initializing word embeddings. We use two Hi-Transformers layers in our approach and two Transformer layers in other baseline methods. 6 We use attentive pooling (Yang et al., 2016) to implement the hierarchical pooling module. The hidden dimension is set to 256, i.e., 8 self-attention heads in total and the output dimension of each head is 32. Due to the limitation of GPU memory, the input sequence lengths of vanilla Transformer and its variants for long documents are 512 and 2048, respectively. The dropout (Srivastava et al., 2014) ratio is 0.2. The optimizer is Adam (Bengio and LeCun, 2015), and the learning rate is 1e-4. The maximum training epoch is 3. The models are implemented using the Keras library with Tensorflow backend. The GPU we used is GeForce GTX 1080 Ti with a memory of 11 GB. We use accuracy and macro-F scores as the performance metrics. We repeat each experiment 5 times and report both average results and standard deviations.

Method
Complexity Table 3: Complexity of different methods. K is sentence length, M is the number of sentences in a document, T is the number of positions for sparse attention, and d is the hidden dimension.
at both word and sentence levels. The results of these methods on the three datasets are shown in Table 2. We find that Transformers designed for long documents like Hi-Transformer and BigBird outperform the vanilla Transformer. This is because vanilla Transformer cannot handle long sequence due to the restriction of computation resources, and truncating the input sequence leads to the loss of much useful contextual information. In addition, Hi-Transformer and HI-BERT outperform Longformer and BigBird. This is because the sparse attention mechanism used in Longformer and Big-Bird cannot fully model the global contexts within a document. Besides, Hi-Transformer achieves the best performance, and the t-test results show the improvements over baselines are significant. This is because Hi-Transformer can incorporate global document contexts to enhance sentence modeling.
We also compare the computational complexity of these methods in Table 3. The complexity of Hi-Transformer is much less than the vanilla Transformer and is comparable with other Transformer variants designed for long documents. These re-sults indicate the efficiency and effectiveness of Hi-Transformer.

Model Effectiveness
Nest, we verify the effectiveness of the global document contexts for enhancing sentence modeling in Hi-Transformer. We compare Hi-Transformer and its variants without global document contexts in Fig. 2. We find the performance consistently declines when the global document contexts are not encoded into sentence representations. This is because the local contexts within a single sentence may be insufficient for accurate sentence modeling, and global contexts in the entire document can provide rich complementary information for sentence understanding. Thus, propagating the document contexts to enhance sentence modeling can improve long document modeling.

Influence of Text Length
Then, we study the influence of text length on the model performance and computational cost. Since the documents in the MIND dataset are longest, we conduct experiments on MIND to compare the model performance as well as the training time per layer of Transformer and Hi-Transformer under different input text length 7 , and the results are shown in Fig. 3. We find the performance of both methods improves when longer text sequences are used. This is intuitive because more information can be incorporated when longer text is input to the model for document modeling. However, the computational cost of Transformer grows very fast,  which limits its maximal input text length. Different from Transformer, Hi-Transformer is much more efficient and meanwhile can achieve better performance with longer sequence length. These results further verify the efficiency and effectiveness of Hi-Transformer in long document modeling.

Conclusion
In this paper, we propose a Hi-Transformer approach for both efficient and effective long document modeling. It incorporates a hierarchical architecture that first learns sentence representations and then learns document representations. It can effectively reduce the computational complexity and meanwhile be aware of the global document  contexts in sentence modeling to help understand document content accurately. Extensive experiments on three benchmark datasets validate the efficiency and effectiveness of Hi-Transformer in long document modeling.