H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences

We describe an efficient hierarchical method to compute attention in the Transformer architecture. The proposed attention mechanism exploits a matrix structure similar to the Hierarchical Matrix (H-Matrix) developed by the numerical analysis community, and has linear run time and memory complexity. We perform extensive experiments to show that the inductive bias embodied by our hierarchical attention is effective in capturing the hierarchical structure in the sequences typical for natural language and vision tasks. Our method is superior to alternative sub-quadratic proposals by over +6 points on average on the Long Range Arena benchmark. It also sets a new SOTA test perplexity on One-Billion Word dataset with 5x fewer model parameters than that of the previous-best Transformer-based models.


Introduction
Linearly combining information using contentbased weights, a method generically known as attention, is a key building block in many deep neural networks such as recurrent neural networks (RNN) (Luong et al., 2015), convolutional neural networks (CNN) (Bello et al., 2019) and graph convolutional networks (GCN) (Velickovic et al., 2018). One particular type of such attention, called multi-head scaled dot-product attention, is one of the main components of the Transformer architecture proposed by Vaswani et al. (2017), which has been shown to push the state-of-theart (SOTA) performance for various understanding and generation tasks. These include standard natural language processing (NLP) tasks such as machine translation, document classification, entailment, summarization and question answering (Zaheer et al., 2020;Dai et al., 2019;Baevski and Auli, 2019), as well as music generation (Huang et al., 2018), image generation (Parmar et al., 2018;Chen et al., 2020) and genomics (Zaheer et al., 2020;Choromanski et al., 2020). The Transformer is also the backbone architecture for models such as BERT (Devlin et al., 2019) (and its numerous relatives) and GPT3 (Brown et al., 2020), which have delivered impressive performance across many NLP tasks. However, the standard attention mechanism of the Transformer has a run time and memory usage that scales quadratically with sequence length. Therefore, this quadratic complexity has become a critical bottleneck in processing long sequences (over 1,000 tokens), and has since motivated many new attention algorithms, see (Tay et al., 2020d) for a survey of such work.
In this paper, we draw inspiration from two branches in numerical analysis: Hierarchical Matrix (H-Matrix) (Hackbusch, 1999(Hackbusch, , 2000 and Multigrid method (Briggs et al., 2000). We propose a hierarchical attention that has linear complexity in run time and memory, and only utilizes dense linear algebra operations optimized for GPUs or TPUs.
We hypothesize that the inductive bias embodied by the proposed hierarchical structure for the attention matrix is effective in capturing the hierarchical structure in the sequences typically seen in many natural language processing and computer vision tasks. The main benchmark we use in this paper is the Long Range Arena (LRA) benchmark (Tay et al., 2020c), which has been specifically designed to evaluate and compare various sub-quadratic attention algorithms. Our new hierarchical attention mechanism achieves best average performance to-date on the LRA benchmark by more than 6 points over the previous-best Big-Bird algorithm (Zaheer et al., 2020), while pushing SOTA performance higher in 4 of the 5 successful tasks. Furthermore, using this new atten-tion, a Transformer-based language model trained on the One-Billion Word dataset (Chelba et al., 2014) sets a new SOTA performance record by reducing the test perplexity by 1.55 points comparing to the previous-best Transformer-XL (Dai et al., 2019) with 5x more parameters. Overall, these empirical results both validate the soundness of our approximation method for computing attention weights, as well as the the appropriateness of the inductive bias present in the proposed hierarchical attention.

Related Works
It is well established in the NLP literature that the embeddings of nearby tokens tend to be more similar than the distant ones (Manning and Schütze, 1999). This leads to the intuition that token similarity and hence the attention should decrease with the sequence distance between a query token and a key token 1 . This motivates the sliding-window local attention (Parmar et al., 2018;Ramachandran et al., 2019;Qiu et al., 2019) which amounts to truncating off-diagonal entries in the attention matrix beyond a user-specified sequence distance. A second approach is to keep O(1) number of nonzeros per row in the attention matrix. The nonzero entry selection is either content-based (Kitaev et al., 2020;Roy et al., 2020;Tay et al., 2020b;Zhou et al., 2020), hand-crafted (Beltagy et al., 2020;Brown et al., 2020;Child et al., 2019;Ho et al., 2019) or simply random (Zaheer et al., 2020). It is also well known in the NLP literature that long-range contextual information is necessary for many NLP tasks (Khandelwal et al., 2018;Liu and Lapata, 2019). So a set of global tokens are also considered. This adds O(1) number of dense rows and columns to the attention matrix (Zaheer et al., 2020;Beltagy et al., 2020). A third approach is to approximate the attention matrix with a low-rank factored form (Choromanski et al., 2020;Tay et al., 2020a).
The first two approaches are based on the premise that one needs to explicitly zero out entries in the attention matrix in order to reduce the quadratic complexity. Decades of research by the scientific computing and numerical analysis community has resulted in more sophisticated algorithms to sparsify matrices. A small set of samples of these algorithms and their engineering applications include Fast Multipole Method (Greengard and Rokhlin, 1987;Greengard, 1994;Nabors et al., 1994;Shi et al., 1998), Pre-corrected FFT (Phillips andWhite, 1997;, Hierarchical Singular Value Decomposition (SVD) (Kapur and Long, 1997) and Hierarchical Matrix (H-Matrix) (Hackbusch, 1999(Hackbusch, , 2000. These are generally called Multilevel Methods (Brandt and Lubrecht, 1990). The hierarchical attention proposed in this paper is inspired by these Multilevel Methods in general and the H-Matrix in particular. The hierarchical matrix structure allows a linear complexity in both constructing and applying the attention matrix.

Definition and Notation
Given matrices Q, K and V , with rows representing sequences of token embedding or feature vectors for query, key and value respectively, the output weighted by the scaled dot-product attention in the Transformer (Vaswani et al., 2017) is defined as where Z, Q, K, V ∈ R L×d , L is the length of the sequences, and d is the embedding or feature size. In a more compact matrix form, Eq. (1) can be written as Here, A, S ∈ R L×L , 1 L ∈ R L is a vector with all ones, and S i,j represents the unnormalized cosine similarity between query embedding Q i (the i-th row in Q) and key embedding K j (the j-th row in K).
For the sake of clarity, we focus on the singlehead attention in the exposition of the proposed algorithm. Extension to the multi-head case is straightforward since each attention head is computed independently (Vaswani et al., 2017).

Introduction on H-Matrix and
Multigrid Method

H-Matrix
The singular-value decomposition of the attention matrix A in Eq. (3) is where Σ = diag{σ 1 , σ 2 , ..., σ L } and σ i is the i-th singular value. The numerical rank of matrix A is r if L i=r+1 σ i < for a given tolerance (Trefethen and Bau, 1997). The standard rank-r approximation to matrix A is have the first r columns of U and V , andṼ = VΣ. This is the low-rank approximation used in (Choromanski et al., 2020;Tay et al., 2020a). This approximation compresses L 2 entries in A to 2rL entries inÛ andṼ T . So the compression rate is L 2r . The H-Matrix generalizes this low-rank approximation by using matrix block hierarchy. Consider a two-level H-Matrix with 4 × 4 and 2 × 2 block partition at level-0 and level-1, respectively. Matrix A is partitioned as The low-rank approximation in Eq. (8) is applied to the off-diagonal blocks at each level. For example, where l = 0, 1. To give a concrete example, suppose each entry in matrix A has the analytical form where i, j = 0, 1, 2, ..., 15 2 . With the block hierarchy defined in Eq. (9), the size of the matrix block at level-1 and level-0 is 8 × 8 and 4 × 4, respectively. For tolerance = 10 −3 , one can verify that the numerical rank map of matrix A is where the number in each block is the numerical rank of the corresponding block in Eq. (9). Note that matrix A still has full numerical rank of 16 at a looser tolerance 10 −1 . So the standard lowrank approximation is ineffective in this case. But even this simple two-level H-matrix already offers a compression rate of 4 3 since storing an H-matrix with the rank map in Eq. (13) takes 192 entries 3 . In addition, one can verify that no entry A i,j in Eq. (11) is very small, since S i,j ∈ [−1, 1] in Eq. (12). Therefore, truncating off-diagonal entries of matrix A, as proposed in (Parmar et al., 2018), would produce a poor approximation. In practice, the number of levels is adapted to the underlining governing equations that result in matrix A and it can easily be over 10 ( Kapur and Long, 1997;Hackbusch, 2000;. In turn, this can substantially increase the compression rate. In general, the computation complexity of the H-Matrix is either O(L) or O(L log L), depending on the underlining physics (Hackbusch, 1999(Hackbusch, , 2000.

Elements of the Multigrid Method
Multigrid Method is a multi-level nested iterative method for solving large-scale sparse matrices resulting from discretized partial-differential equations (PDEs) (Briggs et al., 2000;Trottenberg et al., 2000). At its core are two simple but powerfully complementary ideas: relaxation and correction. Our proposed hierarchical attention only uses the correction scheme as a building block since there is no sparse matrix to relax on.
The correction scheme has two components: restriction or coarsening, and interpolation or pro-3804 longation. Consider a vectorv h of scalar values defined on a set of N grids with uniform interval h. The simplest coarsening is to take the average of the scalar values on each pair of grids, i.e., where j = 0, 1, 2, ...N/2 − 1. The superscript in Eq. (14) indicates that the grid interval at these two levels is h and 2h, respectively. The simplest interpolation is to duplicate the value on each coarse grid to values on a pair of fine grids, i.e., where j = 0, 1, 2, ...N/2 − 1.

Intuition for Hierarchical Attention
The hierarchical low-rank structure like Eq. (13) turns out to be pervasive in many if not all physics phenomena. Much of the theoretical analysis by (Greengard and Rokhlin, 1987;Hackbusch, 1999) is concerned with quantifying such aspects.
The key insight into these Multilevel Methods can be summarized as follows: perform no approximation for near interactions, and apply progressively lower-precision approximation for progressively longer distance interactions. The simple case shown in Eq. (9)-(13) is a good example. To satisfy the tolerance of 10 −3 , we need full rank (no approximation) for the diagonal blocks (near interactions), higher precision approximation (rank-2 vs full-rank of 4) for the 4 × 4 off-diagonal blocks at level-0 (mid-distance) and lower precision approximation (rank-2 vs full-rank of 8) for the 8 × 8 off-diagonal blocks at level-1 (long-distance).
In this section, we present some intuition to answer two important questions: 1) Does the hierarchical low-rank structure hold for the attention matrix A in Eq. (3)? 2) What is the algorithm to efficiently compute the hierarchical low-rank structure? We only give an informal exposition of the hierarchical attention. The formal mathematical derivation is deferred to the Appendix.

Hierarchical Structure As Inductive Bias
The error analysis in (Greengard and Rokhlin, 1987;Hackbusch, 1999) offers little direct insight since the attention matrix A in Eq. (3) is data dependent by definition and hence its analytical form like Eq. (11) and (12) is generally unknown. So gathering empirical evidences seems the only viable path to answer the first question listed above.
The ablation studies by (Khandelwal et al., 2018) examine the effect of context words on a language model. Within the context range of about 200 tokens, word order is only relevant within the 20 most recent tokens or about a sentence. In the long-range context, order has almost no effect on performance, suggesting that the model maintains a high-level, rough semantic representation of faraway words. The observation is succinctly summarized by the title of the paper "sharp nearby, fuzzy far away". Remarkably, this is in spirit very close to the key insight into the Multilevel Methods.
A few recent attention-related studies have explored this direction with some success, such as word-level and sentence-level attentions in (Miculicich et al., 2018;Abreu et al., 2019), and sentence-level and paragraph-level attentions in (Liu and Lapata, 2019). Even though the proposed hierarchical attention in these studies only has two levels, as opposed to ten or more levels typically used by the Multilevel Methods, the reported positive results are quite suggestive.
We therefore hypothesize that the same hierarchical low-rank structure as shown in Eq (13) might also hold for the attention matrix in many NLP tasks. And we treat it as the inductive bias in the hierarchical attention mechanism proposed in this paper. As pointed out in (Goyal and Bengio, 2020), inductive biases encourage the learning algorithm to prioritise solutions with certain properties. Hence good benchmark performance delivered by a Transformer-based model with proposed hierarchical attention can be regarded as a positive evidence to support the hierarchical low-rank structure hypothesis.

Informal Exposition of Hierarchical Attention
In the standard definition of attention in Eq. (3) and (4), there is no preference given to any keys based on the sequence distance between a query and keys. The observation in (Khandelwal et al., 2018) clearly suggests that a distance-dependent attention mechanism should be a better alternative. We will take three steps to informally explain the hierarchical attention mechanism. First, the attention matrix blocks for nearby, mid-distance and long-distance attention are separated in sec-tion 5.2.1. This is the first step toward the distance-dependent attention mentioned above. Second, a token hierarchy is established in section 5.2.2. Third, the hierarchical attention is constructed in section 5.2.3

Attention Partition
Consider a 16-word sentence in Fig. 1. The sentence is partitioned at three segment granularity. This induces a three-level partition of the attention matrix A for the original sequence: where (19) Note that the nonzero entries in A (0) , A (1) and A (2) are the same as the corresponding entries of matrix A in Eq. (3). Matrix block size of A ij is 2×2, 4×4 and 8×8, respectively. Following the key insight into Multilevel Methods, we perform no approximation to any level-0 matrix block A (0) ij and apply a low-rank approximation to off-diagonal matrix blocks in A (1) and A (2) . If we set the numerical rank of all these blocks to 2, then we can assemble the three rank maps into a single rank map as 4 4 We omit some of implementation details to handle the overlapping entries between adjacent levels.  The hierarchical structure embodied by the predetermined rank map in Eq. (20) represents the inductive bias for the attention matrix A in Eq. (16). But this construction step is inefficient because we need to form the original attention matrix and then perform SVD to discover the low-rank approximation.

Token Hierarchy
To illustrate the notion of token hierarchy, consider the same 16-word sentence in Fig. 2. A simple 3-level binary-tree hierarchy can be set up by following the simple coarsening defined in Eq. (14): 1) At level-0, each one of the 16 words is mapped to its word embedding; 2) At level-1, each token (parent node) corresponds to a pair of adjacent words at level-0 (child nodes), which are shown inside each box. The embedding of each parent token is simply the average of its child token embeddings; 3) At level-2, each token (parent node) corresponds to one pair of adjacent tokens at level-1 (child nodes) or 4 adjacent words at level-0 (grand child nodes), which are shown inside each box. The embedding of each parent token is simply the average of its child token embeddings. In general, the height of the binary tree is O(log 2 (L) and the total number of tree nodes is O(2L), where L is the sequence length. We only need word embeddings for the leaf nodes since the embeddings of all other tree nodes can be recursively computed. The formal definition and notations of the recursion for query and key are detailed in section 6.1.

Informal Construction of Hierarchical Attention
It is clear from Fig. 2   A key step to arrive at the hierarchical attention is to apply the contextual sliding window at each hierarchy level. The tokens at each level are partitioned into segments of size 2 in Fig. 2. One way to implement the local attention is to allow each query token segment to attend only two adjacent key token segments, one to its left and another to its right. At level-0, each query token segment also attends to the collocated key token segment. The token segment partition and local attention lead to a tri-diagonal block sparse matrix structure for A (0) and bi-diagonal block sparse matrix structure forÃ (1) andÃ (2) . Their sparsity patterns arẽ 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 where the 2 in the nonzero blocks indicates that these are dense blocks of size 2 × 2. It is clear thatÃ (0) is identical to A (0) in Eq. (19). The efficiency gain comes fromÃ (2) and A (1) . Each nonzero entry inÃ (2) andÃ (1) captures the aggregated or coarse attention between two disjoint chunk of four and two tokens, respectively. Progressively larger token chunks lead to progressively lower-precision approximation to the original attention blocks. This is precisely the intention of the rank map in Eq. (20). We can now see thatÃ (2) andÃ (1) provide an efficient way to approximate A (2) in Eq. (17) and A (1) in Eq. (18), respectively.
6 Key Components in Hierarchical Attention

Constructing Hierarchical Attention
The simple example in Fig. 2 can be easily generalized. Eq. (14) is used to coarsen or merge rows in matrices Q, K and V in Eq. (1). For sequence length L = 2 M +1 , the coarsening establishes a binary tree of depth M for Q, K and V , respectively. Each tree node represents a matrix row and there are 2 M +1−l nodes or rows at level-l. To facilitate the discussion, we define a few hierarchy related notations here. LetQ (l) ,K (l) andṼ (l) be coarsened versions of Q, K and V at level-l in the binary tree. We note that l = 0 is a special case, which is defined as Following Eq. (14), the recursion to coarsen Q, K and V is:Q where l = 0, 1, ..., M − 2 and j = 0, 1, 2, ..., 2 M −l . It should be noted that the coarsening of V in Eq. (27) does not have the averaging factor 1 2 . We defer more details on coarsening to Appendix Section A.1. Now we are ready to compute the nonzero entries in Eq. (21), (22) and (23) and construct hierarchical attention matrixÃ (l) . Substituting Eq. (25) and (26) into (4) and then into (3), we obtainÃ Again, we note that l = 0 is a special case becausẽ A (0) ij = A ij .

Applying Hierarchical Attention
The hierarchical matrix structure in Eq. (17), (18) and (19) naturally leads to a hierarchical approach to the matrix-matrix multiplication in Eq.
(2) and the matrix-vector multiplication in Eq. (5). We use the matrix-matrix multiplication as an example since matrix-vector multiplication is just a special case of the matrix-matrix multiplication.

Algorithm And Computational Complexity
To facilitate the description and the complexity analysis of the algorithm, we define a few more hierarchy-related notations. In addition to sequence length L, number of hierarchy levels M and embedding or feature size d in Eq. (1), the new notations include: 1) N r : numerical rank of the off-diagonal blocks (for instance, 2 in Eq. (20)). This is also the diagonal block size at level-0; 2) N (l) b : number of blocks at level-l. Note that L and d are usually data-dependent hyper-parameters, while N r is the only model hyper-parameter responsible for our method's inductive bias. In turn, N (l) b and M are derived parameters, computed as: It is easy to verify that It is important to note that only the diagonal blocks at level-0 and the super-diagonal and subdiagonal blocks at level-l are needed in applying the hierarchical attention matrix. This is clearly shown in Eq. (21)-(23). This means that only N (l) b − 1 super-diagonal and sub-diagonal blocks are computed at level-l. This is crucial to the overall linear complexity in run time and memory.
We should also note that all matrix blocks in coarse attention matrixÃ (l) have the same size N r × N r . This is due to the rank map in Eq. (20). This is crucial for efficiency reason since the single-instruction-multiple-data (SIMD) programming style supported by the dense linear algebra libraries for GPU and TPU encourages uniform tensor shapes.
We summarize the main steps to construct and apply the hierarchical attention in Algorithm 1. So the overall run time complexity of the hierarchical attention algorithm is O(dL). Likewise, the memory complexity can be shown to be O(dL) as well. We defer the detailed analysis to appendix Section A.5 and A.6.  Table 1: Experimental results on long-range arena benchmark. Best model is in boldface and second best is underlined. All models do not learn anything on Path-X task, contrary to the Pathfinder task and this is denoted by FAIL. Path-X is not counted toward the Average score as it has no impact on relative performance.

Experiments And Results
We have implemented the proposed hierarchical attention using Jax, an open source library 5 for automatic gradient computation and linear algebra operations on GPUs and TPUs. All numerical operations in our algorithm use the Numpy native linear algebra functions supported by Jax. In all our experiments in this section, we use the standard Transformer architecture described in (Vaswani et al., 2017) as the backbone for our H-Transformer-1D model. Unless specified otherwise, the model parameters are: number of layers is 6, number of heads is 8, word embedding size is 512 and the feed-forward module (FFN) size is 2048. We follow the API for the standard multihead scaled dot-product attention implementation 6 so that we can perform a simple drop-in replacement of the standard multihead attention with our hierarchical attention implementation. This allows for an easy and fair comparison.

Long-Range Arena
The open-source Long-Range Arena (LRA) benchmark 7 has been proposed as a standard way to probe and quantify the capabilities of various xformer (long-range Transformer) architectures (Tay et al., 2020c). In our case, it also serves to highlight the effectiveness of the inductive bias 5 https://github.com/google/jax 6 https://github.com/google/flax/blob/master/flax/nn 7 https://github.com/google-research/long-range-arena inspired by the H-Matrix method, as well as the capability of our hierarchical attention to handle long sequences.
The LRA has several desirable qualities that made us focus on it as a primary evaluation benchmark: generality (restricted to encoder-only tasks to accommodate most proposals); simplicity (no pretraining, no data augmentation allowed); difficulty (large headroom with existing approaches); long-input focus (so that modeling improvements in this area are visible); diverse (6 tasks, covering math, language, image, and spatial modeling); and lightweight (so that modeling improvements are measurable independently of the ability to train and run high-capacity models).
The tasks that comprise LRA are: ListOps (sequences of arithmetical expressions of lengths of up to 2K that tests the ability to reason hierarchically while handling long context); Text (byte/character-level text classification at document level, which both simulates longer input sequences -max length 4K -and increases the difficulty level); Retrieval (byte/character-level document retrieval, which simulates the ability to model document similarity as a score between two independently-encoded long input sequences -max length 4K + 4K = 8K); Image (image classification based on the CIFAR-10 dataset, where an NxN image is flattened to a sequence of length N 2 pixels); Pathfinder (long-range spatial dependency task, with images consisting of two small  Table 2: Experimental results on one-billion word benchmark. We compare previous SOTA results obtained with models of size 465M-4900M parameters against the performance of the quadratic attention baseline and the H-Transformer-1D models. circles and dash-line paths that either connect the two circles or not -image dimensions of 32x32 for a pixel sequence of length 1,024); Path-X (same as Pathfinder, but for image dimensions of 128x128 for a total pixel sequence of length 16,384). The default Transformer model parameters such as number of layers and number of heads etc are pre-determined by the benchmark configuration for each task. The results obtained by our H-Transformer-1D model on the LRA benchmark are given in Table 1. Overall, the H-Transformer-1D model achieves 61.41 average accuracy, a +6.4 points improvement over the previous-best average performance from BigBird (Zaheer et al., 2020). We want to highlight ListOps, Text and Retrieval because they all involve long sequences and H-Transformer-1D model improves SOTA performance by relatively large margins. These should be strong evidences to support our hypothesis in section 5.1 and validate the inductive bias due to the hierarchical attention.

Language Models Trained on One-Billion Words
We have used Flax, an open-source library 8 to train neural networks, as the code base for the model training. Our H-Transformer-1D model uses the standard Transformer decoder implementation in Flax as the backbone. Only the attention is replaced with our hierarchical attention.
We trained both the Transformer baseline and H-Transformer-1D on the One-Billion Word benchmark (Chelba et al., 2014). We tried different N r 8 https://github.com/google/flax (numerical rank) in our H-Transformer-1D model. These represent different inductive bias. We found that H-Transformer-1D with N r = 16 generated text with quality comparable to that of the baseline Transformer. For both Transformer baseline and H-Transformer-1D, we also tried two sets of model parameters: 1) embedding size is 512 and feed-forward module size is 2048 and hence the parameter count is 53M; 2) embedding size is 1024 and feed-forward module size is 4096 and hence the parameter count is 144M. The test perplexity results of these four models and various SOTA models are shown in table 2. H-Transformer-1D delivers the lowest perplexity to-date while using 5× smaller model capacity than that of the previous SOTA model Transformer-XL (Dai et al., 2019). This is another strong evidence to support our hypothesis in section 5.1 and validate the inductive bias due to the hierarchical attention.

Conclusions and Future Work
We have proposed a new Transformer attention using the inductive bias inspired by the H-Matrix. The new algorithm has linear complexity in run time and memory usage and is fully compatible with dense linear algebra libraries on GPU and TPU. The effectiveness of this new attention is demonstrated by the empirical evidences from long-range arena benchmark and One-Billion word language modeling. Future work include applying the new attention to music and genomics, developing proper inductive bias for cross-attention and extending the onedimensional hierarchical attention to 2D images.

A.1 Restriction or Coarsening Matrices
For sequence length L = 2 M , the coarsening establishes a binary tree of depth M for Q, K and V , respectively. The root of the binary tree at level-(M − 1) has two nodes which correspond to the two matrix rows coarsened from four matrix rows at level-(M − 2). The piecewise constant restriction matrix at level-(M − 2) is In general, the restriction matrices follow the recursion which starts from R (M −2) of size 2 × 4 and goes backward to R (0) of size L 2 × L.

A.2 Interpolation Matrices
Given Y (l) at level-l, the interpolated Y (l−1) at level-(l − 1) can be written as where l = 1, 2, ..., M − 1, sparse matrix P (l) has size L (l−1) × L (l) , and L (l) = 2 M −l is the node count at level-l of the binary tree. This recursion also follows the binary tree hierarchy. The four matrix rows at level-(M − 2) are interpolated from the two matrix rows at level-(M − 1). Specifically, the piecewise constant interpolation matrix at level-(M − 1) is Likewise, the piecewise constant interpolation matrix at level-(M − 2) is In general, the interpolation matrices follow the recursion which starts from P (M −1) of size 4 × 2 and goes backward to P (0) of size L× L 2 . In view of Eq. (34) and (38), it is obvious that In view of the recursions in Eq. (36) and (40), it is easy to prove by induction that

A.3 Expansion Matrices
For the purpose of factored low-rank approximation for the off-diagonal attention matrix blocks, we design a series of so-called expansion matrices. The first two expansion matrices in this series are and where 1 N is a length-N vector of ones. The general form of matrix T (l) is defined as where l = 1, 2, ..., M − 1. In view of Eq. (43), (45) and (40), it is easy to prove by induction that and it has size 2 M −l+1 × 2. Further more, in view of Eq. (45) and (42), we have

A.4 Low-Rank Factored Form
Matrix T (l) plays a pivotal role in constructing the low-rank approximation to the off-diagonal attention matrix blocks. Let the ij-th block in the coarsened attention matrix at level-1 bẽ where a ij is the entry resulted from the inner product between a row inQ (1) andK (1) . The rank-2 approximation to the corresponding ij-th block in the original attention matrix A at level-1 can be written as It is clear that the resulting 4 × 4 matrix A ij is essentially the piecewise constant interpolation of the 2 × 2 matrixÃ ij necessarily has rank 2. One can also view a ij as being similar to the average value at the ij-th cluster center in the K-mean method. The role of matrix T (M −1) is to expand from these 2 × 2 clusters to the 4 × 4 grid and hence the name expansion matrix.
Since we maintain the same numerical rank 2 for all super-and sub-diagonal attention matrix blocks, the rank-2 approximation to the ij-th block in the original attention matrix A at level-l is where the last equality is due to Eq. (45) and (47). We note that matrix T (l) has full column rank 2 by design and this can be easily shown from Eq. (46). We have used this fact to construct the rank-2 approximation in Eq. (51).

A.5 Construct Hierarchical Attention Matrix
To see how Eq. (51) can be used, consider a simple three-level partition of the attention matrix A for sequence length L = 16 where the size of level-0, level-1 and level-2 matrix blocks is 2 × 2, 4 × 4 and 8 × 8, respectively. Note that the number of levels is M = log 2 (L/2) = 3. We use this simple three-level example to illustrate the key steps in both constructing and applying the hierarchical attention matrix. In view of Eq. (51), we have (57) We note that matrices T (l) , l = 1, 2 are never explicitly formed and are only implicitly used, as shown in next section. So only the diagonal blocks at level-0 and super-and sub-diagonal blocks of the coarsened matrixÃ at level-l need to be explicitly computed. By design, all these blocks have the same size 2 × 2 if we set the numerical rank to N r = 2. The total number of superand sub-diagonal blocks in the binary tree hierarchy is upper bounded by twice the number of super-and sub-diagonal blocks at level-0, which is 2N

A.6 Apply Hierarchical Attention Matrix
Computing matrix-matrix product AV follows the hierarchical structure of matrix A in Eq. (55), (56) and (57). We first partition matrix V according to the three-level binary tree established by the coarsening process, i.e., (58) Note that these are partitions of the same matrix V at 3 different levels. For sequence length L = 16, matrix V has size 16 × d, and the size of the partitioned blocks V So we may replace V (2) 1 with the right-hand side in Eq. (59).