MATE: Multi-view Attention for Table Transformer Efficiency

This work presents a sparse-attention Transformer architecture for modeling documents that contain large tables. Tables are ubiquitous on the web, and are rich in information. However, more than 20% of relational tables on the web have 20 or more rows (Cafarella et al., 2008), and these large tables present a challenge for current Transformer models, which are typically limited to 512 tokens. Here we propose MATE, a novel Transformer architecture designed to model the structure of web tables. MATE uses sparse attention in a way that allows heads to efficiently attend to either rows or columns in a table. This architecture scales linearly with respect to speed and memory, and can handle documents containing more than 8000 tokens with current accelerators. MATE also has a more appropriate inductive bias for tabular data, and sets a new state-of-the-art for three table reasoning datasets. For HybridQA (Chen et al., 2020), a dataset that involves large documents containing tables, we improve the best prior result by 19 points.


Introduction
The Transformer architecture (Vaswani et al., 2017) is expensive to train and run at scale, especially for long sequences, due to the quadratic asymptotic complexity of self-attention. Although some work addresses this limitation Kitaev et al., 2020;Zaheer et al., 2020), there has been little prior work on scalable Transformer architectures for semi-structured text. 1 However, although some of the more widely used benchmark tasks involving semi-structured data have been restricted to moderate size tables, many semi-structured documents are large: more than 20% of relational tables on the web have 20 or more rows (Cafarella et al.,Figure 1: Sparse self-attention heads on tables in MATE are of two classes: Row heads attend to tokens inside cells in the same row, as well as the query. Column heads attend to tokens in the same column and in the query. Query tokens attend to all other tokens. 2008), and would pose a problem for typical Transformer models.
Here we study how efficient implementations for transformers can be tailored to semi-structured data. Figure 1 highlights our main motivation through an example: to obtain a contextual representation of a cell in a table, it is unlikely that the information in a completely different row and column is needed.
We propose the MATE architecture 2 (Section 3), which allows each attention head to reorder the input so as to traverse the data by multiple points of view, namely column or row-wise ( Figure 2). This allows each head to have its own data-dependent notion of locality, which enables the use of sparse attention in an efficient and context-aware way.
This work focuses on question answering (QA) and entailment tasks on tables. While we apply our model to several such tasks (see section 6), HYBRIDQA (Chen et al., 2020b) is particularly interesting, as it requires processing tables jointly with long passages associated with entities mentioned in the table, yielding large documents that may not fit in standard Transformer models. attention (·,concat([ roll(·,shift=-1,axis=-2), roll(·,shift=+1,axis=-2), ·, ],axis=-1)) Figure 2: Efficient implementation for MATE. Each attention head reorders the tokens by either column or row index and then applies a windowed attention mechanism. This figure omits the global section that attends to and from all other tokens. Since column/row order can be pre-computed, the method is linear for a constant block size.
Overall, our contributions are the following: i) We show that table transformers naturally focus attention according to rows and columns, and that constraining attention to enforce this improves accuracy on three table reasoning tasks, yielding new state-of-the-art results in SQA and TABFACT.
ii) We introduce MATE, a novel transformer architecture that exploits table structure to allow running training and inference in longer sequences. Unlike traditional self-attention, MATE scales linearly in the sequence length.
iii) We propose POINTR (Section 4), a novel two-phase framework that exploits MATE to tackle large-scale QA tasks, like HYBRIDQA, that require multi-hop reasoning over tabular and textual data. We improve the state-of-the-art by 19 points.
All the code is available as open source. 3

Related Work
Transformers for tabular data Traditionally, tasks involving tables were tackled by searching for logical forms in a semantic parsing setting. More recently Transformers (Vaswani et al., 2017) have been used to train end-to-end models on tabular data as well (Chen et al., 2020a). For example, TAPAS (Herzig et al., 2020) relies on Transformerbased masked language model pre-training and special row and column embeddings to encode the table structure. Chen et al. (2021) use a variant of ETC  on an open-domain version of HYBRIDQA to read and choose an answer span from multiple candidate passages and cells, but the proposed model does not jointly process the 3 github.com/google-research/tapas full table with passages. In order to overcome the limitations on sequence length  propose heuristic column selection techniques, and they also propose pre-training on synthetic data.  propose a model based cell selection technique that is differentiable and trained end-to-end together with the main task model. Our approach is orthogonal to these methods, and can be usefully combined with them, as shown in Table 4.
Recently, Zhang et al. (2020) proposed SAT, which uses an attention mask to restrict attention to tokens in the same row and same column. SAT also computes an additional histogram row appended at the bottom of the table and encodes the table  content as text only (unlike TAPAS). The proposed method is not head dependent as ours is, which prevents it from being implemented efficiently to allow scaling to larger sequence lengths. Controlling for model size and pre-training for a fair comparison, we show that our model is both faster (Table 4) and more accurate (Table 6) than SAT.

Efficient Transformers
There is prior work that tries to improve the asymptotic complexity of the self-attention mechanism in transformers. Tay et al. (2020) review the different methods and cluster them based on the nature of the approach. We cover some of the techniques below and show a theoretical complexity comparison in Table 1.
The LINFORMER model  uses learned projections to reduce the sequence length axis of the keys and value vectors to a fixed length. The projections are then anchored to a specific input length which makes adapting the sequence length during pre-training and fine-tuning challenging, and makes the model more sensitive to position offsets in sequences of input tokens. REFORMER (Kitaev et al., 2020) uses locality sensitive hashing to reorder the input tokens at every layer in such a way that similar contextual embeddings have a higher chance of being clustered together. We instead rely on the input data structure to define ways to cluster the tokens. Although the limitation can be circumvented by adapting the proposed architecture, REFORMER was originally defined for auto-regressive training.  introduce ETC, a framework for global memory and local sparse attention, and use the mechanism of relative positional attention (Dai et al., 2019) to encode hierarchy. ETC was applied to large document tasks such as Natural Questions (Kwiatkowski et al., 2019). The method does not allow different dynamic or static data re-ordering. In practice, we have observed that the use of relative positional attention introduces a large overhead during training. BIGBIRD (Zaheer et al., 2020) presents a similar approach with the addition of attention to random tokens.

The MATE model
Following TAPAS (Herzig et al., 2020), the transformer input in MATE, for each table-QA example, is the query and the table, tokenized and flattened, separated by a [SEP] token, and prefixed by a [CLS]. Generally the table comprises most of the the input. We use the same row, column and rank embeddings as TAPAS.
To restrict attention between the tokens in the table, we propose having some attention heads limited to attending between tokens in the same row (plus the non-table tokens), and likewise for columns. We call these row heads and column heads respectively. In both cases, we allow attention to and from all the non-table tokens. Formally, if X ∈ R d×n is the input tensor for a Transformer layer with sequence length n, the k-th position of the output of the i-th attention head is: are query, key and value projections respectively, σ is the softmax operator, and A i k ⊆ {1, · · · , n} represents the set of tokens that position k can attend to, also known as the attention pattern. Here X A i k denotes gathering from X only the indexes in A i k . When A i k contains all positions (except padding) for all heads i and token index k then we are in the standard dense transformer case. For a token position k, we define r k , c k ∈ N 0 the row and column number, which is set to 0 if k belongs to the query set Q: the set of token positions in the query text including [CLS] and [SEP] tokens.
In MATE, we use two types of attention patterns. The first h r ≥ 0 heads are row heads and the remaining h c are column heads: One possible implementation of this is an attention mask that selectively sets elements in the attention matrix to zero. (Similar masks are used for padding tokens, or auto-regressive text generation.) The ratio of row and column heads is a hyperparameter but empirically we found a 1 : 1 ratio to work well. In Section 6 we show that attention masking improves accuracy on four table-related tasks. We attribute these improvements to better inductive bias, and support this in Section 7 showing that full attention models learn to approximate this behavior.

Efficient implementation
Although row-and column-related attention masking improves accuracy, it does not improve Transformer efficiency-despite the restricted attention, the Transformer still uses quadratic memory and time. We thus also present an approximation of row and column heads that can be implemented more efficiently. Inspired by , the idea is to divide the input into a global part of length G that attends to and from everything, and a local (typically longer) part that attends only to the global section and some radius R around each token in the sequence. ETC does this based on a fixed token order. However, the key insight used in MATE is that the notion of locality can be configured differently for each head: one does not need to choose a specific traversal order for tokens ahead of time, but instead tokens can be ordered in a datadependent (but deterministic) way. In particular, row heads can order the input according to a row order traversal of the table, and column heads can use a column order traversal. The architecture is shown in Figure 2.
After each head has ordered its input we split off the first G tokens and group the rest in evenly sized buckets of length R. By reshaping the input matrix in the self-attention layer to have R as the last dimension, one can compute attention scores from each bucket to itself, or similarly from each bucket to an adjacent one. Attention is further restricted with a mask to ensure row heads and column heads don't attend across rows and columns respectively. See model implementation details in Appendix D. When G is large enough to contain the question part of the input and R is large enough to fit an entire column or row, then the efficient implementation matches the mask-based one.
As observed in , asymptotic complexity improvements often do not materialize for small sequence lengths, given the overhead of tensor reshaping and reordering. The exact breakeven point will depend on several factors, including accelerator type and size as well as batch size. In the experiments below the best of the two functionally equivalent implementations of MATE is chosen for each use case.

Compatibility with BERT weights
The sparse attention mechanism of MATE adds no additional parameters. As a consequence, a MATE checkpoint is compatible with any BERT or TAPAS pre-trained checkpoint. Following Herzig et al. (2020) we obtained best results running the same masked language model pre-training used in TAPAS with the same data but using the sparse attention mask of MATE.
For sequence lengths longer than 512 tokens, we reset the index of the positional embeddings at  the beginning of each cell. This method removes the need to learn positional embeddings for larger indexes as the maximum sequence length grows while avoiding the large computational cost of relative positional embeddings.

Universal approximators
Yun et al. (2020a) showed that Transformers are universal approximators for any continuous sequence-to-sequence function, given sufficient layers. This result was further extended by Yun et al. (2020b); Zaheer et al. (2020) to some Sparse Transformers under reasonable assumptions. However, prior work limits itself to the case of a single attention pattern per layer, whereas MATE uses different attention patterns depending on the head. We will show that MATE is also a universal approximator for sequence to sequence functions.
Formally, let F be the class of continuous functions f : D ⊂ R d×n → R d×n with D compact, with the p-norm || · || p . Let T MATE be any family of transformer models with a fixed set of hyperparameters (number of heads, hidden dimension, etc.) but with an arbitrary number of layers. Then we have the following result.
Theorem 1. If the number of heads is at least 3 and the hidden size of the feed forward layer is at least 4, then for any f ∈ F and ∈ R + there existŝ See the Appendix C for a detailed proof, which relies on the fact that 3 heads will guarantee at least two heads of the same type. The problem can then be reduced to the results of Yun et al. (2020b).

The POINTR architecture
Many standard table QA datasets (Pasupat and Liang, 2015;Chen et al., 2020a;Iyyer et al., 2017), perhaps by design, use tables that can be limited to 512 tokens. Recently, more datasets (Kardas et al., 2020;Talmor et al., 2021) requiring parsing larger semi-structured documents have been released.  Among them, we focus on HYBRIDQA (Chen et al., 2020b). It uses Wikipedia tables with entity links, with answers taken from either a cell or a hyperlinked paragraph. Dataset statistics are shown in Table 2. Each question contains a table with on average 70 cells and 44 linked entities. Each entity is represented by the first 12 sentences of the Wikipedia description, averaging 100 tokens. The answer is often a span extracted from the table or paragraphs but the dataset has no ground truth annotations on how the span was obtained, leaving around 50% of ambiguous examples where more than one answer-sources are possible. The total required number of word pieces accounting for the table, question and entity descriptions grows to more than 11, 000 if one intends to cover more than 90% of the examples, going well beyond the limit of traditional transformers.

Original Input
To apply sparse transformers to the HYBRIDQA task, we propose POINTR, a two stage framework in a somewhat similar fashion to open domain QA pipelines (Chen et al., 2017;. We expand the cell content by appending the descriptions of its linked entities. The two stages of POINTR correspond to (Point)ing to the correct expanded cell and then (R)eading a span from it. See Figure 3 for an example. Full set-up details are discussed in Appendix A.

POINTR: Cell Selection Stage
In the first stage we train a cell selection model using MATE whose objective is to select the expanded cell that contains the answer. MATE accepts the full table as input; therefore, expanding all the cells with their respective passages is impractical. Instead, we consider the top-k sentences in the entity descriptions for expansion, using a TF-IDF metric against the query. Using k = 5, we can fit 97% of the examples in 2048 tokens; for the remaining examples, we truncate the longest cells uniformly until they fit in the budget.
The logit score S for each cell c is obtained by mean-pooling the logits for each of the tokens t inside it, which are in turn the result of applying a single linear layer to the contextual representation of each token when applying MATE to the query q and the expanded table e.
We use cross entropy loss for training the model to select expanded cells that contain the answer span. Even though the correct span may appear in multiple cells or passages, in practice many of these do so only by chance and do not correspond to a reasoning path consistent with the question asked. In Figure 3 for instance, there could be other British divers but we are only interested in selecting the cell marked with a star symbol ( ). In order to handle these cases we rely on Maximum Marginal Likelihood (MML) Berant et al., 2013). As shown by Guu et al. (2017) MML can be interpreted as using the online model predictions (without gradients) to compute a soft label distribution over candidates. For an input query x, and a set C of candidate cells, the loss is: with q(z) = p Θ (z|x, z ∈ C) the probability distribution given by the model restricted to candidate cells containing the answer span, taken here as a constant with zero gradient.

POINTR: Passage Reading Stage
In the second stage we develop a span selection model that reads the answer from a single expanded cell selected by the POINTR Cell Selector. In order to construct the expanded cell for each example, we concatenate the cell content with all the sentences of the linked entities and keep the first 512 tokens. Following various recent neural machine reading works (Chen et al., 2017;Herzig et al., 2021), we fine-tune a pre-trained BERT-uncased-large model  that attempts to predict a text span from the text in a given table cell c (and its linked paragraphs) and the input query q. We compute a span representation as the concatenation of the contextual embeddings of the first and last token in a span s  and score it using a multi-layer perceptron: A softmax is computed over valid spans in the input and the model is trained with cross entropy loss. If the span-text appears multiple times in a cell we consider only the first appearance. To compute EM and F1 scores during inference, we evaluate the trained reader on the highest ranked cell output predictions of the POINTR Cell Selector using the official evaluation script.

Experimental Setup
We begin by comparing the performance of MATE on HYBRIDQA to other existing systems. We focus on prior efficient transformers to compare the benefits of the table-specific sparsity. We follow Herzig et al. (2020);  in reporting error bars with the interquartile range.

Baselines
The first baselines for HYBRIDQA are Table-Only and Passage-Only, as defined in Chen et al. (2020b). Each uses only the part of the input indicated in the name but not both at the same time. Next, the HYBRIDER model from the same authors, consists of four stages: entity linking, cell ranking, cell hopping and finally a reading comprehension stage, equivalent to our final stage. The first three stages are equivalent to our single cell selection stage; hence, we use their reported error rates to estimate the retrieval rate. The simpler approach enabled by MATE avoids error propagation and yields improved results.
We also consider two recent efficient transformer architectures as alternatives for the POINTR Cell Selector, one based on LINFORMER  and one based on ETC . In both cases we preserve the row, column and rank embeddings introduced by Herzig et al. (2020). LINFORMER learns a projection matrix that reduces the sequence length dimension of the keys and values tensor to a fixed length of 256 (which performed better than 128 and 512 in our tests.) ETC is a general architecture which requires some choices to be made about how to allocate global memory and local attention (Dai et al., 2019) Here we use a 256-sized global memory to summarize the content of each cell, by assigning each token in the first half of the global memory to a row, and each token in the second half to a column. Tokens in the input use a special relative positional value to mark when they are interacting with their corresponding global row or column memory position. We will refer to this model as TABLEETC.
Finally we consider two non-efficient models: A simple TAPAS model without any sparse mask, and an SAT (Zhang et al., 2020) model pretrained on the same MLM task as TAPAS for a fair comparison. For the cell selection task TAPAS obtains similar results to MATE, but both TAPAS and SAT lack the efficiency improvements of MATE.

Other datasets
We also apply MATE to three other datasets involving tables to demonstrate that the sparse attention bias yields stronger table reasoning models. SQA (Iyyer et al., 2017) is a sequential QA task, WIK-ITQ (Pasupat and Liang, 2015) is a QA task that sometimes also requires aggregation of table cells, and TABFACT (Chen et al., 2020a) is a binary entailment task. See Table 3 for dataset statistics. We evaluate with and without using the intermediate pre-training tasks (CS) .

Results
In Figure 4 we compare inference speed of different models as we increase the sequence length. Similar results showing number of FLOPS and memory usage are in Appendix A. The linear scaling of LINFORMER and the linear-time version MATE can be seen clearly. Although LINFORMER has a slightly smaller linear constant, the pre-training is 6 times slower, as unlike the other models, LIN-FORMER pretraining must be done at the final sequence length of 2048. Table 5 shows the end-to-end results of our system using POINTR with MATE on HYBRIDQA, compared to the previous state-of-the-art as well as the other efficient transformer baselines from Section 5. MATE outperforms the previous SOTA HYBRIDER by over 19 points, and LINFORMER, the next best efficient-transformer system, by over   2.5 points, for both exact-match accuracy and F1. We also applied MATE to three tasks involving table reasoning over shorter sequences. In Table 4 we see that MATE provides improvements in accuracy, which we attribute to a better inductive bias for tabular data. When combining MATE with Counterfactual + Synthetic intermediate pretraining (CS)  we often get even better results. For TABFACT and SQA we improve over the previous state-of-the-art. For WIKITQ we close the gap with the best published system TABERT (Yin et al., 2020) (51.8 mean test accuracy), which relies on traditional semantic parsing, instead of an end-to-end approach. Dev results show a similar trend and can be found in Appendix B. No special tuning was done on these models-we used the same hyper-parameters as the open source release of TAPAS.  Table  In-Passage  Total  In-Table  In-Passage  Total  EM  F1  EM  F1  EM  F1  EM  F1  EM  F1 EM F1  Table 5: Results of different large transformer models on HYBRIDQA. In- Table and In-Passage subsets refer to the location of the answer. For dev, we report errors over 5 runs using half the interquartile range. Since the test set is hidden and hosted online, we report the results corresponding to the model with the median total EM score on dev.
homa athlete drafted in? (G)old answer: "second", (P)redicted: "second round"). While around 30% of such misses involved numerical answers (eg: "1" vs "one"), the predictions for the rest of them prominently (58% of the near misses) either had redundant or were missing auxiliary words (e.g., Q: What climate is the northern part of the home country of Tommy Douglas? G: "Arctic" P: "Arctic climate"). The inconsistency in the gold-answer format and unavailability of multiple gold answers are potential causes here.
Among the non near-misses, the majority predictions were either numerically incorrect, or were referencing an incorrect entity but still in an relevant context-especially the questions involving more than 2 hops. (e.g. Q: In which sport has an award been given every three years since the first tournament held in 1948-1949? G: "Badminton", P: "Thomas Cup"). Reassuringly, for a huge majority (> 80%), the entity type of the predicted answer (person, date, place, etc.) matches the type of the gold answer. The observed errors suggest potential gains by improving the entity (Xiong et al., 2020) and numerical (Andor et al., 2019) reasoning skills. Table 6 we compare architectures for cell selection on HYBRIDQA. Hits@k corresponds to whether a cell containing an answer span was among the top-k retrieved candidates. As an ablation, we remove the sparse pre-training and try using only row/column heads. We observe a drop also when we discard the ambiguous examples from training instead of having MML to deal with them. Unlike the other datasets, TAPAS shows comparable results to MATE, but without any of the theoretical and practical improvements.  Observed Attention Sparsity Since we are interested to motivate our choices on how to sparsify the attention matrix, we can inspect the magnitude of attention connections in a trained dense TAPAS model for table question answering. It is important to note that in this context we are not measuring attention as an explanation method (Jain and Wallace, 2019;Wiegreffe and Pinter, 2019). Instead we are treating the attention matrix in the fashion of magnitude based pruning techniques (Han et al., 2015;See et al., 2016), and simply consider between which pairs of tokens the scores are concentrated. Given a token in the input we can aggregate the attention weights flowing from it depending on the position of the target token in the input (CLS token, question, header, or table) and whether the source and target tokens are in the same column or row, whenever it makes sense. We average scores across all tokens, heads, layers and examples in the development set. As a baseline, we also compare against the output of the same process when using  a uniform attention matrix, discarding padding. In Table 7, we show the obtained statistics considering only table tokens as a source. We use the WIKITQ development set as a reference. While we see that 23% of the attention weights are looking at tokens in different columns and rows, this is only about one third of the baseline number one would obtain with a uniform attention matrix. This effect corroborates the approach taken in MATE.

Conclusion
We introduce MATE, a novel method for efficiently restricting the attention flow in Transformers applied to Tabular data. We show in both theory and practice that the method improves inductive bias and allows scaling training to larger sequence lengths as a result of linear complexity. We improve the state-of-the-art on TABFACT, SQA and HYBRIDQA, the last one by 19 points.

Ethical Considerations
Although one outcome of this research is more efficient Transformers for table data, it remains true that large Transformer models can be expensive to train from scratch, so experiments of this sort can incur high monetary cost and carbon emissions. This cost was reduced by conducting some experiments at relatively smaller scale, e.g. the results of Figure 4. To further attenuate the impact of this work, we plan release all the models that we trained so that other researchers can reproduce and extend our work without re-training.
All human annotations required for the error analysis (Section 7) are provided by authors, and hence a concern of fair compensation for annotators did not arise.

Appendix
We provide all details on our experimental setup to reproduce the results in Section A. In Section B we show the development set results for our experiments. The proof for Theorem 1 is described in Section C and in Section D we include the main code blocks for implementing MATE efficiently on a deep learning framework.

A Experimental setup
A.1 Pre-training Pre-training for MATE was performed with constrained attention with a masked language modeling objective applied to the corpus of tables and text extracted by Herzig et al. (2020). With a sequence length of 128 and batch size of 512, the total training of 1 million steps took 2 days.
In contrast, for LINFORMER the pre-training was done with a sequence length of 2048 and a batch size of 128, and the total training took 12 days for 2 million steps. For TABLEETC we also pre-trained for 2 million steps but the batch size had to be lowered to 32. In all cases the hardware used was a 32 core Cloud TPUs V3.

A.2 Fine-tuning
For all experiments we use Large models over 5 random seeds and report the median results. Errors are estimated with half the interquartile range. For TABFACT, SQA and WIKITQ we keep the original hyper-parameters used in TAPAS and provided in the open source release. In Figure 5 we show the floating point operation count of the different Transformer models as we increase the sequence length, as extracted from the execution graph. We also measure the memory doing CPU inference in figure 6. The linear scaling of LINFORMER and MATE can be observed. No additional tuning or sweep was done to obtain the published results. We set the global size G to 116 and the radius R for local attention to 42. We use an Adam optimizer with weight decay with the same configuration as BERT. The number of parameters for MATE is the same as for BERT: 340M for Large models and 110M for Base Models.
In the HYBRIDQA cell selection stage, we use a batch size of 128 and train for 80, 000 steps and a sequence length of 2048. Training requires 1 day. We clip the gradients to 10 and use a learning rate of 1 × 10 −5 under a 5% warm-up schedule. For the  reader stage use a learning rate of 5 × 10 −5 under a 1% warm-up schedule, a batch size of 512 and train for 25, 000 steps, which takes around 6 hours.

B Development set results for SQA, WIKITQ and TABFACT
We show in Table 8 the dev set results for all datasets we attempted, which show consistent results with the test set reported in the main paper.

C Proof of Theorem 1
In this section we discuss the proof that MATE are universal approximators of sequence functions.
Theorem. If the number of heads is at least 3 and  the hidden size of the feed forward layer is at least 4, then for any f ∈ F and ∈ R + there existŝ f ∈ T MATE such that ||f − f || p < Proof. When the number of heads is at least 3, there are at least 2 heads of the same type. Fixing those two heads, we may restrict the value of the projection weights W V to be 0 for the rest of the heads. This is equivalent to having only those two heads with the same attention pattern to begin with. This restriction only makes the family of functions modelled by MATE smaller. In a similar way, we can assume that the hidden size of the feed-forward layer is exactly 4 and that the head size is 1.
Note that the attention pattern of the two heads, regardless of its type contains a token (the first one) which attends to and from every other token. We also have that every token attends to itself. Then Assumption 1 of Yun et al. (2020b) is satisfied. Hence we rely on Theorem 1 of Yun et al. (2020b), which asserts that sparse transformers with 2 heads, hidden size 4 and head size 1 are universal approximators, which concludes the proof.

D TensorFlow Implementation
In figure 7 we provide an approximate implementation of MATE in the TensorFlow library. For the sake of simplicity we omit how attention is masked between neighbor buckets for tokens in difference columns or rows. We also omit the tensor manipulation steps to reorder and reshape the sequence into equally sized buckets to compute attention across consecutive buckets. The full implementation will be part of the open source release.