Investigating Efficiently Extending Transformers for Long Input Summarization

While large pretrained Transformer models have proven highly capable at tackling natural language tasks, handling long sequence inputs continues to be a significant challenge. One such task is long input summarization, where inputs are longer than the maximum input context of most pretrained models. Through an extensive set of experiments, we investigate what model architectural changes and pretraining paradigms can most efficiently adapt a pretrained Transformer for long input summarization. We find that a staggered, block-local Transformer with global encoder tokens strikes a good balance of performance and efficiency, and that an additional pretraining phase on long sequences meaningfully improves downstream summarization performance. Based on our findings, we introduce PEGASUS-X, an extension of the PEGASUS model with additional long input pretraining to handle inputs of up to 16K tokens. PEGASUS-X achieves strong performance on long input summarization tasks comparable with much larger models while adding few additional parameters and not requiring model parallelism to train.


Introduction
Large pretrained Transformer models have proven to be extremely capable at tackling natural language tasks (Devlin et al., 2018;Brown et al., 2020).However, handling long textual sequences continues to be a significant challenge for these models.Training models to handle long sequences is expensive in both computation and memory, and moreover requires training and evaluating on long sequence data, which is rarer and more costly to collect.Given the broad success of Transformer models on short-sequence language tasks, our goal is to investigate the best way to extend these models to handle longer input sequences.
In this work, we focus on the task of long input summarization: summarizing long input documents into shorter text sequences.The inputs of such tasks are often significantly longer than the maximum input lengths of most standard Transformer models, and hence warrant both architecture modifications as well as new training regimes.For instance, to avoid the quadratic growth in memory consumption of attention in Transformers, many memory-efficient Transformer variants have been proposed (Tay et al., 2020(Tay et al., , 2021)).However, the manner in which these changes are incorporated into models has been inconsistent and ad-hoc, and there are few established best practices.For instance, some works directly fine-tune on long-input summarization tasks (Zaheer et al., 2020;Pang et al., 2022), while others first perform additional pretraining (Beltagy et al., 2020).Because of the high cost of training these models, there has yet to be a systematic study of how best to adapt models for long input sequences.Hence, it has been difficult to establish which model and training changes are necessary or complementary.
To answer these questions, we conduct an extensive empirical investigation into the architectural changes, model configurations and pretraining schemes to identify better approaches to training Transformer models for long input summarization.We evaluate a set of efficient Transformer variants, and propose a simple block-wise local Transformer architecture with staggered blocks and global tokens that strikes a good balance of performance and memory efficiency.We show that given a fixed token budget, pretraining on short sequences and then pre-adapting the model to an efficient Transformer architecture by training on longer sequences leads to better performance than only long input pretraining or no adaptation at all.We also investigate model design choices such as position encoding schemes, encoder-decoder layer distributions, and the impact of discrepancies between pretraining and fine-tuning architecture hyperparameters.
Based on the findings from our empirical investigation, we adapt the pretrained PEGASUS Large model (Zhang et al., 2020) to tackle long input summarization on up to 16K input tokens.The resulting model, which we call PEGASUS-X, attains top scores on long summarization tasks, outperforming much larger models like LongT5 (Guo et al., 2021).Moreover, impact on short input summarization performance is minimal.A smaller version which we call PEGASUS-X Base attains similar scores with much fewer parameters.Beyond summarization, we believe that many of our findings will be useful to the community for efficiently adapting Transformer models to handle ever longer input sequences for other tasks.
In summary, our contributions are: 1. We evaluate a series of proposed efficient Transformer architectures as well as other model modifications, and report their efficacy and computational trade-offs when applied to long input summarization tasks.
2. Based on our findings, we propose a recipe for adapting a short-context, pretrained Transformer encoder-decoder to longer inputs, and apply it to PEGASUS to greatly improve its long-document summarization performance, with comparable short-input performance.

Experimental Setup
Similar to Zhang et al. (2020), we perform the majority of our experiments with a PEGASUS Basesized model, before applying our findings to PEGASUS Large -sized model.

Pretraining
We generally follow the recipe from PEGASUS (Zhang et al., 2020) for pretraining PEGASUS Basesized models.All experiments in our ablation study performed pretraining with C4 (Raffel et al., 2020) for 500k steps with 512 input tokens and 256 output tokens and a masking ratio of 45%, unless otherwise stated.For long input pretraining we extend the input length to 4096 tokens, and adjust the masking ratio from 45% to 5.625%, reducing the ratio by a factor of 8 to account for the 8x increase in input sequence length.We also filter for only documents longer than 10000 characters.

Fine-tuning
We evaluate models by fine-tuning on the arXiv (Cohan et al., 2018) and GovReport (Huang et al., 2021) long input summarization tasks.Where relevant, we also fine-tune on the shorter-context XSUM and CNN/DailyMail tasks.For each experiment, we report the best validation set scores based on the geometric average (RG) of ROUGE-1, ROUGE-2 and ROUGE-L scores (Lin, 2004) based on the rouge-score package.1 Fine-tuning hyperparameters can be found in Appendix E. Unless otherwise stated, we directly switch to the efficient Transformer architectures between pretraining (on shorter context) and fine-tuning (on longer contexts), with no adaptation phase in between.

Encoder architectures
We first investigate whether using an efficient Transformer encoder allows models to incorporate longer input sequences while consuming reasonable amounts of device memory.We consider two encoder architectures that exemplify different approaches to efficient attention.Big Bird (Zaheer et al., 2020) uses sparse attention computation, combining sliding-window and random attention, and a set of global-attention tokens.Conversely, Performer (Choromanski et al., 2021) factorizes attention matrices via orthogonal random features.
Both model also performed well on the LRA tasks (Tay et al., 2021).For this experiment, we perform both pretraining and fine-tuning with the same encoder architecture to avoid the issue of mismatch between pretraining and fine-tuning architectures.In addition, we also introduce two simple variants of local attention Transformer encoders.First, we use a simple block-local Transformer (Local), where encoder input tokens are divided into nonoverlapping blocks, and tokens can only attend to other tokens within the block.Second, we extend the local Transformer by adding a set of global tokens with learned embeddings, that can attend to and be attended from every encoder token (Global-Local).These components are similar to the sliding window attention and global token attention of Big Bird, ETC (Ainslie et al., 2020) and Longformer (Beltagy et al., 2020).However, we opt for the simpler block-local attention rather than sliding window attention, and compensate for the lack of overlapping blocks by staggering the local attention blocks, which we elaborate on in Section 3.2.As we show below, the performance is highly competitive despite its simplicity.
Results on short and long summarization tasks are shown in Table 1, with the relative training steps per second and memory consumed per device for fine-tuning on arXiv shown in the right-most columns.Among the short tasks, the full-attention Transformer performs best, followed by BigBird.On the long tasks, Big Bird and Global-Local models perform best, but Big Bird consumes significantly more memory and trains much more slowly than the other architectures.Conversely, although Performer has relatively low memory consumption and trains efficiently, it performs worst among the architectures tested by a noticeable margin.
On the other hand, Local and Global-Local encoders strike a good balance of performance and efficiency.The simple local attention encoder, which uses block-local attention, attains performance close to that of Big Bird while being much faster and using much less memory.Global-Local trades off a small amount of speed and memory for better performance, outperforming Big Bird.
Takeaways: Local attention is a strong baseline, and adding global tokens significantly improves performance.Both models are resource-efficient.

Local and Global-Local configurations
Given the good performance of both Local and Global-Local encoder variants, we next consider further architectural tweaks to these models.
First, we introduce staggering of local attention blocks.In block-local attention, tokens can only attend to other tokens within the same block.If the input tokens are divided up into the same blocks in every layer, this means that no information is exchanged across blocks through the entire encoder.To address this pitfall, we stagger attention blocks by shifting the block boundaries by half a block every other layer.We show an example of this in Figure 2. In practice, we implement this by padding the hidden representations on either side by half a block and masking accordingly.
Secondly, in the Global-Local model, the decoder only attends to the encoded token representations, and not the global token representations.We consider a variant where we supply the global token representations to the decoder and introduce a second cross-attention layer that attends only to the global tokens.Our goal is to allow the decoder to incorporate global information before performing cross-attention over the encoded sequence.
Results are shown in Table 2.We find that staggering local blocks noticably improves performance in both Local and Global-Local models.Performance improves even with Global-Local models, which already allow for cross-block interactions via global tokens, indicating that both model improvements are complementary.Conversely, incorporating global token information in the decoder did not lead to much performance improvement, particularly once staggered local blocks were used.
Takeaways: Staggering local attention blocks significantly improves performance, and is complementary to global tokens.

Global-Local: Block Size and Number of Global Tokens
Next, we vary the block size and number of global tokens for the Global-Local encoder, with results shown in Table 3.2 Broadly, we find that increasing either block size or global tokens leads to improved performance, with a corresponding increase in memory consumption and computation time.However, the effect size from going to larger block sizes is not large, and saturates with larger block sizes or number of global tokens.As such, increasing either of these hyperparameters is ideal if resources allow, but is not a high priority compared to other model improvements.For the remainder of the ablation experiments, we use a block size of 64 and 32 global tokens for consistency.
Takeaways: Larger block sizes and/or number of global tokens leads to improved performance, although the effect saturates.

Other Architecture Modifications
We further investigate a of series architectural modifications to the encoder-decoder model, including the position encoding scheme (Table 8), scaling the encoder and decoder layers (Table 10) and using cross-attention in only a fraction of the decoder layers (Table 12).We find that the sinusoidal position encoding provide a good balance of performance and efficiency, and that a balanced encoder-decoder with full cross-attention generally performs the best.More details are provided in Appendix B.

Pretraining vs Fine-tuning Architectures
Previous works using efficient Transformer encoders have generally taken the parameters of a full-attention Transformer pretrained on a shorter sequences and adapted them to efficient architectures, either directly during fine-tuning (Zaheer et al., 2020) or with an intermediate stage of additional pretraining (Beltagy et al., 2020).In this section, we investigate if such an approach is optimal, or if models benefit from being pretrained with efficient encoders from the beginning.Note that we still perform pretraining on a short sequences (512 tokens), even with an efficient encoder.
We consider both pretraining with a Transformer and pretraining with the efficient architecture for both Local and Global-Local models.We also vary the block size, as the main difference between a  Results are shown in Table 11.For Local models, pretraining with local attention using small block sizes tends to hurt performance, but at moderate block sizes (e.g.64) there is little difference between the two approaches.In contrast, for Global-Local pretraining with the efficient architecture tends to perform better.We hypothesize that this difference arises because of the learned global embedding tokens, which are randomly initialized when adapting from a pretrained Transformer and hence may benefit from pretraining and being jointly trained with the local attention.
Takeaways: For moderate block sizes, either pretraining or adapting to a Local encoder performs about equally well, but pretraining with a Global-Local encoder performs slightly better.

Pretraining Schemes
Up to this point, we have only considered pretraining with short sequences.We might expect that pretraining with longer sequences ought to improve performance on downstream long input summarization.However, pretraining only on long sequences is computationally expensive and requires a large collection of long input documents, which are relatively rarer.Long documents may also contain different information from short documents, hence limiting training to only long inputs mae reduce the diversity of training data.Different long context Transformers have taken different approaches to pretraining on long inputs.For instance, Longformer (Beltagy et al., 2020) performed several additional stages of increasingly longer-sequence pretraining to adapt the initial RoBERTa to long sequence inputs.On the other hand, LongT5 (Guo et al., 2021) is pretrained exclusively with long input sequences.Others (Zaheer et al., 2020;Ivgi et al., 2022) perform no long input pretraining at all.In this section, we investigate how the balance of short and long pretraining impact downstream performance, and try to find the best trade-off between pretraining cost and downstream performance.
We consider two setups for pretraining: shortinput pretraining, with 512 input tokens and 256 output tokens, and long-input pretraining, with 4096 input tokens and 256 output tokens.We describe the corresponding differences in data preprocessing in Section 2.1.We fix the number of input tokens seen during training, and vary configurations subject to this constraint.This constraint roughly proxies for the amount of compute consumed and corresponds to the number of input tokens seen during pretraining. 3e set our total input token budget at 131 billion tokens, which corresponds to 1 million steps with 512 input tokens, compared to the 500k steps in the above experiments.This larger budget ensures that when we only do long-input pretraining, the model is still pretrained for a reasonable number of steps.We consider four pretraining configurations: • Short-input for 100% of tokens (1M steps) • Short-input for 75% of tokens (98.3B, 750k steps), then long-input for 25% of tokens (32.8B, 31.25ksteps) • Short-input for 50% of tokens (62.5B, 500k steps), then long-input for 50% of tokens (62.5B, 62.5k steps) • Long-input for 100% of tokens (125k steps) We compare the performance of the different pretraining scehemes in Table 4.We also include short-input pretraining for 500k steps for comparison.First, comparing short-input pretraining for 500k and 1M steps, we find that more pretraining still improves performance, indicating that our base models may still be undertrained at 500k steps.Second, long-input pretraining performs consistently worse than the other variants, which we attribute having fewer training steps, again highlighting the issue of potential undertraining.For the middle three configurations, on the long tasks, all three non-long-only variants atttain similar scores, with more long-input pretraining having slightly better performance, particularly on the ROUGE-2 and ROUGE-L scores.While the small absolute differences in scores make it hard to draw strong conclusions, we lean towards the conclusion that adding a short phase of long input pretraining can improve performance on long input summarization tasks.
Takeaways: Given a fixed compute budget, allocating some training steps to long-input training can improve performance, although the optimal allocation is difficult to determine.Exclusively long pretraining results in worse performance.

PEGASUS-X
Based on our findings, we settle on the following recipe for adapting PEGASUS models (Zhang et al., 2020) to long sequence summarization.
• We use a Global-Local architecture with block staggering, a large number of global tokens, and large block sizes during pretraining.
• We perform additional long input pretraining on 4096 token inputs for 300k steps.
• We extend input sequences up to 16384 input tokens in fine-tuning, depending on the task.
We experiment with two model sizes: PEGASUS-X (PEGASUS eXtended) based on PEGASUS Large , and PEGASUS-X Base based on a newly trained PEGASUS Base model which we call PEGASUS Base+ . 4e initialize the weights of PEGASUS-X and PEGASUS-X Base with the pretrained weights of PEGASUS Large and PEGASUS Base+ respectively.Only two new sets of parameters are introduced: global token embeddings, and a new LayerNorm for the global input representations in each Transformer layer.This is ∼1M more parameters for PEGASUS-X Base and 2M more for PEGASUS-X.We initialize the global token embeddings by randomly sampling tokens from the input embeddings, and we initialize the LayerNorm weights with the regular input LayerNorm weights.
The task-and model-specific hyperparameters for fine-tuning can be found in Appendix 15.For this section, we report ROUGE-Lsum5 rather than ROUGE-L for consistency with the metrics reported in other papers and leaderboards.

Results on Summarization Tasks
Long summarization tasks In Table 6, we compare the performance of PEGASUS models to those of PEGASUS-X on three long-input summarization tasks: arXiv, Big Patent and PubMed.In all three tasks, we see significant improvements in performance of PEGASUS-X Base over PEGASUS Base+ , and PEGASUS-X over PEGASUS Large .To isolate the impact of additional long input pretraining compared to only switching the architecture during fine-tuning, we also include evaluation on the PEGASUS models using the Global-Local architecture with no further pretraining, which we list in the table as PEGASUS Base+ + Global-Local.
We also compare to reported results of Big Bird-PEGASUS 6 (Zaheer et al., 2020), LED (Beltagy et al., 2020), Top-Down Transformer (Pang et al., 2022) with both Average-Pool (AvgP) and Adaptive-Pool (AdaP) variants, BART-LS (Xiong et al., 2022a), LongT5-Large and XL, and SLED (Ivgi et al., 2022).LED, Top-Down and SLED are initialized with BART Large weights with no additional pretraining on long input sequences.BART-LS is concurrent work that also incorporates staggered block-local attention and addition long-sequence pretraining, in addition to pooling layers and different pretraining data.
PEGASUS-X outperforms Big Bird-PEGASUS on all tasks, and Top-Down-AvgP on both compared tasks.Although Top-Down-AdaP outperforms PEGASUS-X, it uses a much more complex fine-tuning setup, using an importance tagger on reference summaries to construct token pooling weights, whereas PEGASUS-X only uses standard fine-tuning.Even so, PEGASUS-X still outperforms Top-Down-AdaP on PubMed.PEGASUS-X outperforms BART-LS on PubMed and slightly underperforms on arXiv; as mentioned above, google-research/blob/master/rouge/README.md#two-flavors-of-rouge-l 6 Big Bird-PEGASUS only has a context of 3072 tokens, likely due to the larger memory consumption of Big Bird.
PEGASUS-X and BART-LS share many similarities, and we see the strong performance of BART-LS as confirmation of the efficacy of parts of our recipe for longer sequence models.PEGASUS-X also outperforms LongT5 on both arXiv and PubMed, despite both compared LongT5 models having more parameters.However, we find that LongT5 performs much better on BigPatent, which is a largely extractive summarization task.We hypothesize that a larger hidden size may improve extraction over very long sequences.

Short summarization tasks
We show in Table 14 the performance of PEGASUS and PEGASUS-X models on shorter summarization tasks, where there is a slight regression in performance of both PEGASUS-X models compared to their PEGASUS equivalents.We hypothesize that long input pretraining might negatively impact the performance on shorter input tasks because of the data filtering for long documents, resulting in a potentially less diverse training data distribution.

SCROLLS Summarization Tasks
We report the performance of the PEGASUS-X models on the summarization tasks in the recently introduced SCROLLS benchmark in Table 7.This includes GovReport (Huang et al., 2021), the ForeverDreaming subset of SummScreen (Chen et al., 2022), and QMSum (Zhong et al., 2021).
PEGASUS-X outperforms all other models on GovReport, setting the state of the art on the dataset.7It also performs comparably to both LongT5 Large and Top-Down-AvgP on Summ-Screen/FD, although it underperforms LongT5 models and BART-LS on QMSum.Moreover, PEGASUS-X Base also performs competitively, outperforming both LongT5 models on GovReport, and only a small margin behind PEGASUS-X on all three tasks.PEGASUS-X Base also outperforms BART Large -SLED, a larger model with a similar 16K input length.

Pertinent Related Work
Many works such as Zaheer et al. (2020) porating sliding window attention and global representations.However, pretraining only on long sequences significantly increases the pretraining time, and as we show in Section 3.6, pretraining first on short inputs and then subsequently on long inputs is much more cost efficient.
In concurrent work released shortly before this submission deadline, Xiong et al. (2022a) also investigated extending short input Transformer models for long input tasks.While they focus on BART rather than PEGASUS, they similarly find that global tokens, staggered block-local attention, and extended pretraining greatly improve performance, lending further support to our findings.Their final model also incorporates pooling layers and is trained on different data.
A broader treatment of related work can be found in Appendix A.

Conclusion
In this work, we investigate a range of proposed improvements to Transformer models to effectively and economically handle long inputs in summarization tasks.Through extensive ablation experiments, we find a simple but effective recipe for extending short-input models to tackle long-input summarization.Based on our findings, we introduce PEGASUS-X, an extended version of PEGASUS with a modified architecture and additional longsequence pretraining.We show that PEGASUS-X sets the state of the art on two long input summarization tasks (GovReport and PubMed) and performs competitively on many others, even despite being much smaller than some compared models.Our findings can be extended to models in other domains beyond summarization, both for pretraining long input models from scratch as well as extending already pretrained short sequence models.

Limitations Challenges of Evaluating Long-Document Summarization Models
One limitation of our work is that evaluation of long-document summarization models is challenging, and while we evaluate on the widely used benchmarks for long-document summarization models, we highlight here the difficulties of measuring the capabilities of such models.In addition to the widely accepted issues with automatic evaluation of model-generated summaries with metrics such as ROUGE, long-document summarization brings about new challenges.In particular, there are relatively fewer long-document summarization tasks available to evaluate models on, and many of them (e.g.arXiv, Pub Med, SummScreen) are constructed by repurposing existing data and proxies for summaries (e.g.abstracts) rather than explicitly written summaries.As such, the available datasets for summarization reflect the data that is easy to repurpose into summarization rather than practical downstream summarization settings; in other words, the available evaluation datasets may not match the distribution of data or settings where such models are realistically used.
On scoring generations, human evaluation should ideally be conducted to measure the quality of model-generated summaries.However, the much longer input texts also means that human evaluation of summaries becomes much more expensive and onerous, as raters would need to read the whole input before judging the quality of the summary.
More discussion on the challenges of evaluating long-document summarization models can be found in Wang et al. (2022).

Findings May Not Generalize to Other Tasks
We have confined our study to summarization tasks, as it matches our goal of investigating the ability for models to process large input contexts, with less focus on generating long outputs.We acknowledge that our ablation studies and experiments are focused solely on summarization tasks, and that our findings may not directly apply or extend to other long-input language tasks.

A Full Related Work
Long Document Summarization Several new long input summarization datasets and benchmarks have been recently introduced, providing better measures of long input summarization capability as well as prompting new interest in this research direction.The BookSum dataset (Kryściński et al., 2021) consists of paragraph, chapter, and full summaries of books on Project Gutenberg based on web-scraped educational website.(Chen et al., 2022) consists of television show transcripts and episode summaries based on web-scraped fanwritten summaries.The SCROLLS benchmark (Shaham et al., 2022) and the MuLD benchmark (Hudson and Al Moubayed, 2022) consist of multiple natural language tasks with long inputs, including long input summarization.The SQuALITY dataset (Wang et al., 2022) consists of questionfocused summaries of Project Gutenberg stories, where annotators write summaries based on different questions that cover different aspects of the same story.
Efficient Transformers Many efficient Transformer variants have been introduced in recent years (Tay et al., 2020), and we discuss here the works more relevant to this manuscript.(Beltagy et al., 2020) use global tokens as well as a sliding window local attention, implemented using custom CUDA kernels.The ETC model (Ainslie et al., 2020) uses both global tokens and block-wise sliding window local attention, although the global attention is incorporated based on the first few tokens of a sequence, rather than separately learned global tokens.Zaheer et al. (2020) extend ETC by adding random attention blocks, but we found that this significantly increases code complexity and computational cost.Guo et al. (2021) similarly extend ETC's block-wise sliding window attention, but computes transient "global token" representations by pooling over blocks of tokens.Pang et al. (2022) propose to augment the Longformer encoder-decoder with additional pooling layers to improve long-sequence summarization performance.Ivgi et al. (2022) propose an alternative approach to sparse attention via encoding overlapping chunks and fusing information across chunks int he decoder.We highlight that while the final Global-Local model architecture that we settle on shares similarity with several other proposed efficient Transformer architectures, our key con-tribution lies in our extensive ablation study that identifies architectural tweaks that improve and, just as importantly, do not improve downstream performance.
Among the listed model architectures for long input summarization, LongT5 (Guo et al., 2021) is the most similar to PEGASUS-X, sharing a similar encoder-decoder architecture, a similar training objective in generating masked sentences, and a mix of local attention and global information sharing for the encoder.We briefly highlight the key differences between the two models.Firstly, LongT5 trains from scratch on long sequences, whereas we initialize our model weights with PEGASUS weights (which is trained on short sequences) before doing additional pretraining on long input sequences.This significantly reduces the overall pretraining cost, as short sequence pretraining and be performed much more economically.LongT5 also uses the T5 relative position biases whereas PEGASUS-X uses sinusoidal position embeddingsas shown in Section B.1, T5 relative position biases perform slightly better but are significantly slower.The efficient encoder architecture between the two models is also different: LongT5 uses a transient global representations based on pooling chunks of tokens, whereas PEGASUS-X uses learned global token embeddings.LongT5 also uses a sliding window local attention based on ETC (Ainslie et al., 2020), whereas we use a simpler block-local attention with staggered blocks.Lastly, the largest LongT5 model is 3B parameters, more than 5× the size of PEGASUS-X.
More broadly, Tay et al. (2021) compare a variety of efficient Transformer architectures on a set of tasks designed to probe long-sequence processing capability, evaluating the different models on both performance as well as computation requirements.Tay et al. (2022) further evaluate the scaling properties of novel Transformer architectures, finding that deviating from full attention tends to hurt downstream performance.Xiong et al. (2022b) showed that simple local attention variants can be highly competitive with more complex sparse attention schemes, consistent with our findings.

B Details of Architecture Modification Experiments B.1 Position Encoding Schemes
New position encoding schemes encoding schemes such as RoPE (Su et al., 2021) and ALiBi (Press  et al., 2022) have garnered recent attention, showing improved performance on downstream evaluations.As input sequence lengths have gotten much longer, and in particular longer than the dimensions of hidden representations, previous choices of position encoding may no longer be optimal.Moreover, relative position encodings such as RoPE, T5 and ALiBi may be better suited for adapting models to different input lengths between pretraining and fine-tuning.Hence, this is a good opportunity to revisit the choice of positioning encoding schemes in encoder models.
Because of the more complex interaction between local attention blocks and relative position encoding implementations, we conduct a preliminary investigation with a full-attention Transformer.We pretrain with an input length of 512, and finetune with an input length of 2048 for the long sequence tasks -this experiment also tests the propensity for position encodings to be adapted to longer sequences downstream.In addition to the sinusoidal position encoding used in PEGA-SUS and Vaswani et al. (2017), we also consider the bucket-based relative position encoding scheme of T5, RoPE, absolute position embeddings, and no position encoding as a baseline.For absolute position embeddings, we follow the recipe of Beltagy et al. (2020) and duplicate the learned position embeddings to handle longer sequences before finetuning.The chosen position encoding scheme is applied to all parts of the model, including both the encoder and the decoder.We do not experiment with ALiBi, as we found no natural way to adapt ALiBi to cross-attention.
Our results are shown in Table 8.We find that although T5 performs the best, it is also almost twice as slow as the other position encoding schemes, which is consistent with the findings of Press et al. (2022).Sinusoidal position encodings and RoPE perform only slightly worse than T5 with much better efficiency, making them more desirable choices.Given the much simpler implementation of sinusoidal position encodings, we opt to stick with them for the remainder of the experiments.
Takeaways: Sinusoidal position encodings still remain a good choice for long input Transformers.

B.2 Scaling Encoder and Decoder Layers
Scaling laws (Kaplan et al., 2020;Ghorbani et al., 2021;Zhang et al., 2022) that describe the empirical relationship between model sizes and performance have proven surprisingly consistent and gotten significant attention in recent years.We present in this section a small set of scaling experiments, exploring the distribution of layers between encoder and decoder.
Our results are shown in Table 10.In the top half, we fix the total number of layers to 24, and consider both encoder-heavy and decoder-heavy distributions, for both Local and Global-Local models.We observe that impact of distribution of encoder and decoder layers on performance is relatively small.For Local models, we see a slight boost from decoder-heavy models.For Global-Local models, we observe that a balanced encoder-decoder outperforms encoder-and decoder-heavy models, both of which perform about comparably.
We also consider cases where we further increase the size of either the encoder or decoder to 18 layers, shown in the second half of Table 10.We observe no improvement in performance over the 12/12-layer encoder-decoder, and suspect that other hyperparameters (e.g.hidden size) might be the bottleneck rather than the number of layers.
We highlight here that because of the asymmetry of the input and output lengths, there are different computational trade-offs to different balances of encoder and decoder layers.Encoder-heavy models require more memory because of the long input sequences, whereas decoder-heavy models are relative slower at inference because of the autoregressive nature of decoding.Given the relatively small difference in the margin of performance, memory or computational constraints may outweigh the performance differences in practical scenarios.
Takeaways: A balanced encoder-decoder performs best, but the difference in performance may be outweighed by other resource considerations.

B.3 Partial Cross Attention
Given the use of an efficient attention architecture, which has memory consumption scale linearly rather than quadratically in input sequence length, another major memory bottleneck is the encoder-decoder cross-attention.Because each decoder layer attends separately to the long encoder representations, and the attention is dense, this is a large contiguous chunk of memory that we could seek to reduce.
Perceiver AR (Hawthorne et al., 2022) demonstrated strong performance by using only a single cross-attention at the bottom layer of an autoregressive language model.Based on these results, we investigate the impact of only having cross-attention on a subset of decoder layers.In Table 12, we show the results of pretraining and fine-tuning Global-Local models with cross-attention only on specific layers on a variety of configurations.We find that reducing the number of cross-attention layers leads to a drop in performance, but the impact on performance is smaller than expected.For instance, with only cross-attention on the first and sixth layer, the Global-Local model still outperforms a Local model.The reduction of cross-attention layers also leads to a corresponding improvement in training step and reduction in memory consumption.
Given the small drop in performance from using fewer decoder layers with cross-attention, we consider the viability of dropping cross-attention layers after pretraining.In other words, we take a Global-Local model pretrained with full cross-attention, drop the cross-attention for a subset of layers, and fine-tune directly.Our results are shown in Table 13.We find that dropping the cross-attention after pretraining again only leads to a small (additional) dip in performance.This indicates that dropping cross-attention may be a viable strategy for further reducing memory requirements for an existing pretrained model with a small performance trade-off, and pretraining a separate model from scratch is not necessary.
Takeaways: Dropping cross-attention for a fraction of decoder layers can reduce memory consumption at the cost of slight performance regression.Cross-attention can be dropped after pretraining, with an associated performance trade-off.

B.4 Comparison on short summarization tasks C PEGASUS Base+
In a similar finding as Hoffmann et al. (2022), we found that PEGASUS Base benefits from training on significantly more tokens.As such, we trained a PEGASUS Base for a much larger number of to- kens (the same as PEGASUS Large ), which achieves much better performance than the previously released PEGASUS Base model.

D Encoder Architecture Hyperparameters
For experiments in Section 3.

E Fine-tuning Hyperparameters
For arXiv, we fine-tune with an input length of up to 16384 tokens and 256 output tokens, while for GovReport we use an input length of 10240 input tokens and 1024 output tokens given the longer summaries for the task.For XSUM and CNN/Daily Mail, with use an input length of 512, and output lengths of 64 and 128 respectively, following PE-GASUS hyperparameters.The full set of hyperparameters for fine-tuning models are shown in Table 15.

F Engineering Details
The original PEGASUS model was trained using a codebase based on TensorFlow.The experiments in this paper were run using a new codebase written with JAX (Bradbury et al., 2018) and Flax (Heek et al., 2020).PEGASUS-X Base and PEGASUS-Xwere trained by converting the weights from the TensorFlow checkpoint to a Flax checkpoint format, and then continuing with long input training.

Figure 1 :
Figure1: Performance on SCROLLS(Shaham et al., 2022) summarization tasks.All models evaluated on up to 16K input tokens.PEGASUS-X outperforms other models at comparable model sizes.Scores (as of 08/08/22) shown are the average of the geometric mean of ROUGE-1/2/L.

Figure 2 :
Figure2: In block-local attention (a), the same block boundaries are used across all layers, preventing information from being shared across blocks.Staggering the block boundaries (b) be shifting the boundaries every other layer allows for cross-block interactions with minimal additional computational cost or complexity.

Table 1 :
Comparison of different encoder architectures on short (XSUM, CNN/DM) and long (arXiv, GovReport) summarization tasks.Training steps per second and memory are computed based on arXiv, and normalized to Local Transformer performance.

Table 2 :
Comparison of architectural tweaks to Local and GlobalLocal encoder.Staggering local blocks uses different blocks boundaries for different layers in block-local attention.Global information is incorporated in the decoder via an additional cross-attention before cross-attention over the encoded input.
Transformer and Local Transformer is the block size (aside from staggering, a Local model with block size 512 is equivalent to a dense

Table 3 :
Varying the block size and number of global tokens in Global-Local encoders.Training steps per second and memory are computed based on arXiv, and normalized to the run with Block Size=128 and Global Tokens=32.
Transformer), and hence the difference in block size also corresponds to the extent to which the model needs to adapt between architectures.When adapting from a pretrained Transformer encoder to a Global-Local architecture, because the Global-Local model relies on newly introduced global token embeddings, we initialize them by randomly sampling tokens from the vocabulary embeddings.

Table 4 :
Comparison of different pretraining formats, given a input token budget of 131B tokens, which corresponds to 1M steps with 512 input tokens.Short pretraining uses 512 input tokens, whereas long pretraining uses 4096 input tokens.

Table 5 :
Hyperparameters of PEGASUS-X Models

Table 7 :
Comparison on SCROLLS benchmark (Summarization tasks, Test sets).Results for SLED, BART-LS, LongT5 and UL2 models are taken from the SCROLLS benchmark leaderboard.
(Pang et al., 2022) al., 2022)reports much higher scores for ROUGE-L on SummScreen/FD than any other model, and may have been computed with a variant of ROUGE-L that involves splitting on sentences rather than newlines.

Table 8 :
Beltagy et al. (2020)n encodings schemes for a Transformer encoder-decoder.Training steps per second are computed based on arXiv summarization.Absolute position embeddings are replicated to longer input sequences, followingBeltagy et al. (2020).Training steps per second is computed based on arXiv, and normalized to the run with absolute position embeddings.

Table 9 :
Comparison of different scaling constants in sinusoidal position encodings.

Table 10 :
Varying the distribution of encoder/decoder layers)

Table 11 :
Comparison of adapting models architectures between pretraining and fine-tuning.

Table 12 :
Comparison of models with cross-attention only in a subset of the 12 decoder layers.Training steps per second and memory are computed based on arXiv, and normalized to the Cross[0,6] run.
1, BigBird, Local and Global-Local all use a block size of 64.BigBird and Global-Local also use 32 global tokens.Performer uses 256 random features.