CoLT5: Faster Long-Range Transformers with Conditional Computation

Many natural language processing tasks benefit from long inputs, but processing long documents with Transformers is expensive -- not only due to quadratic attention complexity but also from applying feedforward and projection layers to every token. However, not all tokens are equally important, especially for longer documents. We propose CoLT5, a long-input Transformer model that builds on this intuition by employing conditional computation, devoting more resources to important tokens in both feedforward and attention layers. We show that CoLT5 achieves stronger performance than LongT5 with much faster training and inference, achieving SOTA on the long-input SCROLLS benchmark. Moreover, CoLT5 can effectively and tractably make use of extremely long inputs, showing strong gains up to 64k input length.


Introduction
Many natural language processing tasks, such as summarization (Cohan et al., 2018) or question answering over long documents (Joshi et al., 2017), require machine learning models to encode longform text.Processing long documents with a Transformer model is computationally expensive, both because attention cost scales quadratically with input length and because feedforward and attention projection layers have to be applied to each input token.
Over the past few years, many "efficient Transformer" approaches have been proposed that reduce the cost of the attention mechanism over long inputs (Child et al., 2019;Ainslie et al., 2020;Beltagy et al., 2020;Zaheer et al., 2020;Wang et al., 2020;Tay et al., 2021;Guo et al., 2022).However, especially for larger models, the feedforward and projection layers actually make up the majority of the computational burden and can render processing long inputs intractable.
This paper presents COLT5 (Conditional LongT5), a new family of models that, building on top of LONGT5 (Guo et al., 2022), enables fast processing of long inputs by combining architecture improvements for both attention and feedforward layers.COLT5 is based on the intuition that some tokens are more important than others, and we can achieve better quality for lower cost by devoting more computation to important tokens.Moreover, the fraction of important tokens is likely to diminish with document length, allowing for tractable processing of long documents.
In particular, COLT5 divides each feedforward layer and each attention layer into a light branch Figure 2: COLT5 achieves stronger performance than LONGT5 at any speed.Average performance on all datasets as a function of inference and fine-tuning time per sample (ms) for LONGT5 and COLT5 Base, Large, and XL models.LONGT5 does not use MQA, but we report speed as though it had for a conservative baseline.
which is applied to all tokens and a heavy branch which is applied to a set of important tokens, selected specifically for that input and component.
The light feedforward branch has lower hidden dimension than standard LONGT5 while the heavy feedforward branch has higher hidden dimension.
The light attention branch has fewer heads and applies only local attention, while the heavy attention branch performs full attention over another separately selected set of important tokens.Figure 1 provides an overview of the COLT5 conditional mechanism.
Finally, COLT5 also includes two other modifications to the LONGT5 architecture.COLT5 adds multi-query cross-attention (Shazeer, 2019), significantly speeding up inference.COLT5 also employs the UL2 (Tay et al., 2022) pre-training objective, which we demonstrate allows for in-context learning over long inputs.
We show that COLT5 performs much faster finetuning and inference with similar or better model quality, improving over LONGT5 on arXiv summarization (Cohan et al., 2018) and TriviaQA question answering (Joshi et al., 2017) datasets and achieving SOTA on the SCROLLS benchmark (Shaham et al., 2022).Moreover, COLT5 achieves further gains in quality and speed for tasks with extremely long inputs (64k tokens), with less-than-linear scaling of "focus" tokens.

Background
Transformer FLOPs COLT5 follows an extensive line of work in attempting to reduce the computational cost of Transformer models, particularly over long inputs.The computational burden of Transformer models has several distinct elements, and different approaches focus on reducing the cost of different components.For that reason, it is helpful to start by providing a breakdown of the computational cost of Transformer components.Table 1 shows the FLOPs1 for each component of a Transformer encoder layer (Kaplan et al., 2020).

Encoder Layer Component Flops
Vanilla self-attention

Sparse attention
The first challenge of applying a Transformer to a long input is that the FLOPs of the self-attention mechanism scales quadratically in the input length, becoming intractable for long inputs.A large body of work focuses on reducing self-attention cost, restricting attention between a subset of inputs (Child et al., 2019;Ainslie et al., 2020;Beltagy et al., 2020;Zaheer et al., 2020;Wang et al., 2020;Guo et al., 2022) or to a subset of layers (Zemlyanskiy et al., 2021).In LONGT5 (Guo et al., 2022), the most closely related model to COLT5, tokens attend within a lo-cal window as well as to a mean-pooled summary representation for each block of 16 tokens in the input.LONGT5 attention leads to sharply reduced (though still non-negligible) FLOPs (Table 1).
Conditional computation After applying a sparse attention mechanism, the feedforward and attention projection layers account for the majority of the FLOPs.These costs scale with the length of the input, such that processing long inputs is still prohibitively expensive.A common approach to reduce the remaining cost is to employ some form of conditional computation, avoiding applying all model parameters to the entire input.CALM (Schuster et al., 2022) applies a varying number of decoder layers to each decoded token, outputting a token early if the model is confident in its prediction.Mixture-of-Experts models (Shazeer et al., 2017;Fedus et al., 2021;Zoph et al., 2022) route inputs through a small proportion of expert sub-modules, bringing to bear only the parameters most relevant to the input.In the context of retrieval-augmented models, numerous works rerank retrieved passages by their relevance to the query and process only the highest scoring passages (Mao et al., 2021;Wang et al., 2018;Yu et al., 2022) and vary the number of processed passages depending on model confidence (Kratzwald and Feuerriegel, 2018;Varshney et al., 2022).Concurrent work CoDA (Lei et al., 2023) employs a related conditional computation mechanism, designed for efficient adaptation rather than modeling long documents.
Device utilization FLOPs do not tell the whole story, as modeling choices can influence the effective speed of operations achieved by accelerators.
For long text inputs, autoregressive decoder inference is very slow due to memory bandwidth constraints from repeatedly loading the long sequence of keys and values (Shazeer, 2019;de Jong et al., 2022).Shazeer (2019) introduces multi-query attention (MQA), sharing heads for keys and values to reduce memory bandwidth overhead.Pope et al. (2022) studies how to shard large models, especially in the context of MQA, to obtain optimal device utilization and therefore speed.
Training objectives T5 introduced the span corruption objective (Raffel et al., 2020), a modification of masked language modeling (Devlin et al., 2019).LONGT5 made use of the PEGA-SUS (Zhang et al., 2020) sentence reconstruc-tion objective for improved summarization performance.Tay et al. (2022) proposes UL2, a mixture of span corruption, prefix, and causal language modeling, and shows that it leads to strong performance on both short-output and generative tasks.

Conditional computation
As discussed in the previous section, a large proportion of Transformer FLOPs arise from feedforward and projection layers that scale with the length of the input sequence.Therefore, LONGT5 training and inference on long documents remains expensive.
COLT5 further reduces the cost of processing long documents through conditional computation, following the intuition that some tokens are more important and therefore benefit more than others from heavy computation.First, some types of tokens may inherently require less computation, such as filler words and punctuation.Second, especially in long documents, large parts of the input may not be relevant to the current question, task, or processing stage.
The COLT5 conditional computation mechanism consists of three components: routing modules, conditional feedforward layers, and conditional attention layers.All tokens are processed by standard, lightweight attention and feedforward layers.Routing modules additionally select important tokens from an input at each attention or feedforward layer, and a heavy conditional layer applies additional computation to routed tokens.This section describes each component in detail.Figure 1 provides an overview of the COLT5 conditional computation mechanism, and Table 2 compares COLT5 and LONGT5 FLOPs.

Model
Encoder Layer Flops Routing In order to separately select important tokens for each component in each layer, we need a learnable and tractable routing function.We follow the simple three-step mechanism from Lei et al. (2023): (1) multiply inputs with a learned embedding to obtain routing scores, (2) normalize, and (3) select the top-k highest scoring inputs.Let X i be the representation of token i, and u a d-dimensional learnable embedding.Then the routing score of token i is We select the top-k highest scoring inputs.In order to provide a learning signal to the scoring embedding, we make sure the contribution of the routed tokens to the layer update is scaled according to the routing score, as will be seen later.To provide a better distributed signal to all tokens, we also globally normalize the routing scores to sum up to the number of desired routed tokens using a generalized softmax, resulting in normalized scores si .Each COLT5 layer has three independent routers, one each for the feedforward layer, attention queries, and attention key-values.
Conditional Feedforward Intuitively, some token representations may benefit from more processing than others.The COLT5 conditional feedforward layer applies an additional high-capacity feedforward layer to selected tokens.In particular, let X i be the model state of the ith token and si denote the normalized routing score (set to 0 for non-routed tokens).Then the feedforward update for COLT5 is given by The light and heavy feedforward branches differ only in their hidden dimension, with the light branch having smaller hidden dimension than the standard T5 feedforward layer and the heavy branch larger.Let n denote the number of input tokens, m the number of selected tokens, and r L and r H the ratios of light and heavy hidden dimension to standard T5 hidden dimension.Then the FLOPs of the COLT5 layer are given by

Heavy branch
We set the light and heavy ratios as r L = 1 2 and r H = 4, half and quadruple the standard T5 hidden dimension respectively.For our main experiments, a fraction 1 16 of tokens are routed to the Conditional Attention COLT5 conditional attention operates on the intuition that most tokens have simple, local interactions, but some tokens benefit from heavier processing and long-range interactions.The COLT5 conditional attention layer applies an additional high-capacity attention layer that attends from selected query tokens to selected key-value tokens.Let sq i denote the normalized routing query score for token i, and skv the keyvalue scores for all tokens (set to 0 if not routed).Then the attention update for COLT5 is given by The light and heavy branches differ in the number of heads and tokens attended to: the light branch has fewer heads and attends to a local context window, while the heavy branch has more heads and attends to all routed key-value tokens.Separately selecting query and key-value tokens also allows the model to differentiate between tokens that require additional information and those that possess such information.Figure 3 shows the COLT5 attention pattern.Let q, v be the number of selected query and key-value tokens, w the size of the local attention window and r L , r H the proportion of light and heavy heads relative to standard T5.Then the FLOPs of the COLT5 attention layer are given by We set the light and heavy head ratios as r L = 1 4 and r H = 3 4 , keeping the total number of heads across the light and heavy branches equal to standard T5 heads.For our main experiments a fraction with less than half projection FLOPs and order-ofmagnitude smaller quadratic length scaling compared to LONGT5.Table 2 shows total FLOPs for the COLT5 layer.In general, we set q = m and v = 2m, and use m to summarize the number of routed tokens going forward.
2 Global projection and attention FLOPs rounded to readable fractions, exact values are 9 32 and 3 256 .Complexity assumes constant fraction of routed tokens; we show we can do better in practice for extremely long inputs.

Multi-query Attention
Conditional computation effectively reduces the computational cost of the encoder.However, for encoder-decoder models with long inputs the majority of inference time is spent in the decoder due to memory bandwidth constraints (Shazeer, 2019;de Jong et al., 2022).Most of the overhead is caused by repeatedly reading all the input token keys and values from memory for every output token that is autoregressively decoded during cross attention.Multi-query attention (Shazeer, 2019) (MQA) allows all query heads to share a single key and value head, alleviating this bottleneck.Accordingly, we apply MQA in cross-attention layers for much faster inference.Note however that MQA does not improve training speed since target tokens are processed in parallel during training, avoiding this memory bandwidth bottleneck.

UL2
The UL2 pre-training objective (Tay et al., 2022) combines different denoising objectives, extending the span corruption pre-training used in T5 to a variety of noise rates / average span lengths and adding a prefix language modeling objective more similar to typical decoder-only model pre-training.UL2 has been shown to lead to improved in-context learning.We train COLT5 on UL2 instead of PE-GASUS (Zhang et al., 2020), endowing COLT5 with in-context learning capabilities.

Experiments
In order to evaluate COLT5, we perform the following experiments: (1) our main results com-pare COLT5 and LONGT5 on a collection of long input datasets using input length of 16k tokens; (2) we evaluate COLT5 on extremely long inputs up to 64k tokens and compare scaling against LONGT5; (3) demonstrate COLT5's few-shot capability, investigating how performance changes as input length and number of shots increase, (4) perform a series of ablations to understand the effect of individual COLT5 components, and (5) investigate empirical routing patterns.The remainder of the section outlines our experimental setup, and then describes each of the experiments above.

Experimental setup
Configurations COLT5 is based on the T5.1.1 architecture (Raffel et al., 2020), implemented with JAX (Bradbury et al., 2018), Flax (Heek et al., 2020), and Flaxformer 3 .Following LONGT5, we experiment with Base, Large, and XL model sizes.COLT5 models use the same embedding dimension, number of layers, and total attention heads as corresponding LONGT5 models of the same size, with more overall parameters (but less compute) due to the conditional branch.See Appendix B for additional details on model configuration.

Pre-training
We pre-train COLT5 for 1M steps on the C4 dataset (Raffel et al., 2020) using a variant of the UL2 objective (Tay et al., 2022) with batch size 256, input length 4096, and output length 910.In particular, our mixture contains four objectives in equal proportion: prefix-LM with noise rate 0.5, and span corruption (Raffel et al., 2020) with noise rate 0.15 and average span lengths 3, 8, and 64.We use the Adafactor optimizer (Shazeer and Stern, 2018)  Fine-tuning For fine-tuning we use a constant learning rate of 0.001, batch size 128, and dropout rate 0.1 for all tasks.Main results use input length of 16384 for all datasets other than ContractNLI, which uses 8192.Question answering datasets use output length 128 and summarization datasets use output length 512, except for GovRep which uses output length 1024.We route m = 1024 tokens, 1 16 th of the input length.We train until convergence 3 https://github.com/google/flaxformerand select the checkpoint with the highest dev performance.We use greedy decoding for inference.

Main results
Figure

Scaling to extremely long inputs
We hypothesize that the advantage of COLT5 over LONGT5 strengthens with input length, as the fraction of important tokens decreases and COLT5 can route a greater proportion of important tokens to the heavy branch.Figure 4 compares the qualityspeed trade-off for LONGT5 and COLT5 on Nar-rativeQA, sweeping over input length rather than model size.The number of routed tokens is 1 16 th of the input length, except that we do not increase routed tokens going from 32k to 64k, so at 64k we route only 1 32 nd of the input length.COLT5 achieves both stronger performance and faster inference speed at all input lengths and is able to effectively make use of extremely long inputs.We note that COLT5 achieves large quality gains by going from 32k to 64k tokens even while keeping the number of routed tokens constant, providing more evidence for our hypothesis.

In-context learning
Models trained on the UL2 objective have shown strong few-shot in-context learning (ICL) capabilities5 even at smaller sizes (Tay et al., 2022).COLT5 enables tractable inference with long inputs.Here, we leverage this for scaling the number of examples used for in-context learning.We test the above hypothesis by evaluating few-shot learning performance on Natural Questions (Kwiatkowski et al., 2019) and TriviaQA as a function of input length, using as many examples as fit in the context.We consider the open book setting, such that each example consists of question, context document, and answer.Table 5 shows the number of examples by input length.We evaluate on the full dev set, randomly sampling examples from the training set for each dev sample until no further examples fit in the input length.We found that COLT5 can perform in-context learning only up to the input length it was trained on, so for these experiments we continued pre-training a COLT5-Large model on input length 16384 for another 100k steps.For the same reason we route m = 512 tokens as in pre-training.
Figure 5 displays COLT5 few-shot performance as a function of input length, showing that COLT5 is able to apply its long-input capabilities to extract information from increasing numbers of examples.

Ablations
This section studies the effect of different choices in the COLT5 recipe.Table 6: COLT5 ablations evaluated on validation sets.Each experiment modifies a component of the COLT5 recipe for COLT5-Base.Static routing divides the input into equal-length blocks and selects the first token in each block to be routed.Shared QKV routing shares routing decisions for queries and keys/values.In v=all the routed queries attend to the entire input, while v=q selects the same number of key and value tokens as query tokens.m=512 and m=1536 use different numbers of routed tokens.LONGT5-B uses a LONGT5 encoder while retaining other parts of the COLT5 training recipe such as MQA and the UL2 objective.Multi-head refers to using multi-head cross-attention.The final ablation replaces the UL2 objective with PEGASUS as in LONGT5.
nent for COLT5 Base.
Routing First, we note that static routing -evenly distributing routed tokens over the input --leads to massive drop in performance.The importance of routing provides evidence that the model learns to devote capacity to important tokens and the advantage of COLT5 is not merely a result of additional parameters.Sharing routing decisions for query and KV tokens should be compared with v=q, and leads to a modest reduction in quality and increase in speed.
The optimal number of routed tokens represents a trade-off between improved performance and computational cost of applying heavier layers.Table 6 shows strong gains going from 512 to 1024 (baseline) routed tokens and diminishing returns for further increases.
Attention COLT5 relies on routing to identify not only tokens that can benefit from important information elsewhere in the input, but also which tokens contain such important information.We study whether COLT5 is successful in this task by comparing performance with two different attention settings --v=all, in which routed tokens attend to the entire input, and v=q, which uses equal number of routed keys and values as queries, rather than twice as many.COLT5 appears to occupy a sweet spot, as using fewer routed key-values modestly decreases performance at similar speed but attending to all inputs barely helps at sharply increased cost.Other We compare COLT5 to LONGT5 with multi-query cross-attention, confirming that LONGT5 indeed does not achieve an unexpected quality gain from MQA, and our conservative assumptions in Figures 2, 4 are valid.Next, we evaluate multi-head cross-attention for COLT5, finding that it leads to modestly improved COLT5 performance.However, as MHA exhibits orderof-magnitude slower inference, MQA is clearly favored.Finally, PEGASUS appears to fine-tune slightly better than UL2, though the difference is small and UL2 enables few-shot learning.

Routing analysis
It is interesting to ask whether COLT5 routed tokens line up with what we consider intuitively important tokens in each document.We investigate this question by studying routing patterns of a Large COLT5 model fine-tuned on TriviaQA.We divide tokens into three categories: (1) question tokens, (2) answer tokens, and (3) other tokens.
Figure 6 shows the average fraction of each type of token that is routed through the heavy path for MLP and attention layers on TriviaQA.We note that question and answer tokens are significantly more likely to be routed than other tokens, for feedforward as well as attention queries and keys/values.Appendix F presents more detailed routing analysis; e.g., semantically important tokens are much more likely to be selected in later layers.

Conclusion
We propose COLT5, a new model for long-range inputs that employs conditional computation for higher quality and faster speed.COLT5 has light feedforward and attention layers that apply to the entire input, as well as heavy branches that are applied only to a subset of important tokens selected by a learned router.We show that COLT5 achieves stronger performance at any speed compared to LONGT5 on a variety of long-input datasets, and can effectively and efficiently make use of extremely long inputs up to 64k tokens.

Limitations
COLT5 applies conditional computation only in the encoder.Applying conditional computation in the decoder is more complicated; the routing method in COLT5 is not causal, so it isn't applicable when generating token by token.Since decoder-only models and applications with long outputs have become more popular recently, this is a strong limitation of the current approach.Although the routing method in COLT5 could potentially be applied to the input context in a decoder-only model, we didn't investigate this setup.COLT5 is specialized towards long sequences and has to be trained from scratch.For large-scale training and deployment, it is desirable to either train a single model that can handle both short and long sequences, or develop a long-input architecture that can be adapted from an existing large model.

D Additional Experimental Results
Table 8 compares LONGT5 and COLT5 inference speed in more detail, splitting off encoder and total time per sample.Since COLT5 applies conditional computation only in the encoder, encoder speed gains are larger than overall speed gain, and total speed gains are largest for shorter output length.Trade-offs are even more in the favor of COLT5 when paired with other decoder optimizations.

E Computational Resources
For pre-training we generally used 128 TPUv4 chips for Base and 256 TPUv4 chips for Large and XL.Pre-training took approximately 2.5 days for Base, 3.7 days for Large, and 12.8 days for XL.For fine-tuning we generally used 64, 128, and 256 TPUv4 chips for Base, Large, and XL, respectively, with training time varying with dataset size.

F Routing Analysis
In this section we take a closer look at the routing mechanisms in COLT5.There are three routing processes in each layer of COLT5: (1) Routing of attention keys and values ("KV-routing"), (2) routing of attention queries ("Q-routing") and (3) routing of MLP tokens ("MLP-routing").For simplicity, we will say that a token is selected, when it is routed to the heavy alternative (of either MLP or attention).We are interested in understanding what tokens are selected and whether these mechanisms select similar or different tokens in each layer.
Which tokens are selected We divide input tokens into three categories: (1) question tokens, (2) answer tokens (found via simple normalized string match of the ground truth answer), and (3) other tokens.Figure 7 shows the proportion of each token type that is routed by a fine-tuned COLT5-Large model on the TriviaQA dev set, by layer and routing component.The same is true for tokens around the correct answer ("papageno" in this example).
Question is heavily routed to the expensive alternative by last layers of the model.Earlier we showed that question and answer tokens are more likely to be selected, but separating routing decisions by layer reveals interesting patterns.At early layers question and answer tokens are only modestly more likely to be selected, with routing probability sharply increasing at later layers and peaking in the last layer.This makes intuitive sense: in early layers the model has not yet had the opportunity to identify which tokens and parts of the document are important.However, the increase is not monotonic and there is strong variation between layers.This variation may imply that different layers focus on different types of tokens, or that some routing components do not successfully learn to identify important tokens.
To gain a better insight into this, Figure 8 visualizes routing on two sample fragments from a TriviaQA example (notice that, given the large input length used in COLT5, we do not show the complete example in the figure).The two fragments shown correspond to the beginning of the example (where the question is located), and the part of the context surrounding the correct answer.We have added a colored background to the figure, where each of the three CMY channels are mapped to the KV-routing weights in different layers of the model.Cyan corresponds to layer 1, Magenta to layer 12, and Yellow to layer 24.As we can see, question and answer are heavily yellow colored, showing those tokens are selected in the last layer.
Correlation between routing processes.Table 10 shows the Pearson correlation coefficient between the routing weights of the different routing mechanisms in each layer in a COLT5 Large model (MLP-routing correlation with KV-routing, MLP-routing with Q-routing, and KV-routing with Qrouting).We show numbers for both the pre-trained checkpoint, as well as a fine-tuned model on Trivi-aQA.As we can see, the routing of keys/values and routing of queries is highly correlated at all layers except the first two, while the routing of tokens in the MLP has lower correlation to the other two processes.Interestingly correlation between MLP and attention routing increases in the last layers of the model.

Figure 1 :
Figure 1: An overview of a COLT5 Transformer layer with conditional computation.All tokens are processed by light attention and MLP layers, while q routed query tokens perform heavier attention over v routed keyvalue tokens and m routed tokens are processed by a heavier MLP.

Figure 3 :FLOPs
Figure 3: An overview of the COLT5 attention pattern.The light branch performs local attention for each token.In the higher capacity heavy branch q selected query tokens (2 in the figure) attend to v separately selected key and value tokens (4 in the figure).
with the T5.1.1 inverse square root learning rate schedule and no dropout.COLT5 is trained with the T5X (Roberts et al., 2022) framework.For pre-training, we route m = 512 tokens, 1 8 th of the input length.

Figure 5 :
Figure 5: COLT5 can use its long-input capability to benefit from more shots for in-context learning.Few-shot exact match for COLT5-Large on Natural Questions and TriviaQA dev sets as a function of input tokens, fitting as many examples as possible.Each example contains question, context, and answer.Inputs length used are 1024, 2048, 4096, 8192, 16384.

Figure 6 :
Figure 6: Proportion of tokens routed for answer (string match), question, and other tokens by routing component for COLT5 Large model, averaged over examples in TriviaQA dev set and all layers of model.

Figure 7 :
Figure 7: Proportion of tokens routed for answer (string match), question, and other tokens by routing component and layer for COLT5 Large model, averaged over examples in TriviaQA dev set.

Figure 8 :
Figure 8: Visualization of token routing weights for some fragments of an example on TriviaQA.

Table 1 :
Computational cost of encoder layer transformer components measured in FLOPs.n is the input length, d is the model dimensionality, and w is the size of the local attention window.
COLT5 FLOPs rounded to readable fractions.

Table 3 :
Performance comparison of COLT5 and LONGT5 Base, Large and XL models on question-answering datasets TriviaQA (TQA), NarrativeQA (NQA), QASPER (QAS), and QuALITY (QuAL), NLI dataset Con-tractNLI (CNLI), and summarization datasets arXiv, SummScreenFD (SumS), QMSum (QMS), and GovReport (GovR).SCROLLS results are on leaderboard test set where COLT5-XL achieves SOTA.Average speed is reported in samples per second for inference (inf) and fine-tuning (fn).LONGT5 does not use MQA but inference speed is reported without/with MQA for conservative baseline.R gm stands for the geometric mean of ROUGE-1,2,L.Similar to SCROLLS, we take a simple average across all datasets even though the datasets use different performance metrics.
provides an overview of the size and input length for each dataset.

Table 4 :
Median and 90th percentile input length by dataset measured in SentencePiece tokens.

Table 5 :
Average number of Natural Questions and TriviaQA few-shot examples that fit in input length.
Table 6 contains results of a series of experiments that change a single compo-

Table 9 :
Full performance comparison with Rouge-1, Rouge-2, and Rouge-L metrics of COLT5 and LONGT5 Base, Large, and XL models on summarization dev sets.Results based on checkpoint that maximizes R gm as in Table3.