Prefix Propagation: Parameter-Efficient Tuning for Long Sequences

Parameter-efficient tuning aims to mitigate the large memory requirements of adapting pretrained language models for downstream tasks. For example, one popular method, prefix-tuning, prepends trainable tokens to sequences while freezing the rest of the model’s parameters. Although such models attain comparable performance with fine-tuning when applied to sequences with short to moderate lengths, we show their inferior performance when modelling long sequences. To bridge this gap, we propose prefix-propagation, a simple but effective approach that conditions prefixes on previous hidden states. We empirically demonstrate that prefix-propagation outperforms prefix-tuning across long-document tasks, while using 50% fewer parameters. To further investigate the proposed architecture, we also show its advantage in calibration, and perform additional study on its relationship with kernel attention. To the best of our knowledge, this work is the first to focus on parameter-efficient learning for long-sequence language tasks.


Introduction
The Transformer architecture (Vaswani et al., 2017) has changed the landscape of recent natural language processing approaches by enabling the pretraining of state-of-the-art large language models (LLM) (Devlin et al., 2019;He et al., 2020;Brown et al., 2020). However, fine-tuning and storing full copies of LLMs can consume prohibitively large quantities of resources. Parameter-efficient tuning (PEFT) methods such as prefix-tuning (Li and Liang, 2021;He et al., 2021a; address these concerns by reducing the number * Work done during a student internship at Ingenuity Labs. †   of trainable parameters. Prefix-tuning can tune 0.01% of parameters and still match the performance of regular fine-tuning (updating all model parameters). PEFT has been investigated for tasks with inputs consisting of sentences, sentence-pair, or sequences that fit within the typical LLM maximum tokens. But how does PEFT perform for tasks that need to model longer textual sequences? In this work, we start with this basic question and provide evidence suggesting that the gap between PEFT and regular fine-tuning is substantial when modelling long sequences. As shown in Table 1, prefix-tuning underperforms fine-tuning on long sequence classification tasks, Hyperpartisan (Kiesel et al., 2019) and 20-newsgroups (Lang, 1995), when used with the popular long-document model Longformer (Beltagy et al., 2020).
In this paper, we propose a simple and effective method, prefix-propagation, which consistently improves the performance of PEFT for long sequence models. Unlike prefix-tuning, prefix-propagation propagates the hidden states corresponding to prefixes through the attention computation. This allows for the prefixes hidden states to dynamically change as the input propagates through each layer. To further understand prefix propagation, we investigate the reliability of the model's predictions by performing analyses on calibration. Lastly, we conduct study on prefix-based methods in terms of kernel attention to strengthen their theoretical value.
In summary, our contributions are as follows:  Figure 1: Illustration of the differences between (a) prefix-propagation (ours) (b) and prefix-tuning Li and Liang, 2021). Blue blocks denote trainable prompts, and "Transformer Layer" represents the computation done in a layer of the pre-trained LLM. Note that in prefix-propagation (a), the summation of prefixes continues for layers beyond 3, up to n. This operation is encapsulated by the ellipses. In prefix-tuning (b), prefixes in subsequent layers do not depend on hidden states from past layers (they are simply overwritten).
• We study PEFT for long documents and show that prefix-tuning is significantly inferior to fine-tuning in this scenario. To the best of our knowledge, this is the first work to focus on PEFT for long documents.
• We introduce prefix-propagation, which consistently improves the performance over prefix turning on the different long document datasets, while using 50% fewer parameters.
• We study the reliability of the predictions by performing analyses on calibration and show that models tuned with prefix-propagation are better calibrated.
• We elucidate the relationship between prefixpropagation and kernel attention and perform an ablation study that utilizes this insight.

Related Works
Long Sequence Models Numerous methods have been proposed to reduce the complexity of attention from O(n 2 ) to O(n) such as kernel approximations (Choromanski et al., 2020;Katharopoulos et al., 2020;Peng et al., 2021) and fixed (Child et al., 2019;Beltagy et al., 2020;Zaheer et al., 2020) or learned (Kitaev et al., 2020) sparse attention patterns. For a broader summary, please refer to Tay et al. (2022). In this work, we use Longformer (Beltagy et al., 2020). To linearize attention complexity, Longformer employs sliding window attention while globally attending to relatively few special tokens.
Parameter-Efficient Tuning Inspired by the success of manual prompting (Brown et al., 2020), prefix-tuning (Li andLiang, 2021; prepends trainable "soft" prompts to an input sequence. Although further PEFT methods have since been introduced (He et al., 2021a;Hu et al., 2021;Ben Zaken et al., 2022), we focus on adapting prefix-tuning. We note that our adaptation does not violate orthogonality and thus prefixpropagation can still be compounded with other PEFT methods as proposed in the UnifiedPET framework (He et al., 2021a), likely yielding similar performance gains. We leave the empirical validation of this hypothesis for future work. Out work also adheres to the key motivation of the recent PEFT method, inducer-tuning (Chen et al., 2022), which is that optimal prefixes should be close to queries within their latent space. We derive queries, keys, and values from the same prefix token, limiting the distance that separates them.

Methodology
In this section we introduce prefix-propagation, which, unlike prefix-tuning, propagates the hidden states corresponding to prefixes through the attention computation. This allows for the prefixes hidden states to dynamically change as the input propagates through each layer. Prefix-propagation and its predecessor, prefix-tuning are depicted in Figure 1a Table 2: Main results of prefix-propagation compared to prefix-tuning and traditional fine-tuning on the validation sets of each dataset. All approaches use Longformer-base except "RoBERTa PT", which is prefix-tuning on RoBERTa-base. Micro F1 and macro-average precision ("P") and recall ("R") is reported for ArXiv, Hyperpartisan (with mean across 5 runs), and 20-newsgroups. Accuracy is reported for WikiHop. Performance is reported on test splits with the exception of Hyperpartisan, which is performance on the validation split (See Appendix B for reasoning). The best run is bold and second best is underlined. embeddings) to the input sequence (blue blocks in top left of Figure 1a). Then, before every subsequent layer, we sum new trainable matrices onto the first j embeddings corresponding to the prefixes (denoted by the sum operators in Figure 1a). By propagating instead of overwriting, we halve the number of parameters trained while simultaneously improving performance on long-document tasks. We now formalize prefix-propagation. Multiheaded attention processes query, key, and value matrices derived from a sequence C ∈ R m×d with length m and embeddings of size d. Our method modifies traditional attention by concatenating a prefix P ∈ R j×d of length j to the sequence: per layer l and head i yielding the output of the attention head, H ∈ R (j+m)×d h . The prefixes are concatenated for the first layer (l = 1) and summed to their corresponding hidden states for the remaining layers (l > 1). We do not continually concatenate new prefixes to the sequence to avoid increasing the sequence length after each layer.
For both prefix-tuning and prefix-propagation, prefixes (keys and values) are globally attended to by all queries. Unlike prefix-tuning however, our method concatenates additional hidden states before the hidden states C are projected by W (i) k and W (i) v . By doing so, prefix-propagation modifies query matrices, allowing prefixes to attend to other hidden states globally, thereby increasing representation capability. This approach is somewhat analogous to the external global tokens inserted in the BigBird-ETC model (Zaheer et al., 2020). By attending to other tokens, the prefixes can act as special storage tokens, which is particularly useful in the restricted regime of long-document modelling where relatively few tokens have global context. Conversely, prefix-tuning only concatenates trained key and value matrices, P k , P v ∈ R j×d h , statically to the sequence: Since our method has a single prefix matrix, P instead of separate P k and P v matrices, we reduce the number of trained parameters by 50%.

Calibration
We further study the proposed prefix-propagation method to understand the reliability of model's predictions through calibration. Well-calibrated models output confidence scores that closely match the models' accuracy. Either over-confident or underconfident models are undesirable. Calibration has widely been overlooked in PEFT methods. To quantify calibration in our work, we use expected calibration error (ECE), which bins predictions based on model confidence and compares them to accuracy (Pakdaman Naeini et al., 2015;Guo et al., 2017).

Kernel Decomposition
Traditional attention is analogous to applying a kernel smoother over inputs (Tsai et al., 2019).
Motivated by this insight, we reformulate prefixpropagation as a sum of kernelized attention modules. Separating the modules introduces flexibility in two ways: (1) Their individual kernel forms can be mixed and matched and (2) A hyperparameter scale factor α can be applied to the prefix component to increase or decrease its weighting. Equation 3 defines kernel decomposition for prefixpropagation 2 : where Kern refers to kernel attention as formulated in (Tsai et al., 2019). The first term results from attending to the original sequence, C, and the second comes from attending to the prefixes, P . We provide the derivation of Equation 3 and the full definition of kernel attention in Appendix A.
Our main motivation for presenting prefix decomposition is to establish foundational knowledge and guide future research. Ergo, we restrict experiments in this initial presentation to using just the default exponential kernel (Appendix A).

Experiments and Results
Datasets We evaluate our approach on three longdocument classification tasks: ArXiv (He et al., 2019), an 11-class classification task composed of academic research papers, the 20-newsgroups (Lang, 1995) classification task consisting of mailing lists that fall into one of 20 classes, and the Hyperpartisan dataset, a binary classification task for extremist news classification (Kiesel et al., 2019). We also run experiments on WikiHop (Welbl et al., 2018), a long-document reading comprehension task requiring multi-step reasoning.
Due to compute limitations inherent to working with long documents, with the exception of Hyperpartisan, we only report a single run for each task. This mimics the original Longformer reporting scheme (Beltagy et al., 2020). For Hyperpartisan, the smallest of the datasets, we report mean metrics averaged over five seeds.

Method
ArXiv Hyperpartisan 20-newsgroups  More details on dataset sizes, pre-processing, and hyperparameters are in Appendix B.

Results and Discussion
Across all tasks, our results in Table 2 verify that prefix-tuning is inferior to fine-tuning long sequences. Conversely, prefix-propagation consistently outperforms prefix-tuning and is comparable to fine-tuning on most tasks. Prefix propagation also performs competitively on Hyperpartisan, a relatively small dataset with only 625 samples. This is in contrast to prefix-tuning, which is known to underperform in low-data settings (Gu et al., 2022). Because we ran multiple seeds on Hyperpartisan, we also found that prefix-propagation's better performance relative to prefix-tuning is statistically significant (p < 0.05, using a single-tailed t-test). We do not have multiple samples to run these tests for larger datasets, but we emphasize that Hyperpartisan likely has the most variance and yet it is still statistically significant. We suspect that prefixpropagation's performance exceeds prefix-tuning because propagated prefixes can transmit global context across multiple layers, possibly modelling more expressive abstractions.
We note one exception where prefix-based methods still leave room for improvement: multiplechoice question answering on WikiHop. We hypothesize that prefix methods have insufficient capacity to properly model complex long-document multi-step question answering.
We also observe that prefix-based methods, and especially prefix-propagation, achieve better calibration than fine-tuning, as shown in Table 3. Unlike prefix-tuning however, prefix-propagation effectively balances calibration with accuracy metrics. The calibration of fine-tuning deteriorates as training progresses (Figure 4 in Appendix C) and we speculate that this may be due to catastrophic forgetting (Jagielski et al., 2022).
As an initial test for our ongoing prefixpropagation kernel study, we show results on Hy-

Micro F1
Figure 2: Violin plot of Micro F1 Score for five different seeds on the Hyperpartisan task. White dots, gray boxes, and gray lines are the medians, interquartile ranges, and ranges respectively. Width of the five violin shapes show the probability densities for the corresponding F1-score. All methods tune Longformer-base except "R Prefix", which is prefix-tuning on RoBERTa-base.
perpartisan in Figure 2. The kernelized version of prefix-propagation achieves the best single-run performance, but has higher variance than fine-tuning and prefix-propagation which necessitates further research.

Conclusion
Our research focuses on parameter efficient tuning for long documents tasks. We introduce prefix-propagation, which consistently improves performance over prefix-turning on long document datasets, while using 50% fewer parameters. We study the reliability of the predictions by performing analyses on calibration and show that models tuned with prefix-propagation are better calibrated. We lastly explicate prefix-propagation from a kernel perspective, uncovering insights for future PEFT research.

Limitations Scope
This short paper serves as an initial step toward PEFT for long-document models. As such, our evaluated scope of models, tasks, datasets, and kernel variations is limited. We acknowledge the need to experiment across broader settings and hope our work provides a foundation for others to build on.
Future experiments should analyze the validity and efficacy of using prefix-propagation with other long-sequence models to determine whether the prefix modality is suitable for non-sparse attention approximations. For example, would the projection of prefix vectors using a random feature map as in Choromanski et al. (2020) result in an excessive loss of information for these critical tokens?
Regarding tasks and datasets, the performance degradation in prefix methods for WikiHop deserves significant attention. Verifying whether this extends to other reading comprehension and question-answering tasks will assist in guiding future research efforts. We restricted our research to the encoder-only version of Longformer, but using the encoder-decoder version, LED would enable analysis of sequence-to-sequence tasks. The SCROLLS benchmark (Shaham et al., 2022) would be a good starting point for this analysis since it includes an LED baseline.
Combining prefix and kernel methods is an ongoing research effort and there are several questions we plan to address: (1) What are the effects of swapping the default exponential kernel with other variants such as linear, polynomial, and RBF? (2) Does making the α scale parameter trainable improve performance? (3) Can we have a separate scale parameter for each query and should they be trainable? (4) Is this approach effective for modalities other than long-document? (5) Can we separate other components of attention into modular kernels (e.g. local and global kernels for sparse attention)?

Robustness
The size and nature of long-sequence tasks often resulted in long run times for the larger datasets ArXiv, 20-newsgroup and WikiHop. Consequently, we report results of one seed after doing a hyperparameter search for learning rate. This aligns with the reporting system of the original Longformer paper (Beltagy et al., 2020) but greater assurance in all long-sequence task performance could be achieved by accumulating results over several seeds. The size of datasets and iteration over several epochs somewhat mitigate this concern.

Ethics Statement
Our work helps to address the environmental and equitable distribution concerns of LLMs (Strubell et al., 2019). All PEFT variants attempt to reduce resource requirements, primarily via GPU memory consumption and storage requirements. By applying prefix-tuning and our variation, prefixpropagation to long-document models we limit car-bon emissions and increase accessibility for lowresource groups. We note that prefix-propagation neither exacerbates nor alleviates other ethical risks such as biases regarding gender, race, religion, etc. that are often embedded in pre-trained LLMs. If such biases exist in the pre-trained model, they will be propagated to downstream tasks regardless of tuning method.

A Kernel Decomposition Derivation
In the unified framework of He et al. (2021b), we can write the first layer l = 1 attention mechanism of prefix-propagation as: where P is a trained prefix for each downstream task. Omitting layer and head indices and using D = cat(P, C) for brevity, we can rewrite Equation 4 as: where λ(C) is a scalar (dependent on C) to normalize softmax over the sequence and the prefixes and is computed by: We consider the two terms of Equation 5 as kernelized attention modules which brings us back to the complete kernel decomposition: where α is an introduced hyperparameter that replaces the fixed weighting of λ. This change allows us to explicitly increase the weighting of prefixes where subscripts (e.g. i) index the rows of a matrix, N is the number of key and value vectors, and k is a kernel function that calculates the similarity score between two vectors. We do not experiment with altering the kernel type since the default exponential kernel inherent to softmax attention already implicitly maps the input vectors to an infinite feature space. Therefore, the kernel function in Equation 8 takes the form: where ·, · signifies the dot product and d k is the dimension of key projections.

B Experimental Details
Artifact Notes   evaluate and/or develop state-of-the-art algorithms.
The intended use of 20-newsgroups is not explicit, although it is commonly used for natural language processing in research. We therefore believe we have adhered to the intended usages of the datasets we included. We do not anonymize the data for 20newsgroups as (a) the trained models is not being deployed (only used for evaluation purposes) and (b) the non-anonymized variant is already publicly available. We chose to use the datasets in the current form for fair comparison with other baselines and therefore did not do a detailed analysis for those artifacts. We refer readers to the cited original works in Table 4 for complete documentation.
Training For our experiments, we use and adapt the prefix-tuning implementation provided in . Training was conducted on 12 NVIDIA GeForce 1080 Ti cards, for an estimated 2300 single GPU hours (including preliminary ex-periments). All models tested fit on a single card, so we did not use any model parallelism. Throughout experiments, we use gradient accumulation for an effective batch size of 32. We use early stopping for our hyperparameter search, and show results for the run with the best validation F1-score. For learning rate, we search between {1e-2, 5e-2, 1e-3, 5e-3, 5e-4} for prefix-based methods, and {3e-5, 5e-5} for fine-tuning. For kernelized prefix-propagation, we search for a scale factor (hyperparameter α) of {1e-2, 4e-2, 1e-3, 3e-3, 5e-3, 7e-3} (after choosing the best learning-rate). Other hyperparameters are listed in Table 5.
Despite seeding random number generators for Hugging Face's transformer library through the set_seed method, slight deviations will propagate if using GPUs due to some non-deterministic CUDA methods that do not respect the seed setting mechanisms of Pytorch (Paszke et al., 2019). Upon further analysis, we found this issue in nondeterministic algorithms to be widely overlooked in the field, and believe that this area needs further discussion in the research community. However, we note that our results should be reproducible when running across multiple seeds.
Task Details All datasets used have a considerable portion of documents greater than RoBERTa's max sequence limit of 512 tokens, as shown in Figure 3. Number of samples and number of classes for each dataset are in Table 6.
For all classification tasks, we prepend a globally-attended [CLS] token to the start of the sequence and pass the output into a learned classification head. We truncate document lengths to 4096 and 512 tokens for Longformer and RoBERTa, respectively. For Hyperpartisan, we use the same data pre-processing and training split as Beltagy et al. (2020). However, we noticed overlap between training and testing samples, so we instead show validation results. We use the ArXiv dataset from He et al. (2019) that is available on Huggingface datasets (which we reviewed for correctness). The original dataset has labels leaked in the source text, so we use the no_ref version that has those labels filtered. We use the 20-newsgroups and follow preprocessing as recommended by scikit-learn authors, removing headers, quotations, and signatures from each sample to prevent the model from learning spurious correlations.
WikiHop instances include a question, candidate answers, and multiple context documents. For  for RoBERTa) and pass them separately through the model while concatenated to the question and candidate pair. We then train a classifier to predict a single logit for each [ent] token, take the average over all chunks, apply softmax, and finally use cross-entropy loss. We also train the new special tokens [ent] and [q] in prefix-based methods to better learn an effective representation (as they did not appear in pre-training).
C Impact of Training Time on ECE Apparent in Figure 4, prefix-propagation is bettercalibrated relative to other approaches throughout training. Prefix-tuning and fine-tuning however  Table 7: Runtime for inference using "No PEFT" (i.e., regular forward pass), prefix-tuning, and prefixpropagation. "Relative Runtime" is the runtime relative to "No PEFT". either start less calibrated or deviate from prefixpropagation as training progresses.

D Runtime Performance
We test the inference time of the studied methods and show the results in Table 7. We use the same 8000 randomly generated sequences of length 4096 across methods and test on a NVIDIA GTX 1080 Ti. We notice that prefix-propagation is slightly more efficient than prefix-tuning. We theorize that this discrepancy is caused by prefix-propagation only needing to concatenate a matrix in the first layer (and sum on the rest), whereas prefix-tuning concatenates before every layers.