Parameter-Efficient Language Model Tuning with Active Learning in Low-Resource Settings

,


Introduction
Pre-trained language models (PLMs) have rapidly become a staple in the field of natural language processing.With the growing demand for data for training these models, developing efficient finetuning methods has become critical.This is particularly relevant for many domains and languages where it is difficult or downright impossible to obtain large amounts of labeled training data.In such low-resource settings, it becomes essential to effectively leverage and adapt PLMs while minimizing the need for extensive labeled data.
Data labeling is notoriously time-consuming and expensive, often hindering the development of sizable labeled datasets required for training high-performance models.Active learning (AL) (Cohn et al., 1996;Settles, 2009) has emerged as a potential solution to this challenge.In contrast to passive learning, in which the training set is sampled at random, AL encompasses a unique family of machine learning algorithms specifically designed to reduce labeling costs by reducing label complexity, i.e., the number of labels required by an acquisition model to achieve a certain level of performance (Dasgupta, 2011).With the advent of PLMs, AL research has pivoted towards investigating training regimes for PLMs, such as task-adaptive pre-training (TAPT; Gururangan et al., 2020), that could be combined with AL and further reduce label complexity.
While AL explicitly aims to minimize the label complexity of learning, it is also important to consider reducing the parameter complexity of the acquisition model.Effectively lowering parameter complexity can lead to a corresponding reduction in label complexity, further improving training efficiency.As PLMs grow larger, fine-tuning becomes increasingly challenging due to the sheer number of parameters involved.To address this issue, adapters (Houlsby et al., 2019) have been recently introduced as compact modules that can be incorporated between the layers of PLMs.These adapters enable a considerable degree of parametersharing, thereby fostering the practice of parameterefficient fine-tuning (PEFT) through modular learning (Pfeiffer et al., 2023).During the tuning process for a downstream task, only the parameters of the adapters are updated, resulting in a significantly reduced number of parameters.Recent research (He et al., 2021;Li and Liang, 2021;Karimi Mahabadi et al., 2021) has revealed that some PEFT methods outperform full fine-tuning (FFT) in lowresource settings, potentially due to better stability and a decreased risk of overfitting.In contrast, FFT has been shown to exhibit instability in scenarios with limited data, whereas PEFT methods have proven to be well-suited for such circumstances.
Despite the promising results demonstrated by PEFT methods in low-resource settings, a notable gap remains in the research landscape.Given that the majority of real-world AL scenarios involve a restricted amount of data, PEFT methods emerge as strong candidates as acquisition models for AL.However, there has been no exploration of AL in conjunction with adapters.Investigating this uncharted territory can further advance our understanding of AL and reveal novel strategies for optimizing performance in low-resource settings.
In this paper, we present an empirical study on the behavior of PEFT in low-resource settings for text classification tasks.We analyze PEFT with and without AL and compare it against FFT.While our results confirm that PEFT exhibits superior performance in low-resource setups compared to FFT, we show that the improved performance with PEFT extends to AL scenarios in terms of performance gains over passive learning.Furthermore, we analyze the efficacy of TAPT in conjunction with AL and PEFT.We find that TAPT is beneficial in AL scenarios for both PEFT and fully finetuned models, thus representing a viable technique for improving performance in low-resource scenarios.Finally, aiming to illuminate why PEFT and TAPT improve AL performance in low-resource settings, we analyze the properties of PEFT and FFT via forgetting dynamics (Toneva et al., 2019) and PLMs' instance-level representations.We find that AL methods choose fewer unforgettable and more moderately forgettable examples when combined with PEFT and TAPT, where forgetfulness indicates the model's tendency to learn and forget the gold label of a particular instance.Compared to FFT, we observe that PEFT yields early and middle layer representations more similar to the base model, which might suggest increased stability and targeted regularization during training in the lowresource setup.We hypothesize that the stability of early and middle layers mitigates the issue of forgetting the knowledge obtained during pre-training when fine-tuning for downstream tasks.
In summary, we showed that (1) PEFT yields greater improvements in performance and increased stability compared to FFT in AL lowresource scenarios for text classification and (2) TAPT enhances the overall text classification performance of adapters and is well-suited for AL scenarios.We also found that (3) AL methods choose fewer unforgettable and more moderately forgettable examples with PEFT and TAPT and (4) PEFT produces instance-level representations of early and middle layers that are more similar to the base PLM than FFT.Our results uncover the intricacies of positive interactions between AL, PEFT, and TAPT, providing empirical justification for their combined use in low-resource settings.

Related Work
Our study encompasses research efforts that marry AL with PLMs and explore the application of PEFT techniques in low-resource contexts.
AL with PLMs.Until recently, the conventional approach for integrating PLMs with AL involved performing full fine-tuning with a fixed number of training epochs and training the model from scratch in each AL step (Ein-Dor et al., 2020;Margatina et al., 2021;Shelmanov et al., 2021;Karamcheti et al., 2021;Schröder et al., 2022).However, studies by Mosbach et al. (2021) and Zhang et al. (2021) revealed that fine-tuning in low-resource setups is prone to instability, particularly when training for only a few epochs.This instability, often sensitive to weight initialization and data ordering (Dodge et al., 2020), presents a significant challenge for AL, which frequently operates in lowresource settings.Recent research has looked into the impact of PLM training regimes on AL performance (Grießhaber et al., 2020;Yuan et al., 2020;Yu et al., 2022), suggesting that the choice of training regime is more critical than the choice of the AL method.Notably, TAPT has proven particularly effective in enhancing AL performance (Margatina et al., 2022;Jukić and Šnajder, 2022).
Adapters in low-resource settings.Research on adapters in low-resource settings has primarily focused on areas such as cross-lingual transfer for low-resource languages (Ansell et al., 2021;Lee et al., 2022;Parović et al., 2022), where the emphasis lies on exploring diverse methods of fusing adapters.In monolingual settings with scarce data, adapters have been found to outperform full finetuning (Li and Liang, 2021;Mao et al., 2022).A study by He et al. (2021) demonstrated that adapterbased tuning exhibits enhanced stability and generalization capabilities by virtue of being less sensitive to learning rates than traditional fine-tuning methods.Furthermore, incorporating task adaptation techniques, such as TAPT, has been shown to improve performance in limited-data scenarios.
However, it appears that TAPT yields diminishing returns when sufficient data is available (Kim et al., 2021).
While adapters have shown promise in lowresource settings, their application in the context of AL, where such conditions are common, remains unexplored.We investigate the performance of adapters in AL scenarios, aiming to expand our understanding of why they exhibit favorable properties in low-resource settings.

Preliminaries
In this section, we outline our experimental setup, providing details on the datasets we employed, as well as the PEFT and AL methods used in our study.

Datasets
We employ four single-text classification tasks commonly used for AL evaluation.These include: • The Subjectivity dataset (SUBJ; Pang and Lee, 2004), designed to assess the subjectivity of a given text; • The Question Type classification dataset (TREC; Li and Roth, 2002), designed for categorizing questions according to their types; • The Stanford Sentiment Treebank (SST; Socher et al., 2013), which focuses on sentiment analysis; • AG's News classification dataset (AGN; Zhang et al., 2015), which classifies news articles into different categories.
We provide the dataset statistics in the appendix for further reference (cf.Appendix Table 3).

PEFT methods
In our experiments, we explore four distinct PEFT techniques that introduce additional trainable parameters, covering the main approaches to PEFT.
Adapter incorporates trainable bottleneck layers after both the multi-head attention and feedforward block in each Transformer layer (Houlsby et al., 2019).
Prefix-tuning adds new parameters in the multihead attention blocks within each Transformer layer (Li and Liang, 2021).
LoRA (Low-rank adaptation) represents an additive method that incorporates trainable low-rank decomposition matrices into the layers of a pretrained model (Hu et al., 2022).
UniPELT combines multiple PEFT approaches, namely LoRA, Prefix-tuning, and Adapter, in a single unified setup (Mao et al., 2022).Each constituent is a submodule, and UniPELT employs gating mechanisms to activate them effectively.
All the mentioned PEFT methods fall under the category of lightweight fine-tuning.While prefixtuning does not technically qualify as an adapter, (He et al., 2022) demonstrated that it shares formal similarities with adapters, with prefix-tuning performing weighted addition and an adapter employing unweighted addition.We refer to all four considered PEFT methods as adapters for terminological simplicity.We use BERT (Devlin et al., 2018) as the base model for every adapter.

Active learning methods
Our study employs five sampling strategies, including random selection as a passive learning baseline.The other four strategies are active learning methods originating from different families.We have deliberately chosen these methods for their robustness (ability to perform well across various tasks) and widespread usage in the field.
Random selection (RND) uniformly selects instances from the unlabeled pool.
Maximum entropy (ENT; Lewis and Gale, 1994) comes from the family of uncertainty strategies.The method queries instances where the model is least certain based on the maximum entropy criterion of the prediction output.
Monte Carlo dropout (MC; Gal and Ghahramani, 2016) resembles ENT but utilizes the stochasticity of forward passes with dropout layers (Srivastava et al., 2014) to estimate the entropy for a given instance.In experiments, we use ten inference cycles to approximate the entropy of the output via Monte-Carlo dropout sampling.
Core-set (CS; Sener and Savarese, 2018) encourages instance diversity by using the learned representations of the acquisition model.This method aims to minimize the distance between an example in the unlabeled set and its closest counterpart in the labeled subset.We use the [CLS] token representation from the Transformer's penultimate layer.We follow the greedy method described in the original work.
Discriminative active learning (DAL; Gissin and Shalev-Shwartz, 2019) frames active learning as a classification of whether a particular instance is labeled or not to make the labeled and unlabeled sets indistinguishable.Specifically, DAL queries instances that a trained classifier is most likely to consider as part of the unlabeled subset.

Experimental setup
In AL runs, we selected 50 new examples in each step of each AL experiment, using 100 examples for the warm start (randomly sampled labeled data to initiate the model).To probe different PEFT approaches with and without AL in low-resource settings, we established a labeling budget limit of 1, 000 instances.
To circumvent the need for a validation set in our experiments, which is frequently unavailable in real-world AL scenarios, we adopted the Besov early stopping (Jukić and Šnajder, 2022).This approach leverages the smoothness of Transformer layers to determine the appropriate epoch at which to halt training.
In the case of TAPT, we pre-trained the base model on a masked language modeling task using unlabeled training data.For adapters, we only updated the injected parameters while keeping the remaining parameters of the base model frozen.This approach aligns with the primary function of adapters, which is to utilize a common base model across diverse tasks.In the implementation of TAPT, we randomly masked 15% of tokens for both FFT models and adapters and trained the model via self-supervision for 50 epochs with the learning rate set to 10 −5 .Other details of training and hyperparameters details are given in Appendix A.3.
For every setting, we performed five runs with different random seeds.We report the average F 1 score at each sampling step (with and without AL for FFT and PEFT) to show the corresponding learning curve averaged over five runs with standard error denoting the confidence intervals.

Evaluation
In order to gauge the overall impact of an AL method, we resort to the area under the perfor-mance curve (AUC).In each individual AL step with a specific quantity of labeled examples, we measure the classification performance in terms of the F 1 score, which is subsequently used in computing AUC.We advocate for using AUC alongside the AL curves, as it serves as a suitable approximation of AL feasibility through a summary numeric score, which is in line with recent AL literature (Schröder et al., 2022;Jukić and Šnajder, 2022).As our experiments involve different training regimes, we compare each AL sampling strategy S AL to passive learning S PL within the same training regime to isolate the effects of AL.The primary objective of AL is to improve the label efficiency of passive learning.To test whether AL is successful, we compare it against random sampling as the baseline, where we compute the difference between the AUC of a specific AL method and the corresponding passive learning curve.We normalize this difference with respect to AUC(S PL ) in order to evaluate the relative improvement over passive learning (RIPL) and make the metric more comparable across different datasets.We define this metric as follows: Intuitively, the RIPL metric provides an estimate of the proportion of maximum possible improvement achievable by a given AL method in comparison to the passive learning baseline.A score of 1 signifies the maximum theoretical improvement, which would be tantamount to attaining an F 1 score of 1 in the initial sampling step and sustaining that score through the final step.Conversely, a negative score indicates that the AL method performs worse than passive learning.

Experiments
In this section, we first examine the performance of PEFT methods in comparison to FFT within passive learning scenarios.Subsequently, we conduct a battery of experiments in AL settings to explore their efficacy further.

PEFT vs. FFT
Previous research (Li and Liang, 2021;Mao et al., 2022;He et al., 2021), that investigated the use of adapters in low-resource settings demonstrated that these adapters perform at least as well as, and sometimes even better than FFT.However, these conclusions were primarily derived from comparing FFT to a single adapter variant on a full dataset or from evaluating the performance of adapters at only a few discrete points.
In the first part of our experiments, we build upon these findings by conducting a more nuanced analysis.We generate detailed learning curves that facilitate the comparison of multiple adapters with FFT across a range of conditions.Our comparison, summarized by the AUC metric in Table 1, reveals that UniPELT and Prefix-tuning consistently outshine FFT across all datasets used in our study.Conversely, the performance of Adapter and LoRA is mostly comparable to FFT, although there are instances where they either outperform or underperform FFT. Figure 1 shows the corresponding learning curves under the passive learning setup, where it is possible to dissect the performance dynamics as the training set increases.The performance disparities between adapters and FFT become more noticeable under conditions of extreme data scarcity, particularly when dealing with only 100-300 labeled instances.Notably, the greatest differences in performance appear at the initial step, with only 100 labels, which is consistent across all datasets.These properties highlight the potential of adapter-based methods in effectively handling low-resource scenarios, with Prefix-tuning and UniPELT emerging as the more successful approaches compared to Adapter and LoRA.

AL with PEFT
Our initial findings using PEFT in low-resource settings further support the notion of employing PEFT in AL scenarios.We conducted evaluations of individual PEFT methods in AL scenarios with and without the application of TAPT.As shown in Table 2, the performance of PEFT methods is presented in terms of gains over random sampling (passive learning).We observe that without applying TAPT, PEFT methods demonstrate greater gains over FFT.When TAPT is implemented, FFT becomes comparable to PEFT, although Prefix-tuning and UniPELT still yield the most significant improvements, depending on the specific dataset and AL method employed.
We further investigate the behavior of AL methods and their performance throughout individual steps, where we compare adapters against FFT models.Figure 2 shows the learning curves for corresponding adapter models with and without applying TAPT.We show the learning curves for the SUBJ dataset to avoid clutter, as we noticed similar trends on other datasets we used.Prefixtuning and UniPELT seem to be invariant of a particular AL method, showing better overall performance compared to Adapter and LoRA.These latter adapters, however, show superior label efficiency with ENT and MC, whereas CS and DAL yield modest improvements but still outdo passive learning.These results are particularly pronounced when TAPT is applied to Adapter and LoRA.TAPT typically boosts performance throughout the AL steps, though it appears that CS and DAL glean less benefit from it than the entropy-based methods we used.

Analysis
We now look into the properties of adapters and FFT models to examine their behavior with AL.First, we analyze the influence of TAPT on the forgetting dynamics during training.We continue with example-level representation analysis, where we investigate the representation similarity of PEFT and FFT to their respective base models.

Forgetting dynamics
To understand forgetting dynamics, we draw upon the study by Toneva et al. (2019), focusing on the occurrence of forgetting events -cases where a specific training example transitions from correct to incorrect classification over the course of multiple learning epochs.We delve into forgetting dynamics by dividing instances into three categories: (1) unforgettable instances, i.e., the ones that have never experienced a forgetting event during training, (2) instances that have witnessed one or two forgetting events, labeled as moderately forgettable, and (3) instances subjected to three or more forgetting events, referred to as highly forgettable instances.
Figure 3 shows the frequency of instances across   Table 2: Improvement over passive learning in terms of the RIPL metric.We present the results for all combinations of the adapters and datasets utilized in our study.In addition, we compare the outcomes of the regular approach with those achieved when we apply TAPT to the pre-trained base model.Positive values indicate improvement over passive learning, while negative values represent situations where performance drops compared to passive learning.
the three forgetting categories.Particularly, we contrast RND with MC-an AL method which consistently enhances performance across all datasets.We find that FFT selects a larger number of unforgettable instances compared to adapters, while it picks fewer moderately forgettable instances.Intriguingly, when TAPT is introduced, these disparities in forgetting profiles between FFT and the other two adapters, Prefix-tuning and UniPELT, seem to vanish.Conversely, TAPT appears to intensify the differences between FFT and the remaining two adapters, LoRA and Adapter, which generally exhibit smaller improvements than Prefix-tuning and UniPELT.We speculate that LoRA and Adapter might not reap as much benefit from TAPT, thus they fail to achieve an improved forgetting profile that, ideally, consists of fewer unforgettable instances and a greater number of moderately forgettable ones, as compared to the profiles achieved with FFT without TAPT.

Representation analysis
To analyze the effect of PEFT and FFT on AL selection, we utilized centered kernel alignment (CKA) as a similarity measure between two sets of representations (Kornblith et al., 2019).It has been shown previously that PEFT methods result in representations closer to the base model at the token-level (He et al., 2021).We extend the analysis to example-level representation to explore the behavior of models with AL.CKA is designed to be invariant invertible linear transformation, which is important for measuring meaningful similarities between representations of higher dimensions than the number of data points.
In Figure 4, we compare the similarity between the base model and the adapter model and between the base model and the fully fine-tuned model.We find that PEFT representations are more similar to the base model in the early and middle layers when using a specific AL method.Specifically, up to the eighth layer, representations are greatly more similar in adapters than in FFT models.In the final four layers, the difference in CKA scores between the adapter and FFT model is close to zero.Interestingly, the penultimate, 11-th layer is more similar in the FFT model with respect to the base model.

Conclusion
Our study has shed light on the advantages of parameter-efficient fine-tuning (PEFT) in lowresource settings, confirming its superiority over full fine-tuning (FFT) methods.Importantly, we have demonstrated that the integration of PEFT with active learning (AL) can offer substantial performance gains compared to passive learning strategies, even in settings where labeled data is scarce.Furthermore, we highlighted the potential of taskadaptive pre-training (TAPT) to improve model performance further when used in conjunction with both PEFT and AL.We found that AL methods in combination with PEFT and TAPT tend to select fewer unforgettable instances and more moderately forgettable examples.We further found that PEFT maintains the integrity of early and middle layer representations similar to the base model, thereby preventing forgetting during downstream task finetuning.These insights inform us of a possible underpinning mechanism that contributes to PEFT's superior performance and stability in low-resource settings.Overall, our work highlights the potential of PEFT and AL and establishes a foundation for developing increasingly efficient and cost-effective approaches for training models in low-resource settings.

Figure 2 :Figure 3 :Figure 4 :
Figure2: Learning curves for AL compared with random sampling.We report the results for SUBJ.The first row features learning curves for adapters without the application of TAPT, while the second row highlights their corresponding curves with TAPT implemented.The final row compares FFT, both with and without TAPT.The results are averaged over five iterations, and the confidence intervals represent the standard deviation.For the best viewing experience, we recommend observing these figures on a computer screen.

Table 1 :
Comparison of adapters with FFT in passive learning setup.We report the AUC metric (based on F 1 score) averaged over five runs.Numbers in bold denote the best-performing approach for a particular dataset.
Learning curves for passive learning with different PEFT methods and FFT.The results are averaged over five runs.The confidence intervals denote the standard deviation.Best viewed on a computer screen.