Sparse Low-rank Adaptation of Pre-trained Language Models

Fine-tuning pre-trained large language models in a parameter-efficient manner is widely studied for its effectiveness and efficiency. The popular method of low-rank adaptation (LoRA) offers a notable approach, hypothesizing that the adaptation process is intrinsically low-dimensional. Although LoRA has demonstrated commendable performance, it is implemented with a fixed and unalterable intrinsic rank that might not always be the ideal choice. Recognizing the need for more flexible adaptation, we extend the methodology of LoRA to an innovative approach we call sparse low-rank adaptation (SoRA) that enables dynamic adjustments to the intrinsic rank during the adaptation process. We achieve this through the incorporation of a gate unit optimized with proximal gradient method in the training stage, controlling the cardinality of rank under the sparsity of the gate. In the subsequent inference stage, we eliminate the parameter blocks corresponding to the zeroed-out ranks, to reduce each SoRA module back to a concise yet rank-optimal LoRA. Our approach strengthens the representation power of LoRA by initializing it with a higher rank, while efficiently taming a temporarily increased number of parameters via updating in a sparse way. We further introduce a sparsifying scheduler for SoRA, aiming to examine the impact of the number of non-zero parameters on the model's memorization and generalization. Our experimental results demonstrate that SoRA can outperform other baselines even with 70% retained parameters and 70% training time.


Introduction
Adapting large-scale pre-trained language models (Devlin et al., 2019;Brown et al., 2020;He et al., 2020;Bommasani et al., 2021;Han et al., 2021;Touvron et al., 2023) in a parameter-efficient (He et al., 2022;Ding et al., 2023;Hu et al., 2023) manner is increasingly gaining traction within the research community.The methods of this paradigm typically keep most of the parameters of the underlying model unchanged, either insert additional trainable parameters into the model (Houlsby et al., 2019;Li and Liang, 2021), or specify a small number of parameters (Zaken et al., 2021;Liu et al., 2021;Su et al., 2023) to be trainable or reparameterize the adaptation process into a more efficient form (Hu et al., 2021;Qin et al., 2021).They have been validated to be effective across various models and tasks, often yielding comparable or even better results than full-parameter fine-tuning.
The development potential of parameter-efficient fine-tuning became evident after extensive validation of its performance.These methods offer the opportunity to adapt the base model to fit any data, allowing for enhancements and customization of language models tailored to specific tasks and personalized user characteristics.Due to the lightweight nature of the optimized parameters, they can be seamlessly plugged into the model, allowing targeted enhancements to be made.Among these methods, low-rank adaptation (LORA (Hu et al., 2021)) is considered one of the most efficient methods at present.It assumes that the change of the model's parameters after adaptation is "intrinsically low-dimensional" and performs adaptation by optimizing the matrix obtained from low-rank decomposition.LoRA avoids forward propagation latency caused by inserting additional neural modules while demonstrating stable performance.Although effective, the setup of the intrinsic rank (normally as a hyperparameter) is still unclear.Intuitively, a larger rank brings larger optimization space and creates the capacity to handle more challenging tasks.However, in practice, the optimal intrinsic rank would vary according to multiple factors such as the backbone model and the task.
Given the enormous computational cost of searching hyperparameters on large-scale models (such as GPT-3 (Brown et al., 2020) with 175 billion parameters and LLaMA (Touvron et al., 2023) with 700 million to 65 billion parameters), developing a method based on adaptive ranks is a natural approach.Some existing work has attempted to explore this direction (Valipour et al., 2022;Zhang et al., 2023), but they are largely heuristic or introduce additional costs.In this paper, we propose SoRA, a simple, effective, and automated method for adaptive parameter-efficient fine-tuning.We introduce a gating module with a proximal gradient decent update under L1 regularization to control the sparsity of the updated matrices.After training, the zero entry of the gating vector records the columns of the down-projection matrix and the rows of the up-projection matrix, which can be simply dropped and stored in a more parameterefficient manner.Compared to other adaptive approaches, the proximal gradient method has a clear mathematical meaning and does not have to involve other computations and heuristics.For example, AdaLoRA (Zhang et al., 2023) introduces an additional regularizer to ensure that the lower and upper projection matrices strictly adhere to the definition of singular value decomposition (SVD), with each matrix being orthogonal.However, this regularization term incurs substantial computational overhead due to the gradient calculations.In contrast, we eliminate this requirement and instead selectively filter low-rank components by controlling the intermediate diagonal matrix.We detailedly compare SoRA and related methods in Section 3.
The mechanism of SoRA also allows us to control the sparsity temporarily and investigate the relationship between the number of non-zero trainable parameters and memorization and generalization capabilities.We propose a sparsifying scheduler and find that the process of model adaptation exhibits a strong "compression capability", and even a tiny portion of parameters (lower than LoRA rank being 1) could retain considerable performance.Extensive experiments are conducted to demonstrate the effectiveness of our method.Particularly, our model could consistently outperform parameter-efficient baselines with fewer parameters and 30% shorter training time on a wide range of downstream tasks.The code of this work will be publicly available at https://github.com/TsinghuaC3I/SoRA.

A Closer Look to Adaptive Rank
Related Work.Before introducing our approach, we first briefly recap parameter-efficient tuning and our backbone low-rank adaptation (LoRA).Parameter-efficient tuning is a set of methods that only optimize a small portion of parameters and keep the main model untouched for adaptation.Some parameter-efficient methods would insert additional neural modules or parameters to the backbone model, such as Adapter (Houlsby et al., 2019), Prefix and Prompt Tuning (Li and Liang, 2021;Lester et al., 2021).And another line of such methods attempts to specify particular parameters to be trainable or prunable (Guo et al., 2021;Zhao et al., 2020;Zaken et al., 2021).Researchers derive a series of variants of parameter-efficient methods to improve the effectiveness or efficiency (Karimi Mahabadi et al., 2021;Hu et al., 2022;Sung et al., 2022;He et al., 2022).Recently, the applications of parameter-efficient fine-tuning are expanded to multi-modal and instruction-tuning scenarios (Gao et al., 2023;Dettmers et al., 2023).In this paper, we focus more on LoRA (Hu et al., 2021), which uses low-rank matrices to approximate the change of weights.
In LoRA, pre-trained weights (denoted as W 0 ∈ R p×q ) are frozen, and the trainable LoRA modules are low-rank decomposition matrices W d ∈ R r×q and W u ∈ R p×r of the change of each weight matrix ∆ = W u W d ∈ R p×q .In this way, the output of the current layer h could be represented as where r ≪ min{p, q} is a hyper-parameter of "intrinsic dimension" that controls the size of low-rank matrices and the number of trainable parameters.In this section, we primarily focus on the last term, denoting z ← − W u W d x.
Adaptive Rank on LoRA.Despite a great step forward in tractability and efficiency, LoRA is still restricted by its inflexibility in selecting the optimal rank r.Unlike continuous hyperparameters such as learning rate and weight decay that can be tuned adaptively online during the training process, LoRA rank r takes discrete values -the change of which will directly alter the model structures.The optimal choice of rank can vary across different backbone models and downstream tasks.A conservative choice of huge rank r can waste training time and computation resources, while progressively setting r tiny may degrade model performance and < l a t e x i t s h a 1 _ b a s e 6 4 = " 8 u B + w u r 6 y 1 P 5 u h 2 0

T (•)
< l a t e x i t s h a 1 _ b a s e 6 4 = " F + P a 0 t n k a 1 i 8 5 < l a t e x i t s h a 1 _ b a s e 6 4 = " F + P a 0 t n k a 1 i 8 5 x < l a t e x i t s h a 1 _ b a s e 6 4 = " 8 / U i h l M x l t 0 w y y l q  < l a t e x i t s h a 1 _ b a s e 6 4 = " F + P a 0 t n k a 1 i 8 5

Upprojection < l a t e x i t s h a 1 _ b a s e 6 4 = " S Q N L z P + o V K T B u l P S p m K N + T f S p d 4 = " >
< l a t e x i t s h a 1 _ b a s e 6 4 = " n b q J X X G N 5 B T E U v s I F X w w 4 Z Y e / L A = " > A A A B 8 3 i c b V D L S g M x F L 3 x W e u r 6 t J N s A g u p M x I U Z c F N y 4 r 2 A d 0 h p L J Z N r Q T G Z I M k I Z + h t u X C j i 1 p 9 x 5 9 + Y a W e h r Q c C h 3 P u 5 Z 6 c I B V c G 8 f 5 R m v r G 5 t b 2 5 W d 6 u 7 e / s F h 7 e i 4 q 5 N M U d a h i U h U P y C a C S 5 Z x 3 A j W D 9 V j M S B Y L 1 g c l f 4 v S e m N E / k o 5 m m z I / J S P K I U 2 K s 5 H k x M e M g y n u z Y T i s 1 Z 2 G M w d e J W 5 J 6 l C i P a x 9 e W F C s 5 h J Q w X R e u A 6 q f F z o g y n g s 2 q X q Z Z S u i E j 8 6 w b t O 9 0 u 3 1 X u 7 j a P O X r J X r M 0 C d s x 6 7 C P r s w E T 7 J x d s p / s l / f N + + F d e d e b 1 p q 3 n X n B b p T 3 + w / f y 6 u q < / l a t e x i t > g t ⌘ t r g L ( ) Several remedies have been proposed in recent years to enable the flexible tuning of LoRA rank.For example, rather than setting a fixed rank, Valipour et al. (Valipour et al., 2022) introduce Dy-LoRA in which a pre-defined discrete distribution p B (•) is cast over a range of rank choices.This approach is related to but different from nested dropout (Rippel et al., 2014), and can be regarded as optimizing a mixture model with LoRA modules of different ranks.
Nevertheless, tuning LoRA rank straightforwardly and deterministically appears to be a more attractive approach.To devise such an approach, we first gain a crucial hint from the connection between a matrix's rank and its singular value decomposition (SVD).Let us denote the tunable incremental weight matrix in LoRA by ∆ := W u W d .We can then formulate its SVD as in which U and V are orthogonal respectively, and Σ is a (rectangular) diagonal matrix with diagonal elements being the singular values of ∆: For notation convenience, we reshape the diagonal of Σ into a column vector Then, letting d = min{p, q}, we can reformulate the LoRA forward propagation as where ⊙ denotes element-wise dot product (Hadamard product).Note that rank(∆) = ∥g∥ 0 which is the ℓ 0 norm of g.Therefore, tuning the LoRA rank suffices to control the sparsity of the vector g.Zhang et al. precede along this SVD-based track with their methodology named AdaLoRA (Zhang et al., 2023).In AdaLoRA, the elements in vector g are calibrated such that the number of nonzero entries is smaller than a pre-defined budget b.To be specific, they preserve only the entries with top-b importance scorewhich is their newly proposed metric of "sensitivity" heuristically constructed from weight-gradient product.The nonnegativity of g entries is reasonably dropped since a negative g i can be simply reduced to the positive case by flipping the sign of either u i or v i .Besides, they transform the constrained optimization problem into its unconstrained version by replacing the orthogonality conditions U ⊤ U = I p and V ⊤ V = I q with a regularization term In spite of the effectiveness demonstrated through experiments, there are still two problems in AdaLoRA that demand rethinking of the methodology and wait for further improvements.First, the sparsity selection criterion in AdaLoRA is based on their newly proposed importance score relied on the moving average of weight-gradient product.Despite its effectiveness in empirical study, this criterion is largely heuristic, lacking theoretical motivation.Second, both the moving average operation of importance scores and the gradients of orthogonality regularization (5) add up to additional computation cost.Compared to AdaLoRA with the aforementioned limitations, our approach, SoRA, serves as an amelioration with highly simplified updating rules and is backed up by the theory of sparsity regularization and proximal gradient methods.Detailed methodology of SoRA will be elaborated in the next section.

Our Approach
The key idea of our approach, sparse low-rank adaptation (SoRA), is to dynamically adjust the intrinsic rank in the training process with a sparse gating unit trained by proximal gradient method.SoRA adopts the previously introduced framework of lowrank decomposition because of its widely validated effectiveness and parameter efficiency.

Sparse Low-rank Adaptation
Module Structure.At the start of building a SoRA module, we pre-define a maximum acceptable rank r max according to practical or research concerns.Then, each SoRA module will inherit two matrices W d ∈ R rmax×q and W u ∈ R p×rmax from LoRA for down projection and up projection.The maximum rank r max is set to be relatively large, but we will show in the subsequent paragraph how to tame it efficiently in a sparse sense.In fact, this is realized by injecting a gating unit g ∈ R rmax between the projection matrices, which imitates the formulation of SVD.The forward propagation of the SoRA module proceeds as follows: or, more compactly, Optimization.We optimize down-projection and up-projection matrices with stochastic gradient methods as in LoRA, while each gate g is updated in a different sparsity-promoting way: in which L 0 (•) is the original loss function of the language model, ∆ denotes the complete tunable parameter (including the gates), η t > 0 stands for the step-size at the t-th iteration, and λ > 0 works as the regularization strength hyperparameter that promotes sparsity.Besides, T ηt•λ (•) in the above expression stands for the element-wise broadcast of the following soft-thresholding function: with ξ = η t • λ being the threshold.In practice, the true gradient ∇ g L 0 in ( 10) is approximated by its mini-batch stochastic counterpart.
Post-pruning.When training is completed, we further prune the SoRA weights to drop the zeroedout ranks and reduce the module back to the LoRA form.To be specific, for the k-th SoRA module, let be the index of zero entry in the k-th gating vector g (k) .We drop the I (k) -th rows of down-projection u , as well as the I (k) -th entry of gate g (k) to obtain g (k) .In this way, during inference time the k-th SoRA module will proceed as a usual LoRA module of rank r max − |I (k) | with down-projection matrix W (k)

Interpretation and Comparison
Theoretical interpretation.The update rule ( 10) is in fact an application of the proximal gradient method for ℓ 1 loss (Chambolle et al., 1998;Beck and Teboulle, 2009).This follows immediately once we reformulate (10) equivalently as The above equation ( 13) is exactly the proximal gradient update of the ℓ 1 regularized loss function where g (k) denotes the gate of the k-th SoRA module.This sparsity-promoting strategy dates back to LASSO estimator (Tibshirani, 1996) and compressed sensing (Candes et al., 2006), and is also adopted by many works within the realm of deep learning (Wen et al., 2016;Scardapane et al., 2017).
Comparision with AdaLoRA.Inspired alike by SVD decomposition, our approach SoRA differs from the preceding work AdaLoRA (Zhang et al., 2023) in the following sense.First, we do not apply the orthogonality regularization ( 5) used in AdaLoRA.The reason is that for rank selection purposes, sparsifying the gate g will be sufficient.
Sticking to the original requirements of SVD can result in additional computation expenditure.Second, the moving averaged importance score in AdaLoRA works as an approximation to the change in loss when the corresponding entry is zeroed out, which is regarded as a heuristic measurement of parameter "sensitivity".However, a model's temporal sensitivity to a certain parameter cannot imply that the parameter should be retained, since there is no rigorous theory for doing so.By contrast, our rank selection based on soft-thresholding operation (10) proceeds in a much cleaner form and is soundly justified by the theory of proximal gradient iteration.As is explained earlier this section, the updating rule of SoRA module exactly follows the first principle of interpolation-complexity trade-off by minimizing a regularized loss objective (14).
Beyond the formal simplicity and theoretical clearness is SoRA's superior experimental performance achieved with fewer parameters in less wallclock time, which will be presented in Section 4.

Scheduling ξ to Explore Memorization and Generalization
We dub the threshold ξ as a sparsity indicator.As the name implies, this parameter could directly determine the sparsity of SoRA in the training process.
It can be set as a constant to heuristically control the sparsity according to the budget of parameters and expected performance.When dynamically changing ξ in the adaptation process, SoRA serves as an effective tool to assess the memorization and generalization under a model M and a dataset D.
In other words, we can visually observe how many additional parameters are required to achieve a particular point of performance given the model M and data D. We elaborate the fundamental idea as follows.The process starts by assigning a rel-atively small value to ξ.Consequently, the SoRA model is initially "dense" and is trained until convergence.Once this stage is achieved, we introduce a scheduler to incrementally increase the value of ξ, thereby enhancing the model's sparsity.During this transition from a dense to a sparse model, it becomes possible to evaluate the model's memorization and generalization abilities by examining performance on the training and testing data respectively.The procedure is reported in Algorithm 1.
The process can be regarded as exploring the "compression loss" in the scenario of model adaptation.Here, "compression loss" refers to the reduction in model performance due to the increased sparsity, providing a measure of how well the model can retain its predictive power under constraints.Investigating this "compression loss" is meaningful to understanding the behavior of model adaptation and can facilitate developing efficient, compact models that maintain high-performance levels.Extensive experiments are carried out to assess the effectiveness of our approach comprehensively.Generally speaking, we explore two aspects in this section: (1) the performance and corresponding analysis as a normal parameter-efficient method; and (2) the investigation of memorization and generalization in virtue of the sparsity nature of SoRA.
We omit the variants of the Adapter since we find that the performance between them is very close.We also do not include Prompt Tuning since we find that it takes considerably longer time for convergence and cannot yield non-trivial performance on our backbone models.

Results
We first conduct an evaluation on GLUE benchmark, a widely recognized benchmark for natural language understanding.The experimental performance of SoRA, as well as other baseline methodologies, is recorded in Table 1.We reproduce these methods in our infrastructure and present the average results drawn from 5 random seeds.Our findings indicate that both AdaLoRA and SoRA consistently outperform the initial LoRA baseline.This underlines the validity of adaptive rank as a potent solution for enhanced model adaptation.Most notably, SoRA outshines all other baselines, particularly LoRA and AdaLoRA, despite utilizing fewer parameters.This lends credence to the argument that our proximal gradient method may constitute a more efficacious and essential approach to achieving adaptive rank.For instance, on the MRPC, SoRA achieved an accuracy of 91.98%, surpassing AdaLoRA by 1.76%.On average, SoRA surpassed LoRA and AdaLoRA on the GLUE benchmark by 0.98% and 0.52%, respectively, using 31.5% and 28.3% fewer parameters.To take a closer look at the effectiveness of adaptive rank, we conduct an experiment to compare LoRA and SoRA with different ranks in Table 2.The results affirm that SoRA's superiority is consistent across different budgets of parameters, that is, SoRA could outperform the LoRA baseline in all settings while utilizing over 30% fewer parameters.

Sparsifying Scheduler
We apply the sparsifying scheduler introduced in Section 3.3 by enlarging the sparse indicator ξ (starting from 1e-4) of SoRA progressively in the adaptation process.As illustrated in Figure 2, we plot the memorization and generalization curve of RoBERTa-large (Liu et al., 2019) on MRPC, RTE, STS-B, CoLA, QNLI, and SST-2, where the memorization is gauged by the performance on the training set and the generalization is measured by the performance on the validation set.Intriguingly, we observe a robust "compression performance" across almost all the datasets.Among these, SST-2 emerges as the most "compressible" task, where the model sustains over 99% performance even when restricted to 47,104 non-zero parameters.Remarkably, a mere 4,096 parameters can still conserve above 90% memorization and generalization capabilities.As the sparsifying process proceeds, the model encounters an "inflection point" on different data, after which the performance significantly plummets.This consistent phenomenon suggests that there exist some critical parameters that underpin the performance and are worth further in-   and more pronounced decline in performance compared to others.Another finding is that the trend of memorization and generalization is consistent in the sparsifying procedure, which is aligned with intuition.Our observations also indicate a tendency for the parameters of intermediate and deep layers to maintain their density, while those of the shallow layers show a higher propensity towards sparsity.

Rank Analysis
An intuitive statement is that a single model suffers from varying extents of difficulty when being adapted to different downstream datasets.Concurrently, it is evident that not all parameters within the model carry equal importance-some are more critical to performance than others.In this section, we visualize the final ranks after the training process converges with SoRA on four datasets in Figure 3. Quite obviously, the trained parameter matrices on QQP are exceedingly dense and others do not exhibit such density, which echos the existence of different levels of difficulties.This phenomenon also suggests that leveraging the performance and the parameter budget does not have an invariable constant law, but needs specific considerations in different situations.

Applying SoRA to Different Weights
In our experiments in Table 1, we utilize LoRA, AdaLoRA, and SoRA on all weight matrices to enhance performance.It should be noted that the performance may fluctuate when parameterefficient fine-tuning is applied to various positions within the model, as evidenced by previous research (Zaken et al., 2021;Hu et al., 2022;Zhang et al., 2023).We carry out such ablation experiments with SoRA on three datasets to investigate the impact.Although SoRA is not a budget-oriented method, we adjust λ to approximately equate the retained non-zero parameters.As reported in Table 3, in most cases, the application of SoRA to all weight matrices resulted in a considerable improvement in performance compared to the application of merely one or several types of weights, which suggest that uniformly applying SoRA to all weight matrices can serve as a beneficial strategy.And merely applying SoRA to W Q,K will experience considerable performance drop, which is aligned with LoRA.

Efficiency Analysis
We elaborate that SoRA is a theoretically clear and computation-efficient method in Section 3.2.

Conclusion
Our work presents Sparse Low-Rank Adaptation (SoRA), an innovative method for parameterefficient fine-tuning large pre-trained language models.Upon the hypothesis that the adaptation process could be intrinsically sparse, we offer a dynamic alternative rank by introducing an optimizable gate with a proximal gradient method to regulate sparsity, thereby expanding the optimization space while enhancing parameter efficiency.The method is simple and theoretically supported with promising performance across various tasks.Utilizing SoRA as a tool, we propose a sparsifying scheduler to analyze the correlation between parameters and memorization and generalization.

Limitations
Despite the encouraging results demonstrated by SoRA, there are certain limitations in our current study that are worth acknowledging.This paper only evaluates the effectiveness of SoRA on traditional natural language processing tasks.However, recent studies demonstrate that parameterefficient methods could be applied to cross-modal or instruction-tuning scenarios.In those cases, how the sparsity of SoRA is displayed is still unknown and worth investigating.Our sparsifying scheduler could provide insights on the adaptation process of language models, but it is still challenging to rigorously explain the procedure and more efficiently to assess the difficulty of an adaptation process.

A Experimental Details
A.1 Datasets The GLUE benchmark, consisting of CoLA (Warstadt et al., 2019), SST-2 (Socher et al., 2013), MRPC (Dolan and Brockett, 2005), QQP (Wang et al.), STS-B (Wang et al.), MNLI (Williams et al., 2017), QNLI (Rajpurkar et al., 2016) and RTE (Dagan et al., 2005;Haim et al., 2006;Giampiccolo et al., 2007;Bentivogli et al., 2009), is used for natural language understanding.The details and the evaluation metric are reported in Table 5.We source each dataset from Huggingface Datasets (Lhoest et al., 2021) and utilize the full dataset for our experiments.For almost all experiments, we run 5 times using different random seeds and report the average results in order to ensure statistical significance.Table 5: The size and evaluation metric of the datasets in GLUE benchmark."Mcc", "Acc", "F1" and "Corr" represent matthews correlation coefficient, accuracy, the F1 score and pearson correlation coefficient respectively.And "Acc(m/mm)" represents the results corresponding to matched and mismatched datasets of MNLI while the metric is accuracy.

A.2 Implementation Details
Regarding hyper-parameters, we set the learning rate to 8e-4.Based on the size and training convergence speed of the datasets, we set the number of epochs for CoLA, MRPC, and STS-B to 20, and the number of epochs for the remaining tasks to 10.As for RTE, we reference the settings of Friedman et al. 2021, which entail a learning rate of 1.2e-3 and an epoch count of 50.We set λ to 0.1 in all our experiments, and select ξ with a grid search in {1e-5, 5e-5, 1e-4}.When dealing with MRPC, RTE, and STS-B datasets, a common trick in certain studies is that using the best model checkpoint on the MNLI dataset could boost the performance.
In our experiments, we do not use this strategy and instead opt for standard initializations across all models.
The Huggingface Transformers (Wolf et al., 2020) and PyTorch (Paszke et al., 2019) are utilized for all the experiments.We use NVIDIA GeForce RTX 3090 (maximum GPU memory=24268MB) and the application of SoRA with a batch size of 8 occupies 6110MB GPU memory on average.

A.3 Optimization of Hyperparameters
In this section, we delve into the optimization of hyperparameters.The results of different r max are proved in Table 2 and we supplement the results of two other important hyperparameters, ξ and η in the Table 6 and Table 7.The performance of SoRA is highly stable with respect to different choices of ξ and η.And for each fixed ξ and η, the variance of performance is rather low.In general, we suggest setting ξ to 1e-4 level and η around 1e-1∼1e-3.

•
< l a t e x i t s h a 1 _ b a s e 6 4 = " U j U n 3 B 7 u f g C O m x V T l E T r J 0 C / v J Y = " > A A A B 8 X i c b V D L S g M x F L 2 p r 1 p f V Z d u g k V w N c x I U Z c F N y 4 r 2 A e 2 Q 8 m k m T Y 0 k x m S j F i G / o U b F 4 q 4 9 W / c + T d m 2 l l o 6 4 H A 4 Z x 7 y b k n S A T X x n W / U W l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 d Z w q y l o 0 F r H q B k Q z w S V r G W 4 E 6 y a K k S g Q r B N M b n K / 8 8 i U 5 r G 8 N 9 O E + R E Z S R 5 y S o y V H v o R M e M g z J 5 m g 2 r N d d w 5 8 C r x C l K D A s 1 B 9 a s / j G k a M W m o I F r 3 P D c x f k a U 4 V S w W a W f a p Y Q O i E j 1 r N U k o h p P 5 s n n u E z q w w h C O t e 5 7 b m L 8 D C v D C K e z y i D V N M F k g k e 0 b 6 n A M d V + N r 9 2 h s 6 s E q J I K l v C o L n 6 e y L D s d b T O L C d M T Z j v e z l 4 n 9 e P z X R t Z 8 x k a S G C r J Y F K U c G Y n y 1 1 H I F C W G T y 3 B R D F 7 K y J j r D A x N q A 8 B G / 5 5 V X S u a h 7 l / X G f a P W v C n i K M M J n M I 5 e H A F T b i D F r S B w C M 8 w y u 8 O d J 5 c d 6 d j 0 V r y S l m j u E P n M 8 f J I G O 2 w = = < / l a t e x i t > • [ [ < l a t e x i t s h a 1 _ b a s e 6 4 = " 8 / U i h l M x l t 0 w y y l q b t M A 9 a I M u w O A J P I N X 8 G b N r R f r 3 f p Y j m 5 Y 5 c 4 p + A P r 8 w e 3 l p Q t < / l a t e x i t > w 7 r 3 n 3 p w w 4 0 w q h J 6 s 0 t T 0 z O z c / E J 5 c W l 5 Z b W y t t 6 S a S 4 I b Z K U p + I m x J J y l t C m Y o r T m 0 x Q H I e c X o e 9 0 0 H 9 + o 4 K y d L k S v U z 6 s e 4 k 7 C I E a y M F F R c 7 Q 1 N X N E J f Y 3 s 2 p F B Y w / Z d Y S Q g 8 Y E 1 Q q P G 9 M 2 9 m Q e B 7 p 3 7 B S 3 + r z w W l Q o 6 M V Y d c N I d 4 y 2 0 9 s d q Y F T B J X q x B B O D O H E E D o 2 G q I K x r g I K o 9 e O y V 5 T B N F O J b S d V C m f I 2 F Y o T T o u z l k m a Y 9 H C H u o Y m O K b S 1 8 P 7 C 7 h t l D a M U m F e o u B Q / T 6 h c S x l P w 5 N 5 + B e + b s 2

⇠
e H O E 8 + K 8 O x + L 1 o K T z x z D H z i f P 1 S Z j b w = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " S Q N L z P + o V K T B u l P S p m K N + T f S p d 4 4 8 V 7 A e 0 o W w 2 m 3 b t Z j f s b o Q S + h + 8 e F D E q / / H m / / G T Z u D t j 4 Y e L w 3 w 8 y 8 I O F M G 9 f 9 d k p r 6 x u b W + X t y s 7 u 3v 5 B 9 f C o o 2 W q C G 0 T y a X q B V h T z g R t G 2 Y 4 7 S W K 4 j j g t B t M b n O / + 0 S V Z l I 8 m G l C / R i P B I s Y w c Z K n Y E M p a k M q z W 3 7 s 6 B V o l X k B o U a A 2 r X 4 N Q k j S mw h C O t e 5 7 b m L 8 D C v D C K e z y i D V N M F k g k e 0 b 6 n A M d V + N r 9 2 h s 6 s E q J I K l v C o L n 6 e y L D s d b T O L C d M T Z j v e z l 4 n 9 e P z X R t Z 8 x k a S G C r J Y F K U c G Y n y 1 1 H I F C W G T y 3 B R D F 7 K y J j r D A x N q A 8 B G / 5 5 V X S u a h 7 l / X G f a P W v C n i K M M J n M I 5 e H A F T b i D F r S B w C M 8 w y u 8 O d J 5 c d 6 d j 0 V r y S l m j u E P n M 8 f J I G O 2 w = = < / l a t e x i t > • [ [ < l a t e x i t s h a 1 _ b a s e 6 4 = " U j U n 3 B 7 u f g C O m x V T l E T r J 0 C / v J Y = " > A A A B 8 X i c b V D L S g M x F L 2 p r 1 p f V Z d u g k V w N c x I U Z c F N y 4 r 2 A e 2 Q 8 m k m T Y 0 k x m S j F i G / o U b F 4 q 4 9 W / c + T d m 2 l l o 6 4 H A 4 Z x 7 y b k n S A T X x n W / U W l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 d Z w q y l o 0 F r H q B k Q z w S V r G W 4 E 6 y a K k S g Q r B N M b n K / 8 8 i U 5 r G 8 N 9 O E + R E Z S R 5 y S o y V H v o R M e M g z J 5 m g 2 r N d d w 5 8 C r x C l K D A s 1 B 9 a s / j G k a M W m o I F r 3 P D s 7 8 4 a P I 6 E v R 3 b d d d w F 4 D r x S l I H J d o j + 8 u P B M 4 S w j V m S K m h 5 6 Y 6 y J H U F D M y r / q Z t M A 9 a I M u w O A J P I N X 8 G b N r R f r 3 f p Y j m 5 Y 5 c 4 p + A P r 8 w e 3 l p Q t < / l a t e x i t > T (•)

⇠
y Y b g L b + 8 S l o X V e + y W r u r V e o k j 6 M I J 3 A K 5 + D B F d T h F h r Q B A Z D e I Z X e H O E 8 + K 8 O x + L 1 o K T z x z D H z i f P 1 S Z j b w = < / l a t e x i t > 4 L b l x W s A 9 s h 5 J J 7 7 S h m c y Q Z I Q y 9 C / c u F D E r X / j z r 8 x b W e h r Q c C h 3 P u J e e e I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y x W M

Figure 1 :
Figure 1: An illustration of sparse low-rank adaptation (SoRA).At the training stage, the gate g will control the sparsity of W d and W u .At the inference stage, zero vectors in W d and W u , indexed by the zero entries of g, would be eliminated.lead to from-scratch re-training.These limitations highlight the importance of upgrading LoRA with an adaptive-rank-selection plug-in.Several remedies have been proposed in recent years to enable the flexible tuning of LoRA rank.For example, rather than setting a fixed rank, Valipour et al.(Valipour et al., 2022) introduce Dy-LoRA in which a pre-defined discrete distribution p B (•) is cast over a range of rank choices.This approach is related to but different from nested dropout(Rippel et al., 2014), and can be regarded as optimizing a mixture model with LoRA modules of different ranks.Nevertheless, tuning LoRA rank straightforwardly and deterministically appears to be a more attractive approach.To devise such an approach, we first gain a crucial hint from the connection between a matrix's rank and its singular value decomposition (SVD).Let us denote the tunable incremental weight matrix in LoRA by ∆ := W u W d .We can then formulate its SVD as

Figure 2 :
Figure 2: The memorization and generalization curve on six datasets.The "Param" axis indicates the number of non-zero parameters.The sparsity indicator ξ increases every 5 epochs.

Figure 3 :
Figure 3: The final ranks after training with SoRA on four datasets (l.e., QQP, MNLI, QNLI, and MRPC).The X-axis is the index of DeBERTaV3-base layers, and the Y-axis indicates different layers SoRA applies to.

Table 1 :
Test results of SoRA and other baselines on the GLUE benchmark.We denote the best result in bold and underline the second best result.The standard deviations of results from different methods are similar and we show them in Table8in Appendix A.4.

Table 2 :
Test results and number of parameters of SoRA initialized with different r max on the GLUE benchmark, compared with LoRA of the same rank.The standard deviations of results from different methods are similar and we show them in Table9in Appendix A.4.

Table 3 :
Test results that applying SoRA to different weights.Q, K, V, and A.O represent query, key, value and attention output layers respectively.#Params means the number of parameters that would remain after training.

Table 4 :
The average training time per epoch on six datasets.For each task, the experiments with AdaLoRA and SoRA have the same batch size 32.

Table 7 :
Test results of the optimization experiments on different η t .A.4 Results with Standard Deviations The test results in Table 1 are shown in Table 8, and results in Table 2 are shown in Table 9.

Table 8 :
Test results of SoRA and other baselines on the GLUE benchmark.We denote the best result in bold and underline the second best result.The standard deviation is provided in parentheses.

Table 9 :
Test results and number of parameters of SoRA initialized with different r max on the GLUE benchmark, compared with LoRA of the same rank.The standard deviation is provided in parentheses.