Orthogonal Subspace Learning for Language Model Continual Learning

Benefiting from massive corpora and advanced hardware, large language models (LLMs) exhibit remarkable capabilities in language understanding and generation. However, their performance degrades in scenarios where multiple tasks are encountered sequentially, also known as catastrophic forgetting. In this paper, we propose orthogonal low-rank adaptation (O-LoRA), a simple and efficient approach for continual learning in language models, effectively mitigating catastrophic forgetting while learning new tasks. Specifically, O-LoRA learns tasks in different (low-rank) vector subspaces that are kept orthogonal to each other in order to minimize interference. Our method induces only marginal additional parameter costs and requires no user data storage for replay. Experimental results on continual learning benchmarks show that our method outperforms state-of-the-art methods. Furthermore, compared to previous approaches, our method excels in preserving the generalization ability of LLMs on unseen tasks.


Introduction
Learning tasks sequentially is crucial for developing real-world NLP models (Wang et al., 2023b;Xi et al., 2023), as it enables continuous evolution when encountering new tasks or knowledge.Although pre-trained models (Devlin et al., 2019;Brown et al., 2020;Raffel et al., 2020;OpenAI, 2023) have achieved tremendous success on static tasks (Wang et al., 2022a(Wang et al., , 2023c)), learning multiple tasks sequentially, commonly referred to as continual learning, remains challenging (Wu et al., 2022;Luo et al., 2023).As a model learns new tasks, it tends to forget or lose the knowledge it had acquired for earlier tasks, leading to a phenomenon known as catastrophic forgetting (McCloskey and Cohen, 1989).

Orthogonal update direction
Figure 1: Illustration highlighting the intuition of our approach.O-LoRA mitigates catastrophic forgetting of past task knowledge by constraining the gradient updates of the current task to be orthogonal to the gradient subspace of the past tasks.
Existing continual learning works (Ke and Liu, 2022;Wang et al., 2023a) can be mainly categorized into rehearsal-based, regularization-based, and architecture-based approaches.Rehearsalbased approaches (Lopez-Paz and Ranzato, 2017;de Masson D'Autume et al., 2019) allow access to a memory buffer with examples from prior tasks and train the model jointly with the current task.Unfortunately, storing and replaying data from previous tasks may raise privacy concerns, especially when sensitive or personally identifiable information is involved.Regularization-based approaches (Kirkpatrick et al., 2017;Li and Hoiem, 2017;Smith et al., 2023) introduce additional terms in the loss function to penalize changes in important weights, aiming to protect earlier learned tasks.They often struggle to handle long task sequences.Architecture-based approaches (Wang et al., 2023e;Razdaibiedina et al., 2023) dynamically expand the model capacity or isolate existing model weights to reduce interference.However, such approaches essentially learn different expert models for different tasks, limiting their generalization to unseen tasks.
Existing methods typically update all tasks within a shared vector space, directly affecting the model's hidden layer outputs.Recent studies (Farajtabar et al., 2020;Saha et al., 2021) have highlighted a promising approach to address this issue.By taking gradient steps in the orthogonal direction to the gradient subspaces associated with past tasks, we can effectively mitigate catastrophic forgetting as it prevents interference with the past task loss functions.However, previous approaches either require storing historical data (Chaudhry et al., 2019), which raises data privacy concerns, or historical data gradients (Farajtabar et al., 2020), which becomes impractical for large-scale models.
In this work, we propose orthogonal low-rank adaptation (O-LoRA)1 , a simple and efficient approach for continual learning in language models.Our key insight is rooted in the nature of LoRA: large pre-trained models primarily fine-tune within a specific low-rank subspace.With this premise, we hypothesize that the gradient subspaces from previous tasks can be effectively captured by the LoRA parameters.In the context of continual learning, we incrementally learn new tasks in an orthogonal subspace while fixing the LoRA parameters learned from past tasks.Figure 1 provides a visual representation of how O-LoRA minimizes catastrophic forgetting.
Our method offers three advantages: (1) Data privacy-friendliness: We require no storage of user data for replay, addressing concerns associated with privacy.(2) Model parameter-friendliness: By introducing only marginal cost of additional parameters, our approach enables the learning of new tasks without compromising the performance of previous tasks.(3) Generalization-friendliness: Our method does not rely on task IDs during testing, making it compatible with instruction tuning paradigm (Wang et al., 2022b), thus preserving LLMs' generalization ability on unseen tasks.
Our main contributions are summarized as follows: • We introduce O-LoRA, a simple and efficient approach for continual learning in language models, incrementally learning new tasks in orthogonal subspaces.
• Our method significantly outperforms prior SOTA methods on standard continual learning benchmarks.
• Experimental results show that our method preserves the generalization ability of large language models on unseen tasks, which was lacking in previous approaches.
2 Background In this study, we tackle a more challenging setting.
During the training phase, the model is prohibited from accessing any historical data.In the testing phase, the model predicts a sample's label without knowing which task it belongs to.

•••
Figure 2: The framework of O-LoRA for language model continual learning.First, allowing the integration of human expertise and enhancing generalization by instruction tuning.Next, approximate gradient subspaces of each task respectively uning LoRA.For each sequentially incoming task, we incrementally learn a new LoRA while enforcing orthogonality between the current task's LoRA and the past ones.

Orthogonal Low-rank Adaptation
In this section, we introduce O-LoRA, illustrated in Figure 2. First, we adopt instruction tuning as our training paradigm.Then, we incrementally learn new tasks within an orthogonal subspace while keeping the LoRA parameters fixed for past tasks.Lastly, we conduct a comparative analysis of our method compared to existing approaches.

Instruction Schema
Instruction-following capability is essential to LLMs as an interface between humans and AI models (Wang et al., 2022b;Ouyang et al., 2022;Wang et al., 2023d).We choose instruction tuning as our training paradigm for two reasons: 1) Incorporating human expertise: The models can leverage prior knowledge and benefit from human expertise by providing explicit instructions, leading to more efficient learning.2) Enhanced generalization: The explicit guidance helps models capture the underlying principles, enabling better generalization to unseen situations.
All task instructions follow the same uniform schema, which is composed of 1) Task Definition provides a detailed guide on how an input text (e.g., a sentence or a document) is expected to be mapped to an output text.2) Options are the output label constraints for a task, which represent the set of possible outputs that can be generated by the model for a given input.3) Text is the input sentence of a task instance.This sequence is then fed into the pre-trained language model along with the task instruction and options.4) Answer is the expected output of the given sample.

Continual Learning in Orthogonal Subspaces
Previous methods exhibit a common feature: all tasks undergo updates within a shared vector space, directly impacting the hidden layer outputs of the model.Catastrophic forgetting happens in neural networks when the gradient updates with respect to a new task are applied to the model without considering previous tasks.Farajtabar et al. (2020) propose the Orthogonal Gradient Descent (OGD) method for mitigating this problem, which constrains the parameters to move within the orthogonal space to the gradients of previous tasks.With limited access to previous task data, OGD approximates the current gradient of previous data with the gradient in the previous convergence parameters.However, OGD needs to store gradients of all previous data.This can be especially intractable for large-scale language models with billions of parameters (Raffel et al., 2020;Brown et al., 2020), which have become a standard in the NLP field.
Is it possible to approximate the gradient direction of previous tasks without storing historical gradients?In this study, we leverage the low-rank subspace of LoRA (Hu et al., 2021) as a proxy for the gradient subspace of past tasks.Our fundamental insight is rooted in the nature of LoRA: large pre-trained models primarily fine-tune within a specific low-rank subspace.This characteristic behavior suggests that the LoRA parameters are not mere numerical adjustments but encapsulate crucial model update directions.Therefore, we hypothesize that the gradient subspaces of previous tasks are succinctly represented by the LoRA parameters.By learning within a subspace orthogonal to the LoRA subspace associated with previous tasks, we can prevent interference with past task loss functions, thus mitigating catastrophic forgetting.
We propose O-LoRA, which incrementally learns new tasks in a direction orthogonal to the LoRA subspace of past tasks while fixing the previous parameters.For each task, we introduce a set of LoRA parameters denoted as {A t , B t }, where A ∈ R d×r , B ∈ R r×k , and the rank r ≪ min(d, k).We approximate the parameter update subspace U t for the t-th task as the subspace spanned by the column vectors of A t : , where b i t ∈ B t represents the linear weighting coefficients of the column vectors in A t .
To ensure the orthogonality between the subspace U and the subspace W, we need to satisfy: (5) Therefore, achieving orthogonality between the LoRA subspaces of task i (U i ) and task t (U t ) can be expressed as: Finally, our training objective is defined as: x,y∈Dt where O i,t [j, k] denotes the element at the j-th row and k-th column of O i,t , and λ 1 is the weights of the orthogonality loss.During the training process, to mitigate forgetting of past knowledge, we fix the previous LoRA parameters {A i , B i |i < t}.
Following Hu et al. (2021), we only apply LoRA to the attention weights of queries (W q ) and values (W v ).
While the number of LoRA parameters grows with the number of tasks during training, we can merge the updates corresponding to the LoRA parameters into the initial parameters to avoid GPU memory inflation. (9)

Comparisons Between O-LoRA and Other Methods
In this section, we compare O-LoRA with other existing continual learning methods across several dimensions: rehearsal-free, parameter efficiency, availability of task-id during inference, and applicability to unseen tasks.As shown in Model parameter-friendliness.Many previous methods (Kirkpatrick et al., 2017;Farajtabar et al., 2020) train the entire model parameters for each task, while our method only introduces marginal additional parameters for each task.O-LoRA has lower requirements in terms of computational resources and GPU memory during training.Additionally, because the training of LoRA freezes the pre-trained model parameters, it is less prone to forgetting the knowledge acquired during pre-training.
Generalization-friendliness. Traditional methods (Kirkpatrick et al., 2017;Chaudhry et al., 2018;Wang et al., 2022c), primarily designed for classification tasks, often fall short in generalizing to unseen tasks due to their narrow task-specific focus.In contrast, O-LoRA employs instruction tuning (Wang et al., 2022b)  Large number of tasks Our method's performance on longer task sequences, posing a greater challenge, is evaluated through experiments on a continual learning benchmark of 15 datasets (Razdaibiedina et al., 2023).This includes five tasks from CL benchmark, four from GLUE benchmark (MNLI, QQP, RTE, SST2) (Wang et al., 2018), five from SuperGLUE benchmark (WiC, CB, COPA, MultiRC, BoolQ) (Wang et al., 2019), and the IMDB movie reviews dataset (Maas et al., 2011).Following Razdaibiedina et al. (2023), we select 1000 random samples for training each task and hold out 500 samples per class for validation.
Unseen tasks Generation To assess the impact of our approach on LLMs' generalization ability, we initially train an LLM on the Alpaca dataset (Taori et al., 2023), an open-source multitask instruction tuning dataset.We then use the pretrained LLM for sequential training on the standard CL benchmark (Zhang et al., 2015).Our zeroshot benchmark, MMLU (Hendrycks et al., 2020), covers 57 subjects across various domains such as STEM, humanities, and social sciences, assessing world knowledge and problem-solving abilities across various difficulty levels.

Metrics
Let a i,j be the testing accuracy on the i-th task after training on j-th task, the metrics for evaluating is Average Accuracy (AA), the average accuracy of all tasks after training on the last task, 1 T T i=1 a i,T

Baselines
We evaluate O-LoRA against 10 baseline methods.Importantly, among these baselines, only promptbased methods are exceptions; all others utilize the LoRA framework.This uniformity in the foundation ensures consistent parameter settings between O-LoRA and its comparatives, guaranteeing a fair comparison.
• SeqFT (de Masson D'Autume et al., 2019): train all model parameters on a sequence of tasks (without adding any regularization or replaying samples from the previous tasks).
• SeqLoRA: fixed-size LoRA parameters are trained on a sequence of tasks (without adding any regularization or replaying samples from the previous tasks).(without adding any regularization or replaying samples from the previous tasks).
• Replay: finetune the whole model with a memory buffer, and replay samples from old tasks when learning new tasks to avoid forgetting.
• EWC (Kirkpatrick et al., 2017): finetune the whole model with a regularization loss that prevents updating parameters that could interfere with previously learned tasks.
• LwF (Li and Hoiem, 2017): constrains the shared representation layer to be similar to its original state before learning the new task.
• L2P (Wang et al., 2022c): uses the input to dynamically select and update prompts from the prompt pool in an instance-wise fashion.
• LFPT5 (Qin and Joty, 2021): continuously train a soft prompt that simultaneously learns to solve the tasks and generate training samples, which are subsequently used in experience replay.
• ProgPrompt (Razdaibiedina et al., 2023): adopts a task-specific soft prompt for each distinct task, sequentially appending it to prior learned prompts.In essence, it trains individual models per task, leveraging the task ID to select the appropriate model during inference.
• PerTaskFT: train a separate model for each task.
• MTL: train a model on all tasks as multi-task learning.This method is the upper bound of continual learning.

Implementation Details
O-LoRA is a model-agnostic CL method that can be used with any transformer-based model.In our experiments, we use two language models adopted by the previous lines of works in CL for NLP: encoder-decoder T5 model (Raffel et al., 2020) and decoder-only LLaMA model (Touvron et al., 2023).
To compare O-LoRA to the recent CL approaches (Wang et al., 2022c;Qin and Joty, 2021), we use the pre-trained T5-large model.To validate the impact of our approach on the generalization ability of LLMs for unseen tasks, we use pre-trained LLaMA-7B model.All experimental results are reported as the average of 3 runs.For more detailed settings, refer to the Appendix A.1.of-the-art method.Our approach demonstrates comparable performance to multi-task learning and significantly outperforms PerTaskFT, indicating that our method not only effectively avoids catastrophic forgetting but also leverages knowledge from past tasks for efficient learning of new tasks.

Main Results
Performance with Large Number of Tasks On a more challenging benchmark with a large number of tasks, O-LoRA outperforms the stateof-the-art method, LFPT5, in terms of the average performance across three orders of tasks.While ProgPrompt performs better than our method in handling long sequence tasks, its inherent constraints cannot be overlooked.ProgPrompt is strictly tied to tasks it's trained on and leans heavily on task IDs during inference, limiting its generalization ability and making it less adaptive for LLMs.It is worth noting that almost all existing continual learning methods perform significantly lower than PerTaskFT and MTL, indicating that continual learning for a large number of tasks remains a challenging problem.
Impact on the Generalization Ability of LLMs We investigate the impact of O-LoRA on the generalization of Large Language Models through continual learning experiments.We start with a fine-tuned LLaMA-7B language model on the Alpaca dataset, then test models with and without the O-LoRA constraint on the MMLU benchmark.With MMLU being a four-classification problem, a 25% accuracy equates to random guessing.According to Table 3, models without O-LoRA (Alpaca-LoRA-CL, Alpaca-LoRA-inc-CL) achieve accuracies of 23.3% and 28.6% respectively, comparable to random guesses.In contrast, models with O-LoRA average at 33.6% accuracy, demonstrate the effectiveness of O-LoRA in maintaining generalization for unseen tasks.

Discussions
Does O-LoRA preserve the loss of previous tasks while training new tasks?We assessed the efficiency of the low-rank subspace of LoRA in approximating the gradient subspace of previous tasks.In our evaluation, we applied an orthogonality constraint with a weight of 0.5 (λ 1 = 0.5).
In comparison, without the constraint (λ 1 = 0), new LoRA parameters are added for new tasks with historical LoRA, and model parameters are kept fixed.As Figure 3 shows, the O-LoRA constraint helps keep the loss of previous samples low, proving that the O-LoRA constraint effectively counteracts catastrophic forgetting.
How does O-LoRA influence the output of each layer in the model?We examine the variation in hidden states for past task samples in models trained with and without O-LoRA constraints, using the T5-base model.Figure 4 demonstrates that the O-LoRA constraint minimizes the variations, hence reducing the forgetting of internal knowledge.We found that lower layers encode more generic semantic knowledge that can be shared across tasks.Conversely, higher layers encode task-specific semantic knowledge and change a lot during new task learning.The decoder can capture relevant information from these rich semantic representations, proving the minimal impact of our method on past tasks.
How do different PLMs influence performances?We evaluate the performance of models across varying parameter sizes (T5-base, T5-large, T5-XL) and distinct architectures (T5, LLaMA) using the standard continual learning benchmark.Our findings are as follows: 1) In the T5 series, O-LoRA's average accuracy improves as the parameter size increases.2) Larger network  sizes appear to counteract catastrophic forgetting, approaching the proficiency levels of multitask learning.3) Notably, even with a greater parameter count in the LLaMA-7B model, the T5-3B model registered a higher average accuracy.This implies that encoder-decoder architectures might be more resistant to forgetting.

What is the Optimal Rank r for O-LoRA?
To investigate the influence of the rank parameter (r) on the performance of O-LoRA, we conduct experiments using T5-Base on a standard CL benchmark.Table 5 presents the results of varying r values.Increasing the rank r improves the average accuracy of the model to a certain extent.However, we observe that there is not a significant difference in performance between r=2 and r=16, indicating that the gradient space of the model has a relatively low intrinsic dimensionality.For an in-depth discussion on continual learning in the era of large language models, readers may refer to (Wang et al., 2023b).
Rehearsal-based approaches (Lopez-Paz and Ranzato, 2017;de Masson D'Autume et al., 2019;Han et al., 2020;Bai et al., 2022) leverage a memory buffer that stores examples from previous tasks, training the model jointly with the current task.Experience replay (ER) (Rolnick et al., 2019) is a common strategy employed in rehearsalbased approaches and serves as a strong baseline.However, the storage and replay of data from previous tasks raise privacy concerns, particularly when dealing with sensitive information.
Regularization-based approaches (Kirkpatrick et al., 2017;Li and Hoiem, 2017;Farajtabar et al., 2020;Smith et al., 2023) incorporate additional terms into the loss function to penalize changes in crucial weights.For instance, Orthogonal Gradient Descent (OGD) (Farajtabar et al., 2020) constrains the parameters to move within the orthogonal space defined by the gradients of previous tasks.However, OGD requires storing gradients of all historical data, which becomes infeasible for large language models.Another work introduces C-LoRA (Smith et al., 2023) for continual learning of text-conditioned images, which regularizes the similarity of new LoRA parameters with historical versions, limiting their learning plasticity to new tasks.
Architecture-based approaches (Wang et al., 2023e;Razdaibiedina et al., 2023) focus on dynamically expanding model capacity or isolating existing model weights to mitigate interference between new and old tasks.Progressive Prompts (Razdaibiedina et al., 2023) learns separate prompts for each incoming task and sequentially concatenates them with previously learned prompts.However, such approaches essentially train distinct expert models for different tasks, which restricts their generalization ability to unseen tasks.
In contrast to existing methods, our approach offers unique advantages in terms of data privacy, model parameter efficiency, and generalization capability, as discussed in the previous sections.

Parameter Efficient Tuning
Parameter Efficient Tuning (PET) (He et al., 2021) has emerged as a significant research direction aimed at optimizing model performance while minimizing computational resources and annotation efforts.Various approaches have been proposed to achieve parameter efficiency in tuning, including adapters (Houlsby et al., 2019), prompt learning (Lester et al., 2021), LoRA (Hu et al., 2021), and fine-tuning subsets of the model (Zaken et al., 2021).One particularly promising approach is the use of low-rank adapters, which have demonstrated effectiveness in adapting models to new tasks with minimal additional parameters.Building upon LoRA, we propose an efficient continual learning neural architecture in this work.Our approach involves layering low-rank adapters on the key and value projection matrices of transformer blocks.By leveraging the benefits of low-rank adapters, we aim to strike a balance between model performance and computational efficiency in the context of continual learning.

Conclusion
In this paper, we introduce O-LoRA, a novel approach that leverages orthogonal subspace learning for continual learning in language models.O-LoRA systematically addresses catastrophic forgetting by adopting an incremental learning strategy within orthogonal subspaces.Distinguished by its data privacy considerations, efficient model parameter utilization, and robust generalization to novel tasks, our method stands out.Empirical evaluations underscore O-LoRA's efficacy in tackling the intricacies of continual learning.

Limitations
While our method has demonstrated effectiveness in empirical evaluations, there are a few limitations to consider.Firstly, its performance and applicability in more complex scenarios with a large number of tasks, such as hundreds of tasks, require further investigation.Additionally, although our method does not rely on task identification during inference, it still requires task identification during training to train different LoRA parameters for each task.Exploring methods for task-agnostic training would be a valuable future direction.By addressing these limitations, we can enhance the scalability and task-agnostic capabilities of our approach, further advancing the field of continual learning for language models.for transformer-based masked language-models.arXiv preprint arXiv:2106.10199.
Character-level convolutional networks for text classification.Advances in neural information processing systems, 28.

A.2 Datasets
Table 4 shows details of the 15 datasets we used for our CL experiments, along with their evaluation metrics.Overall, we used datasets from CL benchmark (Zhang et al., 2015), GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019) benchmarks, and added IMDB movie reviews dataset, following (Razdaibiedina et al., 2023).

A.3 Task Sequence Orders
We report task orders used for our CL experiments across T5 and LLaMA models in Table 5.

A.4 Task Instructions
Table 6 shows prompts for different tasks.NLI denotes natural language inference, including MNLI, RTE and CB.SC denotes sentiment analysis, including Amazon, Yelp, SST-2 and IMDB.TC denotes topic classification, including AG News, Dbpedia and Yahoo.
A.5 Detailed results of MMLU Zero-shot as its training paradigm.By incorporating explicit instructions or demonstrations, the model can capture the underlying principles or constraints of a task.This explicit guidance helps the model generalize beyond the specific examples in the training data, enabling it to handle unseen situations more effectively.The integration of human expertise through instruction tuning enhances the generalization capabilities of O-LoRA.We evaluate our approach using the CL benchmark for language models, which consists of five text classification datasets introduced by Zhang et al. (2015): AG News, Amazon reviews, Yelp reviews, DBpedia and Yahoo Answers.We adopt the CL setup for the T5 model, following LFPT5(Qin and Joty, 2021), and explore three different orders of the benchmark.Appendix A.2 provides the task details, and the sequences of tasks used in our experiments are provided in Appendix A.3.

Figure 3 :
Figure 3: Histogram of prediction loss changes after training on a new task.The O-LoRA constraint (λ 1 = 0.5) helps reduce the changes in comparison to when it is not present (λ 1 = 0).

Figure 4 :
Figure 4: Variation in hidden states across different layers of the T5-base model with and without O-LoRA constraints.(a) illustrates the changes in the hidden states of each layer in the T5 encoder after training on a new task.(b) demonstrates the changes in the hidden states of each layer in the T5 decoder.Light blue color indicates the utilization of orthogonality loss, while dark blue represents the absence of orthogonal loss in the training objective.
. A single model needs to adapt to them sequentially, with only access to D t at the t-th task.In general, given a prediction model h Θ parameterized by Θ, continual learning seeks to optimize for the following objective across all tasks: 1 , . . ., D T } arrive in a streaming fashion.Each task D t = x t i , y t i nt i=1 contains a separate target dataset, where x t i ∈ X t , y t i ∈ Y t

Table 1 :
The comparison between O-LoRA and other continual learning methods.Specifically, RF indicates whether the method is rehearsal-free.PE indicates whether the method is parameter efficient.TIF indicates whether task-id is available during inference.UT indicates whether the method can be applied to solve unseen tasks.

Table 1
Data privacy-friendliness.Rehearsal-based  methods (de MasHuang et al., 2021), 2019;Huang et al., 2021), which rely on storing past task data in a buffer and replaying it during training, are not suitable for scenarios with data privacy concerns.Additionally, as the number of training tasks increases, the cost of training new tasks using replay-based methods also grows.In contrast, our method does not require storing historical data, alleviating concerns regarding data privacy.Moreover, since we only modify the training loss, there is no additional training cost incurred.

Table 2 :
Summary of the results on two standard CL benchmarks with T5-large model.Averaged accuracy after training on the last task is reported.All results are averaged over 3 runs.
• IncLoRA: incremental learning of new LoRA parameters on a sequential series of tasks

Table 4 :
Comparison of different PLMs' performances across three orders in a standard continual learning benchmark.Results also include average accuracy ("avg") and multitask learning performance ("MTL").

Table 5 :
Comparisons of different rank r of LoRA.This experiment is conducted based on T5-Base on the standard continual learning benchmark.