Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning

Real-life multilingual systems should be able to efficiently incorporate new languages as data distributions fed to the system evolve and shift over time. To do this, systems need to handle the issue of catastrophic forgetting, where the model performance drops for languages or tasks seen further in its past. In this paper, we study catastrophic forgetting, as well as methods to minimize this, in a massively multilingual continual learning framework involving up to 51 languages and covering both classification and sequence labeling tasks. We present LR ADJUST, a learning rate scheduling method that is simple, yet effective in preserving new information without strongly overwriting past knowledge. Furthermore, we show that this method is effective across multiple continual learning approaches. Finally, we provide further insights into the dynamics of catastrophic forgetting in this massively multilingual setup.


Introduction
Standard supervised NLP methods perform well when training on enough data from a uniform distribution.However, they fail to retain knowledge learnt in the past when sudden shifts occur in training data distributions.This effect of dropping performance on data from past distributions is commonly referred to as catastrophic forgetting (Mc-Closkey and Cohen, 1989;de Masson D'Autume et al., 2019;Biesialska et al., 2020a), where stability or preservation of knowledge is traded off for increased plasticity or the ability to acquire new knowledge.To tackle this issue, continual learning (CL) methods were proposed under various settings, such as limited compute or ability to store past data (Lopez-Paz and Ranzato, 2017;de Masson D'Autume et al., 2019).The data shifts commonly studied are obtained by training over a sequence of non-iid partitions (Chaudhry et al., 2018), different tasks (Jin et al., 2021), or by training on various domains such as in task-oriented dialogue (Madotto et al., 2021), named entity recognition (Monaikul et al., 2021), part-of-speech (Liu et al., 2020), and intent detection (Wu et al., 2021).
Lifelong learning is key to the success of deployed multilingual systems, enabling the system to incorporate annotated data for new languages as they become available without costly retraining and redeployment of the entire system.This sequential availability of data for new languages is a common case of training data shift (see Figure 1 for the task setup).Yet, the effect of catastrophic forgetting was not yet systematically studied for multi-lingual models with multiple diverse languages.M'hamdi et al. (2022) study continual learning in a crosslingual setting limited to just six languages.The cross-lingual abilities of pre-trained models were found to drop when performing fine-tuning for a target language (Liu et al., 2021), although applying continual learning approaches can effectively reduce the magnitude of the effect (Lopez-Paz and Ranzato, 2017).
In this paper, we systematically study the effect of catastrophic forgetting and mitigation strategies in a massively multilingual setting covering up to 51 languages on three different tasks.We start by quantifying the extent to which forgetting happens when languages are presented to the model in sequence, identifying an up to 16% F1 drop compared to training using all the data mixed.Next, we propose LR ADJUST, a simple, yet effective, method to preserve the learned knowledge by adjusting the learning rate over time to alleviate the knowledge overwriting from the new language and preserve the previous learned knowledge.This method is orthogonal to continual learning methods and thus can be handily combined with any of these.We find that across three different CL methods, LR ADJUST helps further reduce the gap between a fully trained model and the CL setup.We conduct analysis on the aspect of cross-lingual transfer in backward and forward directions to measure the influence of the CL on previous tasks and its ability in zero-shot learning respectively.Finally, we conduct analyses on the effects of catastrophic forgetting when first training on multiple languages jointly and when using a curriculum learning approach informed by language similarity.

Task Setup
We define a curriculum of T tasks as an ordered set of data sets D = {D 1 , D 2 , ..., D t , ..., D T } and model θ t , where D t is the data set with task t.
In this case, the task is a distinct language.The weights of model θ are updated continuously θ t+1 ← f (θ t , D t ) by minimizing the log-likelihood over data set D t via gradient updates.

Inter-task Learning Rate Adjustment
We propose LR ADJUST, a simple and effective method to adjust the learning rate when we start training on a new task.Our intuition is that models are susceptible to catastrophic forgetting when we provide a higher learning rate, so the learning rate should be toned down with time to ensure the preservation of the learned knowledge and to reduce the effect of overwriting the weights with the new knowledge.Learning rate adjustments have been studied in the context of incremental learning (Cavalin et al., 2009;Khreich et al., 2012) and for efficient optimization using schedules (Ge et al., 2019).Concretely, the new learning rate is lowered every time as the following: lr t = max(lr min , lr t−1 * γ), with a weight γ, where γ < 1 and lr min is the minimum learning rate.The method is detailed in Algorithm 1.

Continual Learning Method
We experiment with the following continual learning approaches: Adjust learning rate to lr t = max(lr min , lr t−1 * γ) et al., 2019) uses an episodic memory to store seen training data in memory and retrieve it from memory for fine-tuning.We schedule the replay step to be run every few iterations.During the replay step, we retrieve the data and fine-tune the model using the retrieved data.The number of stored data is constrained to ensure efficient memory use.And, we take a uniform distribution of samples across all labels.• Averaged GEM (A-GEM) (Chaudhry et al., 2018) also utilizes an episodic memory M and is a more efficient implementation of GEM (Lopez-Paz and Ranzato, 2017) that computes the gradient constraints and minimizes the loss as follows: where loss L t is constrained to be lower or equal to the loss L t−1 .• Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) minimizes the following loss: where F i is the Fisher information matrix, λ is a coefficient that sets how important the old task is compared to the new one, and θ * t−1 is the previous learned weights.The F i is pre-computed after each task is completed and we incorporate the loss to the training on each gradient update.

Data sets
We use a multilingual natural language understanding data set, MASSIVE (FitzGerald et al., 2022) and a multilingual named entity recognition (NER) data set, WikiAnn (Rahimi et al., 2019).The MAS-SIVE data set consists of two tasks: intent classification and slot filling with 51 languages.The WikiAnn data set consists of 40 languages.We adopt the data splits from the original papers.

Methods
Our model architecture is an encoder-only multilingual model XLM-R BASE (Conneau et al., 2020) with a classification layer for each task.All parameters in the model are updated during the training.The full hyper-parameters are listed in Appendix A. The language order for the experiments are listed in Appendix B. We experiment with the following approaches: • MULTI: A single multilingual model is trained on data mixed from all languages.This represents an upper bound to CL methods, as there are no memory or data sequencing constraints.• MONO: A separate model is trained on all the supervised data for each language and applied to all inputs in that language.• VANILLA: A single model is trained by sequentially presenting data from each language.The language order is selected randomly.• CL Methods: We run REPLAY, A-GEM, and EWC to train a single model on data from each language presented sequentially.• CL Methods + LR ADJUST: We run the CL methods with the learning rate adjustment method described in Section 2.2.

Metrics
We measure the ability of cross-lingual transfer using CL metrics adapted from Lopez-Paz and Ranzato (2017).We define a matrix R ∈ R T ×T , where R i,j denotes the test score performance of the model on task t j when training the last sample from task t i .We formally define the metrics as:

Cross-lingual Forward Transfer (CFT)
This metric represents the ability to perform zeroshot learning by evaluating on the test data from tasks/languages that are unseen in training.We formally define the metric as:  where Xi is the average performance of the languages that will be seen in the future (t >i ).

Cross-lingual Backward Transfer (CBT)
This metric measures the influence of learning a task t i on the performance of the previous tasks.We formally define the metric as the following: CBT practically measures the effect of catastrophic forgetting of past tasks after adding a new task to the model.ing for the three data sets.Each point on the graph shows the average and standard deviation of the F1 scores, obtained over 5 runs with different seeds.For the CL experiments, the seed also controls the order of languages in training.This can lead to higher variance across runs -compared to mixing the same data across runs -because the forgetting effect depends on the language order.Table 1 shows the forward and backward transfer for the three data sets.
Results show the following: Catastrophic forgetting is present in multilingual continual learning, with performance dropping quickly when training on languages sequentially (VANILLA).This method converges to between 4 -17 F1 lower than mixing the same data and training a full model every time (MULTI).We also see that this effect is present even when the performance of monolingual models is close to that of multilingual models, as in the case of WikiAnn.Training a full model on all languages (MULTI) always performs best, outperforming training on one language at a time (MONO) sometimes substantially (MASSIVE-Slot -10 F1, MASSIVE-Intent -4 F1), highlighting the importance of cross-lingual transfer and preserving information from past seen languages in a continual learning setup.
Continual learning methods generally help dampen the impact of forgetting.For example, in WikiAnn, the REPLAY and A-GEM CL methods reduce the backward transfer from 16.90 to 11.87 and 10.75 respectively, albeit EWC does not substantially improve the performance relative to the VANILLA

method.
LR ADJUST, the learning rate adjustment scheme, further reduces the gap to the multi-task model significantly and consistently across languages when combined with any of the CL methods.For example, on the WikiAnn dataset, the backward transfer is reduced from 16.90 to just 3.59 and 3.79 for the A-GEM and REPLAY methods respectively, making multilingual CL feasible.Further, we see that using CL methods alone results in a continuous drop in performance as more languages are added, while adding LR ADJUST stabilizes average performance after the first few languages, resulting in a flatter curve.
Finally, we see that the patterns of improvement hold when studying cross-lingual forward transfer, which quantifies the zero-shot model performance on languages unseen in training.The continual learning approaches improve over sequential training (e.g.+4.45 on WikiAnn) and using LR ADJUST in addition further boosts performance (e.g.+9.25 for VANILLA, +5.83 for REPLAY on WikiAnn).This shows that the resulting models were able to retain essential and generalizable information for the task that is more universal across all languages.

Multi-task Training vs. Catastrophic Forgetting
We conduct additional experiments to understand whether initially training on multiple languages at once can reduce the severity of catastrophic forgetting in the CL setup when new languages are added.We run a multi-task training on the first k languages, where k is 10 or 30, and then, we run the remaining languages sequentially on the WikiAnn data set.As shown in Figure 5, the model is more robust to forgetting when it is exposed to multi-task training with more languages with higher final average scores at the final task, but the graphs shows the performance still drops dramatically after being exposed to the first new language fed sequentially.

The Role of Language Order
To investigate the role of the language order on CL, we decide to reorder the language list by using heuristics.We start with two languages from the same family, as listed in Ethnologue (Eberhard and Gary, 2019), and add all languages from the same family one by one, then switch to a new language family and continue the same process.We conjecture that seeing a similar language at an interval will allow a more effective cross-lingual transfer.
Figure 5 (LANGUAGE ORDER) displays the results, which indicate that performance does not improve after we manually select the languages, and its performance is similar to random ordering (VANILLA).

Conclusion
We present the first study of catastrophic forgetting in a massively multilingual setting involving up to 51 languages on named entity recognition and natural language understanding tasks.We investigate continual learning methods and present a learning rate scheduling method that is simple yet effective in reducing the effects of catastrophic forgetting.Furthermore, we show that this method is effective across multiple continual learning methods.Finally, we provide analysis and further insights into the dynamics of catastrophic forgetting.

A Hyper-parameters
In all experiments, we run with five different seeds {42, 52, 62, 72, 82} using a V100 32GB GPU and each run takes up to a week to finish.

A.1 WikiAnn
Table 2 shows the hyper-parameters used in the experiments using WikiAnn dataset.

A.2 MASSIVE-Slot
Table 3 shows the hyper-parameters used in the experiments using MASSIVE-Slot dataset.

A.3 MASSIVE-Intent
Additionally, we run the CL setup on MASSIVE-Intent dataset and the results are shown in Figure 4. Table 4 shows the hyper-parameters used in the experiments using MASSIVE-Intent dataset.

B Language Order
We randomly shuffle the language order for each seed.Tables 5 and 6 show the language order we use in the experiments for WikiANN and MAS-SIVE datasets, respectively.

C Geographical Information of Languages
Table 7 shows all languages' language families and subgroups on NusaX and MASSIVE datasets.

Figure 1 :
Figure 1: Multilingual Continual Learning: The model θ is trained sequentially with data from T different languages.

Figure 2 :
Figure 2: Average F1 scores and standard deviation over 5 runs on WikiAnn (Rahimi et al., 2019) evaluated over increasing number of languages seen in training.

Figure 3 :
Figure 3: Average F1 scores and standard deviation over 5 runs on MASSIVE-Slot (FitzGerald et al., 2022) evaluated over increasing number of languages seen in training.

Figures 2 ,
Figures 2, 3 and 4 show performance numbers across the languages seen to that point in train-

Figure 4 :
Figure 4: Average F1 scores and standard deviation over 5 runs on MASSIVE-Intent (FitzGerald et al., 2022) evaluated over increasing number of languages seen in training.

Figure 5 :
Figure 5: Multi-task strategies and training using the language order heuristics on WikiAnn.It shows the averaged F1 scores and standard deviation on 5 runs.
AfshinRahimi, Yuan Li, and Trevor Cohn.2019.Massively multilingual transfer for ner.In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 151-164.

Table 3 :
Hyper-parameters for MASSIVE-Slot data set.

Table 4 :
Hyper-parameters for MASSIVE-Intent data set.

Table 5 :
Language Order for Experiments with WikiAnn.

Table 6 :
Language Order for Experiments with MAS-SIVE.