Cross-lingual Continual Learning

The longstanding goal of multi-lingual learning has been to develop a universal cross-lingual model that can withstand the changes in multi-lingual data distributions. There has been a large amount of work to adapt such multi-lingual models to unseen target languages. However, the majority of work in this direction focuses on the standard one-hop transfer learning pipeline from source to target languages, whereas in realistic scenarios, new languages can be incorporated at any time in a sequential manner. In this paper, we present a principled Cross-lingual Continual Learning (CCL) evaluation paradigm, where we analyze different categories of approaches used to continually adapt to emerging data from different languages. We provide insights into what makes multilingual sequential learning particularly challenging. To surmount such challenges, we benchmark a representative set of cross-lingual continual learning algorithms and analyze their knowledge preservation, accumulation, and generalization capabilities compared to baselines on carefully curated datastreams. The implications of this analysis include a recipe for how to measure and balance different cross-lingual continual learning desiderata, which go beyond conventional transfer learning.


Introduction
With more than 7,000 languages spoken around the globe, downstream applications still lack proper linguistic resources across languages (Joshi et al., 2020), necessitating the use of transfer learning techniques that take advantage of data that is mismatched to the application. In an effort to simplify architecture complexity and energy consumption, it is desirable to unify multi-lingual performance into a single, parameter-and memory-constrained model, and to allow this model to evolve, learning on multi-lingual training data as it becomes available without having to pre-train or fine-tune from scratch. Such is the longstanding goal of language Figure 1: An overview of CCL: We use an example of a non-stationary datastream moving from high to low resource languages. Each bold and dashed box represents either a training or test data instance being fine-tuned or evaluated on, respectively. To support this problem setup, we evaluate the cross-lingual capabilities of continual approaches. Those capabilities include knowledge preservation on old languages, accumulation to the current language, and generalization to unseen languages at each point of the training. In addition to that, we evaluate model utility at the end of continual learning. representation learning. Existing multi-lingual representations such as M- BERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) are strong pillars in cross-lingual transfer learning, but if care is not taken when choosing how to fine-tune them, they can neglect to maximize transfer (Ruder et al., 2019) to new tasks or languages and are subject to forgetting (French, 1993), where performance decreases after exposure to new task or language.
Most previous work that attempts to deal with the challenge of transfer exploitation and forgetting mitigation focuses on the problem of sequentially learning over different NLP downstream tasks or domains (Sun et al., 2020;Han et al., 2020;Madotto et al., 2021), rather than on language shifts. Indeed, the current literature for learning over sequences of languages is rather scarce, and is mostly reduced to cross-lingual transfer learning between a pair of languages (Liu et al., 2021;Garcia et al., 2021;Muller et al., 2021;Pfeiffer et al., 2021;Minixhofer et al., 2022). Liu et al. pre-train a (parent) language model and then fine-tune it on a downstream task in one of several different (child) languages. This conflates task and language transfer, and confuses analysis -the interference between the pre-trained language model 'task' and the fine-tuned task along with the parent and child languages cannot be disentangled. Garcia et al. propose an adaptation scheme to each new language pair independently while retaining the translation quality on the parent language pairs. Similarly, Muller et al. (2021) and Pfeiffer et al. (2021) propose lexical and semantic level techniques to adapt to target languages. However, all these mentioned works still focus on the 'one-hop' case, consisting of two steps: (1) training on initial parent language(s) (pairs), then (2) adapting to new children language(s) (pairs); the effect of multiple shifts in the datastream is not trivially generalizable to more than one hop. More recently, Pfeiffer et al. (2022) propose an approach for language-specific modules based on adapters and evaluate that on sequential streams of languages. However, they only focus on adapters and two desiderata of continual learning: interference mitigation and transfer maximization. We need a more robust and comprehensive fine-grained evaluation that balances the dynamics between different cross-lingual continual learning desiderata.
In this paper, we pave the way for a more comprehensive multi-hop continual learning evaluation that simulates the sequential learning of a single task over a stream of input from different languages. This evaluation paradigm requires experimentation over balanced streams of n data scenarios for n > 2. Unlike previous work, this paper concretely defines the following comprehensive goals along with their evaluation metrics as guidelines for analyzing the cross-lingual capabilities of multilingual sequential training: knowledge preservation, accumulation, generalization, and model utility as shown in Figure 1. We apply our test bed to a sixlanguage task-oriented dialogue benchmark and comprehensively analyze a wide variety of successful continual learning algorithms, from previous literature investigated in continual learning contexts different from the cross-lingual context, including (a) model-expansion (Pfeiffer et al., 2020b), (b) regularization (Kirkpatrick et al., 2017), (c) memory replay (Chaudhry et al., 2019b), and (d) distillation-based approaches (Hinton et al., 2015;Aguilar et al., 2020). Our findings confirm the need for a multi-hop analysis and the effectiveness of continual learning algorithms in enhancing knowledge preservation and accumulation of our multilingual language model. We additionally demonstrate the robustness of different continual learning approaches to variations in individual data setup choices that would be misleading if presented in a traditional manner.
Our main contributions are: (1) We are the first to explore and analyze cross-lingual continual finetuning 1 across multiple hops and show the importance of this multi-hop analysis in reaching clearer conclusions with greater confidence compared to conventional cross-lingual transfer learning ( §4.1).
(2) We demonstrate the aggregated effectiveness of a range of different continual learning approaches ( Figure 1) at reducing forgetting and improving transfer ( §4.3) compared to multilingual sequential baselines ( §4.2). (3) We make concrete recommendations on model design to balance transfer and final model performance with forgetting ( §4.3).
(4) We show that the order of languages and data set size impacts the knowledge preservation and accumulation of multi-lingual sequential fine-tuning and identify the continual learning approaches that are most robust to this variation ( §4.4). (5) We analyze zero-shot generalization trends and their correlation with forgetting and show that current continual learning approaches do not substantially improve the generalization ( §4.5).

Cross-lingual Continual Learning
In this section, we formally define cross-lingual continual learning, describe its goals and challenges, and introduce the downstream tasks, datastreams, and evaluation protocols used. Although we are not the first to define or investigate continual learning for languages, we are to the best of our knowledge the first to define and study crosslingual continual learning where continual learning is focused on languages only. Thus, we formally define cross-lingual continual learning as learning over a set of languages seen sequentially in multiple hops which is truer to the term of cross-lingual and continual learning, respectively. We distinguish that from 'cross-lingual cross-task cross-stage continual learning' which continually learns over a set of pretraining and downstream tasks sampled from different languages (Liu et al., 2021) and 'cross-lingual one-hop transfer learning ' (Garcia et al., 2021).

Problem Formulation
We define cross-lingual continual learning as the problem of sequentially fine-tuning a model θ for a particular downstream task K over a cross-lingual datastream. In this case, a cross-lingual data stream is made of N labeled and distinct datasets D 1···N , each one sampled from a distinct language and consisting of separate train and test portions. Let hop i be the stage in cross-lingual continual learning where θ i is optimized to θ i+1 via exposure to D i . Let L = {ℓ 1 , ℓ 2 · · · ℓ N } be a set of labeled languages, let S(L ) be the set of all permutations of L , and without loss of generality let p ∈ S(L ) be one such permutation and p[i] ∈ L be the ith language in p. The language of D i is p [i]. Therefore, by default, the number of languages used is equal to the number of datasets. Let D <i and D >i refer to a sequence of datasets (train or test portions, depending on context) used in hops from 1 to i − 2 and i to N − 1, respectively; we generalize these terms to D ≤i and D ≥i by including hop i − 1 as well at the end or, respectively, beginning of the sequence.

Goals
We define the goals, 2 necessarily dependent on each other, for our study of cross-lingual continual learning as follows (also depicted in Figure 1): • Cross-lingual preservation. This is the ability to retain previous knowledge on seen languages.
• Cross-lingual accumulation. This is the ability to accumulate knowledge learned from previous languages to benefit learning on the current language.
• Cross-lingual generalization. This is the ability to generalize uniformly well to unseen languages which goes beyond accumulating knowledge up to the current languages.
• Model utility. This is the ability of the fully trained model to perform equally well on all languages.
In this paper, we wish to understand the relationships between these goals. Our aim is to come up with a recipe for a more systematic cross-lingual continual learning. Thus, we need to understand if the goals are aligned with each other or if maximizing some goals lead to minimizing other goals.

Challenges
Learning sequentially from a non-stationary data distribution (i.e., task datasets coming from different languages) can impose considerable challenges on the goals defined earlier: • Catastrophic forgetting. This happens when finetuning a model on D ≥i leads to a decrease in the performance on D <i .
• Negative transfer. This happens when fine-tuning a model up to D ≤i leads to a lower performance on D i than training on it alone.
• Low zero-shot transfer. This happens when finetuning on D ≤i gives a lower performance than random on unseen D >i .
• Low final performance. This is when fine-tuning on all D ≤N gives an uneven performance between languages when tested on D ≤N at the end of training.

Downstream Tasks and Datastreams
Here, we describe the downstream tasks and multilingual sequential datastreams used. Downstream Tasks. We choose task-oriented dialogue parsing as a use case and consider the multi-lingual task-oriented parsing (MTOP) benchmark (Li et al., 2021). Task-oriented dialogue parsing provides a rich testbed for analysis, as it encompasses two subtasks: intent classification and slot filling, thus allowing us to test different task capabilities in cross-lingual continual learning. Datastream Construction. For a set of N languages L , our study considers a permutation subset P ⊂ S(L ) with the following properties: 3 • |P | = |L | = N , i.e. P consists of N permutations, each of which is a sequence of N datasets in each of the N languages in L .
• H2L ∈ P , the permutation from most highresource to most low-resource fine-tuning data sets, based on the training split dataset size.
In our experiments, we use MTOP (Li et al., 2021), which is a multi-lingual task-oriented dialogue dataset that covers six typologically diverse languages and spans over 11 domains and 117 intents. We chose MTOP since it is the largest scale dataset available for task-oriented dialogue, and because it covers languages that have varying amounts of data resources available. We use only the flat representation of slots (without nesting) to simplify our evaluation. We use the original data for most experiments. Table 1 shows a summary of the number of sentences (dialogue utterances) per language and split.

Evaluation Protocols
For each language permutation, we train on each dataset in sequence, but continually evaluate on all languages. Let R be some success metric for evaluating a downstream task K and R i,≤j be the evaluation on the test set for language ℓ i fine-tuning K on D ≤j . We define the following meta-metrics (which are inspired, but slightly different from the metrics in Lopez-Paz and Ranzato (2017) and Chaudhry et al. (2019a)): • Forgetting (F ↓). This is the average forgetting over all datasets (excluding the first dataset) computed as: where F ≤j is the average forgetting that occurred at the point of training D j . We compute F i,≤j = max k∈[1,j−1] R i,≤k −R i,≤j . F i,≤j is the degree to which performance on D i has suffered by continuing to train on D ≤j instead of stopping before covering D j . • Transfer (T ↑). This is the average forward transfer computed as: where R i denotes evaluation of a model finetuned only on D i . Then, T i is the incremental impact of sequential training on datasets prior to seeing D i . To measure generalization to new languages, we add a zero-shot transfer (T 0 ↑) metric measured as: where T 0 i is the average performance of a model on the forward transfer to a language ℓ i after training on D <i compared to the random performance R 0 i before even fine-tuning on any language (i.e. using fixed pre-trained M-BERT weights and randomly initialized weights for the output layer).
• Final performance (FP ↑). This is the average performance after training on all datasets in the studied stream, computed as:

Methods
For our base model, we use the same M-BERTbased architecture as was used in Castellucci et al. (2019) andM'hamdi et al. (2021) to jointly learn the intent classification and slot filling subtasks of MTOP. 4 On top of that, we define baselines, noncontinual learning reference models, and continual learning algorithms.

Baseline & Reference Models
Before delving into continual learning approaches, we consider a simple lower-bound baseline. In addition to that, we design reference models either trained from scratch for each new language, in a joint manner, or in a sequential multi-hop manner. Those are upper-bound non-continual learning models that are used to assess the performance of different models trained with continual learning methodologies. Those reference models can be in general superior to continual learning models but can also be less efficient and not feasible. For a fair comparison, all models use the same base model architecture and its loss with no further additions or special optimizations to the architecture. Lower-bound Baseline. This consists of naive sequential fine-tuning (Naive Seq FT), which sequentially fine-tunes with no continual learning.
Non-continual Learning Upper-bound Models.
These are stronger upper-bound models used as reference points of performance. However, they are either not efficient or prohibitive in the context of cross-lingual continual learning. Some of them require training from scratch for each language which is not efficient. Others require having access to all languages either at the same time or incrementally. Having such access can be restrictive due to privacy or storage efficiency concerns.
• Language-specific fine-tuning (Lang-Spec FT). This trains independent models on the data set for each language ℓ i using only D i .
• Multi-lingual learning (Multilingual). This trains one single model jointly across all data sets D ≤N .
• Incremental joint learning (Inc Joint). This incrementally trains adding the data set for each language in the stream. This consists of the following hops: 1) D ≤1 , 2) D ≤2 , · · · , and N-1) D ≤N −1 . This is the only sequential reference model.

Continual Learning Approaches
To continually fine-tune on different languages, we establish a representative set of strong approaches 5 spanning the following categories inspired by previous evaluation paradigms such as Jin et al. (2022) lifelong language model domain-incremental pertaining. To the best of our knowledge, we are the first to exhaustively investigate such approaches for the context of cross-lingual continual learning, whereas different approaches were investigated separately for different problem definitions. 5 More details about the approaches and their hyperparameters can be found in Appendix B and C.2, respectively.
Model Expansion. We consider the following approaches, that add hop-specific parameters, as shown in Figure 2. We expand on either the input (i.e. M-BERT representations) or the output side (i.e. task-specific prediction heads). For the former (Lang-Spec Trans), the transformer layers are replicated for each hop while sharing the prediction heads. To expand on the output side (Lang-Spec Task), we use different prediction heads across hops but share the M-BERT layers. We additionally consider Lang-Spec Enc[1-9] which trains M-BERT encoder layers ∈ 1 . . . 9 in a language-specific manner, while sharing the rest. We also separately add MAD-X adapters (Pfeiffer et al., 2020b). We either fine-tune the adapter layers and freeze the rest of M-BERT (Lang-Spec Ada(F)) or tune them both (Lang-Spec Ada(T)). 6 Regularization. We focus on elastic weight consolidation (EWC) (Kirkpatrick et al., 2017), which mitigates catastrophic forgetting by reducing the changes in parameters that are deemed critical to previously seen languages. We use the online version of EWC (EWC-Online) for efficiency.
Memory Replay. We use experience replay (ER) (Chaudhry et al., 2019b), which alleviates forgetting by maintaining a fixed-size memory equally balanced between the different languages and regularly drawing examples from the memory to replay.
Distillation-based. On top of ER, we distill dark knowledge (Kariya, 2018) from previous model checkpoints. We explore two variants: logit distillation (KD-Logit) (Hinton et al., 2015) and representation distillation (KD-Rep) (Aguilar et al., 2020), which optimize the minimum squared error loss on either the output logits or M-BERT representations between the current and previous models.

Results & Analysis
In this section, we provide an extensive analysis in the form of different ablation studies. We ask critical analysis questions that revolve around the continual learning goals described in §2.2. For §4.2, scores are reported using accuracy (Acc) and F1score (F1) for intent classification and slot filling, respectively. For the remaining sections, all results are reported for intent classification only, slot filling results, for which the same trends are observed, can be found in Appendix D. Bootstrap sampling (over test data shuffling) is used to compute the average and 95% confidence intervals (averaged over all language permutations except for §4.4). More details can be found in Appendix C.3. We also separately repeat key experiments over 3 different seeds and obtain similar findings which can be found in Appendix E. We decide to report the results using bootstrap sampling since they have tighter confidence intervals.

How is a Multi-Hop Analysis Different from its One-Hop Counterpart?
To motivate our cross-lingual continual learning evaluation paradigm, we start by investigating how a multi-hop analysis is different from a conventional one-hop transfer learning analysis. Figure 3 shows a comparison between the two in terms of forgetting (Eq. 1) for different approaches aggregated over different language permutations. More results for slot filling and other metrics can be found in Figure 9 in Appendix D.5. Figure 3: Comparison between forgetting trends for intent classification using one-hop (crossed boxplots on the left) and multi-hop analysis (dotted boxplots on the right), showing the variance over different language permutations. One-hop analysis exhibits higher variance than its multi-hop counterpart.
Lang-Spec Trans tends to have the least forgetting and Naive Seq FT the most, but importantly the variance for the multi-hop analysis is much smaller than that for the one-hop analysis. Hav-ing larger confidence intervals, the one-hop analysis also tends to be misleading in the sense that certain models are depicted as having a good performance while it is not truly the case. For example, Naive Seq FT, according to the one-hop analysis, shows a range of forgetting, from very little (0.5) to a lot (2.0). So in some circumstances, it has little forgetting thus a good performance under the one-hop analysis. But according to the multi-hop analysis, it clearly has a lot of forgetting with more confidence. Therefore, the multi-hop analysis leads to a more conclusive analysis. We conjecture that averaging over more hops and balanced diversified datastreams is what leads to narrower confidence intervals. This agrees with the well-known fact that larger sample sizes lead to narrower confidence intervals (Hazra, 2017).

Can a Multi-lingual Language Model
Learn to Preserve and Accumulate Knowledge across Different Languages?
Given the conclusiveness of the multi-hop analysis in §4.1, we follow that type of analysis thereafter.
In this section, we investigate how well the baseline and different non-continual learning reference models learn to preserve and accumulate knowledge across different languages, by looking at the average over language permutations. Since not all reference models are sequential, we start by comparing them to the baseline using their final performances (Eq. 4). The final performance is indicative of how well a single final model can encapsulate the knowledge across all languages at the end of training. From Table 2, we notice that Naive Seq FT and Multilingual have the worst and best final performances, respectively. This suggests that a multilingual joint model is more beneficial than sequential models. In practical scenarios, however, we may not have access to all languages at the same time. Among non-continual learning reference models, Inc Joint is closest to Multilingual if all data may be preserved. However, this may also not be the case. In that case, Inc Joint is nearly as good. Training incrementally and sequentially (Inc Joint) is also more beneficial than fine-tuning on just the language of interest (Lang-Spec FT), as the former exploits cross-lingual transfer capabilities. We focus, thereafter, on Inc Joint 7 and compare  Table 2: The average final performance across different language permutations for the baseline compared to reference models. We highlight the best scores in bold and underline the second best across models.
its forgetting (Eq. 1) and transfer (Eq. 2) trends to the baseline Naive Seq FT, as shown in Table 3. Inc Joint exhibits significantly less forgetting which also causes its final performance to be higher than Naive Seq FT. This suggests that recalling previously used training data is helpful in knowledge preservation. However, Naive Seq FT seems to slightly outperform Inc Joint in terms of transfer. This difference is not statistically significant. 8 We hypothesize that this could be due to exposing Inc Joint to all resources from previously seen languages, so it is likely that the data distribution between all these languages may distract the model from learning on the new one.  Table 3: Forgetting (F) and transfer (T) performance averaged across different language permutations for sequential baseline and reference models. We highlight the best models in bold for each subtask and metric.

Is Continual Learning Effective in
Boosting Knowledge Preservation, Accumulation, and Model Utility?
To study the effectiveness of continual learning approaches, we compare them to the baseline using the average over language permutations. We show, in Figures 4(a) and 4(c), the final performances (Eq. 4) and transfer (Eq. 2) of different approaches, respectively, versus their negative forgetting (Eq. 1).
In general, we observe that continual learning approaches mitigate forgetting and improve final performance. They also improve transfer, to some degree, though gains are mostly not significant compared to Naive Seq FT(Appendix C.3) . From Figure 4(a), we notice that model expan-sion approaches 9 (Lang-Spec Trans and Lang-Spec Enc[1-9] described previously) are good at mitigating forgetting and improving the final performance while Lang-Spec Task is not. M-BERT, when trained in a language-specific manner, is responsible for encapsulating the cross-lingual representations necessary for enabling knowledge preservation, whereas changes to the downstream taskspecific layers do not make much of a difference. This implies that in cross-lingual continual learning more attention should be paid to how to train those representations in a language-specific manner efficiently. Lang-Spec Ada(T) is one way to do this more efficiently, but its performance still lags behind other model expansion approaches. ER achieves a performance close to Lang-Spec Trans and Lang-Spec Enc[1-9] and this suggests that using a portion of the memory is beneficial. 10 In the baseline approach which suffers from the highest forgetting, we also notice the lowest final performance and transfer in Figures 4(a) and 4(c). As continual learning approaches reduce forgetting, they also improve the final performance and some of them also improve transfer but not to the same degree. This suggests that the lower the forgetting a model can achieve, the easier it gets for it to learn a stronger final model. However, there is no direct correlation between forgetting and transfer. For example, Lang-Spec Trans is the best model in reducing forgetting but also the worst in terms of transfer. This could be due to the fact that Lang-Spec Trans exhibits a similar behavior to Lang-Spec FT thus the transfer of a model, which is the difference between the performance of that model and that of Lang-Spec FT, is almost null. On the other hand, although Lang-Spec Ada(F) has the highest transfer, it has the lowest final performance and close to average forgetting. Although the adapter will not be updated anymore after the model has been fine-tuned on, we think that the forgetting could be due to the shared task specific-layer leading to a forgetting closer to Lang-Spec Trans more than Lang-Spec Ada(T) which also shares M-BERT and tunes it. We show in Figure 4(b) that there is no direct correlation between final performance and transfer. This posits that all three metrics need to be studied independently for a more insightful analysis.

Which Permutations Impose More
Challenges on Knowledge Preservation, Accumulation, and Model Utility?
So far our analysis has focused on the average over language permutations, but are the same patterns observed for different language permutations? To shed light on this, we analyze the performance of different continual learning algorithms and the baseline in terms of their forgetting (Eq. 1), transfer (Eq. 2), and final performance (Eq. 4) over H2L and L2H permutations, in Table 4. 11 In general, we observe that it is more challenging to learn from low to high resource languages. However, model expansion and memory replay approaches reduce forgetting and final performance gaps between language permutations. We hypothesize that L2H being more challenging than H2L could be due to the fine-tuning data size that is different between languages.
To verify this hypothesis, we dig deeper to check if the differences among fine-tuning data sizes be-11 Full results for slot filling, more language permutations, and a balanced version of data can be found in Appendix D.3.  Table 4: Comparison of intent classification for two language permutations. We highlight in bold the best forgetting (F), highest transfer (T), and final performance (FP) of accuracy scores among H2L and L2H, whereas the best and second best scores across approaches for H2L and L2H separately are underlined and italicized, respectively. We report mean performances for each metric and language order. All 95% confidence intervals range from ± 0.01 to ± 0.04.
tween languages is the main factor by performing an ablation study on that. Therefore, we use the same amount of fine-tuning and evaluation resources for each language (9,219 for train, 1,285 for dev, and 2,299 for test splits) and report the results on Naive Seq FT in Table 5. We notice that there is still a gap between these two language permutations for forgetting and final performance. This suggests that the difference in fine-tuning data size is not what accounts for the differences between the two language permutations. There are perhaps biases in the pre-training or other linguistic artifacts that need to be studied in future work.  Table 5: Performance on intent classification comparison between two versions of the data: original data version and balanced data for Naive Seq FT across the same permutations as Table 4. We bold the best among H2L and L2H for each metric.

How do Continual Learning Models Generalize to Unseen Languages?
To analyze the zero-shot transfer to languages unseen during fine-tuning, we plot the performance of zero-shot transfer (Eq. 3) as a function of negative forgetting over the average of different language permutations, to investigate any relationships between generalization and preservation. In Figure 4(d), we infer that most continual learning approaches do not substantially improve generalization compared to Naive Seq FT. In particular, model expansion approaches (in red) hurt generalization even if they significantly reduce forgetting. This zero-shot transfer versus interference trade-off is referred to as the stabilityplasticity dilemma (Mermillod et al., 2013), where the weights responsible for improving on new tasks are often responsible for forgetting previous tasks. Except for model expansion approaches, we notice that approaches which reduce forgetting also improve generalization compared to Naive Seq FT.
Better approaches to balance between the two can be investigated in future work.  (2021) analyze the adaptability and usability of large language models to unseen and under-studied low-resource languages. However, they all focus on a one-hop analysis from high to low-resource language pairs or pre-training to fine-tuning tasks, unlike our work, which analyzes across multiple hops. More recently, Pfeiffer et al. (2022) propose a new methodology based on adapters and show that their approach mitigates negative interference between languages while enabling positive transfer. They use a multi-hop evaluation paradigm closer to our setup, but they only evaluate with respect to adapters using interference and transfer and do not analyze other aspects of cross-lingual continual learning capabilities.

Conclusion
We formulate the cross-lingual continual learning problem setup. We show that naive sequential finetuning is prone to catastrophic forgetting and has poor accumulation capabilities sensitive to different language permutations. We provide the first benchmark to compare the effectiveness of different continual learning algorithms for the cross-lingual case. We show that continual learning models improve cross-lingual knowledge preservation, which also contributes to improving final model performance and to a lesser degree accumulation and generalization. We also discuss the challenges of sequentially training for certain language permutations. We hope that this study will encourage more analyses in the same spirit to cover more benchmarks and datastream setups to gain more insights that go beyond conventional cross-lingual transfer learning.

Limitations
Application to Other Benchmarks A central limitation of our work is that the main experiments are based on a single task-oriented dialogue benchmark. While there are multiple other natural language understanding benchmarks like XNLI, XQUAD, MLQA, and PAWS-X (Conneau et al., 2018;Artetxe et al., 2020;Lewis et al., 2020;Yang et al., 2019) that can also be used to back up our claims, we argue that this is outside the scope of this paper. The main objectives of this paper are to first come up with a new definition of a crosslingual continual learning challenge and then to give an example using a comprehensive and realistic benchmark like task-oriented dialogue to catalyze more research in that direction.
Choice of Realistic Permutations For more realistic setups of continual learning, we need to come up with an approach to define continual learning annotation scenarios of languages. Rather than using brute force with all possible ways the languages could be annotated at different stages, a principled way would be more desired. Since it is hard to tell if there is any logic or pattern in the annotation process itself and given the sheer amount of realistic scenarios, we chose one scenario experienced by some of the users: a model is built for a user, then the user reveals that more languages are desired. We test in our work the plausibility of continual learning approaches where the sequence moves from one language to another without repetition of the same language. Working on scenarios where the data from different languages are integrated as soon as they are annotated, implying different languages for different hops, is out of the scope of this paper.
Data and Model Size Analysis In this paper, we pick certain model expansion approach variations to analyze the effect of model components (one aspect of model size) and two data distribution scenarios. However, analyzing extensively the effect of the scale of data and model size is beyond the scope of our work. We agree that different data sizes can be used and it is interesting to analyze different supervision levels such as using different proportions of the data for each language and simulating few-shot scenarios. We believe that for lowresource scenarios we need to investigate specific approaches to continual learning like meta-learning. We plan to investigate that in future work.

Application to Other Transformers
Another possible limitation of our work is the restriction of the evaluation to a base model on top of M-BERT Transformers. With the advent of Transformer-based encoders as strong pillars for transfer-learning, several Transformers such as XLM-R have been proposed more recently. Although those models have been shown to outperform M-BERT on numerous downstream applications especially on low-resource languages (Conneau et al., 2020), M-BERT is still largely used due to its reduced number of parameters. In our specific continual learning challenge, efficiency is a top concern as we are training in multiple hops and benchmarking on different models. So, M-BERT has been feasible in our use case. We leave experimenting with other Transformer-based encoders to future work.

Acknowledgements
This material is partially based upon work supported in part by the Office of the Director of National Intelligence ( Continual learning for cross-lingual NLP is underexplored, either focusing on proposing crosslingual approaches that indirectly support continual learning, such as Artetxe et al. (2020), on the transfer-ability of monolingual models. Other approaches derive a cross-lingual continual learning problem directly from cross-lingual transfer learning, such as Garcia et al. (2021), who propose a lexical approach to adapt to new low-resource languages for machine translation. Similarly, Pfeiffer et al. (2021) propose lexical-level adaptation schemes that can be applied to models relying on subword-based tokenization to adapt them to lowresource languages not covered or whose scripts are unseen during pre-training. Minixhofer et al. (2022) also propose adaptations that go beyond the lexical level. Their approach facilitates the creation of monolingual language models that are transferable for new languages. Liu et al. (2021) explore continual techniques to fine-tune on downstream applications for new languages, while preserving the original cross-lingual ability of the pre-trained model. However, they all focus on a one-hop analysis from high to low-resource language pairs or pre-training to fine-tuning tasks, unlike our work, which analyzes across multiple hops. Muller et al. (2021) analyze the adaptability and usability of large language models to unseen and under-studied low-resource languages. Based on that and depending on the degree of additional pre-training and fine-tuning required, they categorize the lowresource languages into easy, intermediate, and hard. Although this work paves the way for a better understanding of the mechanics to transferability to low-resource scenarios, they do not study the scenario where the transferability needs to be performed in multiple hops following a sequential stream of data. More recently, Pfeiffer et al. (2022) propose a new methodology for language-specific modules to add additional capacity and deal with the curse of multilinguality and show that their approach mitigates both negative interference between languages while enabling positive transfer. They use a continual learning multi-hop evaluation paradigm which is closer to our setup but they only evaluate using interference and transfer and only using one approach based adapters and do not analyze other aspects of cross-lingual continual learning capabilities using a holistic approach like our work. More specifically, a multi-lingual pre-trained model is used to encode the input. Then, to predict the intent and slot spans, we add task-specific prediction heads. For intent prediction, this takes the form of a linear layer plus softmax on top of the [CLS] token representation. For slot filling, we use a sequence labeling layer in the form of a linear layer plus CRF respectively. We use the sum of both intent and CRF based slot losses to optimize the model parameters.

B.2 Model Expansion
Model expansion methods, such as Lang-Spec Trans and Lang-Spec Enc[1-9], are fine-tuned for each language with either an entirely or partially language-specific M-BERT (whole 12 layers in addition to the embeddings or just the top 8 layers in the case of Lang-Spec Trans and Lang-Spec Enc[1-9] respectively). When fine-tuning them on a new language, the previously tuned parameters on the old languages are retained unchanged while the rest of the parameters that are not language-specific are fine-tuned. During the evaluation on a particular language, the tuned parameters for that language are restored and used if the language has been seen in training. Otherwise, the parameters initialized from M-BERT (before fine-tuning on any language) are used for zero-shot evaluation.
Adapters consist of downsampling layers followed by upsampling layers inserted between layers of our Transformer encoder in addition to their invertible components. We do not add task-specific adapters, as according to our ablation studies they didn't prove beneficial. We add adapter components to every encoder layer following MAD-X configuration and using their pre-trained weights. 12 We either fine-tune the weights for the languages available in AdapterHub or train from scratch for languages for which there are no pre-training adapter weights. At inference time, we use adapter layers fine-tuned independently for each language in the datastream.

B.3 Online Elastic Weight Consolidation (EWC-Online)
To penalize changes in the parameters crucial to previous languages, we use EWC, which adds a regularization term to the loss applied only after the first data set D i in the language stream is seen. ∀i ∈ 2 . . . N , we compute the total loss as follows: where L cur is the usual loss of the downstream task on the current data D i and L reg is the regularization term and λ is a hyperparameter to control the regularization strength (which is fixed to 20). For efficiency purposes, we use the online version of EWC (EWC-Online). Following that, our regularization term is computed as, based on the formulation in van de Ven et al. (2022): where θ are the parameters of the Transformer model in addition to the downstream prediction heads, N p is the total number of parameters, and F (i−1) jj is the Fisher information matrix on the last language just before training on D i . This is computed as the running sum of the i th diagonal elements of the Fisher Information matrices of D j , for all j ∈ 1 . . . (i − 1).F (i) jj = γF (i−1) jj + F i jj and F 1 jj = F 1 jj . In practice, F i is simply the gradients all parameters flattened into one single matrix.

B.4 Experience Replay (ER)
After training for each D i for all i ∈ 1 . . . N , we populate the memory with randomly sampled examples from D i . For each D i for all i ∈ 2 . . . N , after training for every k = 100 mini-batches and optimizing for the current loss separately, the model randomly samples an equal batch from the memory for each D j such that j ∈ 1 . . . (i − 1) and replays them using the current model checkpoint used for training on D i . We retrieve an equal amount of memory items from each language and at each step and hop. The loss from the current D i and the loss on the memory on the D j are interleaved as the replay on the memory only happens every k steps. This prioritization of the current language helps make the training more stable without over-fitting on the small memory from previous languages.

B.5 Knowledge Distillation (KD-Logit & KD-Rep)
We use the same strategy explained in §B.4 to select the memory to be replayed using a knowledge distillation loss. For each D i for all i ∈ 2 . . . N , after training for every k = 100 mini-batches, we randomly sample an equal batch from the memory for each D j such that j ∈ 1 . . . (i−1). We also load the model checkpoints for each hop j and use that model and the memory for D j to compute either the intent and slot logits in the case of KD-Logit or the multilingual representations of M-BERT in the case of KD-Rep. We do the same thing using the current model checkpoint this time. Then, we use the minimum square error loss to minimize the distance between the intent logits obtained using the previous and current model checkpoints and do the same thing for slot logits for KD-Logit. Then, we take the same over intent and slot distillation losses across different language retrieved from the memory. The same is done for computing the distillation loss over the multilingual representations in KD-Rep.

C.1 Datastreams
We use the following datastreams for all our experiments as summarized in

C.2 Implementation Details
For all experiments, we use M-BERT(bert-basemultilingual-cased) 13 with 12 layers as our pretrained Transformer model. We use the dev set to pick the hyperparameters of the optimizer to be used. We perform a manual search for the most optimal learning rate over a range [1e − 4, 3e − 4, 1e − 5, 3e − 5] for Adam optimizer (Kingma and Ba, 2015) and finally based on the dev performance we fix the learning rate to 3e − 5 for all experiments for a fair comparison. We use ϵ = 1e − 8, β 1 = 0.9, β 2 = 0.99, batch size of 16, γ = 0.1 for EWC Online, 6000 memory size for ER and knowledge distillation. For all experiments, we run for 10 epochs maximum and pick the best model based on dev data. We also fix a seed of 42 for the random initialization of numpy, random, and torch over all bootstrap experiments. For additional experiments using multiple seeds, we fix three seeds. All experiments are run using the same computing infrastructure Pytorch version 1.7.1, using one Tesla P100 GPU of 16280 MiB of memory CUDA version 11.2. The runtime and the number of parameters depend on the approach and the mode of training used as shown in Table 7. With the exception of model expansion and language-specific approaches, all approaches have the same number of parameters coming from the sum of M-BERT and prediction head parameters. Lang-Spec Trans has the highest number of parameters which is six times more than Naive Seq FT but only requires two times more runtime as only one 1 6 part of language-specific M-BERT is updated at each hop for each whereas the rest is used in evaluation mode only. Lang-Spec Ada(F) has the smallest number of parameters which is around 24% and takes 2 times less than the usual runtime of Naive Seq FT (while exhibiting lower forgetting and higher transfer than Naive Seq FT, as shown in Table 8). Memory replay and knowledge distillation approaches have more runtime (slightly more than Lang-Spec Trans) as they store and handle memory and compute the replay or distillation losses interleaved with the main loss which makes them time-consuming. What impacts the runtime of ER is much more than just iterating over a small sampled memory. Its runtime does not only depend on the size of the memory as much as it depends on the frequency of interleaving happening at the fine-tuning schedule. After each k minibatch steps, we sample a minibatch from the memory and fine-tune on it interleaved with the fine-tuning on the main minibatch. So, that makes the runtime depend on k and not only the size of the memory. This make its training more time consuming than if we had to sample only after each epoch with the same memory size.

C.3 Bootstrap Sampling & Statistical Significance
We run all experiments over one fixed seed of 42. We then use bootstrap sampling (Koehn, 2004) to compute the mean and confidence intervals for each of the metrics described in §2.5 over a single approach. For each language permutation, and for each R i,≤j , representing some performance metric on language ℓ i after training on D ≤j , we sample with replacement 600 sentences from the testing data over 600 iterations. By using this number of iterations and sampling sentences, we ensure and also double check that all sentences in the test set are covered in the evaluation ensuring a uniform evaluation across approaches. Let x be the list of results we get for each iteration independently. Then, we compute the mean and standard deviation x and std(x) respectively and the 95% confidence interval size CI using the following equation: This computes x and CI for each language permutation separately. To aggregate this across different language permutations, we simply take the average and the standard deviation.
To compute the statistical significance between different approaches, we use ANOVA and perform a multiple pairwise comparisons analysis using Tukey's honestly significant difference (HSD) test 14 over different language permutations for each metric.

D More Results & Analysis using
Boostrap Sampling D.1 Full Average Results Table 8 shows the full results and confidence intervals for different continual learning approaches. Compared to intent classification, we observe a higher forgetting and slightly higher transfer but a lower zero-shot transfer and final performance in the case of slot filling. This could be due to the nature of the task of slot filling which is more challenging to learn. In general, we can observe the same forgetting, transfer, zero-shot transfer, and final performance trends between intent classification and slot filling. In other words, if a model a has higher forgetting of intent classification than model b then the same thing applies to slot filling. This could be due to the transfer between intent classification and slot filling that is maximized when training them jointly. The best model for transfer is Lang-Spec Ada(F), which we hypothesize is due to its lightweight adaptation to the current language which makes it overfit on that at the cost of a lower average and final performance overall.  For each metric and score, we highlight the best score in bold and underline the second best score.  Table 9: Per group layer analysis: ablation studies of different M-BERT's components. Best, second best, and third best scores for each metric are in bold, underlined, and italicized respectively. Table 9 shows ablation studies for the analysis of M-BERT components following four different categories: groups of 12 layers with or without embeddings, groups of 3 layers, 6 layers, and 9 layers at a time trained in a language specific manner and the rest shared between languages. We notice that training the full Lang-Spec Trans and Lang-Spec Enc [1][2][3][4][5][6][7][8][9][10][11][12] have the best in terms of forgetting and final performance. Training only the first 8 encoder layers Lang-Spec Enc[1-9], excluding embeddings, in a language-specific manner comes next in terms of a low forgetting and a comparable final performance, with a relatively better transfer and zero-shot transfer performance. Other good model reaching a good compromise between transfer, zero-shot transfer and forgetting with less language-specific layers are Lang-Spec Enc[1-3] and Lang-Spec Enc[1-6]. Naive Seq FT is comparable to those model-expansion approaches in terms of zero-shot performance, but has a lower final performance and significantly higher forgetting. We also notice the same trend for language-specific embeddings Lang-Spec Embed which reaches the second best zero-shot transfer performance, but with also a high forgetting. This suggests that language-specific knowledge is less likely to be encoded in the embeddings and more at the encoder layers. This shows that there is a real plasticitystability tradeoff between zero-shot transfer and knowledge preservation (which we explain in more details in §4.5).

D.3 Full Results on Language Permutations
Full results for all language permutations can be found in Tables 10, 11, and 12. By looking at additional language permutations, L2H (Thai → Spanish → Hindi → French → German → English) is still the most challenging one in terms of knowledge preservation, accumulation, generalization, and model utility. H2L (English → German → French → Hindi → Spanish → Thai) is still the easiest to learn. Order 5(Hindi → English → Spanish → Thai → French → German) is the second most challenging language permutation to train. In general, the same trends regarding the more challenging nature of training for certain language permutations are observed for both intent classification and slot filling uniformly. Table 13 includes the results for more language permutations for the balanced data.

D.4 Per Language Analysis
Tables 14, 15, and 16 show the full results for forgetting, transfer, and zero-shot transfer respectively, across different languages averaged over different language permutations. We notice that languages like English, German, French, and Spanish have constantly lower forgetting and higher zero-shot transfer than languages like Hindi and Thai for both intent classification and slot filling for Naive Seq FT compared to the reference model Inc Joint for which the forgetting is low and nearly equal between different languages. Approaches like Lang-Spec Trans, Lang-Spec Enc[1-9], Lang-Spec Ada(F), and to a certain degree ER also reduce that gap. We also notice that approaches that lower forgetting for a particular language do so uniformly for all languages. The performance in terms of zero-shot transfer is significantly lower in the case of Thai. Figure 6 plots final performance versus negative forgetting, final performance versus transfer, transfer versus negative forgetting, and zero-shot transfer versus negative forgetting for the subtask of slot filling. The same trends observed for intent classification can also be observed for slot filling. Figures 7a and 7b show how Naive Seq FT intent classification accuracy score and slot filling F1 score, respectively, change for each language separately after different hops of training. We can see that although the performance increases as more hops are seen for high-resource Latin-script languages like English, Spanish and to some degree French, the same cannot be said for low-resource languages Thai and Hindi, which also suffer from being script isolates.

D.5 More Analysis
To analyze the zero-shot generalization to unseen languages, we analyze the performance of each model across different hops. In other words, we consider the average performance after seeing from 1 to 5 languages, enabled by the balanced datastreams we carefully curated 2.4. We can check the performance after training on each x language(s) from exactly one datastream. Figures 8a and 8b show a comparison between different approaches across different hops of training using zero-shot transfer metric for intent classification and slot filling, respectively. In general, we can observe that the average performance of the zero-shot transfer decreases after seeing n languages, where n ∈ [1 . . . 5]. In this case, after seeing one language, the performance is equivalent to conventional transfer learning involving two hops, whereas the performance after seeing n >= 2 is for multi-hop continual learning. We notice that as we increase the number of hops, the transfer capabilities decrease nearly uniformly across most approaches, making the problem more challenging and different from conventional transfer learning. Figures 8c and 8d show the generalization trends for different continual learning approaches compared to the baselines for intent classification and slot filling, respectively. We can see that most continual learning approaches improve in terms of both intent accuracy and slot filling F1 scores over Naive Seq FT and the gap increases mainly as more languages are seen (except at hop 4 ). After 5 hops, there is a clear gap between Naive Seq FT and continual learning approaches on top of them Lang-Spec Ada(T) and KD-Logit. Figure 9 shows more results for multi-hop versus one-hop analysis for more metrics and tasks. In general, we can observe the same trend, whereby multi-hop dotted boxplots analysis has smaller confidence intervals than one-hop crossed boxplots. Table 17 shows a comparison between the performance of experience replay variants with different memory sizes ranging from 750 to 6000 instances which accounts for 5% to 60% of the training data  Table 10: Per language permutation view: a pairwise comparison between H2L (English → German → French → Hindi → Spanish → Thai) and L2H (Thai → Spanish → Hindi → French → German → English). We highlight the best forgetting (lowest), transfer (highest), zero-shot transfer (highest), and final performance (highest) of accuracy and f1 scores among those two orders for each approach in bold, whereas the best scores across approaches for the two orders separately are underlined.

Shared {Trans, Task} Baselines
for each language. Although we notice that forgetting is the lowest and the final performance is the highest when a memory of 6000 instances is used, the gap is not that significant as the memory is scaled down. Moreover, differences in transfer are not correlated with the size of the memory. We notice that ER achieves a performance that surpasses Naive Seq FT even when using the lowest memory size. This suggests that even tiny bits of memory are helpful.

E More Results using Multiple Seeds
In this section, we show the results using different seeds for key experiments in the main paper. We show in Table 18 and 19 the average final per-formance, forgetting, and transfer averaged across different language permutations for the baseline model compared to reference models. We also show in Table 20 the performance on intent classification comparison between the baseline and different continual learning algorithm across H2L and L2H. Overall, we notice the same trends and findings observed earlier in Tables 2, 3, and 4.

F Statistical Significance
We show in Figures 10 and 11 the results for different approaches with a p-value lower than 0.05 for confidence intervals of 95%, thus rejecting the null hypothesis that they are drawn from the same distribution. Figures 10a, 11a, 10c, 10b, 11a, 10d, 10e,   Figures 12 and 13 show the corresponding statistical significance pvalue confusion plots using multiple seeds. With a few exceptions like Lang-Spec FT + Ada(T) and Lang-Spec FT + Ada(F), most pairwise p-values which indicate statistical significance between two models using bootstrap sampling analysis are compliant with statistical significance computed using Task} Baselines  Table 12: Per language permutation view: a pairwise comparison between Order 5(Hindi → English → Spanish → Thai → French → German) and Order 6 (German → French → Thai → Spanish → English → Hindi). We highlight the best forgetting (lowest), transfer (highest), zero-shot transfer (highest), and final performance (highest) of accuracy and f1 scores among those two orders for each approach in bold, whereas the best scores across approaches for the two orders separately are underlined.     For each metric and score, we highlight the best score in bold and underline the second best score.         Table 18: The average final performance across different language permutations for the baseline compared to reference models using multiple seeds. We highlight the best scores in bold and underline the second best across models. We notice the same findings as when using bootstrap sampling but with tighter confidence intervals as shown in Table 2.  Table 19: Forgetting (F) and transfer (T) performance averaged across different language permutations for sequential baseline and reference models using different seeds. We highlight the best models in bold. We notice exactly the same trends as when using bootstrap sampling for our analysis in Table 3.  Table 20: Performance on intent classification comparison between the baseline and continual learning algorithms across two language permutations using multiple seeds. We highlight in bold the lowest forgetting (F), highest transfer (T), and final performance (FP) of accuracy scores among H2L and L2H, whereas the best scores across approaches for H2L and L2H separately are underlined. We notice the same trends and findings as Table 4 where only bootstrap sampling is used to compute the confidence intervals.

ACL 2023 Responsible NLP Checklist
A For every submission: A1. Did you describe the limitations of your work?
We discuss the limitations in the limitations section after the conclusion. There are no strong assumptions, claims or biases we are aware of that are not stated in the paper. We state only claims that are supported by evidence and experimental setup that we design and describe clearly in the main paper and in the supplemental material (appendix and code) (more details on the experimental setup can be found in Appendix C).
A2. Did you discuss any potential risks of your work? Not applicable. To the best of our knowledge, there is no potential risk or harm of any kind that could result from this work. Our research is just an analysis paper about cross-lingual continuous learning where we share our experiments on pre-existing approaches.
A3. Do the abstract and introduction summarize the paper's main claims?
The main claims and contributions are stated clearly in the abstract, introduction (section 1) especially the contribution paragraph, emphasized in the results and analysis section (section 4) and summarized briefly in the conclusion (section 6). The distinction between the main claims and future work is also clear, especially in section 4 where we discuss our claims based on the current experiments and provide speculations and hypotheses that can be verified in future work.
A4. Have you used AI writing assistants when working on this paper?
Left blank.

B Did you use or create scientific artifacts?
We provide the github repository for the code we create in footnote 1. We provide citations to the dataset and pretrained models used in Sections 2.4 and Section 3.

B1. Did you cite the creators of artifacts you used?
The dataset and pre-trained model used are all cited and the approaches used with their algorithms are also cited in Section 2.4 and Section 3. We write our own code for the experiments. Each time we use a specific tool we also cite or include its link in main text, footnotes, and appendices.
B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
The license for the usage of the dataset is in section C.1 in the appendix. The dataset was released as open source from Facebook. The dataset is only used in our case, there is no distribution, repackaging etc.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)? Not applicable. Left blank.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it? Not applicable. To the best of our knowledge, there is no author information or offensive content in the open source data that we used (not our own data). This is a dataset created using manual translation from English data where the creators of the data asked crowd-sourcers to generate natural language sentences for task-oriented dialogue with neutral domains about alarm, weather, etc.