Efficiently Upgrading Multilingual Machine Translation Models to Support More Languages

With multilingual machine translation (MMT) models continuing to grow in size and number of supported languages, it is natural to reuse and upgrade existing models to save computation as data becomes available in more languages. However, adding new languages requires updating the vocabulary, which complicates the reuse of embeddings. The question of how to reuse existing models while also making architectural changes to provide capacity for both old and new languages has also not been closely studied. In this work, we introduce three techniques that help speed up the effective learning of new languages and alleviate catastrophic forgetting despite vocabulary and architecture mismatches. Our results show that by (1) carefully initializing the network, (2) applying learning rate scaling, and (3) performing data up-sampling, it is possible to exceed the performance of a same-sized baseline model with 30{% computation and recover the performance of a larger model trained from scratch with over 50{% reduction in computation. Furthermore, our analysis reveals that the introduced techniques help learn new directions more effectively and alleviate catastrophic forgetting at the same time. We hope our work will guide research into more efficient approaches to growing languages for these MMT models and ultimately maximize the reuse of existing models.


Introduction
Research into multilingual machine translation (MMT) (Aharoni et al., 2019;Fan et al., 2020) has shifted from a relatively small number of translation directions (Dong et al., 2015;Firat et al., 2016;Ha et al., 2016) to much larger scale, recently reaching up to tens of thousands of translation directions (NLLB Team et al., 2022;Bapna et al., 2022). Despite the notable increase in the number of supported languages, these models still need to be upgraded as increasing amount of data in new languages are becoming available. The process of adding new languages to existing models is an instance of continual learning (Ring, 1994), in which the models need to effectively learn new tasks (new translation directions) while not catastrophically forgetting (French, 1993;McCloskey and Cohen, 1989) the knowledge about tasks from the previous training stage (original translation directions).
Unlike the most conventional continual learning setup where the model remains the same as the initial learning stage, growing languages for MMT models needs to deal with new parameters. Adding new languages to existing training data shifts subword token distribution (e.g., tokens from unseen scripts are added), hence the need to re-train the tokenizer, which adds new embedding parameters to the MMT model. Previous studies have shown the effectiveness of adapting embedding parameters to retain performance on old translation directions (Lakew et al., 2018;Escolano et al., 2019;Garcia et al., 2021). This is usually done by reusing the embeddings for tokens that overlap between the old and new vocabularies.
One aspect that has not been extensively addressed in previous research is how to deal with other architecture mismatches that may arise during the continual learning phase. When growing MMT models to support many additional languages, it also makes sense to increase the model size overall. This extra capacity can be used to learn not only the new directions well but also the old directions better. It is not obvious, however, how to reuse the parameters from existing models (i.e., to engage in continual learning) given such architectural changes. Thus, in addition to dealing with different vocabularies, we also investigate how to make use of previously trained models when scaling up model size in order to train much more efficiently than training from scratch.  Figure 1: Illustration of two architectural mismatches we tackle in this paper. Left: The hidden dimension in the feed-forward layers is doubled (d ′ hidden = 2d hidden ) during the continual learning stage so that the model becomes "wider". Right: Additional layers are inserted at the bottom of encoder and the top of decoder so that the model becomes "deeper". Both architectural changes increase capacity and are not well-addressed by previous works.
In this work, we introduce a recipe that helps significantly reduce the amount of required computation for continually learning new languages in MMT models. The recipe involves training the models with a combination of three techniques: (1) careful initialization of the network (2) applying learning rate scaling and (3) up-sampling selected language pairs. We validate our method on settings both with and without architecture mismatches (e.g., models becoming wider or deeper as shown in Figure 1 to have extra capacity for both old and new directions). We compare our models with strong same-sized baselines that are trained on all data from scratch. Our experimental results show that, without architecture mismatches, it is possible to outperform the baseline with just 30% of computation required by the baseline. When training larger models, less than 50% of the original computation is required to match the full performance of the wider baseline model, and less than 10% of computation is needed to recover over 95% of the corresponding baseline performance. We further conduct a suite of analysis which shows that: • Proper initialization of the parameters before continual learning is crucial for fast convergence.
• Data up-sampling is vital to achieving good performance on new language pairs.
• Scaling down the learning rate for reused parameters helps alleviate catastrophic forgetting.
It is our hope that this work will help save computation for future research into large-scale multilingual machine translation, guide more efficient reuse of existing models for continual learning, and allow people to efficiently adapt large publicly-released MMT models for new languages and datasets.

Method
Adding new languages to existing models, especially languages in new scripts, leads to different subword tokenization, thus different vocabularies, which precludes simple fine-tuning with the exact same model on additional data. During the continual learning stage, we may also want to increase the model size overall to have extra capacity to learn the new languages and improve old languages at the same time. Therefore, we also investigate two typical architectural changes commonly done to increase the network capacity: (1) make the model "wider" by expanding the hidden dimension of the feed-forward layers, and (2) make the model "deeper" by inserting new layers to both encoder and decoder. In this section, we delineate three techniques that we found most effective in achieving computation reduction for the continual learning of MMT models.
Proper initialization. While we can copy weights from the old model 1 to the new model, it is not immediately clear how the new parameters (e.g., new token embeddings, new feed-forward weights, new layers) should be initialized and coadapted with the old weights such that maximal knowledge about the old directions is retained. Instead of initializing the new parameters randomly, we find that initializing the new embeddings with that of <unk> leads to the best performance. When   All  76  74  74  75  75  75  75  Orig.  76  76  76  76  76  76  76  Added  -67  69  68  70  68  69   Table 1: While continually learning new languages by bootstrapping from a model trained on 20 languages (M20), given new embedding parameters (vocabulary mismatch), one can exceed the performance of a baseline model trained on all languages from scratch (M25 @100k ) with just 30% computation (Mt25 @30k ). With architectural mismatches, employing our method, the wider-model baseline performance (M25 wide ) and over 98% performance of a deeper-model baseline (M25 @100k deep ) can be recovered with half of the corresponding baseline computation (Mt25 @50k wide and Mt25 @50k deep respectively). The "Orig." row shows performance on old 20 languages and the "Added" on newly added 5 languages. the network becomes "wider" in the continual learning phase, concatenating each old weight matrix with a noisy version of itself performs better than other methods we tried. When the network becomes "deeper", initializing new layers with averaged weights of old layers results in slight improvement over other naive initialization methods. For each setup, we compare different initialization methods in section 4.2.
Data up-sampling. Similar to (Garcia et al., 2021), we introduce the new tasks by mixing old and new training data together. Since the main goal of the continual learning phase is to quickly learn the new directions, we up-sample the new pairs so the model gets more learning signals from these new directions. To increase the transfer across related language pairs and reduce the chance of catastrophically forgetting less represented directions in the original training data, we also up-sample a small number of the old low-resource languages that are from the same language family as the new languages. We present the effect of up-sampling these selected directions in section 4.3.
Learning rate scaling. To further mitigate the frequent catastrophic forgetting problem exhibited in continual learning (Ring, 1994;Thompson et al., 2019), we scale the learning rate for individual parameters depending on whether they are copied from the old model or not. Based on the assumption that the model better retains the knowledge about old tasks if the weights stay close to that of the old model (Kirkpatrick et al., 2017), we scale down the learning rate for these old parameters while maintaining or scaling up the learning rate for the newly added parameters. In contrast to other methods that incur extra computation such as Fisher Information based loss (Thompson et al., 2019), our approach is simple, straightforward and efficient in alleviating catastrophic forgetting. We present a deeper analysis of learning rate scaling in section 4.4 and section 5.2.

Experiments
Languages. We conduct our experiments on 25 languages covering ten language families, four resource levels (high, mid, low, and very low) and four scripts 2 . To grow the languages atop existing learned languages, we train the seed model on 20 languages (40 English-centric directions) and add 5 new languages (10 English-centric directions) during the continual learning stage. Mimicking the common scenario where newly added languages are mostly low-resource, we select five low and very-low resource languages covering four language families and four scripts as the new languages to add to the seed model. To verify the validity of our approach, we also experiment with two other 20/5 groupings and one 12/13 division of old/new languages in section 5.1 3 . Table 2 shows the groups of original and added languages used in most of our experiments. For all our models, we train sentencepiece (Kudo, 2018) 4 tokenizers with 64K tokens and a sampling temperature of 2, on joined (source and target) data from all language directions available in each setup. This

Original
Added lav rus fin spa xho lit hin est deu guj swh mar pol zul * npi ukr mkd ces msa * ind bel bul fra kir * kaz Table 2: The M20 model is trained on the "Original" languages. Mt25, Mt25 wide and Mt25 deep bootstrap from M20 and train on the combination of "Original" and "Added" languages. We perform data up-sampling over added data in conjunction with related old lowresource languages (marked with * ).
means that tokenizers in the initial training on the the old languages, were not exposed to unseen languages.  size ∼444K tokens peak learning rate 0.003 warmed up with 8000 steps. For the bootstrapped models, we train Mt25 for 30k updates and Mt25 wide and Mt25 deep for 50k updates since there are more added parameters. We use temperature 1 and prepend encoder/decoder language token at the beginning of each example. All models are trained with attention dropout 0.1 and label smoothing with ϵ = 0.1 (Szegedy et al., 2016). The baseline models M25, M25 wide and M25 deep complete training of 100K updates on 64 GPUs in ∼24h, ∼29.9h, and ∼29.0h respectively. We do not reset the learning rate scheduler for the second training phase. 7 The data of selected languages are up-sampled by 5 and the learning rate of the old parameters in Mt25 wide is multiplied by 0.5 and in Mt25 deep by 0.05 at the beginning while linearly increasing to 0.5. We present the effect of these hyperparameters in the next section.
Main results. Results in Table 1 indicate that by properly applying the techniques introduced earlier, it is possible to recover the baseline performance with significantly less computation. When architecture remains the same during the second learning phase, better overall performance can be achieved (31.8 vs. 31.6) after 30% of the updates required by the baseline model M25. The gains come from effective learning of both the old and new directions while the latter seems to be better learned than the baseline model (25.2 vs. 24.6). While training into a wider model, applying our techniques recovers the performance with approximately half of the M25 wide computation. Although we did not fully recover the performance while training into a deeper model, over 98% of the baseline performance can be achieved using our techniques with half of the baseline M25 deep computation. We find that the major degradation in Mt25 deep comes from the worse performance on the original directions (33.9 vs. 34.5) which suggests mitigating catastrophic forgetting is harder when expanding the network by depth than by width.

Effect of each technique 4.1 Ablation study
To understand the effect of each introduced technique, we conduct an ablation study where each model is trained on the same configuration except for one essential element (i.e., proper initialization, learning rate scaling, data up-sampling). As a naive baseline, while having scaled learning rate and data upsampling, we include the models where no weights from the seed M20 model are used ("random init all") to compare with a less naive baseline "random init new" where the M20 weights are reused and only the newly added parameters (i.e., new token embeddings in all three models, new fully-connected layer weights in Mt25 wide and weights of new layers in Mt25 deep ) are randomly initialized. To summarize, each configuration in Table 3 corresponds to the following: Random init all : All parameters are initialized randomly while model is trained with data up-sampling and learning rate scaling.
Random init new : Newly added parameters are initialized randomly while weights of M20 are copied to the new model. Model is trained with data up-sampling and learning rate scaling.
No up-sampling : Model weights are properly initialized and their learning rates are scaled during training. No language pair is up-sampled.
No lr scaling : Model weights are properly initialized and low-resource pairs are up-sampled whereas no learning rate scaling is applied. Table 3 confirm the contribution of each of the three introduced techniques in achieving the desired performance across different settings. Overall, not reusing the M20 weights leads to worse performance than the baseline by 2∼6 BLEU in different settings. While reusing the old model's weights, also having proper initialization of the new parameters yields better performance than simply initializing with default normal distribution. The benefit is most obvious when training into a wider model (29.9 vs. 32.6) compared to the other two settings. We also observe that data up-sampling is crucial to achieving good performance on the new directions. Not applying upsampling degrades around ↓ 3BLEU on new directions across all settings, while barely or just slightly hurting the performance on the old directions. On the other hand, not applying learning rate scaling harms the performance of old directions across all settings, which suggests the effectiveness of scaling the learning rate to mitigate catastrophic forgetting, about which we include a more detailed analysis in section 5.2. 8 Since learning rate scaling helps counteract catastrophic forgetting and up-sampling speeds up learning of the new directions, we discover that their effectiveness are additive -better performance can be achieved on both old and new directions by combining these two techniques.

Effect of proper initialization
In this section we briefly discuss different variations we attempted for initializing new parameters and present the results in Table 4.

Mt25
In the case of having only mismatched vocabularies, we find that dropping the entire old embedding table ("all emb random") causes large drop in performance, and that initializing the new embeddings with that of <unk> leads to slightly better performance than random initialization.  Mt25 wide A naive method to initialize wider feed-forward layer projections is to expand the old weight matrix with a randomly initialized weight matrix. However, having random additional parameters messes up the output of feed-forward layer, which interferes with the inter-dependency among layers, thus we also tried concatenating the old weight matrix with a noisy version of itself (concat 10 ) followed by a normalization operation 11 to keep the output as close to the original projection output as possible. In addition, rather than maintaining the old weights in block, we also tried linearly interpolating the original weight matrices along the expanded dimension (linear interp). It's important that the new parameters in the two projection matrices in a feed-forward layer match each other along the hidden dimension axis. Results indicate that weight concatenation is a simple and effective way that allows for faster convergence.
Mt25 deep Our preliminary experiments show that inserting new layers at the bottom of the encoder and at the top of the decoder is more effective than other attempted scenarios, so we went with this setup in all our Mt25 deep experiments. We tried dif- 10 We inject zero-mean Gaussian noise with std = 0.01. We also tried not adding noise to the new parameters, which has almost identical performance. To avoid both parts of the weight matrices learning redundant information, we decide to add small perturbation to the new parameters. 11 We normalize the new weight matrix such that it has similar Frobenius norm as the old weight matrix.    (@0k), the wider hidden projection matrix is initialized with a concatenation of M20 weights M and a noisy version of itself. During the continual learning stage, the learning rate for weights copied from old model is scaled by γ (old) , the rest are scaled by γ (new) . ferent methods of initializing the inserted layers, including random initialization, copying parameters from the closest layer, 12 and averaging the weights across all encoder or decoder layers. Although the benefit is small, we do perceive that initializing the new layers with averaged layer weights results in better performance, especially on new translation directions (26.7 vs. 26.3).

Effect of data up-sampling
We multiply the original dataset size of each upsampled direction by a value α before computing the final up-sample ratio. The new directions thus constitute a larger portion in the new dataset than if not. We show in Table 5 the effect of the up-sample factor α. We find that up-sampling selected languages leads to much better performance (+3BLEU) on new directions than without up-sampling. However, up-sampling the new directions too much (e.g., α = 10) worsens the performance on old directions while not improving the new directions. In general, we find the up-sampling All Orig. Added γ (old) = 1, γ (new) = 1 32.0 33.5 26.1 γ (old) = 0.5, γ (new) = 5 32.3 * 33.8 * 26.2 γ (old) = 0.5, γ (new) = 0.5 31.5 * 33.3 24.7 * γ (old) = 0.5, γ (new) = 1 32.1 * 33.6 * 26.0 γ (old) = 5, γ (new) = 5 31.9 33.4 26.1 Table 6: Performance of Mt25 wide after 30K updates while applying different learning rate scaling factor based on notation in Figure 2. These experiments adopt an earlier setup where the old related low-resource languages are not up-sampled. * indicates p-value of paired T-test against the baseline (top row) is smaller than 0.05.
factor α = 5 adopted in previous studies (Garcia et al., 2021;Berard, 2021) suitable to many other variants we have attempted in this work. Besides merely up-sampling the new directions, we also up-sample the old low-resource directions that belong to the same language family as any of the new directions. Doing so slightly improves the performance on old directions whereas causes a drop in new directions, which further confirms the critical role of up-sampling in balancing the performance between the old and new directions.

Effect of learning rate scaling
As described in section 2, we scale down the learning rate for old (reused) parameters. In the case of Mt25, all parameters are updated with a smaller learning rate than the new token embeddings. In the wider network, as we have established the effectiveness of using concatenated weights, we apply different learning rate to each part of the weight matrices as generically illustrated in Figure 2. In Mt25 deep , the new layers are updated with a larger learning rate while the rest of the parameters receive a smaller learning rate. Table 6 suggests that having a smaller learning rate for old parameters is more favorable than scaling all parameters by the same amount ratio. Scaling down the update for all parameters slows down the learning of the new directions, whereas scaling up with a larger value does not improve the performance either. While already applying smaller update on old parameters, scaling up the learning rate for the new parameters can in fact improve both the old and new directions (the top two rows in Table 6). Overall we find that learning rate scaling is an effective and easy-to-implement alternative to previous methods (Kirkpatrick et al., 2017) in terms of alleviating catastrophic forgetting.   Table 8: Zoomed in analysis over specific language pairs. Our approach is more effective in learning eng→xxx directions than xxx→eng directions across different seedlanguage setups regardless if the language is up-sampled (Up=Y) or not (Up=N).

Analysis
5.1 How does the choice of seed languages affect continual learning?
In addition to the core setup described in Table 2, we also experiment with other settings. We considered three new setups following the natural scenario where the seed languages are mostly of highand mid-resource languages. Two of the new setups still adopt the 20/5 division between old and new languages, but emphasize different scenarios -Mt25_v1 covers one mid resource language in the set of new languages and Mt25_v2 does not add any new script. The last one instead initializes from a model trained on 12 high-and mid-resource languages and adds in 13 new languages. 13 We follow the same configuration used in the previous experiments and train into wider models in the continual learning phase. Results in Table 7  when models are trained on different choices of seed languages, even when the number of seed languages are different. However, we do observe small differences after zooming in by resource-levels. A more fine-grained analysis over a few low-resource languages that cover diverse scripts (bel-Cyrl, guj-Gujr, npi-Deva, xho-Latn) is presented in Table 8.
In general, regardless of if the language pair is in the seed language set or not, the eng→xxx directions are learned faster than xxx→eng directions, since most of the pairs already exceed the baseline performance after just 30K updates (compare the upper part of Table 8 to the lower part). We also verify the effectiveness of data up-sampling for eng→xxx directions. Up-sampling these pairs leads to much better performance than when not up-sampling (e.g., compare performance of eng-bel in the three setups).

How effective is learning rate scaling in mitigating catastrophic forgetting?
To quantitatively measure the amount of information lost after the continual learning phase, we adopt an evaluation setup akin to (Garcia et al., 2021), in which the embeddings in M20 that overlap with Mt25 are substituted with the corresponding embeddings in Mt25. One can evaluate this new M20 model with substituted embeddings on the original 20 languages, and use the drop in sp-BLEU as a proxy for the amount of knowledge lost in the embeddings due to catastrophic forgetting. In Figure 3, we display the spBLEU drop after substituting the embeddings of a  Figure 4: Training the M25 from scratch without using the M20 weights converges significantly slower. Applying the techniques introduced in this work results in the largest computation reduction, achieving over 95% performance with less than 10% baseline computation. variants, the spBLEU scores drop, indicating that some information in the embeddings is lost after training on the new languages. We also find that, for both models, the decrease in spBLEU is larger for (very) low-resource languages than high and mid resource languages, which suggests (very) lowresource languages are easier to be "forgotten" in the second training phase, which reinforces our decision of also up-sampling the related low-resource languages as introduced in section 4.3. Finally, not applying learning rate scaling leads to much larger decrease in all directions, which manifests the effectiveness of scaling down the learning rate for alleviating catastrophic forgetting. 14

Computation saved
Due to the mismatched vocabularies and architectures, models trained on the combination of new and old languages are typically re-trained from scratch after already incurring large computation on old languages. In this section, we look at how much computation can be effectively saved with the approach proposed in this paper. Note that, for both our methods and the retraining approach, M20 is already trained, and its computation cost is excluded from our calculation.
Although results in Table 1 shows that 50% of the baseline computation is required to recover the baseline performance, we show in Figure 4 that much computation can be saved if we slightly slack the target performance. The M25 wide model trained from scratch reaches only 78% of the baseline performance after 10k updates, while models trained with the combination of our techniques achieve over 97% of the baseline performance after the same amount of updates (10% total computation of training Mt25 wide ).

Related Work
Our work is closely related to prior research on adapting existing MMT models (Mohammadshahi et al., 2022) to new languages (Lakew et al., 2019;Kocmi, 2020). Neubig and Hu (2018) add lowresource languages to multilingual models by finetuning on low-resource data while regularizing with related high-resource data. Garcia et al. (2021) introduce simple vocabulary substitution for adapting MMT models to new languages without any architectural changes. Another line of research employs modular approaches, which include training lightweight adapters (Bapna and Firat, 2019), language-specific encoder-decoders (Escolano et al., 2019(Escolano et al., , 2021, language specific embeddings (Berard, 2021) for learning new languages. While sometimes escaping the need to train on old examples, growing the model in a modular fashion (Rusu et al., 2016) requires non-trivial changes to standard architectures. In contrast, our work relies on rehearsal mechanism (i.e., also train on old examples) but does not need to modify network structures (Robins, 1995).
Our approach is also related to works that focus on continual learning of MT models for adapting multiple domains. Thompson et al. (2019); Gu and Feng (2020) adopt a method derived from Elastic Weight Consolidation (Kirkpatrick et al., 2017) to alleviate catastrophic forgetting. While most prior works only investigate two-stage continual learning, Cao et al. (2021) propose a new framework that extends to multi-stage training to mitigate catastrophic forgetting (Ring, 1994). The initialization of the new parameters and embeddings in our technique is also related to that in (Pfeiffer et al., 2021), which accommodate multilingual models to unseen scripts via matrix factorization. Our focus on the architectural differences between initial and continual learning phase is also relevant to recent discoveries that wider networks forget less catastrophically (Mirzadeh et al., 2022).

Conclusion
We show in this work that it is possible to efficiently bootstrap from existing models and recover the baseline performance with much less computation while vocabularies and architectures can be different in the continue learning stage. We highlight the importance of (1) reusing the existing model weights and carefully initializing the new parameters, (2) applying learning rate scaling, and (3) performing data up-sampling. Analyses reveal that scaling down the learning rate for old parameters helps alleviate catastrophic forgetting, and that data up-sampling is vital to achieving good performance on the new directions. We hope our work can help save computation for research into large-scale multilingual MT models, and more generally, will help spur research into continual multitask learning in the presence of architectural changes.

Limitations
While we explore the under-studied architectural mismatches for continual learning of MMT models, we focus exclusively on adding new languages in bulk, without investigating adding languages one by one continuously. Furthermore, due to limited computational resources, we only experimented with a few typical scenarios where the new languages are low or very-low resourced. Experiments on other groupings of old and new languages could further validate the effectiveness of our approach.  Scaling down the learning rate for parameters copied from the old model keeps them close to their initialization while pushes the newly added parameters farther away.

A Language details
We present two extra language grouping settings we have experimented in this section. To verify the validity of our approach on other language groupings, we also experiment with two other settings as shown in Table 10. In addition to dividing the old/new languages by 20/5, we also tried a setting where the seed model is trained on fewer languages as reported in Table 11.
B Detailed performance of each direction Table 12 contains the detailed performance of all translation directions for all models reported in Table 1.

C Norm Analysis
The analysis in section 5.2 is limited to only measuring the lost information in the embeddings, to understand how learning rate scaling affects the weights other than embeddings, one natural extension is substituting other weights back to M20 as well. However, it leads to much worse performance for both variants as the inter-dependency among layers is impaired in this case. Therefore, we instead measure the amount that the weights of each encoder or decoder layer (denoted with L 0 , L 1 , . . . , L 11 ) change in the latent space via the Frobenius norm. Following the notation in Figure 2, we measure how much the weight matrices M 1 and M 2 have changed from the original weight matrix M (∥M 1 − M ∥ and ∥M 2 − M ∥), as well as how much they differ from each other (∥M 1 − M 2 ∥). We refer to the Frobenius norm in an encoder or a decoder layer with ∥ · ∥ e and ∥ · ∥ d respectively.
The trend in Figure 5 shows that applying learning rate scaling prevents M 1 from deviating too much from the original weights M and at the same time pushes the new parameters to space farther from its initialization. This is in contrast with the smaller differences when not applying learning rate scaling. The left side of Figure 5 indicates that even after the continual learning, M 1 and M 2 stay close to each other across the encoder layers, it is only when reaching the last several decoder layers do the two matrices demonstrate larger differences. Since we initialize M 1 and M 2 both based on M , having larger ∥M 1 − M 2 ∥ reduces the chance of both parts learning redundant information i.e., effectively using the additional parameters.

D Scaling learning rate by Fisher information
Besides multiplying the learning rate for all old parameters with the same scaling factor, we also tried scaling the learning rate based on their Fisher Information. This is directly inspired by Elastic Weight Consolidation (Kirkpatrick et al., 2017), in which extra penalty is incurred when parameters crucial to the old tasks deviate too much from their original values. We plot the distribution of the per-token Fisher information of each parameter in Figure 6 (right). We further experiment with LR scaling over selected parameters that are supposed to be important for old tasks (Fisher information exceeds certain threshold based on Figure 6 (right)). Results in Table 11 (right) shows that among our attempted settings, scaling only part of the parameters based on Fisher information does not improve the overall performance. We conjecture that the performance could be improved if the Fisher Information is calculated on a larger set and that if we apply a piece-wise threshold function for scaling the learning rate of different parameters.

E Effect of LR scaling in alleviating catastrophic forgetting
We include the spBLEU drop on the old languages after substituting the embeddings of Mt25 back to M20 in Figure 6.     Table 12: Detailed performance of each translation direction for all models shown in Table 1 1527