Preserving Cross-Linguality of Pre-trained Models via Continual Learning

Recently, fine-tuning pre-trained language models (e.g., multilingual BERT) to downstream cross-lingual tasks has shown promising results. However, the fine-tuning process inevitably changes the parameters of the pre-trained model and weakens its cross-lingual ability, which leads to sub-optimal performance. To alleviate this problem, we leverage continual learning to preserve the original cross-lingual ability of the pre-trained model when we fine-tune it to downstream tasks. The experimental result shows that our fine-tuning methods can better preserve the cross-lingual ability of the pre-trained model in a sentence retrieval task. Our methods also achieve better performance than other fine-tuning baselines on the zero-shot cross-lingual part-of-speech tagging and named entity recognition tasks.


Introduction
Recently, multilingual language models (Devlin et al., 2019;Conneau and Lample, 2019), pretrained on extensive monolingual or bilingual resources across numerous languages, have been shown to enjoy surprising cross-lingual adaptation abilities, and fine-tuning them to downstream crosslingual tasks has achieved promising results (Pires et al., 2019;Wu and Dredze, 2019). Taking this further, better pre-trained language models have been proposed to improve the cross-lingual performance, such as using larger amounts of pre-trained data with larger pre-trained models (Conneau et al., 2019;Liang et al., 2020), and utilizing more tasks in the pre-training stage (Huang et al., 2019).
However, we observe that multilingual BERT (mBERT) (Devlin et al., 2019), a pre-trained language model, forgets the masked language model (MLM) task that has been learned and partially loses the cross-lingual ability (from a cross-lingual Cross-lingual Sentence Retrieval (XSR) Accuracy mBERT fine-tuned mBERT Figure 1: Masked language model and cross-lingual sentence retrieval results before and after fine-tuning mBERT to the English part-of-speech tagging task.
sentence retrieval (XSR) 1 experiment) after being fine-tuned to the downstream task in English, as shown in Figure 1, which results in sub-optimal cross-lingual performance to target languages. In this paper, we consider a new direction to improve the cross-lingual performance, which is to preserve the cross-lingual ability of pre-trained multilingual models in the fine-tuning stage. Motivated by the continual learning (Ring, 1994;Rebuffi et al., 2017;Kirkpatrick et al., 2017;Lopez-Paz and Ranzato, 2017) that aims to learn a new task without forgetting the previous learned tasks, we adopt a continual learning framework to constrain the parameter learning in the pre-trained multilingual model when we fine-tune it to downstream tasks in the source language. Specifically, based on the results in Figure 1, we aim to maintain the cross-linguality of pre-trained multilingual models by utilizing MLM and XSR tasks to constrain the parameter learning in the fine-tuning stage.
Experiments show that our methods help pretrained models better preserve the cross-lingual ability. Additionally, our methods surpass other fine-tuning baselines on the strong multilingual model mBERT and XLMR (Conneau et al., 2019) on zero-shot cross-lingual part-of-speech tagging (POS) and named entity recognition (NER) tasks.

Methodology
In this section, we first describe the gradient episodic memory (GEM) (Lopez-Paz and Ranzato, 2017), a continual learning framework, which we adopt to constrain the fine-tuning process. Then, we introduce how we fine-tune the pre-trained multilingual model with GEM.

Gradient Episodic Memory (GEM)
We consider a scenario where the model has already learned n − 1 tasks and needs to learn the n-th task. The main feature of GEM is an episodic memory M k that stores a subset of the observed examples from task k (k ∈ [1, n]). The loss at the memories from the k-th task can be defined as where the model f θ is parameterized by θ. In order to maintain the performance of the model in the previous n − 1 tasks while learning the n-th task, GEM utilizes the losses for the previous n−1 tasks in Eq. (1) as inequality constraints, avoiding their increase but allowing their decrease. Concretely, when observing the training samples (x, y) from the n-th task, GEM solves the following problem: where f n−1 θ is the model before learning task n.

Fine-tuning with GEM
We consider two tasks (n = 2) in total by applying GEM to the fine-tuning of pre-trained multilingual models, namely, mBERT and XLMR. The first task is either what the pre-trained models have already learned (MLM) or the ability that they already possess (XSR), and the second task is the fine-tuning task. We follow Eq. (2) when we fine-tune the pre-trained models: where T 1 and T 2 denote the first and second tasks, respectively, and f * θ represents the original pretrained model. When the MLM task is considered as the first task, we constrain the fine-tuning process of the pre-trained model by preventing it from forgetting its original task after fine-tuning so as to better preserve the original cross-lingual ability. When the XSR task is considered as the first task, on the other hand, we prevent the pre-trained model from losing its cross-lingual ability after fine-tuning. We also consider incorporating both MLM and XSR as the first task.

Dataset
For the POS task, we use Universal Dependencies 2.0 (Nivre et al., 2017) and select English (en), French (fr), Spanish (es), Greek (el) and Russian (ru) to evaluate our methods. For the NER task,

Baselines
We compare our methods to several baselines. Naive Fine-tune (Wu and Dredze, 2019) is to add one linear layer on top of the pre-trained model while fine-tuning with L2 regularization. Fine-tune with Partial Layers Frozen (Wu and Dredze, 2019) is to fine-tune pre-trained multilingual models by freezing the partial bottom layers. And Multi-Task Fine-tune (MTF) is to fine-tune pre-trained multilingual models on both the finetuning task and additional tasks (MLM and XSR).

Training Details
We conduct the MLM task with two settings. First, we only utilize the English Wikipedia corpus (MLM (en)) since we observe the catastrophic forgetting in the English MLM task as in Figure 1. Second, we utilize both the source and target languages Wikipedia corpus (MLM (all)). The first setting is used in our main experiments. Note that we do not use all pre-trained languages in mBERT for the MLM task because it would make the finetuning process very time-consuming. For the XSR task, we leverage the sentence pairs between the source and target languages from the Europarl parallel corpus (Koehn, 2005). 2 2 More training details are in the appendix.

Results & Analysis
Does GEM preserve the cross-lingual ability?
From Table 1, we can see that naive fine-tuning mBERT significantly decreases the MLM performance, especially in English. Since mBERT is fine-tuned to the English task, the English subword embeddings are fine-tuned, which makes mBERT lose more MLM task information in English. Naive fine-tuning also makes the XSR performance of mBERT drop significantly. We observe that finetuning with partial layers frozen is able to somewhat prevent the MLM performance from getting worse, while fine-tuning with GEM based on that task almost preserves the original MLM performance of mBERT. Although we only use English data in the MLM task, using GEM based on the MLM task still preserves the task-related parameters that are useful for other languages. Correspondingly, we can see that GEM w/ MLM achieves better XSR performance than Naive Fine-tune w/ frozen layers, which shows that GEM helps better preserve the cross-lingual ability of mBERT. In addition, although GEM w/ XSR aggravates the catastrophic forgetting in the MLM task, it is able to significantly improve the XSR performance due to the usage of the XSR supervision. Furthermore, incorporating both the MLM and XSR tasks can better preserve the performance in both tasks.
Does GEM improve the cross-lingual performance? From Table 2, we can see that our methods consistently surpass the fine-tuning baselines on all target languages in the POS and NER tasks. In terms of the average performance, our methods outperform the baselines by an around or more   Table 3: Ablation study on the two settings of using the MLM task based on mBERT. than 1% improvement. 3 In addition, constraining mBERT fine-tuning on the MLM task shows similar performance to constraining it on the XSR task. We conjecture that the effectiveness of both methods is similar, although they come from different angles. When the information of both tasks is utilized, GEM is able to slightly improve the performance. We find that the experimental results on XLMR are consistent with mBERT.
GEM vs. MTF From Table 1, we notice that using the MLM task, MTF achieves lower perplexity than GEM since it aggressively trains mBERT on this task. However, we observe that MTF w/ MLM makes the performance of the XSR, POS and NER tasks worse than Naive Fine-tune, and we speculate that MTF pushes mBERT to be overfit on the MLM task, instead of preserving its cross-lingual ability. Meanwhile, we can see that GEM regularizes the loss of the training on the MLM task to avoid catastrophic forgetting of previously trained languages, and conserve the cross-linguality of the pre-trained multilingual models.
In addition, we observe that adding XSR objec- 3 The results of XLMR are included in the appendix.
tive to the training cause the MLM performance worse. Although MTF achieves the best performance in the XSR task since it directly fine-tunes mBERT on that task, we can see from Table 2 that GEM w/ XSR boosts the cross-lingual performance of downstream tasks, while MTF w/ XSR has the opposite effect. We speculate that brutally finetuning mBERT on the XSR task (MTF w/ XSR) just makes mBERT learn the XSR task, while using GEM to constrain the fine-tuning on the XSR task can preserve its cross-lingual ability of mBERT. Incorporating both the MLM and XSR tasks further improves the performance for GEM, while MTF still performs worse than Naive Fine-tune. Table 3, we can see that using GEM to constrain fine-tuning on MLM with all languages (GEM w/ MLM (all)) achieves better performance than it does with only English (GEM w/ MLM (en)) on the MLM task since more MLM supervision signals are provided, while their performances in the POS task are similar. Intuitively, since GEM w/ MLM is able to improve the cross-lingual performance, constraining on more languages should give better performance. We conjecture, however, that the constraint with all languages could be too aggressive, so mBERT might tend to be overfit to the monolingual MLM task in all languages instead of preserving its original cross-lingual ability. In addition, we observe that fine-tuning mBERT on the MLM task (MTF) would get worse when more languages are utilized.

Conclusion
In this paper, we propose to preserve the crosslinguality of pre-trained language models in the fine-tuning stage. To do so, we adopt a continual learning framework, GEM, to constrain the parameter learning in pre-trained multilingual models based on the MLM and XSR tasks when we finetune them to downstream tasks. Experiments on the MLM and XSR tasks illustrate that our methods can better preserve the cross-lingual ability of pretrained models. Furthermore, our methods achieve better performance than fine-tuning baselines for the strong multilingual models mBERT and XLMR on the zero-shot cross-lingual POS and NER tasks.

A Training Details
We utilize the Wikipedia corpus for the MLM task. Given that using all the Wikipedia corpus will greatly lower the training speed, we randomly sample 1M sentences for each language for the training of MTF w/ MLM and GEM w/ MLM, and we use another 100K sentences for each language to evaluate the model performance on the MLM task. We take the English-Spanish (en-es), English-Italian (en-it), English-French (en-fr), English-Greek (enel), English-German (en-de), and English-Dutch (en-nl) parallel datasets from the Europarl parallel corpus. We ramdomly select 90% of them for the training of GEM w/ MLM and GEM W/ XSR, and the rest 10% of them are used for evaluating the model performance on the XSR task. We use accuracy for evaluating the POS task, BIO-based F1-score for evaluating the NER task, perplexity for evaluating the MLM task, and P@k for evaluating the XSR task. Concretely, P@k (k=1,5,10) accounts for the fraction of pairs for which the correct translation of the source language sentence is in the k-th nearest neighbors. We use an early stop strategy which is based on the average performance over the target languages to select the model. We use the Adam optimizer with a learning of 1e-5. We use batch size 16 for the all tasks, namely, POS, NER, MLM and XSR. In each iteration, we use GEM to constrain the fine-tuning on a batch of data samples from the MLM and XSR tasks. Our models are trained on V100.

B Data Statistics
The data statistics of the NER and POS datasets are shown in Table 4 and Table 5, respectively.

C.1 XLMR Experiments
Experiments on POS and NER tasks for XLMR base are illustrated in Table 6 (in the next page). The results on XLMR are consistent with mBERT.

C.2 XSR Experiments
Experiments on more language pairs are illustrated in   Table 7: Experiments on XSR tasks based on mBERT. Models other than mBERT are fine-tuned to the English POS task. The bold numbers in the XSR task denote the best performance after fine-tuning without using the XSR supervision.