Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation

The data scarcity in low-resource languages has become a bottleneck to building robust neural machine translation systems. Fine-tuning a multilingual pre-trained model (e.g., mBART (Liu et al., 2020)) on the translation task is a good approach for low-resource languages; however, its performance will be greatly limited when there are unseen languages in the translation pairs. In this paper, we present a continual pre-training (CPT) framework on mBART to effectively adapt it to unseen languages. We first construct noisy mixed-language text from the monolingual corpus of the target language in the translation pair to cover both the source and target languages, and then, we continue pre-training mBART to reconstruct the original monolingual text. Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline, as well as other strong baselines, across all tested low-resource translation pairs containing unseen languages. Furthermore, our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training. The code is available at https://github.com/zliucr/cpt-nmt.


Introduction
Neural machine translation (NMT) (Bahdanau et al., 2015;Luong et al., 2015;Vaswani et al., 2017) has a poor generalization ability to lowresource languages where large monolingual and parallel corpora are not available. Recently, leveraging multilingual pre-trained models (Song et al., 2019;Liu et al., 2020a; as the starting checkpoints has shown to be effective at building low-resource NMT systems. However, the effectiveness of the pre-training will be vastly limited for low-resource languages that are not in the list of pre-training languages. Given the fact that there are more than 7000 languages around the world (Austin and Sallabank, 2011), it is almost impossible for a multilingual model to include all languages. And it is expensive and time-consuming to pre-train another model from scratch so as to include the languages we need. To address this issue, we propose to leverage the advantages of an offthe-shelf multilingual pre-trained model and focus on better generalizing it to any low-resource language pair. In this paper, we use mBART (Liu et al., 2020a) as the multilingual pre-trained model, given its effectiveness at building low-resource NMT systems.
To simulate the problem, we suppose that we need an NMT system on a low-resource translation pair, and at least one of the languages in the translation pair is an unseen language for the pre-trained model. To adapt mBART into unseen languages in the NMT task, we propose to conduct a continual pre-training (CPT) on it with mixed-language training (MLT). Concretely, we first follow the noise function used in Liu et al. (2020a) to corrupt the monolingual text of the target language in the translation. Then, we utilize a bilingual dictionary to generate mixed-language sentences and simultaneously delete some tokens based on the corrupted text. After that, we conduct the CPT on mBART to reconstruct the original monolingual text. After the CPT, we follow Liu et al. (2020a) to directly fine-tune mBART on the parallel data of the translation pair. The purpose of producing mixed-language sentences is to make a rough alignment between the languages in the translation pair. Conducting the token deletion is to increase the difficulty of the reconstruction task and the diversity of the noisy mixed-language text, which force the model to quickly learn an unseen language.
We consider an extremely low-resource setting where we have very few parallel data (10k) for low-resource translation pairs and very few monolingual data (100k) for each language in the translation. Experimental results show that our proposed pre-training approach is able to consistently outperform the mBART baseline as well as other pre-training baselines across all tested translation pairs that contain unseen languages. Interestingly, we observe that the continual mixed-language pretraining is even beneficial for a translation pair where both languages are in the mBART's pretraining list. Results also show that mBART can achieve better zero-shot performance after applying the CPT with MLT, which illustrates that the mixed-language pre-training is able to make a better alignment. Furthermore, we investigate our method in terms of various low-resource settings where different amounts of parallel and monolingual data are available, and experimental results show that the effectiveness of our approach can be further improved when a larger pre-training corpus is available.
The contributions of this paper are summarized as follows: • To the best of our knowledge, we are the first to investigate how to effectively adapt a multilingual pre-trained model to unseen languages for the NMT task.
• We show that our proposed method can consistently surpass strong baselines across all the tested translation pairs.
• We conduct in-depth experiments and analyses in terms of different low-resource settings and the effectiveness on the various components of our method.

Methodology
In this section, we first give a brief overview of the mBART model (Liu et al., 2020a), and then we introduce our proposed method that aims to adapt mBART to unseen languages in the translation task.

Model: mBART
The mBART model follows the sequence-tosequence (Seq2Seq) pre-training scheme of the BART model (Lewis et al., 2020) (i.e., reconstructing the corrupted text) and is pre-trained on largescale monolingual corpora in 25 languages. Two types of noises are used to produce the corrected text. The first is to remove text spans and replace them with a mask token, and the second is to permute the order of sentences within each instance.
Thanks to the large-scale pre-training on multiple diverse languages, the mBART model has shown its strength at building low-resource NMT systems by being fine-tuned to the target language pair, and it is also shown to possess a powerful generalization ability to languages that do not appear in the pre-training corpora (Liu et al., 2020a).

Continual Pre-Training
Despite the powerful adaptation ability that mBART possesses, we argue that its performance on unseen languages is still sub-optimal since it has to learn these languages from scratch. Therefore, we propose to conduct the continual pre-training (CPT) on the mBART model to improve its adap-tation ability to unseen languages. The process of this additional pre-training task is illustrated in Figure 1, and the details are described as follows.
Pre-Training We denote lang 1 →lang 2 as the needed translation pair, where lang 1 is the source language and lang 2 is the target language, and at least one of them is an unseen language for the mBART model. The CPT can be considered as maximizing L θ : where θ is initialized with mBART's parameters, D 2 denotes a collection of monolingual documents in lang 2 , and f is a function to generate noisy mixed-language text that contains both lang 1 and lang 2 .
Noisy Mixed-Language Function (f ) Given a monolingual instance X, we first use the noise function (denoted as g, described in §2.1) used in Liu et al. (2020a) to corrupt the text, and then we use a dictionary of lang 2 to lang 1 to assist in the function of producing mixed-language sentences (denoted as h). Specifically, after the processing of the noise function g, if the non-masked tokens in lang 2 exist in the dictionary, we set a probability to replace it with its translation in lang 1 . If it is not being replaced, there is a 50% chance that we will directly delete this token, and otherwise, we keep the original token in lang 2 . More formally, function f (in Eq. (1)) can be considered as the combination of two functions: f (X) = h(g(X)). (2) Notice that lang 2 is not always the unseen language (i.e., lang 1 could be the only unseen language). Since the inputs are mixed with the tokens in lang 1 and lang 2 , the model can always learn the unseen language. The reason why we choose to reconstruct lang 2 instead of lang 1 is because lang 2 is the target language that the decoder needs to generate in the translation task, and reconstructing lang 2 in the pre-training makes the model easier to adapt to the lang 1 →lang 2 translation pair. We leverage the noise function g since it has shown its effectiveness at helping pre-trained models to obtain language understanding ability. The intuition of producing mixed-language text for inputs is to roughly align lang 1 and lang 2 , since the model needs to understand the tokens of lang 1 so as to reconstruct the translations in lang 2 . The purpose of not replacing all tokens in the dictionary with their translations is to increase the variety of the mixedlanguage text, and given that there will be plenty of frequent words (e.g., stopwords), replacing all of them with the corresponding translations could make the sentences unnatural, and the translations of the frequent words in lang 1 would likely not match the context in lang 2 . In addition, adding a probability to delete the original token in function h is to inject extra noise and further increase the diversity of the generated mixed-language text.
To produce noisy mixed-language sentences, we collect monolingual corpora for the target languages from Wikipedia, and we utilize the bilingual dictionaries from MUSE (Lample et al., 2018b) 1 for the En-X and X-En pairs. For a dictionary (denoted as X-Y) that is not available in MUSE (English is not in the pair in this case), we first obtain the token list of language X from the X-En dictionary in MUSE, and then construct the X-Y dictionary utilizing Google Translate 2 to translate the tokens from language X to Y.

Low-Resource Settings
We focus on an extremely low-resource setting, where we assume that only 10K parallel samples are available. Considering that obtaining a large size monolingual corpus could be difficult for some low-resource languages, we constrain the number of monolingual paragraphs to be as few as 100K (the size is ∼ 30MB). To do so, we randomly sample 10K parallel examples and 100K monolingual paragraphs from the available corpora. In addition, we also conduct experiments with different numbers of parallel data (from 10K to 100K) and monolingual data (from 100K to 1M) to investigate the effectiveness of the proposed method in different levels of low-resource setting. As for the translation pairs En ↔ Gu and En ↔ Kk, we follow the settings in Liu et al. (2020a) and use parallel data with a size of 10K and 91K for the En ↔ Gu and En ↔ Kk, respectively.

Models & Baselines
mBART We directly fine-tune the mBART model on the parallel data of the translation pair. Note that it is already a strong baseline since mBART is shown to possess a good generalization ability to unseen languages (Liu et al., 2020a).
CPT w/ Ori (Src) We follow the original objective function of mBART (only using the noise function g in §2.2 to corrupt the text) to continue pre-training it on the source language of the translation pair. 3 Then we directly fine-tune it on the translation parallel data.
CPT w/ Ori (Tgt) This baseline is the same as the previous one except that we continue pretraining mBART on the target language of the translation pair.
CPT w/ MLT (Src) Different from CPT w/ Ori, we use the noisy mixed-language function (f ) to create noisy mixed-language text. However, different from what we propose in Eq. (1), it reverses the pre-training direction (i.e., it corrupts the text in the source language instead of the target language).
CPT w/ MLT (Tgt) This is our proposed method described in §2.2. We use Tgt or Src to distinguish the target or source language (in the translation pair), respectively, that mBART needs to reconstruct in the CPT.
mT5 Like mBART, mT5 (Xue et al., 2020) is also a multilingual pre-trained model using a Seq2Seq pre-training. It is pre-trained in 101 languages covering all the languages in our experimental settings.
Note that we use the mT5-base (600M parameters) which has a similar size as mBART (610M parameters) to ensure the fair comparison.

Training Details
Given that the sizes of the pre-training data and the parallel data are relatively small, we freeze the first 8 layers (out of 12) of the encoder and the first 8 layers (out of 12) of the decoder in the CPT, as well as the fine-tuning processes (applied for both mBART and mT5), to avoid the over-fitting issue. Note that we still keep the embeddings layer unfrozen since the model needs to learn the embeddings for unseen languages. For CPT, we control the probability of whether to replace a token with its translation to ensure around 30% of tokens are replaced. In the CPT stage, we train with a dropout rate of 0.1, a batch size of 100, and a learning rate of 3e-5 for 5 epochs. In the fine-tuning stage, we train with a dropout rate of 0.3, a batch size of 32, and 2500 warm-up steps with a maximum learning rate of 5e-5 for all directions. We use the Adam optimizer (Kingma and Ba, 2015) for both the CPT and fine-tuning processes. We set the maximum fine-tuning epochs as 20, and the final model is selected based on the performance on the validation dataset. The final results are reported in the casesensitive tokenized BLEU (Papineni et al., 2002). We notice that the tokenizer of mBART is the same as that of XLM-R (Conneau et al., 2020) which covers 100 languages. Note that extending the vocabulary may be necessary for new languages that are not included in the original tokenizer, while we do not extend the vocabulary in the experiments since all the languages in the experiments are included in the vocabulary of XLM-R, and we find that the unknown token rates for unseen languages in the experiments are zero. Therefore, for all the models, we directly use mBART's tokenizer on the text for all languages in the experiments to ensure a fair comparison in BLEU, and we use thai-segmenter 4 to pre-tokenize the text in Thai (Th) before using mBART's tokenizer. For inference, we use beam search with a beam size of 5 for all directions.

Main Results
The results of our proposed methods and baseline models are illustrated in Table 1, from which we can observe that conducting CPT on mBART is  Table 1: Fine-tuning performance on the 10K parallel data for the 24 translation pairs. All CPT methods utilize a corpus with a size of 100K paragraphs. The upper 12 pairs contain one unseen language for mBART (the other seen language is English), and the bottom 12 pairs contain two unseen languages. The CPT using our proposed method consistently outperforms all baseline models.
generally effective in the low-resource scenario of the NMT task, although the size of the pre-training corpus is as few as 100K paragraphs. Also, we can see that the CPT w/ MLT consistently outperforms all baseline models since the additional mixed-language information helps to construct a better alignment between the source and target languages in the translation pair. We observe that the CPT w/ MLT (Tgt) significantly outperforms mBART in multiple translation pairs (e.g., 2.92 BLEU points in En → Id and 2.39 BLEU points in En → Th). We find that, although conducting CPT (w/ Ori or w/ MLT) on the text that contains tokens in the unseen language generally enhance the performance in the translation, the effectiveness of CPT w/ Ori is relatively deficient compared to CPT w/ MLT. We conjecture that the original objective function of mBART loses its advantages when the amount of pre-training monolingual data is small, while MLT is still beneficial thanks to the additional bilingual alignments that it have learned. Additionally, we find that the direction of the CPT (Src or Tgt) also plays an important role. As we can see from Table 1, conducting CPT by reconstructing the target language in the translation pair generally achieves better performance than reconstructing the source. We conjecture that making the generated language in the CPT stage consistent with that in the fine-tuning stage will increase the benefits from the CPT. This is because, if the generated languages are different in these two stages, the model needs to learn to generate sentences on an entirely different language with only a few data samples in the fine-tuning stage, which could make the fine-tuning task much more challenging. Interestingly, when English (a seen language) is the target language, the CPT w/ Ori (Tgt) becomes less effective, but CPT w/ MLT (Tgt) still works well. The reason is that CPT w/ Ori (Tgt) ignores the unseen language in the continual pre-training stage, while the mixed-language inputs of CPT w/ MLT (Tgt) still contain the tokens in the unseen language, which still enables the model to learn the unseen language. Surprisingly, mT5 performs generally worse than mBART, although it covers all the languages in our experiments. We conjecture that, since the objective function of mT5 is to generate the masked tokens, it makes the averaged length of the generated text relatively shorter than mBART, which might limit its ability to quickly adapt to a generation task in the low-resource scenario.

Different Low-Resource Settings
In this section, we investigate whether our method can generalize to other low-resource settings (i.e., different sizes of the parallel data and monolingual The performance over different numbers of parallel data (from 10K to 100K) and pre-training data (from 100K to 1M). CPT denotes our method, CPT w/ MLT (Tgt). Since the maximum number of paragraphs in Wikipedia for Bn is ∼600K, we set the data size in the CPT as 100K, 300K and 600K for En ↔ Bn. data). We choose three translation pairs (En ↔ Bn, En ↔ Id, and Sk ↔ Sv), which cover two scenarios: 1) only one unseen language in a translation pair; and 2) both languages in a translation pair are unseen. As illustrated in Figure 2, we can observe that our method is able to consistently improve on the mBART baseline in terms of different parallel data sizes, and the improvements can be further boosted when the size of the pre-training data (monolingual data) increases. This is because a larger corpus is able to amplify the benefits of MLT and better align the space between the two languages in the translation. Moreover, we find that our method is especially effective for the Sk ↔ Sv translation pair when the size of the pre-training data reaches 1M. For example, in the Sk → Sv translation, the performance of CPT (1M) with 10K parallel samples (3.79) is on par with mBART with 70K parallel samples (3.80), which might suggest that gathering larger monolingual data (along with a dictionary) can be an alternative to collecting a larger size of parallel data.

Effectiveness on Seen Languages
As we can see from   ditionally, the improvement brought by our method can be further boosted when a larger pre-training corpus is available, which accords with the experimental results for the unseen languages.
We conjecture two reasons: 1) Continuing pretraining mBART can make the model focus on the languages in the translation pair and increase the model's ability of fast adaptation to the translation task. 2) Continual pre-training with the mixedlanguage text can further align the two languages in the translation, which gives a better initialization for the low-resource translation task.  Table 4: Ablation study on the noise function g (denoted as noise) and token deletion (denoted as deletion).

Zero-shot Performance
To further analyze the alignment quality between the source and target languages in the translation after the CPT, we evaluate the models in the zeroshot scenario, where we directly test the pre-trained models on the test set without any fine-tuning on the parallel data. As illustrated in Table 3, we can see that the zero-shot performance is relatively low since the models are not trained on any parallel or pseudo-parallel data 5 , and mBART gets 0 BLEU points due to the unseen languages in the test data. 5 The results for each translation pair are in Appendix A.
We find that CPT w/ Ori achieves more than 0 BLEU points, even though it does not utilize any supervision from the bilingual text. We conjecture that this can be attributed to the multilingual ability of mBART. Furthermore, CPT w/ MLT is able to outperform CPT w/ Ori since it learns additional bilingual alignments by reconstructing the target documents from the mixed-language text. In addition, the results are able to further illustrate that our method is able to achieve a better alignment quality than the baseline method.

Ablation Study
In this section, we first explore how the noise function g and token deletion in function h affect the effectiveness of our method (g and h are described in §2.2). Then, we investigate how the language mixing ratio of the mixed-language text affects our method's performance. Table 4, we can see that both the noise function and token deletion play an important part in the CPT, and removing both of them further degrades the performance. Given that the number of pre-training documents is as few as 100K, it is relatively difficult for the model to learn a good representation for the unseen language. However, adding the noise function in the CPT forces the model to learn to perform text infilling and sentence reordering, which increases the model's ability to understanding the unseen language. Conducting the token deletion brings two benefits: 1) It increases the variety of the mixedlanguage text, which makes the model not overfit to a certain mixed-language pattern. 2) It also injects extra noise to the inputs, which further compels mBART to understand the unseen language better. Moreover, incorporating both noise function g and the token deletion further boosts the effectiveness of the pre-training.

Noise & Deletion As shown in
Language Mixing Ratio We control the probability of whether to replace a token with its translation to generate different settings of language mixing ratios and investigate how different ratios affect the effectiveness of the pre-training. As shown in Table 5, using a too high or too low mixing ratio will degrade the advantages of the CPT w/ MLT, 6 and keeping the ratio between 30% and 40% will achieve the best performance. We conjecture that, if the mixing ratio is too low, the dictionary which provides the supervision of bilingual alignment is not well utilized, while if the mixing ratio is too high (e.g., 50%), we replace almost all the tokens existing in the dictionary, which lowers the diversity of the mixed-language text and makes the model more easily overfit to the pre-training data.

Importance of Pre-Tokenization
Considering that the tokenizer of mBART is created based on the text of the pre-training languages, it might not perform good tokenization for the unseen languages that are diverse from the pre-trained languages. Therefore, it could be a better option to pretokenize the text before using mBART's tokenizer. We conduct experiments on the En-Th language pair and compare the performance between performing and not performing the pre-tokenization for Thai. As shown in Table 6, we find that pretokenization is able to improve the performance in En → Th significantly, while the improvements are marginal in Th → En. We conjecture that decoding (generating) tokens in the unseen language is much more difficult than encoding those tokens when they are not properly tokenized. This is because the task of the encoder is to understand the meaning of the input text, while the decoder needs to attend to the input text and generate tokens simultaneously, which makes the task of the decoder more difficult than that of the encoder. Therefore, when the unseen language (Thai) becomes the target language in the translation pair, the performance drops remarkably without pre-tokenization.

Multilingual Pre-Trained Models
Recently, multilingual pre-trained models based on the masked language modeling (MLM) objective function (Devlin et al., 2019;Conneau and Lample, 2019;Huang et al., 2019;Conneau et al., 2020) have shown their effectiveness at performing cross-lingual classification-based tasks. However, these models are inferior to the generation tasks (Rönnqvist et al., 2019) since they are not pre-trained in a generative way. Multilingual pretraining performed in a Seq2Seq fashion is able to mitigate this issue (Radford et al.;Lewis et al., 2020;Raffel et al., 2019), and has become a strong backbone for building NMT systems, especially in a low-resource scenario (Liu et al., 2020a;Song et al., 2019;Xue et al., 2020;Fan et al., 2020;Tang et al., 2020). Liu et al. (2020a) pre-trained a Seq2Seq multilingual model (mBART) by denoising full texts in 25 languages, while  proposed multilingual random aligned substitution to pre-train an NMT model for many languages based on parallel data. Instead of pre-training models from scratch,  proposed to extend multilingual BERT (Devlin et al., 2019) to an unseen language and evaluate it on the named entity recognition task. Although many studies have focused on pre-training multilingual models, few have investigated how to adapt the pre-trained models to new languages effectively. Also, to the best of our knowledge, we are the first to explore how to adapt a multilingual model pre-trained in a Seq2Seq fashion to unseen languages and evaluate the methods on a generative task (the NMT task).

Conclusion & Future Work
In this paper, we present a continual pre-training framework to improve mBART's generalization ability to extremely low-resource translation pairs that contain unseen languages. We propose to construct noisy mixed-language text from the monolingual corpus to cover both the source and target languages, and then, we continue pre-training mBART to reconstruct the original monolingual text. Results illustrate that our method is able to consistently surpass strong baselines across all tested translation pairs that contain unseen languages, as well as the ones where both languages are seen in the original mBART's pre-training. Moreover, we observe that our method is also beneficial for different low-resource settings, and its performance can be further boosted when a larger pre-training corpus is available. Furthermore, we find that not only mixing the source and target languages, but also increasing the variety of the inputs plays an essential role in the continual mixed-language pretraining. In future work, we will explore more pretraining methods to further boost the performance of pre-trained models on the NMT task. Additionally, we will study more applications of continual mixed-language pre-training, such as applying it to downstream cross-lingual tasks.