Robust Domain Adaptation for Pre-trained Multilingual Neural Machine Translation Models

Recent literature has demonstrated the potential of multilingual Neural Machine Translation (mNMT) models. However, the most efficient models are not well suited to specialized industries. In these cases, internal data is scarce and expensive to find in all language pairs. Therefore, fine-tuning a mNMT model on a specialized domain is hard. In this context, we decided to focus on a new task: Domain Adaptation of a pre-trained mNMT model on a single pair of language while trying to maintain model quality on generic domain data for all language pairs. The risk of loss on generic domain and on other pairs is high. This task is key for mNMT model adoption in the industry and is at the border of many others. We propose a fine-tuning procedure for the generic mNMT that combines embeddings freezing and adversarial loss. Our experiments demonstrated that the procedure improves performances on specialized data with a minimal loss in initial performances on generic domain for all languages pairs, compared to a naive standard approach (+10.0 BLEU score on specialized data, -0.01 to -0.5 BLEU on WMT and Tatoeba datasets on the other pairs with M2M100).


Introduction
Building a NMT model supporting multiple language pairs is an active and emerging area of research (NLLB Team et al., 2022;Fan et al., 2020;Tang et al., 2020).Multilingual NMT(mNMT) uses a single model that supports translation in multiple language pairs.Multilingual models have several advantages over their bilingual counterparts (Arivazhagan et al., 2019b).This modeling proves to be both efficient and effective as it reduces the operational cost (a single model is deployed for all language pairs) and improves translation performances, especially for low-resource languages.
All these advantages make mNMT models interesting for real-world applications.However, they are not suitable for specialized industries that require domain-specific translation.Training a model from scratch or fine-tuning all the pairs of a pretrained mNMT model is almost impossible for most companies as it requires access to a large number of resources and specialized data.That said, finetuning a single pair of a pre-trained mNMT model in a specialized domain seems possible.Ideally this domain adaptation could be learned while sharing parameters from old ones, without suffering from catastrophic forgetting (Mccloskey and Cohen, 1989).This is rarely the case.The risk of degrading performance on old pairs is high due to the limited available data from the target domain and to the extremely high complexity of the pre-trained model.In our case, overfitting on fine-tuning data means that the model might not even be multilingual anymore In this context, this article focuses on a new real-world oriented task fine-tuning a pre-trained mNMT model in a single pair of language on a specific domain without losing initial performances on the other pairs and generic data.Our research focuses on fine-tuning two state-of-the-art pre-trained multilingual mNMT freely available: M2M100 (Fan et al., 2020) and mBART50 (Tang et al., 2020) which both provide high performing BLEU scores and translate up to 100 languages.
We explored multiple approaches for this domain adaptation.Our experiments were made on English to French data from medical domain1 .This paper shows that fine-tuning a pre-trained model with initial layers freezing, for a few steps and with a small learning rate is the best performing approach.
It is organized as follows : firstly, we introduce standard components of modern NMT, secondly we describe related works, thirdly we present our methods.We finally systematically study the impact of some state-of-the-art fine-tuning methods and present our results.
Our main contributions can be separated into 2 parts: • Defining a new real-world oriented task that focuses on domain adaptation and catastrophic forgetting on multilingual NMT models • Defining a procedure that allows to finetune a pre-trained generic model on a specific domain 2 Background

Neural Machine Translation
Neural Machine Translation (NMT) has become the dominant field of machine translation.It studies how to automatically translate from one language to another using neural networks.
Most NMT systems are trained using Seq2Seq architectures (Sutskever et al., 2014;Cho et al., 2014) by maximizing the prediction of the target sequence V T = (v 1 , . . ., v T ), given the source sentence W S = (w 1 , . . ., w S ): Today the best performing Seq2Seq architecture for NMT is based on Transformers (Vaswani et al., 2017) architecture.They are built on different layers among which the multi-head attention and the feed-forward layer.These are applied sequentially and are both followed by a residual connection (He et al., 2015) and layer normalization (Ba et al., 2016).Although powerful, traditional NMT only translates from one language to another with a high computational cost compared to its statistical predecessor.It has been shown that a simple language token can condition the network to translate a sentence in any target language from any source language (Johnson et al., 2017).It allows to create multilingual models that can translate between multiple languages.Using previous notation the multilingual model adds the condition on target language in the previous modeling where is the target language.

Transfer Learning
Transfer learning is a key topic in Natural Language Processing (Devlin et al., 2018;Liu et al., 2019).It is based on the assumption that pre-training a model on a large set of data in various tasks will help initialize a network trained on another task where data is scarce.
It is already a key area of research in NMT where large set of generic data are freely available (news, common crawl, ...).However, real-world applications require specialized models.In-domain data is rare and more costly to gather for industries (finance, legal, medical, ...) making specialized models harder to train.It is even more true for multilingual model.
In our work, we study how we can adapt a mNMT model on a specific domain by fine-tuning on only one language pair, without losing too much generality for all language pairs.3 Related works

Multilingual Neural Machine Translation
While initial research on NMT started with bilingual translation systems (Sutskever et al., 2014;Cho et al., 2014;Luong et al., 2015;Yang et al., 2020), it has been shown that the NMT framework is extendable to multilingual models (Dong et al., 2015;Firat et al., 2016;Johnson et al., 2017;Dabre et al., 2020) mNMT has seen a sharp increase in the number of publications, since it is easily extendable and it allows both end-to-end modeling and cross lingual language representation (Conneau et al., 2017;Linger and Hajaiej, 2020;Conneau et al., 2019).
Competitive multilingual models have been released and open sourced.mBART (Liu et al., 2019) first, was trained following the BART (Lewis et al., 2019) objective before being finetuned on an English-centric multilingual dataset (Tang et al., 2020).M2M100 (Fan et al., 2020) scaled large transformer layers (Vaswani et al., 2017) with a lot of mined data in order to create a mNMT without using English as pivot, that can perform translation between any pairs among 100 languages.More recently, NLLB was released (NLLB Team et al., 2022), extending the M2M100 framework to 200 languages.Those models are extremely competitive as they have similar performance to their bilingual counterpart while allowing a pooling of training and resources.
Our experiments will rely on M2M100 and mBART but it can be generalized to any new pretrained multilingual model (NLLB Team et al., 2022).

Domain Adaptation
Domain Adaptation in the field of NMT is a key real-world oriented task.It aims at maximizing model performances on a certain in-domain data distribution.Dominant approaches are based on fine-tuning a generic model using either in-domain data only or a mixture of out-of-domain and indomain data to reduce overfitting (Servan et al., 2016a;Van Der Wees et al., 2017).Many works have extended domain adaptation to multi-domain, where model is finetuned on multiple and different domains (Sajjad et al., 2017;Zeng et al., 2018;Mghabbar and Ratnamogan, 2020).However, to the best of our knowledge, our work is the first exploring domain adaptation in the context of recent pre-trained multilingual neural machine translation systems, while focusing on keeping the model performant in out-of-domain data in all languages.

Learning without forgetting
Training on a new task or new data without losing past performances is a generic machine learning task, named Learning without forgetting (Li and Hoiem, 2016).
Limiting pre-trained weights updates using either trust regions or adversarial loss is a recent idea that has been used to improve training stability in both natural language processing and computer vision (Zhu et al., 2019;Jiang et al., 2020;Aghajanyan et al., 2020).These methods haven't been explored in the context of NMT but are key assets that demonstrated their capabilities on other NLP tasks (Natural Language Inference in particular).Our work will apply a combination of those methods to our task.

Zero Shot Translation
MNMT has shown the capability of direct translation between language pairs unseen in training: a mNMT system can automatically translate between unseen pairs without any direct supervision, as long as both source and target languages were included in the training data (Johnson et al., 2017).However, prior works (Johnson et al., 2017;Firat et al., 2016;Arivazhagan et al., 2019a) showed that the quality of zero-shot NMT significantly lags behind pivot-based translation (Gu et al., 2019).Based on these ideas, some paper (Liu et al., 2021) have focused on training a mNMT model supporting the addition of new languages by relaxing the correspondence between input tokens and encoder representations, therefore improving its zero-shot capacity.We were interested in using this method as learning less specific input tokens during the finetuning procedure could help our model not to overfit the training pairs.Indeed, generalizing to a new domain can be seen as a task that includes generalizing to an unseen language.

Methods
Our new real-world oriented task being at the crossboard of many existing task, we applied ideas from current literature and tried to combine different approaches to achieve the best results.

Hyperparameters search heuristics for efficient fine-tuning
We seek to adapt generic multilingual model to a specific task or domain.(Cettolo et al., 2014;Servan et al., 2016b).Recent works in NMT (Domingo et al., 2019) have proposed methods to adapt incrementally a model to a specific domain.We continue the training of the generic model on specific data, through several iterations (see Algorithm 1).This post-training fine-tuning procedure is done without dropping the previous learning states of the multilingual model.The resulting model is considered as adapted or specialized to a specific domain.We want to avoid the model to suffer from forgetting on generic domain and pairs.To this end, we include different methods in this finetuning, that have been mentioned in the literature.These methods includes in particular choosing a small learning rate (Howard and Ruder, 2018), a triangular learning schedule (Houlsby et al., 2019), reducing the number of steps and freezing some of the layers(Stickland and Murray, 2019).

Smoothness-inducing Adversarial Regularizer
We seek to reduce the loss on generic domain and other pairs.Indeed, due to limited data resources from downstream tasks and the extremely large capacity of pre-trained models, aggressive fine-tuning often causes the adapted model to overfit the data of downstream tasks and forget the knowledge of the pre-trained model.To this end, we added a Smoothness-inducing Adversarial Regularization (SMART) term during the fine-tuning (Jiang et al., 2020).Models fine-tuned on GLUE task with SMART approach outperform even the strongest pre-trained baseline on all 8 tasks.Comparing with BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019), BERT SM ART and RoBERTa SM ART are performing better by a big margin.This approach gives a smoothness-inducing property to the model f .This is helpful to prevent overfitting and to improve generalization on low resource target domain for a certain task.Therefore, adding it to our task should avoid overfitting on the new domain.
Given the model f (.; θ) and n data points of the target task denoted by {(x i , y i )} n i=1 , where x i 's denote the embeddings of the input sentences, given by the first embedding layer of the language model and y i 's are the associated labels, SMART is adding a regularization term R s (θ) to the canonical optimisation loss below: where L(θ) is the loss function defined as and (•, •) is the loss function depending on the target task, λ s > 0 is a tuning parameter, and R s (θ) is the smoothness-inducing adversarial regularizer.
Here we define R s (θ) as (3) where > 0 is a tuning parameter.Since NMT is a classification tasks, f (; θ) outputs a probability simplex and s is chosen as the symmetrized KLdivergence, i.e.,

Enabling the model to learn less aggressive input tokens
We seek at reducing the loss of performances on the pairs learned during the pre-training of the model.A factor causing a too important language-specific representation is the positional correspondence to input tokens (Liu et al., 2021).Relaxing it should help the model learn the new domain while not focusing too much on the language representation.Recent advances in mNMT showed that we can reduce the positional correspondence learned from the input tokens seen during training thanks to Positional Disentangling Encoder (PDE) (Liu et al., 2021).PDE corresponds to removing some of the residual connections of the model architecture.
PDE is reported to beat by +18.5 BLEU models that do not use it on zero shot translation pairs while retaining quality on supervised directions (Liu et al., 2021).Doing this during the domain adaptation fine-tuning helps to learn less specific input tokens (since we train only from English to French).Therefore, adapting this method to our domain adaptation training is straightforward and could bring gain in BLEU on language pairs seen during pre-training while not sacrificing performances on the new specific domain.
5 Experimental Settings

Pre-trained Generic Models used
We have worked with two pre-trained mNMT models: M2M100 and mBART50 large.M2M100 is a multilingual encoder-decoder model, based on large Transformer architecture that can handle 100 languages.It was trained on a non-English-centric dataset of 7.5B sentences from generic domain, as such it is the first true many-tomany NMT model.To ease the fine-tuning process and due to hardware limitations, we worked with the lightest version released (418M parameters).mBART50 is a multilingual encoder-decoder model, based on training on an English-centric dataset and on large Transformer architecture that can handle 50 languages.It was trained following the BART objective (Lewis et al., 2019).More formally, the model aims to reconstruct a text that has been previously noised.
We will compare the domain adaptation performance between mBART50 which was trained on English-centric data and M2M100 which was trained on non English-centric data.

Datasets and preprocessing
In order to assess the effectiveness of our different domain adaptation strategies, we focused on the medical domain on the English to French using data from the EMEA32 dataset (Tiedemann, 2012).We used the same preprocessing as the original publications (BPE joint-tokenization from sentencepiece).We split the dataset into a train and a test dataset.We chose to use the first 5.000 sentences for the testing set and 350.000 sentences for the training set.For the evaluation data on the generic domain, we used generic data from different sources including WMT3 and Tatoeba4 .For the evaluation data on the medical domain, we also used EMEA3 dataset in different languages.

Detailed Procedure
We first define a hyperparameters search heuristics procedure.We chose a range of learning rate and trained the model with these values.We set prior threshold between the loss we accept on generic data and the increase we target on medical data.Then apply the procedure in algorithm 1.Having done this, we kept best settings (best learning rate and number of steps for given threshold), and tried freezing first layers to reduce the loss on generic domain.We define 3 , a threshold between loss on medical domain and gain on generic domain.We reproduce the same procedure and reports our best results.This allows us to find the optimal model θ opt , representing the best compromise between not losing performances on generic data and good adaptation to the medical domain.
Algorithm 1 Hyperparameters search heuristic for domain adaptation using simple fine-tuning Algorithm Input: T : the maximum number of steps; L : the number of layers we have frozen; L r : the learning rate, 1 : the threshold for ∆ 1 : the difference of BLEU between baseline and adapted model on EN-FR generic domain data, 2 threshold for ∆ 2 : the mean difference of BLEU between baseline and adapted model on all other generic data, θ 0 is the parameters of the pretrained model, θ opt : is the parameters of the model that has optimal value of BLEU on domain and generic.1: T ← 100K 2: L ← 1 3: for L r = 3e − 5, 1e − 5, ..., 1e − 8 do 4: Every 2k steps, evaluate model on validation set and compute ∆ 1 and ∆ 2 8: end for 15: end for Output: θ opt M2M100 We trained M2M100 on the medical EN-FR dataset.We used the adam optimizer (β 1 = 0.9, β 2 = 0.98), label smoothing, a dropout of 0.1 and a weight decay of 0. We applied our hyperparameters search heuristic procedure 1 to find the best model.We set 1 = 2, 2 = 1.On this configuration, optimal results were reported with a learning rate of 1e-07, freezing the embeddings at the encoder level, and 60K steps.mBART50 We trained mBART50 large on the medical EN-FR dataset.We used the adam optimizer (β 1 = 0.9, β 2 = 0.98), label smoothing, a dropout of 0.3 and a weight decay of 0. Again, we applied our hyperparameters search heuristic procedure to find the best model 1.We had to increase the value of 1 , 2 since mBART50 tends to forget the generic domain quicker than M2M100.We set 1 = 4, 2 = 3.On this configuration, optimal results were reported with a learning rate of 6e − 07, freezing the embeddings at the encoder level and 10K steps.

SMART:
We finetuned the model with the SMART procedure and continue hyperparameters search as in algorithm 1.In Algorithm 2, we note R s (θ) = and AdamU pdate the ADAM update rule for optimizing equation 1 using the mini-batch B. Lastly, we set T x = 1.For the perturbation, we set = 10 −5 and σ = 10 −5 .The learning rate η is set to 10 −3 .

M2M100
As shown in table 1 we reached more than 9.00 increase of BLEU score on the medical dataset without sacrificing performance on generic domain, the loss is not important on most of the pairs (between 0.01 and 0.2).In figure 3, we see that the mean results is rather stable and that the BLEU on generic English to French data does not decrease a lot (around -1.5 BLEU).The model converges after 60K steps so we stop training.mBART50 Again we reach more than 9.00 BLEU increase (Figure 4).We observe that after 50K steps mBART50 starts converging around 40.00 BLEU, yet we decided to stop domain adaptation training sooner than with M2M100 as a tradeoff between good performance on the EN-FR medical domain and loss of performance on the generic domain.Globally, we achieved better results with M2M100 than mBART50.We tested several learning rate values and we report here our results with a bigger learning rate (3e − 5).
For both models, it led to a catastrophic forgetting on the non-finetuned pairs along with a huge performance increase on the EN-FR Medical dataset, reaching a higher BLEU on the Medical dataset.We decided to focus on a smaller learning rate as a trade-off between loss on generic domain and gain on the medical domain.

SMART
We have reported our fine-tuning results for M2M100 and mBART50 with SMART in Table 1.
Our goal with SMART was to reach a higher BLEU score on the generic domain data without sacrificing performances on the medical dataset.In Table 1, we note a good increase in BLEU score.Moreover, we have noted that the BLEU change less when moving learning rate in a reasonable However if exploring a large scale of hyper parameters if feasible simple fine-tuning procedure like in Algorithm 1 can provide better results as shown in Table 1.

PDE
We seek at reducing the loss of performances on the pairs learned during pre-training of the model (and that are not used during the post-training domain adaptation).Relaxing the correspondence to the input tokens learned during Domain Adaptation.Fine-tuning was supposed to help learning less specific input tokens and therefore the model would be less likely to forgot all the pretrained pairs.As expected, the model learned less aggressive input tokens and do not overfit on English input tokens.However, in practice this does not seem to work well.Indeed, the model is also likelier to forget the pretrained input tokens making this method unfit to our procedure.Using PDE a posteriori (during fine-tuning) seems to be inefficient, since the model is performing worse on all pairs and not only on the English pairs.We report our results in table 1.

Zero-shot Domain Adaptation on other pairs
We challenged the approach on domain adaptation on languages unseen during the post-training on the We investigated why mBART50 was more likely to forget other pairs compared to M2M100.First, we have worked with the 418M-parameters version of M2M100.This is not the largest M2M100 version released (and certainly not the most optimized) and this could possibly explain the differences.Then, another hypothesis is the different dataset used during training of both models.Indeed, mBART50 is trained on English-centric data, and M2M100 is not.Non-English centric models are known to achieve higher BLEU especially on low resource data (Fan et al., 2020).Extending this study to domain adaptation, we believe non-English-centric models might be more robust to domain adaptation.We noted that when fine-tuning mBART50 with a bigger learning rate, the first pairs to be forgotten are the non-English ones.Testing this hypothesis on NLLB might be useful.

Conclusion and Discussion
In this paper, we propose a study of robust domain adaptation approaches on mNMT models where indomain data is available only for a single language pair.Best performing approach combines embedding freezing and simple fine-tuning with good hyperparameters.This approach shows good improvements with few in-domain data on all language pairs.The framework effectively avoids overfitting and aggressive forgetting on out-of-domain generic data while quickly adapting to in-domain data.We demonstrate that this could be a solution for incremental adaptation of mNMT models.Finally our work is a call for more research in domain adaptation for multilingual models as it is key for real-world applications.

Limitations
This study was limited by hardware issues.We did not have the possibility to fine-tune on M2M100 large version (12B parameters) that requires 64 GB of VRAM.
Testing our results with a larger version of M2M100 might be interesting.
Also, our study focused on two pre-trained multilingual neural machine translation models.However, many others exist and will be released (NLLB Team et al., 2022).We think that our work is generic enough to be applied on other pre-trained models but extensive experiments on these new models should be carried out.
Finally, the work has been realised on English to French data.We showed domain adaptation is possible for languages with English morphology and tested the impact of this training on many different languages morphology (Japanese, English, Russian, ...).Applying domain adaptation training on other morphology languages and on other domains is also an area to investigate.

Ethics Statement
The dataset was gathered on OPUS and is largely open-sourced.It was released by (Tiedemann, 2012) and we have downloaded it from OPUS website.We have reviewed the dataset and have not noted any issue with these data.They are very specific to health domain and therefore are not inappropriate.The dataset does not deal with demographic or identity characteristics.
Moreover, these experiments were made using only 2 GPUs and training were relatively short.Given the urgency of addressing climate change, we believe our domain adaptation procedure could help have high-performing mNMT models at small carbon and energy costs.Moreover, SMART framework allows for quicker research of the right hyperparameters, therefore reducing even further the number of experiments and the carbon costs of our method.

Figure 1 :
Figure 1: Domain Adaptation of a Pre-trained mNMT

Algorithm 2
Figure 2: PDE Illustration: Removing Residual Connections on encoder block

Figure 5 :
Figure 5: Domain Adaptation of M2M100 with big Learning Rate

Figure 6 :
Figure 6: Adding SMART to M2M100 Domain Adaptation training

Table 1 :
Global results on domain adaptation of M2M100 and mBART50