Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data

The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages. Fortunately, some low-resource languages are linguistically related or similar to high-resource languages; these related languages may share many lexical or syntactic structures. In this work, we exploit this linguistic overlap to facilitate translating to and from a low-resource language with only monolingual data, in addition to any parallel data in the related high-resource language. Our method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for low-resource adaptation. We experiment on 7 languages from three different language families and show that our technique significantly improves translation into low-resource language compared to other translation baselines.


Introduction
While machine translation (MT) has made incredible strides due to the advent of deep neural machine translation (NMT) (Sutskever et al., 2014;Bahdanau et al., 2014) models, this improvement has been shown to be primarily in well-resourced languages with large available parallel training data.
However with the growth of internet communication and the rise of social media, individuals worldwide have begun communicating and producing content in their native low-resource languages. Many of these low-resource languages are closely related to a high-resource language. One such example are "dialects": variants of a language traditionally considered oral rather than written. Machine translating dialects using models trained on * This work was conducted while author was working at Facebook AI the formal variant of a language (typically the highresource variant which is sometimes considered the "standardized form") can pose a challenge due to the prevalence of non standardized spelling as well significant slang vocabulary in the dialectal variant. Similar issues arise from translating a low-resource language using a related high-resource model (e.g., translating Catalan with a Spanish MT model).
While an intuitive approach to better translating low-resource related languages could be to obtain high-quality parallel data. This approach is often infeasible due to lack specialized expertise or bilingual translators. The problems are exacerbated by issues that arise in quality control for low-resource languages (Guzmán et al., 2019). This scarcity motivates our task of learning machine translation models for low-resource languages while leveraging readily available data such as parallel data from a closely related language or monolingual data in the low-resource language. 1 The use of monolingual data when little to no parallel data is available has been investigated for machine translation. A few approaches involve synthesising more parallel data from monolingual data using backtranslation (Sennrich et al., 2015) or mining parallel data from large multilingual corpora (Tran et al., 2020;El-Kishky et al., 2020b,a;Schwenk et al., 2019). We introduce NMT-Adapt, a zero resource technique that does not need parallel data of any kind on the low resource language.
We investigate the performance of NMT-Adapt at translating two directions for each low-resource language: (1) low-resource to English and (2) English to low-resource. We claim that translating into English can be formulated as a typical unsupervised domain adaptation task, with the highresource language as the source domain and the 803 related low-resource, the target domain. We then show that adversarial domain adaptation can be applied to this related language translation task. For the second scenario, translating into the lowresource language, the task is more challenging as it involves unsupervised adaptation of the generated output to a new domain. To approach this task, NMT-Adapt jointly optimizes four tasks to perform low-resource translation: (1) denoising autoencoder (2) adversarial training (3) high-resource translation and (4) low-resource backtranslation.
We test our proposed method and demonstrate its effectiveness in improving low-resource translation from three distinct families: (1) Iberian languages, (2) Indic languages, and (3) Semitic languages, specifically Arabic dialects. We make our code and resources publicly available. 2

Related Work
Zero-shot translation Our work is closely related to that of zero-shot translation (Johnson et al., 2017;Al-Shedivat and Parikh, 2019). However, while zero-shot translation translates between a language pair with no parallel data, there is an assumption that both languages in the target pair have some parallel data with other languages. As such, the system can learn to process both languages. In one work, Currey and Heafield (2019) improved zero-shot translation using monolingual data on the pivot language. However, in our scenario, there is no parallel data between the low-resource language and any other language. In other work, Arivazhagan et al. (2019) showed that adding adversarial training to the encoder output could help zero shot training. We adopt a similar philosophy in our multi-task training to ensure our low-resource target is in the same latent space as the higher-resource language.
Unsupervised translation A related set of work is the family of unsupervised translation techniques; these approaches translate between language pairs with no parallel corpus of any kind. In work by Artetxe et al. (2018); Lample et al. (2018a), unsupervised translation is performed by training denoising autoencoding and backtranslation tasks concurrently. In these approaches, multiple pretraining methods were proposed to better initialize the model (Lample et al., 2018b;Lample and Conneau, 2019;Liu et al., 2020;Song et al., 2019). 2 https://github.com/wjko2/NMT-Adapt Different approaches were proposed that used parallel data between X-Y to improve unsupervised translation between X-Z (Garcia et al., 2020a;. This scenario differs from our setting as it does not assume that Y and Z are similar languages. These approaches leverage a cross-translation method on a multilingual NMT model where for a parallel data pair (S x ,S y ), they translate S x into language Z with the current model to get S z . Then use (S y ,S z ) as an additional synthesized data pair to further improve the model. Garcia et al. (2020b) experiment using multilingual cross-translation on low-resource languages with some success. While these approaches view the parallel data as auxiliary, to supplement unsupervised NMT, our work looks at the problem from a domain adaptation perspective. We attempt to use monolingual data in Z to make the supervised model trained on X-Y generalize to Z.
Leveraging High-resource Languages to Improve Low-resource Translation Several works have leveraged data in high-resource languages to improve the translation of similar low-resource languages. Neubig and Hu (2018) showed that it is beneficial to mix the limited parallel data pairs of low-resource languages with high-resource language data. Lakew et al. (2019) proposed selecting high-resource language data with lower perplexity in the low-resource language model. Xia et al. (2019) created synthetic sentence pairs by unsupervised machine translation, using the high-resource language as a pivot. However these previous approaches emphasize translating from the low-resource language to English, while the opposite direction is either unconsidered or shows poor translation performance. Siddhant et al. (2020) trained multilingual translation and denoising simultaneously, and showed that the model could translate languages without parallel data into English near the performance of supervised multilingual NMT.
Similar language translation Similar to our work, there have been methods proposed that leverage similar languages to improve translation. Hassan et al. (2017) generated synthetic English-dialect parallel data from English-main language corpus. However, this method assumes that the vocabulary in the main language could be mapped word by word into the dialect vocabulary, and they calculate the corresponding word for substitution using localized projection. This approach differs from our work in that it relies on the existence of a seed bilingual lexicon to the dialect/similar language. Additionally, the approach only considers translating from a dialect to English and not the reverse direction. Other work trains a massively multilingual many-to-many model and demonstrates that high-resource training data improves related lowresource language translation (Fan et al., 2020). In other work, Lakew et al. (2018) compared ways to model translations of different language varieties, in the setting that parallel data for both varieties is available, the variety for some pairs may not be labeled. Another line of work focus on translating between similar languages. In one such work, Pourdamghani and Knight (2017) learned a character-based cipher model. In other work, Wan et al. (2020) improved unsupervised translation between the main language and the dialect by separating the token embeddings into pivot and private parts while performing layer coordination.

Method
We describe the NMT-Adapt approach to translating a low-resource language into and out of English without utilizing any low-resource language parallel data. In Section 3.1, we describe how NMT-Adapt leverages a novel multi-task domain adaptation approach to translating English into a low-resource language. In Section 3.2, we then describe how we perform source-domain adaptation to translate a low-resource language into English. Finally, in Section 3.3, we demonstrate how we can leverage these two domain adaptations, to perform iterative backtranslation -further improving translation quality in both directions.

English to Low-resource
To translate from English into a low-resource language, NMT-Adapt is initialized with a pretrained mBART model whose pretraining is described in (Liu et al., 2020). Then, as shown in Figure 1, we continue to train the model simultaneously with four tasks inspired by (Lample et al., 2018a) and update the model with a weighted sum of the gradients from different tasks.
The language identifying tokens are placed at the same position as in mBART. For the encoder, both high and low-resource language source text, with and without noise, use the language token of the high-resource language [HRL] in the pre-trained mBART. For the decoder, the related high and low-resource languages use their own, different, language tokens. We initialize the language token embedding of the low-resource language with the embedding from the high-resource language token. Task 1: Translation The first task is translation from English into the high-resource language (HRL) which is trained using readily available highresource parallel data. This task aims to transfer high-resource translation knowledge to aid in translating into the low-resource language. We use the cross entropy loss formulated as follows: . (X En , X HRL ) is a parallel sentence pair. E, D denotes the encoder and decoder functions, which take (input, language token) as parameters. L CE denotes the cross entropy loss.
Task 2: Denoising Autoencoding For this task, we leverage monolingual text by introducing noise to each sentence, feeding the noised sentence into the encoder, and training the model to generate the original sentence. The noise we use is similar to (Lample et al., 2018a), which includes a random shuffling and masking of words. The shuffling is a random permutation of words, where the position of words is constrained to shift at most 3 words from the original position. Each word is masked with a uniform probability of 0.1. This task aims to learn a feature space for the languages, so that the encoder and decoder could transform between the features and the sentences. This is especially necessary for the low-resource language if it is not already pretrained in mBART. Adding noise was shown to be crucial to translation performance in (Lample et al., 2018a), as it forces the learned feature space to be more robust and contain highlevel semantic knowledge.
We train the denoising autoencoding on both the low-resource and related high-resource languages and compute the loss as follows: , where Z i = E(N (X i ), [HRL]). X i is from the monolingual corpus.  English to low-resource backtranslation data. The aim of this task is to capture a language-modeling effect in the low-resource language. We describe how we obtain this data using the high-resource translation model to bootstrap backtranslation in Section 3.3.
The objective used is, is an English to low-resource backtranslation pair.
Task 4: Adversarial Training The final task aims to make the encoder output language-agnostic features. The representation is language agnostic to the noised high and low-resource languages as well as English. Ideally, the encoder output should contain the semantic information of the sentence and little to no language-specific information. This way, any knowledge learned from the English to high-resource parallel data can be directly applied to generating the low-resource language by simply switching the language token during inference, without capturing spurious correlations (Gu et al., 2019a).
To adversarially mix the latent space of the encoder among the three languages, we use two critics (discriminators). The critics are recurrent networks to ensure that they can handle variable-length text input. Similar to Gu et al. (2019b), the adversarial component is trained using a Wasserstein loss, which is the difference of expectations between the two types of data. This loss minimizes the earth mover's distance between the distributions of different languages. We compute the loss function as follows: As shown in Equation 4, the first critic is trained to distinguish between the high and low-resource languages. Similarly, in Equation 5, the second critic is trained to distinguish between English and non-English (both high, and low-resource languages).
Fine-tuning with Backtranslation: Finally, we found that after training with the four tasks concurrently, it is beneficial to fine-tune solely using backtranslation for one pass before inference. We posit that this is because while spurious correlations are reduced by the adversarial training, they are not completely eliminated and using solely the language tokens to control the output language is not sufficient. By fine-tuning on backtranslation, we are further adapting to the target side and encouraging the output probability distribution of the decoder to better match the desired output language.

Low-resource to English
We propose to model translating from the lowresource language to English as a domain adaptation task and design our model based on insights from domain-adversarial neural network (DANN) (Ganin et al., 2017), a domain adaptation technique widely used in many NLP tasks. This time, we train three tasks simultaneously: Task 1: Translation We train high-resource to English translation on parallel data with the goal of adapting this knowledge to translate low-resource sentences. We compute this loss as follows: , where Z HRL = E(X HRL , [HRL]).
Task 2: Backtranslation Low-resource to English backtranslation translation, which we describe in Section 3.3. The objective is as follows: , where Z LRL = E(Y LRL , [HRL]).
Task 3: Adversarial Training We feed the sentences from the monolingual corpora of the highand low-resource corpora into the encoder, and the encoder output is trained so that its input language cannot be distinguished by a critic. The goal is to encode the low-resource data into a shared space with the high-resource, so that the decoder trained on the translation task can be directly used. No noise was added to the input, since we did not observe an improvement. There is only one recurrent critic, which uses the Wasserstein loss and is computed as follows: Similar to the reverse direction, we initialize NMT-Adapt with a pretrained mBART, and use the same language token for high-resource and lowresource in the encoder.

Iterative Training
We describe how we can alternate training into/outof English models to create better backtranslation data improving overall quality. The iterative training process is described in Algorithm 1. We first create English to low-resource backtranslation data by fine-tuning mBART on the high-resource to English parallel data. Using this model, we translate monolingual low-resource text into English treating the low-resource sentences as if they were in the high-resource language. The resulting sentence pairs are used as backtranslation data to train the first iteration of our English to low-resource model.
After training English to low-resource, we use the model to translate the English sentences in the English-HRL parallel data into the low-resource language, and use those sentence pairs as backtranslation data to train the first iteration of our low-resource to English model.
We then use the first low-resource to English model to generate backtranslation pairs for the second English to low-resource model. We iteratively repeat this process of using our model of one direction to improve the other direction.

Datasets
We experiment on three groups of languages. In each group, we have a large quantity of parallel training data for one language(high-resource) and no parallel for the related languages to simulate a low-resource scenario.
Our three groupings include (i) Iberian languages, where we treat Spanish as the high-  The parallel corpus for each language is described in Table 1. Due to the scarcity of any parallel data for a few low-resource languages, we are not able to match the training and testing domains. For monolingual data, we randomly sample 1M sentences for each language from the CC-100 corpus 3 . For quality control, we filter out sentences if more than 40% of characters in the sentence do not belong to the alphabet set of the language. For quality and memory constraints, we only use sentences with length between 30 and 200 characters.
Collecting Dialectical Arabic Data While obtaining low-resource monolingual data is relatively straightforward, as language identifiers are often readily available for even low-resource text (Jauhiainen et al., 2019), identifying dialectical data is often less straightforward. This is because many dialects have been traditionally considered oral rather than written, and often lack standardized spelling, significant slang, or even lack of mutual intelligibility from the main language. In general, dialectical data has often been grouped in with the main lan-3 http://data.statmt.org/cc-100/ guage in language classifiers.
We describe the steps we took to obtain reliable dialectical Arabic monolingual data. As the CC-100 corpus does not distinguish between Modern Standard Arabic (MSA) and its dialectical variants, we train a finer-grained classifier that distinguishes between MSA and specific colloquial dialects. We base our language classifier on a BERT model pretrained for Arabic (Safaya et al., 2020) and finetune it for six-way classification: (i) Egyptian, (ii) Levantine, (iii) Gulf, (iv) Maghrebi, (v) Iraqi dialects as well as (vi) the literary Modern Standard Arabic (MSA). We use the data from (Bouamor et al., 2018) and (Zaidan and Callison-Burch, 2011) as training data, and the resulting classifier has an accuracy of 91% on a held-out set. We take our trained Arabic dialect classifier and further classify Arabic monolingual data from CC-100 and select MSA, Levantine and Egyptian sentences as Arabic monolingual data for our experiments.

Training Details
We use the RMSprop optimizer with learning rate 0.01 for the critics and the Adam optimizer for the rest of the model. We train our model using eight GPUs and a batch size of 1024 tokens per GPU. We update the parameters once per eight batches. For the adversarial task, the generator is trained once per three updates, and the critic is trained every update.
Each of the tasks of (i) translation, (ii) backtranslation as well as (iii) LRL and HRL denoising (only for En→LRL direction), have the same number of samples and their cross entropy loss has equal weight. The adversarial loss, L adv , has the same weight on the critic, while it has a multiplier of −60 on the generator (encoder). This multiplier was tuned to ensure convergence and is negative as it's opposite to the discriminator loss.
For the first iteration, we train 128 epochs from  Table 2: BLEU score of the first iteration on the English to low-resource direction. Both the adversarial (Adv) and backtranslation (BT) components contribute to improving the results. The fine-tuning step is omitted for Urdu as decoding is already restricted to a different script-set from the related high-resource language.  English to the low-resource language and 64 iterations from low-resource language to English. For the second iteration we train 55 epochs for both directions. We follow the setting of (Liu et al., 2020) for all other settings and training parameters. The critics consist of four layers: the third layer is a bidirectional GRU and the remaining three are fully connected layers. The hidden layer sizes are 512, 512 and 128 and we use an SELU activation function.
We ran experiments on 8-GPUs. Each iteration took less than 3 days and we used publicly available mBART-checkpoints for initialization. GPU memory usage of our method is only slightly larger than mBART. While we introduce additional parameters in discriminators, these additional parameters are insignificant compared to the size of the mBART model.

Results
We present results of applying NMT-Adapt to lowresource language translation.

English to Low-Resource
We first evaluate performance of translating into the low-resource language. We compare the first iteration of NMT-Adapt to the following baseline systems: (i) En→HRL Model: directly using the model trained for En→HRL translation. (ii) Adversarial: Our full model without using the backtranslation objective and without the final fine-tuning.
As seen in Table 2, using solely the adversarial component only, we generally see improvement in the BLEU scores over using the high-resource translate model. This suggests that our proposed method of combining denoising autoencoding with adversarial loss is effective in adapting to a new target output domain.
Additionally, we observe a large improvement using only backtranslation data. This demonstrates that using the high-resource translation model to create LRL-En backtranslation data is highly effective for adapting to the low-resource target.
We further see that combining adversarial and backtranslation tasks further improve over each individually, showing that the two components are complementary. We also experimented on En-HRL translation with backtranslation but without adversarial loss. However, this yielded much worse results, showing that the improvement is not simply due to multitask learning.
For Arabic, backtranslation provides most of the gain, while for Portuguese and Nepali, the adversarial component is more important. For some languages like Marathi, the two components provides small gains individually, but shows a large improvement while combined.
For Urdu, we found that backtranslation only using the Hindi model completely fails; this is intuitive as Hindi and Urdu are in completely different scripts and using a Hindi model to translate Urdu results in effectively random backtranslation data. When we attempt to apply models trained with the adversarial task, the model generates sentences with mixed Hindi, Urdu, and English. To ensure our model solely outputs Urdu, we restricted the output tokens by banning all tokens containing English or Devanagari (Hindi) characters. This allowed our model to output valid and semantically meaningful translations. This is an interesting result as it shows that our adversarial mixing allows translating similar languages even if they're written in different scripts. We report the BLEU score with the restriction. Since the tokens are already restricted, we skip the final fine-tuning step. Table 3 shows the results of the first iteration from translating from a low-resource language into English. We compare the following systems (i) HRL→En model: directly using the model trained for HRL→En translation. (ii) Adversarial: similar to our full model, but without using the backtranslation objective. (iii) Backtranslation: mBART finetuned on backtranslation data from our full model in the English-LRL direction. (iv) BT+Adv: Our full model.

Low-resource to English
For this direction, we can see that both the backtranslation and the adversarial domain adaptation components are generally effective. The exception is Arabic which may be due to noisiness of our dialect classification compared to low-resource language classification. Another reason could be due to the lack of written standardization for spoken dialects in comparison to low-resource, but standardized languages.
For these experiments, we did not apply any special precautions for Urdu on this direction despite it being in a different script from Hindi. Table 4 shows the results of two iterations of training. For languages other than Arabic dialects, the second iteration generally shows improvement over the first iteration, showing that we can leverage an improved model in one direction to further improve the reverse direction. We found that the improvement after the third iteration is marginal.

Iterative Training
We compare our results with a baseline using the HRL language as a pivot. The baseline uses a fine tuned mBART (Liu et al., 2020) to perform supervised translation between English and the HRL, and uses MASS (Song et al., 2019) to perform unsupervised translation between the HRL and the LRL. The mBART is tuned on the same parallel data used in our method, and the MASS uses the same monolingual data as in our method. For all languages and directions, our method significantly outperforms the pivot baseline.

Comparison with Other Methods
In table 5, we compare a cross translation method using parallel corpora with multiple languages as auxiliary data (Garcia et al., 2020b) as well as results reported in (Guzmán et al., 2019) and (Liu et al., 2020). All methods use the same test set, English-Hindi parallel corpus, and tokenization for fair comparison. For English to Nepali, NMT-Adapt outperforms previous unsupervised methods using Hindi or multilingual parallel data, and is competitive with supervised methods. For Nepali to English direction, our method achieves similar performance to previous unsupervised methods. Note that we use a different tokenization than in table 3 and 4, to be consistent with previous work. Table 6 shows the first iteration English to Marathi results while varying the amount of monolingual data used. We see that the BLEU score increased from 11.3 to 16.1 as the number of sentences increased from 10k to 1M showing additional monolingual data significantly improves performance.

Conclusion
We presented NMT-Adapt, a novel approach for neural machine translation of low-resource languages which assumes zero parallel data or bilingual lexicon in the low-resource language. Utilizing parallel data in a similar high resource language as well as monolingual data in the low-resource language, we apply unsupervised adaptation to facilitate translation to and from the low-resource language. Our approach combines several tasks including adversarial training, denoising language modeling, and iterative back translation to facilitate the adaptation. Experiments demonstrate that this combination is more effective than any task on its own and generalizes across many different language groups.