Two Parents, One Child: Dual Transfer for Low-Resource Neural Machine Translation

Neural machine translation suffers when parallel data for training is scarce. Previous works have explored transfer learning to assist training in low-resource scenarios. However, they transfer either from high-resource parallel data, or from monolingual data. In this work, we pro-pose a framework to transfer multiple sources of auxiliary data, including both high-resource parallel data and monolingual data of involved languages. Knowledge in those sources is respectively encoded in a high-resource translation model and pretrained language models, and dually transferred to the low-resource translation model by our approach. Extensive experiments show that our approach yields consistent improvements over strong competitors for multiple translation directions. Furthermore, our approach still exhibits beneﬁt on top of back-translation, making it a useful addition to practitioners’ toolbox.


Introduction
Neural machine translation (NMT) has achieved remarkable success in recent years, but its quality critically hinges on large-scale parallel data. In the low-resource scenarios for most world languages and many domains, its performance usually deteriorates dramatically.
Although parallel data for some translation tasks may be difficult to obtain, monolingual data is usually within reach, and often comes in much larger quantity. Besides, parallel data for several highresource languages is readily available. These corpora have been used in various methods to help training low-resource NMT. The most relevant method to our work is transfer learning.
Transfer learning starts with training a source task and then initializes the target task with the parameters. Recent (Section 4.3). H/L: high/lowresource language pair; M: monolingual; P: parallel. BBERT transfer checks all the boxes but uses data in a different way from ours. be seen as transfer learning, where language modeling is the source task for downstream target tasks. In low-resource NMT, pretrained language models have also provided noticeable improvements (Clinchant et al., 2019;Imamura and Sumita, 2019). As another source of transfer, high-resource NMT models have also been used for transfer learning low-resource NMT. Zoph et al. (2016) pioneered this direction with NMT based on recurrent neural networks, and coined the high-resource and low-resource models as parent and child models, respectively.
However, it is non-trivial to transfer from both PLMs and NMT models. This limitation constrains most existing transfer-learning-based low-resource NMT to a single source of auxiliary data, either monolingual or parallel.
In this paper, we propose a framework for transfer learning low-resource NMT that utilizes both monolingual data and high-resource parallel data (Table 1). Our approach encodes monolingual knowledge in parent PLMs and translation knowledge in parent NMT models, and transfers both types of models to the child NMT model. Despite its simplicity, our approach shows consistent gains for multiple translation directions. Furthermore, it possesses several desirable features: • It performs reasonably well even with little or no parallel data in the language pair of interest, alleviating the data issue for low-resource language pairs.
• It is complementary to back-translation, a strong data augmentation approach.
• It is agnostic to network architectures and thus applicable to any translation models.
• It is widely applicable to low-resource languages and can be applied to domain adaptation.
• The same high-resource NMT model can be used to transfer to future low-resource languages, saving computation.
2 Background 2.1 Transfer from Pretrained Language Models The "pretraining-finetuning" paradigm has been highly successful for various natural language processing tasks. It first pretrains a language model through self-supervised learning, and then finetunes the model along with additional task-specific layers on downstream task data. Here, we exclude pretrained language models trained by sequenceto-sequence learning to simplify discussion 1 . Common pretrained language models include BERT and GPT (Brown et al., 2020).
In NMT with the encoder-decoder architecture (Sutskever et al., 2014;Bahdanau et al., 2015), the direct application of the "pretraining-finetuning" paradigm would be initializing the encoder with PLM and treating the decoder as task-specific layers. However, it is also possible to initialize the compatible modules in the decoder, leaving the cross attention module randomly initialized. Although initializing the decoder does not appear as useful, especially for high-resource language pairs (Rothe et al., 2020), it is not harmful either.
1 Examples of such models include MASS (Song et al., 2019) and BART (Lewis et al., 2020). If desired, pretrained encoders in these models can be used in our approach.

Transfer from High-Resource
Translation Models Even though the Transformer model (Vaswani et al., 2017) has become more popular than recurrent neural networks for NMT, the transfer procedure proposed by Zoph et al. (2016) still applies as long as the parent model and the child model share the same architecture, which is typically the case. However, one problem still persists. Because the high-resource languages have different vocabularies from the low-resource ones, directly transferring the word embedding layer is not possible.
One way to circumvent this issue is to prepare a joint vocabulary of the involved languages that is shared between the parent and child NMT models (Kocmi and Bojar, 2018). Known as warm-start transfer (Neubig and Hu, 2018), this type of methods need to prepare a new joint vocabulary whenever a new low-resource model is on demand, and retrain both parent and child models. In contrast, cold-start transfer (Kocmi and Bojar, 2020) trains a universal parent NMT model that does not depend on child languages. Kim et al. (2019) addressed the vocabulary mismatch for cold-start transfer by matching word embeddings across languages. They first learn monolingual word embeddings of the child language with e.g. skip-gram (Mikolov et al., 2013), and then learn a cross-lingual linear mapping to connect child monolingual word embeddings and pretrained parent NMT word embeddings. The child monolingual word embeddings can then be mapped to the parent word embedding space, and be used to initialize the child NMT word embeddings. The cross-lingual linear mapping relies on a bilingual lexicon to learn, which can be induced from parent and child language monolingual data by unsupervised methods like (Lample et al., 2018a).
Our approach also belongs to cold-start transfer in its usage of the parent NMT model. It addresses the vocabulary mismatch by design, without relying on monolingual word embeddings and bilingual lexica.

Approach
Our approach is a general framework for transferring from any high-resource language pair to any low-resource language pair, as long as data condition permits. Generally speaking, monolingual data and high-resource parallel data are available in large quantity. We first present the gen- [A] PLM body A and B mono. (1) [P] PLM emb.
[A] PLM body P and Q mono.  eral case where we would like to transfer from the high-resource A→B to the low-resource P→Q, where capital letters denote languages. Then we discuss specific cases where some of the involved languages are the same. Figure 1 shows the pipeline of our approach, consisting of four major steps, as detailed below.

General Transfer
(1) Train PLM A and PLM B on monolingual data of A and B separately.
(2) Train PLM P and PLM Q on monolingual data of P and Q as follows.
• Initialize PLM P with PLM A (except word embeddings); freeze parameters other than word embeddings. • Initialize PLM Q with PLM B (except word embeddings); freeze parameters other than word embeddings.
(3) Train NMT A→B on A→B parallel data as follows: Initialize NMT encoder with PLM A , and decoder with PLM B ; freeze word embeddings during training.
(4) Replace word embeddings as follows to initialize NMT P→Q , and finetune on P→Q parallel data.
• Replace NMT A→B encoder word embeddings with those in PLM P . • Replace NMT A→B decoder word embeddings with those in PLM Q .
Note that Steps (2) and (3) are independent of each other, and therefore can be done in parallel.
Intuitively, Step (2) learns word embeddings of P and Q that lie in the same semantic space of A and B, respectively. Because only word embeddings are trainable, they are forced to align with pretrained A and B body parameters to do language modeling (e.g. masked language model). In Step (3), NMT A→B needs to learn translation based on the frozen A and B word embeddings space. With P and Q word embeddings swapped in place in Step (4), the body and embedding parameters can cooperate in a close semantic space, allowing finetuning to proceed smoothly. Like (Kim et al., 2019), our approach solves the vocabulary mismatch issue by manipulation in the embedding space, allowing transfer between arbitrary languages, even with different scripts 2 . Each language now manages its own independent vocabulary. We also tie input and output embeddings of the decoder (Press and Wolf, 2017), so a single decoder embedding block is shown in Figure 1.
We can further generalize our approach by defining transfer parameters as those responsible for transforming input into continuous representations shared across languages. In Figure 1, the transfer parameters are simply word embeddings, but we may also use other sets of transfer parameters, e.g. word and position embeddings, or even lower layers of the body. In Step (2), only transfer parameters are trainable, while in Step (3), only nontransfer parameters are trainable, and initialization changes accordingly.
Our approach defines a framework for transfer learning, which can be applied to various network architectures. For example, if we would like to train a low-resource RNN-based NMT, we can prepare RNN-based PLMs and a high-resource RNN-based NMT. In our experiments, we use Transformer for PLMs and NMT models.

Shared Target Transfer and Shared
Source Transfer In practice, it is a rare need to train on a lowresource language pair where both languages are low-resource. Typically one of the two languages would be high-resource, e.g. English. In this case, we can choose a high-resource language pair that shares this language on the same side, thereby simplifying our approach.
If the target language (Q) of the low-resource language pair (P→Q) is high-resource, we can choose a high-resource language pair (A→B) with that language as the target, i.e. B=Q. In this case, there is no vocabulary mismatch on the target side, so PLM Q is no longer needed, and decoder word embeddings can be adjusted when training NMT A→B in Step (3). PLM B also becomes optional, and the randomly initialized decoder of NMT A→B may learn sufficiently from abundant A→B parallel data.
Likewise, if the source language (P) is highresource, we can let A=P. Then PLM P is not needed, and encoder word embeddings are trainable in Step (3). PLM A may also be dispensed with and the encoder of NMT A→B is randomly initialized.

Domain Adaptation
By viewing a certain domain as a special language, our approach can also be applied to domain adaptation. In this case, A→B is a high-resource source domain, and P→Q is a low-resource target domain. By definition, this setting is general transfer, because neither B=Q nor A=P is possible due to domain difference, but typically they will be the same language, respectively.

Experimental Setup
We mainly verify our approach in the more realistic shared target and shared source transfer scenarios. We take German-English (de-en) as the highresource language pair, while Estonian-English (et-en) and Turkish-English (tr-en) are the low-resource language pairs. Previous works language code # sentence (pair)   (Post, 2018). Further details about data and hyperparameters can be found in Appendices B and C, respectively.

Data
We mainly use data from WMT 2018 4 . We use preprocessed parallel data for training NMT models. The provided development data includes multiparallel data for several languages, which we use for fr→es. We collect monolingual data for the involved languages and follow the same preprocessing pipeline. Training data statistics is provided in Table 2. Each language is encoded with byte pair encoding (BPE) (Sennrich et al., 2016b). The BPE codes and vocabularies are learned on each language's monolingual data, and then used to segment parallel data. Following (Kim et al., 2019), we use 50k merge operations for English, and 20k for other languages. Sentences with more than 150 subwords are removed from NMT training.

Hyperparameters
We use Transformer base as our NMT model, but with slight modifications that follow the implementation of BERT 5 . The absolute position embeddings are also learned as in BERT. We apply dropout with probability 0.1. Learning rate warms up for 16,000 steps and then follows inverse square root decay. The peak learning rate is 7 × 10 −4 for the high-resource de-en. For other translation tasks, we grid search over {1, 3, 5} × 10 −4 for each approach in every experiment, and keep the best model based on development BLEU. We use 8 GPUs for de-en, and 1 GPU otherwise. Other hyperparameters follow (Kim et al., 2019). We train BERT as the PLM in our experiments, with the same number of layers and hidden size as Transformer base. The absolute position embeddings are learned up to 128. We only train with masked language modeling and dispense with next sentence prediction as in . We train for 480k steps with batch size 180 on 8 GPUs. The peak learning rate is 1.8 × 10 −4 , and the number of warmup steps is 18,000. Rothe et al. (2020) found that for the highresource de-en pair, initializing the decoder with PLM has no advantage over random initialization. Therefore, we only used PLM de for de→en, but for en→de, we used both PLM en and PLM de because the vocabulary mismatch is on the target side.

Baselines
We compare with the following approaches.
No transfer This baseline trains directly on the low-resource parallel data. (Zoph et al., 2016) This approach transfers from the high-resource language pair. In the original paper, random parent word embeddings are used to initialize child word embeddings. We simply initialize child word embeddings with the truncated normal initializer.
(Kim et al., 2019) This approach transfers from the high-resource language pair and utilizes cross-lingual word embeddings. The authors also proposed other orthogonal data augmentation techniques, but we do not include them in our experiments. 5 https://github.com/google-research/ bert BERT2RND This approach transfers from the source language PLM trained on monolingual data. By comparing with BERT2BERT, we can see if the finding in (Rothe et al., 2020) holds for low-resource language pairs. BERT2BERT This approach transfers from the source and target language PLMs trained on monolingual data. Note that PLMs for BERT2BERT and BERT2RND are directly trained on monolingual data of P and Q, different from those obtained by Step (2) of our approach.
As discussed in (Kim et al., 2019), managing independent vocabularies for each language has the advantage of flexibility. However, many approaches rely on shared vocabulary. We nevertheless report their performance for reference.
(Kocmi and Bojar, 2018) This approach uses joint vocabulary of all the involved languages. It first trains the NMT model on the highresource parallel data, and then finetunes it on the low-resource parallel data. It can be seen as a multilingual NMT in which highresource performance does not matter. We experiment with transferring from de→en to et→en, thus involving three languages. We learn joint BPE with 90k merge operations.
BBERT2BBERT Multilingual PLMs usually rely on shared vocabulary, and bilingual BERT (BBERT) is an example trained on non-parallel data of two languages. We learn joint BPE with 70k merge operations for the source and target languages of the low-resource language pair, and the same vocabulary is used for the source and target sides of NMT. Otherwise this approach is the same as BERT2BERT Table 3: BLEU on et→en, with the best in bold. " " in the "V" column indicates independent vocabulary, while " " means the approach relies on shared vocabulary. Our approach (dual transfer) has two variants, with or without position embeddings in the transfer parameters.
PLM en when transferring from de→en to et→en.
In their experiments, Zoph et al. (2016) and Kim et al. (2019) only considered shared target transfer, and they found that freezing certain components of the decoder during finetuning can be beneficial. In our et→en experiment, we tried freezing the decoder word and position embeddings, and optionally self attention parameters, for their approaches, our approach, and BERT2BERT, but development set results revealed that the only setting which brought improvement was freezing word and position embeddings and self attention parameters for (Kim et al., 2019), possibly due to the relatively large size of et→en data. Therefore we only use it for (Kim et al., 2019) in our experiments.

Results
In this section, we first report extensive experiments on et→en before generalizing to other translation directions. We then present the performance of our approach when used in conjunction with back-translation and self training. Finally we demonstrate that our approach can be used for domain adaptation. Table 3 shows the BLEU scores for et→en. We report the following findings for this translation direction. The approach in (Zoph et al., 2016) only uses high-resource parallel data for transfer, and the approach in (Kim et al., 2019) additionally uses low-resource monolingual data; their BLEU scores are close to the "no transfer" baseline. The approach in (Kocmi and Bojar, 2018) shows positive transfer from high-resource parallel data by forgoing the vocabulary flexibility and relying on joint vocabulary.

Results on et→en
Using monolingual data, BERT2RND and BERT2BERT show notable improvement on the "no transfer" baseline. In this relatively low-resource setting, it appears useful to initialize the decoder with BERT, in contrast to de-en experiments in (Rothe et al., 2020).
We expected additionally transferring position embeddings to better deal with word order divergence across languages, but after comparing the two variants of our approach, we find no benefit in including position embeddings in the transfer parameters. Our approach with word embeddings as transfer parameters achieves best BLEU, which is a 3.05 improvement over the "no transfer" baseline, and 1.37 over BERT2BERT. Note that we did not use monolingual English data for our approach when the target language is English.

Effect of Low-Resource Parallel Data
Size Arguably, the parallel training data for et→en is not quite low-resource. But it provides a good test bed for manually adjusting the data size to simulate various degrees of resource scarcity. We sample subsets of {1, 5, 10, 50, 100, 500, 1000}×10 3 parallel sentence pairs, and show BLEU of different approaches in Figure 2. We observe roughly monotonic trend of BLEU with respect to parallel data size, as expected. Our approach performs consistently better than baselines, and the gap is larger with fewer parallel sentence pairs. In the extremely low-resource setting of one thousand pairs, our approach still achieves BLEU close to 10, while all other approaches fail with BLEU close to 0.

Zero-Shot Translation
Our approach can also be modified slightly to perform zero-shot translation. We conjecture that in Step (3) of our approach, freezing the embeddings alone is insufficient to prevent encoder body parameters from drifting too far away. Therefore we try freezing the entire encoder in Step (3). This technique helps our approach to achieve a zero-shot BLEU score of 6.20, as shown in Table 4. How-ever, it does not have advantage when parallel data is available. Table 5 shows the results that include shared target transfer, shared source transfer, and general transfer, comparing our approach with no transfer and BERT2BERT. Our approach consistently outperforms baselines. Previous works (Zoph et al., 2016;Kim et al., 2019) typically conducted experiments on shared target transfer only, and shared source transfer is considered more difficult (Kocmi, 2020), but our approach works well for shared source transfer, as well as general transfer. Also note that, we use the same de-en pair for all child languages from diverse language families, which demonstrates the robustness of our approach. It also highlights the advantage of independent vocabularies: We can prepare NMT de→en and NMT en→de for any future child language, while approaches like (Kocmi and Bojar, 2018) and BBERT transfer have to retrain with the high-resource language every time a new low-resource language is needed.

Back-Translation and Self Training
Back-translation (BT) (Sennrich et al., 2016a) and self training (ST) (Zhang and Zong, 2016) are data augmentation techniques that generate synthetic parallel data, using target language monolingual data and source language monolingual data respectively. We first experiment with ST for et→en. We use the "no transfer" NMT et→en to translate 4m et monolingual data into en by greedy decoding, and merge with authentic parallel data. Results in Table 6 show that self training is not helpful for this experiment, and considerably lowers the BLEU of our approach.   We then use the same synthetic parallel data for en→et, turning to the case of BT. The upper rows in Table 7 show that BT is highly beneficial for both the baseline and our approach. Encouraged by this, we further try using all 130m et monolingual data with the maximum of 80 tokens and 100 subwords per line. We upsample authentic data to have a 1:4 ratio with synthetic data, following (Caswell et al., 2019). The lower rows in Table 7 show that more BT data can further improve the "no transfer" baseline, though the small improvement appears unattractive considering the cost. As for our approach, going from 4m to 130m yields no gain. Besides, our approach with 4m BT still surpasses no transfer with 130m BT. We conjecture that our approach can work complementarily with a manageable amount of BT data, reducing the need to decode and train on a huge data size.
Finally, note that we use the "no transfer" NMT et→en to generate all synthetic parallel data in our experiments. In practice, the model produced by our approach can be used for decoding, which should result in higher-quality synthetic data. This might also be the reason that ST hurts our approach more than the "no transfer" baseline.

Domain Adaptation
A simple and effective approach to domain adaptation is finetuning source domain NMT on target domain data (Luong and Manning, 2015;Freitag and Al-Onaizan, 2016). This approach is possible because directly inheriting parent NMT vocabulary is acceptable for domain adaptation. In other words, this is a special case of (Kocmi and Bojar, 2018) where child vocabulary largely overlaps with parent vocabulary. However, our approach allows using a dedicated vocabulary for the target domain. In this case, we learn BPE with the same number of merge operations as the source domain on target domain monolingual data. Table 8 shows that our approach can surpass the baselines, especially with the child (medical domain) vocabulary.
Transfer learning usually utilizes a single source of knowledge. When multiple sources are available, transfer learning may be applied in a cascaded fashion (Lakew et al., 2018), but catastrophic forgetting may need to be addressed. Maimaiti et al. (2019) proposed multi-round transfer by performing transfer learning for several rounds on multiple high-resource language pairs.
Multilingual NMT (Johnson et al., 2017;Dabre et al., 2019) aims to perform translation for multiple translation pairs in a single model, and positive transfer towards low-resource language pairs typically occurs. In our experiment, we have considered a variant that solely focuses on the lowresource pair (Kocmi and Bojar, 2018;Nguyen and Chiang, 2017).
Outside NMT, Artetxe et al. (2020) proposed a similar partial freezing approach to transferring BERT cross-lingually. As they worked on BERT (Transformer encoder) for natural language understanding tasks, several differences from our work arise. First, we need to consider the initialization of decoder for NMT, and for the shared source case, we need to deal with vocabulary mismatch on the decoder side. Second, we find that additionally transferring position embeddings is not helpful in our experiments. Third, our approach can outperform BBERT transfer, whereas they observe slightly lower performance in their experiments.

Conclusion and Future Work
In this work, we propose a framework for transferring from both pretrained language models and neural machine translation models, so that both monolingual data and high-resource parallel data can be used to assist low-resource training. Our approach shows consistent usefulness in a variety of experiments, while also enjoying the flexibility of independent vocabulary.
Recently, a deep encoder and shallow decoder architecture is shown to have comparable translation quality with faster decoding speed (Kasai et al., 2020). While our approach can be applied to such architectures, a shallow decoder means that transferring on the decoder side will be limited by the shallow PLM, which is particularly severe for shared source transfer. In future work we would like to investigate how to work around this issue. the cost of computation. However, we have highlighted the benefit of cold-start transfer: Trained high-resource NMT models can be used for future transfer. For example, we can reuse NMT de→en and NMT en→de for a future low-resource language X translating to and from English, and PLM X can be used for both directions if the encoder and the decoder have the same number of layers. We hope such reuse can amortize the cost of preparing parent models. We release the code to facilitate future transfer at https: //github.com/huawei-noah/noah-research/ tree/master/noahnmt/dual-transfer.
Besides, our experiment indicates that our approach can reduce the need of back-translation data size, while producing back-translation data and training on augmented data are both costly. fr-es newstest2008-2011 newstest2012 newstest2013 de-en medical EMEA -dev -test random 3k of EMEA random 3k of EMEA   Table 11: Runtime of each step in dual transfer (word) for NMT et→en . The runtime of the "no transfer" baseline for this language pair is also listed.

B Data Source and Preprocessing
We list the data source in Tables 9 and 10. Most of the data is from WMT 2018, unless otherwise noted. Medical data is from WMT 2014 medical translation task 7 . The French and Spanish monolingual data is from WMT 2013 news translation task 8 .
All data sets are deduplicated. The Turkish monolingual data is further cleaned by removing lines with more than half non-Turkish characters, and we only use a subset with 100m lines. 7 http://statmt.org/wmt14/medical-task/ 8 http://statmt.org/wmt13/ translation-task.html

C Hyperparameters and Development Performance
As we grid search learning rates in {1, 3, 5} × 10 −4 , we report the best found learning rate and the corresponding development BLEU in Tables 12, 13, and 14. The development BLEU is calculated by tokenized multi-bleu.perl. Due to the large scale of the 130m BT experiment, we directly use the best learning rates for 4m BT, and set other hyperparameters as in high-resource NMT.