Unsupervised Neural Machine Translation for Low-Resource Domains via Meta-Learning

Unsupervised machine translation, which utilizes unpaired monolingual corpora as training data, has achieved comparable performance against supervised machine translation. However, it still suffers from data-scarce domains. To address this issue, this paper presents a novel meta-learning algorithm for unsupervised neural machine translation (UNMT) that trains the model to adapt to another domain by utilizing only a small amount of training data. We assume that domain-general knowledge is a significant factor in handling data-scarce domains. Hence, we extend the meta-learning algorithm, which utilizes knowledge learned from high-resource domains, to boost the performance of low-resource UNMT. Our model surpasses a transfer learning-based approach by up to 2-3 BLEU scores. Extensive experimental results show that our proposed algorithm is pertinent for fast adaptation and consistently outperforms other baselines.


Introduction
Unsupervised neural machine translation (UNMT) leverages unpaired monolingual corpora for its training, without requiring an already labeled, parallel corpus. Recently, the state of the art in UNMT (Conneau and Lample, 2019;Song et al., 2019;Ren et al., 2019) has achieved comparable performances against supervised neural machine translation (NMT) approaches. In contrast to supervised NMT, which uses a parallel corpus, training the UNMT model requires a significant number of monolingual sentences (e.g., 1M-3M sentences). However, the prerequisite limits UNMT's applicability to low-resource domains, especially for * equal contributions † This work done in NAVER Corp. Our code is avilable at https://github.com/ papago-lab/MetaGUMT domain-specific document translation tasks. Since gathering or creating those documents requires domain specific knowledge, the monolingual data themselves are scarce and expensive. In addition, the minority languages (e.g., Uzbek and Nepali) make the problem of data scarcity even worse.
Yet, UNMT for low-resource domains is not an actively explored field. One naive approach is to train a model on high-resource domains (e.g., economy and sports) while hoping the model will generalize on an unseen low-resource domain (e.g., medicine). However, recent studies have shown that non-trivial domain mismatch can significantly cause low translation accuracy on supervised NMT tasks (Koehn and Knowles, 2017).
Another reasonable approach is transfer learning-particularly, domain adaptation-which has shown performance improvements in the supervised NMT literature (Freitag and Al-Onaizan, 2016;Zeng et al., 2019). In this approach, the model is first pretrained using data from existing domains and then finetuned on a new domain. However, this approach can suffer from overfitting and catastrophic forgetting due to a small amount of training data and a large domain gap.
As an effective method for handling a small amount of training data, meta-learning has shown its superiority in various NLP studies such as dialog generation, machine translation, and natural language understanding (Qian and Yu, 2019;Gu et al., 2018;Dou et al., 2019). In general, the metalearning approach is strongly affected by the number of different tasks where tasks are defined as languages or domains from the aforementioned studies. However, in practice, the previous studies may struggle to gather data to define tasks because they rely on a supervised model that requires labeled corpora. In this respect, we argue that applying a meta-learning approach to the unsupervised model is more feasible and achievable than the supervised model because it can define multiple different tasks with unlabeled corpora. Therefore, we introduce a new meta-learning approach for UNMT, called MetaUMT, for low-resource domains by defining each task as a domain.
The objective of MetaUMT is to find the optimal initialization for the model parameters that can quickly adapt to a new domain even with only a small amount of monolingual data. As shown in Fig. 1 (a), we define two different training phases, a meta-train and a meta-test phase, and simulate the domain adaption process to obtain optimally initialized parameters. Specifically, the meta-train phase adapts model parameters to a domain while the meta-test phase optimizes the parameters obtained from the meta-train phase. After obtaining optimally initialized parameters through these two phases, we fine-tune the model using a target domain (i.e., a low-resource domain).
Although the initial parameters optimized through MetaUMT are suitable for adapting to a low-resource domain, these parameters may not fully maintain the knowledge of high-resource domains. Concretely, in the meta-test phase, MetaUMT optimizes initial parameters using the adapted parameters; however, it discards meta-train knowledge used to update adapted parameters in the meta-train phase. Therefore, instead of validating the same domain used in the meta-train phase, we intend to inject generalizable knowledge into the initial parameters by utilizing another domain in the meta-test phase. This prevents overfitting from the data scarcity issue.
As shown in Fig. 1 (b), we propose an improved meta-learning approach called MetaGUMT for lowresource UNMT by explicitly infusing common knowledge across multiple source domains as well as generalizable knowledge from one particular domain to another. In other words, we do not only encourage the model to find the optimally initialized parameters that can quickly adapt to a target domain with low-resource data, but also encourage the model to maintain common knowledge (e.g., general words such as determiners, conjunctions, and pronouns) which is obtainable from multiple source domains. Furthermore, due to a small amount of training data in a low-resource domain, the model can suffer from overfitting; however, we attempt to handle overfitting by leveraging generalizable knowledge that is available from one domain to another. Our proposed meta-learning approach demonstrates consistent improvements over other baseline models.
Overall, our contributions can be summarized as follows: • We apply a meta-learning approach for UNMT. To the best of our knowledge, this is the first study to use a meta-learning approach for UNMT, where this approach is more suitable to a UNMT task than a supervised one.
• We empirically demonstrate that our enhanced method, MetaGUMT, shows fast convergence on both pre-training (i.e., meta-learning with source domains) and finetuning (i.e., adapting to a target domain).
• The model trained with MetaGUMT consistently outperforms all baseline models including MetaUMT. This demonstrates that finding optimally initialized parameters that incorporate high-resource domain knowledge and generalizable knowledge is significant in handling a low-resource domain.

Related Work
Our study leverages two components from the natural language processing (NLP) domain: lowresource NMT and meta-learning. In this section, we discuss previous studies by concentrating on these two main components.

Low-Resource Neural Machine Translation
Based on the success of attention-based models (Luong et al., 2015;Vaswani et al., 2017), NMT obtains significant improvement in numerous language datasets, even showing promising results (Wu et al.) in different datasets. However, the performance of NMT models depends on the size of the parallel dataset (Koehn and Knowles, 2017). To address this problem, one conventional approach is utilizing monolingual datasets. Recent studies point out the difficulty of gathering parallel data, whereas the monolingual datasets are relatively easy to collect. To facilitate monolingual corpora, several studies apply dual learning (He et al., 2016), back-translation (Sennrich et al., 2016b), and pretraining the model with bilingual corpora (Hu et al., 2019;Wei et al., 2020). Furthermore, as a challenging scenario, recent studies propose the UNMT methods without using any Figure 1: An illustration of a high-level training process for both MetaUMT and MetaGUMT. In the case of MetaGUMT, the training process is divided into two different phases, a meta-train phase and a meta-test phase. The objective in the meta-train phase is to obtain adapted parameters (i.e., φ) by minimizing a meta-train loss (i.e., L[D tr N ]) from initial unadapted parameters. N represents the number of domains; D tr indicates meta-train data. In the meta-test phase, we optimize initial parameters θ through φ by minimizing the two losses, meta-train and meta-test losses, i.e., L[D tr N ] and L[D ts N ; D ts other ]. D ts represents meta-test data; D other is the domain data other than D N . parallel corpora (Lample et al., 2018a;Artetxe et al., 2018;. The UNMT models show comparable performances by extending the back-translation method (Conneau et al., 2018) and incorporating methods such as shared Byte Pair Encoding (BPE) (Lample et al., 2018b) and cross-lingual representations (Conneau and Lample, 2019), following those of the supervised NMT. However, since these approaches require plenty of monolingual datasets, they suffer in a low-resource domain.
Transferring the knowledge from high-resource domains to a low-resource domain is one alternative way to address this challenge. A few studies concentrate on transferring the knowledge from the rich-resource corpora into the low-resource one. Several models (Chu and Wang, 2018;Hu et al., 2019) show better performances than when trained with the low-resource corpora only. However, these approaches are applicable in specific scenarios where one or both of the source and target domains consist of a parallel corpus.
To address the issues, we define a new task as the unsupervised domain adaptation on the lowresource dataset. Our work is more challenging than any other previous studies, since we assume that both the low-resource target domain and the source domain corpora are monolingual.

Meta Learning
Given a small amount of training data, most of the machine learning models are prone to overfitting, thus failing to find a generalizable solution. To handle this issue, meta-learning approaches seek for how to adapt quickly and accurately to a lowresource task, and show impressive results in various domains (Finn et al., 2017;Javed and White, 2019). The meta-learning approaches aim to find the optimal initialization of the model parameters that adapts the model to a low-resource dataset in a few iterations of training (Finn et al., 2017;Ravi and Larochelle, 2016). Owing to the success of the meta learning, recent studies apply the meta learning to low-resource NMT tasks, including multilingual NMT (Gu et al., 2018) and the domain adaptation (Li et al., 2020). These studies assume that all the training corpora consist of the parallel sentences. However, a recent work  utilizes the meta learning approach to find a generalized model for multiple target tasks. However, it is not focused on adapting a specific target task since its main goal is to handle the target task without using any low-resource data.
Our study attempts to address the low-resource UNMT by exploiting meta-learning approaches. Moreover, we present two novel losses that encourage incorporating high-resource knowledge and generalizable knowledge into the model parameters. Our proposed approaches show significant performance improvements in adapting to a lowresource target domain.

Unsupervised Neural Machine Translation
In this section, we first introduce the notation of the general UNMT models. We then describe the three steps for the UNMT task: initialization, language modeling, and back-translation. On these three steps, we illustrate how each step contributes to improving the performance of UNMT.
Notations. We denote S and T as a source and a target monolingual language dataset. x and y represent the source and the target sentences from S and T . We assume the NMT model is parame-  terized by θ. We also denote M s→s and M t→t as language models in a source and a target language, respectively, while denoting M s→t and M t→s as the machine translation models from the source to the target language and vice versa.
Initialization. A recent UNMT model (Lample et al., 2018b) is based on a shared encoder and decoder architecture for the source and the target language. Due to the shared encoder and decoder for each language, initializing the model parameters of the shared encoder and decoder is an important step for competitive performances (Conneau et al., 2018;Lample et al., 2018a;Artetxe et al., 2018;. Conneau and Lample (2019) propose the XLM (cross-lingual language model) to initialize parameters, showing significantly improved performances for UNMT. Among various initialization methods, we leverage the XLM as our initialization method.
Language modeling. We use a denoising autoencoder (Vincent et al., 2008) to train the UNMT model, reconstructing an original sentence from a noisy one in a given language. The objective function is defined as follows: where C is a noise function described in (Lample et al., 2018b), which randomly drops or swaps words in a given sentence. By reconstructing the sentence from the noisy sentence, the model learns the language modeling in each language.
Back-translation. Back-translation helps the model learn the mapping functions between the source and the target language by using only the monolingual sentences. For example, we sample a sentence x and y from source language S and target language T . To make pseudo-pair sentences from the sampled source sentence, we deduce the target sentence from the source sentence, such that y = M s→t (x), resulting in the pseudo parallel sentence, i.e., (x, y ). Similarly, we obtain (x , y), where x is the translation of a target sentence, i.e., M t→s (y). We do not back-propagate when we generate the pseudo-parallel sentence pairs. In short, the back-translation objective function is (2)

Proposed Approach
This section first explains our formulation of a lowresource unsupervised machine translation task where we can apply a meta-learning approach. Afterwards, we elaborate our proposed methods, MetaUMT and MetaGUMT. We utilize the metalearning approach to address a low-resource challenge for unsupervised machine translation. Moreover, we extend MetaUMT into MetaGUMT to explicitly incorporate learned knowledge from multiple domains. Finn et al. (2017) assume multiple different tasks to find the proper initial parameters that can quickly adapt to a new task using only a few training examples. In this paper, we consider tasks in the meta-learning as domains, where D out = {D 0 out , ..., D n out } represents n out-domain datasets (i.e., source domain datasets), and D in indicates an in-domain dataset (i.e., a target domain dataset), which can be the dataset in an arbitrary domain not included in D out . Each domain in both D out and D in is assumed to be composed of unpaired language corpora, and we create D in as a lowresource monolingual dataset 1 . To adapt our model to the low-resource in-domain data, we finetune the UNMT model by minimizing both the losses described in Eqs. (1) and (2) with D in .

MetaUMT
In order to obtain an optimal initialization of the model parameters, allowing the model to quickly adapt to a new domain with only a small number of monolingual training data, MetaUMT uses two training phases, the meta-train phase and the metatest phase. During the meta-train phase, the model first learns domain-specific knowledge by updating initial model parameters θ to temporary model parameters φ i , i.e., adapted parameters. Then, in the meta-test phase, the model learns the adaptation by optimizing θ with respect to φ i . From the domain adaption perspective, two phases simulate the domain adaption process. The model first adapts to a specific domain through the meta-train phase, and this adaption is evaluated in the meta-test phase.
Meta-train phase. We obtain φ i for each i-th out-domain dataset by using one-step gradient descent, i.e., where L s D i out is the i-th out-domain dataset, and α is the learning rate for the meta-train phase. As previously discussed in Section 3, the language modeling and back-translation losses are essential in facilitating the unsupervised machine translation. Hence, L s consists of L lm and L bt , where each loss function is computed with D i out . Meta-test phase. The objective of the meta-test phase is to update θ using each φ i learned from the 1 We randomly sample the 5,000 tokens (∼ 300 sentences) from the in-domain training dataset. meta-train phase by using each L s D i out 2 . We call this update as a meta-update, defined as where β is another learning rate in the meta-test phase. Since Eq. (5) requires the second-order gradient, the equation is simplified with the first-order gradient by replacing the second-order term. Finn et al. (2017) showed that the first-order approximation of the meta-learning maintains the performance while minimizing the computational cost.

MetaGUMT
To handle a data scarcity issue from a meta-learning perspective, it is critical to be able to make the initialized model to adapt to a data-scarce domain. However, since a small amount of training data in the new domain may cause the model to overfit and prevent utilizing high-resource domain knowledge, it is important to incorporate high-resource domain knowledge and generalizable knowledge into the model parameters. To address this issue, we extend the existing meta-learning approach via two novel losses, which we call an aggregated meta-train loss and a cross-domain loss. The former contributes to incorporating high-resource domain knowledge into the model parameters, while the latter encourages our model, after trained using a particular domain, to still generalize well to another domain, i.e., cross-domain generalization. Meta-train phase. As shown in Fig. 2 (C), via Eqs. (3) and (4), we obtain φ i from each i-th outdomain datasets. Since this phase is exactly same with the meta-train phase of MetaUMT, we leave out the details.
Meta-test phase. The aggregated meta-train loss, which refers to Fig. 2 (D), is computed using all out-domain datasets, i.e., introduce a cross-domain loss, which is in Fig. 2 (D), as where L s i.e., computing the cross-domain loss with the data from D i out as well as those from other domains D i other . To obtain the optimal initialization θ for model parameters, we define our total loss function, which is Fig. 2 (E), as the sum of the two of our losses, i.e., θ ← θ − β∇ θ (L cd + L ag ).
In summary, our aggregated meta-train and crossdomain losses encourage our model to accurately and quickly adapt to an unseen target domain. The overall procedure is described in Algorithm A.1.

Experiments
This section first introduces experiment settings and training details. Afterwards, we show empirical results in various scenarios.

Dataset and Preprocessing
We conduct our experiments on eight different domains 3 (Appendix T.2). Each domain dataset is publicly available on OPUS 4 (Tiedemann, 2012). We utilize the eight domains for out-domain (D out ) and in-domain datasets (D in ). To build the monolingual corpora of in-domain and out-domain datasets, we sample data from the parallel corpus. We made sure to include at most one sentence from each pair of parallel sentences. For instance, we sample the first half of the sentences as unpaired source data and the other half as truly unpaired target data. Consequently, the sampled monolingual corpora contain no translated sentence in each language. Each of the two monolingual corpora contains the equal number of sentences for each language (e.g., English and German). For our lowresource scenarios, we sample 5,000 tokens from a selected in-domain corpus for each language. Note that the out-domain dataset represents the full monolingual corpora.

Experimental Settings
As our base model, we use a Transformer (Vaswani et al., 2017), which is initialized by a masked language model from XLM (Conneau and Lample, 2019) using our out-domain datasets. All the models consist of 6 layers, 1,024 units, and 8 heads.
We establish and evaluate various baseline models as follows: • UNMT model is trained with only the indomain monolingual data, composed of 5,000 words for each language.
• Supervised neural machine translation model (NMT) is trained with in-domain parallel datasets, which we arrange in parallel with the two in-domain monolingual corpora.
• Unadapted model is pretrained with only the out-domain datasets and evaluated on the indomain datasets.
• Transfer learning model is a finetuned model, which is pretrained with the outdomain datasets and then finetuned with a low-resource in-domain dataset.
• Mixed finetuned model (Chu et al., 2017) is similar to the transfer learning model, but it utilizes both in-domain and out-domain datasets for finetuning. That is, the training batch is sampled evenly from in-domain and out-of-domain datasets.

Experimental Results
In order to verify that leveraging the high-resource domains (i.e., the source domains) effects to handle the low-resource domains (i.e., the target domain), we compare the unsupervised and supervised models with ours and other baseline models. As shown in Table 1, the unsupervised model trained on in-domain data suffers from data scarcity because it only uses low-resource in-domain data. Although the unsupervised and supervised models are initialized by XLM, those models show the worst performance in all the cases. This result indicates that when the small size of an in-domain corpus is given, it is appropriate to utilize the outdomain datasets rather than to train only with lowresource data. In addition, the performance of the unadapted model is far behind compared to other models, such as the mixed finetuned model, transfer learning model, MetaUMT, and MetaGUMT. This implies that we need an adequate strategy of leveraging the high-resource domains to improve the performance.
We further compare the performance between our proposed approaches (i.e., MetaUMT and MetaGUMT) and the other two finetuning models (i.e., the transfer learning and the mixed finetuned ones). Our methods exhibit the leading performances in both directions of translation (en ↔ de), and consistently achieve improvements of 2-3 BLEU score in most of settings. Furthermore, MetaGUMT consistently obtains better BLEU scores and converges faster than MetaUMT. We assert that our proposed losses (i.e., the aggregated meta-train and the cross-domain losses) help the model not only to perform well even on the unseen in-domain dataset but also to accelerate the convergence speed.

Performances and Adaptation Speed in Finetuning Stage
As shown in Fig. 3 (A), we compare our proposed methods with the transfer learning approach by varying the sizes of an in-domain monolingual corpus. The smaller the size of training data is, the wider the performance gap between the two approaches and the transfer learning model becomes.
It means that meta-learning is an effective approach to alleviate the performance degradation, preventing the model from overfitting to the low-resource data.
Compared to the transfer learning model, MetaUMT demonstrates a better performance than other methods in various settings. However, MetaGUMT exhibits even better performances consistently in all settings owing to our proposed losses (Eq. (8)). The transfer learning approach shows the worst performance except for the unadapted model, even though it exploits the in-domain corpus after being pretrained with the out-domain datasets.
Additionally, we analyze the number of iterations required for a model to converge given an in-domain dataset. As shown in Fig. 3 (B), the metalearning approaches rapidly converge after only a few iterations, even faster than the transfer learning one does. As the number of in-domain training words increases, the transfer learning approach requires a much larger number of iterations until convergence than our meta-learning approaches. It can be seen that MetaUMT and MetaGUMT rapidly adapt to an unseen domain. Moreover, owing to the encapsulated knowledge from the high-resource do-  Table 2: BLEU scores evaluated on out-domain and in-domain data with initial θ and finetuned θ, respectively. "D" denotes the domain, "Unseen" indicates the new domain evaluated with finetuned θ. Since the transfer and mixed finetuned models use the same initial θ, we leave its corresponding row as "-". mains, MetaGUMT converges within a relatively earlier iteration than MetaUMT does. In summary, the meta-learning-based methods quickly converge in the low-resource domain, improving the performances over the transfer learning method in various low-resource settings. This indicates that the meta-learning-based approaches are suitable to alleviate the data deficiency issue in scarce domains. Furthermore, our losses in Eq. (8) enhance the capabilities of aggregating domain general knowledge and finding adequate initialization.

Number of Iterations until Convergence in Pretraining Stage
An advantage of our meta-learning approaches is that they can find an optimal initialization point from which the model can quickly adapt to a lowresource in-domain dataset. The transfer learning model requires twice more iterations until convergence than ours does. As shown in Fig. 3 (C), MetaUMT and MetaGUMT not only converge quickly but also outperform the other baseline methods. Specifically, compared to MetaUMT, MetaGUMT is effective in achieving an optimized initialization at an earlier iteration. These results indicate that our additional losses (i.e., the crossdomain and aggregated meta-train losses) are beneficial in boosting up the ability for finding an optimal initialization point when training the model with the out-domain datasets.

Analysis of MetaGUMT losses
We assume that the domain generalization ability and high-resource domain knowledge are helpful for the UNMT model to translate the low-resource domain sentences. First, to identify whether the model encapsulates the high-resource knowledge from multiple sources, we evaluate our model on out-domain datasets (i.e., D out ) with initial θ. As shown in Table. 2, MetaGUMT shows remarkable performances over MetaUMT in all domains, even better than the transfer learning models. In other words, MetaUMT demonstrates poor performances  in D out , compared to MetaGUMT. This can be explained as MetaGUMT uses an aggregated metatrain loss such that MetaGUMT is able to encapsulate the high-resource domain knowledge. As shown in Table. 1, MetaGUMT achieves superior performances, showing that MetaGUMT is capable of leveraging the encapsulated knowledge when finetuning the low-resource target domain. Secondly, our cross-domain loss encourages the model to have a generalization capability after adapting to the low-resource target domain. As shown in "Unseen" column in Table. 2, MetaGUMT outperforms the other models. It can be seen that our model has the domain generalization ability after the finetuning stage due to the cross-domain loss in the meta-test phase.

Performance of Unbalanced Monolingual Data in Finetuing Stage
In UNMT, data unbalancing is often the case in that source language (e.g., English) data are abundant and the target language (e.g., Nepali) data are scarce (Kim et al., 2020). We extend our experiment to the unbalanced scenarios to examine whether our proposed model shows the same tendency. In this scenario, the low-resource target domain dataset consists of monolingual sentences from one side with two times more tokens than the monolingual sentences from the other. As shown in Table. 4, MetaGUMT outperforms in all unbalanced data cases. It shows that MetaGUMT is feasible to a practical UNMT scenario where the number of sentences is different in the source and target languages. The only difference against the main experiment setting 5.1 is the condition that the indomain corpus is unbalanced. We also include the  result of the transfer learning model in Table. T.4.

Ablation Study
We empirically show the effectiveness of the crossdomain and aggregated meta-train losses, as shown in Table 3 5 . First, compared to MetaUMT which does not use any of the two losses, incorporating the cross-domain loss improves the average BLEU score by 0.21. The cross-domain loss acts as a regularization function that prevents the model from overfitting during the finetuning stage. Second, the aggregated meta-train loss, another critical component of our model, allows the model to utilize the high-resource domain knowledge in the finetuning stage. This also improves the average BLEU score by 0.37 from MetaUMT. Lastly, combining both cross-domain and aggregated meta-train losses significantly enhances the result in both directions of translation (En ↔ De), indicating that they are complementary to each other.

Impact of the Number of Source Domains
We examine how the performances change against the different number of source domains for each approach. As shown in Table. 5 6 , MetaGUMT consistently outperforms the transfer, the mixedfinetune, and MetaUMT approaches. As the size of the source domains increases, so does the performance gap between ours and the transferring based models, i.e., transferring and mixed-finetune models. This indicates that the meta-learning based approaches are highly effected by the size of the domains in the meta-train phase, and also, if the number of source domains is large enough to capture the general knowledge, the meta-learning based approaches are suitable to handle the low-resource target task (i.e., machine translation in a low-resource domain).

Conclusions
This paper proposes a novel meta-learning approach for low-resource UNMT, called MetaUMT, which leverages multiple source domains to quickly and effectively adapt the model to the target domain even with a small amount of training data. Moreover, we introduce an improved method called MetaGUMT, which enhances cross-domain generalization and maintains high-resource domain knowledge. We empirically show that our proposed approach consistently outperforms the baseline methods with a nontrivial margin. We believe that our proposed methods can be extended to semisupervised machine translation as well. In the future, we will further analyze other languages, such as Uzbek and Nepali, instead of languages like English and German.

A Implementation Details
In order to preprocess datasets, We utilize Moses (Koehn et al., 2007) to tokenize the sentences. We then use byte-pair encoding (BPE) (Sennrich et al., 2016a) to build a shared sub-word vocabulary using fastBPE 7 with 60,000 BPE codes. Based on this shared sub-word vocabulary, constructed from the out-domain datasets, we split words into sub-word units for the in-domain dataset. We implement all of the models using PyTorch library 8 , and then train them in four nvidia V100 gpus for pretraining and finetuning. We evaluate all the experiments based on the BLEU script 9 . The number of convergence iteration of each algorithm is defined based on the best validation epoch, which shows no more improvement on validation score after we run 10 more epochs. Moreover, we have conducted comprehensive experiments to obtain our main result table (Table. 1 and Table. T.1 ) on different domains by training the model with 10 different sampled words each time.
For optimizing each algorithms, we choose the Adam optimizer (Kingma and Ba) for pretraining stage, as well as the Adam warmup optimizer (Vaswani et al., 2017) for finetuning stage. The learning rate is set to 10 −4 , optimized within the range of 10 −2 to 10 −5 . In all experiments, the number of tokens per batch is set as 1,120 and the dropout rate is set as 0.1. In meta-learning approaches, we set the learning rates of alpha and beta commonly as 0.0001 in all experiments.
In the pretraining stage, we follow the same stopping criterion as Gu et al. (2018). For instance, among different target domains, we randomly select one as a validation domain. We utilize early stopping, i.e., stopping training if the validation BLEU score does not increase within the ten subsequent epochs. Similarly in the finetuning stage, we apply early stopping using a validation dataset from the target domain.

C Perfomances and Adaptation Speed in
Finetuning Stage for a Law Domain Back-translation generates source and target language sentences using the current translation model 6: Evaluate ∇ θ L bt

E Performance of Semi-Superivsed Machine Translation in Finetuning Stage
The proposed algorithms, MetaUMT and MetaGUMT, show promising results on lowresource monolingual data. However, some may argue that creating parallel sentences from a small number of unpaired monolingual sentences (e.g., 5k tokens) is also feasible. Hence, we additionally conduct an experiment of semi-supervised machine translation in the finetuning stage. For instance, we follow the same pretraining stage, but we utilize both monolingual and parallel sentences while finetuning the model on a low-resource domain. The number of tokens for each monolingual and parallel data is 5k. To finetune the model in the semi-supervised setting, we compute the loss as sum of L ct and L bt , where L ct is the conventional translation loss in the supervised NMT, i.e., L ct =E (x,y)∼P [−logM s→t (y | x)]+ E (x,y)∼P [−logM t→s (x | y)].
As shown in Table T.3, we observe that MetaGUMT demonstrates the promising performance against others, even if we only utilize the monolingual out-domain datasets to pretrain the model.

F Statistics of Datasets
As shown in Table. T.2, we present the overall number of sentences and words for each domain, where W/S indicates the number of words per sentence in a domain.  Table T.4: Results on unbalanced monolingual data. This is the same results of Table 4 but included the additional baseline model, Transfer.