Generalised Unsupervised Domain Adaptation of Neural Machine Translation with Cross-Lingual Data Selection

This paper considers the unsupervised domain adaptation problem for neural machine translation (NMT), where we assume the access to only monolingual text in either the source or target language in the new domain. We propose a cross-lingual data selection method to extract in-domain sentences in the missing language side from a large generic monolingual corpus. Our proposed method trains an adaptive layer on top of multilingual BERT by contrastive learning to align the representation between the source and target language. This then enables the transferability of the domain classifier between the languages in a zero-shot manner. Once the in-domain data is detected by the classifier, the NMT model is then adapted to the new domain by jointly learning translation and domain discrimination tasks. We evaluate our cross-lingual data selection method on NMT across five diverse domains in three language pairs, as well as a real-world scenario of translation for COVID-19. The results show that our proposed method outperforms other selection baselines up to +1.5 BLEU score.


Introduction
Unsupervised domain adaptation (UDA) aims to generalise MT models trained on domains with typically large-scale bilingual parallel text to new domains without parallel data (Chu and Wang, 2018). Most prior works in UDA of NMT assume the availability of either non-parallel texts of both languages or only the target-language monolingual text in the new domain to adapt the NMT model. The adaptation is achieved by modifying the model architecture and joint training with other auxiliary tasks (Gulcehre et al., 2015;Domhan and Hieber, 2017;Dou et al., 2019), or constructing a parallel corpus for the new domain from a general-domain parallel text using data-selection methods (Silva et al., 2018;. However, very little attention has been paid to the UDA problem with only the source-language monolingual text in the new domain. In practice, this setting is not very rare, e.g. building a translation system from English to Shona (a low-resource African language) in a specific domain such as healthcare and disaster. While it would be very time consuming to collect in-domain text in Shona, English corpora are more accessible.
In this paper, we consider the generalised problem of UDA in NMT where we assume the availability of monotext in only one language, either the source or target, in the new domain. We propose a generalised approach to the problem using crosslingual data selection to extract sentences in the new domain for the missing language side from a large monolingual generic corpus. Our proposed data selection method trains an adaptive layer on top of multilingual BERT by contrastive learning , such that the representations of source and target language are aligned. The aligned representations enable the transferability of a domain classifier trained on one language side to the other language for in-domain data detection. Previous works have explored filtering data of the same language for MT (Moore and Lewis, 2010; Axelrod et al., 2011;Duh et al., 2013;Junczys-Dowmunt, 2018); however, utilising data in one language to detect in-domain data in the other language is under-explored.
With selected sentences in the new domain of the missing language side, the original adaptation problem is transformed to the usual setting of UDA problem, and can be approached by the existing UDA methods. In this paper, we extend the discriminative domain mixing method for supervised domain adaptation (Britz et al., 2017) which jointly learns domain discrimination and translation to the unsupervised setting. More specifically, the NMT model jointly learns to translate with the translation loss on pseudo bitext, and captures the characteris-tics of the new domain by the domain discrimination loss on data from the old and new domains.
Our contributions can be summarised as follows: • We introduce a generalised UDA (GUDA) problem for NMT which unifies both the usual setting of having only target language monotext and the under-explored setting with only source language monotext in the new domain.
• We propose a cross-lingual data selection method to address GUDA by retrieving indomain sentences of the missing language from a generic monolingual corpus.
• We augment the discriminative domain mixing method to UDA by constructing an indomain pseudo bitext via forward-translation and back-translation.
• We empirically verify the effectiveness of our approach on translation tasks across five diverse domains in three language-pairs, as well as a real-world translation scenario for COVID-19. The experimental results show that our method achieves up to +1.5 BLEU improvement over other data selection baselines. The visualisation of the representations generated by the adaptive layer demonstrates that our method is not only able to align the representation of the source and target language, but it also preserves characteristics of the domains in each space 1 .

Generalised Unsupervised Domain Adaptation
Domain adaptation is an important problem in NMT as it is very expensive to obtain training data that are both large and relevant to all possible domains. Supervised adaptation problem requires the existence of out-of-domain (OOD) bitext and indomain bitext. Unsupervised domain adaptation problem assumes OOD and in-domain monotext, usually in the target language. A domain is defined as a distribution P (X, Y ) where X ranges over sentences in the source language s, and Y is its translation in the target language t. We define the generalised unsupervised domain adaptation (GUDA) for NMT as the problem of adapting an NMT model trained on an old domain P old (X, Y ) to a new domain 1 Source code is available at https://github.com/ trangvu/guda. P new (X, Y ), where only either the source or target language text is available in the new domain. Since P (X, Y ) = P s (X)P s,t (Y |X), let us consider P s old (X) which is the distribution over sequences on the source language s in the old domain. It is usually much richer (i.e., containing diverse categories such as news, politics, etc.) than P s new (X) which is typically a much more specific category where we aim to adapt the NMT model. The conditional distribution P s,t old (Y |X) specifies the encoder-decoder NMT network to be adapted to the new domain.
Given parallel bitext D old = {(x x x i , y y y i )} in the old domain, we consider two settings in GUDA: • An initial monolingual text X new = {x x x j } of the source language in the new domain and a generic monolingual text D t of the target language.
• An initial monolingual text Y new = {y y y k } of the target language in the new domain and a generic monolingual text D s of the source language.
Crucially, in both cases we do not require any parallel text in the new domain, hence the term unsupervised domain adaptation. The goal is to adapt an NMT model, parametrised by θ θ θ, trained on the old domain bitext D old to the new domain.
In the setting involving Y new , it can be used to create pseudo-parallel data via back-translation (Sennrich et al., 2016), or to adapt the decoder via multi-task learning (Gulcehre et al., 2015;Domhan and Hieber, 2017). This setting is the usual formulation in UDA for NMT (Chu and Wang, 2018). In contrast, the setting involving the source monotext X new is not well explored in the literature.
Our approach for addressing GUDA is to create in-domain monotext for the language side, where the data in the new domain is missing. That is, if given X new , we build a classifier to select indomain monotext Y new in the target language from the generic monotext D t . We perform a similar procedure for the other case where only in-domain Y new is present. We then adapt the NMT model based on the bitext from the old domain as well as the source and target language monotext in the new domain. The challenge, however, is how to train a classifier for data selection for the languageside with missing data. We address this problem in Section 3, then mention how to adapt the NMT model to the new domain in Section 4. 3 Cross-lingual In-domain Data Selection Aharoni and Goldberg (2020) have shown that the emergent domain clusters via BERT (Devlin et al., 2019) can be used to select in-domain bitext for NMT. Inspired from that observation, we leverage the sentence representations produced by the multilingual BERT (mBERT) for cross-lingual monotext selection. We first align the source and target language representation space while preserving the domain clustering characteristics in each space. Using the available monotext in one language, we train a binary classifier to detect old and new domains on the aligned semantic spaces. This classifier is then transferred to pick in-domain sentences in the other language ( fig. 1).
Representation Alignment. We encode the representation of a sentence x x x by h h h(mBERT(x x x)), where h h h computes the mean-pooled top-layer hidden states obtained from mBERT. To align the representation space of the source and target language, we learn an adaptive layer g g g φ φ φ (.), a feed-forward network parametrised by φ φ φ, on top of the mBERT by contrastive learning . The intuition is that the representation of a translation pair (x x x i , y y y i ) should be close to each other in the semantic space, while the representation of nontranslation pairs should be far apart. Specifically, we aim to optimise a contrastive loss, where z z z x x x := g g g φ φ φ (h h h(mBERT(x x x))) and z z z y y y := g g g φ φ φ (h h h(mBERT(y y y))) are the output of the adaptive layer for the source and target sentences; (z z z + x x x , z z z + y y y ) and (z z z + x x x , z z z − y y y ) denote the positive and negative example pairs, τ is a temperature parameter, and sim(.) is the cosine similarity following Aharoni and Goldberg (2020). While training φ φ φ of the adaptive layer, other layers including embedding and transformer layers are frozen.

Given a batch of N training examples from the old domain
∼ D old , these translation pairs from the bitext are the positive examples. Instead of blindly treating those from non-translation pairs as negative examples, we create domain labels by clustering the mBERT representations of the bitext into k clusters. For a given (x x x i , y y y i ) pair in the training batch, we consider the pairs from distinct clusters in the same batch as the negative examples. This helps the computational complexity by encoding and using all positive and negative examples in the same batch . We will show the benefit of this setting in § 6.1.
In-domain Data Selection. Using the adaptive layer's encoding, we learn a domain classifier for the language-side in which we are given the monotext in the new domain. Let us assume we are given source language monotext X new in the new domain, and the bitext D old in the old domain. 2 The domain classifier c c c ψ ψ ψ (z z z) produces the probability of belonging to the new domain for an input vector z z z. We train the parameter ψ ψ ψ for the domain classifier by minimising the following loss ( fig. 1), where D s old denotes source language side of the parallel bitext D old . Thanks to the aligned semantic spaces, we then transfer the trained domain classifier cross-lingually to the other language-side to select a subcorpus of in-domain monotext. We select the top-k probable sentences from the given generic corpus of the other language-side.

NMT Adaptation to the New Domain
Given the parallel data in the old domain D old and monolingual data in the new domain for both the source language X new and target language Y new , we adapt the NMT model by minimising the loss, as illustrated in fig. 2 and explained below.
Bitext Loss. We create pseudo-bitext D new by back-translating Y new using a reverse-direction translation model trained on D old . The quality of the pseudo-bitext depends on the quality of the reverse-direction NMT model in the new domain. We further mix the pseudo-bitext D new with the olddomain bitext D old to form the bitext loss function where p θ θ θ (y y y|x x x) is the translation probability according to the NMT model, and λ 1 controls effect of the old domain.
Source Monotext Loss. To take into account the clean text in source language of the new domain, we apply the discriminative domain mixing method (Britz et al., 2017) to force the encoder towards capturing new domain's characteristics. For this purpose, we build a classifier c c c ψ ψ ψe (z z z e ), a feedforward network parametrised by ψ ψ ψ e , whose output is the new domain's probability. z z z e = h h h(enc θ θ θ (x x x)) is the representation of the sentence x x x, computed by the mean-pooled average of the top layer's states where λ 2 controls the effect of the old domain.
Target Monotext Loss. Similarly, the target monotext loss is defined as, where dec θ θ θ is the NMT's decoder, c c c ψ ψ ψ d is the domain classifier parametrised by ψ ψ ψ d for the decoder, D t old is the target sentences in the old domain's bitext, and λ 3 controls the effect of the old domain.

Experiments
We evaluate our proposed approach for GUDA on the three language pairs covering five domains, and a real-world translation task, namely, TICO-19.

Setup
Datasets. Table 1 shows data statistics. The general domain datasets come from WMT2014 for English-French, WMT2020 for English-German, news parallel corpus from OPUS for Arabic-English 3 . We appraise our proposed methods on following specific domains: TED talk, Law, Medical, IT, Koran from OPUS (Tiedemann, 2012) following the recipe in Koehn and Knowles (2017). We sample 10M English sentences from Newcrawl 2007-2019 as the generic monolingual corpus. Data pre-processing is described in Appendix A. Baselines. We evaluate the effectiveness of our GUDA framework over the zero-shot baseline (base) where the old-domain model is evaluated without any further training on the new domain.
We also evaluate our method against a pseudotranslation baseline (trans) where the old-domain model is further trained on the pseudo-translation of monolingual data from the new domain. More specifically, the pseudo-translation training data contains sentences in the source language and its forward-translated sentences in English for to-English translation direction. Otherwise, it contains sentences in the target language and their backtranslated sentences in English for from-English translation direction. We also train fully-supervised models (sup.) which further trains the old-domain models on in-domain parallel data and yields approximately the upperbound BLEU scores. We compare our proposed in-domain data selection method against several baselines including, • random: we randomly select English sentences from the generic monolingual pool and treat them as in-domain sentences.
• cross entropy difference (CED) (Moore and Lewis, 2010) which is a widely used data selection method in MT. The CED score of a given sentence x in the generic corpus is calculated as CED( where H S (x) and H G (x) are the cross-entropy of the sentence x according to the specific domain and generic domain LMs respectively. The lower the CED score is, the more likely the sentence belongs to this specific domain. In our GUDA setting, to enable cross-lingual data selection, we train a multilingual neural LM on the bitext in the old domain then further finetune it on the available monotext in the new domain and use it to rank the generic corpus. We only run CED methods for En↔Fr and En↔De translation since we do not share vocabulary between Ar and En.
• domain-finetune (Aharoni and Goldberg, 2020) which trains a domain classifier on mBERT representations and selects the topk in-domain sentences scored by the classifier. Despite of having similar selection mechanism to our method, the classifier in the domain-finetune technique operates on the pretrained representation space of mBERT without alignment between languages.
GUDA setup. We assume the availability of non-English language data and evaluate our method to select 500K English sentences from the generic monolingual pool. We use the multilingual Distill-BERT(mDistillBERT) (Sanh et al., 2019) to encode the sentence representation. We sample and cluster 2M sentences from the old-domain bitext into k=5 clusters for negative example creation. To train the domain classifier, we extract the top 500K sentences from the old domain with low similarity scores between their representation and the mean representation of the monotext in the new domain. The adaptive layer is a 2-layer feed-forward network with hidden size 128. We set the temperature parameter τ in the contrastive loss to 0.2. We train the adaptive layer using the Adam optimiser with learning rate 1e-5 , batch size of 64 sentences, up to 20 epochs with early stopping if there is no improvement for 5 epochs on the loss of the dev set in the old domain. The domain discriminator is also a 2-layer feed-forward network with the same hyperparameters as the adaptive layer. We use the Transformer (Vaswani et al., 2017) as NMT model and set the mixing hyperparameters λ 1 , λ 2 , λ 3 to 1, i.e. the old domain parallel data as well as source and target monotext contributes equally to the training signal for the NMT model. Detail of the model hyperparameters can be found in the Appendix A. Table 2 presents the result of translations to and from English, according to GUDA with source and target language monotext respectively. There is a significant gap between the fully supervised (sup.) and zero-shot (base) scores. It can be seen that GUDA is able to reduce this gap, especially when the in-domain data are selected intelligently.   Overall, our selection method consistently outperforms both the domain-finetune and CED strategy. We further assess our approach on the translation initiative for COVID-19 task (TICO-19) for En-Fr and En-Ar (Anastasopoulos et al., 2020). The task contains a dev set and a test set of 971 and 2100 sentences. As an emerging domain, there is no training set. We collect additional 49K and 17K indomain French and Arabic monotext 4 . As shown in Table 3, surprisingly, GUDA on random selection deteriorates the BLEU score. It is possible that pandemic related words have not appeared often before. Consistent with previous results, our method outperforms other methods up to +1.2 BLEU score.

Main Results
To evaluate our alignment method, we visualise the representation of the TICO-19 dev set produced by mDistillBERT and the adaptive layer in Figure 3. It can be seen that the adapted French and English representations are better aligned in the semantic space than the mDistillBERT.

Ablation
Clustering-based negative sampling. The intuition of the clustering-based negative sampling is to preserve the domain clustering characteristics emerged in mBERT. We assess the importance of this clustering step and the effect of the number of  Table 5: Domain discriminative mixing ablation cluster k on the En↔De translation performance in law, med and TED domains. Table 4 reports the BLEU score of the NMT model in the new domain with k = {1, 2, 3, 5, 7, 10} where k = 1 corresponds to perform negative sampling without preclustering mBERT representation space. Overall, the NMT model trained on the selected data with clustering-based negative sampling k > 1 outperforms the one without clustering k = 1. On the other hand, the effect of number clusters k varies, depending on the domains and languages. From the empirical results, we found that k = {5, 7} works better than other values.
Discriminative domain mixing. We run ablation experiments to verify the contribution of each loss term in the discriminative domain mixing training objective presented in eq. (3). Particularly, we evaluate the NMT adapted to the new domain using (i) only the bitext loss (BI); (ii) the combination of the bitext loss and either the source monotext loss (BI+S) or the target monotext loss (BI+T); and (iii) the joint of all three loss terms (BI+S+T). Table 5 shows the results under both supervised domain adaptation where we have access to the true bitext, and UDA in which the model is trained on pseudo bitext generated by back-translation (warm-start). The size of the ground-truth bitext is shown in Table 1. The size of the pseudo-bitext is 500K which is approximately double the size of the groundtruth bitext of TED and med domains, and roughly the same for law domain. We also further evaluate the contribution of the discriminative domain loss when the NMT model is trained from scratch (cold-start).
Consistent with Britz et al. (2017), training NMT on mixed domain data (BI) degrades performance versus models fit to a single domain (sup.). Adding the discriminative domain loss can mitigate this negative effect in multi-domain NMT. We observe similar outcomes in both domain adaption with the true bitext and the pseudo bitext. Overall, we found that the source monotext loss plays a more critical role than the target monotext loss. Combining both monotext loss achieves the best BLEU score in most of domain adaptation scenarios.

Analysis
Domain cluster visualisation. To demonstrate the ability of our approach in preserving the domain clustered characteristics of mBERT, we plot 2D visualisation of the mean-pooling BERT hidden state sentence representation and our constrastivebased sentence representations using PCA. Following Aharoni and Goldberg (2020), we combined the development set of all the new domain dataset and cluster the representations using a Gaussian Mixture Model (GMM) with k pre-defined clusters where k is number of domains. Figure 4 visualises the obtained clusters in semantic space of mDistillBERT and the adaptive layer for each language in the translation pairs. The ellipses describe the mean and variance parameters learned for each cluster. In line with the finding in Aharoni and Goldberg (2020), the mDistillBERT representation of English sentences can be clustered by their domains with a small overlap region. In contrast, Arabic sentences are not well-clustered according to their domains where their domain clusters exhibit a high overlap rate. As can be seen, our contrastive-based representation alignment method is not only able to preserve the domain clusters in English sentences but also learn domain clustered representations of Arabic sentences in which the clusters are less overlapped.
Distribution of domain predictive score. Figure 6 plots the cumulative distribution for the domain predictive score over the generic English corpus. It can be seen that only a small portion of the generic corpus are predicted to belong to the new domains. As expected, the more specific-domains such as med and law have smaller number of anticipated sentences than the TED domain.
ngram analysis. A domain can be considerred as a distribution over ngram. The data selection methods mitigate the domain shift in NMT by introducing ngrams of the new domain to the training corpus. We estimate the new in-domain ngram contribution of each selection method by calculating the overlap of ngrams in the translation hypothesis and the translation reference. The new ngram contribution is calculated as where G(y ref i,new ), G(ỹ zero i,new ), G(ỹ GUDA i,new ) are the set of ngrams in the reference, the zero-shot and the GUDA translation hypothesis of the sentence i in the test set in the new domain, respectively. Figure 5 presents the percentage of new ngram contribution, 1 ≤ n ≤ 4, of each data selection methods as well as the fully supervised model for De-En translation in law, med, ted domains. As expected, the fully-supervised model has the highest correct in-domain ngram rate to the translation hypothesis. Our proposed selection method contributes a higher percentage of in-domain ngrams than other selection methods in all domains.

Related Works
Unsupervised Domain Adaptation. Previous works in UDA has been focused on aligning domain distribution by minimising the discrepancy between representations of source and target domains ; learning domain-invariant representation via adversarial learning (Ganin and Lempitsky, 2015;Shah et al., 2018;Moghimifar et al., 2020); bridging the domain gap by adaptive pretraining of contextualised word embeddings (Han and Eisenstein, 2019;Vu et al., 2020). In this paper, we adapt the NMT model from the old to new domain by learning domain-invariant representations of both encoder and decoder via domain discrimination loss.

Unsupervised Domain Adaptation of NMT.
There are two main approaches in UDA for NMT, including model-centric and data-centric methods (Chu and Wang, 2018). In the model-centric approach, the model architecture is modified and jointly trained on MT tasks, and other auxiliary tasks such as language modelling (Gulcehre et al., 2015). On the other hand, the data-centric methods focus on constructing in-domain parallel corpus by data-selection from general corpus (Domhan and Hieber, 2017), and back-translation (Jin et al., 2020;Mahdieh et al., 2020). Most prior works in UDA of NMT often assume the availability of indomain data in the target language. While there are few studies on the UDA problem with in-domain source-language data in statistical MT (Mansour and Ney, 2014;Cuong et al., 2016), this problem remains unexplored in NMT.
Data selection for NMT. To address the scarcity problem of MT parallel data in specific-domain, data selection methods utilise an initial in-domain training data to select relevant additional sentences from a generic parallel corpus. Previous research has used n-gram language model (Moore and Lewis, 2010; Axelrod et al., 2011;Duh et al., 2013), count-based methods Parcheta et al., 2018), similarity score of sentence embeddings (Wang et al., 2017;Junczys-Dowmunt, 2018;Dou et al., 2020) to rank the generic corpus. The ranking and selection process often operate in the same language, either source or target language, and take advantage of the parallel corpus to retrieve the paired translation (Farajian et al., 2017). When such generic parallel corpus is unavailable, cross-lingual data selection which uses data in one language to detect in-domain data in the other language is under-explored.

Conclusion
We have proposed a cross-lingual data selection method to the GUDA problem for NMT where only monolingual data from one language side is available in the new domain. We first learn an adaptive layer to align the BERT representation of the source and target languages. We then utilise a domain classifier trained on one language to select in-domain data for another. Experiments on translation tasks of several language pairs and domains show the effectiveness of our method over other baselines.
A Training Procedure Data preprocessing. We tokenise English, French, German sentences using Moses tokenizer (Koehn et al., 2007) and remove the sentences with more than 175 tokens. Arabic text are tokenised using CAMeL (Obeid et al., 2020). For Arabic, we first filter out the sentences containing more than 50% Latin characters, then remove those with more than 175 tokens.
Model hyperparameters. The adaptive layer is a 2-layer feed-forward net with hidden size 128. We set the temperature parameter τ in the contrastive loss to 0.2. We train the adaptive layer using the Adam optimiser with learning rate 1e-5, batch size of 64 sentences, up to 20 epochs with early stopping if there is no improvement for 5 epochs on the loss of the dev set in the old domain. The domain discriminator is also a 2-layer feed-forward net. We train it with the same hyperparameters as in the adaptive layer.
We use the Transformer as NMT model, which consists of 6 encoder and decoder layers, 4 selfattention heads, hidden size of 256, feed-forward hidden size of 1024, implemented in Fairseq framework (Ott et al., 2019). Number of parameters is 64.3M. We use the Adam optimiser with learning rate 5e-4 (Kingma and Ba, 2015) and an inverse square root schedule with warm-up 1000 steps. We apply dropout and label smoothing with a rate of 0.3 and 0.1 respectively. We learn the vocabulary of size 32000 using unigram language model (Kudo, 2018), implemented in SentencePiece 6 . For En-Fr, En-De, and En-Cs, the source and target embeddings are shared and tied with the last layer. We set the mixing hyperparameters λ 1 , λ 2 , λ 3 to 1, i.e. the old domain parallel data as well as source and target monotext contributes equally to the training signal for the NMT model. We train the NMT with the batch size of 32768 tokens and up to 30 epochs with early stopping if there is no improvement on dev set for 5 epochs.
Our model is trained on a V100 GPU, and took up to 4 days for the NMT trained in old domain, and 1 day for other experiments.