The Source-Target Domain Mismatch Problem in Machine Translation

While we live in an increasingly interconnected world, different places still exhibit strikingly different cultures and many events we experience in our every day life pertain only to the specific place we live in. As a result, people often talk about different things in different parts of the world. In this work we study the effect of local context in machine translation and postulate that this causes the domains of the source and target language to greatly mismatch. We first formalize the concept of source-target domain mismatch, propose a metric to quantify it, and provide empirical evidence for its existence. We conclude with an empirical study of how source-target domain mismatch affects training of machine translation systems on low resource languages. While this may severely affect back-translation, the degradation can be alleviated by combining back-translation with self-training and by increasing the amount of target side monolingual data.


Introduction
The use of language greatly varies with the geographic location (Firth, 1935;Johnstone, 2010).Even within places where people speak the same language (Britain, 2013), there is a lot of lexical variability due to change of style and topic distribution, particularly when considering content posted on social media, blogs and news outlets.For instance, while a primary topic of discussion between British sport fans is cricket, American sport fans are more likely to discuss other sports like baseball (Leech and Fallon, 1992).
The effect of local context in the use of language is even more extreme when considering regions where different languages are spoken.Despite the increasingly interconnected world we live in, people in different places tend to talk about different things.There are several reasons for this, from cultural differences due to geographic separation and history, to the local nature of many events we experience in our every day life; e.g., the traffic congestion in Taipei is not affected by a heavy snowfall in New York City.This phenomenon has not only interesting socio-linguistic aspects but it has also strong implications in machine translation (Bernardini and Zanettin, 2004).In particular, machine translation of low-resoruce language pairs aims at automatically translating content in two languages that are often spoken in very distant geographic locations by people with rather different cultures.In machine learning terms and at a very high level of abstraction, this is akin to the problem of aligning two very high dimensional and sparsely populated point clouds.The learning problem is difficult because not only very few correspondences are provided to the learner, but also because the distributions of points is rather different.
As of today and to the best of our knowledge, machine translation has been based on the often implicit assumption that content in the two languages is comparable.Sentences comprising the parallel dataset used for training are assumed to cover the same topic distribution, regardless of the originating language.Similarly, monolingual corpora are assumed to be comparable, i.e. to cover the same distribution of topics albeit in two different languages.
Unfortunately, this assumption does not hold for the vast majority of language pairs, which are lowresource, and for the vast majority of the content produced every day on the Internet by means of blogs, social platforms and media outlets.
In this work, we then introduce and formalize the concept of source-target domain mismatch (STDM) which accounts for intrinsic differences (besides the language) between source and target originating sentences of both parallel and monolingual datasets used to train machine translation systems.We surmise that STDM may impact negatively the effectiveness of back-translation (Sennrich et al., 2015), which is de facto the best known approach to leverage monolingual data in low resource settings.When STDM is considerable, back-translation is less effective because even if the backward model were perfect, the backtranslated data is out-of-domain relative to the source domain from which we aim to translate.
In order to study the effects of STDM we introduce a controlled setting that any researcher can easily reproduce to finely tune the amount of domain mismatch.Using this synthetic setting, we study how the composition of the parallel data and the amount of monolingual and parallel data affect translation quality.Besides back-translation we also investigate self-training (Yarowski, 1995), as an approach to better leverage in-domain monolingual data on the source side.
Our empirical validation demonstrates that back-translation is often complementary to selftraining.The former works better when there is a lot of target side monolingual data, and when STDM is not very strong.The latter works better when source side monolingual data is more abundant and when the STDM is more prominent.Moreover, the combination of self-training and back-translation often yields improvement over each baseline method.We finally demonstrate these approaches on low resource language pairs like Nepali-English and Myanmar-English, and report improvements between 1 to 4 BLEU points over the baseline approach using parallel data only.

Related Work
The observation that topic distributions and various kinds of lexical variabilities depend on the local context has been known and studied for a long time.For instance, Firth (1935) says "Most of the give-and-take of conversation in our everyday life is stereotyped and very narrowly conditioned by our particular type of culture".In her seminal work, Johnstone (2010) analyzed the role of place in language, focussing on lexical variations within the same language, a subject further explored by Britain (2013).Some of these works were the basis for later studies that introduced computational models for how language changes with geographic location (Mei et al., 2006;Eisenstein et al., 2010).
Moving to cross-lingual analyses, there has been work at the intersection of linguistics and cognitive science (Pederson et al., 1998) showing how certain linguistic codings vary across languages, and how these affect how people form mental concepts.In machine translation, researchers have often made an explicit assumption on the use of comparable corpora (Fung and Yee, 1998;Munteanu et al., 2004;Irvine and Callison-Burch, 2013), i.e. corpora in the two languages that roughly cover the same set of topics.Unfortunately, monolingual corpora are seldom comparable in practice.Leech and Fallon (1992) analyzes two comporable corpora, one in American English and the other in British English, and demonstrate differences that reflect the cultures of origin.Similarly, Bernardini and Zanettin (2004) observes that parallel datasets built for machine translation exhibit strong biases in the selection of the original documents, making the text collection not quite comparable.
The non-comparable nature of machine translation datasets is even more striking when considering low resource language pairs, for which differences in local context and cultures are more pronounced.Recent studies (Søgaard et al., 2018;Neubig and Hu, 2018) have warned that removing the assumption on comparable corpora strongly deteriorates performance of lexicon induction techniques which are at the foundation of machine translation.
To the best of our knowledge, no prior work has so far made explicit the intrinsic mismatch between source and target domain in machine translation, both when considering the portion of the parallel dataset originating in the source and target language, and when considering the source and target monolingual corpora.We believe that this is an important characteristic of machine translation tasks, particularly when the content is derived from blogs, social media platforms, and news outlets.We shall not attempt to make corpora comparable, because it would change the nature of the actual task!Back-translation (Sennrich et al., 2015) has been the workhorse of modern neural MT, enabling very effective use of target side monolingual data.Back-translation is beneficial because it helps regularizing the model and adapting to new domains (Burlot and Yvon, 2018).However, the typical setting of current MT benchmarks as popularized by recent WMT competitions (Bojar et al., 2019) is a mismatch between training and test sets, as opposed to a mismatch between source and target domains.In this setting, vast amounts of target monolingual data in the domain of the test set can be leveraged very effectively by back-translation.Unfortunately, back-translation is much less effective when dealing with STDM, as we will show in §5.1.1.
There has been some work attempting to make better use of source side monolingual data, as this is in-domain with the text we would like to translate at test time.Ueffing (2006) proposed to improve a statistical MT system using selftraining (Yarowski, 1995), a direction later pursued by Zhang and Zong (2016) for neural MT.In our work, we also consider this baseline approach with a few important differences (He et al., 2019): i) we train all parameters of the model as opposed to just the encoder parameters, ii) we apply noise to the input and iii) we use it in an iterative fashion as in the algorithm originally proposed by Yarowski (1995).Furthermore, we show consistent improvements when combining selftraining with back-translation.

The STDM Problem
In this section we formalize the definition of Source-Target Domain Mismatch (STDM).This is an intrinsic property of the data which is independent of the particular machine translation system under consideration.
Machine translation systems are often composed by several datasets, which may contain either parallel or monolingual data.In this work, we assume access to only small quantities of parallel data, but relatively large quantities of monolingual data in the source and target languages.See Fig. 1 for a toy illustration.We denote by M S and M T the source and target monolingual data, respectively.
Furthermore, we assume that there are two distinct domains: the domain of the source language D S , and the domain of the target language, D T .We make the natural assumption that text originating in the source language belongs to the source domain and that text originating in the target language belongs to the target domain.
We are interested both in translating from the source to the target language, and vice versa.Accordingly, we assume that a portion of the parallel data originates in the source language, while the remainder originates in the target language.We denote by P S the portion of the parallel dataset originating in the source language, and by P T the one originating in the target language.
Our assumptions on the existence of source and target domains is expressed by: {P S , M S } ∈ D S , and {P T , M T } ∈ D T .Thus even if we were to perfectly translate, i.e. if we could disregard the artifacts introduced by the translation process (Baker, 1993;Zhang and Toral, 2019;Toury, 2012), the distributional properties of source-originating text would be different from target-originating text.As mentioned in §2, even "comparable" corpora are affected by such domain mismatch to some extent.
The goal of a machine translation system trained on such data is to accurately translate text originating in the source language and belonging to the source domain, and vice versa for the reverse direction.
The questions we aim to answer with this study are: 1. Is STDM hampering the performance of back-translation?
2. Is self-training useful in the presence of STDM?
3. Is target-domain data useful to improve translation of source-domain data?
4. Is there a controlled setting that lets us assess how an algorithm works as a function of the degree of STDM?

Baseline Algorithms
In this section we review the basic learning algorithms we have considered.In this work, we use a state-of-the-art neural machine translation (NMT) system based on the transformer architecture (Vaswani et al., 2017) with subword vocabularies learned via byte-pair encoding (BPE) (Sennrich et al., 2015).However, our analysis is not architecture-specific and we believe it extends to other systems as well.
The most basic model is trained on the parallel data using token-level cross-entropy loss with label smoothing (Szegedy et al., 2016) and dropout (Srivastava et al., 2014), as standard practice in the field.Next, we describe approaches that can leverage monolingual data as well.

Back-Translation (BT)
Back-translation (BT) (Sennrich et al., 2015) is a popular and effective data augmentation technique that leverages target side monolingual data.The algorithm proceeds in three steps.First, a reverse machine translation system is trained from target to source using the provided parallel data: ← − θ = arg max θ E (x,y)∼P log p(x|y; θ).Then, the reverse model is used to translate the target monolingual data: xi ≈ arg max z p(z|t i ; ← − θ ), for i = 1 . . .N T and t i ∈ M T .The maximization is typically approximated by beam search.Finally, the forward model is trained over the concatenation of the original parallel and back-translated data: In practice, the parallel data is weighted more in the loss, with a weight selected via hyper-parameter search on the validation set.
BT provides several main benefits in practice: (1) since parallel datasets are typically small, augmenting the training set with large quantities of BT data improves generalization; (2) when the target side monolingual data and test set are from the same domain, BT helps adapt to the domain of the test data; (3) BT improves the model's fluency (Edunov et al., 2018(Edunov et al., , 2019)).
In the context of STDM however, BT has a potential weakness.Even if the reverse model were to produce perfect translations, back-translated data belongs to the target domain, and it is therefore out-of-domain with the data we wish to translate, i.e., source sentences belonging to the source domain.We will verify this conjecture empirically in §5.1.1.

Self-Training (ST)
Self-Training (ST) (Yarowski, 1995), shown in Alg. 1, is another method for data augmentation that instead leverages monolingual data on the source side.First, a baseline forward model is trained on the parallel data (line 4).Second, this initial model is applied to the source monolingual data (line 6).Finally, the forward model is re-trained from random initialization by augmenting the original parallel dataset with the forwardtranslated data.As with BT, the parallel dataset receives more weight in the loss.
A potential benefit of this approach is that the synthetic parallel data added to the original parallel dataset is in-domain, as it comes from the source monolingual data.This is a crucial advantage of ST against BT in the STDM setting.A potential shortcoming is that the model may reinforce its mistakes since synthetic targets are produced by the model itself.
We introduce two methods to mitigate this issue.First, we make the algorithm iterative and add only the examples for which the model was most confident (line 3, loop in line 5 and line 7 where we select sentences with largest average per-token log-probability).Second, we inject noise to the input sentences to further improve generalization (line 8).

Combining BT and ST
BT and ST are clearly complementary to each other.The former has the advantage of always using correct targets but synthetic data is out-ofdomain when there is STDM.The latter has the advantage of using in-domain source sentences but synthetic targets may be inaccurate.We therefore consider their combination as an additional baseline approach.
The combined learning algorithm proceeds in three steps.First, we train an initial forward and reverse model using the parallel dataset.Second, we back-translate target side monolingual data using the reverse model (see §4.1) and iteratively forward translate source side monolingual data using the forward model (see §4.2 and Alg. 1).We then retrain the forward model from random ini- x N e + x m X S W q Y p I t F Y S q I i c n s d T L g i l E j J p Y g V d z e S u g I F V J j A y r Z E L z l l 1 d J q 1 b 1 L q q 1 + 8 t K / S a P o w g n c A r n 4 tialization using the union of the original parallel dataset, the synthetic back-translated data, and the synthetic forward translated data at the last iteration of the ST algorithm.

Results
In this section, we first introduce a controlled setting to study STDM and report a detailed analysis of the influence of various factors, such as the extent to which target originating data is out-ofdomain, and the effect of monolingual data size.
We then report experiments on genuine low resource language pairs, namely English-Myanmar and English-Nepali, and conclude with an ablation study on ST.

Controlled Setting
It is not obvious how to measure STDM.Particularly for low resource language pairs, there is often not enough data translated from the source domain to compute meaningful statistics.Even if we had sufficient parallel data, it would be difficult to factor out the effect of translationese from pure source-target domain mismatch.Accordingly, we introduce a synthetic benchmark that enables us to finely control the domain of the target originating data, and therefore the amount of STDM.The key idea of this controlled setting is to consider as target originating data, which comprises half of the parallel training data and the target side monolingual data (see Fig. 1), a convex combination of training data from two sufficiently different domains.In this work we use EuroParl (Koehn, 2005) as our source originating data, while our target originating data contains a mix of data from EuroParl and OpenSubtitles (Lison and Tiedemann, 2016), see Fig. 2 for an illustration.
Specifically, we consider a French to English translation task with a parallel dataset composed by 10,000 sentences from EuroParl (which originate in French) and 10,000 sentences from the target domain (which originate in English).The source monolingual data consists of N S sentences from EuroParl (not overlapping with the parallel set), while the target monolingual data consists of N T sentences from the target domain.If 0 < α < 1, a fraction α of the target originating data is taken from EuroParl and the rest from OpenSubtitles.The test set comprises sentences all in French and originating from the EuroParl source domain.
For instance, when α = 0 then the target domain is totally out-of-domain with respect to the source domain.The parallel dataset has equal proportion of EuroParl and OpenSubtitles sentences.The source monolingual dataset is all from Eu-roParl while the target monolingual dataset is all from OpenSubtitles.This is the most extreme case of STDM, as depicted in Fig. 1 right hand side.
When α = 1, the target domain matches perfectly the source domain and all the data comes from EuroParl.For intermediate values of α, the target domain only partially overlaps with the source domain.In other words, α let us precisely control the amount of STDM.
We perform hyper-parameter search for the model architectures and BPE size on the validation set.For all the experiments in this controlled setting, we use a 5-layer transformer architecture with 8M parameters when training on datasets with less than 300K parallel sentences and use a bigger transformer architecture that consists of 5 layers and a total of 110M parameters when training on bigger datasets.The BPE size is 5,000.We report SacreBLEU (Post, 2018) for both languages.

Varying Amount of Domain Mismatch
In our first experiment reported in Fig. 3, we benchmark our baseline approaches while varying α (see §5.1), which controls the overlap between source and target domain.
First, we observe that increasing α improves performance of all methods according to BLEU (Papineni et al., 2002).Second, there is a big gap between the baseline trained using the parallel data only, and methods which leverage monolingual data.Third, combining ST and BT works better than each individual method, showing that  indeed these approaches are complementary.Finally, BT works better than ST but the gap reduces as the target domain becomes increasingly different from the source domain (small values of α).In the extreme case of STDM for α = 0, ST actually outperforms BT.In fact, we observe that the gain of BT over the baseline decreases as α decreases (notice that the amount of monolingual data and parallel data is always constant across all these experiments).Therefore, BT does suffer when there is strong STDM.

Varying the amount of monolingual data
We next explore how the quantity of monolingual data affects performance and if the relative gain of ST over BT when α = 0 disappears as BT is provided with more monolingual data.The experiment in Fig. 4 shows that a) the gain in terms of BLEU tapers off exponentially with the amount of data (notice the log-scale in the xaxis), b) for the same amount of monolingual data ST is always better than BT and by roughly the same amount, and c) BT would require about 3 times more target monolingual data (which is out-  of-domain) to yield the performance of ST.

Varying the amount of in-domain data
We now explore whether, in the presence of extreme STDM (α = 0), it may be worth restricting the training data to only contain in-domain source originating sentences (with the notation introduced in §3, P S and M S ).In Fig. 5, we compare the restricted and unrestricted settings for various combinations of parallel, BT and ST training.Across all settings, we find that it is better to include the out-of-domain data originating on the target side (green bars) as opposed to only the indomain source originating data (blue bars).It appears that, particularly in the low resource settings considered here, neural models benefit from all available data even if this data is out-of-domain.
Next, we control for the quantity of parallel training data (fixed at 20,000 sentences) and explore whether there exists an optimal ratio of indomain to out-of-domain parallel data in the presence of extreme STDM (α = 0), keeping the target monolingual data unchanged and composed only by OpenSubtitles sentences.It is not obvious what the optimal ratio may be, a priori, particularly when applying back-translation, which could be made more effective by training the reverse model with some target domain parallel data.
Following our earlier synthetic setting, we introduce a hyperparameter β ∈ [0, 1] which controls the ratio between source domain (in-domain) and target domain (out-of-domain) parallel data.We again consider EuroParl to be our source domain and OpenSubtitles to be our target domain, with a parallel dataset containing 20,000 sentences and 900,000 target domain monolingual sentences.When β = 0, all parallel data comes from OpenSubtitles, while when β = 1, all parallel data comes from EuroParl.Fig. 6 shows that the best way to compose the parallel data is by taking all sentences from Eu-roParl (β = 1) when translating from French to English (blue curves).At high values of β, we observe a slight decrease in accuracy for models trained only on back-translated data (dotted line), confirming that BT loses its effectiveness when the reverse model is trained on out-of-domain data.However, this is compensated by the gains brought by the additional in-domain parallel sentences (dashed line).In the more natural setting in which the model is trained on both parallel and back-translated data (dash-dotted line), we see monotonic improvement in accuracy with β, with optimal accuracy reached at β = 1 (i.e., all parallel data is in-domain).
A similar trend is observed in the other direction (English to French, red lines).Therefore, if the goal is to maximize translation accuracy in both directions, an intermediate value of β (≈ 0.5) is more desirable.This is the setting we used previously in §5.1.1 and §5.1.2.Note that the performance of English to French model trained on parallel data drops at β = 0, even it has more in-domain parallel data than β = 0.25.This is because the OpenSubtitles dataset has shorter sentences in average, and parallel data contains less tokens when β decreases, which negatively affects model performance.

Low-Resource MT
With the findings from the controlled-setting experiments, we test our approaches on low-resource language pairs, namely English-Myanmar and English-Nepali.Myanmar and Nepali are spoken in regions with unique local context which is very distinct from English-speaking regions, making these two language pairs a good use case for studying the STDM setting in real life.

English-Myanmar
For English to Myanmar we use the parallel data provided in the WAT 2019 competition (Nakazawa et al., 2019) which consists of two datasets.The Asian Language Treebank (ALT) corpus (Thu et al., 2016;Ding et al., 2018Ding et al., , 2019) ) has 18,088 training sentences, 1,000 validation sentences and 1,018 test sentences from English originating news articles.The UCSY dataset1 contains 204,539 sentences from various domains, including news articles and textbooks.The test set is taken from the ALT dataset.
For English monolingual data, we use the 2018 Newscrawl dataset provided by WMT, where the domain of the corpus is also news.We apply the fastText classifier (Joulin et al., 2017) over the individual sentences to filter out non-English sentences.For Myanmar monolingual data, we use the language split Commoncrawl data from (Buck et al., 2014) which includes texts in various domains crawled from the web.We use the myanmar-tools2 library to classify and convert all Zawgyi text to Unicode.We use 5M unique English sentences and 100k unique Myanmar sentences as our monolingual data.
To summarize and comparing to our idealized setting of Fig. 1, this dataset has a small indomain parallel dataset from English to Myanmar (ALT), an out-of-domain parallel dataset (UCSY), and small out-of-domain monolingual corpus in Myanmar and a large monolingual corpus in English which is in-domain with ALT.Therefore, a priori we would expect ST to be more useful than BT when translating from English to Myanman.
The model architecture and BPE size is selected by hyper-parameter search on the ALT validation set.We use 5-layer transformer architecture with 42M parameters for model trained on parallel data only, and a 6-layer transformer architecture with 186M parameters for models trained with both parallel and monolingual data.The BPE size is 10,000.We report the system performance on the ALT test set following the same evaluation protocol of the WAT 2019 English-Myanmar subtask, see (Chen et al., 2019) for more details on BLEU calculation.
In Table 1, we observe that back-translation barely out-performs the baseline model by +0.6 BLEU points, while self-training improves by +2.5 BLEU points.This is because the source side monolingual data is in-domain with test set, and we have more source side monolingual data than target side monolingual data.We also observe that combining self-training and backtranslation together out-performs each individual method only slightly.

English-Nepali
We collect a English-Nepali parallel dataset by selecting sentences from public posts in English and Nepali and translating these sentences in the other language.This dataset is composed by 40,000 sentences originated in Nepali and only 7,500 sentences originated in English.We also have 1.8M monolingual sentences in Nepali and 1.8M monolingual sentences in English, also collected from public posts.This dataset is remarkably similar to our idealized setting of Fig. 1 right hand side, except that the two portions of the parallel dataset are grossly uneven.
The model architecture and BPE size are selected by hyper-parameter search on the validation set.We use a 5-layer transformer architecture with 39M parameters when training on parallel data alone, and a 6-layer model with 131M parameters when training on the parallel dataset augmented with synthetic data.The BPE size is 5,000 and we report tokenized BLEU score on both languages.
We consider the translation task in both directions and we report the results in Third, we find that for both directions combining ST and BT works better than each individual method in both directions, paticularly when ST and BT have comparable performance, showing that these approaches are indeed complementary.

Ablation Study on ST
We conduct an ablation study to understand the effect of iterative training and adding source-side noise on self-training.In particular, we consider a parallel dataset with 10,000 sentences from Eu-roParl and 10,000 sentences from OpenSubtitle.We also have 900,000 source monolingual sentences available for self-training.We perform four iterations of self-training where we gradually increase k for the top-K highest scoring examples we select for training in each iteartion.
In Table 3 further improves the BLEU score by +0.6 points.Therefore, injecting source-side noise when doing iterative self-training is the setting yielding the best performance for ST.

Final Remarks & Perspectives
In this work we introduced the problem of sourcetarget domain mismatch in machine translation.Echoing prior work in the sociolinguistic literature (Leech and Fallon, 1992;Bernardini and Zanettin, 2004), this problem is inherent to the translation task and it is even more prominent for low resource language pairs, for which differences in the local context is even more pronounced.While the dominant approach to building machine translation corpora has been centered on making corpora comparable, we argue that using the natural distribution of the text data in each language is important if we are targeting translation of organic content produced by social platforms, blogs and even news outlets.In other words, the non-comparability of parallel and monolingual corpora is an important feature of this task, it should be made explicit and it should be taken into account when designing machine translation models.
We introduced a simple controlled setting to study STDM and tested several baseline approaches.We found that ST can perform better than BT when the target monolingual data is scarce or out-of-domain relative to the source domain.In general, the two approaches are complementary to each other and they can be easily combined.Finally, we tested these approaches on truly low-resource language pairs reporting encouraging improvements over the baseline methods.
Looking forward, there are several directions worth future investigation.First, there is need for a better characterization of STDM, better under-standing of its causes and effects, and for possibly measuring its prominence on a given dataset factoring out (or accounting for) the effect of translationese from domain mismatch.Second, the approaches we introduced are merely baselines and they clearly underperform when there is severe STDM.Better algorithms leveraging source side monolingual data are required to make strides in this setting.Finally, the community needs to build more benchmarks exhibiting these natural phenomena, which are particularly relevant for low resource language pairs.

Figure 1 :
Figure1: Illustration of domain mismatch in machine translation.Each block corresponds to a dataset.Filled blocks represent data naturally occurring in a certain language.Empty blocks are human translations.Blocks in the top row are in the source language, blocks in the bottom row are in the target language.Blue blocks are in domain A, red blocks are in domain B. Left: in the traditional setting (here over-simplified), parallel data in both directions belong to the same domain, dubbed domain A, while the test set may be in another domain, domain B. For high-resource language pairs, we typically possess monolingual or small parallel datasets in domain B, enabling standard domain-adaptation approaches (e.g., back-translation).Right: in the source-target domain mismatch setting, typical of low resource language pairs, the parallel and monolingual data originating in the source language belong to domain A, while the parallel and monolingual data originating in the target language belong to domain B. At test time, we ask for a translation of a sentence originating in the source language and belonging to domain A. In this scenario, back-translation is less effective and self-training may be used to improve translation quality.

Data:
Given a parallel dataset P = {xi, yi}i=1...N P and a source monolingual dataset MS = {si}i=1...N S ; Noise: Let n(x) be a function that adds noise to the input by dropping, swapping and blanking words; Hyper-params: Let k be the number of iterations and A1 < . . .< A k ≤ NS be the number of samples to add at each iteration; Train a forward model: − → θ = arg max θ E (x,y)∼P log p(y|x; θ); for t in [1 . . .k] do forward-translate data: (ŷi, vi) ≈ arg maxz p(z|si; − → θ ), for i = 1 . . .NS and si ∈ MS, where vi is the score of the i-th example; Let I be the index set of the top-At highest scoring examples according to vi; re-train forward model: − → θ = arg max θ E (x,y)∼Q log p(y|x; θ) with Q = P ∪ {n(sj), ŷj}j∈I and sj ∈ MS. end Algorithm 1: Self-Training learning algorithm.
t e x i t s h a 1 _ b a s e 6 4 = " + w S B P e L 8 n x B d v z P X A 2 q s w h G h f p g = " > A A A B 7 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e i F 4 8 V 7 A e 0 o U y 2 m 3 b t Z h N 2 N 0 I J / Q 9 e P C j i 1 f / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b

Figure 2 :
Figure 2: Illustration of the controlled setting varying the amount of mismatch between the source and the target domain.The source domain is taken from Eu-roParl.The target domain is: α EuroParl +(1 − α) OpenSubtitles.By varying α, we vary the amount of mismatch.

Figure 3 :
Figure 3: BLEU score in Fr-En as a function of the degree by how much the target originating data is indomain.α = 0: fully out-of-domain.α = 1: fully in-domain.

Figure 4 :
Figure 4: BLEU in Fr-En as a function of the amount of monolingual data when there is extreme STDM (α = 0).

Figure 5 :
Figure 5: BLEU in Fr-En for various learning algorithms comparing the case where we use only source originating in-domain data (blue bars) and when we also add out-of-domain target originating data, with α = 0.

Figure 6 :
Figure 6: BLEU score as a function of the proportion of parallel data originating in the source domain.When β = 0 all parallel data originates from Open-Subtitles (out-of-domain), when β = 1 all parallel data originates from EuroParl (in-domain).The blue curves show BLEU in the forward direction (Fr-En translation of EuroParl data).The red curves show BLEU in the reverse direction (En-Fr translation of OpenSubtitles sentences).The three curves show BLEU for models trained using only parallel data, only synthetic backtranslated data and the union of the two.

Table 1 :
BLEU scores for the English to Myanmar translation task.

Table 3 :
, we observe that iterative self-training performs better than original self-training, showing advantages of adding training examples for which the model was most confident.Moreover, adding source-side noise to iterative self-training Iterative self-training with source-side noise yields better BLEU score.