On the Language Coverage Bias for Neural Machine Translation

Language coverage bias, which indicates the content-dependent differences between sentence pairs originating from the source and target languages, is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice. By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data, and find that using only the source-original data achieves comparable performance with using full training data. Based on these observations, we further propose two simple and effective approaches to alleviate the language coverage bias problem through explicitly distinguishing between the source- and target-original training data, which consistently improve the performance over strong baselines on six WMT20 translation tasks. Complementary to the translationese effect, language coverage bias provides another explanation for the performance drop caused by back-translation. We also apply our approach to both back- and forward-translation and find that mitigating the language coverage bias can improve the performance of both the two representative data augmentation methods and their tagged variants.


Introduction
In recent years, there has been a growing interest in investigating the effect of original languages in parallel data on neural machine translation (Barrault et al., 2020;Edunov et al., 2020;Marie et al., 2020). Several studies have shown that targetoriginal test examples 1 can lead to distortions in automatic and human evaluations, which should be omitted from machine translation test sets (Barrault et al., 2019;Zhang and Toral, 2019;Graham 1 (Edunov et al., 2020;Marie et al., 2020). They attribute these phenomena to the reason that human-translated texts (i.e., translationese) exhibit formal and stylistic differences that set them apart from the texts originally written in that language (Baker et al., 1993;Volansky et al., 2015;Zhang and Toral, 2019).
Complementary to the translationese bias, which is content-independent (Volansky et al., 2015), we identify another important problem, namely language coverage bias, which refers to the contentdependent differences in data originating from different languages. These differences stem from the diversity of regions and cultures. While the degree of the translationese bias varies across different translators (Toral, 2019), language coverage bias is an intrinsic bias between the source-and targetoriginal data, which is hardly affected by the ability of the translator. Figure 1 shows an example, where the contents in English-and German-original texts differ significantly due to language coverage bias.
To investigate the effect of language coverage bias in the training data on NMT models, we propose an automatic method to identify the original language of each training example, which is generally unknown in practical corpora. Experimental results on three large-scale translation corpora show that there exists a significant performance gap between NMT models trained on the source-and target-original data, which have different vocabulary distributions, especially for content words. Since the target-original training data performs poorly in translating content words, using only the source-original data achieves comparable performance with using full training data. These findings motivate us to explore other data utilization methods rather than indiscriminately mixing the sourceand target-original training data.
We propose to alleviate the language coverage bias problem by explicitly distinguishing between the source-and target-original training data. Specifically, two simple and effective methods are employed: bias-tagging and fine-tuning. Experimental results show that both approaches consistently improve the performance on six WMT20 translation tasks. Language coverage bias also provides another explanation for the failure of back-translation on the source-original test data, complementary to the translationese effect (Marie et al., 2020). We further validate our approach in the monolingual data augmentation scenario, where the language coverage bias problem would be more severe due to the newly introduced monolingual data.

Contributions
The main contributions of our work are listed as follows: • We demonstrate the necessity of studying the language coverage bias for NMT, and identify that using the target-original data can cause poor translation adequacy on content words.
• We address the language coverage bias induced by the target-original data by explicitly distinguishing the original languages, which can significantly improve the translation performance on six WMT20 translation tasks.
• We show that alleviating the language coverage bias also benefits monolingual data augmentation, which can improve both backand forward-translation and their tagged variants (Caswell et al., 2019).

Experimental Setup
Data We conducted experiments on six WMT20 benchmarks (Barrault et al., 2020), including English⇔German (En⇔De), English⇔Chinese (En⇔Zh), and English⇔Japanese (En⇔Ja) news translation tasks. The preprocessed training corpora contain 41.0M, 21.8M, and 13.0M sentence pairs for En⇔De, En⇔Zh, and En⇔Ja, respectively. We used the monolingual data that is publicly available in WMT20 to train the proposed original language detection model (Section 3.1) and data augmentation (Section 4.2). The Appendix lists details about the data preprocessing. For En⇔De and En⇔Zh, we used newstest2019 as the validation sets. For En⇔Ja, we split the official validation set released by WMT20 into two parts by the original language and only used the corresponding part for each direction. We used newstest2020 as the test sets for all the six tasks. We reported the Sacre BLEU (Post, 2018), as recommended by WMT20.
Model We used the Transformer-Big (Vaswani et al., 2017) model, which consists of a 6-layer encoder and a 6-layer decoder, and the hidden size is 1024. Recent studies showed that training on large batches can further boost model performance Wu et al., 2018). Accordingly, we followed their settings to train models with batches of approximately 460k tokens. Please refer to the Appendix for more details about model training. We followed Ng et al. (2019) to use the Transformer-Big decoder as our language models, which are used to detect the original language and measure translation fluency. Language models are also trained with large batches .

Observing Language Coverage Bias
In this study, we first establish the existence of language coverage bias (Section 3.2), and show how the bias affects NMT performance (Section 3.3). To this end, we propose an automatic method to detect the original language of each training example (Section 3.1), which is often not available in large-scale parallel corpora (Riley et al., 2020).

Detecting Original Languages
Detection Method Intuitively, we use a largescale monolingual dataset to estimate the distribution of the contents covered by each language. For each training example, we compare its similarities to the distributions of source and target languages, based on which we determine its original language. Formally, let D s and D t denote the source-side and target-side distributions of the covered contents. Given a training example x, y , the probability that it is covered by one language (represented as D s and D t ) can be expressed as We use a score function to denote the difference between the two probabilities: where c = log P (D s ) − log P (D t ), which has a constant value when the source and target monolingual datasets are given. Intuitively, examples with higher score values are more likely to be sourceoriginal while those with lower score values are more likely to be target-original data. We train language models θ lm s and θ lm t on the source-and target-language monolingual data to estimate the conditional probabilities: Accordingly, the score can be rewritten as We label examples as source-original if their score values are positive, and the other examples as targetoriginal. To find a specific constant for each language pair, we tune the value of c to obtain the best classification performance on the validation sets, where the original languages are known.  Detection Accuracy We evaluated the detection method on the mixture of the test sets of bidirectional translation tasks in WMT20 for each language pair. For comparison, we re-implemented the CNN-based forward-translation (FT) classifier proposed by Riley et al. (2020). The FT classifier and the language models used in our method were trained on the same monolingual data sets. Table 1 shows that our method outperforms the FT classifier in all language pairs. In addition, our model also outperforms the FT approach on detecting noisy training data, which leads to an improvement in translation performance (please refer to Table 11 in the Appendix for more results).

Existence of Language Coverage Bias
In this section, we validate the existence of language coverage bias by (1) comparing the performance of NMT models trained on data with different original languages, and (2) directly calculating the divergence between the vocabulary distributions of the source-and target-original data.
Translation Performance Once all the training examples are assigned a score by the detection method (Eq. (1)), we regard R% of examples with the highest scores as the most source-original examples, and R% of examples with the least scores as the most target-original examples. We investigate the effect of R% on translation performance, as shown in Figure 2. Clearly, using the most sourceoriginal examples significantly outperforms using its target-original counterparts, demonstrating that the source-and target-original data indeed differ greatly from each other. To rule out the effect of data scale, we treat 50% of data with the highest scores as source-original data, and the same amount  Table 2: JS divergence of the vocabulary distributions between the source-and target-original data ("S vs T") on the training set of WMT20 En⇔Zh. "All", "Content", and "Function" denote all words, content words, and function words respectively. For reference, we also report the JS divergence between randomly selected 50% examples and the others ("Random").
of data with the least scores as target-original data in the following experiments by default.
Since some recent works find that BLEU might be affected by the translationese problem (Edunov et al., 2020;, we have also conducted a side-by-side human-evaluation on the Zh⇒En development set, where 500 randomly sampled examples were evaluated by six persons (agreement (Fleiss, 1971): Fleiss' Kappa=0.46). 37.0% of outputs using the source-original data are better than using target-original data, and 21.0% are worse. By manually checking the outputs, we find using only the target-original data tends to omit important source contents (e.g., named entities) either by totally ignoring some contents or by using pronouns instead. The human-evaluation shows the same trend with the BLEU score presented in Figure 2. Given that conducting human-evaluation on all the six translation tasks is time-consuming and labor-intensive, we use automatic measures to further investigate this problem in Section 3.3.

Vocabulary Distributions
Complementary to previous studies that focus on the contentindependent stylistic difference (Volansky et al., 2015) between translationese and original texts (Riley et al., 2020;Edunov et al., 2020;Marie et al., 2020), we investigate the content-dependent language coverage bias between the source-and targetoriginal data in this experiment. Intuitively, if the language coverage bias exists, the vocabulary distributions of the source-and target-original data should differ greatly from each other, since the covered issues tend to have different frequencies between them (D'Alessio and Allen, 2000). We use the Jensen-Shannon (JS) divergence (Lin, 1991) to measure the difference between two vocabulary   distributions p and q: where KL(·||·) is the KL divergence (Kullback and Leibler, 1951) of two distributions. Table 2 shows the JS divergence of the vocabulary distributions between the source-and targetoriginal data. We also divide the words into content words and functions words based on their POS tags, since content words are more related to the language coverage bias, while the function words are more related to the stylistic and structural differences between the translationese and original texts (Lembersky et al., 2011;Volansky et al., 2015). The JS divergence between the source-and target-original data are 186× larger than that between randomly split data, which is mainly due to the difference between content words. Results for different ratios R% and other language pairs can be found in Appendix (Tables 12 and 13), where the trend holds in all cases, supporting our claim of the existence of language coverage bias.

Effect of Language Coverage Bias
In this section, we investigate the effect of language coverage bias on NMT models.
Using only the source-original data achieves comparable performance with using full data. Table 3 lists the translation performances of NMT models trained on only the source-or targetoriginal data and on both of them. The results show that using only the source-original data significantly outperforms using the target-original data in all language pairs, which reconfirm the necessity of studying the language coverage bias for NMT. It should be emphasized that using only the sourceoriginal data (i.e. 50% of the whole training set) achieves translation performances on par with using full training data. In the following experiments, we investigate why using target-original data together cannot further improve the performance.
Using additional target-original data does not consistently improve translation adequacy. To rule out the effect of translationese and focus on the content-dependent difference caused by the language coverage bias, we examine the translation adequacy of content words in Table 4 2 . We follow Raunak et al. (2020) to use F-measure (Neubig et al., 2019) to quantify the translation accuracy of specific types of words.
Compared with the source-original data, using only the target-original data greatly reduces the translation accuracy of content words, which we attribute to the divergence of the content word distributions between the source-and target-original data. The results also indicate that indiscriminately using all the training data can not consistently improve the translation adequacy of content words over using only source-original data, and in some cases using all the data is even harmful to the adequacy on content words. Table 5 shows an example, which suggests that using only the target-original data tends to omit content words. This problem is potentially caused by that some content words at the source-side are less or even not visible in the target-original data, and indiscriminately adding target-original data induces a distribution shift on the content word distribution.
Using additional target-original data only slightly improves the structural fluency. Recently, Edunov et al. (2020) claim that using additional back-translated data can improve translation fluency. Target-original bilingual data is similar to back-translated data since both of them are constructed by translating sentences from the target language into the source language. One question 2 We only list the results on En⇔Zh due to space limit. Please refer to Table 14 in the Appendix for the translation quality on other language pairs.

Refer.
The hairy crab is the most famous image spokesperson in Bacheng.
Target It is one of the city's most well-known Orig. image spokesmen.
Source Hairy crabs are the most well-known Orig. image spokesmen of Bacheng.

Both
It is the best-known icon of Bacheng. naturally arises: can target-original bilingual data improve the fluency of NMT models?
To answer the above question, we measure the fluency of outputs with language models trained on the monolingual data as described in Section 2. Previous study finds that different perplexities could be caused by specific contents rather than structural differences (Lembersky et al., 2011). Specifically, some source-original contents are of low frequency in the target-language monolingual data (e.g., "Bacheng" in Table 5), thus the language model trained on the target-language monolingual data tends to assign higher perplexities to outputs containing more source-original content words. To rule out this possibility and check whether the outputs are structurally different, we follow Lembersky et al. (2011) to abstract away from the contentspecific features of the outputs to measure their fluency at the syntactic level. Table 6 shows the results. Although using only the source-original data results in high perplexities measured by vanilla language models, the perplexities of NMT models trained on different data are close to each other at the syntactic level. Using additional target-original data only slightly reduces the perplexity at the syntactic level over using only the source-original data.

Addressing Language Coverage Bias
In Section 3 we show that the target-original data performs poorly in translating content words due to the language coverage bias. Accordingly, simply using the full training data without distinguishing the original languages is sub-optimal for model training. Based on these findings, we propose to address the language coverage bias by explicitly distinguishing between the source-and the targetoriginal data (Section 4.1). We then investigate whether the performance improvement still holds in  Table 6: Translation fluency measured by the perplexities (i.e., PPL) of language models with different levels of lexical abstraction, "Diff." means the relative change with respect to "Both". "No Abs." denotes no abstraction (i.e., vanilla LM), "Cont. Abs." denotes abstracting all content words with their corresponding POS tags. The results are reported on the validation sets.
the monolingual data augmentation scenario (Section 4.2), where the language coverage bias problem is more severe due to the newly introduced dataset in source or target language.

Bilingual Data Utilization
In this section, we aim to improve bilingual data utilization through explicitly distinguishing between the source-and target-original training data.
Methodology We distinguish original languages with two simple and effective methods: • Bias-Tagging: Tagging is a commonly-used approach to distinguishing between different types of examples, such as different languages (Aharoni et al., 2019;Riley et al., 2020) and synthetic vs authentic examples (Caswell et al., 2019). In this work, we attach a special tag to the source side of each target-original example, which enables NMT models to distinguish it from the source-original ones in training.
• Fine-Tuning: Fine-tuning (Luong and Manning, 2015) is a useful method to help knowledge transmit among data from different distributions. We pre-train NMT models on the full training data that consists of both the source-and targetoriginal data, and then fine-tune them on only the source-original data. For fair comparison, the total training steps of the pre-training and fine-tuning stages are the same as the baseline.
Translation Performance Table 7 depicts the results on the benchmarking datasets. For comparison, we also list the results of several baselines using the vanilla Transformer architecture trained on the constrained bilingual data in the WMT20 competition (Barrault et al., 2020). Clearly, both the bias tagging and fine-tuning approaches consistently improve translation performance on all benchmarks, which confirms our claim of the necessity of explicitly distinguishing target-original examples in model training.
Analysis Recent studies have shown that generating human-translation like texts as opposed to original texts can improve the BLEU score (Riley et al., 2020). To validate that the improvement is partially from alleviating the content-dependent language coverage bias, we examine the translation adequacy of content words on the test sets, as listed in Table 8. The results indicate that explicitly distinguishing between the source-and target-original data improves the translation of content words (e.g., nouns), which is closely related to the language coverage bias problem. Table 9 lists the translation fluency at the syntactic level, where the proposed approaches maintain the syntactic fluency.

Monolingual Data Augmentation
In this section, we aim to provide some insights where monolingual data augmentation improves translation performance, and investigate whether our approach can further improve model performance in this scenario that potentially suffers more from the language coverage bias problem. For fair comparison across language pairs, we augment NMT models with the same English monolingual corpus as described in Section 2. We down-sample the large-scale monolingual corpus to the same amount as that of the bilingual corpus in each language pair, in order to rule out the effect of the scale of synthetic data Fadaee and Monz, 2018). We use backtranslation (Sennrich et al., 2016a) to augment the English monolingual data for the task of translating from another language to English ("X⇒En"), and use forward-translation for the task in the opposite translation direction ("En⇒X").  Table 7: Sacre BLEU reported on the WMT20 test sets. "Tag" and "Tune" denote the bias-tagging and fine-tuning, respectively. We highlight the highest score in bold and the second-highest score with underlines. "↑/⇑" denotes significantly better than the baseline with p < 0.05 and p < 0.01, respectively. For comparison, we list three systems that use vanilla Transformer models trained on the bilingual data in the WMT20 competition.   back-translation (Row 3) harms the translation performance on average, while the vanilla forwardtranslation improves the performance, which is consistent with the findings in previous studies (Edunov et al., 2020;Marie et al., 2020). Caswell et al. (2019) have shown that the tagging strategy works for back-translation while fails for forward-translation, and our results confirm these findings. Both phenomena can be attributed in part to the language coverage bias problem. Backtranslated data originates from the target language, and thus suffers more from the language coverage bias problem. Accordingly, directly using the back-translated data is sub-optimal, while tagged back-translation recovers translation performance by distinguishing training examples with different origins, which is consistent with our results in Table 7. In contrast, the language coverage bias problem does not exist for source-side monolingual data (i.e. the same original language). Therefore, the vanilla forward-translation can improve translation performance, while tagged forward-translation performs worse.

Method
Improving Data Augmentation Our approach (Row 2) achieves comparable improvements of translation performance with the monolingual data augmentation approaches (e.g. averaged BLEU: 31.2 vs. 30.7, and 37.6 vs. 37.9), while we do not use additional monolingual data to train the models. 3 Combining them can further improve performance (Rows 5-6), indicating that the two types of approaches are complementary to each other. This is straightforward, since our approach better exploits the bilingual data, while data augmentation introduces new knowledge from additional monolingual data. In addition, our approach consistently improves performance over both vanilla and tagged augmentation approaches, making it more robust in practical application across datasets.

Related Work
Our work is inspired by three lines of research in the NMT community.  Table 10: Translation performance of augmenting English monolingual data with different strategies: backtranslation for X⇒En tasks (blue cells), and forward-translation for En⇒X tasks (red cells). "Tagging" denotes adding a special tag to each synthetic sentence pair (Caswell et al., 2019). "Fine-Tune" denotes fine-tuning the pre-trained NMT models on the source-original bilingual data, as described in Section 4.1.

Translationese
Recently, the effect of translationese in NMT evaluation has attracted increasing attention (Zhang and Toral, 2019;Bogoychev and Sennrich, 2019;Edunov et al., 2020;Graham et al., 2020). Graham et al. (2020) show that the source-side translationese texts can potentially lead to distortions in automatic and human evaluations. Accordingly, the WMT competition starts to use only source-original test sets for most translation directions since 2019.
Our study reconfirms the necessity of distinguishing the source-and target-original examples and takes one step further to distinguish examples in training data. Complementary to previous works, we investigate the effect of language coverage bias on machine translation, which is related to the content bias rather than the language style difference. Shen et al. (2021) also reveal the context mismatch between texts from different original languages. To alleviate this problem, they proposed to combine back-and forward-translation by introducing additional monolingual data, while we focus on better exploiting bilingual data by distinguishing the original languages, which is also helpful for back-and forward-translation. Lembersky et al. (2011Lembersky et al. ( , 2012 propose to adapt machine translation systems to generate texts that are more similar to human-translations, while Riley et al. (2020) propose to model human-translated texts and original texts as separate languages in a multilingual model and perform zero-shot translation between original texts. Riley et al. (2020) and our work both aim to better utilize the bilingual training data. They aim to guide NMT models to produce original text, while we focus on improving translation adequacy by alleviating the language coverage bias problem.

Data Augmentation
Concerning model training, recent works find that back-translation can harm the translation of sourceoriginal test set, and attribute the quality drop to the stylistic and content-independent differences between translationese and original texts (Edunov et al., 2020;Marie et al., 2020). In this work, we empirically show that language coverage bias is another reason for the performance drop of backtranslation, as well as the different performances between tagged forward-translation and tagged back-translation (Caswell et al., 2019). In addition, we show that our approach is also beneficial for data augmentation approaches, which can further improve the translation performance over both back-translation and forward-translation.

Domain Adaptation
Since high-quality and domain-specific parallel data is usually scarce or even unavailable, domain adaptation approaches are generally employed for translation in low-resource domains by leveraging out-of-domain data (Chu and Wang, 2018). Languages can be also regarded as different domains, since articles in different languages cover different topics (Bogoychev and Sennrich, 2019). Starting from this intuition, we distinguish examples with different original languages with tagging (Aharoni et al., 2019) and fine-tuning (Luong and Manning, 2015), which are commonly-used in domain adaptation and multi-lingual translation tasks.
Our work also benefits domain adaptation: distinguishing original languages in general domain data consistently improves translation performance of NMT models in several specific domains (Table 16 in Appendix), making these models better start points for further domain adaptation.

Conclusion and Future Work
In this work, we first systematically examine why the language coverage bias problem is important for NMT models. We conducted extensive experiments on six WMT20 translation benchmarks. Empirically, we find that source-original data and target-original data differ significantly at the text content, and using target-original data together without discrimination is sub-optimal. Based on these observations, we propose two simple and effective approaches to distinguish the source-and target-original training data, which obtain consistent improvements in all benchmarks.
Furthermore, we link language coverage bias to two well-known problems in monolingual data augmentation, namely the performance drop of back-translation, and the different behaviors between tagged back-translation and tagged forwardtranslation. We show that language coverage bias can be considered as another reason for these problems, and fine-tuning on the source-original bilingual training data can further improve performance over both back-and forward-translation.
Future directions include exploring advanced methods to better alleviate the language coverage bias problem, as well as validating on other language pairs. It is also interesting to investigate the language coverage bias problem in multilingual translation, where we can better understand this problem by considering language family. was trained using 8 NVIDIA V100 GPUs for about 20 hours.

A.3 Effect of Detection Methods on Translation Performance
To further compare our proposed original language detection method and the FT classifier (Riley et al., 2020), we fine-tune the NMT model pre-trained on the whole training set using the source-original data detected by the two methods. Note that the two detection methods are developed using the same monolingual data sets. For fair comparison, the fine-tuning sets are of the same amount (50% of the whole training set) between the two methods in this experiment.

A.4 Divergence of Vocabulary Distributions
In this section, we report the JS divergence of the vocabulary distributions in more cases. Table 12 lists the results for different ratios R% on En⇔Zh, and Table 13 shows the results on all language pairs. The results show that the divergence of vocabulary distributions between the source-and target-original data is substantially larger than that between randomly split data, which reconfirms the existence of language coverage bias.  Table 12: JS divergence (×10 −5 ) of the vocabulary distributions between the source-and target-original training data ("S vs T") for different labeled ratios on En⇔Zh. For reference, we also report the JS divergence between two sets of randomly selected examples ("Random", non-overlap).

Data
En-Zh En-Ja En-De    Overall 36.6 37.2 Table 16: Transformer performance on the validation set of the En⇒Zh task. We split the whole validation set into several parts by the domain tag. "Ours" denotes the "Bias-Tagging" approach as described in Section 4.1. The results indicate that distinguishing data with different original languages in the general domain training data can improve the performance of NMT models in many specific domains, making the models better start points for further domain adaptation.

A.6 Translation Adequacy on Test Sets for Other Language Pairs
We report the translation adequacy on test sets for En⇔De and En⇔Ja in Table 15, corresponding to Table 8 in the main paper. The results show that explicitly distinguishing the source-and targetoriginal training data can consistently improve the translation adequacy for content words on all the six translation tasks.

A.7 Translation Performance in Specific Domains
We evaluate NMT models trained with and without explicit distinguishing between the source-and target-original data in several specific domains. The results are shown in Table 16, suggesting that our method can improve the translation performance of NMT models in several specific domains, which can be combined with further domain adaptation approaches.