A Comparison between Pre-training and Large-scale Back-translation for Neural Machine Translation

BERT has been studied as a promising technique to improve NMT. Given that BERT is based on the similar Transformer architecture to NMT and the current datasets for most MT tasks are rather large, how pre-training has managed to outperform standard Transformer NMT models is underestimated. We compare MT engines trained with pre-trained BERT and back-translation with incrementally larger amounts of data, implementing the two most widely-used monolingual paradigms. We analyze their strengths and weaknesses based on both standard automatic metrics and intrinsic test suites that comprise a large range of linguistic phenomena. Primarily, we ﬁnd that 1) BERT has limited advantages compared with large-scale back-translation in accuracy and consistency on morphology and syntax; 2) BERT can boost the Transformer baseline in semantic and pragmatic tasks which involve intensive understanding; 3) pre-training on huge datasets may introduce inductive social bias thus affects translation fairness.


Introduction
Neural machine translation (NMT) has shown promising results as an end-to-end approach to automatic translation (Sutskever et al., 2014;Bahdanau et al., 2014;Vaswani et al., 2017). One reason for its success is the availability of large amounts of training resources such as parallel corpora with high quality. For low-resource languages or domain-specific settings, monolingual data have also been effectively used by NMT systems (Zhang and Zong, 2016;Siddhant et al., 2020), providing rich linguistic features for translation.
Two lines of work have been done on leveraging monolingual corpora to improve translation quality. One approach is back-translation (Bojar and * Equal contribution. † Corresponding author. Tamchyna, 2011;Sennrich et al., 2016), in which an auxiliary target-to-source system is trained on genuine bitext, and then used to generate synthetic text from a large monolingual corpus on the target side. The synthetic and genuine pairs are then used together to train a source-to-target MT model.
An alternative method of using monolingual data is the pre-trained language model (Devlin et al., 2019;Radford et al., 2019), a neural network trained over large texts and can be incorporated into standard NMT encoder-decoder architectures (Jean et al., 2015;Gulcehre et al., 2015;Zhu et al., 2020). Pre-trained language models have led to improvements in NMT results across low-resource scenarios (Song et al., 2019), crosslingual transfers (Conneau and Lample, 2019;  and code-switching settings (Yang et al., 2020).
Among these two dominant monolingual paradigms, there has been relatively more work investigating how back-translation helps NMT. For example, initial studies show that back-translation is beneficial to machine translation by producing more fluent outputs (Edunov et al., 2020). However, relatively little work has focused on how pretrained language models contribute to translation. We fill this gap by quantitatively comparing MT models trained with pre-trained language models and back-translation under a fair large-scale setting. Specifically, for pre-trained language models, we reimplement BERT-fused NMT (Zhu et al., 2020), and for back-translation, we use incrementally larger data amounts to train a range of systems, with the synthetic data being half, equal, twice and four times of the authentic data. We conduct experiments on rich  and low (LDC Chinese-to-English) resource scenarios, and evaluate performance on 8 benchmarks covering morphological, syntactic, semantic and pragmatic competences. Empirically, we find that: 1. BERT yields improvement for standard NMT in BLEU but has no remarkable advantage compared with large-scale back-translation. 2. BERT has little effect on correcting smaller discrepancies in morphological and syntactic levels in NMT (Section 5.1& 5.2). 3. BERT features salient promotion for MT requiring heavy context understanding and intensive knowledge, but also brings concerns around bias and fairness (Section 5.3& 5.4). To our knowledge, we are the first to detect the effectiveness of pre-training in NMT by a comparison with back-translation in a fair setting. We also contribute to the analysis of BERT in a bilingual situation.

Related Work
Pre-training in NMT Gulcehre et al. (2015) and Jean et al. (2015) are among the first to integrate language models into the decoder part of NMT. Subsequent work extends the studies by adding pre-trained representations in the encoder part (Edunov et al., 2019) or the both sides (Ramachandran et al., 2017) Zhu et al. (2020) suggest using BERT as an extra memory. Specifically, they first encode the inputs by BERT and use the last layer's output as an extra memory. The Transformer NMT network uses an extra self-attention module to weigh the memory in each layer of both the encoder and decoder. The model shows a noticeable improvement in both supervised, semisupervised and unsupervised tasks, leading to the new state-of-the-art results of using BERT in NMT. Given the significant improvements achieved by their work, we adopt this model in our experiments.
Back-translation Back-translation is a widely used data augmentation technology originally introduced for SMT (Bojar and Tamchyna, 2011) and then flourished in NMT (Sennrich et al., 2016). It has been studied with dual-learning frameworks (He et al., 2016), large-scale extensions (Edunov et al., 2018;, iterative versions (Hoang et al., 2018), unsuper-vised scenarios (Artetxe et al., 2018Lample et al., 2018), tagged back-translated sources (Caswell et al., 2019) as well as systematic analysis (Burlot and Yvon, 2018;Poncelas et al., 2018;Edunov et al., 2020). In line with Edunov et al. (2018), we aim to broaden understanding of back-translation in a large-scale manner. While their focus is on different methods that generate synthetic source sentences, ours is to investigate how large-scale pretraining compares with large-scale back-translation in boosting translation performance.

Protocol for MT Evaluation
We use BLEU (Papineni et al., 2002) and 8 more focused evaluation tasks to probe MT systems with pre-trained BERT and back-translation. Below we introduce the error analysis protocols in detail.

Morphological Competence
We assess the morphological competence of MT systems translating from English into morphologically rich languages, which is a necessity for MT systems to overcome out-of-vocabulary source tokens and flexible word orders. We take Morpheval 1 (Burlot and Yvon, 2017;Burlot et al., 2018) as one of the representative test suits, consisting of a set of contrast pairs that can be triggered in the source language and evaluated in the target language (Table 1). This dataset describes three types of contrasts: the first evaluates one single morphological derivational feature such as number, gender, tense; the second evaluates agreement; the third concerns lexical replacements of the same category, testing whether morphological consistency still holds if a word is replaced by a hyponym.

Syntactic Competence
We evaluate whether MT models can generate coherent and grammatical sentences. We adopt the LingEval97 2 (Sennrich, 2017), a test set of contrastive translation pairs for analysis of a number of syntactic phenomena including syntactic agreement over long distances, discontiguous verbparticle constructions, transliteration of names and faithful translation of polarity (Table 1).

Semantic Competence
Semantics helps MT enforce meaning preservation and handle data sparsity. We measure semantic competence from the ambiguity of content words, conjunctions and pronouns, corresponding to tasks of homograph translation, conjunction disambiguation, and pronoun coreference resolution, respectively. First, homograph translation requires models to determine the intended sense of polysemous words in context. We adopt MUCOW 3 (Raganato et al., 2019), a lexical ambiguity benchmark in which a sentence containing an ambiguous word is paired with a correct reference and an incorrect modified translation with the ambiguous word being replaced by a word of a different sense. Second, NMT should theoretically be able to handle conjunctions with variant senses if the encoder cap-tures clues from sentence structures. We use the test set of Popović (2019) 4 , which translates the English conjunction but into two different German conjunctions aber or sondern. The former can be used after a positive or a negative clause, while the latter is only used after a negative clause when expressing a contradiction. Lastly, for coreference resolution, we adopt ContraPro 5 (Müller et al., 2018) to evaluate the accuracy when models translate the English pronoun it to its German counterparts es (it), sie (she) and er (he), based on a correct understanding of antecedents.

Pragmatic Competence
We further evaluate systems on 3 challenging problems involving pragmatic inference: idiom translation, commonsense reasoning and gender bias. First, idiom translation still presents a difficulty because the meaning of idioms is non-compositional and non-literal, making word-by-word translation incorrect. We use the CIBB dataset 6 (Shao et al., 2018), in which a blacklist consisting literal translation of idiom characters is constructed and once translations from NMT trigger the blacklist, the literal translation errors can be counted to score the systems. Another demanding competence for NMT is commonsense reasoning. He et al. (2020)

Experimental Setup
We verify the effectiveness of MT combined with BERT (Zhu et al., 2020) and back-translation on both rich-and low-resource scenarios.

Data and Baseline
For the rich-resource scenario, we take WMT'14 English-to-German (En→De) with a corpus size of 4.5M 9 . We use newstest2013 as the validation set and newstest2014 as the test set. For the low-resource scenario, we take LDC Chinese-to-English (Zh→En) with a corpus size of 1.25M . We use nist06 as the validation set and report an average score on nist02/03/04/05/08 test sets. We apply wordpieces (Wu et al., 2016) to preprocess data with a shared source and target vocabulary of 32K.
We train a standard Transformer NMT model (Vaswani et al., 2017) on fairseq 10 as a baseline. We adopt transformer big for En→De and transformer base for Zh→En with a 6-layer encoder-decoder network. We set the dropout ratio as 0.25 and use beam search with width 4 and length penalty 0.6 for inference.

BERT-fused NMT
BERT (Devlin et al., 2019) is composed of a layered self-attention Transformer network and is pretrained on billions of unlabeled text to perform masked language modeling and next sentence prediction tasks. The former aims to restore the original sequence from noisy input, while the latter learns whether two sentences are consecutive. Zhu et al. (2020) incorporate BERT into NMT systems. On the source side, given a language input x, the model first extracts the last layer's output  of the context-aware representation from BERT encoder: and then fuses H B with each layer of the encoder of the NMT model through attention mechanisms: where H l E refers to the hidden state after fusion of the l-th layer, attn S is the multi-head self-attention layer, and attn B is the BERT attention layer. In the case of layer l in the target side, the decoder also uses both contexts at the same time: where attn M S , attn B , attn E is the multi-head future-masked self-attention layer, BERT-decoder attention layer and the encoder-decoder attention layer, respectively. H L E is the output of the encoder. Following Zhu et al. (2020), we first train a standard Transformer NMT and then initialize the weights of the BERT-fused model. We choose bert large cased 11 with 24 layers and 1024 hidden dimension for En→De and bert base chinese 12 with 12 layers and 768 hidden dimension for Zh→En, ensuring that the dimension of BERT and NMT model almost matches. BERT is fixed during training. The optimization algorithm is Adam in accordance with 0.0005 learning rate and the inverse sqrt scheduler.

Back-translation
For back-translation, we use the standard Transformer baseline with the method of Sennrich et al. (2016) to synthesize augmented data. Our goal is to give a comparison between BERT-fused NMT and back-translation of different data scales, using monolingual data from the same source of BERT training by random selection from the Wikipedia 13 11 https://huggingface.co/bert-large-cased 12 https://huggingface.co/bert-base-chinese 13 dumps.wikimedia.org/dewiki/latest 14 . Previous work shows that data capacity for backtranslation does not consistently improve performance beyond a threshold (Poncelas et al., 2018), therefore we choose a suitable amount and scale up the data from 625k to 18M with the ratio between authentic and synthetic data being 1:0.5, 1:1, 1:2 and 1:4, respectively (see Table 2). In total we have 18M monolingual sentences in German and 5M monolingual sentences in English. All datasets are preprocessed similarly to the training data.

Evaluation
We use the multi-bleu.perl from Moses on tokenized sentences for BLEU evaluation of all systems. The tasks of conjunction disambiguation and idiom translation are evaluated on the presence percentage of correct conjunction and pre-defined blacklist words, respectively. The task of gender bias is evaluated on morphological analysis from 3 aspects: overall accuracy calculated by the percentage of instances in which the translation preserved the gender of the entity from the original sentence, ∆G denoting the difference in performance between masculine and feminine scores, and ∆S indicating the difference in performance between pro-stereotypical and anti-stereotypical gender role assignments (see examples in Appendix A.4).
Other tests use a contrastive pair paradigm, which tests a model's ability to discriminate between given good and bad translations by exploiting the fact that NMT systems can be viewed as language models of the target language, conditioned on source texts. Similar to language models, NMT models can score a negative log probability for sentences. If the model score of the actual translation is smaller than the contrastive translation, we treat the decision as correct. We aggregate model decisions on the whole test set and report the overall percentage of correct decisions as results.

Results
The overall BLEU points are given in Table 3 15 . For both rich-and low-resource settings, the BERTfused model demonstrates stronger performances than the baseline. However, systems augmented with back-translated data are better than the BERTfused model, with the best score achieved by model trained with 2.25M synthetic data (1:0.5 setting)   for En→De, and 1.25M synthetic data (1:1 setting) for Zh→En. This shows that in terms of BLUE, the advantage of large-scale pre-training is not obvious compared with large-scale backtranslation, even though the latter requires far less training data and computational resources. Taking En→De as an example (Table 4), back-translation uses only 85% parameters compared to the BERTfused method, while achieves higher BLEU points, 3.6 times faster decoding speed, and the same target/source length ratio which indicates an equivalent information richness in the target translation. Table 5 shows the results for the morphology test in En→De translation. Generally, for derivational (Table 5a), agreement (Table 5b) and consistency (Table 5c) content, pre-training does not show prominent advantages over back-translation in helping the standard Transformer model convey correct morphology from source to target. Prior work on monolingual tasks (Hofmann et al., 2020;Edmiston, 2020;Haley, 2020) has shown that BERT is capable of encoding morphological information and many morphological features can be extracted by training a simple classifier on a BERT layer. In our bilingual task, however, BERT is trained in the source context and evaluated in the target language. The performance discrepancy shows that BERT's morphology prediction for novel words in mono language results from high-frequent morphological data during pre-training, which helps BERT to memorize the statistical connection over contextualized string cues. In contrast, NMT morphological rules involve both source and target languages, which is different from BERT training. Surface cues are not available for BERT in bilingual  situation thus BERT cannot compute the interlingual representations. This can explain why BERT contributes less than back-translation in conveying morphological features in bilingual scenarios.

Syntax
The results for syntax tests in En→De are shown in Table 6. We find similar performances across all systems, indicating that solving problems regarding syntax is easy for the current standard Transformer since it has achieved a high accuracy close to 100. Neither back-translation nor pre-training brings significant benefits to the baseline. Initial work on monolingual tasks (Goldberg, 2019; Wolf, 2019) claims that BERT learns powerful syntactic representations and shows promise at agreement phenomena. However, our results show that in translation, BERT performs at best no better than the Transformer baseline and back-translation techniques in favoring the grammatical variants in the target sides. Inspired by the results of morphological and syntactic evaluations, we leave for future work to separately incorporate the source and target side pre-training in the encoder and decoder of NMT, with the aim to better leverage linguistic information contained in language models (Guo et al., 2020). Figure 1 shows results for translating sentences with ambiguous words in both the news domain (in-domain) and colloquial speech domain (out-  of-domain). In the news domain, the F-score of the baseline is 0.715. With back-translation, the performance fluctuates but is worse than the BERTfused model. The BERT-fused model performs the best of 0.735 in F-score and improves the baseline by 2.8%. In the colloquial speech domain where words are more frequent than news domains and thus have more senses, the BERTfused model still maintains the top and surpasses the baseline by 11.7%. There is evidence that BERT's context-aware embeddings actually encode certain forms of sense knowledge and provides distinct clusters corresponding to word senses (  tial is not fully exploited in cross-lingual settings. We plan to extend this point with the optimized model RoBERTa (Liu et al., 2019b) in future work. Figure 2 shows the results for conjunction disambiguation. The accuracy of the BERT-fused model is 96.62, with which we identify a progress of the BERT-fused model over other systems. This shows that BERT's contextualized word embedding is useful to capture clues from sentence structures and form a generic idea of conjunctions. Conjunction can impact the structure of the surrounding sentences and is related more to fluency than to adequacy. Therefore it can be more difficult than content word ambiguity (Popović, 2019). We conclude that BERT can actually absorb fine-grained relevant sense information during pre-training, which helps learn meaningful conjunction sense distinctions. Table 7 shows the results for coreference translation. The second column refers to the total accuracy of pronoun translation. The BERT-fused model achieves the score of 52.46, outperforming the others by 0.52-1.16 in accuracy. This corresponds to prior studies which show that BERT's attention matrices are able to do coreference resolution by effectively encoding coreference signal in deeper layers and at specific heads (Clark et al., 2019). The last two columns reflect the models' performance when antecedent location is inside or outside the current sentence. The accuracy of the BERT-fused model ranks the highest in short antecedent distance, outperforming others by 2-5 points, but deteriorates the most sharply as the distance between the pronoun and its antecedent increases. Though all models are ineffective in larger segments, the BERT-fused model even underperforms the baseline by 0.25 points. On the one hand, these observations prove the ability of BERT's deeply bidirectional representation con-   ditioned on both left and right context to capture intra-sentence dependency which is important for understanding coreferences. On the other hand, it also shows BERT's limitation on long-range features in document-level contexts, which is also observed by Joshi et al. (2019). As mentioned earlier in Section 4.2, one training task of BERT is to predict the next sentence. We assume that BERT is better than the standard Transformer to capture relation between two sentences and thus can improve performance on translation involving long-range features. Based on our results, however, seemingly BERT's potential in capturing sentence relations is not thoroughly exploited by NMT architectures.  errors continue to decrease as we add more synthetic data. However, it slightly rises when building systems with 2.5M synthetic data, showing that increasing data size is not the most useful to help idiom translation, while a better encoding of idiom expression via pre-training may help. The data size of Zh→En is relatively small, so we further verify BERT's effectiveness in the large-scale En→De experiment (elaborated in Appendix D). The BLEU results are summarized in the last column of Table 8. The BERT-fused model still gains the best performance among others with a score of 30.76. This shows that in addition to local syntactic properties, BERT's context-aware embedding based on previous and following context can help the encoder of NMT to capture global topical properties of words, thus making the model more expressive and understand the underlying meanings better. The commonsense reasoning results are shown in Figure 3. The results clearly show that the BERT-fused model is better than the baseline and back-translated models in all three reasoning types, with the largest superiority on lexical ambiguity, a smaller gap on contextless syntactic ambiguity, and the weakest gap on context syntactic ambiguity. The performance of back-translation shows that incrementally larger amounts of training data do not consistently improve the commonsense reasoning performance of NMT, therefore it is likely the knowledge implied in the pre-trained language model that enhances commonsense reasoning ability of MT systems. Prior work  has proven BERT's effectiveness in promoting commonsense ability in monolingual tasks. We further find that in bilingual scenario, BERT can also help model utilize knowledge via injecting prior information on the encoder part of NMT.

Pragmatics
The results for gender translation are presented in Table 9. With BERT, gender bias in MT is not  mitigated. The best performance is achieved by the model trained with back-translation data in a 1:2 setting, scoring 75.1, 0.1 and 5.2 in Accuracy, ∆G and ∆S, respectively. The scores of the BERT-fused model are 71.4, 3.2, 14.6, respectively, not competitive with the baseline on Accuracy and ∆G, and even much poor on ∆S. On the one hand, this further indicates that BERT may encode unintended social correlations during pretraining (May et al., 2019;Tan and Celis, 2019), and will propagate bias to downstream MT application. On the other hand, the poor ∆S score shows that the BERT-fused model is prone to translate based on gender stereotypes, and suffer deteriorated performance when translating antistereotypical assignments. This is in line with prior observations in QA and relation classification (Poerner et al., 2019) which shows that BERT's knowledge can come from learning stereotypical associations.

Conclusion
We presented a quantitative study of BERT in NMT as compared with large-scale back-translation.