BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural Machine Translation

The success of bidirectional encoders using masked language models, such as BERT, on numerous natural language processing tasks has prompted researchers to attempt to incorporate these pre-trained models into neural machine translation (NMT) systems. However, proposed methods for incorporating pre-trained models are non-trivial and mainly focus on BERT, which lacks a comparison of the impact that other pre-trained models may have on translation performance. In this paper, we demonstrate that simply using the output (contextualized embeddings) of a tailored and suitable bilingual pre-trained language model (dubbed BiBERT) as the input of the NMT encoder achieves state-of-the-art translation performance. Moreover, we also propose a stochastic layer selection approach and a concept of dual-directional translation model to ensure the sufficient utilization of contextualized embeddings. In the case of without using back translation, our best models achieve BLEU scores of 30.45 for En->De and 38.61 for De->En on the IWSLT'14 dataset, and 31.26 for En->De and 34.94 for De->En on the WMT'14 dataset, which exceeds all published numbers.


Introduction
Pre-trained language models (LMs), trained on a large-scale unlabeled data to capture rich representations of the input, such as ELMO (Peters et al., 2018), BERT (Devlin et al., 2019), XLNET (Yang et al., 2019) and XLM (Conneau and Lample, 2019) have increasingly attracted attention in various NLP tasks. Either utilizing context-aware representations of input tokens (Peters et al., 2018) or fine-tuning the pre-trained parameters (Devlin et al., 2019) both lead to significant improvement for downstream tasks. Figure 1: The overview of methods: a series of additive improvements to the use of contextualized embeddings on IWSLT'14 dataset. Experimenting over various pretrained language models, we show that our BIBERT, a bilingual English-German language model, vastly outperforms all other methods (Section 2). Adding stochastic layer selection to BIBERT improves performance (Section 3). Finally, innovative dual-directional training and fine-tuning with the previous two methods yield around 2 BLEU point gains over the previous state-of-the-art result  (Section 4).
Inspired by the superior performance of BERT on many other tasks, researchers have investigated leveraging using this pre-trained masked language model to enhance translation models, e.g., initializing the parameters of the model's encoder with BERT parameters (Rothe et al., 2020), and incorporating the output of BERT to each layer of the encoder (Zhu et al., 2020;Weng et al., 2020). In this paper, we demonstrate simply using the output of a pre-trained language model as the input of NMT systems can achieve state-of-the-art results on IWLST'14 (Cettolo et al., 2014) and WMT'14 (Bojar et al., 2014) English↔German (En↔De) translation tasks in the case of without using back translation (Sennrich et al., 2016; 3 . After conducting a thor-ough evaluation of numerous pre-trained language models, we demonstrate that specialized bilingual models perform the best. We then introduce two further refinements, stochastic layer selection and dual-directional training that yield further improvements. The overview of methods are shown in Figure 1. Overall, our best systems beat published state-of-the-art BLEU scores by around 2 points.
Our main contributions are listed as follows: • We release our English-Germean bilingual pre-trained language model, BIBERT, and demonstrate that it outperforms both monolingual and multi-lingual language models for machine translation (Section 2).
• Expanding upon our bilingual language model results, we introduce stochastic layer selection which incorporates information from more layers in the pre-trained language model to improve machine translation (Section 3).
• We introduce dual-directional translation models which leverages the inherent bilingual nature of BIBERT with mixed domain training and fine-tuning. When combined with stochastic layer selection, it achieves state-of-the-art performance, i.e., 30.45 for En→De and 38.61 for De→En on the IWSLT'14 dataset, and 31.26 for En→De on the WMT'14 dataset (Section 4).

Method
In this section, we focus on investigating the effectiveness of using the output (contextualized embeddings) of the last layer of pre-trained language models on building NMT models. Our basic NMT models are six-layer transformer translation models, though it is model agnostic assuming there are encoder embeddings (Vaswani et al., 2017). Specifically, our method relies on extracting contextualized embeddings of source sentences from the final layer of a frozen pre-trained language model and feeding them to the embedding layer of the NMT encoder. Rather than randomly initializing the source embedding layer, we use the output of these pre-trained models and do not allow these parameters to update during training. To allow for a deep analysis, we concentrate on one language pair, English↔German (En↔De). In the with additional monolingual data, we only use the provided bitexts during machine translation training.
following subsections, we first explore how much translation performance can be improved by simply using contextualized embeddings, and then explore the internal factors of various pre-trained language models that may affect NMT models. We then introduce our bilingual pre-trained language model and demonstrate that using its contextualized embeddings achieves state-of-the-art results.

Existing Pre-Trained Models
We first describe four influential pre-trained models that we incorporate into NMT -two monolingual and two multilingual models.
ROBERTA An optimized version of BERT which is trained on a larger dataset, with a dynamic masked language model training regiment that also removes the next sentence prediction . This model matches or exceeds the performance of BERT on multiple NLP tasks.
GottBERT A state-of-the-art pure German Roberta model (Scheible et al., 2020) trained on 145G German text data portion of OSCAR (Ortiz Suárez et al., 2020), a huge multilingual corpus extracted from Common Crawl. This has been shown to outperform the other two existing German monolingual models (i.e., German BERT 4 from deepset and dbmz BERT 5 ) on NER and text classification tasks.

XLM-R (base)
A transformer-based (Vaswani et al., 2017) masked language model trained on 100 languages, using more than two terabytes of filtered CommonCrawl data, which outperforms MBERT on a variety of cross-lingual benchmarks (Conneau et al., 2020).

How Do Pre-Trained LMs Affect NMT?
First we investigate how contextualized embeddings of aforementioned pre-trained language models help NMT models, and explore possible positive and negative factors that may affect NMT models.
Dataset We initially consider a low-resource scenario and then show further experiments in a highresource scenario in Section 5. We conduct experi-ments on the IWSLT'14 English-German dataset, which has 160K parallel bilingual sentence pairs.

Settings
Our model configuration is transformer_iwslt_de_en, a six-layer transformer architecture (Vaswani et al., 2017), with FFN dimension size 1024 and 4 attention heads. We use an embedding dimension of 768 to match the dimension of pre-trained language models. For a consistent comparison with previous works, the evaluation metric is the commonly used tokenized BLEU (Papineni et al., 2002) Table 1: IWSLT'14 En↔De BLEU scores utilizing contextualized embeddings from various pre-trained language models. random represents the embedding layer of the NMT encoder that is randomly initialized but uses the same vocabulary of the assigned pretrained language model. pre-trained means the embedding layer of the NMT encoder use the output of the assigned frozen pre-trained language model during MT training. Numbers in the bracket show the increment/deduction compared with the corresponding model compared to randomly initialized embeddings.
Observations The main IWSLT'14 results are shown in Table 1. We first conduct experiments with randomly initialized embeddings to obtain baselines. Feeding the output of a pre-trained language model into an NMT model necessitates that the vocabulary of the encoder should be the same as the one used for the language model. To ensure that improvements are not the result of choosing a better vocabulary, we train randomly initialized baseline systems using identical vocabularies for each encoder. For these experiments, the decoder's vocabulary size is fixed to 8K in order to make fair comparisons. We investigate decoder vocabulary size selection in more detail in Section 2.5. When the embedding layer of the MT encoder is randomly initialized, as opposed to using the pre-trained language model, we ob-serve similar BLEU scores for all baselines from English-to-German (around 27.6) and German-to-English (around 33.7). By replacing the embedding layer with contextualized embeddings, GOT-TBERT boosts the BLEU scores of De→En from 33.56 to 36.32, and ROBERTA strengthens the En→De translation from 27.3 to 28.74. However, the MBERT and XLM-R only provide modest improvement in De→En translation and even degenerate the performance of En→De translation.

Curse of Multilinguality
We first note the deterioration caused by MBERT and XLM-R on En→De over the randomly initialized baselines, as well as the comparatively small gains versus the monolingual models of De→En. We hypothesize that contextualized embeddings from MBERT and XLM-R are hurt by the curse of multilinguality (Conneau et al., 2020), i.e., low-resource language performance can be improved by adding higherresource languages during pre-training, but unfortunately high-resource performance suffers and degrades. MBERT and XLM-R are trained on 100 and 104 languages respectively and the curse of multilinguality may lead to model capacity issues that degenerate the contextualized embeddings of high-resource languages such as English and German. We attribute the slightly higher improvements of XLM-R over MBERT to the larger amounts of data used in pre-training. The large monolingual models, ROBERTA and GOTTBERT significantly beat a randomized baseline, but also significantly beat the multilingual models. Note that even though XLM-R has 55.6B English tokens used for pre-training, it still helps less than ROBERTA using around 28B English tokens, which is possibly due to interference and constrained capacity (Arivazhagan et al., 2019;Johnson et al., 2017;Tan et al., 2019). Therefore, a suitable pre-trained language model for NMT intuitively should be trained on a large amount of data, but with special care to avoid using too many languages during pre-training.

Customized Pre-Trained LM
Pre-trained monolingual language models can improve performance of machine translation systems, yet machine translation is inherently a bilingual task. We hypothesize that a pre-trained language model can further improve the translation performance if its training data is composed of a mixture of texts in both source and target lan-guages. In other words, we expect the source and target language data to enrich the contextualized information for each other to better facilitate translation for both directions (En↔De). Therefore, we propose our bilingual pre-trained language models, dubbed BIBERT.
Our BIBERT EN-DE is based on the RoBERTa architecture  and implemented using the fairseq framework . In order to make a direct comparison, BIBERT EN-DE is trained on the same German texts as GOTTBERT -just with an additional 146GB of English texts. These are a subset of the English portion in OS-CAR -the same dataset the German texts come from. We combine English and German data and shuffle them before training. We train the model using the same number of update steps on German texts as GOTTBERT 6 . We train a unified 52K vocabulary using the WordPiece tokenizer (Wu et al., 2016), with 67GB English and 67GB German texts which are randomly sampled from the training set. BIBERT EN-DE is trained on TPU v3-8 for four weeks. More details about optimization for BIBERT EN-DE are described in Appendix B.

Vocabulary Size Selection
The vocabulary is fixed for the encoder but still indeterminate for the decoder. In a low-resource machine translation setting, performance is highly sensitive to decoder vocabulary size selection. Gowda and May (2020) demonstrated that a decoder vocabulary using 8K BPE operations performed best across a large grid search. To ensure that 8K vocabulary size is also a suitable choice for the IWSLT'14 (160K parallel sentences) dataset when combined with our method, we search over four candidate decoder vocabulary sizes (8K, 16K, 24K, and 32K) for all aforementioned pre-trained language models. As shown in Figure 2, 8K yields the highest BLEU score for all of our NMT models for De↔En. Thus we select 8K as the vocabulary size of the decoder and use this for all subsequent experiments on IWSLT'14 unless otherwise noted. Interestingly, we also notice that the performance of the translation model with BIBERT EN-DE is robust for De→En, and basically unaffected by the vocabulary size. Analysis Based on the superior performance of BIBERT EN-DE , we hypothesize that contextualized embeddings output from BIBERT EN-DE contain richer German information than GOTTBERT and better assist the model in translation by learning extra English data. Furthermore, we theorize training on German texts also enhances the quality of English contextualized embeddings -note that even though ROBERTA and BIBERT EN-DE are not directly comparable due to different English pre-training data, BIBERT EN-DE still had a 0.68 BLEU point improvement over ROBERTA even while using less English training data. Some other explanations for the superior performance of BIBERT EN-DE are 1) it learns the aligned embeddings for the tokens with similar meanings across two languages. Hence, the source embeddings can offer the encoder a hint of aligned target embeddings to help translation. 2) Embeddings of overlapping En-De sub-word units 7 fed to NMT encoders may facilitate translation by bilingual information.

Algorithms
De → En Adversarial MLE (Wang et al., 2019) 35.18 DynamicConv  35.20 Macaron Net (Lu* et al., 2020) 35.40 BERT-Fuse (Zhu et al., 2020) 36.11 MAT  36.22 Mixed Representations  36.41 UniDrop  36.88 Ours, GOTTBERT 36.32 Ours, BIBERT 37.58  Table 2 shows a comparison of our work with the recent literature on IWSLT'14 German to English translation. These works propose improvements to transformer models in different aspects, e.g., incorporating BERT into every layer of encoders and decoders with additional multi-head attentions (Zhu et al., 2020), multi-branch encoders , mixed representations from different tokenizers  and uniting different dropout techniques into NMT models . Our straightforward method of simply using the final layer of BIBERT EN-DE outperforms all of them. Furthermore, even the model that only uses the monolingual GOTTBERT achieves a competitive result (36.32) compared with the previous state-of-the-art approach (36.88). Our method is easy to implement, so it can be used in conjunction with other methods in the literature.

Time Costs
Leveraging an external pre-trained language model leads to higher computational complexity. Our approach takes approximately 20% additional time during training and 13% extra time during inference. Considering the significant BLEU gains, we argue that they justify the higher time costs. 7 Such as ##n, which uses shared En-De information.

Layer Selection
Jawahar et al. (2019) demonstrates that different layers of BERT capture differing linguistic information in a rich, hierarchical structure that mimics classical, compositional tree-like structures. Information in the lower layer (e.g., phrase-level information) gets gradually diluted in higher layers. Thus, to potentially leverage more information encapsulated in the pre-trained language models, we are also interested in exploring how other layers of contextualized embeddings can improve NMT models -rather than simply using the last layer.
We denote X as the collection of source language sentences. For each source sentence x ∈ X , let H i B (x) denote the contextualized embeddings of x obtained from the i th layer of the pre-trained language model. In our settings, we consider top K layers of the pre-trained language model, i.e., we consider H where K is a hyperparameter, and M is the total number of layers of the pre-trained language model.

Stochastic Layer Selection
During training of deep neural networks, various methods of stochastically freezing groups of parameters in a model for individual training examples have been shown to improve performance. For instance, dropout (Srivastava et al., 2014) samples parameters from a Bernoulli distribution to not update, and drop-net (Zhu et al., 2020) and drop-branch  randomly active a candidate net and freeze the others in a uniform distribution. We propose stochastic layer selection, a novel approach to encapsulate more features and information from more layers of the pre-trained language models. Specifically, for each batch, we randomly pick the output from one layer rather than all of them as the input for the NMT encoder ( Figure 3). We denote the input embeddings of sentence x to the NMT encoder as H E (x), which is defined in the following way during training: where 1(·) is the indicator function and p is a random variable which is uniformly sampled from [0,1]. In the inference step, the output is the expectation of outputs of all layers used for training, i.e., :

Experiments and Results
Based on the results of Table 1, we select the pre-trained model performing best for NMT, BIBERT EN-DE , and use it as the basis for all subsequent experiments. To be consistent with the results in Section 2, we once again use the IWSLT'14 dataset. Figure 4 illustrates the impact of stochastic layer selection. We conduct experiments for En↔De with the number of layers K ranging from 2 to M (M = 12 for BIBERT EN-DE ). Note that setting K = 1 reduces to the case of only selecting the last layer as in Section 2. In all cases, the stochastic layer selection obtains substantial gains compared with our previous best scores in En→De (29.65) and De→En (37.58) in Section 2. In both situations of En→De and De→En, the translation model gets the highest score (37.94 for De→En and 30.04 for En→De) when stochastic layer selection uses 8 layers.

One Model, Dual-Directional Translations
In this section, different from ordinary oneway translation models, we introduce our dualdirectional translation models, i.e., a model can translate both En→De and De→En. The model architecture is the same as the one in Section 3. One of the biggest advantages of the shared English-German vocabulary of BIBERT EN-DE is that our encoder has the capability of receiving De→En (right, red bars) BLEU as a function of number of layers K considered in the stochastic layer selection module for NMT models. Note that when K = 1, it reduces to the case of selecting the last layer. However, any value of K > 1 selected for stochastic layer selection beats this very strong baseline with K = 8 obtaining the highest BLEU scores in both directions. contextualized embeddings of both source and target tokens. During the training step, we feed source sentences to the model and expect the generation of a target translation, yet also, inversely, feed target sentences and expect translations in the source language. The motivation behind the dual-directional translation model is that we expect the contextualized representations of source and target sentences could enhance each other to build a better encoder for the translation model. From the aspect of data augmentation, the target sentences play a role in augmented data in the task of translating from the source language to the target language, vice versa. With the method of swapping source and target sentences once as an additional dataset, our experiments show superior performance for both direc- Figure 5: Workflow of data preprocessing. We swap source and target sentences, and concatenate swapped sentence pairs and original sentene pairs. Finally, we shuffle the concatenated data for dual-directional translation model training.
tional translations. Two advantages of this method are 1) obtaining improvement without extra bitexts, and 2) only slight modification for data preprocessing and no changes for the model architecture.

Dataset Preprocessing
For consistent comparisons, the dataset is still IWSLT'14 En→De. The details of data preprocessing for the dual-directional translation model are illustrated in Figure 5. Using only the same exact parallel sentences in our bitext for training, we simply leverage the dataset in reverse, by swapping our original target sentences to use as new source sentences and original source sentences as new target sentences. We then concatenate and jointly shuffle the original and new data to acquire our mixed training data. We use a joint English-German vocabulary of size 12k for the decoder.

Fine-tuning
Inspired by the findings of , where training on a mix of in-and out-of-domain of data initially, and then gradually fine-tuning until only in-domain data is used, substantially improved model performance, we treat our concatenated sentences as mixed domain data, and the source and target languages are separate language domains. Each language data can be the out-of-domain data for the other language. Following this perspective, we first train our dual-directional model on mixed data, and then fine-tune it on the source or target data to obtain one-way translation models 8 .

Experiments and Results
We additionally conduct one-way translation models with 12K bilingual vocabulary to have a fair baseline for dual-directional models. Overall results are shown in Table 3. We first discuss the models trained without stochastic layer selection. The dual-directional model substantially outperforms the one-way model by obtaining a gain of 0.52 in En→De and 0.72 in De→En. Moreover, fine-tuning on the in-domain data further improves BLEU from 29.89 to 30.33 in En→De and from 37.97 to 38.12 in De→En. Both positive results indicated by the dual-directional model and finetuning approach show their effectiveness in helping translation. A similar discussion holds for the models with the stochastic layer selection method. Compared with our previous models in Section 3 (30.04 En→De and 37.94 in De→En), our best model achieves new state-of-the-art results both in En→De and De→En, which respectively obtain 30.45 and 38.61 BLEU.    29.3 -Evolved Transformer (So et al., 2019) 29.8 -BERT Initialization (12 layers) (Rothe et al., 2020) 30.6 33.6 BERT-Fuse (Zhu et al., 2020) 30.  Following the findings that En↔De translation has similar results for vocabularies ranging from 32K to 64K in high-resource scenarios (4.5M training samples) (Gowda and May, 2020), we use a bilingual vocabulary with 52K size for the decoder, which is larger than the ones (8K and 12K) used in IWSLT experiments.

Results
We compare our methods with prior existing works that achieve highest scores by only using provided bi-texts in Table 4. With BIBERT EN-DE contextualized embeddings and stochastic layer selection, our model achieves state-of-the-art BLEU both on En→De (30.91) and De→En (34.94). Interestingly, dual-directional translation training does not show the same strong effectiveness as it did in the low-resource scenario. One possible reason is that model capacity is not large enough to handle mixed domain data (Arivazhagan et al., 2019). However, it still additively improves En→De to 31.26 BLEU. It is worth mentioning that our NMT model achieves better performance with less training parameters -the hidden size of our NMT model is 768 but 1024 for the prior existing works.
6 Related Work

Pre-Trained Embeddings
Traditional pre-trained embeddings are investigated in type level, e.g., word2vec (Mikolov et al., 2013), glove (Pennington et al., 2014) and fastText (Bojanowski et al., 2017). Peters et al. (2018) moved further from this line and proposed context-aware embeddings output from pre-trained bidirectional LSTM (ELMO). Following the attention-based transformer module (Vaswani et al., 2017), the architectures of GPT models (Radford et al., 2018(Radford et al., , 2019Brown et al., 2020) and BERT (Devlin et al., 2019) respectively are based on stacking deep transformer decoders and encoders and significantly boost downstream tasks. Beyond pure English models, pre-trained language models for other languages have also showed up, e.g., CAMEMBERT for French (Martin et al., 2020) and ARABERT for Arabic (Baly et al., 2020). Multilingual representations, e.g. MBERT and XLMS (Conneau and Lample, 2019) have been shown to be effective to facilitate cross-lingual learning. XLM-R (Conneau et al., 2020), a model learning cross-lingual representation at scale achieved state-of-the-art results on multiple cross-lingual benchmarks. Recently, an English-Arabic bilingual BERT (Lan et al., 2020) outperformed ARABERT, MBERT and XLM-R on supervised and zero-shot transfer settings.

MT with Context-Aware Representations
Imamura and Sumita (2019) removed the NMT encoder part and directly fed the output of BERT to the attention mechanism in the decoder. They train the model with two optimization stages, i.e., only training the decoder and fine-tuning BERT. Similarly, Clinchant et al. (2019) have incorporated BERT into NMT models by replacing the embedding layer with BERT parameters and initializing encoder with BERT, but they still notice that NMT model with BERT is not as robust as expected. Rothe et al. (2020) also leveraged pretrained checkpoints (e.g., BERT and GPT) to initialize 12-layer NMT encoder and decoder and achieved state-of-the-art results. Interestingly, they showed that the models with decoder initialized by GPT fail to improve the translation performance and are even worse than the one whose decoder is randomly initialized. Similarly, Ma et al. (2020) initialize both transformer encoder and decoder by XLM-R but fine-tune it on multiple bilingual corpora to obtain a multilingual translation model. The preliminary experiments from Zhu et al. (2020) indicate that NMT models simply fed by the output of BERT outperform the models initialized by BERT or XLM. However, only limited experiments and little analysis on this method has been done in their work. They mainly focused on the BERT-fuse approach, i.e., the output of BERT is fed to each layer of NMT encoder and decoder with extra multi-head attentions. Instead of only using the last layer of BERT, Weng et al. (2020) introduced layer-aware attention mechanism to capture compound contextual information from BERT. Moreover, they also proposed the knowledge distillation paradigm to learn pre-trained representation in the training process. On an English-Arabic translation task,  use a precursor of this method though it lacks all of the refinements described here. However, it was shown to further help in downstream cross-lingual information extraction tasks.

Conclusion
We have shown that our BIBERT trained on a large amount of mixed texts of the source and target languages can better help NMT models improve translation performance compared with other existing pre-trained language models and achieve state-ofthe-art results by simply using the output of the last layer. Moreover, we introduce the stochastic layer selection method and demonstrated its effectiveness in improving translation performance. Finally, experiments on the dual-directional translation model illustrate that source and target data can augment each other to further boost performance.