Discriminative Reranking for Neural Machine Translation

Reranking models enable the integration of rich features to select a better output hypothesis within an n-best list or lattice. These models have a long history in NLP, and we revisit discriminative reranking for modern neural machine translation models by training a large transformer architecture. This takes as input both the source sentence as well as a list of hypotheses to output a ranked list. The reranker is trained to predict the observed distribution of a desired metric, e.g. BLEU, over the n-best list. Since such a discriminator contains hundreds of millions of parameters, we improve its generalization using pre-training and data augmentation techniques. Experiments on four WMT directions show that our discriminative reranking approach is effective and complementary to existing generative reranking approaches, yielding improvements of up to 4 BLEU over the beam search output.


Introduction
Reranking models take a number of different output hypotheses generated by a baseline model and select one hypothesis based on more powerful features. Before the recent re-emergence of neural networks, these models have been well studied for several NLP tasks including parsing (Charniak and Johnson, 2005;Collins and Koo, 2005) and statistical machine translation .
Traditional statistical models (SMT) based on n-gram counts made very strong independence assumptions where features would only capture very local context information to avoid sparsity and poor generalization. A large n-best list produced by these models would then be passed to a discriminatively trained reranker which leverages features engineered to capture more global context  yielding significant improvements to the quality of the translations.
On the other hand, modern neural models (NMT) make much weaker independence assumptions because predictions of standard sequenceto-sequence models depend on the entire source sentence as well as the target prefix generated. However, reranking may still be beneficial for two reasons: First, NMT systems are subject to exposure bias (Ranzato et al., 2016), i.e., models are never exposed to their own generations at training time, while a reranking model has been trained on model outputs. Second, beam search with autoregressive models uses the chain rule to sum individual token-level probabilities to obtain a target sequence probability. However, individual probabilities are based on a limited amount of target context, while a reranking model can condition on the entire target context. Indeed, recent generative reranking approaches applied to NMT, such as Noisy-Channel Decoding (NCD,  which leverages a pre-trained language model and a backward model, show strong improvements over beam search outputs, as demonstrated in recent WMT evaluations .
In this paper, we explore whether training large transformer models using the reranking objective can further improve performance. Our model, dubbed DrNMT, takes as input the entire source sentence and an n-best list of output hypotheses to predict a distribution of sentence-level evaluation scores, such as BLEU. 1 This setup is similar to earlier work with SMT, except that the baseline model is an NMT model and the reranker is a big transformer architecture as opposed to a log-linear model on top of discrete or human engineered features.
Unfortunately, optimizing for the task of interest does not always lead to better performance. Overfitting to the training set is a potential concern, as the reranker has hundreds of millions of parameters yet it receives only one gradient and weight update per source/target sentence pair as opposed to one per token as for standard NMT models. In our work, we mitigate overfitting in two ways. First, we leverage the success of pre-training by finetuning masked language models (MLM; Devlin et al. 2019) which initializes the model with features trained on much more training data. Second, we augment the original dataset with back-translated data (BT; Sennrich et al. 2016).
Experiments show that DrNMT can match the performance of a strong NCD baseline and that their combination leads to further improvements as measured by BLEU, TER and also human evaluation.

Related Work
Our method is inspired by the seminal work of  and  who introduced and popularized discriminative reranking to SMT. Besides using a weaker MT system to generate the n-best list, these works relied on a linear discriminator trained on human-designed features as opposed to a transformer taking the raw source sentence and hypothesis.
Most work using NMT has focused on generative reranking methods Imamura and Sumita, 2017;Wang et al., 2017), where the reranker's parameters are optimized using a criterion which is different from the metric of interest. For instance, Yu et al. (2017);  perform noisy-channel decoding where hypotheses are scored by linearly combining the output of the forward model, a target-side language model and a backward model which scores the source sentence given the hypothesis. These methods have shown remarkable improvements over the output of beam decoding, despite not being trained for the reranking task (except for the two or three hyperparameters of the linear combination of scores which are tuned on a validation set). Another approach belonging to this class of methods is the one proposed by Salazar et al. (2019), which employs the scores from a masked language model (MLM). While this method employs a transformer architecture, it is still not trained for the task of interest.
To the best of our knowledge, there is only concurrent work by Naskar et al. (2020) which attempts at training discriminatively a reranker for NMT. They use a pair-wise margin loss on hypotheses sampled from the NMT, while we learn to rank the full n-best list produced by beam. Their experiments also show that the reranker performs better when directly conditioned on the source sentence. However, they do not compare nor combine their method with NCD like we do. Both their work and our work are however an extension of Deng et al. (2020), who proposed to train a discriminator to improve neural language modeling.
There is also a large body of literature on different ways to combine SMT and NMT by using one to rerank the other, since SMT is generally better at adequacy while NMT is better at fluency. For instance, Auli and Gao (2014) uses an RNN discriminator to rerank the n-best list produced by a phrase-based SMT. Instead, Ehara (2017) does the opposite, using an SMT discriminator to rerank an n-best list produced by an NMT.
Finally, our work is also related to recent attempts at using adversarial training to improve MT (Wu et al., 2018;. Unlike these approaches our method is much simpler because we do not update the parameters of the MT system generating the hypotheses. Moreover, our discriminator is trained to predict the distribution of desired metric and it is used at decoding time to rerank, while GAN-based MT would only retain the generator.

Model
Given a source sentence x, an NMT model generates a set of hypotheses U(x) = {u 1 , u 2 , ..., u n } in the target language. The goal of this work is to learn a reranker that produces higher scores for hypotheses of better quality, as defined in terms of a user-specified metric µ(u, r) such as BLEU (Papineni et al., 2002a), where quality is measured with respect to a reference r.
As illustrated in Figure 1, our reranker is a transformer architecture which takes as input the concatenation of the source sentence x and hypothesis u ∈ U(x). The architecture includes also position embeddings and language embeddings, to help the model represent tokens that are shared between the two languages (Conneau and Lample, 2019). The final hidden state corresponding to the start of sentence token ( s ) serves as the joint representation for (x, u); let us denote this feature vector as z ∈ R d . The reranker associates a scalar score o ∈ R to (x, u) by applying a one hidden layer  Figure 1: Illustration of DrNMT, a pre-trained transformer architecture which takes as input both the source sentence as well as a hypothesis and outputs a scalar score. DrNMT is trained to output scores which reflect the distribution of sentence-level scores according to a user-specified metric over an n-best list.
neural network with d tanh hidden units to z, as default in the design of the "classification head" of RoBERTa (Liu et al., 2019). The parameters of the reranker are denoted by θ and include the parameters of the transformer, all the embeddings and also the top projection block mapping the feature vector to the scalar score. Each hypothesis u i in the set U(x) is therefore processed independently and yields a score o i .

Training and Inference
We train the reranker discriminatively, hence the name DrNMT for Discriminative Reranker for NMT, by minimizing the KL-divergence between the target distribution and the model output distribution, D KL (p T ||p M ) (Cao et al., 2007). For each x, the model output distribution is a softmax over all n hypotheses in the n-best list: where we made explicit that the score o j is conditioned on the input x and parameter vector θ. Notice that we do not enforce any additional factorization. In particular, we do not assume that the score is computed auto-regressively. The target distribution is defined as a normalized distribution of the end metric µ(u i , r) which we assume to improve as it takes on larger values: where T is the temperature to control the smoothness of the distribution. In practice, we apply a minmax normalization on µ. We subtract each value by the minimum in the hypothesis set, and divide the result by the difference between the maximum and the minimum value, so that the best hypothesis scores 1 and the worst 0. This helps the optimization as it reduces the variance of the gradients, as pointed out by Edunov et al. (2018).
The parameters of DrNMT are then learned by minimizing the KL divergence over the training dataset. For a given training example, we have: We minimize this loss over the training set by stochastic gradient descent using standard backpropagation of the error, since all terms are differentiable. In order to alleviate overfitting, we employ dropout regularization (Srivastava et al., 2014), we pre-train the model (Conneau et al., 2019) and we also perform data augmentation by training on back-translated data (BT) (Sennrich et al., 2016). See §5.3 for details.
At test time, generation proceeds by first having the NMT generate the n-best list, and then by applying the reranker to select the best hypothesis. Since the score of the forward model is also available, unless otherwise specified we rerank using a weighted combination of both; this is dubbed as DrNMT. In the experiments we also report results by adding all the other scores from NCD, namely the backward model score and the language model score. We denote this variant by "DrNMT + NCD". Whenever we combine scores from various models we tune the additional hyper-parameters controlling the weighted combination by random search on the validation set .

Experimental Setup
In this section we describe the datasets, baselines and model details.

Datasets
We experiment on four language pairs: German-English (De-En), English-German (En-De), English-Tamil (En-Ta) and Russian-English (Ru-En). For training on De-En and En-De, we use NewsCommentary from WMT'19 (Barrault et al., 2019) and NewsCrawl2018 for the parallel dataset and target side monolingual data, respectively. We validate on newstest2014 and newstest2015, and test on newstest2016, 2017, 2018 and 2019. For En-Ta, we use all bitext and monolingual data shared by the WMT'20 news translation task for training, and the officially released development and test sets for validation and testing purposes. For Ru-En, we use all the parallel data from WMT'19 (Barrault et al., 2019) and NewsCrawl2018 as the monolingual dataset for training, validate on newstest2015 and 2016, and test on newstest 2017, 2018 and 2019.
We follow the steps in  for data preprocessing, including sentence deduplication, language identification filtering on all bitext and monolingual data (Joulin et al., 2017) and indomain filtering (Moore and Lewis, 2010) on Tamil CommonCrawl data. Table 1 shows the resulting size of each dataset. For the base NMT models, we learn 30K byte-pair encoding (BPE) units for De-En and En-De, 20K BPE units for En-Ta and 24K BPE units for Ru-En separately, using the sentencepiece toolkit (Kudo and Richardson, 2018). All systems are evaluated using SACRE-BLEU (Post, 2018).

Baselines
We use the Transformer (Vaswani et al., 2017) architecture and train MT models using bitext data only. These are the models that generate the n-best list, and which serve also as a lower bound for the performance of DrNMT. BT data is generated from beam decoding with beam size equal to 5. Since the bitext data of En-Ta originates from seven different sources, we prepend dataset tags to each source sentence to indicate the origin (Kobus et al., 2017). We do not prepend any tags on the validation and test sets when decoding, as this choice worked best during cross-validation. In general and for each language pair, we tune the model architecture and all hyper-parameters on the validation set.
In addition to beam decoding, we consider two reranking baselines. First, we consider the method recently introduced by Salazar et al. (2019). In its simplest formulation, this takes a pre-trained masked language model (MLM) on the target side, and iteratively masks one word of the hypothesis at the time and aggregates the corresponding scores to yield a score for the whole hypothesis. Then, this score is combined with the score of the forward model to rerank the n-best list; this is dubbed as "fw + MLM". We also have a version of MLM which is tuned on our target side monolingual dataset; we dub this "fw + MLM-ft".
Finally, we consider reranking using noisy channel decoding (NCD; . NCD reranks by taking a weighted combination of three scores: the forward model score, the score of a target-side language model (LM), and the score of a backward model. A length penalty is then applied on the combined score. The weights and the length penalty are tuned on the validation set via random search. All LMs are transformers with 16 blocks, 16 attention heads and embedding size 1024. They are trained on the target side monolingual data only.

Setting Up DrNMT
We use XLM-R Base 2 (Conneau et al., 2019), a transformer-based multilingual MLM trained on more than 2.5T of of filtered CommonCrawl data in 100 languages, including En, De, Ta and Ru, as the pre-trained model for DrNMT. The same model is also used in the MLM baseline described in §5.2. The XLM-R Base model consists of 12 transformer blocks, 12 attention heads, embedding size 768 (270M params) and has a vocabulary size of 250K BPE units. As each training sample of XLM-R only contained one single language, we further enhance the model with two language embeddings,  initialized from random, to indicate the source and target languages for the reranker. We perform beam decoding on both bitext and BT data using the baseline MT models to generate n-best lists with 50 hypotheses. We combine n-best lists from both bitext and BT as training data for the rerankers for De-En, En-De and En-Ta, and use only BT data for Ru-En. We train DrNMT with batch size 512, use Adam (Kingma and Ba, 2015) and early-stop when the validation performance does not improve after 12K parameter updates. All hyper-parameters, including learning rate, number of warmup steps, dropout rate, etc., are tuned on the validation set. All models are implemented and trained using fairseq  3 .

Results
In this section we report the main findings of our work. When optimizing for BLEU as metric, the performance of DrNMT and baselines for De-En, En-De, En-Ta and Ru-En is summarized in Table 2. The findings are similar across the four language directions. We therefore focus the discussion on the De-En test set results.
First, we notice that all methods improve over the beam search output with gains ranging from 1.0 to 4.1 BLEU. However, there may be still room for improvement as the oracle performance suggests. The oracle is computed by selecting the best hypotheses based on BLEU with respect to the human reference. Of course, the oracle may be not achievable because of uncertainty in the translation task.
3 Code for reproducing the results can be found at: https://github.com/pytorch/fairseq/ tree/master/examples/discriminative_ reranking_nmt Second, Salazar et al. (2019)'s method, particularly the version fine-tuned on the in-domain training dataset, improves upon beam by 1.1 BLEU points. However, the improvement over beam is not as large as with NCD, which improves upon beam by 3.2 BLEU points, suggesting that among the non-discriminative reranking methods NCD performs the best.
Third, DrNMT performs on par (En-Ta, En-De and Ru-En) or better (De-En) than NCD, showing that discriminative reranking can be very competitive. Note, that the reranker requires only one additional forward pass through the hypotheses generated by beam, while NCD requires two forward passes (one for the LM and one for the backward MT model). Therefore, our reranker works at least as well as NCD while requiring roughly half of the compute.
Fourth, the discriminative reranker and NCD are complementary to each other, since combining both achieves the best performance overall across the three language directions, with gains between 0.9 BLEU (De-En) and 0.2 (En-Ta) compared to NCD, and an overall gain between 4.1 BLEU (De-En) and 0.5 (En-Ta) compared to the beam baseline.
Fifth, the gain brought by discriminative reranking can be better appreciated by comparing "fw + LM" and DrNMT, as the major difference between the two approaches is the objective function used for training them (generative language modeling instead of prediction of the distribution of BLEU scores). We can see that in all cases, discriminative reranking yields better translations, with gains between 0.2 and 2.3 BLEU points depending on the language direction.
Finally, we notice that En-Ta is a difficult lan-  guage pair, in which the baseline NMT is weak and none of the reranking approaches work nearly as well as in the other language directions. The difference between validation and test BLEU scores suggests also a certain degree of overfitting to the validation set. Despite this, our reranker still yields the largest improvement over beam. Appendix B shows similar trends when test performance is measured in terms of translation error rate (TER) (Snover et al., 2006), showing that DrNMT is not particularly overfitting to the training metric.
Human evaluation: We randomly sample 750 sentences from the De-En test sets and collect human ratings. We perform A/B testing, where a rater can see the source sentence together with translated sentences from two systems. We conduct two rounds of human evaluation by comparing the proposed "DrNMT + NCD" vs. "beam", and "DrNMT + NCD" vs. "NCD". For each sentence, we collect three ratings (between 0 to 100) and average the scores, treating sentences with a score difference less than 5 as equally good. Out of the 750 sentences, our proposed method generates better translation than beam on 149 sentences and is worse on 82 sentences, and it performs better than NCD on 123 sentences and worse on 108 sentences, corroborating the gains observed when measuring with BLEU.
Next we show that DrNMT works with other userspecified metrics, study how performance varies with the number of hypotheses and perform several ablation studies to better understand its critical components.

Optimizing for a Different Metric
In order to validate the generality of DrNMT, we consider as metric µ the opposite of TER, so that larger values indicate better translation quality. Table 3 shows validation and test performance in terms of both BLEU and TER when optimizing for either one of the two metrics. While the two metrics are correlated, the best results are achieved when optimizing for the metric used at test time.

Varying the Number of Hypotheses
We examine the effect of training the reranker with different sizes of the n-best list, U(x). Even though we fix the n-best list size at training time, we can apply the reranker on n-best lists of different sizes at test time. Figure 2 shows the performance of DrNMT on De-En validation sets from four rerankers trained with 5, 10, 20 and 50 hypotheses, respectively. As the size of the n-best list during test time increases, the performance of all rerankers and NCD improve. On the other hand, the performance of beam decoding starts to saturate early at beam size 10. A reranker trained with 50 hypotheses gives a 1.4 BLEU improvement over beam decoding when beam size is only 5 at test time, and the improvement increases to 3.4 BLEU as we increase the beam size to 200 at test time. DrNMT consistently perform better than or equally well as NCD in all training and testing scenarios.
Interestingly, a reranker trained with more hypotheses performs better than one trained with fewer hypotheses, regardless of the beam size used at test time. For instance, when the beam size is 20 at test time, the reranker trained with beam 50 improves over beam by 2.3 BLEU points, while the one which was trained with 20 like at test time, improves by 2.2 BLEU points.
To our surprise, a reranker trained with only 5 hypotheses can still yield a 3.2 BLEU gain compared with beam decoding when used to rerank 200 hypotheses during test time, indicating that the reranker suffers little from the mismatch between training and testing conditions. As a result, depending on available compute resources, one can decide to set the number of hypotheses to the largest value possible to get better test time performance with larger n-best lists, while being robust to the particular choice used at training time.

Ablation Study
We report an ablation study by probing all major design choices made. We train DrNMT by optimizing BLEU and evaluate it on the validation set of the De-En task using 50 hypotheses both at training and test time.    Pre-training: We investigate the importance of pre-training by comparing with a reranker of the same size initialized with random weights. Table 4 shows that a randomly initialized reranker performs significantly less well, with a decrease of 0.8 BLEU. In addition to lower performance, a randomly initialized reranker also trains more slowly, by requiring 1.6× more weight updates compared to the pre-trained reranker to converge. This corroborates our choice to pre-train, as the reranking task is fairly related to the pre-training task and we lack sufficient labeled data to train such a large model from scratch. Notice that our pre-trained reranker trains for at most two passes over the data before starting to overfit to its training set.
Source sentence: When comparing "fw + LM" against DrNMT to assess the impact of training discriminatively, we did not take into account a confounding factor which is the fact that the LM does not attend over the source sentence. Indeed, Salazar et al. (2019) score hypotheses without taking into account the source sentence. What is the gain brought by considering also the source sentence? To answer this question we compare our reranker with a reranker that takes as input only the hypotheses. As shown in Table 4, including the source sentences achieves a small gain of 0.2 BLEU.
Normalization: We apply minmax normalization and set T = 0.5 when computing the target distribution in the training objective, so that for every source sentence, the range of the BLEU scores of its hypotheses is between 0 and 2. This choice yields a 0.4 BLEU improvement compared to a reranker trained with the raw BLEU scores.
Training data: So far we've been training the reranker with both bitext and BT data. In Table 4, we see that training the reranker with only bitext data deteriorates the model's performance by 2 BLEU points. The model starts overfitting after 15 passes over the small bitext (around 9,000 parameter updates). Incorporating the BT data helps alleviate this issue. The model achieves the best validation performance after 1.9 passes over the combination of bitext and BT data (around 63,000 parameter updates).
Model size: We explore building the reranker using only the first few layers of the XLM-R Base model. Since beam hypotheses often differ only locally on isolated phrases, one may wonder whether more local features, as those produced by a shallower reranker may work better. Moreover, reducing the model capacity may help preventing overfitting. Compared with either only three or six transformer blocks, Table 4 shows that deeper and bigger models work better, despite being more prone to overfitting and despite capturing more global information about their input.

Other Training and Model Variations
We conclude our empirical evaluation by investigating how reranking works on top of baseline NMT models trained with back-translation, and by reporting two variations of model architectures. As before, we report results on the validation set of the De-En task with n-best list of size 50, using BLEU as metric.
MT trained with bitext+BT: Would the gains brought by the reranker carry over when this is ap-valid beam (fw) 31.6 + MLM (Salazar et al., 2019) 32.6 + MLM-ft (Salazar et al., 2019) 32.6 + LM 33.1 NCD  33.3 DrNMT 33.1 + NCD 33.6 plied on the n-best list produced by a baseline NMT model trained with back-translation? As shown in Table 2 the beam baseline on validation was at 24.7 BLEU, while if we train the NMT by adding backtranslated data, BLEU increases to 31.6 ( Table 5).
In this case, we train the reranker using hypotheses generated by the more powerful NMT model trained with back-translated data. From Table 5, we can see that DrNMT gives 1.5 BLEU improvement over the beam decoding baseline, and combining NCD and reranker gives an additional gain of 0.5 BLEU, which is less than what we reported in Table 2 but still confirming the overall finding of discriminative reranker and NCD performing similarly while being complementary to each other.
Causal vs. bidirectional: As the complete hypothesis is available during reranking, the architecture of our reranker is bidirectional as it conditions on the whole sentence. This contrasts with how the baseline NMT model generates hypotheses and how it scores them with beam which leverages an auto-regressive decomposition. Here we explore the importance of joint modeling and consider an alternative reranker which consists of an encoder and a causal decoder, and which is therefore initialized from the base NMT generating the n-best list. Given a source sentence and a hypothesis as input, the output of the decoder is a T × d matrix (notice that hidden states are causal), where T is the number of tokens of the hypothesis, and d is the hidden dimension. We average the output across position to obtain a d-dimensional representation and apply the same one-hidden layer neural network to obtain a reranking score. Table 6 shows that our bidirectional architecture outperforms the causal architecture by 0.8 BLEU.
Set reranker: While our training objective considers the full set of hypotheses of each source valid encoder + causal decoder 26.8 bi-directional (proposed) 27.6 valid set-level 27.6 hypothesis-level (proposed) 27.6 Table 7: Reranking with features computed over the entire n-best list (set-level reranking) vs. features from just the current hypothesis. sentence, the reranker scores each pair of (x, u i ) in isolation; it never compares hypotheses directly. We therefore explore an architecture that computes cross-hypothesis features. In the original reranker architecture, the model produces a ddimensional representation for each (x, u i ). We add another transformer block that computes selfattention across the set of n representations for {(x, u)|u ∈ U(x)}. We then apply the one hidden layer projection block to map each d dimensional vector to a single score as before, yielding n scores for reranking. This design enables the model to have set-level information during reranking, and thus the scoring has to be performed on the full set at once. Table 7 shows that these two model variants perform the same, suggesting that set level representations may need to be captured at a lower layer of the transformer. We leave this avenue of exploration for future work.

Conclusions
Reranking is effective for both SMT and NMT. Inspired by work done almost two decades ago Och, 2003), we studied discriminative reranking for NMT and found that it performs at least as well as the strongest generative reranking method we are aware of, namely noisy channel decoding (NCD) (Yee et al., 2019) -as long as care is taken to alleviate overfitting.
There is a subtle trade-off between improvements stemming from optimizing the end metric and addressing exposure bias on the one hand, and poor generalization and sample inefficiency of discriminative training on the other hand. In this study we regularize the reranker by using dropout, by pre-training on large corpora and by performing data augmentation.
Empirically, we found that NCD and our discrim-inative reranker are complementary to each other, yielding sizeable improvements over each other and the beam baseline. Our reranker is computationally less demanding than NCD, since it consists of a single model while NCD requires scoring using two additional models. Our reranker is also robust to the choice of the size of the n-best list and other hyper-parameters settings.
In the future we plan to investigate better ways to alleviate sample inefficiency, as well as to design more effective architectures to score at the set level. Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pages 129-136. We build the baseline MT models in Table 2 following the Transformer big architecture (Vaswani et al., 2017) with 6 layers, embedding size 1024 and 16 attention heads. Table 8 shows the additional hyper-parameters that we tune on the validation set for the best performing models of each language direction. We use Adam with β 1 = 0.9, β 2 = 0.98, = 0.00000001, and apply an inverse square root learning rate schedule with 4000 warmup steps. We train for 200 epochs for De-En, En-De and En-Ta, and 100K updates for Ru-En, and select the best checkpoint based on validation loss.

A.2 LM
For all LMs, we use 16 transformer layers, embedding size 1024, feed-forward network embedding size 4096 and 16 attention heads. We optimize with NAG with learning rate 0.0001 and a cosine learning rate schedule with 16K warmup steps. All models are trained on 32 GPUs for a maximum of 984K steps, and the best checkpoint is selected based on validation loss.

A.3 DrNMT
We train DrNMT using Adam with β 1 = 0.9, β 2 = 0.98, = 0.000001, and apply a polynomial learning rate decay schedule with 8000 warmup steps for De-En, En-Ta, and Ru-En, and 16K warmup steps for En-De. We use a learning rate of 0.00005 and dropout 0.2 for De-En, En-Ta, and Ru-En, and a learning rate of 0.00001 and dropout 0.1 for En-De. Table 9 summarizes the average validation and test TER (Snover et al., 2006) of DrNMT trained with BLEU (Papineni et al., 2002b) scores.