Energy-Based Reranking: Improving Neural Machine Translation Using Energy-Based Models

The discrepancy between maximum likelihood estimation (MLE) and task measures such as BLEU score has been studied before for autoregressive neural machine translation (NMT) and resulted in alternative training algorithms (Ranzato et al., 2016; Norouzi et al., 2016; Shen et al., 2016; Wu et al., 2018). However, MLE training remains the de facto approach for autoregressive NMT because of its computational efficiency and stability. Despite this mismatch between the training objective and task measure, we notice that the samples drawn from an MLE-based trained NMT support the desired distribution – there are samples with much higher BLEU score comparing to the beam decoding output. To benefit from this observation, we train an energy-based model to mimic the behavior of the task measure (i.e., the energy-based model assigns lower energy to samples with higher BLEU score), which is resulted in a re-ranking algorithm based on the samples drawn from NMT: energy-based re-ranking (EBR). We use both marginal energy models (over target sentence) and joint energy models (over both source and target sentences). Our EBR with the joint energy model consistently improves the performance of the Transformer-based NMT: +3.7 BLEU points on IWSLT’14 German-English, +3.37 BELU points on Sinhala-English, +1.4 BLEU points on WMT’16 English-German tasks.


Introduction
Autoregressive models are widely used for neural machine translation (NMT) (Bahdanau et al., 2015;Gehring et al., 2017;Vaswani et al., 2017). The autoregressive factorization provides a tractable likelihood computation as well as efficient sampling. The former results in the effective maximum likelihood estimation (MLE) for training the * Amirmohammad Rooshenas is the corresponding author. parameters of NMT models. However, optimizing likelihood does not guarantee an improvement in task-based measures such as the BLEU score, which has motivated directly optimizing task measures with reinforcement learning (Ranzato et al., 2016;Norouzi et al., 2016;Shen et al., 2016;Bahdanau et al., 2017;Wu et al., 2018). However, for NMT, these training algorithms are often used in conjunction with MLE training (Wu et al., 2018) or as fine-tuning (Choshen et al., 2020).
Interestingly, we observe that samples drawn from an NMT model trained using MLE may have higher quality (measured with BLEU) than the outputs of beam search. In particular, we draw 100 target samples for each source sentence from an NMT model trained using MLE on the IWSLT'14 German-English task, and observe that an oracle ranker -i.e. argmax y∼P NMT (y|x) BLEU(., y * ), where (x, y * ) is the pair of source and gold target sentence -achieves the high score of 67.54, while the beam decoding achieves 33.87. We also look at the distribution of the Spearman rank correlation coefficient of the drawn samples with respect to the log probability score of the baseline NMT (BaseNMT). Figure 1 shows that there is no strong correlation between the BLEU score ranking of samples and the log probability score ranking for the majority of source sentences; thus, maximum a priori (MAP) decoding is incapable of finding the desired output. In parallel to our study, Eikema and Aziz (2020) also report that the mismatch regarding MLE training of autoregressive models is attributable to the distribution of the probability mass rather than the parameter estimation, resulting in a poor MAP decoding.
Instead of looking for an alternate algorithm for parameter estimation, these results motivate us to explore training a parametric approximation of the metric, here BLEU score: ω θ (y, x) ≈ BLEU(y, y * ). Therefore the decoding becomes: Figure 1: Distribution of the Spearman rank-order correlation coefficients for the training data (left) and test data (right) of the IWSLT'14 German-English task.
We use energy-based models (EBMs) to parameterize ω θ (y, x). EBMs (LeCun et al., 2006) are general parametric models that assign a scalar energy value to each configuration of input variables, thus defining an unnormalized probability distribution. Although computing the partition function is intractable for general EBMs, we only require the relative energy of the sampled sentences from the BaseNMT model, thus canceling out the normalization constant. In this paper we use two different energy-based models: marginal energy model (Marginal-EBM) defined only over target sentences and joint energy model (Joint-EBM) defined over both source and target sentences.
Figure 1 also shows the correlation coefficient of the energy ranking and BLEU score using both Marginal-EBM and Joint-EBM. The shift in the coefficient distribution suggests that decoding based on energy scores results in better BLEU scores compared to decoding based on the log probability scores of the BaseNMT model. Also we observe that Joint-EBM works better than using Marginal-EBM as Joint-EBM better captures the correlation of source and target sentences, while Marginal-EBM is not directly conditioned on the source sentence.
In this paper, we describe how to train EBMs 1 to achieve the desired ranking. Our energy ranker consistently improves the performance of Transformerbased NMT on German-English, Romanian-English and Italian-English tasks from IWSLT'14, the French-English task from IWSLT'17, German-English task from WMT'14, and English-German task from WMT'16, as well as the low-resource Sinhala-English and Nepali-English tasks described in the FLoRes dataset . The EBM is trained such that its energy landscape is consistent with the BLEU score. Marginal-EBM is not conditioned on the source sentence, thus each local region is trained to have similar ranking as that BLEU score for the samples in the region.

Energy-Based Reranking
Using EBM E θ to reweight the samples from an NMT defines a new probability distribution over the output sentences (see Grover et al. (2019)): where T is temperature. The ideal re-ranker requires an EBM with the energy function E θ (y, x) such that P θ (y|x) and BLEU(y, y i ) have similar modes for all (x i , y i ) ∈ D, where D is an empirical data distribution. To train θ we use rank-based training (Rohanimanesh et al., 2011;Rooshenas et al., 2018Rooshenas et al., , 2019. Rank-based training enforces that the samples from P θ (.) have similar ranking with respect to both the energy score and task measure (see Figure 2).
To sample from P θ (y|x), we sample k sentences from P NMT (y|x) using multinomial sampling from locally normalized distributions over the output and reweight the samples based on the energy network exp( −E θ (y,x) T ). Then we resample two sentences, y 1 and y 2 , from the renormalized set, which defines a conditional distribution: P i (y|x) = exp(−E θ (y,x)/T ) k exp(−E θ (y k ,x)/T ) (a similar sampling approach has been used in Deng et al. (2020)). Now we train the energy model such that the ranking of y 1 and y 2 with respect to the energy model is consistent with their ranking with respect to the task metric, BLEU score.
In general, we assume y h is the sentence with the higher BLEU score and y l is the sentence with with the lower BLEU score. Therefore, the training objective of E θ (y, x) becomes: Where ξ(y i , x i ) is the margin violation and α is the margin weight. Algorithm 1 outlines the whole training procedure. If we define the energy only over sentences of the target language, E θ (y), we can share the energymodel among multiple language pairs with the same target language. In this case we have to, first, sample the language l from our language set and then sample a sentence pair from the selected language training set D l . The probability of selecting a language is proportional to the number of sentences in its training set.
Algorithm 1 Rank-Based Training of EBM PNMT(y|x) ← Pretrained NMT E θ (y, x) ← Energy based models for target sentences repeat In this paper, we use BERT (Devlin et al., 2019) to parameterize both E θ (y, x) and E θ (y). Section 4.3 and 4.4 discuss the construction of E θ in detail. Grover et al. (2019) show that importance weights can be used to make generative models better fit the desired data distribution: p θ (y) ∝ q(y)ω θ (y), where q(y) is a generative model that we can efficiently take samples from and ω θ (y) is the importance weight function. The importance weights can be determined using a discriminator that differentiates the generated samples from the target data. Rosenfeld et al.;Parshakova et al. (2001; define q(y) as autoregressive model and ω θ (y) using a log-linear model: ω θ (y) = exp(θ T φ(y)), where φ(y) is the vector of sufficient statistics (features) evaluated at y. The log-linear model simplifies training the parameters θ: ∇ θ p θ (y) = y∈D φ(y)−Eŷ ∼p θ (.) φ(ŷ). The expectation term can be estimated using rejecting sampling or importance sampling given the proposal distribution q. Deng et al. (2020) extend this approach for text generation by using unrestricted EBMs instead of log-linear models: ω θ (y) = exp(−E θ (y)). They train the EBM using noise contrastive estimation (Gutmann and Hyvärinen, 2010). We find this less suitable for re-ranking in the translation tasks (see Section 4).

Related Work
Discriminative re-ranking was first introduced by Shen et al. (2004) for improving the performance of machine translation (MT). They have trained a linear separator using the perceptron learning algorithm to distinguish the top r translations from the rest of the translations in the n-best possible outputs. The features for the discriminator are extracted from both source and target sentences. Mizumoto and Matsumoto (2016) combine the score of MT and the linear model using more complex syntactical features to re-rank the target sentences. Here, we rely on the features learned by BERT, and given the high capacity of the energy model, we train the energy model to respect the ranking of every pair of samples. Gulcehre et al. (2017) describe using language model (LM) to improve the performance of NMT using shallow and deep fusion. Shallow models combine the marginal probability of predicting each word in NMT and LM: log P NMT (y i |y <i ) + λ log P LM (y i |y <i ), while deep fusion concatenates the hidden states of two models before predicting each word and uses parallel data to fine-tune the weights. Similar to deep fusion, Domhan and Hieber (2017) feed the unnormalized output of LM to the decoder of NMT. Domhan and Hieber (2017) jointly train the LM and NMT using monolingual target-side data and parallel data, respectively. Sennrich et al. (2016a) augment the parallel training data with monolingual data with the target language and back-translation.
Re-ranking with LM has also been explored by , where they decode the output based on log p(y|x) + λ 1 log p(x|y) + λ 2 log p(y), where p(y|x) is the direct model provided by NMT, p(x|y) is computed via back-translation and p(y) is an LM. Our approach differs from the previous methods that use LMs for re-ranking as we train our energy-based model to be consistent with the task measure instead of using pre-trained LMs. In our experiments, we only explore the effect of using the direct model plus LM, nevertheless, backtranslation can also be added into our model for further improvement.
Recently, Salazar et al. (2020) use masked language models (MLM) such as BERT to score hypotheses from NMT. Salazar et al. (2020) describe the score of a MLM as pseudo-log-likelihood score (PLL). To calculate PLL score of a sentence, each token w i in the sentence is sequentially masked, which allows the calculation of log p(w i |w \i ) from the output of the MLM. The normalized pseudolog-probability of the sentence is the average of logprobability of the masked words given the rest of the words in the sentence: 1 where N is the length of the sentence. We use this approach as one of our baselines.
In parallel to our work, Guo et al. (2020) proposes using two different BERT models as an encoder of the source language (X-BERT) and a decoder of the target language (Y-BERT). Guo et al. (2020) add an extra trainable encoder-decoder adaption module followed by a feed-forward module to each layer of the decoder and a feed-forward module to each layer of the encoder. (Please see Guo et al. (2020) for more detail on the architecture.) For fine-tuning XY-BERT for translation tasks, Guo et al. (2020) keep all XY-BERT's parameters fixed except the parameters of the new modules, and use mask-predict decoding (Ghazvininejad et al., 2019) for running test-time inference. Guo et al. (2020) report a significant improvement over prior non-autoregressive models and superior performance comparing to autoregressive methods on IWSLT'14 German-English task. Their finding is consistent with our improvement using the pretained BERT model. However, our Joint-EBM model is a different way of using BERT for translation, which does not require separate BERT models for source and target language. Please see Section 4.9 for a detailed comparison.
Finally, other works also discuss using BERT to improve the performance of NMT. Clinchant et al. (2019) describe initializing the embedding or the whole encoder with BERT's parameters. Zhu et al. (2020) use an attention model to incorporate the output of BERT into encoder and decoder of NMT. In our approach, we use BERT as an external energy-based ranker.

Datasets
We use German-English (De→En), Romanian-English (Ro→En) and Italian-English (It→En) from IWSLT'14 datasets and French-English (Fr→En) from IWSLT'17 translation tasks. We also use IWSLT'14 English-German (En→De) to show that the proposed method can be expanded to translation tasks with a different target language. All sentences were preprocessed using byte-pairencoding (Sennrich et al., 2016b). For all language pairs in IWSLT'14 and IWSLT'17, we merge the test datasets tst2010, tst2011, tst2012 and report BLEU on the merged dataset. We also use German-English (De→En) from the WMT'14 and English-German (En→De) from WMT'16 translation tasks.
Finally, we use low-resource translation tasks Nepali-English (Ne→En) and Sinhala-English (Si→En) from FLoRes  translation tasks. We follow dataset distribution and preprocessing steps described in  using the FLoRes implementation. FLoRes dataset contains development (dev), devtest and test dataset for both language pairs. Similar to  we use the devtest dataset for all our evaluations.

Base Model
We use the Transformer 2 (Vaswani et al., 2017) as our BaseNMT. Our Transformer architecture includes six encoder and six decoder layers, and the number of attention heads, embedding dimension and inner-layer dimension are 8, 512 and 4096, respectively. We use dropout, weight decay, label smoothing to regularize our models. We use layer normalization and early stopping. Models are optimized using Adam (Kingma and Ba, 2015) with parameters β 1 = 0.9, β 2 = 0.98, and = 1e −8 and we use the same learning rate scheduler as . We trained our models on 1 Nvidia TITANX GPU.

Marginal-EBM
To construct the energy network over the sentences of the target language, we use a pretrained BERT (Devlin et al., 2019) from Huggingface (Wolf et al., 2019) as our pretrained language model and project the hidden state of BERT for each output token into a scalar value and define the energy value of the target sentence as the average of the scalar values. We use the BERT-base uncased model with 12 encoder layers, 768 hidden state dimension, 12 attention heads and 110M parameters. For the projection layer, we use a 2-layer MLP with 256 hidden variables. In our experiments, we only train the parameters of the projection layer and the rest of BERT's parameters remain frozen. We use margin weight of α = 10 and temperature T = 1000 for our experiments. We regularize the projection layer using L2 regularization. Models are optimized using Adam (Kingma and Ba, 2015) with parameters β 1 = 0.9, β 2 = 0.98, and = 1e −8 and a learning rate of 0.01. We run all experiments on 1 Nvidia TESLA M40 GPU.

Joint-EBM
Joint-EBM must assign a score to a pair of sentences from source and target languages, so to construct the Joint-EBM, similar to Marginal-EBM, we need a Joint-BERT. We feed the sentence pairs from source and target languages jointly to BERT, thus the name Joint-BERT. Since Joint-BERT has not been pre-trained to accept pairs of sentences from two different languages, we finetune it for 12 epochs using the input format of [CLS]Source[SEP]Target [SEP] with the pairs of source and target sentences for each translation task. For fine-tuning, we only mask the tokens of the target sentence. For all translation tasks we use the BERT-Base, Multilingual Cased model with 12 encoder layers, 768 hidden state dimension, 12 attention heads and 110M parameters. After finetuning Joint-BERT, we follow the same architecture as Marginal-EBM for the Joint-EBM.

Methods
As the main baseline, we run beam decoding with a beam size of five over the trained BaseNMT (BaseNMT+Beam). We also use the samples drawn from the BaseNMT and report the BLEU score of the sample with the highest log-probability score on BaseNMT (BaseNMT+Sample). For all methods we use 100 target samples for each source sentence. BaseNMT+LM draws samples from the BaseNMT and uses log P NMT (y|x) + λ log P LM (y) to rank the samples (λ = 0.01 out of the set of {0.001, 0.01, 0.1} results in the best performance).
In our BaseNMT+LM baseline, we use pretrained language model to calculate log P LM (y). For the {De, Fr, It, Ro, Si, Ne}− →En tasks, we use a pretrained Transformer-XL (Dai et al., 2019) transfo-xl-wt103 and for the En− →De task we use a pretrained XLM (Lample and Conneau, 2019) xlm-mlm-ende-1024 from Huggingface (Wolf et al., 2019). BaseNMT+MLM is similar to BaseNMT+LM but it uses log P NMT (y|x) + λ log P M LM (y), where P M LM is the average pseudo-log-probability of sample y calculated using BERT. We use the same architecture of BERT as Marginal-EBM, but we fine-tuned BERT for MLM over the target sentences in training sets for 10 epochs. We tuned λ similar to BaseNMT+LM.
EBR is our method that uses rank-based training for EBMs. We explore EBR with Marginal-EBM (Marginal-EBR) and Joint-EBM (Conditional-EBR). We also use noise-contrastive estimation to train our Marginal-EBM, similar to Deng et al. (2020), which we refer to as NCE-EBR. Next, we have Shared-EBR that trains single Marginal-EBM for the tasks with the same target language. Shared-EBR is only trained on IWSLT and FLo-Res tasks with English target. For this method, we first sample a translation task and then sample a batch from that task and follow Algorithm 1 for the training of the Marginal-EBM. Finally, as an upper bound for the best achievable result, we also extract the translations from the sample that are closest to the gold data (based on BLEU score). Table 1 shows the performance of the described methods for IWSLT, FLoRes, and WMT translation tasks. 3 BaseNMT+Sample achieves a better score than beam decoding suggesting that our multinomial sampling supports the modes of the distribution defined by the BaseNMT. Similarly, oracle values are high, indicating that the samples also support the desired distribution. This satisfies the necessary condition for P θ (y|x) ∝ P NMT (y|x) exp(−E θ (y, x)/T ) to be closer to the desired distribution. Re-ranking with a language model using BaseNMT+LM improves over BaseNMT+Sample for De→En, Fr→En, It→En, and En→De, but fails on Ro→En, Si→En, and Ne→En.

Results
However, in all of these tasks, the difference between BaseNMT+Sample and BaseNMT+LM is not substantial. BaseNMT+MLM is consistently better than BaseNMT+LM. The performance of BaseNMT+MLM is attributable to PLL scoring, as the encoder has the global information over the sentence. Marginal-EBR performs considerably better than BaseNMT+{Beam, Sample, LM, MLM} and better than NCE-EBR on all tasks except on Ne→En, where NCE-EBR outperforms Marginal-EBR. The main advantage of Marginal-EBR over NCE-EBR is the use of only sampled data instead of gold data for training. See Section 4.7 for detailed discussion.
Shared-EBR has a significant improvement over the Marginal-EBR, especially it improves the lowresource task of Si→En by more than 2 BLEU points. For this task, we also show that how using more language pairs in training improves performance (Table 2).
Conditional-EBR outperforms Shared-EBR on all tasks. The performance of Conditional-EBR is  The translation improvement of using EBR on IWSLT and FLoRes translation tasks are more considerable than the improvement of using EBR on WMT tasks. We believe that pre-trained BERT helps low-resource tasks more than large-scale translation tasks.

Effect of Using Gold Data
Noise-contrastive estimation (NCE) trains the energy model using a discriminative training to distinguish gold data from the sampled data (Gutmann and Hyvärinen, 2010;Deng et al., 2020). In contrast to the NCE-EBR, EBR does not directly use gold data in the training of the EBM, but only exploit it to determine the rank of two points as well as the margin. To show that our approach is effective, we introduce parameter γ as the percentage of the time that we can use gold data as one of the points (for example, y h in Algorithm 1). Table 3 shows the results for both De→En and Fr→En tasks using Marginal-EBR. As we increase the value of γ, the performance of Marginal-EBR drops. The main reason is that BaseNMT rarely produces the exact correct translation in the sample set, thus learning the ranking with respect to the gold data is not very informative. When the γ is zero, the Marginal-EBM learns to re-rank the samples with respect to their distance to the gold data.

Regularized Training
We hypothesize that the performance of EBR improves as we increase the support of the base distribution toward the mode of the true distribution. To show that we add an entropy regularization term to the likelihood training of BaseNMT: Entropy regularization improves the diversity of samples, and as a result, Oracle's score increases by 0.67 BLEU points. While BaseNMT only benefits less than 0.1 BLEU points from the regularization, Conditional-EBR improves by 0.3 BLEU points (see Table 4). For this study we explored β from {0.01, 0.1}, and reported results use β = 0.01 selected based on the validation set. BaseNMT trained with β = 0.1 has the Oracle score of 65.76 on the test set (comparing to the Oracle score of 68.21 for β = 0.01), which indicates that stronger regularization reduces the sample quality.

Using XY-BERT for Joint-EBM
To explore the effect of a different way of conditioning on the source language, we compare the EBM constructed using the Joint-BERT model with EBM constructed using recently introduced XY-BERT (Guo et al., 2020). To construct EBM from XY-BERT, we remove the output layer and project each hidden-state of the final layer to a scalar energy value similar to how we build EBM from BERT. We compare these two models on IWSLT'14 De→En task. For XY-BERT we use German BERT for the encoder and English BERT for the decoder, following Guo et al. (2020). Our Joint-BERT uses Multilingual BERT because we feed both source and target sentences to BERT jointly. Conditional-EBR with XY-BERT achieves 38.33 BLEU score, which is 0.75 BLEU points higher than Conditional-EBR with Joint-BERT and improves the performance of XY-BERT with mask-predict decoding (Ghazvininejad et al., 2019) by 1.84 BLEU points. 4 We believe that the improvement in Conditional-EBR using XY-BERT is mostly attributable to using specialized BERT models. Moreover, XY-BERT has extra trainable modules, so we could fine-tune XY-BERT on the trans-lation task for 60 epochs, while keeping the rest of the parameters fixed without causing catastrophic forgetting. Joint-BERT, on the other hand, does not have any extra parameters, so we fine-tuned all parameters for only 15 epochs. Further training of Joint BERT resulted in poor performance. We leave adding extra modules for better fine-tuning of Joint BERT for future studies.

Maximizing Expected Score
As another comparison, we train our models by directly maximizing the expected BLEU score (compared to rank-based training): We use log-trick to calculate the gradient of the above objective: We use self-normalized importance sampling to draw samples from the energy-based model. We use one sample to approximate the outer expectation and 10 samples to approximate the inner expectation. We train both Marginal-EBM and Joint-EBM by maximizing the expected BLEU score on IWSLT'14 DE-EN. The former obtains a score of 34.20 BLEU and the latter achieves 34.77 BLEU points. Both models underperform rankbased training.

Inference Time
We compare the inference latency of EBR variations with BaseNMT (Table 5). We use 100 samples for re-ranking using Marginal-EBR, Conditional-EBR with Joint-BERT and Conditional EBR with XY-BERT (Guo et al., 2020). Inference on Marginal-EBR takes on average about 170 milliseconds per sentence more than inference in BaseNMT as we have to sample 100 sentences from BaseNMT and evaluate them on the energy model. We evaluate the Marginal-EBR only on the target sentences, while we evaluate Conditional-EBR for sequences from both source and target language, so the input sequence of Conditional-EBR is longer, thus having higher latency comparing to Marginal-EBR. We also measure the latency of Conditional-EBR when we use XY-BERT architecture to construct Joint-EBM. In this case, we have

Analysis
In this section, we study the sentence preference of Marginal-EBR created by the energy ranking.

Qualitative Analysis
We  Table 6 presents quintessential examples we find after examining 400 examples on IWSLT'14 De→En test set. It is worth to mention that examples do not strictly land in only one category. For example, the sentences we show in the 'Rephrase' type will also be counted as the change of pronouns. With this in mind, we compute statistics over the 400 sentences and find each of the 'Pronoun', 'Contraction' and 'Rephrase' appears approximately 30% of the time while 10% of the sentences change 'Tense'. The other less frequent types are changing of determiners, prepositions and deletion (comparing the MAP decoding of BaseNMT and preferred   output by Marginal-EBR).

BLEU Gains by Length
Besides the qualitative analysis, we are also curious to see whether the improvement is affected by length. Table 7 shows the BLEU scores on the IWSLT'14 test set, which is divided into three bins according to the target length. Shorter sentences have the largest increase in BLEU, and the gain is decreasing as length increases. We reckon that it is easier for EBR to cover larger training space for sentences of shorter length and thus has the largest improvement in BLEU for these sentences.

Random Sentences
In the absence of access to the source sentence, the energy model ranks the outputs purely according to the features of target sentences. We hypothesize that the energy model is better at differentiating incoherent and coherent sentences and manage to show that through the following analysis. We apply two kinds of shuffle on IWSLT'14 test set targets: (1) global shuffle: tokens in the sentence are randomly shuffled (2) local shuffle: we first randomly select a token and randomly shuffle the tokens within a local window of three. Then we compute the energy scores of these shuffled sentences as well as the untouched ones. The energy scores are listed in Table 8. (The energy model assign a lower energy to its preference.) We find 87% of the time, the energy model is able to distinguish the original sentence from a local shuffled one, and 90.5% from the global shuffled one. This supports our hypothesis that the energy model is capable of capturing the fluency of generated candidates.

Conclusion and Future Work
We introduce energy-based re-ranking (EBR) to improve the performance of autoregressive neural machine translation. Despite its superior performance, EBR suffers from high latency because of its dependency on sampling from an autoregressive model. Directly sampling from the underlying EBM can speed up the inference, which is our future direction in order to benefit from the power of energy-based models for machine translation.