Rewriter-Evaluator Architecture for Neural Machine Translation

A few approaches have been developed to improve neural machine translation (NMT) models with multiple passes of decoding. However, their performance gains are limited because of lacking proper policies to terminate the multi-pass process. To address this issue, we introduce a novel architecture of Rewriter-Evaluator. Translating a source sentence involves multiple rewriting passes. In every pass, a rewriter generates a new translation to improve the past translation. Termination of this multi-pass process is determined by a score of translation quality estimated by an evaluator. We also propose prioritized gradient descent (PGD) to jointly and efficiently train the rewriter and the evaluator. Extensive experiments on three machine translation tasks show that our architecture notably improves the performances of NMT models and significantly outperforms prior methods. An oracle experiment reveals that it can largely reduce performance gaps to the oracle policy. Experiments confirm that the evaluator trained with PGD is more accurate than prior methods in determining proper numbers of rewriting.


Introduction
Encoder-Decoder architecture (Sutskever et al., 2014) has been widely used in natural language generation, especially neural machine translation (NMT) (Bahdanau et al., 2015;Gehring et al., 2017;Vaswani et al., 2017;Zhang et al., 2019;Kitaev et al., 2020). Given a source sentence, an encoder firstly converts it into hidden representations, which are then conditioned by a decoder to produce a target sentence. In analogy to the development of statistical machine translation (SMT) (Och and Ney, 2002;Shen et al., 2004;Zhang and Gildea, 2008), some recent methods in NMT attempt to improve the encoder-decoder architecture with multipass decoding (Xia et al., 2017;Zhang et al., 2018;Geng et al., 2018;Niehues et al., 2016). In these models, more than one translation is generated for a source sentence. Except for the first translation, each of the later translations is conditioned on the previous one. While these methods have achieved promising results, they lack a proper termination poqlicy for this multi-turn process. For instance, Xia et al. (2017); Zhang et al. (2018) adopt a fixed number of decoding passes, which is inflexible and can be sub-optimal. Geng et al. (2018) utilize reinforcement learning (RL) (Sutton et al., 2000) to automatically decide the number of decoding passes. However, RL is known to be unstable due to the high variance in gradient estimation (Boyan and Moore, 1995).
To address this problem, we introduce a novel architecture, Rewriter-Evaluator. This architecture contains a rewriter and an evaluator. The translation process involves multiple passes. Given a source sentence, at every turn, the rewriter generates a new target sequence to improve the translation from the prior pass, and the evaluator measures the translation quality to determine whether to end the iterative rewriting process. Hence, the translation process is continued until a certain condition is met, such as no significant improvement in the measured translation quality. In implementations, the rewriter is a conditional language model (Sutskever et al., 2014) and the evaluator is a text matching model (Wang et al., 2017).
We also propose prioritized gradient descent (PGD) that facilitates training the rewriter and the evaluator both jointly and efficiently. PGD uses a priority queue to store previous translation cases. The queue stores translations with descending order of their scores, computed from the evaluator. The capacity of the queue is limited to be a few times of batch size. Due to its limited size, the queue pops those translations with high scores and only keeps the translations with lower scores. The samples in Target  the queue are combined together with new cases from the training data to train the rewriter.
Rewriter-Evaluator has been applied to improve two mainstream NMT models, RNNSearch (Bahdanau et al., 2015) and Transformer (Vaswani et al., 2017). We have conducted extensive experiments on three translation tasks, NIST Chineseto-English, WMT'18 Chinese-to-English, and WMT'14 English-to-German. The results show that our architecture notably improves the performance of NMT models and significantly outperforms related approaches. We conduct oracle experiment to understand the source of improvements. The oracle can pick the best translation from all the rewrites. Results indicate that the evaluator helps our models achieve the performances close to the oracle, outperforming the methods of fixing the number of rewriting turns. Compared against averaged performances using a fixed number of rewriting iterations, performance gaps to the oracle can be reduced by 80.7% in the case of RNNSearch and 75.8% in the case of Transformer. Quantitatively, we find the evaluator trained with PGD is significantly more accurate in determining the optimal number of rewriting turns. For example, whereas the method in Geng et al. (2018) has 50.2% accuracy in WMT'14, the evaluator achieves 72.5% accuracy on Transformer.

Rewriter-Evaluator
Rewriter-Evaluator consists of iterative processes involving a rewriting process ψ and an evaluation process φ. The process of translating an n-length source sentence x = [x 1 , x 2 , · · · , x n ] is an application of the above processes. Assume we are at the k-th iteration (k ≥ 1). The rewriter ψ gener- given the source sentence x and the past trans- l k−1 ] from the (k − 1)-th turn. l k and l k−1 are the sentence lengths. The evaluator φ estimates the translation quality score q (k) of the new translation z (k) , which is used for determining whether to end the multiturn process. Formally, the k-th pass of a translation process is defined as . (1) Initially, z (0) and q (0) are respectively set as an empty string and −∞.
The above procedure is repeatedly carried out until not much improvement in the estimated quality score can be achieved, i.e., where is a small value tuned on the development set. Alternatively, the procedure is terminated if a certain number of iterations K > 0 is reached. In the former case, we adopt z (k−1) as the final translation. In the latter case, the last translation z (K) is accepted.

Architecture
A general architecture of Rewriter-Evaluator using Encoder-Decoder is illustrated in Fig. 1. The rewriter ψ consists of a source encoder f SE , a target decoder f T E , and a decoder g DEC . The evaluator φ shares encoders with the rewriter and contains an estimator g EST . Assume it is at the k-th pass. Firstly, the source encoder f SE casts the source sentence x into word Algorithm 1: Prioritized Gradient Descent (PGD) Input: rewriter ψ, evaluator φ, training set T , batch size B, and expected iteration number E. Output: well-trained rewriter ψ and well-trained evaluator φ. 1 Initialize an empty priority queue A with the capacity C ← B × E. 2 while Models are not converged do 3 Pop B cases with high quality scores from priority queue A and discard them.

4
Randomly sample a B-sized batch of training cases S from T .  Initialize an empty priority queue D of limited size C.
14 Optimize rewriter ψ with the samples in list F to reduce loss in Eq. (7).

15
Optimize evaluator φ with the samples in list F to reduce loss in Eq. (8).

16
Update priority queue A: A ← D.
where operation [; ] is row-wise vector concatenation. Similarly, the translation z (k−1) from the previous turn k − 1 is encoded as . (4) Then, the decoder g DEC of the rewriter ψ produces a new translation z (k) as Ultimately, the evaluator φ scores the new translation z (k) with the estimator g EST : .
The implementation can be applied to a variety of architectures. The encoders, f SE and f T E , can be any sequence model, such as CNN (Kim, 2014). The decoder g DEC is compatible with any language model (e.g., Transformer). The estimator g EST is a text matching model, e.g., ESIM (Chen et al., 2017). In Sec. 4, we apply this implementation to improve generic NMT models.

Training Criteria
We represent the ground truth target sentence as a (m + 1)-length sequence y = [y 0 , y 1 , · · · , y m ]. The rewriter ψ is trained via teacher forcing. We use o i to denote the probability of the i-th target word, which is the prediction of feeding its prior words [y 0 , y 1 , · · · , y i−1 ] into the decoder g DEC . The training loss for the rewriter is where y 0 = "[SOS]" and y m = "[EOS]", marking the ends of a target sentence. For the evaluator φ, we incur a hinge loss between the translation score of the ground truth y and that of the current translation z (k) as .
At training time, translation z (k) is generated via greedy search, instead of beam search, to reduce training time.

Prioritized Gradient Descent
We present prioritized gradient descent (PGD) to train the proposed architecture. Instead of the random sampling used in stochastic gradient descent (SGD) (Bottou and Bousquet, 2008), PGD uses a priority queue to store previous training cases that receive low scores from the evaluator. Randomly sampled training cases together with those from the priority queue are used for training.

RNN Encoder
Details of PGD are illustrated in Algorithm 1. Initially, we set a priority queue A (1-st line) with a limited size C = B × E. B is the batch size. E, the expected number of rewriting iterations, is set as K 2 . The queue A is ordered with a quality rate in descending order, where the top one corresponds to the highest rate. The quality rate of a certain sample (x, y, z (k) ) is computed as (9) where the weight ρ is controlled by an annealing schedule j j+1 with j being the current training epoch and BLEU (Papineni et al., 2002). The rate r (k) is dominated by BLEU in the first few epochs, and is later dominated by the evaluation score q (k) with an increasing number of epochs. This design is to mitigate the cold start problem when training an evaluator φ. At every training epoch, PGD firstly discards a certain number of previous training samples with high quality rates (3-rd line) from queue A. It then replaces them with newly sampled samples S (4-th to 6-th lines). Every sample (x, y, z (k−1) , r (k−1) ) in queue A is then rewritten into a new translation z (k) by the rewriter. These are scored by the evaluator φ (10-th lines). These new samples are used to respectively train the rewriter ψ and the evaluator φ (14-th to 15-th lines) with Eq. (7) and Eq. (8).
PGD keeps low-quality translations in the queue A for multi-pass rewriting until they are popped out from queue A with high scores from the eval-uator φ. Hence, the evaluator φ is jointly trained with the rewriter to learn discerning the quality of translations from the rewriter ψ, in order to help the rewriter reduce loss in Eq. (7).
PGD uses a large queue (B ×E) to aggregate the past translations and newly sampled cases. Computationally, this is more efficient than explicit B times of rewriting to obtain samples. This requires extra memory space in exchange for lowing training time. In Sec. 5.7, we will show that the additional increase of training time by PGD is less than 20%, which is tolerable.

Applications
Following Sec. 2.1, we use Rewriter-Evaluator to improve RNNSearch and Transformer.
RNNSearch w/ Rewriter-Evaluator. The improved RNNSearch is illustrated in Fig. 2. The two encoders (i.e., f SE and f T E ) and the decoder g DEC are GRU (Chung et al., 2014). We omit computation details of these modules and follow their settings in Bahdanau et al. (2015). Note that, at every decoding step, the hidden state of decoder is attended to not only h i , 1 ≤ i ≤ n but also p We apply co-attention mechanism (Parikh et al., 2016) to model the estimator f EST . Firstly, we capture the semantic alignment between the source sentence x and the translation z (k−1) as Then, we use average pooling to extract features and compute the quality score: where ⊕ is column-wise vector concatenation.
Transformer w/ Rewriter-Evaluator. The Transformer (Vaswani et al., 2017) is modified to an architecture in Fig. 3. The input to the encoder contains a source sentence x, a special symbol "ALIGN", and the past translation z (k−1) : where operation denotes the concatenation of two sequences.

Transformer Decoder Dot Product
The following mask matrix is applied to every layer in the encoder: In this way, the words in x can't attend to those in z (k−1) and vice versa. "ALIGN" can attend to the words both in x and z (k−1) . This design is to avoid cross-sentence attention in encoder layers.
In earlier studies, we find it slightly improves the performances of models. We denote the representation for "ALIGN" in the final encoder layer as h ALIGN . The estimator f EST obtains the quality score as in which v is a learnable vector.

Experiments
We have conducted extensive experiments on three machine translation tasks: NIST Chinese-to-English (Zh→En), WMT'18 Chinese-to-English, and WMT'14 English-to-German (En→De). The results show that Rewriter-Evaluator significantly improves the performances of NMT models and notably outperforms prior post-editing methods. Oracle experiment verifies the effectiveness of the evaluator. Termination accuracy analysis shows our evaluator is much more accurate than prior methods in determining the optimal number of rewriting turns. We also perform ablation studies to explore the effects of some components. We train all the models with 150k steps for NIST Zh→En, 300k steps for WMT'18 Zh→En, and 300k steps for WMT'14 En→De. We select the model that performs the best on validations and report their performances on test sets. Using multi-bleu.perl 3 , we measure case-insensitive BLEU scores and case-sensitive ones for NIST Zh→En and WMT'14 En→De, respectively. For WMT'18 Zh→En, we use the case-sensitive BLEU scores calculated by mteval-v13a.pl 4 . The improvements of the proposed models over the baselines are statistically significant with a reject probability smaller than 0.05 (Koehn, 2004).

Experimental Setup
For RNNSearch, the dimensions of word embeddings and hidden layers are both 600. Encoder has 3 layers and decoder has 2 layers. Dropout rate is set to 0.2. For Transformer, we follow the setting of Transformer-Base in Vaswani et al. (2017). Both models use beam size of 4 and the maximum number of training tokens at every step is 4096. We use Adam (Kingma and Ba, 2014) for optimization. In all the experiments, the proposed models run on NVIDIA Tesla V100 GPUs. For Rewriter-Evaluator, the maximum number of rewriting iterations K is 6 and termination threshold is 0.05. Hyper-parameters are obtained by grid search, except for the Transformer backbone.

Results on NIST Chinese-to-English
We adopt the following related baselines: 1) Deliberation Networks (Xia et al., 2017) adopts a second decoder to polish the raw sequence produced by the first-pass decoder; 2) ABD-NMT (Zhang et al., 2018) uses a backward decoder to generate a translation and a forward decoder to refine it with attention mechanism; 3) Adaptive Multi-pass Decoder (Geng et al., 2018) utilizes RL to model the iterative rewriting process. Table 1 shows the results of the proposed models and the baselines on NIST. Baseline BLEU scores are from Geng et al. (2018). There are three observations. Firstly, Rewriter-Evaluator significantly improves the translation quality of NMT models. The averaged BLEU score of RNNSearch is raised by 3.1% and that of Transformer is increased by 1.05%. Secondly, the proposed architecture notably outperforms prior multi-pass decoding methods. The performance of RNNSearch w/ Rewriter-Evaluator surpasses those of Deliberation Network by 2.46%, ABD-NMT by 2.06%, and Adaptive Multi-pass Decoder by 1.72%. Because all of these systems use the same backbone of RNN-based NMT models, these results validate that Rewriter-Evaluator is superior to other alternative methods. Lastly, the proposed architecture can improve Transformer backbone by 1.05% on average, and the improvements are consistently observed on tasks from MT03 to MT06.

Results on WMT Tasks
To further confirm the effectiveness of the proposed architecture, we make additional comparisons on WMT'14 En→De and WMT'18 Zh→En. The results are demonstrated in Table 2. Because the above methods don't have results on the two datasets, we re-implement Adaptive Multi-pass Decoding for comparisons.
These results are consistent with the observations in Sec. 5.2. We can see that the new architecture can improve BLEU scores on both RNNSearch and Transformer backbones. For example, the improvements on RNNSearch backbone are 2.13% on WMT'14 and 2.24% on WMT'18. On Transformer backbone, scores are raised by 1.38% on WMT'14 and 1.43% on WMT'18 . Furthermore, RNNSearch w/ Rewriter-Evaluator outperforms Adaptive Multi-pass Decoder by 1.31% and 1.32%, respectively, on the two tasks. Interestingly, the proposed architecture on RNNSearch backbone even surpasses Transformer on these two datasets. For example, the BLEU score on WMT'14 increases from 27.53% to 27.86%.

Oracle Experiment
We conduct oracle experiments on the test set of WMT'14 En→De to understand potential improvements of our architecture. An oracle selects the iteration that the corresponding rewrite has the highest BLEU score. Its BLEU scores are shown on the  red dashed lines in Fig. 4. The numbers on the green vertical bars are the BLEU scores of adopting a fixed number of rewriting iterations. Their averaged number is shown on the dashed blue line. BLEU score from using our evaluator is shown on the solid dark-blue line. Results show that the evaluator, with 27.86% BLEU score and 28.91 BLEU score, is much better than the strategies of using a fixed number of rewriting turns. The gaps between oracle and the averaged performance by RNNSearch and Transformer with fixed iterations are 1.92% and 1.90%. Using the evaluator, these gaps are reduced relatively by 80.7% for RNNSearch and 75.8% for Transformer, respectively, down to 0.37% and 0.46%. These results show that the evaluator is able to learn an appropriate termination policy, approximating the performances of oracle policy.

Termination Accuracy Analysis
We define a metric, percentage of accurate terminations (PAT), to measure how precise a termination policy can be. PAT is computed as where δ is the indicator function that outputs 1 if its argument is true and 0 otherwise. For each pair (x, y) in the test set U , w q (x, y) is the turn index k with the highest quality score max k q (k) and w b (x, y) is the one with the highest BLEU score  max k BLEU(z (k) , y). The translations z (k) , 1 ≤ k ≤ K and their scores q (k) , 1 ≤ k ≤ K are obtained using Eq. 5 and Eq. 6. For fair comparisons, the maximum number of rewritings is set to 6 for both Rewriter-Evaluator and Adaptive Multi-pass Decoder (Geng et al., 2018). Results in Table 3 show that PAT scores from Rewriter-Evaluator are much higher than those of Adaptive Multi-pass Decoder. For instance, RNNSearch w/ Rewriter-Evaluator surpasses Adaptive Multi-pass Decoder by 40.96% on WMT'14 and 10.35% on WMT'18. Transformer 5h23m 14m11s w/ Rewriter-Evaluator 6h36m 52m02s from 42.25% to 42.79% with the same maximum iteration number of K.

Ablation Studies
Maximum Number of Iterations. Increasing the maximum number of turns K generally improves the BLEU scores. For instance, on NIST, K = 8 outperforms K = 2 by 1.0%, K = 4 by 0.46%, and K = 6 by 0.04%. However, described in Sec. 5.7, large K (e.g., 8) can increase inference time cost. Moreover, additional gains in performance from K = 8 is small. We therefore set K = 6 by default.

Running Time Comparisons
While achieving improved translation quality, the models are trained with multiple passes of translation. Therefore, a natural question is on the increase of training time and test time. We report results on 4 GPUs with the maximum rewriting turns K = 6 and the beam size set to 8. Results on WMT'14 are listed in Table 5.
It shows that Rewriter-Evaluator increases the test time by approximately 4 times, because of multiple passes of decoding. However, training time is only relatively increased by 15% and 18%, respectively on RNNSearch and Transformer, due to the large priority queue used in PGD to store previous translation cases.

Related Work
Multi-pass decoding has been well studied in statistical machine translation (Brown et al., 1993;Koehn et al., 2003Koehn et al., , 2007Och and Ney, 2004;Chiang, 2005;Dyer et al., 2013). Och (2003); Och and Ney (2002) propose training models with minimum error rate criterion on lattices from first-pass decoder. Marie and Max (2015) introduce an iterative method to refine search space generated from simple feature with additional information from more complex feature. Shen et al. (2004) investigate reranking of hypothesis using neural models trained with discriminative criterion. Neubig et al. (2015) propose to reconfirm effectiveness of reranking. Chen et al. (2008) present a regeneration of search space from techniques such as n-gram expansion. These approaches are however applied to shallow models such as log-linear models (Och and Ney, 2002).
Our work is closely related to recent efforts in multi-pass decoding on NMT. In these recent works (Xia et al., 2017;Zhang et al., 2018;Geng et al., 2018), the models generate multiple target sentences for a source sentence and, except for the first one, each of them is based on the sentence generated in the previous turn. For example, Xia et al. (2017) propose Deliberation Networks that uses a second decoder to polish the raw sequence produced by the first-pass decoder. While these methods have achieved promising results, they lack a proper termination policy for the multi-pass translation process. Zhang et al. (2018) adopt a predefined number of decoding passes, which is not flexible. Geng et al. (2018) incorporate post-editing mechanism into NMT model via RL. However, RL can be unstable for training because of the high variance in gradient estimation. The lack of a proper termination policy results in premature terminations or over-translated sentences, which can largely limit the performance gains of these methods.

Conclusion
This paper has introduced a novel architecture, Rewriter-Evaluator, that achieves a proper termination policy for multi-pass decoding in NMT. At every translation pass, given the source sentence and its past translation, a rewriter generates a new translation, aiming at making further performance improvements over the past translations. An evaluator estimates the translation quality to determine whether to complete this iterative rewriting process. We also propose PGD that facilitates training the rewriter and the evaluator both jointly and efficiently. We have applied Rewriter-Evaluator to improve mainstream NMT models. Extensive experiments have been conducted on three translation tasks, NIST Zh→En, WMT'18 Zh→En, and WMT'14 En→De, showing that our architecture notably improves the results of NMT models and significantly outperforms other related methods. An oracle experiment and a termination accuracy analysis show that the performance gains can be attributed to the improvements in completing the rewriting process at proper iterations.