Guiding Teacher Forcing with Seer Forcing for Neural Machine Translation

Although teacher forcing has become the main training paradigm for neural machine translation, it usually makes predictions only conditioned on past information, and hence lacks global planning for the future. To address this problem, we introduce another decoder, called seer decoder, into the encoder-decoder framework during training, which involves future information in target predictions. Meanwhile, we force the conventional decoder to simulate the behaviors of the seer decoder via knowledge distillation. In this way, at test the conventional decoder can perform like the seer decoder without the attendance of it. Experiment results on the Chinese-English, English-German and English-Romanian translation tasks show our method can outperform competitive baselines significantly and achieves greater improvements on the bigger data sets. Besides, the experiments also prove knowledge distillation the best way to transfer knowledge from the seer decoder to the conventional decoder compared to adversarial learning and L2 regularization.


Introduction
Neural machine translation (NMT) (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2014;Gehring et al., 2017;Vaswani et al., 2017) has achieved great success and is drawing larger attention recently.Most NMT models are under the attention-based encoder-decoder framework which assumes there is a common semantic space between the source and target languages.The encoder encodes the source sentence to the common space to get its meaning, and the decoder projects the source meaning to the target space to generate corresponding target words.Whenever The code: https://github.com/ictnlp/SeerForcingNMTgenerating a target word at a time step, the decoder needs to retrieve the attended source information and then decodes into a target word.The underline principle which makes sure the framework works is that the information hold by the source sentence and its target counterpart is equivalent.Thus the translation procedure can be considered to decompose source information into different pieces and then to convert each piece to a proper target word according to bilingual context.When all the information encoded in the source sentence is throughly processed, the whole translation has been generated.
Neural machine translation models are usually trained via maximum likelihood estimation (MLE) (Johansen and Juselius, 1990) and the operation form is known as teacher forcing (Williams and Zipser, 1989).The teacher forcing strategy performs one-step-ahead predictions with the past ground truth words fed as context and forces the distribution of the next prediction to approach a 0-1 distribution where the probability of the next ground truth word corresponds to 1 and others to 0. In this way, the predicted sequence is trained to be close to the ground truth sequence.From the perspective of information division, the function of teacher forcing is to teach the translation model how to segment source information and derive the ground truth word from the source information at a maximum probability.
However, teacher forcing can only provide upto-now ground truth words for one-step-ahead predictions and hence lacks global planning for the future.This will result in local optimization especially when the next prediction is highly related to the future.Besides, as the translation grows, the previous prediction errors will be accumulated and affect later predictions (Zhang et al., 2019c).This is the important reason why NMT models cannot always produce the ground truth sequence during training.Therefore, it is more possible to achieve global optimization by getting to know the future ground truth words.This can lead to better crossattention to the source sentence and thus better information devision.But unfortunately, ground truth can be only obtained during training and we cannot inference with future ground truth at test.
To address this problem, we introduce an additional seer decoder into the encoder-decoder framework to integrate future information.During training, the seer decoder is used to guide the behaviors of the conventional decoder while at test the translation model only inferences with the conventional decoder without introducing any extra parameters and calculation cost.Specifically, the conventional decoder only gets past information participating in the next prediction, while the seer decoder has both the past and future ground truth words engaged in the next prediction.Both decoders are trained to generate ground truth via MLE and meanwhile the conventional decoder is forced to simulate the behaviors of the seer decoder via knowledge distillation (Buciluǎ et al., 2006;Hinton et al., 2015).In this way, at test the conventional decoder can perform like the seer decoder as if it knew the future translation.
We conducted experiments on two small data sets (Chinese-English and English-Romanian) and two big data sets (Chinese-English and English-German) and the experiment results show that our method can outperform strong baselines on all the data sets.In addition, we also compared different mechanisms of transferring knowledge and found that knowledge distillation is more effective than adversarial learning and L2 regularization.To the best of our knowledge, this paper is the first to explore the effects of the three mechanisms simultaneously in machine translation.
l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " C 5 v 4 Y V C k q x a j v e e F + D G 0 C k J 6 l y w = " > A x E r M W V P + B W f 0 z 8 A / 0 L 7 4 w j q E V 0 Q p I z 5 9 5 z Z u 6 9 f h p y I R 3 n p W B N T c / M z h X n S w u L S 8 s r 5 d W 1 p k j y L G C N I A m T r O 1 7 g o U 8 Z g 3 J Z c j a a c a 8 y A 9 Z y 7 8 8   We introduce our method on the basis of Transformer which is under the encoder-decoder framework (Vaswani et al., 2017).Our model consists of three components: the encoder, the conventional decoder and the seer decoder.The architecture is shown in Figure 1.The encoder and the conventional decoder work in the same way as the corresponding components of Transformer do.The seer decoder integrates future ground truth information into its self-attention representation and calculates cross-attention over source hidden states with the self-attention representation as the query.During training, the encoder is shared by the two decoders and both decoders perform predictions to generate ground truth.The behaviors of the conventional decoder are guided by the seer decoder via knowledge distillation.If the conventional decoder can predict a similar distribution as the seer decoder, we think the conventional decoder performs like the seer decoder.Then we can only use the conventional decoder for test.
The details of the encoder and the conventional decoder can be got from Vaswani et al. (2017).Assume the input sequence is x = (x 1 , ..., x J ), the ground truth sequence is y * = (y * 1 , ..., y * I ) and the generated translation is y = (y 1 , ..., y I ).We will give more description to the seer decoder and the training in what follows.

The Seer Decoder
Although we feed the future ground truth words to the seer decoder, we will not tell it the next ground truth word to be generated, in case it will only learn a copy operation, not how to derive a word.Considering efficiency, the seer decoder does not integrate the past and future ground truth information with a unique decoder , but two separate subdecoders.As a result, the seer decoder consists of three components: the past subdecoder, the future subdecoder and the fusion layer.The architecture of the seer decoder is given in Figure 2. The past and future subdecoders are employed to decode the past and future ground truth information into hidden states respectively and the fusion layer is used to fuse the output of the past and future subdecoders and calculate the final hidden state for the next prediction.
The past subdecoder is composed of N −1 layers and each layer has three sublayers which are the multi-head sublayer, the cross-attention sublayer and the feed-forward network (FNN) sublayer, the same as Transformer.The multi-attention sublayer accepts the whole ground truth sequence as the input and applies a mask matrix M p to make sure only the past ground truth words attend the selfattention.Specifically, to generate the i-th target word, its corresponding mask vector in the mask matrix M p is set to mask the words y * i , y * i+1 , ..., y * I .Then after the cross-attention sublayer and the FFN sublayer, the past subdecoder output a sequence of past hidden states, the packed matrix of which is denoted as H p .
The future subdecoder has the same structure as the past subdecoder except for the mask matrix.The future subdecoder also has the whole ground truth sequence as the input but employs a different mask matrix M f to only remain the future ground truth information.To generate the i-th target word, the corresponding mask vector in M f masks the words y * 1 , ..., y * i−1 , y * i .The packed matrix of the future hidden states generated by the future subdecoder is denoted as H f .
The fusion layer is composed of four sublayers: the multi-head sublayer, the linear sublayer, the cross-attention sublayer and the FFN sublayer.Except the linear sublayer, the rest three sublayers works in the same way as Transformer does.The multi-head sublayer encodes the outputs of the past and future subdecoders separately with the mask matrix M p and M f , and the packed matrix of their output are denoted as H p and H f respec-tively.Then we reverse the order of the vectors in H f to get H f , so that the same index in H p and H f can correspond to the past and future representation needed for the same prediction.Assume H f = [h f 1 ; h f 2 ; ...; h f I ], then its reversed matrix is H f = [h f I ; ...; h f 2 ; h f 1 ].The linear sublayer fuses H p and H f via a linear transformation as Now we can think each representation in the matrix A incorporates the past and future information for its corresponding prediction.Then after the crossattention sublayer over the outputs of the encoder and then the FFN sublayer, we can get the target hidden states produced by the seer decoder as S s = [s s 1 ; ...; s s I ] T .Then the probability to generate the target word y i is Note that the past and the future subdecoders share the same set of parameters, and the same linear transformation matrix W o is applied to the outputs of the conventional and seer decoders.

Training
In our method, only the conventional decoder is employed for test and the seer decoder is only used to guide the conventional decoder during training.Given a sentence pair x, y * in the training set, the conventional decoder and the seer decoder can predict a distribution for target position i as p c (y i |y * <i , x) and p s (y i |y * >i , y * <i , x), respectively.The two decoders are both trained by comparing its predicted distribution with the 0-1 distribution of the ground truth word by minimizing the cross entropy, that is to maximize the likelihood of the corresponding ground truth word.As the two decoders involve different information for next prediction, we call the training strategy teacher forcing and seer forcing, respectively.The cross-entropy loss for the conventional decoder is and the cross-entropy loss for the seer decoder is where K is the size of the training set and I k is the length of the k-th target sentence.
The conventional decoder is further trained to get close to the distribution of the seer decoder via knowledge distillation.In knowledge distillation, the conventional decoder (the student) has to not only match the one-hot ground truth word, but fit the distribution over the target vocabulary V drawn by the seer decoder (the teacher).The knowledge distillation loss can be formalized as where |V| is the size of the target vocabulary.
The final training loss is Different from the conventional knowledge distillation which first trains the teacher via cross entropy against ground truth, then fixes the teacher and only trains the student, we train all the parameters from the scratch, but we still follow the above rule to keep the teacher (i.e. the seer decoder) unchanged in the process of distillation.To do this, we do not update the parameters of the seer decoder through the loss L kd , that is, we only back propagate gradients to the seer decoder through L s , but not through L kd .

Related Work
Reinforcement-learning-based methods also encode future information in the rewards to supervise fine-tuning of the translation model.The rewards are worked out either by sampling future translation with the REINFORCE algorithm (Williams, 1992;Yu et al., 2017;Yang et al., 2018;Shao et al., 2019), or by directly calculating a value with the actor-critic algorithm (Bahdanau et al., 2016;Li et al., 2017).This set of methods only give a weak supervision to the NMT model through rewards and suffer from unstable training.In contrast, Shao et al. (2018) propose to train autoregressive NMT with the probabilistic n-gram based GLEU (Wu et al., 2016) and Shao et al. (2020) propose to minimize the bag-of-ngrams difference for nonautoregressive NMT so that the two methods can abandon reinforcement learning and perform training directly by gradient descent.
Another set of methods introduce future information into inference with additional pass of decoding or extra components at test.Niehues et al. (2016), Xia et al. (2017), Hassan et al. (2018) and Zhang et al. (2018) proposed a two-pass decoding algorithm to first generate a draft translation and then generate final translation referring to the draft.Geng et al. (2018) expand this line of methods by performing an adaptive multi-pass decoding where the number of decoding passes is determined by a policy network.Liu et al. (2016a), Liu et al. (2016b), Hoang et al. (2017), Zhang et al. (2019d) and He et al. (2019) perform bidirectional decoding simultaneously and the two decoders correlate to each other via an agreement term or a regularization term in the loss.Zhou et al. (2019a) , Zhou et al. (2019b) and Zhang et al. (2019b) also maintain a forward decoder and a backward decoder to decode simultaneously but they interact to each other when making predictions.Zhang et al. (2019a) introduce a future-aware vector at test which is learned via the knowledge distillation framework during training.The difference between this set of methods and our method is that our method does not require any other cost at test and is easy to use.There are some other works which integrate future information during training while only perform one-pass decoding.Serdyuk et al. (2018) introduce a twin network to perform bidirectional decoding simultaneously during training and force the hidden states generated by the two decoders to be consistent, then at inference it can only use the forward decoder.But in this method the two decoders act as a counterpart to each other and no decoder plays a role of teacher, which determines that it can only be trained via L 2 regularization, not knowledge distillation which has proven in the experiments more effective than L 2 regularization.Feng et al. (2020) introduce an evaluation module to give each translation more reasonable evaluation when it cannot match the ground truth.The evaluation is conducted from the perspective of fluency and faithfulness which both need the participation of past and future information.The difference from the method proposed in this paper is their method uses self-generated translation as past information and does not train with knowledge distillation.Some researchers work in another perspective by introducing future information.Zhang et al. (2020b) propose to employ future source information to guide simultaneous machine translation with knowledge distillation, so that the incompleteness of source can be mitigated.Zheng et al. (2018) and Zheng et al. (2019) propose to model past and future information for the source to help the decoder focus on untranslated source information.

Data Preparation
We conducted experiments on two small data sets and two big data sets.
Small Data Sets Chinese→English The training set consists of about 1.25M sentence pairs from LDC corpora with 27.9M Chinese words and 34.5M English words respectively1 .We used MT02 for validation and MT03, MT04, MT05, MT06, MT08 for test.We tokenized and lowercased English sentences using the Moses scripts2 , and segmented the Chinese sentences with the Stanford Segmentor3 .The two sides were further segmented into subword units using Byte-Pair Encoding(BPE) (Sennrich et al., 2016) with 30K merge operations.32K size of the Chinese dictionary and 29K size of the English dictionary were built for the two sides.
English→Romanian We used the preprocessed version of WMT16 En-Ro dataset released by Lee et al. (2018) which includes 0.6M sentence pairs.We used news-dev 2016 for validation and newstest 2016 for test.The two languages share the 35K size of the joint vocabulary generated with 40K merge operations of BPE on the combined data.
Big Data Sets Chinese→English The training data is from WMT 2017 Zh-En translation tasks that contains 20.18M sentence pairs after deleting duplicate ones.The newsdev2017 was used as the development set and newstest2017 was used as the test set.To avoid the effects of the translationese (Graham et al., 2019), we also tested the methods on the newstest2019 test set.We tokenized and truecased the English sentences with Moses scripts.For the Chinese data, we performed word segmentation by using Stanford Segmenter.32K BPE sizes were applied to the training data seperately and then we filtered out the sentences which are longer than 128 sub-words.44K size of the Chinese dictionary and 33K size of the English dictionary were built based on the corresponding data.
English→German The training data is from WMT2016 which consists of about 4.5M sentences pairs with 118M English words and 111M German words.The newstest2014 was used as the development set and newstest2016 and newstest2019 were used as the test sets.The two languages share the 32K size of the joint vocabulary generated with 30K merge operations of BPE on the combined data.

Systems TRANSFORMER
We used an open-source toolkit called Fairseq-py released by Facebook (Ott et al., 2019) which was implemented strictly following Vaswani et al. (2017).
RL-NMT We trained Transformer under the reinforcement learning framework using the RE-INFORCE algorithm (Williams, 1992) with the BLEU as the rewards.The implementation details for the RL part is the same as Yang et al. (2018).
ABDNMT SEER+L 2 Seer forcing with L 2 regularization.Similar to TWINNET, we set where g is a linear transformation.We first pretrained the two decoders together only with L = L t +L s , then trained them with the loss of L = L t + L s + αL 2 where α = 0.2, too.Please note that the L 2 loss did not update the seer decoder and the encoder so that the conventional decoder would approach the seer decoder, which followed Serdyuk et al. (2018).
SEER+AL Seer forcing with adversarial learning.A discriminator is employed to distinguish the hidden state sequences generated by the conventional decoder and the seer decoder.The discriminator is based on CNN, implemented according to Gu et al. (2019).The translation model and the discriminator are trained jointly via a gradient reversal layer just like our method.The loss is L = L t + L s + αL d where L d is the loss of the discriminator and α = 0.3 on the EN→RO data set and α = 0.2 on the other data sets.Our Method Implemented based on Fairseqpy.The weight λ in Equation 6for the small Chinese→English data set is set to 0.25, and for other data sets is set to 0.5.
All the Transformer-based systems have the same configuration as the base model described in Vaswani et al. (2017) except that dropout rate is 0.3.The translation quality was evaluated with BLEU (Papineni et al., 2002) with n=4 using the SacreBLEU tool (Post, 2018) 4 , where small data sets employ case-insensitive BLEU while big data sets use case-sensitive BLEU.

Main Results
We compare our method with other methods that can make global planning, including the reinforcement-based method (RL-NMT), the twopass decoding method (ABDNMT), twin networks which match past and future information (TWINNET) and the NMT model with an evaluate module to evaluate fluency and faithfulness (EVANMT).In addition, we also explore learning mechanisms which can transfer knowledge from the seer decoder to the conventional decoder, including L 2 regularization (SEER+L 2 ), adversarial learning (SEER+AL) and knowledge distillation (Our Method).
4 BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+ version.1.3.6We report results together with training time on the small and big data sets in Table 1 and Table 2, respectively.5As for different methods, in the small data sets, RL-NMT can only get small improvements over Transformer which are in line with the results reported in Wu et al. (2018), and ABDNMT cannot get consistent improvements over Transformer with an obvious difference on the EN→RO data set and a small difference on the CN→EN data set.TWINNET can get comparable BLEU scores with our method on the small data sets but mostly negative difference on the big data sets.EVANMT can achieve consistent improvements and greater improvements on the EN→DE data set.For the learning mechanisms, knowledge distillation show consistent superiority over L 2 regularization and adversarial learning, which is remarkable especially on the big data sets.Adversarial learning can bring improvements over Transformer on all the data sets while L 2 regularization acts unstable on the big data sets.In summary, our method proved to be effective not only in the term of the architecture but also in the learning mechanism.Table 3: BLEU scores of teacher forcing and seer forcing with and without cross-attention on NIST CN→EN translation.CD and SD denote the conventional decoder and the seer decoder, respectively.CA represents crossattention.

The Superiority of the Seer Decoder
To use seer forcing to guide teacher forcing, it should be ensured that the seer decoder can outperform the conventional decoder.To verify this, we trained the two decoders together with the loss L = L t + L s without knowledge distillation.
Then we evaluated their performance on the small Chinese-English translation task as follows.Both decoders are fed with ground truth words as context at test so that they can inference in the same way as at training, where the conventional decoder uses the past ground truth as context and the seer decoder employs the past and future ground truth words as context in the past and future subdecoders.
Besides translation performance, we also check the superiority of seer decoder in target language modeling.We do this by dropping out crossattention so that the decoder can only generate translation based on target language model.In this way, the translation performance without crossattention can demonstrate the ability of the two decoders in target language modeling.We used the first reference of the test set as ground truth and calculated BLEU scores only with this reference.From the results in Table 3, we can see that whether with or without cross-attention the seer decoder can make super large improvements over the conventional decoder consistently on all the test sets.However, without cross-attention, the BLEU scores of both decoders decrease dramatically which means language model information is not enough for the translation task.Therefore, we can conclude the seer decoder acts much better in target language modeling and cross-language projection and it is reasonable to use the seer decoder as the guider.

The Distillation of Future Information
As the seer decoder achieves its superiority with the help of future target information, we hope that the conventional decoder can learn future information from the seer decoder with knowledge distillation.To check this, we tested whether the hidden states of the conventional decoder could derive more future ground truth words after knowledge distillation.The underlying belief is that the future ground information transferred from the seer decoder can help the conventional decoder derive more future ground truth words.
Assuming the hidden states generated by the conventional decoder are S t = [s t 1 ; ...; s t I ] T , the future words for each target position i can be predicted with the distribution where W w is the weight matrix.During training, we can get the bag of ground truth words for position i as y * i = {y * i+1 , ..., y * I } and train W w with other parameters fixed by maximizing the likelihood of y * i as where K is the size of training sentences, I k is the length of the target sentence and log p wi (w) is the probability of the word w in Equation 7.
At test, we select the top best I b i words according to Equation 7 as the bag of future words b i for position i.As we cannot get the ground truth, the size of b i is calculated approximately as I b i = max {2, (J − i) × 2} where J is the length of source sentence.As we do not know the target length during prediction, it may occur that i is greater than J and calculating I b i in this way can ensure b i contains 2 words at least.
We conducted experiments on Chinese-English translation and used MT02 as the test set only   with the first reference as ground truth.We calculated the accuracy and recall by comparing each b i against each y * i .The results in Table 4 show the conventional decoder in our method can achieve higher accuracy and recall compared to the decoder of Transformer.This means knowledge distillation does transfer future information from the seer decoder to the conventional decoder.

The Contribution of Subdecoders
In the seer decoder of our method, the information from the past and future subdecoders is fused (as shown in Equation 1) to get the final cross-attention.The intuition is that at the beginning stage, the past subdecoder contains less information than the future subdecoder, so the fused information should rely more on the future subdecoder.As the translation gets longer, the information embodied in the past subdecoder grows, and the fused information should depend more on the past subdecoder.To confirm this hypothesis, we calculate the cosine similarity of the vectors in A given in Equation 1with the corresponding weighted vectors of W p H p and W f H f .We selected 205 sentences the length of which ranges [15,25], then calculated the cosine similarities word by word.Then the similarities at the same target position will be averaged and the chart over all the target positions is given in Figure 3.
The figure confirms our conjecture that at first, the fused information is highly related to the future information, and over time the similarity to past information increases gradually while the similarity to future information decreases faster.

Ablation Study
We have proven that in our method the past and future information collaborate to achieve better global planning.In this section, we will explore the influence of past and future information by separately deleting the future and past subdecoders from the seer decoder.In both cases, only the structure of the seer decoder changes and the whole model is trained with knowledge distillation in the same way.We also remove knowledge distillation loss in which case the seer and conventional decoders only interact via the shared encoder and only optimize their own cross-entropy losses during training.The results are given in Table 5.When we exclude future or past information, the translation performance decreases dramatically at almost the same extent, but they still have an obvious gain compared to Transformer.This demonstrates that both the past and future information are necessary for global planning.It is interesting that the translation performance still rise without future subdecoder where there is no additional information fed compared to Transformer.The reason may be the conventional and seer decoder can restrict each other to avoid bad behaviors.When knowledge distillation is dropped, the performance decline greatly which means only communicating via the encoder the conventional and seer decoders is not enough.Hence we need to introduce knowledge distillation to reinforce the influence of the seer decoder to the conventional decoder.

Performance with Sentence Length
As the translation is generated word by word, the translation errors will be accumulated while the the translation grows, which will influence the later prediction.In our method, the conventional decoder can learn future information from the seer decoder and hence it should make better global planning for the whole sequence.From this, we deduce that our method performs better on long sentences than Transformer.
We checked this on the NIST CN→EN translation task and split the sentences in all the test sets into 8 bins according to their length.Then we translated for each bin and tested the BLEU scores.The results in Figure 4 show that our method can achieve bigger improvements on longer sentences, especially in the last three bins.

Conclusion
In order to help the NMT model to make good global planning at inference, we propose to introduce a seer decoder which embodies future ground truth to guide the behaviors of the conventional decoder.To this end, we employ the method of knowledge distillation to transfer future information from the seer decoder to the conventional decoder.At test, the conventional decoder can perform translation on its own as if it knew some future information.The experiments indicate our method can outperform strong baselines significantly on four data sets.We are also the first to explore learning mechanisms of knowledge distillation, adversarial learning and L 2 regularization and knowledge distillation has proven to be the most effective one.

Figure 1 :
Figure 1: The architecture of the proposed method t e x i t s h a 1 _ b a s e 6 4 = " C 5 v 4 Y V C k q x a j v e e F + D G 0 C k J 6 l y w = " > AA A C y X i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d F l 0 I 7 i p Y B / Q F k n S aR 3 N y 8 x E r M W V P + B W f 0 z 8 A / 0 L 7 4 w j q E V 0 Q p I z 5 9 5 z Z u 6 9 f h p y I R 3 n p W B N T c / M z h X n S w u L S 8 s r 5 d W 1 p k j y L G C N I A m T r O 1 7 g o U 8 Z g 3 J Z c j a a c a 8 y A 9 Z y 7 8 8 5 9 5 z Z u 6 9 f h p y I R 3 n p W B N T c / M z h X n S w u L S 8 s r 5 d W 1 p k j y L G C N I A m T r O 1 7 g o U 8 Z g 3 J Z c j a a c a 8 y A 9 Z y 7 8 8 5 9 5 z Z u 6 9 f h p y I R 3 n p W B N T c / M z h X n S w u L S 8 s r 5 d W 1 p k j y L G C N I A m T r O 1 7 g o U 8 Z g 3 J Z c j a a c a 8 y A 9 Z y 7 8 8 r h K 5 d 9 d b T 8 V e d q V i 1 D 0 x u j j d 1 S x q w + 3 O c k 6 C 5 U 3 W d q n u y W 6 k d m F E X s Y F N b N M 8 9 1 D D E e p o k P c F H v C I J + v Y u r J u r N u P V K t g N O v 4 t q z 7 d / y R k b U = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " C 5 v 4 Y V C k q x a j v e e F + D G 0 C k J 6 l y w = " > AA A C y X i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d F l 0 I 7 i p Y B / Q F k n S a R 3 N y 8 x E r M W V P + B W f 0 z 8 A / 0 L 7 4 w j q E V 0 Q p I z 5 9 5 z Z u 6 9 f h p y I R 3 n p W B N T c / M z h X n S w u L S 8 s r 5 d W 1 p k j y L G C N I A m T r O 1 7 g o U 8 Z g 3 J Z c j a a c a 8 y A 9 Z y 7 8 8 V P H W N c s E T + J T O U p Z L / K G M R / w w J N E N b u S R 0 y c l S t O 1 d H L n g S u A R W Y V U / K z + i i j w Q B c k R g i C E J h / A g 6 O n A h Y O U u B 7 G x G W E u I 4 z 3 K F E 2 p y y G G V 4 x F 7 S d 0 i 7 j m F j 2 i t P o d U B n R L S m 5 H S x h Z p E s r L C K v T b B 3 P t b N i f / M e a 0 9 1 t x H 9 f e M V E S t x T u x f u s / M / + p U L R I D 7 O s a O N W U a k Z V F x i X X H d F 3 d z + U p U k h 5 Q 4 h f s U z w g H W v n Z Z 1 t r h K 5 d 9 d b T 8 V e d q V i 1 D 0 x u j j d 1 S x q w + 3 O c k 6 C 5 U 3 W d q n u y W 6 k d m F E X s Y F N b N M 8 9 1 D D E e p o k P c F H v C I J + v Y u r J u r N u P V K t g N O v 4 t q z 7 d / y R k b U = < / l a t e x i t >Future Mask Past Mask ⇥ < l a t e x i t s h a 1 _ b a s e 6 4 = " C 5 v 4 Y V C k q x a j v e e F + D G 0 C k J 6 l y w = " > A A A C y X i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d F l 0 I 7 i p Y B / Q F k n S a R 3 N y 8 x E r M W V P + B W f 0 z 8 A / 0 L 7 4 w j q E V 0 Q p I z 5 9 5 z Z u 6 9 f h p y I R 3 n p W B N T c / M z h X n S w u L S 8 s r 5 d W 1 p k j y L G C N I A m T r O 1 7 g o U 8 Z g 3 J Z c j a a c a 8 y A 9 Z y 7 8 8 5 9 5 z Z u 6 9 f h p y I R 3 n p W B N T c / M z h X n S w u L S 8 s r 5 d W 1 p k j y L G C N I A m T r O 1 7 g o U 8 Z g 3 J Z c j a a c a 8 y A 9 Z y 7 8 8 5 9 5 z Z u 6 9 f h p y I R 3 n p W B N T c / M z h X n S w u L S 8 s r 5 d W 1 p k j y L G C N I A m T r O 1 7 g o U 8 Z g 3 J Z c j a a c a 8 y A 9 Z y 7 8 8

Figure 2 :
Figure 2: The architecture of the seer decoder Our implementation of Zhang et al. (2018) based on Transformer.TWINNET Our implementation of Serdyuk et al. (2018) based on Transformer.The weight of L 2 loss was 0.2 .EVANMT Our implementation of Feng et al. (2020).

Figure 3 :
Figure 3: The similarity of the past and future information to the fused information

Figure 4 :
Figure 4: The BLEU scores on sentence bins with different lengths.

Table 4 :
Comparison on the predicted bag of words between the conventional decoders