Turning Fixed to Adaptive: Integrating Post-Evaluation into Simultaneous Machine Translation

Simultaneous machine translation (SiMT) starts its translation before reading the whole source sentence and employs either fixed or adaptive policy to generate the target sentence. Compared to the fixed policy, the adaptive policy achieves better latency-quality tradeoffs by adopting a flexible translation policy. If the policy can evaluate rationality before taking action, the probability of incorrect actions will also decrease. However, previous methods lack evaluation of actions before taking them. In this paper, we propose a method of performing the adaptive policy via integrating post-evaluation into the fixed policy. Specifically, whenever a candidate token is generated, our model will evaluate the rationality of the next action by measuring the change in the source content. Our model will then take different actions based on the evaluation results. Experiments on three translation tasks show that our method can exceed strong baselines under all latency.


Introduction
Simultaneous machine translation (SiMT) (Gu et al., 2017;Ma et al., 2019;Arivazhagan et al., 2019;Ma et al., 2020;Zhang andFeng, 2021b, 2022d) starts translation before reading the whole source sentence.It seeks to achieve good latencyquality tradeoffs and is suitable for various scenarios with different latency tolerances.Compared to full-sentence machine translation, SiMT is more challenging because it lacks partial source content in translation and needs to decide on translation policy additionally.
The translation policy in SiMT directs the model to decide when to take READ (i.e., read the next source token) or WRITE (i.e., output the generated token) action, so as to ensure that the model has appropriate source content to translate the target tokens.Because READ and WRITE actions are often decided based on available source tokens and generated target tokens, it is difficult to guarantee their accuracy.Therefore, if the SiMT model can evaluate the rationality of actions with the help of the current generated candidate token, it can reduce the probability of taking incorrect actions.
However, the previous methods, including fixed and adaptive policies, lack evaluation before taking the next action.For fixed policy (Ma et al., 2019;Elbayad et al., 2020;Zhang et al., 2021;Zhang and Feng, 2021c), the model generates translation according to the predefined translation rules.Although it only relies on simple training methods, it cannot make full use of the context to decide an appropriate translation policy.For adaptive policy (Gu et al., 2017;Arivazhagan et al., 2019;Ma et al., 2020;Zhang et al., 2022), the model can obtain better translation performance.But it needs complicated training methods to obtain translation policy and takes action immediately after making decisions, which usually does not guarantee the accuracy of actions.
Therefore, we attempt to explore some factors from the translation to reflect whether the action is arXiv:2210.11900v1[cs.CL] 21 Oct 2022 correct, thereby introducing evaluation into translation policy.The goal of translation is to convert sentences from the source language to the target language (Mujadia and Sharma, 2021), so the source and target sentences should contain the same semantics (i.e., global equivalence).To ensure the faithfulness of translation (Weng et al., 2020), the source content that has already been translated should be semantically equivalent to the previously generated target tokens at each step (i.e., partial equivalence) (Zhang and Feng, 2022c).Furthermore, by comparing the changes between adjacent steps, the increment of the source content being translated should be semantically equivalent to the current generated token (i.e., incremental equivalence).Therefore, the rationality of the generated target token can be reflected by the increment of the source content being translated between adjacent steps, which can be used to evaluate the READ and WRITE actions.
In this paper, we propose a method of performing the adaptive policy by integrating post-evaluation into the fixed policy, which directs the model to take READ or WRITE action based on the evaluation results.Using partial equivalence, our model can recognize the translation degree of source tokens (i.e., the degree to which the source token has been translated), which represents how much the source content is translated at each step.Then naturally, by virtue of incremental equivalence, the increment of translated source content can be regarded as the change in the translation degree of available source tokens.Therefore, we can evaluate the action by measuring the change in translation degree.As shown in Figure 1, if the translation degree has significant changes after generating a candidate token, we think that the current generated token obtains enough source content, and thus WRITE action should be taken.Otherwise, the model should continue to take READ actions to wait for the arrival of the required source tokens.Experiments on WMT15 De→En and IWSLT15 En→Vi translation tasks show that our method can exceed strong baselines under all latency.

Background
Transformer (Vaswani et al., 2017), which consists of encoder and decoder, is the most widely used neural machine translation model.Given a source sentence x = (x 1 , ..., x I ), the encoder maps it into a sequence of hidden states z = (z 1 , ..., z I ).
Our method is based on wait-k policy (Ma et al., 2019) and Capsule Networks (Hinton et al., 2011) with Guided Dynamic Routing (Zheng et al., 2019b), so we briefly introduce them.

Wait-k Policy
Wait-k policy, which belongs to fixed policy, takes k READ actions first and then takes READ and WRITE actions alternately.Define a monotonic non-decreasing function g(t), which represents the number of available source tokens when translating target token y t .For wait-k policy, g(t) can be calculated as: where I is the length of the source sentence.
To avoid the recalculation of the encoder hidden states when a new source token is read, unidirectional encoder (Elbayad et al., 2020) is proposed to make each source token only attend to its previous tokens.Besides, multi-path method (Elbayad et al., 2020) optimizes the model by sampling k uniformly during training and makes a unified model obtain the translation performance comparable to wait-k policy under all latency.

Capsule Networks with Guided Dynamic Routing
Guided Dynamic Routing (GDR) is a variant of routing-by-agreement mechanism (Sabour et al., 2017) in Capsule Networks and makes input capsules route to corresponding output capsules driven by the decoding state at each step.In detail, encoder hidden states z are regarded as a sequence of input capsules, and a layer of output capsules is added to the top of the encoder to model different categories of source information.The decoding state then directs each input capsule to find its affiliation to each output capsule at each step, thereby solving the problem of assigning source tokens to different categories.

The Proposed Method
The architecture of our method is shown in Figure 2. Our method first guides the model to recognize the translation degree of available source tokens based on partial equivalence during training via the

Output Probability
Unidirectional Encoder Conventional Decoder The architecture of our method.The R/W prediction module obtains the translation degree of the available source tokens and evaluates the next action based on the change in translation degree.introduced GDR module.Then based on the incremental equivalence between adjacent steps, our method utilizes the changes in translation degree to post-evaluate the rationality of the READ and WRITE actions and accordingly make corrections, thereby performing an adaptive policy during inference.Besides, to enhance the robustness of the model in recognizing the translation degree during inference, our method applies a disturbed-path training based on the wait-k policy, which adds some disturbance to the translation policy during training.The details are introduced in the following sections in order.

Recognizing the Translation Degree
As mentioned above, the translation degree represents the degree to which the source token has been translated and is the prerequisite of our method.Therefore, we introduce Capsule Networks with GDR to model the translation degree, which is guided by our proposed two constraints according to partial equivalence during training.
Translation Degree We define the translation degree of all source tokens at step t as I ).To obtain the translation degree, we need to utilize the ability of Capsule Networks with GDR to assign the source tokens to different categories.Assume that there are J+N output capsules modeling available source information that has already been translated and has not yet been translated, among which there are J translated capsules Φ T = (Φ 1 , ..., Φ J ) and N untranslated capsules Φ U = (Φ J+1 , ..., Φ J+N ), respectively.The encoder hidden states z are regarded as input capsules.To determine how much of z i needs to be sent to Φ j at step t, the assignment probability c where b (t) ij measures the cumulative similarity between z i and Φ j .Then c (t) ij is updated iteratively driven by the decoding state and is seen as the affiliation of z i belonging to Φ j after the last iteration.For more details about Capsule Networks with GDR, please refer to Zheng et al. (2019b).On this basis, the translation degree of x i is calculated by aggregating the assignment probability of routing to the translated capsules at step t: (3)

Segment Constraint
To ensure that the model can recognize the translation degree of source tokens, the model requires additional guidance.According to partial equivalence, the translated source content should be semantically equivalent to the generated target tokens.On the contrary, the untranslated source content and unread source tokens should be semantically equivalent to target tokens not generated.So we introduce mean square error to induce the learning of output capsules: where W T , W U e and W U d are learnable parameters.H T t and H U t are the averages of hidden states of the generated target tokens and target tokens not generated, which are calculated respectively: where M is the length of the target sentence.Z t is the average of hidden states of unread source tokens at step t: Φ T t and Φ U t are the translated and untranslated source information at step t, respectively.
Token Constraint To recognize the changes in translation degree more accurately, we propose token constraint according to incremental equivalence.It encourages the translated capsules to predict the generated tokens and combines translated and untranslated capsules to predict the available source tokens at each step.It can be calculated as: where p d (y <t |Φ T t ) represents the probability of generated target tokens based on translated source information and p e (x ≤g(t) |Φ T t ; Φ U t ) is the probability of available source tokens based on both translated and untranslated information.Then we can get the training objective of our model: where −log p θ (y|x) is negative log-likelihood.

Post-Evaluation Policy
With the help of token and segment constraints, our model can accurately recognize the translation degree, which can be utilized to perform our Post-Evaluation (PE) policy by measuring the changes in translation degree between adjacent steps.
Generally speaking, the core of the adaptive policy is to decide the conditions for taking different actions (Zhang and Feng, 2022b).According to incremental equivalence, the current generated token should be semantically equivalent to the increment of the source content that has been translated, which can be measured by the changes in translation degree.Therefore, we can evaluate the rationality of actions by measuring the change in the translation degree of available source tokens.We define the change in the translation degree of source tokens after generating y t as ∆d (t) = (∆d i and d (t+1) i are calculated in Eq.( 3) and max(•) function ensures that the translation degree is undiminished considering incremental equivalence.Furthermore, we introduce hyperparameter ρ, which is the threshold to measure the change in translation degree.
As shown in Figure 3, we can get the conditions for taking different actions by comparing ∆d (t)  and ρ.We first define function max_select(•), which returns the maximum element in a vector.According to incremental equivalence, if the change in the translation degree exceeds the threshold (i.e, max_select(∆d (t) ) ≥ ρ), then the current generated token obtains enough source content, and the model should take WRITE action.Otherwise, the model should continue to take READ action.However, the generation of auxiliary tokens such as 'the' in English can not lead to a change in translation degree.This misleads the model to take READ actions consecutively, so we force the model to take WRITE actions by setting the restriction of consecutive READ actions as r.PE policy is shown in Algorithm 1.Our model will only take WRITE action after reading the whole source sentence.

Disturbed-Path Training
Up to now, we have proposed our adaptive policy by introducing post-evaluation, which utilizes the translation degree.Because the adaptive policy adopts different translation paths (i.e., the sequence of READ and WRITE actions) for different contexts, this requires the model to learn as many translation paths as possible.However, the previous training methods (Ma et al., 2019;Elbayad et al., 2020) can only cover a small number of predefined translation paths.To enhance the ability to recognize the translation degree on different translation paths, our model is optimized across our proposed disturbed-path.
Specifically, the log-likelihood estimation based on sentence pair (x, y) through the single path g k is computed as:  where g k = (g(1; k), ..., g(M ; k)) defines the number of available source tokens at each step and k is the number of source tokens read in advance before generation.For translation path g k , g(t; k) is updated as: ) where γ is uniformly sampled from [0, ..., r] and r is the restriction on READ actions in PE policy and controls the degree of disturbance to a single translation path.This essentially simulates the situation where the model makes decisions on the next action.For (x, y), we then make the model have the ability to recognize the translation degree under all latency by changing k.Thus, the log-likelihood estimation in Eq.( 11) is modified: where k is uniformly sampled form K = [1, ..., I] and I is the length of source sentence.Therefore, our method can perform our adaptive policy under all latency by only using a unified model.
For En→Vi task (Cettolo et al., 2016), our settings are the same as Arivazhagan et al. (2019).We replace tokens whose frequency is less than 5 with unk .We use TED tst2012 as the development set and TED tst2013 as the test set.
For En→De task, the model settings remain the same as Cettolo et al. (2014).
For De→En task, we keep our settings consistent with Ma et al. (2020).We apply BPE (Sennrich et al., 2016) with 32K subword units and use a shared vocabulary between source and target.We use newstest2013 as the development set and new-stest2015 as the test set.

Model Settings
Since our experiments involve the following models, we briefly introduce them.Wait-k (Ma et al., PED represents that our model is trained through disturbed-path and performs PE policy during inference.For all the models mentioned above, we apply Transformer-Small (6 layers, 4 heads) on En→Vi and En→De tasks and Transformer-Base (6 layers, 8 heads) on De→En task.Other model settings follow Ma et al. (2020).We implement all models by adapting Transformer from Fairseq Library (Ott et al., 2019).The settings of Capsule Networks with GDR are consistent with Zheng et al. (2019b).For our method, we empirically set r = 2 and ρ = 0.24 for all experiments, and use k as free parameter to achieve different latency.Our proposed method is fine-tuned based on the pre-trained multi-path model.We use greedy search in decoding and evaluate these methods with translation quality measured by tokenized BLEU (Papineni et al., 2002) and latency estimated by Average Lagging (AL) (Ma et al., 2019).

Main Results
The translation performance between our method and the previous methods is shown in Figure 4.It can be seen that our method can exceed previous methods under all latency on all translation tasks.
Compared to wait-k policy, our method obtains significant improvement, especially under low la-tency.This is because wait-k policy performs translation according to the predefined path, which usually leads to uncertain anticipation or introduces redundant latency (Ma et al., 2019).Both Multipath and our methods can generate translation under all latency with a unified model.But our PED method transcends its performance by performing Post-Evaluation (PE) policy, which can evaluate the rationality of actions and then decide whether to take them.Therefore, compared with fixed policy, our PE method can achieve better performance by adjusting its translation policy.
Compared to Adaptive-wait-k policy, our model also surpasses its performance and is more reliable under high latency.Adaptive-wait-k generates translation through a heuristic composition of several models with different fixed policies, which restricts the performance under high latency and leads to a decrease in translation speed caused by frequent model switching (Zheng et al., 2020).Our method generates translation with only a unified model and integrates post-evaluation into fixed policy to evaluate the rationality of actions.In particular, our model can approach the performance of full-sentence machine translation with lower latency on two tasks.

Analysis
To understand our proposed method, we conduct multiple analyses.All of the following results are reported on De→En task.

Ablation Study
We conduct an ablation study on PE policy and disturbed-path training method to verify their effectiveness, respectively.As shown in Table 1, both PE policy and disturbed-path method can Table 1: Ablation study of our method when k = 9. 'w/o PE' denotes our model is trained across disturbedpath and performs fixed policy.'w/o disturbed-path' denotes our model is trained across multi-path and performs our PE policy.
improve the translation performance, and better latency-quality tradeoffs can be obtained by their joint contributions.
We also carry out comparative experiments to understand the two constraints in subsection 3.1.The results are shown in Table 2.Both token and segment constraints have positive effects on translation performance respectively.Although the translation quality is slightly worse when the model is guided by them concurrently, the translation degree of available source tokens can be greatly improved and the latency is also reduced by their combined contributions.

Analysis of Translation Degree
To describe the translation degree intuitively, we visualize it in Figure 5. Obviously, the translation degree of each source token gradually accumulates with the progress of translation, which means that the source content is gradually utilized by the target to generate translation and observes partial equivalence.Besides, our PE policy can take WRITE actions when the translation degree of source tokens has significant changes, which obeys incremental equivalence and ensures the rationality of actions.Therefore, our PED policy can adaptively adjust the translation path based on context to achieve better translation performance.Following Zheng et al. (2019b), we evaluate the accuracy of the translation degree at each step by using overlapping rate, which measures the coincidence between the predicted tokens and groundtruth tokens.We introduce the prediction function in token constraint to predict the target and source tokens respectively.Then we obtain target overlapping rate R T by comparing the predicted target tokens with the generated tokens and source over-Latency 1 3 5 7 9 R T (↑) 0.60 0.62 0.63 0.62 0.61 R S (↑) 0.80 0.78 0.77 0.77 0.78  3. The output capsules can well represent the available source information and generated target information under all latency.Therefore, our method can recognize the translation degree accurately at each step according to partial equivalence, thereby providing the basis for our policy.

Analysis on Translation Path
The purpose of the translation policy is to get a better translation path, which is composed of READ and WRITE actions.To verify the effectiveness of our PE policy, we introduce sufficiency and necessity (Zhang and Feng, 2022c) as evaluation metrics.
Essentially, sufficiency measures the faithfulness of the generated translation and necessity measures how much the redundant delay is introduced.We take manually aligned alignments for De→En corpus in RWTH dataset 5 as ground-truth 5 https://www-i6.informatik.rwth-aachen.de/goldAlignment/ alignments.The comparison of sufficiency and necessity of different methods is shown in Figure 6.Obviously, the translation path decided by our PE policy exceeds other methods in terms of sufficiency and necessity.The sufficiency of wait-k policy is similar to PE policy, but it introduces too much unnecessary delay under all latency.Compared to wait-k policy, Adaptive-wait-k performs better in terms of necessity, but it is obtained at the cost of partial sufficiency.

Translation Efficiency
In order to compare the translation efficiency between our method and the previous methods, we measure it by using the average time of generating each token.The results in Table 4 are tested on GeForce GTX TITAN-X.It can be seen that the translation speed of our methods is less than wait-k policy, but about three times faster than Adaptivewait-k policy.Besides, the translation speed of PED is about twice as slow as 'PED w/o PE', which  Zheng et al. (2020) implemented the adaptive policy through a composition of several fixed policies.Miao et al. (2021) proposed a generative framework to perform the adaptive policy for SiMT.Zhang and Feng (2022c) introduced duality constraints to direct the learning of translation paths during training.Instead of predicting the READ and WRITE actions, Zhang and Feng (2022a) implemented the adaptive policy by predicting the aligned source positions of each target token.
Our method focuses on the accuracy of READ and WRITE actions during inference.Our PE policy can evaluate the rationality of actions by utilizing the increment of source content before taking them, which reduces the probability of incorrect actions.Besides, our method achieves good performance under all latency with a unified model.
Capsule Networks (Hinton et al., 2011) and its assignment policies (Sabour et al., 2017;Hinton et al., 2018) initially attempted to solve the problem of parts-to-wholes in computer vision.Dou et al. (2019) first employed capsule network into NMT (i.e, neural machine translation) model for layer representation aggregation.Zheng et al. (2019b) proposed a novel assignment policy GDR to model past and future source content to assist translation.Wang et al. (2019) proposed a novel capsule network for linear time NMT.
Our PED method introduces Capsule Networks with GDR into SiMT model and recognizes the translation degree of source tokens under the restriction of partial source information.Furthermore, we evaluate the rationality of the actions by measuring the changes in translation degree, to implement the adaptive policy.

Conclusion
In this paper, we propose a new method of performing the adaptive policy by integrating postevaluation into the fixed policy to evaluate the rationality of the actions.Besides, disturbed-path training is proposed to enhance the robustness of the model to recognize the translation degree on different translation paths.Experiments show that our method outperforms the strong baselines under all latency and can recognize the translation degree on different paths accurately.Furthermore, PE policy can enhance the sufficiency and necessity of translation paths to achieve better performance.

Limitations
We think our methods mainly have two limitations.On the one hand, although our method can recognize the translation degree of each source token, it still has some deviations.On the other hand, although the inference speed of our method is slightly slower than the wait-k policy, it is still faster than the Adaptive-wait-k policy, which is enough to meet the needs of the application.

Figure 1 :
Figure 1: The change in translation degree of source tokens after generating a candidate token, and the READ/WRITE action is taken accordingly.

Figure 3 :
Figure 3: Change in translation degree of available sources token after generating y t .The model takes WRITE action when the translation degree has significant changes.Otherwise, the model should take READ action.

Figure 4 :
Figure 4: Performance of different methods on En→Vi (Transformer-Small), En→De (Transformer-Small) and De→En (Transformer-Base) tasks.It shows the results of our methods, wait-k, multi-path, adaptive-wait-k and offline model.

Figure 5 :
Figure 5: Translation and Evaluation process of a De→En example when performing PE policy with k = 5.The horizontal direction denotes the source sentence (De), and the vertical direction denotes generated sentence (En).'T'represents the translation degree.'U' represents the degree to which the source token has not yet been translated.Our PE policy can take WRITE actions accurately when the translation degree has significant changes.
of adequacy and necessity of translation path between different translation policies.

Table 2 :
Comparison among the combinations of two constraints when decoding with k = 9.The model is optimized through multi-path and performs fixed policy.

Table 3 :
The results of overlapping rate under all latency, where the higher rate is better.The model is trained across disturb-path and performs fixed policy.lapping rate R S by comparing the predicted source tokens with available source tokens.R T is calculated as: