Adaptive Policy with Wait-k Model for Simultaneous Translation

Simultaneous machine translation (SiMT) requires a robust read/write (R/W) policy in conjunction with a high-quality translation model. Traditional methods rely on either a fixed wait-k policy coupled with a standalone wait-k translation model, or an adaptive policy jointly trained with the translation model. In this study, we propose a more flexible approach by decoupling the adaptive policy model from the translation model. Our motivation stems from the observation that a standalone multi-path wait-k model performs competitively with adaptive policies utilized in state-of-the-art SiMT approaches. Specifically, we introduce DaP, a divergence-based adaptive policy, that makes read/write decisions for any translation model based on the potential divergence in translation distributions resulting from future information. DaP extends a frozen wait-k model with lightweight parameters, and is both memory and computation efficient. Experimental results across various benchmarks demonstrate that our approach offers an improved trade-off between translation accuracy and latency, out-performing strong baselines. 1


Introduction
Simultaneous Machine Translation (SiMT) (Gu et al., 2017) poses a unique challenge as it generates target tokens in real-time while consuming streaming source tokens, mostly applying to the scenario of speech translation (Zhang et al., 2019(Zhang et al., , 2023;;Fu et al., 2023).Unlike traditional machine translation (MT) (Bahdanau et al., 2015;Vaswani et al., 2017) where the entire source is available, SiMT requires a read/write (R/W) policy to determine whether to generate target tokens or wait for additional source tokens, along with the ability to trans- wait-2 path wait-3 path adaptive path log ( !| "# ,  "$ ) log ( % | "# ,  "! ) source target Figure 1: An example demonstrating that the optimization of cross-entropy loss on an adaptive path can be achieved through multi-path wait-k training.Note that any adaptive path can be composed of subpaths (highlighted by the green and blue circles) derived from various wait-k paths.late from source prefixes to target prefixes (P2P) (Ma et al., 2018).Conventionally, the read/write policy and the translation model are designed to work in tandem: a simple wait-k policy with an offline or wait-k translation model (Ma et al., 2018;Elbayad et al., 2020;Zhang et al., 2021b), or an adaptive policy (Gu et al., 2017;Dalvi et al., 2018;Zheng et al., 2019Zheng et al., , 2020;;Ma et al., 2020a;Guo et al., 2023) that dynamically makes read/write decisions based on context, paired with a translation model that learns to translate the prefixes determined by the policy.The latter approach has achieved state-of-the-art results (Zhang andFeng, 2022, 2023).However, it entails dedicated architecture designs and multitask learning to jointly train tightly coupled adaptive policy and the translation model in order to balance translation quality and latency, resulting in computational complexity and challenges in optimizing individual components.
On the other hand, the multi-path wait-k approach proposed by (Elbayad et al., 2020) introduces an effective method for training prefix-toprefix translation models by randomly sampling dif-ferent k values between batches.Intuitively, as depicted in Figure 1, any read/write path determined by an adaptive policy can be composed of subpaths from various wait-k paths with different k values, which can be effectively translated by a welltrained multi-path wait-k model.Although the performance of the multi-path wait-k approach falls behind the adaptive counterparts, we argue that this discrepancy should be attributed to the wait-k policy rather than the translation model.Notably, our observations indicate that multi-path wait-k models can achieve competitive results when combined with the adaptive policy proposed in (Zhang and Feng, 2022).This suggests a decoupled modular approach where the adaptive read/write policy can be modeled and optimized separately from the translation model, offering increased flexibility and the potential for improved performance.
A key aspect of this approach lies in acquiring high-quality signals to effectively supervise the learning of the read/write policy model.We draw inspiration from human simultaneous translation (Al-Khanji et al., 2000;Liu, 2008), where interpreters make a switch from listening to translating once they have gathered enough source context x ≤g(t) to determine how to expand the partial translation y <t to produce the next target word y t .In other words, they anticipate that seeing additional future words would not impact their current decisions.This behavior implies a small discrepancy between the interpreters' modeling of translation distribution given the partial source context p(y t |y <t , x ≤g(t) ), and the translation distribution given the full source context p(y t |y <t , x).Conversely, interpreters would wait for more source words if the discrepancy becomes significant.This observation motivates the utilization of statistical divergence (Lee, 1999) between the two conditional distributions for any prefix-to-prefix pair given a translation model as an informative criterion for making read/write decisions.In light of this, we propose DaP-SiMT, a novel divergencebased adaptive policy for simultaneous translation, to enable adaptive simultaneous translation using estimated divergence values, considering that the full source context is unavailable during the translation process.
While there are various options of neural architectures for the policy model and the translation model, we choose to build upon a well-trained multi-path wait-k translation model with frozen parameters, and introduce additional lightweight parameters for the adaptive policy model.This design choice minimizes the memory and computation overhead introduced by the policy model, while providing an effective mechanism to achieve an adaptive read/write policy within an existing SiMT model.Our main contributions can be summarized as follows.
1. We propose a novel method to construct read/write supervision signals from a parallel training corpus based on statistical divergence.2. We present a lightweight policy model that is both memory and computation efficient and enables adaptive read/write decision-making for a well-trained multi-path wait-k translation model.3. Experiments conducted on multiple benchmarks demonstrate that our approach outperforms strong baselines and achieves a superior accuracy-latency trade-off.

Related Works
Existing SiMT policies are mainly classified into fixed and adaptive categories.Fixed policies (Ma et al., 2018;Elbayad et al., 2020;Zhang et al., 2021b) determine read/write operations based on predefined rules.For example, the wait-k policy (Ma et al., 2018) first reads k source tokens and then alternates between writing and reading one token.On the other hand, adaptive policies predict read/write operations based on the current source and target prefix, achieving a better balance between latency and translation quality.Reinforcement learning has been used by (Gu et al., 2017) to learn the policy within a Neural Machine Translation (NMT) model.Dalvi et al. (2018) designed an incremental decoding that outputs a varying number of target tokens.Meanwhile, Arivazhagan et al. (2019) and Ma et al. (2020a) presented approaches to learning the adaptive policy through attention mechanisms.Recent advancements like the waitinfo policy (Zhang et al., 2022b) and ITST (Zhang and Feng, 2022) have quantified the waiting latency and information weight respectively to devise adaptive policies.To the best of our knowledge, ITST is currently the state-of-the-art method in SiMT.One of the most relevant works to ours is the Meaningful Unit (MU) for simultaneous translation (Zhang et al., 2020).This approach detects whether the translation of a sequence of source tokens forms a prefix of the full sentence's translation.This method was further generalized to speech translation in MU-ST (Zhang et al., 2022a).
While MU-ST inspects the target prefix in the vocabulary domain, our work advances further by examining the distribution of target tokens.

Full-sentence MT and SiMT
In the context of a full sentence translation task, given a translation pair x = (x 1 , x 2 , ..., x N ) and y = (y 1 , y 2 , ..., y T ), an encoder-decoder model such as Transformer (Vaswani et al., 2017) maps x into hidden representations and then autoregressively decodes the target tokens.Generally, the model is optimized by minimizing the crossentropy loss.
For the SiMT task, given that g(t) is a monotonic non-decreasing function representing the end timestamp of the source prefix that must be consumed to generate the t-th target token, the objective function of SiMT can be modified as follows,

Wait-k Policy and Multi-Path Wait-k
Wait-k policy (Ma et al., 2018), the most widely used fixed policy, begins by reading k source tokens and then alternates between writing and reading one token.The function g(t) for the wait-k policy can be formally calculated as, (3) where K is the candidate set of k.The main advantage of this method is its ability to make inferences under different latencies with a single model.Additionally, by adopting the unidirectional encoder, it can cache the encoder hidden states of streaming input.Previous experiments have shown that its performance is comparable to multiple wait-k models trained with different k values.

Motivation
Typically, a fixed k value is used when performing inference with a multi-path wait-k model, following the wait-k read/write path.As exemplified in Figure 1, any adaptive read/write path can be composed of subpaths of various wait-k paths.Given that a multi-path wait-k model is trained to perform prefix-to-prefix translation for different k values, we argue that such a model can also achieve competitive performance when used with an adaptive read/write policy.
To evaluate this hypothesis, we construct a SiMT system by combining a multi-path wait-k model with the adaptive policy from ITST (Zhang and Feng, 2022) has an integrated translation module, we only utilize its policy module to make read/write decisions and rely on the multi-path wait-k model for translation.As illustrated in the BLEU/AL (BLEU Score vs. Average Lagging) curves in Figure 3, the resulting ITST-guided multi-path wait-k approach achieves competitive performance in all latency settings.It consistently outperforms the original multi-path wait-k approach.When compared to the original ITST approach, the combined approach achieves significantly better BLEU scores in the low latency region while obtaining comparable results in mid-to-high latency settings.This positive observation leads to a natural question: Can we develop a better adaptive policy for a multi-path wait-k model?

Divergence-based Read/Write Supervision
Learning an adaptive policy requires high-quality read/write supervision training data.It is unfeasible to collect manual annotations due to both the cost and complexity of the task.Instead, we draw inspiration from human simultaneous translation and propose to create such data automatically.We consider that a good write action should only occur when the partial source information is sufficient to make accurate translations, i.e., the translations should be similar to that with the complete source input.To be precise, we want to quantify the divergence D p other given full source: where the distributions can be computed using a well-trained offline translation model or simultaneous translation model.In this paper, we quantify D(•, •) with three different divergence measures2 : Euclidean: We utilize the divergence measures to automatically construct read/write supervisions from a parallel corpus used for MT training.For each parallel sentence pair, we compute a divergence matrix D, where each element D t,g(t) = D(p part t , p full t ), for all possible prefix-to-prefix pairs.We can then make read/write decisions by comparing D t,g(t) with a threshold λ: Figure 4 shows an example divergence matrix and a highlighted read/write path.Varying the threshold would result in a different latency.For each chosen threshold, we could construct read/write samples and train a separate read/write policy model, however, the resulting model would be specific for that threshold.Instead, we treat the divergence measures computed from parallel sentences as ground-truth values, and train a single adaptive read/write policy model to predict the ground-truth values from the partial source and target pair: (8)

SiMT with Adaptive Policy
The architecture of our proposed DaP-SiMT model, as depicted in Figure 5, integrates a SiMT model with a divergence-based adaptive read/write policy network.In contrast to the design shown in Figure 2, which employs an independent policy, our approach incorporates a policy network that leverages the hidden states from the SiMT model's encoder and decoder as inputs, tailored explicitly to enable adaptive read/write decisions within the SiMT model.More specifically, the policy network includes an additional transformer decoder layer placed atop the original SiMT decoder, followed by a regression head responsible for predicting divergence values.The incorporation of an extra functional decoder layer aligns with the common practices found in previous works on NMT (Li et al., 2022(Li et al., , 2023)).The design of the regression head adheres to Roberta (Liu et al., 2019), featuring two linear layers with a tanh activation function sandwiched in between.In terms of the learning objective, we employ Mean Squared Error (MSE) for divergence measures based on Euclidean distance or KL-divergence, and binary cross-entropy with continuous labels for measures based on cosine distance.
During training, we only tune the parameters of the adaptive policy network while keeping the parameters of the multi-path wait-k model fixed.In the inference phase, we compare the predicted divergence values with a predefined threshold to make read/write decisions, following Equation 7, and can achieve varying latency levels by adjusting the threshold.Additionally, we have empirically observed that introducing another hyper-parameter to limit the maximum number of continuous READ operations for certain languages (see analysis in Section 5.4.2 for the impact on different language pairs) results in a better balance between transla-tion quality and latency.The inference process of DaP-SiMT is summarized in Algorithm D in the Appendix.

Experiments
5.1 Datasets WMT2022 Zh→En3 .We use a subset with 25M sentence pairs for training4 .We first tokenize the Chinese and English data using the Jieba Chinese Segmentation Tool5 and Moses6 , respectively, and then apply BPE with 32,000 merge operations.We employ a validation set of 956 sentence pairs from BSTC (Zhang et al., 2021a) as the test set.WMT15 De→En7 .All 4.5M sentence pairs from this dataset are used for training, and are tokenized using 32K BPE merge operations.We use newstest2013 (3000 sentence pairs) for validation and report results on newstest2015 (2169 sentence pairs).IWSLT15 En→Vi8 .All 133K sentence pairs from this dataset (Luong and Manning, 2015) are used for training.We use TED tst2012 (1553 sentence pairs) for validation and TED tst2013 (1268 sentence pairs) as the test set.Following the settings in (Ma et al., 2020b), we adopt word-level tokenization and replace rare tokens (frequency < 5) with <unk>.The vocabulary sizes are 17K for English and 7.7K for Vietnamese, respectively.

Settings
All our implementations are based on the Transformer (Vaswani et al., 2017) architecture and adapted from the Fairseq Library (Ott et al., 2019).For the Zh→En experiments, we utilize the transformer big architecture, while the base and small architectures are used for De→En and En→Vi experiments respectively.We use the cosine distance to calculate the read/write supervision signals in the main experiments, and investigate the effects of different divergence types in Section 5.4.2.
For evaluation, following ITST (Zhang and Feng, 2022), we report case-insensitive BLEU (Papineni et al., 2002) scores to assess translation quality and Average Lagging (AL/token) (Ma et al., 2018) to measure latency.Regarding the maximum num- ber of continuous read actions in our method, we empirically select the best-performing configurations, which are no constraint, 4, no constraint for Zh→En, De→En, En→Vi respectively.

Main Results
The performance of our method is compared to previous approaches on three language pairs in Figure 6.First, it is evident that the performance of a multipath wait-k model can be significantly improved when guided by an adaptive read/write policy like ITST, compared to using a fixed wait-k policy.This enhanced performance often closely matches or even surpasses that of ITST, the previous state-ofthe-art SiMT model, particularly in low latency settings for De→En translation.These results underscore the competitiveness and flexibility of prefixto-prefix translation within the multi-path wait-k model, a potential that remains largely untapped with a fixed wait-k policy.
Secondly, our proposed DaP-SiMT approach significantly enhances the performance of the multipath wait-k model, outperforming all other approaches.The divergence-based adaptive policy consistently surpasses the fixed wait-k policy across all latency levels.Furthermore, when compared to the adaptive policy in ITST, it achieves comparable results in the De→En scenario and superior performance in the Zh→En and En→Vi translations, all while using the same multi-path wait-k model as the translation model.This result suggests that the divergence-based approach not only surpasses the fixed wait-k policy but also competes effectively with state-of-the-art approaches like ITST, which features a closely integrated adaptive policy and translation model.

Analysis
In our analysis, we aim to provide a more in-depth understanding of our proposed approach.Unless otherwise stated, results are based on the Zh→En Transformer-Big model.

The NLL vs. AL Curve
In addition to the commonly used BLEU vs. AL curves that assess the quality of complete translations across different latency levels, we introduce a novel evaluation metric: the NLL vs. AL curve.This metric enables a qualitative measurement of the average impact of various read/write policies on translation quality at each translation step, all while utilizing the same translation model.the NLL vs. AL curve, we begin with a read/write policy and the necessary hyper-parameters, such as k for the fixed wait-k approach and the threshold λ for the divergence-based adaptive policy, which controls the latency level.Given a parallel sentence, we first derive the read/write path, denoted as g(1), g(2), ..., g(T ), under the read/write policy, and then calculate the negative log-likelihood of the translation along the read/write path, following Eq.( 2).By aggregating these NLL scores and their corresponding latency levels across an entire dataset, we can generate NLL vs. AL curves for any read/write policy.
Figure 7 provides a comparative analysis of NLL vs. AL curves for four read/write policies: waitk, ITST, and two variations of DaP-SiMT.The first DaP-SiMT variant is based on divergence values predicted by the policy model, while the second relies on ground truth divergence values computed using full sentences.The results underscore the effectiveness of our DaP-SiMT approach, as it consistently yields substantially lower NLL scores when compared to the fixed wait-k policy at equivalent latency levels.This confirms that the SiMT model is more adept at accurately predicting the correct translation along the resulting read/write paths.Furthermore, the DaP-SiMT approach exhibits a lower NLL curve compared to ITST's policy model, aligning with the trends observed in the BLEU vs. AL curves depicted in Figure 6.Notably, the DaP-SiMT variant employing ground truth divergence values has a much lower curve than its counterpart based on predicted divergence values.This suggests that there is potential for further improvement through better modeling of divergence supervision signals.

Ablation Study
Effect of translation model for divergence supervision In our main experiment, we utilized an offline translation model to calculate the divergence supervision values.Here, we assess the impact of utilizing a SiMT model for this purpose.Figure 8(a) illustrates that supervision signals computed by the multi-path wait-k model result in almost identical performance to those obtained from the offline model.
Effect of the divergence measures We exam-ine the influence of various divergence measures, including Euclidean distance, KL-divergence, and cosine distance, on the performance of DaP-SiMT.As depicted in Figure 8(b), the almost identical curves suggest that our approach is not sensitive to the choice of divergence measures.
Effect of the number of layers for the policy net To examine whether a single additional decoder layer can effectively model the divergence supervision signals, we conducted comparative experiments using either 0 or 3 additional decoder layers.Figure 8(c) demonstrates that configurations with 1 or 3 additional decoder layers yield similar results.Although the configuration with 0 additional decoder layers does not perform as strongly, it still manages to achieve a reasonable balance between accuracy and latency.
Effect of the max continuous READ constraint As discussed in Section 4.3, we introduce a constraint on the maximum number of continuous reads during inference, forcing a write action after reaching the limit.Figure 8(d) shows that this constraint has varying impacts on different language pairs.In the cases of Zh→En and En→Vi, this hyperparameter has minimal influence on results, especially in the low-latency region, which is a primary focus in SiMT.However, for De→En, a substantial improvement is observed with the introduction of this constraint.
We hypothesize that this difference is related to the modeling difficulty for language pairs with varying degrees of word order variations.As quantified in Appendix B, the De→En translation direction exhibits the highest anticipation rate among the three language pairs and demonstrates the most significant divergence in word order (Wang et al., 2023).Consequently, it naturally requires more read actions for an accurate De→En translation, which is reflected in the distribution of divergence supervision signals and subsequently influences the learned policy model.By adjusting the threshold λ to achieve low latency, we inadvertently exacerbate the negative effects of exposure bias, resulting in excessive reads, as observed in our experiments.Introducing the maximum number of continuous reads serves as an ad-hoc solution to address this challenge, and we leave it to future research to investigate this issue thoroughly.

Upper Bound of DaP-SiMT
We evaluate the upper bound performance of DaP-SiMT to study the impact of modeling errors within the policy model.Specifically, we substitute the model-predicted divergence scores with the ground truth divergence calculated using the complete sentence D p part t , p full t during DaP-SiMT inference, following the procedure outlined in Algorithm D. As illustrated in Figure 9, in line with the findings from Section 5.4.1, the upper bound performance of DaP-SiMT is substantially higher than that achieved with the learned policy model.This observation highlights the potential for further improvement in policy modeling.

Examples
We provide several examples in Appendix A comparing the divergence matrix predicted by the learned policy model with the ground truth.Although there are some discrepancies, the predicted divergence matrix closely resembles the ground truth matrix and can be used to make reasonable read/write decisions.

Conclusion
In this paper, we introduce a divergence-based adaptive policy for SiMT, which makes read/write decisions based on the potential divergence in translation distributions resulting from future information.Our approach extends a frozen multi-path wait-k translation model with lightweight parameters for the policy model, making it memory and computation efficient.Experimental results across various benchmarks demonstrate that our approach provides an improved trade-off between translation accuracy and latency compared to strong baselines.We hope that our approach can inspire a novel perspective on simultaneous translation.

Limitations
Our evaluation primarily focused on assessing the impact of the proposed adaptive policy on simultaneous translation using BLEU vs. AL and NLL vs. AL curves.However, we acknowledge that intrinsic evaluations of the policy model itself are lacking, and further investigation in this area is necessary to guide improvements.We provided only a limited exploration of modeling variations for the policy model, leaving room for more in-depth analysis and enhancements.It's worth noting that while the threshold parameter λ controls latency, it doesn't have a direct one-to-one relationship with latency, as is the case with the fixed wait-k policy.This nuanced aspect requires careful consideration in future investigations.

B Anticipation Rate
Anticipation occurs in simultaneous translation when a target word is generated prior to the receipt of its corresponding source word.To detect instances of anticipation, accurate word alignment between the paired sentences is needed.Fast-align (Dyer et al., 2013) is employed to obtain the word alignment a between a source sentence X and a target sentence Y.The resulting alignments comprise a set of source-target word index pairs (s, t), where the s th source word x s aligns with the t th target word y t .A target word y t is k-anticipated (A k (t, a) = 1) if it aligns to at least one source word x s where s ≥ t + k: The k-anticipation rate (AR k ) of an (X, Y, a) triple is further defined as follow: The anticipation rates (AR%) of different language pairs are shown in Table 2, from which we can find that the De→En translation task exhibits greater word order differences than the other two cases.

C How Sensitive Is
The AL To Thresholds During Inference?
Table 1 exhibits the sensitivity of the AL to thresholds during inference based on different settings of divergence types.It can be observed that the AL is not particularly sensitive to the threshold overall, which makes the process of determining the threshold straightforward.

D Algorithm
The inference process of DaP-SiMT is summarized in Algorithm 1.

E Numerical Results
The numerical results are presented in Table 3, Table 4, Table 5.

Figure 4 :
Figure 4: Example of a Zh→En divergence matrix D using cosine distance, where D t,g(t) = D p part t , p full t .A potential read/write path is indicated by the red elements in the matrix and can be determined based on a predefined threshold (0.2 in this case).

Figure 5 :
Figure 5: Architecture of the DaP-SiMT approach.It is equivalent to adding an extra decoder layer to the original SiMT.The output of the extra decoder will be fed into a regression head to determine read/write action.

Figure 6 :
Figure 6: Comparison of BLUE vs. AL curves between multi-path (abbreviated as Mp) wait-k, ITST, ITST-guided multi-path wait-k, and our proposed DaP-SiMT approach on three language pairs.

Figure 7 :
Figure 7: NLL vs. AL curves comparing four read/write policies utilizing the same SiMT model."DaP-SiMT Ground Truth" indicates read/write paths derived from ground truth divergence values calculated using the full sentence, while "DaP-SiMT Prediction" is based on divergence values predicted by the policy model.

Figure 8 :
Figure 8: Ablation studies on the proposed DaP-SiMT method

Figure 9 :
Figure 9: BLEU vs. AL curves comparing between DaP-SiMT with ground truth divergence and standard DaP-SiMT with predicted divergence.
Divergence matrix examples are shown in Figure 10, Figure 11, Figure 12.

Table 1 :
The sensitivity of the AL to the threshold during inference based on different settings of divergence types.

Table 4 :
Number of Extra Decoder Layers (Figure 8(c)) 0 Extra Decoder Layer 1 Extra Decoder Layer 3 Extra Decoder Layers Numerical results in