Simultaneous Machine Translation with Tailored Reference

Simultaneous machine translation (SiMT) generates translation while reading the whole source sentence. However, existing SiMT models are typically trained using the same reference disregarding the varying amounts of available source information at different latency. Training the model with ground-truth at low latency may introduce forced anticipations, whereas utilizing reference consistent with the source word order at high latency results in performance degradation. Consequently, it is crucial to train the SiMT model with appropriate reference that avoids forced anticipations during training while maintaining high quality. In this paper, we propose a novel method that provides tailored reference for the SiMT models trained at different latency by rephrasing the ground-truth. Specifically, we introduce the tailor, induced by reinforcement learning, to modify ground-truth to the tailored reference. The SiMT model is trained with the tailored reference and jointly optimized with the tailor to enhance performance. Importantly, our method is applicable to a wide range of current SiMT approaches. Experiments on three translation tasks demonstrate that our method achieves state-of-the-art performance in both fixed and adaptive policies.


Introduction
Simultaneous machine translation (SiMT) (Gu et al., 2017;Ma et al., 2019Ma et al., , 2020) ) generates the target sentence while reading in the source sentence.Compared to Full-sentence translation, it faces a greater challenge because it has to make trade-offs between latency and translation quality (Zhang and Feng, 2022a).In applications, it needs to meet the requirements of different latency tolerances, such as online conferences and real-time subtitles.Therefore, the SiMT models trained at It is an enjoyable activity to watch a movie .
Watching a movie is an enjoyable activity .
Figure 1: An example of Chinese-English parallel sentence.The SiMT model will be forced to predict 'an enjoyable activity' before reading corresponding source tokens.In contrast, the tailored reference avoids forced anticipations while maintaining the original semantics.
different latency should exhibit excellent translation performance.
Using an inappropriate reference to train the SiMT model can significantly impact its performance.The optimal reference for the SiMT model trained at different latency varies.Under high latency, it is reasonable to train the SiMT model with ground-truth since the model can leverage sufficient source information (Zhang and Feng, 2022c).However, under low latency, the model is constrained by limited source information and thus requires reference consistent with the source word order (Chen et al., 2021).Therefore, the SiMT model should be trained with corresponding appropriate reference under different latency.
However, the existing SiMT methods, which employ fixed or adaptive policy, commonly utilize only ground-truth for training across different latency settings.For fixed policy (Ma et al., 2019;Elbayad et al., 2020;Zhang and Feng, 2021), the model generates translations based on the predefined rules.The SiMT models are often forced to anticipate target tokens with insufficient information or wait for unnecessary source tokens.For adaptive policy (Ma et al., 2020;Miao et al., 2021;Zhang and Feng, 2022b), the model can adjust its translation policy based on translation status.Nevertheless, the policy learning of SiMT model will gradually adapt to the given reference (Zhang et al., 2020).Consequently, employing only ground-truth for the SiMT models trained at varying latency levels can negatively impact overall performance, as it forces them to learn the identical policy.Furthermore, Chen et al. (2021) adopts an offline approach to generate reference using the Full-sentence model for training the SiMT model at different latency, but this approach also imposes an upper bound on the performance of the SiMT model.Therefore, it is necessary to provide high-quality and appropriate reference for the models with different latency.
Under these grounds, we aim to dynamically provide an appropriate reference for training the SiMT models at different latency.In SiMT, the source information available to the translation model varies with latency (Ma et al., 2019).Therefore, the appropriate reference should allow the model to utilize the available information for predicting target tokens accurately.Otherwise, it will result in forced anticipations, where the model predicts target tokens in reference using insufficient source information (Guo et al., 2023).To explore the extent of forced anticipations when training the SiMT model with ground-truth at different latency, we introduce anticipation rate (AR) (Chen et al., 2021).As shown in Table 1, the anticipation rate decreases as the SiMT is trained with higher latency.Consequently, the reference requirements of the SiMT model vary at different latency.To meet the requirements, the appropriate reference should avoid forced anticipations during training and maintain high quality.Therefore, we propose to dynamically tailor reference, called tailored reference, for the training of SiMT model according to the latency, thereby reducing forced anticipations.We present an intuitive example of tailored reference in Figure 1.It can avoid forced predictions during training while maintaining the semantics consistent with the original sentence.
Therefore, we propose a new method for providing tailored reference to SiMT models at different latency.To accomplish this, we introduce the tailor, a shallow non-autoregressive Transformer Decoder (Gu et al., 2018), to modify ground-truth to the tailored reference.Since there is no explicit supervision to train the tailor, we quantify the requirements for the tailored reference as two reward functions and optimize them using reinforcement learning (RL).On the one hand, tailored reference should avoid forced anticipations, ensuring that the word reorderings between it and the source sentence can be handled by the SiMT model trained AR[%] 28.17 8.68 3.49 1.12 0.49  (Ma et al., 2019) and represents the number of tokens that the target sentence lags behind the source sentence.
at that latency.To achieve this, the tailor learns from non-anticipatory reference corresponding to that latency, which can be generated by applying Wait-k policy to Full-sentence model (Chen et al., 2021).On the other hand, tailored reference should maintain high quality, which can be achieved by encouraging the tailor to learn from ground-truth.Therefore, we measure the similarity between the output of tailor and both non-anticipatory reference and ground-truth, assigning them as separate rewards.The tailor can be optimized by striking a balance between these two rewards.During training, the SiMT model takes the output of the tailor as the objective and is jointly optimized with the tailor.Additionally, our method is applicable to a wide range of SiMT approaches.Experiments on three translation tasks demonstrate that our method achieves state-of-the-art performance in both fixed and adaptive policies.

Background
For a SiMT task, the model reads in the source sentence x = (x 1 , ..., x J ) with length J and generates the target sentence y = (y 1 , ..., y I ) with length I based on the policy.To describe the policy, we define the number of source tokens read in when translating y i as g i .Then the policy can be represented as g = (g 1 , ..., g I ).Therefore, the SiMT model can be trained by minimizing the cross-entropy loss: where y ⋆ i represents the ground-truth token.Our approach involves Wait-k (Ma et al., 2019), HMT (Zhang and Feng, 2023b) and CTC training (Libovický and Helcl, 2018), so we briefly introduce them.
Wait-k policy As the most widely used fixed policy, the model reads in k tokens first and then alternates writing and reading a token.It can be formalized as: where J indicates the length of the source sentence.
HMT Hidden Markov Transformer (HMT), which derives from the Hidden Markov Model, is the current state-of-the-art SiMT model.It treats the translation policy g as hidden events and the target sentence y as observed events.During training, HMT learns when to generate translation by minimizing the negative log-likelihood of observed events over all possible policies: CTC CTC (Graves et al., 2006) is applied in nonautoregressive translation (NAT) (Gu et al., 2018) due to its remarkable performance and no need for length predictor.CTC-based NAT will generate a sequence containing repetitive and blank tokens first, and then reduce it to a normal sentence based on the collapse function Γ −1 .During training, CTC will consider all possible sequences a, which can be reduced to y using function Γ −1 : where p(a | x) is modeled by NAT architecture.

Method
In this section, we introduce the architecture of our model, which incorporates tailor into the SiMT model.To train the SiMT model with the tailor, we present a three-stage training method, in which the SiMT model benefits from training with tailored reference and is optimized together with the tailor.During inference, the SiMT model generates translation according to the policy.The details are introduced in the following subsections.

Model Architecture
We present the architecture of our method in Figure 2. Alongside the encoder and decoder, our method introduces the tailor module, which is responsible for generating a tailored reference for the SiMT model, utilizing the ground-truth as its input.Considering the efficiency of generating tailored reference, the tailor module adopts the structure of the non-autoregressive Transformer decoder (Vaswani et al., 2017).To enable the tailor to generate a tailored reference that is not limited by the length of ground-truth, it initially upsamples the groundtruth.Subsequently, it cross-attends to the output of the encoder and modifies ground-truth while considering the word order of the source sentence.Finally, it transforms the output of tailor into the tailored reference by eliminating repetitive and blank tokens (Libovický and Helcl, 2018).The tailored reference replaces the ground-truth as the training objective for the SiMT model.Given the lack of explicit supervision for training the tailor, we quantify the requirements for tailored reference into two rewards and optimize the model through reinforcement learning.We propose a three-stage training method for the SiMT model with the tailor, the details of which will be presented in the next subsection.

Training Method
After incorporating tailor into the SiMT model, it is essential to train the SiMT model with the assistance of tailor to get better performance.In light of this, we quantify the requirements of the tailored reference into two rewards and propose a novel three-stage training method for the training of our method.First, we train the SiMT model using ground-truth and equip the SiMT model with good translation capability.Subsequently, we use a pre-training strategy to train the tailor, enabling it to establish a favorable initial state and converge faster.Finally, we fine-tune the tailor by optimiz-ing the two rewards using reinforcement learning, where the output of the tailor serves as the training target for the SiMT model after being reduced.In the third stage, the tailor and SiMT model are jointly optimized and share the output the of the encoder.Next, we describe our three-stage training method in detail.
Training the Base Model In our architecture, the tailor cross-attends to the output of the encoder to adjust ground-truth based on source information.As a result, before training the tailor module, we need a well-trained SiMT model as the base model.In our method, we choose the Wait-k policy (Ma et al., 2019) and HMT model (Zhang and Feng, 2023b) as the base model for fixed policy and adaptive policy, respectively.The base model is trained using the cross-entropy loss.Once the training of the base model is completed, we optimize the tailor module, which can provide the tailored reference for the SiMT models trained across different latency settings.

Pre-training Tailor
The tailor adopts the architecture of a non-autoregressive decoder (Gu et al., 2018).The non-autoregressive architecture has demonstrated excellent performance (Qian et al., 2020;Huang et al., 2022).Importantly, it enables the simultaneous generation of target tokens across all positions, making it highly efficient for reinforcement learning.However, if we train the tailor using reinforcement learning directly, it will converge to a suboptimal state in which the tokens at each position are some frequent tokens (Shao et al., 2022).This behavior is attributed to the exploration-based nature of reinforcement learning, highlighting the need for a favorable initial state for the model (Lopes et al., 2012).Since the tailored reference is modified from ground-truth, we let it learn from ground-truth during pre-training and then fine-tune it using reinforcement learning.The details of pre-training stage are introduced below.
To keep the output of the tailor from being limited by the length of ground-truth, the tailor upsamples ground-truth to get the input of the tailor, denoted as y ′ .During training, CTC loss (Libovický and Helcl, 2018) is used to optimize the tailor.
Denoting the output of the tailor as a = (a 1 , ..., a T ), the probability distribution modeled by the tailor can be presented as: where T is the output length of tailor and is a multiple of the length of y.Subsequently, we can get the normal sequence s by applying collapse function Γ −1 to a and the distribution of s is calculated by considering all possible a: To make the tailor learn from ground-truth, the tailor is optimized by minimizing the negative loglikelihood: which can be efficiently calculated through dynamic programming (Graves et al., 2006).
RL Fine-tuning After completing the pretraining, the tailor is already in a favorable initial state.We quantify the requirements for tailored reference as two rewards and fine-tune the tailor using reinforcement learning.We then introduce the two reward functions.
On the one hand, the tailored reference should not force the model to predict the target tokens before reading corresponding source information, which means the SiMT model can handle the word reorderings between the tailored reference and the source sentence at that latency (Zhang et al., 2022).Therefore, we make the tailor learn from nonanticipatory reference y na , which is generated by applying the corresponding Wait-k policy to the Full-sentence model.It has the word order that matches the latency and maintains the original semantics (Chen et al., 2021).We employ reward R na to measure the similarity between the output of tailor and non-anticipatory reference.On the other hand, the tailored reference should remain faithful to ground-truth.We introduce the reward R gt to measure the similarity between ground-truth and the output of the tailor.By striking an appropriate balance between R na and R gt , we can obtain the tailored reference.
Given the output a of tailor, we can obtain the normal sentence s by removing the repetitive and blank tokens (Libovický and Helcl, 2018).We use BLEU (Papineni et al., 2002) to measure the similarity between two sequences.Therefore, R na and R gt for the output of tailor is calculated as:    Based on these two rewards, we can obtain the final reward R by balancing R na and R gt : where α ∈ [0, 1] is a hyperparameter.Subsequently, we use REINFORCE algorithm (Williams, 1992) to optimize the final reward R to obtain the tailored reference: (11) where Γ −1 represents the collapse function and θ denotes the parameter of tailor.During training, we sample the sequence a from the distribution p a (a | x, y ′ , θ) using Monte Carlo method.As the tailor adopts a non-autoregressive structure where all positions are independent of each other, we can concurrently sample tokens for all positions from the distribution.We then apply collapse function to sequence a to obtain the normal sequence s, which is used to compute the reward R(s) and update the tailor with estimated gradient ∇ θ log p s (s | x, y ′ , θ)R(s).In the calculation of p s (s | x, y ′ , θ), we use dynamic programming to accelerate the process.Additionally, we adopt the baseline reward strategy to reduce the variance of the estimated gradient (Weaver and Tao, 2001).
In this stage, we utilize reinforcement learning to optimize the final reward R(s) and train the SiMT model with tailored reference using L t_simt .As a result, the SiMT model and the tailor are jointly optimized to enhance performance.

Datasets
We evaluate our method on three translation tasks.
IWSLT152 English→Vietnamese (En→Vi) (Cettolo et al., 2015) We use TED tst2012 as the development set and TED tst2013 as the test set.In line with Ma et al. (2020), we replace the tokens occurring less than 5 with ⟨unk⟩.Consequently, the vocabulary sizes of English and Vietnamese are 17K and 7.7K, respectively.
WMT16 3 English→Romanian (En→Ro) We use newsdev-2016 as the development set and newstest-2016 as the test set.The source and target languages employ a shared vocabulary.Other settings are consistent with Gu et al. (2018).WMT15 4 German→English (De→En) Following Ma et al. (2020), we use newstest2013 as development set and newstest2015 as test set.We apply BPE (Sennrich et al., 2016) with 32K subword units and use a shared vocabulary between source and target.

System Settings
Our experiments involve the following methods and we briefly introduce them.
Full-sentence model is the conventional fullsentence machine translation model.
Wait-k policy (Ma et al., 2019) initially reads k tokens and then alternates between writing and reading a source token.
Multi-path (Elbayad et al., 2020) introduces the unidirectional encoder and trains the model by sampling the latency k.
Adaptive Wait-k (Zheng et al., 2020) employs multiple Wait-k models through heuristic method to achieve adaptive policy.
MMA (Ma et al., 2020) makes each head determine the translation policy by predicting the Bernoulli variable.
MoE Wait-k (Zhang and Feng, 2021), the current state-of-the-art fixed policy, treats each head as an expert and integrates the decisions of all experts.
PED (Guo et al., 2022) implements the adaptive policy via integrating post-evaluation into the fixed translation policy.
BS-SiMT (Guo et al., 2023) constructs the optimal policy online via binary search.
ITST (Zhang and Feng, 2022b) treats the translation as information transport from source to target.
HMT (Zhang and Feng, 2023b) models simultaneous machine translation as a Hidden Markov Model, and achieves the current state-of-the-art performance in SiMT.
*+100% Pseudo-Refs (Chen et al., 2021) trains the Wait-k model with ground-truth and pseudo reference, which is generated by applying Wait-k policy to the Full-sentence model.
*+Top 40% Pseudo-Refs (Chen et al., 2021) filters out pseudo references in the top 40% of quality to train the model with ground-truth.
Wait-k + Tailor applies our method on Wait-k.HMT + Tailor applies our method on HMT.All systems are based on Transformer architecture (Vaswani et al., 2017) and adapted from Fairseq Library (Ott et al., 2019).We apply Transformer-Small (6 layers, 4 heads) for En→Vi task and Transform-Base (6 layers, 8 heads) for En→Ro and De→En tasks.Since PED and Adaptive Wait-k do not report the results on the En→Ro task, we do not compare them in the experiment.For our method, we adopt the non-regressive decoder structure with 2 layers for the tailor.We empirically set the hyperparameter α as 0.2.The non-anticipatory reference used for RL Fine-tuning of SiMT model is generated by Test-time Wait-k method (Ma et al., 2019) with corresponding latency.Other system settings are consistent with Ma et al. (2020) and Zhang and Feng (2023b).The detailed settings are shown in Appendix C. We use greedy search during inference and evaluate all methods with translation quality estimated by BLEU (Papineni et al., 2002) and latency measured by Average Lagging (AL) (Ma et al., 2019).

Main Results
The performance comparison between our method and other SiMT approaches on three translation tasks is illustrated in Figure 3 and Figure 4. Our method achieves state-of-the-art translation performance in both fixed and adaptive policies.When comparing with other training methods in Figure 5, our approach also achieves superior performance.
When selecting the most commonly used Waitk policy as the base model, our method outperforms MoE Wait-k, which is the current state-ofthe-art fixed policy.our method brings significant improvement, especially under low latency.Wait-k policy is trained on ground-truth and cannot be adjusted, which may force the model to predict tokens before reading corresponding source information (Ma et al., 2019).In contrast, our method provides a tailored reference for the SiMT model, thereby alleviating the issue of forced anticipations.Our method also exceeds Multi-path and MoE Wait-k.These two methods are trained using multiple Wait-k policies (Elbayad et al., 2020) and gain the ability to translate under multiple latency (Zhang and Feng, 2021), but they still utilize ground-truth at all latency, leading to lower performance.
Our method can further enhance the SiMT performance by selecting adaptive policy as the base model.As the current state-of-the-art adaptive policy, HMT possesses the ability to dynamically adjust policy to balance latency and translation quality (Zhang and Feng, 2023b).However, it still relies on ground-truth for training SiMT models across different latency settings.By providing a tailored reference that matches the latency, our method can alleviate the latency burden of the SiMT model, resulting in state-of-the-art performance.
Our method also surpasses other training approaches.Ground-truth is not suitable for incremental input due to word reorderings, resulting in performance degradation (Zhang and Feng, 2022b).On the contrary, pseudo reference can avoid forced anticipations during training (Chen et al., 2021).However, it is constructed offline by applying the Wait-k policy on the Full-sentence model.It imposes an upper bound on the performance of the SiMT model.The tailored reference avoids forced anticipations while maintaining high quality, leading to the best performance.
In addition to enhancing translation performance, our method effectively narrows the gap between fixed and adaptive policies.By leveraging our method, the SiMT model can achieve comparable performance to Full-sentence translation with lower latency on En→Vi and De→En tasks.

Analysis
To gain a comprehensive understanding of our method, we conducted multiple analyses.All of the following results are reported on De→En task.

Ablation Study
We conduct ablation studies on the structure and training method of tailor to investigate the influence of different settings.The experiments all use Wait-k model as the base model with k set to 3. Table 2 presents a comparison of different structures.The best performance is achieved when the tailor has 2 layers.The performance can be negatively affected by both excessive layers and insufficient layers.Table 3 illustrates the results of the ablation study on the training method.Each stage of the training method contributes to the performance of the SiMT model and the training stage of the base model has the most significant impact on the performance.This can be attributed to the fact that a well-trained encoder can provide accurate source information to the tailor, enabling the generation of appropriate tailored references.Additionally, when α is selected as 0.2, our method yields the best performance, indicating an optimal balance between word order and quality for the tailor.

Analysis of Tailored Reference
Anticipation Rate Furthermore, we conduct an analysis of the tailored reference to assess its influence.We first explore the rate of forced anticipations caused by using different references during training.Using the anticipation rate (AR) (Chen et al., 2021) as the metric, the results in Table 4 show that the tailored reference can effectively re- Ground-Truth 28.17 8.68 3.49 1.12 Tailored Ref 19.84 8.29 2.98 0.90 duce the forced anticipations during the training of the SiMT model under all latency.This implies that, compared to ground-truth, the word reorderings between the tailored reference and the source sentence can be more effectively handled by the SiMT model at different latency.
Quality However, one concern is whether the quality of tailored reference will deteriorate like non-anticipatory reference after adjusting the word order.To assess this, we compare different references with ground-truth to measure their quality.As shown in Table 5, we observe that the tailored reference exhibits significantly higher quality than the non-anticipatory reference.Therefore, our method successfully reduces the rate of forced anticipations during training while remaining faithful to ground-truth.To provide a better understanding of the tailored reference, we include several illustrative cases in Appendix B.

Hallucination in Translation
If the SiMT model is forced to predict target tokens before reading corresponding source information during training, there is a high likelihood of generating hallucinations during inference (Ma et al., 2019).To quantify the presence of hallucinations in the translation, we introduce hallucination rate (HR) (Chen et al., 2021) for evaluation.Figure 6 illustrates that the SiMT model trained with the tailored reference demonstrates a reduced probability of generating hallucinations.Moreover, even though the adaptive policy can adjust its behavior based on the translation status, our approach still ef- fectively mitigates the hallucinations by alleviating the burden of latency.This signifies that minimizing forced predictions during training can enhance the faithfulness of the translation to the source sentence, thereby improving translation quality (Ma et al., 2023).

Related Work
Simultaneous machine translation (SiMT) generates translation while reading the source sentence.It requires a policy to determine the source information read when translating each target token, thus striking a balance between latency and translation quality.Current research on SiMT mainly focuses on two areas: policy improvement and adjustment of the training method.
For policy improvement, it aims to provide sufficient source information for translation while avoiding unnecessary latency.Ma et al. (2019) propose Wait-k policy, which initially reads k tokens and alternates between writing and reading one token.Zhang and Feng (2021) enable each head to obtain the information with a different fixed latency and integrate the decisions of multiple heads for translation.However, the fixed policy cannot be flexibly adjusted based on context, resulting in suboptimal performance.Ma et al. (2020) allow each head to determine its own policy and make all heads decide on the translation.Miao et al. (2021) propose a generative framework, which uses a re-parameterized Poisson prior to regularising the policy.Zhang and Feng (2023a) propose a segmentation policy for the source input.Zhang and Feng (2023b) model the simultaneous machine translation as a Hidden Markov Model and achieve state-of-the-art performance.However, these methods are all trained with ground-truth, leading to forced predictions at low latency.
For the adjustment of the training method, it wants to supplement the missing full-sentence information or cater to the requirements of latency.Zhang et al. (2021) shorten the distance of source hidden states between the SiMT model and the Full-sentence model.This makes the source hidden states implicitly embed future information, but encourages data-driven prediction.On the other hand, Chen et al. (2021) try to train the model with non-anticipatory reference, which can be effectively handled by the SiMT model at that latency.However, while non-anticipatory reference can alleviate forced predictions at low latency, it hinders performance improvement at high latency.
Therefore, we want to provide a tailored reference for the SiMT models trained at different latency.The tailored reference should avoid forced anticipations and exhibit high quality.In view of the good structure and superior performance of the non-autoregressive model (Gu et al., 2018;Libovický and Helcl, 2018), we utilize it to modify the ground-truth to the tailored reference.

Conclusion
In this paper, we propose a novel method to provide a tailored reference for the training of SiMT model.Experiments and extensive analyses demonstrate that our method achieves state-of-the-art performance in both fixed and adaptive policies and effectively reduces the hallucinations in translation.

Limitations
Regarding the system settings, we investigate the impact of the number of layers and training methods on performance.We think that further exploration of system settings could potentially yield even better results.Additionally, the tailor module aims to avoid forced anticipations and maintain faithfulness to ground-truth.If we can add language-related features to the SiMT model using a heuristic method, it may produce more suitable references for the SiMT model.We leave it for future work.I never had a conversation with him like that .
I had never had such a conversation with him .the order of the source sentence and the tailored sentence is consistent, which makes it suitable for Wait-1 policy.
In Figure 8, using ground-truth as the training target of Wait-1 policy also forces the model to predict 'ordering online' before reading 'Online' and 'Bestellung'.In contrast, by replacing 'ordering online' with 'an online order', the word order of tailored reference is the same as the source sentence, thereby avoiding forced anticipations during the training of Wait-1 policy.

C Hyperparameters
The system settings on three translation tasks are shown in Table 6.For more detailed implementation issues, please refer to our publicly available code.

D Numerical Results
In addition to the translation performance comparison in Figure 3 and Figure 4, we also provide corresponding numerical results for reference.Table 7, 8, 9 respectively report the numerical results on IWSLT15 En→Vi, WMT16 En→Ro and WMT15 De→En measured by AL (Ma et al., 2019) and BLEU (Papineni et al., 2002).

Figure 2 :
Figure 2: The architecture of our method.The tailor module modifies ground-truth to the tailored reference, which serves as the training target for the SiMT model.The tailor is induced to optimize two rewards by reinforcement learning.

Figure 3 :
Figure 3: Translation performance of different fixed policies on En→Vi, En→Ro and De→En tasks. 3

Figure 4 :
Figure 4: Translation performance of different adaptive policies on En→Vi, En→Ro and De→En tasks.

4Figure 5 :
Figure 5: Translation performance of different training methods on Wait-k policy.

Figure 6 :
Figure6: The hallucination rate (HR)(Chen et al., 2021) of different methods.It measures the proportion of tokens in translation that cannot find corresponding source information.

Figure 7 :
Figure 7: Case study of #319 in De→En test set.The tokens marked in the same color share the same semantics.The tailored reference is more suitable when the model is trained with Wait-1 policy.

Table 1 :
The AR(↓) on WMT15 De→En test set at different latency.k belongs to Wait-k policy

Table 2 :
Compared to Wait-k policy, Performance of the SiMT model when the tailor has a different number of layers.

Table 3 :
Ablation study on training method of the tailor and ratio between two rewards.'w/o Base Model' removes the training stage of the base model.'w/o Pre-training' removes the pre-training stage.'w/o RL Fine-tuning' removes the RL fine-tuning stage.

Table 4 :
The anticipation rate (AR[%]) when applying the Wait-k policy on different references, which are based on De→En test set.

Table 5 :
The quality (BLEU) of difference references compared to ground-truth for the training of Wait-k policy.'Non-Anti Ref' represents the reference generated by applying Wait-k policy on Full-sentence model.
Figure 8: Case study of #1869 in De→En test set.The tokens marked in the same color share the same semantics.The tailored reference is more suitable when the model is trained with Wait-1 policy.

Table 6 :
Hyperparameters of our experiments.

Table 9 :
Numerical results of WMT15 De→En.