Enhanced Simultaneous Machine Translation with Word-level Policies

Recent years have seen remarkable advances in the field of Simultaneous Machine Translation (SiMT) due to the introduction of innovative policies that dictate whether to READ or WRITE at each step of the translation process. However, a common assumption in many existing studies is that operations are carried out at the subword level, even though the standard unit for input and output in most practical scenarios is typically at the word level. This paper demonstrates that policies devised and validated at the subword level are surpassed by those operating at the word level, which process multiple subwords to form a complete word in a single step. Additionally, we suggest a method to boost SiMT models using language models (LMs), wherein the proposed word-level policy plays a vital role in addressing the subword disparity between LMs and SiMT models. Code is available at https://github.com/xl8-ai/WordSiMT.


Introduction
Simultaneous Machine Translation (SiMT) commences the translation process while simultaneously receiving the input, making it an effective approach for applications that require minimal latency such as simultaneous interpretation or live broadcast.The development of a novel policy is central to research efforts in SiMT.This policy dictates the translation process by determining whether to execute a READ or WRITE action at each step of the process.
Neural SiMT models, like offline Neural Machine Translation (NMT) models, commonly employ Byte Pair Encoding (BPE) (Sennrich et al., 2016) or similar techniques to encode an input sentence into a sequence of tokens.Typically, a single READ or WRITE action of a SiMT policy is responsible for handling an encoded token, which may sometimes be a word but often a subword.The development of BPE-based SiMT models and their policies has resulted in researchers focusing on working at the subword level.The performance analysis and implementation of many SiMT systems, to the best of our knowledge, have been carried out on encoded sequences of source and target subwords, rather than on the original source and target sentences1 .This has led to two critical issues that need to be addressed.
The first issue is the lack of a standardized tokenization and encoding scheme, meaning that different implementations may employ varying token sequences to encode identical text.This variability can impact latency evaluation results and complicate score comparisons across different systems.
The second issue is the missed opportunity to process more source tokens before writing each target token without added latency.For a BPE-based SiMT model, the input must be received on a wordby-word basis to ensure proper encoding of each word into a sequence of subwords.Consequently, when the model encodes a word and performs a READ to process only a subword, it delays the reading of the remaining subwords without any benefit in actual latency2 , and may adversely impact Figure 2: An exemplary case depicting the difference between token-level and word-level Wait-k policies and their Average Lagging (AL) scores.The token-level model begins translating in the middle of reading "Tremendously" and fails to recover from the incorrectly translated target prefix.On the other hand, the word-level model processes "Tremendously" in a single READ action and produces a correct translation.
translation quality by causing the model to rely on incomplete representations extracted from partiallyread words.Similarly, performing a WRITE to generate a subword earlier than the remaining subwords does not necessarily reduce latency, as the subword must wait for a complete word to be generated before it can be displayed.
In this paper, we show that establishing the unit of measuring and operating SiMT systems at the word level, rather than the subword level, offers a viable solution to these issues.Specifically, to tackle the first issue, we propose word-level latency metric calculation for measuring the latency of SiMT systems.This not only enables consistent comparisons between different systems but also provides a more accurate reflection of the actual latency experienced in SiMT applications that display translation results word by word.
To address the second issue, we illustrate that an existing token-level policy can be transformed into a word-level policy that inherently overcomes the second issue, resulting in improved performance.Word-level policies take into consideration word boundaries and perform a READ or WRITE action to sequentially process a sequence of tokens that form a complete word.This conversion process can be applied to any token-level policies, and our experiments reveal that state-of-the-art fixed and adaptive policies exhibit significantly better performance when transformed into their word-level counterparts.Notably, these word-level policies often outperform token-level policies, even when evaluated using a token-level latency metric, due to their enhanced utilization of input and output tokens.
Additionally, to boost translation accuracy, we suggest incorporating a pre-trained language model (LM) into a SiMT model, where the word-level policy plays a crucial role as a pivotal component.One of the major hurdles in utilizing an LM for SiMT is the vocabulary mismatch between the SiMT model and the LM.The difficulty of handling subword disparities when utilizing an LM for a downstream task is widely acknowledged (Liu et al., 2021a;Wang et al., 2022), and it becomes particularly problematic in SiMT, as inconsistent subwords between the LM and SiMT model make processing the same source or target prefix challenging.Our study demonstrates that our proposed word-level policy effectively tackles this challenge, enabling a successful integration of LMs into SiMT systems.

Simultaneous Machine Translation
SiMT systems that employ a fixed policy utilize a pre-defined sequence of READ and WRITE operations for each source sentence.STATIC-RW (Dalvi et al., 2018) and Wait-k (Ma et al., 2019a) policies first read k source tokens, then alternate between reading and writing a single token.Elbayad et al. (2020) propose the multi-path training of a Wait-k model to train a single model that supports different k values at test time.Zhang et al. (2021) improve Wait-k policy using knowledge distillation from an offline MT model, while Zhang and Feng (2021) suggest a Mixture-of-Experts Wait-k Policy where predictions from multiple k values are combined inside a single model.
In contrast, research efforts on adaptive policies focus on the development of dynamic decisionmaking processes for READ/WRITE actions.Cho and Esipova (2016) firstly introduce model-based adaptive criteria for Neural SiMT.Gu et al. (2017) propose to learn a policy by using reinforcement learning.Raffel et al. (2017) introduce Monotonic Attention that ensures monotonic alignment in the attention mechanism.Succeeding works improve it by extending the alignment window (Chiu and Raffel, 2017;Arivazhagan et al., 2019), extending it as monotonic multi-head attention (MMA) (Ma et al., 2019b) or learning transposed policies between the forward and backward models.(Zhang and Feng, 2022c).Zheng et al. (2020) derive a policy by composing Wait-k models trained with different values of k.Zhang and Feng (2022a) model to predict the alignment between each target token and the source token.Zhang and Feng (2022b) measure accumulated information from source tokens to decide whether to write the current target token.Despite significant advancements, the impact of operating policies at the word level has not been thoroughly explored in existing works, which have mainly focused on developing and evaluating systems at the token level.In this paper, we address this gap by demonstrating that implementing various types of policies at the word level consistently outperforms their token-level counterparts.

Utilizing pre-trained LM for MT
Since the successful utilization of Transformerbased LMs pre-trained on large text corpora for downstream NLP tasks (Devlin et al., 2019;Liu et al., 2019;Lample and Conneau, 2019), the utilization of these models for MT has become a significant research area.Several studies have demonstrated the effectiveness of incorporating encoderonly LMs into NMT models.Weng et al. (2020);Yang et al. (2020) combine the LM representations with the encoder's representation using a gating mechanism.Zhu et al. (2020) propose attention between BERT and both the encoder and decoder.Weng et al. (2022) leverage mBERT as an encoder and introduce a decoder that attends to grouped representations of the encoder output.
Another research direction focuses on developing LMs with the encoder-decoder architecture designed for NMT as the target downstream task (Lewis et al., 2020;Liu et al., 2020).These models show improvements particularly for low-resource language pairs.To enhance their adaptation for MT, various methods have been proposed, including fine-tuning specific parts of the LMs (Cooper Stickland et al., 2021), reducing domain mismatch and overestimation (Wang et al., 2022) and mitigating the copying behavior (Liu et al., 2021b).
The integration of pre-trained LMs into SiMT remains an underexplored area of research.To date, Indurthi et al. (2022) is the only related study we are aware of.It improves MMA by integrating the LM's prediction of the next target token, which is encoded using their model's vocabulary before being inputted into the model.However, this approach sacrifices the semantic coherence of the original tokens due to token fragmentation.Additionally, their approach falls under target-side LM integration, overlooking the potential advantages of source-side LM integration.
In this paper, we demonstrate an effective way of integrating a source-side LM into SiMT systems, offering a more versatile solution that can be integrated into most existing neural SiMT models.Building upon previous research conducted in offline MT (Zhu et al., 2020), we introduce essential modifications, with a particular focus on word-level policies as a pivotal component.The effective management of vocabulary mismatches between the LM and the SiMT model is contingent upon the successful implementation of a word-level SiMT policy, a key aspect that we address in our study.

Proposed Methods
In this section, we propose the concept of employing a word-level latency metric and outline our conversion process for translating token-level policies into their word-level equivalents.Additionally, we present an integration of LM into SiMT, highlighting the advantages of utilizing word-level policies.

Preliminaries
Given a source sentence x = (x 1 , x 2 , ...x n ), the goal of a SiMT model is to generate a target sentence of y = (y 1 , y 2 , ...y m ) while minimizing latency metrics.A SiMT model's policy, represented by the variable g i , determines the number of source tokens to process before predicting target token y i .Then the probability of generating y given x is formulated as follows: where θ is the model's parameters which are commonly optimized with a cross-entropy loss.
Transformer encoder-decoder model (Vaswani et al., 2017) is currently the most widely used architecture for SiMT.To avoid redundant encoding of the input sequence after each READ operation, the encoder is typically modified to encode the source tokens unidirectionally (Elbayad et al., 2020).Alternatively, more advanced techniques like the recurrent Linear Transformer (Kahardipraja et al., 2021) or Partial Bidirectional Encoding (Iranzo Sanchez et al., 2022) can be adopted to enhance the encoding capabilities further.
During the evaluation of SiMT systems, translation quality is commonly assessed in conjunction with the latency required for generating translations.Various metrics have been proposed to calculate latency scores, with Average Lagging (AL) (Ma et al., 2019a) being the most commonly used metric.

Measuring latency based on the word level
As detailed in A.1, a substantial body of prior research work assesses the performance of SiMT systems by utilizing a latency metric on encoded source tokens under different tokenization and encoding schemes.This practice results in each system being evaluated on non-identical token sequences for the same dataset, thereby making it challenging to accurately compare scores across different systems.
To address this, we propose word-level latency score calculation by considering the word boundaries in token-level sequences.Specifically, when the first token of a source word is processed through a READ operation, we consider it as reading the corresponding word.Similarly, when the last token of a target word is written via a WRITE operation, we consider it as writing that word.By doing so, the latency scores are calculated consistently, regardless of the tokenization and encoding of the input.This ensures that results from different systems can be compared fairly.

Word-level SiMT policies
The proposed word-level policy restricts a SiMT policy's transition from READ to WRITE or vice versa to occur exclusively at the boundaries of words.Any token-level policy can be transformed to operate at the word-level by following the conversion process we outline below.
Concretely, we ensure that a word-level policy does not write a target token in the middle of reading a sequence of source tokens that make up a word.To accomplish this word-level READ, we delay g i until it reaches the nearest source word boundary.We formally define r i that has a refined value of g i based on the word boundaries in x as follows: Here, B S denotes the indices of the source words' last tokens.Substituting r i for g i as a policy transforms it into another policy that upholds the same x 1 x 2 x 3 x 4 x 5 x 6 x 1 x 2 x 3 x 4 x 5 x 6 x 1 x 2 x 3 x 4 x 5 x 6 x 1 x 2 x 3 x 4 x 5 x 6 Token-level unidirectional encoding Intra-word bidirectional encoding decision-making criterion while ensuring uninterrupted reading of an entire word when initiating the reading of a token.
Similarly to the word-level READ, we design a word-level WRITE to balance READ and WRITE actions throughout the translation.To achieve this, we can modify r i such that it writes until it produces a token that ends with an end-of-word symbol.We define w i that satisfies this as follows: where B T denotes the indices of the target words' last tokens and b i represents the index of the final token in the word that includes y i .By employing w i in place of r i (or g i ), we ensure that the policy consistently composes entire words without interruptions from any READ actions.This approach effectively reduces latency by facilitating faster writing of certain tokens compared to the original policy, thereby compensating for the increased latency resulting from word-level READ operations.Figure 1 provides a visual comparison between word-level and token-level policies in the context of Wait-1, with the word-level policy encompassing both word-level READ and WRITE operations.

Intra-word bidirectional encoding
Unidirectional encoding in SiMT is vital for managing computational complexity and training efficiency.However, it has an inevitable consequence of weakening the source sequence representations compared to bidirectional encoding.This is an additional factor contributing to the lower translation quality of SiMT compared to offline models, along with the early translation from partial inputs.
To mitigate this issue, we utilize a technique called intra-word bidirectional encoding.At the word level, this approach involves unidirectional encoding for each word in the input sentence, meaning past words cannot attend to future words.However, at the subword level, subwords within the same word can attend to each other, allowing past subwords to attend to future subwords within the same word.Since READ operates at the word level in word-level policies, this encoding does not require recomputation during each WRITE operation.It only necessitates a single forward pass, similar to token-level unidirectional encoding.However, it can produce a better encoded representation by enabling attention to more tokens An example masking to enable intra-word bidirectional encoding is depicted in Figure 3.

Integration of LM into SiMT through word-level policies
In this subsection, we showcase an additional benefit of word-level policies when integrating an LM into a SiMT system.One of the key challenges in this integration is the vocabulary mismatch between the LM and the SiMT model, which hinders ensuring that both models process an equal amount of input prefix at each translation step.One possible solution is to use the LM's vocabulary for the SiMT model.However, the LM's training data may not align well with the specific domain targeted by the SiMT system (Wang et al., 2022).This can result in suboptimal vocabulary for the SiMT model compared to a vocabulary obtained from in-domain data (Liu et al., 2021a).An-other option is to explore methods to bridge vocabulary gaps (Kim et al., 2019;Sato et al., 2020;Liu et al., 2021a), but they are either validated only in certain transfer learning scenarios or require an additional training phase to train adapters or finetuning the entire LM using pre-training data.
In this paper, we introduce a method for leveraging an LM in a manner that facilitates the integration of an off-the-shelf LM into a SiMT model, utilizing a word-level policy, regardless of vocabulary mismatches and the internal structure of the SiMT model.Specifically, we employ an LM fused attention for both the encoder and decoder, following the approach outlined in (Zhu et al., 2020), but with two notable modifications.
Firstly, we replace BERT with a decoder-only auto-regressive LM (Radford et al., 2019;Lin et al., 2022) for unidirectional encoding of the input, aligning with SiMT models for efficient training and inference.Secondly, the attention between the SiMT model and the LM occurs when both models execute a word-level READ for an input word.This ensures they interact only when they process an equal amount of input prefix, naturally resolving the synchronization issue.Additionally, as they align at every word boundary, the SiMT model can operate independently with a vocabulary derived from in-domain data, while the LM continues to use its original vocabulary.Unlike methods targeting specific SiMT models (Indurthi et al., 2022), our approach can benefit any Neural SiMT model with any decoder-only LM. Figure 4 illustrates the proposed integration of the LM with word-level Wait-1.(Elbayad et al., 2020).For the word-level Wait-k policy, we define it as reading the first k words and then alternating between reading one source word and writing one target word. 3 MoE Wait-k (Zhang and Feng, 2021): A Mixture-of-Experts model initially trained with 3 Technically, the word-level policy derived from the tokenlevel Wait-k through the conversion process in Section 3.3 can wait between 1 and k tokens, depending on the input encoding.Therefore, it is not equivalent to the word-level Wait-k policy we define here, which always waits for k words.
fixed Experts weights and then fine-tuned with dynamic Experts weights.The word-level Wait-k policy of different k is applied to each expert.
ITST (Zhang and Feng, 2022b): A SoTA SiMT model equipped with Information-Transport-based policy that quantifies information weights from each source to the current target token.To implement word-level ITST, we convert the number of source tokens required for the first target token of each target word into the corresponding number of source words using Equation 2. Once the required number of source words is read, we complete the translation of the word.Additionally, we calculate the latency cost at the word level.We compare each system by training both tokenlevel and word-level models, with and without an LM.For models with an LM, we use XGLM-564M (Lin et al., 2022) and employ the two-stage training in which we first train the SiMT model without the LM and initialize the encoder and decoder from the LM-fused model with the pre-trained weights (Zhu et al., 2020).We also tested the single stage training where all trainable parameters are trained jointly with the LM from scratch.The difference of these strategies are discussed in Section 5.3.We tokenized and encoded the input using sentencepiece (Kudo and Richardson, 2018) and applied BPE with a vocabulary size of 32k.We use sacreBLEU for BLEU calculation (Post, 2018).For models with token-level policies, we trained models with the official implementations4 5 6 , and implemented word-level policies based on these implementations.More training details are described in A.2.

Main results
The performance of each system is compared in Figure 5. Notably, when measuring latency at the word level, the word-level policy proves to be highly effective for all three systems across different latency levels and datasets, resulting in superior performance.The only exception is observed in the En-Fr ITST models that demonstrate similar levels of performance.The incorporation of an LM using the proposed LM-fused attention further enhances the performance for all word-level configurations.This observation highlights the suitability of wordlevel policies for the LM-fused attention approach and underscores the effectiveness of leveraging an LM to enhance SiMT systems.
Notably, as depicted in Figure 11, the word-level policies also outperform or compete with the tokenlevel policies in token-level latency.This can be attributed to the enhanced token representations under the word-level policy, thanks to the contextual information provided by all other tokens belonging to the same word for each token.

Analysis
To validate the effectiveness of word-level policies from multiple angles, we conduct several analyses on various settings.All the experiments were conducted on WMT De.En with transformer-big unless specified otherwise.

Effects of Word-level READ and WRITE
To gain insights into the functionality of word-level READ and WRITE actions, we trained Wait-k models with various policy settings and conducted a performance comparison.Specifically, we examined models with the following policy settings: WW: word-level READ and WRITE.
TW: token-level READ and word-level WRITE.
WT: word-level READ and token-level WRITE.
TkTk: a simpler baseline policy which involves alternating reading k source tokens and writing k target tokens without considering word boundaries.
The results are presented in Figure 6 (a).The the word-level policy (WW) consistently outperforms TW across all latency settings.This is attributed to its imbalance between the number of source and target prefixes processed in each step.Additionally, WT achieves a minimal AL of approximately 3.6, indicating that it is not well-suited for scenarios that require low latency.Lastly, TkTk shows significantly worse performance than WW, suggesting that reading or writing a few consecutive tokens without considering semantic boundaries offers no benefits, unlike word-level policies.

Effects of intra-word bidirectional encoding
In order to assess the impact of the proposed intraword bidirectional encoding, we trained word-level Wait-K models with and without it and compared the accuracy of the two models across different AL settings.The results are presented in Figure 6 (b).
Remarkably, the model equipped with the intraword bidirectional encoding consistently achieved higher BLEU scores compared to the model without it, across all tested latency settings.This provides strong evidence of the effectiveness of the intra-word bidirectional encoding in enhancing SiMT performance.

Effectiveness of word-level policies for LM
In this subsection, we aim to explore the significance of word-level policies in leveraging LM for SiMT.We compare different configurations based on three factors: Word vs. Token: The policy type that the model operates with.In-domain vocab vs. LM vocab: Whether the model uses an in-domain vocabulary obtained from the in-domain training data or uses the LM's vocabulary for the source language.Note that the use of "in-domain vocab" is specific to the "Word" configuration due to the vocabulary mismatch.LM-fused attention vs. LM embed: Whether the model incorporates an LM using the LM-fused attention or replacing the embedding layer of the encoder with the LM's embedding (Xu et al., 2021).The latter approach uses "LM vocab" by design.Figure 7 showcases the results.The models with word-level policies consistently outperform those with token-level policies by a significant margin in both LM-fused attention and LM embedding settings, underscoring the importance of wordlevel policy adoption for effective LM integration.The top-performing configuration is the proposed LMAttn-In-domain vocab-Word, demonstrating that the highest translation accuracy is achieved when the SiMT model operates with an in-domain vocabulary.Additionally, it is evident that the "LM embed" approach performs notably worse than the proposed LM-fused attention, further affirming the latter's superiority.

Effects of LM-fused attention with various LMs and training configurations
To assess the effectiveness and broad applicability of our proposed LM integration, we conducted experiments on the IWSLT17 En-Fr dataset with two decoder-only LMs of different sizes: the 137M parameter GPT-2 model (Radford et al., 2018) and the XGLM-564M model.Additionally, we explore the option of training models in the single training stage instead of the two-stage training.The results, presented in Figure 8, demonstrate that GPT-2 model also exhibits improved performance with LM-fused attention, although their impact is naturally less pronounced compared to XGLM due to the difference in model size.Moreover, although models trained using the single-stage training generally exhibit lower performance compared to those trained using the two-stage training, they still outperform models without the LM for most configurations.This indicates that the LM-fused attention is applicable to various types of LMs and remains effective even when using the single-stage training strategy.This flexibility allows users to choose a training approach and model configuration that aligns best with their desired accuracy goals and computational constraints.

Policy quality comparison
To assess the accuracy of policies in determining when to read or write, we adopt the methodology of estimating the quality of a policy in prior research (Zhang and Feng, 2022b,c;Guo et al., 2022).We measure the quality of a policy by analyzing the proportion of aligned source words received before translating on RWTH De-En alignment dataset7 .
To ensure accurate word-level alignment calculation, we consider an aligned source word is read before writing a ground truth (GT) target word if the last token of the source word is read before the first token of the target word is written.The results of this analysis are presented in Figure 9.It is observed that word-level policies, both for ITST and Wait-k, exhibit better alignments across most latency settings.This suggests that word-level policies contribute to revising premature WRITE actions by guiding the model to read the remaining tokens of the aligned word, without negatively impacting the model's latency.

Conclusions
This paper explores the potential benefits of wordlevel operations in SiMT systems.We propose word-level latency calculation to ensure fair and accurate latency comparisons.We introduce a conversion process that transforms token-level policies into word-level policies, enabling the processing of multiple subwords that form a word within a single READ or WRITE action.Additionally, we propose the integration of LM-fused attention, which combines an autoregressive LM model into SiMT models with word-level policies.Experimental results demonstrate the superiority of word-level policies compared to token-level policies, as well as the effectiveness of the LM integration.Our findings highlight the crucial role of word-level policies in the integration process.

Limitations
While the proposed word-level policy implementation is widely applicable to most existing SiMT systems, it is important to note that systems utilizing languages with a writing style that lacks spaces or other delimiters between words or sentences (e.g., Chinese) are unable to derive benefits from this approach.Furthermore, it is important to consider that while the proposed LM-fused attention proves effective in enhancing translation quality across all latency levels, integrating a large LM may necessitate a faster compute capability to fulfill the low-latency demands of the SiMT task.

A.1 Token-level latency evaluation in previous works
Many previous works in the field of SiMT have employed token-level latency calculation as a fundamental metric for evaluating system performance.
In the first group of papers, which includes (Arivazhagan et al., 2019;Ma et al., 2019b;Schneider and Waibel, 2020;Zheng et al., 2020;Zhang and Feng, 2022b), researchers explicitly stated their use of token-level latency calculation or incorporated token-level scores in their analyses.On the other hand, the second group, comprising (Zhang and Feng, 2022b;Elbayad et al., 2020;Zhang and Feng, 2022c, 2021, 2022a;Zhang et al., 2022;Guo et al., 2022;Zhang and Feng, 2023;Guo et al., 2023;Caglayan et al., 2020;Wang et al., 2023), can be identified by their publicly available official implementations, which incorporate token-level latency evaluation code.This underscores the prevalence of token-level latency calculation in assessing the effectiveness of SiMT systems in the cutting-edge SiMT research works.

A.2 More training details
For Transformer model configurations, we follow the Efficient Wait-k (Elbayad et al., 2020)'s set-tings. 8The details of hyperparameter settings for each model is in Table 1.
To ensure optimal performance of the ITST models, we conducted multiple training runs for each model configuration and dataset.We observed performance fluctuations across the training runs.To select the best-performing models for both tokenlevel and word-level policies, we repeated the training process three times and selected the checkpoint with the lowest validation loss for each configuration.
For LM integrated models, the intra-word bidirectional encoding was not applied to the LM to prevent the need for fine-tuning.

A.3 Examples and discussions
In Table 2, an illustrative case is presented to demonstrate the distinction between a token-level policy (token-level Wait-1) and a word-level policy (word-level Wait-1).At Step 2, the token-level model predicts the target token "B" after processing the subword "B" for the word "Beine" (which means "leg" in German).Subsequently, it fails to recover from this incorrect prediction and continues by predicting "ody " as the next token, even after processing the remaining token "eine ".In contrast, the word-level model accurately predicts "legs" (encoded as "leg s ") after processing the complete word 'Beine' (encoded as 'B eine ').Furthermore, the token model also makes erroneous predictions for all steps when processing the word 'blutüberströmt.' (encoded as 'bl ut über ström t .', meaning "covered in blood" in German), while the word-level model accurately predicts a correct word in a single step.
Another case can be observed in Table 3, where the token-level model initiates the translation before fully reading the word "Irgendetwas" (encoded as "Irgen det was ").As a consequence, it produces an incorrect translation.On the other hand, the word-level model accurately translates the sentence by processing the complete word before making any predictions.

A.5 Graphs of main results
Hyperparameter

Figure 1 :
Figure 1: Illustration of both token-level Wait-1 (red arrow lines) and word-level Wait-1 (green arrow lines) policies.The symbol " " at the end of a token indicates a word boundary.

Figure 3 :
Figure 3: Comparison of masking in the token-level unidirectional attention (left) and the intra-word bidirectional encoding (right).Word boundaries are represented by vertical/horizontal bars on each axis.

Figure 4 :
Figure4: Illustration of LM-fused attention with the word-level Wait-1.Word and model activations processed at a specific stage are highlighted in red.When a source word is received, it is independently encoded by the LM's vocabulary and the in-domain vocabulary.The hidden activations of the word from the LM is then utilized in the encoder.The decoder generates a sequence of tokens for a word by using both the LM and encoder activations.

Figure 6 :
Figure 6: Ablation studies for word-level policies.(a): Comparison of word-level Wait-k policies with different policies.(b): Comparison of word-level Wait-k models with and without the intra-word bidirectional encoding.

Figure 7 :
Figure 7: Comparison of models with different LM integration.

Figure 8 :
Figure 8: Translation accuracy comparison of LM-fused attention models with different training configurations.(a): En-Fr Transformer small (b) De-En Transformer big

Figure 9 :
Figure 9: The percentage of aligned source words read before the translation of the target words started.

Table 1 :
Hyperparameters of each Transformer model.
Figure 11: Results of translation quality v.s.token-level AL.

Table 4 :
Main results of transformer-small Waik-k models on IWLST17 En.Fr dataset.

Table 5 :
Main results of transformer-small MoE Waik-k models on IWSLT17 En.Fr dataset.

Table 6 :
Main results of transformer-small ITST models on IWSLT17 En.Fr dataset.

Table 7 :
Main results of transformer-base Waik-k models on WMT15 De.En dataset.

Table 8 :
Main results of transformer-base MoE Waik-k models on WMT15 De.En dataset.

Table 9 :
Main results of transformer-base ITST models on WMT15 De.En dataset.

Table 10 :
Main results of transformer-big Waik-k models on WMT15 De.En dataset.

Table 11 :
Main results of transformer-big MoE Waikk models on WMT15 De.En dataset.

Table 12 :
Main results of transformer-big ITST models on WMT15 De.En dataset.