Non-autoregressive Streaming Transformer for Simultaneous Translation

Simultaneous machine translation (SiMT) models are trained to strike a balance between latency and translation quality. However, training these models to achieve high quality while maintaining low latency often leads to a tendency for aggressive anticipation. We argue that such issue stems from the autoregressive architecture upon which most existing SiMT models are built. To address those issues, we propose non-autoregressive streaming Transformer (NAST) which comprises a unidirectional encoder and a non-autoregressive decoder with intra-chunk parallelism. We enable NAST to generate the blank token or repetitive tokens to adjust its READ/WRITE strategy flexibly, and train it to maximize the non-monotonic latent alignment with an alignment-based latency loss. Experiments on various SiMT benchmarks demonstrate that NAST outperforms previous strong autoregressive SiMT baselines.


Introduction
Simultaneous machine translation (SiMT; Cho and Esipova, 2016;Gu et al., 2017;Ma et al., 2019;Arivazhagan et al., 2019;Zhang and Feng, 2023), also known as real-time machine translation, is commonly used in various practical scenarios such as live broadcasting, video subtitles and international conferences.SiMT models are required to start translation when the source sentence is incomplete, ensuring that listeners stay synchronized with the speaker.Nevertheless, translating partial source content poses significant challenges and increases the risk of translation errors.To this end, SiMT models are trained to strike a balance between latency and translation quality by dynamically determining when to generate tokens (i.e., WRITE action) and when to wait for additional source information (i.e., READ action).
However, achieving the balance between latency and translation quality is non-trivial for SiMT models.Training these models to produce high-quality translations while maintaining low latency often leads to a tendency for aggressive anticipation (Ma et al., 2019), as the models are compelled to output target tokens even before the corresponding source tokens have been observed during the training stage (Zheng et al., 2020).We argue that such an issue of anticipation stems from the autoregressive (AR) model architecture upon which most existing SiMT models are built.Regardless of the specific READ/WRITE strategy utilized, AR SiMT models are typically trained using maximum likelihood estimation (MLE) via teacher forcing.As depicted in Figure 1, their training procedure can have adverse effects on AR SiMT models in two aspects: 1) non-monotonicity problem: The reference used in training might be non-monotonically aligned with the source.However, in real-time scenarios, SiMT models are expected to generate translations that align monotonically with the source to reduce latency (He et al., 2015;Chen et al., 2021).The inherent verbatim alignment assumption during the MLE training of AR SiMT models restricts their performance; 2) source-info leakage bias: Following the practice in full-sentence translation systems, AR SiMT models deploy the teacher forcing strategy during training.However, it may inadvertently result in the leakage of source information.As illustrated in Figure 1, even if the available source content does not contain the word "举行 (hold)", the AR decoder is still fed with the corresponding translation word "held" as the ground truth context in training.This discrepancy between training and inference encourages the AR SiMT model to make excessively optimistic predictions during the realtime inference, leading to the issue of hallucination (Chen et al., 2021).
To address the aforementioned problems in autoregressive SiMT models, we focus on developing  In this case, the AR SiMT model learns to predict at the third time step based on the source contexts "布什 (Bush)", "与 (and)", "沙龙 (Sharon)", and the ground truth contexts "Bush", "held".Although the source token "举行 (hold)" has not been read yet, it is exposed to the AR SiMT model through its corresponding token "held" in the ground truth context.
SiMT models that generate target tokens in a nonautoregressive (NAR) manner (Gu et al., 2018) by removing the target-side token dependency.We argue that an NAR decoder is better suited for streaming translation tasks.Firstly, the target tokens are modeled independently in NAR models, which facilitates the development of a non-monotonic alignment algorithm between generation and reference, alleviating the non-monotonicity problem.Additionally, the conditional independence assumption of the NAR structure liberates the model from the need for teacher forcing in training, thereby eliminating the risk of source-side information leakage.These advantageous properties of the NAR structure enable SiMT models to avoid aggressive anticipation and encourage the generation of monotonic translations with fewer reorderings that align with the output of professional human interpreters.
In this work, we propose non-autoregressive streaming Transformer (NAST).NAST processes streaming input and performs unidirectional encoding.Translations are generated in a chunk-bychunk manner, with tokens within each chunk being generated in parallel.We enable NAST to generate blank token ϵ or repetitive tokens to build READ/WRITE paths adaptively, and train it to maximize non-monotonic latent alignment (Graves et al., 2006;Shao and Feng, 2022) with a further developed alignment-based latency loss.In this way, NAST effectively learns to generate translations that are properly aligned with the source in a monotonic manner, achieving high-quality translation while maintaining low latency.
Extensive experiments on WMT15 German → English and WMT16 English → Romanian bench-marks demonstrate that NAST outperforms previous strong autoregressive SiMT baselines.

Simultaneous Translation
Simultaneous machine translation models often adopt a prefix-to-prefix framework to start generating translation conditioned on partial source input.Given a source sentence x = {x 1 , ..., x m }, previous autoregressive SiMT models factorize the probability of target sentence y = {y 1 , ..., y n } as: where g(t) is a monotonic non-decreasing function of t, denoting the number of observed source tokens when generating y t .A function g(t) represents a specific READ/WRITE policy of SiMT models.
In addition to translation quality, latency is a crucial factor in the assessment of SiMT models.The latency of a policy g(t) is commonly measured using Average Lagging (AL; Ma et al., 2019), which counts the number of tokens that the output lags behind the input: where τ g (|x|) is the cut-off function to exclude the counting of problematic tokens at the end: and r = |y| |x| represents the length ratio between the target and source sequences.

Parallel Decoding
Non-autoregressive generation (Gu et al., 2018) is originally introduced to reduce decoding latency 1 .It removes the autoregressive dependency and generates target tokens in a parallel way.Given a source sentence x = {x 1 , ..., x m }, NAR models factorize the probability of target sentence y = {y 1 , ..., y n } as: (4)

Connectionist Temporal Classification
Unlike autoregressive models that dynamically control the length by generating the <eos> token, NAR models often utilize a length predictor to predetermine the length of the output sequence before generation.The predicted length may be imprecise and lacks adaptability for adjustment.Connectionist Temporal Classification (CTC; Graves et al., 2006) addresses this limitation by extending the output space Y with a blank token ϵ.The generation a ∈ Y * is referred to as the alignment.CTC defines a mapping function β(y; T ) that returns a set of all possible alignments of y of length T and a collapsing function β −1 (a) that first collapses all consecutive repeated tokens in a and then removes all blanks to obtain the target.During training, CTC marginalizes out all alignments: where T is a pre-determined length and the alignment is modeled in a non-autoregressive way: 3 Approach We provide a detailed introduction to the nonautoregressive streaming Transformer (NAST) in this section.

Architecture Overview
NAST consists of a unidirectional encoder (Arivazhagan et al., 2019; Ma et al., 2019; Miao et al.,   1 Note that the concept of latency differs between NAR generation and SiMT.It refers to the delay in generating all target tokens once all source tokens are observed in the first case and to the level of synchronization between target-side generation and source-side observation in the latter case. 2021) and a non-autoregressive decoder with intrachunk parallelism.The model architecture is depicted in Figure 2. When a source token x i is read in, NAST passes it to the unidirectional encoder, allowing it to attend to the previous source contexts through causal encoder self-attention: Concurrently, NAST upsamples x i λ times and feeds them to construct the decoder hidden states as a chunk.Within the chunk, NAST handles λ states in a fully parallel manner.To further clarify, we introduce h to represent the sequence of decoder states.Thus, the j-th hidden state in the i-th chunk can be denoted as h (i−1)λ+j , subject to Those states can attend to information from all currently observed source contexts through cross-attention: and to information from all constructed decoder states through self-attention: Following CTC (Graves et al., 2006), we extend the vocabulary to allow NAST generating the blank token ϵ or repeated tokens from decoder states to model an implicit READ action.We refer to the outputs from a states chunk h (i−1)λ+1:iλ as partial alignments a (i−1)λ+1:iλ , where NAST generates them in a non-autoregressive way: To obtain the translation stream, we first apply the collapsing function β −1 to deal with the partial alignments generated from the i-th chunk: Then NAST concatenates the outputs from the current chunk to generated prefix y pre according to the following rule: where y pre −1 denotes the last token in the generated prefix.Consequently, upon receiving a token in the input stream, NAST is capable to generate 0 to λ

Glancing
Figure 2: Overview of the proposed non-autoregressive streaming Transformer (NAST).Upon receiving a source token, NAST upsamples it λ times and feeds them to the decoder as a chunk.NAST can generate blank token ϵ or repetitive tokens (both highlighted in gray) to find reasonable READ/WRITE paths adaptively.We train NAST using the non-monotonic latent alignment loss (Shao and Feng, 2022) with the alignment-based latency loss to achieve translation of high quality while maintaining low latency.
tokens at a time, endowing it with the ability to adjust its READ/WRITE strategy flexibly.Formally, each full alignment a ∈ β(y; λ|x|) can be considered as a concatenation of all the partial alignments, and implies a specific READ/WRITE policy to generate the reference y.Therefore, NAST jointly models the distribution of translation and READ/WRITE policy by marginalizing out latent alignments: p(a (i−1)λ+j |x ≤i ). (13)

Latency Control
While NAST exhibits the ability to adaptively determine an appropriate READ/WRITE policy, we want to impose some specific requirements on the trade-off between latency and translation quality.
To accomplish this, we introduce an alignmentbased latency loss and a chunk wait-k strategy to effectively control the latency of NAST.

Alignment-based Latency Loss
Considering NAST models the distribution of READ/WRITE policy by capturing the distribution of latent alignments, it is desirable to measure the averaged latency of all latent alignments and further regularize it.Specifically, we are interested in the expected Average Lagging (AL; Ma et al., 2019) of NAST: where g a is the policy induced from alignment a. Due to the exponentially large alignment space, it is infeasible to enumerate all possible g a to obtain AL(θ; x).This limitation motivates us to delve deeper into AL(θ; x) and devise an efficient estimation algorithm.
To simplify the estimation process of AL(θ; x) while still excluding the lag counting of problematic words generated after all source read in, we deploy a new cut-off function that disregards tokens generated after all source observed, i.e., tokens from the last chunk: Then we introduce a moment function m(i) to denote the number of observed source tokens when generating the i-th position in the alignment.Given the fixed upsampling strategy of NAST, it is clear that: We further define an indicator function 1(a i ) to denote whether the i-th position in the alignment is reserved after collapsed by β −1 .With its help, it is convenient to express the lagging of alignment a: nator: It relieves us from the intractable task of enumerating g a .Instead, we only need to handle two expectation terms: E a [ where p(1(a i )) represents the probability that the i-th token in the alignment is reserved after collapsing and can be calculated simply as: With the assistance of the aforementioned derivation, it is efficient to estimate the expected average lagging of NAST.By applying it along with a tunable minimum lagging threshold l min , we can train NAST to meet specific requirements of low latency:

Chunk Wait-k Strategy
In addition to the desiring property of shorter lagging, there may be practical scenarios where we aim to mitigate the risk of erroneous translations by increasing the latency.To this end, we propose a chunk wait-k strategy for NAST to satisfy the requirements of better translation quality.
NAST is allowed to wait for additional k source tokens before initializing the generation of the first chunk.The first chunk is fed to the decoder at the moment the (k + 1)-th source token is read in.Subsequently, NAST feeds each following chunk as each new source token is received.The partial alignment generated from each chunk is consistently lagged by k tokens compared with the corresponding source token until the source sentence is complete.
Formally, the moment function for the chunk wait-k strategy can be formulated as: As depicted in Figure 3, decoder states can further access information from additional k observed source tokens through cross-attention: which leads NAST to prioritize better translation quality at the expense of longer delay.

Non-monotonic Latent Alignments
While CTC loss (Graves et al., 2006) provides the convenience of directly applying the maximum likelihood estimation to train NAST, i.e., L = − log p(y|x), it only considers the monotonic mapping from target positions to alignment positions.However, non-monotonic alignments are crucial in simultaneous translation.SiMT models are expected to generate translations that are monotonically aligned with the source sentence to achieve low latency.Unfortunately, in the training corpus, source and reference pairs are often non-monotonically aligned due to differences in grammar structures between languages (e.g., SVO vs SOV).Neglecting the non-monotonic mapping during training compels the model to predict tokens for which the corresponding source has not been read, resulting in over-anticipation.To address these issues, we apply the bigram-based nonmonotonic latent alignment loss (Shao and Feng, 2022) to train our NAST, which maximizes the F1 score of expected bigram matching between target and alignments: where C g (y) denotes the occurrence count of bigram g = (g 1 , g 2 ) in the target, C g (θ) represents the expected count of g for NAST, and G 2 denotes the set of all bigrams in y.

Glancing
Due to its inherent conditional independence structure, NAST may encounter challenges related to the multimodality problem3 (Gu et al., 2018).To address this issue, we employ the glancing strategy (Qian et al., 2021) during training.This involves randomly replacing tokens in the decoder's input chunk with tokens from the most probable latent alignment.Formally, the glancing alignment is the one that maximizes the posterior probability: Then we randomly sample some positions in the decoder input and replace tokens in the input sequence with tokens from the glancing alignment sequence at those positions in training.

Training Strategy
In order to better train the NAST model to adapt to simultaneous translation tasks with different latency requirements, we propose a two-stage training strategy.In the first stage, we train NAST using the CTC loss to obtain the reference monotonicaligned translation with adaptive latency: In the second stage, we train NAST using the combination of the non-monotonic latent alignment loss and the alignment-based latency loss: This further enables NAST to generate translations that are aligned with the source in a monotonic manner, meeting specific latency requirements.

Experimental Setup
Datasets We conduct experiments on the following benchmarks that are widely used in previous SiMT studies: WMT154 German → English (De→En, 4.5M pairs) and WMT165 English → Romanian (En→Ro, 0.6M pairs).For De→En, we use new-stest2013 as the validation set and newstest2015 as the test set.For En→Ro, we use newsdev-2016 as the validation set and newstest-2016 as the test set.For each dataset, we apply BPE (Sennrich et al., 2016) with 32k merge operations to learn a joint subword vocabulary shared across source and target languages.
Implementation Details We select a chunk upsample ratio of 3 (λ = 3) and adjust the chunk waiting parameter k and the threshold l min in alignment-based latency loss to achieve varying quality-latency trade-offs. 6For the first stage of training, we set the dropout rate to 0.3, weight decay to 0.01, and apply label smoothing with a value of 0.01.We train NAST for 300k updates on De→En and 100k updates on En→Ro.A batch size of 64k tokens is utilized, and the learning rate warms up to 5 • 10 −4 within 10k steps.The glancing ratio linearly anneals from 0.5 to 0.3 within 200k steps on De→En and 100k steps on En→Ro.
In the second stage, we apply the latency loss only if the chunk wait strategy is disabled (k = 0).The dropout rate is adjusted to 0.1 for De→En, while no label smoothing is applied to either task.We further train NAST for 10k updates on De→En and 6k updates on En→Ro.A batch size of 256k tokens is utilized to stabilize the gradients, and the learning rate warms up to 3 • 10 −4 within 500 steps.The glancing ratio is fixed at 0.3.During both training stages, all models are optimized using Adam (Kingma and Ba, 2014) with β = (0.9, 0.98) and ϵ = 10 −8 .Following the practice in previous research on nonautoregressive generation, we employ sequencelevel knowledge distillation (Kim and Rush, 2016) to reduce the target-side dependency in data. 7We adopt Transformer-base (Vaswani et al., 2017) as the offline teacher model and train NAST on the distilled data.
Baselines We compare our system with the following strong autoregressive SiMT baselines: Offline AT Transformer model (Vaswani et al., 2017), which initiates translation after reading all the source tokens.We utilize a unidirectional encoder and employ greedy search decoding for fair comparison.Wait-k Wait-k policy (Ma et al., 2019), which initially reads k tokens and subsequently alternates between WRITE and READ actions.
MoE Wait-k Mixture-of-experts wait-k policy (Zhang and Feng, 2021), which involves employing multiple experts to learn multiple wait-k policies during training.MoE Wait-k is the current SOTA fixed policy.
MMA Monotonic multi-head attention (MMA; Ma et al., 2020) employs a Bernoulli variable to predict the READ/WRITE action and is trained using monotonic attention (Raffel et al., 2017).
HMT Hidden Markov Transformer (HMT; Zhang and Feng, 2023), which treats the moments of starting translating as hidden events and considers the target sequence as the observed events.This approach organizes them as a hidden Markov model.HMT is the current SOTA adaptive policy.
Metrics To compare SiMT models, we evaluate the translation quality using BLEU score (Papineni et al., 2002) and measure the latency using Average Lagging (AL; Ma et al., 2019).Numerical results with more latency metrics can be found in Appendix B.

Main Results
We compare NAST with the existing AR SiMT methods in Figure 4 demonstrates superior performance compared to the offline AT system even when the AL is as low as 6.85, showcasing its competitiveness in scenarios where higher translation quality is desired.On En→Ro, NAST also exhibits a substantial improvement under low latency conditions.On the other hand, NAST achieves comparable performance to other models on En→Ro when the latency requirement is not stringent.

Importance of Non-monotonic Alignments
NAST is trained using a non-monotonic alignment loss, enabling it to generate source monotonicaligned translations akin to human interpreters.This capability empowers NAST to achieve highquality streaming translations while maintaining low latency.To validate the effectiveness of nonmonotonic alignment, we conduct further experi- ments by studying the performance of NAST without L NMLA .We compare the translation quality (BLEU) and latency (AL) of models employing different chunk wait-k strategies.The results are reported in Table 1 and Table 2.Note that L latency is not applied here for clear comparison.We observe that incorporating L NMLA significantly enhances translation quality by up to 1.86 BLEU, while maintaining nearly unchanged latency.We also notice that the improvement is particularly substantial when the latency is low, which aligns with our motivation.Under low latency conditions, SiMT models face more severe non-monotonicity problems.The ideal simultaneous generation requires more reordering of the reference to achieve source sentence monotonic alignment, which leads to greater improvements of applying non-monotonic alignment loss.

Analysis on Hallucination Rate
NAST mitigates the risk of source information leakage during training, thereby minimizing the occurrence of hallucination during inference.To demonstrate this, we compare the hallucination rate (Chen et al., 2021) of hypotheses generated by NAST with that of the current SOTA model, HMT (Zhang and Feng, 2023).A hallucination is defined as a generated token that can not be aligned to any source word.The results are plotted in Figure 5.
We note that the hallucination rates of both models decrease as the latency increases.However, NAST exhibits a significantly lower hallucination rate compared to HMT.We attribute this to the fact that NAST avoids the bias caused by source-info leakage and enables a more general generationreference alignment, thus mitigating compelled predictions during training.

Performance across Difficulty Levels
To further illustrate NAST's effectiveness in handling non-monotonicity problem, we investigate its performance when confronted with samples of varying difficulty levels.It is intuitive to expect that samples with a higher number of cross alignments between the source and reference texts pose a greater challenge for real-time translation.Therefore, we evenly partition the De→En test set into subsets based on the number of crosses in the alignments, categorizing them as easy, medium, and hard, in accordance with the approach by Zhang and Feng (2021).We compare our NAST with previous HMT model, and the results are presented in Figure 6.
Despite the impressive performance of NAST, a closer examination of Figure 6 reveals that the superiority is particular on the challenging subset.Even when real-time requirements are relatively relaxed, the improvement in handling the hard subset remains noteworthy.We attribute this to the stringent demand imposed by the hard subset, requiring SiMT models to effectively manage word reorderings to handle the non-monotonicity.NAST benefits from non-monotonic alignment training and excels in addressing these challenges, thus enhancing its performance in handling those harder samples.

Concerns on Fluency
While the non-autoregressive nature endows NAST with the capability to tackle the non-monotonicity problem and source-info leakage bias, it also exposes NAST to the risk of potential fluency degradation due to the absence of target-side dependency.
To have a better understanding of this problem, we Though NAST exhibits significantly improved translation quality, we find its non-autoregressive nature does impact fluency to some extent.However, we consider this trade-off acceptable.In practical scenarios like international conferences where SiMT models are employed, the language used by human speakers is often not perfectly fluent.In such contexts, the audience tends to prioritize the overall translation quality under low latency, rather than the fluency of generated sentences.

Related Work
SiMT Simultaneous machine translation requires a READ/WRITE policy to balance latency and translation quality, involving fixed and adaptive strategies.For the fixed policy, Ma et al. (2019) proposed wait-k, which first reads k source tokens and then alternates between READ/WRITE action.Elbayad et al. ( 2020) introduced an efficient training method for the wait-k policy, which randomly samples k during training.Zhang and Feng (2021) proposed a mixture-of-experts wait-k to learn a set of wait-k policies through multiple experts.For the adaptive policy, Gu et al. (2017) trained an agent to decide READ/WRITE via reinforcement learning.Arivazhagan et al. (2019) introduced MILk, which incorporates a Bernoulli variable to indicate the READ/WRITE action.Ma et al. (2020) proposed MMA to implement MILk on Transformer.Liu et al. (2021) introduced CAAT, which leverages RNN-T and employs a blank token to signify the READ action.Miao et al. (2021) proposed GSiMT to generate the READ/WRITE actions.Chang et al. (2022) proposed to train a casual CTC encoder with Gumbel-Sinkhorn network (Mena et al., 2018) to reorder the states.Zhang and Feng (2023) proposed HMT to learn when to start translating in the form of HMM, achieving the current state-of-theart SiMT performance.
NAR Generation Non-autoregressive models generate tokens parallel to the sacrifice of target-side dependency (Gu et al., 2018).This property eliminates the need for teacher forcing, motivating researchers to explore flexible training objectives that alleviate strict position-wise alignment imposed by the naive MLE loss.Libovický and Helcl (2018) proposed latent alignment model with CTC loss (Graves et al., 2006), and Shao and Feng (2022) further explored non-monotonic latent alignments.Shao et al. (2020Shao et al. ( , 2021) ) introduced sequence-level training objectives with reinforcement learning and bag-of-ngrams difference.Ghazvininejad et al. (2020) trained NAT model using the best monotonic alignment and Du et al. (2021) further extended it to order-agnostic cross-entropy loss.In addition, some researchers are focusing on strengthening the expression power to capture the token dependency.Huang et al. (2022) proposed directed acyclic graph layer and Gui et al. (2023) introduced probabilistic context-free grammar layer.Building upon that, Shao et al. (2022) proposed Viterbi decoding and Ma et al. (2023) further explored fuzzy alignment training, achieving the current state-of-the-art NAR model performance.Apart from text translation, the NAR model also demonstrated impressive performance in diverse areas such as speech-to-text translation (Xu et al., 2023), speech-to-speech translation (Fang et al., 2023) and text-to-speech synthesis (Ren et al., 2021).

Conclusion
In this paper, we propose non-autoregressive streaming Transformer (NAST) to address the nonmonotonicity problem and the source-info leakage bias in existing autoregressive SiMT models.Comprehensive experiments demonstrate its effectiveness.

A Derivation of Equation 19
We present the detailed derivation of Equation 19 in this section. Ea[ where p(1(a i )) denotes the probability that the i-th token in the alignment is reserved after collapsing.

B Numerical Results with More Metrics
In addition to Average Lagging (AL; Ma et al., 2019), we also incorporate Consecutive Wait (CW; Gu et al., 2017), Average Proportion (AP; Cho and Esipova, 2016), and Differentiable Average Lagging (DAL; Arivazhagan et al., 2019) as metrics to evaluate the latency of NAST.We adjust l min in L latency and k in chunk wait-k strategy to achieve varying quality-latency tradeoffs.For clarity, we present the numerical results of NAST using specific hyperparameter settings in Table 3 and Table 4.Note that L latency is applied to achieve lower latency, while the chunk wait-k strategy is employed to improve translation quality.Therefore, we apply L latency only when k = 0.

C Case Study
To gain further insights into NAST's behavior, we examine the generation processes of two different cases within the De→En test set.We visualize the generation by plotting the generated partial alignments and the collapsed outputs at each step.In Figure 9, we depict another generation case where NAST manages word reorderings at the sentence level in comparison to the reference.In order to ensure low latency, NAST adjusts the sentence structure while maintaining meaning consistency with the reference.When NAST processes the source words "es sieht so au", it promptly generates "it looks as if " and continues generating the subsequent words within this grammatical structure.This ensures listeners keep synchronized with the speaker.

Figure 1 :
Figure1: Illustration of the non-monotonicity problem and the source-info leakage bias in the training of autoregressive SiMT models.In this case, the AR SiMT model learns to predict at the third time step based on the source contexts "布什 (Bush)", "与 (and)", "沙龙 (Sharon)", and the ground truth contexts "Bush", "held".Although the source token "举行 (hold)" has not been read yet, it is exposed to the AR SiMT model through its corresponding token "held" in the ground truth context.

Figure 3 :
Figure 3: Illustration of cross-attention with different chunk wait-k strategies.

Figure 4 :
Figure 4: Results of translation quality (BLEU) against latency (Average Lagging) on De→En and En→Ro.

Figure 5 :
Figure 5: Results of hallucination rate against latency (Average Lagging) on De→En test set.

Figure 6 :
Figure 6: Performance on De→En test subsets categorized by difficulty.

Figure 7 :
Figure 7: Results of fluency (Perplexity) against latency (Average Lagging) on De→En test set.

Figure 9 :
Figure 9: Case study of #1083 in De→En test set, where we configure NAST with k = 0 and L latency not applied.

Table 1 :
. On De→En, NAST outperforms all AR SiMT models significantly across all latency settings, particularly in scenarios with very low latency.With the latency in the range of [0, 1], where listeners are almost synchronized with the speaker, NAST achieves a translation quality of 27.73 BLEU, surpassing the current SOTA model HMT by nearly 6 BLEU points.Moreover, NAST Results of BLEU scores on De→En test set with or without L NMLA under different chunk wait-k strategy.L latency is not applied here.

Table 2 :
Results of Average Lagging on De→En test set with or without L NMLA under different chunk wait-k strategy.L latency is not applied here.

Table 3 :
Numerical results of NAST on De→En."-" indicates that L latency is not applied.

Table 4 :
Numerical results of NAST on En→Ro."-" indicates that L latency is not applied.
India and Japan prime ministers".This output represents a source-monotonic-aligned phrase, thereby effectively reducing latency.