Wait-info Policy: Balancing Source and Target at Information Level for Simultaneous Machine Translation

Simultaneous machine translation (SiMT) outputs the translation while receiving the source inputs, and hence needs to balance the received source information and translated target information to make a reasonable decision between waiting for inputs or outputting translation. Previous methods always balance source and target information at the token level, either directly waiting for a fixed number of tokens or adjusting the waiting based on the current token. In this paper, we propose a Wait-info Policy to balance source and target at the information level. We first quantify the amount of information contained in each token, named info. Then during simultaneous translation, the decision of waiting or outputting is made based on the comparison results between the total info of previous target outputs and received source inputs. Experiments show that our method outperforms strong baselines under and achieves better balance via the proposed info.


Introduction
Simultaneous machine translation (SiMT) (Cho and Esipova, 2016;Gu et al., 2017;Ma et al., 2019) outputs the translation while receiving the source sentence, aiming at the trade-off between translation quality and latency.Therefore, a policy is required for SiMT to decide between waiting for the source inputs (i.e., READ) or outputting translations (i.e., WRITE), the core of which is to wisely balance the received source information and the translated target information.When the source information is less, the model should wait for more inputs for a high-quality translation; conversely, when the translated target information is less, the model should output translations for a low latency.
Existing SiMT policies, involving fixed and adaptive, always balance source and target at the  token level, i.e., treating each source and target token equally when determining READ/WRITE.Fixed policies decide READ/WRITE based on the number of received source tokens (Ma et al., 2019;Zhang and Feng, 2021c), such as wait-k policy (Ma et al., 2019) simply considers each source token to be equivalent and lets the target outputs always lag the source inputs by k tokens, as shown in Figure 1(a).Fixed policies are always limited by the fact that the policy cannot be adjusted according to complex inputs, making them difficult to get the best trade-off.Adaptive policies predict READ/WRITE according to the current source and target tokens (Arivazhagan et al., 2019;Ma et al., 2020) and thereby get a better trade-off, but they often ignore and under-utilize the difference between tokens when deciding READ/WRITE.Besides, existing adaptive policies always rely on complicated training (Ma et al., 2020;Miao et al., 2021) or additional labeled data (Zheng et al., 2019;Zhang et al., 2020;Alinejad et al., 2021), making them more computationally expensive than fixed policies.
Treating each token equally when balancing source and target is not the optimal choice for SiMT policy.Many studies have shown that different words have significantly different functions in translation (Lin et al., 2018;Moradi et al., 2019;Chen et al., 2020), often divided into content words (i.e., noun, verb, • • • ) and function words (i.e., conjunction, preposition, • • • ), where the former express more important meaning and the latter is less informative.Accordingly, tokens with different amounts of information should also play different roles in the SiMT policy, where more informative tokens should play a more dominant role because they bring more information to SiMT model (Zhang and Feng, 2022a,b).Therefore, explicitly differentiating various tokens rather than treating them equally when determining READ/WRITE will be beneficial to developing a more precise SiMT policy.
In this paper, we differentiate various source and target tokens based on the amount of information they contain, aiming to balance received source information and translated target information at the information level.To this end, we propose wait-info policy, a simple yet effective policy for SiMT.As shown in Figure 1(b), we first quantify the amount of information contained in each token through a scalar, named info, which is jointly learned with the attention mechanism in an unsupervised manner.During the simultaneous translation, READ/WRITE decisions are made by balancing the total info of translated target information and received source information.If the received source information is more than translated target information by K info or more, the model outputs translation, otherwise the model waits for the next input.Experiments and analyses show that our method outperforms strong baselines and effectively quantifies the information contained in each token.

Related Work
SiMT Policy Recent policies fall into fixed and adaptive.For fixed policy, Ma et al. (2019) proposed wait-k policy, which first READ k source tokens and then READ/WRITE one token alternately.Elbayad et al. (2020) proposed an efficient multipath training for wait-k policy to randomly sample k during training.Zhang et al. (2021) proposed future-guide training for wait-k policy, which introduces a full-sentence MT to guide training.Zhang and Feng (2021a) proposed a char-level wait-k policy.Zhang and Feng (2021c) proposed a mixtureof-experts wait-k policy to develop a universal SiMT model.For adaptive policy, Gu et al. (2017) trained an agent to decide READ/WRITE via reinforcement learning.Arivazhagan et al. (2019) proposed MILk, which predicts a Bernoulli variable to determine READ/WRITE.Ma et al. (2020) proposed MMA to implement MILk on Transformer.(Zhang and Feng, 2022c) proposed dual-path SiMT to enhance MMA with dual learning.Zheng et al. (2020) developed adaptive wait-k through heuristic ensemble of multiple wait-k models.Miao et al. (2021) proposed a generative framework to generate READ/WRITE decisions.Zhang and Feng (2022a) proposed Gaussian multi-head attention to decide READ/WRITE based on alignments.
Previous policies always treat each token equally when determining READ/WRITE, ignoring the fact that tokens with different amounts of information often play different roles in SiMT policy.Our method aims to develop a more precise SiMT policy by differentiating the importance of various tokens when determining READ/WRITE.
Information Modeling in NMT Linguistics divides words into content words and function words according to their information and functions in the sentence.Therefore, modeling the information contained in each word is often used to improve the NMT performance.Moradi et al. (2019) and Chen et al. (2020) used the word frequency to indicate how much information each word contains, and the words with lower frequencies contain more information.Liu et al. (2020) and Kobayashi et al. (2020) found that the norm of word embedding is related to the token information in NMT.Lin et al. (2018) and Zhang and Feng (2021b) argued that the attention mechanism for different types of word should be different, where the attention distribution of content word tends to be more concentrated.
Our method explores the usefulness of modeling information for SiMT policy, and proposes an unsupervised method to quantify the information of tokens through the attention mechanism, achieving good explainability.

Background
Full-sentence MT For a translation task, we denote the source sentence as x = (x 1 , • • • , x n ) with source length n and the target sentence as y = (y 1 , • • • , y m ) with target length m.Transformer (Vaswani et al., 2017) is the most widely used architecture for full-sentence MT, consisting of an encoder and a decoder.Encoder maps x to source hidden states z = (z 1 , • • • , z n ).Decoder maps y to target hidden states s = (s 1 , • • • , s m ), and then performs translating.Specifically, each encoder layer contains two sub-layers: self-attention and feed-forward network (FFN), while each decoder layer contains three sub-layers: self-attention, cross-attention and FFN.Both self-attention and cross-attention are implemented through the dotproduct attention between query Q and key K, calculated as: ( where e ij is the similarity score between Q i and K j , and α ij is the normalized attention weight.d k is the input dimension, W Q and W K are projection parameters.More specifically, self-attention extracts the monolingual representation of source or target tokens, so the query and key both come from the source hidden states z or target hidden states s.While cross-attention extracts the cross-lingual representation through measuring the correlation between target and source token, so query comes from the target hidden states s, and key comes from the source hidden states z.Wait-k Policy Simultaneous machine translation (SiMT) determines when to start translating each target token through a policy.Wait-k policy (Ma et al., 2019) is the most widely used policy for SiMT, which refers to first waiting for k source tokens and then translating and waiting for one token alternately, i.e., the target outputs always lagging k tokens behind the source inputs.Formally, when translating y i , wait-k policy forces the SiMT model to wait for g k (i) source tokens, where g k (i) is calculated as: (3)

Method
To differentiate various tokens when determining READ/WRITE, we quantify the amount of information contained in each source and target token, named info.As shown in Figure 2, we propose infoaware Transformer to jointly learn the quantified info with the attention mechanism in an unsupervised manner.Then based on the quantified info, we propose wait-info policy to balance the received source information and translated target information.The details are as follows.

Info Quantification
To quantify the amount of information in each token, we use a scalar to represent how much information each token contains, named info.We denote the info of the source tokens and the target tokens as I src ∈ R n×1 and I tgt ∈ R m×1 , respectively, where I src j and I tgt i represent the info of x j and y i , and the higher info means that the token has more information.
To predict I src and I tgt , we introduce two Info Quantizers before the encoder and decoder to respectively quantify the information of each source and target token, as shown in Figure 2. Specifically, the info quantizer is implemented by a 3-layer feedforward network (FFN): For the formulation of the following wait-info policy, 2×sigmoid(•) is used to restrict the quantified info I src j , I tgt i ∈ (0, 2).Further, in a translation task, source sentence and target sentence should be semantically equivalent (Finch et al., 2005;Guo et al., 2022), so the total information of source tokens should be equal to that of target tokens.To this end, we introduce an info-sum loss L sum to constrain the total info of the source tokens and target tokens, calculated as: where ζ is a hyperparameter to represent the total info, and we set ζ = m+n 2 (i.e., average length of source and target) to control the average info to be around 1. Therefore, the final loss L is: where L ce is the original cross-entropy loss for the translation (Vaswani et al., 2017).λ is a hyperparameter and we set λ = 0.3 in our experiments.

Learning of Quantified Info
The form of quantified info I src and I tgt has been constrained through Eq.(4-7), and then the key challenge is how to encourage the quantified info to accurately reflect the amount of information each token contains.Since the tokens with different amounts of information often show different preferences in the attention distribution (Lin et al., 2018), we propose an unsupervised method to learn the quantified info through the attention mechanism.As shown in Figure 2, we introduce an info-aware Transformer, consisting of info-aware self-attention and info-consistent cross-attention.
Info-aware Self-attention Self-attentions in both encoder and decoder are used to extract monolingual representations of tokens, where tokens with different amounts of information tend to exhibit different attention distributions (Lin et al., 2018;Zhang and Feng, 2021b).Specifically, tokens with much information, such as content words, tend to pay more attention to themselves.For the tokens with less information, since they have less meaning in themselves, they need more context information and thereby pay less attention to themselves.Therefore, we use the quantified info to bias the tokens' attention to themselves, thereby encouraging those tokens that tend to focus more on themselves to get higher info.Specifically, based on the original self-attention in Eq.(1,2), we add the quantified info I τ i , τ ∈ {src, tgt} (respectively used for encoder and decoder self-attention) on the token's similarity to itself e ii (Lin et al., 2018), and then normalize them with softmax (•) to get the info-aware self-attention β ij , calculated as: If I τ i > 1 (i.e., containing more information), the token will pay more attention to itself, otherwise the token will focus more on other tokens to extract context information.Therefore, the info can be learned from the attention distribution.
Info-consistent Cross-attention In addition to modeling the token info in a monolingual context, the consistency of the token info between target and source is also crucial for the SiMT policy, which ensures that the received source information and the target information can be accurately balanced under the same criterion.For consistency, the target and source tokens with high similarity (i.e., those with high cross-attention scores) should have similar info.Therefore, we scale the crossattention with the info consistency between target and source, where the info consistency is measured by L 1 distance between target and source info.Infoconsistent cross-attention γ ij is calculated as: where 2 − I tgt i − I src j ∈ (0, 2] measures the info consistent between y i and x j .
Overall, we apply the proposed info-aware selfattention β ij and info-consistent cross-attention γ ij to replace the original attention for the learning of the quantified info.

Wait-info Policy
Owing to the quantification and learning of info, we get I src and I tgt to reflect how much information that source and target tokens contain.Then, we develop wait-info policy for SiMT to balance source and target at the information level.
Borrowing the idea from the wait-k policy that requires the target outputs to lag behind the source inputs by k tokens (Ma et al., 2019), wait-info policy keeps that the target information is always less than the received source information K info, where K is the lagging info, a hyperparameter to control the latency.Formally, we denote the number of Algorithm 1: Wait-info Policy Input: source inputs x (incremental), lagging info K, ŷ0 = BeginOfSequence Output: target outputs ŷ Init: target idx i = 1, source idx j = 1 Wait for next source input x j+1 ; 10 j ← j + 1; 11 return ŷ; source tokens that the SiMT model waits for before translating y i as g K (i), calculated as: The specific decoding process of wait-info policy is shown in Algorithm 1.
During training, we mask out the source token x j that j > g K (i) to simulate the incomplete source sentence.Besides, we apply multi-path training (Elbayad et al., 2020) to randomly sample different K in each batch to enhance the training efficiency.

Experiment
5.1 Datasets IWSLT152 English → Vietnamese (En→Vi) (133K pairs) We use TED tst2012 (1553 pairs) as the dev set and TED tst2013 (1268 pairs) as the test set.Following the previous setting (Ma et al., 2020), we replace tokens that frequency less than 5 by ⟨unk⟩, and the vocabulary sizes of English and Vietnamese are 17K and 7.7K respectively.
WMT153 German→English (De→En) (4.5M pairs) We use newstest2013 (3000 pairs) as the dev set and newstest2015 (2169 pairs) as the test set.BPE (Sennrich et al., 2016) is applied with 32K merge operations and the vocabulary is shared.

System Settings
We conduct experiments on following systems.
Full-sentence MT Standard Transformer model (Vaswani et al., 2017), which waits for the complete source sentence and then starts translating.
Wait-k Wait-k policy (Ma et al., 2019), which first READ k source tokens, and then alternately READ one token and WRITE one token.
Efficient Wait-k An efficient multi-path training for wait-k (Elbayad et al., 2020), which randomly samples k between batches during training.
Adaptive Wait-k An adaptive policy via a heuristic composition of a set of wait-k models (e.g., k from 1 to 13) (Zheng et al., 2020).Adaptive Wait-k uses the tokens number of target and source to select a wait-k model to generate a target token, and then decides whether to output or not according to the generating probability.
MoE Wait-k4 Mixture-of-experts wait-k policy (Zhang and Feng, 2021c), which applies multiple experts to perform wait-k policy with various k to consider the translation under multiple latency.
MMA5 Monotonic multi-head attention (MMA) (Ma et al., 2020), which uses a Bernoulli variable 0/1 to decide READ/WRITE and Bernoulli variable is jointly learning with multi-head attention.
GSiMT Generative SiMT (Miao et al., 2021), which applies a generative framework to predict a Bernoulli variable to decide READ/WRITE, and uses the dynamic programming to train the policy.
GMA6 Gaussian multi-head attention (GMA) (Zhang and Feng, 2022a), which uses a Gaussian prior to learn the alignments in attention, and then performs READ/WRITE based on the alignments.
Wait-info The proposed method in Sec.4.
The implementation of all systems are based on Transformer (Vaswani et al., 2017) and adapted from Fairseq Library (Ott et al., 2019).Following Ma et al. (2020), we apply Transformer-Small (4 heads) for En→Vi, Transformer-Base (8 heads) and Transformer-Big (16 heads) for De→En.Since GSiMT involves dynamic programming with expensive training costs, we only report GSiMT on De→En with Transformer-Base, the same as its original setting (Miao et al., 2021)  we report BLEU (Papineni et al., 2002) for translation quality and Average Lagging (AL) (Ma et al., 2019) for latency.Average lagging evaluates the number of tokens lagging behind the ideal policy, calculated as: where τ = argmax i (g (i) = n), and g (i) is number of waited source tokens before translating y i .

Main Results
We compare the proposed wait-info policy with previous policies in Figure 3, where Wait-info outperforms the previous methods under all latency.
Compared with Wait-k and Efficient Wait-k which directly wait for a fixed number of source tokens, Wait-info balances target outputs and source inputs at the information level, which provides a more flexibly SiMT trade-off and thereby brings significant improvements.MoE Wait-k uses multiple experts to fuse the translation under multiple latency to cope with complex inputs, while Wait-info dynamically adjusts READ/WRITE based on the info and thereby deals with the complex inputs in a more straightforward manner.Both Adaptive Waitk and Wait-info are adaptive policies, but Adaptive Wait-k still decides which k to use based on the token number of target outputs and received source inputs (Zheng et al., 2020), while Wait-info decides READ/WRITE based on more refined info and thus performs better.Besides, Adaptive Wait-k trains multiple wait-k models, which is computationally expensive, while Wait-info only trains one model to perform SiMT under different latency.
Compared with the adaptive policies, Wait-info also achieves better performance.Previous adap-   tive policies often decide READ/WRITE based on the current source and target token (Ma et al., 2020;Zhang and Feng, 2022a), while Wait-info is based on the accumulated source and target info, which is more reasonable for the SiMT policy.More importantly, most adaptive policies rely on complicated and time-consuming training (Zheng et al., 2020) since involving dynamic programming (Ma et al., 2020;Miao et al., 2021).The training of Wait-info is simple as fixed policy, meanwhile the performance is better than adaptive policies.

Analysis
We conduct extensive analyses on wait-info policy.Unless otherwise specified, all results are reported on De→En with Transformer-Base.

Ablation Study
Info-aware Self-attention v.s.Info-consistent Cross-attention We propose two novel attention to learn the quantified info, so we analyze their roles in Figure 4(a).Without info-aware self-attention, the SiMT performance drops 0.7 BLEU on average, showing that info-aware self-  attention is beneficial to the learning of quantified info.When removing the info-consistent crossattention, the latency becomes much higher, which is because some target info exceptionally becomes much larger than the source info.Info-consistent cross-attention ensures the info consistency between similar tokens and thus controls the latency in a suitable range.When removing both of them, the source or target info is unconstrained and becomes the same value.While the target info will be slightly larger than source info (due to L sum ), which is beneficial for SiMT under low latency, we will analyze it in Sec.6.5.Source Info v.s.Target Info Wait-info policy quantifies the info of both source and target tokens, and we respectively fix the source info I src = 1 or the target info I tgt = 1 (i.e., degenerate into wait-k policy that treats each source or target token equally) to compare the effect of only quantifying the source or target info.As shown in Figure 4(b), quantifying the source or target info can both bring significant improvements, where the improvements brought by target info are even more significant.

Improvements on Full-sentence MT
Besides focusing on SiMT, the proposed infoaware Transformer can also improve full-sentence MT.As the full-sentence MT results shown in Table 1, info-aware Transformer improves 0.08 BLEU on En→Vi(Small), 0.59 BLEU on De→En(Base) and 0.39 BLEU on De→En(Big), showing that explicitly modeling token info is also beneficial for NMT.

Comparison on Information Modeling
To model the information amount contained in each token, we propose an unsupervised method to adaptively learn the info from the attention mechanism.Some previous methods apply heuristic methods to model the information, such as using the token  frequency to indicate the amount of information (Moradi et al., 2019;Chen et al., 2020) or associthe norm of embedding with the token information (Liu et al., 2020;Kobayashi et al., 2020).We apply different methods of information modeling (i.e., via attention, via token frequency and via norm of token embedding) in the proposed waitinfo policy, and show the results in Figure 6.

En→Vi
Using embedding norm to indicate token info is not suitable for the proposed wait-info policy, we argue that this is because the embedding norm is better at identifying specific tokens such as <eos> and punctuation (Kobayashi et al., 2020) frequency can both achieve improvements, where our proposed method of learning info from attention performs much better, since jointly learning the info with translation is more flexible than the fixed frequency (Zhang et al., 2022).

Quality of Quantified Info
We expect that the proposed info can reflect the amount of information contained in the token, thus providing reasonable evidence for the SiMT policy.
To verify the quality of quantified info, we further explore whether the quantified info can distinguish different types of tokens, especially content words and function words as mentioned above.In response to this question, we categorize different tokens using the Universal Part-of-Speech (POS) Tagging tool7 , and draw the info distribution of tokens with different POS8 via violin plot in Figure 5. Tokens with different parts of speech have obvious differences in info distribution, where content words (e.g., VERB, NOUN, AUX, ADJ, PROPN) generally get larger info, while function words (e.g., CCONJ, SCONJ, ADP, PART, DET) have smaller info, which is in line with our expectations (Xu et al., 2019).Therefore, info can successfully learn the amount of information contained in different tokens, so as to develop a reasonable SiMT policy.

Flexibility on Length Difference
Early-stop Caused by Length Difference The length difference between the two languages is a major challenge for SiMT, especially for wait-k policy.Wait-k policy is sensitive to the length ratio between source and target and sometimes may force the model to finish the target translation before Wait-k Wait-info k De→En En→Vi De→En En→Vi 1 29.88% 0.39% 0.00% 0.00% 3 22.68% 0.16% 0.00% 0.00% 5 13.09% 0.00% 0.00% 0.00% 7 6.78% 0.00% 0.00% 0.00% 9 3.23% 0.00% 0.00% 0.00%  reading the complete source sentence (Ma et al., 2019;Zhang and Feng, 2022d), named early-stop, especially when the source sentence is longer than the target sentence.Formally, wait-k policy will early-stop translating when g k (m) < n, where g k (m) = k+m−1 defined in Eq.( 3), n and m are source and target lengths.
More importantly, the length difference is always language-specific (Ma et al., 2019), and Table 2 reports the length ratio between source and target on En→Vi and De→En datasets.As seen, the target sentence in En→Vi is generally longer than the source sentence, on the contrary, the source sentence in De→En is longer (i.e., n > m), which is more prone to the early-stop.To study the severity of early-stop, we calculate the proportion of early-stop in wait-k policy in Table 2, where over 20% of De→En cases will early stop translating before receiving the complete source sentence under low latency.The essential reason for early-stop is that wait-k policy balances source and target at the token level, where the token-level balance is not the best choice because the number of tokens (i.e., length) is often language-specific.

Wait-k
Gra@@ ham Ab@@ bot@@ t went in for surgery in March 2012 .
Gra@@ ham Ab@@ bot@@ t unter@@ zog sich im März 2012 der operation .Source: Reference: Figure 8: Case study of No.1219 in De→En test set, showing Wait-k (k = 5) and Wait-info (K = 1) under the similar latency (AL ≈ 3).To show the process of SiMT more clearly, we correspond the outputs and inputs in the horizontal direction, indicating which source tokens are received when translating the target token.For source and target info, values that are larger than the average info (i.e., containing more information) are marked in red, values that are smaller than the average info (i.e., containing less information) are marked in blue.
Wait-info Avoids Early-stop Owing to L sum in Eq.( 6) that constrains the total source info to be equal to total target info, the proposed waitinfo policy can learn to adjust the ratio between source and target info according to the length ratio, thereby avoiding early-stop.As shown in Table 2, the average quantified info ratio (target info/source info) is basically the same as the length ratio (source length/target length), which shows that L sum successfully constrains the equality between total source info and total target info.Therefore, as shown in Table 3, wait-info policy completely avoids the early-stop caused by length difference.Different from the wait-k policy, wait-info policy balances source and target at the info level, where the total info of target and source is the same and language-independent, thereby overcoming the length difference between two languages.
Wait-info v.s.Catch-up To avoid early-stop, Ma et al. (2019) proposed a heuristic approach Catch-up for wait-k policy to compensate for the length difference between target and source.Catchup requires the model to read one additional source token after every generating c target tokens (i.e., try to read more source tokens to avoid early-stop), where c is a hyperparameter.We compare the performance of 'Wait-k+Catch-up' and Wait-info in Figure 7, where Wait-info performs better since it balances the source and target more flexibly from the info level rather than reading more source tokens according to heuristic rules.

Case Study
To study the specific improvement of the proposed wait-info policy compared to the wait-k policy, we conduct a case study in Figure 8.In Wait-k, the model is forced to wait for a fixed 5 tokens be-fore translating, which makes the model either too aggressive or too conservative in different cases (Zheng et al., 2020).As shown in this case, at the beginning of translation, when translating 'Grahams', 2 source tokens are enough to translate, but wait-k policy forces the model to wait for 5 tokens, resulting in unnecessary waiting.When translating the noun 'surgery', the model should have waited until receiving 'operation', but the model was forced to output in advance, resulting in the wrong translation 'educated' (marked in green).
In Wait-info, this weakness is ameliorated by quantifying the information in each token rather than considering each token equally.First of all, we find the proposed info can effectively distinguish different tokens, where the content words often get larger info, such as 'sich', 'März' and 'operation' in German, and 'went', 'surgery' and 'March' in English, thereby being more important to the SiMT policy.Owing to the quantified info, when translating the 'surgery', the model recognized that the previous 'der' (i.e., determiner in German) does not contain enough info, so the model continues to wait for the 'operation' and thereby generates the correct translation 'surgery' (marked in red).Overall, in wait-info policy, tokens with larger info, such as verbs and nouns, play a more important role in the model's decision of READ/WRITE, making it easier to ensure that those content words are read before translating.

Conclusion
In this paper, we quantify the information in tokens and propose a wait-info policy accordingly.Experiments show the superiority of our method on SiMT tasks and good explainability of the quantified info.

Limitations
In this work, we quantify the amount of information contained in each token via a scalar.Although quantifying information as a scalar is intuitive and friendly to SiMT policy, the expression space of a scalar may be limited for some particularly complex situations.Quantifying the information contained in each token through a low-dimensional vector may be able to further improve the performance of wait-info policy.However, how to balance the info in vector form between source and target is also a new challenge, and we will put it into our future work.6), where n is the length of source sentence and m is the length of source sentence.

A Comparison on Settings of Total Info
Based on the semantic equivalence between the source sentence and the target sentence, we introduce L sum to constrain the total info of the source tokens and target tokens in Eq.( 6).L sum can not only ensure that the total info of the source and target is equal, but also constrain the average info to be around 1, which is friendly to wait-info policy.
In our experiments, we set the total info ζ = m+n 2 , where n is the length of source sentence and m is the length of source sentence.We compare the performance under different ζ settings in Figure 9

B Extended Analyses on Early-stop
Severity of Early-stop As mentioned in Sec.6.5, wait-k policy may early-stop translating before receiving complete source inputs, especially under low latency.The reason for early-stop is g k (m) < n caused by the length difference between the source and target.To investigate how seriously early-stop affects translation quality, we calculate the BLEU scores of wait-k policy for early-stop or not-early-stop cases respectively in Figure 10.When the wait-k policy appears earlystop, the translation quality is 11 BLEU lower than those cases not-early-stop on average, indicating that early-stop seriously affects SiMT performance.
Why Does Wait-info Avoid Early-stop?The wait-k policy will early-stop translating when g k (m) < n.While for wait-info policy, g K (m) = argmin j j l=1 I src l ≥ m l=1 I tgt l +K (defined in Eq.( 12)) will almost always greater than n, since we introduce an info-sum loss L sum (defined in Eq.( 6)) to constrain the n j=1 I src j = m i=1 I tgt i .

C Numerical Results
Besides Average Lagging (AL) (Ma et al., 2019), we also use Consecutive Wait (CW) (Gu et al., 2017), Average Proportion (AP) (Cho and Esipova, 2016) and Differentiable Average Lagging (DAL) (Arivazhagan et al., 2019) to evaluate the latency of the SiMT model.We use g (i) to record the number of source tokens received when translating y i .The calculation of latency metrics are as follows.
Consecutive Wait (CW) (Gu et al., 2017) evaluates the average number of source tokens waited between two target tokens, calculated as: where 1 g(i)−g(i−1) = 1 counts the number of Average Proportion (AP) (Cho and Esipova, 2016) measures the proportion of the received source tokens, calculated as: Differentiable Average Lagging (DAL) (Arivazhagan et al., 2019) is a differentiable version of average lagging, calculated as: Numerical Results Table 4, 5 and 6 report the numerical results of all systems in our experiments, evaluated with BLEU for translation quality and CW, AP, AL and DAL for latency.

IWSLT15 English→Vietnamese
Transformer-Small Full-sentence MT (Vaswani et al., Wait-k policy: treats each token equally, and lags k tokens.Wait-info policy: quantifies the information in each token, named info (e.g., 0.5, 1.7, • • • ), and keeps the target information always less than the received source information K info.

Figure 2 :
Figure 2: Architecture of the proposed info-aware Transformer, where we omit residual connection and layer normalization in the figure for clarity.
-w/o Info-consistent Cross-attention -w/o Info-aware Self-attention -w/o Both Wait-k (a) Effects of two attention.
MT Wait-info (source & target info) Only target info Only source info Wait-k (no info) (b) Effects of src and tgt info.

Figure 4 :
Figure 4: Ablation Studies on wait-info policy.
Distribution of source info on different POS.V E R B N O U N A U X A D J P R O P N C C O N J S C O N J A D P P A R T D E T P U N C T P R O N IN T J A D V N U M Distribution of target info on different POS.

Figure 5 :
Figure 5: Info distribution on different parts of speech (POS), where POS marked in red is often the content word, POS marked in blue is often the function word.

Figure 6 :
Figure 6: Comparison of different methods of information modeling in wait-info policy, including via attention, token frequency and embedding norm.

Figure 7 :
Figure 7: Comparison of Wait-info and Catch-up.

Figure 9 :
Figure 9: Comparison on different settings of total info ζ in Eq.(6), where n is the length of source sentence and m is the length of source sentence.
, including ζ = m+n 2 , ζ = m and ζ = n.Our method is not sensitive to the setting of ζ and achieves almost similar performance under different settings.

Table 1 :
Improvements on full-sentence MT.

Table 4 :
Numerical results on En→Vi with Transformer-Small.

Table 5 :
Numerical results on De→En with Transformer-Base.

Table 6 :
Numerical results on De→En with Transformer-Big.