Reducing Position Bias in Simultaneous Machine Translation with Length-Aware Framework

Simultaneous machine translation (SiMT) starts translating while receiving the streaming source inputs, and hence the source sentence is always incomplete during translating. Different from the full-sentence MT using the conventional seq-to-seq architecture, SiMT often applies prefix-to-prefix architecture, which forces each target word to only align with a partial source prefix to adapt to the incomplete source in streaming inputs. However, the source words in the front positions are always illusoryly considered more important since they appear in more prefixes, resulting in position bias, which makes the model pay more attention on the front source positions in testing. In this paper, we first analyze the phenomenon of position bias in SiMT, and develop a Length-Aware Framework to reduce the position bias by bridging the structural gap between SiMT and full-sentence MT. Specifically, given the streaming inputs, we first predict the full-sentence length and then fill the future source position with positional encoding, thereby turning the streaming inputs into a pseudo full-sentence. The proposed framework can be integrated into most existing SiMT methods to further improve performance. Experiments on two representative SiMT methods, including the state-of-the-art adaptive policy, show that our method successfully reduces the position bias and thereby achieves better SiMT performance.


Introduction
Simultaneous machine translation (SiMT) (Cho and Esipova, 2016;Gu et al., 2017;Ma et al., 2019;Arivazhagan et al., 2019) starts translating while receiving the streaming source inputs, which is crucial to many live scenarios, such as simultaneous interpretation, live broadcast and synchronized subtitles. Compared with full-sentence machine translation (MT) waiting for the complete source sen- * Corresponding author: Yang Feng. tence, SiMT is more challenging since the source sentence is always incomplete during translating.
To process the incomplete source, SiMT has a different architecture from full-sentence MT, as shown in Figure 1. Full-sentence MT applies the seq-to-seq architecture (Sutskever et al., 2014), where each target word can be translated based on a complete source sentence. SiMT always applies prefix-to-prefix architecture (Ma et al., 2019) to force each target word to only align with a source prefix rather than the complete source sentence, where the source prefix consists of partial source words in the front position and is monotonically non-decreasing at each step.
Although the prefix-to-prefix architecture effectively adapts to the streaming inputs by removing the subsequent source words, it intensifies the structural gap between SiMT and full-sentence MT, resulting in the following issues. First, since each target word is forced to align with a monotonically non-decreasing source prefix, the source words in different positions become no longer fair. Specifically, the source words in the front position participate in more target words' translation due to earlier appearance, and hence are always illusoryly considered more important, resulting in position bias (Ko et al., 2020;Yan et al., 2021). Due to the position bias, SiMT model prefers to pay more attention to the source words in front position during testing, which not only robs the attention of the words that are supposed to be aligned (increase mis-translation error) (Zhang and Feng, 2021b), but also results in great overlap on attention distribution (aggravate the duplication translation error) (Elbayad et al., 2020). We will analyze the detailed causes and disadvantages of position bias in Sec.3. Second, prefix-to-prefix architecture directly removes the subsequent source words, resulting in the lost of some potential full-sentence information (Zhang et al., 2021). Most importantly, the prefixto-prefix training makes the model insensitive to the full-sentence length, which can provide a global planning for translation .
Under these grounds, we propose a Length-Aware Framework (LAF) for SiMT to turn the incomplete source into a pseudo full-sentence, thereby reducing the position bias. We aim to extend the incomplete source sentence in SiMT to the full-sentence length and meanwhile guarantee that future source words would not be leaked to fulfill the streaming inputs during testing. To this end, LAF first predicts the full-sentence length based on the current incomplete source sentence. Then, LAF fills the future source positions (between the current source length and predicted full-sentence length) with the positional encoding (Vaswani et al., 2017) to construct the pseudo full-sentence. Accordingly, each target word is translated based on the pseudo full-sentence and no longer forced to align with the source prefix. LAF can be integrated into most of the existing SiMT methods to further improve performance by bridging the structural gap between SiMT and full-sentence MT.
We apply LAF on two representative and strong SiMT methods, and experiments on IWSLT15 En→Vi and WMT15 De→En tasks show that our method achieves better performance in both cases.

Background
We first introduce full-sentence MT and SiMT with the focus on the prefix-to-prefix architecture.

Full-sentence Machine Translation
For a translation task, we denote the source sentence as x = {x 1 , · · · , x J } with source length J, and target sentence as y = {y 1 , · · · , y I } with tar-get length I. Transformer (Vaswani et al., 2017) is the currently most widely used model for fullsentence MT, which consists of encoder and decoder. The encoder maps x into the source hidden states h = {h 1 , · · · , h J }, and the decoder generates the i th target word y i based on source hidden states h and previous target words y <i . Overall, the decoding probability of full-sentence MT is: Attention Transformer calculates the attention weights with dot-product attention, and the encoder-decoder cross-attention α ij is calculated based on previous target hidden state s i−1 and source hidden state h j : where W K and W Q are input matrices, and d k is the input dimension. Positional encoding Transformer (Vaswani et al., 2017) where d model is the dimension of input embedding.

Simultaneous Machine Translation
Different from full-sentence MT waiting for the complete sentence, SiMT translates concurrently with the streaming inputs and hence prefix-to-prefix architecture (Ma et al., 2019) is proposed to adapt to the incomplete source, where the target word y i is generated based on a partial source prefix. Prefix-to-prefix architecture Let g(i) be a monotonically non-decreasing function of i that denotes the length of received source sentence (i.e., source prefix) when translating the target word y i . Given g(i), the probability of generating the target word y i is p y i | x ≤g(i) , y <i , where x ≤g(i) is first g(i) source words and y <i is previous target words. Overall, the decoding probability of SiMT is: To determine g(i) during translating process, SiMT requires a policy to determine 'translating' a target word or 'waiting' for the next source word, falling into fixed policy and adaptive policy.
Fixed policy performs 'waiting' or 'translating' according to pre-defined rules. Wait-k policy (Ma et al., 2019) is the most widely used fixed policy, which first waits for k source words and then translates one target word and waits for one source word alternately. Besides, Ma et al. (2019) also proposed a test-time wait-k policy, using a fullsentence model to perform wait-k policy in testing.
Adaptive policy can dynamically adjust 'waiting' or 'translating' according to the current state. Monotonic multi-head attention (MMA) (Ma et al., 2020) is the current state-of-the-art adaptive policy, which predicts a Bernoulli action READ/WRITE to decide to wait for the next source word (READ) or translate a target word (WRITE). To train the Bernoulli actions, MMA predicts the writing probability of y i when receiving x j , denoted as β ij , and uses it to approximate the READ/WRITE actions during training (Arivazhagan et al., 2019).

Preliminary Analysis on Position Bias
In this section, we analyze the influence and cause of position bias in SiMT. In full-sentence MT, the source sentence is complete, so that each source word participates in the translation of all target words. While in the prefix-to-prefix architecture for SiMT, each target word is forced to align with an increasing source prefix, which directly causes that the source words in the front position participate in the translation of more target words during training and hence are always illusoryly considered more important during testing, resulting in position bias. Please refer to Appendix A for a theoretical analysis of the position bias.
During testing, position bias is reflected in the preference of paying more attention to the source words in front positions. To explore the specific impact of position bias, we select the samples with the same source length (77 sentences) in WMT15 De→En test set as a bucket, and then calculated the average attention weight obtained by each source position in the bucket. Since the times of each source position being paid attention to may be different in SiMT, the average attention weight is averaged on the times of being attended, so the evaluation is fair for each source position. Specifically, give the attention weight α ij between target word  y i and source word x j , the average attention weight A j at source position j is calculated as:

Average Attention
where I i=1 α ij is the sum of attention on the j th source position, and I i=1 1 j≤g(i) counts the times of the j th source position being paid attention to.
What is position bias? Figure 4(a) shows the average attention obtained by different source positions in two representative SiMT methods, compared with full-sentence MT. SiMT has a significant difference from the full-sentence MT on the average attention to the source position. In fullsentence MT, the average attention on each position is similar and the back position gets slightly more attention (Voita et al., 2021). However, in both the fix and adaptive policy in SiMT, the front source positions obviously get more attention due to position bias, especially the first source word.   Compared with wait-k, MMA alleviates the position bias by dynamically adjusting 'waiting' or 'translating', but the first source position still abnormally gets more attention. Note that the average attention on the back positions in SiMT is higher since the times they are attended are less (the denominator in Eq.(6) is smaller). Does position bias affect SiMT performance? To analyze whether the position bias in SiMT results in poor translation quality, we use the ratio of the average attention on the first source position to all positions (A 1 / j A j ) to reflect the degree of position bias, and accordingly divide WMT15 De→En test set into 5 parts evenly. We report the translation quality of these 5 parts in Figure 3, where the position bias is heavier from 'Bottom' to 'Top'. The translation quality of both wait-k and MMA significantly decrease as the position bias becomes heavy, while full-sentence MT remained high-quality translation on these parts. More importantly, as the position bias intensifies, the performance gap between SiMT and full-sentence MT is amplified, where wait-k and MMA are 9.85 BLEU and 7.03 BLEU lower than full-sentence MT respectively on the 'Top' set. Therefore, the position bias is an important cause of the performance gap between SiMT and full-sentence MT.
What is the position bias caused by? To ver- ify that the preference for front source positions is caused by the structural gap between SiMT and full-sentence MT rather than streaming inputs during testing, we compare the average attention of wait-k and 'test-time wait-k' in Figure 4(b), where 'test-time wait-k' is trained with full-sentence structure and tested with wait-k policy. After replacing the prefix-to-prefix architecture with the seq-to-seq architecture during training, the position bias in the 'test-time wait-k' is significantly weakened, which shows that prefix-to-prefix training is the main cause of position bias. However, directly training with full-sentence structure leaks many future source words, where the obvious trainingtesting mismatch results in inferior translation quality of 'test-time wait-k' (Ma et al., 2019). In practice, prefix-to-prefix architecture forces the target word to assign attention to the prefix even if its corresponding source word has not been read in, which will undoubtedly cause the attention to become chaotic and tend to be distributed to the front position. This also explains why the position bias is more serious in the fixed policy, because their read/write cannot be adjusted, resulting in more cases the prefix does not contain the corresponding source word but is forced to pay attention to. Besides, prefix-to-prefix architecture increases the frequency of front source positions during training, and previous works (Zhou and Liu, 2006;Luong et al., 2015; show that NMT models have a tendency towards over-fitting on high-frequency words.
Specific attention characteristics Furthermore, we compare the characteristics of attention distribution in full-sentence MT and SiMT, shown in Source : Target :

Received source
Positional encoding Target full-sentence length   Zhang and Feng, 2022a), which is not conducive to translation. First, the biased attention on front positions robs the attention of the aligned source word, resulting in mis-translation error. Second, much overlapping on attention distribution aggravates the duplication translation error, where a human evaluation proposed by Elbayad et al. (2020) shows that duplication error in SiMT is 500% of full-sentence MT. Besides, in some cases, even if the aligned source words have not been received, the prefix-to-prefix architecture still forces the target word to align with the irrelevant source prefix, resulting in the confusion on attention (Chen et al., 2021).

The Proposed Method
Based on the preliminary analyses, we develop a Length-Aware Framework (LAF) to turn the streaming inputs into the pseudo full-sentence, thereby reducing the position bias. As shown in Figure  5, given the partial source words in streaming inputs, LAF first predicts the full-sentence length. Accordingly, LAF extends the current incomplete source to the full-sentence length by filling the future source position with positional encoding. The details are introduced following.

Length-Aware Framework
Length prediction To turn the incomplete source into pseudo full-sentence, LAF first predicts the full-sentence length. At step i, based on the current received source sentence x ≤g(i) , LAF predicts the full-sentence length L i through a classification task. Note that the predicted length dynamically changes with the increase of received source words.
Formally, the probability of full-sentence length L i is predicted through a multi-layer perceptron (MLP) based on the mean of hidden states of the currently received source words: j=1 h j is the the mean of hidden states of the currently received source words. V∈R d model ×d model and W∈R Nmax×d model is the parameters of MLP, where N max is the max length of the source sentence in the corpus. Note that softmax(·) is normalized on all possible length values. In testing, the value with the highest probability is selected as the full-sentence length.
If source sentence is already complete (receiving eos ) or the predicted length L i is not larger than the received source length (L i ≤ g(i)), we use the current length g(i) as the full-sentence length.
Pseudo full-sentence Given the predicted fullsentence length, we fill the future source position (g (i) , L i ] with positional encoding to construct the pseudo full-sentence. Formally, given the hidden states of received source word h ≤g(i) and the predicted full-sentence length L i , we fill the future position with positional encoding to get the pseudo full-sentence hidden states h (i) at step i: Then, the target word y i is generated based on the pseudo full-sentence hidden states h (i) and previous target word y <i , and hence the cross-attention α ij in Eq.(2) is rewritten as: Overall, the decoding probability of the lengthaware framework is:

Training Objective
The length-aware framework consists of a length prediction module and a translation module. For the length prediction module, we take the complete source length J as the ground-truth length label and train the model with cross-entropy loss: For the translation module, we complement the source prefix to the ground-truth source length J with positional encoding and train the translation module by minimizing the cross-entropy loss: where y is the ground-truth target sentence. During testing, we apply the predicted full-sentence length to complement the source prefix. We will compare the performance of training with groundtruth or predicted full-sentence length in Sec.7.1. Finally, the total loss of LAF is calculated as:

Integrated into SiMT Policy
The length-aware framework can be integrated into most existing SiMT methods. We take wait-k and MMA as representatives to introduce the slight difference when integrated to fix and adaptive policy respectively. LAF predicts the full-sentence length based on the currently received source words x ≤g(i) , so the key is to calculate g (i), which may be different in fix and adaptive policy. Fixed policy Since wait-k is a pre-defined fixed policy, g wait−k (i) in wait-k during both training and testing is invariably calculated as: Adaptive policy Since MMA can dynamically predict READ/WRITE actions, the calculation of g (i) during training and testing is different. During testing, we take the number of source words received by the model when starting to translate y i as g (i). During training, MMA does not have explicit READ/WRITE actions, but predicts the writing probability β ij , where β ij represents the probability of translating y i after receiving source word x j . Therefore, we select the position of x j with the highest writing probability as g mma (i):

Related Work
The main architectures of SiMT model are divided into two categories: seq-to-seq architecture and prefix-to-prefix architecture. The early SiMT methods always used a fullsentence MT model trained by seq-to-seq architecture to translate each segment divided by the SiMT policy (Bangalore et al., 2012;Cho and Esipova, 2016;. Gu et al. (2017) used reinforcement learning to train an agent to decide whether to start translating.  added a predict operation based on Gu et al. (2017). Zhang et al. (2020b) proposed an adaptive segmentation policy based on meaning units. However, the mismatch between training and testing usually leads to inferior translation quality.
The recent SiMT methods, including fix and adaptive policies, mainly used prefix-to-prefix architecture. For the fixed policy, Ma et al. (2019) proposed a wait-k policy, which always generates target token k words behind the source token. Zhang and Feng (2021a) proposed a char-level wait-k policy. Zhang and Feng (2021c) proposed a universal SiMT with the mixture-of-experts wait-k policy. For the adaptive policy, Zheng et al. (2019a) trained an agent with the golden read/write action sequence. Zheng et al. (2019b) added a "delay" token and introduced limited dynamic prediction. Arivazhagan et al. (2019) proposed MILk, using a Bernoulli variable to determine whether to write. Ma et al. (2020) proposed MMA to implement MILK on the Transformer. Wilken et al. (2020) and Zhang and Feng (2022b) proposed alignment-based SiMT policy. Liu et al. (2021a) proposed crossattention augmented transducer for SiMT. Zhang et al. (2021) and Alinejad et al. (2021) introduced a full-sentence model to guide SiMT policy. Miao et al. (2021) proposed a generative SiMT policy.
Although the prefix-to-prefix architecture simulates the streaming inputs, it brings the position bias described in Sec.3. Therefore, we proposed a length-aware framework to reduce the position bias and meanwhile fulfill the streaming inputs.

Datasets
We evaluate LAF on the following datasets.
IWSLT15 1 English→Vietnamese (En→Vi) (133K pairs) (Cettolo et al., 2015) We use TED tst2012 as validation set (1553 pairs) and TED tst2013 as test set (1268 pairs). Following the previous setting (Raffel et al., 2017;Ma et al., 2020), we replace tokens that the frequency less than 5 by unk , and the vocabulary sizes are 17K and 7.7K for English and Vietnamese respectively.

Systems Setting
We conduct experiments on following systems.
Wait-k Wait-k policy proposed by Ma et al. (2019), the most widely used fixed policy, which first waits for k source words and then translates a target word and waits for a source word alternately.
MMA 3 Monotonic multi-head attention (MMA) proposed by (Ma et al., 2020), the SOTA adaptive policy. At each step, MMA predicts a Bernoulli variable to decide whether to start translating.
* + LAF Applying proposed length-aware framework on Wait-k or MMA.
The implementation of all systems are adapted from Fairseq Library (Ott et al., 2019) based on Transformer (Vaswani et al., 2017) with the same setting in Ma et al. (2020). For En→Vi, we apply Transformer-small (4 heads). For De→En, we apply Transformer-Base (8 heads) and Transformer-Big (16 heads). We evaluate these systems with BLEU (Papineni et al., 2002) for translation quality and Average Lagging (AL) (Ma et al., 2019) for latency. AL is calculated based on g (t): where τ = argmax t (g (t) = |x|), and |x|, |y| are source and target length respectively. Figure 6 shows the performance improvement that LAF brings to Wait-k and MMA, where our method achieves higher translation quality under all latency. LAF has a more significant improvement on the fixed policy Wait-k, improving about 0.28 BLEU on En→Vi, 1.94 BLEU on De→En(Base), 1.50 BLEU on De→En(Big), which is because the position bias in original wait-k is more serious. Compared with the SOTA adaptive policy MMA, our method also performs better and is much closer to full-sentence MT performance.

Analysis
We conduct extensive analyses to understand the specific improvements of our method. Unless otherwise specified, all the results are reported on De→En(Base) and tested with wait-5 (AL=4.10) and MMA (AL=4.57) under similar latency.

Ablation Study
We use ground-truth full-sentence length to train the translation module, and use the predicted full- sentence length in testing. We conduct the ablation study of using predicted full-sentence length (Pred) or ground-truth length (GT) for translation in training and testing respectively, reported in Table 1. LAF has a better performance than 'Pred LAF', indicating that using ground-truth length during training is more helpful for learning translation. Compared with 'Oracle LAF' that uses groundtruth full-sentence length in testing, LAF achieves comparable performance, which shows that the length prediction module in LAF performs well.

Accuracy of Predicted Length
Figure 7(a) shows the prediction accuracy of the full-sentence length in LAF, indicating that our method achieves good prediction performance. As the latency increases, the prediction accuracy of both 'Wait-k+LAF' and 'MMA+LAF' gradually increases. Specifically, 'Wait-k+LAF' predicts more accurately at low latency, which shows that the regular form of fixed policy is more conducive to LAF learning the full-sentence length. Besides,in Figure 7(b), with the continuous increase of received source words, the prediction accuracy of the fullsentence length gradually improves, which is in line with our expectations.

Reduction of Position Bias
We show the change of average attention 4 after applying LAF in Figure 8. With LAF, the position bias in SiMT is significantly reduced, where the front positions are no longer illusoryly considered more important. By constructing the pseudo fullsentence, LAF bridges the structural gap between SiMT and full-sentence MT, so that the importance of source positions are more similar to that in fullsentence MT, thereby reducing the position bias. 4 Calculation is same with Eq.(6) without calculating the future position predicted by LAF, so the comparison is fair.

Decreasing of Duplicate Translation
Position bias makes the target word tend to focus on the front source word, which leads to much overlap in the attention distribution, resulting in duplicate translation errors (Elbayad et al., 2020). Following See et al. (2017), we count the n-grams duplication proportion in translation in Figure 10.
There are few duplicate n-grams in reference and full-sentence MT, especially when n > 2. However, position bias in SiMT makes the model always focus on some particular source words in the front position, thereby exacerbating duplicate translation errors, especially in the fixed policy. In 3-grams, the duplicate translation of Wait-k is about 6 times that of full-sentence MT, which is in line with the previous conclusion (Elbayad et al., 2020). After applying LAF, the duplicate translation in SiMT is significantly reduced, similar to full-sentence MT.

Improvement on Various Difficulty Levels
The word order difference is a major challenge of SiMT, where many word order inversions may force the model to start translating before reading the aligned source words (Chen et al., 2021). Following Zhang and Feng (2021c)  in alignments using fast-align 5 (Dyer et al., 2013), and report the results on each set in Table 2. For full-sentence MT, word order reversal will not cause too much challenge, so that the performance gap between different sets is small. In SiMT, word order reversal often causes the model to translate before reading the aligned source words, which forces the target word to focus on some unrelated source words, resulting in poor performance in Hard set. LAF complements the incomplete source to the full-sentence length, which allows the target word to focus on the subsequent position instead of must focusing on the current irrelevant source word when the aligned word is not received, thereby obviously improving the performance on Hard set.

Attention Characteristics
LAF constructs the pseudo full-sentence by predicting the full-sentence length and filling the future position with positional encoding. To verify the importance of the future position, we count the attention weights on the future position (i.e., filled with positional encoding) at each decoding step in Figure 11. In the beginning, the future posi- tion gets much attention weight, especially getting about 30% attention in the first decoding step. As the received source words increase, the attention received by future positions gradually decreases. Furthermore, we visualize the attention distribution of an example in Figure 9. In Wait-k and MMA, attention is more concentrated on the front position, especially Wait-k extremely focuses on the first source word, which leads to duplicate translation "expected to to hold". With LAF, when the aligned source word has not been received, the future positions tend to get more attention, e.g. when 'Wait-k+LAF' translating "take place" before receiving "beginnen". Besides, the predicted length in LAF changes dynamically and gradually approaches the full-sentence length. Overall, LAF reduces the position bias and thus the attention in SiMT is more similar to the attention in fullsentence MT, resulting in better translation quality.

Conclusion
In this paper, we develop a length-aware framework for SiMT to reduce the position bias brought by incomplete source. Experiments show that our method achieves promising results by bridging the structural gap between SiMT and full-sentence MT.