Modeling Dual Read/Write Paths for Simultaneous Machine Translation

Simultaneous machine translation (SiMT) outputs translation while reading source sentence and hence requires a policy to decide whether to wait for the next source word (READ) or generate a target word (WRITE), the actions of which form a read/write path. Although the read/write path is essential to SiMT performance, no direct supervision is given to the path in the existing methods. In this paper, we propose a method of dual-path SiMT which introduces duality constraints to direct the read/write path. According to duality constraints, the read/write path in source-to-target and target-to-source SiMT models can be mapped to each other. As a result, the two SiMT models can be optimized jointly by forcing their read/write paths to satisfy the mapping. Experiments on En-Vi and De-En tasks show that our method can outperform strong baselines under all latency.


Introduction
Simultaneous machine translation (SiMT) (Cho and Esipova, 2016;Gu et al., 2017;Ma et al., 2019;Arivazhagan et al., 2019), which outputs the translation while reading the source sentence, is important to many live scenarios, such as simultaneous interpretation, live broadcast and synchronized subtitles. Different from full-sentence machine translation waiting for the whole source sentence, SiMT has to decide whether to wait for the next source word (i.e., READ action) or translate a target word (i.e., WRITE action) to complete the translation.
The sequence of READ and WRITE actions in the translation process constitutes the read/write path, which is key to SiMT performance. Improper read/write path will bring damage to translation performance as too many but not necessary READ   actions before WRITE will result in high translation latency while too few but not sufficient READ actions will exclude the indispensable source information. Therefore, the ideal situation for read/write path is that the READ actions before WRITE action are just sufficient and necessary, which means the source words covered by consecutive READ actions and the following consecutive written target words should be translations to each other. Ensuring the sufficiency and necessity of READ actions will lead to a proper read/write path and thereby good translation performance. Unfortunately, the existing SiMT methods, which employ fixed or adaptive policy, do not consider the sufficiency or necessity in the policy. The fixed policy performs SiMT based on a pre-defined read/write path (Dalvi et al., 2018;Ma et al., 2019), where the number of READ actions before WRITE is fixed. The adaptive policy (Gu et al., 2017;Zheng et al., 2019b;Arivazhagan et al., 2019;Zheng et al., 2019a;Ma et al., 2020;Liu et al., 2021) dynamically determines READ/WRITE actions directly targeting at translation quality and total latency, skipping the evaluation of READ/WRITE actions.
Under these grounds, we aim at introducing the evaluation of the sufficiency and necessity of READ actions to guide the read/write path without involving external information. To this end, how to evaluate the sufficiency and necessity of READ actions is the key problem to solve. As mentioned above, in an ideal solution, the source segment (i.e., source words in consecutive READ actions) and the corresponding target segment (i.e., target words in the following consecutive WRITE actions) are supposed to be semantically equivalent and can be translated to each other, which constitutes a separate segment pair. Hence, an ideal read/write path divides the whole sentence pair into a sequence of segment pairs where the source sentence and the target sentence should be translations to each other segment by segment. That means if the translation direction is reversed, i.e., from target to source, the same sequence of segment pairs can also derive an ideal read/write path. For example, according to the alignment in Figure 1(a), the ideal read/write paths should be 'RRWWW|RW|RW' in De→En SiMT and 'RRRWW|RW|RW' in En→De SiMT, as shown in Figure 1(b), both of which share the same segment pairs of <Fand ich, I fount it>, <super, great> and <., .>. Therefore, the agreement on the segment pairs between the read/write paths in source-to-target and target-to-source SiMT, named duality constraint, can be a good choice to evaluate the sufficiency and necessity of READ actions.
Based on the above findings, we propose a method of Dual Path SiMT, which uses the SiMT model in the reverse direction to guide the SiMT model in the current direction according to the duality constraint between their read/write paths. To accomplish the duality constraint, the read/write paths in source-to-target and target-to-source SiMT should achieve an agreement on the corresponding segment pairs. Along this line, we employ a source-to-target SiMT model and a target-to-source SiMT model, which respectively generate their own read/write path using monotonic multi-head attention (Ma et al., 2020). Then by minimizing the difference between the segment pairs constituted by two read/write paths, the two SiMT models successfully converge on the segment pairs and provide supervision to each other. Experiments on IWSLT15 En↔Vi and WMT15 De↔En SiMT tasks show that our method outperforms strong baselines under all latency, including the state-ofthe-art adaptive policy.

Background
We first briefly introduce SiMT with a focus on monotonic multi-head attention (Ma et al., 2020).

Simultaneous Machine Translation
For a translation task, we denote the source sentence as x = {x 1 , · · · , x J } and the corresponding source hidden states as m = {m 1 , · · · , m J }, where J is the source length. The model generates target sentence y = {y 1 , · · · , y I } with target hidden states s = {s 1 , · · · , s I }, where I is the target length.
Different from the full-sentence machine translation, the source words received by SiMT model are incremental during translating, and hence the model needs to decide whether to read a source word (READ) or write a target word (WRITE) at each step, i.e., read/write path.
Read/write path can be represented in multiple forms, such as an action sequence of READ and WRITE (e.g., RRWWWRW· · · ), or a path from (0, 0) to (I, J) in the attention matrix from the target to source, as shown in Figure 1 Mathematically, a read/write path can be represented by a monotonic non-decreasing sequence {g i } I i=1 of step i, where the g i represents the number of source words read in when writing the i th target word y i . At step i, given first g i source tokens x ≤g i and previous target tokens y <i , the probability of generating the target word y i is p (y i | x ≤g i , y <i ). Therefore, the decoding probability of y with parameters θ is:

Monotonic Multi-head Attention
Our method involves monotonic multi-head attention (MMA) (Ma et al., 2020), which is the current state-of-the-art SiMT policy, so we briefly introduce it. MMA determines read/write action with monotonic attention (Raffel et al., 2017), and applies expectation training and latency controlling to jointly learn the translation and read/write path.
Monotonic attention For a Transformer model (Vaswani et al., 2017) with L decoder layers and H heads per layer, each head in MMA independently determines a READ/WRITE action at each step. Specifically, the h th head in l th layer predicts a selection probability p l,h i,j , which represents the probability of writing the i th target word when reading the j th source word, and then a Bernoulli random variable z l,h i,j is calculated accordingly to determine READ or WRITE action: where V K l,h and V Q l,h are learnable parameters, d k is dimension of head. 1) If z l,h i,j = 0, the head performs READ action to wait for next source word. 2) if z l,h i,j = 1, the head sets g l,h i = j and then performs WRITE action.
Note that if and only when all heads decide to perform WRITE action (totally L×H heads), the model starts translating a target word, otherwise the model performs READ action to read the next source word.
Expectation training MMA involves sampling a discrete random variable z l,h i,j , which precludes back-propagation. Therefore, Raffel et al. (2017) proposed expectation training to replace z l,h i,j with the expected writing probability α l,h i,j during training, where α l,h i,j represents the expectation of writing the i th target word when reading the j th source word, calculated as: Given expected writing probability α l,h i,j , the expected attention distribution β l,h i,j is calculated as: where u l,h i,j is original soft attention. Accordingly, the expected context vector in a head during training is calculated as: where W V l,h is projection matrices. Latency controlling To trade-off between translation quality and latency, MMA introduces a latency loss L g to control the latency, calculated as: where g l,h i = J j=1 jα l,h i,j during training and C is a latency metric that transforms read/write action into a loss, which is realized by Differentiable Average Lagging (Arivazhagan et al., 2019). Then, the total loss function is calculated as: where λ is the weight of latency loss. Please refer to Arivazhagan et al. (2019) and Ma et al. (2020) for more detailed derivation and implementation.
Since MMA requires a total of L×H heads to independently decide READ/WRITE action and starts translating only when all heads select WRITE action, some outlier heads that perform too many READ actions will force the model to wait for too many source words, resulting in higher latency. Ma et al. (2020) try to control this phenomenon by adding some loss functions, but it still cannot avoid some outlier heads waiting for too many words, which seriously affects the performance of SiMT and impair the necessity of the READ action in read/write path (Ma et al., 2020;Zaidi et al., 2021).

The Proposed Method
As shown in Figure 2, dual paths SiMT is composed of a source-to-target model and a targetto-source model, respectively named single path SiMT, which generate their own read/write path based on MMA (Sec.3.1). Since both sharing the common segment pairs between two languages, there is supposed to be an inherent connection between the read/write paths in the two directions, named duality constraint, as shown in Figure 1. To accomplish this duality constraint, we design a method to transpose a read/write path into another translation direction, and then use these transposed path to guide the read/write path in another direction (Sec.3.2). Finally, the SiMT models in two directions are jointly optimized subject to the duality constraint between their read/write paths (Sec.3.3). Details are introduced following.

Single Path SiMT
Our single path SiMT generates the read/write path based on monotonic multi-head attention (Ma et al., 2020), but we slightly modified it for more stable performance. We no longer let H heads in L decoder layers independently determine the READ/WRITE action which results in outlier heads, but share the READ/WRITE action between the decoder layers. Specifically, only the H heads in the first decoder layer independently predict READ/WRITE actions based on monotonic attention, and then the heads in subsequent decoder layers share these actions. As a result, the expected writing probability α l,h i,j during training (in Eq.(4)) is rewritten as α h i,j in the following description.

Read/Write Path Transposition
For the purpose of modeling the duality constraint between the read/write path in two directions, we design a method of transposing a path (denoted as TP(·)) to the other direction. Since the read/write paths in two directions are both based on a common division of segment pairs and hence they should satisfy the dual form, i.e., they can be transformed from each other by reversing the READ and WRITE actions in each segment pair. Therefore, to transpose the read/write path, we propose segment transposition, which consists of three steps: segment, transpose and merge, as shown in Figure 3. For example, given a primal read/write path RRWWWRWWRRW, we first divide it into multiple segment pairs, where each segment pair contains consecutive READs and fol-lowing consecutive WRITEs (RRWWW | RWW | RRW). Then, in each segment pair, the content of READs and WRITEs should be semantically equivalent, so that they can be directly transposed (RRWWW | RWW | RRW dual −→ RRRWW | RRW | RWW). Finally, all segment pairs are merged to construct a legal transposed path. In practice, all operations are performed in the form of a matrix parallelly and the formal expressions are as follows.
Segment During training, the read/write path is realized by expected writing probability α h i,j (in Eq.(4)) in the matrix form, so we set the source position with the maximum expected writing probability as WRITE action and the rest as READ action, where the position of WRITE action at the i th step is denoted as d h i , calculated as: Based on d h i , a matrix M h i,j is calculated to indicate the segment pairs of consecutive READ actions and WRITE actions: Note that the first row of Eq.(11) represents the consecutive READ actions (expand on source-axis), and the second row represents the WRITE action (expand on target-axis).
Transpose & Merge We transpose the matrix M h i,j and merge all segment pairs to get the transposed read/write path γ h i,j for the other direction: where γ h is the ideal transposed path of writing probability α h . Following, we refer to the Eq.(10-12) of transposing path as TP(·) for short.

Dual Paths SiMT
Dual paths SiMT consists of a forward network and a backward network, where the forward network performs SiMT on x → y with the forward writing probability α h,F (in Sec.3.1) parameterized by θ F ; and the backward network performs SiMT in reversed direction y → x with backward writing probability α h,B parameterized by θ B .
To model duality constraint, we first generate the transposed read/write path γ h,B , γ h,F based  on the forward and backward writing probability α h,F , α h,B respectively through TP(·) in Sec.3.2: Since based on a common segment pairs, the forward writing probability α h,F should be similar with the transposed path γ h,F of backward read/write path; vice versa. Therefore, we shorten the distance between them through L 2 regularization terms to minimize the difference between the segment pairs constituted by two read/write paths: Overall, the total loss function is calculated as: where L θ F and L θ B are the loss function of the forward and backward network respectively, calculated as Eq. (9). The weights λ of latency loss in the forward and backward networks are set to the same value, and we will compare the performance of different λ in Sec.6.5. λ dual is a hyperparameter that controls the weight of the duality constraint and we set λ dual = 1 in our experiments. In the inference time, since the forward and backward networks have the same architecture and can perform SiMT independently, we use the forward or backward network to complete the source-totarget or target-to-source SiMT respectively.

Related Work
Dual learning is widely used in dual tasks, especially machine translation. The previous methods often lack the constraint on read/write path. Some works try to use external information such as alignment or generated rule-based sequences to guide the read/write path (Zheng et al., 2019a;Zhang et al., 2020;Wilken et al., 2020;Alinejad et al., 2021). However, these methods rely too much on heuristic rules, and thus their performance is not comparable to jointly optimizing read/write path and translation. Our method internally explorers the duality between the read/write paths in two directions, and accordingly uses the duality to constrain the read/write paths, thereby obtaining better SiMT performance.

Datasets
We evaluated our method on four translation directions of the following two public datasets.  et al., 2016) is applied with 32K merge operations and the vocabulary is shared across languages.

System Setting
We conducted experiments on following systems.
Wait-k Wait-k policy, the widely used fixed policy Ma et al. (2019), which first reads k source tokens and then writes a target word and reads a word alternately.
MMA 3 Monotonic multi-head attention (MMA) proposed by (Ma et al., 2020), the state-of-the-art adaptive policy for SiMT, which applies monotonic attention on each head in Transformer.
Single Path A variation of our method, only using the SiMT model in one direction without the duality constraint (described in Sec.3.1).
Dual We evaluate these systems with BLEU (Papineni et al., 2002) for translation quality and Average Lagging (AL) (Ma et al., 2019) for latency. Average lagging evaluates the number of words lagging behind the ideal policy. Given read/write path g i , AL is calculated as: where τ = argmax i (g i = |x|), and |x| and |y| are the length of the source sentence and target sentence respectively. The results with more latency metrics are shown in Appendix D.  directions. 'Dual Paths' outperforms the previous methods under all latency, and more importantly, the proposed duality constraint can improve the SiMT performance on both source-to-target and target-to-source directions concurrently.

AL BLEU
Compared to 'Wait-k', our method has significant improvement, especially under low latency, since the read/write path in 'Wait-k' is fixed and cannot be adjusted. Compared to 'MMA', the state-of-the-art adaptive policy, our 'Single Path' achieves comparable performance and is more stable under high latency. 'MMA' allows each head of each layer to independently predict a read/write path, where some outlier heads will affect the overall performance, resulting in a decline in translation quality under high latency (Ma et al., 2020). Our method applies a common read/write path instead of the heads in each layer to predict read/write, thereby reducing the possibility of outlier heads. Based on 'Single Path', 'Dual Paths' further improves the SiMT performance by modeling the duality constraint between read/write paths, especially under low latency. Besides, our method improves the SiMT performance even close to the full-sentence MT on En↔Vi, which shows that the more precise read/write path is the key to SiMT performance. Additionally, under the same latency weight λ, our method tends to have lower latency than 'MMA' on De↔En. The 'Single Path' reduces the unnecessary latency caused by outlier heads, and the duality constraint further improves the necessity of reading source content, thereby achieving lower latency.

Analysis
We conducted extensive analyses to understand the specific improvements of our method. Unless otherwise specified, all results are reported on De→En.

Ablation Study
We conducted ablation studies on the duality constraint, where we use direct transposition to replace the proposed TP(·), only constrain the forward network or remove the duality constraint. As shown in Table 1, the proposed TP(·) is critical to translation quality, especially the segment operation. Besides, mutual constraint between forward and backward networks is more conducive to SiMT performance than only constraining the forward read/write path or removing constraint.

Evaluation of Read/Write Path
The read/write path needs to ensure sufficient content for translation and meanwhile avoid unnecessary latency, where the aligned source position 7 is always considered as the oracle position to perform WRITE in previous work (Wilken et al., 2020;Arthur et al., 2021). Therefore, we propose two metrics A Suf and A N ec to measure the sufficiency and necessity of READ actions in the read/write path via alignments. We denote the ground-truth aligned source position of the i th target word as a i , and the read/write path is represented by g i , which  is the number of source words read in when writing the i th target word. For sufficiency, A Suf is used to evaluate whether the aligned source word is read before writing the target word, calculated as: where 1 a i ≤g i counts the number of a i ≤ g i , and I is the target length. For necessity, A N ec is used to measure the distance between the output position g i and the aligned source position a i , calculated as: where the best case is A N ec = 1 for g i = a i , performing WRITE just at the aligned position and there is no unnecessary waiting. The more detailed description please refers to Appendix A. As shown in Figure 5, we evaluate the A Suf and A N ec of read/write path on RWTH De→En  alignment dataset 8 , whose reference alignments are manually annotated by experts (Liu et al., 2016;Ghader and Monz, 2017). The read/write paths of all methods perform similarly in sufficiency evaluation and our method performs slightly better at low latency. Except that the fixed policy 'Wait-k' may be forced to start translating before reading the aligned source word under the lower latency, 'MMA' and our method can almost cover more than 85% of the aligned source word when starting translating. In the necessity evaluation, our method surpasses 'Wait-k' and 'MMA', and starts translation much closer to the aligned source word, which shows that duality constraint makes read/write path more precise, avoiding some unnecessary waiting. Note that while avoiding unnecessary waiting, our method also improves the translation quality (see Figure 4), which further shows the importance of a precise read/write path for SiMT performance.

Effect of Duality Constraint
During training, we convert the duality constraint to the regularization terms. To verify that our method brings duality constraint, we conduct duality evaluation between source-to-target and target-to-source read/write paths during testing. Specifically, we first express the read/write path in the form of a matrix and generate the transposed path of the targetto-source read/write path according to TP(·) (described in Sec.3.2), and then calculate the Intersection over Union score (IoU) between the area below the transposed path and source-to-target read/write path (see Figure 6), which is regarded as the duality between the read/write path in the two directions. Appendix B gives the detailed calculation of IoU. The results of duality evaluation are reported in Table 2, where our method effectively enhances the duality of source-to-target and target-to-source read/write paths under all latency levels.    Figure 6 shows the read/write path visualization of a De↔En example. In 'Dual Paths', there is a strong duality between the read/write paths in two translation directions, where the target-to-source read/write path (Figure 6c) and the transposed path of the source-to-target read/write path (Figure 6b) have a high degree of overlap.

Analysis on Forward/Backward Latency
To analyze the relationship between the forward and backward network in terms of the latency setting, we set the latency weight (λ in Eq.(9)) of the forward and backward network to different values, denoted as λ F and λ B respectively (the greater the latency weight, the lower the model latency). Table  3 reports the effect of different settings of λ B on the performance of the forward network. After applying backward network and the duality constraint, our method has a much lower la-tency and similar translation quality compared with 'MMA' and 'Single Path'. As the latency of the backward network decreases (λ B becomes larger), the latency of the forward network also gradually decreases, which shows that the latency of the forward and backward networks are strongly correlated. Overall, regardless of the setting of λ F and λ B , 'Dual Paths' obtains a better trade-off between latency and translation quality. Furthermore, we can get a slightly larger or smaller latency by adjusting the combination of λ F and λ B .

Conclusion
In this paper, we develop the dual paths SiMT to supervise the read/write path by modeling the duality constraint between SiMT in two directions. Experiments and analyses we conducted show that our method outperforms strong baselines under all latency and achieves a high-quality read/write path.  The black line indicates the ground-truth alignments between the target and source. g i is the number of source words read in when generating the i th target word. a i is the ground-truth aligned source position of the i th target word. a i > g i (colored in red) means that the i th target word is forced to be translated in advance before reading its aligned source word.

A Evaluation Metrics of Read/Write Path
In Sec.6.2, we propose two metrics A Suf and A N ec to measure the sufficiency and necessity of the read/write path using alignments. Then, we give a more detailed introduction to the calculation methods of the two metrics. Given the ground-truth alignments, we denote the aligned source position of the i th target word as a i . Specifically, for one-to-many alignment from target to source, we choose the furthest source word as it aligned source position. For a read/write path, we denote the number of source words read in when generating the i th target word as g i . Figure 7 gives an example of the calculation of a i and g i .
Sufficiency A Suf measures how many aligned source words are read before translating the target word (i.e., a i ≤ g i ), which ensures the faithfulness of the translation, calculated as: where 1 a i ≤g i counts the number that a i ≤ g i .
Taking the case in Figure 7 as an example, the sufficiency is calculated as A Suf = 1 5 × (0 + 1 + 1 + 0 + 1) = 3 5 , where the 1 st and 4 th target word are translated before read their aligned source word (a i > g i ).
Necessity A N ec measures how far the output position g i is from the aligned position a i , where the closer output position to the alignment position indicates that the read/write path outputs earlier, and there is less unnecessary latency. A N ec is calculated source-to-target read/write path target-to-source read/write path IoU score Transposed path Figure 8: Schematic diagram of calculating the Intersection over Union score (IoU) to evaluate the dual degree of source-to-target and target-to-source read/write path. The yellow area represents the union of the areas below two paths, and the area enclosed by the black line represents the intersection. as: Note that A N ec only focuses on aligned positions that are read before output position (i.e., a i ≤ g i ).
In the case shown in Figure 7, the necessity is calculated as A N ec = 1 3 × 2 2 + 1 4 + 4 5 = 41 60 , where we only consider the the 2 th , 3 rd and 5 th target word.

B IoU Score for Duality Evaluation
To verify that our proposed method does make the read/write path of source-to-target and target-tosource more dual, we calculate the Intersection over Union score (IoU) to evaluate the duality in Sec.6.3. Following, we describe the detailed calculation of IoU score. Figure 8 gives an example of calculating the IoU score. Given the source-to-target and target-tosource read/write path P s2t and P t2s in the binary matrix form, we first generate the transposed path DP s2t of P t2s with proposed TP(·) in Sec.3.2: Then, we calculate the intersection over union score between P s2t and DP s2t :  where the larger IoU score means that the sourceto-target and target-to-source read/write path are much more dual. Ideally, the best case is IoU = 1, which means the source-to-target and target-tosource read/write path are exactly in the dual form.
In the calculation of IoU score, for 'MMA' and 'Single Path', the source-to-target and target-tosource read/write paths come from independent models in the two directions respectively. For 'Dual Paths', the source-to-target and target-to-source read/write paths come from the forward and backward network in our model concurrently.

C Hyperparameters
The systems in our experiments use the same hyperparameters, and the values are shown in Table  4.

D Numerical Results with More Metrics
We also compare 'Dual Paths' and 'Single Path' with previous methods on the latency metrics Average Proportion (AP) (Cho and Esipova, 2016) and Differentiable Average Lagging (DAL) (Arivazhagan et al., 2019). In this section, we first give the definition of AP and DAL, and then report the expanded results and numerical results of the main experiment (Sec.5.3), using AP, AL, DAL as latency metrics.

D.1 Latency Metrics
Average Proportion (AP) (Cho and Esipova, 2016) measures the proportion of the area above a read/write path. Given the read/write path g i , AP is calculated as: Differentiable Average Lagging (DAL) (Arivazhagan et al., 2019) is a differentiable version of average lagging, which can be integrated into training. Given the read/write path g i , DAL is calculated as: D.2 Expand Results Figure 9, 10, 11, 12 respectively show the expanded results on IWSLT15 En↔Vi and WMT15 De→En, measured by AP and DAL.