Universal Simultaneous Machine Translation with Mixture-of-Experts Wait-k Policy

Simultaneous machine translation (SiMT) generates translation before reading the entire source sentence and hence it has to trade off between translation quality and latency. To fulfill the requirements of different translation quality and latency in practical applications, the previous methods usually need to train multiple SiMT models for different latency levels, resulting in large computational costs. In this paper, we propose a universal SiMT model with Mixture-of-Experts Wait-k Policy to achieve the best translation quality under arbitrary latency with only one trained model. Specifically, our method employs multi-head attention to accomplish the mixture of experts where each head is treated as a wait-k expert with its own waiting words number, and given a test latency and source inputs, the weights of the experts are accordingly adjusted to produce the best translation. Experiments on three datasets show that our method outperforms all the strong baselines under different latency, including the state-of-the-art adaptive policy.


Introduction
Simultaneous machine translation (SiMT) (Cho and Esipova, 2016;Gu et al., 2017;Ma et al., 2019;Arivazhagan et al., 2019) begins outputting translation before reading the entire source sentence and hence has a lower latency compared to fullsentence machine translation. In practical applications, SiMT usually has to fulfill the requirements with different levels of latency. For example, a live broadcast requires a lower latency to provide smooth translation while a formal conference focuses on translation quality and allows for a slightly higher latency. Therefore, an excellent SiMT model should be able to maintain high translation quality under different latency levels.
However, the existing SiMT methods, which usually employ fixed or adaptive policy, cannot achieve k train = k test k train = 1 k train = 3 k train = 5 k train = 7 k train = 9 Figure 1: Performance of wait-k models with different k train v.s. k test on IWSLT15 En→Vi SiMT task. k train and k test mean the number of source tokens to wait before performing translation during training and testing, respectively. the best translation performance under different latency with only one model (Ma et al., 2019(Ma et al., , 2020. With fixed policy, e.g., wait-k policy (Ma et al., 2019), the SiMT model has to wait for a fixed number of source words to be fed and then read one source word and output one target word alternately. In wait-k policy, the number of words to wait for can be different during training and testing, denoted as k train and k test respectively, and the latency is determined by k test . Figure 1 gives the performance of the model trained with k train under different k test , and the results show that under different k test the SiMT model with the best performance corresponds to different k train . As a result, multiple models should be maintained for the best performance under different latency. With adaptive policy, the SiMT model dynamically adjusts the waiting of source tokens for better translation by directly involving the latency in the loss function (Arivazhagan et al., 2019;Ma et al., 2020). Although the adaptive policy achieves the state-of-the-art performance on the open datasets, multiple models need to be trained for different latency as the change of model latency is realized by the alteration of the loss function during training. Therefore, to perform SiMT under different latency, both kinds of methods require training multiple models for different latency, leading to large costs. Under these grounds, we propose a universal simultaneous machine translation model which can self-adapt to different latency, so that only one model is trained for different latency. To this end, we propose a Mixture-of-Experts Wait-k Policy (MoE wait-k policy) for SiMT where each expert employs the wait-k policy with its own number of waiting source words. For the mixture of experts, we can consider that different experts correspond to different parameter subspaces , and fortunately the multi-head attention is designed to explore different subspaces with different heads (Vaswani et al., 2017). Therefore, we employ multihead attention as the implementation manner of MoE by assigning different heads with different waiting words number (wait-1,wait-3,wait-5,· · · ). Then, the outputs of different heads (aka experts) are combined with different weights, which are dynamically adjusted to achieve the best translation under different latency.
Experiments on IWSLT15 En→Vi, WMT16 En→Ro and WMT15 De→En show that although with only a universal SiMT model, our method can outperform strong baselines under all latency, including the state-of-the-art adaptive policy. Further analyses show the promising improvements of our method on efficiency and robustness.

Background
Our method is based on mixture-of-experts approach, multi-head attention and wait-k policy, so we first briefly introduce them respectively.

Mixture of Experts
Mixture of experts (MoE) (Jacobs et al., 1991;Eigen et al., 2013;Peng et al., 2020) is an ensemble learning approach that jointly trains a set of expert modules and mixes their outputs with various weights: where n is the number of experts, E i and G i are the outputs and weight of the i th expert, respectively.

Multi-head Attention
Multi-head attention is the key component of the state-of-the-art Transformer architecture (Vaswani et al., 2017), which allows the model to jointly attend to information from different representation subspaces. Multi-head attention contains h attention heads, where each head independently calculates its outputs between queries, keys and values through scaled dot-product attention. Since our method and wait-k policy are applied to crossattention, the following formal expressions are all based on cross-attention, where the queries come from the t th decoder hidden state S t , and the keys and values come from the encoder outputs Z. Thus, the outputs H t i of the i th head when decoding the t th target token is calculated as: where f att (·; θ i ) represents dot-product attention of the i th head, W Q i , W K i and W V i are learned projection matrices, √ d k is the dimension of keys. Then, the outputs of h heads are concatenated and fed through a learned output matrix W O to calculate the context vector C t :

Wait-k Policy
Wait-k policy (Ma et al., 2019) refers to first waiting for k source tokens and then reading and writing one token alternately. Since k is input from the outside of the model, we call k the external lagging. We define g (t) as a monotonic nondecreasing function of t, which represents the number of source tokens read in when generating the t th target token. In particular, for wait-k policy, given external lagging k, g (t; k) is calculated as: g (t; k) = min{k+t−1, |Z|} , t = 1, 2,· · · (4) In the wait-k policy, the source tokens processed by the encoder are limited to the first g (t; k) tokens when generating the t th target token. Thus, each head outputs in the cross-attention is calculated as: where Z ≤g(t;k) represents the encoder outputs when the first g (t; k) source tokens are read in.
The standard wait-k policy (Ma et al., 2019) trains a set of SiMT models, where each model is trained through a fixed wait-k train and tested with corresponding wait-k test (k test = k train ). Elbayad et al. (2020a) proposed multipath training, which uniformly samples k train in each batch during training. However, training with both k train = 1 and k train = ∞ definitely make the model parameters confused between different subspace distributions.

The Proposed Method
In this section, we first view multi-head attention from the perspective of the mixture of experts, and then introduce our method based on it.

Multi-head Attention from MoE View
Multi-head attention can be interpreted from the perspective of the mixture of experts (Peng et al., 2020), where each head acts as an expert. Thus, Eq.(3) can be rewritten as: is a row-wise block sub-matrix representation of W O . E t i is the outputs of the i th expert at step t, and G t i ∈ R is the weight of E t i . Therefore, multi-head attention can be regarded as a mixture of experts, where experts have the same function but different parameters

Mixture-of-Experts Wait-k Policy
To get a universal model which can perform SiMT with a high translation quality under arbitrary latency, we introduce the Mixture-of-Experts Wait-k Policy (MoE wait-k) into SiMT to redefine the experts E t i and weights G t i in multi-head attention (Eq.(7)). As shown in Figure 2, experts are given different functions, i.e., performing wait-k policy with different latency, and their outputs are denoted as E t i h i=1 . Meanwhile, under the premise of normalization, the weights of experts are no longer wait-1 wait-3 wait-5 wait-kEh

Mixture-of-Experts
External lagging Expert lagging Input Figure 2: The architecture of the mixture-of-experts wait-k policy. Each expert performs wait-k under different lagging (such as wait-1,wait-3,wait-5,· · · ), and then their outputs are combined with different weights.
equal but dynamically adjusted according to source input and latency requirement, denoted as G t i h i=1 . The details are introduced following.

Experts with Different Functions
The experts in our method are divided into different functions, where each expert performs SiMT with different latency. In addition to the external lagging k in standard wait-k policy, we de- where k E i is the hyperparameter we set to represent the fixed lagging of the i th expert. For example, for a Transformer with 8 heads, if we set 3,5,7,9,11,13,15], then each expert corresponds to one head and 8 experts concurrently perform wait-1, wait-3, wait-5,· · · , wait-15 respectively. Specifically, given K MoE , the outputs H t i of the i th head at step t is calculated as: is the number of source tokens processed by the i th expert at step t and g (t; k) is the number of all available source tokens read in at step t. During training, k is uniformly sampled in each batch with multipath training (Elbayad et al., 2020a). During testing, k is the input test lagging. Then, the outputs E t i of the i th expert when generating t th target token is calculated as:

Dynamic Weights for Experts
Each expert has a clear division of labor through expert lagging K MoE . Then for different input and latency, we dynamically weight each expert with the predicted G t i h i=1 , where G t i ∈ R can be considered as the confidence of expert outputs E t i . The factor to predict G t i consists of two components: • e t i : The average cross-attention scores in the i th expert at step t, which are averaged over all source tokens read in (Zheng et al., 2019a). • k: External lagging k in Eq.(8).
At step t, all e t i and k are concatenated and fed through the multi-layer perceptron (MLP) to predict the confidence score β t i of the i th expert, which are then normalized to calculate the weight G t i : The algorithm details of proposMoE wait-k policy are shown in Algorithm 1. At decoding step t, each expert performs the wait-k policy with different latency according to the expert lagging K MoE , and then the expert outputs are dynamically weighted to calculate the context vector C t .

Training Method
We apply a two-stage training, both of which apply multipath training (Elbayad et al., 2020a), i.e., randomly sampling k (k in Eq.(8)) in every batch during training. First-stage: Fix the weights G t i equal to 1 h and pre-train expert parameters. Second-stage: jointly fine-tune the parameters of experts and their weights. In the inference time, the universal model is tested with arbitrary latency (test lagging). In Sec.5, we compare the proposed two-stage training method with the one-stage training method which directly trains the parameters of experts and their weights together.
We tried the block coordinate descent (BCD) training (Peng et al., 2020) which is proposed to train the experts in the same function, but it is not calculate G t i according to Eq.(10, 11) 12 end 13 calculate C t according to Eq. (12) 14 Return C t suitable for our method, as the experts in MoE waitk have already assigned different functions. Therefore, our method can be stably trained through back-propagation directly.

Related Work
Mixture of experts MoE was first proposed in multi-task learning (Jacobs et al., 1991;Caruana et al., 2004;Liu et al., 2018;Ma et al., 2018;Dutt et al., 2020). Recently,  applied MoE in sequence learning. Some work (He et al., 2018;Shen et al., 2019;Cho et al., 2019) applied MoE in diversity generation. Peng et al. (2020) applied MoE in MT and combined h − 1 heads in Transformer as an expert.
Previous works always applied MoE for diversity. Our method makes the experts more regular in parameter space, which provides a method to improves the translation quality with MoE.
SiMT Early read / write policies in SiMT used segmented translation (Bangalore et al., 2012;Cho and Esipova, 2016;. Grissom II et al. (2014) (2020) and  proposed adaptive segmentation policies. Bahar et al. (2020) and  proposed alignment-based chunking policy. A common weakness of the previous methods is that they all train separate models for different latency. Our method only needs a universal model to complete SiMT under all latency, and meanwhile achieve better translation quality.

Datasets
We evaluated our method on the following three datasets, the scale of which is from small to large.
IWSLT15 1 English→Vietnamese (En-Vi) (133K pairs) (Cettolo et al., 2015) We use TED tst2012 (1553 pairs) as the validation set and TED tst2013 (1268 pairs) as the test set. Following Raffel et al. (2017) and Ma et al. (2020), we replace tokens that the frequency less than 5 by unk . After replacement, the vocabulary sizes are 17K and 7.7K for English and Vietnamese, respectively.
For En-Ro and De-En, BPE (Sennrich et al., 2016) is applied with 32K merge operations and the vocabulary is shared across languages.

System Settings
We conducted experiments on following systems.
Offline Conventional Transformer (Vaswani et al., 2017) model for full-sentence translation, decoding with greedy search.
Standard Wait-k Standard wait-k policy proposed by Ma et al. (2019). When evaluating with the test lagging k test , we apply the result from the model trained with k train , where k train = k test .
Optimal Wait-k An optimal variation of standard wait-k. When decoding with k test , we traverse all models trained with different k train and apply the optimal result among them. For example, if the best result when testing with wait-1 (k test = 1) comes from the model trained by wait-5 (k train = 5), we apply this optimal result. 'Optimal Wait-k' selects the best result according to the reference, so it can be considered as an oracle.
Multipath Wait-k An efficient training method for wait-k policy (Elbayad et al., 2020a). In training, k train is no longer fixed, but randomly sampled from all possible lagging in each batch.
MU A segmentation policy base on meaning units proposed by , which obtains comparable results with SOTA adaptive policy. At each decoding step, if a meaning unit is detected through a BERT-based classifier, 'MU' feeds the received source tokens into a full-sentence MT model to generate the target token and stop until generating the < EOS > token.
MMA 4 Monotonic multi-head attention (MMA) proposed by (Ma et al., 2020), the state-of-the-art adaptive policy for SiMT, which is the implementation of 'MILk' (Arivazhagan et al., 2019) based on the Transformer. At each decoding step, 'MMA' predicts a Bernoulli variable to decide whether to start translating or wait for the source token.
MoE Wait-k A variation of our method, which directly trains the parameters of experts and their weights together in one-stage training.
Equal-Weight MoE Wait-k A variation of our method. The weight of each expert is fixed to 1 h . MoE Wait-k + FT Our method in Sec.3.2.
The implementation of all systems are adapted from Fairseq Library , and the setting is exactly the same as Ma et al. (2019) and Ma et al. (2020). To verify that our method is effective on Transformer with different head settings, we conduct experiments on three types of Transformer, where the settings are the same as Vaswani et al. (2017). For En-Vi, we apply Transformer-Small (4 heads). For En-Ro, we apply Transformer-Base (8 heads). For De-En, we apply both Transformer-Base and Transformer-Big (16 heads). Table 2 reports the parameters of different SiMT systems on De-En(Big). To perform SiMT under different latency, both 'Standard Wait-k', 'Optimal Wait-k' and 'MMA' require multiple models, while 'Multipath Wait-k', 'MU' and 'MoE Wait-k' only need one trained model. Expert lagging K MoE in MoE wait-k is the hyperparameter we set, which represents the lagging of each expert. We did not conduct many searches on K MoE , but set it to be uniformly distributed in a reasonable lagging interval, as shown in Table 1. We will analyze the influence of different settings of K MoE in our method in Sec.6.5.
We evaluate these systems with BLEU (Post, 2018) for translation quality and Average Lagging (AL 5 ) (Ma et al., 2019) for latency. Given g (t), latency metric AL is calculated as: where |x| and |y| are the length of the source sentence and target sentence respectively. 5 github.com/SimulTrans-demo/STACL.   Figure 4 show the comparison between our method and the previous methods on Transformer with the various head settings. In all settings, 'MoE wait-k + FT' outperforms the previous methods under all latency. Our method improves the performance of SiMT much closer to the offline model, which almost reaches the performance of full-sentence MT when lagging 9 tokens. Compared with 'Standard Wait-k', our method improves 0.60 BLEU on En-Vi, 2.11 BLEU on En-Ro, 2.33 BLEU on De-En(Base), and 2.56 BLEU on De-En(Big), respectively (average on all latency). More importantly, our method only needs one well-trained universal model to complete SiMT under all latency, while 'Standard wait-k' requires training different models for each latency. Besides, 'Optimal Wait-k' traverses many models to obtain the optimal result under each latency. Our method dynamically weights experts according to the test latency, and outperforms 'Optimal Wait-k' under all latency, without searching among many models.

Figure 3 and
Both our method and 'Multipath Wait-k' can train a universal model, but our method avoids the mutual interference between different sampled k during training. 'Multipath Wait-k' often improves the translation quality under low latency, but on the contrary, the translation quality under high latency is poor (Elbayad et al., 2020b). The reason is that sampling a slightly larger k in training improves the translation quality under low latency (Ma et al., 2019;, but sampling a smaller k destroys the translation quality under high latency. Our method introduces expert lagging and dynam- Compared with 'MMA' and 'MU', our method performs better. 'MU' sets a threshold to perform SiMT under different latency and achieves good translation quality, but it is difficult to complete SiMT under low latency as it is a segmentation policy. As a fixed policy, our method maintains the advantage of simple training and meanwhile catches up with the adaptive policy 'MMA' on translation quality, which is uplifting. Furthermore, our method only needs a universal model to perform SiMT under different latency and the test latency can be set artificially, which is impossible for the previous adaptive policy.

Ablation Study
We conducted ablation studies on the dynamic weights and two-stage training, as shown in Figure 3 and Figure 4. The translation quality decreases significantly when each expert is set to equal-weight. Our method dynamically adjusts the weight of each expert according to the input and test lagging, resulting in concurrently performing well under all latency. For the training methods, the two-stage training method makes the training of weights more stable, thereby improving the translation quality, especially under high latency.

Analysis
We conducted extensive analyses to understand the specific improvements of our method. Unless otherwise specified, all the results are reported on De-En with Transformer-Base(8 heads).

Performance on Various Difficulty Levels
The difference between the target and source word order is one of the challenges of SiMT, where many word order inversions force to start translating before reading the aligned source words. To verify the performance of our method on SiMT with various difficulty levels, we evenly divided the test set into three parts: EASY, MIDDLE and HARD. Specifically, we used fast-align 6 (Dyer et al., 2013) to align the source with the target, and then calculated the number of crosses in the alignments (number of reversed word orders), which is used as a basis to divide the test set (Chen et al., 2020;. After the division, the alignments in the EASY set are basically monotonous,   and the sentence pairs in the HARD set contains at least 12 reversed word orders. Our method outperforms the standard wait-k on all difficulty levels, especially improving 3.90 BLEU on HARD set under low latency. HARD set contains a lot of word order reversal, which is disastrous for low-latency SiMT such as testing with wait-1. The standard wait-k enables the model to gain some implicit prediction ability (Ma et al., 2019), and our method further strengthens it. MoE wait-k introduces multiple experts with varying expert lagging, of which the larger expert lagging helps the model to improve the implicit prediction ability , while the smaller expert lagging avoids learning too much future information during training and prevents the illusion caused by over-prediction (Chen et al., 2020). With MoE wait-k, the implicit prediction ability is stronger and more stable.

Improvement on Robustness
Robustness is another major challenge for SiMT (Zheng et al., 2020b). SiMT is often used as a downstream task of streaming automatic speech recognition (ASR), but the results of streaming ASR are not stable, especially the last recognized source token Gaido et al., 2020;Zheng et al., 2020b). In each decoding step, we ran- domly modified the last source token with different proportions, and the results are shown in Figure 5. Our method is more robust with the noisy last token, owing to multiple experts. Due to different expert lagging, the number of source tokens processed by each expert is different and some experts do not consider the last token. Thus, the noisy last token only affects some experts, while other experts would not be disturbed, giving rise to robustness.

Differentiation of Experts Distribution
Our method clearly divides the experts into different functions and integrates the expert outputs from different subspaces for better translation. For 'Multipath Wait-k' and our method, we sampled 200 cases and reduced the dimension of the expert outputs (evaluating with wait-5) with the t-Distributed Stochastic Neighbor Embedding (tSNE) technique, and shown the subspace distribution of the expert outputs in Figure 6.

Superiority of Dynamic Weights
Different expert outputs are dynamically weighted to achieve the best performance under the current test latency, so we calculated the average weight of each expert under different latency in Table 4.
Through dynamic weighting, the expert lagging of the expert with the highest weight is similar to the k train of the optimal model with standard wait-k, meanwhile avoiding the traversal on many trained models. When the test lagging is larger, the expert with larger expert lagging has higher weight; and vice versa. Besides, the expert with a slightly larger expert lagging than k test tends to get the highest weight for better translation, which is in line with the previous conclusions (Ma et al., 2019;. Furthermore, our method enables the model to comprehensively consider various expert outputs with dynamic weights, thereby getting a more comprehensive translation.

Effect of Expert Lagging
Expert lagging K MoE is the hyperparameter we set to control the lagging of each expert. We experimented with several settings of K MoE to study the effects of different expert lagging K MoE , as shown in Figure 7.
Totally, all types of K MoE outperform the baseline, and different K MoE only has a slight impact on the performance, which shows that our method is not sensitive to how to set K MoE . Furthermore, there are some subtle differences between different K MoE , where the 'Original' setting performs best. 'Low interval' and 'High interval' only perform well under a part of the latency, as their K MoE is only concentrated in a small lagging interval. 'Repeated' performs not well as the diversity of expert lagging is poor, which lost the advantages of MoE. The performance of 'Wide span' drops under low latency, because the average length of the sentence is about 20 tokens where the much larger lagging is not conducive to low latency SiMT.
In summary, we give a general method for setting expert lagging K MoE . K MoE should maintain diversity and be uniformly distributed in a reasonable lagging interval, such as lagging 1 to 15 tokens.

Conclusion and Future Work
In this paper, we propose Mixture-of-Experts Waitk Policy to develop a universal SiMT, which can perform high quality SiMT under arbitrary latency to fulfill different scenarios. Experiments and analyses show that our method achieves promising results on performance, efficiency and robustness.
In the future, since MoE wait-k develops a universal SiMT model with high quality, it can be applied as a SiMT kernel to cooperate with refined external policy, to further improve performance.