Multilingual Simultaneous Neural Machine Translation

Simultaneous machine translation (S I MT) involves translating source utterances to the target language in real-time before the speaker utterance completes. This paper proposes the multilingual approach to S I MT, where a single model simultaneously translates between multiple language-pairs. This not only results in more efﬁciency in terms of the number of models and parameters (hence simpler deployment), but may also lead to higher performing models by capturing commonalities among the languages. We further explore simple and effective multilingual architectures based on two strong recently proposed S I MT models. Our results on translating from two Germanic languages (German, Dutch) and three Romance languages (French, Italian, Romanian) into English show (i) the single multilingual model is on-par or better than individual models, and (ii) multilingual S I MT models trained based on language families are on-par or better than the universal model trained for all languages. 1


Introduction
Simultaneous translation is the task of incrementally generating the translation while the source utterance is gradually spoken. It is crucial in multinational meetings, e.g., in business and politics, where the online simultaneous translation is required for one or multiple language-pairs. Simultaneous machine translation (SIMT) is an attempt to address the challenges of this translation scenario, i.e., trading off the translation quality and its latency (Cho and Esipova, 2016;Arivazhagan et al., 2019Arivazhagan et al., , 2020Firat et al., 2016b).
In this paper, we investigate the multilingual SIMT setting, where a single model simultaneously translates between multiple language-pairs. This not only results in more efficiency in terms of the number of models and parameters (hence simpler deployment), but may also lead to higher performing models by capturing commonalities among the languages. The multilingual setting has been successful for the standard offline neural machine translation (NMT) and studied extensively (Johnson et al., 2017;Tan et al., 2019;Aharoni et al., 2019).
We explore simple and effective multilingual architectures based on two strong recently proposed SIMT models, i.e. the WAIT-K (Dalvi et al., 2018) and COUPLED POLICY (Arthur et al., 2020). The former waits to read a fixed number of k input tokens; afterward, it writes (generates) an output token for each newly received input token. The latter learns a policy, via an agent, for an adaptive waiting between reading and writing to reduce the translation delay while maintaining the quality.
COUPLED POLICY uses the adaptive waiting, generated from offline word alignments. It continues to read the source tokens until the corresponding word alignment to the target token appears.
Under these underlying SIMT models, we explore multi-task learning (MTL) framework, full and partial parameter sharing protocols across the languages with language indicators.
Our experiments show the effectiveness of the simple strategy of sharing all the SIMT components across the languages, with language tags specifying the translation task. The results on translating from two Germanic languages (German, Dutch) and three Romance languages (French, Italian, Romanian) into English show the single multilingual model is on-par or better than individual models. Furthermore, the results show that multilingual SIMT models trained based on language families are on-par or better than the universal model trained for all languages. F, E is a language pair of D i .

4:
for (x, y, a) ∈ D i do 5: end for 8: end for 9: end while 2 Multilingual Simultaneous Translation The original neural programmer-interpreter (NPI)-SIMT framework (Arthur et al., 2020) employs a trainable programmer θ prog and interpreter θ intp . The programmer/agent issues read/write commands to control the interpreter, i.e. the NMT model. The interpreter is constructed using an encoder θ E and a decoder θ D . Each component is trained on triplet x, y, a where x is the source sentence, y is the target sentence, and a is the program oracle using behavioral cloning (Torabi et al., 2019). For notation clarity we rename the programmer into θ A , resulting triplet of trainable modules θ A , θ E , θ D .

Language-Specific Parameters
We further extend this framework by distilling a parameter, specific to language θ l x , where x is a specific module and l is a specific language. This language-specific parameter is similar to Firat et al. (2016a);Dong et al. (2015); Ahmadnia and Dorr (2020) where parameters are separated based on the source and target languages. In the case of SIMT, the program a is affected by both languages. This framework enables us to use multiple parallel corpora D i and train a language specific module using maximum likelihood estimation by updating particular θ l x based on D i . The training algorithm for our NPI-SIMT is shown in Algorithm 1.

Multilingual Parameter Sharing
Multilingual parameter sharing is achieved by using only a single module for language-specific parameter θ * x . Depending on the module, we can disregard source (F ) or target (E) completely. This allows us to share the parameter across different parallel corpora. However, the embedding matrix in different D i can be different because of various tokenization and vocabulary construction methods. To remedy this, we can either train joint vocabulary spaces for source and target sides, or simply joining different spaces using union operation. Herein, we use the latter method.
Language Indicator Embedding When the interpreter is shared, it is difficult to communicate which pairs of languages are being processed. To deliver this, we pass the source and target language embedding information to the encoder and decoder, respectively. This information is then combined using addition operation with the word embedding.
In the programmer, we use a concatenation of both source and target languages.
Batch of Multilingual Instances Algorithm 1 outlines the overall training procedure of multilingual SIMT. Here, it is crucial to construct a batch as a mixture of many language pairs to achieve good multilingual training. We also need to include the information of the source language to create language indicator embedding. If a module is language-agnostic, it will be responsible for consuming all the input; otherwise, language-specific modules will be used to process the specific item in the batch according to its language. Results from different languages will be aggregated using concatenation at the end.

Experiments
Our experiment aims to investigate the effects of multilingualism in SIMT architecture. To achieve this we first choose the language pairs from (1) the same family group and (2) mixing them all. This is enabled by investigating various parameter sharing strategies for the components of the SIMT architectures.
Datasets. We use IWSLT 2017 (Cettolo et al., 2017) datasets for all parallel corpora in which all translating to English. The choice was made due to its characteristics of spoken multilingual corpora from TED. We choose the Germanic language group, German (DE) and Dutch (NL), and the Romance language group, Italian (IT), French (FR), and Romanian (RO). The languages within the same group generally have high syntactic similarity and the same word order. Unless otherwise specified, we use the same settings and preprocessing as described in Arthur et al. (2020). 2 SIMT Systems We compared two SIMT baselines, COUPLED POLICY (Arthur et al., 2020) and the WAIT-K model (Ma et al., 2018). For a fair comparison, we choose a value of k, which achieves comparable translation quality to the COU-PLED POLICY system. Following some initial experiments, we choose k = 2.
Parameter Sharing Since our model deals with the many-to-one translation task with an agent, we decided to separate i) encoder, ii) agent, and iii) encoder + agent. This idea came from the performance improvements that a number of studies demonstrated by separating the decoder in offline one-to-many MT (Dong et al., 2015;Sachan and Neubig, 2018). In SIMT, two modules, encoder and agent, are tied to the source, and therefore, reasonable to have them as language-specific parameters.
Evaluation. Following Arthur et al. (2020), we evaluate the systems based on their translation quality and delay. Translation quality can be measured by case sensitive BLEU (Papineni et al., 2002). 3 We adopt two delay measurements by previous studies: (1) average proportion (AP) (Cho and Esipova, 2016) is a fraction of reading source words per emitted target words, and (2) average lagging (AL) (Ma et al., 2019) is an average number of lagged source words until all inputs are read.

Results
In this section, we will describe the results of parameter sharing in SIMT. Following that, we present the comparison of multilingualism under different language groups. Parameter Sharing Strategies. Table 1 presents the results of various parameter sharing strategies for FR/IT/RO in the Romance language family. When sharing all parameters across these three languages, WAIT-2 has a slight increase in delay, but the translation quality is comparable to or better than bilingual. In contrast, the best parameter sharing setting for COUPLE POLICY is to have language-specific encoders and share the rest of the parameters. This appears to have a clear advantage in both quality and delay; the BLEU score increases up to 0.8 units, with a reduction size does not impact performance.
3 Calculated using sacrebleu (Post, 2018 in AL, approximately 10% in FR and IT. In both architectures, the model size reduces drastically when trained on multilingual setting, and remains approximately the same across different sharing strategies. These results are consistent for DE/NL in the Germanic language family. Full results are included in the supplementary material.
Multilingual Modelling Strategies. Table 2 shows the overall performance comparison of the multilingual setting. Multilingualism in SIMT evidently surpasses the bilingual baseline in translation delay, quality, and/or model size. Generally, SIMT trained on the same language family outperforms not only the bilingual baselines, but also the universal multilingual model. In the Germanic language, training under the same language group boosts up the BLEU up to 1 unit. Although baseline in WAIT-2 has a shorter delay, COUPLED POLICY surpasses both quality and delay. We observe that when the model runs universally, the BLEU score reaches back to or lower than that of the bilingual model.
On the other hand, the Romance language family has slightly different behavior across different SIMT models. COUPLED POLICY behaves similarly, where training SIMT under the same group positively influences the performance, but in WAIT-2, the universal model excels the best. This is particularly interesting because the Romance language family has the same word order as English, which WAIT-2 would be a perfect fit for such translation between two languages with the same word order. However, it is not the case, so mixing all the languages regardless of word order under WAIT-2 improves translation quality more while preserving the delay.  Under the same SIMT model, COUPLED POL-ICY has better performance when trained in the same language group. Also, the model size decreases 40% compared to the bilingual baseline, where the same language family has a total of 164.2M parameters and the bilingual has a total of 278.2M parameters. WAIT-2 seems to have slightly arguable results, where DE and NL have the highest BLEU when trained language-family-wise, but the Romance language family benefits the most from universally trained in all languages. Also, one should note that a lower delay in WAIT-K under the same k value does not mean outputting the target sentence faster: (1) Because of the nature of WAIT-K, the model follows the fixed READ and WRITE actions, and (2) the formulation of AL accounts for not only the lagging of translation but also the number of tokens produced as output and taken as input. Therefore, a lower AL indicates the changes in the probability of producing the end of the sentence. This will generate a shorter target sentence and/or stop the translation without fully observing the input, which impacts the delay. Nevertheless, under WAIT-2, the translation quality improves, and the model size decreases with multilingualism.

Discussion
The setting for parameter sharing in this experiment is inspired from the observation that the multilingual NMT can benefit from separating encoder and decoder parameters (Dong et al., 2015;Sachan and Neubig, 2018;Ahmadnia and Dorr, 2020). The motivation from Dong et al. (2015); Sachan and Neubig (2018) is that separating decoder parameters in one-to-many setting is beneficial because of the difficulty of one-to-many translation task. Our problem is SIMT where not only mapping from the source language to the target language is important, but also learning when to map is equally important. Hence, our assumption was that due to the difficulty of many-to-one SIMT task, assigning the encoder and the agent to language-specific would help the performance. Under WAIT-K, encoding the representation of the source language separately does not seem to benefit. However, Table 2 shows multilingual setting surpasses bilingual. This would be similar to the traditional NMT, that the model generalize the translation tasks across different languages and leverages the correlation across the source languages.
COUPLED POLICY is more complex architecture than WAIT-K as it also needs to learn the optimal policy from an oracle trajectory. However, this takes more advantages when trained on the same language family. Since its oracle is generated from offline word alignments between the source language and the target language, its mechanism of read/write is dependent on the word order and language properties. Our results in Table 2 also supports this as the model trained on the same language family surpasses both bilingual baselines and universal model. The interesting observation here is that, unlike WAIT-K, a separated encoder takes advantages more than fully shared architecture while separating both encoder and agent significantly degrades the performance in BLEU. This suggests that the language-specific encoder can form the representation of the source languages better than the shared one, but if the agent is separated together, the model struggles mapping from the source language to the target language. This reflects why separated encoder and agent in COUPLED POLICY has a BLEU decay while AL is not affected as significant. Therefore, because COUPLED POLICY takes advantages of the same word order from its oracle trajectory, its shared agent can capture the general representation of the same word order better while the language-specific encoder can help the agent by only focusing on encoding the representation of each source language.

Related Work
Simultaneous Machine Translation SIMT has been explored as a sequential decision-making translation problem. NPI architecture is employed to 1) choose whether to take more input token or produce output token using agent programmer and 2) translate partially observed input tokens to output using neural machine translation (NMT) interpreter (Satija and Pineau, 2016; Gu et al., 2016). The initial approaches were mainly training the agent using reinforcement learning with assigned rewards to balance the trade-off between translation quality and delay (Gu et al., 2016;Satija and Pineau, 2016;Alinejad et al., 2018). However, it has stability and robustness issues due to the sparse reward signals, so imitation learning using oracle actions has been independently attempted (Zheng et al., 2019;Arthur et al., 2020;Dalvi et al., 2018).

Multilingual Machine Translation
In NMT, multilingual training is a popular MTL approach as it is very simple, but effective (Johnson et al., 2017;Sachan and Neubig, 2018;Dong et al., 2015;Dabre et al., 2020). Instead of choosing entirely different NLP tasks and increase complexity of implementation (Niehues and Cho, 2017;Zaremoodi and Haffari, 2018), multilingual setting only involves concatenating multiple bilingual language pairs for training (Johnson et al., 2017). The language pairs are the task space in MTL, which determines the performance of the model, and so, the selection of language pairs influences the overall performance of translation (Tan et al., 2019).
Parameter sharing in multilingual setting has also been extensively studied. Dong et al. (2015) initially had language-specific decoder under oneto-many translation. This was further extended to sharing decoder parameters partially (Sachan and Neubig, 2018). Ahmadnia and Dorr (2020) investigated hierarchically sharing parameters under the similarity between languages. This simple parameter sharing has shown to restrict sharing dissimilarity, improving translation quality of all the languages (Johnson et al., 2017;Sachan and Neubig, 2018;Cettolo et al., 2017).

Conclusions
In this paper, we have investigated multilingual SIMT using IWSLT 2017 datasets. We have explored simple and effective multilingual architectures based on two strong recently proposed SIMT models, namely WAIT-K and COUPLED POLICY. Experiments show that the best parameter sharing strategy for the WAIT-K model, when dealing with DE/NL (as Germanic languages) and RO/IT/FR (as Romance languages), is to share all SIMT components across the languages regardless of the language set. However, the best sharing strategy seems to depend on the language family when it comes to COUPLED POLICY. Under the best parameter sharing strategy, our results have shown that (i) the single multilingual model is on-par or better than individual models, and (ii) multilingual SIMT models trained based on language families are onpar or better than the universal model trained for all languages. Furthermore, (iii) COUPLED POLICY takes the advantages of the same word order, so it achieves the best performance with the languagespecific encoder and training under the same language family For the future work, we plan to extend this to a larger dataset. Aharoni et al. (2019) demonstrated the scales of parallel corpora draw different conclusions in multilingual NMT. To maintain the characteristics of spoken languages, translation datasets must be selected carefully. Secondly, we will investigate different language families, including Slavic languages and Austronesian languages. The consistent results in different families would make the claim in this paper more valid.

A Appendices
SIMT Architecture We used a single layer long short term memory (LSTM) SEQ2SEQ as the interpreter for both SIMT models. The programmer is an LSTM transducer with a binary softmax to generate a read or write action for COUPLED POLICY. WAIT-K followed the fixed oracle actions under k = 2 without the programmer. We used scheduled sampling of 5%, 15%, and 15% for training the NPI-SIMT framework as described in Arthur et al. (2020). Training is done using Adam (Kingma and Ba, 2015) with initial learning rate 0.001, and halved it each time when perplexity increased on the development set. Early stopping is reached at the fourth learning rate. During testing we use a standard beam search algorithm similar to Gu et al. (2016) with b = 5. Training is done using single V-100 GPU for 6 hours for one source language. Multilingual experiments time scale linearly to the numbers of parallel corpora being used for the experiment.