Multilingual Agreement for Multilingual Neural Machine Translation

Although multilingual neural machine translation (MNMT) enables multiple language translations, the training process is based on independent multilingual objectives. Most multilingual models can not explicitly exploit different language pairs to assist each other, ignoring the relationships among them. In this work, we propose a novel agreement-based method to encourage multilingual agreement among different translation directions, which minimizes the differences among them. We combine the multilingual training objectives with the agreement term by randomly substituting some fragments of the source language with their counterpart translations of auxiliary languages. To examine the effectiveness of our method, we conduct experiments on the multilingual translation task of 10 language pairs. Experimental results show that our method achieves significant improvements over the previous multilingual baselines.


Introduction
Multilingual neural machine translation (MNMT) has experienced rapid growth in recent years (Johnson et al., 2017;Aharoni et al., 2019;. It is not only capable of translating among multiple language pairs by encouraging the crosslingual knowledge transfer to improve low-resource translation performance (Firat et al., 2016b;Zoph et al., 2016;Sen et al., 2019;Qin et al., 2020;Hedderich et al., 2020;Raffel et al., 2020), but also can handle multiple language pairs in a single model, reducing model parameters and training costs (Firat et al., 2016a;Blackwood et al., 2018;Sun et al., 2020).
Previous works in MNMT simply optimize independent translation objectives and do not use ar- * Contribution during internship at Microsoft Research Asia. † Corresponding author. bitrary auxiliary languages to encourage the agreement across different translation directions. As shown in Figure 1, the multilingual baseline is separately trained on French-English and German-English directions and cannot explicitly promote each other. The German-English translation only implicitly helps the French-English translation since both translation directions share the same encoder. There still exists a gap between German-English and French-English translation directions. As a result, minimizing the difference across different translation directions by an explicit paradigm requires further exploration. In this paper, we propose a novel agreementbased method, which explicitly models the shared semantic space for multiple languages and encourages the agreement across them. Our training procedure extends the multilingual translation with the agreement term, which encourages the model to produce the source sentence with multiple languages into the target sentence. As Figure 1 shows, we randomly substitute some source phrases with their counterparts of other languages to create codeswitched sentences using word alignment. Our model is jointly trained with the multilingual translation and agreement objectives, where the codeswitched sentences are translated into the target sentences. The key idea is to encourage the agreement among different translation directions simul- x Li mi denotes the m i -th token in the sentence of language L i . We randomly substitute source phrases of language L src = L i with the translations of other languages L aux ∈ L all to create code-switched sentences. Different words/phrases with the same meanings may contain different numbers of tokens. Then the code-switched source sentences are translated to the target language L tgt = L k by the multilingual model. This process greatly encourages multilingual agreement across different translation directions. taneously by leveraging alignment information of the bilingual source sentence pairs.
Experimental results on the multilingual translation task of WMT demonstrate that our method outperforms the multilingual baseline by a large margin. To better explain the BLEU improvements, we visualize the sentence-level crosslingual representations and the attention weights across different languages, which shows that our method effectively encourages the agreement between languages.

Multilingual Machine Translation
Our multilingual model is based on the single Transformer model (Vaswani et al., 2017) and shares all embedding matrices by a common vocabulary of all languages. Given M languages L all = {L 1 , . . . , L M }, the multilingual model appends special symbols to the source text to indicate the translation direction from the source language L src to the target language L tgt .

Agreement-based Training
Multilingual models can translate multiple sourceside languages into target-side languages. Given N bilingual corpora D B = {D B 1 , . . . , D B N }, the multilingual model with parameters θ is jointly trained over N language directions to optimize the combined objective as below: where x, y denote the sentence pair in the bilingual corpus D Bn . L M T is the combined translation objective of the multilingual model. The agreement objective over the code-switched corpora D C is calculated by: where x Lsrc/Laux is the code-switched sentence in which some phrases are substituted by their counterpart phrases in other languages and y is the target sentence. L aux is the auxiliary language.
We combine the bilingual corpora D B and codeswitched corpora D C to train our agreement-based model, which minimizes the gaps among different translation directions using word alignment: where L ALL is the combined objective.

Constructing Training Samples
We use L src as the source language, L tgt as target language, and L aux as auxiliary languages to construct training samples. As shown in Figure 2   x Lsrc/Laux is described as: where most words in the code-switched sentence x Lsrc/Laux are derived from x Lsrc , while some source phrases x Lsrc u:v are substituted by their counterpart phrases x Laux s:t . Given the parallel sentences among M different languages, we can construct code-switched source sentence x Lsrc/Laux with different auxiliary languages. Therefore, the code-switched corpora D C can be constructed in a similar way for other languages to encourage the agreement across different translation directions to help each other.

Multilingual Data
We use the same training, valid, and test sets as the previous work  to evaluate multilingual models by parallel data from multiple WMT datasets with various languages, including English (En), French (Fr), Czech (Cs), German (De), Finnish (Fi), Latvian (Lv), Estonian (Et), Romanian (Ro), Hindi (Hi), Turkish (Tr), and Gujarati (Gu). For each language, we concatenate the WMT data of the latest available year and get at most 10M sentences by randomly sampling. Detailed statistics of datasets are listed in Table 3. All sentences in our experiments are tokenized by SentencePiece 1 (Kudo and Richardson, 2018

Baselines and Evaluation
We compare our method against the following baselines. Bilingual baseline is trained on each language pair separately. One-to-Many and Manyto-One are trained on the En→X and X→En directions respectively. We collect all English sentences (33M) of the bilingual corpora described above and translate them into other languages sentences. We extract alignment pairs (Dyer et al., 2013) across different languages for our method. One-to-Many + Pseudo and Many-to-One + Pseudo are trained on multilingual data combined with the pseudo data. We average the last 5 checkpoints and employ the beam search strategy with a beam size of 5 for evaluation. The evaluation metric is case-sensitive detokenized sacreBLEU 2 (Post, 2018).

Training Details
We adopt the Transformer big architecture as the backbone model for all our experiments, which has 6 layers with an embedding size of 1024, a dropout of 0.1, the feed-forward network size of 4096, and 16 attention heads. We train multilingual models with Adam (Kingma and Ba, 2015) (β 1 = 0.9, β 2 = 0.98). The learning rate is set as 5e-4 with a warm-up step of 4,000. The models are trained with the label smoothing cross-entropy with a smoothing ratio of 0.1. The batch size is 5,120 tokens and the parameters are updated every 16 iterations to simulate a 128-GPU environment.

Results
The results of our model are separately listed in Table 1 and Table 2. Table 1 shows that Oneto-Many outperforms bilingual NMT by +1.8 BLEU points on average. Our method further improves over both One-to-Many and One-to-Many + Pseudo consistently. Using pseudo and codeswitched data brings more improvements to the low-resource languages (Et, Ro, Hi, Tr, and Gu) than high-resource languages (Fr, Cs, De, Fi, and Lv). These results suggest that our model encourages the agreement between different translation directions. Table 2 reports the results on the X→En test sets. Many-to-One outperforms the bilingual NMT by +4.2 BLEU points on average. We combine the parallel data with the pseudo data, leading to an improvement of +1.9 BLEU points over Manyto-One. Our method further outperforms Manyto-One + Pseudo by a large gain of +0.5 BLEU points on average, showing the effectiveness of our agreement-based method and the significance of multilingual agreement.

Analysis
Attention Visualization The representations of attention in Figures 3 and 4 are averaged over all 16 heads of the last layer. Figure 3 shows the selfattention weights of a code-switched English sentence, where the source phrase "coordination between law enforcement" is substituted by the German phrase "Koordinierung zwischen Strafverfolgung sbehörden". Similar to the common attention pattern, our model can learn better crosslingual representations in this code-switching case. Figure  4 shows that the cross-attention weights between the input code-switched English sentence and the ▁ T h is ▁ m e a n s ▁ t h a t ▁ fr u it fu l ▁ K o o r d in ie r u n g ▁ z w is c h e n ▁ S t r a fv e r fo lg output German sentence. The words with similar meanings are aligned together between the codeswitched input and target output.

Crosslingual Representation
We select 500 parallel sentences across different languages and visualize their sentence vectors of multilingual baseline and our method in Figure 5. The vector of the special language symbol of the source sentence is used as the sentence representation for visualization. Compared to Figure  Substitution Strategy We employ both wordlevel and phrase-level substitution strategies for code-switching. The word-level and phrase-level methods replace some words or spans of the source sentence with other languages. In Table 4, phraselevel substitution works better. Furthermore, we investigate the effect of the substitution ratio of the source words. From Figure 6, the best substitution ratio is 10%. When increasing the ratio to 30%, the performance gets worse, which indicates substitut-   ing too many words may degrade the performance.
As Equation 3 formulates, our method uses both the original corpora and code-switched corpora simultaneously to reduce the effect of the word alignment errors. Besides, fast align (Dyer et al., 2013) is a simple, fast, and effective tool with a lower alignment error rate. Therefore, our method can avoid the disturbance introduced by the word alignment errors as much as possible.
Time Cost of Word Alignment In this work, we try a large pseudo parallel corpus (33M) to train the multilingual corpora. In most scenarios, the size of the parallel corpus is less than 33M and thus consumes less time to generate the alignment pairs. All the alignment pairs are offline generated only once before the training phase. Therefore, the time cost of the word alignment is much smaller than that of the model training.

Related Work
Multilingual Machine Translation Previous works (Zoph et al., 2016;Firat et al., 2016b;Johnson et al., 2017) have explored different settings of the multilingual neural machine translation (MNMT). Recent studies show that MNMT (Blackwood et al., 2018;Platanios et al., 2018;Gu et al., 2018) helps improve the performance of the lowresource or zero-shot translation. Some researchers 0% 5% 10% 15% 20% 25% 30% Substitution Ratio 28.9 29.1 29.3 29.5 BLEU Our method Figure 6: Average results of X→En directions on different substitution ratio settings. Large substitution ratio may degrade the model performance and is even worse than the multilingual baseline.
Agreement-based Learning Many works try to use the agreement-based method (Liang et al., 2007(Liang et al., , 2006Al-Shedivat and Parikh, 2019) to encourage agreement among different translation orders and directions (Liang et al., 2006;Castilho, 2020;Yang et al., 2020a;Cheng et al., 2016;. Besides, the agreement-based method is also used to minimize the difference between the representation of source and target sentence (Yang et al., 2019). Our method further explores the approach of the multilingual agreement.

Conclusion
We propose a novel agreement-based framework to encourage multilingual agreement across different translation directions by the agreement term. Experimental results on the multilingual translation task demonstrate that our method effectively minimizes the gaps among different translation directions and significantly outperforms the multilingual baselines. The analytic experiment about the crosslingual representation shows the effectiveness of our multilingual agreement in minimizing the differences among different languages.