Breaking Down Multilingual Machine Translation

While multilingual training is now an essential ingredient in machine translation (MT) systems, recent work has demonstrated that it has different effects in different multilingual settings, such as many-to-one, one-to-many, and many-to-many learning. These training settings expose the encoder and the decoder in a machine translation model with different data distributions. In this paper, we examine how different varieties of multilingual training contribute to learning these two components of the MT model. Specifically, we compare bilingual models with encoders and/or decoders initialized by multilingual training. We show that multilingual training is beneficial to encoders in general, while it only benefits decoders for low-resource languages (LRLs). We further find the important attention heads for each language pair and compare their correlations during inference. Our analysis sheds light on how multilingual translation models work and also enables us to propose methods to improve performance by training with highly related languages. Our many-to-one models for high-resource languages and one-to-many models for LRL outperform the best results reported by Aharoni et al. (2019).


Introduction
Multilingual training regimens (Dong et al., 2015;Firat et al., 2016;Ha et al., 2016) are now a key element of natural language processing, especially for low-resource languages (LRLs) (Neubig and Hu, 2018;Aharoni et al., 2019). These algorithms are presumed to be helpful because they leverage syntactic or semantic similarities between languages, and transfer processing abilities across language boundaries.
In general, English is used as a central language due to its data availability, and three different multilingual training settings are considered: (1) one-tomany: training a model with languages pairs from 1 We will release our scripts once accepted.
English to many other languages.
(2) many-to-one: training a model with languages pairs from many languages to English (3) many-to-many: training a model with the union of the above two settings' data. (1) and (3) can be used for English to other (En-X) translation, while (2) and (3) can be used for other to English (X-En) translation.
However, multilingual training has not proven equally helpful in every setting. Arivazhagan et al. (2019) showed that many-to-one training improves performance over bilingual baselines more than one-to-many does. In this paper we consider this result from the point of view of the components of the MT model. In the many-to-one setting, inputs of the model are from different language distributions so the encoder can be considered a multi-domain model, whereas the decoder is trained on a single distribution. In the one-to-many setting, it is the opposite: the encoder shares data, and the decoder is multi-domain. While there are recent studies analyzing multilingual translation models (Kudugunta et al., 2019;Voita et al., 2019a;Aji et al., 2020;Mueller et al., 2020), in general they do not (1) examine the impact of different multilingual training settings such as one-to-many and many-to-one, and (2) they do not examine the different components such as encoder and the decoder separately. This motivates us to ask "how do various types of multilingual training interact with learning of the encoder and decoder?" To answer this question, we set up controlled experiments that decouple the contribution to the encoder and the decoder in various training settings. We first train multilingual models using many-to-one, one-to-many, or many-to-many training paradigms. We then compare training bilingual models with and without initializing the encoder or the decoder with parameters learnt by multilingual training. We find that, for LRLs, multilingual training is beneficial to both the encoder and the decoder. However, surprisingly, for high-resource languages (HRL), we found mul-  tilingual training only beneficial to encoder but not to the decoder.
To further analyze the result, we examine "to what degree are the learnt parameters shared across languages?". We use the head importance estimation method proposed by Michel et al. (2019) as a tool to identify the important attention heads in the model, and measure the consistency between the heads sets that are important for different language pairs. The results suggest that the encoder does share parameters across different languages in all settings. On the other hand, the decoder can treat the representation from the encoder in a language-agnostic way for X-En translation, and less parameter sharing is observed for En-X translation. Our analyses on parameter sharing also provides a possible explanation of Kudugunta et al. (2019)'s observation that the representation from the encoder is target-language-dependent.
Our investigation of how multilingual training works leads us to a method for improving MT models. With the comprehensive experiments in multilingual settings, for translation in HRL (Ar-En, De-En, He-En, It-En), we discover that fine-tuning multilingual model with target bilingual data outperforms the best results in Aharoni et al. (2019) by 2.99 to 4.63 BLEU score With the analysis on the parameter sharing in the decoder, we are able to identify related languages. Fine-tuning jointly with the identified related languages boosts lowresource translation (En-Az, En-Be, En-Go, En-Sk) over the best results in Aharoni et al. (2019) by 1.66 to 4.44 BLEU score. Compared to Neubig and Hu (2018), our method does not require linguist knowledge, and thus may be more useful for less-studied low-resource languages.
In sum, our contributions are in three-fold. First, our experiments can be used as a diagnostic tool for multilingual translation to investigate how an encoder and a decoder benefit from multilingual training. Second, our results provide insights into how multilingual translation works. Third, we improve the translation models based on the findings from our analysis, showing a promising path for future research on multilingual machine translation.

Experimental Settings for Multilingual Training
Before stepping into our analysis, we first explain our experimental setup. Following the setting in Aharoni et al. (2019) and Neubig and Hu (2018), we use the publicly available TED Talks Dataset (Qi et al., 2018) is used to train all our machine translation models. Following Neubig and Hu (2018), we break words into subwords with BPE jointly learnt over all source languages using the sentencepiece toolkit. The vocabulary size is 32000. We perform experiments with the Transformer architecture (Vaswani et al., 2017) using the hyper parameters same as in (Arivazhagan et al., 2019) 2 . All models are implemented and trained using Fairseq 0.10.0 (Ott et al., 2019). We trained multilingual translation models with 60 different languages on the TED Talks Dataset with the three settings described in Section 1: one-to-many, manyto-one and many-to-many. For one-to-many and many-to-many settings, we add a special language token to the input of the encoder to indicate the target language. Following Aharoni et al. (2019), we evaluate our models with BLEU score (Papineni et al., 2002;Post, 2018) on the selected 8 languages. They are representative for different language families (Qi et al., 2018). The size of the training is shown in Table 1.

How Multilingual Training Benefits Each Component
Previous studies have shown that the multilingual training results are generally stronger than the bilingual training (Arivazhagan et al., 2019). To understand how multilingual training benefits NMT, we analyze the effect of multilingual training on different components of an NMT model, specifically, the encoder and decoder.

Experiments Design
To study how multilingual training benefits each component, we train models on bilingual data with components initialized differently as follows: •  • Load encoder/decoder: Models with trainable parameters of either encoder or decoder initialized with parameters learnt from multilingual training.
• Load both: Models with parameters of both encoder and decoder initialized with parameters learnt from multilingual training. This can be seen as fine-tuning the multilingual model on bilingual data.
The motivation for this paradigm is that if multilingual training is beneficial to a component, then initializing the parameters of that component should result in improvements over random initialization and training on only bilingual data. If load encoder outperforms bilingual only, then we can say that multilingual training is beneficial for the encoder, and if load decoder outperforms we can make the analogous conclusion for the decoder. Thus comparing these models reveals how each component benefit from multilingual training.
We also consider a load and freeze setting ( Thompson et al., 2018), where we initialize a component from a multilingual model and freeze its weights when fine-tuning on bilingual data. For example, in the load decoder setting, we train the loaded decoder with a randomly initialized encoder. We suspect that learning with randomly initialized component might ruin the other component which is well-trained with multilingual data, especially in the beginning of the training. Thus, we additionally experiment with this load and freeze setting to ensure the multilingual-trained component is not deteriorated.

Results and Discussion
The overall results of X-En and En-X are shown in Table 2 and Table 3, respectively. The difference between the numbers reported in Aharoni et al. (2019) and ours is due to the different batch size and learning rate schedule we use. In the following section we will discuss the results of our study. Because they are highly dependent on the training data size (Table 1), we discuss the results in two groups: HRL (HRL; referring to ar, de, he, and it) and LRL (LRL; referring to az, be, gl, sk). 3

Low-Resource Language Results
For LRLs, we find that multilingual training is generally beneficial to both the encoders and the decoders in all of the three multilingual models. Both load encoder and load and freeze decoder can achieve performance better than the bilingual baseline. This suggests that the parameters in the encoder and the decoder learnt by multilingual training do contain information that is not effectively learnt from the smaller bilingual data.  The results also suggest that multilingual training is more beneficial for the encoders than for the decoders. In all cases, either load encoder or freeze encoder outperforms both load decoder and load and freeze decoder. However, multilingual training of the encoder and the decoder are complementary; loading both the encoder and the decoder can usually improve the performance over loading only one component.

High-Resource Language Results
On HRLs, we find that multilingual training is generally beneficial to the encoders in all of the three multilingual models, while it is not beneficial for the decoders in some settings. Load encoder always outperform the baseline models, but for the All-En model on X-En translation, and the All-All model on En-X translation, neither load decoder nor load and freeze decoder outperform the baseline model.
We also observe that multilingual training is generally more beneficial to the encoders than to the decoders. In all of the cases, load encoder can achieve performance competitive to load both (better or less by within 1 BLEU score). However, in all of the cases, both load decoder and load and freeze decoder have performance worse than load both. Therefore, multilingual training is not as beneficial to the decoders as to the encoders.

Discussion
For LRL, because the size of bilingual training data is small, it is not surprising that multilingual training is beneficial for both the encoder and the decoder. However, our results are somewhat more surprising for HRL -it is not trivial that multilingual training is not as beneficial. In the next section, we focus on explaining the phenomena observed on HRL by investigating how parameters are shared across languages.

How Multilingual Parameters are Shared in Each Component
Given the previous results, we are interested in exactly how parameters are shared among different language pairs. Given that we are using the Transformer architecture, for which multi-head attention is a fundamental component, we use the attention heads as a proxy to analyze how multilingual models work differently when translating between different languages. Specifically, we analyze our models by identifying the attention heads that are important when translating a language pair. Measuring the consistency between the sets of important attention heads for two language pairs gives us hints on the extent of parameter sharing.

Head Importance Estimation
First, we provide some background on head importance estimation, specifically the method proposed by Michel et al. (2019). Given a set of multi-head attention modules, each of which can be written as (1) where N h is the number of attention heads, and ξ h = 1 for all h.
The importance of a head can be estimated as given a loss function L and input X. Then, the importance score of each head in an attention module is normalized Note that when the input X is different, the estimated importance score can be different. Therefore, when different language pairs are fed in, the important heads identified can be different. We denote the set of attention head scores estimated on translation from language l a to language l b as H(l a , l b ).
We denote the scores of attention heads in a component by using superscript. For example, H enc represents the scores of the heads in a encoder.

Measuring Parameter Sharing by Correlation of Head Scores
With the attention head importance scores estimated by Equation 3, we can investigate how parameters are shared across languages. For each of the En-All, All-En, All-All multilingual models, we estimated a set of head-importance scores H(l a , l b ) for each language pair (l a , l b ) in the training setting. We calculate the head scores with the training loss function (MLE with label smoothing) and 100K randomly sampled sentences in the training set.
To investigate how much parameters are shared by two pairs of languages (l a , l b ) and (l c , l d ), we measure the agreement between H(l a , l b ) and H(l c , l d ). If a head is important for both of (l a , l b ) and (l c , l d ), then important parameters for translating are shared. Thus high agreement suggests high parameter sharing.  To quantify the agreement between two score sets, we use Spearman's rank correlation (Spearman, 1987). A rank-based correlation metric is used because the importance estimation was originally proposed to order attention heads in a model. Higher correlation implies higher agreement and thus implies higher parameter sharing. For each of the En-All, All-En, All-All models, we calculate the correlation between H(l a , l b ) and H(l c , l d ) for all language pairs (l a , l b ) and (l c , l d ) that are used to train the model. The detailed correlation computation process can be found in Appendix A. We plot the correlation matrices of the head scores (included in appendix) and summarize them in Table 10. We also compare the top-10 most important heads for every language pairs with F1 scores, and observe similar results. We include the statistics in appendix.

How Multilingual Translation Models Share
Results in Table 10 combined with Section 3 provides the insights into how multilingual translation models work with respect to cross-lingual sharing: Encoder for En-X: It is natural that the encoder from En-X likely benefit from multilingual training because it can generate representations tailored for different target languages with shared parameters. En-X is a set of language pairs where the source language is always English. Therefore, if the prepended target language token is ignored, the inputs of the encoders for all pairs in En-X are from one identical distribution. This is in contrast to X-En pairs, where the inputs are in different languages. However, for the encoders, we observe from Encoder for X-En: For X-En language pairs, the input of the encoder is multilingual, which means the input from different X-En language pairs has distinct distribution. However, the correlation between different source languages is still high. It shows that high parameters sharing in the encoder is possible.
Decoder for En-X: The decoders for En-X have the lowest correlation. From the correlation matrix, we do see some parameter sharing between some language pairs. However, larger model capacity might be required for a model to be proficient in all the languages.
Decoder for X-En: The decoder have average correlation as high as 0.973 and 0.967 for All-En and All-All models respectively. This suggests that to decode intermediate representation encoded by the encoder, the decoder use almost the same set of parameters. However, Kudugunta et al. shows that the representation encoded by the encoder is not language-agnostic. A possible explanation is that the important parameters of the decoder are highly determined by the target output, which is always in English. Therefore, even though the encoder representation is not language-agnostic, it is still difficult to learn parameters reflecting the difference. It suggests why multilingual training does not benefit the decoder in the X-En setting. The set of English sentences is almost the same for all the HRL pairs in the TED Talks dataset, so multilingual training can hardly provide more unique English sentences than bilingual training does. If the decoder is dedicated for generation, multilingual training cannot expose the decoder to more diverse data. Therefore the multilingually trained decoder does not perform better than the bilingual one.

Improving Translation Based on the Degree of Parameter Sharing
Insights from the previous section provide us with a new way to choose languages for multilingual training. In previous work (Lin et al., 2019;Oncevay et al., 2020), choosing on languages with similar linguistic properties is a popular practice. However, Mueller et al. (2020) found the effect is highly language-dependent. Sometimes training with similar languages might be worse than training on a set of unrelated languages. Here we otherwise propose an entirely model-driven way to find related languages to improve multilingual translation models. We explore choosing languages where parameters can be better shared.

Improving X-En by Related En-X Pairs
In the All-All model, we notice low parameter sharing between En-X and X-En pairs. The average correlation between H enc (En, X) and H enc (X, En) is 0.44 (std: 0.17). The average correlation between H dec (En, X) and H dec (X, En) is 0.49 (std: 0.13). It provides a possible explanation why training with both the En-X and the X-En pairs only brings little improvement over training with only En-X alone or with X-En alone. The low correlation combined with results in Section 3 motivate us to experiment on improving X-En with related En-X pairs. Section 3 shows that the multilingual decoder has less advantage than the encoder. This may suggest the inefficiency of parameter sharing in the decoder. Therefore we experiment on choosing a set of related languages based on the degree of parameter in the decoder. We choose the language set L such that for all l ∈ L, the average correlation 1 60 60 l i =1 Corr(H dec (En, l), H dec (l i , En)) is higher than 0.60.
Results are shown in Table 5. Even though finetuning on related languages improves the overall performance, it is not better than fine-tuning on the All-En pairs only. Also, the average correlation between H dec (En, l a ) and H dec (l b , En) is not improved. Our experiment demonstrates the difficulty of sharing parameters between All-En pairs and En-All pairs. We leave this problem for future work.

Improving En-X by Language Clusters
The low correlation between attention head scores of language pairs motivates us to improve the performance of En-X using related language pairs. As shown in Table 10, the decoders have the lowest correlation scores. We conjecture that it is due to the difficulty of sharing parameters between distant languages. Thus, we seek for finding related language sets, in each of which parameters can be shared.   Again, we resort to the attention head importance scores to find the related languages. Our intuition is that related languages would share many parameters in between and training a model on related languages would be helpful. As a sanity check of our idea, we first use t-SNE (Maaten and Hinton, 2008) to reduce the dimension of head-importance scores H(l a , l b ). We only focus on heads in the decoders, because the correlation score between H (En,lc) and H (En,l d ) is lower in average for the decoders. The result visualized in Figure 1 illustrates that, the distance between H (En,lc) and H (En,l d ) tend to be shorter if languages l c and l d are linguistically related. Hence, determining related languages with head score H (En,l) should be reasonable.
We then fine-tune multilingual models on related language clusters. Related languages clusters are determined by k-mean++ (Arthur and Vassilvitskii, 2007) with k = 5. We consider clusters that cover all of the four low-resource languages. For the All-All model, one of the cluster we consider contains Be, Gl, De, He, It, and the other one contains Az. For the En-All model, we also experiment with two clusters. One includes Ar, De, He, It, and the other includes Az, Be, Gl, Sk. As a baseline, we also experiment with random groups. They are groups generated by randomly splitting the 59 target languages.
The results are shown in  Table 7: Correlation between the decoder attention head scores when estimated using the language pairs in the cluster. HL and LL represent the cluster that includes HRL and the one that includes LRL respectively.
En-All and the All-All model, except En-Gl, finetuning on clusters can improve performance on all the considered language pairs consistently. For LRLs, fine-tuning on related language clusters is also better than fine-tuning on random groups in general. To verify whether this improvement is brought by increased parameter sharing in the decoders, we check the correlation between H dec after fine-tuning. The results shown in Table 7 shows improvements after fine-tuning on the clusters. For low-resource language pairs En-Az, En-Be, En-Sk on the En-All model, we notice that only few languages are highly correlated with them (with correlation > 0.80). Therefore, we also experiment with fine-tuning the En-All model with only the language pairs with high correlation scores (> 0.80) for each of the three pairs , which boosts the performance of En-Be to 15.2 and En-Sk to 28.6.

Related Work
The early attempts of multilingual training for machine translation use a single model to translate between multiple languages (Dong et al., 2015;Firat et al., 2016;Ha et al., 2016). Those works find multilingual NMT models are appealing because they not only give us a simple paradigm to handle mapping between multiple languages, but also improve performance on low and zero-resource languages pairs (Gu et al., 2018). However, how multilingual training contributes to components in the translation model still remains unknown.
There are some attempts at analyzing and explaining the translation models. Thompson et al. (2018) analyze the contribution of different components of NMT model to domain adaptation by freezing the weights of components during continued training. Arivazhagan et al. (2019) provide an comprehensive study on the state-of-the-art multilingual NMT model in different training and testing scenarios.  experiment with different parameter sharing strategies in Transformer models, showing that sharing parameters of embedding, key and query performs well for one-to-many settings. Artetxe et al. (2020) shows the strong transferability of monolingual representation to different languages. The intermediate representation of BERT can be language-agnostic if we freeze the embeddings during training. The deficiency of the one-to-many setting is explored in (Johnson et al., 2017). They find only the many-toone setting consistently improves the performance across languages. Wang et al. (2018) also explore problems of the one-to-many setting, and show language-specific components are effective to improve the performance. Voita et al. (2019a) analyzes how generated sentences of NMT models are influenced by context in the encoder and decoder. The attempt to investigate encoder and decoder separately is similar to our work. Rothe et al. (2020) explores how pretrained checkpoints can benefit the encoder and the decoder in a translation model. Zhang et al. (2021) investigate the trade-off between language-specific and shared capacity of layers in a multilingual NMT model.
Multi-head attention has been shown effective in different NLP tasks. Beyond improving performance, multi-head attention can help with subjectverb agreement (Tang et al., 2018), and some heads are predictive of dependency structures (Raganato and Tiedemann, 2018). Htut et al. (2019) and Clark et al. (2019) report that heads in BERT attend significantly more to words in certain syntactic position. They show some heads seem to specialize in certain types of syntactic relations. Michel et al. (2019), Voita et al. (2019b, and Behnke and Heafield (2020) study the importance of different attention heads in NMT models, and suggest that we can prune those attention heads which are less important. Brix et al. (2020) also shows pruning NMT models can improve the sparsity level to optimize the memory usage and inference speed.
However, all previous works do not directly investigate how encoder and decoder of NMT models benefit from multilingual training, which is the key question of why multilingual training works. To our best knowledge, we are the first to tackle the question, and our analysis can be used to further improve multilingual NMT models.

Conclusion
In this work, we have the following findings: 1) In Section 3, we examine how multilingual training contributes to each of the components in a machine translation model. We discover that, while multilingual training is beneficial to the encoders, it is less beneficial to the decoders. 2) In Section 4, our analysis of important attention heads provides insight into the behavior of multilingual components. Results suggest that the encoder in the En-All model may generate target-language-specific representation, while the behavior of the decoder of the All-En model may be source-language-agnostic. In addition, in the All-All model, we observe indications of lower parameter sharing between X-En pairs and En-X pairs. 3) In Section 5, we explore approaches to improve the model based on our findings. On En-X translation, we outperform the best results in (Aharoni et al., 2019). With our proposed analysis as diagnostic tools, future work may further improve the multilingual systems.
Our findings provide some possible future directions. First, parameter sharing between En-X and X-En pairs in the All-All model seems low. Improving the sharing may improve the performance. Second, the decoder in the All-En model seems to behave in a source-language-agnostic way. It may not be optimal since the representation from the encoder is not source-language-agnostic (Kudugunta et al., 2019). To mitigate this issue, either the encoder is required to encode inputs into languageagnostic representation, or the decoder should behave in different ways according to the input representation. Third, our experiments can be repeated in other settings, including the non-English-centric setting (Fan et al., 2021), and larger datasets, such as Zhang et al. (2020). We leave the exploration in future work.      (a,b) and S (c,d) be the top-10 most important heads for language pair (l a , l b ), and S (c,d) respectively. We calculate the F1 score between S (a,b) and S (c,d) to measure their similarity. The number in the table is the average F1 scores.
Theses random clusters are generated by (1) shuffling the 59 languages, (2) randomly selecting positions. The results 5 segments separated by the 4 positions are the 5 clusters.

E Closest Languages
The closest languages used in Section ?? are: • Az: en-az en-eu en-fi en-tr • Be: en-be en-it en-uk • Gl: en-gl en-pt en-es en-lt en-it en-pt_br

F Experimental Details
• Infrastructure: All the experiments can be conducted on one single RTX 2080Ti GPU.
• Evaluation: We report the BLEU score calculated by FairSeq.
• Version of FairSeq: We use v0.10.0 (https://github.com/pytorch/ fairseq/tree/v0.10.0) • Dataset: It can be downloaded from https://github.com/neulab/ word-embeddings-for-nmt.  Figure 4: Correlation matrix between language pairs after fine-tuning on the languages clusters. The first figure is the matrix of the fine-tuned All-All model. The second and the third ones are the matrix of the En-All model fine-tuned on the language clusters containing the high-resource and the LRL respectively. The topleft corner is the correlation between the encoder head scores H enc , while the bottom-right corner is the correlation between the decoder head scores H dec .