Importance-based Neuron Allocation for Multilingual Neural Machine Translation

Multilingual neural machine translation with a single model has drawn much attention due to its capability to deal with multiple languages. However, the current multilingual translation paradigm often makes the model tend to preserve the general knowledge, but ignore the language-specific knowledge. Some previous works try to solve this problem by adding various kinds of language-specific modules to the model, but they suffer from the parameter explosion problem and require specialized manual design. To solve these problems, we propose to divide the model neurons into general and language-specific parts based on their importance across languages. The general part is responsible for preserving the general knowledge and participating in the translation of all the languages, while the language-specific part is responsible for preserving the language-specific knowledge and participating in the translation of some specific languages. Experimental results on several language pairs, covering IWSLT and Europarl corpus datasets, demonstrate the effectiveness and universality of the proposed method.


Introduction
Neural machine translation(NMT) (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2015;Gehring et al., 2017;Vaswani et al., 2017) has shown its superiority and drawn much attention in recent years. Although the NMT model can achieve promising results for highresource language pairs, it is unaffordable to train separate models for all the language pairs since there are thousands of languages in the world (Tan et al., 2019;Aharoni et al., 2019;Arivazhagan et al., 2019). A typical solution to reduce the model size and the training cost is to handle multiple languages in a single multilingual neural machine translation (MNMT) model (Ha et al., 2016;Firat et al., 2016;Johnson et al., 2017;Gu et al., 2018). The standard paradigm of MNMT proposed by Johnson et al. (2017) contains a language-shared encoder and decoder with a special language indicator in the input sentence to determine the target language.
Because different languages share all of the model parameters in the standard MNMT model, the model tends to converge to a region where there are low errors for all the languages. Therefore, the MNMT model trained on the combined data generally captures the general knowledge, but ignores the language-specific knowledge, rendering itself sub-optimal for the translation of a specific language (Sachan and Neubig, 2018;Blackwood et al., 2018;Wang et al., 2020b). To retain the language-specific knowledge, some researches turn to augment the NMT model with language-specific modules, e.g., the language-specific attention module (Blackwood et al., 2018), decoupled multilingual encoders and/or decoders (Vázquez et al., 2019;Escolano et al., 2020) and the lightweight language adapters . However, these methods suffer from the parameter increment problem, because the number of parameters increases linearly with the number of languages. Besides, the structure, size, and location of the module have a large influence on the final performance, which requires specialized manual design. As a result, these problems often prevent the application of these methods in some scenarios.
Based on the above, we aim to propose a method that can retain the general and language-specific knowledge, and keep a stable model size as the number of language-pair increases without introducing any specialized module. To achieve this, we propose to divide the model neurons into two parts based on their importance: the general neurons which are used to retain the general knowledge of all the languages, and the language-specific neurons which are used to retain the language-specific knowledge. Specifically, we first pre-train a standard MNMT model on all language data and then evaluate the importance of each neuron in each language pair. According to their importance, we divide the neurons into the general neurons and the language-specific neurons. After that, we finetune the translation model on all language pairs. In this process, only the general neurons and the corresponding language-specific neurons for the current language pair participate in training. Experimental results on different languages show that the proposed method outperforms several strong baselines.
Our contributions can be summarized as follows: • We propose a method that can improve the translation performance of the MNMT model without introducing any specialized modules or adding new parameters.
• We show that the similar languages share some common features that can be captured by some specific neurons of the MNMT model.
• We show that some modules tend to capture the general knowledge while some modules are more essential for capturing the languagespecific knowledge.

Background
In this section, we will give a brief introduction to the Transformer model (Vaswani et al., 2017) and the Multilingual translation.
Transformer is a stacked network with N identical layers containing two or three basic blocks in each layer. For a single layer in the encoder, it consists of a multi-head self-attention and a position-wise feed-forward network. For a single decoder layer, besides the above two basic blocks, a multi-head cross-attention follows multi-head selfattention. The input sequence x will be first converted to a sequence of vectors and fed into the encoder. Then the output of the N -th encoder layer will be taken as source hidden states and fed into decoder. The final output of the N -th decoder layer gives the target hidden states and translate the target sentences.

Multilingual Translation
In the standard paradigm of MNMT, all parameters are shared across languages and the model is jointly trained on multiple language pairs. We follow Johnson et al. (2017) to reuse standard bilingual NMT models for multilingual translation by altering the source input with a language token lang, i.e. changing x to x = (lang, x 1 , . . . , x J ).

Approach
Our goal is to build a unified model, which can achieve good performance on all language pairs. The main idea of our method is that different neurons have different importance to the translation of different languages. Based on this, we divide them into general and language-specific ones and make general neurons participate in the translation of all the languages while language-specific neurons focus on some specific languages. Specifically, the proposed approach involves the following steps shown in Figure 1. First, we pretrain the model on the combined data of all the language pairs following the normal paradigm in Johnson et al. (2017). Second, we evaluate the importance of different neurons on these language pairs and allocate them into general neurons and language-specific neurons. Last, we fine-tune the translation model on the combined data again. It should be noted that for a specific language pair only the general neurons and the language-specific neurons for this language pair will participate in the forward and backward computation when the model is trained on this language pair. Other neurons will be zeroed out during both training and inference.

Importance Evaluation
The basic idea of importance evaluation is to determine which neurons are essential to all languages while which neurons are responsible for some specific languages. For a neuron i, its average importance I across language pairs is defined as follow: where the Θ(·) denotes the importance evaluation function and M denotes the number of language Figure 1: The whole training process of the proposed method. The red, yellow and blue circles represent languagespecific neurons that are important for l 1 , l 2 &l 3 and l 1 &l 3 , respectively.
pairs. This value correlates positively with how important the neuron is to all languages. For the importance evaluation function Θ(·), we adopt two schemes: one is based on the Taylor Expansion and the other is based on the Absolute Value.

Taylor Expansion
We adopt a criterion based on the Taylor Expansion (Molchanov et al., 2017), where we directly approximate the change in loss when removing a particular neuron. Let h i be the output produced from neuron i and H represents the set of other neurons. Assuming the independence of each neuron in the model, the change of loss when removing a certain neuron can be represented as: where L(H, h i = 0) is the loss value if the neuron i is pruned and L(H, h i ) is the loss if it is not pruned. For the function L(H, h i ), its Taylor Expansion at point h i = a is: Then, approximating L(H, h i = 0) with a firstorder Taylor polynomial where h i equals zero: (4) The remainder R 1 can be represented in the form of Lagrange: where δ ∈ (0, 1). Considering the use of ReLU activation function (Glorot et al., 2011) in the model, the first derivative of loss function tends to be constant, so the second order term tends to be zero in the end of training. Thus, we can ignore the remainder and get the importance evaluation function as follows: In practice, we need to accumulate the product of the activation and the gradient of the objective function w.r.t to the activation, which is easily computed during back-propagation. Finally, the evaluation function is shown as: where h l i is the activation value of the i-th neuron of l-th layer and T m is the number of the training examples of language pair m. The criterion is computed on the data of language pair m and averaged over T m .

Absolute Value
We adopt the magnitude-based neuron importance evaluation scheme (See et al., 2016), where the absolute value of each neuron's activation value is treated as the importance: The notations in the above equation are the same as those in the Equation 7. After the importance of each neuron is evaluated on the combined data, we need to determine the role of each neuron in the fine-tuning step following the method in the next section.

Neuron Allocation
In this step, we should determine which neurons are shared across all the language pairs and which neurons are shared only for some specific language pairs.
General Neurons According to the overall importance I(i) in Equation 1, the value correlates positively with how important the neuron is to all languages. Therefore, we rank the neurons in each layer based on the importance and make the top ρ percentage as general neurons that are responsible for capturing general knowledge.
Language-specific Neurons Next, we regard other neurons except for the general neurons as the language-specific neurons and determine which language pair to assign them to. To achieve this, we compute an importance threshold for each neuron: , where max(Θ m (i)) denotes the maximum importance of this neuron in all language pairs and k is a hyper-parameter. The neuron will be assigned to the language-pairs whose importance is larger than the threshold. When the importance of neurons is determined, the number of language pairs associated with each neuron can be adjusted according to k. The smaller the k, the more language-pairs will be associated with the specific neurons. In this way, we flexibly determine the language pairs assigned to each neuron according to its importance in different languages. Note that the neuron allocation is based on the importance of language pair. We have also tried other allocation variants, e.g., based on the source language, target language, and find that the language pair-based method is the best among of these methods. The detailed results are listed in Appendix A.
After this step, the model is continually finetuned on the combined multilingual data. If the training data is from a specific language pair, only the general neurons and the language-specific neurons for this language pair will participate in the forward computation and the parameters associated with them will be updated during the backward propagation.

Data Preparation
In this section, we describe the datasets using in our experiments on many-to-many and one-to-many multilingual translation scenarios.
Many-to-Many For this translation scenario, we test our approach on IWSLT-17 1 translation datasets, including English, Italian, Romanian, Dutch (briefly, En, It, Ro, Nl). We experimented in eight directions, including It↔En, Ro↔En, Nl↔En, and It↔Ro, with 231.6k, 220.5k, 237.2k, and 217.5k data for each language pair. We choose test2016 and test2017 as our development and test set, respectively. Sentences of all languages were tokenized by the Moses scripts 2 and further segmented into subword symbols using Byte-Pair Encoding (BPE) rules (Sennrich et al., 2016) with 40K merge operations for all languages jointly.

One-to-Many
We evaluate the quality of our multilingual translation models using training data from the Europarl Corpus 3 , Release V7. Our experiments focus on English to twelve primary languages: Czech, Finnish, Greek, Hungarian, Lithuanian, Latvian, Polish, Portuguese, Slovak, Slovene, Swedish, Spanish (briefly, Cs, Fi, El, Hu, Lt, Lv, Pl, Pt, Sk, Sl, Sv, Es). For each language pair, we randomly sampled 0.6M parallel sentences as training corpus (7.2M in all). The Europarl evaluation data set dev2006 is used as our validation set, while devtest2006 is our test set. For language pairs without available development and test set, we randomly split 1K unseen sentence pairs from the corresponding training set as the development and test data respectively. We tokenize and truecase the sentences with Moses scripts and apply a jointly-learned set of 90k BPE obtained from the merged source and target sides of the training data for all twelve language pairs.

Systems
To make the evaluation convincing, we reimplement and compare our method with four baseline systems, which can be divided into two categories with respect to the number of models. The multiple-model approach requires maintaining a dedicated NMT model for each language:   Table 2: BLEU scores on one-to-many translation tasks. 'Para' of the Individual system is 62.23M for each language pair. The denotations represent the same meaning as in Table 1.
Individual A NMT model is trained for each language pair. Therefore, there are N different models for N language pairs. The unified model-based methods handle multiple languages within a single unified NMT model: Multilingual (Johnson et al., 2017) Handling multiple languages in a single transformer model which contains one encoder and one decoder with a special language indicator lang added to the input sentence.
+TS (Blackwood et al., 2018) This method assigns language-specific attention modules to each language pair. We implement the target-specific attention mechanism because of its excellent performance in the original paper.
+Adapter  This method injects tiny adapter layers for specific language pairs into the original MNMT model. We set the dimension of projection layer to 128 and train the model from scratch.
Our Method-AV Our model is trained just as the Approach section describes. In this system, we adopt the absolute value based method to evaluate the importance of neurons across languages.
Our Method-TE This system is implemented the same as the system Our Method-AV except that we adopt the Taylor Expansion based evaluation method as shown in Equation 7.
+Expansion To make a fair comparison, we set the size of Feed Forward Network to 3000 to expand the model capacity up to the level of other baselines, and then apply our Taylor Expansion based method to this model.

Details
For fair comparisons, we implement the proposed method and other contrast methods on the advanced Transformer model using the open-source toolkit Fairseq-py (Ott et al., 2019). We follow Vaswani et al. (2017) to set the configurations of the NMT model, which consists of 6 stacked encoder/decoder layers with the layer size being 512. All the models were trained on 4 NVIDIA 2080Ti GPUs where each was allocated with a batch size of 4,096 tokens for one-to-many scenario and 2,048 tokens for the many-to-many scenario. We train the baseline model using Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98, and = 10 −9 . The proposed models are further trained with corresponding parameters initialized by the pre-trained baseline model. We vary the hyperparameter ρ that controls the proportion of general neurons in each module from 80% to 95% and set it to 90% in our main experiments according to the performance. The detailed results about this hyper-parameter are list in Appendix B. We set the hyper-parameter k to 0.7 and do more analysis on it in Section 5.3. For evaluation, we use beam search with a beam size of 4 and length penalty α = 0.6.

Results
The final translation is detokenized and then the quality is evaluated using the 4-gram case-sensitive BLEU (Papineni et al., 2002) with the SacreBLEU tool (Post, 2018). 4 Many-to-Many The results are given in Table 1. We can see that the improvements brought by +TS and +Adapter methods are not large. For the +TS method, attention module may be not essential to capture language-specific knowledge, and thus it is difficult to converge to good optima. For the +Adapter method, adding an adapter module to the end of each layer may be not appropriate for some languages and hence has a loose capture to the specific features. In all language pairs, our method based on Taylor Expansion outperforms all the baselines in the datasets. Moreover, the parameters in our model are the same as the Multilingual system and less than other baselines.
One-to-Many The results are given in Table 2, our method exceeds the multilingual baseline in all language pairs and outperforms other baselines in most language pairs without capacity increment. When we expand the model capacity to the level of +Adapter, our approach can achieve better translation performance, which demonstrates the effectiveness of our method. Another finding is that the results of the individual baseline are worse than other baselines. The reason may be the training data is not big enough, individual baseline can not get a good enough optimization on 0.6M sentences, while the MNMT model can be well trained with a total of 7.2M data.

Neuron Importance for Different languages
In our method, we allocate neurons based on their importance for different languages. The rationality behind this mechanism is that different neurons should have distinct importance values so that these neurons can find their relevant language pairs. Therefore, we show the importance of neurons computed by Taylor Expansion in different modules for the one-to-many (O2M) and many-to-many (M2M) translation tasks. For clarity and convenience, we only show the importance values of three language pairs in the sixth layer of encoder and decoder. The results of O2M are shown in Figure 2(a) and Figure 2(b), and the language pairs are En→Es, En→Pt, and En→Fi. The first two target languages are Spanish and Portuguese, both of which belong to the Western Romance, the Romance branch of the Indo-European family, while the last one is Finnish, a member of the Finnish-Ugric branch of the Ural family. As we can see, the importance of Spanish and Portuguese are always similar in most neurons, but there is no obvious correlation between Finnish and the other two languages. It indicates that similar languages are also similar in the distribution of the neuron importance, which implies that the common features in similar languages can be captured by the same neurons.
The results of M2M are shown in Figure 2(c) and Figure 2(d), and the language pairs are It→En, Ro→It, and En→Ro, whose BLEU scores are 0.67, 1, and 1.7 higher than the multilingual baseline, respectively. In most neurons, the highest importance value is twice as high as the lowest and this high variance of importance provides the theoretical basis for later neuron allocation. Moreover, we can see a lot of importance peaks of the two language pairs: Ro→It and En→Ro, which means that these neurons are especially important for generating the translation for these language pairs. However, the fluctuation of It→En is flat with almost no peaks, which means only a few neurons are specific to this language pair. This may be the reason why some language pairs have higher improvements, while some have lower improvements.

Distribution of the Language-specific Neurons
Except for the general neurons shared by all the language pairs, our method allocates other neurons to different language pairs based on their importance. These language-specific neurons are important for preserving the language-specific knowledge. To better understand the effectiveness of our method, we will show how these specific neurons are distributed in the model. To evaluate the proportion of language-specific neurons for different language pairs at each layer, we introduce a new metric, LScore, formulated as: whereĨ m l denotes the number of neurons allocated to language pair m in the l-th layer, andĨ l denotes the total number of the language-specific neurons in the l-th layer. The larger the LScore, the more neurons allocated to the language pair m. We also Figure 4: The average ∆ BLEU over the Multilingual baseline with different hyper-parameters k on the many-to-many translation task.
introduce a metric to evaluate the average proportion of language-specific neurons of each language in different modules, which formulated as: (11) whereĨ m l,f denotes the number of specific neurons for language pair m of in the f module of the lth layer and M denotes the total number of the language pair. The larger the MScore is, the more specific neurons are allocated to different language pairs in this module.
As shown in Figure 3(a) and Figure 3(b), the language pairs have low LScores at the top and bottom layers and high LScores at the middle layers of both the encoder and decoder. The highest LScore appears at the third or fourth layers, which indicates that the neuron importance of different language pairs is similar and the neurons of the middle layers are shared by more languages. As a contrast, the bottom and top layers will be more specialized for different language pairs. Next, from Figure 3(c) and Figure 3(d), we can see the MScores of the attention modules are almost near 1.0, which means neurons in self attention and cross attention are almost shared across all language pairs. However, the MScores of Feed Forward Network (FFN) gradually decrease as layer depth increases and it shows that the higher layers in FFN are more essential for capturing the language-specific knowledge.

Effects of the Hyper-parameter k
When the importance of neurons for different languages is determined, the number of language pairs associated with each neuron can be adjusted ac- Figure 5: ∆ BLEU over best performance when erasing the general or language-specific neurons randomly on the many-to-many translation task. cording to k. When k = 1.0, the threshold is max(Θ m (i)) as computed by Equation 9, so the neurons will only be allocated to the language pair with the highest importance, and when k = 0, the threshold is 0 so the neurons will be shared across all language pairs just like the Multilingual baseline. To better show the overall impact of the hyperparameter k, we vary it from 0 to 1 and the results are shown in Figure 4. As we can see, the translation performance of the two proposed approaches increases with the increment of k and reach the best performance when k equals 0.7. As k continues to increase, the performance deteriorates, which indicates that the over-specific neurons are bad at capturing the common features shared by similar languages and will lead to performance degradation.

The Specific and General knowledge
The main idea of our method is to let the general knowledge and the language-specific knowledge be captured by different neurons of our method. To verify whether this goal has been achieved, we conduct the following experiments. For the general knowledge, we randomly erase 20% general neurons of the best checkpoint of our method, which means we mask the output value of these neurons to 0, then generate translation using it. For languagespecific knowledge, we randomly erase 50% specific neurons and then generate translation.
As shown in Figure 5, when the general neurons are erased, the BLEU points of all the language pairs drop a lot (about 15 to 20 BLEU), which indicates general neurons do capture the general knowledge across languages. For specific neurons, we show three language pairs for the sake of convenience. We can see that when the neurons associated with the current language pair are erased, the performance of this language pair decreases greatly. However, the performance of other language pairs only declines slightly, because the specific knowledge captured by these specific neurons are not so important for other languages.

Related Work
Our work closely relates to language-specific modeling for MNMT and model pruning which we will recap both here. Early MNMT studies focus on improving the sharing capability of individual bilingual models to handle multiple languages, which includes sharing encoders (Dong et al., 2015), sharing decoders (Zoph et al., 2016), and sharing sublayers (Firat et al., 2016). Later, Ha et al. (2016) and Johnson et al. (2017) propose an universal MNMT model with a target language token to indicate the translation direction. While this paradigm fully explores the general knowledge between languages and hard to obtain the specific knowledge of each language (Tan et al., 2019;Aharoni et al., 2019), the subsequent researches resort to Language-specific modeling, trying to find a better trade-off between sharing and specific. Such approaches involve inserting conditional languagespecific routing layer (Zhang et al., 2021), specific attention networks (Blackwood et al., 2018;Sachan and Neubig, 2018), adding task adapters , and training model with different language clusters (Tan et al., 2019), and so on. However, these methods increase the capacity of the model which makes the model bloated.
Moreover, our method is also related to model pruning, which usually aims to reduce the model size or improve the inference efficiency. Model pruning has been widely investigated for both computer vision (CV) (Luo et al., 2017) and natural language processing (NLP) tasks. For example, See et al. (2016) examines three magnitude-based pruning schemes, Zhu and Gupta (2018) demonstrates that large-sparse models outperform comparablysized small-dense models, and Wang et al. (2020a) improves the utilization efficiency of parameters by introducing a rejuvenation approach. Besides, Lan et al. (2020) presents two parameter reduction techniques to lower memory consumption and increase the training speed of BERT.

Conclusion
The current standard models of multilingual neural machine translation fail to capture the characteristics of specific languages, while the latest researches focus on the pursuit of specific knowledge while increasing the capacity of the model and requiring fine manual design. To solve the problem, we propose an importance-based neuron allocation method. We divide neurons to general neurons and language-specific neurons to retain general knowledge and capture language-specific knowledge without model capacity incremental and specialized design. The experiments prove that our method can get superior translation results with better general and language-specific knowledge. Figure 6: ∆ BLEU over Multilingual baseline on manyto-many translation.

A Performance on Different Varieties
In the proposed method we allocate neurons based on importance of language pair. There are three varieties of our method: (a) Source-Specific, share all neurons according to the source language only; (b) Target-Specific, share all neurons according to the target language only; (c) Separate Enc-Dec, Encoder neurons are shared according to the source language and decoder neurons are shared according to the target language. Note that (c) is different from our method since (c) is separate neurons to two parts (encoder and decoder) and then connect specific neurons of the two parts to form a whole, while our method is directly based on language pairs. As shown in Figure 6, we compare our Taylor Expansion method with the other three varieties. Our approach outperforms other varieties on almost all language pairs, and the performance of the language-pair based approach is undoubtedly the best. The second is based on the target language and the source language. Worst of all are the separated encoder-decoder, which may be due to the mismatch between the neurons of the encoder and decoder when they are reconnected.

B Effects of the Hyper-parameter ρ
We conducted several experiments on ρ to determine the optimal hyper-parameter, so as to determine the proportion of universal neurons. As shown in Table 3, when ρ = 90% the model gets the best translation result and reach best trade-off between general and language-specific neurons.  Table 3: BLEU scores on many-to-many translation tasks when k = 0.7