Learning Language Specific Sub-network for Multilingual Machine Translation

Multilingual neural machine translation aims at learning a single translation model for multiple languages. These jointly trained models often suffer from performance degradationon rich-resource language pairs. We attribute this degeneration to parameter interference. In this paper, we propose LaSS to jointly train a single unified multilingual MT model. LaSS learns Language Specific Sub-network (LaSS) for each language pair to counter parameter interference. Comprehensive experiments on IWSLT and WMT datasets with various Transformer architectures show that LaSS obtains gains on 36 language pairs by up to 1.2 BLEU. Besides, LaSS shows its strong generalization performance at easy adaptation to new language pairs and zero-shot translation. LaSS boosts zero-shot translation with an average of 8.3 BLEU on 30 language pairs. Codes and trained models are available at https://github.com/NLP-Playground/LaSS.


Introduction
Neural machine translation (NMT) has been very successful for bilingual machine translation (Bahdanau et al., 2015;Vaswani et al., 2017;Wu et al., 2016;Hassan et al., 2018;Su et al., 2018;Wang, 2019). Recent research has demonstrated the efficacy of multilingual NMT, which supports translation from multiple source languages into multiple target languages with a single model (Johnson et al., 2017;Aharoni et al., 2019;Zhang et al., 2020;Fan et al., 2020;Siddhant et al., 2020). Multilingual NMT enjoys the advantage of deployment. Further, the parameter sharing of multilingual NMT encourages transfer learning of different languages. An extreme case is zero-shot translation, where direct translation between a language pair never seen in training is possible (Johnson et al., 2017). * Equal contribution. While very promising, several challenges remain in multilingual NMT. The most challenging one is related to the insufficient model capacity. Since multiple languages are accommodated in a single model, the modeling capacity of NMT model has to be split for different translation directions (Aharoni et al., 2019). Therefore, multilingual NMT models often suffer from performance degradation compared with their corresponding bilingual baseline, especially for rich-resource translation directions. The simplistic way to alleviate the insufficient model capacity is to enlarge the model parameters (Aharoni et al., 2019;Zhang et al., 2020). However, it is not parameter or computation efficient and needs larger multilingual training datasets to avoid over-fitting. An alternative solution is to design language-aware components, such as division of the hidden cells into shared and language-dependent ones (Wang et al., 2018), adaptation layers (Bapna and Philip et al., 2020), language-aware layer normalization and linear transformation (Zhang et al., 2020), and latent layers .
In this work, we propose LaSS, a method to dynamically find and learn Language Specific Subnetwork for multilingual NMT. LaSS accommodates one sub-network for each language pair. Each sub-network has shared parameters with some other languages and, at the same time, preserves its language specific parameters. In this way, multilingual NMT can model language specific and language universal features for each language pair in one single model without interference. Figure 1 is the illustration of vanilla multilingual model and LaSS. Each language pair in LaSS has both language universal and language specific parameters. The network itself decides the sharing strategy.
The advantages of our proposed method are • LaSS is parameter efficient, requiring no extra trainable parameters to model language specific features. • LaSS alleviates parameter interference, potentially improving the model capacity and boosting performance. • LaSS shows its strong generalization performance at easy adaptation to new language pairs and zero-shot translation. LaSS can be easily extended to new language pairs without dramatic degradation of existing language pairs. Besides, LaSS can boost zero-shot translation by up to 26.5 BLEU.

Related Work Multilingual Neural Machine Translation
The standard multilingual NMT model uses a shared encoder and a shared decoder for different languages (Johnson et al., 2017). There is a transfer-interference trade-off in this architecture (Arivazhagan et al., 2019): boosting the performance of low resource languages or maintain the performance of high resource languages.
To solve this trade-off, previous works assign some parts of the model to be language specific: Language specific decoders (Dong et al., 2015), Language specific encoders and decoders (Firat et al., 2016;Lyu et al., 2020) and Language specific hidden states and embeds (Wang et al., 2018). Sachan and Neubig (2018) compares different sharing methods and finds different sharing methods have a great impact on performance.
Recently, Zhang et al. (2021) analyze when and where language specific capacity matters. Li et al.
(2020) uses a binary conditional latent variable to decide which language each layer belongs to.
Model Pruning Our approach follows the standard pattern of model pruning: training, finding the sparse network and fine-tuning (Frankle and Carbin, 2019;. Frankle and Carbin (2019) and  highlight the importance of the sparse network architecture. Zhu and Gupta (2018) proposed a method to automatically adjust the sparse threshold. Sun et al. (2020) learns different sparse architecture for different tasks. Evci et al. (2020) iteratively redistribute the sparse network architecture by the gradient.

Methodology
We describe LaSS method in this section. The goal is to learn a single unified model for many translation directions. Our overall idea is to find sub-networks corresponding to each language pair, and then only update the parameters of those subnetworks during the joint training.

Multilingual NMT
A multilingual NMT model learns a mapping function f from a sentence in one of many languages to another language. We adopt the multilingual Transformer (mTransformer) as the backbone network (Johnson et al., 2017). mTransformer has the same encoder-decoder architecture with layers of multihead attention, residual connection, and layer normalization. In addition, it has two lanuage identifying tokens for the source and target. Define a multilingual dataset {D s i →t i } N i=1 where s i , t i represents the source and target language.
We train an initial multilingual MT model with the following loss.
where x, y is a sentence pair from the language s i to t i , and θ is the model parameter.

Finding Language Specific Model Masks
Training a single model jointly on multiple language directions will lead to performance degradation for rich resource pairs (Johnson et al., 2017). The single model will improve on low resource language pairs, but will reduce performance on pairs like English-German. Intuitively, jointly training on all translation pairs will obtain an "average" model. For rich resources, such averaging may hurt the performance since a multilingual MT model must distribute its modeling capacity for all translation directions. Based on this intuition, our idea is to find a sub-network of the original multilingual model. Such sub-network is specific to each language pair.
We start from a multilingual base model θ 0 . The θ 0 is trained with Eq. (1). A sub-network is indicated by a binary mask vector M s i →t i ∈ {0, 1} |θ| for language pair s i → t i . Each element being 1 indicates to retain the weight and 0 to abandon the weight. Then the parameters associated with where j denotes the jth element in θ 0 . The parameters θ s i →t i are only responsible for the particular language s i and t i . We intend to find such language specific sub-networks. Figure 1 illustrates the original model and its language specific sub-networks.
Given an initial model θ 0 , we adopt a simple method to find the language specific mask for each language pairs.

Start with a multilingual MT model
Intuitively, fine-tuning θ 0 on specific language pair s i → t i will amplify the magnitude of the important weights for s i → t i and diminish the magnitude of the unimportant weights. 3. Rank the weights in fine-tuned model and prune the lowest α percent. The mask M s i →t i is obtained by setting the remaining indices of parameters to be 1.

Structure-aware Joint Training
Once we get masks M s i →t i for all language pairs, we further continue to train θ 0 with languagegrouped batching and structure-aware updating. First, we create random batches of bilingual sentence pairs where each batch contains only samples from one pair. This is different from the plain joint multilingual training where each batch can contain fully random sentence pairs from all languages. Specifically, a batch B s i →t i is randomly drawn from the language-specific data D s i →t i . Second, we evaluate the loss in Eq. 1 on the batch B s i →t i . During the back-propagation step, we only update the parameters in θ 0 belonging to the sub-network indicated by M s i →t i . We iteratively update the parameters until convergence.
In this way, we still get a single final model θ * that is able to translate all language directions.
During the inference, this model θ * and its masks M s i →t i , i = 1, . . . , N are used together to make predictions. For every given input sentence in language s and a target language t, the forward inference step only uses the parameter θ * M s→t to calculate model output.

Experiment Settings
Datasets and Evaluation The experiments are conducted on IWSLT and WMT benchmarks. For IWSLT, we collect 8 English-centric language pairs from IWSLT2014, whose size ranges from 89k to 169k. To simulate the scenarios of imbalanced datasets, we collect 18 language pairs ranging from low-resource (Gu, 11k) to rich-resource (Fr, 37m) from previous years' WMT. The details of the datasets are listed in Appendix. We apply byte pair encoding (BPE) (Sennrich et al., 2016) to preprocess multilingual sentences, resulting in a vocabulary size of 30k for IWSLT and 64k for WMT. Besides, we apply over-sampling for IWSLT and WMT to balance the training data distribution with a temperature of T = 2 and T = 5 respectively. Similar to Lin et al. (2020), we divide the language pairs into 3 categories: low-resource (<1M), medium-resource (>1M and <10M) and rich resource (>10M). We perform many-to-many multilingual translation throughout this paper, and add special language tokens at both the source and the target side. In all our experiments, we evaluate our model with commonly used standard testsets. For zeroshot, where standard testsets (for example, Fr→Zh) of some language pairs are not available, we use OPUS-100 (Zhang et al., 2020) testsets instead.
We report tokenized BLEU, as well as win ratio (WR), informing the proportion of language pairs we outperform the baseline. In zero-shot translation, we also report translation-language accuracy 1 , which is commonly used to measure the accuracy of translating into the right target language.
Model Settings Considering the diversity of dataset volume, we perform our experiments with variants of Transformer architecture. For IWSLT, we adopt a smaller Transformer (Transformersmall 2 ). For WMT, we adopt Transformer-base and Transformer-big 3 . The pruning rate α of IWSLT and WMT is 0.7 and 0.3, respectively. For simplicity, we only report the highest BLEU from the best pruning rate and we also discuss the impact of different pruning rate on performance in Sec.6. In Sec. 6 we discuss the relationship of performance and pruning rate. For more training details please refer to Appendix.

Experiment Results
This section shows the efficacy and generalization of LaSS. Firstly, we show that LaSS obtains consistent performance gains on IWSLT and WMT datasets with different Transformer architecture variants. Further, we show that LaSS can easily generalize to new language pairs without losing the accuracy for previous language pairs. Finally, we observe that LaSS can even improve zero-shot translation, obtaining performance gains by up to 26.5 BLEU.

Main Results
Results on IWSLT We first show our results on IWSLT. As shown in Table 1, LaSS consistently outperforms the multilingual baseline on all language pairs, confirming that using LaSS to alleviate parameter interference can help boost performance.
Results on WMT To further verify the generalization of LaSS, we also conduct experiments on WMT, where the dataset is more imbalanced across different language pairs. We adopt two different Transformer architecture variants, i.e., Transformerbase and Transformer-big. As shown in Table 2, LaSS obtains consistent gains over multilingual baseline on WMT for both Transformer-base and Transformer-big. For Transformer-base, LaSS achieves an average improvement of 1.2 BLEU on 36 language pairs over baseline, while for Transformer-big, LaSS obtains 0.6 BLEU improvement.
We observe that with the dataset scale of language pairs increasing, the improvements of BLEU and WR become larger, suggesting that the language pairs with large scale dataset benefit more from LaSS than language pairs of low resource. This phenomenon is intuitive since rich resource dataset suffers more parameter interference than low resource dataset. We also find that the BLEU and WR gains obtained in Transformer-base are larger than that in Transformer-large. We attribute it to the more severe parameter interference for smaller models.
For comparison, we also include the results of LaSS with randomly initialized masks. Not surprising, Random underperforms the baseline by a large margin, since Random intensifies rather than alleviates the parameter interference.

Generalization to New Language Pairs
LaSS has shown its efficacy in the above section. A natural question arises that can LaSS adapt to a new language or language pair that it has not seen in training phase? In other words, can LaSS generalize to other language pairs? In this section, we show the generalization of LaSS in two settings. We firstly show that LaSS can easily adapt to new unseen languages to match bilingual models with training for only a few hundred steps while keeping the performance of the existing language pairs hardly dropping. Secondly, we show that LaSS can also boost performance in zero-shot translation scenario, obtaining performance gains by up to 26.5 BLEU.
The model is Transformer-big trained on WMT dataset. En↔Ar and En↔It are both unseen language pairs.

Extensibility to New Languages
Previous works have studied the easy and rapid adaptation to a new task or language pair (Bapna and Rebuffi et al., 2017  that LaSS can also easily adapt to new unseen languages without dramatic drop for other existing languages. We distribute a new sub-network to each new language pair and train the sub-network with the specific language pair for fixed steps. In this way, the new language pair will only update the corresponding parameters and it can alleviate the interference and catastrophic forgetting (Kirkpatrick et al., 2016) to other language pairs. We verify the extensibility of LaSS on 4 language pairs. For LaSS, as described in Sec.3, we first fine-tune the multilingual base model and prune to obtain the specific mask for the new language pair. For both multilingual baseline and our method, we train on only the specific language pair for fixed steps. Figure 2 shows the trend of BLEU score along with the training steps. We observe that 1) LaSS consistently outperforms the multilingual baseline model along with the training steps. LaSS reaches the bilingual model performance with fewer steps. 2) Besides, the degradation of other language pairs is much smoother than the baseline. When reaching the bilingual baseline performance, LaSS hardly drops on other language pairs, while the multilingual baseline model dramatically drops by a large margin.
We attribute the easy adaptation for specific languages to the language specific sub-network. LaSS only updates the corresponding parameters, avoiding updating all parameters which will hurt the performance of other languages. Another benefit of updating corresponding parameters is its fast adaptation towards specific language pairs.

Zero-shot
Zero-shot translation is the translation between known languages that the model has never seen together at training time (e.g., Fr→En and En→Zh are both seen in training phase, while Fr→Zh is not.). It is the ultimate goal of Multilingual NMT and has been a common indicator to measure the model capability (Johnson et al., 2017;Zhang et al., 2020). One of the biggest challenges is the offtarget issue (Zhang et al., 2020), which means that the model translates into a wrong target language.
In previous experiments, we apply specific masks to their corresponding language pairs. As the training dataset is English-centric, non-Englishcentric masks are not available. We remedy it by merging two masks to create non-English-centric masks. For example, We create X→Y mask by combining the encoder mask of X→En and the En→X (x-axis and y-axis), within X→En (x-axis and yaxis) and between En→X (x-axis) and X→En (y-axis), respectively. The mask similarity is positively correlated to the language family similarity. decoder mask of En→Y. We select 6 languages and evaluate zero-shot translation in language pairs between each other.
As shown in Table 3, surprisingly, by directly applying X→Y masks, LaSS obtains consistent gains over baselines in all language pairs for both BLEU and translation-language accuracy, indicating that the superiority of LaSS in learning to bridge between languages. It is worth noting that for Fr→Zh, LaSS outperforms the baseline by 26.5 BLEU, reaching 32 BLEU.
We also sample a few translation examples from Fr→Zh to analyze why LaSS can help boost zeroshot (More examples are listed in Appendix).
As shown in Table 4 as well as translationlanguage accuracy in Table 3, we observe that the multilingual baseline has severe off-target issue. As a counterpart, LaSS significantly alleviates the off-target issue, translating into the right target language. We attribute the success of "on-target" in zero-shot to the language specific parameters as a strong signal, apart from language indicator, to the model to translate into the target language.

Analysis and Discussion
In this section, we conduct a set of analytic experiments to better understand the characteristics of language specific sub-network. We first measure the relationship between language specific subnetwork as well as its capacity and language family. Secondly, we study how masks affect performance in zero-shot scenario. Lastly, we discuss the relationship between pruning rate α and performance.
We conduct our analytic experiments on IWSLT dataset. For readers not familiar with language family and clustering, Figure 4 is the hierarchical clustering according to language family.

Mask similarity v.s Language family
Ideally, similar languages should share more parameters since they share more language characteristics. Therefore, a natural question arises: Does the model automatically capture the relationship of language family defined by human?
We calculate the similarity of masks between language pairs to measure the sub-network relationship between language pairs. We define mask similarity as the number of 1 where two masks share divided by the number of 1 of the first mask: where · 0 represent L 0 norm. Mask similarity reflects the degree of sharing among different language pairs. Figure 3(a) and 3(b) shows the mask similarity in En→X and X→En. We observe that, for both En→X and X→En, the mask similarity is positively correlated to the language family similarity. The color of grids in Figure is deeper between similar languages (for example, es and it) while more shallow between dissimilar languages (for example, es and he).
We also plot the similarity between En→X and X→En in Figure 3(c) . We observe that, unlike En→X or X→En, the mask similarity does not correspond to language family similarity. We suspect that the mask similarity is determined by combination of source and target languages. That means that En→Nl does not necessarily share more parameters with Nl→En than En→De.
6.2 Where language specific capacity matters?
To take a step further, we study how model schedule language specific capacity across layers. Figure 5 Target   shows the similarity of different components on the encoder and decoder side along with the increase of layer. More concretely, we plot query, key, value on the attention sub-layer and fully-connected layer on the positional-wise feed-forward sub-layer. We observe that a) On both the encoder and decoder side, the model tends to distribute more language specific components on the top and bot-tom layers rather than the middle ones. This phenomenon is intuitive. The bottom layers deal more with embedding, which is language specific, while the top layers are near the output layer, which is also language specific. b) For fully-connected layer, the model tends to distribute more language specific capacity on the middle layers for the encoder, while distribute more language specific capacity in the decoder for the top layers.

How masks affect zero-shot?
In Sec.4, we show that simply applying X→Y masks can boost zero-shot performance. We conduct experiments to analyze how masks affect zeroperformance. Concretely, we take Fr→Zh as an example, replacing the encoder or decoder mask with another language mask, respectively.
As shown in Table 5, we observe that replacing the encoder mask with other languages causes only littler performance drop, while replacing the decoder mask causes dramatic performance drop. It suggests that the decoder mask is the key ingredient of performance improvement.

About Sparsity
To better understand the pruning rate, we plot the performance along with the increase of pruning  rate in Figure 6. For WMT, the best choice for α is 0.3 for both Transformer-base and Transformer-big, while for IWSLT the best α lies between 0.6∼0.7. The results are consistent with our intuition, that large scale training data need a smaller pruning rate to keep the model capacity. Therefore, we suggest tuning α based on both the dataset and model size.
For large datasets such as WMT, setting a smaller α is better, while a larger α will slightly decrease the performance (i.e. less than 0.5 BLEU score). For small datasets like IWSLT, setting a larger α may yield better performance.

Conclusion
In this paper, we propose to learn Language-Specific Sub-network (LaSS) for multilingual NMT. Extensive experiments on IWSLT and WMT have shown that LaSS is able to alleviate parameter interference and boost performance. Further, LaSS can generalize well to new language pairs by training with a few hundred steps, while keeping the performance of existing language pairs. Surprisingly, in zero-shot translation, LaSS surpasses the multilingual baseline by up to 26.5 BLEU. Extensive analytic experiments are conducted to understand the characteristics of language specific sub-network. Future work includes designing a more dedicated end-to-end training strategy and incorporating the insight we gain from analysis to design a further improved LaSS.

A.2 Training Details
As stated in the previous section, we first train a multilingual baseline (Phase 1). Then we fine-tune the baseline on specific language pair to obtain the mask (Phase 2). After that we train the LaSS model with the obtained masks (Phase 3). Note that we only apply masks on linear weights, which means that the embedding weights, layer normalization are not masked out. We also exclude the output projection weight. We apply label smoothing of value 0.1 in all our experiments.
Data Following Tan et al. (2019), we first tokenize the data then apply BPE. The BPE vocab size is 30k. We apply over-sampling with a temperature of T = 2.
Training For Phase 1, we train the baseline with Adam with a learning rate schedule of (5e-4,4k). The max tokens per batch is set to 262144. For Phase 2, we keep all other settings unchanged except we set the max tokens to be 16384 and the dropout 0.3. For Phase 3, we keep the same setting as Phase 1, except we apply masks on the model.   2019). We replace fixed positional embedding with learnable one and replace ReLU with GeLU. Also we use Layernorm-embedding  to stabilize training.
Data We use SentencePiece (Kudo and Richardson, 2018) to preprocess the data and learn BPE. Since the WMT dataset is highly imbalanced, we apply a temperature-based sampling strategy with T = 5. To ensure all languages are represented adequately in the vocabulary, we apply the same temperature-based sampling strategy for training the BPE model.
Training For Phase 1, we train the baseline with Adam with a learning rate schedule of (5e-4,8k). The max tokens per batch is set to 524288. For Phase 2, the warm-up updates are set to 1000. To guarantee that the model does not overfit the data, we train on different language pairs with different steps and different batch size. Concretely, we finetune on >10k, >100k, >1m, >10m language pairs with 1k, 2k, 4k, 8k steps and max tokens per batch with 20480, 40960, 81920 and 163840. For Phase 3, we keep the setting the same as Phase 1.