More Parameters? No Thanks!

This work studies the long-standing problems of model capacity and negative interference in multilingual neural machine translation MNMT. We use network pruning techniques and observe that pruning 50-70% of the parameters from a trained MNMT model results only in a 0.29-1.98 drop in the BLEU score. Suggesting that there exist large redundancies even in MNMT models. These observations motivate us to use the redundant parameters and counter the interference problem efficiently. We propose a novel adaptation strategy, where we iteratively prune and retrain the redundant parameters of an MNMT to improve bilingual representations while retaining the multilinguality. Negative interference severely affects high resource languages, and our method alleviates it without any additional adapter modules. Hence, we call it parameter-free adaptation strategy, paving way for the efficient adaptation of MNMT. We demonstrate the effectiveness of our method on a 9 language MNMT trained on TED talks, and report an average improvement of +1.36 BLEU on high resource pairs. Code will be released here.


Introduction
Multilingual neural machine translation(MNMT) has seen various advances in recent years (Dong et al., 2015;Firat et al., 2016;Zoph et al., 2016;Tan et al., 2019;Aharoni et al., 2019;Arivazhagan et al., 2019). However, the core principle behind the effectiveness in terms of modelling multiple languages remains the same, i.e., sharing all the model parameters between all the languages (Johnson et al., 2017). Although highly scalable and effective, the performance on high resource languages decreases as more low resource languages are added in the model; this is called negative interference. To overcome this, recent works Zhang et al., 2020) proposed language-specific adapter modules, which provide extra parameters to learn language specific representations, and overcomes the effect of negative interference caused by a high degree of parameter sharing.
In this paper, we propose an alternative to adapter modules. Instead of adding more parameters, we show that the Transformer (Vaswani et al., 2017) has enough capacity to model multiple languages and overcome negative interference effectively. Inspired by the work of Mallya and Lazebnik (2018), we apply iterative pruning to free up the redundant parameters from an MNMT, and retrain them to learn language specific representations. We start with a trained MNMT model, and prune a fraction of the model parameters, we freeze the surviving parameters and retrain the free ones on a bilingual dataset. This process is iteratively applied for each bilingual pair to get bilingual masks over all the model parameters, as illustrated in figure 1. We show that using only a fraction of redundant parameters, significantly improves the performance on high resource languages. Also, we retain the multilinguality and the zero-shot translation ability after adaptation. By demonstrating the effectiveness of this approach, we open a potential research direction towards parameter-free adaptation in MNMT.

Related Work
Adding multiple tasks to a single network: Due to the over-parameterized nature of deep neural networks, prior works (Kirkpatrick et al., 2017;Lee et al., 2017;Li and Hoiem, 2017;Triki et al., 2017) aimed at developing methods to learn multiple tasks while avoiding catastrophic forgetting. Mallya and Lazebnik (2018) proposed an iterative pruning approach to free up parameters for adding new tasks and retain the previously trained param- Figure 1: (Better seen in colour.) Illustration of the evolution of model parameters. (a) shows the multilingual parameters in grey. Through 60% pruning and retraining, we arrive at (b), here white represents the free weights with value=0. The surviving weights in grey will be fixed for the rest of the method. Now, we train the free parameters on the first bilingual pair (L-1) and arrive at (c), which represents the initial parameters of L-1 in orange, and share weights with the previously trained multilingual parameters in grey. Again, with 50% pruning and retraining on the current L-1 specific weights in orange, we get the final parameters for L-1 shown in (d) and extract the final mask for L-1 in (f). We repeat the same procedure for all the bilingual pairs and extract the masks for each pair. eters at the same time. Inspired by the concept, we show that an MNMT Transformer model can be heavily pruned and the freed up parameters can be retrained to improve bilingual performance, while retaining the multilinguality.
Adapting multilingual model to a new language pair and domain adaptation: Prior works on adaptation (Neubig and Hu, 2018;Variš and Bojar, 2019;Stickland et al., 2020;Escolano et al., 2020;Akella et al., 2020;Zhang et al., 2020) aims at improving language specific performance by either fine-tuning the same MNMT model or adding language specific modules. While being effective, these methods either lose their multilinguality or introduce additional parameters. Sharing the same objective, we propose a method to adapt an MNMT, without adding language-specific modules, while retaining the multilinguality at the same time. Another line of work (Thompson et al., 2018;Wuebker et al., 2018), proposed training of subnetworks and freezing the rest for domain adaptation.

Method
The central idea of our method is to use magnitude pruning to free up parameters in the model and learn bilingual specific representations. Figure 1 depicts the evolution of model weights during the training procedure, with (a) representing the initial multilingual weights in grey. We prune away a fraction of parameters using the one-shot magnitude pruning technique (Han et al., 2015), which results in a compressed multilingual representation. We further train the survived multilingual weights for a few more epochs on the multilingual dataset to compensate for extreme pruning, now the multilingual parameters will remain fixed. Then, we use the free parameters to learn the first language-specific representations. We select the first bilingual dataset and train the free parameters. Next we again prune a fraction of weights from the current bilingual parameters only, to accommodate more bilingual representations. We repeat the same procedure for all the existing bilingual pairs. A point to note is that during a forward pass data flows through all the shared and specific weights, while during the backward pass only the current bilingual-specific parameters get updated. Hence, the accuracy is retained for all the previously trained bilingual pairs and it enables a high degree of sharing and specificity at the same time.
Pruning Approach: We perform magnitude pruning (Han et al., 2015) over the weights of all layers. For simplicity, we do not use the more sophisticated pruning methods (Frankle and Carbin, 2019;Michel et al., 2019;Voita et al., 2019). We do not perform pruning over biases and layer normalization parameters, since they correspond to less than 1% of the total parameters, which is insignificant. Also, we do not prune the embeddings, as they are data specific parameters. All are kept fixed after training the multilingual model.
Inference: After finishing the training for each bilingual pair, we get the final mask over all the parameters of the model. Values of the mask range from 1 → N , where N is the total number of bilingual pairs. Each model parameter is masked according to the bilingual pair of interest. To predict a translation for the t th pair, all the parameters learned for languages 1 → t will be used, as shown in figure 1(f) and (g).

Datasets
We use the TED talks (Qi et al., 2018) in all our experiments, and all the numbers are BLEU (Papineni et al., 2002) scores over the test set 1 . Here we have chosen to train on 8 English centring language pairs 2 en-xx covering a spectrum of sizes from high resource Ar (Arabic), 214K to low resource Be (Belarusian), 4.5K.

Training
Architecture: We use Transformer architecture (Vaswani et al., 2017), implemented in fairseq (Ott et al., 2019), which was modified to include the pruning and masking modules. We train a joint BPE model (Sennrich et al., 2016) on all languages to the vocabulary size of 40K. The Transformer (Vaswani et al., 2017) architecture used in this work 3 has 8 attention heads, 6 encoder and decoder layers, an embedding size of 512, and a feed-forward dimension of 2048. We set the dropout to 0.3.
MNMT Training: We train a standard MNMT model following similar settings as Johnson et al. (2017). A single many-to-many model is trained on all the English-centric data, using a source-side control token to indicate the target language. We use Adam (Kingma and Ba, 2015) with an inverse square root schedule, with 4500 warm-up updates and a maximum learning rate of 0.0003. We set the maximum batch size per GPU to 3050 tokens and train on 4 GPUs. Like Arivazhagan et al. (2019), to avoid the size imbalance, we use the temperaturebased sampling strategy with T = 5. The MNMT is trained for 40 epochs over 8 English-centric language pairs, i.e., 16 directions. As shown in table 1, we train a strong parent MNMT baseline.
Pruning MNMT: We prune 50% of parameters from a fully converged MNMT model, and retrain the surviving parameters on the same multilingual 1 Scores reported are SacreBLEU (Post, 2018) 2 ar, az, be, de, gl, he, it, sk 3 transformer in fairseq dataset for ten more epochs, to compensate for the lost parameters.
Adapting MNMT to bilingual specific representations: After pruning the MNMT model, we select each bi-direction datasets (en-xx and xx-en) in the descending order of dataset sizes. We use the original source side control token, reset the learning rate scheduler and train all the free parameters for 20 epochs. Then, we prune 75% of parameters from the current bilingual specific parameters and retrain for ten more epochs to compensate for heavy pruning.
Pruning ratios are decided based on the trade off between the accuracy lost and the space left to adapt all the languages. We prune 50-70% of parameters from the parent MNMT and observe that it leads to a drop of 0.29-1.98 Bleu score. Therefore, we select 50% to be the first pruning ratio, and is kept constant in all the experiments. The second pruning ratio is kept 75% such that the last language pairs get at least 2-5% of parameters. More variations in the second pruning ratio is demonstrated in section 5.4.

Overcoming interference for high resource pairs:
In table 1, we present a comparative study of a high resource language scenario, severely affected by negative interference. Adapted MNMT outperforms the parent MNMT on all the 8 directions, with an average improvement of +1.40 on xx-en, and +1.32 on en-xx directions, and closes the gap with high performing bilingual baselines. Analysing model capacity and negative interference: Now, we expound on the problems of model capacity and interference. As shown in table 1, pruning 50% of parameters from the parent MNMT model leads to an average loss of just 0.29 BLEU points. This observation confirms, that there exists large redundancies even in a 9-language MNMT model. The drop in the performance of an MNMT over its counterpart bilingual models is loosely associated with the lack of capacity. As can be seen in figure 2, by using only a fraction of parameters for each bilingual pair, we can significantly improve the performance over the parent MNMT. Our results demonstrate the ability of parameter-free adaptation to fight negative interference, and improve the performance of severely affected high resource language pairs. (2) -Multilingual models scores. Here Aharoni et al. (2019) and  are trained on 59 and 20 languages respectively. Parent MNMT is our multilingual model trained till convergence on 9 languages. 50% pruned MNMT is the compressed parent MNMT. Adapted MNMT is the proposed model.

Analysing differences in the adaptation of high and low resource pairs:
To understand the impact of parameter-free adaptation on both the high and low resource language pairs in an unbiased setting. We train two models in opposite orders of adding bilingual pairs. First, we train in the order of high to low resource languages (Ar to Be). Second, we train in the order of low to high resource languages (Be to Ar). Now, we assign the same proportion of parameters, to the high and low resource languages (Ar, He) in case 1, and (Be, Az) in case 2 respectively. As evident from figure 2 and 3, the improvements in Ar and He in case 1 is significantly more, than the improvement in Be and Az in case 2. This observation agrees with the fact that negative interference severely af- fects the high resource languages in an MNMT, and it needs adaptation to be improved. But, the performance of low resource languages in an MNMT, is already near saturation due to the positive transfer from high resource languages. Hence, to extract the most out of parameter-free adaptation, it is better to prune and retrain the network in the order of high to low resource languages. This assigns high proportion of parameters to high resource pairs, to effectively overcome negative interference.

Zero-shot Translation:
Zero-shot translation in the context of MNMT, refers to inference between pairs that are not seen directly during the training phase xx-xx. We show that we retain this important ability in our adapted MNMT Table 2: Full-FT represents the bilingual models derived from finetuning the full parent MNMT. Rest are the adapted MNMTs adapted over 50% free parameters of the pruned MNMT. 1) Ar only with 50% parameters, 2) Ar, He with 25% each, 3) Ar, He, It with 16.6% each, and 4) Ar, He, It, De with 12.5% each. Johnson et al. (2017). As shown in figure 4, adapted MNMT performs as good as the parent MNMT on all the 56, xx-xx directions even with only 50% of the total parameters.

5.4
Adapting to a subset of languages and retaining the multilinguality: Due to limited and fixed number of parameters, we cannot adapt to arbitrary number of languages. However, this framework allows high flexibility in adapting the parent MNMT to only the languages of interest, while retaining the multilinguality simultaneously. We adapt the parent MNMT to four models: 1) Ar, 2) Ar, He, 3) Ar, He, It and, 4) Ar, He, It, De. This way, we can assign all the free parameters to only the languages of interest and increase their capacities. The first pruning ratio is set to 50% for all four models. The second pruning ratio is set such that each language receives equal proportion of parameters. From the results in table 2, we observe that assigning more parameters improve the performance marginally. The four adapted MNMTs have similar performances, even with a significant difference in the proportion of parameters assigned for each language. The 4 th model, with only 12.5% parameters reserved for Ar, performs competitively with the 1 st model with 50% parameters for Ar. This implies, that a small fraction of parameters can effectively overcome negative interference, hence allowing space to adapt to multiple languages. To infer on the remaining languages which are not adapted, we can use 50% pruned MNMT weights, as done for zero-shot translation in the previous section, hence retaining the multilinguality.
In table 2, we also compare the results of the four adapted MNMTs, with naive finetuning of the full parent MNMT to bilingual pairs (Full-FT). The difference between naive finetuning and the proposed adaptation approach is that the former uses all the 100% of model parameters and the embeddings to adapt to a single bilingual pair, thus the multilinguality is lost. While in our approach, the pruned MNMT weights and the embeddings are fixed, and we only retrain the free parameters very efficiently, allowing to adapt to multiple languages. As can be seen in table 2, adapted MNMTs perform competitively with Full-FT while retaining the multilinguality.

Conclusion
We investigate the problems of model capacity and negative interference in multilingual neural machine translation. We show that even a 9 language MNMT has a large proportion of redundant parameters, which are efficiently retrained to overcome interference. We propose a parameter-free adaptation strategy. Where, we use iterative pruning and retraining to improve bilingual representations, without any additional parameters. We hope that our work will attract more attention to practical and efficient ways of adapting an MNMT.