Communication Efficient Federated Learning for Multilingual Neural Machine Translation with Adapter

Federated Multilingual Neural Machine Translation (Fed-MNMT) has emerged as a promising paradigm for institutions with limited language resources. This approach allows multiple institutions to act as clients and train a unified model through model synchronization, rather than collecting sensitive data for centralized training. This significantly reduces the cost of corpus collection and preserves data privacy. However, as pre-trained language models (PLMs) continue to increase in size, the communication cost for transmitting parameters during synchronization has become a training speed bottleneck. In this paper, we propose a communication-efficient Fed-MNMT framework that addresses this issue by keeping PLMs frozen and only transferring lightweight adapter modules between clients. Since different language pairs exhibit substantial discrepancies in data distributions, adapter parameters of clients may conflict with each other. To tackle this, we explore various clustering strategies to group parameters for integration and mitigate the negative effects of conflicting parameters. Experimental results demonstrate that our framework reduces communication cost by over 98% while achieving similar or even better performance compared to competitive baselines. Further analysis reveals that clustering strategies effectively solve the problem of linguistic discrepancy and pruning adapter modules further improves communication efficiency.


Introduction
Federated Learning (FL) (McMahan et al., 2017) provides a new training framework utilizing data from various clients without privacy leakage.In FL, the server receives models from clients trained with their local data and aggregates all parameters it has received to acquire a global model, and then 1 Our code is available at https://github.com/lancopku/FedMNMT sends it back to all clients to start the next training round.This characteristic enables FL to get widely applied in real-world scenarios (Ge et al., 2020;Roosta et al., 2021;Passban et al., 2022;Niu and Deng, 2022).In recent years, federated multilingual neural machine translation (Fed-MNMT) has become a new training paradigm and making it feasible for most institutions to train MNMT models (Roosta et al., 2021;Passban et al., 2022).FL makes it possible to leverage corpora from other organizations without privacy problems, solving the problem that training an MNMT model needs to collect large-scale multilingual corpora, which is expensive, time-consuming, and often unaffordable for resource-constrained institutions.Therefore, Fed-MNMT is a secure and cost-effective alternative to conventional centralized training for the optimization of MNMT models.
However, the issue of communication cost is non-negligible when we introduce FL to neural machine translation.Unlike local centralized learning, federated learning requires frequent commu- nication of model parameters between the server and clients.Therefore, the communication cost grows rapidly along with the increase in model size.Nowadays, pre-trained language models are widely adopted as backbone models for MNMT, whose parameters are usually over 10 8 , e.g., 611M for mBART-50 (Tang et al., 2020) and 1.2B for M2M100 (Fan et al., 2020).Considering the increasing number of clients in realistic scenarios as there frequently appear new clients, communication costs will severely hinder the efficient training of the entire Fed-MNMT system and thus make the application of FL to MNMT impractical.
To tackle this problem, we introduce the parameter-efficient tuning idea (Houlsby et al., 2019;Pfeiffer et al., 2021;Karimi Mahabadi et al., 2021) into Fed-MNMT.Specifically, we focus on adapter (Rebuffi et al., 2017;Houlsby et al., 2019), which is a popular technique for efficient tuning that requires only updating a lightweight adapter module.In the training process of adapters, a number of randomly initialized modules are inserted into the backbone models and fine-tuned on new data.Concretely, only the parameters of these modules are updated during training, so the number of parameters needed to be transferred between the server and clients is substantially reduced.As the communication cost before and after introducing adapter illustrated in Figure 1, this approach significantly saves communication costs and enables practical applications of Fed-MNMT.
However, directly adding adapter modules to NMT models results in a performance decline, which is initially observed by Roosta et al. (2021) and also confirmed by our experimental results.This phenomenon is attributed to the divergence of different language pairs.In Fed-MNMT, corpora in diverse languages from different clients are not independently and identically distributed (Non-I.I.D.), so directly aggregating parameters from clients leads to a decrease in model's performance (Zhao et al., 2018).
Considering the adverse effect of conflicting parameters from diverse languages in Fed-MNMT, we introduce clustering strategies to alleviate this issue.The core idea is to cluster the samples according to the characteristics of their data and only conduct aggregation within each cluster where samples share similar properties.Specifically, we cluster all clients with different language pairs based on the language family, gradient similarity, and random, respectively, and systematically compare the performance of different clustering strategies on multilingual translation benchmarks.Figure 2 gives a general view of our training framework.Our experimental results show that clustering on adapters alleviates the data Non-I.I.D. problem and yields better performance in most cases.Overall, our work opens a new direction for future improvements on Fed-MNMT in the real world.
In conclusion, our primary contributions can be summarized as follows: • Aware of the communication barrier in the training of Fed-MNMT models, we introduce a practical efficient Fed-MNMT framework to enable real-world applications.
• By exploring adapter and clustering strategies for alleviating the undesirable effect of data discrepancy, we achieve comparable results with over 98% communication cost reduced compared to vanilla Fed-MNMT.

Methodology
In this section, we first define the Fed-MNMT problem in § 2.1.Next, we elaborate on the adapter modules and the investigated clustering strategies in § 2.2 and § 2.3, respectively.Last, we provide an analysis of communication costs between the original Fed-MNMT and our method in § 2.4.

Problem Formulation
For a Fed-MNMT problem, we suppose that the set of clients is {C i } N i=1 where N > 1 and client C i owns only one language pair P i , whose source language and target language are src i and tgt i , respectively, and corresponding dataset , where n i is the size of D i .In each training round, the optimization target for C i is minimizing the cross entropy loss between the ground truths y i and model's predictions ỹi : After the t-th training round, all clients will deliver local parameters to the server.The server will aggregate these parameters to obtain the initial parameters for the next round's training.A commonly adopted aggregation algorithm is FedAvg (McMahan et al., 2017), where the weighted average of clients' parameters is calculated according to the quantities of local data samples.Let Θ denote model parameters, the FedAvg algorithm can be formulated as: where n = N i=1 n i .Then the aggregated parameters will be sent back to all clients to initialize their local models for the next round of training.
However, data sizes can vary sharply among low-resource and high-resource languages in Fed-MNMT and FedAvg cannot deal well with data quantity skew well (Wang et al., 2020).Thus we change FedAvg, calculating the weighted mean of different clients' parameters, to directly calculating the arithmetic mean of parameters: We refer to this aggregation method as FedMean in our paper.
Considering the size of pre-trained multilanguage models, the communication of model parameters between the server and clients is timeconsuming.Inspired by recent progress in parameter efficient tuning, we are interested in whether adapter can be used to improve the efficiency in FL.

Adapter Modules
We introduce bottleneck adapter (Houlsby et al., 2019) into pre-trained multilingual models.Following the settings of Houlsby et al. (2019) and Pfeiffer et al. (2020), we add adapter modules after the self-attention layer and feed-forward network (FFN) layer for each encoder layer and an additional adapter layer after the cross-attention layer for each decoder layer.During training, only the parameters of adapters and layer-norm modules will be updated thus only a small proportion of parameters have to be communicated between the server and clients.

Client Clustering Strategies
Related research (Johnson et al., 2017;Firat et al., 2016) has shown that parameter sharing among different languages in MNMT boosts the model's performance, especially for those low-resource languages.Motivated by the success of language clustering in MNMT (Tan et al., 2019), we decide to introduce the method of language pairs clustering into the Fed-MNMT problem and we only allow inner-cluster parameters aggregation.Assuming that the multi-language model consists of an encoder and a decoder, we first conduct a clustering algorithm to obtain the encoder clusters set G e = {g i } me i=1 and the decoder clusters set G d = {g i } m d i=1 .Each cluster g i contains the ids of clients in this cluster.Detailed aggregation algorithm is shown in Algorithm 1.
We explore the following three different clustering strategies.
Language families/groups.Chronopoulou et al. (2022) have verified the strategy of sharing parameters within the same language family in the MNMT problem.We decide to use this strategy in the FL setting.We choose 8 languages belonging to 4 different language families from the TED2020 corpus and 10 languages belonging to 4 different language groups, which are all parts of the Indo-European language family, from the Europarl corpus.The clustering of the encoder depends on language families/groups of source languages and the clustering of the decoder is decided by the target languages' families/groups.Languages from the same family or group will be clustered into the same group.
Gradients.Unlike in the scene of centralized learning, clustering based on model parameters (Tan et al., 2019) in Fed-MNMT is unfeasible due to privacy problems.Therefore, we use gradients as the basis of the feature for clustering instead.For each language pair, we use a pre-trained multi-language model to acquire an average gradient vector of all data samples, then a clustering algorithm is applied to the gradient vectors in order to separate clients into different groups.The number of parameters we use for gradients clustering is only about 131K for each client and will hardly introduce any extra communication cost.
Random clustering.We also test randomly separating all clients into different groups as a baseline for clustering strategies.In detail, we uniformly separate the clients and keep the numbers of clusters in the encoder and the decoder the same as those in the language families/groups strategy.

Communication Cost Comparison
Taking the mBART-50 model (Tang et al., 2020), which is a popular pre-trained multi-language model, as an example, the number of parameters is around 610.9M, which requires about 2.44GB storage space in the FP32 format.In comparison, after adding adapter modules, only about 8M parameters have to be transferred, which will save approximately 98.7% communication cost.More concretely, we provide an approximation for the transmission time needed between the server and clients as follows.Assuming the maximum bandwidth of the server is 1000Mbps, the time to transfer the entire mBART model from client to server is around 2.44GB / 1000Mbps = 19.5 seconds.Assuming all clients share the bandwidth, the total transfer time grows linearly with the number of clients.The synchronization process for all clients to finish transferring models to the server will occupy a large proportion of the total training time.In our actual experiments with 12 clients, the theoretical total transferring time is about 19.5 × 12 = 234 seconds.However, for clients with low-resource languages, the training could be finished within 7 minutes, which means transferring time occupies over half of the local training time.By contrast, the time to transfer the adapter's parameters is only about 0.26 seconds, which is negligible compared to local training time thus significantly improving the training efficiency.

Datasets and Evaluation Metrics
We conduct experiments in two different settings: Multi-to-English and Multi-to-Multi (hereinafter referred to as "m2en" and "m2m", respectively).We use the TED2020 corpus (Reimers and Gurevych, 2020) for the m2en setting and the Europarl corpus (Koehn, 2005) for the m2m setting.The TED2020 corpus is extracted from TED speeches and contains over 100 languages around the world.The Europarl corpus is from the proceedings of the European Parliament and contains 21 languages of European countries.
For each language pair 2 , we divide the original corpus into training, dev, and test datasets in accordance of the proportion of 6:2:2.We further sample subsets from the divided datasets.To simulate the scene of low-resource languages and high-resource languages, the training data size of each language pair varies according to the corresponding original corpus size.The specific language pairs we use and corresponding data sizes are shown in Appendix A.
In the m2en setting, for clustering strategies based on language families/groups and random shuffle, clustering algorithms are only applied to the encoder and all clients share the decoder's parameters because their target languages are all English.But for clustering based on gradients, we also cluster the decoder's parameters into different groups and the number of groups stays the same as that in the encoder.In the m2m setting, clustering is conducted for both the encoder and the decoder.Meanwhile, the numbers of groups in the encoder and decoder are the same in all clustering strategies.We will provide a further analysis of the clustering strategies of the m2m setting in § 4.2.
We choose the BLEU score as the evaluation metric using the SacreBLEU (Post, 2018) package.Aside from the BLEU score on each language pair, we additionally report the macro average and micro average scores on all language pairs.

Baselines
We evaluate the following methods as baselines: Centralized-model.The results of centralized training, where data from all clients are gathered together, using the original multi-language model without extra modules.
Centralized-adapter.The results of centralized training using the multi-language model with adapter modules.
Adapter-local.We train a model for each client using local data without parameter aggregation with other language pairs.Model-fed.We train the original multi-language model without Adapter modules under the federated learning framework, where the parameters are shared among all clients using the aggregation algorithm in Eq. (3) without any clustering strategies.
Adapter-fed.In this method, adapter modules are attached to the backbone model, while the rest settings are the same as those in model-fed.This baseline corresponds to the scene of directly introducing adapter without any clustering strategies.

Training Setup
We choose the mBART-50 pre-trained model3 as our backbone model.To fairly compare the training and communication costs of different methods, we train each model for 5 rounds.We select the checkpoint with the lowest loss on the dev set and evaluate it on the test set.Parameters are aggregated every time all clients finish an epoch of local training.For every client, the batch size is 8 and the local model is updated every 16 steps.The local learning rate is 5 × 10 −5 for the mBART model and 1 × 10 −3 for the models with adapter modules.The hidden size of adapter modules is 64.
For all experiments, we train the model with 3 random seeds and report the average scores.For the random clustering strategy, the clustering groups are different when using different random seeds.

Primary Results and Findings
The experimental results in the m2en and m2m settings are shown in Table 1 and Table 2, respectively.In general, directly adding adapter modules leads to a performance drop (comparing adapterfed and model-fed) and methods with clustering strategies all achieve better performance than the direct baseline adapter-fed, indicating the ability of our clustering strategies to alleviate data discrepancy.In both settings, adapter-families method performs best in macro and micro average scores among three clustering strategies, even surpassing model-fed in the m2m setting.
It is noteworthy that the clustering strategies acquire more significant performance improvements in the m2m setting than in the m2en setting.The problem of conflicting parameters is more nettlesome in the m2m setting because there exist more kinds of languages (especially target languages).Thus introducing clustering strategies will bring more benefit to m2m translation tasks.
Meanwhile, we notice that our clustering strategies fail to beat adapter-local in the m2m setting.This can also be explained by the difference in the difficulty of tasks.In the more complicated m2m translation task, more elaborate clustering strategies should be designed to fully make advantage of other language pairs and avoid the influence of conflicting parameters.However, we bring the ability of multi-language translation to these clients through FL with an acceptable drop in performance compared to adapter-local.

Ablation Study
In the m2m setting, the clustering of the adapter modules attached to the encoder and the decoder Method Comm.Cost zh-en th-en ar-en he-en fi-en et-en ru-en sl-en Macro Avg.Micro Avg. are independent.To further explore the specific influence of clustering in these two modules on the model's performance, we apply the clustering strategy to only the encoder and the decoder separately and show the results in Figure 3.We use language families/groups as the clustering strategy.All methods are trained in the m2m setting using the Europarl dataset with one random seed.Other settings are the same as those in our main experiments.

centralized-model
To our surprise, either clustering in the encoder or the decoder significantly improves performance compared to no-sharing strategies, even surpasses adapter-families.We owe this phenomenon to our naive clustering strategies in the encoder and the decoder.In adapter-families, the clustering of the encoder and the decoder is only related to the source and target languages respectively.However, parameters in both the encoder and the decoder are influenced by source and target languages together during training.The inconsistency between clustering strategies and parameter update results in adapter-families method's worse performance than adapter-encoder and adapter-decoder.

Case Study
We select some representative cases of translations from method adapter-families and adapter- shareAll, which are shown in Table 3, to further study the influence of our clustering strategies.We separate the mistakes into three categories.

Opposite semantics
In case 1 and case 2, adapter-fed misses negation adverbs in predictions and results in totally opposite semantics.
Inaccurate words/phrases In case 3, adapterfed translates adverbial of time into "a day or two", which should be "four or five weeks" actually.In case 4, adapter-fed uses the expression "save the world", which differs from the original expression "change the world" in semantics.Ambiguous semantics In cases 5 and 6, adapterfed loses specific semantic information in ground truths.It fails to properly translate "children", "science" and "become a learner" and uses more ambiguous expressions instead.
In comparison, adapter-families makes more accurate predictions in the above cases, which suggests that appropriate clustering strategies help the model produce better translations with improvements in semantics.

Both FedMean and Clustering Contribute
In our experiments, discrepancies in data come from two aspects: data quantity skew and linguistic discrepancy (language difference).We adjust the aggregation algorithm from Eq. (2) to Eq. ( 3), which we call FedMean here, to tackle the problem of quantity skew.Moreover, we propose clustering strategies to prevent clients from receiving con-flicting parameters from dissimilar languages.To explore how these two methods contribute to improvement in performance, we further conduct experiments with the aggregation algorithm changed to FedAvg (see Eq. ( 2)) and keep other training settings unchanged.
As the results displayed in Table 4, clustering strategies bring more significant improvements in performance on the Europarl corpus than the TED2020 corpus.Since experiments on the Europarl corpus consists of more different languages and are conducted in a more complicated m2m setting, the problem of linguistic discrepancy is more severe for Europarl.For the TED2020 corpus, changing the aggregation algorithm from Fe-dAvg to FedMean leads to more significant improvements for methods with clustering strategies compared to adapter-fed.In contrast, for the Europarl corpus, adapter-fed substantially benefits from FedMean, while FedMean hardly brings any benefits to methods with clustering strategies, even causing performance drop in some cases.
Based on these observations, we contend that both aggregation algorithms and clustering strategies contribute to performance gain by alleviating data discrepancies.The specific extent of improvement depends on the extent of data quantity skew and linguistic discrepancy.

Further Cost Saving by Adapter Pruning
On top of the adapter tuning approach, adapter pruning techniques (Rücklé et al., 2021;Pfeiffer et al., 2021;Karimi Mahabadi et al., 2021) further compress the number of parameters to be updated.To further reduce the communication cost, we conduct an exploratory attempt to prune parts of adapter modules in both the encoder and the decoder.We still choose the mBART-50 model, with 12 layers in the encoder and the decoder separately, to conduct the experiments.Specifically, we evenly separate all adapter modules we add in mBART into three parts: input-end adapters (adapter modules in the first 4 layers of the encoder or the decoder), middle-layer adapters (adapter modules in layers 5 to 8 of the encoder or the decoder), outputend adapters (adapter modules in the last 4 layers of the encoder or the decoder).In each strategy, only one part of the adapter modules is kept, so the communication cost is saved by two-thirds.We use adapter-families as the baseline and train all models with one random seed.The rest settings stay the  30.18 18.85 29.86 22.42 16.10 27.66 41.13 32.96 20.78 21.80 22.78 21.27 25.48 25.75 Table 5: BLEU scores of different adapter pruning strategies.We acquire similar results with only 1/3 communication cost compared to keeping all adapter modules.same as those in previous main experiments.The results are shown in Table 5.
It is encouraging that pruning adapters do not result in a sharp decrease in performance.We observe that keeping output-end adapters achieves the highest score among the three pruning strategies, which suggests that adapters in the top layers play more important roles.Overall, the results indicate that it is possible to further reduce communication costs and it is worthwhile to explore more elaborate pruning techniques in future work.

Related Work
Federated Learning was first proposed by McMahan et al. (2017) as a decentralized training framework.Due to its decentralized and private nature, FL shows great potential in actual applications.Recently, there has been a surge in the NLP community to explore the application of federated learning in diverse NLP tasks, such as emojis prediciton (Gandhi et al., 2022), named entity recognition (Ge et al., 2020), and machine translation (Roosta et al., 2021;Passban et al., 2022;Weller et al., 2022), etc. Roosta et al. (2021) first applied FL to NMT tasks.However, training language models in the FL setting brings huge communication overheads.To solve this problem, researchers have proposed to only exchange some dedicated "Controller" (Roosta et al., 2021) layers between the server and clients.Moreover, Passban et al. (2022) introduced parameter pruning strategies to reduce communication bandwidth.Our methods with adapter modules have advantages in communication efficiency with fewer parameters to be transferred compared to Controller (see Appendix D), and other parameter pruning strategies can also be applied to our adapter modules to further reduce communication costs.

Multilingual Neural Machine Translation
(MNMT) trains a single model to handle translation between multiple language pairs (Johnson et al., 2017;Aharoni et al., 2019;Zhang et al., 2020).Moreover, MNMT significantly reduces training and inference costs by eliminating the need to train models for each language pair.Massively pre-trained multilingual models have been used for MNMT, such as mBART-50 (Tang et al., 2020) and M2M100 (Fan et al., 2021).In recent years, adapter has become a popular method in MNMT (Bapna and Firat, 2019;Cooper Stickland et al., 2021;Philip et al., 2020;Üstün et al., 2021;Chronopoulou et al., 2022) due to its high parameter efficiency and transferability between tasks.
Different from previous works on this topic, inspired by recent progress in improving the efficiency of NLP methods (Strubell et al., 2019;Li et al., 2021a;Xu et al., 2021;Li et al., 2021b), we focus on communication efficiency in FL-MNMT and make the first effort to introduce adapter modules in order to reduce communication costs.We also apply different clustering strategies to resolve the issue of conflicting parameters stemming from data discrepancy.

Conclusion
In this paper, we introduce adapter modules to PLMs for the Fed-MNMT problem to boost communication efficiency.We reduce the communication cost by over 98% and make the training process of Fed-MNMT practical.To deal with the problem of performance drop after introducing adapter modules, we propose different clustering strategies to separate clients into different groups to avoid the negative influence of data discrepancy.We surpass the direct baseline with a substantial gap, especially in the more complicated multi-tomulti translation setting.Furthermore, our analytic experiments indicate that both aggregation algorithms in server and clustering strategies affect the performance of Fed-MNMT.We also explore the possibility of further reducing communication costs by pruning adapter modules and find that adapters in top layers are more significant for translation performance.
In future work, we will explore more welldesigned clustering strategies and attach other parameter-efficient techniques to adapter to further reduce the parameters to be transferred.

Limitations
First, in this work, we assume that clustering in the encoder and the decoder is only related to the source and target languages, respectively.Actually, both parameters in the encoder and the decoder are influenced by source and target languages simultaneously.Therefore, our assumption may lead to a performance drop.In future work, we plan to explore more complicated clustering strategies.
Moreover, our adapter-families method depends on prior linguistic knowledge.Its actual effectiveness can be affected by the distribution of language families/groups in clients.Our methods mainly apply to comparably uniform language distribution.
In addition, the effectiveness of our methods on other PLMs needs to be verified.However, it is easy to transfer our methods to other models so it will not be a challenging problem.

A Training Data Sizes
The specific training data size of each language pair is shown in Table 6.

B Complete Results of Ablation Study
We show the complete results on all language pairs of the ablation study in Table 7.We find that only applying clustering to the decoder acquires the highest scores on 7 out of the total 12 language pairs.To our surprise, adapterfamilies method fails to reach the best performance in average scores, which has been explained in § 4.2 in the main text.

C Uniform Data Distribution
We also conduct experiments on TED2020 with each client owning training data of equal size.The results are shown in Table 8.We observe that the performance gaps between different methods are similar to those in Table 1.Notably, Adapterfamilies beats adapter-random by a slight margin.Both clustering strategies acquire obvious performance improvement compared to the baseline adapter-fed.These empirical results verify that our methods apply to various data distributions.

Figure 1 :
Figure 1: Estimated transfer time (bar chart) and the approximate ratio of transfer time to total training time (line chart) in one round for a low-resource client.Our method significantly reduces communication cost and improve training efficiency.

Figure 2 :
Figure 2: Communication-efficient Fed-MNMT framework with adapter modules and clients clustering strategies.

Figure 3 :
Figure 3: Results of ablation study.Encoder/Decodercluster denote clustering only in the encoder/decoder, both obtaining significant improvements compared to the no-clustering baseline.
Algorithm 1: Inner-cluster Aggregation Input: Encoder and decoder clusters sets Ge and G d ;Initial encoder and decoder paras Θ 0 e and Θ 0 9 foreach id in g do 10 Θ t e,id = Θ t e,g ; // inner-cluster aggregation of decoder parameters 11 foreach g in G d do 12

Table 1 :
BLEU scores on the TED2020 corpus.Comm.Cost, which is short for communication cost, denotes the number of parameters communicated between the server and each client.Adapter-random, adapter-gradients, and adapter-families refer to clustering strategies of random clustering, gradients, and language families/groups, respectively.The best result of each language pair is highlighted in bold (only methods with adapter modules trained in the FL setting are considered).

Table 2 :
BLEU scores on the Europarl corpus.The meanings of symbols stay the same as those in Table1.
somebody tells a lie, they're not just a liar.adapter-families I think if somebody tells a lie, they're not just a liar.adapter-fed I know that when people lie, they're just lying.

Table 3 :
Case study comparison of adapter-families and adapter-fed.Obvious translation mistakes are highlighted in red and corresponding correct translations are highlighted in blue.
inconsistency problem in heterogeneous federated optimization.In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.Orion Weller, Marc Marone, Vladimir Braverman, Dawn Lawrie, and Benjamin Van Durme.2022.Pretrained models for multilingual federated learning.In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1413-1421, Seattle, United States.Association for Computational Linguistics.Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, and Lei Li. 2021.A survey on green deep learning.arXiv preprint arXiv:2111.05193.Biao Zhang, Philip Williams, Ivan Titov, and Rico Sennrich.2020.Improving massively multilingual neural machine translation and zero-shot translation.In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1628-1639, Online.Association for Computational Linguistics.Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra.2018.Federated learning with non-iid data.ArXiv preprint, abs/1806.00582.

Table 6 :
(Roosta et al., 2021)f language pairs from Europarl corpus.D Comparison to ControllersControllers(Roosta et al., 2021)only exchange 8 layers in a 32-layer Transformer (4 from encoder and 4 from decoder) between the server and clients, which means that they reduce the communication cost by approximately 66% (the number of layers in the original model without Controllers is 24).Compared with Controllers, we introduce adapter modules in Fed-MNMT without the need to define additional layers.Besides, our methods transmit a much smaller amount of parameters in client-toserver exchanges than using Controllers.Therefore, our proposal is superior to Controllers in terms of communication efficiency.