Condensing Multilingual Knowledge with Lightweight Language-Specific Modules

Incorporating language-specific (LS) modules is a proven method to boost performance in multilingual machine translation. This approach bears similarity to Mixture-of-Experts (MoE) because it does not inflate FLOPs. However, the scalability of this approach to hundreds of languages (experts) tends to be unmanageable due to the prohibitive number of parameters introduced by full-rank matrices in fully-connected layers. In this work, we introduce the Language-Specific Matrix Synthesis (LMS) method. This approach constructs LS modules by generating low-rank matrices from two significantly smaller matrices to approximate the full-rank matrix. Furthermore, we condense multilingual knowledge from multiple LS modules into a single shared module with the Fuse Distillation (FD) technique to improve the efficiency of inference and model serialization. We show that our LMS method significantly outperforms previous LS methods and MoE methods with the same amount of extra parameters, e.g., 1.73 BLEU points over the Switch Transformer on many-to-many multilingual machine translation. Importantly, LMS is able to have comparable translation performance with much fewer parameters.


Introduction
Multilingual models confer the benefit of facilitating cross-lingual learning; however, they also grapple with the issue of language interference (Conneau et al., 2020;Wang et al., 2020a;Shaham et al., 2022).Recent studies aim to alleviate negative language interference through the introduction of language-specific (LS) modules (Zhang et al., 2020;Fan et al., 2020;Zhang et al., 2021;Fan et al., 2021;Pires et al., 2023).In this setup, each language batch is processed through its designated module rather than a shared module.Although this approach is promising and barely inflates the number of FLOPs like Mixture-of-Experts (MoE) (Shazeer et al., 2017;Lepikhin et al., 2021), 2 the number of parameters becomes difficult to manage and sometimes impractical when working with a large variety of languages.This is because the fundamental element forming LS or MoE modules is typically the full-rank weight matrix derived from a densely connected layer, which causes a rapid increase in the number of parameters with a large number of languages or experts. 3n this paper, we first scrutinize the parameter efficiency of language-specific modules from the perspective of using fewer parameters.Consequently, a necessary question arises (RQ1): can we approximate the original dense weight matrix using substantially fewer parameters?To answer this question, we propose novel and parameter-efficient method, Language-Specific Matrix Synthesis (LMS), which can achieve similar performance to switch transformer even with three to four times smaller LS parameters (as shown in Figure 1).
Then, we further investigate parameter efficiency from the perspective of knowledge density in each LS module.Given recent discoveries that the performance improvement of sparsely activated models diminishes with an increase in the number of experts (Hoffmann et al., 2022;Gao et al., 2022;Xu et al., 2023), we hypothesize that knowledge in these experts (or LS modules) is over-estimated.Hence, we propose another question (RQ2): Could a single shared module encapsulate the same level of knowledge as language-specific modules?In addressing this question, we introduce the Fuse Distillation (FD) method to examine the feasibility of condensing the multilingual knowledge into a single module.
Our main contributions are summarized as follows: • We propose the parameter-efficient and lightweight LMS method, which substantially outperforms previous LS methods or MoE with fewer than or the same number of parameters, e.g., +1.73 BLEU over Switch Transformer on OPUS-100 multilingual translation.
• We introduce FD to condense multilingual knowledge from LS modules into a shared module.FD is able to use only 2M more parameters (1% increase) to achieve the 65% of performance gains from Switch Transformer which use 760M more parameters (314% increase) during inference.

Lightweight LS Modules
In this section, we address RQ1 by constructing LS modules with significantly fewer parameters.

Language-Specific Matrix Synthesis
Language-specific modules are typically composed of linear projections, whose weights are fullrank matrices in previous studies.We propose the Language-specific Matrix Synthesis (LMS) method to form low-rank matrices to approximate the full-rank ones.This is inspired by the concept of "intrinsic dimension" in pre-trained language models (Aghajanyan et al., 2021;Hu et al., 2021) and "intrinsic rank" in trainable matrices, leading to the idea that features are learned in a subspace.Specifically, as shown in Figure 2, our LS matrix is derived from the multiplication of an LS 'vertical' matrix with an LS 'flat' matrix.Formally speaking, let W ∈ R r×c be a weight matrix in the model and we want to build parallel LS matrices which have the same size.Hence, for each language l i , i ∈ {1, 2, • • • , L} with L being the number of languages, there exists an LS vertical matrix )) that we use to approximate the full-rank matrix.Here, we propose two synthesis methods: language-wise and pair-wise synthesis.
Figure 2: The difference between pair-and languagewise synthesis.Language-wise synthesis constructs a low-rank matrix using both the vertical and flat matrices derived from the same language.Conversely, pairwise synthesis formulates the matrix by combining the vertical matrix from the source language with the flat matrix from the target language.
Language-Wise Synthesis Most multilingual tasks, such as conventional multilingual questionanswering, are characterized by a languagemonolithic nature: a single example only pertains to a single language, and examples from different languages build the multilingual data.Under such circumstances, a naive way to assemble a language-specific matrix for a given language, l i , is straightforwardly using its corresponding vertical and flat matrices, such that W l i = W l i v W l i f .Pair-Wise Synthesis Cross-lingual tasks like MMT can also be accomplished using languagewise synthesis, wherein the encoder uses the source language matrix and the decoder uses the target language matrix.However, we posit that this is not the optimal strategy for MMT tasks due to the lack of learning bilingual information.Motivated by this, we introduce a pair-wise synthesis method to accommodate the bilingual context in each example in MMT.In this strategy, the language-specific matrix is a composition of the vertical matrix from the source language l i and the flat matrix from the target language l j : f .The difference between the language-wise and pairwise synthesis approaches is depicted in Figure 2. In Section 5, we will demonstrate that the pair-wise synthesis approach is more effective.
After deriving a language-specific matrix, we incorporate it into the original full-rank matrix, as opposed to performing an isolated forward pass of the model like MoE and conventional LS methods.This approach stems from our hypothesis that the employment of low-rank matrices alone may not sufficiently facilitate the learning of features.Therefore, given an input x i associated with a source language l i and a target language l j (l i and l j are the same for language-monolithic tasks), our modified forward pass yields the output x o : (1) 2.2 Where to Implement?
We primarily focus on incorporating languagespecific matrices generated using the LMS method into the linear projection of each feedforward network (FFN) layer in every transformer layer.Recall from earlier that r and c are the number of rows and columns in the matrix, and L is the number of languages.Thus, the total number of language-specific parameters added is given by 2L • N • d • (c + r), where N represents the number of layers.We also conduct an ablation study to examine the performance when implementing LMS in attention layers in Section 6.For initialization, we employ a random Gaussian distribution for vertical matrices and zeros for flat matrices suggested by Hu et al. (2021).
3 Can We Fuse Multilingual Knowledge in A Single Module?
In this section, we introduce Fuse Distillation (FD) and use a preliminary experiment to answer RQ2: whether we can condense the multilingual knowledge from language-specific modules into a single module.

Fuse Distillation
Let us first consider a language-(or task-) level MoE (Kudugunta et al., 2021), where we replace a single FFN layer with L FFN modules.L is the number of languages, as defined previously.
The slight difference from the original design is we discard the routing gate and make each expert language-specific, i.e., an expert only serves batches in its corresponding language.Given recent findings that model improvements diminish with an increasing number of experts (Hoffmann et al., 2022;Gao et al., 2022;Xu et al., 2023), we hypothesize that information contained in experts is sparse and can be condensed into a shared module.
To fuse knowledge from L FFN layers to the shared one, we propose the following training scheme and name this method Fuse Distillation: We first add an additional shared FFN parallel to an existing model with L FFN layers as shown in Figure 3.During training, each batch undergoes two forward passes and one backward pass.In the first forward pass, the batch is processed through its language-specific FFN module; in the second pass, the batch is routed through the shared FFN.To fuse the language-specific knowledge contained within the L FFN modules into the shared FFN module, a distillation loss between the outputs from the two forward passes is also incorporated: (2) where p l denotes the probability output for the LS pass, and p s represents the shared pass output.The function g(•) signifies that gradients will not be traced back, so only the shared module learns from LS modules but LS ones do not learn from this loss.The backward pass also involves optimizing the model by minimizing the Cross-Entropy loss (CE) between the target and predicted values (the regular training loss).Thus, the total loss is: where y denotes gold labels.
Then, during the inference stage, we discard the LS modules.The model only forward passes the shared FFN for inference.To evaluate whether the shared FFN has effectively learned all LS information, we conduct a comparison between its results and those obtained via the routing through LS modules instead.

Preliminary Experiments
Our preliminary experiments are conducted under three settings: (1) Naive MMT: A basic multilingual translation model is trained without any modifications.
(2) FD: This setting utilizes our proposed fuse distillation method.
(3) FD-LS: We train the model with the FD method, but during the inference stage, the input is processed through its language-specific FFN module instead of the shared module as the original language-level MoE did.
We carry out our experiments using the IWSLT benchmarks, focusing on the many-to-many translation model paradigm.Following Lin et al. (2021); Xu et al. (2022), we collect 8 Englishcentric language pairs from the IWSLT'14 dataset, with sizes ranging from 89K to 169K sentences.We train all methods with the same number of steps and leave detailed training settings in Appendix A. We report sacreBLEU scores (Papineni et al., 2002;Post, 2018) with the FLORES-200 tokenizer (NLLB Team et al., 2022).

Results and Analysis
Overview results of these 4 settings are shown in Table 1.The reported scores are the average of both xx→en and en→xx directions.As anticipated, after applying language-specific modules for each FFN layer, FD-LS has considerable enhancements over the naive MMT (+1.50 BLEU gains).Importantly, after discarding LS modules, FD only performs slightly worse than FD-LS (+1.17 vs. +1.50)with much fewer parameters for inference (48M vs. 149M).This observation underscores the feasibility of condensing multilingual knowledge into a single FFN module, thereby reducing the need of a large number of LS parameters for inference.

Combining LMS and FD
We have shown the success of multilingual information condensation by fuse distillation.We are interested in further reducing the parameters needed by utilizing the language-specific matrix synthesis method during inference, so we then attempt to incorporate the FD method within LMS.Similar to Section 3.1, apart from the LS vertical and flat matrices, we introduce shared vertical and flat matrices, denoted as W shared v and W shared f , respectively.To employ the fuse distillation method, each batch is required to undergo two forward passes.The initial pass navigates through the LS matrix f , while the subsequent pass traverses the shared matrix W + W shared v W shared f .These two passes generate two respective outputs, p l and p s .Given the common parameter W shared across both paths, we utilize symmetric KL divergence (Jiang et al., 2020) for distillation, as opposed to the traditional KL divergence: Thus, the backward pass optimizes both the standard prediction loss and the fuse distillation Table 1: Average BLEU on IWSLT'14 many-to-many translation.Our proposed FD is able to fuse the majority of knowledge into a single module (+1.17 vs. +1.50)with the same parameters as the naive model during inference.loss.
In Figure 4, we provide a comprehensive comparison of space complexity for generating extra LS (or expert) modules, among conventional LS modules, Mixture-of-Experts, and our proposed methods.Notably, our methods demonstrate substantial reductions in parameter usage during both training and inference.

Experiments
We evaluate our LMS and LMS+FD methods using three tasks: MMT, MNER, and MQA.Similar to Section 3.2, we have two routing options for the LMS+FD method during inference time: 1) evaluating the model by passing the shared route (denoted as LMS+FD-Share, the default setting), or 2) passing the language-specific module (denoted as LMS+FD-LS).We present results for both routes to show the performance difference between using the condensed module and the original LS modules.Considering the computational cost for MMT, we run all methods once with the same random seed.For the other two tasks, we run experiments with 3 different random seeds and report the average scores.For ease of implementation, we build homogeneous batches (i.e., a batch only containing sentences in one language or one language direction) and only activate the corresponding LS module.4

Baselines
We compare our approaches against two strong baselines that incorporate additional parameters to mitigate language interference.CLSR: The first baseline is Conditional Language-Specific Routing (CLSR) (Zhang et al., 2021), which employs LS linear projections following FFN or attention layer.Following their best settings, we set the budget p = 0.3 for LS routing.The original setting used shared LS projections across all encoder or decoder sublayers.We also consider a non-shared version, where each sublayer has its own LS projection, and denote it as CLSR*.
Switch Transformer: We also consider Switch Transformer (Fedus et al., 2021) as the second strong baseline, which uses similar FLOPs as our methods. 5We use 16 experts for every two layers  with a gate balance loss with a weight of 0.01.

Multilingual Machine Translation
Data and Training settings We concentrate on the many-to-many translation setting, with results reported from two benchmarks.The first is the English-centric IWSLT'14 dataset, as aforementioned in Section 3.2.Additionally, we examine the OPUS-100 dataset (Zhang et al., 2020), which encompasses 100 languages in total, including 94 development/test language pairs.We preprocess the data by sentencepiece (Kudo and Richardson, 2018), establishing a vocabulary size of 32K for the IWSLT'14 dataset and 64K for the OPUS-100 dataset.We utilize transformer small and transformer big for IWSLT'14 and OPUS-100, respectively.We fix the training steps for all methods for a fair comparison.For IWSLT'14, we use d = 32 as the rank for low-rank matrices.
For OPUS-100, we consider three settings: (i) d = 64 to match the parameter size of the Switch Transformer, (ii) d = 16 to match the parameter size of CLSR, and (iii) d = 4 for very through a single module in each expert layer.
lightweight LS model construction.The default LMS setting for MMT tasks is pair-wise unless otherwise specified.We discuss more training details in Appendix A.
Evaluation We report results in terms of sacreBLEU (Post, 2018), tokenized by FLORES-200 tokenizer (NLLB Team et al., 2022), and win ratio (WR) (Zhang et al., 2020) which is the proportion of language pairs on which our method beats the baseline.For IWSLT'14, we report the scores averaged by xx→en and en→xx directions.
For OPUS-100, we split the 94 test language pairs into three groups based on their training data size suggested by Zhang et al. (2020): high-resource (> 0.9M, 45 languages), low-resource (< 0.1M, 21 languages) and medium-resource (others, 28 languages), and report the averaged scores in each category.We use beam search with a width of 5 and use a length penalty of 1.
LMS performance: Light and Effective LS Module The primary results for IWSLT'14 and OPUS-100 are presented in Table 2  Language-Wise or Pair-Wise?We compare language-and pair-wise synthesis in both IWSLT'14 and OPUS-100 (d = 64) datasets.On average, pair-wise synthesis outperforms languagewise synthesis by 0.27 BLEU points on IWSLT'14 (+1.05 vs. +0.78).Moreover, the pair-wise method (+3.60 and +3.35) also shows superior performance on the OPUS-100 dataset compared with the language-wise one (+2.09and + 2.09).Notably, pair-wise synthesis with d = 16 surpassed the performance of language-wise synthesis with d = 64, even though the latter has 4 times more extra parameters.Hence, this discovery strongly advocates for the use of pair-wise synthesis over the language-wise approach.
FD performance: Can FD Fuse 95 Languages?On the IWSLT'14 8-language MMT dataset, we observe negligible differences between LMS and LMS+FD (+1.05 vs. +0.88),suggesting successful condensation of information from various language-specific modules into the shared module.In the 95-language (94 languages plus English) scenario of OPUS-100, FD with a dimensionality of 16 utilizes only an additional 2M parameters (less than 1% increase compared to the 242M naive model) to attain 65% of the performance improvements from Switch Transformer (+1.13 vs. +1.75 on average), which requires 760M additional parameters (a 314% increase).While FD may not condense all multilingual information due to restricted parameter capacity, its parameter efficiency is commendable.

Multilingual Named-Entity Recognition
Data and Settings We evaluate our methods on Wikiann Named-Entity Recognition (Pan et al., 2017) dataset.We randomly select 24 languages to conduct experiments.The model architecture is based on pre-trained XLM-R base , attached with a feed-forward token-level classifier.We set the dropout rate as 0.1 and run 20 epochs for all methods.We set d = 32 for low-rank matrices and report F1 scores.

Results
The overall results are shown in Table 4.When applying LMS to each FFN layer for 24 languages, the model size increases by only 70M, while yielding a 0.55 F1 improvement.After implementing LMS+FD, the performance improves by 0.67 with the LS route and achieves a 0.33 gain with the shared route, which requires only an additional 3M parameters.Full results are shown in Appendix B.

Multilingual Question Answering
Data and Settings We pick 6 languages from TyDiQA (Typologically Diverse Question Answering)-Gold Passage to conduct the MQA experiments (Artetxe et al., 2020).Following Xu and Murray (2022), the representations of subwords in XLM-R base are input to a span classification head; a linear layer computing the answer's start and end.We set d = 32 for low-rank matrices, dropout rate = 0.1, and run 20 epochs.

Results
The overall results are shown in Table 5. Upon the application of LMS and LMS+FD, all methods exhibit improved performance with a slight increase in parameters.Notably, LMS+FD-Share outperforms LMS+FD-LS.This suggests that FD may be more effective in fusing knowledge when the number of languages is relatively small.Full results are shown in Appendix C.  6 Ablation Study 6.1 Is LMS Parameter-Efficient?
Here, we examine the parameter efficiency of the LMS method, i.e., whether an increase in extra parameters yields a proportional enhancement in model performance.We conduct experiments with d ranging from 4 to 60 in increments of 8 to observe the resulting performance variations.For comparison, we examine the Switch Transformer with 4, 8, 12, 16 experts to assess its parameter efficiency.We focus on the MMT task using the OPUS-100 dataset.Due to computational demands, we limit experiments to randomly selected 15 languages from OPUS-100, designated as OPUS-15.We leave training details in Appendix D. We report the average BLEU gains over all translation directions in Figure 1.The plot reveals that the LMS curve is steeper compared to that of the Switch Transformer, indicating a higher parameter efficiency for our method, i.e., it achieves greater model performance with fewer additional parameters.Compared with a 16-expert Switch Transformer, LMS with d = 52 yields similar performance by using 3.7 times smaller parameters (51M vs. 189M).Numeric results are in Appendix E.

Applying LMS to The Attention Layer
In our default design, the LMS is solely applied to FFN layers.We are interested in assessing the potential benefits of extending LMS to the attention layer (in each K, Q, V, output projection).We consider three model variants: (1) LMS applied only to FFN layers (default design), (2) LMS applied only to the attention layers, and (3) LMS applied to both FFN and attention layers.We conduct experiments on OPUS-15, with a fixed rank value of d = 20.
We show the averaged BLEU of all translation directions of the three designs in parameters.Moreover, applying LMS to both FFN and attention layers results in a marginal improvement over its application solely to FFN layers.This outcome suggests that LS information is primarily situated in FFN layers, aligning with the previous findings of Wang et al. (2020b).

Related Work
Language-Specific Modules To mitigate language interference, previous studies incorporate language-specific modules into models, such as additional language-aware linear projections (Zhang et al., 2020;Fan et al., 2020;Zhang et al., 2021;Fan et al., 2021), LS layer normalization (Zhang et al., 2020).Feed-Forward Networks (Kwon and Chung, 2023), or even entire languagedependent transformer layers (Escolano et al., 2021;Wang and Zhang, 2022;Pires et al., 2023).Similar to LS modules, Mixture-of-Experts (MoE) are also able to reduce language interference (Shazeer et al., 2017;Lepikhin et al., 2021;Fedus et al., 2021;Xu et al., 2023).However, the parameter count of LS (or expert) drastically increases when scaling to numerous languages.Zhang et al. (2021) address this issue by sharing all LS modules across all encoder or decoder layers.However, this does not fundamentally resolve the problem, given that the complexity of constructing LS modules remains unaltered and that different layers may need to learn varying types of LS information.
Lightweight Modules Our proposed techniques draw inspiration from another research line, lightweight fine-tuning, wherein the model undergoes fine-tuning on a parameter subset significantly smaller than that of the original model, such as prefix tuning (Li and Liang, 2021), prompt tuning (Lester et al., 2021), multitask prompt tuning (Wang et al., 2023), LoRA (Hu et al., 2021).In the multilingual machine translation setting, previous studies use language-pair adapters (Bapna and Firat, 2019) to fine-tune a specific direction.This approach also extends to languagewise adapters (Philip et al., 2020), languagefamily adapters (Chronopoulou et al., 2023), hyperadapters (Baziotis et al., 2022) to facilitate the cross-lingual learning.In light of the efficient lightweight modules, we propose LMS to help LS modules scale to hundreds of languages.

Conclusion
The construction of language-specific modules (or experts) using full-rank matrices tends to be parameter-intensive and inefficient, especially as the number of languages (or experts) increases.To address this, we have introduced the Language-Specific Matrix Synthesis (LMS) method that approximates the original full-rank matrix.Notably, pair-wise synthesis, a variant of the LMS methods, exhibits commendable performance in MMT tasks.Further, we have proposed the Fuse Distillation (FD) approach to condense multilingual information into a shared module, thereby further diminishing parameter requirements during inference.Our methods outperform CLSR and Switch Transformer in MMT tasks and also demonstrate their effectiveness in MNER and MQA tasks.

Limitations
One limitation of our LMS method is that it necessitates the construction of homogeneous batches, i.e., batches containing sentences exclusively in one language or language direction.However, this limitation could potentially be addressed by implementing ALLToALL communications amongst devices, a strategy that is already widely employed in Mixture of Experts (MoE) models (Lepikhin et al., 2021), which is a topic we intend to explore in future research.In each forward pass of an FFN layer, we need an additional step to multiply two small matrices, creating the low-rank large matrix.The additional cost of this operation is negligible, as the computational complexity of the FLOPs/tok for a Feedforward linear projection, given an input dimension c and output dimension r, is O(r • c), while the complexity for constructing the low-rank matrix with rank d is O(d • (r + c)).For example, in our ablation study, when r = 2048, c = 512, and d = 20, the difference in computational load can be 2048×512 20×(512+2048) ≈ 20 times less.In terms of actual training time, no significant differences were observed; the discrepancy was less than 1 second per 100 updates.Additionally, a potentially effective strategy to enhance multilingual information encapsulation in FD could involve using a larger shared module relative to other lightweight LS modules.This could be an intriguing avenue for future research.

A Training Details for IWSLT'14 and OPUS-100
To balance the training data, we also over-sample low-resource languages with a temperature of T = 5 (Aharoni et al., 2019) for the OPUS-100 data and T = 2 for the IWSLT'14 data.We preprocess the data by sentencepiece (Kudo and Richardson, 2018), establishing a vocabulary size of 32K for the IWSLT'14 dataset and 64K for the OPUS-100 dataset.We pre-pend a special language id symbol at the beginning of the source sentence to indicate the target language.We build homogeneous batches (i.e., a batch only containing sentences in one language direction) and only activate the corresponding language-specific matrix.We set the dropout rate as 0.1 for both datasets.For the IWSLT'14 dataset, we fix the training steps at 150K with 8K warm-up steps for all methods, with a batch size of 4096 tokens.
For OPUS, we fix the training steps at 100K with 8K warm-up steps for all methods, with a batch size of 4096 tokens but accumulating gradients 4 times.We train all models on 4 RTX 6000 GPUs.For the IWSLT'14 dataset, we employ the transformer small model (with an FFN dimension of 1024 and an embedding dimension of 512), while the transformer big model (with an FFN dimension of 4096 and an embedding dimension of 1024) is utilized for training the OPUS-100 dataset.The maximum learning rate is 0.0005.The optimizer is Adam (Kingma and Ba, 2014) with inverse_sqrt learning rate scheduler and weight decay of 0. We use beam search with a width of 5 and use a length penalty of 1.

B Full Results for MNER
We show the full results of MNER in Table 7.

C Full Results for MQA
We show the full results of MQA in Table 8.

D Training Details for The Ablation Study
We randomly pick 15 languages from the OPUS-100 data to build a smaller 15-language data (OPUS-15) for the ablation study: eu, pt, bg, sk, zh, sl, de, hr, nb, ga, rw, as, fy, mr, se.We conduct the ablation study under the many-to-many translation settings.To balance the training data, we sample the data with a temperature of T = 5.
We preprocess the data by sentencepiece (Kudo and Richardson, 2018), establishing a vocabulary size of 32K vocabulary.we fix the training steps at 50K with 8K warm-up steps for all methods, with a batch size of 4096 tokens.We employ the transformer base model (with an FFN dimension of 2048 and an embedding dimension of 512) for training the OPUS-15 dataset.The other settings are the same as Appendix A.

E Numeric Results for The Ablation Study
Figure 1 shows the averaged BLEU over all directions.Here, We show the detailed numeric results in Figure 9.

Figure 1 :
Figure 1: We show the BLEU gains between the LMS method and the Switch Transformer as the model's parameters increase in our multilingual translation ablation study.The LMS method notably outperforms the Switch Transformer with similar extra LS (expert) parameter counts, achieving comparable performance even with four to five times fewer parameters.

Figure 3 :
Figure 3: We utilize a language-level MoE architecture to verify the feasibility of fusing multilingual knowledge from all language-specific modules into a single shared module.During training, each batch goes through the LS module in the first forward pass and goes through the shared module in the second pass.Then, we conduct distillation between two outputs to condense the knowledge into the shared module.For inference, we discard the LS module and only use the shared module.

Figure 4 :
Figure 4: Suppose we incorporate additional language-specific (LS) linear projections into a layer.We compare the space complexity of the extra LS parameters (or experts) needed across all methods for both training and inference phases.Let's denote L = 15 as the number of languages, r = 4096 as the output dimension, c = 1024 as the input dimension, E = 8 represents the number of experts for Mixture-of-Experts (MoE), and d = 32 signifies the rank for low-rank matrices.The number adjacent to the dashed line is the number of parameters calculated based on the given sample numbers.In this case, one can observe that the Language-Specific Matrix Synthesis (LMS) requires a significantly lower quantity of LS parameters compared to other methods during training, and fuse distillation (FD) demands a substantially reduced number of additional parameters during the inference stage.

Table 2 :
Overall BLEU results of on IWSLT'14 many-to-many translation.LMS outperforms all baselines.At inference, LMS+FD-Share utilizes extra 1M parameters to exceed baselines that enlarge the model size 2 or 3 times.

Table 3 :
BLEU scores on OPUS-100 many-to-many translation.LMS with d = 64 outperforms all baselines on average.LMS+FD-Share with d = 16 uses 1% more parameters, and achieves 65% BLEU gains averaged by all directions, compared to the Switch Transformer which uses 314% more parameters.

Table 4 :
The overall MNER results (F1 score) between baseline and our three proposed methods.

Table 5 :
The overall MQA results (F1 score) between baseline and our three proposed methods.

Table 6 .
LMS applied only to attention layers yields inferior performance compared to LMS applied only to FFN layers with a similar number of extra

Table 6 :
The average BLEU gains with three different LMS designs with a fixed rank d = 20.

Table 8 :
Full results for the MQA task.We report F1 scores.

Table 9 :
The numeric results for the Figure1.