Multilingual Lottery Tickets to Pretrain Language Models

The curse of multilinguality in training multi-lingual pretrained language models (mPLMs) refers to the negative interference between languages, especially when the capacity is limited. While increasing the capacity may appear intuitive for overcoming this curse, it negatively affects both training and inference costs. Our distinction is pursuing the competing goals of reducing negative interference, while keeping the capacity per each language more or less the same. Specifically, we first scale the model to reduce interference, then search for a per-language subnetwork, or a lottery ticket, with comparable performance to the full model. According to lottery ticket hypothesis, this scale-then-find-ticket approach alleviates interfering signals as in the scaled model, but redistributes parameters to keep the parameters reduced. Finally, to avoid the cost of multiple retraining for searching multilingual tickets, we explore zero-shot neural architecture search (NAS) methods. We investigate the most appropriate zero-shot NAS method to find multilingual tickets. Our proposed multilingual tickets reduce the inference cost of models for each languages, while boosting the performances. The ticket search cost is negligible and tickets found qualitatively preserve linguistic similarity. Our code is publicly available.

However, when more languages are covered, or parameters are limited, negative interference be-tween languages (Wang et al., 2020b) has been observed, known as curse of multilinguality (Conneau et al., 2020).
For the problem of signals from languages interfering with each other, a naïve solution would be to increase the capacity per each language (Conneau et al., 2020;Pfeiffer et al., 2022).For example, Figure 1b illustrates adding additional parameters per each language to the shared model.This design intuitively improves the average performance of each language and allows the gradient conflict to be alleviated, but with the cost of enlarging perlanguage parameter size, which affects both the training and inference costs of the model.
Our distinction is keeping the per-language capacity similar to reach the same goal.To achieve multiple competing goals -improving the performance and mitigating the gradient conflict, without increasing per-language capacity-we invite the lottery ticket hypothesis (Frankle and Carbin, 2019).Lottery ticket hypothesis claims that dense models contain subnetworks -called "winning tickets"whose performance is at least similar to the full model.This hypothesis is empirically shown to be true for popular language model architectures (Prasanna et al., 2020;Chen et al., 2020a;Zheng et al., 2022).
Our key idea is to search tickets per each language, i.e., multilingual tickets, that can achieve all of the competing goals, by scaling the model and then searching for such tickets.The parameters of each language will be redistributed in the scaled model to keep the per-language capacity unaffected by scaling.To illustrate, in Figure 1c, each of l 1 and l 2 maintains the same capacity before (Figure 1a) and after scaling (Figure 1c) as 9 parameters.This will i) improve the performance, since the performance from tickets will be similar to the scaled model, by the lottery ticket hypothesis.Moreover, disseminating the parameters will also ii) mitigate the negative interference.However, finding multilingual tickets in a scaled model incurs a prohibitive cost -even to find a single ticket, Frankle and Carbin (2019) train model multiple times.To overcome, we interpret this problem as neural architecture search (NAS) (Zoph and Le, 2017;Liu et al., 2019) per each language.Particularly, we explore recently emerging zeroshot NAS (Abdelfattah et al., 2021;Javaheripi et al., 2022;Shu et al., 2022), which aims to search architecture without training cost.We discriminate the most appropriate zero-shot NAS method to remove the burden of finding multilingual tickets.
Finally, to verify our claims, we devise a metric to measure the negative interference during multilingual pretraining.Our measurements indicate that multilingual tickets decrease negative interference.Our experiments show that multilingual tickets we found increase task performance while maintaining the capacity as expected.In our qualitative analysis, the locality of subnetworks preserves the linguistic similarity within the same family as well.
Our contributions can be summarized as follows: • We propose a novel method to alleviate negative interference during multilingual pretraining: Searching multilingual tickets.
• We explore the most appropriate zero-shot NAS method to remove the cost of finding multilingual tickets.
• Experiments show that multilingual tickets do alleviate negative interference, increasing the task performance while keeping the capacity and computational complexity.
• Our code is publicly available.1 2 Preliminaries

Negative Interference in mPLM
Pretraining an mPLM utilizing corpora of multiple N languages aims to leverage a positive crosslingual transfer (Pires et al., 2019;Devlin et al., 2019;Conneau and Lample, 2019).However, Wang et al. (2020b) unveil a negative interference in mPLMs.Specifically, they measure the interference from bilingual pretraining (N = 2), using cosine similarity between gradients (Yu et al., 2020), originally devised to measure the interference from multitask training.First, the total loss from multitask training L(θ) can be denoted as a sum of each task loss L i (θ).Then the total gradient can be decomposed as follows: where g i (θ) = ∇L i (θ).They consider two tasks i and j interfere with each other if the cosine similarity between g i and g j is low.Wang et al. (2020b) use this metric to measure the interference in bilingual pretraining as follows: where g 1 and g 2 are gradients from each language.In this measure.the inference increases as gc(θ) becomes lower, which is maximized when gc(θ) is −1, indicating g 1 (θ) and g 2 (θ) are in opposite directions, canceling the updates for each other.Wang et al. (2020b) reveal that gc(θ) is lower in bilingual pretraining (N = 2) compared with monolingual pretraining (N = 1), implying negative interference happens during multilingual pretraining.
Our strategy is finding a lottery ticket for scaling up and reducing interference, shown in the following subsection.

Lottery Ticket Hypothesis
Lottery ticket hypothesis (Frankle and Carbin, 2019) states that every dense model contains some subnetwork whose performance is at least similar to the dense model.Formally, given the initial parameter of the dense model θ, let p(θ) be the performance of the network.The Lottery ticket hypothesis is denoted as follows: where m ⊙ θ denotes the pruned subnetwork, and dim θ denotes the dimensionality of θ.They named m as "winning ticket"2 of the given dense network.Frankle and Carbin (2019) find the ticket by iterative train-and-prune.In each stage, they train the model starting from the subnetwork from the previous stage, and prune some parameters with the least magnitudes.They repeat this expensive process for multiple stages to get the final ticket m, known as iterative magnitude pruning.
Our key idea is finding tickets per each language, i.e., finding subnetwork m i ⊙ θ for every language l i ∈ {l 1 , • • • , l N }.However, iterative magnitude pruning N ≈ 100 languages in our target problem of multilingual pretraining incurs a prohibitive cost.Thus we formulate our target search as a zero-cost Neural Architecture Search (NAS) per each language, removing the cost of searching the architectures.
Therefore, reducing the search cost attracted keen research interests (Pham et al., 2018;Liu et al., 2019).Recently, zero-shot NAS (Abdelfattah et al., 2021;Mellor et al., 2021;Javaheripi et al., 2022) emerged, whose goal is to make the search cost almost negligible.We study how to leverage zeroshot NAS for our purpose of finding subnetworks per each language.
3 Proposed Method: Multilingual Lottery Tickets with Zero-Shot NAS This section presents our method, scaling the model and then finding tickets m i per each language l i ( §3.1), leveraging zero-shot NAS techniques ( §3.2).

Scale-then-Search Multilingual Tickets
Lower Interference, Higher Performance Allowing more space for each language to operate without interfering with each other has been shown to be beneficial in previous studies (Conneau et al., 2020;Pfeiffer et al., 2022).However, such a change would increase the total parameter size, which in turn increases both training and inference costs.
Our distinction is first to scale the baseline model, specifically by increasing the number of layers, from the initial model η 0 (3 in Figure 1a) to η s (4 in Figure 1c).We then redistribute the per-language parameters, by finding subnetworks maintaining the initial per-language parameter size.
However, would scaling the model from θ to θ ′ then finding the lottery ticket m ′ from θ ′ perform better than p(θ)?To answer this question, we reinterpret Equation 3. Since scaling (He et al., 2016;?;OpenAI, 2023) is a trustworthy method to improve performance, or, p(θ ′ ) > p(θ), it has the effect of raising a lower bound performance of p(m ′ ⊙ θ ′ ).
In conclusion, once we successfully identify the per-language tickets from the scaled model, not only will the interference be alleviated, but the performance will also be enhanced.To ensure similar per-language capacity and computational complexity during the search procedure, it is crucial to carefully design the search space to apply neural architecture search (NAS) techniques.

NAS Search Space
Our goal in finding multilingual tickets is to keep computational complexity and capacity, rather unaffected by scaling.
First, regarding computational complexity for matrix multiplication, we mask rows or columns from key and query matrices K, Q, respectively, for pruning subnetwork.When masking the entire row of K (and a column Q) reduces the cost of matrix multiplication, we constrain masking rules to favor those with lower multiplication costs.Similarly, when we mask the rows of value matrix V and columns of the following linear layer W 0 or, for the rows of W 1 and columns of W 2 .The total masked capacity is constrained to be similar to the baseline model.
Formally, given input h i , we mask the layers to get the output h o as follows:3 where • represents a scalar product along rows, smax denotes the softmax function, and ϕ is the activation function.
Second, regarding controlling capacity, we let the NAS algorithm search through the various candidate tickets of m = µ q ; µ v ; µ w , where ; is the concatenation.To maintain capacity and computational complexity, we set r = |m| 0 / dim(m) ≈ η 0 /η s , where || 0 denotes the number of non-zero components.

Zero-Shot NAS
Choosing Zero-Shot NAS On the defined search space, we will search for tickets while minimizing the search cost.For this goal, we need to choose a specific zero-shot NAS algorithm.To narrow it down, we describe the characteristics of the zeroshot NAS method we need.
(a) Input-adaptive: Our goal is adapting subnetwork m i ⊙ θ to languages.However, recent zero-shot NAS methods (Tanaka et al., 2020;Javaheripi et al., 2022;Zhou et al., 2022;Shu et al., 2022;Sun et al., 2022), such as Synflow or TF-TAS, aims to find a task-specific structure that remains invariant for inputs.For input-invariance, they use the input filled with 1s, which is counter-intuitive to our purpose of finding different structures per different language inputs.We thus resort to input-adaptive zero-shot NAS methods.
(b) Transformer-friendly: Our search space is based on transformers (Vaswani et al., 2017).However, some NAS methods (Mellor et al., 2021;Lin et al., 2021a) rely on particular attributes of CNNs, since the NAS methods mainly developed from searching CNN architectures.For example, JACOV_COV (Mellor et al., 2021) assumes the network uses ReLU (Fukushima, 1969), which is typically true for CNNs.However NLP models based on transformers architecture (Devlin et al., 2019;Brown et al., 2020;OpenAI, 2023) mostly use GeLU (Hendrycks and Gimpel, 2016) instead of it.Such approaches are not practical for our goal.
To this end, we choose SNIP (Lee et al., 2019) as the most appropriate zero-shot NAS method for searching multilingual tickets.SNIP calculates the score of the subnetwork as follows: Note that this equation is the same as importance (Molchanov et al., 2019), when we consider measuring the scores of m only (Michel et al., 2019).Importantly, this score relies on the given inputs, which satisfies (a).Moreover, SNIP (or, importance) has use cases in transformer architectures (Michel et al., 2019;Prasanna et al., 2020), which supports (b).

Searching Multilingual Tickets with SNIP
With the NAS search space defined, and the zeroshot NAS chosen, we will search tickets per each language l i , i.e., determine the subnetwork m i ⊙ θ with the highest SNIP score (Eq.4).Formally, we need to maximize as follows: Since S(m i,k ) is always non-negative, and m i,k ∈ {0, 1}, maximizing the k S(θ k ) is setting m i,k = 1 when S(m i,k ) is within the top r% of the SNIP values. 4Therefore, once we collect the gradients ∂L ∂m i,k with the initial weight θ and given input data from l i , we can easily decide the ticket m i .

On Stably Measuring Negative Interference
Though a commonly used measurement of negative interference is cosine similarity between gradients (Eq.2), denoted as gc (orange) in Figure 2, its measurement is highly variant, or the metric change is significant over training steps.Such a variance makes comparing the reported performance at the given step highly unreliable.We thus propose to compare a cumulative metric: We evaluate the cosine similarity between the cumulated parameter updates by each language.Fortunately in our case, since each batch consists of a single language, we can easily decompose the total updates from languages.Formally, suppose the language l (t) is used in each step t.Then we define our metric to evaluate negative interference as follows: where 1(l (t) = l i ) is 1 if and only if l (t) = l i , and δ t is the parameter update from step t.We regard negative interference between language l i and l j is larger as uc i,j (θ T ) is smaller. 5ur proposed metric compares an accumulated effect of the influences, which is more tolerant to steps: In Figure 2, we show the relative difference of cumulative metric uc (red) is much stabler than gc (orange).Thus, we will use metric uc in the following experimental section.

Experiments
In this section, we tackle the following research questions: • RQ1: Do multilingual tickets improve performance?Figure 2: Relative difference of metrics for negative interference per every 10 steps in mBERT pretraining.
Our proposed metric (red) is much stabler than cosine similarity between gradients (orange).
• RQ2: Are the searched multilingual tickets better than random tickets?
• RQ3: Do multilingual tickets mitigate the negative interference?
• RQ4: Is our method better than inputinvariant NAS method?
• RQ5: How much more computations would be needed to match our improvement with naïve mBERT?

Experimental Settings
Unlabeled Datasets and Languages for Pretraining We utilize Wikipedia dumps of the same languages mBERT (Devlin et al., 2019) used.We extract Wikipedia articles using WIKIEXTRACTOR.

Task Datasets and Languages for Evaluation
We focus on evaluating in-language performance using XTREME benchmarks (Hu et al., 2020;Ruder et al., 2021).Since we focus on in-language performance, we deal with the NER and POS tasks, which are the only tasks available for tens of languages in the benchmarks.Moreover, we restrict the languages with a sufficient amount of train and test data, for reliable evaluation. 6These results in evaluating 42 languages over 14 language families and 1 isolate: af, ar, az, bg, bn, de, el, en, es, et, eu, fa, fi, fr, he, hi, hu, id, it, ja, ka, kk, ko, lt, ml, mr, ms, nl, pl, pt, ro, ru, sw, ta, te, th, tl, tr, uk, ur, vi, zh.
Implementation Details The baseline dense model follows the BERT architecture (Devlin et al., 2019).We use η 0 = 12 layers, with 768 hidden units where the intermediate linear layer W 1 expands the dimension to 3072.We scale the model so that η s becomes 14 or 16.When searching for multilingual tickets, we accumulate gradients from 2.5M tokens to calculate the SNIP values (Eq.4).This takes only about 2-3 minutes on 1 RTX 3090 per language.
To pretrain the model, we follow the default setting described in Devlin et al. (2019).We oversample low-resource languages with an exponential smoothing factor of 0.7.We use a learning rate of 1e-4, batch size of 256, and update for 1M steps.We use a sequence length of 128 for 90% of the updates and 512 for the last 10%.Pretraining is conducted on TPUv3-8.
Our evaluation settings largely follow the XTREME benchmarks (Hu et al., 2020;Ruder et al., 2021).We fine-tune the pretrained models with a batch size of 32 and a learning rate of 2e-5.We generally fine-tune them for two epochs.But for NER, since the training dataset for some languages is scarce, we ensure to update parameters for at least 2500 iterations.We run 5 times per each language, and report the average score over all languages.Fine-tuning is conducted on RTX 3090.
Comparisons We compare the following models: • mBERT: Naïve multilingual pretraining with the baseline dense model architecture.
• random tickets: Multilingual pretraining utilizing the randomly selected tickets on the defined search space.
• multilingual tickets: Multilingual pretraining with the multilingual tickets found on the defined search space.

RQ1: Effectiveness of Multilingual Tickets
The results presented in Tables 1 and 2 demonstrate the significant computational efficiency of our proposed multilingual ticket variants compared to other approaches.For example, multilingual tickets (η s = 16) outperforms mBERT, while requiring fewer FLOPs for inference.

RQ2: Effectiveness of Chosen Zero-Shot NAS
To demonstrate the efficacy of our chosen zeroshot NAS approach, we compare it with randomly selected tickets from the same search space.Table 1 and 2 highlight the substantial performance improvement achieved by our cost-free search method; For instance, random tickets (η s = 16) even suffer performance degradation when compared to mBERT, while our multilingual tickets outperform it.

RQ3: Alleviated Interference by multilingual tickets
To confirm that the discovered multilingual tickets mitigate negative interference, we compare the average uc (Eq.7) values.In a controlled  experiment, we search for the multilingual tickets on the baseline model without scaling.With a setting of r = 0.85, we calculate the average uc for each layer at 10K update steps, considering all languages.
As depicted in Figure 3, the uc values of the multilingual tickets are higher compared to mBERT, indicating a reduction in negative interference.
RQ4: Superiority of Input-Adaptive NAS We emphasize that the selection of input-adaptive NAS is important.We establish another comparison of multilingual tickets built with a representative input-invariant zero-shot NAS method, magnitude pruning (Frankle et al., 2021).We set η s as 16 for both methods.
Table 3 shows that SNIP, the input-adaptive zero-shot NAS method, outperforms the magni- tude pruning, the input-invariant zero-shot NAS method.This highlights that input-adaptive NAS is essential.

RQ5: Computational Efficiency of Our Method
To estimate our improvement as computational cost, we compare with a scaled version of naïvely pretrained mBERT.We scale hidden units by 10%.
Table 4 shows that even if we add 10% of FLOPs to the baseline, multilingual tickets outperforms it.

Analysis: Ticket Similarity and Language Relatedness
We further analyze the effectiveness of our multilingual tickets and provide deeper insights into their characteristics.Our hypothesis is that the multilingual tickets capture language relatedness, as related languages are less prone to negative interference (Wang et al., 2020b) and benefit more from the positive transfer (Pires et al., 2019;Khemchandani et al., 2021;Muller et al., 2021).The redistribution strategy of parameters (Figure 1c), if effective, should thus favor sharing parameters among related languages.
To investigate whether our multilingual tickets exhibit such characteristics, we project the found tickets for each language using UMAP (McInnes et al., 2018).For comparison, we also project randomly selected tickets.We identify language families using the Glottolog database (Hammarström et al., 2021) to list related languages.Figure 4 illustrates the results, where our multilingual tickets (bottom) show similarity among languages from the same language family (dots with the same color).This is in stark contrast to the random tickets (top).For example, in our multilingual tickets, the tickets of the artificially constructed languages Volapük (vo) and Ido (io) are close to each other, while they are positioned at the left and right edges in the random tickets.Similarly, natural languages belonging to the Germanic language family exhibit similar characteristics.
One might question the Balto-Slavic languages as a counter-example, as they appear in two clusters at the bottom of Figure 4.However, with a more fine-grained taxonomy, we can further verify that the multilingual tickets correlate with linguistic genealogy.The Balto-Slavic languages on the left side belong to the East Slavic or Eastern South Slavic languages, while the languages on the right side belong to the West Slavic, Western South Slavic, and Baltic languages.One exception is Serbian (sr), which belongs to Western South Slavic but is closer to the left cluster.We hypothesize that since Serbian shares dialects, such as Torlak, with some Eastern South Slavic languages (Kortmann and van der Auwera, 2011), it is positioned closer to the left cluster rather than other Western South Slavic languages.
In summary, our multilingual tickets automati-cally learn language relatedness with only a few unlabeled language tokens and negligible computational cost.This finding greatly benefits multilingual pretraining.

Subnetworks for mPLMs
Several previous works have explored the use of language-specific subnetworks in mPLM.For the translation task, Lin et al. (2021b) identify language-specific subnetwork based on the magnitude of neurons after some update, then further finetune it with the data of specific language pairs.Xie et al. ( 2021) investigate important neurons using importance score, then follow a similar approach.However, all of these assume that the negative interference incurred in the pretraining stage can be alleviated in a post-hoc manner, which is challenged by Pfeiffer et al. (2022).Our distinction is eradicating the interference from the pretraining stage, such that the post-hoc methods can be applied complementarily to our approach.The closest work to ours is S3Net (Lu et al., 2022), which utilizes language-specific subnetworks to pretrained mPLMs for automatic speech recognition (ASR).However it requires significant computational cost to find such subnetworks, whereas our method removes the need for expensive subnetwork search using zero-shot NAS.

NAS for Pretrained Language Models
After the success of automatically searched neural networks (Zoph et al., 2018;Real et al., 2019) outperforming the manually-designed architectures, researchers have also applied NAS for NLP models (So et al., 2019;Wang et al., 2020a;So et al., 2021;Gao et al., 2022;Javaheripi et al., 2022).While most of these works focus on optimizing models for a single language, DARTS-ASR (Chen et al., 2020b) leverages a successful NAS method, DARTS (Liu et al., 2019), to automatically search multilingual ASR model for four languages.Similarly, Tsai et al. (2020) perform NAS to find a shared architecture for multilingual corpora.In contrast, our approach leverages NAS techniques to address negative interference by searching for language-specific tickets rather than focusing on the shared model part, as overlooked in previous works.Moreover, to the best of our knowledge, we are the first to explore zero-shot NAS methods for mPLMs.

Conclusion
This paper studied the challenge by balancing the conflicting goals of reducing negative interference while maintaining a similar per-language capacity.We proposed a Scale-then-Search approach, of searching for per-language subnetworks, or lottery tickets, from a scaled model, that improves performance without increasing per-language capacity.We keep the cost of finding such tickets negligible, by exploring a zero-shot NAS method.Our results show that ours reduces negative interference as expected, and the tickets discovered qualitatively preserve linguistic relatedness.

Limitation
Generalization to Unseen Languages Our research primarily focuses on the effectiveness of multilingual tickets and their impact on reducing interference.However, the performance and generalizability of our approach to languages unseen during the pretraining stage, may be limited.Further investigation and adaptation of the method specifically for resource-poor settings are necessary.
Fine-Grained Language Relatedness While we use language relatedness to qualitatively analyze multilingual tickets found, this notion may not capture fine-grained variations in language relatedness, which may require additional research to sophisticate qualitative analysis.

Figure 1 :
Figure 1: Comparison between various multilingual pretraining methods.(a): The naïve multilingual pretraining, where all the languages share all parameters.However it suffers negative interference between languages.(b): Increasing per-language capacity.(c): Keeping per-language capacity, while mitigating the interference between languages.

Figure 4 :
Figure4: Random tickets (top) and the found multilingual tickets (bottom) projected via UMAP.Each color corresponds to a specific language family.Multilingual tickets from same language families are similar to each other.

Table 3 :
Comparison between multilingual tickets with input-invariant zero-shot NAS (magnitude pruning) and input-adaptive zero-shot NAS (SNIP).We report the averaged NER and POS F1 score over all languages.

Table 4 :
Comparison between multilingual tickets and scaled mBERT.We report the averaged NER and POS F1 score over all languages.