MAD-G: Multilingual Adapter Generation for Efﬁcient Cross-Lingual Transfer

Adapter modules have emerged as a general parameter-efﬁcient means to specialize a pre-trained encoder to new domains. Massively multilingual transformers (MMTs) have particularly beneﬁted from additional training of language-speciﬁc adapters. However, this approach is not viable for the vast majority of languages, due to limitations in their corpus size or compute budgets. In this work, we propose MAD-G ( M ultilingual AD apter G eneration), which contextually generates language adapters from language representations based on typological features. In contrast to prior work, our time-and space-efﬁcient MAD-G approach enables (1) sharing of linguistic knowledge across languages and (2) zero-shot inference by generating language adapters for unseen languages. We thoroughly evaluate MAD-G in zero-shot cross-lingual transfer on part-of-speech tagging, dependency parsing, and named entity recognition. While offering (1) improved ﬁne-tuning efﬁciency (by a factor of around 50 in our experiments), (2) a smaller parameter budget, and (3) increased language coverage, MAD-G remains competitive with more expensive methods for language-speciﬁc adapter training across the board. Moreover, it offers substantial beneﬁts for low-resource languages, particularly on the NER task in low-resource African languages. Finally, we demonstrate that MAD-G’s transfer performance can be further improved via: (i) multi-source training , i.e., by generating and combining adapters of multiple languages with available task-speciﬁc training data; and (ii) by further ﬁne-tuning generated MAD-G adapters for languages with monolingual data.


Introduction
Multilingual NLP has witnessed large advances, with cross-lingual word embedding spaces (Mikolov et al., 2013;Artetxe et al., 2018;Glavaš et al., 2019) and, more recently, massively multilingual Transformers (MMTs) like mBERT (Devlin et al., 2019), XLM-R (Conneau et al., 2020), and mT5 (Xue et al., 2021) as main vehicles of cross-lingual transfer. Although MMTs display impressive (zero-shot) cross-lingual transfer abilities (Pires et al., 2019;Wu and Dredze, 2019), their performance has been shown to drop when the target language is typologically distant to the source language, or the size of its pretraining data is limited (Hu et al., 2020;Lauscher et al., 2020). In addition, their coverage of the world's languages-and consequently the range of language technology applications they can support-remains insufficient. 1 Adapters (Rebuffi et al., 2017;Houlsby et al., 2019) have been proposed as a parameter-efficient means to extend multilingual models to underrepresented languages (Bapna and Firat, 2019;Üstün et al., 2020). The general practice is to train a language adapter on the unlabeled data for each language (Pfeiffer et al., 2020b) via masked language modeling (MLM). However, this generally requires substantial amounts of monolingual data, which prevents adapters from serving under-resourced languages where such additional language-specific capacity would be most useful.
To address this deficiency, we propose multilingual adapter generation (MAD-G), a novel paradigm that enables the generation of adapters for low-resource languages by sharing information across languages. Instead of learning separate adapters for each language, MAD-G leverages contextual parameter generation (CPG;Platanios et al., 2018a;Ponti et al., 2019b), that is, it learns a single model that can generate a language adapter for an arbitrary target language. At the core of MAD-G is a contextual parameter generator which  Figure 1: Cross-lingual transfer with MAD-G. MAD-G training: the generator component learns to generate language-specific adapters given URIEL vectors of input languages; the parameters of the generator are trained with an MLM objective, where instances of the respective language are passed through the frozen Transformer layers and the generated adapter parameters. In the downstream task fine-tuning, both the Transformer weights as well as the weights of the generated source-language adapter are frozen; an additional task adapter with randomly initialized weights is placed on top of the generated source language adapter. During target language downstream inference, the generated source language adapters are replaced with the generated target language adapters. takes the typological vector of a language as input and outputs the parameters of the language-specific adapter. The generator's parameters are trained via MLM on the Wikipedias of 95 languages, selected to maximize linguistic diversity. Unlike prior CPG work (Platanios et al., 2018a;Üstün et al., 2020), MAD-G generates language adapters that are taskagnostic, thus allowing for an efficient and modular cross-lingual transfer across the board, i.e., the MAD-G language adapters can be leveraged in arbitrary downstream tasks (Pfeiffer et al., 2020b).
MAD-G shares information across languages (i) at the level of hidden representations by sharing the parameters of the adapter generator as well as (ii) at the typological level by conditioning on features from the URIEL database (Littell et al., 2017). The latter additionally enables zero-shot transfer to unseen languages. Further, we propose a variant of MAD-G in which we generate adapters also conditioned on their Transformer layer position (see Section 3.2), allowing MAD-G to be much more parameter-efficient than adapter-based transfer methods of prior work.
In experiments on zero-shot cross-lingual trans-fer on part-of-speech tagging (POS), dependency parsing (DP), and named entity recognition (NER), MAD-G demonstrates competitive performance to training more expensive language-specific adapters and shows strong performance in low-resource scenarios, e.g., in the NER task for African languages.
What is more, we show that transfer performance can be further improved by (a) multilingual training of task adapters and (b) fine-tuning of generated MAD-G adapters, via MLM, on small amounts of monolingual data. Finally, we provide a nuanced analysis of transfer performance to unseen languages, highlighting the importance of the diversity of the language sample selected for pretraining.

Background
Before introducing MAD-G in detail in Section 3, we recapitulate its key components adopted from previous work. In particular, we discuss language adapters (LA) in Section 2.1 and Contextual Parameter Generation (CPG) in Section 2.2.

(Why) Language Adapters
Massively multilingual models infamously suffer from the 'curse of multilinguality ' (Arivazhagan et al., 2019;Conneau et al., 2020): for a fixed model capacity, their performance decreases as they cover more languages. Extending them to underrepresented and unseen languages is far from trivial: additional training (of all model parameters) for such languages can lead to catastrophic forgetting of the previously acquired knowledge (McCloskey and Cohen, 1989;Santoro et al., 2016). A common remedy for both their coverage-performance tradeoff and limited flexibility is to allocate additional model parameters for individual languages. This is typically achieved through the use of adapter layers (Houlsby et al., 2019;Pfeiffer et al., 2020b). In particular, a language adapter is a light-weight component inserted into a MMT such as mBERT (Devlin et al., 2019) or XLM-R (Conneau et al., 2020) with the purpose of specializing the MMT for a particular language, in order to either (a) support a new language not covered by the MMT's original multilingual pretraining (Pfeiffer et al., 2020b;Artetxe et al., 2020) or (b) recover/improve the performance for a particular (resource-rich) language (Bapna and Firat, 2019;Rust et al., 2021). In this work, we adopt the competitive and lightweight (so-called bottleneck) adapter variant of Pfeiffer et al. (2021a). There, only one adapter module, consisting of a successive down-projection and up-projection, is injected per Transformer layer, after the feed-forward sublayer (see Figure 1). 2 The language adapter LA b at the b-th Transformer layer/block performs the following operation: where h b and r b are the Transformer hidden state and the residual at layer b, respectively. D b ∈ R h×m and U b ∈ R m×h are the down-and up-projections, respectively (h being the Transformer's hidden layer size, and m the adapter's dimension), and a(·) is a non-linear activation function. The residual connection r b is the output of the Transformer's feed-forward layer whereas h b is the output of the subsequent layer normalisation. The parameters of a language adapter are learned through MLM with the original parameters of the MMT kept frozen (Pfeiffer et al., 2020b).

(Why) Contextual Parameter Generation
Language adapters are an instance of a common design pattern in multilingual NLP: training a separate model or model components for each target language. 3 This approach based on a separate instance per language has two crucial drawbacks: 1) the total training time and number of parameters learned increase linearly with the number of languages; 2) a lack of information sharing across languages due to the complete independence of learned parameters, which prevents low-resource languages from benefiting from their typological and genealogical ties to high(er)-resource languages. CPG is a technique introduced by Platanios et al. (2018a) to address these drawbacks. While originally conceived for neural machine translation (NMT), CPG can be applied to any neural model f parameterized by θ, for which we aim to learn parameterizations for a number of different contexts; in multilingual NLP, these "contexts" are languages. In the instance-per-language approach, an independent parameterization θ (l) , l ∈ {1, . . . , n l }, is learned for each of the n l languages of interest.
2 According to Pfeiffer et al. (2020aPfeiffer et al. ( , 2021a and Rücklé et al. (2021), such an architecture with a single adapter per Transformer layer is more parameter-efficient while performing on par with the architecture of Houlsby et al. (2019) with two adapters per Transformer layer (one after the multi-head attention sublayer and one after the feed-forward sublayer). 3 Other examples include the training of language-specific pretrained language models (Rust et al., 2021) as well as language pair-specific encoder-decoder models for machine translation (Luong et al., 2016;Firat et al., 2016). In CPG, the only language-specific parameters that we learn are the low-dimensional language embeddings λ (l) ∈ R d l . These are used by the generator g, a hyper-network (Ha et al., 2017) component 4 with its own parameterization φ, to produce the language-specific parameterization of the main model: θ (l) = g φ (λ (l) ). While g can in principle be any differentiable function (i.e., arbitrarily deep neural model), in practice it is typically set to a simple linear projection (i.e., φ = W ): where W ∈ R np×d l is a learnable weight matrix, n p being the number of parameters of f . The total number of parameters learned when training n l independent models is n l n p , whereas the number of parameters in the W matrix is d l n p . Therefore, neglecting the small number of parameters dedicated to language embeddings, the CPG approach uses fewer parameters when d l < n l . 5 More importantly, in multilingual training the generator matrix W is shared across all languages, which enables knowledge sharing across languages and leads to improved transfer performance. Platanios et al. (2018b) and Ponti et al. (2021a) opt for randomly initializing language embeddings λ (l) and learning them end-to-end. Specified like this, however, CPG cannot generalize to languages unseen in training, as it would lack embeddings for those languages at inference. To support generalization to arbitrary new languages, one must ground language embeddings in some external language representation, available for many languages. To this end, Ponti et al. (2019b) exploit typological language vectors from the URIEL database (Littell et al., 2017) directly as language embeddings to generate a full set of model parameters. In a similar vein, Üstün et al. (2020) use the typological language vectors from URIEL to generate task-and language-specific adapters for dependency parsing: they learn the parameters φ of the generator g via multilingual dependency parsing training on 13 languages. In contrast, MAD-G's multilingual MLM training allows the generation of task-agnostic LAs that can support downstream cross-lingual transfer for arbitrary NLP tasks.

MAD-G: Methodology
MAD-G aims to enable resource-efficient adaptation of MMTs to a wide range of previously unseen, radically resource-poor languages, 6 and contribute in this manner to more sustainable (Strubell et al., 2019;Moosavi et al., 2020) and more inclusive NLP (Joshi et al., 2020). We couple (i) the computational efficiency of the light-weight adapters (cf. Section 2.1) and (ii) knowledge sharing and zero-shot language transfer capabilities of CPG (cf. Section 2.2), with (iii) external linguistic (i.e., typological) knowledge (Ponti et al., 2019a) towards supporting arbitrary NLP tasks for (even radically) resource-poor languages.
MAD-G mitigates important limitations of prior work. Unlike Üstün et al. (2020), we generate taskagnostic LAs, (re)usable across NLP tasks. Unlike the MAD-X framework (Pfeiffer et al., 2020b), which trains LAs independently for each language (requiring sufficient monolingual corpora), MAD-G can support unseen and resource-poor languages in downstream tasks by generating LAs from typological vectors. Moreover, MAD-G leverages typological relations between languages. We also show that the two approaches can be successfully combined: monolingual MLM fine-tuning of a MAD-G-generated LA yields further benefits.

Generating Language Adapters
Our input representation for each language is a sparse typological vector t (l) encompassing 289 binary linguistic features (103 syntactic, 28 phonological and 158 phonetic features) from the URIEL language typology database (Littell et al., 2017). We obtain the language embedding λ (l) from t (l) using a single-layer linear down-projection: λ (l) = V t (l) , with the parameter matrix V ∈ R d l ×289 . Down-projecting to a dimension d l << 289 prevents W from being impractically large. By grounding language embeddings in external expert linguistic knowledge (i.e., URIEL vectors), we enable generalization to all languages for which such typological vectors exist, regardless of the availability of monolingual text for those languages for generator training. In multilingual MLM training, we generate the adapter parameters θ (l) for each instance from the embedding of the respective lan-guage, as specified in Eq (2). 7 Let n b be the number of layers in the MMT (e.g., for mBERT (Devlin et al., 2019), n b = 12). The MAD-G parameter matrix W then has n b · 2 · h · m × d l parameters, where h is the hidden size of the Transformer layer and m the bottleneck size of the adapter layer (i.e., a single adapter module has 2 · h · m parameters).

Factoring Out Layer Embeddings
By factoring out language-specific embeddings λ (l) , we force the MAD-G parameters W to share knowledge across languages. The generated language adapters in different Transformer layers are, however, still mutually independent. By additionally factoring out representations of each Transformer layer indices into layer embeddings , 2, . . . , n b }, we can condition the adapter generation not only on languages but also on layers. This has two potential benefits: (i) it allows for information sharing between adapters of different layers, and, more importantly, (ii) it substantially reduces the size of the generator W . In this model variant, dubbed MAD-G-LS, the generator outputs adapters θ (l,b) for language-layer pairs: with the concatenation of the language embedding λ (l) and layer embedding λ (b) as input. The MAD-G-LS generator has 2 · h · m × (d l + d b ) parameters, which is, assuming language and layer embeddings of equal size (i.e., d b = d l ), a parameter reduction by a factor n b 2 compared to the base MAD-G configuration from §3.1.

Multi-Source Task Adapters
Once the multilingual adapter generator has been trained via multilingual MLM, the generated LAs can be used to facilitate downstream cross-lingual transfer. Here, we follow the task-specific finetuning setup of MAD-X (Pfeiffer et al., 2020b): we insert and train the task-specific adapter (TA) on top of the language adapter of the source languagethe parameters of the LA as well as parameters of the original MMT are kept frozen. In prior work, the TA is trained on data from a single source language l s with the LA for l s activated (with frozen parameters). At inference time, the LA for the tar-get language l t is plugged in instead of l s 's adapter, with the same TA (Pfeiffer et al., 2020b).
In downstream tasks with task data in multiple languages, we can resort to multi-source transfer, i.e., multilingual training of the task adapter. This is possible with per-language trained LAs (e.g., MAD-X adapters) as well as without any LAs. We hypothesized that multi-source training would be particularly beneficial with MAD-G because of the knowledge shared by LAs of different languages as a result of their generation with the MAD-G's multilingual generator. In other words, with MAD-G, the multi-source task adapter training is supported by a single LA generator model (see Figure 1), rather than a set of independently trained LAs. However, our experiments show that multisource training is greatly beneficial regardless of language adapter type; the advantage does not seem larger for MAD-G in particular.
We employ a straightforward approach to TA training on the set of source languages L s : in each step, we (1) randomly select a language l from L s from which we sample a training batch and (2) in the forward pass -before the task adapter -activate the LA of the language l for that batch. To the best of our knowledge, we are the first to investigate multi-source adapter-based transfer in cross-lingual settings.

Experimental Setup
Tasks and Languages. We evaluate on three downstream tasks which provide sufficient evaluation data for low-resource languages: part-ofspeech (POS) tagging, dependency parsing (DP), both on the Universal Dependencies (UD) 2.7 dataset (Zeman et al., 2020), and named entity recognition (NER) on the MasakhaNER dataset for African languages (Adelani et al., 2021). For POS and DP, we evaluate on a substantial subset of all UD languages with available treebanks. 8 We discern between three language groups in evaluation, with some examples in Table 1: (i) mBERT-seen languages are those included in mBERT's pretraining; (ii) MAD-G-seen languages were not part of mBERT's pretraining but are included in MAD-G training; and (iii) unseen languages are those not included in mBERT pretraining nor in MAD-G training.

Baselines and MAD-G Variants
mBERT is an MMT pretrained on the Wikipedias of 104 languages. We use mBERT as the base MMT for MAD-G. XLM-R is a state-of-the-art MMT pretrained on the CommonCrawl data of 100 languages (Conneau et al., 2020). 9 We evaluate them in the standard transfer setup with full-model fine-tuning (-ft).
MAD-X is the state-of-the-art modular adapterbased framework for cross-lingual transfer (Pfeiffer et al., 2020b) based on independent MLM-training of a dedicated LA for each language. We train our own MAD-X LAs when no pretrained ones are available, notably for the six MAD-G-seen UD languages. Training LAs for all other lowresource languages, however, is prohibitively computationally expensive, 10 so during all MAD-X experiments, the pool of languages with available MAD-X adapters consists of the 20 high-resource source languages used in multi-source setups (see Section 4.2) and MAD-G-seen languages. When evaluating on a target language without an available MAD-X LA, we instead choose the available MAD-X LA for the language that is closest to the target language. 11 MAD-G is the base setup of our method from Section 3.1. MAD-G-LS is the variant of MAD-G in which the adapter generation is additionally conditioned on layer embeddings, as described in Section 3.2. MAD-G-en uses the English adapter rather than that of the target language during inference on target language instances. The purpose of this baseline is to test if the parameters generated for different languages are actually meaningfully different and able to outperform the English LA.
TA-only trains the task adapter directly on top of the MMT, i.e., without any language adapter. With 9 Although it mostly outperforms mBERT in multilingual and cross-lingual transfer experiments, mBERT was used in prior work as a more robust choice for radically resourcepoor languages in general (Pfeiffer et al., 2020b). Our NER experiments on African languages confirm this (Table 3 later). Note that MAD-G can be applied to XLM-R as well.
10 Note that this efficiency and scalability shortcoming of MAD-X is precisely one of the main motivations for MAD-G, i.e., for language adapter generation for unseen languages. 11 We quantify the linguistic proximity of languages as the cosine similarity between their respective URIEL-based language vectors (Lauscher et al., 2020).  this baseline, we seek to quantify the contribution of dedicated LAs in general.

MAD-G Training Setup
MLM-training of MAD-G's adapter generator is run on Wikipedias of 95 languages. We considered only the languages with at least 1,000 Wikipedia articles and selected them following a greedy process that maximizes typological diversity. At each step, we select the language with the largest number of articles belonging to the language family and its genus that are least represented in the current sample of languages (Ponti et al., 2020); see Appendix for a full list. Following Pfeiffer et al. (2020b), the LA bottleneck size is m = 384. Both the language embedding dimension d l and the layer embedding (if used) dimension d b are set to 32. At each MLM training step, we randomly sample a batch in a language from an exponentially smoothed distribution with a cap preventing oversampling of high-resource languages: the probability of selecting a language l is proportional to min(n_examples (l) , 500, 000) 0.5 . Training runs for 200,000 steps in total over all languages; batch size is 64 and the maximum sequence length is 256. We used a linearly decreasing learning rate, starting at 5e-5. In contrast, relying on the same batch size and max sequence length, MAD-X was trained for 100,000 steps for each language. This makes the average per-language duration of MAD-G training ≈50 times shorter than for MAD-X. Moreover, MAD-G and MAD-G-LS have 226M and 38M parameters respectively, compared to 728M for a hypothetical 95 MAD-X dedicated language adapters.
Single-and Multi-Source Transfer. We train task adapters on English data with the English MAD-G adapter. For comparability, we adopt the TA configuration of MAD-X (Pfeiffer et al., 2020b): the bottleneck size is m = 48. For POS-tagging and NER we use the standard token-level single-layer multi-class classifier. For DP, we use the shallow variant (Glavaš and Vulić, 2021) of the biaffine dependency parser of Dozat and Manning (2017).
For POS tagging and DP, we train on the English EWT treebank. For consistency and comparability with multi-source experiments, we sample 12,000 sentences for training (out of the 12,543 available examples). For NER, we train on the CoNLL 2003 English dataset (Tjong Kim Sang and De Meulder, 2003). 12 For all tasks, we train for 15,000 steps with batch size 8 (roughly 10 epochs) and a linearly decreasing learning rate, starting at 5e-5.
For multi-source transfer experiments, we select 20 typologically diverse high-resource source languages for POS-tagging and DP using the following process: we iterate over the UD languages in the descending order of treebank size and select a language if it belongs to a genus not already represented in the sample. 13 We again sample a total of 12,000 examples (600 per language).

Results and Discussion
In what follows, we focus on reporting and analyzing the most important global trends in results with accompanying discussions and side experiments. For completeness, the full results per individual target language are provided in the Appendix.
Single-Source Transfer. Relative to all methods which do not employ language adaptation, we find that the use of MAD-G in the primary MAD-G and MAD-G-LS settings is greatly beneficial on all tasks for MAD-G-seen languages in both the single-and multi-source transfer scenarios (see Tables 2 and 3), with the very parameter-efficient MAD-G-LS being only slightly weaker than the base MAD-G variant in general, even slightly outperforming it for some languages and transfer setups. Despite having far less capacity per target language, MAD-G retains much of the performance gain of MAD-X on languages seen during language adapter training, showing that MAD-G  Table 2: UD POS tagging accuracy scores and dependency parsing unlabeled/labeled attachment scores for various language adapter and fine-tuning settings. Values are shown as averages over each of the language groups mBERT-seen, MAD-G-seen and unseen, defined in Table 1. Task adapters are trained only on English data (en, upper part) and 20 diverse, high-resource languages (multi, lower part). The highest score per column in each of the two setups is in bold, the second highest is underlined.
hau ibo kin lug luo pcm swa wol yor avg. method MAD-G-seen MAD-G-seen MAD-G-seen unseen unseen unseen mBERT-seen unseen mBERT-seen   Table 4: UD POS tagging accuracy scores and dependency parsing unlabeled/labeled attachment scores for for various language adapter/fine-tuning settings. Values are shown as averages over each of the language groups mBERT-genus, MAD-G-genus and unseen-genus. The task adapter is trained only on English data.
achieves efficient yet effective language adaptation. The MAD-G-en variant does not achieve such gains on MAD-G-seen languages, demonstrating that MAD-G does generate meaningfully different adapter parameters for different languages.
The use of MAD-G is not in general beneficial for mBERT-seen languages; this is unsurprising since it is unrealistic to believe that mBERT's knowledge of languages observed during its own pretraining can be substantially improved through language adaptation on a much smaller amount of data. At first glance there also does not appear to be any benefit to using MAD-G for unseen target languages, except for NER, where gains are substantial. However, averaging the results over all languages in this group does not provide a full picture because it consists of languages whose relationships to those observed during training differ substantially. Therefore, we provide a finer-grained analysis below.
While the use of typological vectors for generating LAs allows MAD-G to learn features which could generalize well to unseen languages, this assumption should mostly hold for unseen languages whose 'typological relatives' are available during training. To investigate the effect the degree of typological relatedness has on MAD-G's generalization ability, we further divide the unseen lan- guages into three subgroups: mBERT-genus (the 21 languages whose genus matches that of at least one language seen during mBERT pretraining); MAD-G-genus (the 4 languages whose genus was not seen during mBERT pretraining but was seen during MAD-G training); unseen-genus (the 8 languages whose genus is completely unseen). Table 4 shows the POS tagging and DP performance for each of the three unseen subgroups. MAD-G is beneficial on the MAD-G-genus subgroup, while its benefits do not extend to the other two subgroups. The results for mBERT-genus versus MAD-G-genus languages mirror those for mBERT-seen versus MAD-G-seen languages; in general, mBERT's knowledge of a genus (or specific language) can be improved through language adaptation if and only if that genus/language was not observed during mBERT's pretraining. As expected, the scores on unseen-genus languages confirm the intuition that the performance on languages typologically unrelated to any language seen during mBERT and/or MAD-G training cannot be recovered solely on the basis of limited external typological information. For cross-lingual generalization, the typological diversity of pretraining languages is thus paramount.
Multi-Source Transfer. When training on 20 languages, while maintaining the overall number of training examples, we observe large gains across all settings and language groups for both POS tagging and DP (see Table 2). This suggests that multisource training yields a more general and languageagnostic representation of the task adapter, thus transferring better to unseen languages. We inves- tigate the effect of multi-source training further in Figure 2, where we gradually add languages to the multi-source pool, while (again) maintaining the overall number of training examples. We find that the transition from one language to two languages in the source-pool results in the largest relative performance increase, but the performance still rises with the addition of more languages. In sum, in line with previous findings (Ponti et al., 2021b), our results indicate that the language diversity of training data has strong positive effects on zero-shot transfer across multiple methods and setups.
Fine-tuning MAD-G-Initialized Adapters. Although interesting from a theoretical point of view, the scenario where there is no unannotated data whatsoever available for the target language might be unrealistic. We thus examine a setup where there is a small amount of unannotated data available. In this case, we can still exploit MAD-G by generating an initialization of a language-specific adapter for a target language l t , and then fine-tuning its parameters via MLM on the unannotated data.
We perform POS tagging and DP experiments when fine-tuning MAD-G-initialized languagespecific adapters on the 14 unseen UD languages which have Wikipedias. 14 We simulate different degrees of resource-poverty by sampling training datasets with 1,000, 3,000, 10,000, 30,000 and 100,000 words from the full Wikipedia. We compare this MAD-G-ft setting with the results of finetuning randomly-initialized LAs on the same data (rand-ft). 15 Figure 3 shows that there is a large and consistent improvement on the 14 unseen evaluation languages as their language adapters are fine-tuned on increasingly large amounts of unannotated text. For both tasks, the performance is better when the language adapter is initialized with the weights generated by MAD-G than when the weights are randomly initialized. The difference between the two settings is modest for POS tagging, but it is larger for DP and is maintained even when 100,000 training tokens are available.

Conclusion
We proposed MAD-G, a modular and efficient cross-lingual transfer framework for low-resource languages, that generates task-agnostic adapters for massively multilingual Transformers (e.g., mBERT) from typological language representations. MAD-G performs competitively with a stateof-the-art adapter-based transfer approach MAD-X; yet its training is roughly 50 times more efficient per target language. MAD-G can also be applied to unseen languages, benefiting those belonging to a genus introduced during its training, and it can be used as a better initialization for "radically lowresource languages"; there, its generated language adapters can be further refined on small amounts of text, improving downstream performance. We further show that cross-lingual performance with adapters can be greatly improved by training on multiple source languages. We release the MAD-G code online at: https://github.com/ Adapter-Hub/adapter-transformers.