Cross-Lingual Transfer with Target Language-Ready Task Adapters

Adapters have emerged as a modular and parameter-efficient approach to (zero-shot) cross-lingual transfer. The established MAD-X framework employs separate language and task adapters which can be arbitrarily combined to perform the transfer of any task to any target language. Subsequently, BAD-X, an extension of the MAD-X framework, achieves improved transfer at the cost of MAD-X’s modularity by creating ‘bilingual’ adapters specific to the source-target language pair. In this work, we aim to take the best of both worlds by (i) fine-tuning task adapters adapted to the target language(s) (so-called ‘target language-ready’ (TLR) adapters) to maintain high transfer performance, but (ii) without sacrificing the highly modular design of MAD-X. The main idea of ‘target language-ready’ adapters is to resolve the training-vs-inference discrepancy of MAD-X: the task adapter ‘sees’ the target language adapter for the very first time during inference, and thus might not be fully compatible with it. We address this mismatch by exposing the task adapter to the target language adapter during training, and empirically validate several variants of the idea: in the simplest form, we alternate between using the source and target language adapters during task adapter training, which can be generalized to cycling over any set of language adapters. We evaluate different TLR-based transfer configurations with varying degrees of generality across a suite of standard cross-lingual benchmarks, and find that the most general (and thus most modular) configuration consistently outperforms MAD-X and BAD-X on most tasks and languages.


Introduction and Motivation
Recent progress in multilingual NLP has mainly been driven by massively multilingual Transformer models (MMTs) such as mBERT (Devlin et al., 2019), XLM-R (Conneau et al., 2020), and mT5 (Xue et al., 2021), which have been trained on the unlabeled data of 100+ languages.Their shared multilingual representation spaces enable zero-shot cross-lingual transfer (Pires et al., 2019;K et al., 2020), that is, performing tasks with a reasonable degree of accuracy in languages that entirely lack training data for those tasks.
Zero-shot cross-lingual transfer is typically performed by fine-tuning the pretrained MMT on taskspecific data in a high-resource source language (i.e., typically English), and then applying it directly to make task predictions in the target language.In the standard setup, the model's knowledge about the target language is acquired solely during the pretraining stage (Artetxe et al., 2020).In order to improve the transfer performance, task fine-tuning can be preceded with fine-tuning on unlabeled data in the target language (Ponti et al., 2020;Pfeiffer et al., 2020b).Nonetheless, the performance on the target languages in such scenarios is lower than that on the source language, and the difference is known as the cross-lingual transfer gap (Hu et al., 2020).Crucially, the transfer gap tends to increase for the languages where such transfer is needed the most (Joshi et al., 2020): i.e., for low-resource target languages, and languages typologically more distant from the source language (e.g., English) (Lauscher et al., 2020).
Adapters (Rebuffi et al., 2017;Houlsby et al., 2019) have emerged as a prominent approach for aiding zero-shot cross-lingual transfer (Pfeiffer et al., 2020b;Üstün et al., 2022a;Ansell et al., 2021;Parović et al., 2022).They offer several benefits: (i) providing additional representation capacity for target languages; (ii) much more parameterefficient fine-tuning compared to full-model finetuning, as they allow the large MMT's parameters to remain unmodified, and thus preserve the multilingual knowledge the MMT has acquired during pretraining.They also (iii) provide modularity in learning and storing different facets of knowledge (Pfeiffer et al., 2020a): this property enables them to be combined in favorable ways to achieve better performance, and previously fine-tuned modules (e.g., language adapters) to be reused across different applications.
The established adapter-based cross-lingual transfer framework MAD-X (Pfeiffer et al., 2020b) trains separate language adapters (LAs) and task adapters (TAs) which can then be arbitrarily combined for the transfer of any task to any language.Despite having a highly modular design, stemming primarily from dedicated per-language and per-task adapters, MAD-X's TAs lack 'adaptivity' to the target language(s) of interest: i.e., its TAs are fully target language-agnostic.More precisely, during task fine-tuning, the MAD-X TA is exposed only to the source language LA, and 'sees' the target language TA and examples from that language for the first time only at inference.This deficiency might result in incompatibility between the TA and the target LA, which would emerge only at inference.BAD-X (Parović et al., 2022) trades off MAD-X's high degree of modularity by introducing 'bilingual' language adapters specialized for transfer between the source-target language pair.1 While such transfer direction specialization results in a better performance, the decrease in modularity results in much larger computational requirements: BAD-X requires fine-tuning a dedicated bilingual LA for every language pair of interest followed up by fine-tuning a dedicated TA again for each pair.
Prior work has not explored whether this specialization (i.e., exposing the target language at training time) can be done successfully solely at the level of TAs whilst preserving modularity at the LA level.Such specialization in the most straightforward bilingual setup still requires fine-tuning a dedicated TA for each target language of interest.However, this is already a more pragmatic setup than BAD-X since TAs are much less computationally expensive to train than LAs.Moreover, as we show in this work, it is possible to also extend TA fine-tuning to more target languages, moving from bilingual specialization to the more universal multilingual 'exposure' and towards multilingual language-universal TAs.
In this work, we aim to create a modular design inspired by MAD-X while seeking to reap the benefits of the exposure to one or more target languages.To this end, we thus introduce target languageready (TLR) task adapters designed to excel at a particular target language or at a larger set of target languages.In the simplest bilingual variant, TLR TAs are trained by alternating between source and target LAs, while the more general version allows cycling over any set of LAs.Creating TLR TAs does not require any expensive retraining or alternative training of LAs.
We run experiments with a plethora of standard benchmarks focused on zero-shot cross-lingual transfer and low-resource languages, covering 1) NER on MasakhaNER; 2) dependency parsing (DP) on Universal Dependencies; 3) natural language inference (NLI) on AmericasNLI and XNLI; 4) QA on XQuAD and TyDiQA-GoldP.Our results show that TLR TAs outperform MAD-X and BAD-X on all tasks on average, and offer consistent gains across a large majority of the individual target languages.Importantly, the most general TLR TA, which is shared between all target languages and thus positively impacts modularity and reusability, shows the strongest performance across the majority of tasks and target languages.Fine-tuning the TA in such multilingual setups also acts as a multilingual regularization (Ansell et al., 2021): while the TA gets exposed to different target languages (i.e., maintaining its TLR property), at the same time it does not overfit to a single target language as it is forced to adapt to more languages, and thus learns more universal cross-language features.Our code and models are publicly available at: https: //github.com/parovicm/tlr-adapters.

Background
Adapters.Following MAD-X and BAD-X, in this work we focus on the most common adapter architecture, serial adapters (Houlsby et al., 2019;Pfeiffer et al., 2021a), but we remind the reader that other adapter options are available (He et al., 2022) and might be used in the context of crosslingual transfer.Serial adapters are lightweight bottleneck modules inserted within each Transformer layer.The architecture of an adapter at each layer consists of a down-projection, a non-linearity and an up-projection followed by a residual connection.Let the down-projection at layer l be a matrix D l ∈ R h×d and the up-projection be a matrix U l ∈ R d×h where h is the hidden size of the Transformer and d is the hidden size of the adapter.If we denote the hidden state and the residual at layer l as h l and r l respectively, the adapter computation of layer l is then given by: with ReLU as the activation function.
MAD-X and BAD-X Frameworks.MAD-X trains dedicated LAs and TAs (Pfeiffer et al., 2020b).LAs are trained using unlabeled Wikipedia data with a masked language modeling (MLM) objective.TAs are trained using task-specific data in the source language.Given a source language L s and a target language L t , MAD-X trains LAs for both L s and L t .The TA is trained while stacked on top of the L s LA, which is frozen.To make predictions on L t , the L s LA is swapped with the L t LA.
Unlike MAD-X, which is based on monolingual adapters, BAD-X trains bilingual LAs (Parović et al., 2022).A bilingual LA is trained on the unlabeled data of both L s and L t and the TA is then trained on task-specific data in L s , stacked on top of the bilingual LA.To perform inference on the task in L t , the same configuration is kept since the bilingual LA 'knows' both L s and L t .

Target Language-Ready Task Adapters
Instead of sacrificing the LAs' modularity as in BAD-X, it might be more effective to keep MAD-X's language-specific LAs and opt to prepare only the TAs to excel at a particular target language L t , or a set of target languages of interest.Assuming LAs are available for the source language L s and K target languages L t,i , i = 1, . . ., K, we cycle over all K + 1 LAs during TA training, resulting in the so-called multilingual TLR TA.This general idea is illustrated in Figure 1.The bilingual variant with a TLR TA trained by alternating between the source and target LA is a special case of the multilingual variant where K = 1, while the original MAD-X setup is obtained by setting K = 0. 2 This procedure exposes a single target language (bilingual TLR TA) or multiple target languages (multilingual TLR TA) to the TA as soon as its finetuning phase, making it better equipped (i.e., ready) for the inference phase, where the TA is combined with the single L t LA.
TLR Variants.While BILINGUAL TA fine-tuning follows naturally from BAD-X, and it seems suitable for transfer between a fixed pair of L s and 2 It is also possible to train a TA directly without relying on any LA at all.However, previous research (Ansell et al., 2021) has empirically validated that this 'TA-only' variant is consistently outperformed by MAD-X; hence, we do not discuss nor compare to 'TA-only' in this work.showing the language adapters (LAs) for English as the source language and K target languages along with the NLI TA.The TA is trained by cycling over the K + 1 LAs associated with the K + 1 languages.For a given step number, only the LA step % (K + 1) is switched on and the forward pass goes through that LA.Setting K = 0 results in the original MAD-X setup, where only the source LA is switched on, while a bilingual TLR variant is given by K = 1.Setting K = 1 and removing the English LA formulates the TARGET-only TLR variant.See §2.2 for the descriptions of all the variants.The same adapter configuration(s), but with different parameters, are added at each MMT layer.

Multi-Head Attention
L t , it might be better to train the TA only on top of the L t LA.Such TARGET-only TLR TAs could be particularly effective for higher-resource languages whose LAs have been trained on sufficient corpora, to the extent that pairing them with L s is detrimental.This could be especially detectable for higher-resource L t -s that are also distant from L s or lack adequate vocabulary overlap with it.
TARGET and BILINGUAL TLR TAs require training of dedicated TAs for every L t of interest, which makes them computationally less efficient than MAD-X, and they introduce more parameters overall.Using MULTILINGUAL TLR TAs mitigates this overhead.We consider two variants of MUL-TILINGUAL TAs.First, the so-called TASK-MULTI TLR variant operates over the source language and the set of all target languages available for the task under consideration (e.g., all languages represented in the MasakhaNER dataset).Second, the ALL-MULTI TLR variant combines the source language with all target languages across datasets of multiple tasks (e.g., all languages represented in MasakhaNER, all languages represented in AmericasNLI, etc.); see §3 later.These variants increase modularity and parameter efficiency and are as modular and parameter-efficient as MAD-X per each task: a single TA is required to handle transfer to any target language.At the same time, unlike MAD-X, they are offered some exposure to the representations arising from the multiple target languages they will be used for.Handling multiple LAs at fine-tuning might make the TAs more robust overall: multilinguality might act as a regularization forcing the TA to focus on more universal cross-language features (Ansell et al., 2021).

Experimental Setup
Evaluation Tasks and Languages.We comprehensively evaluate our TLR adapter framework on a suite of standard cross-lingual transfer benchmarks.They span four different task families (NER, DP, NLI and QA), with a total of six different datasets and 35 different target languages, covering a typologically and geographically diverse language sample of both low-and high-resource languages.
For NER, we use the MasakhaNER dataset (Adelani et al., 2021) which contains 10 low-resource languages from the African continent. 3For DP, we use Universal Dependencies 2.7 (Zeman et al., 2020) and inherit the set of 10 typologically diverse low-resource target languages from BAD-X (Parović et al., 2022).For NLI, we rely on the AmericasNLI dataset (Ebrahimi et al., 2022), containing 10 low-resource languages from the Americas, as well as a subset of languages from XNLI (Conneau et al., 2018).Finally, for QA we use subsets of languages from XQuAD (Artetxe et al., 2020) and TyDiQA-GoldP (Clark et al., 2020).The subsets for XNLI, XQuAD and TyDiQA-GoldP were selected to combine (i) low-resource languages (Joshi et al., 2020), with (ii) higher-resource languages for which dedicated (i.e., 'MAD-X') LAs were readily available.The full overview of all tasks, datasets, and languages with their language codes is provided in Underlying MMT.We report results on all tasks with mBERT, pretrained on Wikipedias of 104 languages (Devlin et al., 2019).mBERT has been suggested by prior work as a better-performing MMT for truly low-resource languages (Pfeiffer et al., 2021b;Ansell et al., 2021).To validate the robustness of our TLR adapters, we also use XLM-R (Conneau et al., 2020) for a subset of tasks.
Language Adapters.We train LAs for the minimum of 100 epochs or 100,000 steps with a batch size of 8, a learning rate of 5 • 10 −5 and a maximum sequence length of 256. 4 We evaluate the LAs every 1,000 steps for low-resource languages and every 5,000 steps for high-resource ones, and choose the LA that yields the lowest perplexity, evaluated on the 5% of the held-out monolingual data (1% for high-resource languages).For the BAD-X baseline, we directly use the bilingual LAs from (Parović et al., 2022).Following Pfeiffer et al. (2020b), the adapter reduction factor (i.e., the ratio between MMT's hidden size and the adapter's bottleneck size) is 2 for all LAs.For the MAD-X LAs, we use the efficient Pfeiffer adapter configuration (Pfeiffer et al., 2020a) with invertible adapters, whereas BAD-X LAs do not include them.
Task Adapters.We fine-tune TAs by stacking them on top of the corresponding LAs (see Figure 1).During their fine-tuning, the MMT's parameters and all the LAs' parameters are frozen.The adapter reduction factor for all TAs is 16 as in prior work (Pfeiffer et al., 2020b) (i.e., d = 48), and, like the LAs, they use the Pfeiffer configuration.The hyperparameters across different tasks, also borrowed from prior work, are listed in Table 1.In addition, we use early stopping of 4 when training the QA TA (i.e., we stop training when the F1 score does not increase for the four consecutive evaluation cycles).We use the English SQuADv1.1 training data (Rajpurkar et al., 2016) for TyDiQA-GoldP since (i) it is much larger than TyDiQA's native training set, and (ii) we observed higher performance on target languages in our preliminary experiments than with TyDiQA's training data.
Transfer Setup: Details.In all our transfer experiments, the source language L s is fixed to English, and we evaluate different variants described in §2.2.For the MAD-X baseline, we rely on its 'MAD-X v2.0' variant, which drops the adapters in the last layer of the Transformer, which has been found to improve transfer performance across the board (Pfeiffer et al., 2021b).For the TASK-MULTI TLR variant, along with using the English LA, we fine-tune TAs using the LAs of all our evaluation languages in that particular dataset.For instance, for DP this spans 10 languages, while for NLI, we fine-tune a separate TASK-MULTI TLR with the 10 languages from AmericasNLI, and another one for the XNLI languages.For the ALL-MULTI TLR variant, in addition to English LA, we cycle over the LAs of all our evaluation languages from all the tasks and datasets.

Results and Discussion
Main Results.The main results with mBERT for all tasks and all languages are shown in Table 2, with the averages concisely provided in Figure 2. Additional results with XLM-R are available in Appendix B. As a general trend, we observe that all proposed TLR variants outperform MAD-X on the majority of the target languages across all tasks.Besides reaching higher averages on all tasks, the best per-task variants from the TLR framework surpass MAD-X on: 9/9 (NER), 10/10 (DP), 10/10 (AmericasNLI), 6/6 (XNLI), 4/4 (XQuAD) and 5/5 (TyDiQA) target languages.We also demonstrate that gains are achieved over the much less modular BAD-X on two tasks (DP, AmericasNLI) for which we had readily available BAD-X LAs.In sum, the comprehensive set of results from Table 2 confirms the effectiveness and versatility of TLR adapters across a range of (typologically diverse) target languages and datasets.
Breakdown of Results across Tasks and TLR Variants.On NER and DP we observe very similar trends in results.Importantly, the most modular ALL-MULTI variant offers the highest performance overall: e.g., it reaches the average F1 score of 69.86% in the NER task, while outperforming MAD-X by 1.9% on average and on all 9 target languages.Pronounced gains with that variant are also indicated in the DP task.The TARGET and BILINGUAL variants also yield gains across the majority of languages, with BILINGUAL being the stronger of the two.However, their overall utility in comparison to ALL-MULTI is lower, given their lower performance coupled with lower modularity.
On AmericasNLI, all TLR variants display considerable gains over MAD-X, achieving 5-6% higher average accuracy.They outperform MAD-X on all 10 target languages, except the TASK-MULTI variant with only a slight drop on AYM.The best variant is once again the most modular ALL-MULTI variant, which is better than the baselines and all the other variants on 6/10 target languages.
On XNLI, which involves some higher-resource languages such as AR, HI and ZH, all TLR variants reach higher average accuracy than MAD-X.The gains peak around 5-6% on average; however, this is due mainly to SW where MAD-X completely fails, achieving the accuracy of random choice.Nonetheless, the TLR variants attain better scores on all other languages as well (the only exception is ALL-MULTI on AR).Besides SW, TH also marks a large boost of up to 11.2% with the BILINGUAL variant, while the other languages attain more modest gains of up to 2%.We remark that the BILINGUAL variant now obtains the highest average accuracy: we speculate that this could be a consequence of target languages now being on the higher-resource end compared to MasakhaNER and AmericasNLI.
Our final task family, QA, proves yet again the benefits of transfer with TLR adapters.On XQuAD and TyDiQA-GoldP, the best TLR variant is now the TARGET adapter.This might be partially due to a good representation of high-resource languages  such as AR, HI, or ZH in mBERT and its subword vocabulary.However, we observe gains with TAR-GET also on lower-resource languages such as BN and SW on TyDiQA, which might indicate that the higher complexity of the QA task is at play in comparison to tasks such as NER and NLI.Crucially, the most modular ALL-MULTI TLR variant, which trains a single TA per each task, yields very robust and strong performance across all tasks (including the two QA tasks) and both on high-resource and low-resource languages.
Towards Language-Universal Task Adapters?Strictly speaking, if a new (K + 1)-th target language is introduced to our proposed TLR framework, it would be necessary to train the multilingual TLR TA anew to expose it to the new target language.In practice, massively multilingual TAs could still be applied even to languages 'unseen' during TA fine-tuning (e.g., in the same way as the original MAD-X framework does).This violates the TLR assumption, as the TA sees the target language only at inference.However, this setup might empirically validate another desirable property of our multilingual TLR framework from Figure 1: exposing the TA at fine-tuning to a multitude of languages (and their corresponding LAs) might equip the TA with improved transfer capability even to unseen languages.Put simply, the TA will not overfit to a single target language or a small set of languages as it must learn to balance across a large and diverse set of languages; see §2.
We thus run experiments on MasakhaNER, UD DP, and AmericasNLI with two subvariants of the most general ALL-MULTI variant.First, in the LEAVE-OUT-TASK subvariant, we leave out all the LAs for the languages from the corresponding task dataset when fine-tuning the TA: e.g., for Americ-asNLI, that subvariant covers the LAs of all the languages in all the datasets except those appearing in AmericasNLI, so that all AmericasNLI languages are effectively 'unseen' at fine-tuning.The second subvariant, termed LEAVE-OUT-TARG, leaves out only one language at a time from the corresponding dataset: e.g., when evaluating on Guarani (GN) in AmericasNLI, the only language 'unseen' by the TA at fine-tuning is GN as the current inference language.
The results, summarized in Tables 2( a)-(c), reveal that our MULTILINGUAL TA fine-tuning indeed increases transfer capability also for the 'TAunseen' languages, and leads towards languageuniversal TAs.The scores with both subvariants offer substantial gains over MAD-X for many languages unseen during fine-tuning and in all three tasks.This confirms that (i) MAD-X TAs tend to overfit to the source language and thus underperform in cross-lingual transfer, and (ii) such overfitting might get mitigated through our proposed 'multilingual regularization' of the TAs while keeping the same modularity benefits.Additionally, the results also confirm the versatility of the proposed TLR framework, where strong transfer gains are achieved with different sets of languages included in multilingual TA fine-tuning: e.g., the scores with the two LEAVE-OUT subvariants remain strong and competitive with the full ALL-MULTI variant.
For the DP task we even observe slight gains with the LEAVE-OUT-TASK variant over the original ALL-MULTI variant which 'sees' all task languages.We speculate that this might partially occur due to the phenomenon of 'the curse of multilinguality' (Conneau et al., 2020) kicking in, now at the level of the limited TA budget, but leave this for further exploration in future work.

Further Analyses
Robustness to LA Training Configuration.To demonstrate that our results hold even when LAs are trained with the different hyper-parameters, we adopt a training regime that makes MAD-X LAs directly comparable with BAD-X as trained in previous work by Parović et al. (2022).The average results with such LAs on DP and AmericasNLI are presented in Table 3, demonstrating that the gains with the proposed TLR variants hold irrespective of the LA training setup.
Multiple Runs.Given the large number of experimental runs in this work, most scores are reported from single runs with fixed seeds.However, to validate that our findings hold under different random initializations of TAs, we also run MAD-X and all TLR variants with three different random seeds on a subset of tasks (MasakhaNER and AmericasNLI).
The main results are presented in Table 3, indicating that all the findings hold and are not due to a single favorable seed.

Do TLR Adapters Improve Alignment Between
Source and Target Languages?In order to explain the consistent gains with TLR adapters over MAD-X, we analyse whether TLR adapters produce better-aligned representations between source and target languages than MAD-X.We execute experiments on the NLI task, choosing 4 languages from AmericasNLI (AYM, GN, HCH, QUY) and 4 languages from XNLI (AR, HI, SW, UR) datasets, with English as a source language.The representations of English are obtained using MultiNLI data and English LA is paired with 1) MAD-X TA for the MAD-X baseline, and 2) ALL-MULTI TA for the TLR representations.To obtain the representations in the target language, we use its validation data and its LA paired with either MAD-X TA or ALL-MULTI TA as before.The alignment scores of both MAD-X and TLR methods are measured as cosine similarity between English and target representations of mBERT's [CLS] token, using 500 examples in both languages.The results are presented in Figure 3.We can observe that MAD-X seems to have a much more significant drop in alignment values in the last layer than the ALL-MULTI adapter, which could explain the better performance of the latter.In addition, on Americas-NLI languages, where we observe sizable gains, the ALL-MULTI adapter seems to achieve better alignment across the middle layers of mBERT.

Related Work
Parameter-Efficient Fine-Tuning has emerged from an effort to overcome the need for full model fine-tuning, especially with the neural models becoming increasingly larger.Some approaches finetune only a subset of model parameters while keeping the rest unmodified (Ben Zaken et al., 2022;Guo et al., 2021;Ansell et al., 2022).Other approaches keep the model's parameters fixed and introduce a fresh set of parameters that serves for learning the desired task (Li and Liang, 2021; Lester et al., 2021;Houlsby et al., 2019;Hu et al., 2022), with the tendency towards decreasing the number of newly introduced parameters while concurrently maximizing or maintaining task performance (Karimi Mahabadi et al., 2021a,b).
Adapters were introduced in computer vision research (Rebuffi et al., 2017) before being brought into NLP to perform parameter-efficient transfer learning across tasks (Houlsby et al., 2019).Bapna and Firat (2019) use adapters in NMT as an efficient way of adapting the model to new languages and domains because maintaining separate models would quickly become infeasible as the number of domains and languages increases.Wang et al. (2021) propose factual and linguistic adapters to infuse different types of knowledge into the model, while overcoming the catastrophic forgetting that would otherwise occur.
Adapters for Cross-Lingual Transfer.MAD-X Pfeiffer et al. (2020b) introduces LAs and TAs for efficient transfer; they also propose invertible adapters for adapting MMTs to unseen languages.Subsequently, Pfeiffer et al. (2021b) introduce a vocabulary adaptation method for MAD-X that can adapt the model to low-resource languages and even to unseen scripts, the latter of which was not possible with MAD-X's invertible adapters.In another adapter-based cross-lingual transfer approach, Vidoni et al. (2020) introduce orthogonal LAs and TAs designed to store the knowledge orthogonal to the knowledge already encoded within MMT.FAD-X (Lee et al., 2022) explores whether the available adapters can be composed to complement or completely replace the adapters for low-resource languages.This is done through fusing (Pfeiffer et al., 2021a) TAs trained with LAs in different languages.
Our TLR adapters do not involve any fusion, but rather benefit from a training procedure that operates by cycling over multiple LAs.Faisal and Anastasopoulos ( 2022) use linguistic and phylogenetic information to improve cross-lingual transfer by leveraging closely related languages and learning language family adapters similar to Chronopoulou et al. (2022).This is accomplished by creating a phylogeny-informed tree hierarchy over LAs.
UDapter (Üstün et al., 2020) and MAD-G (Ansell et al., 2021) learn to generate LAs through the contextual parameter generation method (Platanios et al., 2018).Both UDapter and MAD-G enable the generation of the parameters from vectors of typological features through sharing of linguistic information, with the main difference between the two approaches being that MAD-G's LAs are task-agnostic, while UDapter generates them jointly with a dependency parser's parameters.Hyper-X (Üstün et al., 2022b) generates weights for adapters conditioned on both task and language vectors, thus facilitating the zero-shot transfer to unseen languages and task-language combinations.
Improving Cross-Lingual Transfer via Exposing Target Languages.In an extensive transfer case study focused on POS tagging, de Vries et al. (2022) showed that both source and target language (and other features such as language family, writing system, word order and lexical-phonetic distance) affect cross-lingual transfer performance.XeroAlign (Gritta and Iacobacci, 2021) is a method for task-specific alignment of sentence embeddings (i.e. they encourage the alignment between source task-data and its target translation by an auxiliary loss), aiming to bring the target language performance closer to that of a source language (i.e. to close the cross-lingual transfer gap).Kulshreshtha et al. (2020) analyze the effects of the existing methods for aligning multilingual contextualized embeddings and cross-lingual supervision, and propose a novel alignment method.Yang et al. (2021) introduce a new pretraining task to align static embeddings and multilingual contextual representations by relying on bilingual word pairs during masking.Inspired by this line of research, in this work we investigated how 'exposing' target languages as well as conducting multilingual fine-tuning impacts the knowledge stored in task adapters, and their ability to boost adapter-based cross-lingual transfer.

Conclusion and Future Work
We have presented a novel general framework for adapter-based cross-lingual task transfer, which improves over previous established adapter-based transfer frameworks such as MAD-X and BAD-X.The main idea is to better equip task adapters (TAs) to handle text instances in a variety of target languages.We have demonstrated that this can be achieved via so-called target language-ready (TLR) task adapters, where we expose the TA to the target language as early as the fine-tuning stage.As another major contribution, we have also proposed a multilingual language-universal TLR TA variant which offers the best trade-off between transfer performance and modularity, learning a single universal TA that can be applied over multiple target languages.Our experiments across 6 standard cross-lingual benchmarks spanning 4 different tasks and a wide spectrum of languages have validated the considerable benefits of the proposed framework and different transfer variants emerging from it.Crucially, the most modular multilingual TLR TA variant offers the strongest performance overall, and it also generalizes well even to target languages 'unseen' during TA fine-tuning.
In future work, we plan to further investigate multilingual language-universal task adapters also in multi-task and multi-domain setups, and extend the focus from serial adapters to other adapter architectures, such as parallel adapters (He et al., 2022) and sparse subnetworks (Ansell et al., 2022;Foroutan et al., 2022).

Limitations
Our experiments are based on (arguably) the most standard adapter architecture for adapter-based cross-lingual transfer and beyond, which also facilitates comparisons to prior work in this area.However, we again note that there are other emerging parameter-efficient modular methods, including different adapter architectures (He et al., 2022), that could be used with the same conceptual idea.We leave further and wider explorations along this direction for future work.
Our evaluation relies on the currently available standard multilingual benchmarks, and in particular those targeted towards low-resource languages.While the development of better models for underrepresented languages is possible mostly owing to such benchmarks, it is also inherently constrained by their quality and availability.Even though our experiments have been conducted over 35 different target languages and across several different tasks, we mostly focus on generally consistent trends across multiple languages.Delving deeper into finer-grained qualitative and linguistically oriented analyses over particular low-resource languages would require access to native speakers of those languages, and it is very challenging to conduct such analyses for many languages in our language sample.
Due to a large number of experiments across many tasks and languages, we report all our results based on a single run.Averages over multiple runs conducted on a subset of languages and tasks confirm all the core findings; for simplicity, we eventually chose to report the results for all languages and tasks in the same setup.

Figure 1 :
Figure1: A general multilingual task adapter (TA) target language-ready (TLR) module at one MMT layer, showing the language adapters (LAs) for English as the source language and K target languages along with the NLI TA.The TA is trained by cycling over the K + 1 LAs associated with the K + 1 languages.For a given step number, only the LA step % (K + 1) is switched on and the forward pass goes through that LA.Setting K = 0 results in the original MAD-X setup, where only the source LA is switched on, while a bilingual TLR variant is given by K = 1.Setting K = 1 and removing the English LA formulates the TARGET-only TLR variant.See §2.2 for the descriptions of all the variants.The same adapter configuration(s), but with different parameters, are added at each MMT layer.

Table 1 :
Table 5 in Appendix A. Hyperparameters for different tasks.

Table 2 :
Results of all methods and TLR variants on all tasks and target languages.The highest task score per each language in bold, but excluding the two ablation subvariants of ALL-MULTI placed below the dashed horizontal lines (LEAVE-OUT-TASK and LEAVE-OUT-TARG).Better refers to the number of target languages for which each TLR variant scores higher than MAD-X.An asterisk (*) next to the best TLR variant indicates non-significant gains over MAD-X, where the significance analysis has been conducted using Student's t-test with p = 0.05.

Table 5 :
Details of the tasks, datasets, and languages involved in our cross-lingual transfer evaluation.
* denotes low-resource languages seen during MMT pretraining; † denotes high-resource languages seen during MMT pretraining; all other languages are low-resource and unseen.The source language is always English.