ZGUL: Zero-shot Generalization to Unseen Languages using Multi-source Ensembling of Language Adapters

We tackle the problem of zero-shot cross-lingual transfer in NLP tasks via the use of language adapters (LAs). Most of the earlier works have explored training with adapter of a single source (often English), and testing either using the target LA or LA of another related language. Training target LA requires unlabeled data, which may not be readily available for low resource unseen languages: those that are neither seen by the underlying multilingual language model (e.g., mBERT), nor do we have any (labeled or unlabeled) data for them. We posit that for more effective cross-lingual transfer, instead of just one source LA, we need to leverage LAs of multiple (linguistically or geographically related) source languages, both at train and test-time - which we investigate via our novel neural architecture, ZGUL. Extensive experimentation across four language groups, covering 15 unseen target languages, demonstrates improvements of up to 3.2 average F1 points over standard fine-tuning and other strong baselines on POS tagging and NER tasks. We also extend ZGUL to settings where either (1) some unlabeled data or (2) few-shot training examples are available for the target language. We find that ZGUL continues to outperform baselines in these settings too.


Introduction
Massive multilingual pretrained language models (PLMs) such as mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) support 100+ languages.We are motivated by the vision of extending NLP to the thousands (Muller et al., 2021) of unseen languages, i.e., those not present in PLMs, and for which unlabeled corpus is also not readily available.A natural approach is zero-shot crosslingual transfer -train the model on one or more source languages and test on the target language, zero-shot.Two common approaches are (1) Standard Fine Tuning (SFT) -fine-tune all parameters of a PLM on task-specific training data in source language(s), and (2) Language Adapters (LAs)small trainable modules inserted within a PLM transformer, and trained on target language's unlabeled data using Masked Language Modeling (MLM).At test time, in SFT the fine-tuned PLM is applied directly to target language inputs, whereas for the latter, the LA of source language is replaced with that of target language for better zero-shot performance.Unfortunately, while in the former case, one would expect presence of unlabeled data for a target language to pre-train the PLM for good performance, the latter also requires the same unlabeled data for training a target adapter.
For the large number of low resource languages, curating a decent-sized unlabeled corpus is a challenge.For instance, there are only 291 languages that have Wikipedias with over 1,000 articles.Consequently, existing works use English data for training on the task and use the English LA (Pfeiffer et al., 2020b;He et al., 2021) or an ensemble of related language LAs (Wang et al., 2021) at inference time.We posit that this is sub-optimal; for better performance, we should leverage multiple source languages (ideally, related to target language) and their LAs, both at train and test-time.
To this end, we propose ZGUL, Zero-shot Generalization to Unseen Languages, which explores this hypothesis. 1 It has three main components.First, it fuses LAs from source languages at train time by leveraging AdapterFusion (Pfeiffer et al., 2021a), which was originally developed for fusing multiple task adapters.This allows ZGUL to locally decide the relevance of each LA for each token in each layer.Second, ZGUL leverages the typological properties of languages (encoded in existing language vectors) as addi-tional information for computing global LA attention scores.Finally, ZGUL also implements the Entropy-Minimization (EM)-based test-time tuning of LA attention weights (Wang et al., 2021).
We denote a language group as a set of phylogenetically or demographically close languages, similar to Wang et al. (2021).We experiment on 15 unseen languages from four language groups: Slavic, Germanic, African and Indo-Aryan, on POS tagging and NER tasks.In each group, we train on multiple (3 to 4) source languages (including English), for which task-specific training data and LAs are available.ZGUL obtains substantial improvements on unseen languages compared to strong baselines like SFT and CPG (Conditional Parameter Generation (Üstün et al., 2020)), in a purely zero-shot setting.Detailed ablations show the importance of each component in ZGUL.We perform attention analysis to assess if learned weights to source languages' LAs are consistent with their relatedness to the target language.
Further, we study two additional scenarios, where (1) some unlabeled data, and (2) some taskspecific training data are available for the target language.We extend ZGUL in these settings and find that our extensions continue to outperform our competitive baselines, including ones that use unlabeled data for a target language to either (1) pre-train mBERT or (2) train target's LA.
Our contributions can be summarized as: (1) We propose a strong method (ZGUL) to combine the pretrained language adapters during training itself.To the best of our knowledge, we are the first to systematically attempt this in context of LAs.ZGUL further incorporates test-time tuning of LA weights using Entropy Minimization (EM).(2) ZGUL outperforms the competitive multi-source baselines for zero-shot transfer on languages unseen in mBERT.
(3) ZGUL exhibits a strong correlation between learned attention scores to adapters and the linguistic relatedness between source and target languages.(4) ZGUL achieves competitive results in a fewshot setting, where a limited amount of labeled (target language) training data is available.(5) When target language unlabeled data is available, a modification of ZGUL outperforms baselines for nine out of twelve languages.To encourage reproducibility, we publicly release our code and trained models. 2ingle-source Adapter Tuning: We build on MAD-X (Pfeiffer et al., 2020b), which introduces two phases of adapter training.1. Pretraining Language adapter (LA) for each language L i : inserting an LA in each layer of transformer model M (denoted by L i •M) and training on unlabeled data for language L i using the MLM objective.2. Training TA for a task T j : stacking LA for source language L src with TA for task T j (denoted by T j •L src •M), in which T j and the task-specific prediction head are the only trainable parameters.During inference, L src is replaced with L tgt , i.e.T j • L tgt • M is used.The MAD-X paradigm uses only one LA for a given input sentence.Also, it assumes the availability of L tgt .If not available, English adapter (He et al., 2021;Pfeiffer et al., 2020b) or a related language's adapter (Wang et al., 2021) is used at test-time.
Adapter Combination: Pfeiffer et al. (2021a) introduce AdapterFusion, a technique that combines multiple pretrained TAs T 1 , ...T n to solve a new target task T n+1 .It learns the attention weights of T 1 , ...T n while being fine-tuned on the data for T n+1 .Vu et al. (2022) adapt this technique for fusing domains and testing on out-of-domain data.This technique has not been applied in the context of LAs so far.The recent release of 50 LAs on AdapterHub3 enables studying this for LAs.
Recently, Wang et al. (2021) propose EMEA (Entropy Minimized Ensembling of Adapters) for efficiently combining multiple LAs at inference time.EMEA calculates the entropy of the prediction during test time and adjusts the LA attention scores (initialized uniformly) using Gradient Descent, aiming to give higher importance to the LA that increases the confidence score of the prediction.However, the training is still conducted using English as a single source.et al., 2018) for training on multiple source languages.They utilize a CPG module, referred to as CPGAdapter, which takes a typological language vector as input and generates a Language Adapter (LA).The CPGAdapter is shared across all source languages and trained from scratch for a specific task.Since an LA is determined by the input lan-guage's vector, this approach can directly generalize to unseen languages.However, it is worth noting that CPG is data-intensive as it learns the parameters of the CPGAdapter from scratch.
We note that CPG comes under a broader category of hypernetworks that generate weights for a larger main network (Ha et al., 2016), which have been recently explored successfully for task mixing (Karimi Mahabadi et al., 2021).In our experiments, we include a comparison with the CPG method.

Model for Ensembling of Adapters
Our goal is to combine a set of source LAs for optimal zero-shot cross-lingual transfer to unseen languages, which are neither in mBERT, nor have readily available labeled or unlabeled data.Similar to previous works (Pfeiffer et al., 2020b;Wang et al., 2021), we focus on languages whose scripts are seen in mBERT.Handling unseen languages with unseen scripts is a more challenging task, which we leave for future research.
Our approach can be described using two highlevel components: (1) train-time ensembling and (2) test-time ensembling, explained below for every layer l (for notational simplicity, we skip using l in notations but clarify wherever required).

Train-time Ensembling
During training, we make use of an attention mechanism inspired by the combination of task adapters explored in Pfeiffer et al. (2021a).While they focus on creating an ensemble of task adapters, our focus is on combining language adapters.Additionally, we identify valuable information available in typological language vectors (Littell et al., 2017), that we can leverage.To achieve this, we design two sub-components in our architecture, which we later combine for each layer (refer to Figure 1).Token-based Attention (FUSION): This subnetwork computes the local attention weights over source LAs for each token using the output of the feed forward layer as its query, and the individual language adapters' outputs as both the key and the value.Mathematically, for t th token, the embedding obtained after passing through feed forward layer becomes the query vector q (t) .The individual LAs' outputs following this become key (and value) matrices K (t) (and V (t) ).The attention weights of source LAs for t th token are computed using the dot-product attention between projected query W q q (t) and projected key matrix The FUSION output for t th token is given by: Here, W q , W k ,W v are projection matrices which are different for every layer l.Language-vector-based Attention (LANG2VEC): This sub-network computes the global attention weights for source LAs using the input language vector (of the token) as the query, the language vectors of the source languages as the keys, and the outputs through individual LAs as the values.Mathematically, the attention weights over the LAs are obtained through a projected dot-product attention between the input language vector l inp as query vector and the source language vectors stacked as a key matrix L src .
Here the language vectors are derived by passing the language features lf4 (each feature being binary) through a single-layer trainable MLP: The LANG2VEC attention scores for t th token are given by: Here W L is a projection matrix associated differently with each layer l. lang[t] denotes language vector of t th token in the input The output of t th token is given by: is same across all tokens in an example and also across all examples in a language, we refer LANG2VEC attention as global.On the other hand, FUSION computes local attention scores that depend purely on the token-level outputs of the feed forward layer and hence are local to each token in any given input sentence.
Combining the two ensembling modules: We pass the input sentence through both networks, and for t th token receive the outputs o L , corresponding to the FUSION and LANG2VEC networks, respectively.These two vectors are concatenated and passed through a fully connected layer.The output of this linear layer, denoted as o LA , serves as input to the task adapter (TA) to obtain the final output o The above process is repeated for each layer l in Figure 1: FUSION Network (left) and LANG2VEC Network (right) outputs are concatenated and sent to a Linear layer followed by a TA in every transformer layer l the transformer architecture.We note that the LAs are kept frozen throughout the training process, while only the TA and other parameters described in FUSION and LANG2VEC modules being trainable.The training objective is word-level crossentropy loss for all models.For model selection, we evaluate ZGUL and other baseline models on the combined dev set of source languages' data.We do not use any dev set of target languages, as doing so would violate the zero-shot assumption on target language data.

Test-time Ensembling
Wang et al. ( 2021) introduced EMEA, an inferencetime Entropy Minimization (EM)-based algorithm to adjust the LA attention scores, initializing them from uniform (as mentioned in sec.2).In our case, since we learn the attention scores during training itself, we seek to further leverage the EM algorithm by initializing them with our learnt networks' weights.First, we compute the entropy of ZGUL's predicted labels, averaged over all words in the input sentence, using an initial forward pass of our trained model.Since ZGUL has two different attention-based networks -FUSION and LANG2VEC, it's trainable parameters are the attention weights for both these networks.We backpropagate the computed entropy and update both these attention weights using SGD optimization.In the next iteration, entropy is computed again using a forward pass with the modified attention weights.This process is repeated for T iterations, where T and learning rate lr are the hyperparameters, which are tuned on dev set of linguistically most related (based on distributed similarity, shown in figure 3) source language for each target (grid search details in Appendix A).Detailed EMEA algorithm is presented in Algo. 1 (Appendix).

Experiments
We aim to address the following questions.(1) How does ZGUL perform in a zero-shot setting compared to the other baselines on unseen languages?What is the incremental contribution of ZGUL's components to the performance on LRLs?
(2) Are LA attention weights learnt by ZGUL interpretable, i.e., whether genetically/syntactically more similar source languages get higher attention scores?(3) How does ZGUL's performance change after incorporating unlabelled target language data?(4) How does ZGUL's performance vary in a few-shot setting, where a few training examples of the target language are provided to the model for fine-tuning?

Datasets, Tasks and Baselines
Datasets and Tasks: We experiment with 4 diverse language groups: Germanic, Slavic, African and Indo-Aryan.Following previous works (Wang et al., 2021;Pfeiffer et al., 2020b), we choose named entity recognition (NER) and part-of-speech (POS) tagging tasks.We select target languages which are unseen in mBERT subject to their availability of test sets for each task.This leads us to a total of 15 target languages spanning Germanic and Slavic for POS, and African and Indo-Aryan for NER.For African and Indo-Aryan NER experiments, we use the MasakhaNER (Adelani et al., 2021) and WikiAnn (Pan et al., 2017) datasets respectively.For POS experiments, we use Universal Treebank 2.5 (Nivre et al., 2020).We pick training languages from each group that have pre-trained adapters available.The details of training and test languages, as well as corresponding task for each group are presented in Table 1.For detailed statistics, please refer to tables 13, 14, 16.Baselines: We experiment with two sets of baselines.In the first set, the baselines use only English as single source language during training: In the second set, we compare against models trained upon multiple source languages (belonging to a group), in addition to English: • SFT-M: Standard fine-tuning on data from all source languages.
• MADX multi -{S}: We naturally extend MADX-En to the multi-source scenario by dynamically switching on the LA corresponding to the input sentence's language during training.S refers to inference strategy that can be one of the following (as described in English baselines above): En, Rel, Uniform Ensembling or EMEA ensembling.
It is important to note that the EM algorithm is not applicable to SFT and CPG baselines because SFT does not use an adapter, while CPG has only a single (shared) adapter.Consequently, there are no ensemble weights that can be tuned during inference for these methods.The EM algorithm is a distinctive feature of the ensemble-based methods like ZGUL, which allows for further optimization and performance improvement.
Evaluation Metric: We report micro-F1 evaluated on each token using seqeval toolkit (Nakayama, 2018).For all experiments, we report the average F1 from three training runs of the models with three different random seeds.The standard deviation is reported in Appendix G.

Results: Zero-Shot Transfer
Tables 2 and 3 present experimental findings for Germanic and Slavic POS, as well as African and Indo-Aryan NER, respectively.ZGUL outperforms other baselines for 10 out of 15 unseen test languages.In terms of POS, ZGUL achieves a respectable gain of 1.8 average F1 points for Germanic and a marginal improvement of 0.4 points for Slavic compared to its closest baseline CPGthe gains being particularly impressive for Gothic, Swiss German and Pomak languages.For NER, ZGUL achieves decent gains of 3.2 points and 0.9 points for the Indo-Aryan and the African groups respectively over the closest baseline i.e.SFT-Mthe gains being upto 4 F1 points for Luo.Moreover, baselines trained on a single En source perform significantly worse (upto 24 points gap in Indo-Aryan), highlighting the importance of multi-source training for effective cross-lingual transfer.We note that CPG outperfoms SFT-M for POS tagging, but order switches for NER.This is to be expected due to huge number of parameters in CPG (details in Sec.A) and the smaller sizes of NER datasets compared to POS (details in App.13).
We also observe a substantial performance gap between MADX multi -Rel and ZGUL -the former performing upto 7.5 points average F1 worse than ZGUL for the Indo-Aryan group.This demonstrates that relying solely on the most related LA is sub-optimal compared to ZGUL, which leverages aggregated information from multiple LAs.Additionally, MADX multi -Uniform, which does a naive averaging of LAs, performs even worse overall.Though MADX multi -EMEA shows some improvement over it, yet remains below ZGUL's performance by an average of about 4 points over all languages.This finding highlights the effectiveness of ZGUL-style training, as the EM algorithm benefits from an informed initialization of weights, rather than a naive uniform initialization strategy.More analysis on this follows in Sec.4.3.Ablation results in last three rows in Tables 2, 3 examine the impact of each of the 3 components, FUSION, LANG2VEC and Entropy Minimization (EM), on ZGUL's performance.We observe a positive impact of each component for each language group, in terms of average F1 scores.For individual languages as well, we see an improvement in F1 due to each component, exceptions being Kinyarwanda and Luganda, where EM marginally hurts the performance.This could occur when wrong predictions are confident ones, and further performing EM over those predictions might hurt the overall performance.

Interpretability w.r.t. Attention Scores
We wanted to examine if the attention weights computed by ZGUL are interpretable.In order to do this, we computed the correlation6 between the (final) attention scores computed by ZGUL at inference time, and the language relatedness with the source, for each of the test languages.Since ZGUL has two different networks for computing the attention scores, i.e., the Fusion network and LANG2VEC, we correspondingly compute the correlation with respect to the average attention scores in both these networks.For both networks, we compute the average of scores across all tokens, layers as well as examples in a target language.To compute the language relatedness, we use the distributed similarity metric obtained as the average of the syntactic and genetic similarities (we refer to Appendix C for details).and language relatedness, especially for Slavic and Indo-Aryan groups, for both the attention networks.This means that the model is assigning higher attention score to languages which are more related in a linguistic sense, or in other words, language relatedness can be thought of as a reasonable proxy, for deciding how much (relative) importance to give to LA from each of the source languages for building an efficient predictive model (for our tasks).Further, we note that among the two networks, the scores are particularly higher for LANG2VEC, which we attribute to the fact that the network explicitly uses language features as its input, and therefore is in a better position to directly capture language relatedness, compared to the Fusion network, which has to rely on this information implicitly through the tokens in each language.
For the sake of comparison, we also include the correlations for MADX multi -EMEA model (referred to as M m -EM for brevity), which does LA ensembling purely at inference time, to contrast the impact of learning ensemble weights via training in ZGUL on correlation with language relatedness.Clearly, the scores are significantly lower in this case, pointing to the fact that EMEA alone is not able to capture a good notion of language relatedness, which possibly explains its weaker performance as well (as observed in Tables 2 & 3).
For completeness, Figures 2 & 3 present the attention scores for the LANG2VEC network, and also the language relatedness to the source languages, for each test language, grouped by corresponding language group.The similarity in the two heat maps (depicted via color coding) again corroborates the high correlation between the attention scores computed by ZGUL with language relatedness.

Leveraging Unlabeled Target Data
Is ZGUL useful in case some amount of unlabeled data is available for the target language?To an-  swer this, we use Wikipedia dumps, which are available for 12 out of our 15 target languages.For each language L tgt , (1) we train its Language Adapter LA tgt , and (2) pre-train mBERT model using MLM, denoted as mBERT tgt .We also extend ZGUL to ZGUL++ as follows: we initialize ZGUL's encoder weights with mBERT tgt and fine-tune it with the additional adapter LA tgt , inserted along with other source LAs.This is trained only on source languages' task-specific data (as target language training is not available).We compare ZGUL++ with (1) MADX multi -Tgt, which trains MADX in multi-source fashion and at inference, use the LA tgt , and (2) SFT++, which initializes SFT's encoder weights with mBERT tgt and finetunes on the source languages' data.
Table 5 shows ZGUL++'s average gains of 2 F1 points over our competitive baseline SFT++.ZGUL++ achieves SOTA performance for 9 of  12 languages while it's ablated variant (not using LA tgt ) does so for two more languages.The gains for Swiss German (6.6 F1 points) are particularly impressive.Ablations for ZGUL++ show that though incorporating the LA tgt is beneficial with average gain of about 0.6 F1 point over all languages, the crucial component is initializing with mBERT tgt , which leads to around 7 avg.F1 point gains.Hence, the additional target pre-training step is crucial, in conjunction with utilizing the target LA, for effectively exploiting the unlabeled data.
We investigate how the performance scales for each model from zero-shot setting (no unlabeled data) to utilizing 100% of the Wikipedia target data.We sample 2 bins, containing 25% and 50% of the full target data.We then plot the average F1 scores over all 12 languages for each bin in fig. 4. We observe ZGUL++ is effective across the regime.Compared to SFT++, the gains are higher on the 100% regime, while compared to MADX-Tgt baseline, a steep gain is observed upon just using 25% data.

Few-Shot Performance
In this experiment, we take the trained ZGUL and other multi-source models, i.e., CPG, SFT-M, and fine-tune them further for a few labeled examples from the train set of each target language.We do this for those test languages, whose training set is available (12 out of 15).We sample training bins of sizes 10, 30, 70 and 100 samples.We observe in Fig. 5 that ZGUL scales smoothly for all language groups, maintaining its dominance over the baselines in each case, except for Slavic, where its performance is similar to CPG baseline.The relative ordering of baselines, i.e.CPG outperforming SFT-M for Slavic and Germanic (POS), and SFT-M outperforming CPG for African and Indo-Aryan (NER), is also maintained across the regime of few-shot samples, similar to zero-shot setting.The learning plateau is not reached in the curves for either of the language groups, showing that adding more target examples would likely result in further improvement of all the models, albeit at a smaller pace.We present the detailed language-wise fewshot curves in Appendix H.

EM tuning using Target Dev set
In the purely zero-shot setting, we tuned EM parameters for ensemble-based methods, i.e.MADX multi -EMEA and ZGUL, on the closest source language's dev set.However, if we assume

Conclusion and Future Work
We present ZGUL7 , a novel neural model for ensembling the pre-trained language adapters (LAs) for multi-source training.This is performed by fusing the LAs at train-time to compute local tokenlevel attention scores, along with typological language vectors to compute a second global attention score, which are combined for effective training.Entropy Minimization (EM) is carried out at testtime to further refine those attention scores.Our model obtains strong performance for languages unseen by mBERT but with the seen scripts.We present various analyses including that the learnt attention weights have significant correlation with linguistic similarity between source and target, and demonstrating scalability of our model in the unlabeled data and few-shot labeled data settings as well.
In the future, our approach, being task-agnostic, can be applied to more non-trivial tasks, such as generation (Kolluru et al., 2022(Kolluru et al., , 2021)), semantic parsing (Awasthi et al., 2023), relation extraction (Rathore et al., 2022;Bhartiya et al., 2022), and knowledge graph completion (Chakrabarti et al., 2022;Mittal et al., 2023).Our technique may complement other approaches for morphologically rich languages (Nzeyimana and Rubungo, 2022) and for efficacy to domain-specific tasks not having sufficient publicly available data in that language and domain as to train a strong Adapter (E.g.Medical domain for African languages).Presently, our technique cannot be tested directly on unseen scripts because our tokenization/embedding layer is same as that of mBERT and may become a bottleneck for Adapters to directly perform well.Our approach is not currently tested on deep semantic tasks and generation-based tasks owing to the lack of suitable large-scale datasets for evaluation.for t ← 0 to T − 1 do 3: ▷ Update Token Attention weights 7: ▷ Update LangVec Attention weights 9: We note that we made the following amendments to the originaly proposed EMEA (Wang et al., 2021) for our setting -(1) we made each token-level attention weights in each layer trainable, while the original EMEA had tied it layer-wise.This gives the EM method more degree of freedom in our framework compared to EMEA.(2) we have 2 attention networks, each initialized with its respective trained attention       21 show the classwise F1 scores for each of the tasks.The scores are averaged over all languages in each task.
We have used the seqeval10 framework for evaluating all the models, which is consistent with the previous works and used by XTREME11 .Seqeval removes the 'B' and 'I' prefixes of the labels, hence the Generation of LA using Shared Parameters: Üstün et al. (2020) employ the Conditional Parameter Generation (CPG) technique (Platanios

Figure 2 :
Figure 2: LANG2VEC attention scores for each of the test languages (clustered group-wise).

Figure 3 :
Figure 3: Language relatedness between target and source languages in each group, computed as average of syntactic and genetic similarity metrics.

Figure 4 :
Figure 4: Average F1 scores over target languages w.r.t.percentage of Wikipedia data used.0 denotes zero-shot.

Figure 5 :
Figure 5: Few-shot F1 averaged over languages in a group for various few-shot bins.Top row: Germanic, Slavic.Bottom row: African, Indo-Aryan.

Figure 8 :
Figure8: Various similarity metrics between source and target languages (higher the more similar).This is used to validate the assignment of the target languages to corresponding groups as well as for depicting correlation with the LA attention scores learnt by LANG2VEC component.

Figure
Figure 9: Orv language

Table 1 :
Language groups or sets of related source and target languages, along with tasks.*English (En) is added to train set in each case.

Table 2 :
F1 of POS Tagging Results for Germanic and Slavic language groups, * denotes p-value < 0.005 for McNemar's test on aggregated results over all test languages in a group

Table 3 :
F1 of NER Results for African and Indo-Aryan language groups, * denotes p-value < 0.005 for McNemar's test on aggregated results over all test languages in a group

Table 4 :
Table4presents the results.Clearly, we observe a high correlation between the attention scores computed by ZGUL Correlation between ZGUL's attention scores and syntactic-genetic (averaged) similarity of sourcetarget pairs for both LANG2VEC and FUSION networks and for the EMEA (multi-source) baseline

Table 5 :
F1 scores after incorporating target unlabeled data in various models.Unlabeled datasize is in # sentences.

Table 6 :
POS Tagging Results for Germanic and Slavic groups when utilizing target dev set for EM tuning

Table 7 :
NER Results for African and Indo-Aryan groups when utilizing target dev set for EM tuning the target dev set availability, which indeed holds for our target languages, one can leverage it for EM hyperparameter tuning.We present the results for the same in tables 6 and 7.The gains of ZGUL become more pronounced, obtaining up to 1.4 avg.F1 points in the Indo-Aryan group.

Table 9 :
Trainable Parameters & per-epoch training time of all models

Table 18 :
Example from the hau langauge

Table 19 :
Example from the luo language F Class-wise F1 scores

Table 20 and
Table

Table 20 :
Table 21 has only 4 classes (E.g.'B-PER' and 'I-PER' are mapped to same label 'PER') Classwise F1-scores for POS task.Averaged over all languages

Table 21 :
Classwise F1-scores for NER task.Averaged over all languages

Table 22 :
F1 Std.Dev.(rounded to 1 decimal) of POS Tagging for Germanic and Slavic language groups.Note:-Avg column denotes std.dev. of average F1 and not the average of std dev.