Embedding Structure Matters: Comparing Methods to Adapt Multilingual Vocabularies to New Languages

Pre-trained multilingual language models underpin a large portion of modern NLP tools outside of English. A strong baseline for specializing these models for specific languages is Language-Adaptive Pre-Training (LAPT). However, retaining a large cross-lingual vocabulary and embedding matrix comes at considerable excess computational cost during adaptation. In this study, we propose several simple techniques to replace a cross-lingual vocabulary with a compact, language-specific one. Namely, we address strategies for re-initializing the token embedding matrix after vocabulary specialization. We then provide a systematic experimental comparison of our techniques, in addition to the recently-proposed Focus method. We demonstrate that: 1) Embedding-replacement techniques in the monolingual transfer literature are inadequate for adapting multilingual models. 2) Replacing cross-lingual vocabularies with smaller specialized ones provides an efficient method to improve performance in low-resource languages. 3) Simple embedding re-initialization techniques based on script-wise sub-distributions rival techniques such as Focus, which rely on similarity scores obtained from an auxiliary model.


Introduction
For languages other than English and a handful of other very high-resource languages, pre-trained multilingual language models form the backbone of most current NLP systems.These models address the relative data scarcity in most non-English languages by pooling text data across many languages to train a single model that (in theory) covers all training languages (Devlin, 2019;Conneau and Lample, 2019;Conneau et al., 2020;Liu et al., 2020;Scao et al., 2023, i.a.).These models often include language-agnostic tokenization and an increased vocabulary capacity over monolingual models (Conneau et al., 2020).
However, Wu and Dredze (2020) show that these massively multilingual models still underperform on lower-resource languages.Recent efforts to cover these languages instead pre-train models that are specialized to specific languages or language families (Ogueji et al., 2021;Ogunremi et al., 2023).These approaches nonetheless require training a new model from scratch and do not leverage transferable information in existing models.
Our study builds on a line of work which instead adapts a pre-trained cross-lingual model (such as XLM-R; Conneau et al., 2020) to a single language, or a smaller set of languages.Language-Adaptive Pre-Training (LAPT)-continuing the MLM or CLM pre-training task on only the target language(s)-is a simple and strong baseline in this regard (Chau et al., 2020).
However, LAPT with no change to the crosslingual vocabulary comes with considerable excess computational cost: when adapting to a single language or small subset of languages, only a small fraction of the cross-lingual vocabulary is used.The excess vocabulary still contributes to the computational cost on both the forward and backward pass, and embedding/output matrices often constitute a large fraction of the total trainable model parameters (for XLM-R-base, 192M / 278M ≈ 69% of parameters).Additionally, the informationtheoretic tokenization modules for cross-lingual models are usually under-optimized for any given language, and especially low-resource languages (Ács, 2019;Conneau and Lample, 2019, i.a.)For this reason, we propose several simple techniques to replace the large cross-lingual vocabulary of a pre-trained model with a compact, languagespecific one during model specialization.Training a new SentencePiece or BPE tokenizer poses no special difficulties.However, re-initializing the embedding matrix for a new vocabulary, which will almost certainly introduce many new tokens lacking pre-trained embeddings, poses significant challenges.We compare several methods for such embedding re-initialization.
After reviewing related literature in Section 2, we conduct a qualitative exploration of the pretrained embedding space for a standard multilingual model: XLM-R (Section 3.1).This exploration informs our formalization of simple techniques to align new vocabulary embeddings with the pre-trained embedding distribution of our base model (Section 3.2).We then provide a systematic experimental comparison of the embedding re-initialization techniques we propose, plus the recently proposed FOCUS re-initialization method (Dobler and de Melo, 2023, Section 4).Our experiments cover a wide selection of low-and midresource target languages (i.e.those that have the most to gain from language specialization). 1 The results of our experiments (Sections 5, 6) demonstrate the following: 1) Embeddingreplacement techniques proposed in the monolingual model adaptation literature are inadequate for adapting multilingual models.2) Replacing large cross-lingual vocabularies with smaller language-specific ones provides a computationallyefficient method to improve task performance in low-resource languages.3) The simple reinitialization techniques we propose here, based on script-wise embedding sub-distributions, rival techniques such as FOCUS, which rely on modeldriven semantic similarity.

Related Work
Pre-trained Model Adaptation Extensive work has proposed re-using and modifying pre-trained models for new settings in order to retain existing model knowledge and reduce pre-training costs.Gururangan et al. (2020) show that continued training on domain-specific data effectively adapts pretrained models to new domains in both high-and low-resource settings.This approach is also used to adapt models to new languages (i.e.Language-Adaptive Pre-Training / LAPT; Chau et al., 2020).
Other approaches involve training new, languagespecific adapter layers to augment a frozen monolingual (Artetxe et al., 2020) or multilingual encoder (Pfeiffer et al., 2020;Üstün et al., 2020;Faisal and Anastasopoulos, 2022).A comparison of these cross-lingual adaptation approaches (Ebrahimi and Kann, 2021) found that continued pre-training often outperforms more complex setups, even in low-resource settings.With this in mind, our experiments evaluate the success of models tuned for target languages with LAPT, starting from variable initializations depending on a choice of embedding adaptation technique.
Cross-lingual Vocabulary Adaptation A major limitation in adapting pre-trained models to new languages is the subword vocabulary, which often fails to cover an unseen script (Pfeiffer et al., 2021) or tokenizes target text inefficiently (Ács, 2019).Muller et al. (2021) demonstrate that script is an extremely important factor in predicting transfer success.Specifically, the pre-trained coverage of closely-related languages improves transfer, but only if the target language is written in the same script as its pre-trained relative.
One adaptation technique is to initialize new subword embeddings that cover the target language, e.g. by expanding the existing vocabulary with new tokens as necessary, then training the new (randomly initialized) embeddings (Chau et al., 2020;Wang et al., 2020).When transferring a monolingual model to a new language, Artetxe et al. ( 2020) and de Vries and Nissim (2021) instead completely re-initialize the embedding matrix, corresponding to a new subword vocabulary.These embeddings are then trained into alignment with the pre-trained, frozen transformer encoder.We show that this technique is not successful when adapting a multilingual model (Section 5).
Other work reuses information in pre-trained embeddings rather than initializing new ones at random.This may include scaling up smaller embedding spaces from models trained on the target language (de Vries and Nissim, 2021; Ostendorff and Rehm, 2023) or copying embeddings from the original vocabulary where there is exact vocabulary overlap (Pfeiffer et al., 2021).When transferring to a target language written in a poorly-covered script, Muller et al. (2021) show that transliterating the target to the script of a well-covered relative can lead to significant performance gains.
Finally, recent work has proposed more complex methods for mapping source embeddings onto semantically similar ones in the target space either through cross-lingually aligned static word embeddings (e.g. the WESCHEL method; Minixhofer et al., 2022) or with bilingual lexicons (Zeng et al., 2023).In concurrent work to ours, Dobler and de Melo (2023) extend WECHSEL with the FO-CUS method to specialize multilingual vocabularies to a single language.Ostendorff and Rehm (2023) use a cross-lingual progressive transfer learning approach to combine information from the source embeddings and a smaller target language model to initialize higher-dimension target embeddings.Unlike earlier initialization methods and our proposed setup, these methods all require additional information outside the source model and often require significant additional compute.We compare one method from this family (FOCUS) to our proposed heuristic-based initialization schemes.

Vocabulary Replacement & Embedding
Re-initialization Research transferring monolingual models from one language to another (e.g.Artetxe et al., 2020;de Vries and Nissim, 2021), has shown that random re-initialization of embeddings +LAPT is sufficient.However, our experiments show that this technique performs poorly when transferring from a multilingual model (Section 5).For this reason, we propose several simple techniques for initializing new embeddings based on a qualitative exploration of the embedding space for XLM-R (Section 3.1), and include the more complex FOCUS technique, developed concurrently with our work, for comparison (Dobler and de Melo, 2023).

XLM-R Embedding-Space Analysis
To better understand the task of initializing new embeddings for a multilingual model, we explore the token-embedding space of XLM-R through PCA projection.Our hypothesis is that multilingual models do not process all languages homogeneously.This seems to be demonstrated in Figures 1a and 1b, where word embeddings are colored by their respective Unicode script block.We see that the highest-resource scripts in XLM-R (Common, Latin, and Cyrillic) have relatively divergent distributions, while others cluster closer together.This heterogeneity may help explain the finding from Muller et al. (2021) that pre-trained models do not transfer well to even closely-related target languages if the target script does not match that of the pre-trained relative.Secondly, each script can be further divided into two sub-distributions, roughly corresponding to a shift in the second principal component.Figure 1c shows that this division corresponds to whether a token is word-initial or word-medial.To preserve whitespace information, SentencePiece tokens include a leading underscore to indicate tokens that should be preceded by a space (word-initial tokens).
2 Although the model does not have access to the internal makeup of its tokens, we hypothesize that it learns to discern which tokens can begin a word and which cannot.Thus when proposing methods to initialize new embeddings for XLM-R, we hypothesize that initializing according script-and position-wise subdistributions will help to align new vocabulary items with the pre-trained embedding distribution.

Embedding Re-initialization Techniques
We now formalize simple techniques for embedding re-initialization based on our exploration of XLM-R's embedding space, as well as one recently proposed technique based on an auxiliary embedding model (FOCUS).Figure 2 provides PCA visualizations of the re-initialized embeddings from each technique on a subword vocabulary specialized for languages of the Uralic family (we experiment with these languages in Section 4).The visualization for these languages' respective scripts (Common, Latin, Cyrillic) in the base model can be found in Figure 1b for comparison.
Re-initialization by Identity REINIT-IDENT first identifies tokens in the new vocabulary that exactly match a token in the original vocabulary, then sets the new embeddings of shared tokens to be identical to those in the original embedding table (Figure 2a).This is a common approach to preserve information from the original model, even when the other embeddings are randomly reinitialized (e.g., Pfeiffer et al., 2021).When identity re-initialization is applied in conjunction with another technique (such as REINIT-SCRIPT), identity takes precedence.
Re-initialization by Script For REINIT-SCRIPT, all base XLM-R tokens are first categorized by Unicode block, as a stand-in for identifying the script/orthography.We then calculate the mean and standard deviation for each script in the original embedding space.Finally, new token embeddings for each script are distributed according to a Normal distribution with the corresponding mean and standard deviation (Figure 2b).Re-initialization by Position REINIT-POSN is based on the observation that within each script, embeddings seem to cluster according their wordinitial vs. word-medial status (Figure 1c).Similarly to REINIT-SCRIPT, we identify the mean and standard deviation of embeddings that belong to each category.Because positional status seems to be a sub-cluster within script clusters, we only use REINIT-POSN in combination with REINIT-SCRIPT.
The mean and standard deviation for each (script, position) combination is calculated and new embeddings are initialized accordingly (Figure 2c).

Experiments
In our experiments, we replace the large crosslingual embedding matrix of XLM-R and reinitialize it for a new, language-specific vocabulary.We then conduct LAPT to specialize the model for the new language(s), and evaluate performance on downstream tasks.We consider both multilingual→monolingual and multilingual→multilingual transfer scenarios, the latter being transfer to a much smaller set of languages than the original cross-lingual training set.We compare our vocabulary-replacement techniques against the baseline performance of XLM-R off-the-shelf, as well as LAPT while retaining the original, full-sized vocabulary.
Another manipulation we consider is whether the transformer-specific parameters are frozen during LAPT.This follows from the literature on transferring monolingual models, which proposes freezing the encoder parameters and only training the new embedding matrix to mitigate catastrophic forgetting during transfer learning (Artetxe et al., 2020;de Vries and Nissim, 2021).In our tables, we denote LAPT with trainable transformer layers as LAPT-FULL, and training with the transformer frozen (but trainable embeddings) as LAPT-EMB.
Target Languages We select our target languages for a wide selection of language families, scripts, typological characteristics, and resource availability, while still having standard evaluation sets for comparison.Training data for all languages is obtained from OSCAR v.22.01 (Abadji et al., 2022).For our lowest-resource languages, supplemental data is obtained from monolingual splits of the OPUS translation corpus (Tiedemann and Nygaard, 2004) and the Johns Hopkins University Bible Corpus (McCarthy et al., 2020).More data curation details may be found in Appendix A.
Our multilingual→monolingual transfer languages can be found in Table 1.In these experiments, the replacement vocabulary and 3 Figure 5b in the Appendix verifies that these clusters capture the initial vs. medial token distinction LAPT training are constrained to a single target language.In addition, we include two multilingual→multilingual experiments.In the first, we simply transfer to the set of languages used in our monolingual experiments.Most of these languages are unrelated and cover a variety of scripts and levels of resource-availability.In the second, we transfer to a set of languages belonging to a single language family -Uralic.These languages come from the same ancestor language, and share broad grammatical features, but also use both Cyrillic and Latin scripts.These differing settings are designed to demonstrate whether language relatedness has an effect on the success of multilingual vocabulary-replacement techniques.
Vocabulary Replacement / Re-initialization When replacing model vocabulary, we train new Sentencepiece models on a subset of the training data.For targets with less than 1GB of data, we use the entire dataset.For those with more, we use a random subset of about 250MB.For multilingual models, we sample 5 million lines according to the same distribution as the training data.All new Sentencepiece models have a total vocabulary size of 32,770 including special tokens.We then initialize the embedding matrix for each new vocabulary according to one or a combination of the techniques described in Section 3.

4
Training All of our experiments use XLM-R as a starting point (base size; Conneau et al., 2020).We conduct LAPT for 100k training steps, with evaluation checkpoints every 1000 steps.For LAPT-FULL experiments, the transformer blocks are frozen for the first 10k steps, then unfrozen for the last 90k, so that the model does not overfit to initial (possibly poor) embedding initializations.For LAPT-EMB experiments, transformer blocks remain frozen throughout training.The checkpoint obtaining the best MLM loss on a development set is selected for task fine-tuning and evaluation.
For multilingual training, we sample languages according to a multinomial distribution parameterized by α = 0.2, following Conneau and Lample (2019), Conneau et al. (2020), i.a.Languages are sampled sentence-wise rather than batch-wise.
Evaluation We evaluate model quality with POStagging and NER tasks.For each task and each language, the trained model is fine-tuned on task training data until evaluation set convergence or the maximum number of epochs is reached, across four random seeds.POS performance is evaluated on Universal Dependencies (UD) treebanks (de Marneffe et al., 2021), and NER is measured on the WikiAnn benchmark (Pan et al., 2017).

Results
The results for monolingual adaptation can be found in Tables 1-2 and general multilingual adaptation in Tables 3-4.Because the results for multilingual adaptation to the Uralic family mostly echo overall trends, we provide these results in Appendix C. 5 In order to adhere to our overall computational budget, we only conduct full-vocabulary LAPT experiments for three languages in the monolingual setting. 6 We first note that across re-initialization methods, LAPT-FULL always outperforms LAPT-EMB.I.e.training with trainable transformer layers outperforms training with frozen ones, despite the risk of catastrophic forgetting with the former.This trend persists across monolingual and multilingual experiments.For example, REINIT-FOCUS+IDENT shows a 6.9 average POS accuracy drop between LAPT-FULL and LAPT-EMB (Table 1).
Second, although FOCUS is the best performing re-initialization method when averaged across languages, for individual languages, it does not perform significantly differently than script-based methods.For instance, Armenian and Telugu POS tagging with script-based initialization performs on-par with or better than FOCUS (Tables 1, 3). 7 In the case of the very low-resource language Erzya, script-based methods mostly outperform FOCUS. 8 Third, for the languages with the largest amount of data in XLM-R (Estonian, Hebrew, and Russian), the off-the-shelf performance of XLM-R (top row) is slightly better than any re-initialization method.This is not unexpected, since we can expect the highest-resource languages in XLM-R to receive adequate vocabulary coverage, and their embeddings are likely the most robustly trained.Finally, LAPT with the full, original XLM-R vocabulary, results in marginally better performance than other techniques.On one hand, this might be surprising given the inefficiency with which crosslingual vocabularies often tokenize low-resource languages (Ács, 2019).On the other hand, these original pre-trained embeddings are also likely robustly aligned with the transformer encoder, which might contribute to slightly better performance.
Part of the motivation for this work, however, is to investigate efficient ways to specialize multilingual models.LAPT with the full XLM-R vocabulary is much more computationally costly than training new vocabulary.Figure 4 shows the tradeoff between computation (in FLOPs) and performance gain in our experiments: the (often) small gains in performance we see from fine-tuning with the original vocabulary come at the cost of two to three times more FLOPs during adaptation.
Erzya POS performance provides one exception to the pattern of full-vocab LAPT providing only marginal benefits (85.1 accuracy with the full vocabulary vs. 79.0 with the reduced vocabulary).This seems surprising, given Erzya is not included in XLM-R's pre-training data, and intuitively should benefit the most from a specialized vocabulary.It could be that the reduced vocabulary size of 32k is sub-optimal for this particular target language, and/or that the new vocabulary does not overlap enough with the original (full-size) one to inherit useful Cyrillic-script embeddings.Investigating the dynamics of target vocabulary size dur- 62.8 ± 0.9 74.9 ± 1.6 66.1 ± 1.1 62.7 ± 1.9 23.9 ± 18.2 53.1 ± 4.7 37.7 ± 2.6 54.4 Table 4: Multilingual LAPT: entity-wise NER F1 score after fine-tuning for NER.Only the high-resource languages of Estonian, Hebrew, and Russian seem to be adequately covered in XLM-R to outperform our specialization techniques.Language-Adaptive Pre-Training with the full (cross-lingual) XLM-R vocabulary often produces marginally better results overall, but at a much greater computational cost, and without making the model more compact in size.Further training and inference after LAPT will continue to suffer from the memory and compute wasted on unused vocabulary items, which constitute a large percentage of the total model parameters.

Script-distribution initialization rivals semantic similarity methods
We introduced several methods for embedding re-initialization in Section 3, namely using the insight that token embeddings for XLM-R cluster by script and position within a word, then distributing new vocabulary items according to these pre-trained sub-distributions.We compare this to the FOCUS re-initialization method, which initializes new embeddings as a weighted combination of existing ones according to similarity scores from an auxiliary model.Averaged across languages, FOCUS yields the best performance in downstream tasks by a slight margin.Within languages, it often overlaps significantly with the performance of our scriptdistribution methods.For very low-resource languages like Erzya, script-based methods even show a slight advantage.This seems to show that, at least in combination with LAPT, the majority of the benefit in re-initialization can be achieved by a method that takes the structure of the pre-trained embedding distribution into account, whether or not it uses advanced methods to precisely initialize the representations of new vocabulary items.We do note that the advantage of FOCUS is more clear-cut when LAPT is conducted with transformer blocks frozen.This lends credence to the idea that FOCUS more precisely mimics the embedding distribution expected by the pre-trained transformer.However, the overall best results come when the transformer blocks are unfrozen/trainable.

Fully random initialization performs poorly
Finally, our experiments demonstrate that fully random re-initialization of embeddings during vocabulary replacement leads to overall poor performance.Across LAPT-FULL experiments, random initial-ization performs an average of 19.4 points worse than the next-best re-initialization method, and 24.7 points worse than the off-the-shelf baseline.The poor performance of random initialization has been noted in other works such as Dobler and de Melo ( 2023), but we emphasize that even incredibly simple methods such as REINIT-IDENT and REINIT-SCRIPT work far better than the random baseline.

Conclusion
This work presents a systematic comparison of methods to specialize the subword vocabularies and embeddings of multilingual models for new languages.We propose simple methods for reinitializing embeddings, motivated by a qualitative exploration of the XLM-R embedding space.Our experiments show that (1) updating the encoder layers during LAPT is crucial for downstream performance, (2) vocabulary replacement provides a computationally-efficient method to improve task performance in low-resource languages, and (3) our re-initialization techniques employing scriptwise sub-distributions perform on par with more involved similarity-based methods.We hope these findings can be built upon in future work on multilingual model specialization, with the goal of providing the best performance for under-resourced languages while also making language modeling more accessible through more manageable compute cost and model sizes.

Limitations
One limitation of our work is the relatively narrow set of evaluation tasks available for our languages of interest.The model-adaptation techniques we compare here are most applicable to lowand medium-resource languages that are not optimally covered by pre-existing multilingual models.For most of these languages, the only standard evaluation datasets that exist are for relatively lowlevel tasks like Part of Speech tagging and Named Entity Recognition.Evaluation of embeddingreinitialization techniques could be improved in future work if datasets for higher-level tasks like Natural Language Inference, question answering, and paraphrase detection were curated for these under-resourced languages.
We also make several simplifying choices to maintain a feasible scope for our work.First, we conduct model adaptation from only a single base model: XLM-R.A valuable addition in future work would be to determine whether the trends we observe here generalize to other model types (i.e.causal and seq2seq language models) and to larger model scales.Secondly, we consider only one size for newly-initialized target vocabularies (32k).Because effective per-language vocabulary allocation has been shown to be an important factor in multilingual modeling (Conneau et al., 2020, i.a.), investigating the dynamics of target vocabulary size during vocabulary re-initialization will be important for future work on this topic.

A Data Details
General information about the language data used in this study can be found in Table 5.All training data used in our experiments is cleaned and deduplicated using the OpusFilter package (Aulamo et al., 2020).For the lowest-resource languages (Erzya and Sami) we additionally filter out lines that are identified as English with a probability of 90% or higher, since positive automatic languageidentification for low-resource languages is likely not robust (Kreutzer et al., 2022).We additionally filter out lines composed of less than 2 tokens, lines with an average token length of greater than 16 characters, lines with tokens longer than 32 characters, and lines composed of fewer than 50% alphabetic characters.
For POS tagging evaluation, most languages have a standard train/dev/test split curated the original Universal Dependencies dataset (de Marneffe et al., 2021).Erzya, however, only has a standard train/test split.To form a dev split, we randomly sample 300 sentences from the train split.The WikiAnn dataset (Pan et al., 2017) does not ship with standard train/dev/test splits, so we create random 85/5/10% splits of each language for this purpose, with a minimum dev/test size of 256 and 512 sentences respectively.

B Training Details
The main details of our experimental process can be found in Section 4.Here we provide our choice of hyperparameters and other details relevant to reproducibility.The code used to run all experiments will be released in a later version of this  paper.All models are trained and fine-tuned on Nvidia Quadro RTX 6000 GPUs using the Adam optimizer (Kingma and Ba, 2015).Hyperparameters for Language-Adaptive Pre-Training (LAPT) can be found in Table 6.If NaN losses were encountered during training, max_gradient_norm was reduced to 0.5.For multilingual sampling during training, each language's training data is capped at approximately 2GB.
Hyperparameters for task fine-tuning on POS and NER are in Table 7.For NER, the reported evaluation metric is entity-wise F1, meaning tokens with label O are ignored.In order to prevent models from learning to output only the majority class O during training, the loss for the O tokens in each batch is down-weighted to have the same influence as the tokens that actually correspond to a named entity.We cap fine-tuning training data at 32,768 sequences.

C Uralic Results
The results for multilingual adaptation to the Uralic family can be found in Tables 8 and 9.These re-sults mostly follow the trends discussed in Section 5 (LAPT-EMB consistently underperforms LAPT-FULL, off-the-shelf performance is best for high-resource languages, LAPT with full crosslingual vocab performs marginally better than other methods).It should be noted that for both Erzya and Hungarian, the best POS accuracy is achieved with SCRIPT+POSN+IDENT initialization (better even than LAPT with the fully cross-lingual vocabulary).Results for the very low-resource language Erzya are generally higher than with multilingual training on unrelated languages, which could suggest a benefit to training with closely-related languages.This observation does not clearly hold for Sami (the other very low-resource language), however.Note that Russian is not a Uralic language -we include it for multilingual training in order to robustly train embeddings for the Cyrillic script, in which Erzya is written.Erzya is also spoken primarily within the Russian Federation, making loan-words likely.

Figure 1 :
Figure 1: PCA visualizations of the embedding space for XLM-R.Subplots: (a) Distribution of embeddings for the 12 most common Unicode scripts.(b) Plot reduced to only Common, Latin, and Cyrillic scripts for simplicity.(c) Embeddings colored by whether the token begins a word (initial) or occurs in the middle of one (medial)

FOCUS
Re-initialization In addition to the heuristic-based methods introduced above, we investigate a pre-existing method for embedding transfer, termed FOCUS(Dobler and de Melo, 2023).FOCUS works by extrapolating from the embedding space of an existing model, like our heuristic methods, but further introduces an auxiliary embedding model trained on the new language(s).This auxiliary model (based on FastText;Bojanowski et al., 2017) is used to obtain similarity measures between the new vocabulary items.Embeddings corresponding to overlapping tokens in the new vocabulary keep their values from the source model (REINIT-IDENT).Completely new tokens are initialized as a weighted combination of the overlapping items, with weights obtained according to similarity in the auxiliary model.

Figure 4 :
Figure 4: Evaluation scores plotted against total floating point operations of LAPT (computational cost).Left point represents cost of LAPT with reduced vocabulary, right point with full vocabulary

Figure 5 :
Figure 5: PCA visualization of re-initialized embeddings with word-initial vs word-medial tokens highlighted.For REINIT-SCRIPT, position-wise clustering seen in the base XLM-R embeddings (Figure 1c) is not captured.REINIT-SCRIPT+POSN and REINIT-SCRIPT+POSN+IDENT show expected positional clustering.REINIT-FOCUS seems to allow slightly more positional overlap

Table 5 :
Training data breakdown by language.XLM-R data is the amount of data used in the pre-training of that model.LAPT data is the amount used for training in our current experiments, after cleaning/deduplicating.

Table 9 :
Uralic family multilingual LAPT: entity-wise NER F1 score after fine-tuning.A score of 0.0 results from the model learning to output only class O (not a named entity) which is the majority class.Sami does not have enough NER data for fine-tuning