Romanization-based Large-scale Adaptation of Multilingual Language Models

,


Introduction
Massively multilingual language models (mPLMs) such as mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) have become the driving force for a variety of applications in multilingual NLP (Ponti et al., 2020;Hu et al., 2020;Moghe et al., 2023).However, guaranteeing and maintaining strong performance for a wide spectrum of low-resource languages is difficult due to two crucial problems.The first issue is the vocabulary size, as the vocabulary is bound to increase with the number of languages added if per-language performance is to be maintained (Hu et al., 2020;Artetxe et al., 2020;Pfeiffer et al., 2022).Second, pretraining mPLMs with a fixed model capacity improves cross-lingual performance up to a point after which it starts to decrease; this is the phenomenon termed the curse of multilinguality (Conneau et al., 2020).
Transliteration refers to the process of converting language represented in one writing system to another (Wellisch et al., 1978).Latin scriptcentered transliteration or romanization is the most common form of transliteration (Lin et al., 2018;Amrhein and Sennrich, 2020;Demirsahin et al., 2022) as the Latin/Roman script is by far the most widely adopted writing script in the world (Daniels and Bright, 1996;van Esch et al., 2022). 1 Adapting mPLMs via transliteration can address the two aforementioned critical issues.1) Since the Latin script covers a dominant portion of the mPLM's vocabulary (e.g., 77% in case of mBERT, see Ács), 'romanizing' the remaining part of the vocabulary might mitigate the vocabulary size issue and boost vocabulary sharing.2) Since no new tokens are added during the romanization process, reusing pretrained embeddings from the mPLM's embedding matrix helps reuse the information already present within the mPLM, thereby allocating the model's parameter budget more efficiently.
However, the main drawback of transliteration seems to be the expensive process of creating effective language-specific transliterators, as they typically require language expertise to curate dictionaries that map tokens from one language and script to another.Therefore, previous attempts at mPLM adaptation to unseen languages via transliteration (Muller et al., 2021;Chau and Smith, 2021;Dhamecha et al., 2021;Moosa et al., 2023) were constrained to a handful of languages due to the limited availability of language-specific transliterators, or were applied only to languages that have 'language siblings' with developed transliterators.
In this work, unlike previous work, we propose to use and then evaluate the usefulness of a universal romanization tool, UROMAN (Hermjakob et al., 2018), for quick, large-scale and effective adaptation of mPLMs to low-resource languages.The UROMAN tool disposes of language-specific curated dictionaries and maps any UTF-8 character to the Latin script, increasing the portability of romanization, with some examples in Figure 1.
We analyze language adaptation on a massive scale via UROMAN-based romanization on a set of 14 diverse low-resource languages.We conduct experiments within the standard parameter-efficient adapter-based cross-lingual transfer setup on two tasks: Named Entity Recognition (NER) on the WikiANN dataset (Pan et al., 2017;Rahimi et al., 2019), and Dependency Parsing (DP) with Universal Dependencies v2.7 (Nivre et al., 2020).Our key results suggest that UROMAN-based transliteration can offer strong performance on par or even outperforming adaptation with language-specific transliterators, setting up the basis for wider use of transliteration-based mPLM adaptation techniques in future work.The gains with romanization-based adaptation over standard adaptation baselines are particularly pronounced for languages with unseen scripts (∼8-22 performance points) without any vocabulary augmentation.2

Background
Why UROMAN-Based Romanization?UROMANbased romanization is not always fully reversible, and its usage for transliteration has thus been limited in the literature.However, due to its high portability, UROMAN can help scale the process of transliteration massively and as such benefit lowresource scenarios and wider adaptation of mPLMs.The main idea, as hinted in §1, is to (learn to) map any UTF-8 character to the Latin script, without the use of any external language-specific dictionaries (see Hermjakob et al. (2018) for technical details).
Cross-Lingual Transfer to Low-Resource Languages.Parameter-efficient and modular finetuning methods (Pfeiffer et al., 2023) such as adapters (Houlsby et al., 2019;Pfeiffer et al., 2020b) have been used for cross-lingual transfer, putting a particular focus on enabling transfer to low-resource languages and scenarios, including languages with scripts 'unseen' by the base mPLM (Pfeiffer et al., 2021).Adapters are small lightweight components stitched into the base mPLM, and then trained for particular languages and tasks while keeping the parameters of the original mPLM frozen.This circumvents the issues of catastrophic forgetting and interference (Mc-Closkey and Cohen, 1989) within the mPLM, and allows for extending its reach also to unseen languages (Pfeiffer et al., 2021;Ansell et al., 2021).
For our main empirical analyses, we adopt a state-of-the-art modular method for cross-lingual transfer: MAD-X (Pfeiffer et al., 2020b).In short, MAD-X is based on language adapters (LA), task adapters (TA), and invertible adapters (INV).While LAs are trained for specific languages relying on masked language modeling, TAs are trained with high-resource languages relying on task-annotated data and task-specific objectives.At inference, the source LA is replaced with the target LA while the TA is kept.In order to do parameter-efficient learning for the token-level embeddings across different languages and to deal with the vocabulary mismatch between source and target languages, Pfeiffer et al. (2020b) also propose INV adapters: they are placed on top of the embedding layer and their inverses precede the output embedding layer. 3e adopt the better-performing MAD-X 2.0 setup (Pfeiffer et al., 2021) where the adapters in the last Transformer layer are dropped at inference.4

Experiments and Results
As the main means of analyzing the impact of transliteration in general and UROMAN-based romanization in particular, we train different variants of language adapters within the MAD-X framework, based on transliterated and non-transliterated Table 1: Languages with their ISO 639-3 codes used in our evaluation, along with their script, language family, and number of sentences available for pretraining.The dashed line separates languages with unseen scripts, placed in the bottom part of the table.
versions of target language data, outlined here.
Variants with Non-Transliterated Data.For the Non-Trans LA+INV variant, we train LAs and INV adapters together.This variant serves to examine the extent to which mPLMs can adapt to unseen languages without any vocabulary extension. 5We compare this to Non-Trans LA+Emb Lex , which trains a new tokenizer for the target language (Pfeiffer et al., 2021): the so-called 'lexically overlapping' tokens are initialized with mPLM's trained embeddings, while the remaining embeddings are initialized randomly.All these embeddings (Emb Lex ) are fine-tuned along with LAs.
Variants with Transliterated Data.We evaluate a Trans LA+INV variant, which uses the same setup as Non-Trans LA+INV but now with transliterated data.We again note that in this efficient setup, we do not extend the vocabulary size, and use the fewest trainable parameters.In the Trans LA+mPLM ft variant, we train LAs along with fine-tuning the pretrained embeddings of mPLM (mPLM ft ).This further enhances the model capacity by fine-tuning the embedding layer instead of using invertible adapters. 6or both variants, transliterated data can be produced via different transliterators: (i) languagespecific ones; (ii) the ones from 'language siblings' (e.g., using a Georgian transliterator for Mingrelian), or (iii) UROMAN.

Experimental Setup
Data, Languages and Tasks.Following Pfeiffer et al. (2021), we select mBERT as our base mPLM.We experiment with 14 typologically diverse lowresource languages that are not part of mBERT's pretraining corpora, with 5/14 languages written in distinct scripts (see Table 1 for details).For LA training, we use Wikipedia dumps for the target languages, which we also transliterate (using different transliterators).Evaluation is conducted on two standard cross-lingual transfer tasks in zeroshot setups: 1) the WikiAnn NER dataset (Pan et al., 2017) with the train, dev, and test splits from (Rahimi et al., 2019); 2) for dependency parsing, we rely on the UD Dataset v2.7 (Nivre et al., 2020).
LAs and TAs.English is the source language in all experiments, and is used for training TAs.The English LA is obtained directly from Adapterhub.ml (Pfeiffer et al., 2020a), LAs and embeddings (when needed) are only trained for target languages.
Finally, for the Non-Trans LA+Emb Lex variant, we train a WordPiece tokenizer on the target language data with a vocabulary size of 10K.
Training of Language and Task Adapters.We train all the language adapters for 50 epochs or ∼ 50K update steps based on the corpus size.The batch size is set to 64 and the learning rate is 1e − 4.
We train English task adapters following the setup from (Pfeiffer et al., 2020b).For NER, we directly obtain the task adapter from Adapterhub.ml which is trained with a learning rate of 1e − 4 for 10 epochs.For DP, we train a Transformer-based (Glavaš and Vulić, 2021) biaffine attention dependency parser (Dozat and Manning, 2017).We use a learning rate of 5e − 4 and train for 10 epochs as in (Pfeiffer et al., 2021).
All the reported results in both tasks (NER and DP) are reported as averages over 6 random seeds.All the models have been trained on A100 or V100 GPUs.None of the training methods consumed more than 36 hours.As the main means of analyzing the impact of transliteration in general and URO-MAN-based romanization in particular, we train different variants of language adapters within the MAD-X framework, based on transliterated and non-transliterated versions of target language data, outlined here.

Results and Discussion
UROMAN versus Other Transliterators and Transliteration Strategies.In order to estab- lish the utility of UROMAN as a viable transliterator, especially for low-resource languages, we compare its performance with transliteration options using the Trans LA + INV setup as the most efficient scenario.First, we compare UROMAN with language-specific transliterators available for selected languages: amseg (Yimam et al., 2021) for Amharic, ai4bharat-transliteration (Madhani et al., 2022) for Hindi and Sindhi, lang-trans for Arabic, and transliterate for Russian and Georgian7 .The transliterators used in this work are outlined in Table 6.The results are provided in Table 2. On average, UROMAN performs better or comparable to the language-specific transliterators.This provides justification to use UROMAN for massive transliteration at scale.Second, we compare UROMAN to two other transliteration strategies.(i) BORROW refers to borrowing transliterators from languages within the same language family and written in the same script.8Since building transliterators are costly, this gives us an estimate of whether it is possible to rely on the related transliterators when we do not have a language-specific one at hand.(ii) RAND refers to a random setting where we associate any non-ASCII character with any ASCII character, giving us an estimate of whether we actually need knowledge of the language to build transliterators.The results are provided in Table 3: UROMAN is largely and consistently outperforming both BOR-ROW and RAND, where the single exception is BORROW (from Hindi to Bhojpuri).Surprisingly, RAND also yields reasonable performance and on average even outperforms the Non-Trans LA+INV variant with non-transliterated data (21.59 vs 18.39 in Table 4 later).This provides further evidence towards the utility of transliteration in general and UROMAN-based romanization in particular to assist and improve language adaptation.
Performance on Low-Resource Languages is summarized in Table 4 and Table 5. 9 We note that Trans LA+INV outperforms Non-Trans LA+INV for all the languages with unseen scripts, and achieves that with huge margins (∼ 8-22 points for NER and ∼ 17 points in UAS scores).We observe similar trends for some of the languages with seen scripts such as Min Dong (cdo), Sindhi (sd), Mingrelian (xmf) on NER tasks and Erzya (myv) on DP.The less efficient Trans LA+mPLM ft , as expected, further improves the performance for all the languages except for Tibetan (bo).10Non-Trans LA+Emb Lex , however, now outperforms UROMAN-based methods for a majority of the languages.This observation can be attributed to various factors related to mBERT's tokenizer, and we provide an in-depth analysis later in Appendix C. Nonetheless, we observe strong and competitive performance of Trans LA + mPLM ft in both tasks, again indicating that more attention should be put on transliterationbased language adaptation in future work.
Sample Efficiency.Finally, we simulate a few-shot setup to study the effectiveness of using transliterated versus non-transliterated data in data-scarce scenarios.For NER, we evaluate performance on all the languages and on languages with unseen scripts; for DP, we evaluate on all the languages.Figure 2 indicates that Trans LA+INV on average performs better than all the other methods at sample sizes 100 (i.e., 100 sentences in the target language) and 1, 000.However, from 10, 000 sentences onward, Non-Trans LA+Emb Lex takes the lead.We observe similar trends in the DP task (see Fig 3).This establishes the utility of transliteration for (extremely) low-resource scenarios.

Conclusion
In this work, we have systematically analyzed and confirmed the potential of romanization, implemented via the UROMAN tool, to help with adaptation of multilingual pretrained language models.Given (i) its broad applicability and (ii) strong performance overall and for languages with unseen scripts, we hope our study will inspire more work on transliteration-based adaptation.

Limitations
In this paper, we work with UROMAN (Hermjakob et al., 2018) which is an unsupervised romanization tool.While it is an effective tool for romanization at scale, it still has potential drawbacks.Since it is only based on lexical substitution, its transliterations may not semantically or phonetically align with the source content and may differ from transliterations preferred by native speakers.Moreover, UROMAN is not invertible-as we have highlighted-and may thus be less appealing when text in the original script needs to be exactly reproduced.Our proposed method, while it is parameter-efficient and effectiveparticularly for low-resource languages-still underperforms language-specific tokenizer-based non-transliteration methods.Future work may focus on developing an improved and more efficient tokenizer for transliteration-based methods as we highlight in the Appendix.
While there is now a growing body of available evaluation resources for low-resource languages (Ebrahimi et al., 2022;Mhaske et al., 2023;Winata et al., 2023, among others), our final selection of tasks, resources and languages has been driven and constrained by the specific concrete goal of our short paper: studying and evaluating if and how transliteration/romanization can help with adaptation of languages with scripts unseen by the pretrained multilingual language model.We thus closely follow the experimental setup of Pfeiffer et al. (2021) which used the same set of tasks and languages with unseen scripts.
Finally, romanization can be seen as a step towards providing more universal, or rather languageagnostic, input text representation.Full-fledged comparisons against other approaches that aim to strike language independence at the input or feature level, such as byte-level models (e.g., ByT5 (Xue et al., 2022)) and pixel-based models (e.g., PIXEL (Rust et al., 2023)) go beyond the scope of this particular work, but we point out to this as a very interesting future research avenue.Moreover, the integration of these languageagnostic representations with 'romanization'-based approaches might yield additional benefits, and should also be attested in future research.

A Transliterators in Evaluation
Besides UROMAN, we also employ various language-specific transliterators which are publicly available.We list them in Table 6.

B Performance comparison of mBERT
We adapt the standard cross-lingual transfer setup for mBERT.The model is finetuned on the task data for a source language (high-resource) and is used to perform inference on the target language (low-resource).We report the performance comparison of the standard cross-lingual transfer setup for mBERT on the NER task for languages with unseen scripts with the adapter-based methods in Table 7.We observe that the adapter-based methods outperform mBERT by huge margins.

C Further Analyses
Following previous work (Ács;Rust et al., 2021;Moosa et al., 2023), we further analyze tokenization quality of the mBERT tokenizer using the following established metrics: 1) % of "UNK"s measures the % of "UNK" tokens produced by the tokenizer, and our aim is to compare their rate before and after transliteration; 2) fertility measures the number of subwords that are produced per tokenized word; 3) proportion of continued subwords measures the proportion of words for which the tokenized word is split across at least two subwords (denoted by the symbol ##).
From the results summarized in Figure 4, it is apparent that transliteration drastically reduces % of UNKs.However, mBERT's tokenizer underperforms as compared to monolingual tokenizers based on fertility and the proportion of continued subwords (Rust et al., 2021).Transliteration performs better for some languages where the quality of the mBERT tokenizer is similar to the monolingual tokenizer such as for dv, km, and cdo.On the other hand, transliteration methods perform worse on languages where the quality of the underlying mBERT tokenizer is relatively poor.
In order to test the hypothesis that the tokenizer quality might be the principal reason for the performance gap for the transliteration-based methods in comparison to the non-transliteration based methods, we carried out an additional experiment.For the experiment, we adapt the Non-Trans LA+Emb Lex to operate on transliterated data, and call this variant Trans LA+Emb Lex .Here, we train a new tokenizer on the transliterated data and initialize lexically overlapping embeddings with mBERT's pretrained embeddings.
We plot the performance in Figure 5.The new method, Trans LA+Emb Lex now outperforms the nontransliteration-based variant on 8/12 languages and also on average.Consequently, this validates our hypothesis and is in line with the previous work (Moosa et al., 2023).However, we found a drop in performance in the case of mhr (-10.71) and cdo (-10.14) when compared to Trans LA + mPLM ft .These drops may be attributed to the lower degree of lexical overlap with mBERT's vocabulary, and consequently a higher number of randomly initialized embeddings for those target languages.

Figure 3 :
Figure 2: Sample efficiency in the NER task.

Table 3 :
Comparison of various transliteration strategies on the NER task (Macro-F1).

Table 5 :
Results (UAS / LAS scores) in the DP task with UD, averaged over 6 random seeds.