When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models

Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-the-art performance in NLP. Still, it remains unclear how this approach should be applied for unseen languages that are not covered by any available large-scale multilingual language model and for which only a small amount of raw data is generally available. In this work, by comparing multilingual and monolingual models, we show that such models behave in multiple ways on unseen languages. Some languages greatly benefit from transfer learning and behave similarly to closely related high resource languages whereas others apparently do not. Focusing on the latter, we show that this failure to transfer is largely related to the impact of the script used to write such languages. We show that transliterating those languages significantly improves the potential of large-scale multilingual language models on downstream tasks. This result provides a promising direction towards making these massively multilingual models useful for a new set of unseen languages.


Introduction
Language models are now a new standard to build state-of-the-art Natural Language Processing (NLP) systems.In the past year, monolingual language models have been released for more than 20 languages including Arabic, French, German, Italian, Polish, Russian, Spanish, Swedish, and Vietnamese (Antoun et al., 2020;Martin et al., 2020b;de Vries et al., 2019;Cañete et al., 2020;Kuratov and Arkhipov, 2019;Schweter, 2020, et alia).Additionally, large-scale multilingual models covering more than 100 languages are now available (XLM-R by Conneau et al. (2020) and mBERT by Devlin et al. (2019)).Still, most of the 7000+ spoken languages in the world are not coveredremaining unseen-by those models.Even languages with millions of native speakers like Sorani Kurdish (about 7 million speakers in the Middle East) or Bambara (spoken by around 5 million people in Mali and neighboring countries) are not covered by any available language models.
Even if training multilingual models that cover more languages and language varieties is tempting, the curse of multilinguality described by Conneau et al. (2020) makes it an impractical solution, as it would require to train ever larger models.Furthermore, as shown by Wu and Dredze (2020), large-scale multilingual language models reach sub-optimal performance for languages that only account for a small portion of the pretraining data.
In this paper, we describe and analyze task and language adaptation experiments to get usable language model-based representations for understudied low resource languages.We run experiments on 16 typologically diverse unseen languages on three NLP tasks with different characteristics: part-of-speech (POS) tagging, dependency parsing (DEP) and named entity recognition (NER).
Our results bring forth a great diversity of behaviors that we classify in three categories reflecting the abilities of pretrained multilingual language models to be used for low-resource languages.Some languages, the "Easy" ones, largely behave like high resource languages.Fine-tuning large-scale multilingual language models in a taskspecific way leads to state-of-the-art performance.The "Intermediate" languages are harder to process as large-scale multilingual language models lead to sub-optimal performance as such.However, adapting them using unsupervised fine-tuning on available raw data in the target language leads to a significant boost in performance, reaching or extending the state of the art.Finally, the "Hard" languages are those for which large-scale multilingual models fail to provide decent downstream performance even after unsupervised adaptation.
"Hard" languages include both stable and en-dangered languages, but they predominantly are languages of communities that are majorly underserved by modern NLP.Hence, we direct our attention to these "Hard" languages.For those languages, we show that the script they are written in is a critical element in the transfer abilities of pretrained multilingual language models.We show that transliterating them into the script of a possibly-related high resource language leads to large gains in performance leading to outperforming non-contextual strong baselines.
To sum up, our main contributions are the following: • Based on our empirical results, we propose a new categorization of low-resource languages that are currently not covered by any available language models: the Hard, the Intermediate and the Easy languages.• We show that Hard languages can be better addressed by transliterating them into a betterhandled script (typically Latin), providing a promising direction for rendering multilingual language models useful for a new set of unseen languages.

Background and Motivations
As Joshi et al. (2020) vividly illustrate, there is a great divergence in the coverage of languages by NLP technologies.The majority of the 7000+ of the world's languages are not studied by the NLP community.Some languages have very few or no annotated datasets, making the development of systems challenging.
The development of such models is a matter of first importance for the inclusion of communities, the preservation of endangered languages and more generally to support the rise of tailored NLP ecosystems for such languages (Schmidt and Wiegand, 2017;Stecklow, 2018;Seddah et al., 2020) In that regard, the advent of the Universal Dependency project (Nivre et al., 2016) and the WikiAnn dataset (Pan et al., 2017) have greatly opened the number of covered languages by providing annotated datasets for respectively 90 languages for dependency parsing and 282 languages for Named Entity Recognition.
Regarding modeling approaches, the emergence of multilingual representation models, first with static word embeddings (Ammar et al., 2016) and then with language model-based contextual representations (Devlin et al., 2019;Conneau et al., 2020) enabled transfer from high to low-resource languages, leading to significant improvements in downstream task performance (Rahimi et al., 2019;Kondratyuk and Straka, 2019).Furthermore, in their most recent forms, multilingual models, such as mBERT, process tokens at the sub-word level using SentencePiece tokenization (Kudo and Richardson, 2018).This means that they work in an open vocabulary setting. 1 This flexibility enables such models to process any language, even those that are not part of their pretraining data.
When it comes to low-resource languages, one direction is simply to train such contextualized embedding models on whatever data is available.Another option is to adapt/finetune a multilingual pretrained model to the language of interest.We briefly discuss these two options.
Pretraining language models on a small amount of raw data Even though the amount of pre-training data seems to correlate with downstream task performance (e.g.compare BERT and RoBERTa), several attempts have shown that training a new model from scratch can be efficient even if the amount of data in that language is limited.Indeed, Suárez et al. (2020) showed that pretraining ELMo models (Peters et al., 2018) on less than 1GB of Wikipedia text leads to state-of-the-art performance while Martin et al. (2020a) showed for French that pretraining a BERT model on as few as 4GB of diverse enough data results in state-ofthe-art performance.This was further confirmed by Micheli et al. (2020) who demonstrated that decent performance was achievable with as low as 100BM of raw text data.
Adapting large-scale models for low-resource languages Multilingual language models can be used directly on unseen languages, or they can also be adapted using unsupervised methods.For example, Han and Eisenstein (2019) successfully used unsupervised model adaptation of the English BERT model to Early Modern English for sequence labeling.Instead of finetuning the whole model, Pfeiffer et al. (2020) recently showed that adapter layers (Houlsby et al., 2019) can be injected into multilingual language models to provide parameter efficient task and language transfer.
Still, as of today, the availability of monolingual or multilingual language models is limited to approximately 120 languages, leaving many languages without access to valuable NLP technology, although some are spoken by millions of people, including Bambara, Maltese and Sorani Kurdish.
What can be done for unseen languages?Unseen languages vary greatly in the amount of available data, in their script (many languages use non-Latin scripts such as Sorani Kurdish and Mingrelian), and in their morphological or syntactical properties (most largely differ from high-resource Indo-European languages).This makes the design of a methodology to build contextualized models for such languages very challenging.In this work, by experimenting with 16 typologically diverse unseen languages, (i) we show that there is a diversity of behavior depending on the script, the amount of available data, and the relation to pretraining languages; (ii) Focusing on the unseen languages that lag in performance compared to their easier-tohandle counterparts, we show that the script plays a critical role in the transfer abilities of multilingual language models.Transliterating such languages to a script which is used by a related language seen during pretraining leads to very significant improvement in downstream performance.2

Experimental Setting
We will refer to any languages that are not covered by pretrained language models as "unseen."We select a small portion of those languages within a large scope of language families and scripts.Our selection is constrained to 16 typologically diverse languages for which we have evaluation data for at least one of our three downstream tasks.Our selection includes low-resource Indo-European and Uralic languages, as well as members of the Bantu, Semitic, and Turkic families.None of these 16 languages are included in the pretraining corpora of mBERT.We report in table 1 information about their scripts, language families, and amount of raw data available.

Raw Data
To perform pretraining and fine-tuning on monolingual data, we use the deduplicated datasets from the OSCAR project (Ortiz Suárez et al., 2019)

Language Models
In all our experiments, we pretrain and fine-tune our language models using the Transformers library (Wolf et al., 2019).

MLM from scratch
The first approach we evaluate is to train a dedicated language model from scratch on the available raw data we have.To do so, we train a language-specific SentencePiece tokenizer (Kudo and Richardson, 2018) before training a Masked-Language Model (MLM) using the RoBERTa (base) architecture and objective functions (Liu et al., 2019).As we work with significantly smaller pretraining sets than in the original setting, we reduce the number of layers to 6 layers in place of the original 12 layers.
Multilingual Language Models We want to assess how large-scale multilingual language models can be used and adapted to languages that are not in their pretraining corpora.We work with the multilingual version of BERT (mBERT) trained on the concatenation of Wikipedia corpora in 104 languages (Devlin et al., 2019).We also run experiments with the XLM-R base version (Conneau et al., 2020) trained on 100 languages using data extracted from the Web.As the observed behav-iors are very similar between both models, we report only the results using mBERT.We note that mBERT is highly biased toward Indo-Europeans languages written in the Latin script.The basic statistics of the vocabulary shows that more than 77% of the vocabulary subword types are in the Latin script, about 11.5% are in the Cyrillic script, the Arabic scripts takes up about 4%, and smaller scripts like the Georgian one only make up less than 1% of the vocabulary (with less than 1,000 subwords) (Ács, 2019).
Adapting Multilingual Language Models to unseen languages with MLM-TUNING Following previous work (Han and Eisenstein, 2019;Wang et al., 2019;Pfeiffer et al., 2020), we adapt large-scale multilingual models by fine-tuning them with their Mask-Language-Model objective directly on the available raw data in the unseen target language.We refer to this process as MLM-TUNING.We will refer to a MLM-tuned mBERT model as mBERT+MLM.

Downstream Tasks
We perform experiments on POS tagging, Dependency Parsing (DEP), and Name Entity Recognition (NER).We use annotated data from the Universal Dependency project (Nivre et al., 2016) for POS tagging and parsing, and the WikiAnn dataset (Pan et al., 2017) for NER.For POS tagging and NER, we append a linear classifier layer on top of the language model.For parsing, following Kondratyuk and Straka (2019), we append a Bi-Affine Graph prediction layer (Dozat and Manning, 2016).For each task, following Devlin et al. (2019), we only back-propagate through the first token of each word.We refer to the process of fine-tuning a language model in a task-specific way as TASK-TUNING.

Optimization
For all pretraining and fine-tuning runs, we use the Adam optimizer (Kingma and Ba, 2014).For fine-tuning, we select the hyperparameters that minmize the loss on the validation set.The reported results are the average score of 5 runs with different random seeds computed on the test split of each dataset.We report details about the hyperparameters for TASK-TUNING in the Appendix in Table 12 and about pretraining and MLM-TUNING in Table 13.

Dataset Splits
For each task and language, we use the

Non-contextual Baselines
For parsing and POS tagging, we use the UDPipe future system (Straka, 2018) as our baseline.This model is a LSTM-based (Hochreiter and Schmidhuber, 1997) recurrent architecture trained using pretrained static word embedding (Mikolov et al., 2013) (hence our non-contextual characterization) along with character-level embeddings.This system was ranked in the very first positions for parsing and tagging in the CoNLL shared task 2018 (Zeman and Hajič, 2018).For NER we use the LSTM-CRF model similar with character and word level embedding based on Qi et al. (2020) implementation.

The Three Categories of Unseen Languages
For each unseen language, we experiment with our three modeling approaches: (a) Training a language model from scratch on the available raw data and then fine-tuning it on any available annotated data in the target language for each task.(b) Finetuning mBERT with TASK-TUNING directly on the target language.(c) Finally, adapting mBERT to the unseen language using MLM-TUNING before fine-tuning it in a supervised way on the target task and language.We then compare all these experiments to our non-contextual baselines.By doing so we can assess if language models are a practical solution to handle each unseen language.
Interestingly we find a great diversity of behaviors across languages regarding those language model training techniques.As summarized in Figure 1, we observe three clear clusters of languages.The first cluster, which we dub "Easy", corresponds to the languages that do not require extra MLM-fine-tuning for mBERT to achieve good performance.mBERT has the modeling abilities to process such languages without a large amount of raw data and can outperform strong non-contextual baselines as such.In the second cluster, the "Intermediate" languages require MLM-fine-tuning.mBERT is not able to beat strong non-contextual baselines using only TASK-TUNING, but MLM-TUNING enables it to do so.Finally, Hard languages are those on which mBERT fails to deliver any decent performance even after MLM-and taskfine-tuning.mBERT simply does not have the capacity to learn and process such languages.
In this section, we present in detail our results in each of these language clusters and provide insights into their linguistic properties.

Easy
Easy languages are the one one which mBERT delivers high performance out-of-the-box, compared to strong baselines.We find that those languages match two conditions: -They are closely related to languages used during MLM pre-training -These languages use the same script as such closely related languages.Table 2: Faroese is an "easy" unseen language: a multilingual model (+ language-specific MLM) easily outperforms all baselines.Zero-shot performance, after task-tuning only on related languages (Danish, Norwegian, Swedish) is also high.
Perhaps the best example of such an "easy" setting is Faroese.mBERT has been trained on several languages of the north Germanic genus of the Indo-European language family, all of which use the Latin script.As a result, the multilingual mBERT model performs much better than the monolingual FaroeseBERT model that we trained on the available Faroese text (cf rows 1-2 and 5-6 in Table 2).Finetuning mBERT on the Faroese text is even more effective (rows 3 and 6 in Table 2), leading to further improvements, reaching more than 96.5% POS-tagging accuracy, 86% LAS for dependency parsing, and 58% NER F1 in the few-shot setting, surpassing the non-contextual baseline.In fact, even in zero-shot conditions, where we tasktune only on related languages (Danish, Norwegian, and Swedish), the model achieves remarkable performance of over 83% POS-tagging accuracy and 67.8% LAS dependency parsing.Swiss German is another example of a language for which one can easily adapt a multilingual model and obtain good performance even in zero-shot settings.As in Faroese, simple MLM fine-tuning of the mBERT model with 200K sentences leads to an improvement of more than 25 points in both POS tagging and dependency parsing ( Table 3: Swiss German is an "easy" unseen language: a multilingual model (+ language-specific MLM) outperforms all baselines in both zero-shot (task-tuning on the related High German) and fewshot settings.
in the few-shot setting.
The potential of similar-language pre-training along with script similarity is also showcased in the case of Naija (also known as Nigerian English or Nigerian Pidgin), an English creole spoken by millions in Nigeria.As Table 4 shows, with results after language-and task-tuning on 6K training examples, the multilingual approach surpasses the monolingual baseline.Table 4: Performance on Naija, an English creole, is very high, so we also classify it as an "easy" unseen language.
On a side note, we can rely on the results of Han and Eisenstein (2019) to also classify Early Modern English as an easy language.Similarly, the work of Chau et al. (2020) allows us to also classify Singlish (Singaporean English) as an easy language.In both cases, these two languages are technically unseen by mBERT, but the fact that they are variants of English allows them to be easily handled by mBERT.

Intermediate
The second type of languages (which we dub "Intermediate") are generally harder to process with pretrained multilingual language models out-ofthe-box.In particular, pretrained multilingual lan-guage models are typically outperformed by a noncontextual strong baselines.Still, MLM-TUNING has an important impact and leads to usable stateof-the-art models.
A good example of such an intermediate language is Maltese, a member of the Semitic language but using the Latin script, Maltese has not been seen by mBERT.Other Semitic languages though, namely Arabic and Hebrew, have been included in the pre-training languages.The results on Maltese are outlined in Table 5, where it is clear that the non-contextual baseline outperforms mBERT.Additionally, a monolingual MLM trained on only 50K sentences matchs mBERT performance for both NER and POS tagging.However, the best results are reached with MLM-TUNING: the proper use of monolingual data and the advantage of similarity to other pre-training languages render Maltese a tackle-able language by outperforming significantly our strong non-contextual baseline.Table 5: Maltese is an "Intermediate" unseen language: a multilingual model requires languagespecific MLM and task-tuning to achieve performance competitive to a monolingual baseline.

Model
Our Maltese dependency parsing results are in line with those of Chau et al. (2020), who also show that MLM-TUNING leads to significant improvements.They also additionally show that a small vocabulary transformation allows finetuning to be even more effective and gain 0.8 LAS points more.We further discuss the vocabulary adaptation technique of (Chau et al., 2020) in Section §6.
We consider Narabizi, an Arabic dialect spoken in North-Africa written in the Latin script and code-mixed with French, to fall in the same "Intermediate" category, because it follows the same pattern.Our results in Narabizi are listed in Table 6.For both POS tagging and parsing, the multilingual models outperform the monolingual NarabiziBERT.In addition, MLM-TUNING leads to significant improvements over the non-language-tuned mBERT baseline, also outperforming the non-contextual dependency parsing baseline.We also categorize Bambara, a Niger-Congo Bantu language spoken in Mali and surrounding countries, as Intermediate, relying mostly on the POS tagging results which follow similar patterns as Maltese and Narabizi (see Table 7).We note that the BambaraBERT that we trained achieves notably poor performance compared to the non-conctextual baseline, a fact we attribute to the extremely low amount of available data (1000 sentences only).We also note that the non-contextual baseline is the best performing model for dependency parsing, which could also potentially classify Bambara as a "Hard" language instead.The importance of script We provide initial supporting evidence for our argument on the importance of having pretrained LMs on languages with similar scripts, even for generally high-resource language families.
We first focus on Uralic languages.Finnish, Estonian, and Hungarian are high-resource representatives of this language family that are typically Following a similar procedure as before, we start with mBERT, perform task-tuning on Finnish and Estonian (both of which use the Latin script) and then do zero-shot experiments on Livvi, and Komi, all low-resource Uralic languages (results on the top part of Table 9).We also report results on the Finnish treebanks after task-tuning, for better comparison.The difference in performance on Livvi (which uses the Latin script) and the other languages that use the Cyrillic script is striking.Table 9: The script matters for the efficacy of cross-lingual transfer.The zero-shot performance on Livvi, which is written in the same script as the task-tuning languages (Finnish, Estonian), is almost twice as good as the performance on the Uralic languages that use the Cyrillic script.
Although they are not easy enough to be tackled in a zero-shot setting, we show that the lowresource Uralic languages fall in the "Intermediate" category, since mBERT has been trained on similar languages: a small amount of annotated data are enough to improve over mBERT using task-tuning.The results for Livvi and Erzya using 8-fold crossvalidation, with each run only using around 700 training instances, are shown in Table 9.For Erzya, the multilingual model along with MLM-TUNING achieves the best performance, outperforming the non-contextual baseline by more than 1.5 point for parsing and matching its performance for POS tagging.

Hard
The last category of the hard unseen language is perhaps the most interesting one, as these languages are very hard to process.All available large-scale language models are outperformed by non-contextual baselines as well as by monolingual language models trained from scratch on the available raw data.At the same time, MLM-TUNING over the available raw data has a minimal impact on performance.
Uyghur, a Turkic language with about 10-15 million speakers in central Asia, is a prime example of a hard language for current models.In our experiments, outlined in Table 10, the non-contextual baseline outperforms all contextual variants, both monolingual and multilingual, in the POS tagging task.The monolingual UyghurBert achieves the best dependency parsing results with a LAS of 77, more than 5 points higher than mBERT, with similar trends for NER.Uyghur is also the only case where mBERT with MLM-TUNING does not improve over the unadapted mBERT on dependency parsing.Table 10: Uyghur is a hard language.The noncontextual baseline outperforms all mBERT variants on POS tagging, and the UyghurBERT is best for DEP.

Model
We attribute this discrepancy to script differences: Uyghur uses the Perso-Arabic script, when Figure 2: An illustration of the pretraining distributions and an unseen language distribution in the case of the Turkic Language Family.Uyghur is unseen but related to Turkish which mBERT has been pretrained on.Uyghur is written in the Arabic script while Turkish is written in the Latin Script making it a great challenge for mBERT the other Turkic languages that were part of mBERT pre-training use either the Latin (e.g.Turkish) or the Cyrillic script (e.g.Kazakh).
Sorani Kurdish (also known as Central Kurdish) is a similarly hard language, mainly spoken in Iraqi Kurdistan by around 8 million speakers, which uses the Sorani alphabet, a variant of the Arabic script.We can solely evaluate on the NER task, where the non-contextual baseline is the best model, achieving a 81.3 F1-score.The SoraniBert that we trained reaches 80.6 F1-score, while mBERT gets 70.4 F1score.MLM-TUNING on 380K sentences of Sorani texts improves mBERT performance to 75.6 F-score, but it is still lagging behind the baseline.

Tackling Hard Languages with Multilingual Language Models
As we have already alluded to, our hypothesis is that the script is a critical element for multilingual pretrained models to efficiently process unseen languages.
To verify this hypothesis, we assess the ability of mBERT to process an unseen language after transliterating it to another script.We focus our experiments on six languages belonging to four language families: Erzya, Bruyat and Meadow Mari (Uralic), Sorani Kurdish (Iranian, Indo-European), Uyghur (Turkic) and Mingrelian (Kartvelian).We apply the following transliteration: • Erzya/Buryat/Mari: Cyrillic Script − → Latin Script

Linguistically-motivated transliteration
The strategy we used to transliterate the abovelisted language is specific to the purpose of our experiments.Indeed, our goal is for the model to take advantage of the information it has learned during training on a related language written in the Latin script.The goal of our transliteration is therefore to transcribe each character in the source script, which we assume corresponds to a phoneme, into the most frequent (sometimes only) way this phoneme is rendered in the closest related language written in the Latin script, hereafter the target language.This process is not a transliteration strictly speaking, and it is needs not be reversible.It is not a phonetization either, but rather a way to render the source language in a way that maximizes the similarity between the transliterated source language and the target language.
We have manually developed transliteration scripts for Uighur and Sorani Kurdish, using respectively Turkish and Kurmanji Kurdish as target languages, only Turkish being one of the languages used to train mBERT.Note however that Turkish and Kurmanji Kurdish share a number of conventions for rendering phonemes in the Latin script (for instance, /S/, rendered in English by "sh", is rendered in both languages by "ş"; as a result, the Arabic letter " ", used in both languages, is rendered as "ş" by both our transliteration scripts).As for Erzya, Buryat and Mari, we used the read-ily available transliteration package transliterate,5 which performs a standard transliteration. 6We used the Russian transliteration module, as it covers the Cyrillic script.Finally, for our control experiments on Mingrelian, we used the Georgian transliteration module from the same package.

Transfer via Transliteration
We train mBERT with MLM-TUNING and TASK-TUNING on the transliterated data.As a control experiment, we also train a monolingual BERT Model from scratch on the transliterated data of each language.
Our results with and without transliteration are listed in Table 11.Transliteration for Sorani and Uyghur generally has a noticeable positive impact.For instance, transliterating Uyghur to Latin leads to an improvement of 16 points in DEP and 20 points in NER.For one of the low-resource Uralic languages, Meadow Mari, we observe an 8 F1score points improvement on NER, while for other Uralic languages like Erzya the effect of transliteration is very minor. 7The only case where transliterating to the Latin script leads to a drop in performance for mBERT and mBERT+MLM is Mingrelian.
We interpret our results as follows.When running MLM and task-tuning, mBERT associates the target unseen language to a set of similar languages seen during pretraining based on the script.In consequence, mBERT is not able to associate a language to its related language if they are not written in the same script.For instance, transliterating Uyghur enables mBERT to match it to Turkish, a language which accounts for a accounts for a sizeable portion pf mBERT pretraining.In the case of Mingrelian, transliteration has the opposite effect: transliterating Mingrelian in Latin is harming the performance as mBERT is not able to associate it to Georgian which is seen during pre-training and uses the Georgian script.
Our findings are generally in line with previous work.Transliteration to English specifically (Lin et al., 2016;Durrani et al., 2014) and named entity transliteration (Kundu et al., 2018;Grundkiewicz and Heafield, 2018) has been proven useful for successful cross-lingual transfer in tasks like NER, entity translation, or entity linking (Rijhwani et al., 2019) and morphological inflection (Murikinati et al., 2020).
The transliteration approach provides a viable path for rendering large pretrained models like mBERT useful for all languages of the world.Indeed, transliterating both Uyghur and Sorani leads to matching or outperforming the performance of non-contextual strong baselines and deliver usable models.

Discussion and Conclusion
Pretraining ever larger language models is a research direction that is currently receiving a lot of attention and resources from the NLP research community (Raffel et al., 2019;Brown et al., 2020).Still, a large majority of human languages are under-resourced making the development of monolingual language models very challenging in those settings.Another path is to build large scale multilingual language models.8However, such an approach faces the inherent zipfian structure of human languages, making the training of a single model to cover all languages an unfeasible solution (Conneau et al., 2020).Reusing large scale pretrained language models for new unseen languages seems to be a more promising and reasonable solution from a cost-efficiency and environmental perspective (Strubell et al., 2019).
Recently, Pfeiffer et al. (2020) proposed to use adapter layers (Houlsby et al., 2019) to build parameter efficient multilingual language models for unseen languages.However, this solution brings no significant improvement in the supervised setting, compared to a more simple Masked-Language Model fine-tuning.Furthermore, developing a language agnostic adaptation method is an unreasonable wish with regard to the great typological diversity of human languages.
On the other hand, the promising vocabulary adaptation technique of (Chau et al., 2020) which leads to good dependency parsing results on unseen languages when combined with task-tuning has so far been tested only on Latin script languages (Singlish and Maltese).We expect that it will be orthogonal to our transliteration approach, but we leave for future work the study of its applicability and efficacy on more languages and tasks.
In this context, we bring empirical evidence to assess the efficiency of language models pretraining and adaptation methods on 16 low-resource and typologically diverse unseen languages.Our results show that the "Hard" languages are currently outof-the-scope of any currently available language models and are therefore left outside of the current NLP progress.By focusing on those, we find that this challenge is mostly due to the script.Transliterating them to a script that is used by a related higher resource language on which the language model has been pretrained on leads to large improvements in downstream task performance.Our results shed some new light on the importance of the script in multilingual pretrained models.While previous work suggests that multilingual language models could transfer efficiently across scripts in zero-shot settings (Pires et al., 2019;K et al., 2020), our results show that such cross-script transfer is possible only if the model has seen related languages in the same script during pretraining.
Our work paves the way for a better understanding of the mechanics at play in cross-language transfer learning in low-resource scenarios.We strongly believe that our method could contribute to bootstrapping NLP resources and tools for lowresource languages, thereby favoring the emergence of NLP ecosystems for languages that are currently under-served by the NLP community.

Figure 1 :
Figure 1: Visualizing our Typology of Unseen Languages.X,Y positions are computed for each language as follows: X = f(mBERT), Y = f(mBERT+MLM) with f(x) = x−Baseline Baseline Easy Languages are the ones on which mBERT works without MLM-TUNING, the Intermediate languages are the ones that require MLM-TUNING while the Hard languages are the ones for which mBERT does not work

Table 1 :
Unseen Languages used for downstream experiments.#sents indicates the number of raw sentences used for MLM-TUNING Biemann et al. (2007)an **code-mixed with French * amount of data for all the unseen languages we work with, except for Narabizi, Naija and Faroese, for which we use data respectively collected bySeddah et al. (2020),Caron et al. (2019)andBiemann et al. (2007), as well as for Buryat, Meadow Mari, Erzya and Livvi for which we use Wikipedia dumps.
provided training, validation and test dataset split except for the ones that have less than 500 training sentences.In this case, we concatenate the training and test set and perform 8-folds cross-Validation and use the validation set for early stopping.If no validation set is available, we isolate one of the folds for validation and report the test scores as the average of the other folds.This enables uses to run training on at least 500 sentences in all our experiments (except for Swiss German for which we only have 100 training examples) and reduce the impact of the annotated dataset size on our analysis.As doing cross-validation results in training on very limited number of examples, we refer to training in this cross-validation setting as few-shot learning.

Table 3 )
in zero-shot settings, with similar improvement trends centricity of data availability and model creating as granted.

Table 8 :
Wolof falls into the Intermediate category.MLM-TUNING enables mBERT to match or outperform strong non-contextual baselines included in multilingual LMs, also having tasktuning data available in large quantities.For several smaller Uralic languages, however, task-tuning data are generally unavailable.

Table 11 :
Transliterating low-resource languages into the Latin script leads to significant improvements in languages like Uyghur, Sorani, and Meadow Mari.For languages like Erzya and Buryat transliteration does not significantly influence results, while it does not help for Mingrelian.In all cases, mBERT+MLM is the best approach.