Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study

Recent research in multilingual language models (LM) has demonstrated their ability to effectively handle multiple languages in a single model. This holds promise for low web-resource languages (LRL) as multilingual models can enable transfer of supervision from high resource languages to LRLs. However, incorporating a new language in an LM still remains a challenge, particularly for languages with limited corpora and in unseen scripts. In this paper we argue that relatedness among languages in a language family may be exploited to overcome some of the corpora limitations of LRLs, and propose RelateLM. We focus on Indian languages, and exploit relatedness along two dimensions: (1) script (since many Indic scripts originated from the Brahmic script), and (2) sentence structure. RelateLM uses transliteration to convert the unseen script of limited LRL text into the script of a Related Prominent Language (RPL) (Hindi in our case). While exploiting similar sentence structures, RelateLM utilizes readily available bilingual dictionaries to pseudo translate RPL text into LRL corpora. Experiments on multiple real-world benchmark datasets provide validation to our hypothesis that using a related language as pivot, along with transliteration and pseudo translation based data augmentation, can be an effective way to adapt LMs for LRLs, rather than direct training or pivoting through English.


Introduction
BERT-based pre-trained language models (LMs) have enabled significant advances in NLP (Devlin et al., 2019;Liu et al., 2019;Lan et al., 2020). Pretrained LMs have also been developed for the multilingual setting, where a single multilingual model is capable of handling inputs from many different * Authors contributed equally Figure 1: Number of wikipedia articles for top-few Indian Languages and English. The height of the English bar is not to scale as indicated by the break. Number of English articles is roughly 400x more than articles in Oriya and 800x more than articles in Assamese.
languages. For example, the Multilingual BERT (mBERT) (Devlin et al., 2019) model was trained on 104 different languages. When fine-tuned for various downstream tasks, multilingual LMs have demonstrated significant success in generalizing across languages (Hu et al., 2020;. Thus, such models make it possible to transfer knowledge and resources from resource rich languages to Low Web-Resource Languages (LRL). This has opened up a new opportunity towards rapid development of language technologies for LRLs.
However, there is a challenge. The current paradigm for training Mutlilingual LM requires text corpora in the languages of interest, usually in large volumes. However, such text corpora is often available in limited quantities for LRLs. For example, in Figure 1 we present the size of Wikipedia, a common source of corpora for training LMs, for top-few scheduled Indian languages 1 and English. The top-2 languages are just one-fiftieth the size of  English, and yet Hindi is seven times larger than the O(20,000) documents of languages like Oriya and Assamese which are spoken by millions of people. This calls for the development of additional mechanisms for training multilingual LMs which are not exclusively reliant on large monolingual corpora. Recent methods of adapting a pre-trained multilingual LM to a LRL include fine-tuning the full model with an extended vocabulary , training a light-weight adapter layer while keeping the full model fixed (Pfeiffer et al., 2020b), and exploiting overlapping tokens to learn embeddings of the LRL (Pfeiffer et al., 2020c). These are general-purpose methods that do not sufficiently exploit the specific relatedness of languages within the same family.
We propose RelateLM for this task. RelateLM exploits relatedness between the LRL of interest and a Related Prominent Language (RPL). We focus on Indic languages, and consider Hindi as the RPL. The languages we consider in this paper are related along several dimensions of linguistic typology (Dryer and Haspelmath, 2013;Littell et al., 2017): phonologically, phylogenetically as they are all part of the Indo-Aryan family, geographically, and syntactically matching on key features like the Subject-Object-Verb (SOV) order as against the Subject-Verb-Object (SVO) order in English. Even though the scripts of several Indic languages differ, they are all part of the same Brahmic family, making it easier to design rulebased transliteration libraries across any language pair. In contrast, transliteration of Indic languages to English is harder with considerable phonetic variation in how words are transcribed. The geographical and phylogenetic proximity has lead to significant overlap of words across languages. This implies that just after transliteration we are able to exploit overlap with a Related Prominent Language (RPL) like Hindi. On three Indic languages we discover between 11% and 26% overlapping tokens with Hindi, whereas with English it is less than 8%, mostly comprising numbers and entity names. Furthermore, the syntax-level similarity between languages allows us to generate high quality data augmentation by exploiting pre-existing bilingual dictionaries. We generate pseudo parallel data by converting RPL text to LRL and vice-versa. These allow us to further align the learned embed-dings across the two languages using the recently proposed loss functions for aligning contextual embeddings of word translations (Cao et al., 2020;Wu and Dredze, 2020). In this paper, we make the following contributions: • We address the problem of adding a Low Web-Resource Language (LRL) to an existing pretrained LM, especially when monolingual corpora in the LRL is limited. This is an important but underexplored problem. We focus on Indian languages which have hundred of millions of speakers, but traditionally understudied in the NLP community. • We propose RelateLM which exploits relatedness among languages to effectively incorporate a LRL into a pre-trained LM. We highlight the relevance of transliteration and pseudo translation for related languages, and use them effectively in RelateLM to adapt a pre-trained LM to a new LRL. • Through extensive experiments, we find that RelateLM is able to gain significant improvements on benchmark datasets. We demonstrate how RelateLM adapts mBERT to Oriya and Assamese, two low web-resource Indian languages by pivoting through Hindi. Via ablation studies on bilingual models we show that RelateLM is able to achieve accuracy of zero-shot transfer with limited data (20K documents) that is not surpassed even with four times as much data in existing methods. The source code for our experiments is available at https://github.com/yashkhem1/RelateLM.

Related Work
Transformer (Vaswani et al., 2017) based language models like mBERT (Devlin et al., 2019), MuRIL (Khanuja et al., 2021), IndicBERT (Kakwani et al., 2020), and XLM-R , trained on massive multilingual datasets have been shown to scale across a variety of tasks and languages. The zero-shot cross-lingual transferability offered by these models makes them promising for lowresource domains. Pires et al. (2019) find that cross-lingual transfer is even possible across languages of different scripts, but is more effective for typologically related languages. However, recent works (Lauscher et al., 2020;Pfeiffer et al., 2020b;Hu et al., 2020) have identified poor cross-lingual transfer to languages with limited data when jointly pre-trained. A primary reason behind poor transfer is the lack of model's capacity to accommodate all languages simultaneously. This has led to increased interest in adapting multilingual LMs to LRLs and we discuss these in the following two settings.
LRL adaptation using monolingual data For eleven languages outside mBERT,  demonstrate that adding a new target language to mBERT by simply extending the embedding layer with new weights results in better performing models when compared to bilingual-BERT pre-training with English as the second language. Pfeiffer et al. (2020c) adapt multilingual LMs to the LRLs and languages with scripts unseen during pre-training by learning new tokenizers for the unseen script and initializing their embedding matrix by leveraging the lexical overlap w.r.t. the languages seen during pre-training. Adapter (Pfeiffer et al., 2020a) based frameworks like (Pfeiffer et al., 2020b;Artetxe et al., 2020;Üstün et al., 2020) address the lack of model's capacity to accommodate multiple languages and establish the advantages of adding language-specific adapter modules in the BERT model for accommodating LRLs. These methods generally assume access to a fair amount of monolingual LRL data and do not exploit relatedness across languages explicitly. These methods provide complimentary gains to our method of directly exploiting language relatedness.
LRL adaptation by utilizing parallel data When a parallel corpus of a high resource language and its translation into a LRL is available, Conneau and Lample (2019) show that pre-training on concatenated parallel sentences results in improved cross-lingual transfer. Methods like Cao et al. (2020); Wu and Dredze (2020) discuss advantages of explicitly bringing together the contextual embeddings of aligned words in a translated pair. Language relatedness has been exploited in multilingual-NMT systems in various ways (Neubig and Hu, 2018;Goyal and Durrett, 2019;Song et al., 2020). These methods typically involve data augmentation for a LRL with help of a related high resource language (RPL) or to first learn the NMT model for a RPL followed by finetuning on the LRL. Wang et al. (2019) propose a soft-decoupled encoding approach for exploiting subword overlap between LRLs and HRLs to improve encoder representations for LRLs. Gao et al. (2020) address the issue of generating fluent To the best of our knowledge no earlier work has explored the surprising effectiveness of transliteration to a related existing prominent language, for learning multilingual LMs, although some work exists in NMT as mentioned above.

Low Web-Resource Adaptation in RelateLM
Problem Statement and Notations Our goal is to augment an existing multilingual language model M, for example mBERT, to learn representations for a new LRL L for which available monolingual corpus D L is limited. We are also told that the language to be added is related to another language R on which the model M is already pretrained, and is of comparatively higher resource. However, the script of D L may be distinct from the scripts of existing languages in M. In this section we present strategies for using this knowledge to  Table 2: Motivation for pseudo translation: BLEU scores between pseudo translated prominent language sentences and LRL sentences. BLEU with Hindi, the RPL, is much higher than with English, the distant prominent language highlighting the effectiveness of pseudo translation from a RPL (Section 3.2). English and Hindi dictionary sizes same. For these experiments, we used a parallel corpus across these 5 languages obtained from TDIL (Section 4.1) better adapt M to L than the existing baseline of fine-tuning M using the standard masked language model (MLM) loss on the limited monolingual data D L . In addition to the monolingual data D R in the RPL and D L in the LRL, we have access to a limited bilingual lexicon B L→R that map a word in language L to a list of synonyms in language R and vice-versa B R→L . We focus on the case where the RPL, LRL pairs are part of the Indo-Aryan language families where several levels of relatedness exist. Our proposed approach, consists of three steps, viz., Transliteration to RPL's script, Pseudo translation, and Adaptation through Pre-training. We describe each of these steps below. Figure 2 presents an overview of our approach.

Transliteration
First, the scripts of Indo-Aryan languages are part of the same Brahmic script. This makes it easier to design simple rule-based transliterators to convert a corpus in one script to another. For most languages transliterations are easily available. Example, the Indic-Trans Library 2 (Bhat et al., 2015). We use D L R to denote the LRL corpus after transliterating to the script of the RPL. We then propose to further pre-train the model M with MLM on the transliterated corpus D L R instead of D L . Such a strategy could provide little additional gains over the baseline, or could even hurt accuracy, if the two languages were not sufficiently related. For languages in the Indo-Aryan family because of strong phylogenetic and geographical overlap, many words across the two languages overlap and preserve the same meaning. In Table 1 we provide statistics of the overlap of words across several transliterated Indic languages with Hindi and English. Note that for Hindi the fraction of overlapping words is much higher than with English which are mostly numbers, and entity names. These overlapping words serve as anchors to align the representations for the non-overlapping words of the LRL that share semantic space with words in the RPL.

Pseudo Translation with Lexicons
Parallel data between a RPL and LRL language pair has been shown to be greatly useful for efficient adaptation to LRL (Conneau and Lample, 2019;Cao et al., 2020). However, creation of parallel data requires expensive supervision, and is not easily available for many low web-resource languages. Back-translation is a standard method of creating pseudo parallel data but for low webresource languages we cannot assume the presence of a well-trained translation system. We exploit the relatedness of the Indic languages to design a pseudo translation system that is motivated by two factors: • First, for most geographically proximal RPL-LRL language pairs, word-level bilingual dictionaries have traditionally been available to enable communication. When they are not, crowd-sourcing creation of wordlevel dictionaries 3 requires lower skill and resources than sentence level parallel data. Also, word-level lexicons can be created semiautomatically   (Artetxe et al., 2019) (Xu et al., 2018). • Second, Indic languages exhibit common syntactic properties that control how words are composed to form a sentence. For example, they usually follow the Subject-Object-Verb (SOV) order as against the Subject-Verb-Object (SVO) order in English. We therefore create pseudo parallel data between R and L via a simple word-by-word translation using the bilingual lexicon. In a lexicon a word can be mapped to multiple words in another language. We choose a word with probability proportional to its frequency in the monolingual corpus D L . We experimented with a few other methods of selecting words that we discuss in Section 4.4. In Table 2 we present BLEU scores obtained by our pseudo translation model of three Indic languages from 3 Wiktionary is one such effort Hindi and from English. We observe much high BLEU for translation from Hindi highlighting the syntactic relatedness of the languages.
Let (D R , B R→L R (D R )) denote the parallel corpus formed by pseudo translating the RPL corpus via the transliterated RPL to LRL lexicon. Likewise let (D L R , B L R →R (D L R )) be formed by pseudo translating the transliterated low web-resource corpus via the transliterated LRL to RPL lexicon.

Alignment Loss
The union of the two pseudo parallel corpora above, collectively called P, is used for fine-tuning M using an alignment loss similar to the one proposed in (Cao et al., 2020). This loss attempts to bring the multilingual embeddings of different languages closer by aligning the corresponding word embeddings of the source language sentence and the pseudo translated target language sentence. Let C be a random batch of source and (pseudo translated) target sentence pairs from P, i.e. C = ((s 1 , t 1 ), (s 2 , t 2 ), ..., (s N , t N )), where s and t are the source and target sentences respectively. Since our parallel sentences are obtained via word-level translations, the alignment among words is known and monotonic. Alignment loss has two terms: L = L align + L reg where L align is used to bring the contextual embeddings closer and L reg is the regularization loss which prevents the new embeddings from deviating far away from the pre-trained embeddings. Each of these are defined below: where l s (i) is the position of the last token of ith word in sentence s and f (s, j) is the learned contextual embedding of token at j-th position in sentence s, i.e, for L align we consider only the last tokens of words in a sentence, while for L reg we consider all the tokens in the sentence. f 0 (s, j) denotes the fixed pre-trained contextual embedding of the token at j-th position in sentence s. #word(s) and #tok(s) are the number of (whole) words and tokens in sentence s respectively.

Experiments
We carry out the following experiments to evaluate RelateLM's effectiveness in LRL adaptation: • First, in the full multilingual setting, we evaluate whether RelateLM is capable of extending mBERT with two unseen low-resource Indic languages: Oriya (unseen script) and Assamese (seen script). (Section 4.2) • We then move to the bilingual setting where we use RelateLM to adapt a model trained on a single RPL to a LRL. This setting allowed us to cleanly study the impact of different adaptation strategies and experiment with many RPL-LRL language pairs. (Section 4.3) • Finally, Section 4.4, presents an ablation study on dictionary lookup methods, alignment losses, and corpus size. We evaluate by measuring the efficacy of zeroshot transfer from the RPL on three different tasks: NER, POS and text classification.

Setup LM Models
We take m-BERT as the model M for our multilingual experiments. For the bilingual experiments, we start with two separate monolingual language models on each of Hindi and English language to serve as M. For Hindi we trained our own Hi-BERT model over the 160K monolingual Hindi Wikipedia articles using a vocab size of 20000 generated using WordPiece tokenizer. For English we use the pre-trained BERT model which is trained on almost two orders of magnitude Wikipedia articles and more. When the LRL is added in its own script, we use the bert-base-cased model and when the LRL is added after transliteration to English, we use the bert-base-uncased model.

LRLs, Monolingual Corpus, Lexicon As
LRLs we consider five Indic languages spanning four different scripts. Monolingual data was obtained from Wikipedia as summarized in Table 4. We extend m-BERT with two unseen low webresource languages: Assamese and Oriya. Since it was challenging to find Indic languages with taskspecific labeled data but not already in m-BERT, we could not evaluate on more than two languages. For the bilingual model experiments, we adapt each of Hi-BERT and English BERT with three different languages: Punjabi, Gujarati and Bengali. For these languages we simulated the LRL setting by    Table 4 Tasks for zero-shot transfer evaluation After adding a LRL in M, we perform task-specific fine-  Table 5: Different Adaptation Strategies evaluated for zero-shot transfer (F1-score) on NER, POS tagging and Text Classification after fine-tuning with the Prominent Language (English or Hindi). mBERT, which is trained with much larger datasets and more languages is not directly comparable, and is presented here just for reference.
tuning on the RPL separately for three tasks: NER, POS and Text classification. Table 3 presents a summary of the training, validation data in RPL and test data in LRL on which we perform zero-shot evaluation. We obtained the NER data from WikiANN (Pan et al., 2017) and XTREME (Hu et al., 2020) and the POS and Text Classification data from the Technology Development for Indian Languages (TDIL) 6 . We downsampled the TDIL data for each language to make them class-balanced. The POS tagset used was the BIS Tagset (Sardesai et al., 2012). For the English POS Dataset, we had to map the PENN tagset in to the BIS tagset. We have provided the mapping that we used in the Appendix (B) Methods compared We contrast RelateLM with three other adaptation techniques: (1) EBERT ) that extends the vocabulary and tunes with MLM on D L as-is, (2) RelateLM without pseudo translation loss, and (3) m-BERT when the language exists in m-BERT.
Training Details For pre-training on MLM we chose batch size as 2048, learning rate as 3e-5 and maximum sequence length as 128. We used whole word masking for MLM and BertWordPieceTokenizer for tokenization. For pre-training Hi-BERT the duplication was taken as 5 with training done for 40K iterations. For all LRLs where monolingual data used was 20K documents, the duplication factor was kept at 20 and and training was done for 24K iterations. For Assamese, where monolingual data was just 6.5K documents, a duplication factor of 60 was used with the same 24K training iterations. The MLM pre-training was done on Google v3-8 Cloud TPUs. For alignment loss on pseudo translation we chose learning-rate as 5e-5, batch size as 64 and  maximum sequence length as 128. The training was done for 10 epochs also on Google v3-8 Cloud TPUs. For task-specific fine-tuning we used learning-rate 2e-5 and batch size 32, with training duration as 10 epochs for NER, 5 epochs for POS and 2400 iterations for Text Classification. The models were evaluated on a separate RPL validation dataset and the model with the minimum F1-score, accuracy and validation loss was selected for final evaluation for NER, POS and Text Classification respectively. All the fine-tuning experiments were done on Google Colaboratory. The results reported for all the experiments are an average of 3 independent runs.

Multilingual Language Models
We evaluate RelateLM's adaptation strategy on mBERT, a state of the art multilingual model with two unseen languages: Oriya and Assamese. The script of Oriya is unseen whereas the script of Assamese is the same as Bengali (already in m-BERT). Table 6 compares different adaptation strategies in- Figure 3: Comparison of F1-score between RelateLM-20K, EBERT-20K and EBERT-80K, where the number after method name indicates pre-training corpus size. We find that RelateLM-20K outperforms EBERT-20K in 8 out of 9 settings, and even outperforms EBERT-80K, which is trained over 4X more data, in 7 out of 9 settings.
cluding the option of treating each of Hindi and English as RPL for transliteration into. For both LRLs, transliterating to Hindi as RPL provides gains over EBERT that keeps the script as-is and English transliteration. We find that these gains are much more significant for Oriya than Assamese, which could be because Oriya is a new script. Further augmentation with pseudo translations with Hindi as RPL, provides significant added gains. We have not included the NER results for Assamese due to the absence of good quality evaluation dataset.

Bilingual Language Models
For more extensive experiments and ablation studies we move to bilingual models. Table 5 shows the results of different methods of adapting M to a LRL with Hi-BERT and BERT as two choices of M. We obtain much higher gains when the LRL is transliterated to Hindi than to English or keeping the script as-is. This suggests that transliteration to a related language succeeds in parameter sharing between a RPL and a LRL. Note that the English BERT model is trained on a much larger English corpus than the Hi-BERT model is trained on the Hindi corpus. Yet, because of the relatedness of the languages we get much higher accuracy when adding transliterated data to Hindi rather than to English. Next observe that pre-training with alignment loss on pseudo translated sentence pairs improves upon the results obtained with transliteration. This shows that pseudo translations is a decent alternative when a parallel translation corpora is not available.
Overall, we find that RelateLM provides substantial gains over the baseline. In many cases Re-lateLM is even better than mBERT which was pretrained on a lot more monolingual data in that language. Among the three languages, we obtain lowest gains for Bengali since the phonetics of Bengali  varies to some extent from other Indo-Aryan languages, and Bengali shows influence from Tibeto-Burman languages too (Kunchukuttan and Bhattacharyya, 2020). This is also evident in the lower word overlap and lower BLEU in Table 1 and Table 2 compared to other Indic languages. We further find that in case of Bengali, the NER results are best when Bengali is transliterated to English rather than Hindi, which we attribute to the presence of English words in the NER evaluation dataset.

Ablation Study
Methods of Dictionary Lookups We experimented with various methods of choosing the translated word from the lexicon which may have multiple entries for a given word. In Table 7 we compare four methods of picking entries: first -en-try at first position, max-entry with maximum frequency in the monolingual data, weighted -entry with probability proportional to that frequency and root-weighted -entry with probability proportional to the square root of that frequency. We find that these four methods are very close to each other, with the weighted method having a slight edge.

Alignment Loss
We compare the MSE-based loss we used with the recently proposed contrastive loss (Wu and Dredze, 2020) for L align but did not get any significant improvements. We have provided the results for additional experiments in the Appendix (A) Increasing Monolingual size In Figure 3 we increase the monolingual LRL data used for adapting EBERT four-fold and compare the results. We observe that even on increasing monolingual data, in most cases, by being able to exploit language relatedness, RelateLM outperforms the EBERT model with four times more data. These experiments show that for zero-shot generalization on NLP tasks, it is more important to improve the alignment among languages by exploiting their relatedness, than to add more monolingual data.

Conclusion and Future Work
We address the problem of adapting a pre-trained language model (LM) to a Low Web-Resource Language (LRL) with limited monolingual corpora. We propose RelateLM, which explores relatedness between the LRL and a Related Prominent Language (RPL) already present in the LM. RelateLM exploits relatedness along two dimensions -script relatedness through transliteration, and sentence structure relatedness through pseudo translation. We focus on Indic languages, which have hundreds of millions of speakers, but are understudied in the NLP community. Our experiments provide evidence that RelateLM is effective in adapting multilingual LMs (such as mBERT) to various LRLs. Also, RelateLM is able to achieve zero-shot transfer with limited LRL data (20K documents) which is not surpassed even with 4X more data by existing baselines. Together, our experiments establish that using a related language as pivot, along with data augmentation through transliteration and bilingual dictionary-based pseudo translation, can be an effective way of adapting an LM for LRLs, and that this is more effective than direct training or pivoting through English.
Integrating RelateLM with other complementary methods for adapting LMs for LRLs (Pfeiffer et al., 2020b,c) is something we plan to pursue next. We are hopeful that the idea of utilizing relatedness to adapt LMs for LRLs will be effective in adapting LMs to LRLs in other languages families, such as South-east Asian and Latin American languages. We leave that and exploring other forms of relatedness as fruitful avenues for future work.
(e.g Personal Pronouns, Possessive Pronouns) for the POS classification, the mapping is also done to reflect the same.  Table 9: Tagset mapping between Penn Treebank and BIS. For some tags in Penn treebank (e.g. DT), we decided that a one-to-many mapping was appropriate based on a word-level division