Role of Language Relatedness in Multilingual Fine-tuning of Language Models: A Case Study in Indo-Aryan Languages

We explore the impact of leveraging the relatedness of languages that belong to the same family in NLP models using multilingual fine-tuning. We hypothesize and validate that multilingual fine-tuning of pre-trained language models can yield better performance on downstream NLP applications, compared to models fine-tuned on individual languages. A first of its kind detailed study is presented to track performance change as languages are added to a base language in a graded and greedy (in the sense of best boost of performance) manner; which reveals that careful selection of subset of related languages can significantly improve performance than utilizing all related languages. The Indo-Aryan (IA) language family is chosen for the study, the exact languages being Bengali, Gujarati, Hindi, Marathi, Oriya, Punjabi and Urdu. The script barrier is crossed by simple rule-based transliteration of the text of all languages to Devanagari. Experiments are performed on mBERT, IndicBERT, MuRIL and two RoBERTa-based LMs, the last two being pre-trained by us. Low resource languages, such as Oriya and Punjabi, are found to be the largest beneficiaries of multilingual fine-tuning. Textual Entailment, Entity Classification, Section Title Prediction, tasks of IndicGLUE and POS tagging form our test bed. Compared to monolingual fine tuning we get relative performance improvement of up to 150% in the downstream tasks. The surprise take-away is that for any language there is a particular combination of other languages which yields the best performance, and any additional language is in fact detrimental.

We empirically study whether (and to what extent) do the related languages accentuate the performance of models in downstream tasks with multilingual fine-tuning 2 in comparison to monolingual fine-tuning. To understand the quantitative advantage of including languages gradually, we explore the gradation of multilinguality by incrementally adding in new languages added one-by-one building up to an all-language multilingual fine-tuning.
A good approximation for language relatedness is their membership to the same language family as languages of a family often share properties such as grammar, vocabulary, etymology, and writing systems. We choose the Indo-Aryan (IA) family for the study, since constituent languages 1) include low-resource languages, 2) have similar Abugida writing system, 3) are relatively understudied, and 4) are covered in a well-defined NLP benchmark IndicGLUE (Kakwani et al., 2020b). Further, the fact that all constituent languages except one use similar Abugida writing systems (rooted in the ancient Brahmi Script 3 ) presents an opportunity for script normalization via transliteration.
Overall, although the general notion of language relatedness is explored, and the multilingual fine-tuning is explored in literature, the detailed linguistic understanding of role of language relatedness in multilingual fine-tuning remains understudied; even more so for IA family. Further, in this context the script-conversion aspect is not explored for multilingual fine-tuning.
To summarize, in this paper we seek to answer the following research questions (employing the Indo-Aryan language family as the experimental test-bed).
• RQ1: Does multilingual fine-tuning with a set of related languages yield improvements over monolingual fine-tuning (FT) on downstream tasks?
• RQ2: Starting from monolingual FT, as each related language is gradually added for multilingual FT, to ultimately a multilingual FT with all related languages, how does the performance vary? In other words, should one use all related languages' data or only a sub-set of languages' data?
These inquiries are critical to understanding the right balance between per-language fine-tuning and massively multilingual fine-tuning, as the viable way forward. Additionally, we also explore the role of common script representation in multilingual FT of related languages.
To facilitate these inquiries, we utilize existing pre-trained models, namely IndicBERT, mBERT, and MuRIL, and also pre-train two language models for Indo-Aryan language family from scratch. We utilize various tasks of IndicGLUE (Kakwani et al., 2020b) as our test-beds.

Related Work
Multilinguality aspect has been explored in context of pre-training language models, for effective transfer from one language to other, and in multilingual fine-tuning, to an extent.  (Kumar et al., 2020;Kakwani et al., 2020b).

Multilingual Pre-training
These approaches focus on multilingual pretraining of models. This means that once a multilingual LM is pre-trained, it is fine-tuned per task separately for each language.

Language Transfer
It is understood that a multilingual model gains cross-lingual understanding from sharing of layers that allows the alignment of representations among languages; to the extent that large overlap of the vocabulary between the languages is not required to bridge the alignment (Conneau et al., 2020b;Wang et al., 2019). This property facilitates, zero-shot transfer between two related languages (e.g. Hindi and Urdu) reasonably well (Pires et al., 2019). Performance for zero-shot transfer further improves when multilingual model is further aligned by utilizing parallel word or sentence resources (Kulshreshtha et al., 2020). Usually, the low-resource language members in a multilingual LM benefit by presence of related languages . Further, it is likely that presence of unrelated languages do not aid the multilingual training, but rather may lead to negative interference rooted in conflicting gradients (Wang et al., 2020b) or yield substantially poor transfer between unrelated languages (e.g. English and Japanese) (Pires et al., 2019). A recent work by Dolicki and Spanakis (2021) focuses on establishing the connection between the effectiveness of zero-shot transfer and the linguistic feature of source and target languages; interestingly, they observe that the effectiveness of zero-shot transfer is a function of downstream task, in addition to the languages themselves.
The general understanding has been that language-specific FT serves as skyline, and, in these set of works, pursuit has been to get zero-shot transfer from related languages(s) closer to the skyline (Wu and Dredze, 2019).

Multilingual Fine-tuning
Tsai et al. (2019) perform multilingual fine-tuning of 48 languages for downstream tasks of POS tagging and morphological tagging, and find these multilingual models to be slightly poorer compared to monolingual models. For morphological tagging and lemmatization tasks, Kondratyuk (2019) makes similar observation regarding poor performance for the model fine-tuned with 66 languages in multilingual setting compared to monolingual fine-tuning (although, a second stage of perlanguage fine-tuning yields superior performance). These findings indicate that arbitrary collection of languages may not be suitable for improving downstream task performance; and that, a principled approach for selecting a set of languages may be preferable for multilingual fine-tuning. To this end, we hypothesize that language relatedness should be an important aspect to consider while selecting a language set for multilingual fine-tuning. Pires et al. (2019) briefly explore language set selection based on topological features (syntactic word order). Wang et al. (2020b) explores multilingual fine-tuning in strictly bilingual settings. Taking the language relatedness in considerations, Tran and Bisazza (2019) show that joint fine-tuning with four European language is better than fine-tuning with only English in the specific task of universal dependency parsing. Unfortunately, it doesn't provide comparison with monolingual fine-tuning of all constituent languages.
We observe that there is a void regarding the systematic analysis to understand how a presence of related languages in the multilingual fine-tuning affects the performance on the target language.

Methodology
Traditionally, a pre-trained LM (such as mBERT) is used as base model, which is fine-tuned for a downstream task for a specific language (monolingual). In this work, we aim to evaluate the role of script and language relatedness in multilingual fine-tuning by employing the Indo-Aryan language family. Therefore, we include the following components to the approach: (1) multilingual fine-tuning, (2) transliteration, and (3) language models. Next, we discuss these in detail.

Multilingual Fine-Tuning
As opposed to traditional monolingual fine-tuning for a downstream task, in multilingual fine-tuning the LM is fine-tuned once per task with the aggregate labelled corpus across languages. Intuitively, related languages should assist each other for a downstream task. To draw a parallel, a polyglot person (akin to a multilingual LM) good at guessing titles of passages written in one language can easily adapt this skill on another, albeit related, language with few exemplars. Arguably, when put together, a greater understanding of the downstream task arises compared to what each language would yield individually, and the relatedness of associated languages would play a key role in deciding the benefits of this approach.
Therefore, to study this systematically, for a downstream task that is relevant for a variety of languages (e.g. part-of-speech tagging or named entity recognition), first the training sets of all languages are combined to create the multilingual task training set. Then, the base LM is fine-tuned on the multilingual task training corpus. This multilingual fine-tuning now yields a model per task, and not per task-language pair.

Script Similarity and Transliteration
Languages of a language family often use similar writing systems. For example, in the IA family, on one hand, Hindi, Bhojpuri, Magahi, Marathi, Sanskrit, and Nepali are written in Devanagari script. On the other hand, Bengali, Gujarati, Punjabi, and Oriya are written in their respective scripts. As Indic languages have high lexical similarity (Bhattacharyya et al., 2016), having a universal script for all these languages allows for model to exploit cross-lingual similarities. For example, the verb term for "to go" is similar in Hindi (जाना jaanaa), Urdu ‫جانا(‬ jaanaa), Gujarati (જવુ ં javum), Punjabi (ਜਾਣਾ jaanaa), Marathi (जाने jaane), and Oriya (ଯି ବାକୁ jibaku), and Bengali (যাওয়া jao)with each language morphing it in different manners. We use indic-nlp-library (Kunchukuttan et al., 2015) (for all but Urdu) and indic-trans tools (Bhat et al., 2015) (for Urdu) for transliteration to Devanagari.

Language Models
For this study, we use the mBERT, IndicBERT, and MuRIL as existing pre-trained language models. Additionally, we pre-train language models (from scratch) specifically for only the Indo-Aryan languages, as other LMs contains languages of other families too.

Pre-training Language Model From Scratch:
We choose to pre-train RoBERTa (Liu et al., 2019) transformer-based model as it has been shown to improve over BERT (Devlin et al., 2019) recently. Existing pre-trained language models are trained on original script data. For fair study of effectiveness of transliteration, we wish to pre-train separate language models on original and transliterated corpus from scratch. Our experimentation around transliteration makes existing pre-trained models (mBERT, IndicBERT, and MuRIL) somewhat incompatible. In other words, it would be akin to finetuning for an unseen language, albeit in a previously seen script. Thus, we settle upon pre-training contextual LM from scratch for this purpose. Specifically, we train two LMs from scratch, one preserving the original scripts of corpora (IndoAryan-Original) and other after transliterating all corpora to Devanagari script (IndoAryan-Transliterated).

Experimental Setup
In this section, we describe the datasets used in our experiments, their pre-processing, and implementation details.

Data
To train the language models, we obtained text data from various sources including: Wikipedia Dump 6 , WMT Common Crawl 7 , WMT News CommonCrawl 8 , Urdu Charles University Corpus (Bojar et al., 2014;Jawaid et al., 2014), IIT Bombay Hindi Monolingual Corpus (Kunchukuttan et al., 2018), Bhojpuri Monolingual Corpus (Kumar et al., 2018), and Magahi Monolingual Corpus 9 . Various statistics of the collected corpus are reported in Table 1. Note the major imbalance in the data with Hindi being undoubtedly a high-resource language and likes of Magahi, Punjabi, and Oriya being low-resource languages. The challenges of the data imbalance and insufficiency of data to train monolingual models for many of these languages is apparent from the statistics.

Data Preparation and Implementation Details
As the first step, sentences are segmented from the text corpora. Then script converted version of the datasets is obtained by transliterating Bengali, Gujarati, Punjabi, Oriya, and, Urdu into Devanagari script. We additionally perform de-duplication to remove repeated sentences. The statistics of the resulting set are reported in Table 1. We identify following two challenges that can affect pre-training negatively: 1) data imbalance and 2) compute requirements.
1. Data imbalance: As reported in Table 1 n i is the number of samples in i th language. Before rescaling, the language distribution is in the range of 0.01%-58%, which changes to 5-12% afterwards.
2. Compute requirements: Depending on the computing infrastructure, running one training epoch can typically take few hundreds to (single digit) few thousands of GPU hours. To mitigate this, we utilize a variant of sharding technique outlined in Algorithm 1 to pre-train the model in infrastructure with limited memory (<50GB) and compute (one v100 GPU). It depends on dividing each language corpus into manageable (into memory) chunks, termed as blocks. Each LM is trained over ∼50 sequential executions of Algorithm 1 on a single v100 GPU machine and each execution running for a day, consuming about 1200 hours overall for pre-training.
In the re-sampling step, exponent s = 0.1 and scaling parameter of γ = 100 are used. Bytelevel BPE tokenizer (Radford et al., 2019;Wang et al., 2020a) is used with vocabulary size of 110K. Trained LMs use 12 layers, 12 attention heads, hidden size of 768, and dropout ratio for the attention probabilities of 0.1. Our implementation uses Huggingface (Wolf et al., 2020) library. We use linear schedule for learning rate decay. Maximum sequence length is set as 128 across tokenization, training, and fine-tuning. Due to compute limitations, having higher maximum sequence length lead to out-of-memory errors. Mini-batches are created by weighted sampling based on language priors with exponent s = 0.7. In LM pre-training, minibatch of 48 samples and gradient accumulation of 53 is used making the effective batch size as 2,544 10 . Apex 11 library is used with O1 optimization level to allow mixed precision training. In all our experiments on fine-tuning, we perform a grid-search with respect to learning-rate and batch size values of {1,3,5}×10 −5 and {16,32,64} respectively.

Experiments
To answer the research questions, we experiment on a variety of tasks suitable for multilingual fine-tuning and analyse the results. To investigate RQ1, in §5.1, the first set of experiments are aimed to understand the utility of multilingual FT with related languages. To investigate RQ2, in §5.2, the second set of experiments are designed to track gradual performance variation with addition of assisting languages. With last set of analysis, in §5.3, we investigate the role of transliteration.

Effectiveness on Multilingual Tasks
We experiment on four tasks suitable for multilingual fine-tuning protocol, including three from IndicGLUE (Kakwani et al., 2020b) and POS tagging (Zeman et al., 2020 We do not show results on Cloze-style Question Answering task of IndicGLUE as it is meant to evaluate masked-token prediction of an LM, and does not involve downstream task training. We utilize mBERT, IndicBERT, MuRIL, IndoAryan-Original (IA-O) and IndoAryan-Transliterated (IA-TR) -last two being pre-trained by us as detailed in §3.3. All the five LMs are fine-tuned in monolingual and multilingual modes, to pursue investigation for RQ 1. Only the IA-TR model is fine-tuned with transliterated versions of the downstream task data; remaining four models are fine-tuned with original script downstream task data. The results of this set of experiments are reported in Table 2. Along with absolute metrics, the relative difference between mono-and multi-.
LG lingual fine-tuning (FT) is also reported. The relative difference is calculated as where M mono and M multi are performance measures of monolingual and multilingual fine-tuning respectively. Key observations are as following: Monolingual vs Multilingual Fine-Tuning: In this analysis, higher the δ, the stronger is the answer to RQ1 in affirmation. In the Table 2 the positive δ, shown in blue color, indicates the cases where multilingual FT improves over monolingual FT. It can be observed that for languages with limited labelled data for the downstream task, the multilingual fine-tuning is resulting in enormous improvements. Across all the five LMs, the trend is consistent. For example, on wikiann-ner task, F-Score on Oriya improved from 0.3882 and 0.3460 to 0.8848 and 0.6436, respectively for MuRIL and IA-TR models; while significant improvements are seen in other languages too. Similar trend is visible in wiki-section-title prediction task too, where improvements are seen for all the languages. Broadly, across LMs, tasks, and languages, the multilingual FT shows improvement over monolingual FT. This helps formulate the answer to RQ1 as the multilingual fine-tuning with related languages can yield huge (up to 40% on absolute scale) improvement for low-resource languages (such as Oriya and Punjabi), and statistically significant (up to 10%) improvements on high resource languages (such as Hindi and Bengali), depending on the task. Note, that this is in contrast to the observations of Tsai et al. (2019) and Kondratyuk (2019) that indicate slightly poor performance with multilingual fine-tuning. They fine-tune with more than forty languages together, without considering language relatedness. We observe large improvements by  selecting only the languages of the family for multilingual fine-tuning.
Trade-off or Win-Win?: Figure 1 visualizes the the improvements by multilingual fine-tuning relative to the monolingual one along with the task training set size. It is clearly evident, the smaller the task training data set, the higher is the relative improvement. Arguably, the data limitation of a low resource language is abridged by the related high resource languages. These improvements are not at the cost of trading-off on high resource languages; it is win-win for all languages. In fact, a decrease in performance (δ < 0), indicated with red color, is observed in only 16 out of the total 105 (21 task-language pairs × 5 LMs) comparisons. Interestingly, there is no task-language pair in which δ values corresponding to all the five LMs are negative, i.e. for all task-language pairs at least one LM always showed improvement using multilingual fine-tuning. Figure 2 illustrates the types of improvements in predicting entity tags.
Best LM Across Tasks?: Arguably, it is unfair to compare the pre-trained LMs due to vast difference in the number of languages they are pre-trained for (in range from 11 to 104), the size of the corpora, nature of the corpora (only monolingual vs parallel corpora), model types (RoBERTA, AlBERT, BERT), number of layers (8 or 12), tokenization, pre-training objectives, and compute consumed in training. Further, mBERT model is not pre-trained with Oriya. However, it is natural to inquire if there is a clear winner LM in the experimentation. The boldface figures in Table 2 shows the best results per task per language. Most of the best metrics fall under either IndicBERT (for Textual Entailment tasks) or MuRIL models (for Title Prediction and mostly for Entity Classification tasks). The lowest performance obtained is with IA-O and IA-TR.

Gradation of Multilinguality
We further dwell into understanding whether the degree of improvement varies by the language closeness within language family. Specifically, we start with the monolingual training, i.e. training set contains only the target language. Then, we experiment by adding each related language to the training set separately. The language that yields highest performance boost is selected for adding to the training set. Thus, a new training set consisting of two languages is obtained. This is repeated until all the related languages are added to the training set, resulting in the all-language multilingual FT. This approach is similar to Sequential Forward Selection of Features in machine learning. Further, we relate the subfamily categorization of IA family for this analysis. Experiments are performed for NER task on MuRIL model, with Oriya and Punjabi as the evaluation language. The results are reported in Table 3. For detailed discussion, consider the case of Oriya, on one end of the spectrum, in the first row, we have a monolingual fine-tuned model with only Oriya, whereas on the other end of the spectrum, in the last row, we have a multilingually fine-tuned model with Oriya, Bengali, Hindi, Gujarati, Marathi, and Punjabi. In the middle span, we have Oriya aided by each -Bengali, Punjabi, Marathi, Gujarati, and Hindi, separately; and their combinations. Since adding Gujarati to Oriya yields best result compared to adding any other language, or+gu is taken as the base training set for next iteration. In the next iteration, adding Bengali to or+gu provides highest boost, thus or+gu+bn forms the base iteration for next iteration. A similar exercise is performed with Punjabi as the base language too.
In case of Oriya, adding the Gujarati data resulted in about 54 percentage points improvement (38.8% to 92.4%), which is further improved by about 0.5% with addition of Bengali (93.0%). It appears that Hindi, Punjabi, and Marathi each negatively interferes with or+gu+bn set, resulting in at least 1.5 percentage points performance drop. The best performance of 94.19% is obtained with the or+gu+bn+mr+hi set (i.e. all but Punjabi), which is 5.7 percentage points higher than considering all the languages together.
It is natural to ask if Punjabi has negative interference with Oriya for the task. or+pa yields 47.7 percentage points improvement over only Oriya, which indicates positive interference between them. Further, or+gu and or+gu+pa are almost similar (0.9245 and 0.9231) indicates that perhaps, pa is redundant to gu for assisting or on the task. However, adding pa to or+gu+bn, or+gu+bn+mr, and or+gu+bn+mr+hi results in 2-5% drop; the common denominator being Bengali, it seems that Punjabi harms the most when the base set contains Bengali. Arguably, it indicates negative interference between Bengali and Punjabi for the task.
Also, note the improvements are not correlated with increase in the training set size; instead the smaller sets (e.g. or+gu with 3425 samples) yields better results than larger sets (e.g. or+bn of 21379). Therefore, gradual deviations should be credited to the addition language rather than the training set inflation. overall, the answer to RQ 2 emerges within the set of related languages, likely, there exists a subset of languages that yields the best performance.

Transliteration
Next, we present a set of observations pertaining to the utility of transliteration to leverage the script similarity between the Indo-Aryan languages. For a fair comparison, the IA-Original and IA-Transliterated models are considered as both of them are pre-trained by us on the original script and transliterated script versions of the same corpora. Thus, in this part of analysis, higher the δ T R − δ O in Table 2, the stronger is the role of explicit script normalization.
Transliteration with Multilingual FT: Comparing the relative difference (δ) for transliterated and original script models, it is observed that in 16 out of 21 task-language pairs δ T R > δ O ; noteworthy is δ T R = 146.96%, and δ O = 117.73% for Oriya language on wiki-section-title prediction task. It suggests that multilingual FT is even more suitable with transliteration. Based on these results, the role of common script representation emerges as effectiveness of multilingual fine-tuning is significantly enhanced when coupled with common-script representation via transliteration.
Transliteration for LM: Comparing performance of monolingual fine-tuning of the original script LM (IA-O) and the transliterated script LM (IA-TR) reveals that the later is better in only few (8 out of 21)  experiments. This is somewhat counter intuitive, as the common script representation should have made the LM pre-train better due to the presence of cognates. We speculate following two rationales.
• Firstly, it indicates that, perhaps, even without explicit alignment of cognates (via transliteration) the model is able to align their embeddings implicitly, corroborating with (Conneau et al., 2020b;Pires et al., 2019).
• Secondly, the byte-Level BPE and the unicode block arrangements for Indo-Aryan languages may be at play underneath this phenomena also. For example, the consonant Pa in Hindi प (0xe0 0xa4 0xaa), Oriya ପ (0xe0 0xac 0xaa), and Punjabi ਪ (0xe0 0xa8 0xaa) are apart by their unicode block offset differences. Thus, potentially, a model knowing the byte level representations of the writing system could learn to map them, provided the loss function guides it.
However, we leave the further inquiry into the exact phenomena for the future work.

Conclusion
We show that multilingual fine-tuning efficiently leverages language relatedness leading to improvements over monolingual approach. We substantiate this claim on the Indo-Aryan language family with experiments on five language models. Multilingual fine-tuning is particularly effective for low-resource languages (e.g., Oriya and Punjabi show improvement up to 150% on relative scale). Also, we show that careful selection of subset of related languages, can further improve performance. Devising automatic approaches for finding optimal subset of related languages is a promising future direction. Additionally, in multilingual fine-tuning, we see some benefits of transliteration to common script.