Entropy-guided Vocabulary Augmentation of Multilingual Language Models for Low-resource Tasks

Multilingual language models (MLLMs) like mBERT promise to extend the benefits of NLP research to low-resource languages (LRLs). However, LRL words are under-represented in the wordpiece/subword vocabularies of MLLMs. This leads to many LRL words getting replaced by UNK, or concatenated from morphologically unrelated wordpieces, leading to low task accuracy. (Pre)-training MLLMs after including LRL documents is resource-intensive in terms of both human inputs and computational resources. In response, we propose EVALM (entropy-based vocabulary augmented language model), which uses a new task-cognizant measurement to detect the most vulnerable LRL words, whose wordpiece segmentations are undesirable. EVALM then provides reasonable initializations of their embeddings, followed by limited fine-tuning us-ing the small LRL task corpus. Our experiments show significant performance improvements and also some surprising limits to such vocabulary augmentation strategies in various classification tasks for multiple diverse LRLs, as well as code-mixed texts. We will release the code and data to enable further research 1 .


Introduction
It is common practice to start with a multilingual language model (MLLM) like mBERT2 or XLM-R (Conneau et al., 2020), which has been pre-trained with large multilingual corpora, and fine-tune the MLLM for diverse downstream tasks.Although MLLMs support many low-resource languages (LRLs), closer inspection of these MLLMs reveals that the portion of vocabulary allotted to LRLs can be orders of magnitude smaller than that allotted to high-resource languages (HRLs) such as English (Table 1).Due to this imbalance, sometimes an LRL word may not be possible to segment into wordpieces as per the MLLM vocabulary, leading to the LRL word being conflated with the UNK (unknown) token.An even more insidious situation is that the MLLM vocabulary has enough (over-fragmented) wordpieces to assemble almost any LRL word (thereby dodging the obvious UNK alert), but the embeddings of these wordpieces collide with unrelated usage in HRLs, and/or are so sparsely trained that contextual aggregations fail to yield satisfactory LRL word embeddings which may lead to poor LRL task performance.On the other hand, significant human and computational investments are needed to create task-specific LRL corpora that are large enough to augment and retrain the MLLM vocabulary.
In this work, we address the setting where a MLLM (that is presumably deficient in LRL coverage) must be minimally fine-tuned after modest modification to its wordpiece vocabulary, guided by specific LRL tasks.We design a measure of damage to an LRL word, caused by wordpiece fragmentation, based on a suitably defined notion of entropy of the word and constituent wordpieces, with respect to the LRL task.This measure then guides the selection of LRL words with which the vocabulary should be augmented.Subsequently, we propose various ways to initialize the embeddings of these newly-introduced words, including using information from the LRL itself, to 'importing' information from HRLs.We call the resulting system EVALM (entropy-based vocabulary augmented language model).
We study the effect of EVALM on an existing MLLM during the fine-tuning stage for various downstream classification tasks covering multiple LRLs and also a code-mixed language.Our study shows that, for most of the datasets, EVALM's vocabulary augmentation strategy helps improve LRL task performance by greater margins than recent best practices (Hong et al., 2021; Hofmann et al., 2022).A detailed analysis of successes and failures delineates the perimeter of EVALM's capabilities and guides our design choices.

Related Work
Continued pre-training (Tai et al., 2020; Ebrahimi and Kann, 2021; Wang et al., 2020; Chau et al., 2020) with or without vocabulary augmentation of existing LMs like monolingual BERT, multilingual BERT (mBERT), XLM-R, etc., proves beneficial for improving domain and languagespecific performances over various tasks.Some works (Ruzzetti et al., 2021; Yu et al., 2021) focus on rare/OOV words.Liu et al. (2021) propose an embedding generator module in the pretrain-finetune pipeline to resolve vocabulary gaps.Adaptors (Sachidananda et al., 2021; Moon and Okazaki, 2020; Hofmann et al., 2021) are also showing promising outcomes in LRL modeling.Chung et al. (2020) explore multilingual vocabulary generation from language clusters.Minixhofer et al. (2021) transfer English LMs to new languages without expensive computation.Hofmann et al. (2022) propose a simple algorithm which modifies the tokenization process to preserve the morphological structure of a word.Others (Wang et al., 2019; Hong et al., 2021) focus on embedding initialization for newly added vocabulary words which are word fragments, which is also among our concerns.

Our system: EVALM
EVALM has three key components.The purpose of the first component (Section 3.1) is to identify (based on only the train fold) a subset of vulnerable LRL words whose assembly from wordpieces is likely to distort the embedding information made available to LRL labeling tasks.The second component (Section 3.2) comprises various possible Algorithm 1 LRL vocabulary selection.if n(w) ≥ θ and ∆H (w) ≥ γ then 13: add the feature triple to candidates 14: sort candidates in decreasing ∆H Output: Prefix of candidates of specified size as Vnew policies to initialize the embeddings of the newlyintroduced LRL words.In the third component, as in AVocaDo (Hong et al., 2021), we prevent overfitting to a small LRL task corpus by regularizing embeddings of corresponding wordpieces of each sentence obtained by the pre-and postaugmentation MLLM tokenizers.

Vulnerable LRL word selection
We need a computationally efficient, task-sensitive surrogate of the value of introducing an LRL word into the wordpiece vocabulary.(Here we augment the vocabulary with whole LRL words, blocking their fragmentation entirely.More clever sharing of fragments is left for future work.)Suppose LRL word w is not in the MLLM vocabulary; w is fragmented into wordpiece sequence T (w) = s 1 , . . ., s T by the MLLM tokenizer T .The LRL task has C class labels.A specific label is denoted c ∈ [C] = {1, . . ., C}.The counts of w and constituent wordpieces s t in each class c are denoted n(w, c) and n(s t , c).Based on these counts, we define the following multinomial distributions: (1) where • = w, s t , etc. Based on this we define the entropy (2) Suppose H(w) is small.This means w is potentially a good feature for the LRL task.Now suppose a wordpiece s t has large H(s t ).That means s t is being shared across other words that are dis-tributed more evenly across classes.If this is the case for most s t , then fragmentation of w may be a serious problem.To combine information from all wordpieces, we average their entropies, and use the relative increase in entropy, going from LRL word to wordpieces, as one signal for the danger of fragmenting w.As an example, suppose the word 'धरम' (religion) occurs ten times in a threeclass sentiment analysis dataset with the class distribution of 'positive', 'neutral', and 'negative' as (1,1,8).Its wordpieces have class distributions 'ध' (100,85,80), '##र' (130,235,250), and '##म' (130,90,125).Then as per equation 2, H('धरम') = 0.639, H('ध') = 1.094,H('##र') = 1.062, and H('##र') = 1.086.The average wordpiece entropy is H S ('धरम') = 1.094+1.062+1.086 3 = 1.081, and the percentage of entropy reduction from average wordpiece to word entropy is about 41%.
We also retain two simpler signals: the number of fragments |T (w)|, and the frequency of w in the LRL task corpus.LRL words are sorted on the amount of entropy decrease and the top LRL words proposed for vocabulary augmentation.We remove words with very low frequency and retain a prefix of specified size to obtain V new , the LRL words to be added to the MLLM vocabulary.Algorithm 1 shows a high-level pseudocode.

Embedding initialization
Here we describe the different ways to initialize the embeddings of newly-added LRL words.InitLRL: The embedding of the newlyintroduced LRL word is initialized using other LRL wordpieces already in the MLLM dictionary.Suppose we add Bengali word ' হাসপাতাল' , ('hospital' in English).Suppose the existing MLLM tokenizer splits it into [' হ' , ' ##◌াস' , ' ##প' , ' ##◌াত' , ' ##◌াল' ].Then we initialize the embedding of ' হাসপাতাল' with the average of the existing MLLM embeddings of the fragments.InitHRL: Here we translate ' হাসপাতাল' to English ('hospital'), tokenize it using T , and take the average embedding of the tokens in the list.InitMix: We use the average of InitLRL and InitHRL embeddings.InitRand: We randomly initialize the embeddings of the newly-added words.
It is challenging to learn good contextual embedding for words in V new due to very small taskspecific training data compared to the MLLM pretraining corpus.Therefore, we found it neces- sary to apply some regularization to avoid overfitting during fine-tuning.Let T , T ′ be the initial and final MLLM tokenizers.For a particular sentence S = w 1 , w 2 , ..., w I with words w i , we will get two different tokenizations; these will generally lead to different contextual embeddings E = (e 1 , . . ., e K ) and E ′ = (e ′ 1 , . . ., e ′ L ); generally K ̸ = L.We average-pool these to get vectors e, e ′ which a final layer uses for the classification task, with losses ℓ T and ℓ T ′ .We also use (e+e ′ )/2 for a third classification, with loss ℓ mix .The overall training loss is ℓ T + ℓ T ′ + ℓ mix , where ℓ T and ℓ mix are expected to reduce overfitting.

Datasets and evaluation metric
We experiment with six short multi-class text classification tasks covering four Indian languages and a Hindi-English code-mixed dataset.We show the details of the datasets in Tables 2 and 6.We use mBERT as the MLLM and report macro-F1 (we report the accuracy metric in Appendix B).Details of model hyperparameters are present in Appendix C.

Quantitative results
In Figure 1, we plot macro-F1 against the extent of vocabulary augmentation.Green, orange, and blue lines show the performance with InitLRL, InitHRL, and InitMix initialization, respectively.Corresponding colored bands show 1-standard deviation spreads.
V new helps: For all tasks, including V new is better than baseline MLLM, and the gap is usually significant.This shows that even minimal training of newly added LRL tokens that used to be UNK or over-fragmented helps improve performance.More augmentation̸ ⇒larger lift: We expected that larger V new would monotonically improve performance, but this was not universally the case.Inclusion of non-informative words, as we grow V new (∆ H decreases with high variance as shown in Appendix B Figure 3), maybe a reason.Initialization does not matter much: Although there are cases where InitHRL or InitMix performs better than InitLRL, we did not find significant performance difference between different embedding initialization of new LRL words.Transfer of embeddings from a well-represented HRL is the likely reason.We also check the performance by randomly initializing the V new words and find, for almost all the cases, random initialization performance, both for macro-F1(in Figure 1) and accuracy(in Appendix B Figure 2), is lesser compared to InitHRL, InitLRL, or InitMix.It suggests meaningful initialization helps.

Comparison with recent approaches:
We compare EVALM with AVocaDo (Hong et al., 2021) keeping V new comparable in size.Table 4 shows that AVocaDo leads to performance degradation for all LRL datasets.The lack of domainspecificity for our datasets may be why AVo-caDo's performance dropped.We also compare with FLOTA (Hofmann et al., 2022) in Figure 1.For all datasets except GLUECoS Hi-En codemix dataset, EVALM performs better than FLOTA.A possible explanation is that mBERT vocabulary already includes many English as well as Hindi words, which helps FLOTA better compose embeddings of morphological components of English and Hindi words compared to other Indian languages.Regularization helps: Table 5 shows that EVALM with AVocaDo-style regularization performs better than without it, for all datasets.
Cases where EVALM hurts: The samples in Table 3 show that EVALM generally helps by spotting words important for predicting the correct class.This is shown in the first two examples, where the added vocabulary (∆ H =100%) tipped the prediction toward the gold label.But the last two examples show cases where for a word, the train and test set frequency distribution among target classes are different.As a consequence, these words may become misleading at test time.

Conclusion
We have proposed a simple and effective method to augment an MLLM wordpiece vocabulary with LRL words that are important for LRL classification tasks.Our study, involving several Indian languages, shows a consistent positive impact of vocabulary augmentation and fine-tuning.We find more augmentation does not guarantee performance improvement, and different embedding initialization fails to show significant performance differences among themselves.We also show that regularization is crucial to prevent overfitting new LRL word embeddings during fine-tuning.We have limited the augmentation to whole LRL words, and a judicious selection of LRL wordpieces may improve performance.We also want to extend to other target tasks (especially language generation) and a more diverse set of LRLs.

Limitations
While EVALM demonstrates that vocabulary augmentation with LRL task performance as objective requires different priorities from vocabulary augmentation for improving representation for its own sake, our work opens up several avenues for exploration.Our understanding of the potential conflict between fidelity of LRL word representation from wordpieces and LRL task class discrimination requirements remains far from complete, particularly when we extend from sequence-tosingle-label applications to sequence labeling (as in POS and NER tagging) and further to sequenceto-sequence applications (such as translation).Perhaps, further experiments with mBERT and other MLLMs will further our understanding of these trade-offs.While initializing an LRL word embedding using InitHRL or InitMix, we depend on automatic machine translation, which can be errorprone.Ranking by ∆ H and picking a prefix fails to discount informative but correlated features.A more sophisticated formulation of loss of information owing to fragmentation, taking multiple LRL words into account simultaneously, may alleviate this problem.In the short term, these two limitations may deserve closer scrutiny.

Table 1 :
Representation of the vocabulary of variousIndian languages in mBERT's wordpiece dictionary.
* Based on basic to extended Latin script Unicode range.

Table 2 :
Salient statistics of tasks.Note the small size of LRL datasets.Further details in Table6.

Table 4 :
Here last two rows show the performance between best performing model of EVALM with AVo-caDo.(a)-(f)arethe datasets/tasks defined in Table2.

Table 5 :
Ablation.The first and second rows show our best model performance, trained with/without ℓ reg respectively.

Table 8 :
Hyperparameters used in experiments.We find the best hyperparameter settings using manual search according to macro F1 performance.