Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages

Pre-trained multilingual language models such as mBERT and XLM-R have demonstrated great potential for zero-shot cross-lingual transfer to low web-resource languages (LRL). However, due to limited model capacity, the large difference in the sizes of available monolingual corpora between high web-resource languages (HRL) and LRLs does not provide enough scope of co-embedding the LRL with the HRL, thereby affecting the downstream task performance of LRLs. In this paper, we argue that relatedness among languages in a language family along the dimension of lexical overlap may be leveraged to overcome some of the corpora limitations of LRLs. We propose Overlap BPE (OBPE), a simple yet effective modification to the BPE vocabulary generation algorithm which enhances overlap across related languages. Through extensive experiments on multiple NLP tasks and datasets, we observe that OBPE generates a vocabulary that increases the representation of LRLs via tokens shared with HRLs. This results in improved zero-shot transfer from related HRLs to LRLs without reducing HRL representation and accuracy. Unlike previous studies that dismissed the importance of token-overlap, we show that in the low-resource related language setting, token overlap matters. Synthetically reducing the overlap to zero can cause as much as a four-fold drop in zero-shot transfer accuracy.


Introduction
Zero-shot cross-lingual transfer is the ability of a model to learn from labeled data in one language and transfer the learning to another language without any labeled data. Transformer (Vaswani et al., 2017) based multilingual models pre-trained on unlabeled data from multiple languages are the stateof-the-art means for cross-lingual transfer (Ruder et al., 2019;Devlin et al., 2019a). While pretraining based cross-lingual transfer holds great Language and Token frequencies English: University (10), versity (6); German: Universitaten (2); Dutch: Universiteit (1); Western Frisian: Universiteiten (1) Starting Vocab Uni, versit, U,n,i,v,e,r,s,i,t,y,a BPE Vocab versity, Uni, versit, U,n,i,v,e,r,s,i,t,y,a OBPE Vocab Universit, Uni, versit, U,n,i,v,e,r,s,i,t,y,a  promise for low web-resource languages (LRLs), such techniques are found to be more effective for transfer within high web-resource languages (HRLs) (Wu and Dredze, 2020). Vocabulary generation is an important step in multilingual model training, where vocabulary size directly impacts model capacity. Usually, the vocabulary is generated from a union of HRL and LRL data. This often results in under-allocation of vocabulary bandwidth to LRLs, as LRL data is significantly smaller in size compared to HRL. This under-allocation of model capacity results in lower LRL performance (Wu and Dredze, 2020), as mentioned previously. In response, prior research has explored development of region-specific models (Antoun et al.;Khanuja et al., 2021), generating vocabulary specific to language clusters (Chung et al., 2020), and exploring relatedness among languages to build better LMs for LRLs (Khemchandani et al., 2021). However, none of these methods have utilized relatedness among languages for better vocabulary generation during multilingual pre-training.
In this paper, we hypothesize that exploiting language relatedness can result in an overall more effective vocabulary, which is also better representative of LRLs. Closely related languages (e.g., languages belonging to a single family) have common origins for words with similar meanings. We show some examples across three different families of related languages in Table 8. Morphological inflections of the root word leads to lexically overlapping tokens across languages. Learning representations for such subwords in lexically overlapping words shared across HRL and its related LRLs can enable better transfer of supervision from HRL to LRLs. During Masked Language Modelling (MLM) pretraining (Devlin et al., 2019a), the shared tokens can serve as anchors in learning contextual representations of neighboring tokens. However, choosing the correct granularity of sharing automatically is tricky. On one extreme, we can choose a vocabulary which favours longer units frequent in HRL without regard for sharing, thereby leading to better semantic representation of the tokens but no cross-lingual transfer. On the other extreme, we can choose character-level vocabulary (Ma et al., 2020), where every token is shared across languages but have no semantic significance.
Given text from a mix of high and low Webresource languages (HRL and LRL, respectively), Byte Pair Encoding (BPE) (Sennrich et al., 2016) and its variants like Wordpiece (Schuster and Nakajima, 2012) and Sentencepiece (Kudo and Richardson, 2018) prefer frequent tokens, most of those from the HRLs. This would cause most long HRL tokens to get included, leaving only a limited budget of short tokens for the LRL. Any sub-token level overlap between HRL and LRL could get lost in this process. In a zero-shot setting, since available supervision is HRL based, this creates a bottleneck when transferring supervision from HRL to LRLs. Over-sampling LRLs is a common strategy to offset this imbalance but that hurts HRL performance as shown in (Conneau et al., 2020a).
In this paper, we propose Overlap BPE (OBPE). OBPE chooses a vocabulary by giving token overlap among HRL and LRLs a primary consideration. OBPE prefers vocabulary units which are shared across multiple languages, while also encoding the input corpora compactly. Thus, OBPE tries to balance the trade-off between cross-lingual subword sharing and the need for robust representation of individual languages in the vocabulary. This results in a more balanced vocabulary, resulting in improved performance for LRLs without hurting HRL accuracy. Table 1 shows an example to highlight this difference between OBPE and BPE.
Recently K et al. (2020);Conneau et al. (2020b) concluded that token overlap is unimportant for cross-lingual transfer. However, they studied language pairs where either both languages had a large corpus, or where the languages were not sufficiently related. We focus on related languages within a family and observe drastic drop in zeroshot accuracy when we synthetically reduce the overlap to zero (58% F1 drops to 17% for NER, 72% drops to 30% for text classification).
This paper offers the following contributions • We present OBPE, a simple yet effective modification to the popular BPE algorithm to promote overlap between LRLs and a related HRL during vocabulary generation. OBPE uses a generalized mean based formulation to quantify token overlap among languages. • We evaluate OBPE on twelve languages across three related families, and show consistent improvement in zero-shot transfer over state-of-the art baselines on four NLP tasks. We analyse the reasons behind the gains obtained by OBPE and show that OBPE increases the percentage of LRL tokens in the vocabulary without reducing HRL tokens. This is unlike over-sampling strategies where increasing one reduces the other. • Through controlled experiments on the amount of token overlap on a related HRL-LRL pair, we show that token overlap is extremely important in the low-resource, related language setting. Recent literature which conclude that token overlap is unimportant may have overlooked this important setting. We plan to make the source code and all resources generated in this paper publicly available.

Related Work
Transformer-based multilingual language models such as mBERT (Devlin et al., 2019b) and XLM-R (Conneau et al., 2020a) are now established as the de-facto method for zero-shot cross-lingual transferability, and thus hold promise for low resource domains. However, recent studies have indicated that even the current state-of-the-art models such as XLM-R (Large) do not yield reasonable transfer performance across low resource target languages with limited data (Wu and Dredze, 2020). This has led to a surge of interest in enhancing cross-lingual transfer of multilingual models to the low-resource setting. We categorize existing work based on the stage of the pre-training pipeline where it is relevant: Input Data In the data creation stage, Conneau et al. (2020a) propose over-sampling of LRL documents to improve LRL representation in the vocabulary and pre-training steps. Khemchandani et al. (2021) specifically target related languages and propose transliteration of LRL documents to the script of related HRL for greater lexical overlap. We deploy both these tricks in this paper.
Tokenization Rust et al. (2021) study that even the tokenization step could have a crucial impact on performance accrued to each language in a multilingual models. They propose the use of dedicated tokenizer for each language instead of the automatically generated multilingual mBERT tokenizer. However, they continue to use the default mBERT vocabulary generator. Sennrich et al. (2016) highlighted the importance of subword tokens in the vocabulary and proposed use of the BPE algorithm (Gage, 1994) for efficiently growing such a vocabulary incrementally. Variants like Wordpiece (Schuster and Nakajima, 2012) and Sentencepiece (Kudo and Richardson, 2018) either build on top of BPE or follow a very similar process. Kudo (2018) is a variant method that chooses tokens based on unigram LM score. We obtained better results with BPE and continued with that. All these BPE variants incrementally add subwords based on overall frequency in the combined corpus, and they all ignore language boundaries. Chung et al.

Vocabulary Generation
(2020) observed that such a combined approach could under-represent several languages, and proposed instead to separately create vocabularies for clusters of related languages and take a union of each cluster-specific vocabulary. However, within each cluster they continue to use the default vocabulary generator. Our approach can be used as a drop-in replacement to further enhance the quality of the cluster-specific vocabulary that they obtain. Pre-Training and Adaptation Several previous works have proposed to include additional alignment loss between parallel (Cao et al., 2020) or pseudo-parallel (Khemchandani et al., 2021 sentences to co-embed HRLs and LRLs. Another approach is to design language-specific Adapter layers (Pfeiffer et al., 2020a,b;Artetxe et al., 2020;Üstün et al., 2020) that can be easily fine-tuned for each new language. Pfeiffer et al. (2021) leverages the pre-trained embeddings of lexically overlapping tokens between the vocabulary of pre-trained model and that of unseen target language to initialize the corresponding embeddings of target language. However, they did not attempt to increase the fraction of such tokens in the vocabulary.
We are not aware of any prior work that explicitly promotes overlapping tokens between LRLs and HRLs in the vocabulary of multilingual models.

Overlap-based Vocabulary Generation
We are given monolingual data D 1 , ..., D n in a set of l languages L = {L 1 , ..., L n } and a vocabulary budget V. Our goal is to generate a vocabulary V that when used to tokenize each D i in a multilingual model would provide cross-lingual transfer to LRLs from related HRLs. We use L LRL to denote the subset of the l languages that are low-resource, the remaining languages L − L LRL are denoted as the set L HRL of high resource languages.
Existing methods of vocabulary creation start with a union D of monolingual data D 1 , ..., D n , and choose a vocabulary V that most compactly represents D. We first present an overview of BPE, a popular algorithm for vocabulary generation.

Background: BPE
Byte Pair Encoding (BPE) (Gage, 1994) is a simple data compression technique that chooses a vocabulary V that minimizes total size of D = ∪ i D i when encoded using V.
The size of the encoding |encode(D i , S)| can be alternately expressed as the sum of frequency of tokens in S when D i is tokenized using S. This motivates the following efficient greedy algorithm to implement the above optimization (Sennrich et al., 2016). Let f ki denote the frequency of a candidate token k in the corpus D i of language L i . The BPE

Algorithm 1 Overlap based BPE (OBPE)
for i ∈ {1, 2, ..., n} do Split words in Di into characters Ci with a special marker after every word end for Update token and pair frequency on {Di}, V Add to V token k formed by merging pairs u, v ∈ V with the largest value of algorithm grows V incrementally. Initially, V comprises of characters in D. Then, until |V| ≤ V, it chooses the token k obtained by merging two existing tokens in V for which the frequency in D is maximum.
A limitation of BPE on multilingual data is that tokens that appear largely in low-resource D i may not get added to V, leading to sentences in L i being over-tokenized. For a low resource language, the available monolingual data D i is often orders of magnitude smaller than another high-resource language. Models like mBERT and XLM-R address this limitation by over-sampling documents of lowresource languages. However, over-sampling LRLs might compromise learned representation of HRLs where task-specific labeled data is available. We propose an alternative strategy of vocabulary generation called OBPE that seeks to maximize transfer from HRL to LRL.

Our Proposal: OBPE
The key idea in OBPE is to maximize the overlap between an LRL and a closely related HRL while simultaneously encoding the input corpora compactly as in BPE. When labeled data D T h for a task T is available in an HRL L h , then a multilingual model fine-tuned with D T h is likely to transfer better to a related LRL L i when L i and L h share several tokens in common. Thus, the objective that OBPE seeks to optimize when creating a vocabulary is: where 0 ≤ α ≤ 1 determines importance of the two terms. The first term in the objective compactly represents the total corpus, as in BPE's (Eq (1)). The second term additionally biases towards vocabulary with greater overlap of each LRL to one HRL where we expect task-specific labeled data to be present. There are several ways in which we can measure the overlap between two languages with respect to a current vocabulary. First, we encode each of D i and D j using the vocabulary S, which then yields a multiset of tokens in each corpus. Inspired by the literature on fair allocation (Barman et al., 2021), we explore a continuously parameterized function that expresses overlap between two languages' encoding as a generalized mean function as follows: (4) where f ki denotes the frequency of token k when D i is encoded with S. For different values of p, we get different tradeoffs between fairness to each language and overall goodness. When p = −∞, generalized mean reduces to the minimum function, and we get the most egalitarian allocation. However, this ignores the larger of the two frequencies.
When p = 1, we get a simple average which is what the first term in Equation (3) already covers. For p = 0, −1, we get the geometric and harmonic means respectively. Due to smaller size of LRL monolingual data, the frequency of a token which is shared across languages is likely to be much higher in HRL monolingual data as compared to that in LRL monolingual data, Hence, setting p to large negative values will increase the weight given to LRLs and thus increase overlap. We will present an exploration of the effect of p on zero-shot transfer in the experiment section.
The greedy version of the above objective that controls the candidate vocabulary item to be inducted in each iteration of OBPE is thus: The data structure maintained by BPE to efficiently conduct such merges can be applied with little changes to the OBPE algorithm. The only difference is that we need to separately maintain the  frequency in each language in addition to overall frequency. Since the time and resources used to create the vocabulary is significantly smaller than the model pre-training time, this additional overhead to the pre-training step is negligible.

Experiments
We evaluate by measuring the efficacy of zeroshot transfer from the HRL on four different tasks: named entity recognition (NER), part of speech tagging (POS), text classification(TC), and Cross-lingual Natural Language Inference (XNLI).
Through our experiments, we evaluate the following questions: 1. Is OBPE more effective than BPE for zeroshot transfer? (Section 4.2) 2. What is the effect of token overlap on overall accuracy? (Section 4.3) 3. How does increased LRL representation in the vocabulary impact accuracy? (Section 4.4) We report additional ablation and analysis experiments in Section 4.5.

Setup
Pre-training Data and Languages As our pretraining dataset {D i }, we use the Wikipedia dumps of all the languages as used in mBERT. We pretrain with 12 languages grouped into three families of four related languages as shown in Table 2. In each family, we simulate as HRL the most populous language, and call the remaining as LRLs. The number of documents for languages simulated as LRLs is set to 20K. For the HRLs, we consider two corpus distributions: • BALANCED : all three HRLs get 160K documents each • SKEWED : English gets one million, French half million, and Hindi 160K documents We evaluate twelve-language models in each of these settings, and present results for separate four language models per family in Table 13 in the Appendix. For the Indo-Aryan languages set, the monolingual data of Punjabi and Gujarati is transliterated to Devanagari, the script of Hindi and Marathi. We use libindic's indictrans library (Bhat et al., 2015) for transliteration. Languages in the other two sets do not require transliteration as they have a common script. Thus, all four languages in each set are in the same script so their lexical overlap can be leveraged. Pre-Training Details To ensure that LRLs are not under-represented, we over-sample using exponentially smoothed weighting similar to multilingual BERT (Devlin et al., 2019b) with exponentiation factor 0.7. We perform MLM pretraining on a BERT base model with 110M parameters from scratch. We generate a vocabulary of size of 30k. We chose batch size as 2048, learning rate as 3e-5 and maximum sequence length as 128. Pre-training of BERT was done with duplication factor 5 for for 64k iterations for HRLs. For all LRLs, duplication factor was 20 and training was done for 24K iterations. MLM pre-training was done on Google v3-8 Cloud TPUs where 10K iterations required 2.1 TPU hours. Task-specific Data We evaluate on four downstream tasks: (1) NER: data from WikiANN (Pan et al., 2017) and XTREME (Hu et al., 2020), (2) XNLI: data from (Conneau et al., 2018), (3) POS: data from XTREME (Hu et al., 2020) and TDIL 1 , and (4) Text Classification (TC): data from TDIL and XGLUE (Liang et al., 2020). We downsampled the TDIL data for each language to make them class-balanced. The POS tagset for Indo-Aryan languages used was the BIS Tagset (Sardesai et al., 2012). Table 9 presents a summary. The test set to compute LRL perplexity os formed by sampling 10K sentences from Samanantar corpus (Ramesh et al., 2021) for Indic languages and from Tatoeba corpus 2 for other languages. Task-specific fine-tuning details We perform taskspecific fine-tuning of pre-trained BERT on the task-specific training data of HRL and evaluate on all languages in the same family. Here we used  sizes in an automated manner for all language using compression rates. Since their best performances are found, when the compression rates are similar, we choose a size for each language corresponding to compression rate of 0.5. The tokenizer used in this method is WordPiece. . 5. OBPE (Ours) with default α = 0.5, p = −∞. We also do ablation on these. In Table 3 we observe that across all four tasks, zero-shot LRL accuracy improves compared to BPE. For example, the average accuracy on XNLI for the LRL languages improves from 55.6 to 58.1 just by changing the set of tokens in the vocabulary. These gains are obtained without compromising HRL performance on the tasks. The Clustered Vocabulary (CV) approach is much worse than BPE. These experiments are on the Balanced-12 model. In the supplementary section, we report the results on the Skewed-12 (Table 10) and Balanced-4 models (Table 13) and show similar gains even with these models. In this table, we averaged the gains over nine LRLs, and in the Supplementary Table 11 we show consistent gains for individual languages.
In addition to improving zero-shot transfer from HRLs to LRLs on downstream tasks, OBPE also leads to better intrinsic representation of LRLs. We validate that by measuring the pseudoperplexity (Salazar et al., 2020) of a test set of LRL sentences. We find that average perplexity of LRL sentences drops by 2.6% when we go from the BPE to OBPE vocabulary. More details on this experiment appear in Figure 3 of supplementary.
In order to investigate the reasons behind the OBPE gains, we first inspected the percentage of tokens in the vocabulary that belong to LRLs, HRLs, and in their overlap. We find that with OBPE both LRL tokens and overlapping tokens increase. Either of these could have led to the observed gains.
We analyze the effect of each of these factors in the following two sections.  We present the impact of token overlap via two sets of experiments: first, a controlled setup where we synthetically vary the fraction of overlap and second where we measure correlation between overlap and gains of OBPE on the data as-is.

Effect of Token Overlap
For the controlled setup we follow (K et al., 2020) for synthetically controlling the amount of overlap between HRL and LRL. We trained a bilingual model between Hindi (HRL 160K) and Marathi (LRL 20K) -two closely related languages in the Indo-Aryan family. To find the set of overlapping tokens between Hindi and Marathi, we first run OBPE on Hindi-Marathi language pair to generate a vocabulary and label all tokens present in both languages as overlapping tokens. We then incrementally sample 10%, 40%, 50%, 90% of the tokens from this set. We shift the Unicode of the entire Hindi monolingual data except the set of sam-pled tokens so that there are no overlapping tokens between Hindi (hi) and Marathi (mr) monolingual data other than the sampled tokens. Let us call this Hindi data SynthHindi. We then run OBPE on SynthHindi-Marathi language pair to generate a vocabulary to pretrain the model. The task-specific Hindi data is also converted to SynthHindi during fine-tuning and testing of the model. Figure 1 shows results with increasing overlap. We observe increasing gains in LRL accuracy as we go from no overlap to full overlap on all three tasks. NER accuracy increases from 17% to 58% for the LRL (mr) even while the HRL (hi) accuracy stays unchanged. For TC we observe similar gains. For POS, even without token overlap, we get good cross-lingual transfer because POS tags are more driven by structural similarity, and Hindi and Marathi follow similar structure.
Our results contradict the conclusions of (K et al., 2020) which claimed that token overlap is unimportant for cross-lingual transfer. However, there are two key differences with our setting: (1) unlike (K et al., 2020), we conduct explore low-resource settings, and (2) except for English-Spanish, the other language pairs they considered are not linguistically related. To explain the importance of both these factors in Table 4 we present accuracy of English-Spanish in a simulated low-resource setting where we sample 20K Spanish documents and 160K English documents. Also, we repeat our Hindi-Marathi experiments where Marathi is not low-resource. We observe that (1) Spanish as LRL benefits significantly on overlap with English. (2) Marathi gains from token overlap with Hindi even in the high resource setting.
Thus, we conclude that as long as language are related token overlap is important and the benefit from overlap is higher in the low resource setting.  Table 5: Zero-shot performance of models in the same setting as Table 3 but comparing default sampling with oversampling (S=0.5). Note, even if BPE_overSamp improves LRL somewhat, it causes HRL to drop. OBPE with default sampling is best for both LRLs and HRLs. Also OBPE_overSampled is better than BPE_overSampled (Section 4.4).   Overlap Vs Gain: Real data setup We further substantiate our hypothesis that the shared tokens across languages favoured by OBPE enable transfer of supervision from HRL to LRL via statistics on real-data. In Table 7 we show the Pearson product-moment correlation coefficient between overlap gain and performance gain within LRLs of the same family and task. We get a high positive correlation coefficient, with an average of 0.644.

Effect of Increased LRL representation
We next investigate the impact of increased representation of LRL tokens in the vocabulary. OBPE increases LRL representation by favoring overlapping tokens, but LRL tokens can also be increased by just over-sampling LRL documents. We train another BALANCED12 model but with further oversampling LRLs with S = 0.5 instead of S = 0.7. We observe in Figure 6 that this increases LRL fraction but reduces HRL tokens in the vocabulary. Table 5 also shows the comparison of zero-shot transfer accuracy with over-sampled BPE against over-sampled OBPE. We find that OBPE even with default S achieves highest LRL gains, whereas aggressively over-sampled BPE hurts HRL accuracy. Within the same sampling setting, OBPE is better than corresponding BPE.

Ablation study
We conducted experiments for different values of p that controls the amount of overlap in the generalized mean function (Equation (5)). Setting p = 1 gives the original BPE algorithm. Setting p = 0, −1 gives geometric and harmonic mean respectively, setting p = −∞ gives minimum. We compare the task-specific results for different values of p as shown in Table 16 and find that the gains we obtain are highest in the p = −∞ (minimum) setting (Figure 4 in Appendix). We also experiment with α = 0.7, and find that for most languages the results were not better than our default α = 0.5.

Conclusion
In this paper, we address the problem of crosslingual transfer from HRLs to LRLs by exploiting relatedness among them. We focus on lexical overlap during the vocabulary generation stage of multilingual pre-training. We propose Overlap BPE (OBPE), a simple yet effective modification to the BPE algorithm, which chooses a vocabulary that maximizes overlap across languages. OBPE encodes input corpora compactly while also balancing the trade-off between cross-lingual subword sharing and language-specific vocabularies. We focus on three sets of closely related languages from diverse language families. Our experiments provide evidence that OBPE is effective in leveraging overlap across related languages to improve LRL performance. In contrast to prior work, through controlled experiments on the amount of token overlap between two related HRL-LRL language pairs, we establish that token overlap is important when a LRL is paired with a related HRL.

A.3 Potential risks
Language models may amplify bias in data and also introduce new ones. Multilingual models explored Figure 2: Similar meaning words with shared root forms across related Indo-Aryan languages. BPE vocabulary does not capture the tokens corresponding to Punjabi as it is a LRL and will thus tokenize Niyukata into multiple tokens which do not captures its meaning whereas Niyukata when tokenized by OBPE tokenizer will contain Niyuk which captures most of the meaning of the token Niyukata whose representation will be learnt when pretraining using Punjabi monolingual data in the paper are not immune to such issues. Detecting such biases and mitigating them is a topic of ongoing research. We are hopeful that our focus on better representation of LRLs in the vocabulary is a step towards more inclusive models.

A.5 License
Tatoeba data, GLUE data, Wikipedia dumps use the Creative Commons licenses. TDIL data used    Table 10: Zero-shot performance of models in the Skewed-12 setting of Table 2 on same four tasks as Table 3. OBPE shows gains here too. Detailed numbers in     (2) Low where their sizes are only 20K. As the percentage of overlapping tokens retained is decreased from 100% to 0%, the accuracy drops but the drop is higher in the low-resource setting.