Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation

The performance of multilingual pretrained models is highly dependent on the availability of monolingual or parallel text present in a target language. Thus, the majority of the world’s languages cannot benefit from recent progress in NLP as they have no or limited textual data. To expand possibilities of using NLP technology in these under-represented languages, we systematically study strategies that relax the reliance on conventional language resources through the use of bilingual lexicons, an alternative resource with much better language coverage. We analyze different strategies to synthesize textual or labeled data using lexicons, and how this data can be combined with monolingual or parallel text when available. For 19 under-represented languages across 3 tasks, our methods lead to consistent improvements of up to 5 and 15 points with and without extra monolingual text respectively. Overall, our study highlights how NLP methods can be adapted to thousands more languages that are under-served by current technology.

The performance of multilingual pretrained models is highly dependent on the availability of monolingual or parallel text present in a target language. Thus, the majority of the world's languages cannot benefit from recent progress in NLP as they have no or limited textual data. To expand possibilities of using NLP technology in these under-represented languages, we systematically study strategies that relax the reliance on conventional language resources through the use of bilingual lexicons, an alternative resource with much better language coverage. We analyze different strategies to synthesize textual or labeled data using lexicons, and how this data can be combined with monolingual or parallel text when available. For 19 under-represented languages across 3 tasks, our methods lead to consistent improvements of up to 5 and 15 points with and without extra monolingual text respectively. Overall, our study highlights how NLP methods can be adapted to thousands more languages that are under-served by current technology. 1

Introduction
Multilingual pretrained models (Devlin et al., 2019;Conneau and Lample, 2019;Conneau et al., 2020) have become an essential method for cross-lingual transfer on a variety of NLP tasks (Pires et al., 2019;Wu and Dredze, 2019). These models can be finetuned on annotated data of a down-stream task in a high-resource language, often English, and then the resulting model is applied to other languages. This paradigm is supposed to benefit under-represented languages that do not have annotated data. However, recent studies have found that the cross-lingual transfer performance of a language is highly contingent on the availability of monolingual data in the language during pretraining (Hu et al., 2020). Languages with more 1 Code and data are available at: https: //github.com/cindyxinyiwang/ expand-via-lexicon-based-adaptation. monolingual data tend to have better performance while languages not present during pretraining significantly lag behind. Several works propose methods to adapt the pretrained multilingual models to low-resource languages, but these generally involve continued training using monolingual text from these languages (Wang et al., 2020;Chau et al., 2020;Pfeiffer et al., 2020Pfeiffer et al., , 2021. Therefore, the performance of these methods is still constrained by the amount of monolingual or parallel text available, making it difficult for languages with little or no textual data to benefit from the progress in pretrained models. Joshi et al. (2020) indeed argue that unsupervised pretraining makes the 'resource-poor poorer'. Fig. 1 plots the language coverage of multilingual BERT (mBERT; Devlin et al., 2019), a widely used pre-trained model, and several commonly used textual data sources. 2 Among the 7,000 languages in the world, mBERT only covers about 1% of the languages while Wikipedia and Com-monCrawl, the two most common resources used for pretraining and adaptation, only contain textual data from 4% of the languages (often in quite small quantities, partially because language IDs are difficult to obtain for low-resource languages (Caswell et al., 2020)). Ebrahimi and Kann (2021) show that continued pretraining of multilingual models on a small amount of Bible data can significantly improve the performance of uncovered languages. Although the Bible has much better language coverage of 23%, its relatively small data size and constrained domain limits its utility (see § 6)-and 70% of the world's languages do not even have this resource. The failure of technology to adapt to these situations raises grave concerns regarding the fairness of allocation of any benefit that may be conferred by NLP to speakers of these languages (Joshi et al., 2020;Blasi et al., 2021). On the other hand, linguists have been studying and documenting under-represented languages for years in a variety of formats (Gippert et al., 2006). Among these, bilingual lexicons or word lists are usually one of the first products of language documentation, and thus have much better coverage of the worlds' languages than easily accessible monolingual text, as shown in Fig. 1. There are also ongoing efforts to create these word lists for even more languages through methodologies such as "rapid word collection" (Boerger, 2017), which can create an extensive lexicon for a new language in a number of days. As Bird (2020) notes: After centuries of colonisation, missionary endeavours, and linguistic fieldwork, all languages have been identified and classified. There is always a wordlist. . . . In short, we do not need to "discover" the language ex nihilo (L1 acquisition) but to leverage the available resources (L2 acquisition).
However, there are few efforts on understanding the best strategy to utilize this valuable resource for adapting pretrained language models. Bilingual lexicons have been used to synthesize bilingual data for learning cross-lingual word embeddings (Gouws and Søgaard, 2015;Ruder et al., 2019) and task data for NER via word-to-word translation (Mayhew et al., 2017), but both approaches precede the adoption of pre-trained multilingual LMs. Khemchandani et al. (2021) use lexicons to synthesize monolingual data for adapting LMs, but their experimentation is limited to several Indian languages and no attempt was made to synthesize downstream task data while Hu et al. (2021) argue that bilingual lexicons may hurt performance.
In this paper, we conduct a systematic study of strategies to leverage this relatively under-studied resource of bilingual lexicons to adapt pretrained multilingual models to languages with little or no monolingual data. Utilizing lexicons from an open-source database, we create synthetic data for both continued pretraining and downstream Figure 2: Results for baselines and adaptation using synthetic data for both resource settings across three NLP tasks. task fine-tuning via word-to-word translation. Empirical results on 19 under-represented languages on 3 different tasks demonstrate that using synthetic data leads to significant improvements on all tasks (Fig. 2), and that the best strategy depends on the availability of monolingual data ( § 5, § 6). We further investigate methods that improve the quality of the synthetic data through a small amount of parallel data or by model distillation.

Background
We focus on the cross-lingual transfer setting where the goal is to maximize performance on a downstream task in a target language T . Due to the frequent unavailability of labeled data in the target language, a pretrained multilingual model M is typically fine-tuned on labeled data in the down- i is a textual input, y S i is the label, and N is the number of labeled examples. The fine-tuned model is then directly applied to task data D T test = {x T i , y T i } i in language T at test time. 3 The performance on the target language T can often be improved by further adaptation of the pretrained model.

Adaptation with Text
There are two widely adopted paradigms for adapting pretrained models to a target language using monolingual or parallel text.
MLM Continued pretraining on monolingual text D T mono = {x T i } i in the target language (Howard and Ruder, 2018;Gururangan et al., 2020) using a masked language model (MLM) objective has proven effective for adapting models to the target language (Pfeiffer et al., 2020). Notably, Ebrahimi and Kann (2021) show that using as little as several thousand sentences can significantly improve the model's performance on target languages not covered during pretraining.
Trans-Train For target languages with sufficient parallel text with the source language D ST par = {(x S i , x T i )} i , one can train a machine translation (MT) system that translates data from the source language into the target language. Using such an MT system, we can translate the labeled data in the source language D S label into target language , and fine-tune the pretrained multilingual model on both the source and translated labeled data D S label ∪ D T label . This method often brings significant gains to the target language, especially for languages with high-quality MT systems (Hu et al., 2020;Ruder et al., 2021).

Challenges with Low-resource Languages
Both methods above require D T mono or D ST par in target language T , so they cannot be directly extended to languages without this variety of data. Joshi et al. (2020) classified the around 7,000 languages of the world into six groups based on the availability of data in each language. The two groups posing the biggest challenges for NLP are: "The Left-Behinds," languages with virtually no unlabeled data. We refer to this as the No-Text setting.
"The Scraping-Bys," languages with a small amount of monolingual data. We refer to this as the Few-Text setting.
These languages make up 85% of languages in the world, yet they do not benefit from the development of pretrained models and adaptation methods due to the lack of monolingual and parallel text. In this paper, we conduct a systematic study of strategies directly targeted at these languages.

Adapting to Under-represented Languages Using Lexicons
Since the main bottleneck of adapting to underrepresented languages is the lack of text, we adopt a data augmentation framework (illustrated in Fig. 3) that leverages bilingual lexicons, which are available for a much larger number of languages.

Synthesizing Data Using Lexicons
Given a bilingual lexicon D ST lex between the source language S and a target language T , we create synthetic sentences x T i in T using sentences x S i in S via word-to-word translation, and use this synthetic data in the following adaptation methods.
We keep the words that do not exist in the lexicon unchanged, so the pseudo text x T i can include words in both S and T . We then adapt the pretrained multilingual model on D T mono using the MLM objective. For the Few-Text setting where some gold monolingual data D T mono is available, we can train the model jointly on the pseudo and the gold monolingual data D T mono ∪ D T mono .
Pseudo Trans-train Given the source labeled We keep the original word if it does not have an entry in the lexicon. We then fine-tune the model jointly on both pseudo and gold labeled data D T label ∪ D S label . Since these methods only require bilingual lexicons, we can apply them to both No-Text and Few-Text settings. We can use either of the two methods or the combination of both to adapt the model.

Challenges with Pseudo Data
Our synthetic data D T could be very different from the true data D T because the lexicons do not cover all words in S or T , and we do not consider morphological or word order differences between T and S. 4 Nonetheless, we find that this approach yields significant improvements in practice (see Tab. 3). We also outline two strategies that aim to improve the quality of the synthetic data in the next section.

Refining the Synthetic Data
Label Distillation The pseudo labeled data Anarchism calls for the abolition of the state , which it holds to be undesirable , unnecessary , and harmful . Pseudo Mono x T ∈ D T mono Anarchism calls g al il abolition ta' il stat , lima hi holds g al tkun undesirable , bla bzonn , u harmful . eng x S ∈ D S label I suspect the streets of Baghdad will look as if a war is looming this week . Pseudo Labeled x T ∈ D T label jien iddubita il streets ta' Bagdad xewqa hares kif jekk a gwerra is looming danġimg a .  Table 1: Examples of pseudo monolingual data and pseudo labeled data for POS tagging for Maltese (mlt). Words in red have different labels between the source language and the label distilled data. This is because "xewqa" in Maltese is a noun meaning "desire,will", while the word "will" is not used as a noun in the original English sentence.
thetic examples x T i could have a different label from the original label y S i (see Tab. 1). To alleviate this issue, we propose to automatically "correct" the labels of pseudo data using a teacher model. Specifically, we fine-tune the pretrained multilingual model as a teacher model using only D S label . We use this model to generate the new pseudo la- We then fine-tune the pretrained model on both the new pseudo labeled data and the source labeled data D T distill ∪ D S label . Induced Lexicons with Parallel Data For the Few-Text setting, we can leverage the available parallel data D ST par to further improve the quality of the augmented data. Specifically, we use unsupervised word alignment to extract additional word pairs D ST lex from the parallel data, and use the combined lexicon D ST lex ∪ D ST lex to synthesize the pseudo data.

General Experimental Setting
In this section, we outline the tasks and data setting used by all experiments. We will then introduce the adaptation methods and results for the No-Text setting in § 5 and the Few-Text setting in § 6.

Tasks, Languages and Model
We evaluate on the gold test sets of three different tasks with relatively good coverage of underrepresented languages: named entity recognition (NER), part-of-speech (POS) tagging, and dependency parsing (DEP  We run each fine-tuning experiment with 3 random seeds and report the average performance. For NER and POS tagging, we follow the data processing and fine-tuning hyper-parameters in Hu et al. (2020). We use the Udify (Kondratyuk and Straka, 2019) codebase and configuration for parsing.
Languages For each task, we select languages that have task data but are not covered by the mBERT pretraining data. The languages we use can be found in Tab. 2. Most fall under the Few-Text setting (Joshi et al., 2020). We employ the same languages to simulate the No-Text setting as well.
Model We use the multilingual BERT model (mBERT) because it has competitive performance on under-represented languages (Pfeiffer et al., 2020). We find that our mBERT performance on WikiNER and POS is generally comparable or exceeds the XLM-R large results in Ebrahimi and Kann (2021). We additionally verify our results also hold for XLM-R in § 7.

Adaptation Data
Lexicon We extract lexicons between English and each target language from the PanLex database. 5 The number of lexicon entries varies from about 0.5k to 30k, and most of the lexicons have around 5k entries. The lexicon statistics for each language can be found in Tab. 2.
Pseudo Monolingual Data English Wikipedia articles are used to synthesize monolingual data. We first tokenize the English articles using Stanza (Qi et al., 2020) and keep the first 200k sentences. To create pseudo monolingual data for a given target language, we replace each English word with its translation if the word exists in the bilingual lexicon. We randomly sample a target word if the English word has multiple possible translations because it is difficult to estimate translation probabilities due to lack of target text.
Pseudo Labeled Data Using the English training data for each task, we simply replace each English word in the labeled training data with its corresponding translation and retain its original label. For the sake of simplicity, we only use lexicon entries with a single word.

No-Text Setting
We analyze the results of the following adaptation methods for the setting where we do not have any monolingual data.
Pseudo MLM The mBERT model is trained on the pseudo monolingual data using the MLM objective. We train the model for 5k steps for the NER tasks and 10k steps for the POS tagging and Parsing tasks.
Pseudo Trans-train We fine-tune mBERT or the model adapted with Pseudo MLM for a downstream task on the concatenation of both the English labeled data and the pseudo labeled data.
Label Distillation We use the model adapted with Pseudo MLM as the teacher model to generate new labels for the pseudo labeled data, which we use jointly with the English labeled data to finetune the final model.

Results
The average performance of different adaptation methods averaged across all languages in each task 5 https://panlex.org/snapshot/ can be found in Tab. 3.
Pseudo Trans-train is the best method for No-Text. Pseudo MLM and Pseudo Trans-train can both bring significant improvements over the mBERT baseline for all tasks. Pseudo Trans-train leads to the best aggregated result across all tasks, and it is also the best method or very close to the best method for each task. Adding Pseudo Transtrain on top of Pseudo MLM does not add much improvement. Label Distillation generally leads to better performance, but overall it is comparable to only using Pseudo Trans-train.

Few-Text Setting
We test same adaptation methods introduced in § 5 for the Few-Text setting where we have a small amount of gold data. First we introduce the additional data and adaptation methods for this setting.

Gold Data
Gold Monolingual Data We use the JHU Bible Corpus (McCarthy et al., 2020) as the monolingual data. Following the setup in Ebrahimi and Kann (2021), we use the verses from the New Testament, which contain 5000 to 8000 sentences for each target language.
Gold Parallel Data We can use the parallel data between English and the target languages from the Bible to extract additional word pairs. We use an existing unsupervised word alignment tool, eflomal (Östling and Tiedemann, 2016), to generate word alignments for each sentence in the parallel Bible data. To create high quality lexicon entries, we only keep the word pairs that are aligned more than once, resulting in about 2k extra word pairs for each language. We then augment the PanLex lexicons with the induced lexicon entries.

Adaptation Methods
Gold MLM The mBERT model is trained on the gold monolingual Bible data in the target language using the MLM objective. Following the setting in Ebrahimi and Kann (2021), we train for 40 epochs for the NER task, and 80 epochs for the POS and Parsing tasks.
Pseudo MLM We conduct MLM training on both the Bible monolingual data and the pseudo monolingual data in the target language. The Bible data is up-sampled to match the size of the pseudo monolingual data. We train the model for 5k steps  Table 3: Average F1 score for languages in each task. We record F1 of the LAS for Parsing. We compare three adaptation methods (∆ indicates gains over baselines): Pseudo Trans-train, Pseudo MLM, and Both. We also examine two data refinement methods: Label Distillation (∆ is gains over Both) and PanLex+Induced (∆ is gains over PanLex). Bold is the best result for each dataset, and underline indicates the best improvements among the three adaptation methods over the baselines. We test the significance of the average gains over the baselines in the last column using paired bootstrap resampling. * indicates significant gains with p < 0.001 and † indicates significant gains with p < 0.05.
for the NER task and 10k steps for the POS tagging and Parsing tasks.

Results
The average performance in each task for Few-Text can be found in Tab. 3.
Pseudo MLM is the competitive strategy for Few-Text. Unlike the No-Text setting, Pseudo Trans-train only marginally improves or even decreases the performance for three out of the four datasets we consider. On the other hand, Pseudo MLM, which uses both gold and pseudo monolingual data for MLM adaptation, consistently and significantly improves over Gold MLM for all tasks. Again, using Pseudo Trans-train on top of Pseudo MLM does not help and actually leads to relatively large performance loss for the syntactic tasks, such as POS tagging and Parsing.
Label Distillation brings significant improvements for the two syntactic tasks. Notably, it is the best performing method for POS tagging, but it still lags behind Pseudo MLM for Parsing. This is likely because Parsing is a much harder task than POS tagging to generate correct labels. The effect of Label Distillation on the NER task is less consistent-it improves over Pseudo Trans-train for WikiNER but not for MasakhaNER. This is because the named entity tags of the same words in different languages likely remain the same so that the pseudo task data probably has less noise for Label Distillation to have consistent benefits. Adding Induced Lexicons We examine the effect of using the lexicons augmented by word pairs induced from the Bible parallel data. The results can be found in Tab. 3. Adding the induced lexicon significantly improves the NER performance, while it hurts the two syntactic tasks.
To understand what might have prevented the syntactic tasks from benefiting from the extra lexicon entries, we plot the distribution of the partof-speech tags of the words in PanLex lexicons and the lexicons induced from the Bible in Fig. 4. PanLex lexicons have more nouns than the Bible lexicons while the Bible lexicons cover more verbs than PanLex. However, the higher verb coverage in induced lexicons actually leads to a larger prediction accuracy drop for verbs in the POS tagging task. We hypothesize that the pseudo monolingual data created using the induced lexicons would contain more target language verbs with the wrong word order, which could be more harmful for syntactic tasks than tasks that are less sensitive to word order such as NER.  On the other hand, the English NER training data for MasakhaNER is from the news domain, which potentially makes Pseudo Trans-train a stronger method for adapting the model simultaneously to the target language and to the news domain. One advantage of Pseudo MLM is that the English monolingual data is much cheaper to acquire, while Pseudo Trans-train is constrained by the amount of labeled data for a task. We show in § A.4 that Pseudo MLM has more benefit for MasakhaNER when we use a subset of the NER training data.

Analyses
Performance with XLM-R We mainly use mBERT because it has competitive performance for under-represented languages and it is more computationally efficient due to the smaller size. Here we verify our methods have the same trend when used on a different model XLM-R (Conneau et al., 2020). We focus on a subset of languages in the POS tagging task for the Few-Text setting and the results are in Tab. 4. We use the smaller XLM-R base for efficiency, and compare to the best result in prior work, which uses XLM-R large (Ebrahimi and Kann, 2021). Tab. 4 shows that our baseline is comparable or better than prior work. Similar to the conclusion in § 6, Pseudo MLM is the competitive strategy that brings significant improvements over prior work. While adding Pseudo Trans-train to Pseudo MLM does not help, using Label Distillation further improves the performance. Effect of Baseline Performance Using pseudo data might be especially effective for languages with lower performance. We plot the improvement of different languages over the baseline in Fig. 5, where languages are arranged with increasing baseline performance from left to right. We mainly plot Pseudo MLM and Pseudo Trans-train for simplicity. Fig. 5 shows that for both resource settings, lower performing languages on the left tend to have more performance improvement by using pseudo data.
Using NMT Model to Synthesize Data One problem with the pseudo data synthesized using word-to-word translation is that it cannot capture the correct word order or syntactic structure in the target language. If we have a good NMT system that translates English into the target language, we might be able to get more natural pseudo monolingual data by translating the English sentences to the target language.
Since the target languages we consider are usually not supported by popular translation services, we train our own NMT system by fine-tuning an open sourced many-to-many NMT model on the Bible parallel data from English to the target language (details in § A.2). Instead of creating pseudo monolingual data using the lexicon, we can simply use the fine-tuned NMT model to translate English monolingual data into the target language.
The results of using NMT as opposed to lexicon for Pseudo MLM on all four tasks can be found in Tab. 5. Unfortunately, NMT is consistently worse than word-to-word translation using lexicons. We find that the translated monolingual data tend to have repeated words and phrases that are common in the Bible data, although the source sentence is from Wikipedia. This is because the NMT model overfits to the Bible data, and it fails to generate good translation for monolingual data from a different domain such as Wikipedia.
Comparison to Few-shot Learning Lauscher et al. (2020) found that using as few as 10 labeled   Table 6: Results on MasakhaNER for k-shot learning. We compare to the zero-shot mBERT baseline and our best adapted model. examples in the target language can significantly outperform the zero-shot transfer baseline for languages included in mBERT. We focus on the zeroshot setting in this paper because the languages we consider have very limited data and it could be expensive or unrealistic to annotate data in every task for thousands of languages. Nonetheless, we experiment with k-shot learning to examine its performance on low-resource languages in the MasakhaNER task. Tab. 6 shows that using 10 labeled examples brings improvements over the mBERT baseline for a subset of the languages, and it is mostly worse than our best adapted model without using any labeled data. When we have access to 100 examples, few-shot learning begins to reach or exceed our zero-shot model. In general, few-shot learning seems to require more data to consistently perform well for under-represented languages while our adaptation methods bring consistent gains without any labeled data. Combining the best adapted model with few-shot learning leads to mixed results. More research is needed to understand the annotation cost and benefit of few-shot learning for low-resource languages.

Related Work
Several methods have been proposed to adapt pretrained language models to a target language. Most of them rely on MLM training using monolingual data in the target languages (Wang et al., 2020;Chau et al., 2020;Muller et al., 2021;Pfeiffer et al., 2020;Ebrahimi and Kann, 2021), competitive NMT systems trained on parallel data (Hu et al., 2020;Ponti et al., 2021), or some amount of labeled data in the target languages (Lauscher et al., 2020). These methods cannot be easily extended to low-resource languages with no or limited amount of monolingual data, which account for more than 80% of the World's languages (Joshi et al., 2020).
Bilingual lexicons have been commonly used for learning cross-lingual word embeddings (Mikolov et al., 2013;Ruder et al., 2019). Among these, some work uses lexicons to synthesize pseudo bilingual (Gouws and Søgaard, 2015;Duong et al., 2016) or pseudo multilingual corpora (Ammar et al., 2016). Mayhew et al. (2017) propose to synthesize task data for NER using bilingual lexicons. More recently, Khemchandani et al. (2021) synthesize monolingual data in Indian languages for adapting pretrained language models via MLM. Hu et al. (2021) argue that using bilingual lexicons for alignment hurts performance compared to word-level alignment based on parallel corpora. Such parallel corpora, however, are not available for truly under-represented languages. Reid and Artetxe (2021) employ a dictionary denoising objective where a word is replaced with its translation into a random language with a certain probability. This can be seen as text-to-text variant of our approach applied to multilingual pre-training. None of the above works provide a systematic study of methods that utilize lexicons and limited data resources for adapting pretrained language models to languages with no or limited text.

Conclusion and Discussion
We propose a pipeline that leverages bilingual lexicons, an under-studied resource with much better language coverage than conventional data, to adapt pretrained multilingual models to underrepresented languages. Through comprehensive studies, we find that using synthetic data can significantly boost the performance of these languages while the best method depends on the data availability. Our results show that we can make concrete progress towards including under-represented languages into the development of NLP systems by utilizing alternative data sources.
Our work also has some limitations. Since we focus on different methods of using lexicons, we restrict experiments to languages in Latin script and only use English as the source language for simplicity. Future work could explore the effect of using different source languages and combining transliteration (Muller et al., 2021) or vocabulary extension (Pfeiffer et al., 2021) with lexicon-based data augmentation for languages in other scripts. We also did not test the data augmentation methods on higher-resourced languages as MLM fine-tuning and translate-train are already effective in that setting and our main goal is to support the languages with little textual data. Nonetheless, it would be interesting to examine whether our methods can deliver gains for high-resource languages, especially for test data in specialized domains.
We point to the following future directions: First, phrases instead of single word entries could be used to create pseudo data. Second, additional lexicons beyond PanLex could be leveraged. 6 Third, more effort could be spent on digitizing both existing monolingual data such as books (Gref, 2016) and lexicons into a format easily accessible by NLP practitioners. Although PanLex already covers over 5000 languages, some language varieties have only as little as 10 words in the database, while there exist many paper dictionaries that could be digitized through technologies such as OCR (Rijhwani et al., 2020). 7 Lexicon collection is also relatively fast, which could be a more cost effective strategy to significantly boost the performance of many languages without lexicons. Finally, the quality of synthetic data could be improved by incorporating morphology. However, we find that there is virtually no existing morphological analysis data or toolkits for the languages we consider. Future work could aim to improve the morphological analysis of these low-resource languages.  Figure 6: Improvements of using combined lexicons compared to PanLex lexicons for Pseudo MLM. Languages with fewer PanLex lexicons tend to benefit more from the combined lexicons.

A.1 Experiment Details
For all experiments using MLM training for NER tasks, we train 5000 steps, or about equivalent to 40 epochs on Bible; for MLM training for POS tagging and Parsing, we train 10000 steps, or equivalent to 80 epochs on Bible. We use learning rate of 2e − 5, batch size of 32, and maximum sequence length of 128. We did not tune these hyperparameters because we mostly follow the ones provided in (Ebrahimi and Kann, 2021). To finetune the model for a downstream task, we use learning rate of 2e − 5 and batch size of 32. We train all models for 10 epochs and pick the checkpoint with the best performance on the English development set.
We use a single GPU for all adaptation and finetuning experiments. Pseudo MLM usually takes less than 5 hours. Pseudo Trans-train and other task specific fine-tuning usually takes around 2 to 3 hours.

A.2 NMT Models
We use the many-to-many NMT models provided in the fairseq repoo (Ott et al., 2019). We use the model with 175M parameters and finetune the NMT model for 50 epochs on the parallel data from the Bible.
We use beam size of 5 to generate translations.

A.3 Induced lexicons help languages with Fewer PanLex Entries
We plot the performance difference between using combined lexicons and PanLex for the Few-Text in Fig. 6. The languages are arranged from left to right based on increasing amount of PanLex entries. For MasakhaNER, the three languages with fewer entries in PanLex have much more significant gains by using the combined lexicon. While using the combined lexicons generally hurts POS tagging, the languages with fewer entries in PanLex tend to have less performance decrease.

A.4 Effect of Task Data Size
Our experiments in Tab. 3 show that MasakhaNER benefits more from Pseudo Trans-train, likely because the labeled data is closer to the domain of the test data. However, this result might not hold when the amount of labeled data is limited. One advantage of Pseudo MLM over Pseudo Trans-train is that it only requires English monolingual data to synthesize pseudo training data, while Pseudo Trans-train is constrained by the availability of labeled data. We subsample the amount of English NER training data for MasakhaNER and plot the average F1 score of Pseudo Trans-train, pseudo MLM and using both. Fig. 7 shows that the advantage of Pseudo Trans-train on MasakhaNER decreases as the number of labeled data decreases, and using both methods is more competitive when the task data is small.

A.5 List of Bilingual Lexicons
We provide a list of bilingual lexicons beyond Pan-Lex: • Swadesh lists in about 200 languages in Wikipedia 8 • Words in 3156 language varietities in CLICS 9 • Intercontinental Dictionary Series in about 300 languages 10 • 40-item wordlists in 5,000+ languages in ASJP 11 • Austronesian Basic Vocabulary Database in 1,700+ languages 12 • Diachronic Atlas of Comparative Linguistics in 500 languages 13