Adapting Monolingual Models: Data can be Scarce when Language Similarity is High

For many (minority) languages, the resources needed to train large models are not available. We investigate the performance of zero-shot transfer learning with as little data as possible, and the influence of language similarity in this process. We retrain the lexical layers of four BERT-based models using data from two low-resource target language varieties, while the Transformer layers are independently fine-tuned on a POS-tagging task in the model's source language. By combining the new lexical layers and fine-tuned Transformer layers, we achieve high task performance for both target languages. With high language similarity, 10MB of data appears sufficient to achieve substantial monolingual transfer performance. Monolingual BERT-based models generally achieve higher downstream task performance after retraining the lexical layer than multilingual BERT, even when the target language is included in the multilingual model.


Introduction
Large pre-trained language models are the dominant approach for solving many tasks in natural language processing. These models represent linguistic structure on the basis of large corpora that exist for high-resource languages, such as English. However, for the majority of the world's languages, these large corpora are not available.
Past work on multilingual learning has found that multilingual BERT (mBERT; Devlin et al. 2019a) generalizes across languages with high zero-shot transfer performance on a variety of tasks (Pires et al., 2019;Wu and Dredze, 2019). However, it has also been observed that high-resource languages included in mBERT pre-training often have a betterperforming monolingual model, and low-resource * These authors contributed equally.
An alternative to multilingual transfer learning is the adaptation of existing monolingual models to other languages. Zoph et al. (2016) introduce a method for transferring a pre-trained machine translation model to lower-resource languages by only fine-tuning the lexical layer. This method has also been applied to BERT (Artetxe et al., 2020) and GPT-2 (de Vries and Nissim, 2020). Artetxe et al. (2020) also show that BERT models with retrained lexical layers perform well in downstream tasks, but comparatively high performance has only been demonstrated for languages for which at least 400MB of data is available.
To test if this procedure is also effective for lowto zero-resource languages, we consider two regional language varieties spoken in the North of the Netherlands, namely Gronings (Low Saxon language variant) and West Frisian. Figure 1: Geographical areas where Gronings (in green) and West Frisian (in red) are spoken. Image modified from https://en.wikipedia.org/ wiki/Low_German. Figure 1 visualizes the geographical areas where these regional language variants are spoken. The regional Low Saxon language is spoken in the northeastern provinces of the Netherlands and in the North of Germany (shown in yellow). As part of the Low Saxon language, Gronings is spoken in the province of Groningen (highlighted in green). The West Frisian language is spoken in the province of Friesland (shown in red), and it is the second official language of the Netherlands, next to Dutch. Dutch is the national language of the Netherlands, and it is spoken in every province of the Netherlands and in Flanders (North of Belgium).
For both Gronings and West Frisian limited data is available. In addition to unlabeled data, for both target languages we have a small collection of annotated part-of-speech (POS) tagging data, which we use for evaluating zero-shot model transfer. We use three monolingual BERT models (source languages English, German, Dutch) and mBERT to investigate if linguistic structure can be transferred to Gronings and West Frisian by learning new subword embeddings. Our model source and target languages are closely related West Germanic languages (Eberhard et al., 2020). In Table 1, we show parallel sentences in Gronings, West Frisian, Dutch, German, and English to illustrate the lexical similarity between these languages. Additionally, the examples show that there are some lexical and syntactic differences.
We also evaluate to what extent the similarity between each source language of the monolingual models and the target languages is relevant for transferring monolingual representations, and as-

Gronings
Tom is n jong en Mary is n wicht. West Frisian Tom is in jonge en Mary is in famke.

Dutch
Tom is een jongen en Mary is een meisje. German Tom ist ein Junge und Mary ist ein Mädchen. English Tom is a boy and Mary is a girl.

Gronings
Zie haar n bloum ien heur haand. West Frisian Se hie in blom yn har hân. Dutch Ze had een bloem in haar hand. German Sie hatte eine Blume in der Hand. English She had a flower in her hand.

Gronings
Dat was n poar joar leden. West Frisian Dat wie in pear jier lyn. Dutch Dat was een paar jaar geleden. German Das war vor ein paar Jahren.

English
That was a couple of years ago. sess the minimum amount of data necessary to adapt these models. Our pre-trained models for Gronings and West Frisian (which did not yet exist) are released. Additionally, our code is publicly available for bringing language models to other low-resource languages at https://github.com/wietsedv/ low-resource-adapt.

Materials
Models We use monolingual BERT-based models of the source languages, and multilingual BERT (mBERT; Devlin et al. 2019a). Specifically, we use BERT (Devlin et al., 2019b) for English, German BERT (gBERT; DBMDZ 2019) for German, and BERTje (de Vries et al., 2019) for Dutch. Each model shares the same architecture as the original base-sized (12 layers) BERT model of Devlin et al. (2019b). The lexical layer weights are shared between the first and last layer of the model to transform discrete tokens into distributed vector representations and vice versa.
Each monolingual model has a vocabulary of 30K capitalized tokens, while mBERT has a vocabulary of 120K tokens shared between the 104 languages it is pre-trained on. These languages include English, German, Dutch and West Frisian, but not Gronings. The monolingual BERT models contain 110M parameters, with 24M being part of the lexical embeddings. Due to its larger vocabulary size, mBERT contains 180M parameters, with 92M part of the lexical embeddings.
Labeled data We use POS-annotated treebanks from the Universal Dependencies (UD) project (Zeman et al., 2020), corresponding to the languages of the monolingual BERT models. For English, we use GUM (6.0K sentences; 113.4K tokens) and ParTUT (2.1K sentences; 49.6K tokens). In addition, HDT (189.9K sentences; 3.4M tokens) and GSD (15.6K sentences; 287.7K tokens) are used for German. Finally, Alpino (13.6K sentences; 208.5K tokens) and LassySmall (7.3K sentences; 98.0K tokens) are used for Dutch. All treebanks are based on various text types from a diverse set of sources. The standard data splits for each of the annotated treebanks are used for training, validation and testing.
We evaluate the performance of our language models on POS-annotated data of Gronings and West Frisian. Manually annotated texts from the Klunderloa 1 project are used for Gronings (3.8K sentences, 49.0K tokens; fiction, poetry, and songs for children). Annotations follow the UD guidelines. West Frisian is under development in the UD project, and we consider all currently available annotations (1.0K sentences, 15.9K tokens; mainly fiction and news). For both treebanks, 25% is used for development, and 75% is used as a test set.
Unlabeled data The new sub-word embeddings are learned from texts written in Gronings and West Frisian. In total, we have 43MB (8.3M tokens) of plain text available for Gronings. These texts are derived from the Bible, fiction and non-fiction texts, poetry, and Low Saxon Wikipedia. The West Frisian data collection consists of 59MB (10.8M tokens) of plain text extracted from fiction and nonfiction texts, and the multilingual OSCAR corpus (Ortiz Suárez et al., 2020).
Language similarity To quantify language similarity, we use the (lexical-phonetic) LDND measure (Wichmann et al., 2010) on the basis of the 40-item word lists from the ASJP database (Wichmann et al., 2010). While a syntax-based measure may be preferred, this is not available for the included language varieties. We use the LDND as a proxy, given that linguistic distance measures between different linguistic levels are correlated (Spruit et al., 2009). Figure 2 visualizes the relative linguistic distances between the five language varieties using multidimensional scaling (MDS; Torgerson, 1952). If cross-lingual transfer benefits from language similarity, we expect Gronings and West Frisian to profit most from a monolingual Dutch model and least from a monolingual English model, with a German model performing in-between.

Model Training
Our training procedure consists of two separate fine-tuning steps. The Transformer layers in the three monolingual BERT models and mBERT are fine-tuned for the POS-tagging task. Independently, new lexical layers for each BERT model are trained for the two target languages with a masked language modeling pre-training objective. Afterwards, the retrained lexical layer and the fine-tuned Transformer layers are combined to yield a POS-tagging model that is now adapted to the target language. Optimal checkpoint combinations of retrained lexical layers and fine-tuned Transformer layers are based on their performance on the development data for each target language.

POS-tagging
The BERT-based models are finetuned for POS-tagging with the UD datasets. The task-specific model consists of BERT's layers with an additional linear classification layer that yields predictions for each of the 16 possible POS tags. During training, the lexical layer of BERT is frozen such that the fine-tuned Transformer layers rely on unchanged token representations from pre-training. The described model is trained with the Adam optimizer (Kingma and Ba, 2014) with β 1 = 0.9, β 2 = 0.999, = 1e−8 and a linearly decreasing learning rate starting at lr = 1e−5. Each model is trained until validation loss stops decreasing.
Lexical layer retraining We retrain lexical layers for each BERT model using Gronings and West Frisian data. First, sub-word vocabularies of 10K tokens are created for Gronings and West Frisian using the WordPiece method (Devlin et al., 2019b) where each token occurs at least 100 times in the data. This vocabulary size is chosen conservatively,    as we have limited data to train the lexical layer. Preliminary experiments with 30K tokens showed poor performance on the development data. The Gronings and West Frisian unlabeled documents are split into sequences of 128 tokens. Then, the models are trained with a masked language modeling objective where 15% of the input tokens are masked. The Adam optimizer is used with β 1 = 0.9, β 2 = 0.999, = 1e−8 and a linearly decreasing learning rate starting at lr = 1e−4. After retraining, we have three (original, Gronings and West Frisian) interchangeable lexical layers for each base model.

Results and Discussion
We summarize our results in Table 2 (details per dataset in Appendix A). The monolingual language models perform poorly on Gronings and West Frisian POS-tagging when the original lexical layers are used, even though Gronings is quite similar to Dutch (see Figure 2). mBERT with its original lexical layer achieves better results than the monolingual models, but only West Frisian performance is comparable to the source language performance. Since West Frisian was included in mBERT pre-training, these results suggest that mBERT might serve languages included in pretraining well, whereas it may be less suitable for those not included (e.g., Gronings).
For all monolingual models, task performance greatly improves by retraining the lexical layer for Gronings and West Frisian (Figure 3a). Best results are obtained by (Dutch) BERTje fine-tuned on the Alpino dataset (92.4% for Gronings, 95.4% for West Frisian). In contrast, (English) BERT yields the worst performance. We find that performance scores and the linguistic distance from Gronings and West Frisian to the source languages ( Figure 2) strongly correlate (r = −0.85, p < 0.05). This suggests that measures of linguistic distance can guide the optimal choice of monolingual models to transfer to low-resource languages. Retraining mBERT's lexical layer also improves performance, especially for Gronings (Figure 3b), but with smaller gains than for monolingual models.
To estimate how our zero-shot approach compares with supervised learning, we train UDPipe (Straka et al., 2016) with five-fold cross-validation on the Gronings and West Frisian POS-tagging data. UDPipe achieves an accuracy of 91.85 (σ = 0.81) for Gronings and 90.60 (σ = 0.58) for West Frisian. These results do not indicate out-of-domain performance, since training and test data are from the same source. Also, labeled data for Gronings comes from a corpus with a specific target audience (i.e. children). Therefore, these results can be seen as an upper-bound. Our adapted models perform on par (Gronings) or better (West Frisian) with no need for labeled data in the target language.
Data size Our zero-shot transfer method relies on the availability of unlabeled Gronings and West Frisian data. Other low-resource languages may have even smaller amounts of data available than we have for West Frisian (59MB) and Gronings (43MB). We therefore assess how little data is sufficient for adequate performance by retraining the lexical layer with subsets of (independently randomly sampled) unlabeled data. Table 3 shows POS-tagging accuracies for each subset. Results are consistent across both target languages and show that ca. 10MB of data (1.9M tokens) is sufficient to achieve almost optimal performance for the monolingual models. By contrast, mBERT shows a steadier improvement with more data, suggesting that it might further improve if even more data is available than we have for Gronings and West Frisian. BERT's POS-tagging accuracy is very low compared to the other monolingual models and performance decreases with more data. These fluctuations suggest that the retrained lexical layer fits BERT poorly and it is unclear if using more data will impact performance positively.

Conclusion
We adapted three monolingual BERT models and mBERT to two low-resource languages, Gronings and West Frisian, by retraining the lexical layers with new vocabularies. We found that the adaptability of mBERT is limited, suggesting that a model trained on a large amount of languages may not facilitate transfer to low-resource languages. In-stead, monolingual BERT models are transferable to languages with very little data if the source and target languages are relatively similar. In such case, 10MB of unlabeled data, and no task-specific labeled data, is sufficient to achieve high (> 90% accuracy) downstream task performance. Table 4 shows results per adapted model per training dataset. Dutch POS-tagging accuracy is still relatively high after lexical layer replacement. Similarly, Table 5 shows the POS-tagging performance with subsets of the lexical layer retraining data per training dataset. Training on the Dutch Alpino dataset instead of LassySmall results in consistently higher performance for both Gronings and West Frisian.   Table 5: POS-tagging accuracy for Gronings and West Frisian with subsets of the unlabeled lexical layer retraining data. This is an extended version of Table 2 in the main paper with accuracies separated by POS-tagging training dataset.