Model Selection for Cross-lingual Transfer

Transformers that are pre-trained on multilingual corpora, such as, mBERT and XLM-RoBERTa, have achieved impressive cross-lingual transfer capabilities. In the zero-shot transfer setting, only English training data is used, and the fine-tuned model is evaluated on another target language. While this works surprisingly well, substantial variance has been observed in target language performance between different fine-tuning runs, and in the zero-shot setup, no target-language development data is available to select among multiple fine-tuned models. Prior work has relied on English dev data to select among models that are fine-tuned with different learning rates, number of steps and other hyperparameters, often resulting in suboptimal choices. In this paper, we show that it is possible to select consistently better models when small amounts of annotated data are available in auxiliary pivot languages. We propose a machine learning approach to model selection that uses the fine-tuned model’s own internal representations to predict its cross-lingual capabilities. In extensive experiments we find that this method consistently selects better models than English validation data across twenty five languages (including eight low-resource languages), and often achieves results that are comparable to model selection using target language development data.


Introduction
Pre-trained Transformers (Vaswani et al., 2017;Devlin et al., 2019) have achieved state-of-the-art results on a range of NLP tasks, often approaching human inter-rater agreement (Joshi et al., 2020a). These models have also been demonstrated to learn effective cross-lingual representations, even without access to parallel text or bilingual lexicons (Wu and Dredze, 2019;Pires et al., 2019).
In the zero-shot transfer learning, training and development data are only assumed in a high resource source language (e.g. English), and performance is evaluated on another target language. Because no target language annotations are assumed, source language data is typically used to select among models that are fine-tuned with different hyperparameters and random seeds. However, recent work has shown that English dev accuracy does not always correlate well with target language performance (Keung et al., 2020).
In this paper, we propose an alternative strategy for model selection in zero-shot transfer. Our approach, dubbed Learned Model Selection (LMS), learns a function that scores the compatibility between a fine-tuned multilingual Transformer, and a target language. The compatibility score is calculated based on features of the multilingual model's learned representations. This is done by aggregating representations over an unlabeled target language text corpus after fine-tuning on source language data. We show that these modelspecific features effectively capture information about how the cross-lingual representations will transfer. We also make use of language embeddings from the lang2vec package (Malaviya et al., 2017), 2 which have been shown to encode typological information that may help inform how a multilingual model will transfer to a particular target. These model and language features are combined in a bilinear layer to compute a ranking on the fine-tuned models. Parameters of the ranking function are optimized to minimize a pairwise loss on a set of held-out models, using one or more auxiliary pivot languages. Our method assumes training data in English, in addition to small amounts of auxiliary language data. This corresponds to a scenario where the multilingual model needs to be quickly adapted to a new language. LMS does not rely on any annotated data in the target language, yet it is effective in learning to predict how well fine-tuned representations will transfer.
In experiments on twenty five languages, LMS consistently selects models with better targetlanguage performance than those chosen using English dev data. Furthermore, our proposed approach improves performance on low-resource languages such as Quechua, Maori and Turkmen that are not included in the pretraining corpus ( §6.1).

Background: Cross-Lingual Transfer Learning
The zero-shot setting considered in this paper works as follows. A Transformer model is first pretrained using a standard masked language model objective. The only difference from the monolingual approach to contextual word representations (Peters et al., 2018;Devlin et al., 2019) is the pre-training corpus, which contains text written in multiple languages; for example, mBERT is trained on Wikipedia data from 104 languages.
After pre-training, the resulting network encodes language-independent representations that support surprisingly effective cross-lingual transfer, simply by fine-tuning with English data. For example, after fine-tuning mBERT using the English portion of the CoNLL Named Entity Recognition dataset, the resulting model can perform inference directly on Spanish text, achieving an F 1 score around 75, and outperforming prior work using cross-lingual word embeddings (Xie et al., 2018;Mikolov et al., 2013). A challenge, however, is the relatively high variance across multiple training runs. Although mean F 1 on Spanish is 75, the performance of 60 fine-tuned models with different learning rates and random seeds ranges from around 70 F 1 to 78. In zero-shot learning, no validation/development data is available in the target language, motivating the need for a machine learning approach to model selection.

Ranking Model Compatibility with a Target Language
Given a set of multilingual BERT-based models, M = m 1 , m 2 , ..., m n that are fine-tuned on an English training set using different hyperparameters and random seeds, our goal is to select the model that performs best on a specific target language, l target . Our approach (LMS) learns to rank a set of models based on two sources of information: (1) the models' own internal representations, and (2) lang2vec representations of the target language (Malaviya et al., 2017). We adopt a pairwise approach to learning to rank (Burges et al., 2005;Köppel et al., 2019). The learned ranking is computed using a scoring function, s(m, l) = f (g mBERT (m), g lang2vec (l)), where g mBERT (m) is a feature vector for model m, which is computed from the model's own hidden state representations, and g lang2vec (l) is the lang2vec representation of language l. The model and language features are each passed through a feed-forward neural network and then combined using a bilinear layer to calculate a final score as follows: Using the above score, we can represent the probability that model m i performs better than m j on language l: where σ(·) is the sigmoid function. To tune the parameters of the scoring function, which include the feed-forward and bilinear layers, we minimize cross-entropy loss: where Here 1[m j l m i ] is an indicator function that has the value 1 if m j outperforms m i , as evaluated using labeled development data in language l. The first sum in Equation 1 ranges over all languages where development data is available (this excludes the target language). After tuning parameters to minimize cross-entropy loss on these languages, models are ranked based on their scores for the target language, and the highest scoring model, m = arg max m s(m, l target ), is selected.

Tasks and Datasets
We perform model selection experiments on five well-studied NLP tasks in the zero-shot transfer setting: part-of-speech (POS) tagging, question answering (QA), relation extraction (RE), eventargument role labeling (ARL), and named entity recognition (NER). In total, we cover 25 target languages including 8 low-resource languages in our experiments following prior work (shown in Table 1). We adopt the best performing model from Soares et al. (2019), [ENTITY MARKERS -EN-TITY START], for RE and ARL. For other tasks, we use established task-specific layers and evaluation protocols, following the references in Table 1. Labeled training data for each task is assumed in English and trained models are evaluated on each target language.

Low-resource Languages
To evaluate LMS on truly low-resource languages, we use the 8 target languages (summarized in Table 2) following (Pfeiffer et al., 2020;Xia et al., 2021) which uses the WikiAnn NER dataset (Pan et al., 2017). These languages are considered lowresource because: 1) the Wikipedia size ranges from 4k to 22k; 2) they are not covered by pretrained multilingual models (i.e., by mBERT and XLM-RoBERTa). The train, development, and test partition of Rahimi et al. (2019) is used following the XTREME benchmark's NER setup (Hu et al., 2020). The related language used for the Pivot-Dev baseline is chosen following Xia et al. (2021), which is based on LangRank (Lin et al., 2019).

Experimental Design
For a multilingual NLP task with n languages L : {l 1 , ..., l n }, our goal is to select the model that performs best on a new target language, l target ∈ L. We assume the available resources are English training and development data, in addition to a small development set for each of the pivot languages, L. First, a set of mBERT models, M , are finetuned on an English training set using different hyperparameters and random seeds and shuffled into meta-train/dev/test sets. We then evaluate each model, m i , on the pivot languages' dev sets to calculate a set of gold rankings, l , that are used in the cross-entropy loss (Equation 1). Model-specific features are extracted from the fine-tuned mBERTs, by feeding unlabeled pivot language text as input. Development and Evaluation mBERT models in the meta-dev set are used to experiment with different model and language features. Evaluation is performed using models in the meta-test set. We use the leave-one-language-out setup for each task during evaluation. For each target language, we rank models using the learned scoring function, select the highest scoring model, and report results in Table 3.

Baselines and Oracles
En-Dev is our main baseline following standard practice for model selection in zero-shot transfer learning (Wu and Dredze, 2019; Pires et al., 2019). Because our approach assumes additional development data in auxiliary languages, we also include a baseline that uses pivot-language language dev data. 3 In addition, we compare against an oracle that selects models using 100 annotated sentences from the target language dev set to examine how our approach compares with the more costly alternative of annotating small amounts of target language data. Finally, we include an oracle that simply picks the best model using the full target language development set (All-Target). All baselines and oracles are summarized below:   • En-Dev (baseline): chooses the fine-tuned mBERT with best performance on the English dev set.
• Pivot-Dev (baseline): chooses the fine-tuned mBERT with best performance on development data in the most similar pivot language (similarity to the target language is measured using lang2vec embeddings).
• 100-Target (oracle): chooses the fine-tuned mBERT with best performance on 100 labeled target language instances.
• All-Target (oracle): chooses the fine-tuned mBERT using the full target language dev set.

Evaluation
Below we report model selection results on mBERTs in the meta-test set for each of the five tasks.
POS Table 3 Table 3, our method selects models with a higher F1 score than En-Dev. Besides, it outperforms model selection using small amounts of target-language annotations (100-Target) on Dutch (nl) and German (de) and selects a model that performs as well on Spanish (es). On average, LMS achieves 1.6 point increase in F1 score relative to Pivot-Dev. We use (Wu and Dredze, 2019) as references for zero-shot crosslingual transfer with mBERT. Model Score Distributions Figure 2 visualizes the En-Dev and LMS results on the test set in the context of the score distributions of the 60 models in the meta-test set, using kernel density estimation. English development data tends to select models that perform only slightly better than average, whereas LMS does significantly better.

Evaluation on low-resource languages
We present results of low-resource languages NER in the bottom section of Table 3, where we use 40 pivot languages in the XTREME benchmark (Hu et al., 2020) to train LMS and test on 8 target lan-guages. LMS, outperforms the En-Dev and Pivot-Dev baselines, leading to an average gain of 4.7 and 3.3 F1 respectively. Since the setting is targeting truly low-resource languages where lang2vec might not be available, the scoring function thus directly predicts a score based on the representation from unlabeled target language text. We use the mBERT zero-shot cross-lingual transfer results from (Pfeiffer et al., 2020) as references.

Evaluation on multilingual fine-tuned models
An interesting question is whether fine-tuning on available development data in the auxiliary languages can improve performance. Since our model assumes access to small amounts of labeled data in a set of pivot languages, we experiment with multilingual fine-tuning and show LMS is still beneficial for selecting among models that are fine-tuned on both English and pivot language data. Part of speech tagging experiments are presented in Table 4, where all mBERT models are fine-tuned on English and the development sets of five pivot languages (ar,de,es,nl,zh). A single LMS is then trained using English fine-tuning, with gold rankings computed on the pivot languages. Then, we directly apply the English LMS model to do model selection on the multilingual fine-tuned mBERT models. We find LMS outperforms the En-Dev baseline on seven out of the ten target languages used in our evaluation, with an average gain of 0.1 accuracy. This also demonstrates LMS that is trained on English fine-tuned representations generalizes to multilingual fine-tuning. We use models that are fine-tuned only on English data, from Wu and Dredze (2019), as references, and find that multilingual fine-tuning shows better cross-lingual transfer performance compared to fine-tuning on only English data.

Analysis
In Section 6, we empirically demonstrated that our learned scoring function, s(·), consistently selects better models than the standard approach (En-Dev), and is comparable to small amounts of labeled target language data. Section 7.1 presents additional analysis of our approach, exploring the impact of various modeling choices with {ar, de, es, nl, zh}. In addition, analysis of generalization beyond mBERT and across tasks capability are present in Appendices A.3 and A.4.  First, we determine the choice of model-specific features by averaging performance across both language embeddings. Table 5 reports averaged evaluation metrics for each model-specific representation across all target languages with En-Dev as a baseline.
Averaged evaluation metrics across all target languages for each language embedding are reported in Table 6. In addition to evaluating the effectiveness of each language embedding, we also experimented with a variant of our scoring function that does not include any language embeddings as input. Results are reported on mBERT models in the meta-dev set and the target languages' dev sets for all experiments in this section.
In Table 5, [PIVOT] features achieve top-2 performance in all five tasks.
[Eng] and [Target] achieve mixed results, and the fusion of three features does not effectively incorporate the advantages of each representation, except in the case of ARL. Table 6 shows that lang2vec outperforms syntax for all tasks but ARL and also outperforms our approach when language embeddings are not included. Thus,lang2vec and [PIVOT] are used for all experiments in Section 6.   Table 6: Language embedding analysis across lang2vec, syntax, and no language embedding. We use mBERT models in the meta-dev set for analysis. Each number represents average of scores across all the target languages in a particular task.

Related Work
Recent work has explored hyper-parameter optimization (Klein et al., 2019), and model selection for a new task. task2vec (Achille et al., 2019) presents a meta-learning approach to selecting a pre-trained feature extractor from a library for a new visual task. More concretely, task2vec represents tasks in a vector space and is capable of predicting task similarities and taxonomic relations. It encodes a new task and selects the best feature extractor trained on the most similar task. Unlike task2vec, we select a trained model for a specific task, and we represent a trained model with model-specific features on a target language. MAML (Finn et al., 2017;Rajeswaran et al., 2019) is another approach to meta-learning, pretraining a single model with a meta-loss to initialize a set of parameters that can be quickly finetuned for related tasks. Nooralahzadeh et al. (2020) explore the use of MAML in the cross-lingual transfer setting. MAML is designed to support few-shot learning through better initialization of model parameters and does not address the problem of model selection. In contrast, our approach improves model selection in the zero-shot crosslingual transfer setting.
Most relevant to our work, Xia et al. (2020) use regression methods to predict a model's performance on an NLP task. They formulate this as a regression problem based on features of the task (dataset size, average sentence length, etc.), incorporating a discrete feature to represent the choice of model. In contrast, LMS inspects a model's internal representations, thus it is suitable for predicting which out of a set of fine-tuned models will best transfer to a target language. Also relevant is prior work on learning to select the best language to transfer from (Lin et al., 2019). There is a need for more NLP research on lowresource languages (Joshi et al., 2020b). Lauscher et al. (2020) present a number of challenges in transferring to languages with few resources using pre-trained Transformers. Our experiments do cover a set of 8 truly low-resource languages following prior work (Pfeiffer et al., 2020;Xia et al., 2021) and a fairly diverse set of languages, including Arabic and Chinese. We believe that there is still a need for more research on multilingual NLP for high-resource languages as well, as this is not a solved problem. Finally, we note that there are several other prominent benchmarks for evaluating cross-lingual transfer including XTERME (Hu et al., 2020) and XGLUE (Liang et al., 2020), both of which include some datasets used in this work.

Conclusion
In this paper, we presented a machine learning approach to model selection for zero-shot crosslingual transfer, which is appropriate when small amounts of development data are available in one or more pivot languages, but not in the target language. We showed that our approach improves over the standard practice of model selection using source language development data. Experiments on five well-studied NLP tasks show that by inspecting internal representations, our method consistently selects better models. LMS also achieves comparable results to the more expensive alternative of annotating small amounts of target-language development data. Finally, we demonstrated that LMS selects better models for low-resource languages, such as Quechua and Maori, that are not included during pretraining.
the NSF (IIS-1845670) and IARPA via the BET-TER program (2019-19051600004). The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for government purposes notwithstanding any copyright annotation therein.

A.1 Data for Relation Extraction and Argument Role Labeling
In this section, we describe details of the dataset for RE and ARL. Table 7 reports the statistics of the dataset and Table 8 summarizes references and baseline results. We create a dataset using the ACE2005 corpus (Walker et al., 2006), which more closely replicates the setting a model will be faced with in a real-world information extraction scenario. First, we shuffle documents into 80%/10%/10% splits for train/dev/test, then extract candidate entity-pairs from each document. For RE, the first approach in Ye et al. (2019) is adopted to extract negative instances. Negative instances whose entity-type combination has never appeared as a positive example in the training data are filtered out. For ARL, we create negative instances by pairing each trigger with every entity in a sentence. Details on the two datasets are summarized in Table 7.
As a baseline for the dataset, we reimplement the Graph Convolutional Network (GCN) model of Subburathinam et al. (2019)

A.2 Variance of Different Meta-train/dev/test Split is Relatively Low
In this section, we present a statistical analysis of model selection results for POS and QA between different meta-train/dev/test splits. Table 9 shows LMS on average improves a point of 0.74 relative to En-Dev and a point of 0.44 relative to Pivot-Dev. We found that the variance in end-task performance between different meta-train/dev/test splits is relatively low.  We train a single LMS with pivot languages in {ar, de, es, nl, zh} for POS and {ar, de, es, zh} for QA, and test it on all the target languages. All the results are reported with mean and standard deviation with five runs (different meta-train/dev/test splits). A Z-test is performed to the differences between LMS/Pivot-Dev and En-Dev. LMS is statistically significantly (p-value ≤ 0.05) higher than En-Dev baseline across all languages and two tasks while Pivot-Dev fails in three languages. LMS also obtains a lower standard deviation for the model scores except for Swedish (sv) and Vietnamese (vi).

A.3 Does this Approach Generalize to XLM-RoBERTa?
In Section 6, we showed that our approach consistently selects better fine-tuned models than those chosen using English development data. To test the robustness of our approach with a different multilingual pre-trained Transformer, we re-train and evaluate using XLM-RoBERTa-base (Conneau et al., 2019), with the same settings used for mBERT in Section 6 for RE and ARL.

RE
In the left section of Table 10, our approach selects a model with a higher F1 score compared to En-Dev in Chinese and on par with En-Dev in Arabic.
ARL In the right section of Table 10, our approach selects a model with a higher F1 score compared to En-Dev in Arabic but performs worse on Chinese (En-Dev outperforms the All-Target). Overall, our approach appears to be effective when used with XLM-RoBERTa. A.4 Can Multi-task Learning Help?
Our setting does not assume access to the labeled data in the target language for a particular task. However, labeled data in the target language may be available for a relevant auxiliary task, which could help the scoring function learn to better estimate whether a model is a good match for the target language.
To test whether an auxiliary task in the target language might help to select a better model for the target task, we train the LMS on two tasks: RE and ARL. Gold rankings on the models are then computed for each language using the pivot languages' dev sets. Also, another "silver" ranking is computed for each language using the auxiliary task. The scoring function is then trained to rank mBERT models for both tasks. To differentiate the two tasks, a variant of the scoring function, s(m, l, t), which concatenates a randomly initialized task embedding with the language embedding is adopted.
In Table 11, our approach selects a model with a higher F1 score for RE. However, multi-task does not benefit ARL but still outperforms En-Dev. As for future direction, we believe an LMS that is trained on an auxiliary dataset can be transferred to the target dataset, hence release the requirement of a small amount of pivot language development data in the target dataset.  Table 11: Multi-task analysis using additional training data in the target language from another task. ([Pivot], lang2vec): baseline of training within a single task data. Model selection is based on the highest scores for the target language and target task: arg max m s(m, l target , t target )