A Closer Look at Few-Shot Crosslingual Transfer: The Choice of Shots Matters

Few-shot crosslingual transfer has been shown to outperform its zero-shot counterpart with pretrained encoders like multilingual BERT. Despite its growing popularity, little to no attention has been paid to standardizing and analyzing the design of few-shot experiments. In this work, we highlight a fundamental risk posed by this shortcoming, illustrating that the model exhibits a high degree of sensitivity to the selection of few shots. We conduct a large-scale experimental study on 40 sets of sampled few shots for six diverse NLP tasks across up to 40 languages. We provide an analysis of success and failure cases of few-shot transfer, which highlights the role of lexical features. Additionally, we show that a straightforward full model finetuning approach is quite effective for few-shot transfer, outperforming several state-of-the-art few-shot approaches. As a step towards standardizing few-shot crosslingual experimental designs, we make our sampled few shots publicly available.

A widely explored transfer scenario is zero-shot crosslingual transfer (Pires et al., 2019; Conneau and Lample, 2019; Artetxe and Schwenk, 2019), * Equal contribution. 1 Code and resources are available at https://github. com/fsxlt where a pretrained encoder is finetuned on abundant task data in the source language (e.g., English) and then directly evaluated on target-language test data, achieving surprisingly good performance (Wu and Dredze, 2019;Hu et al., 2020). However, there is evidence that zero-shot performance reported in the literature has large variance and is often not reproducible (Keung et al., 2020a;Rios et al., 2020); the results in languages distant from English fall far short of those similar to English (Hu et al., 2020;Liang et al., 2020). Lauscher et al. (2020) stress the importance of few-shot crosslingual transfer instead, where the encoder is first finetuned on a source language and then further finetuned with a small amount (10-100) of examples (few shots) of the target language. The few shots substantially improve model performance of the target language with negligible annotation costs (Garrette and Baldridge, 2013;Hedderich et al., 2020).
In this work, however, we demonstrate that the gains from few-shot transfer exhibit a high degree of sensitivity to the selection of few shots. For example, different choices for the few shots can yield a performance variance of over 10% accuracy in a standard document classification task. Motivated by this, we propose to fix the few shots for fair comparisons between different crosslingual transfer methods, and provide a benchmark resembling the standard "N -way K-shot" few-shot learning configuration (Fei-Fei et al., 2006;Koch et al., 2015). We also evaluate and compare several stateof-the-art (SotA) few-shot finetuning techniques, in order to understand their performance and susceptibility to the variance related to few shots.
We also demonstrate that the effectiveness of few-shot crosslingual transfer depends on the type of downstream task. For syntactic tasks such as named-entity recognition, the few shots can improve results by up to ≈20 F 1 points. For chal-lenging tasks like adversarial paraphrase identification, the few shots do not help and even sometimes lead to worse performance than zero-shot transfer. To understand these phenomena, we conduct additional in-depth analyses, and find that the models tend to utilize shallow lexical hints (Geirhos et al., 2020) in the target language, rather than leveraging abstract crosslingual semantic features learned from the source language.
Our contributions: 1) We show that few-shot crosslingual transfer is prone to large variations in task performance; this property hinders unbiased assessments of the effectiveness of different fewshot methods. 2) To remedy this issue, we publish fixed and standardized few shots to support fair comparisons and reproducibility. 3) We empirically verify that few-shot crosslingual transfer has different performance impact on structurally different tasks; we provide in-depth analyses concerning the source of performance gains. 4) We analyze several SotA few-shot learning methods, and show that they underperform simple full model finetuning. We hope that our work will shed new light on the potential and current difficulties of few-shot learning in crosslingual setups.
Recently, Lauscher et al. (2020) and Hedderich et al. (2020) extended the focus on few-shot crosslingual transfer (FS-XLT): They assume the availability of a handful of labeled examples in a target language, 2 which are used to further finetune a source-trained model. The extra few shots bring large performance gains at low annotation cost. In this work, we systematically analyze this recent FS-XLT scenario.
FS-XLT resembles the intermediate-task transfer (STILT) approach (Phang et al., 2018;Pruksachatkun et al., 2020). In STILT, a pretrained encoder is finetuned on a resource-rich intermedi-ate task, and then finetuned on a (resource-lean) target task. Likewise, FS-XLT focuses on transferring knowledge and general linguistic intelligence (Yogatama et al., 2019), although such transfer is between languages in the same task instead of between different tasks.
Few-shot learning was first explored in computer vision (Miller et al., 2000;Fei-Fei et al., 2006;Koch et al., 2015); the aim there is to learn new concepts with only few images. Methods like prototypical networks (Snell et al., 2017) and modelagnostic meta-learning (MAML;Finn et al. (2017)) have also been applied to many monolingual (typically English) NLP tasks such as relation classification (Han et al., 2018;Gao et al., 2019), namedentity recognition (Hou et al., 2020a), word sense disambiguation (Holla et al., 2020), and text classification (Yu et al., 2018;Yin, 2020;Bansal et al., 2020;Gupta et al., 2020). However, recent few-shot learning methods in computer vision consisting of two simple finetuning stages, first on base-class images and then on new-class few shots, have been shown to outperform MAML and achieve SotA scores (Wang et al., 2020;Chen et al., 2020;Tian et al., 2020;Dhillon et al., 2020). Inspired by this work, we compare various fewshot finetuning methods from computer vision in the context of FS-XLT.
Task Performance Variance. Deep neural networks' performance on NLP tasks is bound to exhibit large variance. Reimers and Gurevych (2017) and Dror et al. (2019) stress the importance of reporting score distributions instead of a single score for fair(er) comparisons. Dodge et al. (2020), Mosbach et al. (2021), andZhang et al. (2021) show that finetuning pretrained encoders with different random seeds yields performance with large variance. In this work, we examine a specific source of variance: We show that the choice of the few shots in crosslingual transfer learning also introduces large variance in performance; consequently, we offer standardized few shots for more controlled and fair comparisons.

Method
Following Lauscher et al. (2020) andHedderich et al. (2020), our FS-XLT method comprises two stages. First, we conduct source-training: The pretrained mBERT is finetuned with abundant annotated data in the source language. Similar to Hu et al. (2020), Liang et al. (2020) and due to  the abundant labeled data for many NLP tasks, we choose English as the source in our experiments. Directly evaluating the source-trained model after this stage corresponds to the widely studied ZS-XLT scenario. The second stage is targetadapting: The source-trained model from previous stage is adapted to a target language using few shots. We discuss details of sampling the few shots in §4. The development set of the target language is used for model selection in this stage.

Experimental Setup
We consider three types of tasks requiring varying degrees of semantic and syntactic knowledge transfer: Sequence classification (CLS), namedentity recognition (NER), and part-of-speech tagging (POS) in up to 40 typologically diverse languages (cf., Appendix §B).

Datasets and Selection of Few Shots
For the CLS tasks, we sample few shots from four multilingual datasets: News article classification (MLDoc; Schwenk and Li (2018) 2019)). We use treebanks in Universal Dependencies (Nivre et al., 2020) for POS, and WikiANN dataset (Pan et al., 2017;Rahimi et al., 2019) for NER. Table 1 reports key information about the datasets. We adopt the conventional few-shot sampling strategy (Fei-Fei et al., 2006;Koch et al., 2015;Snell et al., 2017), and conduct "N -way K-shot" sampling from the datasets; N is the number of classes and K refers to the number of shots per class. A group of N -way K-shot data is referred to as a bucket. We set N equal to the number of labels |T |. Following Wang et al. (2020), we sample 40 buckets for each target (i.e., non-English) language of a task to get a reliable estimation of model performance.
CLS Tasks. For MLDoc and MARC, each language has a train/dev/test split. We sample the buckets without replacement from the training set of each target language, so that buckets are disjoint from each other. Target languages in XNLI and PAWSX only have dev/test splits. We sample the buckets from the dev set; the remaining data serves as a single new dev set for model selection during target-adapting. For all tasks, we use K ∈ {1, 2, 4, 8}.
POS and NER. For the two structured prediction tasks, "N -way K-shot" is not well-defined because each sentence contains one or more labeled tokens. We use a similar sampling principle as with CLS, where N is the size of the label set for each language and task, but K is set to the minimum number of occurrences for each label. In particular, we utilize the Minimum-Including Algorithm (Hou et al., 2020b,a) to satisfy the following criteria when sampling a bucket: 1) each label appears at least K times, and 2) at least one label will appear less than K times if any sentence is removed from the bucket. Appendix §C gives sampling details. In contrast to sampling for CLS, we do not enforce samples from different buckets to be disjoint due to the small amount of data in some low-resource languages. We only use K ∈ {1, 2, 4} and exclude K = 8, as 8-shot buckets already have lots of labeled tokens, and thus (arguably) might not be considered few-shot.
For source-training, we finetune the pretrained encoder for 10 epochs with batch size 32. For target-adapting to every target language, the fewshot data is a sampled bucket in this language, and we finetune on the bucket for 50 epochs with early-stopping of 10 epochs. The batch size is set to the number of shots in the bucket. Each target-adapting experiment is repeated 40 times using the 40 buckets. We use the Adam optimizer (Kingma and Ba, 2015) with default parameters in both stages with learning rates searched over {1e − 5, 3e − 5, 5e − 5, 7e − 5}. For CLS tasks, we use mBERT's [CLS] token as the final represen- tation. For NER and POS, following Devlin et al.
(2019), we use a linear classifier layer on top of the representation of each tokenized word, which is its last wordpiece (He and Choi, 2020). We set the maximum sequence length to 128 after wordpiece tokenization (Wu et al., 2016), in all experiments. Further implementation details are shown in our Reproducibility Checklist in Appendix §A.

Source-Training Results
The ZS-XLT performance from English (EN) to target languages of the four CLS tasks are shown in the K = 0 column in Table 2. For NER and POS, the results are shown in Figure 2.
For XTREME tasks (XNLI, PAWSX, NER, POS), our implementation delivers results comparable to Hu et al. (2020). For MLDoc, our results are comparable to (Dong and de Melo, 2019;Wu and Dredze, 2019;Eisenschlos et al., 2019). It is worth noting that reproducing the exact results is challenging, as suggested by Keung et al. (2020a). For MARC, our zero-shot results are worse than Keung et al. (2020b)'s who use the dev set of each target language for model selection while we use EN dev, following the common true ZS-XLT setup.

Target-Adapting Results
Variance of Few-Shot Transfer. We hypothesize that FS-XLT suffers from large variance (Dodge et al., 2020) due to the large model complexity and small amount of data in a bucket. To test this empirically, we first conduct two experiments on MLDoc and MARC. First, for a fixed random seed, we repeat 1-shot target-adapting 40 times using different 1-shot buckets in German (DE) and Spanish (ES). Second, for a fixed 1-shot bucket, we repeat the same experiment 40 times using random seeds in {0 . . . 39}. Figure 1 presents the dev set performance distribution of the 40 runs with 40 random seeds (top) and 40 1-shot buckets (bottom).
With exactly the same training data, using different random seeds yields a 1-2 accuracy difference of FS-XLT (Figure 1 top). A similar phenomenon has been observed in finetuning monolingual encoders (Dodge et al., 2020) and multilingual encoders with ZS- XLT (Keung et al., 2020a;Wu and Dredze, 2020b;Xia et al., 2020); we show this observation also holds for FS-XLT. The key takeaway is that varying the buckets is a more severe problem. It causes much larger variance (Figure 1 bottom) This large variance could be an issue when comparing different few-shot learning algorithms. The bucket choice is a strong confounding factor that may obscure the strength of a promising few-shot technique. Therefore, for fair comparison, it is necessary to work with a fixed set of few shots. We propose to fix the sampled buckets for unbiased comparison of different FS-XLT methods. We publish the sampled buckets from the six multilingual datasets as a fixed and standardized few-shot evaluation benchmark.
In what follows, each FS-XLT experiment is repeated 40 times using 40 different buckets with the same fixed random seed; we report mean and standard deviation. As noted, the variance due to random seeds is smaller (cf., Figure 1) and has been well studied before (Reimers and Gurevych, 2017;Dodge et al., 2020). In this work, we thus focus our attention and limited computing resources on understanding the impact of buckets, the newly detected source of variance. However, we encourage practitioners to report results with both factors considered in the future.
Different Numbers of Shots. A comparison concerning the number of shots (K), based on the few-shot results in Table 2 and Figure 2, reveals that the buckets largely improve model performance on a majority of tasks (MLDoc, MARC, POS, NER) over zero-shot results. This is in line with prior work (Lauscher et al., 2020;Hedderich et al., 2020) and follows the success of work on using bootstrapped data (  et al., 2020). In general, we observe that: 1) 1-shot buckets bring the largest relative performance improvement over ZS-XLT; 2) the gains follow the increase of K, but with diminishing returns; 3) the performance variance across the 40 buckets decreases as K increases. These observations are more pronounced for POS and NER; e.g., 1-shot EN to Urdu (UR) POS transfer shows gains of ≈22 F 1 points (52.40 with zero-shot, 74.95 with 1-shot).
For individual runs, we observe that models in FS-XLT tend to overfit the buckets quickly at small K values. For example, in around 32% of NER 1shot buckets, the model achieves the best dev score right after the first epoch; continuing the training only degrades performance. Similar observations hold for semantic tasks like MARC, where in 10 out of 40 DE 1-shot buckets, the dev set performance peaks at epoch 1 (cf. learning curve in Appendix §D Figure 6). This suggests the necessity of running the target-adapting experiments on multiple buckets if reliable conclusions are to be drawn.
Different Downstream Tasks. The models for different tasks present various levels of sensitiv-ity to FS-XLT. Among the CLS tasks that require semantic reasoning, FS-XLT benefits MLDoc the most. This is not surprising given the fact that keyword matching can largely solve MLDoc (Artetxe et al., 2020a,b): A few examples related to target language keywords are expected to significantly improve performance. FS-XLT also yields prominent gains on the Amazon review classification dataset MARC. Similar to MLDoc, we hypothesize that just matching a few important opinion and sentiment words (Liu, 2012) in the target language brings large gains already. We provide further qualitative analyses in §5.4. XNLI and PAWSX behave differently from MLDoc and MARC. XNLI requires higher level semantic reasoning on pairs of sentences. FS-XLT performance improves modestly (XNLI) or even decreases (PAWSX-ES) compared to ZS-XLT, even with large K. PAWSX requires a model to distinguish adversarially designed nonparaphrase sentence pairs with large lexical overlap like "Flights from New York to Florida" and "Flights from Florida to New York" . This poses a challenge for FS-XLT, given the small amount of target language information in the buckets. Therefore, when buckets are small (e.g., K = 1) and for challenging semantic tasks like PAWSX, the buckets do not substantially help. Annotating more shots in the target language is an intuitive solution. Designing task-specific pretraining/finetuning objectives could also be promising (Klein and Nabi, 2020;Ram et al., 2021).
Unlike CLS tasks, POS and NER benefit from FS-XLT substantially. We speculate that there are two reasons: 1) Both tasks often require little to no high-level semantic understanding or reasoning; 2) due to i.i.d. sampling, train/dev/test splits are likely to have overlapping vocabulary, and the labels in the buckets can easily propagate to dev and test. We delve deeper into these conjectures in §5.4. Different Languages. For languages that are more distant from EN, e.g., with different scripts, small lexical overlap, or fewer common typological features (Pires et al., 2019; Wu and Dredze, 2020a), FS-XLT introduces crucial lexical and structural information to guide the update of embedding and transformer layers in mBERT.
We present several findings based on the NER and POS results for a typologically diverse language sample. Figure 2 shows that for languages with non-Latin scripts (different from EN), despite   Table 3: Correlations between FS-XLT F 1 score gains and the two factors (lexical overlap and the number of common linguistic features with EN) when considered independently for POS and NER: S/R denotes Spearman's/Pearson's ρ. See Footnotes 3, 4 for information on the two factors. their small to non-existent lexical overlap 3 and diverging typological features (see Appendix §D Tables 9 and 14), the performance boosts are generally larger than those in the same-script target languages: 6.2 vs. 3.0 average gain in NER and 11.4 vs. 5.4 in POS for K = 1. This clearly manifests the large information discrepancy between target-language buckets and source-language data. EN data is less relevant to these languages, so they obtain very limited gain from source-training, reflected by their low ZS-XLT scores. With a small amount of target-language knowledge in the buckets, the performance is improved dramatically, highlighting the effectiveness of FS-XLT. Table 3 shows that, besides script form, lexical overlap and the number of linguistic features com- 3 We define lexical overlap as |V | L ∩|V |EN

|V |EN
where V denotes vocabulary. |V |L is computed with the 40 buckets of a target language L. mon with EN 4 also contribute directly to FS-XLT performance difference among languages: There is a moderate negative correlation between F 1 score gains vs. the two factors when considered independently for both syntactic tasks: The fewer overlaps/features a target language shares with EN, the larger the gain FS-XLT achieves.
This again stresses the importance of bucketsthey contain target-language-specific knowledge about a task that cannot be obtained by ZS-XLT, which solely relies on language similarity. Interestingly, Pearson's ρ indicates that common linguistic features are much less linearly correlated with FS-XLT gains in NER than in POS. Table 4 reports the performance drop when directly carrying out target-adapting, without any prior source-training of mBERT. We show the scores for MLDoc and PAWSX as a simple and a challenging CLS task, respectively. For NER and POS, we select two high-(Russian (RU), ES), mid-(Vietnamese (VI), Turkish (TR)), and low-resource languages (Tamil (TA), Marathi (MR)) each. 5 The results clearly indicate that omitting the   source-training stage yields large performance drops. Even larger variance is also observed in this scenario (cf. Appendix §D Table 11). Therefore, the model indeed learns, when trained on the source language, some transferable crosslingual features that are beneficial to target languages, both for semantic and syntactic tasks.

Importance of Lexical Features
We now investigate the sources of gains brought by FS-XLT over ZS-XLT. For syntactic tasks, we take Persian (FA) POS as an example. Figure 3 visualizes the lexical overlap, measured by the Jaccard index, of 10 1-shot buckets (rows) and the improved word-label predictions introduced by target-adapting on each of the buckets (columns). In more detail, for column c, we collect the set (denoted as C c ) of all test set words whose label is incorrectly predicted by the zeroshot model, but correctly predicted by the model trained on the c-th bucket. For row i, we denote with B i the set of words occurring in bucket i. The figure shows in cell (i, k) the Jaccard index of B i and C k . The bright color (i.e., higher lexical overlap) on the diagonal reflects that the improvements introduced by a bucket are mainly 6 those wordlabel predictions that are lexically more similar to the bucket than to other buckets. We also investigate the question: How many word-label predictions that are improved after FS-XLT occur in the bucket, i.e., in the training data? Figure 4 plots this for the 40 1-shot buckets in FA, UR, and Hindi (HI). We see that many test words do occur in the bucket (shown in orange), in line with recent findings (Lewis et al., 2021;Elangovan et al., 2021). These analyses shed light on why the buckets benefit NER/POS -which heavily rely on lexical information -more than higher level semantic tasks.
For the CLS task MARC, which requires un-  derstanding product reviews, Figure 5 visualizes the confusion matrices of test set predictions for DE and Chinese (ZH) zero-and 1-shot models; axis ticks are review scores in {1, 2, 3, 4, 5}. The squares on the diagonals in the two left heatmaps show that parameter initialization on EN is a good basis for well-performing ZS-XLT: This is particularly true for DE, which is linguistically closer to EN. Two extreme review scores -1 (for DE) and 5 (for ZH) -have the largest confusions. The two right heatmaps show that improvements brought by the 1-shot buckets are mainly achieved by correctly predicting more cases of the two extreme review scores: 2 → 1 (DE) and 4 → 5 (ZH). But the more challenging cases (reviews with scores 2, 3, 4), which require non-trivial reasoning, are not significantly improved, or even become worse.
We inspect examples that are incorrectly predicted by the few-shot model (predicting 1), but are correctly predicted by the zero-shot model (predicting 2). Specifically, we compute the difference of where [CLS] attends to, before and after adapting the model on a 1-shot DE bucket. We extract and average attentions computed by the 12 heads from the topmost transformer layer. Table 5 shows that "nicht" ("not") draws high attention change from [CLS]. "Nicht" (i.e., negation) by itself is not a reliable indicator of sentiment, so giving the lowest score to reviews solely because they contain "nicht" is not a good strategy. The following review is classified as 1 by the 1-shot model, but 2 is the gold label (as the review is not entirely negative): "Die Uhr ging nicht einmal eine Minute ... Optisch allerdings sehr schön." ("The clock didn't even work one minute ... Visually, however, very nice.") Pretrained multilingual encoders are shown to learn and store "language-agnostic" features (Pires et al., 2019;Zhao et al., 2020); §5.3 shows that source-training mBERT on EN substantially benefits other languages, even for difficult semantic tasks like PAWSX. Conditioning on such languageagnostic features, we expect that the buckets should lead to good understanding and reasoning capabilities for a target language. However, plain few-shot finetuning still relies heavily on unintended shallow lexical cues and shortcuts (Niven and Kao, 2019;Geirhos et al., 2020) that generalize poorly. Other open research questions for future work arise: How do we overcome this excessive reliance on lexical features? How can we leverage language-agnostic features with few shots? Our standardized buckets, baseline results, and analyses are the initial step towards researching and answering these questions.

Target-Adapting Methods
SotA few-shot learning methods (Chen et al., 2019;Wang et al., 2020;Tian et al., 2020;Dhillon et al., 2020) from computer vision consist of two stages: 1) training on base-class images, and 2) few-shot finetuning using new-class images. Source-training and target-adapting stages of FS-XLT, albeit among languages, follow an approach very similar to these methods. Therefore, we test their effectiveness for crosslingual transfer. These methods are built upon cosine similarity that imparts inductive bias about distance and is more effective than a fullyconnected classifier layer (FC) with small K (Wang et al., 2020). Following (Chen et al., 2019;Wang et al., 2020;Tian et al., 2020), we freeze the embedding and transformer layers of mBERT, and explore four variants of the target-adapting stage using MARC.
COS+Pooler. We randomly initialize a trainable weight matrix W ∈ R h×c where h is the hidden dimension size and c is the number of classes. Rewriting W as [w 1 , . . . , w i , . . . , w c ], we compute the logits of an input sentence representation x ∈ R h (from mBERT) belonging to class i as where α is a scaling hyperparameter, set to 10 in all experiments. During training, W and mBERT's pooler layer containing a linear layer and a tanh non-linearity are updated.
FC+Pooler. During training, we update the linear classifier layer and mBERT's pooler layer.
FC only. During training, we only update the linear classifier layer. This variant largely reduces model complexity and exhibit lower variance when K is small. FC(reset)+Pooler. Similar to FC+Pooler, but the source-trained linear classifier layer is randomly re-initialized before training. Table 6 shows the performance of these methods along with full model finetuning (without freezing). FC+Pooler performs the best among the  four for both K = 1 and K = 8 in all languages. However, it underperforms the full model finetuning, especially when K = 8. FC only is sub-optimal; yet the decrease in comparison to FC+Pooler is small, highlighting that EN-trained mBERT is a strong feature extractor. COS+Pooler and FC(reset)+Pooler perform considerably worse than the other two methods and zero-shot transferpresumably because their new parameters need to be trained from scratch with few shots.
We leave further exploration of other possibilities of exploiting crosslingual features through collapse-preventing regularization (Aghajanyan et al., 2021)

Conclusion and Future Work
We have presented an extensive study of few-shot crosslingual transfer. The focus of the study has been on an empirically detected performance variance in few-shot scenarios: The models exhibit a high level of sensitivity to the choice of few shots. We analyzed and discussed the major causes of this variance across six diverse tasks for up to 40 languages. Our results show that large language models tend to overfit to few shots quickly and mostly rely on shallow lexical features present in the few shots, though they have been trained with abundant data in English. Moreover, we have empirically validated that state-of-the-art few-shot learning methods in computer vision do not outperform a conceptually simple alternative: Full model finetuning.
Our study calls for more rigor and accurate reporting of the results of few-shot crosslingual transfer experiments. They should include score distributions over standardized and fixed few shots. To aid this goal, we have created and provided such fixed few shots as a standardized benchmark for six multilingual datasets.
Few-shot learning is promising for crosslingual transfer, because it mirrors how people acquire new languages, and that the few-shot data annotation is feasible. In future work, we will investigate more sophisticated techniques and extend the work to more NLP tasks. There are about 179 million parameters in mBERT. For all the tasks, we use a linear output layer. Denoting the output dimension of a task as m, e.g., m = 2 for PAWSX. Then we have in total 179 million + 768×m + m parameters for the task.

A.2 Computing Infrastructure
All experiments are conducted on GeForce GTX 1080Ti. In the source-training stage, we use 4 GPUs with per-GPU batch size 32. In the targetadapting stage, we use a single GPU and the batch size is equal to the number of examples in a bucket.

A.3 Evaluation Metrics and Validation Performance
We follow the standard evaluation metrics used in XTREME (Hu et al., 2020) and they are shown in Table 1; evaluation functions in scikit-learn (Pedregosa et al., 2011) and seqeval (https://github.com/ chakki-works/seqeval) are used. Link to code: code/utils/eval meters.py. The validation performance of the Englishtrained models are shown in the first row of Table 7; the optimal learning rate for each task is shown in the second row.

D Additional Results
D.1 Learning Curve Figure 6 visualizes the averaged learning curve of 10 out of 40 German 1-shot MARC buckets for which the best dev performance is obtained at epoch 1.