XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation

Machine learning has brought striking advances in multilingual natural language processing capabilities over the past year. For example, the latest techniques have improved the state-of-the-art performance on the XTREME multilingual benchmark by more than 13 points. While a sizeable gap to human-level performance remains, improvements have been easier to achieve in some tasks than in others. This paper analyzes the current state of cross-lingual transfer learning and summarizes some lessons learned. In order to catalyze meaningful progress, we extend XTREME to XTREME-R, which consists of an improved set of ten natural language understanding tasks, including challenging language-agnostic retrieval tasks, and covers 50 typologically diverse languages. In addition, we provide a massively multilingual diagnostic suite and fine-grained multi-dataset evaluation capabilities through an interactive public leaderboard to gain a better understanding of such models.


Introduction
Most research in natural language processing (NLP) to date has focused on developing methods that work well for English and a small set of other high-resource languages (Joshi et al., 2020). In contrast, methods for other languages can be vastly more beneficial as they enable access to language technology for more than three billion speakers of low-resource languages and prevent the NLP community from overfitting to English. Motivated by these benefits, the area of multilingual NLP has attracted increasing interest recently.
However, evaluating multilingual models is challenging as it requires assessing performance on a wide range of typologically distinct languages in the face of limited heterogeneous data sources. Recently large-scale benchmarks such as XTREME (Hu et al., 2020) and XGLUE (Liang et al., 2020) XTREME XTREME-R # of languages 40 50 # of tasks 9 − 2 + 3 = 10 Task categories Classification, structured prediction, QA, retrieval +language-agnostic retrieval Analysis tools -MULTICHECKLIST, Explainaboard Leaderboard Static Interactive, +metadata Table 1: Overview of XTREME and XTREME-R.
have been introduced, consolidating existing multilingual tasks and covering tens of languages. When XTREME was released, the gap between the best-performing baseline, XLM-R Large (Conneau et al., 2020), and human-level performance was roughly 25. This has since shrunk to less than 12 points, a much smaller but still substantial gap compared to the difference from human-level performance observed in English transfer learning (Wang et al., 2019a), which has recently been closed entirely on some evaluation suites (He et al., 2021). In order to examine the nature of this progress, we first perform an analysis of state-of-the-art multilingual models on XTREME. We observe that progress has not been uniform, but concentrated on cross-lingual retrieval tasks where fine-tuning on other tasks and pre-training with parallel data lead to large gains. On other task categories improvements are more modest. Models still generally perform poorly on languages with limited data and non-Latin scripts. Fine-tuning on additional translated data generally leads to the best performance.
Based on this analysis, we propose XTREME-R (XTREME Revisited), a new benchmark with the dual purpose of ensuring that research in multilingual NLP focuses on the most challenging problems and equipping researchers with a broader set of tools to better understand their models (see Table 1 for a brief overview). XTREME-R follows in its predecessor's footsteps by being massively multilingual, diverse, and accessible. It expands on XTREME by covering 50 typologically diverse languages and 10 challenging, diverse tasks. To make retrieval more difficult, we introduce two new tasks that focus on "language-agnostic" retrieval (Roy et al., 2020), where targets must be retrieved from a large multilingual candidate pool. We additionally establish new state-of-the-art mT5 (Xue et al., 2021) and translate-train baselines for our tasks. XTREME-R aims to move away from a single aggregate metric summarizing a model's performance and towards a more nuanced evaluation and comparison of multilingual models (Ethayarajh and Jurafsky, 2020;Linzen, 2020). To this end, we introduce an extensible multilingual diagnostic and evaluation suite that consists of two main components: a) MULTICHECKLIST, a test suite (Ribeiro et al., 2020) for probing question answering capabilities in 50 languages. This test suite is the first of its kind and enables direct evaluation of finegrained capabilities in a massively multilingual setting. b) We extend the multi-dataset evaluation framework EXPLAINABOARD (Fu et al., 2020;Liu et al., 2021) to additional tasks and the multilingual setting. This framework allows us to break down performance based on language and task-specific attributes, which enables a more nuanced diagnosis of a model's behaviour.
We also make several logistic improvements to improve XTREME-R's utility as a leaderboard. To make it easier to choose the best model for a use case, each submission is required to provide metadata such as the number of parameters and amount of pre-training data, which we make available via an interactive leaderboard. We also introduce task and language-specific sub-leaderboards to invite submissions of dedicated models.
In sum, we make the following contributions: a) an analysis of progress in cross-lingual modeling; b) an improved benchmark covering 50 languages, including a newly created retrieval task (Mewsli-X); c) a massively multilingual diagnostic suite; d) fine-grained evaluation capabilities; e) experiments and analyses of state-of-the-art models; and f) an interactive metadata-rich leaderboard.
2 Examining the State of Multilingual Benchmarking

Background
Benchmarking is critical to evaluate generalpurpose language understanding technologies. To this end, benchmarks like GLUE (Wang et al., 2019b) and SuperGLUE (Wang et al., 2019a) provide a way to assess the transfer learning capabilities of various models. However, these benchmarks focus only on English. On the other hand, crosslingual approaches have been evaluated on a wide range of disparate tasks (Hu et al., 2020). XTREME was proposed as a platform to unify this fragmented evaluation landscape and to catalyze advances in cross-lingual learning by including a diverse set of tasks and languages. It consists of 9 tasks covering 40 diverse languages, which can be grouped into 4 broad task types (see §3.1 for details): classification (XNLI, PAWS-X), structured prediction (UD-POS, WikiANN-NER), question answering (XQuAD, MLQA, TyDiQA-GoldP), and retrieval (Tatoeba, BUCC). XTREME focuses on zero-shot cross-lingual transfer, i.e. models can be pretrained on any multilingual data and are fine-tuned only in English. Similarly, XGLUE (Liang et al., 2020), another cross-lingual benchmark focuses on a smaller number of less typologically diverse languages, and includes generation tasks. Other non-English benchmarks focus on specific linguistic phenomena, e.g. code-switching (Khanuja et al., 2020); languages, e.g. Indonesian (Willie et al., 2020) and Persian (Khashabi et al., 2020); and language families, e.g. Indian languages (Kakwani et al., 2020).

An Analysis of XTREME
As of April 15, 2021, all submissions to the XTREME leaderboard are large-scale Transformers (Vaswani et al., 2017) trained with masked language modeling (MLM; see Appendix A for further details). We analyze the performance of these models on the XTREME leaderboard in Figure 1. 1 Overall, multilingual models have improved the average performance on XTREME from 55.8 to 81.4. Much of this improvement is concentrated in the retrieval-based tasks where performance increased from 47.7 (mBERT) to 92.7 (VECO). In contrast, performance on question answering and structured prediction tasks has improved only slightly. Breaking down performance by language family, on Tatoeba (Figure 1c) recent models still struggle with a few low-resource languages. Models perform well for most other languages and their scores are concentrated in a relatively small range. On MLQA (Figure 1b), scores have increased slightly (a) Performance on XTREME (b) Performance on MLQA (c) Performance on Tatoeba (d) Performance on UD-POS Figure 1: Performance of models (a) on the XTREME leaderboard across all nine XTREME tasks, (b) on the MLQA question answering dataset, (c) on the Tatoeba retrieval task, and (d) on Universal Dependencies POS tagging across language families. Models are ordered based on their XTREME score (a). Results of models that do not evaluate on a task category are omitted, i.e. RemBERT for retrieval and mT5 for retrieval and tagging.
but remain well below performance on English. On POS tagging (Figure 1d), scores remain largely the same; performance is lower for some languages with non-Latin scripts and low-resource languages. We show the scores for the remaining tasks in Appendix B. The remaining gap to English performance on these tasks is partially an artefact of the evaluation setup: zero-shot cross-lingual transfer from English favors English representations whereas models fine-tuned on in-language monolingual data perform more similarly across languages (Clark et al., 2020;Hu et al., 2020).
Overall, representations from token-level MLM pre-training are of limited use for cross-lingual sentence retrieval, as evidenced by the comparatively poor performance of the mBERT and XLM-R models. Fine-tuning on sentence-level tasks (Phang et al., 2020;Fang et al., 2021) can mitigate this. The strong performance of recent models such as VECO and ERNIE-M on the retrieval tasks can be attributed to a combination of parallel data and new pre-training objectives that make use of it. Pretraining on parallel data improves performance on retrieval by making the pre-training task more sim-ilar to the downstream setting but does not significantly improve performance on other tasks. Finetuning on automatically translated task-specific data yields strong gains and is used by most recent models to achieve the best performance (Hu et al., 2020;Ouyang et al., 2020;Luo et al., 2020). Nevertheless, key challenges such as how to learn robust cross-lingual syntactic and semantic processing capabilities during pre-training remain.

XTREME-R
In order to encourage the NLP community to tackle challenging research directions in pursuit of better cross-lingual model generalization, we propose XTREME-R (XTREME Revisited). XTREME-R shares its predecessor's core design principles for creating an accessible benchmark to evaluate crosslingual transfer but makes some key changes.
First, XTREME-R focuses on the tasks that have proven to be hardest for current multilingual models. To this end, it drops XTREME's PAWS-X and BUCC tasks since recent advances have left less room for further improvement, and they cover only a small number of less diverse languages. They  are replaced instead by three new, more challenging tasks: one focusing on causal commonsense reasoning ( §3.2.1) and two focusing on harder retrieval scenarios ( §3.2.2), as this has been the task category where gains have been easiest to achieve. We retain XTREME's seven other tasks as each still presents substantial challenges for state-of-the-art cross-lingual models ( §3.1). Overall, XTREME-R includes 10 diverse tasks, summarized in Table 2. We also make changes to the structured prediction tasks, NER and POS. Instead of only providing examples as lists of tokens, XTREME-R always provides the full text of an input sentence, thus ensuring that the entire benchmark now supports research on models that operate directly from the raw input string (Clark et al., 2021). Furthermore, XTREME-R adopts a more realistic version of the NER task in which no gold tokenization is provided at all, meaning that systems will either have to use model-predicted tokens or embrace tokenizationfree approaches. Finally, XTREME-R provides a multilingual diagnostic and evaluation suite ( §3.4).

Retrieval from a Multilingual Pool
Many previous retrieval benchmarks assume that the entire candidate pool is in a single language. For instance, a French query will be used to search over only English candidates. However, practical settings often violate this assumption, e.g. the answer to a question may be available in any number of languages, possibly different from the query language. Models that cannot compare the appropriateness of retrieval results across languages are thus ineffective in such real-world scenarios. XTREME-R includes two new related crosslingual retrieval tasks. The first seeks to measure the extent to which cross-lingual representations are "strongly aligned" (Roy et al., 2020), i.e. they place the semantically most related text pairs (e.g. a question and its answer) closest together in representation space, regardless of their language identities. The second analogously frames entity linking as retrieving from a multilingual pool of entity descriptions, given an entity mention in context (Botha et al., 2020). For both, we report performance as mean average precision at 20 (mAP@20). LAReQA Language Agnostic Retrieval Question Answering (Roy et al., 2020) is a sentence retrieval task. Each query has target answers in multiple languages, and models are expected to rank all correct answers above all incorrect answers, regardless of language. We use the LAReQA XQuAD-R dataset which contains 13,090 questions each of which has 11 target answers (in 11 distinct languages) within the pool of 13,014 candidate answer sentences. Following Roy et al. (2020), we fine-tune models on the SQuAD v1.1 train set. The fine-tuned model is used to rank the 13K candidates for each question.
Mewsli-X Mewsli (Multilingual Entities in News, linked) is an automatically extracted dataset that requires linking a contextual entity mention to its entry in a language-agnostic knowledge base by retrieving the entity's description from a multilingual candidate pool (Botha et al., 2020). For XTREME-R, we derive Mewsli-X as a new variant of Mewsli-9, still linking against WikiData (Vrandečić and Krötzsch, 2014). Mewsli-X features 15K mentions in 11 languages: given a mention in context, the task is to retrieve the single correct target entity description from a candidate pool ranging over 1M candidates across all 50 languages of XTREME-R. Fine-tuning is done on a predefined set of English-only mention-entity pairs randomly sampled from English Wikipedia hyperlinks (see Appendix E for further details).
For our baseline systems on both tasks, we follow previous work (Roy et al., 2020;Botha et al., 2020) and train a dual encoder initialized from the pre-trained model weights, optimizing for an inbatch sampled softmax loss (Gillick et al., 2018).

Diagnostic and evaluation suite
To increase the language coverage of low-resource languages in XTREME-R and to enable us to systematically evaluate a model's cross-lingual generalization ability, we augment XTREME-R with a massively multilingual diagnostic and evaluation suite. Challenge sets and diagnostic suites in NLP (Wang et al., 2019a,b;Belinkov and Glass, 2019) are mostly limited to English, with a few exceptions (Gulordava et al., 2018). As challenge sets are generally created with a human in the loop, the main challenge for creating a large multilingual diagnostic suite is to scale the annotation or translation effort to many languages and to deal with each language's idiosyncrasies. MULTICHECKLIST To address this, we build on the CHECKLIST (Ribeiro et al., 2020) framework, which facilitates creating parameterized tests for models. CHECKLIST enables the creation of test cases using templates, which test for specific behavioral capabilities of a model with regard to a downstream task. Importantly, by relying on template-based tests, we can efficiently generate a large number of diverse multilingual test cases by creating a relatively small number of templates in 50 languages. 3 We focus on translating English tests, which consist of templates and their fill-in values. 4 To study the feasibility of creating multilingual test cases at scale, we translate the minimum functionality tests (MFT) of CHECKLIST, which probe for general vocabulary and taxonomic knowledge in question answering. We instruct translators to create separate variants of a template to disambiguate linguistic phenomena, such as gender of fillin values, question type, semantics of properties, etc. We automatically fill names in each language based on data from Wikidata and programmatically consolidate different templates in each language.
We show examples of templates and the tests that they generate in different languages in Table 3.
We highlight statistics of the dataset and translation process, instructions to translators, and general challenges of template translation in Appendix F. We believe that parameterized tests are a powerful tool to obtain diverse diagnostics data for otherwise resource-starved languages. We view participatory research (∀ et al., 2020) with native speakers to create template-based test cases testing for languagespecific behaviour as particularly promising. Multilingual EXPLAINABOARD The standard practice in leaderboards is to average performance across different settings (Wang et al., 2019b,a). While this provides discriminative power, it has limited utility for examining the relative advantages of systems, the characteristics of different datasets and languages, and how these factors relate to each other. To provide more granular evaluation capabilities, we extend Fu et al. (2020); Liu et al.
(2021)'s EXPLAINABOARD to the task categories and languages in XTREME-R. EXPLAINABOARD provides a more nuanced impression of a model's performance on a task by defining task-specific attributes (e.g. entity length for NER). The test set is partitioned into different buckets based on the defined attributes and performance is broken down over different attribute values. We define new taskspecific attributes for the four task types as well as task-independent attributes (see Appendix K). Metadata We additionally would like to enable practitioners to rank submissions based on other information. To this end, we ask each submission to XTREME-R for relevant metadata such as the number of parameters, the amount of pre-training data, etc. We will show this information in an interactive leaderboard (see Appendix H for the metadata of current XTREME submissions).

Experiments
Training and evaluation setup XTREME-R focuses on zero-shot cross-lingual transfer from En-glish. While recent work (Hu et al., 2020;Lauscher et al., 2020;Hedderich et al., 2020) demonstrates the benefits of fine-tuning on in-language data, we believe the zero-shot scenario remains the most effective way to evaluate the amount of a priori multilingual knowledge a pre-trained model captures. Due to variation in cross-lingual evaluation (Keung et al., 2020), we recommend researchers to use the validation set of a single target language for development (Artetxe et al., 2020b).

Baselines
We employ established pre-trained multilingual and models using translations as baselines. mBERT Multilingual BERT (Devlin et al., 2019) has been pretrained on the Wikipedias of 104 languages using MLM. XLM-R XLM-R Large (Conneau et al., 2020) uses the same MLM objective with a larger model, and was trained on a magnitude more web data from 100 languages. mT5 Multilingual T5 (Xue et al., 2021) is an encoder-decoder transformer that frames NLP tasks in a "text-to-text" format. It was pre-trained with MLM on a large multilingual web corpus covering 101 languages. We employ the largest mT5-XXL variant with 13B parameters. Translate-train To evaluate the impact of MT, we fine-tune mBERT on translations of English training data from Hu et al. (2020). We create new translations for the XCOPA and SIQa data using an in-house MT system. 5 Translate-train multilingual In addition, we finetune both mBERT and mT5 on the combined translated training data of all languages (including the original English data) jointly. Human performance We use the human performance estimates from XTREME for the retained tasks. For XCOPA we average the proportion of annotated labels disagreeing with the majority label across all languages (Ponti et al., 2020). We are not able to obtain human performance estimates for the new retrieval tasks as identifying a translation among a large number of candidates is too time-consuming for a human to perform.

Results
We show the main results in Table 4. As in prior work, XLM-R Large generally outperforms mBERT. Fine-tuning helps significantly on Tatoeba  Table 4: Overall results of baselines across all XTREME-R tasks. *: Due to compute limitations, mT5-XXL language-agnostic retrieval results are obtained with a frozen rather than a fine-tuned model.  compared to the zero-shot setting (Hu et al., 2020). The new tasks are challenging for current models, which show relatively lower performance compared to other tasks. XCOPA presents a challenging classification task that requires cross-lingual common sense reasoning while the languageagnostic nature of Mewsli-X and LAReQA puts the cross-lingual alignment of multilingual representations to the test. Analysis of the language-agnostic retrieval results show a large gap remains between cross-lingual and same-language test cases. XLM-R Large improves significantly over mBERT on the cross-lingual case in exchange for a slight drop for the same-language case. This points to XLM-R Large inducing more "strongly-aligned" representations (see Appendix I for details). The stateof-the-art mT5 improves performance on classification and QA tasks but performs less well on structured prediction and retrieval, highlighting settings where advances beyond scale are needed. 6 Training on task-specific translations is beneficial in all cases and generally performs best, although improvements on QA tasks are diminishing. To obtain a more fine-grained understanding of the performance of current models, we conduct several analyses using our multilingual diagnostic suite.

MULTICHECKLIST
We show the results of XLM-R fine-tuned on English SQuAD v1.1 on the 6 tests of MULTICHECK-LIST in Table 5 (see Appendix F for the full results, example failure cases, and mBERT results). While mBERT's average error rate is greater than 85% on 4/6 test categories, XLM-R demonstrates a substantially more robust cross-lingual understanding ability. XLM-R performs worst on tests in lowresource languages with limited or no pre-training data such as gu, ha, ht, qu, sw, wo, and yo and in languages with non-Latin scripts such as he, ja, 6 Due to compute limitations, we extract mT5 embeddings by averaging the encoder outputs of a frozen mT5 model fine-tuned on SQuAD v1.1, as opposed to fine-tuning a dual encoder. For this reason, the mT5 language-agnostic retrieval scores are not directly comparable to those of mBERT and XLM-R.   "what" (A), "how" (B), "who" (C), "when" (D) and "which" (E). In the single system diagnosis histogram, blue (red) x ticklabels represent the bucket category (e.g., XS) of a specific attribute on which a system achieved the best (worst) performance. In the pairwise system diagnosis histogram, blue (red) x ticklabels represent the bucket value of a specific attribute where system M1 surpasses (under-performs) M2 by the largest margin illustrated by a blue (red) bin. Blue-only x ticklabels (e.g., -D) indicate that M1 outperforms M2 in all categories of an attribute.
th, and zh. In addition, XLM-R displays interesting variation across languages, for instance failing in modeling comparisons in some languages, like Basque (eu), where it otherwise succeeds. We release the tests and test outputs to encourage deeper analysis and extension to other tasks and languages.

Nuanced Multilingual Evaluation
We showcase how nuanced multilingual evaluation enables us to perform single and pairwise system diagnosis on XQuAD in Table 6 (see Appendix K for analyses of the other tasks). We choose two systems: ERNIE-M, one of the top systems on XTREME, and XLM-R in eight languages: English, Chinese, Hindi, Greek, Russian, Turkish, Arabic, and Vietnamese (en, zh, hi, el, ru, tr, ar, vi).
Attributes We denote (X c , X q , X a ) as a tuple of a context, question and answer, and refer to cLen, qLen, aLen as their lengths (i.e., the number of tokens). We use BLEU (Papineni et al., 2002) to measure lexical overlap between (X a , X q ) and (X q , X c ) as BLEU-AQ and BLEU-QC. We report the top 5 most frequent question types (qType), which cover 85% of questions in the training set.
Single System Analysis For almost all languages, ERNIE-M achieves the highest performance on shorter answers (XS), but the worst performance on longer answers (XL). Especially in el, the performance difference between long and short an-swers is more than 40 absolute points. The influence of question and context length is languagedependent. For example, in zh the system favors long questions and contexts while in hi, it is the opposite. If the answer is lexically similar to the question (larger BLEU-AQ), the system tends to make more mistakes in all eight languages. However, a higher lexical overlap between questions and contexts (BLEU-QC) is helpful for some languages: el, ru, ar. Surprisingly, ERNIE-M struggles to answer relatively frequent question types (i.e., what, and how), while it performs better on less frequent questions, indicating that although questions about person, place and choice are less frequent, they are easier than abstract questions. Pairwise System Analysis Although ERNIE-M outperforms XLM-R by a large margin, it is surpassed by XLM-R on a few buckets. In en, XLM-R is better at dealing with longer answers and questions. In tr, XLM-R surpasses ERNIE-M on samples with shorter answers and contexts. In zh, XLM-R performs better when dealing with questions that are lexically similar to the answers.

Conclusions
Our analyses and experiments have shed light on important directions where scale alone is not sufficient such as "strong" alignment, syntactic transfer, fine-grained natural language understanding, and answering of abstract questions. We encourage the development of better inductive biases, pre-training objectives, and evaluation resources. We make our data, translations, evaluation resources, and interactive leaderboard supporting detailed comparative analyses available to help the community gain a better understanding of multilingual models.
7 Ethical Considerations 7.1 Language representation XTREME-R seeks to improve language representation and language diversity in NLP research, which has been identified as a large challenge (Joshi et al., 2020). We tried to cover a set of languages that is as diverse as possible, while still providing access to evaluation data in multiple tasks for each language. Despite this, XTREME-R has little representation of languages of the Americas and Africa due to a lack of labeled datasets for these languages. In addition, some languages included in XTREME-R with few data available online are only covered in a small number of datasets (see Table  7). To ameliorate this, we release training data of tasks translated into other languages, as well as the new MULTICHECKLIST. We reiterate the ongoing need for creating labeled datasets for diverse tasks in under-represented languages, to facilitate the development and evaluation of NLP models for such languages. We emphasize the importance of participatory research (∀ et al., 2020) as a modus operandi for such work in order to involve marginalized communities in the research process.

Leaderboard chasing
New benchmarks incentivize researchers to hillclimb on aggregate metrics (Ethayarajh and Jurafsky, 2020). In addition, new benchmarks create new opportunities for models to reach "superhuman" performance, which may lead people outside the field to erroneously conclude that some model has "solved language". We hope that our inclusion of EXPLAINABOARD and MULTICHECKLIST help to prevent such a fallacy, by enabling more fine-grained evaluation that goes beyond a single aggregate metric.

Biases in multilingual models
Multilingual models have been shown to reflect biases similar to their monolingual counterparts (Zhao et al., 2020). In addition, multilingual models are biased towards languages with more pre-training data (Hu et al., 2020). Zero-shot crosslingual transfer additionally introduces a bias towards the source language (Søgaard et al., 2018;Anastasopoulos and Neubig, 2020). Due to the paucity of training data in other languages, we nevertheless focus on English-centric transfer and encourage future dataset creation efforts to include training data in multiple languages.

B Task scores on XTREME
We show the performance of the models on the XTREME leaderboard broken down by language family on the remaining XTREME tasks in Figure  2.
C XTREME tasks retained in XTREME-R is another cross-lingual question answering task. The evaluation data in seven languages was automatically mined from Wikipedia, annotations were crowd-sourced, and answer spans aligned. For both XQuAD and MLQA, we use their respective data for evaluation and train on SQuAD v1.1. TyDiQA-GoldP We use the gold passage (GoldP) version of TyDiQA (Clark et al., 2020), a benchmark for information-seeking question answering, which covers nine typologically diverse languages. The GoldP version is a simplification of the primary task, using only the gold passage as context and excluding unanswerable questions. We use the English training data for training and evaluate on the test sets of the target languages. Tatoeba We evaluate on the Tatoeba dataset (Artetxe and Schwenk, 2019), which consists of up to 1,000 English-aligned sentence pairs covering 122 languages. We find the nearest neighbor using cosine similarity. To make the setting more realistic, we move away from zero-shot retrieval and fine-tune models on SQuAD v1.1.

D Languages
Language characteristics We show a detailed overview of languages in XTREME-R including interesting typological differences in Table 7. Wikipedia information is taken from Wikipedia 8 and linguistic information from WALS Online 9 . XTREME-R includes members of the Afro-Asiatic, Austro-Asiatic, Austronesian, Dravidian, Indo-European, Japonic, Kartvelian, Kra-Dai, Niger-Congo, Sino-Tibetan, Turkic, Uralic, Creole 10 , and Quechuan language families as well as of two isolates, Basque and Korean. Language diversity indices We measure the language diversity of XTREME-R according to the typology and language family indices of Ponti et al.
(2020), which we show in Table 8 for XTREME-R, XTREME (

E Mewsli-X Dataset
Mewsli-X is constructed specifically for XTREME-R and is a more carefully sampled variant of the Mewsli-9 dataset (Botha et al., 2020), derived from WikiNews in the same way. Compared to Mewsli-9, Serbian is dropped and Polish, Romanian and Ukrainian are added to obtain 11 languages, while the entity descriptions   Table 8: Diversity indices with regard to typology and language family of XTREME-R, XTREME, and XGLUE. to be retrieved range over all 50 languages in XTREME-R (Table 9). To broaden accessibility, the mention queries, candidates and Wikipedia-based training instances are all downsampled compared to the previous work. Mention Extraction Viable mentions were taken as hyperlinks in the WikiNews articles pointing to Wikipedia entity pages (in any language) that could be mapped successfully to items in our base WikiData entity vocabulary. V base is defined as the set of entities that have a Wikipedia page in any of the 50 languages in XTREME-R (|V base | ≈ 12M). The latter condition ensures that entity descriptions are available. We also extended the WikiData filter used by Botha et al. (2020) to additionally exclude entities that are instances of Wikipedia List Pages (Q13406463) or Wikimedia (Human) Disambiguation Pages (Q22808320, Q4167410), which are not of large interest for entity linking. The resolved, viable mentions were filtered to drop duplicate surface mentions of an entity in the same article, and mentions of years (e.g. 2014), which are commonly linked in WikiNews but not of great interest. We then performed stratified sampling by both mention language and entity frequency bins, seeking uniform sizes across strata. Entity frequency is estimated as the number of times an entity is referenced on pages in the 50-language Wikipedia collection, and then binned into five intervals: The resulting set of 15,000 test mentions covers 9,647 distinct gold entities (V gold ). Candidate Set Doing vector encoding and nearest neighbor search over a full knowledge base of millions of entities is relatively time-consuming, but searching only among V gold is also unrealistic. We thus strike a balance by defining a candidate set V cand ⊃ V gold , by sampling additional items from V base \ V gold , this time only stratified by entity frequency, to obtain |V cand | = 1, 000, 000.
Each entity e ∈ V cand is assigned a single description: we randomly sample a language from among the L e ≤ 50 Wikipedia pages corresponding to e, and take the page's first sentence as entity description.
Fine-tuning Data The fine-tuning data constitutes 115K English-only (mention, entity)-pairs, which were sampled from English Wikipedia hyperlinks that map to V base \ V cand . Sampling is according to the natural distribution. The Mewsli-X evaluation setting in XTREME-R is thus doubly zero-shot: no candidate or test entities are observed during finetuning, nor is non-English text. 3. If there is not a literal translation or the same translation has already been used for another word, feel free to use a translation that is similar in meaning.

F MULTICHECKLIST
4. If the translation of the template differs based on the substituted words, please create multiple translations for the template and indicate which substituted words correspond to it. For instance, if one translation of the template assumes that some substituted words have male gender but others have female gender, create a separate translation of the template that conforms to the female gender.
Challenges of template translation We highlight some of the challenges and linguistic phenomena we encountered during the process of creating MULTICHECKLIST in the following. Unless specified otherwise, we create separate templates to disambiguate each phenomenon.
• Gender agreement: Adjectives and nationalities need to be declined to match the gender of their referring expression. To keep the translation effort manageable and avoid creating separate templates for each gender, we control for gender and restrict fill-in values to male names for the affected tests (3/6). We sample genders equally for the other tests. We welcome dedicated test suites analyzing multilingual gender bias as future extensions.
• Declination: In Russian, animal and vehicle names require Accusative and Nominative in different cases.
• Normalization: For appropriate substitution, fill-in values often need to include articles. We normalize answers and predictions by removing language-specific articles in order to ensure a consistent comparison.
• Names: Our use of names based on data in Wikidata leads to certain biases. Names that are more common in Wikidata are more likely to be chosen. In some cases, names in Wikidata are not written in the native script. Japanese names from Wikidata are often written in hiragana or katakana rather than kanji. Our choice of using the first name is also not applicable to all languages. In Japanese, people are usually not referred to by their first name, e.g. Masanori Suzuki would be called Suzuki-san instead of Masanori.
• Professions: Words for certain professions are gendered (e.g. waiter/waitress), so they only occur with male or female names.
• Question syntax: In some languages, the syntax of the question changes depending on the property or adjective one asks about.
• Syntax of adjectives: In some languages, the syntax changes depending on what adjective is used. In German, the translations of "happy", "excited", and "passionate" require different prepositions.
• Stilted language: Some text appears stilted when values are filled into the translated templates. For instance, the question "どちらの 方が冷静でないですか。" is an unusual way to do negation in Japanese; if directly translated to English, it would mean "Who is more not calm?".
We tried to address most of these challenges by instructing translators to create additional templates to disambiguate linguistic phenomena and by consolidating different templates programmatically. However, as this process was relatively laborintensive, we recommend the usage of morphologically aware templates similar to Jiang et al. (2020) for future work. Note that morphologically aware templates may not be able to resolve some of the finer linguistic differences. For this reason, we also advise working closely with native speakers to design tests that reflect natural language as closely as possible. Full results We show the full results of XLM-R and mBERT on the MULTICHECKLIST tests in Tables 10 and 11 respectively. mBERT only shows limited amounts of cross-lingual taxonomic knowledge. While it is able to distinguish between job and nationality and animals and vehicles in some languages, it fails to do this consistently across all languages. In addition, it completely fails to distinguish between different properties and intensifiers and is not able to perform comparisons. In contrast, while XLM-R struggles with intensifiers, it demonstrates the other capabilities much more consistently across languages.
Example failure cases We provide example failure cases of XLM-R on a subset of languages in Table 12. We will publicly release a comprehensive list of failure cases for XLM-R and mBERT, the complete tests and model outputs for further analysis.

G Hyper-parameters
mBERT We use the cased version, which covers 104 languages, has 12 layers, 768 hidden units per layer, 12 attention heads, a 110k shared Word-Piece vocabulary, and 110M parameters. 12 The model was trained using Wikipedia data in all 104 languages, oversampling low-resource languages with an exponential smoothing factor of 0.7. We generally fine-tune mBERT for two epochs, with a training batch size of 32 and a learning rate of 2e-5. We build on the Transformers library (Wolf et al., 2019) for training on each task. XLM-R We use the XLM-R Large version that covers 100 languages, uses a 200k shared BPE vocabulary, and has been trained with masked language modelling. 13 We fine-tune XLM-R generally for two epochs with a learning rate of 3e-5 and an effective batch size of 16. We use the Transformers library for training XLM-R on all tasks. mT5 We use the publicly released mT5-XXL version that has nearly 13 billion parameters with a vocabulary size 250k (Xue et al., 2021). It has been trained on multilingual C4 (mC4) corpus which has 6.3 trillion tokens spanning 101 languages 14 . For all downstream tasks, we fine-tune mT5-XXL for 10k steps with a constant learning rate of 0.001, dropout rate of 0.1 and a batch size of 2 17 tokens. For early stopping, we save checkpoints every 200 steps and choose the checkpoint with the highest performance on the validation set.

H Metadata
We intend to ask each submission to XTREME-R for relevant metadata. Such metadata includes the number of parameters, amount of pre-training data, amount of fine-tuning data, etc. We are doing this to enhance transparency and to increase utility of our benchmark for practitioners with varying needs. As a first step in this direction, we provide information about the number of parameters and the amount of monolingual and parallel pre-training data used by all submissions to XTREME in Table 13. Note that the different systems report their training data in different ways (e.g. number of tokens, number of examples, size of the data). We plan to standardize this by asking submissions to 12 https://github.com/google-research/ bert/blob/master/multilingual.md 13 https://github.com/facebookresearch/ XLM 14 https://www.tensorflow.org/datasets/ catalog/c4#c4multilingual XTREME-R to report training data in terms of number of tokens seen.

I Language-agnostic Retrieval Results
The multiway cross-language nature of Mewsli-X and LAReQA enables closer analysis of model performance by input and target language pairs. Mewsli-X can directly be split by language pair as it has a single correct target per input mention. For LAReQA, we follow the "Limit to One Target" strategy of Roy et al. (2020): instead of asking the model to retrieve all correct answers in one pass, we evaluate on each target separately, with all the other correct answers removed from the candidate pool, allowing us to report splits by language pair. Table 14 summarizes these pairwise mAP@20 scores (here, micro-averaged), showing that XLM-R Large improves substantially over mBERT on the cross-lingual case (+38% on Mewsli-X and +137% for LAReQA) in exchange for a slight drop for the same-language case. Even so, performance on the cross-lingual case is still low at 29-36 mAP@20, and remains a challenging area for future work. Figures 3 and 4 show the detailed breakdowns.

J Detailed results
We show the detailed results for each task and language in Tables 15 (XNLI),

K Nuanced Multilingual Evaluation
We perform nuanced multilingual evaluations by categorizing testing examples into different attribute buckets and measuring the system performance on each attribute bucket. In the following, we describe the available attributes for tasks in XTREME-R and provide additional analysis on different attributes.

K.1 Attribute Definition
QA We denote (X c , X q , X a ) as a tuple of the corresponding context, question and answer, and refer to cLen, qLen, aLen as their lengths (i.e., the number of tokens). We use BLEU (Papineni et al., 2002) to measure the lexical overlap between (X a , X q ) and (X q , X c ) as BLEU-AQ and BLEU-QC. We classify questions based on their first tokens and report the top 5 most frequent question types as qType (i.e., what, how, when, where, ar de en es fa ja pl ro ta tr uk ar  de  en  es  fa  ja  pl  ro  ta  tr  uk  af  az  bg  bn  el  et  eu  fi  fr  gu  he  hi  ht  hu  id  it  jv  ka  kk  ko  lt  ml  mr  ms  my  nl  pa  pt  qu  ru  sw  te  th  tl  ur  vi  wo    which), which cover 85% of questions in the training set. We list the six attributes as follows.

Structured Prediction
Given a sentence X, we define the i-th word token as x i and a span of words in the range of [i, j) as X i:j in the sentence. We then define five attributes including the label of a span (tag), the token length of a sentence (sLen), the token length of an entity span (eLen), the character length of an entity span (tLen) and the relative token position of an entity (rPos) in the sentence as follows.

K.2 Attribute Buckets
We   Table 27 and 28 illustrate the single system diagnosis of ERNIE-M and XLM-R respectively on the WikiANN-NER task in three languages (i.e., en, es, fr). We make the following observations. ERNIE-M In Table 27, first, we observe that the effects of some attributes for ERNIE-M are language-independent. For example, based on the attribute rPos, the system is good at predicting entities located within the first 1/3 part of the English sentences, while it is relatively bad at predicting entities within the first 1/3 part of the sentences for other languages. Second, the system favors long sentences based on the attribute sLen. We even observe that performance increases as the sentence length increases on es and fr. Third, across all languages, the system performs relatively bad at predicting long entities (eLen) and entities belonging to the organization class (tag). Finally, the system is good at predicting sentences with fewer entities based on the attribute for entity density (eDen). XLM-R In Table 28, we observe that the influence of some attributes such as sLen,eLen, eDen with respect to the system performance are similar between ERNIE-M and XLM-R, although ERNIE-M performs significantly better than XLM-R at generalizing its predictions on es, fr. Table 29 shows the pairwise system analysis of ERNIE-M and T-URLv2 for the XQuAD task. We find that although the overall performance of T-URLv2 outperforms ERNIE-M, it is surpassed by ERNIE-M on a few buckets. For example, in zh, ERNIE-M is better at dealing with samples that have long answers, long questions, and a high lexical overlap between questions and answers. In ru, ERNIE-M is better at dealing with samples with long answers, long questions, and lower lexical overlap between questions and answers, questions and contexts. Figure 5 shows the interface of EXPLAINABOARD containing possible selection options to observe the fine-grained analysis for submitted systems on XTREME. We also demonstrate how to perform Single System and Pair Systems analysis on Figure 6 and 7 respectively. Specifically, to generate a fine-grained overview, we first select models in the table, then click one of the three Analysis Buttons, which generates a fine-grained analysis such as in Figure 6 (single system analysis) and Figure 7 (pair-wise system analysis).  the table represents the performance of a system on a specific dataset and a specific language. Relevant pieces of information such as the paper title also are provided. Figure 6: Single system analysis. Each histogram represents the fine-grained evaluation results of a given system, which are broken down based on a pre-defined attribute (e.g., answer length ).

K.4 EXPLAINABOARD Demonstration
Figure 7: Pairwise system analysis. Each histogram illustrates the fine-grained performance gap between system 1 and system 2, which has been broken down by different pre-defined attributes (e.g., answer length).       Table 13: Metadata for the current submissions to XTREME. Note that monolingual pre-training data is reported in either number of tokens or size of the data (in GB/TB). The amount of parallel data is reported in either number of pairs or size of the data (in GB/TB).       , who (C), when (D) and which (E), ranked by their frequencies in training set. In the pairwise system diagnosis histogram, blue (red) x ticklabels represents the bucket value of a specific attribute on which system M1 surpasses (under-performs) M2 by the largest margin that is illustrated by a blue (red) bin. The blueonly x ticklabels (e.g., -D) indicate that M1 outperforms M2 in all categories of an attribute.