AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages

Pretrained multilingual models are able to perform cross-lingual transfer in a zero-shot setting, even for languages unseen during pretraining. However, prior work evaluating performance on unseen languages has largely been limited to low-level, syntactic tasks, and it remains unclear if zero-shot learning of high-level, semantic tasks is possible for unseen languages. To explore this question, we present AmericasNLI, an extension of XNLI (Conneau et al., 2018) to 10 Indigenous languages of the Americas. We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches. Additionally, we explore model adaptation via continued pretraining and provide an analysis of the dataset by considering hypothesis-only models. We find that XLM-R’s zero-shot performance is poor for all 10 languages, with an average performance of 38.48%. Continued pretraining offers improvements, with an average accuracy of 43.85%. Surprisingly, training on poorly translated data by far outperforms all other methods with an accuracy of 49.12%.


Introduction
Pretrained multilingual models such as XLM (Lample and Conneau, 2019), multilingual BERT (mBERT;Devlin et al., 2019), and XLM-R (Conneau et al., 2020) achieve strong cross-lingual transfer results for many languages and natural language processing (NLP) tasks. However, there exists a discrepancy in terms of zero-shot performance between languages present in the pretraining data and those that are not: performance is generally highest for well-represented languages, and decreases with less representation. Yet, even for unseen languages, performance is generally above chance, and model adaptation approaches have been shown to yield  further improvements (Muller et al., 2020;Pfeiffer et al., 2020a,b;Wang et al., 2020). Practically all work evaluating the zero-shot performance of models on languages not seen during pretraining is currently limited to low-level, syntactic tasks such as part-of-speech tagging, dependency parsing, and named-entity recognition (Muller et al., 2020;Wang et al., 2020). This is due to the fact that most multilingual datasets for highlevel, semantic tasks only cover languages which are well-resourced enough to already be contained in the pretraining data. 1 This limits our ability to draw more general conclusions with regards to the zero-shot learning abilities of pretrained multilingual models for unseen languages.
In order to make such an evaluation possible, we introduce AmericasNLI, an extension of XNLI (Conneau et al., 2018) -a natural language inference (NLI; cf. §2.3) dataset covering 15 high-resource languages -to 10 indigenous languages spoken in the Americas: Asháninka, Aymara, Bribri, Guarani, Náhuatl, Otomí, Quechua, Rarámuri, Shipibo-Konibo, and Wixarika. All of them are truly low-resource languages: they have small or no Wikipedia corpora, and are not present in the training data of current state-of-theart pretrained multilingual models. This dataset enables us to address the following research questions (RQs): (1) How do existing multilingual models perform on unseen languages as compared to their performance on XNLI? (2) Do methods aimed at adapting models to unseen languages -previously exclusively evaluated on low-level, syntactic tasks -also increase performance on NLI?
We experiment with XLM-R, both with and without model adaptation via continued pretraining on monolingual corpora in the target language. Our results show that the performance of XLM-R outof-the-box is moderately above chance, and model adaptation leads to improvements of up to 5.88 percentage points. Training on machine-translated training data, however, results in an even larger performance gain of 10.1 percentage points over the corresponding XLM-R model without adaptation. We further perform an analysis via experiments with hypothesis-only models, to examine potential artifacts which may have been inherited from XNLI and find that performance is above chance for most models, but still below that for using the full example.
AmericasNLI is publicly available at nala-cub.github.io/resources. We hope that it will serve as a benchmark for measuring the zero-shot natural language understanding abilities of multilingual models for unseen languages. Additionally, we hope that our dataset will motivate the development of novel pretraining and model adaptation techniques which are suitable for truly low-resource languages.
These models follow the standard pretrainingfinetuning paradigm: they are first trained on unlabeled monolingual corpora from various languages (the pretraining languages) and later finetuned on target-task data in a -usually high-resource -source language. Having been exposed to a variety of languages through this training setup, crosslingual transfer results for these models are competitive with the state of the art for many languages and tasks. Commonly used models are mBERT (Devlin et al., 2019), which is pretrained on the Wikipedias of 104 languages with masked language modeling (MLM) and next sentence prediction (NSP), and XLM, which is trained on 15 languages and introduces the translation language modeling objective, which is based on MLM, but uses pairs of parallel sentences. XLM-R has improved performance over XLM, and trains on data from 100 different languages with only the MLM objective. Common to all models is a large shared subword vocabulary created using either WordPiece (Sennrich et al., 2016) or SentencePiece (Kudo and Richardson, 2018) tokenization.

Evaluating Pretrained Multilingual Models
Just as in the monolingual setting, where benchmarks such as GLUE (Wang et al., 2018) and Super-GLUE (Wang et al., 2019) provide a look into the performance of models across various tasks, multilingual benchmarks (Hu et al., 2020;Liang et al., 2020) cover a wide variety of tasks involving sentence structure, classification, retrieval, and question answering. Evaluation is based on zero-shot transfer from English, providing a strong benchmark for cross-lingual transfer. Additional work has been done examining what mechanisms allow multilingual models to transfer across languages (Pires et al., 2019;Wu and Dredze, 2019). Wu and Dredze (2020) examine transfer performance dependent on a language's representation in the pretraining data. For language with low representation, multiple methods have been proposed to improve performance, including extending the vocabulary, transliterating the target text, and continuing pretraining before finetuning (Lauscher et al., 2020;Chau et al., 2020;Muller et al., 2020;Pfeiffer et al., 2020a,c;Wang et al., 2020). In this work, we focus on continued pretraining to analyze the performance of model adaptation for a high-level, semantic task.

Natural Language Inference
Given two sentences, the premise and the hypothesis, the task of NLI consists of determining whether the hypothesis logically entails, contradicts, or is neutral to the premise. The most widely used datasets for NLI in English are SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018). XNLI (Conneau et al., 2018) is the multilingual expansion of MNLI to 15 languages, providing manually translated evaluation sets and machinetranslated training sets. While datasets for NLI or the similar task of recognizing textual entailment exist for other languages (Bos et al., 2009;Alabbas, 2013;Eichler et al., 2014;Amirkhani et al., 2020), their lack of similarity prevents a generalized study of cross-lingual zero-shot performance. This is in contrast to XNLI, where examples for all 15 languages are parallel. To preserve this property of XNLI, when creating AmericasNLI, we choose to translate Spanish XNLI as opposed to creating examples directly in the target language.
However, NLI datasets are not without issue: Gururangan et al. (2018) shows that artifacts from the creation of MNLI allow for models to classify examples depending on only the hypothesis, showing that models may not be reasoning as expected. Motivated by this, we provide further analysis of AmericasNLI in Section 6 by comparing the performance of hypothesis-only models to models trained on full examples.

Data Collection Setup
AmericasNLI is the translation of a subset of XNLI (Conneau et al., 2018). As translators between Spanish and the target languages are more frequently available than those for English, we translate from the Spanish version. Additionally, some translators reported that code-switching is often used to describe certain topics, and, while many words without an exact equivalence in the target language are worked in through translation or interpretation, others are kept in Spanish. To minimize the amount of Spanish vocabulary in the translated examples, we choose sentences from genres that we judged to be relatively easy to translate into the target languages: "face-to-face," "letters," and "telephone." We choose up to 750 examples from each of the development and test set, with exact counts for each language in Table 1.

Languages
We now discuss the languages in AmericasNLI. For additional background on previous NLP research on indigenous languages of the Americas, we refer the reader to Mager et al. (2018).
Aymara Aymara is a polysynthetic Amerindian language spoken in Bolivia, Chile, and Peru by over two million people (Homola, 2012). Aymara has multiple dialects, including Northern and Southern Aymara, spoken on the southern Peruvian shore of Lake Titicaca as well as around La Paz and, respectively, in the eastern half of the Iquique province in northern Chile, the Bolivian department of Oruro, in northern Potosi, and southwest Cochabamba. However, Southern Aymara is slowly being replaced by Quechua in the last two regions. A rare linguistic phenomenon found in Aymara is vowel elision, an omission of various sounds in a language. Aymara has an SOV word order. Amer-icasNLI examples are translated into the Central Aymara variant, specifically Aymara La Paz.
Asháninka Asháninka is an Amazonian language from the Arawak family, spoken in Central and Eastern Peru, in a geographical region located between the eastern foothills of the Andes and the western fringe of the Amazon basin (Mihas, 2017). A national-scale census in 2017 revealed a population of 73,567 speakers. 2 While Asháninka in a strict sense refers to the linguistic varieties spoken in Ene, Tambo and Bajo Perené rivers, the name  is also used to talk about the following nearby and closely-related Asheninka varieties: Alto Perené, Pichis, Pajonal, Ucayali-Yurua, and Apurucayali.
Although it is the most widely spoken Amazonian language in Peru, certain varieties, such as Alto Perené, are highly endangered. Asháninka is an agglutinating and polysynthetic language with a VSO word order. The verb is the most morphologically complex word class, with a rich repertoire of aspectual and modal categories. The language lacks case, except for one locative suffix, so the grammatical relations of subject and object are indexed as affixes on the verb itself. Other notable linguistic features of the language include obligatory marking of a realis/irrealis distinction on the verb, a rich system of applicative suffixes, serial verb constructions, and a pragmatically conditioned split intransitivity. Code-switching with Spanish or Portuguese is a regular practice in everyday dialogue.
Bribri Bribri is a Chibchan language spoken by 7000 people in Southern Costa Rica (INEC, 2011). It has three dialects, and it is still spoken by children. However, it is a vulnerable language (Moseley, 2010; Sánchez Avendaño, 2013), which means that there are few settings where the language is written or used in official functions. The language does not have official status and it is not the main medium of instruction of Bribri children, but it is offered as a class in primary and secondary schools. Bribri is a tonal language, with fusional morphology, SOV syntax and an ergative-absolutive case system. Bribri grammar also includes phenomena like head-internal relative clauses, directional verbs and numerical classifiers (Jara Murillo, 2018a).
There are several orthographies which use different diacritics for the same phenomena. For example, the Constenla et al. (2004) system marks nasal vowels with a line underneath the vowel, whereas the Jara Murillo and García Segura (2013) system does this with a tilde, e.g., e'/ẽ' (there). Even for researchers who use the same orthography, the Unicode encoding of similar diacritics differs amongst authors (e.g., the combining low line, minus sign and macron are all found for the nasal marking).
The dialects of Bribri differ in their exact vocabularies, e.g.,ñalà/ñolò (road), and there are phonological processes, like the deletion of unstressed vowels, which also change the tokens found in texts, e.g., dakarò/krò (chicken). In addition, Bribri has only been a written language for about 40 years, so there are very few people who produce written materials in the language, and existing materials have a large degree of idiosyncratic variation. These variations are standardized in AmericasNLI, which is written in the Amubri variant.
Guarani Guarani is spoken by between 6 to 10 million people in South America. Roughly 3 million people use it as their main language, including more than 10 native nations in Paraguay, Brazil, Argentina, and Bolivia, along with Paraguayan, Argentinian, and Brazilian peoples. According to the Paraguayan Census, in 2002 there were around 1.35 million monolingual speakers, which has since increased to around 1.5 million people (Dos Santos, 2017;Melià, 1992). 3 Although the use of Guarani as spoken language is much older, the first written record dates to 1591 (Catechism) followed by the first dictionary in 1639 and linguistic descriptions in 1640. Guarani usage in text continued until the Paraguay-Triple Alliance War (1864-1870) and declined thereafter. However, from the 1920s on, Guarani has slowly re-emerged and received renewed focus. In 1992, Guarani was the first American language declared an official language of a country, followed by a surge of local, national, and international recognition in the early 21st century. 4 The official grammar of the Guarani language was approved in 2018. Guarani is an agglutinative language, with ample use of prefixes and suffixes. Code-switching with Spanish or Portuguese is common among speakers.
Náhuatl Náhuatl belongs to the Nahuan subdivision of the Uto-Aztecan langauge family. There are 30 recognized variants of Náhuatl spoken by over 1.5 millions speakers across 17 different states of Mexico, where Náhuatl is recognized as an official language (SEGOB, 2021). Náhuatl is polysynthetic and agglutinative; different roots with or without affixes are combined to form new words. The suffixes that are added to a word modify the meaning of the original word (Sullivan and León-Portilla, 1976), and 18 prepositions stand out based on postpositions of names and adjectives (Siméon, 1977). In Náhuatl many sentences have an SVO structure or, for emphasis, an OVS-structure (MacSwan, 1998). The translations in AmericasNLI belong to the Central Náhuatl (Náhuatl de la Huasteca) dialect. As there is a lack of consensus regarding the orthographic standard, the orthography is normalized to a version similar to Classical Náhuatl.
Otomí Otomí belongs to the Oto-Pamean language family and has nine linguistic variants with different regional self-denominations, such as nähñu orñähño, hñähñu,ñuju,ñoju, yühu, hnähño, nühú,ñanhú,ñöthó,ñható and hñothó (INALI, 2014). There are around 307,928 speakers spread across 7 Mexican states. In the state of Tlaxcala, the yuhmu orñuhmu variant is spoken by less than 100 speakers, and we use this variant for the Otomí examples in AmericasNLI. Otomí is a tonal language and many words are homophonous to Span-ish (Cajero, 1998(Cajero, , 2009. When speakingñuhmu, pronunciation is elongated, especially on the last syllable. In this variant there are 13 pronunciations, and each is clearly marked in writing, where the alphabet is composed of 19 consonants, 12 vowel phonemes, and characters formed from the combination of consonants, cedillas, and circumflex accents (Cajero, 1998). Words follow an SVO order, with affirmative, negative, interrogative, exclamatory, and imperative sentences, which are simple, compound, and complex (Cajero, 1998;Lastra and de Suárez, 1997).
Quechua Quechua, or Runasimi, is an indigenous language family spoken by the Quechua peoples that primarily live in the Peruvian Andes. Derived from an ancestral language, it is the most widely spoken pre-Columbian language family of the Americas. It has around 8-10 million speakers, and approximately 25% (7.7 million) of Peruvians speak a Quechuan language. Historically, Quechua was the main language family during the Incan Empire, and it was spoken until the Peruvian struggle for independence from Spain during the 1780s. Currently, many variants of Quechua are widely spoken and it is the co-official language of many regions in Peru.
There are multiple subdivisions of Quechua, including Southern, Northern, and Central Quechua. AmericasNLI examples are translated into the standard version of Southern Quechua, Quechua Chanka, also known as Quechua Ayacucho, which is spoken in different regions of Peru and can be understood in different areas of other countries, such as Bolivia or Argentina. In the translations of AmericasNLI, the apostrophe and the pentavocalism from other regions are not used Rarámuri The Rarámuri language, also known as Tarahumara, which means light foot (INALI, 2017), belongs to the Taracahitan subgroup of the Uto-Aztecan language family (Goddard, 1996). Rarámuri is an official language of Mexico, spoken mainly in the Sierra Madre Occidental region in the state of Chihuahua by a total of 89,503 speakers (SEGOB, 2021). Rarámuri is a polysynthetic language, characterized by a head-marking structure (Nichols, 1986  Shipibo-Konibo Shipibo-Konibo is a Panoan language spoken by around 35,000 native speakers in the Amazon region of Peru. It is a language with agglutinative processes, a majority of which are suffixes. However, clitics are also used, and are a widespread element in Panoan literature (Valenzuela, 2003). Shipibo-Konibo uses an SOV word order (Faust, 1973) and postpositions (Vasquez et al., 2018). The translations in AmericasNLI make use of the official alphabet and standard writing supported by the Ministry of Education in Peru.

Wixarika
The Wixarika, or Huichol, language, meaning the language of the doctors and healers (Lumholtz, 2011), is a language in the Corachol subgroup of the Uto-Aztecan language family (Campbell, 2000). Wixarika is a national language of Mexico with four variants: Northern, Southern, Eastern, and Western (INEGI, 2008). It is spoken mainly in the three Mexican states of Jalisc , Nayari , and Durango, with a total of around 47,625 speakers (INEGI, 2001). Wixarika is a polysynthetic language with headmarking (Nichols, 1986), a head-final structure (Greenberg, 1963), nominal incorporation, argumentative marks, inflected adpositions, possession marks, as well as instrumental and directional affixes (Iturrioz and Gómez-López, 2008). Wixarika follows an SOV word order, and lexical borrowing from and code-switching with Spanish are commonly used Translations in AmericasNLI are in Northern Wixarika and use an orthography common among native speakers (Mager-Hois, 2017).

Experiments
In this section, we detail the experimental setup we use to evaluate the performance of various approaches on AmericasNLI.

Zero-Shot Learning
Pretrained Model We use XLM-R (Conneau et al., 2020) as the pretrained multilingual model in our experiments. The architecture of XLM-R is based on RoBERTa (Liu et al., 2019), and it is trained using MLM on web-crawled data in 100 languages. It uses a shared vocabulary consisting of 250k subwords, created using SentencePiece (Kudo and Richardson, 2018) tokenization. We use the Base version of XLM-R for our experiments.
Adaptation Methods To adapt XLM-R to the various target languages, we continue training with the MLM objective on monolingual text in the target language before finetuning. To keep a fair comparison with other approaches, we only use target data which was also used to train the translation models, which we describe in Section 4.2. However, we note that one benefit of continued pretraining for adaptation is that it does not require parallel text, and could therefore benefit from text which could not be used for a translation-based approach. For continued pretraining, we use a batch size of 32 and a learning rate of 2e-5. We train for a total of 40 epochs. Each adapted model starts from the same version of XLM-R, and is adapted individually to each target language, which leads to a different model for each language. We denote models adapted with continued pretraining as +MLM.
Finetuning To finetune XLM-R, we follow the approach of Devlin et al. (2019) and use an additional linear layer. We train on either the English MNLI data or the machine-translated Spanish data, and we call the final models XLM-R (en) and XLM-R (es), respectively. Following Hu et al. (2020), we use a batch size of 32 and a learning rate of 2e-5. We train for a maximum of 5 epochs, and evaluate performance every 625 steps on the XNLI development set corresponding to the finetuning language. We employ early stopping with a patience of 15 evaluation steps and use the best performing checkpoint for the final evaluation. All finetuning is done using the Huggingface Transformers library (Wolf et al., 2020) with two Nvidia V100 GPUs.

Translation-based Approaches
We also experiment with two translation-based approaches, translate-train and translate-test, detailed below along with the used translation model.
Translation Models For our translation-based approaches, we train two sets of translation models: one to translate from Spanish into the target language, and one in the opposite direction. We use transformer sequence-to-sequence models (Vaswani et al., 2017) with the hyperparameters proposed by Guzmán et al. (2019). We employ the same model architecture for both translation directions, and we measure translation quality in terms of BLEU (Papineni et al., 2002) and ChrF (Popović, 2015), cf. Table 4. We use fairseq  to implement all translation models. 5 Translate-train For the translate-train approach, the Spanish training data provided by XNLI is translated into each target language. It is then used to finetune XLM-R for each language individually. Along with the training data, we also translate the Spanish development data, which is used for validation and early stopping. Notably, we find that the finetuning hyperparameters defined above do not reliably allow the model to converge for many of the target languages. To find suitable hyperparameters, we tune the batch size and learning rate by conducting a grid search over {5e-6, 2e-5, 1e-4 } for the learning rate and {32, 64, 128} for the batch size. In order to select hyperparameters 5 The code for translation models can be found at https: //github.com/AmericasNLP/americasnlp2021 which work well across all languages, we evaluate each run using the average performance on the machine-translated Aymara and Guarani development sets, as these languages have moderate and high translation quality, respectively. We find that decreasing the learning rate to 5e-6 and keeping the batch size at 32 yields the best performance. Other than the learning rate, we use the same approach as for zero-shot finetuning.
Translate-test For the translate-test approach, we translate the test sets of each target language into Spanish. This allows us to apply the model finetuned on Spanish, XLM-R (es), to each test set. Additionally, a benefit of translate-test over translate-train and the adapted XLM-R models is that we only need to finetune once overall, as opposed to once per language. For evaluation, we use the checkpoint with the highest performance on the Spanish XNLI development set.

Results and Discussion
Zero-shot Models We present our results in Table 5. Zero-shot performance is low for all 10 languages, with an average accuracy of 38.17% and 38.62% for the English and Spanish model, respectively. However, in all cases the performance is higher than the majority baseline. As shown in Table A.3 in the appendix, the same models achieve an average of 74.65% and, respectively, 75.58% accuracy when evaluated on the 15 XNLI languages. Thus, answering RQ 1, we conclude that zero-shot tasks are much harder for our models if the target language is unseen. Interestingly, even though code-switching with Spanish is encountered in many target languages, finetuning on Spanish labeled data only slightly outperforms the model trained on English and does not improve consistently across languages -performance is better for only five of the languages. The English model achieves a highest accuracy of 42.28%, when evaluated on Náhuatl, while the Spanish model achieves a highest accuracy of 41.60%, when evaluated on Bribri. The lowest performance is achieved when evaluating on Quechua and Rarámuri, for the English and Spanish model, respectively.
Turning to RQ 2, we find that model adaptation via continued pretraining improves both models, with an average gain of 5.88 percentage points for English and 4.98 percentage points for Spanish. Notably, continued pretraining increases performance for Quechua by 28.8 percentage points when finetuning on English, and 19.87 points when finetuning on Spanish. Performance only decreases for Bribri (in both cases) and for Wixarika when using Spanish data.
Translate-test Performance of the translate-test model improves over both zero-shot baselines. We see the largest increase in performance for Guarani and Quechua, with gains of 8.53 and, respectively, 11.47 points over the best performing zero-shot model without adaptation. Considering the translation metrics in Table 4, models for Guarani and Quechua achieve the two highest scores for both metrics. Interestingly, translate-test performance is lower than zero-shot performance for Asháninka and Otomí. While these two languages do not have the lowest translation performance, other languages with similar translation quality either achieve similar, or higher scores than their zero-shot counterparts. On average, translate-test does worse when compared to the adapted zero-shot models, and in all but two cases both adapted models perform better than translate-test.

Translate-train
The most surprising result is that of translate-train, which considerably outperforms the performance of translate-test models for all languages, and outperforms the zero-shot models for all but two languages. Compared to the best non-adapted zero-shot model, the largest performance gain is 18.94 points for Quechua. For the language with the lowest performance, Otomí, translate-train performs 2.54 points worse than zero-shot; however, it still outperforms translate-test. When averaged across all languages, translatetrain outperforms the Spanish zero-shot model by 10.10 points, and translate-test by 8.18 points. It is important to note that the translation performance from Spanish to each target language is not particularly high: when considering ChrF scores, the highest is 0.33, and the highest BLEU score is 3.26. Both translation-based models are correlated with ChrF scores, with a Pearson correlation coefficient of 0.79 and 0.90 for translate-train and translatetest, respectively. Correlations are not as strong for BLEU, with coefficients of 0.25 and 0.58.
The sizable difference between translate-train and the other methods suggests that translationbased approaches may be a valuable asset for crosslingual transfer, especially for low-resource languages. While the largest downsides to this approach are the requirement for parallel data and the need for multiple models, the potential performance gain over other approaches may prove worthwhile. Additionally, we believe that the performance of both translation-based approaches would improve given a stronger translation system, and future work detailing the necessary level of translation quality for the best performance would offer great practical usefulness for NLP applications for low-resource languages.

Hypothesis-only Models
As shown by Gururangan et al. (2018), SNLI and MNLI -the datasets AmericasNLI is based on -contain artifacts created during the annotation process which models exploit to artificially inflate performance. To analyze whether similar artifacts exist in AmericasNLI and if they can also be exploited, we train and evaluate models using only the hypothesis, for the 5 languages which performed highest in the standard setting, when averaged across all approaches.
Most importantly, as displayed in Table 6, the average performance across languages is better than chance for all models except for XLM-R without adaptation. Translate-train obtains the highest result with 47.35% accuracy. Thus, similar as for SNLI and MNLI, artifacts in the hypotheses can be used to predict, to some extent, the correct labels.
However, as shown in Table A.1 in the appendix, in all but 2 cases the performance of hypothesisonly models is lower than that of the standard ones. This indicates that the models are learning  something beyond just exploiting artifacts in the hypotheses, even though AmericasNLI is a zeroshot task with the additional challenge that the languages are unseen during pretraining.

Early Stopping
Early stopping is vital to prevent overfitting in deep learning models. However, in the case of zero-shot learning, to truly mimic a realistic scenario, handlabeled development sets in the target language cannot be used (Kann et al., 2019). Therefore, in our main experiments, when finetuning on a highresource language, we use a development set in that language for early stopping. For translate-train, we translate the source language's development set into the target language. In both cases, performance on the development set is an imperfect signal for how the model will ultimately perform. To explore how this affects final model performance, we present the difference in results for translate-train models when an oracle translation is used for early stopping in Table 7. We find that performance is 2.74 points higher on average, with a maximum difference of 6.93 points for Asháninka. Thus, creating ways to better approximate a development set in the target language might be useful for achieving higher performance.  Table 7: Difference between translate-train results obtained using the oracle development set and the translated development set for early stopping.

Conclusion
To better understand the zero-shot abilities of pretrained multilingual models for semantic tasks in unseen languages, we present AmericasNLI, a parallel NLI dataset covering 10 low-resource languages indigenous to the Americas.
We conduct experiments with XLM-R, and find that the model's zero-shot performance, while better than a majority baseline, is poor. However, it can be improved by model adaptation via continued pretraining. Additionally, we find that translationbased approaches outperform a zero-shot approach, which is surprising given the low quality of the employed translation systems. We hope that this work will not only spur further research into improving model adaptation to unseen languages, but also motivate the creation of more resources for languages not frequently studied by the NLP community. Mateo Cajero. 2009 Table A.3: XNLI results for zero-shot models. Scores are underlined when the same language used for training is used for evaluation as well.