Few-shot Learning with Multilingual Generative Language Models

Large-scale generative language models such as GPT-3 are competitive few-shot learners. While these models are known to be able to jointly represent many different languages, their training data is dominated by English, potentially limiting their cross-lingual generalization. In this work, we train multilingual generative language models on a corpus covering a diverse set of languages, and study their few- and zero-shot learning capabilities in a wide range of tasks. Our largest model with 7.5 billion parameters sets new state of the art in few-shot learning in more than 20 representative languages, outperforming GPT-3 of comparable size in multilingual commonsense reasoning (with +7.4% absolute accuracy improvement in 0-shot settings and +9.4% in 4-shot settings) and natural language inference (+5.4% in each of 0-shot and 4-shot settings). On the FLORES-101 machine translation benchmark, our model outperforms GPT-3 on 171 out of 182 directions with 32 training examples, while surpassing the official supervised baseline in 45 directions. We conduct an in-depth analysis of different multilingual prompting approaches, showing in particular that strong few-shot learning performance across languages can be achieved via cross-lingual transfer through both templates and demonstration examples.


Introduction
Large autoregressive language models such as GPT-3 can be adapted, via few-and zero-shot learning, to a wide range of tasks with significantly less cost than full fine-tuning (Brown et al., 2020;Bommasani et al., 2021).These models have been primarily developed for English.Although the training data of GPT-3 contains a small percentage of non-English text (7%) allowing it to achieve some promising cross-lingual generalization, the model is almost exclusively deployed for use cases in English.Multilingual masked and sequence-to-sequence language models have been studied, including mBERT, XLM-R, mT5, and mBART (Devlin et al., 2019;Conneau et al., 2020;Xue et al., 2020;Fedus et al., 2021;Goyal et al., 2021a;Liu et al., 2020).These models are typically fine-tuned on large amount of labeled data in downstream tasks.Despite notable recent work at smaller scales (Zhao and Schütze, 2021) and for domain-specific tasks (Winata et al., 2021), the multilingual few-shot learning capabilities of language models are less well understood.
In this paper, we train four multilingual generative language models (up to 7.5 billion parameters), XGLM's, and present a comprehensive study of multilingual zero-and in-context few-shot learning.We train the models using a large-scale corpus of 500B tokens that comprises 30 diverse languages, up-sampling the less-resourced languages to render a more balanced language representation.We evaluate the models on multiple multilingual natural language understanding (NLU) tasks, machine translation and a subset of English tasks demonstrated in Brown et al. (2020).
We found XGLM demonstrate strong crosslingual capability where using English prompts together with non-English examples yields competitive zero-and few-shot learning performance.Our largest model (XGLM 7.5B ) achieves strong zeroand few-shot learning performance on language completion and inference tasks (e.g.XStoryCloze: 65.4% 0-shot, 66.5% 4-shot; XNLI: 46.3% 0-shot, 47.3% 4-shot).It also establishes a new state-ofthe-art on few-shot machine translation across a large number of language pairs in the FLORES-101 benchmark (Goyal et al., 2021b), significantly outperforming the GPT-3 model of comparable size (6.7 billion parameters).On the other hand, multilingual pre-training causes performance drop on English.On 8 English NLU tasks, XGLM 7.5B underperforms GPT-3 6.7B by 10.9% on average in zero-shot learning.GPT-3 6.7B also surpasses XGLM 7.5B in machine translation among several high-resource language pairs, including WMT-14 en↔fr, WMT-16 en↔de and WMT-19 en↔zh.
We conduct an in-depth analysis of different multilingual prompting approaches by examining cross-lingual transfer through template and demonstration examples respectively.We show the non-English prompts sometimes yield unexpected low zero-and few-shot learning accuracy even if they are crafted by native speakers ( §4.3).Both using the English template ( §4.4) and adding demonstration examples ( §4.5) provide effective remedy.However, using demonstration examples from another language often cannot further improve the zero-shot learning performance when a strong prompting language (e.g.Engilsh) is used, which indicates room for improvement in cross-lingual pre-training and in-context transfer approaches.

Pre-training Data
Language selection and pre-processing.We extend the pipeline used for mining the CC100 corpus (Conneau et al., 2020;Wenzek et al., 2020) to generate CC100-XL, a significantly larger multilingual dataset covering 68 Common Crawl (CC) snapshots (from Summer 2013 to March/April 2020) and 134 languages.Our pretraining data include 30 languages covering 16 language families.The natural data distribution is skewed with the number of English tokens being 6 times that of the second largest language.Following previous work on multilingual pre-training (Conneau et al., 2020;Liu et al., 2020), we up-sampled the medium and low resource languages to create a more balanced language distribution (Appendix F.1). 2 Figure 1 shows the language distribution of our pre-training data before (blue) and after (green) up-sampling.
Joint sub-word vocabulary.We process all languages with a joint vocabulary of size 250k cre-ated through unigram language modeling (Kudo, 2018), using the SentencePiece library (Kudo and Richardson, 2018).We train the unigram-LM model using 10 million sentences randomly sampled from the filtered data, according to the multinomial distribution defined in Lample and Conneau (2019) with  = 0.3.

Models and Training
We train decoder-only causal language models with the Transformer architecture similar to GPT-3 (Brown et al., 2020).This allows us to study the effect of scaling up model size along both width and depth dimensions.As a result, we compare four models with 564M, 1.7B, 2.9B and 7.5B parameters, respectively.The architecture details are summarized in Table 1.Our models match that of GPT-3 models3 except with the additional embedding parameters from a larger vocabulary.All models are trained for up to 500B tokens, with context length of 2048 tokens.Further training details are described in Appendix A. GPT

Multilingual In-context Learning
We measure the performance of our multilingual language models on downstream tasks in different languages given the tasks and few-shot demonstrations specified via prompts without further parameter updates (Appendix B).

Multilingual and Cross-lingual Prompting
Previous work on English in-context learning has shown that performance heavily depends on 10 7 10 9 10 11 # of tokens (log s en ru zh de es fr ja it pt el ko fi id tr ar vi th bg ca hi et bn ta ur sw te eu my ht qu 0% 2% 5% 8% 10% Pr(language i ) 49.0% 32.6% 92.6% XGLM pre-training and pre-sharding (en: 49.0%) XGLM pre-training and post-sharding(en: 32.6%) GPT-3 pre-training and pre-sharding (en: 92.6%) Figure 1: The % of each language  ( = 1, 2, ..., 30) in XGLM's pre-training data pre-upsampling (blue), postupsampling (green), and its corresponding % in .We truncate the y-axis at 10% to better visualize the tail distribution.the prompt construction, and it is challenging to find the optimal prompt for a given language model (Gao et al., 2021;Perez et al., 2021).This problem is further complicated in the multilingual setting, where we need to find the optimal prompts for examples in different languages.
In this work, we consider three approaches for obtaining the prompts for non-English tasks.
Handcrafting prompts.The first approach is to ask native speakers of the target language to handcraft the prompts.Prompts created this way are expected to have the most natural surface form.However, language expertise is expensive and we further consider two alternatives.
Translating from English prompts.We assume high-quality prompts of a task can be easily sourced in English (Sanh et al., 2021;Mishra et al., 2021).Non-verbal prompts do not contain words in any particular language (e.g. the StoryCloze and WMT prompts shown in Table 2), while verbal prompts have different realizations in different languages (Table 3).If the prompt is non-verbal, we simply apply it to the other languages.If the prompt is verbal, we translate it into the other languages using automatic translation APIs.
Cross-lingual prompting.We consider the third approach which directly applies the prompts in English (or another high-resource language) to non-English examples.We expect this approach to be competitive, as a result of the cross-lingual capability of the model after being trained on a diverse set of languages.

Learning from Cross-lingual Demonstrations
The cross-lingual nature of multilingual language models further enable the possibility of learning from a different language in context without parameter updates.To do so we simply append examples from another language as the demonstration examples in the language model context.Such capability enables cheap transfer from high-resource languages to the low-resource target languages.

Tasks
We evaluate the zero-shot and in-context few-shot learning capabilities (Brown et al., 2020) of XGLM on a spectrum of downstream tasks (Table 4).

English tasks.
We also evaluate our models on English commonsense reasoning and QA, a subset of benchmark tasks used by Brown et al. (2020), and compare the performance to state-of-the-art English-centric few-shot learning models.The tasks are detailed in Table A1.

Setup
Scoring function and calibration.We follow the guidelines suggested by Perez et al. (2021) and adopt a cross-task generalization setting (Triantafillou et al., 2020) to select our scoring function.We reserve three held-out tasks (XNLI, XCOPA and XStoryCloze) to perform the selection based on their development set performance, and directly apply the selected settings to the rest of the tasks.In the end, we use the averaged per-token log-probabilities ignoring the common prefix of different candidates as the scoring function for all multilingual tasks with no additional calibration or normalization.Appendix C.2 details the selection.
Few-shot learning evaluation.We focus on benchmarking the 0-and 4-shot learning performance of the models on all tasks.For cross-lingual demonstration ( §4.5), scaling law ( §4.9) and translation ( §4.8) we also reported 1-shot and 32-shot performance.We report the average results across 5 runs, randomly sampling a different set of few-shot examples each time.Without further specification, we use few-shot examples in the same language as the target example.Appendix C.3 details our complete evaluation protocol.

Comparing Prompting Approaches
We first compare different multilingual prompting approaches proposed in §3.1 using XGLM 7.5B on XNLI and XCOPA6 .Native speakers among the authors handcrafted7 the prompts for the following tasks: XNLI (en, zh, es and hi) and XCOPA (en, zh), as shown in Table 3.We compare the performance of these human-written prompts to English prompts, machine-translated (MT) prompts and human-translated (HT) prompts.prompting approaches 8 .English templates perform the best on average across languages for both tasks except for the 4-shot setting of XCOPA, where it slightly underperforms the machine translated templates.On the XNLI task, the English template significantly improves the performance of Chinese (zh) and Hindi (hi) over their native templates and translated templates.Similar trends are observed for Thai (th) and Swahili (sw) on XCOPA 9 .For both tasks there exist languages where the native templates strongly outperforms the English templates (Spanish (es) for XNLI and Chinese for XCOPA), indicating significant room for future work on language-specific prompt engineering.

Cross-lingual Transfer through Templates
We further examine if the ability of universal prompting is English specific, and in addition, what characterize a language pair for which cross-lingual prompting can work.To this end, we apply each of the human-written non-English templates to the rest of the languages.As shown in Table 5 and 6, 8 Appendix D.1 provides the comparison between English prompts and the MT and HT prompts on the complete dev sets of XNLI and XCOPA. 9The strong performance of English templates may be partially contributed to the fact that the non-English evaluation data on XNLI and XCOPA are obtained from translation.Testing how well the English templates perform on native non-English test sets is an interesting future work.using the Spanish prompt yields competitive 0-and 4-shot performance across all languages, with the 4-shot average performance being comparable to that of the English template.The Hindi template also achieves significantly above random performance on the XNLI tasks for most languages (especially en).The Chinese template, however, achieves close-to-random performance for all languages on XNLI, as well as close-to-random for Thai (0-shot) and Swahili (0-shot) on XCOPA.We hypothesize that the common sub-tokens and the amount of code-switching text in the pre-training data play a significant role in enabling cross-lingual prompting.And in general, high-resource languages with large amounts of pre-training data and vocabulary overlap with other languages act as better universal prompting languages.We leave a more systematic verification of this hypothesis to future work.

Cross-lingual Transfer through Demonstration Examples
We examine the capabilities of learning from cross-lingual demonstration examples ( §3.2) of XGLM 7.5B on XNLI.We examine two settings for each train-eval language pair: same-languageprompting, where the prompt templates and the example are in the same language, and sourcelanguauge-prompting where the prompt templates for both the demo and test examples are in the source language.We use the human-translated prompts for same-language-prompting.
Table 7 shows results on a subset of language pairs of XNLI, where we evaluate transfer through demonstration examples from in-context demonstration examples from high-resource languages to lower-resourced ones, and between languages that are typologically similar.We report the difference between the 32-shot learning results and the 0shot learning results.The non-English templates in this experiment are obtained via human-translation.While they typically underperform the in-language few-shot setting (Figure A2), most cross-lingual few-shot settings significantly improve over the 0shot setting for the target language.Bulgarian is an exception as it does not benefit from Russian despite being in the same language family.Another language that does not work well in the crosslingual settings is Swahili (low resource), for which we examined transfer from English (high resource) and Arabic (medium resource).In contrast, Thai (medium) and Urdu (low resource) significantly Table 7: Learning from cross-lingual demonstrations on XNLI, evaluated on the test set.The results are the absolute improvement over the zero-shot performance for the evaluated language using human-translated prompts.The first language group refers to the source language and the second one refers to the target language.Same-lang refers to a setting there the template is in the example language and source-lang refers to a setting where the template is only in the source language.

Same-lang
The best thing that may be said of Podhoretz and Decter is that their biological clocks can't have many more minutes left on them, right?Yes, Decter is old.

Source-lang
The best thing that may be said of Podhoretz and Decter is that their biological clocks can't have many more minutes left on them, right?Yes, Decter is old.
benefit from cross-lingual demonstrations10 .We also observed the benefit of cross-lingual transfer from demonstration examples is generally canceled if a better prompt (e.g. the English prompt) is used for the target language.We report the crosslingual demonstration experiments between all pairs of languages for XNLI, XCOPA and XStoryCloze and provide more discussion in Appendix D.2.

Performance on Multi-lingual Tasks
Using English as the universal prompting language, we characterize the zero-and few-shot in-context learning capabilities of XGLM 7.5B on XNLI, XCOPA and XStoryCloze and compare them to English centric language models of comparable size.
Comparison to GPT-3.We compare XGLM 7.5B to GPT-3 6.7B on high, medium, low and extremely low resources languages11 .The results are summarized in Table 9 and 10.On all three tasks, XGLM 7.5B outperforms GPT-3 6.7B by a large margin according to the average performance across languages, especially on medium, low and extremely low resource languages.On XNLI, GPT-3 6.7B performs well on English and similar languages, surpassing XGLM 7.5B on en, de (4-shot), es (4-shot), fr (0-shot).A possible explanation is that these languages have significant presence in the GPT-3 training data (fr: 1.8%, de: 1.5%, es: 0.8% as shown in Figure 1) and can benefit more from the lexical cognates from English.
Comparison to Translate-test Baseline.We also create a translate-test baseline, where we translate the non-English examples of the multilingual tasks to English using the Google Cloud Translation API12 and use GPT-3 6.7B repl., an in-house replication of GPT-3 6.7B , to perform inference.We found the translate-test is a strong baseline of multilingual zero-and few-shot learning as is shown in Table 9 and 10

Performance on English Tasks
We also benchmark the performance of XGLM 7.5B on English tasks.Figure 2 shows the comparison between XGLM 7.5B , GPT-3 6.7B and GPT-3 6.7B repl.on a subset of English tasks used by Brown et al. (2020).Our replication of GPT-3 6.7B , GPT-3 6.7B repl., performs better than or close to GPT-3 6.7B on all tasks.While XGLM 7.5B performs competitively on all tasks, there remains a considerable performance gap comparing to GPT-3 6.7B and GPT-3 6.7B repl.. On most tasks XGLM 7.5B and GPT-3 6.7B repl.show similar performance trend as  changes.For example, both models show a performance dip at 1-shot on HellaSwag and PIQA, and 128-shot on COPA.
There are multiple reasons why XGLM 7.5B underperforms English centric models on the English tasks.First, only 32.6% of XGLM 7.5B 's 500Btoken training data is English while both Englishcentric models are trained on close to 300B English tokens.Second, the model capacity of XGLM 7.5B is shared by 30 languages, and the "curse of multilinguality" can degrade the performance across all languages (Conneau et al., 2020).Further scaling up the model capacity and training data can potentially close this gap. 1414 The differences between the training corpora of the three models may have also contributed to the performance difference.While both English centric models incorporate high-quality English monolingual corpora such as BookCorpus (Zhu et al., 2019) in their training data (GPT-3 6.7B also

Performance on Machine Translation
We report machine translation results on popular WMT pairs in Table 11, and a subset of FLORES-101 (Goyal et al., 2021b) in Table 12.We use greedy decoding for both GPT-3 and our own model, and use the same 32 examples for few-shot learning in each case.
GPT-3 yields strong results on a few languages that are best represented in its training data, narrowly surpassing our model on WMT French-English, German-English, and Chinese-English, as well as a few pairs the FLORES-101 set.GPT-3 is particularly strong when English is the target language, presumably due to its strong English language modeling capability.However, it does poorly on the broader set of less-resourced languages.For instance, GPT-3 fails completely when translating into Korean, Arabic, Swahili, Hindi, Burmese and Tamil in FLORES-101, with a spBLEU score of 1.2 in the best case.
In contrast, our model obtains solid results across the board.In addition to surpassing GPT-3 in 171 out of 182 language pairs in the FLORES-101 set, our model is also competitive with the official supervised baseline for this dataset, even upsamples such high-quality data), XGLM 7.5B is trained solely on data extracted from Common Crawl.However, we do not expect this to be the main impact factor.Scao et al. ( 2022) conducted a similar experiment showing that a multilingual model (1.3B parameters) pre-trained over 13 languages also significantly underperforms an English model trained from the same data source in terms of zero-shot generalization.
Figure 2: Performance on English tasks.For XGLM 7.5B and GPT-3 6.7B repl., we plot the confidence interval from 5 different runs corresponding to different training sets when  > 0. For GPT-3 6.7B we use the performance reported by Brown et al. (2020).
surpassing it in 45 language pairs.This suggests that large-scale multilingual language models have a great potential for building machine translation systems for low-resource languages, even if little or no parallel data is available.

Scaling up Model Size
Finally, we study the impact of scaling up the model parameter size on its 0-and few-shot learning capabilities.Figure 3 shows the performance ( = 0, 4, 32, 128) of the four XGLM models (564M, 1.7B, 2.9B, 7.5B) on the five multilingual tasks.The -axis represents the average accuracy across languages for each task.On commonsense reasoning tasks (XStoryCloze, XCOPA, XWino-grad), the performance of all models increases as  increases from 0 to 32.The performance gain from demonstration examples also gets larger as the model size increases, indicating bigger models can better leverage the in-context examples.On XNLI, the performance of all models increases as  increases from 0 to 4, but decreases for  at 32 and above.With the same number of demonstration examples, larger models do not always benefit more.PAWS-X is a task where in-context learning struggles -the performance of all models oscillates near random (50%) as  changes.A possible reason is the adversarial nature of PAWS-X, where the paraphrase and non-paraphrase pairs by design have high lexical overlap.We expect scaling to be an effective recipe for building stronger multilingual language models, given the current trend.

Related Work
Language model prompting.Brown et al. (2020) first demonstrated in-context few-shot learning using the GPT-3 model.This method removes the need for task-specific updates to the model parameters: the few-shot examples that one would normally use for fine-tuning are provided at inference time to the same model for each task.On several high-resource Latin language pairs, GPT-3 achieves machine translation performance that is close to or better than state-of-the-art supervised models, given only a handful of demonstration examples. 15Such change in the learning paradigm raises new questions about multilinguality, which has not been studied as extensively.Winata et al.
15 Study shows that language contamination in pre-training data can effectively boost the cross-lingual capability of English-centric language models (Blevins and Zettlemoyer, 2022).With a heavier tail of deliberately introduced multilingual data, PALM-540B (Chowdhery et al., 2022) later achieves even stronger few-shot machine translation performance.
Several approaches have been developed to facilitate cross-lingual transfer, including sub-word tokenizers which enabled efficient, shared vocabulary learning across languages (Kudo and Richardson, 2018), joint training for efficient knowledge transfer across languages (Pires et al., 2019;Jiang et al., 2020;Kassner et al., 2021), etc.A notable concurrent work is BLOOM16 , which scales multilingual pre-training to 46 languages and 175 billion parameters.

Conclusion
We introduce four multilingual generative language models (XGLMs) at different scales, and study their in-context few-and zero-shot learning capabilities.We show that the few-shot learning capability of XGLM steadily improves as it scales.Our largest model (7.5B parameters) sets a new state of the art for few-shot learning in more than 20 languages (including mid-and low-resource languages) on commonsense reasoning, NLI and machine translation tasks.An in-depth analysis shows the models are highly cross-lingual, which leads to strong few-shot learning performance in non-English languages.

Limitations
Although the multilingual language model is an important step towards building inclusive generalpurpose foundation models, our current models have the following limitations.
Training Data.Our models are trained on a static multilingual corpus extracted from CommonCrawl, with English text comprising 32.6% of the total number of tokens corresponding to 163B tokens.The English data portion of the corpus corresponds to roughly 54% only of GPT-3's training data.We applied several data filtering strategies as proxies for data quality assurance (see a comprehensive list in the Data Card in Appendix F), such as removing duplicated documents and paragraphs by URLs, filtering out paragraphs with high ratio of digits and punctuation, removing paragraphs with profanity, filtering by max number of URLs and minimum length, etc.Such filtering may potentially result in bias of the remaining data used in pretraining, which would need further analysis to understand.Furthermore, the raw data were taken from static CommonCrawl snapshots, which may not include entities and events beyond the time span of the snapshots (till March 2020), such as COVID-19, etc.As such we also note the potential difference in genres between CommonCrawl and the genres used in GPT-3 comprising in addition to CommonCrawl, corpora such as BookCorpus and Wikipedia.
Moreover, GPT-3 is trained on 118 languages despite the fact that 93% of the data is English. 17n contrast our models are trained on 30 languages after rigorous language identification and filtering.
Performance on English tasks.As is shown in Section 4.7 and Figure 2, our model underperforms English-centric models on eight tasks ranging from commonsense reasoning to QA.There are several factors which could be contributing to this gap, such as  et al., 2020).
Additional experiments controlling for these factors would shed more light on the observed gap.
Model architecture and training objective.In this work, we only experimented with causal language models with a decoder-only architecture, which had previously demonstrated promising fewshot learning capabilities (Brown et al., 2020).However, such architecture and pretraining objective do not leverage bidirectional context such as those used by masked language models (MLM), or sequence-to-sequence architectures with denoising autoencoder pretraining objectives.
Model evaluation via in-context learning.We compare our language models to the baselines primarily in the in-context learning paradigm, using the same prompts for all language models in the comparison unless explicitly specified.Despite minimal effort engineering the prompts for any model, it is possible that the prompts work better with some models than the others, which introduces bias to the evaluation.However, we expect this factor to have small impact and the relative strengths of the models can be reliably measured given the volume of tasks they were evaluated on.
Evaluation on social value tasks for more languages.We evaluate and analyze the models' performance on hate speech detection and gender bias for professional occupations.These studies are limited by the available evaluation datasets.We are limited in our study as we only investigate this problem space for six languages (English, French, Spanish, Italian, Portuguese, and Polish) where a majority of them (5) pertain to the Romance language family.It would be pertinent to investigate the impact of multilingual models on social value tasks across a wider and more diversified set of languages before drawing solid conclusions.Moreover, we contend that studies on other tasks such as stereotype (Nangia et al., 2020;Nadeem et al., 2021), ethics (Hendrycks et al., 2020) would pro-vide a more comprehensive view of model behavior for social value tasks.

Ethical Considerations
Devising multilingual pre-trained language models can serve as a powerful tool in the NLP arsenal for multiple reasons.
Energy and maintenance efficiency.From an engineering perspective, XGLM pertains to a family of models that represent single unified models catering to many languages which have wide application across many applications.Such a unified single model saves on carbon footprint as well as energy consumption (comparing to the alternative: separate models for different languages) leading to more energy efficiency.A single model, despite having the risk of being a single point of failure, has the powerful incentive of being easier to maintain, access, distribute, and track.

Diversity and inclusion.
Models such as XGLM represent a paradigm shift from the Anglo-centric view of the world of NLP to being able to cater to all languages on an equal footing.Paying attention to the design of such models is critical to ensure equitability and inclusion, exemplified here by attempting to balance language representation.The further power of XGLM specifically is its ability to perform comparably to Anglo-centric models in zero to few shot settings.Possessing powerful multilingual models that can perform well in such settings especially for medium to extremely low resource languages helps alleviate the burden of creating supervised data for such languages especially for economically challenged languages (medium to low digital presence typically goes hand in hand with economic disparities).Moreover, having such models catering to scarcer languages spurs scientific research in such languages leading to more diversified NLP, and more diversified science in the broader sense.
Social values.We further investigate the impact of our models on social valued problems such as hate speech detection and bias (Appendix §E).Despite inconclusive results overall (bordering on negative), we note that for the relatively scarcer data setting (Polish) the multilingual models outperform the Anglo-centric models indicating that XGLM will be performant for less resourced languages.This is especially significant for social value tasks where obtaining training data is quite problematic due to the inherent expense of obtaining high quality annotated data.
Transparency and Accountability.In the spirit of transparency and accountability for large-scale language modeling we include detailed model card and data card with the model and paper release.

B Multilingual In-context Learning Formulation
We extend the in-context learning framework proposed by Brown et al. (2020) to the multilingual setting.
Let ℳ be a causal language model and  be a task. = (, ℰ) consists of a task description  and a few demonstration examples in one or more languages We consider the setting where the task description comes in the form of a prompt  = ( , ). is a cloze-style template that converts an example input  into a string  () that contains a [Mask]symbol. 19or classification and multiple-choice problems,  :  →  * is a verbalizer that maps each candidate label or choice  ∈  into a string ().Both  () and () can be tokenized into a sequence of one or more tokens in the language model vocabulary .An instantiated prompt (, ) is obtained by substituting the [Mask]symbol in  () with ().Table 2 shows the prompts used by all tasks in our main experiments.
Zero-shot learning.Given a test example x in any target language , the zero-shot prediction is ŷ which maximizes a language model based scoring function ( §C.2).
This general formulation can cover most NLP tasks.For classification problems,  is a mapping from classes to strings; for multiple-choice problems,  is an identity function that maps each candidate choice to itself.For text generation problems,  is identity and we decode free-form text from [Mask], which in this case is positioned at the end of  ().
Few-shot learning.Suppose we have  demonstration examples available in a source language: In this case, we concatenate the instantiated prompts of the demonstration examples {(   ,   )}  =1 and make it the prefix of the input string used in the zero-shot learning setting to form the objective: where [Sep] is a separator symbol chosen empirically.
• When  = , we have the in-language few-shot learning setup.
• When  ̸ = , we have the cross-lingual few-shot learning setup.

C.1 English Evaluation Tasks
Table A1 shows all the English tasks used in our evaluation.(Bisk et al., 2020) 16,113 1,838 3,084 N/A 1 OpenbookQA (Mihaylov et al., 2018) 4,957 500 500 N/A 1 Table A1: English tasks used in our few-shot learning evaluation.All tasks use accuracy as the evaluation metrics.

C.2 Scoring Functions
We considered the following functions for scoring an instantiated prompt using a language model: (1) sum of per-token log probabilities; (2) average of per-token log probabilities; (3) average of per-token log probabilities, ignoring the common prefix of different candidates.
We also considered the calibration approach proposed by Zhao et al. (2021) and character normalization proposed by Lieber et al. (2021).
In the end, we use the average of per-token log-probabilities ignoring the common prefix of different candidates as the scoring function for all multilingual tasks.This is selected based on the development set performance of StoryCloze and XNLI.
For English tasks, we use the same modeling choices as Brown et al. (2020).Specifically, we use the task prompts as detailed in Appendix G of Brown et al. (2020), and a single newline as the separator for few-shot learning.For WinoGrande, we take the log-likelihood of the common suffix of the different candidates as the scoring function.For ARC-easy, ARC-challenge and OpenBookQA, we normalize by the unconditional probability of each candidate by taking (completion|context)  (completion|answer_context) , where we use the string "Answer: " as answer_context.For all the other tasks, we take the average of per-token log-probabilities, ignoring the common prefix of the different candidates.

C.3 Evaluation Protocol
All few-shot learning results are obtained with the in-language setting (both the training and test examples are in the same language) unless otherwise specified.We report results on the test set for all multilingual tasks (including the held-out tasks).For English tasks, we report results on the test set for ARC-easy, ARC-challenge, OpenBookQA and StoryCloze, and on the development set for the rest, following Brown et al. (2020).For few-shot learning, we report the average results across 5 runs, randomly sampling a different set of few-shot examples each time.For tasks with a training set, we sample the few-shot examples from the training set; for tasks with no training set, we sample from the dev set and report evaluation results on the test set; for dev-set examples on XNLI and XCOPA, we sample few-shot examples from the test set, since these two tasks do not have the training sets for all languages.While Brown et al. ( 2020) tuned the few-shot value  as a hyperparameter on the dev set, we pre-selected a few  values (0, 1, 4, 32, 128) and report the corresponding results.

C.3.1 Example Truncation
Following Brown et al. (2020), we truncate the input such that they fit the maximum context length of XGLM ( ctx = 2048) and preserve only the complete demonstration examples after truncation.For each task, we report results up to the 's corresponding to the maximum fit. 20 Representation Bias.We observe that the language model tend to fit more examples in a highresource language in context compared to those in a low-resource language.21English, as the highest resourced language (Table A10), always fit the most examples.This reflects the unequal representation of different languages in our joint multilingual BPE vocabulary ( §2.1).With this vocabulary induction scheme (Sennrich et al., 2015), the underrepresented languages tend to have smaller sub-word units and higher fertility (defined as number of subwords per linguistic word), making it more challenging to learn word-and higher-level semantics for such languages.Other factors can also impact the tokenization granularity.For example, sharing sub-strings with other high resource languages can boost the granularity of a language; and some languages have smaller tokenization granularity as a result of their alphabet system (e.g.Chinese has an average sub-word length of 1.4, indicating the dominance of single-character tokens, despite being the third largest language in our pre-training data according to disk size).

D.1 Comparing Multilingual Prompting Approaches on XNLI and XCOPA
We compare the performance of English prompts and MT and HT prompts on two of our held-out tasks, XNLI and XCOPA, using their development sets.For MT prompts, we translate the English prompts into the target languages using the Google Cloud Translation API.We use the exact prompts as shown in Table 2 as the input of the translation API and manually recover the placeholders in the API output based on brackets markers (e.g."{Sentence 1} because [Mask]" is translated to "{Sentence 1}因为[Mask]").When the candidate set is closed, we replace [Mask]with each verbalized label and translate them separately.For example, "{Sentence 1}, right?Yes, {Sentence 2}" is translated to "{Sentence 1}，对吗？是的，{Sentence 2}".On XNLI, we also compared to prompts manually translated from English to eliminate the impact of translation noise on the comparison. 22s shown in Table A3 and Table A4, the in-context learning performance is sensitive to the prompting choices across all languages.For both XNLI and XCOPA, using the English prompts on average yield significantly better performance than using the machine-translated prompts.For XNLI, human translated (HT) prompts significantly improve over machine translated (MT) prompts for most languages.Surprisingly, the performance of human translated prompts lags behind that of the English prompts in the   0-shot and 4-shot settings.
Further examination of the per-language performance reveals that the relative strengths of different prompting approaches vary across languages.For es and de, HT prompts offer large gains compared to the MT prompts and the English prompts.However, for zh and ur, using translated prompts (either HT or MT) significantly hurts the performance.For zh, fr, vi, ar and hi, using native-speaker translated prompts still yields significantly lower performance compared to using the English prompts in at least one setting, suggesting that translation error is not the sole cause of the performance drop.

D.2 Full Results on Learning from Cross-lingual Demonstrations
We evaluated XGLM 7.5B on XNLI in the learning from cross-lingual demonstration setting, using both the same-language-prompting and English-prompting setups.In same-language-prompting, the prompt fields and the examples are always in the same language.And in English-prompting, English prompts is used for all examples.All few-shot performances in this section are obtained using the -shot per label setting as described in §D.4.
As shown in Figure A2, for many language pairs transferring from source language demonstration can significantly improve over the zero-shot performance in the target language when human-translated templates is used.The improvement is especially significant for languages such as Chinese (zh), Thai (th) and Urdu (ur), whose zero-shot performance is close to random with human translated templates.However, we found that the effect of cross-lingual transfer from template and cross-lingual transfer from demonstration examples typically do not add up.As shown in Figure A3, using the English template significantly improves the zero-shot performance of most languages, including Chinese, Thai and Urdu.In this case, the demonstration examples in general do not help unless they are in the same language as the target example (diagonals).
Figure A4 shows the results on XCOPA.
Figure A5 shows the results on XStoryCloze, where we observed almost no improvement for any language pair.Possible reasons for the poor transfer results on XStoryCloze is that it requires reasoning about implicit relations between multiple sentences which is much harder to do especially in a cross-lingual setting.The matrix shows the difference between 4-shot (per label) and 0-shot performance.For XStoryCloze, there is no difference between same-language prompting and English prompting since the task does not use a verbalized template.Row: target language.Column: source language.We compare learning from a uniform class distribution (randomly sampling  examples per class) to learning from a more skewed distribution (randomly sampling  × |ℒ| examples from the total population) on the XNLI task.We use the 24-shot and maximum fit (truncated 48-shot) settings.As shown in Table A7, for both settings, learning from a uniform class distribution leads to significantly higher accuracy in all languages compared to learning from the skewed distributions.de, tr, bg, hi suffer the most learning from the skewed distributions (> 2 absolute accuracy gap in the 24-shot setting), while es suffers the least.Moreover, the variances among few-shot trials using different random seeds shrink considerably when the training set class distribution is uniform.These results highlight the severeness of the majority label bias issue in the multilingual in-context learning framework.

D.5 Knowledge Probing
We evaluate to what extent our multilingual language model can effectively store factual knowledge in different languages.To this end, we evaluate knowledge triplet completion using the mLAMA dataset (Kassner et al., 2021), which was translated from the English benchmark LAMA (Petroni et al., 2019) using Google Translate.The data is from TREx (Elsahar et al., 2018) with triples of the format ⟨object, relation, subject⟩.Following the convention of LAMA, triples are converted to templates for querying the language model.For example, a triple like ⟨Paris, capital-of, France⟩ is converted to template "Paris is the capital of [MASK]".While each query in the original mLAMA dataset contains hundreds of candidates on average, we restrict it to three candidates one of which is the ground truth candidate and the other two candidates are randomly sampled to ensure fast inference and save API cost.Following the evaluation protocol of mLAMA, we report precision @1 averaged over all relations per language.We evaluate on the 25 languages covered in XGLM's pre-training data.We compare to the GPT-3 6.7B model.As shown in Figure A6, both our multilingual model and GPT-3 Curie perform well on English.For non-English languages, our multilingual model maintains performance (above 0.6) while GPT-3 Curie drops drastically especially for medium and low resource languages.Overall, compared to an English-centric language model, our multilingual language model are better at retaining factual knowledge on a wider range of languages with +7.1 points on average.

E Safety and Bias Analysis
Given the centrality of large scale Language models, it is important to ensure such powerful models are used responsibly.Accordingly, we further examine XGLM's behavior on two tasks: • Hate speech detection: A safety task to test language models' ability to identify hateful and offensive text; • Occupation Identification: A bias task to study language models' performance disparity between different gender groups on the task of occupation identification.
Through extensive experiments, we have following findings: First, hate speech detection in an in-context learning setting is quite challenging.Moreover, language models are not effectively leveraging fewshot examples to improve the performance.Second, although language models have relatively good performance on the occupation identification task, they run the risk of exhibiting strong gender bias for certain occupations.

E.1.1 Setup
Datasets.We adopt datasets introduced by Huang et al. ( 2020) that include hate speech data from Twitter in five languages: English, Italian, Portuguese, Polish and Spanish.All hyperlinks, usernames and hashtags are replaced with generic symbols (URL, USER, HASHTAG) to anonymize user information.We remove tweets containing more than 2 generic symbols to encourage more informative examples.We further filter out tweets of length less than 5 tokens or more than 30 tokens.In the spirit of creating balanced data, we randomly sample 500 each positive (hate speech) negative (not hate speech) examples for each language.For further comparison, we translate non-English data into English by using Google Translate and then evaluate English models performance on the task.
Prompts.We evaluate two approaches to prompting, similar to Section ??.For English prompts, we prefix "The sentence is <candidate>" to the input sentence to form a prompt.We consider 10 verbalization candidates including 5 negative (normal., common., ok., usual., acceptable.)corresponding to classification of not hate speech and 5 positive (sexist., racist., offensive., abusive., hateful.)representing classification of hate speech.For code-switched prompt, we translate the English prefix and candidates into the corresponding target language by using Google Translate.For example, "The sentence is normal" is translated into "Questa frase è normale."for Spanish.For few-shot learning, we randomly draw examples from the training data and report the average performance across 5 runs.Table A9: Accuracy and bias scores of our multilingual model and other English models on the occupation identification task."|Diff|" stands for the average absolute accuracy gap between male and female groups aggregated across all occupations.We bold the highest accuracy score for each language.

E.2.1 Setup Datasets
We use the English bio dataset introduced in (De- Arteaga et al., 2019) to study gender bias based on identifying a person's occupation from their bios.For multilingual bio datasets we use those created by (Zhao et al., 2020).Originally there are 28 occupations in English, 69 occupations in Spanish and 27 occupations in French.To ensure we have plenty of test data for each occupation, we only keep occupations with at least 1000 male examples and 1000 female examples.This leads to 16 occupations in English, 6 occupations in Spanish and 4 occupations in French.We follow the setup in (Zhao et al., 2020) where people's names and pronouns are removed from the bios.We then prefix "The occupation of this person is <candidate>" to the input bio to form a prompt.The candidate set consists of five occupations, including the ground truth one and four other randomly sampled male and female occupations (two male and two female).Male (female) occupations refer to ones having predominantly more male (female) samples.
Metrics Similar to the metric for Hate Speech detection, we first obtain the scores for 5 candidates and consider a prediction correct if the ground truth candidate yields the highest score among five candidates.
We then compute the bias score as the absolute gap between the accuracy scores on the male and female samples,24 averaged across all occupations.A lower bias score indicates that a model has less divergence in identifying occupations for men and women.

E.2.2 Results
We present the overall accuracy scores and the bias scores (|Diff|) in Table A9.Results indicate that the XGLM 6.7B En-only model achieves the best performance on English and Spanish, while the GPT-3 6.7B model achieves the best performance on French.XGLM 7.5B model, instead, falls behind on all three languages, especially for Spanish and French.We think this is potentially due to that all pronouns and people's names are removed from the test data but not training data.The training data for XGLM 7.5B contains more Spanish and French compared to the other two models.Thus, XGLM 7.5B may have more severe morphological mismatch on Spanish and English.Regarding the bias score, the GPT-3 6.7B model is the most biased model on both English and Spanish but least biased on French.XGLM 6.7B En-only and XGLM 7.5B exhibit the least bias on Spanish and English, respectively.

F Data Card
We follow the recommendations of Gebru et al. ( 2018) and provide a datacard for the dataset used to train XGLM, which is a subset of CC100-XL, a larger multilingual dataset we curated.

F.1 Data Sources
Following the recent success of multilingual self-supervised pre-training (Devlin et al., 2019;Lample and Conneau, 2019;Conneau et al., 2020;Xue et al., 2020;Goyal et al., 2021a;Liu et al., 2020), we train our language models on a mixture of monolingual text of different languages.We extend the pipeline used for mining the CC100 corpus (Conneau et al., 2020;Wenzek et al., 2020) to generate CC100-XL, a significantly larger multilingual dataset covering 68 Common Crawl (CC) snapshots (from Summer 2013 to March/April 2020) and 134 languages.As the first step to balance the language distribution, we sampled 30% of the data from the languages that contain more than 15 billion tokens and more than 20 million documents.This resulted in a 8.4 TB multilingual corpus with 1.9 trillion tokens.

F.2 Motivation
• For what purpose was the dataset created?Was there a specific task in mind?Was there a specific gap that needed to be filled?Please provide a description.The CC100-XL dataset was collected to create a high quality monolingual dataset for at least 100 languages.It was mainly used for training foundation multilingual language models which may be applied to a broad list of language tasks, including neural machine translation, speech translation, question answering, etc. CC100-XL involves sentence level filtering, preserves context, improves the filtering mechanism, and paves a way for mining 200+ languages.
• Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?Meta AI.
• Who funded the creation of the dataset?If there is an associated grant, please provide the name of the grantor and the grant name and number.Meta AI.
• Any other comments?No.

F.3 Composition
• What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)?Please provide a description.The instances are textual documents sampled from Commoncrawl snapshots.
• How many instances are there in total (of each type, if appropriate)?
The training dataset of XGLM contains 1.74 billion documents in total.
• Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?If the dataset is a sample, then what is the larger set?Is the sample representative of the larger set (e.g., geographic coverage)?If so, please describe how this representativeness was validated/verified.If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).The dataset is a subset of CC100-XL.For each language, the data is either a full set or a random subset of CC100-XL data.Especially, the medium-and low-resource languages are upsampled.In terms of language representation, the CC100-XL dataset contains 134 languages extracted using fasttext25 from Common Crawl snapshots.We further selected a subset of 30 languages to train XGLM, taking geo-location, language family and typology diversity of the languages into account.
• What data does each instance consist of?"Raw" data (e.g., unprocessed text or images) or features?In either case, please provide a description.Each instance consists of raw text data.
• Is there a label or target associated with each instance?If so, please provide a description.No.
• Is any information missing from individual instances?If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable).This does not include intentionally removed information, but might include, e.g., redacted text.No.
• Are relationships between individual instances made explicit (e.g., users' movie ratings, social network links)?If so, please describe how these relationships are made explicit.A small percentage of document instances (<2%) are duplicated.Other than that, there are no relationships between individual instances.
• Are there recommended data splits (e.g., training, development/validation, testing)?If so, please provide a description of these splits, explaining the rationale behind them.This dataset is split into training and validation only.For each high resource language, at least 5,000 randomly selected documents and 30,000 lines were split into validation set, and the rest documents are for training; for low-resource languages, at least 100 randomly selected documents and 1,000 lines (a couple of very low resource languages contain 80 documents) were split into valid set and leave the rest for training.There are 3.5 million lines of text in total for the validation set.This split is mainly to ensure a good size of validation data with the coverage and balance over all languages, meanwhile, the validation size is not too large to affect the overall training speed.
• Are there any errors, sources of noise, or redundancies in the dataset?If so, please provide a description.10% of Russian sample were lost during internal data transferring.Therefore, we ended up taking 26.7% random subset of the whole Russian data from CC100-XL.
• Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?It's self-contained.
• Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals' non-public communications)?If so, please provide a description.CC100-XL is exclusively extracted from Common Crawl; and the information in it is not considered confidential.
• Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?If so, please describe why.CC100-XL is a subset of public Common Crawl data, which could contain sentences that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety.
• Does the dataset relate to people?If not, you may skip the remaining questions in this section.Some documents of this data relate to people, such as news articles, Wikipedia descriptions, etc.
• Does the dataset identify any subpopulations (e.g., by age, gender)?If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset.No.
• Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset?If so, please describe how Other than the individuals who are celebrities, politicians, etc, and have their Wikipedia pages; it is possible to identify other individuals by their names, twitter account names, etc.But we built personally identifiable information (PII) identification tools following the guidelines of General Data Protection Regulation (GDPR) and National Institute of Standards and Technology (NIST) and run against this dataset, we did not found highly sensitive PII information, such as U.S. social security number, login credentials, etc.
• Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)?If so, please provide a description.We use a curated special word list of 100 languages which covers profanities, hate speech, bulling language, common slangs and profane multi-word expressions (MWEs) to tag paragraphs and remove the docs containing them.Given the size of this data, it could still contain such sensitive information (as the above lists may not be exhaustive) but should be a very small percent of instances.• What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)?How were these mechanisms or procedures validated?Please refer to the main document for details.
• If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?Please refer to the main document for details.
• Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?This data is mined, filtered and sampled by machines.
• Over what timeframe was the data collected?Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)?If not, please describe the timeframe in which the data associated with the instances was created.The data was collected from 68 Common Crawl (CC) snapshots (from Summer 2013 to March/April 2020).Therefore, it does not contain a lot of information about recent events such as COVID-19.
• Were any ethical review processes conducted (e.g., by an institutional review board)?If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation.No.
• Does the dataset relate to people?If not, you may skip the remainder of the questions in this section.No.
• Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?N/A • Were the individuals in question notified about the data collection?If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself.N/A • Did the individuals in question consent to the collection and use of their data?If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented.N/A • If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses?If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate).N/A • Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted?If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.Some responsible AI related evaluations were performed.Please refer to the main document.-Language Identification (LID) at Document Level For this stage, we used the fastText language identification (LID) model on the entire document which helped further divide the data by language.In addition to the original languages supported by fastText, we also added support for 28 romanized languages.In total, the data for each language contains 240 shards.-Deduplicating Documents based on URL We aggregated the data based on URL which yields 60% reduction in volume.In case two documents had the same URL, we selected the document having more recent text content.-Document Splitting and LID at Paragraph Level We segmented the documents based on newline and also stored the information about the order in which the paragraphs were appearing in the original document (i.e.seq_num).Next, we performed LID at the paragraph level again in order to divide the original documents into clusters of paragraphs where each cluster represents sentences belonging to a particular language.-Deduplicating Paragraphs Data extracted from Commoncrawl snapshots still have a lot of duplicate text even if the document is different.In order to tackle this, we applied the normalization function from CCNet (Wenzek et al., 2020) and then computed a SHA-1 hash of the normalized text.This helped in reducing the content by 88%.Choosing which <paragraph, url> combination to keep can be tricky as it can lead to a lot of fragmented documents.So we devised a strategy to choose documents based on sorted <url, seq_num> order which would help in preventing fragmentation as much as possible.-Language Model Scores We scored every paragraph using a Language Model trained on data collected from OPUS (Tiedemann, 2012) (monolingual data collected from the availble bitexts) using a 4-gram KenLM (Heafield, 2011).Note that since the LMs were not trained on data belonging to a specific domain, this feature helped in eliminating general non-fluent sentences.-Heuristic based approaches We use the following techniques to further refine the filtering step (especially useful for Low resource languages having no or poor quality LM)

F.5 Preprocessing/cleaning/labeling
* Ratio of digit+punctuation to total characters (current threshold <0.25) * Maximum number of URLs per sentence (current value 1) * Type-token ratio (current threshold >0.6 + removing bottom 1% per language) * Minimum number of tokens per sentence (current value 3; not applied for agglutinative languages) -Tagging profane words and removing instances containing such words • Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)?If so, please provide a link or other access point to the "raw" data.
The "raw" data is publiclly available in in https://commoncrawl.org/the-data.
• Is the software used to preprocess/clean/label the instances available?If so, please provide a link or other access point.The software is proprietary to Meta Platforms and currently unavailable publicly.
• Any other comments?No

F.6 Uses
• Has the dataset been used for any tasks already?If so, please provide a description.Yes, this dataset and its precursor CC100 data have been used to train machine translations and multilingual language models, which are foundation to many downstream language tasks.
• Is there a repository that links to any or all papers or systems that use the dataset?If so, please provide a link or other access point.No.
• What (other) tasks could the dataset be used for?This data can be used to pretrain multilingual language models, which are foundation to many current and future language tasks.
• Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description.Is there anything a future user could do to mitigate these undesirable harms?The pipeline for creating this dataset paves a way for building a scalable infrastructure for mining datasets to be be used for training large-scale models.
• Are there tasks for which the dataset should not be used?If so, please provide a description. No.
• Any other comments?No.

F.7 Distribution
• Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?If so, please provide a description. No.
• How will the dataset will be distributed (e.g., tarball on website, API, GitHub)?Does the dataset have a digital object identifier (DOI)?N/A • When will the dataset be distributed?No.
• Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.No.
• Have any third parties imposed IP-based or other restrictions on the data associated with the instances?If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions.No.
• Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation.N/A • Any other comments?No.

F.8 Maintenance
• Who is supporting/hosting/maintaining the dataset?Meta AI.
• How can the owner/curator/manager of the dataset be contacted (e.g., email address)?Refer to the main document.
• Is there an erratum?If so, please provide a link or other access point.Currently no.
• Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?If so, please describe how often, by whom, and how updates will be communicated to users (e.g., mailing list, GitHub)?No plan for updating.
• If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)?If so, please describe these limits and explain how they will be enforced.N/A • Will older versions of the dataset continue to be supported/hosted/maintained?If so, please describe how.If not, please describe how its obsolescence will be communicated to users.N/A • If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?If so, please provide a description.Will these contributions be validated/verified?If so, please describe how.If not, why not?Is there a process for communicating/ distributing these contributions to other users?If so, please provide a description.No.
• Any other comments?No.

Figure 3 :
Figure 3: Zero-shot and in-language few-shot learning performance as a function of model size.The last plot shows the average performance over all five tasks in 0and 4-shot learning.
(a) Same-language prompting with human-translated templates (b) English prompting

Figure
Figure A2: Cross-lingual few-shot in-context learning on XNLI development set.The leftmost column shows the 0-shot performance (with human-translated templates) of each language.The rest of the matrix shows the difference between 4-shot (per label) and 0-shot (with human-translated templates) performance.Row: target language.Column: source language.
Figure A3: Cross-lingual few-shot in-context learning on XNLI development set.The leftmost column shows the 0-shot performance (with English templates) of each language.The rest of the matrix shows the difference between 4-shot (per label) and 0-shot (with English templates) performance.Row: target language.Column: source language.
Figure A4: Cross-lingual few-shot in-context learning on XCOPA development set.The leftmost column shows the 0-shot (with machine translated templates) performance.The rest of the matrix shows the difference between 4-shot (per label) and 0-shot (with machine translated templates) performance.Row: target language.Column: source language.

Figure A5 :
Figure A5: Cross-lingual few-shot in-context learning on XStoryCloze test set.The matrix shows the difference between 4-shot (per label) and 0-shot performance.For XStoryCloze, there is no difference between same-language prompting and English prompting since the task does not use a verbalized template.Row: target language.Column: source language.
sampled 23 it often has a much more skewed class distribution.For a ||-way classification task, a skewed training set distribution can cause the model to score the majority class as disproportionately more likely than the other classes.This was shown by Zhao et al. (2021) as the majority label bias problem.As a result, previous work such as Zhao and Schütze (2021) adopts a -shot per class setting, where  unique examples are randomly drawn from each class to form a training set of size  × ||.
FigureA6: Knowledge probing on 25 languages.The performance of a random baseline is 0.33 since we downsampled the candidate set of each query to contain three candidates.

Table 1 :
Model details. size: number of parameters, : layers, ℎ: hidden dimension.Models within the same row have comparable sizes.

Table 2 :
Handcrafted (English) prompts for multilingual natural language understanding and translation tasks.

Table 4 :
Multilingual tasks used in our few-shot learning evaluation.All tasks use accuracy as the evaluation metrics.

Table 5 :
0/4-shot performance of XGLM 7.5B , evaluated on the first 400 examples of XNLI (development set in en, zh, es and hi) using different prompting approaches.Top: each target language is prompted by the language specified in column 1. Bottom: each target language is prompted by itself.HW: human-written.MT: machinetranslated.HT: human-translated.

Table 10 :
Comparison of different models on XStoryCloze and XCOPA.
† Google Translation API is not available for qu.For the averaged translate-test results we directly used the GPT-3 6.7B repl.model for qu entry.

Table A2 :
Table A2 shows the average number of demonstration examples that fit the maximum context length of XGLM ( ctx = 2048) for each task in our experiments.Average # of few-shot examples that fit the maximum context length of XGLM ( ctx = 2048) in our few-shot evaluation benchmark.The languages are sorted according to the amount of pre-training data (high to low).

Table A4 :
Comparison bewtween English prompts and MT (machine-translated) prompts for 0-shot and 4-shot learning on XCOPA dev set using XGLM 7.5B .

Table A6 :
Distribution of XNLI few-shot training sets obtained by randomly sampling from the original dev set.E: entailment, N: neutral, C: contradition.

Table A7 :
XGLM 7.5B in-language few-shot learning performance of on XNLI dev set using training sets of uniform class distribution and randomly sampled class distribution.We report the mean and standard deviation (in parentheses) of 5 different training sets sampled via different random seeds for each sampling strategy.

• Any other comments? No F.4 Collection Process • How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/ derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how.
Please refer to the main document for details.

• Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section.
Yes, the detailed steps are as below:-Downloading and Sharding Commoncrawl Snapshots We downloaded 68 Commoncrawl snapshots and divided the data in 240 shards based on web-domain.At this stage, textual data gets extracted from the WET files provided by Common Crawl which involves cleaning excessive tabs and newlines.