XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages

Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) -- languages for which NLP re-search is particularly far behind in meeting user needs -- it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot; its focus on user-centric tasks -- tasks with broad adoption by speakers of high-resource languages; and its focus on under-represented languages where this scarce-data scenario tends to be most realistic. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies including ASR, OCR, MT, and information access tasks that are of general utility. We create new datasets for OCR, autocomplete, semantic parsing, and transliteration, and build on and refine existing datasets for other tasks. XTREME-UP provides methodology for evaluating many modeling scenarios including text-only, multi-modal (vision, audio, and text),supervised parameter tuning, and in-context learning. We evaluate commonly used models on the benchmark. We release all code and scripts to train and evaluate models


Introduction
The development of natural language processing (NLP) technology that serves most of world's languages is hindered by the stark lack of data for most languages (Joshi et al., 2020).While there is increasing interest in developing datasets and models for under-represented languages (ULs), existing datasets are often informed by established research directions in the NLP community (de Marneffe et al., 2021).While linguistic tasks such as syntactic parsing have become less practically relevant (Glavaš and Vulić, 2021), other impactful capabilities such as question answering or virtual assistants (Asai et al., 2021), often depend on ancillary technologies such as language ID, data filtering, automatic speech recognition (ASR), or optical character recognition (OCR) that are typically underperforming or unavailable for ULs (Caswell et al., 2020;Bapna et al., 2022;Kreutzer et al., 2022;Rijhwani et al., 2021;Khare et al., 2021).As a result, speakers of ULs are unable to reap the benefits, even if the development of models is successful.
In order to make progress on NLP for ULs, we should thus focus on evaluating models on tasks that are most likely to benefit speakers of those languages. 3To this end, we propose XTREME-UP (Under-Represented and User-Centric with Paucal 4 Data), a benchmark focusing on evaluation of multilingual models on user-centric tasks in a scarce-data setting.
We focus on tasks that technology users encounter in their daily lives: i) information access tasks reflecting generally useful NLP capabilities; and ii) input/output tasks that enable other technologies.We show the corresponding tasks and their role in interactions with language technology in Figure 1.Moving away from the cross-lingual zeroshot setting (Hu et al., 2020;Ruder et al., 2021), we standardize multilingual in-language fine-tuning based on the amount of data that can realistically be annotated within 8h for a language.Our results highlight the limitations of current models on ULs, demonstrate the potential of language models (LMs) to improve user-centric applications, and show the benefit of byte-based approaches.
In this work, we contribute the first massivelymultilingual few-example benchmark including: a) newly created data for QA, OCR, autocomplete, semantic parsing, and sentence-level transliteration; b) new task setups for named entity recognition (NER) enabling evaluation on natural-rather than tokenized-text; and for QA and retrieval providing a more interesting setting than the gold passage (GoldP) setup while offering a lower barrier-toentry than the full TyDi QA (Clark et al., 2020) or XOR (Asai et al., 2021) tasks; c) carefullydesigned experimental setups, standardizing inlanguage fine-tuning and in-context learning and focusing on the information access scenario for ULs for ASR and MT; d) baseline results with commonly used subword and byte-based models.

Related Work
Multilingual benchmarks Some studies employ highly multilingual individual datasets for the evaluation of multilingual models, including Universal Dependencies (de Marneffe et al., 2021) or XL-Sum (Hasan et al., 2021).At the same time, there is increasing work on datasets in ULs for a variety of applications (Niyongabo et al., 2020;Winata et al., 2023;Muhammad et al., 2023).Due to their rapidly growing capabilities, NLP models are increasingly evaluated on a suite of datasets.Existing multi-task multilingual benchmarks such as XTREME (Hu et al., 2020), XGLUE (Liang et al., 2020), and XTREME-R (Ruder et al., 2021) cover 20-50 mainly high-resource languages and prioritize tasks with available data, regardless of their utility to speakers.Recently, MEGA (Kabir et al., 2023) and BUFFET (Asai et al., 2023) evaluate in-context learning on existing multilingual tasks.In contrast, XTREME-UP focuses on underrepresented languages, user-centric tasks, a more realistic scarce-data setting, and introduces new tasks and datasets.

Multilingual evaluation
The choice of the experimental setting and aggregation metric are important considerations in multilingual evaluation.Prior work focused on zero-shot cross-lingual transfer (Hu et al., 2020), which-despite being compelling from a scientific perspective (Artetxe et al., 2020)-is less practically useful.While in-language fine-tuning has been explored before (Lauscher et al., 2020;Hedderich et al., 2020), XTREME-UP is the first to standardize the setting across tasks based on realistic annotation costs.Different frameworks aggregate performance in different ways across languages.Blasi et al. (2022) assess the utility of a task by weighting model performance based on the size of the speaker population while Khanuja et al. (2023) introduce the Gini coefficient to quantify performance disparity across languages.XTREME-UP opts for a simple average over ULs, emphasizing intuitiveness and accessibility of the results.

XTREME-UP is motivated by the following design principles:
Under-represented languages Following Joshi et al. (2020) we select languages in categories 1-3 (e.g., Amharic, Estonian, Kinyarwanda) as under-represented, leaving categories 4-5 as highresource languages (e.g., English, German, Hindi).We focus on tasks with existing data in ULs and tasks where we can efficiently collect data at scale (see Appendix B for an overview of ULs in XTREME-UP).
User-centric tasks We focus on widely adopted user-facing tasks benefiting speakers of highresource languages.We further break these down into two major groups: 1) input/output tasks; and 2) information access tasks (see Figure 1).

Scarce data
We focus on a realistic scenario where a small amount of data is available in each UL.Mirroring reality, we do not restrict the amount Efficiency We focus on massively multilingual evaluation settings that can still be run efficiently with a modest amount of compute.
Text-centric, yet multi-modal We focus on tasks that can be tackled using textual data alone and provide baseline systems that do so.We frame multi-modal tasks (OCR and ASR) so that natively multi-modal models can be evaluated fairly alongside text-only models.We accomplish this by releasing original audio, image, and text model inputs while also providing baseline system output that can be fed to second-stage text-only systems.We hope to see fully multi-modal models take up this challenge over the coming years.
We provide an overview of the tasks in XTREME-UP in Table 1.We discuss motivation and highlevel information in the next section and provide more details for each task in Appendix D.

How much data?
To ensure a realistic amount of training data, we limit the training data in each task per language to the number of examples that can be annotated in 8 hours.We believe this reflects the real difficulty of annotating training and evaluation data for a very large number of languages.In this way, we design for the task first.For each task, we estimate how long it takes to annotate a single example for a trained annotator. 5We base our estimates on prior work and our own annotation efforts. 6We show the data annotation time estimates in Table 1.For tasks with larger training datasets, we sub-sample the available data accordingly.Table 1 shows the sub-sampled data sizes.We show the input and output format of each task in Table 2.We provide an example instance of each task in Appendix C.

Input / Output Tasks
Automatic speech recognition (ASR; D.1) The goal of ASR is to transcribe speech into humanreadable text.It thus serves as a fundamental step for enabling natural language understanding applications on speech input.In many contexts, users may strongly prefer to speak rather than type and so high-quality ASR is an enabling factor for such interactions.We employ the FLEURS dataset (Conneau et al., 2023) consisting of recordings in 102 languages for sentences from FLORES-101 (Goyal et al., 2022), which were translated from English Wikipedia to 101 languages.We evaluate on 77 under-represented languages.
Optical character recognition (OCR; D.2) OCR, the process of converting text from images  into machine-readable formats, is used in a wide range of applications, from extracting data only available in paper books (Rijhwani et al., 2020) and imaging legal documents (Singh et al., 2012), to improving accessibility for people with low vision (Mowar et al., 2022).It is especially important for under-represented languages, where both training data and content that users may wish to access may not be available as digital text on the web.We create a dataset that aims to fill the gaps in previous work in OCR for ULs (see Appendix D.2) by focusing on larger-scale, typologically diverse, and user-centric data.Our dataset contains transcriptions for books in seven languages: Amharic (am), Bengali (bn), Kannada (kn), Myanmar (Burmese; my), Sanskrit (sa), Sinhala (si), and Swahili (sw).The books domain is the primary use-case for a large number of downstream users, but is one of the most challenging for OCR models (Rigaud et al., 2019).The dataset consists of transcriptions of entire pages and thus enables leveraging the full context understanding capabilities of large language models.Autocomplete (D.3) Autocomplete (or predictive text), i.e., predicting the rest of a word a user is typing, is a useful technology that speeds up human-computer interaction (Anson et al., 2006).As such, autocomplete has become a technology that users have come to expect and rely on for input in high-resource languages.The standard next word prediction task (Sundermeyer et al., 2012) does not accurately reflect this practical setting as it relies on predicting entire units (words, subwords, or characters); similarly, perplexity-based evaluation makes comparisons across segmentations and languages difficult (Mielke, 2019) and ignores threshold effects associated with top-k predictions in a user interface (Tam and Wells, 2009).
To fill this gap, we introduce a new autocomplete task that unifies character, subword, and tokenlevel LM settings by focusing on a "word" as the predictive unit.Models are required to complete the next word based on a left context of N words and an optional character n-gram prefix.We use accuracy@3 for evaluation to reflect the requirement of displaying a limited number of candidates to the user.We process high-quality natural language data from Universal Dependencies (de Marneffe et al., 2021), which we deduplicate against mC4 (Xue et al., 2021), the most common multilingual pretraining corpus in order to test models' predictive rather than memorization capabilities.
Transliteration (D.4) Transliteration is the conversion of text between writing systems (Wellisch, 1978).Unlike translation, it does not change content but only script.Transliteration is important because it allows users to type in their preferred script (e.g., Latin script) even if it is different than their preferred display script (e.g.Devanagari) and is used internally by many machine translation systems to rewrite names from different scripts.
We extend the Dakshina dataset (Roark et al., 2020), which provides romanizations of Wikipedia sentences written in the native scripts of 12 South Asian languages, with: a) romanizations of native script Wikipedia for one new language (Amharic); and b) transliteration to a third script (Shahmukhi) for one already covered language (Punjabi).The resulting task covers 13 languages for which transliteration occurs from the Latin script to the native script, and vice versa, and between Shahmukhi, Gurmukhi, and Latin for Punjabi.
Machine translation (MT; App.D.5) MT is an important technology for users of ULs wishing to read text written in a different language.However, most current approaches require large amounts of parallel training data to achieve good performance, which are often not available for ULs (Had-dow et al., 2022).We focus on the information dissemination scenario where content from highresource languages (including from tasks such as cross-lingual QA) is translated to enable information access by common users; as such, XTREME-UP includes translations from English into 93 languages, covering a wide range of high-resource and UL languages.Only 39 ULs are used for evaluation; the high-resource languages are included to allow for transfer learning. 7The dataset is adapted from FLORES-101 (Goyal et al., 2022), repurposing half of the dataset's original development set as a training set.See §6 for a detailed discussion of how we distinguish freely-available unsupervised data versus purpose-annotated supervised data in XTREME-UP.

Information Access Tasks
Question Answering (D.6) Question answering enables responding to natural language questions with answers found in text.We focus on the information-seeking scenario (Kwiatkowski et al., 2019) where questions are asked without knowing the answer.Information-seeking question-answer pairs tend to exhibit less lexical and morphosyntactic overlap between the question and answer since they are written separately.
We include two variants of the task: In inlanguage QA, both question and passage are in the same language.We obtain original questions and passages from TyDi QA (Clark et al., 2020).For cross-language QA, the question is in the user's native language while passage and answer are in a language with a large amount of answer content available (English).We use examples from TyDi XOR (Asai et al., 2021) in 7 languages.We additionally collect new data in 23 new Indic languages for cross-lingual QA by professionally translating questions and answers from existing Indic languages in XOR QA.This methodology mitigates the issue of translating Western-centric English data to locales with different topical interests.Cross-lingual QA is especially important for ULs since they lack plentiful in-language answer content on the web.
In XTREME-UP's QA task, a system is given a question, title, and a passage and must provide the answer-if any-or otherwise return that the question has "no answer" in the passage. 8To this end, we generalize the gold passage (Clark et al., 2020) setting, augmenting it with negative examples.These negatives are obtained from (a) passages within the same article as a passage containing the answer and (b) question-answer pairs from the full TyDi QA dataset where no answer was found in the candidate Wikipedia article.The data is split into training, validation, and test splits in such a way to avoid deduplication and overlap of splits, even across our various QA tasks.9 Retrieval for QA (D.6) Within the informationseeking QA scenario, the above core QA task assumes answer candidate passages as an input.In practice, a passage retrieval system for questionanswering allows for the extraction of relevant text from a vast text corpus.The retrieved passages can then be used by a question-answering system to extract or generate an answer to the user's question.In XTREME-UP, we separate retrieval into two distinct tasks, in-language retrieval and crosslanguage retrieval.For in-language retrieval, both the questions and passages are in the same language.The preparation of negatives, deduplication, and splits are identical to the QA task above.For validation and test, we create an index of 271k inlanguage passages (447k English passages for the cross-language task) making for a small enough index for efficient experimentation, while containing distractors that make for a challenging task, since these distractors are drawn from the same articles containing the target passages.
Named entity recognition (NER; D.7) NER is an important capability for information access systems that users depend on with applications ranging from recognizing requests for entity lookups to performing information extraction to populate the knowledge graphs that handle those requests.NER is also a capability needed in spell-checking and localization systems (Li et al., 2020). 10Identifying entities in ULs poses challenges due to the use of different scripts, lack of capitalization, different numerical representations, etc.We build on MasakhaNER (Adelani et al., 2021) and MasakhaNER 2.0 (Adelani et al., 2022), two large NER datasets in African languages, which provide data in the standard CoNLL tokenized format (Tjong Kim Sang and De Meulder, 2003).In order to enable evaluation in a setting that is closer to the real world, we automatically map the annotated spans to the original raw text.The combined data with byte-level span annotationstermed MasakhaNER-X-covers 20 languages. 11emantic parsing (D.8) Semantic parsing is the task of mapping a natural language utterance to a logical form or a structured interpretation that can be executed by a system such as a virtual assistant.This task is especially timely as users will increasingly want to turn their interactions with assistants and chat-like dialog systems into actions on external systems, which require API calls; this capability is what the semantic parsing task evaluates.
We adapt the test split of MTOP 12 (Li et al.,  2021) with professional translators/annotators to 15 languages: Amharic, Belarusian, Bengali, Brazilian Portuguese, Finnish, German, Hausa, Hungarian, Japanese, Russian, Swahili, Tamil, Turkish, Yoruba, and Zulu.Together with the original MTOP languages, the new MTOP++ dataset covers a total of 20 languages.Differently from MTOP, we collect localized data (i.e., Western-centric entities are replaced with more culturally relevant entities for the target language), following recent trends in multilingual benchmarking (Lin et al., 2021;Ding et al., 2022;Majewska et al., 2023).
We also extend MTOP to three widely spoken but under-represented Indic languages in a codeswitching setting: Hindi-English, Bengali-English and Tamil-English.We automatically convert the test-split of MTOP to code-mixed utterances using PaLM (Chowdhery et al., 2022) and run human verification on such utterances.

Overall Evaluation
For each task, we evaluate model performance by computing a task-specific score.We employ character-level metrics such as character error rate (CER) and character n-gram F-score (chrF; Popović, 2015) rather than their word-level counterparts as they enable more fine-grained evaluation and are better suited to morphologically rich languages.We obtain a final score by averaging the Table 3: Additional information on baseline models including the setting in which we evaluate them (finetuning vs in-context learning), their size, their vocabulary, and the fraction of non-English pre-training data.
scores of all tasks.For each task, we only average performance over ULs (discussed in §3.1).For metrics such as character error rate (CER) where lower is better, we invert the scores before averaging scores across tasks.For mean reciprocal rank (MRR), which is in the 0.0-1.0range, we renormalize it to the 0-100 range before averaging.While this scalar provides a quick overall impression of a system's quality across a broad range of tasks, it is not a substitute for analyzing performance on individual tasks, languages, or types of examples.

Experimental setting
Multilingual fine-tuning In contrast to prior benchmarks that focus on zero-shot cross-lingual transfer from English, XTREME-UP focuses on the more realistic scenario of fine-tuning on a small amount of data in the target language.To make this scenario scalable in a massively multilingual setting, XTREME-UP fine-tunes a single model on the combined training data across the available languages for each task.The data for each language is sub-sampled to emulate data sizes that can be realistically annotated within a reasonable time frame (see §3.2).
In-language in-context learning We also provide a 5-shot in-context learning setting where a model is provided with an English instruction and 5 exemplars in the target language in order to evaluate the progress on few-shot learning with large models for ULs.We provide the instruction for each task in Appendix E. 13

Baselines
We provide results on a handful of baseline systems that have already been developed by the research community.Given that our focus in this paper is on the dataset and task setup rather than system building, we do not focus on offering novel modeling types nor do we exhaustively evaluate all possible models; rather we view these results as estimating a starting point from some well-known modeling approaches and seeding contributions from the broader research community. 14 (Chowdhery et al., 2022).We provide additional information on the baseline systems in Table 3.
To offer baseline systems that allow experimentation with text-only models, we use upstream models to provide initial output for ASR and OCR, and present text-based baselines that use these as inputs.We expect these baselines to give way to fully multi-modal models as research progresses.These initial ASR and OCR outputs should be seen as part of a baseline system, not part of the XTREME-UP benchmark iteself.For ASR, we augment the data with predictions of the state-of-the-art Maestro-U (Chen et al., 2023) and then use a downstream text model to improve the outputs (Bassil and Alwani, 2012).Similarly, for OCR, we use the off-the-shelf Google Vision OCR 15 to get first-pass outputs, and train language models to improve them (Dong and Smith, 2018;Rijhwani et al., 2020). 13The choice of prompt and exemplars can have a significant impact on performance (Zhao et al., 2021a,b).We provide a single instruction and set of exemplars per task and language for replicability and leave the search for better instructions and exemplars to future work.
14 XTREME-UP offers a public results tracker for use in tracking the community's progress on XTREME-UP.We conceptualize these results not as a competition, but as offering insights about different models and their trade-offs, each justifying and explaining how it should be compared to the others and how it informs the research landscape.Submissions can be made via self-service git pull requests.

Results
We show the baseline results in Table 4.
Byte-based models outperform subword-based on ULs.The byte-based ByT5 outperforms the subword-based mT5 across most tasks.Gains are particularly pronounced for tasks that require dealing with information on the character level such as autocomplete and transliteration and for predicting information on the word level such as for NER and semantic parsing.These results demonstrate that as we train and evaluate our models on underrepresented languages, standard modeling choices such as subword representations fall short.
In-context learning underperforms fine-tuning on limited data.The Flan-PaLM model generally performs worse than fine-tuned models, despite being much larger.Nevertheless, it achieves reasonable performance on machine translation, which is likely reflected in the pre-training data.
On other tasks, however, it fails to reliably apply its English-centric knowledge to ULs.Despite finetuned models performing relatively well on NER, the in-context learning model is unable to consistently generalize to the task in a few-shot setting in under-represented languages.On semantic parsing, the model fails to generalize to the large number of domain-specific intents and slots using standard prompting in ULs. 16The autocomplete tasks in particular demonstrate the lack of robust cross-lingual information in the English-centric PaLM model: it struggles to complete a sentence given a character prefix and fails to reliably convert between different scripts in the same language.XTREME-UP thus provides a strong challenge to test the generalization abilities of in-context learning methods to ULs.
There is a lot of headroom left to improve performance on ULs.Overall, across all tasks there is still a considerable amount of headroom left.
For ASR, OCR and transliteration, around 10% of characters are still incorrectly predicted.On autocomplete, models only make the correct prediction in about one fourth of all cases.For MT, on average In-context learning (5-shot) 22.9 (20.9 / 24.9) -12.9 0.1 Table 4: Overall results of baselines across all XTREME-UP v1.0 tasks for the test split.Scores on XTREME-UP average over evaluation scores of under-represented languages.QA and retrieval performance is the average of in-language and cross-language settings (indicated in brackets as in-language / cross-language).For OCR, we do not apply any additional models (mT5 nor ByT5) on top of the baseline OCR system; we show these results in parentheses.We do not attempt in-context learning (ICL) results for retrieval since ICL is typically only used for text-in, text-out use cases.For OCR, we use the Google OCR API.† For autocomplete, while we observe reasonable performance on English completions, we find the model typically does a very poor job outside English.
only about a third of n-grams in the hypothesis are also present in the reference, and vice versa.For QA and retrieval, there are large performance differences between in-language and cross-language settings and much headroom still left.On NER, models perform relatively well but are still far from perfect performance on the task.Finally, on semantic parsing models are only able to produce the correct output in around a third of all cases.

Analyses
Lowest-performing languages Models generally perform poorly on African languages.On transliteration, models perform relatively worst on the newly added Amharic language.On NER, which covers only African languages, performance is lowest for Amharic-likely due to its different script-and the extremely under-represented Ghomálá'.Similarly, translation models underperform in Amharic and Yoruba.On ASR, the lowestperforming languages are Yoruba but models also struggle with other languages such as Gaelic, and many South Asian languages such as Lao, Khmer, and Burmese.
Task-specific observations ByT5 provides the best performance while the size of the model does not seem to impact performance much.Several aspects of the data lead to higher error rates in transliteration: the model struggles with input in the Perso-Arabic script and to produce output in Latin based on a different script.In all cases, researchers should rigorously report what additional data was used and how; each use case comes with its own considerations and, contaminated pre-training data is equivalent, better, or almost as good as some other system).Note, this analysis only needs to be done once for each pre-training corpus (e.g., once for mC4) and it is very likely that organizations with enough compute to pre-train a new model on a new corpus would also have sufficient compute to calculate overlap.above all, researches should make a well-reasoned argument that their use of data (i) does not artificially inflate evaluation scores and (ii) reflects a real-world scenario of finding and applying data.

Conclusion
We have presented XTREME-UP, a multilingual benchmark distinguished by its being (i) scarcedata, (ii) user-centric, and (iii) focused on underrepresented languages.The benchmark contains input modalities of text, images, and audio while still allowing experimentation with text-only models.We hope this benchmark will be useful in accelerating research that is useful to speakers of under-represented languages and in highlighting both the progress and limitations of current models of language.

Limitations
The dataset presented in this work does not represent all of the world's languages nor all of the under-represented languages.While we have made efforts to include languages and dialects across a broad variety of geographic regions and language families, it was not feasible to locate or create data in the same set of languages across all tasks.Since this is a data-focused paper, we present modeling results on a few strong modern models; this is not an exhaustive exploration of how all current models may perform on this dataset.We look forward to exploring more under-represented languages as more data becomes available.

B Language Coverage
We provide an overview of the under-represented languages in XTREME-UP in Table 5.For each language, we indicate a) the ISO 639-1 code (or ISO 639-3 code if the former is unavailable); b) its language family according to Glottolog (Nordhoff and Hammarström, 2011); c) the number of datasets in XTREME-UP including the language; d) its resource level based on the taxonomy of Joshi et al. (2020) (0 is least and 5 is highest-resourced); and e) which tasks include the language.

C Task Examples
We provide an example instance of each task in Table 6.Automatic speech recognition (ASR) transcribes speech inputs into human-readable text, serving as a fundamental step for various speech language understanding applications.The transcripts are often calibrated with some pre-trained language models to produce the final outputs.In this paper, we build the ASR benchmark in this way: first, transcribe input audio into text with a pre-trained speech recognition model; then calibrate the transcripts by fine-tuning pre-trained language models on paired transcripts and ground truths.
We paired the ASR transcripts with the ground truths to fine-tune the mT5 or ByT5 models.The average character error rate (CER) of Maestro-U is 8.28% across 102 languages, providing a strong baseline.Therefore, we build the ASR benchmark in a selective way: first, we compare the Maestro-U baseline CER on the dev set with the CER obtained by fine-tuned mT5 or fine-tuned ByT5.If the fine-tuned result is better, we choose the finetuned model for the language to rescore its test set; otherwise, we keep the baseline Maestro-U results for the test.

D.1.3 Data structure
We followed the data split of train, dev, and test sets in FLEURS, and filtered out the examples where Maestro-U prediction is empty (i.e., all the deletion errors).The pairs of transcript and ground truth are saved in jsonl and tsv format.
The individual language datasets are mostly distinguished by the language and region BCP-47 codes, e.g., the kam_ke code represents Kamba language spoken in Kenya.In some cases, when multiple writing systems are available for a language, the ISO 15924 script code is used as well, as is the case with the code sd_arab_in that denotes Sindhi as spoken in India and recorded using Arabic script, as opposed to its Pakistani counterpart.18

D.1.5 Experiments and Discussion
We compared fine-tuned mT5-base and ByT5-base baselines, which were built on TPU.In addition, we explored the compute efficient fine-tuning on GPU, using a mT5-small model as pre-trained model.The three models took 4500, 6500 and 4000 steps to converge, respectively.We report the character error rate for the predicted transcripts by the finetuned models against the one for the Maestro-U baseline, which is 8.28% on average for 102 languages -a quite strict baseline.We observed small gains through fine-tuning with different pre-trained models, as shown in Table 7.
It is observed that ByT5 yields better fine-tuned results than mT5, indicating that byte is a better modeling unit when it comes to textual data of various writing systems.By calculating the average CER for 24 high-resourced language group and 78 low-resourced language group respectively, we find that both mT5 and ByT5 fine-tuned models can reduce CER from 6.40% baseline to 6.36% for high-resourced languages, while ByT5 on its own can further improve CERs for low-resourced languages from 8.86% baseline to 8.80%.
Fine-tuned ByT5 also generalized well on languages which were not seen in the pre-training phase.With a limited amount of fine-tuning data, ByT5 can improve baseline on the group of unseen languages, especially on Umbundu (umb_ao, -14% CER Relative).Even though only Romanized Chinese is used to pre-train ByT5, the fine-tuned ByT5 outperformed baselines for both Mandarin (in simplified Chinese, cmn_hans_cn), and Cantonese (in traditional Chinese, cmn_hant_hk).

D.2.2 Related work
While most existing datasets focus on higherresourced languages (Nayef et al., 2017;Rigaud et al., 2019), there has been recent interest in developing OCR for ULs.This includes the creation of a dataset for OCR on endangered languages (Rijhwani et al., 2020) and a synthetic dataset for 60 languages (Ignat et al., 2022).

D.2.3 Data creation
We retrieve books that are in the public domain on Google Books.These are historic books, where the copyright has expired, as well as more recent and public-domain books, used in this dataset with approval from their publishers.We focus on languages with diverse scripts, for which no existing OCR dataset is currently available.We observe that many public-domain books in such languages are religious or linguistic in nature and were created for missionary purposes.In order to identify a diverse set of high-quality books, we first conduct an annotation task where we ask annotators to look at pages of a book and assign whether it is a) not in the target language, b) religious, c) consisting mainly of tables/other structured formatting, d) linguistic (e.g., a dictionary or grammar book), e) not intelligible, or f) low quality.Based on this annotation, we filtered out some languages that did not have a sufficient amount of high-quality public-domain books available.After filtering, the dataset contains annotated documents in seven under-represented languages -described in detail in Section 3.3.

D.3.1 Task description
Autocomplete (or predictive text), i.e., predicting the rest of a word a user is typing, is a useful technology that speeds up human-computer interaction.However, while language modeling (LM) is a core natural language processing (NLP) task, current LM evaluation does not address the practical constraints of human-computer interaction and current LMs are not directly useful for autocomplete in under-represented languages.
In order to evaluate multilingual models on an evaluation setting as close as possible to the realworld usage of autocomplete, we curated the Universal Dependencies (UD) dataset (Nivre et al., 2020;de Marneffe et al., 2021) according to a set of high level principles that we describe in the section below.

D.3.2 Data creation
The original UD dataset was filtered to better fit the user centric paradigm proposed.We removed a) treebanks using only ancient data, for example liturgical text written in Latin, Ancient Greek or Sanskrit; b) languages with fewer than 100 speakers like Akuntsú; c) signed languages like the Swedish Sign Language; d) highly domain-specific content like for instance SiMoNERo (Mititelu and Mitrofan, 2020) which contains texts from three medical subdomains: cardiology, diabetes, endocrinology; e) languages that are "high resource" by XTREME-UP standards with the exception of English which we kept for prototyping; f) languages that do not have all three of: training, validation and test sets: g) languages with fewer than 1000 examples when combining training and validation set.

D.3.3 Data structure
A data instance has two fields, input and target, for instance {input: "en_-We look f$", target: "forward"}.The input field is composed of a prefix "en_-" to indicates the language to the model and a context sentence: "We look f$".The target field is the word to predict.We normalize all text with Unicode NFKC normalization (Whistler, 2021).
Annotation process In the following, we describe how the example described above is generated from the source data.The original sentence is "We look forward to your active participation to make this forum an exciting meeting place for like minded individuals."The steps are: a) The context sentence including the target can have at most 10 words.A random word of more than 5 characters is chosen to be the target.b) A target context is sampled from the target and added to the context.In this example it is the character "f".The sample rule is to select a number of characters that can vary between 0 to the number of characters in the target minus three.In our example, the target "forward" could be sampled from "" to "forw".c) A specific token "$" is added just after the target context.

D.3.4 Data statistics
We sampled up to 2,000 examples from each language's training set, 1,000 examples from validation, and 1,000 examples from test.This prevents the languages from having disproportionately more data; where the original sets were smaller than these targets, we used all available data.We display the language statistics in Table 8.Note that these experiments are done on a preliminary dataset not the final release version of XTREME-UP.

D.3.6 Results
We observe that ByT5 achieve better performance than mT5 for both Acc@3 and chrF on the autocomplete task as it is displayed in Table 9.Also ByT5 require less than half the time to fine-tune on the training set (45 minutes) compared to mT5 (1 hours and 30 minutes).

D.3.7 Analyses
Based on Acc@3 and chrf, the most challenging languages for mT5 are Eastern Armenian ((hy)) and Uyghur (ug) respectively.Whereas Nigerian Pidgin is the (pcm) and Scottish Gaelic are the easiest languages.For ByT5, whether we consider Acc@3 or chrF, the most challenging language is Uyghur, and the easiest language is Galician (gl).Yet, these extremes only offer a qualitative comparison of mT5 and ByT5.Next, we investigate four questions around model performance: a) Do mT5 and ByT5 have the same cross-lingual generalization pattern?b) Do some languages yield higher scores because autocompletion guesses the same words?c) Do some languages yield higher scores because they have a smaller vocabulary in their corpora?d) Does similarity to the Latin alphabet impact models' performance?We test several hypotheses below, considering a relationship to be significant when the p-value is under 0.05.
Do mT5 and ByT5 have the same cross-lingual generalization pattern?mT5 and ByT5 have the same cross-lingual generalization pattern if the difficulty to generalize to a new language is the same for both models relatively to other languages.
In other words, if models' performance are ranked similarly, they share the same cross-lingual generalization pattern.To evaluate this hypothesis we computed the Spearman's rank correlation between mT5 and ByT5 Acc@3.We got a Spearman's rank correlation of 0.69 with p-value < 0.001.This means that the two models have a high degree of relative agreement, in other words, if a new language is added, there is a high chance that the language is going to be challenging or not for both mT5 and ByT5.
Do some languages yield higher scores because autocompletion guesses the same words?If our dataset in given language over-represents a word to predict, then the model might have misleadingly good performance by always predicting the same word.This would mean that the dataset is not balanced with regards to the diversity of target words.A common way to model the diversity of a distribution of words is to compute its entropy, so we computed the the Pearson correlation between the entropy of the test set's target word distribution in each language and mT5 and ByT5 Acc@3.The entropy of a distribution of word is maximal if every word is different, and it is minimal if it consist on a single word.mT5 and ByT5 displayed correlation coefficients of −0.16 and 0.13 respectively with p-value of 0.45 and 0.53 respectively.These results show that there is insufficient evidence to conclude that there is a significant linear relationship between target words diversity and model performance because the p-value is far above the 0.05 significance threshold.Hence, target word diversity is not a good predictor of model performance variability across languages.
Do some languages yield higher scores because they have a smaller vocabulary in their corpora?We expect that languages with smaller corpora will be easier to fine-tune on because of a smaller prediction space.To test that hypothesis, we computed the Pearson correlation between test set's vocabulary size and mT5 and ByT5's Acc@3 for each language.mT5 and ByT5 displayed correlation coefficients of −0.29 and 0.13 respectively with p-value of 0.17 and 0.54 respectively.Thus there is insufficient evidence to conclude that there is a significant linear relationship between vocabulary size and model performance because the pvalue is above the 0.05 significance threshold.
Does similarity to the Latin alphabet impact models' performance?We verify this hypothesis quantitatively by computing the similarity between a) a Latin alphabet composed of the 26 letters of the alphabet in lower and upper case and b) the alphabet of each language corresponding to all the characters in the test set except punctuation and special characters.The similarity was computed with the Jaccard similarity coefficient (Jaccard, 1908), i.e. the ratio of number of unique items in the intersection of both alphabets and the number of unique items in the union of both alphabets.Moreover we used the same methodology as before and computed the Pearson correlation between the Jaccard similarity index and chrF as this metric is more granular in models' character level performance.We observed a correlation of 0.56 and 0.75 for mT5 and ByT5 respectively with p-values < 0.01 respectively.It indicates that the similarity between the Latin alphabet and each language alphabet is significantly correlated to mT5 and ByT5 chrF.

D.3.8 Evaluation and Discussion
Whether we used a word level metric like Acc@3 or a character level metric like chrF, ByT5 is more accurate at autocomplete than mT5.We also observe that these models generalize more easily to languages written in an alphabet closer to the Latin alphabet, ByT5 being more sensitive to the alphabet of the input language.

D.4.1 Task description
Transliteration is the conversion of text in one writing system to another writing system, e.g., text written in the Devanagari script to the Latin script.It differs from translation in that it does not change the language content of the text, just the script.Many languages are written in multiple scripts, and the current task involves transliterating whole sen-tences, not just isolated terms, from one script to another.

D.4.2 Data Creation and Annotation process
Most of the data for the task comes from the romanized full-string subset of the Dakshina dataset (Roark et al., 2020), in which 10,000 Wikipedia sentences written in the native scripts of the 12 languages were human-romanized by native speakers, resulting in parallel sentences in the native and Latin scripts. 19Two 10,000 sentence additions were made to this data for the current transliteration task: Amharic Wikipedia sentences were similarly manually romanized by native speakers; and the Punjabi sentences from the Dakshina dataset, originally written in the Gurmukhi (Brahmic) script, were manually transliterated by native speakers to the Shahmukhi (Perso-Arabic) script.

D.4.3 Data Preparation
The resulting collection allows for overall 30 tasks converting between various scripts.These are summarised in Table 10 where, for each language indicated by the BCP-47 code (Phillips and Davis, 2009), the corresponding transliteration tasks are shown for scripts indicated by their ISO-15924 codes (ISO, 2004).All the native script data was normalized using Unicode NFC (Whistler, 2021).The data was then further transformed using language-specific visual normalization for Brahmic and Perso-Arabic writing systems using the Nisaba script normalization library (Johny et al., 2021;Gutkin et al., 2022).Both NFC and visual normalization operations preserve visual invariance of the input text, with visual normalization handling many ambiguous cases that fall outside the scope of standard NFC.

D.4.4 Data Statistics
For each task, we establish 2,000 training sentences, 2,000 development set sentences, and close to 6,000 test sentences.Training data for any pretrained models used in the task cannot include the Dakshina dataset.Since this is a contextual fewshot transliteration benchmark, we do not provide the romanization lexicons that were released in the Dakshina dataset along with the full sentence romanizations.
Our few-shot contextual transliteration task covers 13 languages from 3 language families (Indo-Aryan, Dravidian and Semitic), all but one (Amharic) from South Asia.

D.4.5 Directionality and Evaluation Ambiguity
One difference between romanization in these languages and transliteration in the opposite direction (from the Latin script to the native script) is that none of the languages in the benchmark have an orthography in the Latin script, i.e., there is no single correct spelling in the Latin script for these languages.Rather, individuals tend to provide a rough phonetic transcription of the sentences using the Latin script.As a result, word identity may be difficult to achieve (hence high word-error rate), but string similarity should be relatively high between quality romanizations hence we use character-error rate to evaluate the transliterations.The ability to produce romanizations automatically has several key use cases, including simulation of parallel data from mono-script language samples, and for multilingual modeling of languages that use different scripts.For that reason, we include both directions in the benchmark.

D.4.6 Experimental Setup
Previously Xue et al. (2022) performed ByT5 finetuning and evaluation of transliteration and romanization directions separately on single-word, rather than full-sentence, data from vanilla Dakshina dataset.In this benchmark we remove the separation into transliteration and romanization by requiring all tasks to be fine-tuned jointly.In order to achieve this, during all stages of training, development and testing a special code is prepended to the input feature strings for each task.This task code indicates that the input features correspond to the conversion from writing system Source to writing system Target for a language lang.It is encoded as a string "lang_Source_Target".For example, for Punjabi (pa) conversion from Shahmukhi (Arab) to Gurmukhi (Guru) writing systems, the task code is "pa_Arab_Guru".
In the full fine-tuning setup20 we jointly fine-tune the 30 transliteration tasks using mT5 and ByT5 models in Small, Base and Large configurations that correspond to around 300M, 582M and 1.2B parameters, respectively (Xue et al., 2021(Xue et al., , 2022)).Fine-tuning uses 10K training steps with a batch size of 128.We used Google TPU-v3 accelerators (Kumar et al., 2019) for fine-tuning all the configurations apart from ByT5 Large for which a more powerful TPU-v4 (Pope et al., 2022) was necessary.

D.4.7 Evaluation and Discussion
The evaluation results of the full fine-tuning setup described above are provided in Table 11, which shows the character error rate (CER) for each of the 30 transliteration tasks in six configurations, along with the corresponding averages over all the tasks.Some general trends are observable in these baseline results.The ByT5 error rates are generally substantially better than mT5, and, while the size of the configuration matters for mT5, it does not seem to matter much for ByT5.Overall, romanization is harder, i.e., transliterating into the Latin script yields higher error rates than transliterating out of it, perhaps due to the fact that there is no set orthography in the Latin script in those languages.For the best performing configuration (ByT5-Base), 9 out of 10 of the tasks with the lowest CER are from Latin script to native script.All of the tasks with the highest CER are into either the Latin or Perso-Arabic scripts, and all of the tasks transliterating Perso-Arabic input have worse-than-median CER.In other words: Perso-Arabic input is hard; Latin output is hard; and Perso-Arabic to Latin is particularly hard.Why is this dataset part of XTREME-UP?Machine translation is an important tool for expanding language coverage for natural language processing tools.FLORES-101 is a high-quality, highlymultilingual dataset.
Data Statistics 50% of the FLORES-101 dev split was reserved for training and the remainder for validation.The original devtest split was unchanged and reserved for testing.This results in 499/498/1012 sentence pairs for train/validation/test, respectively.

Dataset Curators
The original dataset was curated by the NLLB (No Language Left Behind) Team (flores@fb.com).The version included in XTREME-UP was curated by Parker Riley (prkriley@google.com) and Isaac Caswell (icaswell@google.com).

Curation Rationale
The original FLORES-101 dataset was created to be able to evaluate machine translation models in many languages.The version released in XTREME-UP was created to focus on low-resource languages and provide an in-domain train split along with validation and test splits, all of sizes in line with other tasks in XTREME-UP.

Data Sources
The source data (selected by the NLLB Team) comes from Wikinews, Wikijunior, and Wikivoyage.
Dataset Creation Details of the creation of the original dataset are available in the original publication (Goyal et al., 2022).
Changes to the Original Dataset for XTREME-UP The version of the dataset in XTREME-UP only has the source and target strings, removing additional metadata.We also include 93 of the original 100 non-English languages (the subset supported by Google Translate).Of these, only 39 are used for official evaluation.1. question: a question in the target language (string) 2. title: the title of the evidence passage -target language for in-language setting, English for cross-language setting (string) 3. passage: the evidence passage, which might contain an answer to the question -target language for in-language setting, English for cross-language setting (string) 4. answer: the answer (if any) to the question (string) Data Example See Table 6.
Languages See Table 5.
Data Statistics See Table 1.
Data Sources Evidence text was sourced from Wikipedia.
Dataset Creation Details of the creation of the original dataset are available in the original TyDi QA and XOR QA publications.
Why is this dataset part of XTREME-UP?Named entity recognition is a fundamental task in natural language processing.The MasakhaNER datasets are high-quality multilingual datasets that provide data in 20 African languages.The data is human-annotated and thus higher quality than automatically collected NER datasets.

D.8.2 Data creation
In this section, we describe the two processes used to extend the MTOP instances: the first involves translation and localization with professional translators and the second code-switching using a language model and verification by human annotators.
In both processes, we perform a linearization step of the query and parse.Given an English utterance from the MTOP English test set and the corresponding slot information (slot names each with start and end bytes), we add slot tags around corresponding tokens in the query (Figure 2).
Motivation Recently, researchers published more multilingual semantic parsing datasets that focus on virtual assistant domains (Li et al., 2021;FitzGerald et al., 2022;Moghe et al., 2022;Goel et al., 2023).We extend a portion of an existing semantic parsing dataset to new languages targeting the following features: a) high-quality utterances produced by professional translators; b) a wide range of domains and intents; c) inclusion of different language families and some underrepresented languages; d) sentences with culturally relevant entities; and e) code-mixed sentences, i.e., multiple language within the same sentence-a common phenomenon in multilingual societies.
Translating MTOP to 15 languages: We take the bracketed versions of the slot-tagged English sentences from MTOP and we create translations and localization tasks to be carried out by professional translators.We ran two pilots on a small sample of the data to gather feedback and improve the annotation guidelines.The translators had to translate the original utterances to a given target language, while keeping the brackets around slot value translations and localizing those where possible.Once the pilots were completed without issues, we scaled the tasks to the full test set.
We carried out manual inspections on samples of the data to check if translation and localization was happening correctly, and a set of automatic checks on the full data to ensure that slots were matching between original and translated utterances.Data was sent back to annotators until all the issues were fixed.

Code-switching MTOP to 3 Indic languages:
We use PaLM to convert the linearized query into a code-mixed query using few-shot prompting.We experimented with different discrete prompt design strategies and selected the best prompts after a qualitative evaluation on a small held-out set (11 examples) covering all 11 domains.Specifically we experimented with three designs.
• Naive prompting.The prompt contains (a) the task description followed by a set of examples consisting of (b) the original English linearized query and (c) the corresponding code-mixed version.
• Parallel sentence prompting.In this case, the prompt contains (a) the task description, (b) the original English linearized query, and also (c) the target translated query (obtained with Google translate) and (d) the corresponding code-mixed query.
• Parallel reordered sentence prompting.Similar to the previous, however, target translated queries are human written.
We observed that the Parallel sentence prompting was producing higher quality utterances, with 7/11 correct conversions for Hindi-English.6/11 for Bengali-English, and 8/11 for Tamil-English.We used this strategy to design prompts with the help of native speakers of those languages.We selected 21 sentences from the training split for creating corresponding exemplars for the prompts.With the latter, we performed few-shot prompting with the 64b PaLM model and converted the test split of MTOP to a code-switched corpora.
Human annotators then had to check the PaLM generated data for the presence of code-mixing and for the labeling to be consistent between the original query and the code-mixed version.The annotators were instructed to fix the automatically generated data whenever they found such issues.

D.8.3 Data structure and statistics
To create the training, validation and testing splits for MTOP, we start from the English test set and remove intents with less than 10 examples.This leaves us with 53 intents and a maximum of 4,223 examples for each language (some original MTOP languages may have less examples, while our codeswitched data may have more due to multiple paraphrases).
For each intent, we randomly select training examples such that each slot is covered by at least one example, for a minimum of 5 examples.We end up with training, development and test sets containing respectively a maximum of 285, 239, and 3,669 instances for each language.

D.8.4 Experiments
We fine-tune mT5 (Xue et al., 2021) and ByT5 (Xue et al., 2022) in their base and large configurations on the multilingual training data we collected.Table 12 contains the Exact Match accuracies of a multilingual model trained on data from all languages but the code-switched sets.Table 13 contains the results of a model that includes the code-switched sets.From both tables, we can see that ByT5-base is more accurate then the other models, even compared with the larger ones.This surprising result confirms similar findings on word-level tasks reported by Xue et al. (2022) and Nicosia and Piccinno (2022).We expect mT5 to catch up with ByT5 at larger sizes.

E In-context learning examples
We show in-context learning examples for a selection of tasks in Table 14.Each example consists of a general instruction and prefixes for the input and target, which are repeated for each exemplar.Yeenziwa nguJohn Smith kwiminyaka yee-1970 ukunceda iifolda ezingenamava okanye ezo zinobuchule bemoto obulinganiselweyo.
In-language Retrieval for QA Telugu In-language QA Telugu Table 6: Examples of each task in XTREME-UP.The tasks are generally text-in, text-out with a few exceptions.On the output side, autocomplete requires generating the top-3 outputs and retrieval outputs document identifierscurrent systems tend to implement retrieval by mapping both inputs and candidate outputs to vector and performing nearest neighbor lookup.On the input side, speech recognition has audio input and document OCR has image outputs; our initial baseline systems use external systems to map this to text as a preprocessing step, though we hope to see multi-modal systems eliminate this step in the near future.

Figure 1 :
Figure 1: The tasks in XTREME-UP and their role in language technology.Left: enabling access to language technology; middle: facilitating information access as part of larger systems (question answering, information extraction, virtual assistants); right: making information accessible in the speaker's language.
Multilingual fine-tuning baselines For the main experimental setting of multilingual finetuning, we provide the following baselines: mT5base(Xue et al., 2021)  and a subword-based multilingual encoder-decoder model; ByT5-base (Xue et al., 2022), a byte-based multilingual encoderdecoder model.In-context learning baseline For the in-context learning setting, we employ Flan-PaLM (Chung et al., 2022), an instruction-tuned version of PaLM D.1.4Data statistics The FLEURS dataset contains about 1.4k hours of audio in total for 102 languages.The training data contains 271,488 examples across 102 languages, average length per utterance is about 20 tokens.There are 34,661 examples in the validation (dev) set, and 77,943 examples in the test set.
of Contact (original version): NLLB Team (flores@fb.com) datset names: TyDi QA, XOR-TyDi QA2.Additional cross-lingual data was collected as part of XTREME-UP, following similar methodologyWhy is this dataset part of XTREME-UP?Question answering enables information access.

(
or "No Answer" for some examples) Cross-language Retrieval for QA Oriya Title: Satyavati Context: Daughter of the Chedi king, Vasu (also known as Uparichara Vasu) and a cursed "apsara" (celestial nymph) who was turned into a fish called Adrika, Satyavati was brought up as a commoner. . .Cross-language QA Oriya Question: Title: Satyavati Context: Daughter of the Chedi king, Vasu (also known as Uparichara Vasu) and a cursed "apsara" (celestial nymph) who was turned into a fish called Adrika, Satyavati was brought up as a commoner. . .Uparichara Vasu (or "No Answer" for some examples)

Table 1 :
The tasks in XTREME-UP.For each task, we show both the sum of training examples across all languages-to give some insight into training scale-and the average number of training examples for each underrepresented language-to highlight the challenge of the scarce-data learning scenario.XTREME-UP does not limit supervised training data in high-resource languages (HLs) while each under-represented language (UL) has a maximum of 8 hours of annotation effort in its training split; see last column for estimated annotation effort.We also show the sum of validation and test examples across ULs as XTREME-UP evaluates only on ULs.
of training data available in high-resource languages, but rather provide only as many labeled training examples as can be annotated in a realistic amount of time for ULs (see Section 3.2).

Table 2 :
The input and output format of each task in XTREME-UP.Tasks are generally text-in, text-out with a few exceptions.See Appendix C for task examples.
Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:452-466.Anne Lauscher, Vinit Ravishankar, Ivan Vulić, and Goran Glavaš.2020.From zero to hero: On the limitations of zero-shot language transfer with multilingual Transformers.In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4483-4499, Online.Association for Computational Linguistics.
Participatory research for low-resourced machine translation: A case study in African languages.In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2144-2160, Online.Association for Computational Linguistics.

Table 5 :
Overview of under-represented languages covered in XTREME-UP.रगु जा भागातले रहिवासी Machine Translation Xhosa It was developed by John Smith in the 1970s to help inexperienced folders or those with limited motor skills. सु

Table 7 :
ASR tasks evaluated using CER metric at 4K steps of fine-tuning mT5 and ByT5 Small and Base models.

Table 13 :
Semantic Parsing: Exact Match (EM) accuracies of mT5 and ByT5 models of different sizes trained multilingually on few-shot data.Here the multilingual training data includes three code-switched Indic languages and we report EM for such languages.

Table 14 :
In-context learning examples.