NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages

Democratizing access to natural language processing (NLP) technology is crucial, especially for underrepresented and extremely low-resource languages. Previous research has focused on developing labeled and unlabeled corpora for these languages through online scraping and document translation. While these methods have proven effective and cost-efficient, we have identified limitations in the resulting corpora, including a lack of lexical diversity and cultural relevance to local communities. To address this gap, we conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content. In addition, we present the \datasetname{} benchmark, encompassing 12 underrepresented and extremely low-resource languages spoken by millions of individuals in Indonesia. Our empirical experiment results using existing multilingual large language models conclude the need to extend these models to more underrepresented languages. We release the NusaWrites dataset at https://github.com/IndoNLP/nusa-writes.

In this work, we compare three corpus collection methods for 12 underrepresented languages in Indonesia, namely Ambon (abs), Batak (btk), Betawi (bew), Bima (bhp), Buginese (bug), Javanese (jav), Madurese (mad), Makassarese (mak), Minangkabau (min), Palembang / Musi (mui), Rejang (rej), and Sundanese (sun).We chose Indonesian local languages as our case study because of the language diversity in Indonesia, with more than 700 languages spoken but most of them are underrepresented and extremely low-resource (Cohn and Ravindranath, 2014;Aji et al., 2022b).Bang et al. (2023) categorize Javanese (jav) and Sundanese (sun) as low-resource languages, while the others as extremely low-resource languages.For Ambon (abs), Bima (bhp), Makassarese (mak), Musi (mui), and Rejang (rej), they have no publicly available labeled and unlabeled corpora despite there being millions of speakers.We provide information on 12 low-resource languages under study in Table 10.We conduct two manual data construction efforts for the 12 languages: topic-focused paragraph writing (NusaParagraph) and human translation by native speakers (NusaTranslation), 3 and benchmark them with online scraping.For online scraping, we utilize Wikipedia4 as the main source as it covers some of the Indonesian local languages under study.Figure 1 summarizes the corpora constructed by each approach: Wikipedia, NusaParagraph, and NusaTranslation for online scraping, paragraph writing, and human translation, respectively.NusaParagraph tends to have fewer English and Indonesian lexicons, indicating they are more relevant to the local cultures than the others.
We build a new benchmark for the 12 Indonesian local languages, namely NusaWrites5 using the texts produced in topic-focused paragraph writing and human translation.NusaWrites covers 5 natural language understanding tasks (e.g., emotion, sentiment classification) and one natural language generation task (i.e., machine translation), and complements NusaX (Winata et al., 2023)-a contemporaneous work on 10 Indonesian local languages for sentiment analysis and machine translation.We also demonstrate the inability of (1) fine-tuned Indonesian and multilingual language models (LMs) and (2) zero-shot prompting via large LMs (LLMs) to adapt to these languages, indicating that these languages are distinct from the existing models.
Our contributions to this work are four-fold: • We compare various corpus collection methods for underrepresented and extremely low- resource languages.We show that paragraph writing is the most promising strategy for building high-quality and culturally-relevant corpora.• We extend the NLP resource coverage of Indonesian local languages with 5 new languages: Ambon (abs), Bima (bhp), Makassarese (mak), Musi (mui), and Rejang (rej).• We propose NusaWrites, a benchmark covering new high-quality human annotated corpora consisting of 12 underrepresented languages in Indonesia with 5 downstream tasks.• We conduct extensive analysis to showcase the similarity between languages under study with Indonesian and the inability of existing LLMs to process these languages.

Indonesian Local Languages in Wikipedia
Figure 2 describes Indonesian local languages which are covered in Wikipedia, compared against other existing languages.In total, there are only 11 local languages (out of 700+ (Aji et al., 2022b)), with Minangkabau (min), Javanese (jav), and Sundanese (sun) having a quite large amount of documents around ∼100,000 articles, while the remaining languages have less than ∼10,000 articles.Despite its relatively large scale in Wikipedia, the text quality is not consistently as good as reported in the WikiMatrix dataset (Schwenk et al., 2021).Kreutzer et al. (2022) further find that ∼30% of the correct translation data in English-Javanese are either boilerplates or low-quality texts.
To further verify the quality of Indonesian local languages in Wikipedia, we conduct an analysis to measure lexical diversity in two approaches: 1) calculating the cumulative token distribution per language and 2) measuring the length-agnostic lexical diversity metrics, i.e., moving average type-token ratio (MATTR) (Covington and McFall, 2010), measure of textual lexical diversity (MTLD) (Mc-Carthy, 2005), and mean segmental type-token ratio (MSTTR) (Johnson, 1944).We use Lexical-Richness (Shen, 2021(Shen, , 2022) ) v0.5.06 to calculate these metrics.Based on our analysis in Table 1, we show that some Indonesian local languages in Wikipedia have much less lexical diversity despite having quite a number of articles in Wikipedia, especially for Buginese (bug), Acehnese (ace), Gorontalo (gor), and Nias (nia).Through a further inspection of the Wikipedia corpus presented in §4.1 and Appendix G, Wikipedia articles for these languages tend to comprise many boilerplate texts, especially for Buginese (bug) Wikipedia.

Indonesian Local Languages in Other Sources
Other than Wikipedia, there are other large multilingual corpora such as CommonCrawl,7 mC4 (Xue et al., 2021), OSCAR (Suárez et al., 2019), FLORES-200 (Guzmán et al., 2019;Costa-jussà et al., 2022), and Bible corpus.8Nevertheless, most sources, except the Bible corpus, only support some widely-spoken languages spoken in Indonesia: Indonesian (ind), Javanese (jav), and Sundanese (sun), rendering them ineffective for studying hundreds of local languages spoken in Indonesia.The Bible corpus, on the other hand, consists of 14 Indonesian local languages.9Interestingly, these languages have an extremely low number of speakers with an average population of 40k people.On the contrary, Wikipedia covers Indonesian local languages with a larger number of speakers, with Nias (nia) being the smallest (nearly 770k speakers).In this work, we particularly focus on Indonesian local languages with larger population size (∼500k or above), and leave the exploration of the smaller-scale languages for future work.

Corpus Construction for Indonesian Local Languages
We conduct corpus construction through human annotation by expert workers in two ways: (1) sentence translation and (2) paragraph writing.Sentence translation is a widely used parallel data collection method (Conneau et al., 2018;Hu et al 2020; Winata et al., 2023), while paragraph writing (Koto et al., 2022a) is explored to capture a more culturally relevant aspect which is often left out in translation (Kirkpatrick and van Teijlingen, 2009).The details of our expert annotator recruitment are shown in Appendix B. In the following section, we describe how the data construction is done for both methods.

Sentence Translation
Data Selection We sample data from two sources, i.e., IndoLEM sentiment (Koto and Rahmaningtyas, 2017;Koto et al., 2020), an Indonesian sentiment analysis dataset collected from Twitter and hotel review, and EmoT (Saputri et al., 2018;Wilie et al., 2020), an Indonesian emotion classification dataset collected from Twitter.We take the whole samples from both IndoLEM sentiment (5048 samples) and EmoT (4401 samples) as our source language data, resulting in a total data of 9449 sentences for translation.

Paragraph Writing
We conduct paragraph writing by instructing the annotators to write a 100-word paragraph given a certain topic.The topic for paragraph writing is manually designed to cover a wide coverage of domains.We conduct paragraph writing in 10 languages, i.e., Batak (btk), Betawi (bew), Buginese (bug), Javanese (jav), Madurese (mad), Makassarese (mak), Minangkabau (min), Musi (mui), Rejang (rej), and Sundanese (sun).Note that, unlike sentence translation, there is no Ambon (abs) and Bima (bhp), but instead there is an additional language, Buginese (bug).This happens because of the difficulty of obtaining a large pool of annotators for the Ambon (abs) and Bima (bhp) languages.
Topic Selection We provide a list of topics before instructing the annotators to write paragraphs.The selection of various topics is expected to enrich vocabulary in the corpus as different topics will obviously bring up the use of different terms.The topics provided vary widely, ranging from food and beverages, entertainment/leisure, sports, science, history, politics, and religion.In addition, we also have other topics for describing emotional states such as sadness, happiness, anger, etc.We also provide more specific subtopics for each of the major topics provided.In total, we have 20 main topics with 20 subtopics of each main topic.The list of all topics is given in Appendix D.
Paragraph Writing Procedure The paragraph writing is done with the following criteria: (a) the paragraph consists of a minimum of 100 words, (b) using only the targeted local language except for named entities, (c) the content of the paragraph should be about the provided topics and subtopics, and (d) for each paragraph, the annotator should fill the rhetoric type of the paragraph, which is either narration, description, argumentation, persuasion, or exposition.More details about the paragraph writing procedure are in Appendix E.

Quality Control
Quality control is conducted to ensure the data are correct through manual and automatic validation.If the data does not meet the desired criteria, it is to be revised.Specifically, through a series of manual and automatic validations, we ensure that all sentences that need to be translated are translated to the target language, with minimal overlap with the source language sentence.For paragraph writing, we ensure that there is no plagiarism from external sources by conducting validation through search engines and we also ensure that there is a minimum 30% distinction between two paragraphs (measured by using edit distance).The detail of our quality control process is described in Appendix F. The quality control is conducted in several iterations, by asking annotators to rewrite unqualified instances until all quality control passes.

Resulting Corpora
Through sentence translation, we achieve a total of 72,444 sentences, with 1,579 for Bima;

NusaWrites over Wikipedia
In this section, we compare the quality of various corpus collection methods, i.e., online scraping from Wikipedia, sentence translation (NusaTranslation), and paragraph writing (NusaParagraph), through 5 Indonesian local languages: Buginese (bug), Javanese (jav), Madurese (mad), Minangkabau (min), and Sundanese (sun).The statistics of each corpus collection method as shown in Appendix G.In general, Wikipedia has a larger token count and unique token coverage for Javanese (jav), Sundanese (sun), and Minangkabau (min).While for Madurese (mad), the corpus size in Wikipedia is very small with only 110k tokens, in this case, the sentence translation and paragraph writing methods provide a huge advantage over collecting through Wikipedia.Interestingly, while the #tokens of the Buginese (bug) in Wikipedia are rather large, the #unique tokens are very small even compared to the smaller data from Madurese (mad).Additionally, the #tokens/document is pretty small, indicating a short document per Wikipedia article.These facts show that the data for Buginese (bug) in Wikipedia comprises many short boilerplate texts, which are not useful for learning the language.
As the data size for each corpus collection method is different, to further compare the cor- pus quality generated by each corpus collection method, we compare three criteria that are less prone to the size of the data, i.e., length-agnostic lexical diversity metrics; the empirical language modeling quality from LMs trained on the generated corpus on hold-out text data, NusaX (Winata et al., 2023), a human-translated Indonesian social media posts and online reviews; and the ratio of borrowed words of the generated corpus.

Lexical Diversity
To measure the lexical diversity, we measure the length-agnostic lexical diversity metrics, i.e., MATTR (Covington and McFall, 2010), MTLD (McCarthy, 2005), and MSTTR (Johnson, 1944), for each corpus collection method in Figure 3. 10 For MTLD we use a threshold of 0.72, while for MATTR and MSTTR, we use a window size of 20.11For low-resource languages, i.e., Javanese (jav) and Sundanese (sun), all three methods produce an almost equally diverse corpus, with a slightly higher diversity for sentence translation.For extremely low-resource languages, compared to other methods, Wikipedia achieves slightly higher diversity scores in Minangkabau (min), and NusaTranslation achieves slightly higher scores in Madurese (mad).Nonetheless, by utilizing permutation test (n = 1, 000) (Koplenig, 2019), we conclude that the difference between corpora in all metrics and languages are statistically significant (p < 0.05), except for Madurese (mad) between NusaTranslation and NusaParagraph on the MATTR and MTLD metrics, and for Sundanese (sun) between NusaTranslation and NusaParagraph on the MATTR and MSTTR metrics.Interestingly, for Buginese (bug), Wikipedia achieves very low score diversity scores, while NusaParagraph achieves high diversity scores, which shows that there are a large number of sentences in the Wikipedia data for Buginese (bug) that have a repeating pattern alike boilerplate.In addition to the diversity metrics, we also measure the lexical overlapping with Indonesian and English lexicons obtained from Panlex (Kamholz et al., 2014).As shown in Figure 1, Wikipedia has a higher overlap with the English lexicon, indicating that it covers many shared foreign terms (e.g., scientific terms) and foreign entities (e.g., the name of cities, tourist attractions, etc.), which are not common in the actual day-to-day use of Indonesian local languages where the languages are commonly used for daily conversation, instead of a formal occasion, such as in the academic setting (Cohn and Ravindranath, 2014;Soeparno, 2015;Nurjanah et al., 2018;Nur, 2018;Sutrisno and Ariesta, 2019).

Language Modeling Quality
To evaluate the quality of the generated corpus, we evaluate the LM trained on each of the corpus generated by each method.Specifically, we build a small two-layer decoder-only transformer with 128 hidden dimensions and total parameters of ∼5.5M, which is a comparable size to a BERT-Tiny model (Devlin et al., 2019) using two different settings: 1) using the same amount of tokens for each corpus by sampling the larger-sized corpora (balanced) and 2) using the original corpus size for each collection method (full). 12The first one shows the expected quality of the sentences in the corpora, while the second one shows the expected empirical performance when utilizing the corpus.
The LM perplexity of the three corpus collection methods is shown in Figure 4a.In general, the performance of LMs from NusaTranslation and NusaParagraph is much lower than the one from Wikipedia, showing that the corpora are more aligned with the colloquial writing of Indonesian local languages which is the common use case of using these languages (Cohn and Ravindranath, 2014;Farisiyah and Zamzani, 2018;Soeparno, 2015;Nurjanah et al., 2018;Nur, 2018;Sutrisno and Ariesta, 2019;Aji et al., 2022b).For the balanced setting, we observe that LMs from NusaTranslation produce slightly better results than the LMs from NusaParagraph.This is expected as the source domain of NusaTranslation is more similar to NusaX (Winata et al., 2023), which also covers social media content and online reviews.Nevertheless, as shown in the results from the full setting (see Figure 4b), this problem can be alleviated by increasing the coverage of the corpus.

Loan Words Ratio
To assess the cultural relevance of the generated corpora, we evaluate the ratio of loan words present within each corpus.The loan words are manually curated from the top 200 words that overlap with the English lexicon and an additional list of English loan words13 in each corpus. 14The complete list of loan words is in Appendix H.The ratio is calculated by dividing the number of loan words by the total number of tokens in each corpus, and the results are presented in Figure 5.The findings indicate that NusaParagraph and NusaTranslation exhibit a minimal ratio of loan words, with approximately ∼0.1% and ∼1% respectively.However, some languages in Wikipedia, such as Minangkabau (min), Sundanese (sun), and Buginese (bug), demonstrate significantly higher ratios of loan words, ranging from approximately 5% to 15%.Additionally, in Appendix I, we demonstrate that NusaParagraph and NusaTranslation possess a notably higher ratio of common local words, including terms like indomie and angkot, in comparison to Wikipedia.These results emphasize the superiority of manually curated methods, particularly paragraph writing, in generating culturally relevant corpora.

NusaWrites Benchmark
From our resulting corpora in §3.4,we build the Nu-saWrites benchmark, which consists of 12 Indonesian local languages: Ambon (abs), Batak (btk), Betawi (bew), Bima (bhp), Buginese (bug), Javanese (jav), Madurese (mad), Makassarese (mak), Minangkabau (min), Palembang / Musi (mui), Rejang (rej), and Sundanese (sun).More details of each language are in Appendix J. 4 languages under study, i.e., Ambon (abs), Bima (bhp), Musi (mui), and Rejang (rej), have a population of <1M speakers, while others have a population of >2M speakers, but are underrepresented in NLP research (van Esch et al., 2022;Aji et al., 2022b).The languages due to the relatively high terms overlapping coming from the shared geopolitical landscape and cultural values.Table 2: Overall performance on all tasks in the NusaWrites benchmark.We report the macro-F1 (%) for NLU, and SacreBLEU and ChrF++ for NLG, averaged over all of the languages within the tasks.The best performances in each section are bolded, while the best overall performance in each column is underlined.
belong to the Austronesian language family under the Malayo-Polynesian subgroup.While some of the languages are written in multiple scripts, we use the Latin script in NusaWrites, which has become predominant for all covered languages.

NusaTranslation
We develop three parallel downstream taskssentiment analysis, emotion recognition, and machine translation-covering 11 local languages spoken in Indonesia.We generate a new split for each downstream task and keep a reasonable amount of test samples for languages with smaller sample sizes.The labels of the downstream tasks follow the original label from the original dataset.The statistics of each downstream task are shown in Table 11.A detailed description of each downstream task is provided in Appendix K.

NusaParagraph
We develop three downstream tasks from NusaParagraph-topic modeling, emotion recognition, and rhetoric mode classification-based on the datasets covering 10 local languages spoken in Indonesia.For the topic modeling task, we cover 8 topics: food & beverages, sports, leisure, religion, culture & heritage, a slice of life, technology, and business.For the emotion recognition task, we cover the 6 basic emotions (Ekman, 1992): fear, disgusted, sad, happy, angry, and surprise, and an additional emotion label: shame (Poulson and of Tasmania.School of Management, 2000).
For the rhetoric mode classification, we cover 5 rhetoric modes: narrative, persuasive, argumentative, descriptive, and expository.The statistics of the corpus and the detailed description of each task are shown in Table 12 and Appendix L.

Baselines
Classical Machine Learning In extremely lowresource settings, the classical approaches can outperform the neural approach, especially if there is no pre-trained model supporting that particular language (Winata et al., 2023).Moreover, with the limited computational access in many regions such as Indonesia, classical machine learning remains a popular choice for researchers and industry (Nityasya et al., 2020;Aji et al., 2022a).Hence, we utilize this approach for NusaWrites.
Massively Multilingual LMs Finetuning LMs for downstream tasks has become a popular method in NLP.It enables LMs to learn with a limited dataset and perform better compared to training neural models from scratch (Devlin et al., 2019;Wilie et al., 2020;Gehrmann et al., 2022).
Moreover, recent work has shown that a finetuned model for a specific task can outperform general-purpose, larger language models (Bang et al., 2023;Asai et al., 2023;Zhang et al., 2023).We investigate the performances of both large pretrained multilingual and Indonesian monolingual baseline models on low-resource languages used in this work.We follow the hyperparameter settings in (Winata et al., 2023).Details are in Appendix M.
Zero-Shot LLMs LLMs fine-tuned through diverse instructions show capabilities to generalize across unseen instructions (Wei et al., 2021;Sanh et al., 2021;Ouyang et al., 2022;Yong et al., 2023).Moreover, these models are shown to be able to generalize across different languages, assuming the base model is multilingual (Muennighoff et al., 2022;Adilazuarda et al., 2023;Zhang et al., 2023).Therefore, to assess the zero-shot capabilities of LLMs over our datasets, we benchmark BLOOMZ and mT0 (Muennighoff et al., 2022), both of which are multilingual LLMs that have been fine-tuned with downstream task instructions.We explore the model from 300M up to ∼13B parameters.For NLU, the class output is determined by selecting the most probable label generated after the prompt.For NLG, we generate the translation by using prompts.The prompts used in this experiment can be found in Appendix N.

Benchmark Results and Discussion
We present the results of our NLU and NLG experiments in Table 2a and Table 2b, respectively.While the classical baselines have never learned any prior language representations, they perform competitively to the fine-tuning baselines-the fine-tuned Indonesian monolingual models (i.e., IndoBERT, IndoBART, and IndoGPT) and the fine-tuned multilingual models (i.e., mBERT, XLM-R, mBART-50, and mT5)-on both NLU and NLG benchmarks.Furthermore, based on the per language breakdown shown in Figure 6, except for the languages observed during the pre-training, i.e., Javanese (jav) and Sundanese (sun), both Indonesian and multilingual LMs fail to outperform the classical machine learning approaches on most languages and only able to outperform on languages that are closely related to Indonesian (see Appendix J), i.e., Betawi (bew) and Minangkabau (min).These facts demonstrate that most extremely low-resource languages in NusaParagraph and NusaTranslation are beyond the scope of the knowledge transfer from Indonesian and multilingual pre-training due to their distinct linguistic characteristics.
Secondly, the LLMs used in this study: BLOOMZ and mT0, consistently and significantly underperform the fine-tuned and classical baselines, e.g., up to ∼56% gap on emotion recognition and ∼47% on topic modeling in NusaParagraph, as well as ∼17.5 SacreBLEU on machine translation.Despite their ability to generalize to unseen tasks (Muennighoff et al., 2022), LLMs do not generalize well to unseen languages, which indicates a challenge on knowledge transferability between languages, especially for underrepresented and extremely low-resource language, and underlines the need for more language-inclusive LLMs.

Conclusion
In this work, we compare the effectiveness of corpus collection methods for underrepresented and extremely low-resource languages.From our thorough study, we conclude that, although online scraping is effective for high-resource languages, it is not ideal for many extremely low-resource languages.Other approaches such as sentence translation and paragraph writing can be a better alternative for collecting data in extremely lowresource languages because they produce a better corpus with higher lexical diversity and cultural relevance.Furthermore, to measure the capability of existing LLMs to process underrepresented and extremely low-resource languages, we propose the NusaWrites benchmark, which covers 12 Indonesian local languages.Based on the benchmarking results, we demonstrate that both existing zero-shot prompting LLMs and fine-tuned pre-trained LMs fail to outperform the classical baselines, suggesting that LMs cannot generalize to these extremely low-resource languages as most of the extremely low-resource languages under study are distinct from other previously learned languages.Our empirical experiments conclude the need to extend the language coverage of the models.

Limitations 7.1 Languages for Comparison of Corpus Collection Methods
We explore only 5 Indonesian local languages to compare the effectiveness of different corpus collection methods due to the difficulty of finding eligible annotator candidates for the other languages.We hope future work can explore the generalization of our analysis in broader languages, especially for other underrepresented and extremely low-resource languages in different language families.

Buginese Data for NusaTranslation
We do not have Buginese data in our NusaTranslation corpus, this is due to the difficulty of finding eligible annotator candidates for Buginese.In fact, during our course of dataset construction, we only found one eligible annotator candidate who would like to participate in our study.

Few-Shot LLM Prompting
Few-shot in-context learning has been shown to be able to improve the performance of zero-shot prompting (Brown et al., 2020;Sanh et al., 2022;Wei et al., 2022;Chung et al., 2022).However, few-shot in-context learning incurs a high computational cost and, due to a limited computational budget, we only explore zero-shot LLM prompting and we leave the exploration on few-shot in-context learning for future works.

Ethical Consideration
Our work highlights the importance of democratizing access to Natural Language Processing (NLP) technology for underrepresented and extremely low-resource languages.During our study, we are well aware of the ethical responsibility associated with language research and the potential impact it can have on communities.Our study prioritizes inclusivity, cultural relevance, and fairness.Within this work, the annotators are properly rewarded above the national average minimum wage in Indonesia.We have obtained informed consent from all annotators and adhered to data protection and privacy regulations for releasing the corpus and benchmark.Throughout our research process, we have made conscious efforts to engage with the language communities, involve local experts, and respect their linguistic and cultural nuances.Our ultimate goal is to promote linguistic diversity and contribute to a more inclusive NLP landscape.We encourage further collaboration and engagement with underrepresented language communities to ensure that their voices are heard and their needs are addressed in future language technology development.We remain committed to the principles of ethical research, diversity, inclusivity, and fairness, striving to mitigate biases and promote social good through our work in the field of NLP.

A Indonesian Local Languages in the Bible Corpus
We list out all the Indonesian local languages that are covered in the Bible corpus in Table 3. B Pre-Annotation Procedure

B.1 Annotator Recruitment
Our recruitment process involves multiple steps.Firstly, we conduct a strict selection process to filter out applicants.Subsequently, we proceed with knowledge transfer sessions for the selected annotators.The primary objective of our recruitment process is to identify and engage proficient annotators with expertise in relevant local languages.
Qualification In developing data for a local language, a competent and experienced team in the required local language is certainly needed.Annotators play a crucial role in compiling high-quality local language data.Therefore, strict qualifications are required for the candidate annotators who will be recruited.The qualifications include educational background and experience related to language.Annotator candidates must have good knowledge of the language and the sentence structure of the local language they are proficient in.
Recruitment Process The recruitment process starts with an assessment test comprising three questions for each task.This test is designed to provide an overview of the candidate's abilities in sentence translation and paragraph creation in the relevant local language for future tasks.During this stage, the priority in candidate selection is based on the assessment test results, followed by employment status and educational background.
Out of a total of 892 applicants, only 127 candidates (∼14%) were eligible to participate in the annotation process, and among which, only 83 (∼65%) candidates expressed their willingness to proceed.Some of the annotators withdraw during the course of the annotations which further increases the complexity of the recruitment process.With this obstacle, the recruitment process faces complex challenges.Finding speakers of certain local languages can be difficult, making the recruitment process long and ongoing throughout the annotation process.
Knowledge Transfer All the selected annotators will join groups and receive explanations regarding this project through knowledge transfer and overview meetings before starting their work.The information provided covers various aspects related to project management and the annotation process in detail.Annotators will gain a clear understanding of the methods and guidelines to be followed in performing these annotations.With this explanation, it is expected that the annotator will have a comprehensive understanding of their responsibilities in this work and a detailed understanding of the task.This will assist them in carrying out their task effectively and producing high-quality output.

C Sentence Translation Procedure
Human translation is carried out by determining the boundaries of the rules in the translation process.We instructed the annotators to retain the meaning of the text and to keep entities, such as persons, organizations, locations, and time with no target language translation the same.Specifically, we instructed them to: (1) maintain the sentence's sentiment polarity; (2) preserve entities; and (3) maintain the complete information content of the original text.
Besides, we asked the annotators to maintain the typography.Most sentences from the original dataset are written in an informal tone, with nonstandard spelling, e.g., elongated vowels and punctuation.When the sentence is translated into the target language, direct translation can sound unnatural.For example, translating the Indonesian word kangeeeen (originally kangen; en: miss) to taragaaaak (originally taragak) in Minangkabau may sound unnatural.Similarly, the original sentence may also contain typos.Due to the difficulty of accurately assessing the typographical consistency of translations, we removed this as a criterion.
The translation annotation phase is planned to last for approximately 2-6 weeks depending on the number of annotators involved in one language group.Each annotator gets around 1000-3000 sentences (with the same reasons as the previous explanation).Each annotator is required to complete a translation of 500 sentences per week.However, there were issues of commitment to achieving weekly targets and availability of annotators, extending the annotation process to 9 weeks.This translation method achieved a total of 72444 sentences.The details are: 1579 sentences for Bima language, 1574 sentences each for Palembang, Rejang Lebong, and Ambon languages, 9449 sentences each for Madura language, Minangkabau language, Batak language, Betawi language, Javanese, Sundanese, and Makassar language.

D List of Topics for Paragraph Writing
Here we provide the list of topics and subtopics for the paragraph writing data collection.Each topic consists of 10 subtopics unless stated otherwise.
1 The existence of spirits (ghosts/demons/etc.);Reincarnation; The need for college; Culture preservation; Panda conservation; The need for shaving leg hair; Death penalty; Friendship between men and women without being more than friends; The legalization of assisted suicide.

E Paragraph Writing Procedure
For paragraph writing, we initially provide a list of topics before instructing the annotators to write paragraphs.The topics provided vary widely, ranging from simple topics such as food and beverages, entertainment/leisure, and sports, to quite heavy topics such as science, history, politics, and religion.We also provided more specific subtopics from each of the major topics provided.In total, we have 20 main topics with 20 subtopics of each main topic.The provision of topics (especially, subtopics) aims to facilitate the annotators in the process of writing paragraphs.That way, the annotators only need to write paragraphs by developing ideas from the topics and subtopics that have been given without the need to think about which topic to choose.In addition, the selection of various topics is also expected to enrich vocabulary in the corpus; different topics will obviously bring up the use of different dictionaries.
For conducting the paragraph writing, the annotators are instructed to write short paragraphs with the following criteria: (a) the paragraph consists of a minimum of 100 words, (b) using the targeted local language, (c) the topic is according to the provided topics and subtopics, (d) the type of the paragraphs are narration, description, argumentation, persuasion, and exposition, (e) The content must not defame the name of public entities or contain sensitive and personal information of specific individuals.
The paragraph writing procedure is started (1) after a transfer of learning given to the annotators about the general procedure and knowledge about writing and paragraph types.(2) After that, every annotator is given access to their own worksheet in Google Spreadsheet that already contains all topics and subtopics that they can develop according to the procedure.Every annotator had time around 15 weeks to finish 100-160 paragraphs every week.
(3) While annotators already start their paragraph writing process, QC annotators check and validate their work every week (around 2-3 times a week) to the paragraphs that are already done.(4) Lastly, every two weeks, every annotator is gathered in an online meeting to discuss the evaluation of any errors found in their data to prevent mistakes in the future.
Through paragraph writing, we achieved a total of 56395 paragraphs.The details are: 5017 paragraphs for Madura language, 8538 paragraphs for Minangkabau, 10189 paragraphs for Javanese, 9729 paragraphs for Sundanese, 9756 paragraphs for Betawi, 4711 paragraphs for Batak, 5338 paragraphs for Makassar, 1200 paragraphs for Rejang Lebong, 1473 paragraphs for Palembang, 1059 paragraphs for Bugis language, and 44 paragraphs for Ambon language.

F Post Annotation Procedure F.1 Manual Validation
For sentence translation, QC annotators check manually through the data to ensure that all words are translated to the target language and not a single word is skipped by the translator.For paragraph writing, QC annotators check the data by skimming through the paragraphs one by one, checking for any apparent typos, and making sure that the annotators are using the local language and not Indonesian.There are some cases where local languages still use Indonesian words, but it should only be below 30%, while most of the paragraphs must be in the desired local language.To ensure there is no plagiarism, QC annotators also check by sampling some paragraphs from the data and check whether a similar paragraph is found through a search engine.

F.2 Automatic Validation
To further ensure the diversity of the samples, we run an automatic validation to ensure there are no similar paragraphs written by any annotators.Our automatic validation matches two paragraphs by first removing all punctuation marks and then performing string matching using Levensthein distance, and normalizing the distance by dividing with the average length of the two paragraphs.We conduct the process for all the paragraph pairs and we ask the corresponding annotators to revise when the normalized distance of two paragraphs is less than 30%.

G Token Statistics of the Corpora Under Study
We provide the token statistics of Wikipedia, NusaTranslation, and NusaParagraph in Figure 7.  Especially in Buginese (bug), the document length and the number of unique tokens in Wikipedia are rather low, indicating that there is a lot of boilerplate text in the Buginese Wikipedia data.

H List of Loan Words in Indonesian Local Languages
We present the list of manually curated loan words with their frequency and proportion in the corresponding corpus for each language in Table 4, Table 5, and Table 6, for NusaTranslation, NusaParagraph, and Wikipedia, respectively.

I List of Common Local Words In Indonesian Local Languages
We present the list of manually curated common local words with their frequency and proportion in the corresponding corpus for each language in Table 7, Table 8, and Table 9 for NusaTranslation, and NusaParagraph, Wikipedia, respectively.
Batak languages (btk) are a subgroup of the languages of Northwest Sumatra-Barrier Islands spoken by the Batak people in the North Sumatra province and surrounding areas.Batak languages can be divided into three groups: Northern, Simalungan, and Southern.The Northern  group consists of three languages: Batak Alas-Kluet (btz), Batak Dairi (btd), and Batak Karo (btx).The Simalungan group has one language only, i.e.Batak Simalungun (bts).The Southern group consists of three languages: Batak Angkola (akb), Batak Mandailing (btm), and Batak Toba (bbc) (Eberhard et al., 2021).Batak languages are predicate-initial, and have verb systems reminiscent of Philippine languages, although they differ from them in many details (Blust et al., 2013).They were written using the Batak script, but the Latin script is now used for most writing.Our annotators are originating from Batak Toba and Batak Mandailing which are part of the Southern group.Batak Mandailing (btm) is spoken in North Sumatra (south interior from Padang Sidempuan into Riau) and West Sumatra provinces.The speakers are shifting to Indonesian in urban and migrant areas (Eberhard et al., 2021).It is written in Batak script.
Batak Toba (bbc) is a language spoken in the North Sumatra province.Similarly to Acehnese, it is slowly being replaced by Indonesian in urban and migrant areas.It used to be written in the Batak script but is mainly written in Latin script now.The Batak languages are verb-initial, and have verb sys- Table 10: List of all languages under study in the Nu-saWrites benchmark along with their status of language development versus language endangerment.tems reminiscent of Philippine languages, although they differ from them in many details (Blust et al., 2013).Javanese (jav) is a language spoken mainly in Java island.It is the de facto language of provincial identity in central and eastern Java.The word order is SVO.It has 21 consonants and 8 vowels.It used to be written in Javanese script but since the 20th century, it was mostly written in Latin script.Javanese differs from most other languages of western Indonesia in contrasting dental and retroflex stops and in the feature of breathy voice or murmur as a phonetic property of its voiced obstruents.Javanese also differs from most languages of the Philippines and western Indonesia in allowing a number of word-initial consonant clusters.It has an elaborate system of speech levels (Blust et al., 2013).
Madurese (mad) is a language spoken in the East Java province, mainly on Madura Island, south and west of Surabaya city, Bawean, Kangean, and Sapudi islands.It has vowel harmony, gemination, rich affixation, three types of reduplication, and SVO basic word order (Davies, 2010).
Makassarese (mak) is mainly spoken in South Sulawesi province.It has three dialects that form a chain.Those dialects are Lakiung (Gowa), Turatea (Jeneponto), and Bantaeng (Maros-Pangkep).The Gowa dialect is prestigious.It has 17 consonants and 5 vowels.The stress is on the penultimate syllable.Similar to other Western Malayo-Polynesian languages, it has inclusive and exclusive pronouns, noun head initials, prepositions, definite markers, classifiers, passive markers, and aspect markers (Eberhard et al., 2021).The speakers, especially young people in the cities, are shifting to Indonesian and Makassar Indonesian.It is taught as a subject in primary schools and written in Latin script.The Makassar script is no longer used.
Minangkabau (min) is a language spoken mainly in West Sumatra and other provinces on Sumatra Island such as Bengkulu and Riau.Although it is classified as Malay, it is not intelligible with Indonesian.The word order is SVO written in Latin script.Standard Minangkabau voice can be characterized as an Indonesian-type system whereas colloquial Minangkabau voice is more effectively characterized as a Sundic-type system (Crouch, 2009).ments in this work.For the learning rate, it follows the configuration of NusaX (Winata et al., 2023), while the rest are shown in the following table.

M.3 Multi-Class Classification
Table 14 shows the hyperparameters used in deep learning models on classification experiments in this work.Tasks that follow the following parameters include: sentiment analysis, rhetoric mode classification, emotion recognition, and topic modeling.We follow the hyperparameter settings in Winata et al. (2023) that were found to work best.

N List of Zero-Shot Prompts
We provide the full list of prompts used in our zeroshot prompting experiment in Table 15.

Figure 2 :
Figure 2: Distribution of Indonesian languages in Wikipedia, compared against all existing languages.

Figure 3 :
Figure 3: The (left) MATTR, (center) MTLD, and (right) MSTTR scores of different corpus collection methods.Paragraph writing and translation achieve higher diversity on the extremely low-resource languages, i.e., Madurese (mad) and Buginese (bug), compared to scraping from Wikipedia.

Figure 5 :
Figure 5: Ratio of loan words per language of different corpus collection methods.Wiki: Wikipedia, NusaP: NusaParagraph, NusaT: NusaTranslation.The ratio is presented in log 10 basis.

Figure 8 :
Figure 8: Taxonomy of the languages under study.We show all of the 12 Indonesian local languages under study and the national language of Indonesia, i.e., Indonesian (ind).

Table 1 :
., Lexical diversity of various Indonesian local languages corpora in Wikipedia.X-LRL = Extremely low-resource language, LRL = low-resource language, and MRL = medium-resource language.

Table 3 :
Description of Indonesian local languages covered in the Bible corpus.

Table 4 :
Common loan words in NusaTranslation from top-200 overlap with the English lexicon and loan word list.

Table 5 :
Common loan words in NusaParagraph from top-200 overlap with the English lexicon and loan word list.
Ambonese Malay (abs) is spoken in various parts of Maluku province.It was developed on the

Table 6 :
Common loan words in NusaParagraph from top-200 overlap with the English lexicon and loan word list.We only show the top 50 words for Buginese (bug) and Minangkabau (min).In total, the top 200 overlapping Buginese (bug) and Minangkabau (min) data with the English lexicon contain 54 and 90 loan words, respectively.Jakarta, and some cities in West Java province such as Depok, Bekasi, Bogor, and Karawang.It is a Malay-based creole distinct from both Indonesian and other Malay-based pidgins and creoles.It was evolved around mid-19th century.It functions as a Low variety in a diglossic situation, but has covert prestige when used by the upper class.It has unique phonological, morphological, and lexical traits.It was influenced by Peranakan Indonesian language and Balinese.Bima (bhp) is spoken in Komodo island area in East Nusa Tenggara province and some islands in West Nusa Tenggara province such as Sumbawa island and Banta and Sangeang islands.It has five dialects: Kolo, Sangar, Toloweri, Bima, and Mbojo

Table 7 :
Common local words with their frequency and proportion in the NusaTranslation corpus.

Table 8 :
Common local words with their frequency and proportion in the NusaParagraph corpus.

Table 9 :
Common local words with their frequency and proportion in the Wikipedia corpus.

Table 15 :
List of prompts used in our zero-shot prompting experiments.