IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation

Natural language generation (NLG) benchmarks provide an important avenue to measure progress and develop better NLG systems. Unfortunately, the lack of publicly available NLG benchmarks for low-resource languages poses a challenging barrier for building NLG systems that work well for languages with limited amounts of data. Here we introduce IndoNLG, the first benchmark to measure natural language generation (NLG) progress in three low-resource—yet widely spoken—languages of Indonesia: Indonesian, Javanese, and Sundanese. Altogether, these languages are spoken by more than 100 million native speakers, and hence constitute an important use case of NLG systems today. Concretely, IndoNLG covers six tasks: summarization, question answering, chit-chat, and three different pairs of machine translation (MT) tasks. We collate a clean pretraining corpus of Indonesian, Sundanese, and Javanese datasets, Indo4B-Plus, which is used to pretrain our models: IndoBART and IndoGPT. We show that IndoBART and IndoGPT achieve competitive performance on all tasks—despite using only one-fifth the parameters of a larger multilingual model, mBART-large (Liu et al., 2020). This finding emphasizes the importance of pretraining on closely related, localized languages to achieve more efficient learning and faster inference at very low-resource languages like Javanese and Sundanese.


Introduction
Resources such as datasets, pretrained models, and benchmarks are crucial for the advancement of natural language processing (NLP) research. Nevertheless, most pretrained models and datasets are developed for high-resource languages such as English, French, and Chinese (Devlin et al., 2019;Martin et al., 2020;Chen et al., 2020). Although the number of datasets, models, and benchmarks has been increasing for low-resource languages such as Indonesian (Wilie et al., 2020;Koto et al., 2020b), Bangla (Bhattacharjee et al., 2021), and Filipino (Cruz and Cheng, 2020), these datasets primarily focus on natural language understanding (NLU) tasks, which only cover a subset of practical NLP systems today. In contrast, much fewer natural language generation (NLG) benchmarks have been developed for low-resource languages; most multilingual NLG resources thus far have primarily focused on machine translation, highlighting the need to generalize these low-resource NLG benchmarks to other commonly used NLG tasks such as summarization and question answering. While recent work has developed more comprehensive multilingual NLG benchmarks, such as XGLUE (Liang et al., 2020) and GEM (Gehrmann et al., 2021), these efforts still primarily evaluate the NLG models on fairly high-resource languages.
In this paper, we take a step towards building NLG models for some low-resource languages by introducing IndoNLG-a benchmark of multilingual resources and standardized evaluation data for three widely spoken languages of Indonesia: Indonesian, Javanese, and Sundanese. Cumulatively, these languages are spoken by more than 100 million native speakers, and thus comprise an important use case of NLG systems today. Despite the prevalence of these languages, there has been relatively few prior work on developing accurate NLG systems for these languages-a limitation we attribute to a lack of publicly available resources and evaluation benchmarks. To help address this problem, IndoNLG encompasses clean pretraining data, pretrained models, and downstream NLG tasks for these three languages. For the downstream tasks, we collect pre-existing datasets for English-Indonesian machine translation, monolingual summarization, question answering, and dialogue datasets. Beyond these existing datasets, we prepare two new machine translation datasets (Sundanese-Indonesian and Javanese-Indonesian) to evaluate models on the regional languages, Javanese and Sundanese, which have substantially fewer resources-in terms of both unlabelled and labelled datasets-than the Indonesian language.
How, then, can we build models that perform well for such low-resource languages? Building monolingual pretrained models solely using lowresource languages, such as Sundanese and Javanese, is ineffective since there are only few unlabelled data available for pretraining. In this paper, we explore two approaches. The first approach is to leverage existing pretrained multilingual models, such as mBART (Liu et al., 2020). While this approach is quite effective, we explore a second approach that leverages positive transfer from related languages (Hu et al., 2020;Khanuja et al., 2021), such as pretraining with a corpus of mostly Indonesian text. We justify this approach through the fact that Sundanese, Javanese, and Indonesian all belong to the same Austronesian language family (Blust, 2013;Novitasari et al., 2020), and share various morphological and semantic features as well as common lexical items through the presence of Sundanese and Javanese loanwords in the Indonesian language (Devianty, 2016). We show that pretraining on mostly Indonesian text achieves competitive performance to the larger multilingual models-despite using 5× fewer parameters and smaller pretraining data-and achieves particularly strong performance on tasks involving the very lowresource Javanese and Sundanese languages.
Our contributions are as follows: 1) we curate a multilingual pretraining dataset for Indonesian, Sundanese, and Javanese; 2) we introduce two models that support generation in these three major languages in Indonesia, IndoBART and IndoGPT;3) to the best of our knowledge, we develop the first diverse benchmark to evaluate the capability of Indonesian, Sundanese, and Javanese generation models; and 4) we show that pretraining solely on related languages (i.e. mostly Indonesian text) can achieve strong performance on two very lowresource languages, Javanese and Sundanese, compared to existing multilingual models, despite using fewer parameters and smaller pretraining data. This finding showcases the benefits of pretraining on closely related, local languages to enable more efficient learning of low-resource languages.
2 Related Work NLP Benchmarks. Numerous benchmarks have recently emerged, which have catalyzed advances in monolingual and cross-lingual transfer learning. These include NLU benchmarks for low-resource languages including IndoNLU (Wilie et al., 2020), IndoLEM (Koto et al., 2020b, and those focusing on Filipino (Cruz and Cheng, 2020), Bangla (Bhattacharjee et al., 2021), andThai (Lowphansirikul et al., 2021); neural machine translation (MT) datasets for low-resource scenarios including for Indonesian (Guntara et al., 2020), African languages (Duh et al., 2020;Lakew et al., 2020), andNepali andSinhala (Guzmán et al., 2019); and large-scale multilingual benchmarks such as XTREME (Hu et al., 2020), MTOP (Li et al., 2020), andXGLUE (Liang et al., 2020). Winata et al. (2021); Aguilar et al. (2020);Khanuja et al. (2020) further developed multilingual benchmarks to evaluate the effectiveness of pretrained multilingual language models. More recently, GEM (Gehrmann et al., 2021) covers NLG tasks in various languages, together with automated and human evaluation metrics. Our benchmark compiles languages and tasks that are not covered in those prior work, such as local multilingual (Indonesian, Javanese, Sundanese, and  As shown in Table 1, the total number of words in the local languages is minuscule (≈ 1% combined) compared to the total number of words in the Indonesian language. In order to alleviate this problem, we rebalance the Indo4B-Plus corpus. Following Liu et al. (2020), we upsample or downsample data in each language according to the following formula: where λ i denotes up/down-sampling ratio for language i and p i is the percentage of language i in Indo4B-Plus. Following Liu et al. (2020), we set the smoothing parameter α to 0.7. After rebalancing, the percentage of data in the local languages increases to ∼3%.

IndoNLG Tasks
The IndoNLG benchmark consists of 6 subtasks. Each subtask consists of one or more datasets, each with a different domain or characteristic. We summarize the statistics of each dataset in Table 2.

Question Answering.
For the question answering task, we use the TyDiQA (Clark et al., 2020) dataset. This dataset is collected from Wikipedia articles with human-annotated question and answer pairs covering 11 languages. The question-answer pairs are collected for each language without using translation services. We use the Indonesian data from the secondary Gold passage task of the Ty-DiQA dataset. As the original dataset only provides training and validation sets, we randomly split off 15% of the training data and use it as the test set.
Chit-chat. We use XPersona (Lin et al., 2020), a multilingual chit-chat dialogue dataset for evaluating a generative chatbot. The training data of XPersona is collected from translation and rulebased correction from the English version, while the test data are annotated by a human annotator. We take the Indonesian conversation data and use the dataset split as it is. We only use the conversation turn without including the persona information during the training and evaluation of our models.  Table 4: BLEU Evaluation result for the machine translation tasks. † We report the score from Guntara et al. (2020), and approximate the model size. Here and throughout this paper, entries in bold refer to the best overall score for each task, while entries in underscore refer to the best score in each group of models.

Experimental settings
In this section, we describe the models and outline how we train and evaluate our models.

Models
We provide a set of baseline models for each task.
The detailed list of models evaluated on the benchmark is shown in Table 3. We show the comparison of our models with the task-specific models from prior work in Appendix A.
Scratch. We build an encoder-decoder model using the mBART architecture (Liu et al., 2020), which we train from scratch directly on each downstream task (i.e., no pretraining). This baseline is crucial to assess the effectiveness of pretraining for low-resource languages.
IndoBART. We build our own pretrained encoder-decoder model, IndoBART, which is based on the mBART model (Liu et al., 2020). We pretrain IndoBART only on 3 languages: Indonesian, Sundanese, and Javanese. IndoBART follows the mBART implementation, albeit with different datasets and hyperparameter settings. Our Indo-BART model consists of 6 layers of transformer encoder and 6 layers of transformer decoder, with 12 heads, an embedding size of 768, and a feedforward size of 3072. The size of our IndoBART model is around 132M parameters.
IndoGPT. Following GPT-2 (Radford et al., 2019), we develop IndoGPT, a decoder-only model similarly pretrained on 3 languages: Indonesian, Sundanese, and Javanese. Our IndoGPT model consists of 12 transformer decoder layers with 12 heads, an embedding size of 768, and a feedforward size of 3072. The size of our IndoGPT model is around 117M parameters, with a maximum sequence length of 1024 (see Section 4.2 for more information about the pretraining setup).
Multilingual Generation Models. We include existing pretrained multilingual generation models as our baselines, i.e., mBART (Liu et al., 2020) and mT5 (Xue et al., 2020), to analyze the effectiveness of the local generation models-IndoGPT and IndoBART-compared to their massively multilingual counterparts. For the mBART model, we use the mBART-50 pretrained checkpoint (Tang et al., 2020) with 610M parameters. The model is first pretrained with denoising in 25 languages using a masked language modelling framework, and then fine-tuned on another 25 languages covering low and medium-resource languages, including Indonesian. In contrast, mT5 (Xue et al., 2020) is trained on 101 languages using the mC4 dataset. We use mT5-small (300M parameters) such that the model size (excluding embeddings) resembles our local language models as closely as possible.

Pretraining Setup
Tokenization / Vocabulary. For both our Indo-BART and IndoGPT models, we use SentencePiece (Kudo and Richardson, 2018) with a byte-pair encoding (BPE) tokenizer learnt on the full rebalanced Indo4B-Plus dataset, with a vocabulary size of 40,000. Following Radford et al. (2019), we preprocess Indo4B-Plus for vocabulary generation by adding a space between different character categories if there is no space present. This is to prevent forming a subword token that merges characters across numbers, letters, whitespace characters, and others, such as "2020," and "#3".
IndoBART. Our IndoBART model is trained on 8 NVIDIA V100 GPUs for a total of 640k training   Table 6: Results of automatic evaluation on the question answering and chit-chat datasets. † We re-evaluate the generated response with our evaluation code.
steps. We use batch size of 1024, an initial learning rate of 3.75e-5, and a maximum sequence length of 1024. Following mBART (Liu et al., 2020), the model is pretrained to recover masked spans of tokens with 35% of the tokens being masked. The sampled span of tokens is replaced with a dedicated mask token with a probability of 90%, or a random token from the vocabulary with a probability of 10%; the length of the span of tokens is randomly sampled according to a Poisson distribution (λ = 3.5). In addition, the model is pretrained to recover the shuffled order of sentences within each data input. Our pretrained IndoBART model achieves a denoising perplexity of 4.65 on the validation set.
IndoGPT. We pretrain our IndoGPT model using an autoregressive language modeling objective (Radford et al., 2019) for 640k iterations on 8 NVIDIA V100 GPUs, with a batch size of 512, an initial learning rate of 5e-5, and a maximum sequence length of 1024. We apply distributed data parallelism (DDP) with ZeRO-DP (Rajbhandari et al., 2019) optimization to reduce the compute time and memory usage during pretraining. Our pretrained IndoGPT achieves ∼90 autoregressive language modelling perplexity on the validation set. The pretraining hyperparameter settings details for IndoBART and IndoGPT are shown in Appendix B.

Fine-tuning Setup
To ensure a fair comparison, we limit the encoder and decoder sequence lengths to 512 for the encoder-decoder models, while for the decoderonly IndoGPT, we limit both the maximum prefix length and the maximum decoding length to 512. We perform a hyperparameter search for the learning rate over the range [1e-3, 1e-4, 5e-5, 1e-5, 5e-6] and report the best results. We report the best hyperparameter settings for each model in Appendix C.

Evaluation Procedure
For evaluation, we use beam search with a beam width of 5, a length penalty α of 1.0, and limit the maximum sequence length to 512 for all models and all tasks. We conduct both automatic and human evaluations to assess the models. We use a different evaluation metric for each task following the standard evaluation metric on the corresponding task.   Table 8: Size, performance, and inference speed comparison of all baseline models reported in IndoNLG. We run the inference speed comparison with the same context and generation length to ensure fair comparison across models egy is specifically developed for summarization. On the Indosum dataset, mBART LARGE achieves the highest score, followed by IndoGPT with a slightly lower score. Notably, all scores on Indosum are relatively high, since the summary labels are much less abstractive compared to Liputan6. As shown in Table 6, mBART LARGE outperforms all other models by a large margin on both the F1 and exact match scores in the question answering task. We could not confidently attribute this large gap to any distinct patterns based on qualitative analysis, although we conjecture that different model configurations, such as the embedding dimension and number of attention heads, might be one reason for the gap. In the chit-chat task, In-doBART outperforms all other models including CausalBERT (Lin et al., 2020), which is trained with additional persona information. Conspicuously, all the scores on chit-chat are very low. We hypothesize that this is due to the one-to-many problem in the open-domain dialog task (Zhao et al., 2017;Lin et al., 2020), where for a given dialog history, there exists many valid responses Figure 1: Human evaluation metrics summary for the baseline models on fluency (left, 5 is best) and rank (right, 1 is best). Some of the models, such as mBART, achieve competitive fluency with the ground-truth, and both mBART and IndoBART models are close in terms of rank with the ground-truth (signified by the mean and the distributions), while maintaining high fluency scores (signified by their thin tails on fluency). stemming from unknown latent factors, such as personality, preference, culture, and other factors that affect the response. We thus argue that human evaluation is more suitable for the chit-chat task.
Human Evaluation. As shown in Figure 1, the overall quality of models with respect to human evaluation can be ranked in the following order: mBART LARGE , IndoBART, mT5 SMALL , IndoGPT, and the Scratch models. This finding is supported by the individual task metrics shown in Table 7, which show similar trends for most metrics. Note that the automatic evaluation metrics do not always correlate well with human evaluation metrics. For example, in the Su ↔ Id and Jv ↔ Id tasks, Indo-BART and mT5 SMALL outperform mBART LARGE in terms of automated metrics, which contradicts the human evaluation results on the same tasks. This extends prior findings on the poor correlations of ROUGE and BLEU with human judgements (Novikova et al., 2017;Chaganty et al., 2018;Zhang et al., 2020;Sellam et al., 2020;Sai et al., 2020) to a broader language family beyond the Indo-European and Sino-Tibetan families. The full human evaluation results are in Appendix E.

Impact of Pretraining
To compare the models from all aspects across all tasks, we conduct a further analysis to measure the aggregate performance (in terms of automated metrics) and efficiency of all models, as explained in Appendix F. As shown in Table 8, all pretrained models achieve higher scores compared to the nonpretrained Scratch baseline. Here mBART LARGE achieves the best performance over all tasks, with a 31.45 overall score; IndoBART ranks second with a 3% lower score relative to mBART LARGE . However, both mT5 SMALL and IndoGPT perform worse than the BART-based models-a gap we attribute to the fact that mT5 and IndoGPT are more language-agnostic (i.e. no language identifiers).
Even though the overall performance of our In-doBART model is lower than that of the mBART model, our IndoBART model is more efficient in terms of space complexity and inference time: It is only~20% the size of mBART LARGE , and almost 4x faster when running on a CPU and 2.5x faster when running on a GPU. Nevertheless, our In-doGPT model is almost twice as slow as IndoBART due to the longer attention span, but it achieves a similar performance as the larger mT5 SMALL . Our results suggest that pretraining on local, highly related languages (i.e. mostly Indonesian text in the case of IndoBART and IndoGPT) leads to a better performance-efficiency trade-off for those languages than massively multilingual pretraining of huge models.

Extending the Dataset
As shown in Table 1, our Indo4B-Plus dataset is dominated by the Indonesian language corpus. To address this problem, we collect more data for both Sundanese and Javanese by collecting all publicly available internet documents from Common Crawl. 7 We collect all documents with Javanese and Sundanese language tags; the documents are published between August 2018 and April 2021. To reduce noise, we filter out sentences that are too short, although we still end up with a significant dataset size improvement, especially for Javanese, as shown in Table 9. Specifically, with additional data for Sundanese and Javanese, we increase the percentage of Sundanese data from ∼0.51% to ∼2.07% and the percentage of Javanese data  Table 9: Statistics of the Javanese and Sundanese dataset before and after adding additional data from Common Crawl from ∼0.73% to ∼8.29% in our Indo4B-Plus.
To evaluate the effectiveness of adding more local language corpus data, we perform corpus rebalancing as in Section 3.1, and build a pretrained IndoBART model with the same setting as in Section 4.1. As shown in Table 10, our IndoBART-v2 model, which benefits from more Javanese and Sundanese data, achieves significant improvement on the ID→JV translation task. Our IndoBART-v2 model also maintains the performance on all other tasks, and achieves a slightly higher overall score compared to the IndoBART model. Our result also suggests that decoding in a particular target language (especially low-resource ones like Javanese and Sundanese) is more sensitive to the corpus size, while encoding a particular source language is less sensitive to the corpus size.
In future work, we aim to provide stronger pretrained models by: (i) training larger IndoBART and IndoGPT models, and (ii) using larger pretraining data for the local languages, because downstream task performance correlates highly with both model size and data size (

Conclusion
We introduced the first Indonesian benchmark for natural language generation, IndoNLG. Our benchmark consists of six tasks: summarization, question answering, open chit-chat, and three different language pairs of machine translation tasks. We provide a large and clean pretraining corpus of Indonesian, Sundanese, and Javanese datasets called Indo4B-Plus, which is used to pretrain our NLG models, IndoBART and IndoGPT. We evaluate the effectiveness and efficiency of our models by conducting extensive automatic and human evaluations on the IndoNLG tasks. Based on the evaluation, our IndoBART and IndoGPT models achieve a competitive (albeit slightly lower) performance

Acknowledgments
We would thank Fajri Koto for sharing the generation results of the Liputan6 dataset, Zhaojiang Lin for sharing the generation results of the XPersona (Id) dataset, and Totok Suhardijanto and Dea Adhista for coordinating with local annotators for the human evaluation. We are grateful to Laura Rimell for valuable feedback on a draft of this paper.

Ethical Considerations
Here we focus on the potential harms of our language models to identify and understand them, so that we can mitigate them in the future. We focus on two primary issues: the potential for misuse of language models and issues of bias, fairness, and representation.

Misuse of Language Models
Language models have the potential to contribute to socially harmful activities such as misinformation, plagiarism, spam, phishing, abuse of legal and governmental processes, and social engineering. In light of the growth of this research area, we anticipate that researchers will develop methods for faithful or steerable high-quality text generation that could lower the barrier to entry for carrying out such socially harmful activities and increase their efficacy. In the time period in which this paper is released, the use of language models in Indonesia is in an early stage. So, although the immediate threat is minimal, we expect that this will introduce challenges for the broader research community in the future. We hope to alleviate such risks by focusing on mitigation research in coordination with other researchers.

Fairness, Bias, and Representation
As Indonesia is very rich in culture and religion, understanding the fairness and bias of the model is crucial so that bias issues can be further mitigated for societal benefits. To this end, we analyse fairness and bias relating to gender, ethnic group, and religion in our pre-trained models. While our analysis does not reflect all of the model's biases, it can nevertheless be useful to provide a partial picture of the fairness and bias of a model trained on Indonesian data from the web. We perform co-occurrence tests for each gender, ethnic group, and religion category by translating and adjusting the prompts used in Brown et al. (2020) from English into Indonesian. We use the IndoGPT model to generate 1200 outputs with temperature of 1.0, top-p of 0.9, and maximum sequence length of 50. We manually identify semantically valid phrases that commonly occur in each category. The prompts and the most descriptive phrases for each gender, ethnic group, and religion can be found in Appendix G.

Ethnic Group
We find that our model makes associations that indicate some propensity to reflect how the ethnic groups are sometimes presented in the world, and list the bias across the groups in Table 23 in Appendix G. Elaborating on some of the top-ranked samples regarding some of the ethnicities listed, the Javanese ethnicity for instance is often described as "suka dengan hal-hal yang berbau mistik" (keen on the mystical things), "menghormati orang yang lebih tua" (being respectful to elders); the Sundanese ethnicity is often described as "memiliki jiwa sosial yang tinggi" (have a socially empathetic life), "hidup di tengah-tengah masyarakat" (live in the midst of society); the Chinese ethinicity is described as "memiliki jiwa sosial yang tinggi" (have a socially empathetic life) while Indian and Arabic ethnicities are described as "memiliki kemampuan yang luar biasa" (have an extraordinary ability), and Caucasian as "memiliki jiwa sosial yang tinggi" (have a socially empathetic life).

Religion
We investigated the bias across religions in our model as shown in Table 24 in Appendix G. We found that our model makes associations with common terms related to a specific religion in the real world, e.g., the use of "bertakwa" / "bertaqwa" (forbearance, fear, and abstinence) and "akhlak" (moral / ethics) in Islam; "Yesus Kristus" (Jesus Christ), "Yahudi" (Jewish), and "orang Kristen" (Christian) in Christianity and Catholicism; "Budha" and "Buddha" in Buddhism; "dewa-dewi" (Gods) and "Brahmana" in Hinduism; and "Tionghoa" (Chinese) for Confucianism.

A Model Comparison with Other Baselines
We report comparison between our IndoBART and IndoGPT model with Guntara et al. (2020) and Koto et al. (2020a) in Table 11.

B Pretraining hyperparameter Setting
We report our IndoBART and IndoGPT pretraining hyperparameters on

C Fine-tuning hyperparameter Setting
We report our best fine-tuning hyperparameters for each model in IndoNLG benchmark on Table 13.

D Guideline for Conducting Human Evaluation
The human evaluation is conducted on eight In-doNLG tasks, i.e., En ↔ Id (News), Id ↔ En (News), Su ↔ Id (Bible), Id ↔ Su (Bible), Jv ↔ Id (Bible), Id ↔ Jv (Bible), Liputan6 Xtreme, and XPersona. We randomly select 100 input samples from the test set of each task and evaluate six different generation texts for each input sample, i.e., ground-truth label, Scratch, mBART LARGE , mT5 SMALL , IndoBART, and IndoGPT models. We recruit three native Indonesian annotators to annotate each sample in each task. For machine translation tasks, the annotators are either native or fluent bilingual speakers in the corresponding language pair. We measure different metrics for each task and use 5 points Likert scale to measure each metric. For machine translation tasks, following Guntara et al. (2020), we measure two metrics, i.e., fluency and adequacy. For summarization tasks, following Kryscinski et al. (2019), we incorporate four metrics, i.e., coherence, consistency, fluency, and relevance. For chit-chat tasks, we incorporate three metrics following Lin et al. (2020), i.e., consistency, engagingness, and fluency. We also ask annotators to rank the generated text for each sample to measure the relative quality of the models. The rank r ∈ [1..6] is an integer with 1 indicating the most favourable generation and 6 indicating the least favourable generation. The description of each metrics for machine translation, summarization, and chit-chat are listed on Table 14, Table 16, and Table 17 respectively, and to add some guidelines for some of the metrics that might interpreted differently by the annotators, we add the detail for them as listed on Table 15, Table 18, and Table 19. To generate the per task statistics, for each sample we average the scores from all three annotations correspond to the sample and then compute the statistics from all of the averaged sample score in the corresponding task. To generate summary statistics over all tasks as shown in Figure 1, we compute the statistics from the aggregated averaged sample score from all tasks.

Metrics
Scale Description Fluency 1 -5 Quality of the sentence regardless of its correctness Adequacy 1 -5 How correct is the translation from the given source text Table 14: Metrics description for human evaluation on the machine translation task.

5
completely accurate 4 slight mistranslation 3 something is not translated or the translation contains more content than the source 2 wrong meaning, but contains some lead 1 completely wrong   Table 17: Metrics description for human evaluation on the chit-chat task.

5
The response is interesting and developing the conversation, and giving explanations or informations 4 The response is not short but it's not giving explanations or informations 3 The response is not short and there is a portion of it that seems uninterested or some utterances are just not being responded 2 The response is short and there is a portion of it that seems uninterested or some utterances are just not being responded 1 The response is short and it perceived as an uninterested response or some utterances are just not being responded Table 18: Details for engagingness evaluation on the chit-chat task.

5
100% factual alignment and no redundancy or repetition 4 Factually aligned with some redundancy or repetition 3 In some ways can still seen as aligned i.e. in aspects or connections, but there's observed some disconnect or it's responding to something that's not being asked Very difficult to see for factual alignment 1 Not in any ways aligned Table 19: Details for consistency evaluation on the chitchat task.

E Results of Human Evaluation
We show the human evaluation results for Liputan6 Xtreme and XPersona tasks on Table 20. We show plots for every human evaluation metric in each task on Figure 2 until Figure 9 F Quality and Space Time Analysis To enable comparison over model quality across all tasks, we compute an overall score over all tasks in the IndoNLG benchmark. We compute the score by selecting a metric from each task and then taking the average score over all the tasks. Specifically, we use the SacreBLEU score for the machine translation task, ROUGE-L for the summarization task, F1 for the QA task, and SacreBLEU for the chitchat task. While there are issues associated with reducing scores across heterogeneous settings to a single score, particularly for natural language generation (Ethayarajh and Jurafsky, 2020; Gehrmann et al., 2021) such a score can nevertheless be useful to provide a rough ranking for the purpose of model selection.
We evaluate the inference time of all models to allow further analysis on the running time of all models. We gather the inference time by performing a greedy decoding with a fixed encoder and decoder sequence length of 256. We run the greedy decoding multiple times and take the average over 100 runs. We run the experiment with both CPU and GPU devices. For this experiment, we use an Intel(R) Core(TM) i9-7900X CPU @ 3.30 GHz and a single GTX1080Ti GPU.

G Fairness and Bias Analysis
To analyze fairness and bias, we perform cooccurrence tests for each gender, ethnic group, and religion categories by translating and adjusting the prompts used in Brown et al. (2020) from English into Indonesian. We use the IndoGPT model to generate 1200 outputs with temperature of 1.0, topp of 0.9, and maximum sequence length of 50. We manually extract the semantically-valid phrases in each category. To get the most biased phrases in gender, we eliminate the frequent phrases that occur in both gender category. The prompts used in our analysis is shown in Table 21. We show the most biased phrases for gender in Table 22. We show the most descriptive phrases for ethnic group and religion in Table 23 and Table 24 respectively. We provide the translation of all the Indonesian words in Table 25 Model
Figure 2: Id→En machine translation tasks' human evaluation metrics summary for the baseline models on fluency (top left, 5 is best), adequacy (top right, 5 is best) and rank (bottom, 1 is best).
Figure 3: Id→Su machine translation tasks' human evaluation metrics summary for the baseline models on fluency (top left, 5 is best), adequacy (top right, 5 is best) and rank (bottom, 1 is best).
Figure 4: Id→Jv machine translation tasks' human evaluation metrics summary for the baseline models on fluency (top left, 5 is best), adequacy (top right, 5 is best) and rank (bottom, 1 is best).
Figure 5: En→Id machine translation tasks' human evaluation metrics summary for the baseline models on fluency (top left, 5 is best), adequacy (top right, 5 is best) and rank (bottom, 1 is best).   is best), coherence (top right, 5 is best), consistency (middle left, 5 is best), relevance (middle right, 5 is best), and rank (bottom, 1 is best).
Figure 9: Chit-chat task's human evaluation metrics summary for the baseline models on consistency (top left, 5 is best), engagingness (top right, 5 is best), fluency (bottom left, 5 is best), rank (bottom right, 1 is best).