Replicable Benchmarking of Neural Machine Translation (NMT) on Low-Resource Local Languages in Indonesia

Neural machine translation (NMT) for low-resource local languages in Indonesia faces significant challenges, including the need for a representative benchmark and limited data availability. This work addresses these challenges by comprehensively analyzing training NMT systems for four low-resource local languages in Indonesia: Javanese, Sundanese, Minangkabau, and Balinese. Our study encompasses various training approaches, paradigms, data sizes, and a preliminary study into using large language models for synthetic low-resource languages parallel data generation. We reveal specific trends and insights into practical strategies for low-resource language translation. Our research demonstrates that despite limited computational resources and textual data, several of our NMT systems achieve competitive performances, rivaling the translation quality of zero-shot gpt-3.5-turbo. These findings significantly advance NMT for low-resource languages, offering valuable guidance for researchers in similar contexts.


Introduction
Neural Machine Translation (NMT) holds a crucial role for local languages in Indonesia, supporting language documentation (Abney and Bird, 2010), native language preservation (Bird and Chiang, 2012;Costa-jussà et al., 2022), and bridging socioeconomic gaps (Azzizah, 2015).However, challenges unique to low-resource languages have hindered progress in this field (Aji et al., 2022).Our work addresses these challenges for four prominent local languages in Indonesia: Javanese, Sundanese, Minangkabau, and Balinese.
Impressive NMT advancements often come from well-resourced entities (i.e., Google's PaLM2 (Anil    § Equal contributions.et al., 2023), OpenAI's gpt-3.5-turbo(Brown et al., 2020), Facebook's NLLB-200 (Costa-jussà et al., 2022)), focusing primarily on high-resource languages like English.This phenomenon highlights a research gap for languages with limited resources in data availability and computing power.For instance, benchmark NMT systems like NLLB-200 (Costa-jussà et al., 2022) rely on substantial computing power, a luxury many researchers lack, especially those working with local Indonesian languages (Cahyawijaya et al., 2022).This hampers progress due to the difficulty of gauging whether a new approach, method, architecture, or data augmentation would help improve model performance.
In this work, our contribution is a replicable benchmark of NMT systems for these local Indonesian languages trained on publicly available data and tested on the publicly available FLORES-200 dataset.We prioritize accessible computing resources.Our base cross-lingual (Conneau and Lample, 2019) XLM model uses only a modest compute setup.It is trained with only two languages at a time and on a single GPU with at most 48GB of memory, which we believe is within the reach of most researchers in this domain.We also only use publicly available data sources to train, including the NusaCrowd repository of Indonesian languages (Cahyawijaya et al., 2022) and parsed wikidumps results 1 .Extending prior work, such as (Winata et al., 2023), we benchmark NMT models on multiple domains.
In addition, in a preliminary study, we explore the impact of using gpt-3.5 (Brown et al., 2020) for synthetic low-resource language data generation to augment training.We also investigate codeswitching's potential (Kuwanto et al., 2021) for improving low-resource language NMT that was Prior works (Costa-jussà et al., 2022;Cahyawijaya et al., 2022;Winata et al., 2023) have contributed to the creation of these NMT Benchmarks in two significant ways: (1) the creation and compilation of datasets accompanied by (2) exploration and evaluation of different methodologies.Costa-jussà et al. (2022) focused on developing NMT Benchmarks for low-resource languages.Their NLLB-200 model supports 200+ languages with more than 40K translation directions.Among these 200+ languages, some are local languages in Indonesia.They obtain state-of-the-art results for many translation directions through massive data collection efforts and computing resources.
Similarly, Winata et al. ( 2023) also collects a multilingual dataset for both machine translation and sentiment analysis for ten local languages in Indonesia.They use the collected dataset to create a benchmark for both tasks, obtaining impressive results in the machine translation tasks for the review domain by fine-tuning pre-trained models.Winata et al. (2023) also shows that fine-tuning a non-English-centric pre-trained model on local Indonesian languages outperforms its English-centric counterpart for the machine translation task.
However, it is essential to recognize that the NMT benchmark created by Costa-jussà et al. (2022) is challenging to replicate.Many researchers and institutions, including top Indonesian universities (Cahyawijaya et al., 2022), do not have access to massive compute resources or extensive and proprietary training data required to train the NLLB-200 models.Meanwhile, the NMT benchmark created by Winata et al. (2023) is limited only to the review domain.These leave a research gap that needs to be filled by benchmark NMT models that are replicable and cover more general domains.

High-resource vs Low-resource NMT
Unlike low-resource NMT systems (where either the source or target language is a low-resource language), NMT systems for high-resource languages have achieved impressive results (Costa-jussà et al., 2022).Even with the progress achieved by the grassroots movement mentioned in the previous section, the performance gaps are wide.This phenomenon is due to research in the field of NMT and NLP being dominated by English and other major languages, which means that more efforts have significantly been put into developing language technologies for these significant languages, and more data have been collected and made available for these languages (Akhbardeh et al., 2021;Kocmi et al., 2022).This means that while lowresource NMTs face problems no longer found in high-resource NMTs, insufficient resources and attention are being allocated.
One significant issue NMT systems face is the pivotal role parallel data plays in model performance (Koehn and Knowles, 2017).By definition, low-resource languages have little to no parallel data.One problem that negatively impacts the model performance is out-of-vocabulary (OOV) occurrences (Aji et al., 2022;Wibowo et al., 2021), where the model needs to see a token more to learn what it means.While this issue exists even in highresource languages, the rate of occurrence for lowresource languages is substantially higher.However, usage of byte pair encoding (BPE) (Sennrich et al., 2016b) is capable of alleviating this issue to some degree (Lample et al., 2018b;Yang et al., 2020).
While previous research has made noteworthy strides in addressing out-of-vocabulary (OOV) occurrences, the most effective solution continues to be expanding available training data.However, building textual resources for translation tasks necessitates a significant investment of money, time, and expertise.Because of this, current research increasingly centers on finding innovative ways to augment the training of NMT models.

Augmenting Training for NMT
To combat the issue of data starvation, many researchers aim to utilize monolingual data to train NMT systems (Lample et al., 2018a;Artetxe et al., 2018;Conneau and Lample, 2019) and find ways to generate more training data, either comparable or synthetic data.Comparable data are extracted using various bitext retrieval methods (Zhao and Vogel, 2002;Fan et al., 2021;Jones and Wijaya, 2021;Kocyigit et al., 2022), multimodal signals (Hewitt et al., 2018;Rasooli et al., 2021), dictionary-or knowledge-based approaches (Wijaya and Mitchell, 2016;Wijaya et al., 2017;Tang and Wijaya, 2022); while synthetic data are created and utilized either through innovative training data augmentation (Kuwanto et al., 2021), utilizing automatic backtranslation (Sennrich et al., 2016a;Wang et al., 2019), or even outright generating synthetic data using generative models (Lu et al., 2023), which has gained increasing attention by the community lately due to the advancement of large language models (LLMs).
Both Artetxe et al. (2018) and Lample et al. (2018a) show that NMT systems can be trained using only monolingual data while achieving impressive results.Conneau and Lample (2019) then create the XLM architecture, which allows NMT systems to be traine using monolingual and parallel data.Afterward, (Kuwanto et al., 2021) exploits the cross-lingual nature of the XLM models by corrupting the monolingual data using code-switching, which makes a single training instance contain multiple languages.The result is an improvement in the model's performance for low-resource translation.
In addition, prior works also focus on obtaining synthetic training data by turning monolingual data into parallel data through automatic backtranslation (Sennrich et al., 2016a) or by using LLMs such as the gpt-family models (Brown et al., 2020) that have been gaining popularity in recent years in many fields (Lu et al., 2023).While backtranslation has evolved, becoming a prominent method in the field of NMT (Artetxe et al., 2018;Conneau and Lample, 2019), using LLMs to generate synthetic data has yet to be thoroughly explored.This trend of using generative AI to generate synthetic training data displays initial potential, considering their remarkable performances compared to the state-of-the-art in machine translation (Zhu et al., 2023).However, further research with ablation studies and the inclusion of more language coverage is still needed.

Methodology
In this section, we outline our methodology for creating a replicable NMT benchmark for four Indonesian languages: Javanese (jv), Sundanese (su), Minangkabau (min), and Balinese (ban).We aim to systematically explore different training approaches and paradigms for NMT while maintaining a consistent base architecture (XLM), fixed hyperparameters, and controlled computing environment.Our compute environment is given a strict upper bound, in which a total of 48 GPU Hours from a single GPU for each model training, totaling up to 96 GPU Hours for NMT systems utilizing pre-trained language models.We also limit the memory of the GPU used to a maximum of 48 GB.

Training Approaches
We employ three primary training approaches to build our NMT models: From Scratch (Scratch): In this approach, models are trained from the ground up without any reliance on pre-existing pre-trained language models.This approach acts as a baseline and allows us to gauge the performance of the models when trained from scratch.
Pre-trained Cross-Lingual (PreXL): Here, an NMT model utilizes a pre-trained cross-lingual model (XLM) (Conneau and Lample, 2019) on two sets of monolingual data.One of the sets is the Indonesian monolingual data, and the other is the low-resource local language monolingual data.This provides a strong starting point for the NMT by initializing the model with knowledge from the target and source languages.Therefore, each language pair in this work is given its own respective model.The number of pre-trained models for PreXL equals the number of language pairs in our work, which is four.
Code-switched Pre-trained Cross-Lingual (CodeXL): This approach involves pre-training the language model using additional augmented data from the two sets of monolingual data and a bilingual dictionary through code-switching, explained later in section 3.6.Code-switching allows for a bilingual context within each training instance.The pre-trained model is then fine-tuned for translation.The tasks used to fine-tune CodeXL and PreXL depend on the training paradigm used (section 3.2.The number of pre-trained models for CodeXL is the same as PreXL, which is four. We chose the XLM architecture due to its modest compute resource requirements and its capability of cross-lingual language modeling.Moreover, the architecture is widely used for many low-resource language pairs and shows impressive results despite its modest size (Wang et al., 2019).We use Masked Language Modeling (MLM) (Devlin et al., 2019) to pre-train all the XLM models.

Training Paradigms
Additionally, we also explore two training paradigms, each influencing how the NMT models learn and what data are used for training: Unsupervised NMT (Unsup): This paradigm trains the NMT system using only monolingual data of the source and target language.Utilizing both denoising-autoencoding (Vincent et al., 2010) and automatic back-translation (Sennrich et al., 2016a) to train the NMT system.Note that even though CodeXL utilizes a bilingual dictionary, it does not use any parallel data during pre-training.
Semi-supervised NMT (Semisup): This paradigm trains the NMT system using both monolingual data and parallel data of the source and target language.Monolingual data are utilized for training NMT by automatic back-translation.
We employ these shortened terms throughout our experiments to refer to the respective training approaches and paradigms.Our results indicate that the performance of each combination of approaches and paradigms on the evaluation dataset depends heavily on the amount of available data for the language: Unsup paradigm works better for very low-resource languages.In contrast, Semisup paradigm performs better when at least 10K parallel data is available (Artetxe et al., 2018).We do not conduct training using a strictly Supervised NMT paradigm because prior work has shown automatic back-translation's undeniable impact in improving low-resource NMT systems performance (Sennrich et al., 2016a).

Training with Synthetic Data
Following recent trends of using generative AI to generate synthetic training data (Lu et al., 2023;Zhu et al., 2023), we explore the impact of synthetically generated data on low-resource language NMT systems.We define two main approaches to generating synthetic data: (1) generating parallel data using generative AI and (2) translating monolingual data using an existing model.
To gauge the impact of the synthetically generated training data, we train NMT systems with these additional data using the Scratch and CodeXL training approaches.Scratch is also used in our preliminary experiments to identify the synthetic data generation approach that would yield the best empirical results.Once we identify the best approach, we apply the same synthetic data generation approach to all our language pairs and use the generated data to augment the training of our NMT approach with the Semisup paradigm.
Through the preliminary experiments (reported in Appendix C), we find that synthetic data generated using generative AI (gpt-3.5-turbo)has the most positive impact on training NMT systems.We generate 5000 parallel sentences for each language pair via a zero-shot prompt2 : "Generate a long parallel sentence in SRC and TGT", where SRC and TGT is the pair of language we want to generate the sentences in.Appendix C provides justifications for these choices.

Fine-tuning Objectives
Denoising autoencoding (DAE) (Vincent et al., 2010) is a popular training objective for fine-tuning pre-trained LM for unsupervised MT tasks (Lample et al., 2018b;Wang et al., 2019) for its ability to increase the robustness of NMT models.By utilizing the XLM architecture, our NMT system can perform multi-way translation.Thus, we also utilize automatic back translation (BT) (Sennrich et al., 2016a) during fine-tuning of our NMT models with Unsup and Semisup paradigms.By performing back translation using the same model that is being trained, synthetic parallel data is obtained and used automatically during training.We obtain our monolingual data from multiple publicly available sources.For Indonesian (id), we use the 201M monolingual sentences available from the Indo4B curated dataset (Wilie et al., 2020).We obtain monolingual data for the local languages through publicly available data such as Wikidumps 3 , cc100 (Conneau et al., 2020), imdbjv (Wongso et al., 2021), jadi-ide (Hidayatullah et al., 2020), andsu-emot (Putra et al., 2020).The amount of monolingual sentences used to train each language is available in Table 1, with further breakdown available in Appendix A.

Training Data
All parallel data we use to train the model are also publicly available from the NusaCrowd repository (Cahyawijaya et al., 2022).We scan the repository for datasets that contain parallel data of the local language paired with Indonesian.The amount of parallel sentences used to train each language pair is available in Table 1.Our largest language in terms of monolingual and parallel sentences, Javanese, is a tiny fraction (almost a 20th and a 500th, respectively) of NLLB-200 reported sentences for Javanese.From publicly available resources in Table 1, we can see that these four languages represent low-resource languages.A further breakdown is available in Appendix B.
The sentence counts in Table 1 are after we perform filtering on both monolingual and parallel data.For monolingual data, we remove sentences that contain less than three words or more than 250 words.We also perform simple filtering for sentences obtained from Wikipedia, including deduplication, removing HTML tags, removing sentences with only numbers, removing sentences that do not start with an alphabet, and removing metadata, bulletin points, or number ordering from sentences.For parallel data, we remove sentences that contain less than three words or more than 250 words and remove sentence pairs whose source sentences have a word count ratio above 1.5 of their translations following the setup of Ghazvininejad et al. (2023).

Code Switch
Figure 1: Illustration of generating synthetic data using code-switch.Using a bilingual dictionary, some words in each monolingual sentence are translated (i.e., blue and orange words).The candidate word for translation is chosen randomly, and not all words will be translated (green box).
In this paper, code-switching is done by utilizing the system made by Kuwanto et al. (2021)   Creating synthetic training data through codeswitching utilizes training data from monolingual datasets from both languages in the system.By utilizing a bilingual dictionary, obtained and parsed from Winata et al. ( 2023) 5 , each instance of training data from both monolingual datasets is augmented.Figure 1 illustrates this process, while Table 2 shows how much augmented training data is available for each NMT system.

Experiment Results
We concentrate our efforts on four language pairs: id-jv, id-su, id-min, and id-ban.Indonesia (id) is spoken by approximately 198 million people worldwide, whereas Javanese (jv), Sundanese (su), Minangkabau (min), and Balinese (ban) are spoken by roughly 68.2 million, 32.4 million, 4.8 million, and 3.3 million people, respectively, according to (Eberhard et al., 2023).Unsurprisingly, monolingual and parallel data availability for these four local languages in Indonesia generally follows a similar pattern.Javanese boasts the most extensive corpus of monolingual text data, while Balinese has the smallest.Regarding parallel data, Sundanese leads the way, closely followed by Javanese, while Balinese trails behind.Due to the substantial variation in training data availability, we present our findings for each local language separately.This approach allows us to assess the impact of different  training methods and paradigms while assessing the influence of training data size.
We conduct experiments using three training approaches (Scratch, PreXL, CodeXL) and two training paradigms (Unsup, Semisup).For each combination of these approaches and paradigms, four NMT systems are trained (one of each language pair mentioned above).In total, there are 24 different NMT systems trained this way.
In addition, we conduct experiments using synthetic parallel datasets.We only generate parallel training data, so these data do not affect the Unsup training paradigm.To evaluate the impact of these synthetic datasets, we employ two distinct training approaches: Scratch and CodeXL.We denote the process of training NMT systems with additional synthetic parallel training data as Scratch AU G and CodeXL AU G , respectively.These comprise the remaining 8 NMT systems created in this work, totaling 32.Since we use gpt-3.5-turbo to generate our synthetic parallel data in a zero-shot manner, we also benchmark the zero-shot translation perfor-mance of gpt-3.5-turbo(Turbo) on our evaluation dataset.
The results of these experiments (all metrics are in spm200BLEU, shown in Table 3), reveal a consistent trend: CodeXL approach results in a significantly better performing NMT systems compared to Scratch and PreXL.An exception to this pattern is noted in the id-su language pair when employing the Unsup training paradigm.

Javanese
The id-jv language pair is particularly significant due to its relevance in Indonesia, where approximately 198 million people speak Indonesian (id), and Javanese (jv) is spoken by roughly 68.2 million (Eberhard et al., 2023).
Looking at Table 3, we observe a substantial gap in translation performance between id→jv and jv→id, emphasizing the performance asymmetry.Notably, when training NMT systems using the Unsup paradigm, CodeXL consistently outperforms other approaches for both translation directions, reinforcing the findings of Kuwanto et al. (2021), which highlight the generalization capability of this approach.
Surprisingly, when we train NMT models using additional parallel data generated by gpt-3.5-turbo(CodeXL AU G ), we notice a slight decline in performance for id→jv translation compared to our best-performing model (CodeXL).A more detailed comparison is discussed in the next section.

Sundanese
Table 3 shows where id-su differs from the idjv language pair.For id-su language pair, the Unsup paradigm shows a different trend where CodeXL has a slightly worse performance compared to PreXL in the id→su translation.However, this trend shifts when using the Semisup paradigm, with CodeXL regaining its superiority.
Similar to id-jv language pair, an intriguing phenomenon arises when we train NMT models using additional parallel data generated by gpt-3.5turbo(CodeXL AU G ) for the id-su language pair.While this approach does not create a better performing model in id→su translation, it does result in a slightly better model for su→id.This trend indicates that the generated synthetic parallel data's impact heavily depends on the generative AI's translation performance.For both id-jv and id-su language pairs, gpt-3.5-turbo'szero-shot translation performance on id→x is worse than CodeXL for each respective language pair, therefore CodeXL AU G does not result in improved performance.Meanwhile, the reverse is true, gpt-3.5turbo'sx→id translation performance is better than CodeXL in x→id direction, hence CodeXL AU G has a better performance in this direction.

Minangkabau
The results we obtained for id-min follow a similar pattern as id-jv.The trend where CodeXL models performed better than Scratch and PreXL continues for id-min for both translation directions.
However, unlike id-jv and id-su, using synthetically generated parallel data to train NMT systems for id-min (CodeXL AU G ) performed better than CodeXL on the min→id translation.This is surprising because CodeXL performed better than Turbo on min→id, yet the parallel data generated by Turbo was able to create CodeXL AU G , which is a better performing NMT system.This breaks the previous trends set by id-jv, id-su, and even id-ban in the later section.

Balinese
id-ban continues the trend set by the majority of previous language pairs.Following id-jv and idmin language pairs, CodeXL consistently has superior performance compared to Scratch and PreXL.
Additionally, id-ban follows the trend set by id-jv and id-su, where the use of synthetically generated parallel data from Turbo creates a better NMT system compared to others that do not use them.For id-ban language pair specifically, Turbo's translation performance is much higher than CodeXL, and the data Turbo generated has a significant impact during training, as seen in CodeXL AU G .The difference in score for CodeXL and CodeXL AU G differs by 3+ and 6+ spm200BLEU for id→ban and ban→id respectively.This performance difference is much more significant compared to id-jv, id-su, and id-min, where the performance difference is less than 1 spm200BLEU.This finding supports the idea that the generated synthetic parallel data's impact heavily depends on the generative AI's translation performance.Moreover, if the initial parallel data is limited, like in the case of id-ban (only 0.9K), the addition of synthetic data can greatly improve performance.
However, Table 3 shows that Unsup training paradigm created the best performing NMT system for id-ban language pair.While it would not be surprising for Scratch due to the limited amount of parallel training data, it is surprising that Scratch AU G does not result in a considerably better NMT system, as the parallel training data size becomes 6x its original size (i.e., from 0.9K to 5.9K).This indicates that denoising-autoencoding plays a more significant role in model performance than parallel data when the training parallel data is limited.

Conclusion
In this work, we create a replicable NMT benchmark under low-resource settings.We comprehensively train and analyze NMT systems for four low-resource Indonesian local languages: Javanese, Sundanese, Minangkabau, and Balinese.Our experiments shed light on the impact of different training approaches, paradigms, data sizes, and generated synthetic parallel data in low-resource local languages in Indonesia.In conclusion: We observe that the CodeXL training approaches generally create NMT systems with bet-ter performances compared to Scratch and PreXL approaches.This further strengthens the robustness of the approach suggested by Kuwanto et al. (2021), where code-switching is used to give a stronger cross-lingual signal during model pretraining.Code-switching more positively impacts translation performance for x→id more than id→x.For reference, NMT systems created using the CodeXL training approach and Semisup paradigm have an average performance of 23.80 and 18.15 spm200BLEU for x→id and id→x respectively.
Furthermore, even after pre-training a language model using the MLM objective, fine-tuning the model using the denoising autoencoding objective might play a more prominent role in extremely low-resource NMT than just training the model to be more robust for the translation task.This is shown in the id-ban language pair NMT systems, where Unsup created better NMT systems than the Semisup training paradigm.It is noteworthy that the addition of 5000 synthetic parallel training data might not be enough to significantly improve NMT system performance, as visible in the CodeXL-Unsup entry compared to CodeXL AU G -Semisup entry in Table 3, since the resulting parallel data is still very limited (i.e., less than 6K sentences).
Lastly, we also observe a trend in which generative AIs can help augment the training process by generating synthetic parallel data.In most cases, excluding the id-min language pair, the parallel data generated by generative AI can impact the performance of NMT systems to approach or even outperform the performance of the generative AI's translation performance with much less compute and data resource.

Future Work
Along with the above conclusions, our work also opens several venues for future research.Further ablation studies are needed to fully understand the impact of denoising-autoencoding on translation tasks.Our results indicate that the denoisingautoencoding objective not only increases model robustness but may also play a role in cross-lingual language understanding in extremely low-resource NMT.
In addition, further investigations into synthetically generated parallel data quality and diversity are crucial.We observe a trend where synthetically generated parallel data from gpt-3.5-turbo impact the training of NMT systems such that its performance approaches or even outperforms gpt-3.5-turbo'szero-shot translation performance.

Limitation
While our work has given insight into NMT systems for low-resource local languages in Indonesia, it is essential to note that we have utilized different GPUs (TitanV, RTX8000, RTX6000, A6000, A40) with a maximum memory capacity of 48GB for different experiments.These GPU architectures and memory capacity variations may have influenced the observed performance.However, it is crucial to recognize that hardware differences alone cannot fully account for all the performance gaps observed.Future research should conduct experiments using a more standardized GPU setup to understand the impact of hardware variations better.
Additionally, all of our experiments that include the utilization of gpt-3.5-turboare problematic as it is a closed-sourced model.This causes problems such as transparency and reproducibility in the future.Future work should continue performing ablation studies on open-sourced LLMs.

Acknowledgement
Authors from Indonesia are supported by the MoE-CRT ACE Open Research program.The authors also thank the Indonesian government for their funding and Boston University for providing essential computing resources.We also thanks Garry Kuwanto from Boston University for his help in utilizing his Code-switching system.

A Monolingual Training Data Breakdown
Lang wikidumps cc100 imdb-jv jadi-ide su-emot Total Filter Table 4: Monolingual training data breakdown.Lang denotes the language identifier, Total denotes the total monolingual sentences for each language, and Filter denotes how many monolingual sentences remain after filtering, as listed in our Methodology section.
The monolingual data of local languages in Indonesia are obtained from multiple sources, including parsed wikidumps, cc100 (Conneau et al., 2020), imdb-jv (Wongso et al., 2021), jadi-ide (Hidayatullah et al., 2020), andsu-emot (Putra et al., 2020).Excluding wikidumps monolingual data, which was taken in December of 2022, all of these sources are obtained from the compilation done by Cahyawijaya et al. (2022) and was taken in January of 2023.The breakdown for these monolingual data of local languages in Indonesia is found in Table 4 For the monolingual data of the Indonesian language, we use the Indo4B curated dataset (Wilie et al., 2020).Excluding the data obtained from Twitter, the number of monolingual Indonesian sentences is 201 million.No sentences were filtered out due to the high quality of the dataset.As with our monolingual data breakdown, all of our parallel data were obtained from the NusaCrowd repository in January 2023.The datasets we use include su-id (Suryani et al., 2015), min-nlp (Koto and Koto, 2020), code-mixed (Tho et al., 2021), bible (Cahyawijaya et al., 2021), nusantara (Sujaini, 2020), andnusax (Winata et al., 2023).We performed exploratory experiments regarding different methods of generating parallel data.As mentioned in our methodology, we define two main approaches to generating synthetic data: (1) Generating parallel data using generative AI and (2) Translating monolingual data using an already trained model.

C Ablation of
The models we use to generate the parallel data in approach (1) are gpt-3.5-turboand davinci-text-003.We limit our exploratory experiment to the id↔jv translation direction.First, we compare the zero-shot translation performance of these models on the FLORES200 test set, where gpt-3.5-turboachieved a considerably higher spm200BLEU score.The full breakdown is available in Table 6.We give each model  the prompt to generate these parallel data: "Generate a long parallel sentence in SRC and TGT".Our internal experiments show that without the keyword "long", the model will generate short and simple parallel sentences consisting of regularly occurring words.We conduct these experiments using two approaches: zero-shot generation and ten-shot generation.We give the model the prompt above without additional context in zero-shot generation.We then parse the text that has been generated and split it into Indonesian sentences and Javanese sentences.In the ten-shot generation, we sample 10 parallel sentences in our original training data to feed it as examples to the model.Figure 2 illustrates this process.The impact these generated synthetic data have on training is found in Table 7.These performances align with the benchmark results in Table 6, where gpt-3.5-turbo is better at both translation directions than text-davinci-003.The results in Table 7 show that zero-shot generation of gpt-3.5-turbocreates parallel data with the most positive impact on NMT system training.The results shown in Table 7 indicate that few-shot may not have a considerable difference in performance compared to zero-shot for LLMs on translation tasks.Besides prompting gpt-3.5 to generate parallel sentences directly, we also compared it with generating additional data from translating existing monolingual datasets.We use gpt-3.5 and the baseline XLM model to translate Wikipedia monolingual sentences.gpt-3.5-turbo is used instead of text-davinci-003 based on the results of the experiments shown in table 6.Our findings show that additional data from translating monolingual corpus using the baseline XLM model does not yield any significant performance increase or even hurts it, as shown in Table 8, whereas monolingual corpus translated using gpt-3.5 yields over 1 BLEU score on the Javanese to Indonesian translation direction.
However, this increase is modest compared to the results shown by directly generating parallel sentences from gpt-3.5-turbo as additional parallel data.Therefore, we move forward with approach (1).We apply the same procedure to the remaining language pairs: Sundanese, Minangkabau, and Balinese.Synthetic data generation is a promising research avenue in which both approaches (1) and (2) should still be included.
Benchmarks for Low-resource Local Languages in IndonesiaNeural Machine Translation (NMT) benchmarks are pivotal in documenting and preserving lowresource local languages like those in Indonesia.

Table 7 :
Performance of NMT systems trained from scratch using a Semisup paradigm when the original parallel data is mixed with synthetically generated data from gpt-3.5-turbo or text-davinci-003.The first row is the baseline model performance.All BLEU scores are from XLM's automated BLEU scoring.Bolded entries indicate the model with the best performance.

Figure 2 :
Figure 2: Illustration of using LLMs to generate synthetic parallel data.First tested on id-jv language pair, we use the same pipeline to generate synthetic parallel data for the other language pair.

Figure 3 :
Figure3: Illustration of translating monolingual data using an already trained model.In this work, we choose gpt-3.5-turboand our NMT model trained on id-jv language pair using Scratch approach and Semisup paradigm.
Kuwanto et al. (2021)riculum TrainingCode-switching is used to create synthetic training data by utilizing a bilingual dictionary.The generated data is used only during pre-training and is treated as a third language (labeled cs), where each training instance contains tokens from the other two languages.Results obtained byKuwanto et al. (2021)imply that this method helps the model by giving stronger cross-lingual signals, which helps translation tasks during fine-tuning.

Table 2 :
Total number of monolingual training instances for each language in each NMT system.Mono id , Mono x , Mono cs denotes the size of training instance for Indonesian, regional, and the third language, respectively.

Table 3 :
Performance of NMT systems for Indonesian (id), Javanese (jv), Sundanese (su), Minangkabau (min), and Balinese (ban) translations.Turbo refers to gpt-3.5-turbo'szero-shot translation performance.Zero shot paradigm indicates translation without training.AUG denotes models trained with synthetic data generated by gpt-3.5-turbo.Bold values represent the best overall performance, while italicized values indicate the best performance within each paradigm.

Table 5 :
Parallel training data breakdown of the language pair Indonesia and the local language denoted by Lang (i.e., the entry jv list how much parallel data of the pair ind, jv are obtained from each source).Total denotes total parallel sentences for each pair of Indonesian and local languages, and Filter denotes how much remains after filtering, as listed in our Methodology section.
Different Methods in Generating Parallel Data

Table 6 :
Zero-shot translation spm200BLEU score of generative AIs on the FLORES200 test set.Results indicate that gpt-3.5-turboperforms significantly more than text-davinci-003 on zero-shot translation.

Table 8 :
Comparison of additional data generation techniques.BT is for Back translation, in which we sample sentences from external monolingual corpora and translate them using the model indicated in the parentheses.