Workshop in South East Asian Language Processing (2025)


up

pdf (full)
bib (full)
Proceedings of the Second Workshop in South East Asian Language Processing

pdf bib
Proceedings of the Second Workshop in South East Asian Language Processing
Derry Wijaya | Alham Fikri Aji | Clara Vania | Genta Indra Winata | Ayu Purwarianti

pdf bib
bAI-bAI: A Context-Aware Transliteration System for Baybayin Scripts
Jacob Simon D. Bernardo | Maria Regina Justina E. Estuar

Baybayin, a pre-colonial writing system from the Philippines, has seen a resurgence in recent years. Research in computational linguistics has shown an increasing interest in Baybayin OCR, which focuses on the recognition and classification of script characters. However, existing studies face challenges with ambiguous Baybayin words that have multiple possible transliterations. This study introduces a disambiguation technique that employs word embeddings (WE) for contextual analysis and uses part-of-speech (POS) tagging as an initial filtering step. This approach is compared with an LLM method that prompts GPT-4o mini to determine the most appropriate transliteration given a sentence input. The proposed disambiguation process is integrated into existing Baybayin OCR systems to develop bAI-bAI, a context-aware Baybayin transliteration system capable of handling ambiguous words. Results show that incorporating POS as a filter does not significantly affect performance. The WE-Only method yields an accuracy of 77.46% and takes 5.35ms to process one sample while leveraging GPT-4o mini peaks at a higher accuracy of 90.52% but with a much longer runtime of 3280ms per sample. These findings present an opportunity to further explore and improve NLP approaches in disambiguation methods.

pdf bib
NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural
Wilson Wongso | David Samuel Setiawan | Steven Limcorn | Ananto Joyoadikusumo

We present NusaBERT, a multilingual model built on IndoBERT and tailored for Indonesia’s diverse languages. By expanding vocabulary and pre-training on a regional corpus, NusaBERT achieves state-of-the-art performance on Indonesian NLU benchmarks, enhancing IndoBERT’s multilingual capability. This study also addresses NusaBERT’s limitations and encourages further research on Indonesia’s underrepresented languages.

pdf bib
Evaluating Sampling Strategies for Similarity-Based Short Answer Scoring: a Case Study in Thailand
Pachara Boonsarngsuk | Pacharapon Arpanantikul | Supakorn Hiranwipas | Wipu Watcharakajorn | Ekapol Chuangsuwanich

Automatic short answer scoring is a task whose aim is to help grade written works by learners of some subject matter. In niche subject domains with small examples, existing methods primarily utilized similarity-based scoring, relying on predefined reference answers to grade each student’s answer based on the similarity to the reference. However, these reference answers are often generated from a randomly selected set of graded student answer, which may fail to represent the full range of scoring variations. We propose a semi-automatic scoring framework that enhances the selective sampling strategy for defining the reference answers through a K-center-based and a K-means-based sampling method. Our results demonstrate that our framework outperforms previous similarity-based scoring methods on a dataset with Thai and English. Moreover, it achieves competitive performance compared to human reference performance and LLMs.

pdf bib
Thai Winograd Schemas: A Benchmark for Thai Commonsense Reasoning
Phakphum Artkaew

Commonsense reasoning is one of the important aspects of natural language understanding, with several benchmarks developed to evaluate it. However, only a few of these benchmarks are available in languages other than English. Developing parallel benchmarks facilitates cross-lingual evaluation, enabling a better understanding of different languages. This research introduces a collection of Winograd Schemas in Thai, a novel dataset designed to evaluate commonsense reasoning capabilities in the context of the Thai language. Through a methodology involving native speakers, professional translators, and thorough validation, the schemas aim to closely reflect Thai language nuances, idioms, and cultural references while maintaining ambiguity and commonsense challenges. We evaluate the performance of popular large language models on this benchmark, revealing their strengths, limitations, and providing insights into the current state-of-the-art. Results indicate that while models like GPT-4 and Claude-3-Opus achieve high accuracy in English, their performance significantly drops in Thai, highlighting the need for further advancements in multilingual commonsense reasoning.

pdf bib
Anak Baik: A Low-Cost Approach to Curate Indonesian Ethical and Unethical Instructions
Sulthan Abiyyu Hakim | Rizal Setya Perdana | Tirana Noor Fatyanosa

This study explores the ethical challenges faced by Indonesian Large Language Models (LLMs), particularly focusing on their ability to distinguish between ethical and unethical instructions. As LLMs become increasingly integrated into sensitive applications, ensuring their ethical operation is crucial. A key contribution of this study is the introduction of the Anak Baik dataset, a resource designed to enhance the ethical reasoning capabilities of Indonesian LLMs. The phrase “Anak Baik”, meaning “Good Boy”, symbolizes the ideal of ethical behavior, as a well-behaved child refrains from engaging in harmful actions. The dataset comprises instruction-response pairs in Indonesian, crafted for Supervised Fine-Tuning (SFT) tasks. It includes examples of both ethical and unethical responses to guide models in learning to generate responses that uphold moral standards. Leveraging Low-Rank Adaptation (LoRA) on models such as Komodo and Cendol shows a significant improvement in ethical decision-making processes. This enhanced performance is quantitatively validated through substantial increases in BLEU and ROUGE scores, indicating a stronger alignment with socially responsible behavior.

pdf bib
Indonesian Speech Content De-Identification in Low Resource Transcripts
Rifqi Naufal Abdjul | Dessi Puji Lestari | Ayu Purwarianti | Candy Olivia Mawalim | Sakriani Sakti | Masashi Unoki

Advancements in technology and the increased use of digital data threaten individual privacy, especially in speech containing Personally Identifiable Information (PII). Therefore, systems that can remove or process privacy-sensitive data in speech are needed, particularly for low-resource transcripts. These transcripts are minimally annotated or labeled automatically, which is less precise than human annotation. However, using them can simplify the development of de-identification systems in any language. In this study, we develop and evaluate an efficient speech de-identification system. We create an Indonesian speech dataset containing sensitive private information and design a system with three main components: speech recognition, information extraction, and masking. To enhance performance in low-resource settings, we incorporate transcription data in training, use data augmentation, and apply weakly supervised learning. Our results show that our techniques significantly improve privacy detection performance, with approximately 29% increase in F1 score, 20% in precision, and 30% in recall with minimally labeled data.

pdf bib
IndoMorph: a Morphology Engine for Indonesian
Ian Kamajaya | David Moeljadi

Indonesian is an agglutinative language and rich in morphology. Although it has more than 250 million speakers, it is a low resource language in NLP field. Many Indonesian NLP resources are scattered, undocumented, and not publicly available. In this paper we address the issue of analyzing morphology as well as generating Indonesian words. We introduce IndoMorph, a morphology analyzer and word generator for Indonesian. In an agglutinative language, morphology deconstruction can be crucial to understand the structure and meaning of words. IndoMorph can be useful for language modeling and testing certain analyses. In addition, it can be employed to make a new Indonesian subword representation resource such as Indonesian morphology dictionary (IMD), used as a language education tool, or embedded in various applications such as text analysis applications. We hope that IndoMorph can be employed not only in the Indonesian NLP research development, but also in the NLP research of any agglutinative languages.

pdf bib
NusaDialogue: Dialogue Summarization and Generation for Underrepresented and Extremely Low-Resource Languages
Ayu Purwarianti | Dea Adhista | Agung Baptiso | Miftahul Mahfuzh | Yusrina Sabila | Aulia Adila | Samuel Cahyawijaya | Alham Fikri Aji

Developing dialogue summarization for extremely low-resource languages is a challenging task. We introduce NusaDialogue, a dialogue summarization dataset for three underrepresented languages in the Malayo-Polynesian language family: Minangkabau, Balinese, and Buginese. NusaDialogue covers 17 topics and 185 subtopics, with annotations provided by 73 native speakers. Additionally, we conducted experiments using fine-tuning on a specifically designed medium-sized language model for Indonesian, as well as zero- and few-shot learning on various multilingual large language models (LLMs). The results indicate that, for extremely low-resource languages such as Minangkabau, Balinese, and Buginese, the fine-tuning approach yields significantly higher performance compared to zero- and few-shot prompting, even when applied to LLMs with considerably larger parameter sizes.