David Samuel Setiawan

2026

Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG
David Samuel Setiawan | Raphaël Merx | Jey Han Lau
Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)

Neural Machine Translation (NMT) models for low-resource languages suffer significant performance degradation under domain shift. We quantify this challenge using Dhao, an indigenous language of Eastern Indonesia with no digital footprint beyond the New Testament (NT). When applied to the unseen Old Testament (OT), a standard NMT model fine-tuned on the NT drops from an in-domain score of 36.17 chrF++ to 27.11 chrF++. To recover this loss, we introduce a hybrid framework where a fine-tuned NMT model generates an initial draft, which is then refined by a Large Language Model (LLM) using Retrieval-Augmented Generation (RAG). The final system achieves 35.21 chrF++ (+8.10 recovery), effectively matching the original in-domain quality. Our analysis reveals that this performance is driven primarily by the number of retrieved examples rather than the choice of retrieval algorithm. Qualitative analysis confirms the LLM acts as a robust "safety net," repairing severe failures in zero-shot domains.

2025

pdf bib abs

NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural
Wilson Wongso | David Samuel Setiawan | Steven Limcorn | Ananto Joyoadikusumo
Proceedings of the Second Workshop in South East Asian Language Processing

We present NusaBERT, a multilingual model built on IndoBERT and tailored for Indonesia’s diverse languages. By expanding vocabulary and pre-training on a regional corpus, NusaBERT achieves state-of-the-art performance on Indonesian NLU benchmarks, enhancing IndoBERT’s multilingual capability. This study also addresses NusaBERT’s limitations and encourages further research on Indonesia’s underrepresented languages.

Co-authors

Venues

Fix author