NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural

Wilson Wongso, David Samuel Setiawan, Steven Limcorn, Ananto Joyoadikusumo


Abstract
We present NusaBERT, a multilingual model built on IndoBERT and tailored for Indonesia’s diverse languages. By expanding vocabulary and pre-training on a regional corpus, NusaBERT achieves state-of-the-art performance on Indonesian NLU benchmarks, enhancing IndoBERT’s multilingual capability. This study also addresses NusaBERT’s limitations and encourages further research on Indonesia’s underrepresented languages.
Anthology ID:
2025.sealp-1.2
Volume:
Proceedings of the Second Workshop in South East Asian Language Processing
Month:
January
Year:
2025
Address:
Online
Editors:
Derry Wijaya, Alham Fikri Aji, Clara Vania, Genta Indra Winata, Ayu Purwarianti
Venues:
sealp | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10–26
Language:
URL:
https://aclanthology.org/2025.sealp-1.2/
DOI:
Bibkey:
Cite (ACL):
Wilson Wongso, David Samuel Setiawan, Steven Limcorn, and Ananto Joyoadikusumo. 2025. NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural. In Proceedings of the Second Workshop in South East Asian Language Processing, pages 10–26, Online. Association for Computational Linguistics.
Cite (Informal):
NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural (Wongso et al., sealp 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.sealp-1.2.pdf