Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages

Samuel Cahyawijaya; Holy Lovenia; Fajri Koto; Rifki Afina Putri; Emmanuel Dave; Jhonson Lee; Nuur Shadieq; Wawan Cenggoro; Salsabil Maulana Akbar; Muhammad Ihza Mahendra; Dea Annisayanti Putri; Bryan Wilie; Genta Indra Winata; Alham Fikri Aji; Ayu Purwarianti; Pascale Fung

doi:10.18653/v1/2024.acl-long.796

Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages

Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Rifki Afina Putri, Emmanuel Dave, Jhonson Lee, Nuur Shadieq, Wawan Cenggoro, Salsabil Maulana Akbar, Muhammad Ihza Mahendra, Dea Annisayanti Putri, Bryan Wilie, Genta Indra Winata, Alham Fikri Aji, Ayu Purwarianti, Pascale Fung

Abstract

Large language models (LLMs) show remarkable human-like capability in various domains and languages. To bridge this quality gap, we introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder-decoder architectures across a range of model sizes. We highlight Cendol’s effectiveness across a diverse array of tasks, attaining ~20% improvement, and demonstrate its capability to generalize to unseen tasks and indigenous languages of Indonesia. Furthermore, Cendol models showcase improved human favorability despite their limitations in capturing indigenous knowledge and cultural values in Indonesia. In addition, we discuss the shortcomings of parameter-efficient tunings, such as LoRA, for language adaptation. Alternatively, we propose the usage of vocabulary adaptation to enhance efficiency. Lastly, we evaluate the safety of Cendol and showcase that safety in pre-training in one language such as English is transferable to low-resource languages, such as Indonesian, even without RLHF and safety fine-tuning.

Anthology ID:: 2024.acl-long.796
Volume:: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14899–14914
Language:
URL:: https://aclanthology.org/2024.acl-long.796/
DOI:: 10.18653/v1/2024.acl-long.796
Bibkey:
Cite (ACL):: Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Rifki Afina Putri, Emmanuel Dave, Jhonson Lee, Nuur Shadieq, Wawan Cenggoro, Salsabil Maulana Akbar, Muhammad Ihza Mahendra, Dea Annisayanti Putri, Bryan Wilie, Genta Indra Winata, Alham Fikri Aji, Ayu Purwarianti, and Pascale Fung. 2024. Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14899–14914, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages (Cahyawijaya et al., ACL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.acl-long.796.pdf

PDF Cite Search Fix data